server-side crawlers: tolerate corrupted shares, verify shares #812

New Issue

zooko · 2009-10-05T17:57:35Z

zooko commented

2009-10-05 17:57:35 +00:00

From twistd.log on prodtahoe17 data6:

2009/09/25 13:00 -0700 [-] Log opened.
2009/09/25 13:00 -0700 [-] twistd 2.5.0 (/usr/bin/python 2.5.2) starting up
2009/09/25 13:00 -0700 [-] reactor class: <class 'twisted.internet.selectreactor.SelectReactor'>
2009/09/25 13:00 -0700 [-] Loading tahoe-client.tac...
2009-09-25 20:01:14.954Z [-] Loaded.
2009-09-25 20:01:14.956Z [-] foolscap.pb.Listener starting on 39324
2009-09-25 20:01:14.956Z [-] twisted.conch.manhole_ssh.ConchFactory starting on 8226
2009-09-25 20:01:14.956Z [-] Starting factory <twisted.conch.manhole_ssh.ConchFactory instance at 0x8bfe2cc>
2009-09-25 20:01:14.957Z [-] nevow.appserver.NevowSite starting on 9006
2009-09-25 20:01:14.957Z [-] Starting factory <nevow.appserver.NevowSite instance at 0x8db516c>
2009-09-25 20:01:14.957Z [-] Manhole listening via SSH on port 8226
2009-09-25 20:01:14.958Z [-] twisted.internet.protocol.DatagramProtocol starting on 35896
2009-09-25 20:01:14.958Z [-] Starting protocol <twisted.internet.protocol.DatagramProtocol instance at 0x8db576c>
2009-09-25 20:01:14.960Z [-] (Port 35896 Closed)
2009-09-25 20:01:14.961Z [-] Stopping protocol <twisted.internet.protocol.DatagramProtocol instance at 0x8db576c>
2009-09-27 12:57:40.124Z [-] lease-checker error processing /data6/storage/storage/shares/g6/g6rvkc5jdkgoqhljuxgkquzhvq/5
2009-09-27 12:57:40.130Z [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.5/site-packages/twisted/internet/base.py", line 561, in runUntilCurrent
            call.func(*call.args, **call.kw)
          File "/usr/lib/python2.5/site-packages/allmydata/storage/crawler.py", line 262, in start_slice
            self.start_current_prefix(start_slice)
          File "/usr/lib/python2.5/site-packages/allmydata/storage/crawler.py", line 321, in start_current_prefix
            buckets, start_slice)
          File "/usr/lib/python2.5/site-packages/allmydata/storage/crawler.py", line 361, in process_prefixdir
            self.process_bucket(cycle, prefix, prefixdir, bucket)
        --- <exception caught here> ---
          File "/usr/lib/python2.5/site-packages/allmydata/storage/expirer.py", line 128, in process_bucket
            wks = self.process_share(sharefile)
          File "/usr/lib/python2.5/site-packages/allmydata/storage/expirer.py", line 171, in process_share
            for li in sf.get_leases():
          File "/usr/lib/python2.5/site-packages/allmydata/storage/mutable.py", line 242, in get_leases
            for i, lease in self._enumerate_leases(f):
          File "/usr/lib/python2.5/site-packages/allmydata/storage/mutable.py", line 247, in _enumerate_leases
            for i in range(self._get_num_lease_slots(f)):
          File "/usr/lib/python2.5/site-packages/allmydata/storage/mutable.py", line 227, in _get_num_lease_slots
            num_extra_leases = self._read_num_extra_leases(f)
          File "/usr/lib/python2.5/site-packages/allmydata/storage/mutable.py", line 129, in _read_num_extra_leases
            (num_extra_leases,) = struct.unpack(">L", f.read(4))
          File "/usr/lib/python2.5/struct.py", line 87, in unpack
            return o.unpack(s)
        struct.error: unpack requires a string argument of length 4

$ tahoe --version
tahoe-server: 1.4.1, foolscap: 0.4.2, pycryptopp: 0.5.16-r669, zfec: 1.4.0-4, Twisted: 2.5.0, Nevow: 0.9.26, zope.interface: 3.3.1, python: 2.5.2, platform: Linux-Ubuntu_8.04-i686-32bit, pyutil: 1.3.20, simplejson: 1.7.3, argparse: 0.8.0, pyOpenSSL: 0.6, z-base-32: 1.0.1, setuptools: 0.6c8

No incident logs.

From twistd.log on prodtahoe17 data6: ``` 2009/09/25 13:00 -0700 [-] Log opened. 2009/09/25 13:00 -0700 [-] twistd 2.5.0 (/usr/bin/python 2.5.2) starting up 2009/09/25 13:00 -0700 [-] reactor class: <class 'twisted.internet.selectreactor.SelectReactor'> 2009/09/25 13:00 -0700 [-] Loading tahoe-client.tac... 2009-09-25 20:01:14.954Z [-] Loaded. 2009-09-25 20:01:14.956Z [-] foolscap.pb.Listener starting on 39324 2009-09-25 20:01:14.956Z [-] twisted.conch.manhole_ssh.ConchFactory starting on 8226 2009-09-25 20:01:14.956Z [-] Starting factory <twisted.conch.manhole_ssh.ConchFactory instance at 0x8bfe2cc> 2009-09-25 20:01:14.957Z [-] nevow.appserver.NevowSite starting on 9006 2009-09-25 20:01:14.957Z [-] Starting factory <nevow.appserver.NevowSite instance at 0x8db516c> 2009-09-25 20:01:14.957Z [-] Manhole listening via SSH on port 8226 2009-09-25 20:01:14.958Z [-] twisted.internet.protocol.DatagramProtocol starting on 35896 2009-09-25 20:01:14.958Z [-] Starting protocol <twisted.internet.protocol.DatagramProtocol instance at 0x8db576c> 2009-09-25 20:01:14.960Z [-] (Port 35896 Closed) 2009-09-25 20:01:14.961Z [-] Stopping protocol <twisted.internet.protocol.DatagramProtocol instance at 0x8db576c> 2009-09-27 12:57:40.124Z [-] lease-checker error processing /data6/storage/storage/shares/g6/g6rvkc5jdkgoqhljuxgkquzhvq/5 2009-09-27 12:57:40.130Z [-] Unhandled Error Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/twisted/internet/base.py", line 561, in runUntilCurrent call.func(*call.args, **call.kw) File "/usr/lib/python2.5/site-packages/allmydata/storage/crawler.py", line 262, in start_slice self.start_current_prefix(start_slice) File "/usr/lib/python2.5/site-packages/allmydata/storage/crawler.py", line 321, in start_current_prefix buckets, start_slice) File "/usr/lib/python2.5/site-packages/allmydata/storage/crawler.py", line 361, in process_prefixdir self.process_bucket(cycle, prefix, prefixdir, bucket) --- <exception caught here> --- File "/usr/lib/python2.5/site-packages/allmydata/storage/expirer.py", line 128, in process_bucket wks = self.process_share(sharefile) File "/usr/lib/python2.5/site-packages/allmydata/storage/expirer.py", line 171, in process_share for li in sf.get_leases(): File "/usr/lib/python2.5/site-packages/allmydata/storage/mutable.py", line 242, in get_leases for i, lease in self._enumerate_leases(f): File "/usr/lib/python2.5/site-packages/allmydata/storage/mutable.py", line 247, in _enumerate_leases for i in range(self._get_num_lease_slots(f)): File "/usr/lib/python2.5/site-packages/allmydata/storage/mutable.py", line 227, in _get_num_lease_slots num_extra_leases = self._read_num_extra_leases(f) File "/usr/lib/python2.5/site-packages/allmydata/storage/mutable.py", line 129, in _read_num_extra_leases (num_extra_leases,) = struct.unpack(">L", f.read(4)) File "/usr/lib/python2.5/struct.py", line 87, in unpack return o.unpack(s) struct.error: unpack requires a string argument of length 4 ``` ``` $ tahoe --version tahoe-server: 1.4.1, foolscap: 0.4.2, pycryptopp: 0.5.16-r669, zfec: 1.4.0-4, Twisted: 2.5.0, Nevow: 0.9.26, zope.interface: 3.3.1, python: 2.5.2, platform: Linux-Ubuntu_8.04-i686-32bit, pyutil: 1.3.20, simplejson: 1.7.3, argparse: 0.8.0, pyOpenSSL: 0.6, z-base-32: 1.0.1, setuptools: 0.6c8 ``` No incident logs.

zooko added the

labels 2009-10-05 17:57:35 +00:00

zooko added this to the undecided milestone 2009-10-05 17:57:35 +00:00

warner was assigned by zooko

2009-10-05 17:57:35 +00:00

zooko commented

2009-10-05 17:59:49 +00:00

Here is the mutable share in question, attached.

zooko commented

2009-10-05 18:00:40 +00:00

Attachment 5 (2942 bytes) added

/Users/wonwinmcbrootles/5

**Attachment** 5 (2942 bytes) added /Users/wonwinmcbrootles/5

5

2.9 KiB

warner commented

2009-10-05 23:49:10 +00:00

hm, the share appears to be truncated. I looked at sh7 (in /data5 on that same box), which appears to be intact. Both are for seqnum 6, and both appear to match up until the truncation point (which occurs in the middle of the encrypted private key).

These shares are supposed to be written atomically, with a write vector of length one (consisting of the entire share) that is written in a single python f.write call. Perhaps the write got interrupted (node reboot or system crash), in such a bad way that it caused only part of the data to be written out to disk? The truncation point isn't on a particularly round boundary (the file is 0xb7e bytes long), so it doesn't feel like a disk block size or anything like that.

Weird.

I suppose the important part is to recover gracefully from it. I believe the share-crawler should keep going after the error.. that'd be the first thing to verify.

I guess the second step would be to build a validating share crawler, and/or have some code in the lease-checking share crawler which would respond to problems like this by moving the corrupt share into a junkpile and logging the issue somewhere.

hm, the share appears to be truncated. I looked at sh7 (in /data5 on that same box), which appears to be intact. Both are for seqnum 6, and both appear to match up until the truncation point (which occurs in the middle of the encrypted private key). These shares are supposed to be written atomically, with a write vector of length one (consisting of the entire share) that is written in a single python `f.write` call. Perhaps the write got interrupted (node reboot or system crash), in such a bad way that it caused only part of the data to be written out to disk? The truncation point isn't on a particularly round boundary (the file is 0xb7e bytes long), so it doesn't feel like a disk block size or anything like that. Weird. I suppose the important part is to recover gracefully from it. I believe the share-crawler should keep going after the error.. that'd be the first thing to verify. I guess the second step would be to build a validating share crawler, and/or have some code in the lease-checking share crawler which would respond to problems like this by moving the corrupt share into a junkpile and logging the issue somewhere.

zooko commented

2009-10-06 16:50:46 +00:00

Let's make this ticket be about recovering from this sort of corruption:

keep share-crawling after encountering corrupted leases
move the corrupted lease file into a junkpile and log the issue

In the meanwhile, I assume the work-around is to rm that file, right?

Let's make this ticket be about recovering from this sort of corruption: * keep share-crawling after encountering corrupted leases * move the corrupted lease file into a junkpile and log the issue In the meanwhile, I assume the work-around is to rm that file, right?

zooko changed title from ~~exception from attempt to parse leases~~ to handle corrupted lease files

2009-10-06 16:50:46 +00:00

warner commented

2009-10-08 16:53:12 +00:00

Sounds good. To be specific, this is unrelated to leases, it's just that the lease-expiring crawler is what first noticed the corruption. So this ticket is about:

verify that the crawlers keep crawling after an exception in their per-share handler functions (I believe that it already does this, but we should verify it)
implement a share-verifying crawler (the server-side verifier), and have it quarantine any corrupted share in some offline junkpile

And yeah, just rm the file, it's useless to anyone. The next time that directory is modified, a new copy of that share will be created.

Sounds good. To be specific, this is unrelated to leases, it's just that the lease-expiring crawler is what first noticed the corruption. So this ticket is about: * verify that the crawlers keep crawling after an exception in their per-share handler functions (I believe that it already does this, but we should verify it) * implement a share-verifying crawler (the server-side verifier), and have it quarantine any corrupted share in some offline junkpile And yeah, just rm the file, it's useless to anyone. The next time that directory is modified, a new copy of that share will be created.

warner changed title from ~~handle corrupted lease files~~ to server-side crawlers: tolerate corrupted shares, verify shares

2009-10-08 16:53:12 +00:00

zooko commented

2013-11-14 19:06:57 +00:00

#1834 would remove the lease-checking crawler and bucket-counting crawlers, making this ticket irrelevant. However, we might then want to invent a share-verifying crawler, just for the purpose of looking for corrupted shares, which would make this ticket relevant again.

#1834 would remove the lease-checking crawler and bucket-counting crawlers, making this ticket irrelevant. However, we *might* then want to invent a share-verifying crawler, just for the purpose of looking for corrupted shares, which would make this ticket relevant again.

zooko commented

2014-11-28 15:59:12 +00:00

My consulting client (codename "WAG") (see comment:6:ticket:1278) has corrupt shares. This message is in the storage server's twistd.log:

2014-10-16 20:28:32+0530 [-] lease-checker error processing /GFS2_tahoe/.tahoe-filestore1/storage/shares/jz/jze7ynr5mtrehxj7vchc3dqtpa/5
2014-10-16 20:28:32+0530 [-] Unhandled Error
        Traceback (most recent call last):
          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/base.py", line 824, in runUntilCurrent
            call.func(*call.args, **call.kw)
          File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/crawler.py", line 261, in start_slice
            self.start_current_prefix(start_slice)
          File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/crawler.py", line 320, in start_current_prefix
            buckets, start_slice)
          File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/crawler.py", line 360, in process_prefixdir
            self.process_bucket(cycle, prefix, prefixdir, bucket)
        --- <exception caught here> ---
          File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/expirer.py", line 128, in process_bucket
            wks = self.process_share(sharefile)
          File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/expirer.py", line 171, in process_share
            for li in sf.get_leases():
          File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/mutable.py", line 265, in get_leases
            for i, lease in self._enumerate_leases(f):
          File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/mutable.py", line 270, in _enumerate_leases
            for i in range(self._get_num_lease_slots(f)):
          File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/mutable.py", line 250, in _get_num_lease_slots
            num_extra_leases = self._read_num_extra_leases(f)
          File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/mutable.py", line 133, in _read_num_extra_leases
            (num_extra_leases,) = struct.unpack(">L", f.read(4))
        struct.error: unpack requires a string argument of length 4

And this message is in an incident report file generated by the node (which is both the storage server and the gateway):

CorruptShareError server=5e4d7u5c shnum[2]: block hash tree failure: new hash soakg4py7uswum3y6vzvyk4qt3736ia37l7ti22mce3ntesuhe7a does not match existing hash wuoq2uiuyzope6gwyvucwx3b5q47hn46kvynkhsvzh7jndrcowla at [2 of 3] (leaf [1] of 2)
/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py:575:_runCallbacks
/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py:827:_cbDeferred
/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py:380:callback
/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py:488:_startRunCallbacks
--- <exception caught here> ---
/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py:575:_runCallbacks
/data/allmydata-tahoe-1.10.0/src/allmydata/mutable/retrieve.py:797:_validate_block

This is with this version of Tahoe-LAFS:

Nevow: 0.11.1
Twisted: 13.0.0
allmydata-tahoe: 1.10.0
foolscap: 0.6.4
mock: 1.0.1
platform: Linux-redhat_6.5-x86_64-64bit_ELF
pyOpenSSL: 0.10
pyasn1: 0.1.7
pycrypto: 2.6.1
pycryptopp: 0.6.0.1206569328141510525648634803928199668821045408958
python: 2.6.6
setuptools: 0.6c16dev4
simplejson: 2.0.9
twisted: 13.0.0
zfec: 1.4.7
zope.interface: unknown

My consulting client (codename "WAG") (see comment:6:ticket:1278) has corrupt shares. This message is in the storage server's twistd.log: ``` 2014-10-16 20:28:32+0530 [-] lease-checker error processing /GFS2_tahoe/.tahoe-filestore1/storage/shares/jz/jze7ynr5mtrehxj7vchc3dqtpa/5 2014-10-16 20:28:32+0530 [-] Unhandled Error Traceback (most recent call last): File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/base.py", line 824, in runUntilCurrent call.func(*call.args, **call.kw) File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/crawler.py", line 261, in start_slice self.start_current_prefix(start_slice) File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/crawler.py", line 320, in start_current_prefix buckets, start_slice) File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/crawler.py", line 360, in process_prefixdir self.process_bucket(cycle, prefix, prefixdir, bucket) --- <exception caught here> --- File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/expirer.py", line 128, in process_bucket wks = self.process_share(sharefile) File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/expirer.py", line 171, in process_share for li in sf.get_leases(): File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/mutable.py", line 265, in get_leases for i, lease in self._enumerate_leases(f): File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/mutable.py", line 270, in _enumerate_leases for i in range(self._get_num_lease_slots(f)): File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/mutable.py", line 250, in _get_num_lease_slots num_extra_leases = self._read_num_extra_leases(f) File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/mutable.py", line 133, in _read_num_extra_leases (num_extra_leases,) = struct.unpack(">L", f.read(4)) struct.error: unpack requires a string argument of length 4 ``` And this message is in an incident report file generated by the node (which is both the storage server and the gateway): ``` CorruptShareError server=5e4d7u5c shnum[2]: block hash tree failure: new hash soakg4py7uswum3y6vzvyk4qt3736ia37l7ti22mce3ntesuhe7a does not match existing hash wuoq2uiuyzope6gwyvucwx3b5q47hn46kvynkhsvzh7jndrcowla at [2 of 3] (leaf [1] of 2) /data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py:575:_runCallbacks /data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py:827:_cbDeferred /data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py:380:callback /data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py:488:_startRunCallbacks --- <exception caught here> --- /data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py:575:_runCallbacks /data/allmydata-tahoe-1.10.0/src/allmydata/mutable/retrieve.py:797:_validate_block ``` This is with this version of Tahoe-LAFS: ``` Nevow: 0.11.1 Twisted: 13.0.0 allmydata-tahoe: 1.10.0 foolscap: 0.6.4 mock: 1.0.1 platform: Linux-redhat_6.5-x86_64-64bit_ELF pyOpenSSL: 0.10 pyasn1: 0.1.7 pycrypto: 2.6.1 pycryptopp: 0.6.0.1206569328141510525648634803928199668821045408958 python: 2.6.6 setuptools: 0.6c16dev4 simplejson: 2.0.9 twisted: 13.0.0 zfec: 1.4.7 zope.interface: unknown ```

zooko commented

2014-11-28 22:01:32 +00:00

Okay, I've been looking into this, and I see that in fact this kind of corruption is handled, by logging it, accounting for it in a count called corrupt-shares, and skipping the corrupted share. See [expirer.py]source:trunk/src/allmydata/storage/expirer.py?annotate=blame&rev=d5651a0d0eebdc144db53425ee461e186319e5fd#L127. The only reason that we've thought it was not being handled all these years is because it calls twisted.python.log.err(), which emits a string to the twistd.log that says "Unhandled Error", plus a stack trace. So I propose that we just remove that call to twisted.python.log.err() and add a unit test which requires the code under test to detect and skip over corrupted shares without emitting this error log.

Okay, I've been looking into this, and I see that in fact this kind of corruption *is* handled, by logging it, accounting for it in a count called `corrupt-shares`, and skipping the corrupted share. See [expirer.py]source:trunk/src/allmydata/storage/expirer.py?annotate=blame&rev=d5651a0d0eebdc144db53425ee461e186319e5fd#L127. The only reason that we've thought it was *not* being handled all these years is because it calls `twisted.python.log.err()`, which emits a string to the twistd.log that says `"Unhandled Error"`, plus a stack trace. So I propose that we just remove that call to `twisted.python.log.err()` and add a unit test which requires the code under test to detect and skip over corrupted shares *without* emitting this error log.

daira commented

2014-11-28 22:45:29 +00:00

Wait, why do we ever log directly to the Twisted log? Let's remove all the places that do that.

Wait, why do we ever log directly to the Twisted log? Let's remove *all* the places that do that.

daira commented

2014-11-28 22:56:18 +00:00

Filed #2343 (remove all direct logging to twisted.python.log/err).

Sign in to join this conversation.