Not Enough Shares when repairing a file which has 7 shares on 2 servers #732

New Issue

zooko · 2009-06-10T14:23:41Z

zooko commented

2009-06-10 14:23:41 +00:00

My demo at the Northern Colorado Linux Users Group had an unfortunate climactic conclusion when someone (whose name I didn't catch) asked about repairing damaged files, so I clicked the check button with the "repair" checkbox turned on, and got this:

NotEnoughSharesError: no shares could be found. Zero shares usually indicates a corrupt URI, or that no servers were connected, but it might also indicate severe corruption. You should perform a filecheck on this object to learn more.

I couldn't figure it out and had to just bravely claim that Tahoe had really great test coverage and this sort of unpleasant surprise wasn't common. I also promised to email them all with the explanation, so I'm subscribing to the NCLUG mailing list so that I can e-mail the URL to this ticket. :-)

The problem remains reproducible today. I have a little demo grid with an introducer, a gateway, and two storage servers. The gateway has storage service turned off. I have a file stored therein with 3-of-10 encoding, and I manually rm'ed three shares from one of the storage servers. Check correctly reports says:

 "summary": "Not Healthy: 7 shares (enc 3-of-10)"

Check also works with the "verify" checkbox turned on.

When I try to repair I get thie Not Enough Shares error and an incident report like this one (full incident report file attached):

07:03:12.747 [5977]: web: 127.0.0.1 GET /uri/[CENSORED].. 200 308553
07:03:25.604 [5978]: <Repairer #6>(u7rxp): starting repair
07:03:25.604 [5979]: CHKUploader starting
07:03:25.604 [5980]: starting upload of <DownUpConnector #6>
07:03:25.604 [5981]: creating Encoder <Encoder for unknown storage index>
07:03:25.604 [5982]: <CiphertextDownloader #22>(u7rxpbtbw5wb): starting download
07:03:25.613 [5983]: SCARY <CiphertextDownloader #22>(u7rxpbtbw5wb): download failed! FAILURE:
[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
  File "/Users/wonwinmcbrootles/playground/allmydata/tahoe/trunk/trunk/src/allmydata/immutable/repairer.py", line 69, in start
    d2 = dl.start()
  File "/Users/wonwinmcbrootles/playground/allmydata/tahoe/trunk/trunk/src/allmydata/immutable/download.py", line 715, in start
    d.addCallback(self._got_all_shareholders)
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-macosx-10.3-i386.egg/twisted/internet/defer.py", line 195, in addCallback
    callbackKeywords=kw)
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-macosx-10.3-i386.egg/twisted/internet/defer.py", line 186, in addCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-macosx-10.3-i386.egg/twisted/internet/defer.py", line 328, in _runCallbacks
    self.result = callback(self.result, *args, **kw)
  File "/Users/wonwinmcbrootles/playground/allmydata/tahoe/trunk/trunk/src/allmydata/immutable/download.py", line 810, in _got_all_shareholders
    self._verifycap.needed_shares)
allmydata.interfaces.NotEnoughSharesError: Failed to get enough shareholders
]
[INCIDENT-TRIGGER]
07:03:26.253 [5984]: web: 127.0.0.1 POST /uri/[CENSORED].. 410 234

My demo at the Northern Colorado Linux Users Group had an unfortunate climactic conclusion when someone (whose name I didn't catch) asked about repairing damaged files, so I clicked the check button with the "repair" checkbox turned on, and got this: ``` NotEnoughSharesError: no shares could be found. Zero shares usually indicates a corrupt URI, or that no servers were connected, but it might also indicate severe corruption. You should perform a filecheck on this object to learn more. ``` I couldn't figure it out and had to just bravely claim that Tahoe had really great test coverage and this sort of unpleasant surprise wasn't common. I also promised to email them all with the explanation, so I'm subscribing to the NCLUG mailing list so that I can e-mail the URL to this ticket. :-) The problem remains reproducible today. I have a little demo grid with an introducer, a gateway, and two storage servers. The gateway has storage service turned off. I have a file stored therein with 3-of-10 encoding, and I manually `rm`'ed three shares from one of the storage servers. Check correctly reports says: ``` "summary": "Not Healthy: 7 shares (enc 3-of-10)" ``` Check also works with the "verify" checkbox turned on. When I try to repair I get thie Not Enough Shares error and an incident report like this one (full incident report file attached): ``` 07:03:12.747 [5977]: web: 127.0.0.1 GET /uri/[CENSORED].. 200 308553 07:03:25.604 [5978]: <Repairer #6>(u7rxp): starting repair 07:03:25.604 [5979]: CHKUploader starting 07:03:25.604 [5980]: starting upload of <DownUpConnector #6> 07:03:25.604 [5981]: creating Encoder <Encoder for unknown storage index> 07:03:25.604 [5982]: <CiphertextDownloader #22>(u7rxpbtbw5wb): starting download 07:03:25.613 [5983]: SCARY <CiphertextDownloader #22>(u7rxpbtbw5wb): download failed! FAILURE: [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): File "/Users/wonwinmcbrootles/playground/allmydata/tahoe/trunk/trunk/src/allmydata/immutable/repairer.py", line 69, in start d2 = dl.start() File "/Users/wonwinmcbrootles/playground/allmydata/tahoe/trunk/trunk/src/allmydata/immutable/download.py", line 715, in start d.addCallback(self._got_all_shareholders) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-macosx-10.3-i386.egg/twisted/internet/defer.py", line 195, in addCallback callbackKeywords=kw) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-macosx-10.3-i386.egg/twisted/internet/defer.py", line 186, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-macosx-10.3-i386.egg/twisted/internet/defer.py", line 328, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/Users/wonwinmcbrootles/playground/allmydata/tahoe/trunk/trunk/src/allmydata/immutable/download.py", line 810, in _got_all_shareholders self._verifycap.needed_shares) allmydata.interfaces.NotEnoughSharesError: Failed to get enough shareholders ] [INCIDENT-TRIGGER] 07:03:26.253 [5984]: web: 127.0.0.1 POST /uri/[CENSORED].. 410 234 ```

zooko added the

labels 2009-06-10 14:23:41 +00:00

zooko added this to the 1.5.0 milestone 2009-06-10 14:23:41 +00:00

zooko self-assigned this 2009-06-10 14:23:41 +00:00

zooko commented

2009-06-10 14:26:02 +00:00

Attachment incident-2009-06-10-070325-bch7emy.flog.bz2 (29828 bytes) added

**Attachment** incident-2009-06-10-070325-bch7emy.flog.bz2 (29828 bytes) added

incident-2009-06-10-070325-bch7emy.flog.bz2

29 KiB

zooko commented

2009-06-15 19:45:48 +00:00

#736 (UnrecoverableFileError on directory which has 6 shares (3 needed)) may be related.

zooko commented

2009-06-16 19:19:13 +00:00

Here is the mailing list message on nclug@nclug.org where I posted the promised follow-up.

Here is [the mailing list message on nclug@nclug.org](http://nclug.org/pipermail/nclug/2009-June/009354.html) where I posted the promised follow-up.

kpreid commented

2009-06-19 21:43:52 +00:00

I have the same problem; here are my incident reports (volunteer grid). Here is the troublesome directory. (My repair attempts have been from the CLI using the RW cap, not the RO.) Note that the files are all readable, and tahoe deep-check agrees they are recoverable; only repair fails.

I have the same problem; here are [my incident reports](http://127.0.0.1:9797/uri/URI%3ADIR2-RO%3Awz2jevwzhgzdkpocyvadxjx6sm%3Aicljlu7etpouvvnduhuzyfgyyv5bvqp4iophltfdbtrwdjy3wuea/) (volunteer grid). Here is [the troublesome directory](http://localhost:9797/uri/URI%3ADIR2-RO%3Artpse34ww74nvcbauyd7cimkra%3Auhpplqddxfieeszln5kanypycxgfhaylwf6zxpg2ui6spnhwtwnq/). (My repair attempts have been from the CLI using the RW cap, not the RO.) Note that the files are all readable, and tahoe deep-check agrees they are recoverable; only repair fails.

warner commented

2009-06-20 21:07:35 +00:00

kpreid: could you also upload those incident reports to this ticket? I don't have a volunteergrid node running on localhost.

zooko commented

2009-06-20 21:31:51 +00:00

I just set up a public web gateway for volunteergrid:

http://nooxie.zooko.com:9798/

It's running on nooxie.zooko.com, the same host that runs the introducer for the volunteergrid.

I just set up a public web gateway for volunteergrid: <http://nooxie.zooko.com:9798/> It's running on nooxie.zooko.com, the same host that runs the introducer for the volunteergrid.

warner commented

2009-06-20 21:52:21 +00:00

Thanks for the gateway! In the spirit of fewer clicks, http://nooxie.zooko.com:9798/uri/URI%3ADIR2-RO%3Awz2jevwzhgzdkpocyvadxjx6sm%3Aicljlu7etpouvvnduhuzyfgyyv5bvqp4iophltfdbtrwdjy3wuea/ contains kpreid's incident reports, as long as the volunteer grid and zooko's gateway stay up..

(in case it wasn't clear, I'm arguing that the volunteer grid is not as good a place to put bug report data as, say, this bug report :-).

Thanks for the gateway! In the spirit of fewer clicks, <http://nooxie.zooko.com:9798/uri/URI%3ADIR2-RO%3Awz2jevwzhgzdkpocyvadxjx6sm%3Aicljlu7etpouvvnduhuzyfgyyv5bvqp4iophltfdbtrwdjy3wuea/> contains kpreid's incident reports, as long as the volunteer grid and zooko's gateway stay up.. (in case it wasn't clear, I'm arguing that the volunteer grid is not as good a place to put bug report data as, say, this bug report :-).

kpreid commented

2009-06-20 22:55:56 +00:00

Attachment incident-2009-06-19-165434-sa36v7a.flog.bz2 (22631 bytes) added

kpreid's incident #1

**Attachment** incident-2009-06-19-165434-sa36v7a.flog.bz2 (22631 bytes) added kpreid's incident #1

incident-2009-06-19-165434-sa36v7a.flog.bz2

22 KiB

kpreid commented

2009-06-20 22:56:21 +00:00

Attachment incident-2009-06-19-165531-faf6sda.flog.bz2 (23465 bytes) added

kpreid's incident 2

**Attachment** incident-2009-06-19-165531-faf6sda.flog.bz2 (23465 bytes) added kpreid's incident 2

incident-2009-06-19-165531-faf6sda.flog.bz2

23 KiB

kpreid commented

2009-06-20 22:56:40 +00:00

Attachment incident-2009-06-19-165803-2mz3eay.flog.bz2 (26384 bytes) added

kpreid's incident 3

**Attachment** incident-2009-06-19-165803-2mz3eay.flog.bz2 (26384 bytes) added kpreid's incident 3

incident-2009-06-19-165803-2mz3eay.flog.bz2

26 KiB

kpreid commented

2009-06-20 22:56:54 +00:00

Attachment incident-2009-06-19-165823-mi5gqdq.flog.bz2 (26901 bytes) added

kpreid's incident 4

**Attachment** incident-2009-06-19-165823-mi5gqdq.flog.bz2 (26901 bytes) added kpreid's incident 4

incident-2009-06-19-165823-mi5gqdq.flog.bz2

26 KiB

kpreid commented

2009-06-20 22:57:15 +00:00

Done.

warner commented

2009-06-21 07:57:37 +00:00

ok, I found the bug.. repairer.py instantiates CiphertextDownloader with a !Client instance, when it's supposed to be passing a StorageFarmBroker instance. They both happen to have a method named get_servers, and the methods have similar signatures (they both accept a single string and return an iterable), but different semantics. The result is that the repairer's downloader was getting zero servers, because it was askinging with a storage_index where it should have been asking with a service name. Therefore the downloader didn't send out any queries, so it got no responses, so it concluded that there were no shares available.

test_repairer still passes because it's using a NoNetworkClient instead of the regular !Client, and NoNetworkClient isn't behaving quite the same way as !Client (the get_servers method happens to behave like StorageFarmBroker).

This was my bad.. I updated a number of places but missed repairer.py . The StorageFarmBroker thing, in general, should remove much of the need for a separate no-network test client (the plan is to use the regular client but configure it to not talk to an introducer and stuff in a bunch of loopback'ed storage servers). But in the transition period, this one fell through.

My plan is to change NoNetworkClient first, so that test_repairer fails like it's supposed to, then change one of the get_servers to a different name (so that the failure turns into an AttributeError), then finally fix repairer.py to pass in the correct object. Hopefully I'll get that done tomorrow.

If you'd like to just fix it (for local testing), edit repairer.py line 53 (in Repairer.start) and change the first argument of download.CiphertextDownloader from self._client to self._client.get_storage_broker(). A quick test here suggests that this should fix the error.

ok, I found the bug.. repairer.py instantiates CiphertextDownloader with a !Client instance, when it's supposed to be passing a StorageFarmBroker instance. They both happen to have a method named get_servers, and the methods have similar signatures (they both accept a single string and return an iterable), but different semantics. The result is that the repairer's downloader was getting zero servers, because it was askinging with a storage_index where it should have been asking with a service name. Therefore the downloader didn't send out any queries, so it got no responses, so it concluded that there were no shares available. test_repairer still passes because it's using a NoNetworkClient instead of the regular !Client, and NoNetworkClient isn't behaving quite the same way as !Client (the get_servers method happens to behave like StorageFarmBroker). This was my bad.. I updated a number of places but missed repairer.py . The StorageFarmBroker thing, in general, should remove much of the need for a separate no-network test client (the plan is to use the regular client but configure it to not talk to an introducer and stuff in a bunch of loopback'ed storage servers). But in the transition period, this one fell through. My plan is to change NoNetworkClient first, so that test_repairer fails like it's supposed to, then change one of the get_servers to a different name (so that the failure turns into an AttributeError), then finally fix repairer.py to pass in the correct object. Hopefully I'll get that done tomorrow. If you'd like to just fix it (for local testing), edit repairer.py line 53 (in Repairer.start) and change the first argument of `download.CiphertextDownloader` from `self._client` to `self._client.get_storage_broker()`. A quick test here suggests that this should fix the error.

warner commented

2009-06-24 04:13:53 +00:00

the patches I pushed in the last few days should fix this problem. Zooko, kpreid, could you upgrade and try the repair again? And if that works, close this ticket?

kpreid commented

2009-06-24 14:36:10 +00:00

Seems to work. On my test case I get a different error: ERROR: [MustForceRepairError](wiki/MustForceRepairError)(There were unrecoverable newer versions, so force=True must be passed to the repair() operation) but I assume this is unrelated.

Seems to work. On my test case I get a different error: `ERROR: [MustForceRepairError](wiki/MustForceRepairError)(There were unrecoverable newer versions, so force=True must be passed to the repair() operation)` but I assume this is unrelated.

warner commented

2009-06-24 21:22:50 +00:00

Yeah, MustForceRepairError is indicated for mutable files, when there are fewer than 'k' shares of some version N, but k or more shares of some version N-1. (the version numbers are slightly more complicated than that, but that's irrelevant). This means that the repairer sees evidence of a newer version, but is unable to recover it, and passing in force=True to the repair() call will knowingly give up on that version.

I don't think there is yet a webapi to pass force=True. Also, I think there might be situations in which the repairer fails to look far enough for newer versions. Do a "check" and look at the version numbers (seqNNN in the share descriptions), to see if the message seems correct.

This can occur when a directory update occurs while the node is not connected to the usual storage nodes, especially if the nodes that are available then go away later.

Yeah, MustForceRepairError is indicated for mutable files, when there are fewer than 'k' shares of some version N, but k or more shares of some version N-1. (the version numbers are slightly more complicated than that, but that's irrelevant). This means that the repairer sees evidence of a newer version, but is unable to recover it, and passing in force=True to the repair() call will knowingly give up on that version. I don't think there is yet a webapi to pass force=True. Also, I think there might be situations in which the repairer fails to look far enough for newer versions. Do a "check" and look at the version numbers (seqNNN in the share descriptions), to see if the message seems correct. This can occur when a directory update occurs while the node is not connected to the usual storage nodes, especially if the nodes that *are* available then go away later.

zooko added the

fixed

label 2009-06-30 12:38:12 +00:00

zooko closed this issue

2009-06-30 12:38:12 +00:00

Sign in to join this conversation.