Not Enough Shares when repairing a file which has 7 shares on 2 servers #732
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#732
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
My demo at the Northern Colorado Linux Users Group had an unfortunate climactic conclusion when someone (whose name I didn't catch) asked about repairing damaged files, so I clicked the check button with the "repair" checkbox turned on, and got this:
I couldn't figure it out and had to just bravely claim that Tahoe had really great test coverage and this sort of unpleasant surprise wasn't common. I also promised to email them all with the explanation, so I'm subscribing to the NCLUG mailing list so that I can e-mail the URL to this ticket. :-)
The problem remains reproducible today. I have a little demo grid with an introducer, a gateway, and two storage servers. The gateway has storage service turned off. I have a file stored therein with 3-of-10 encoding, and I manually
rm
'ed three shares from one of the storage servers. Check correctly reports says:Check also works with the "verify" checkbox turned on.
When I try to repair I get thie Not Enough Shares error and an incident report like this one (full incident report file attached):
Attachment incident-2009-06-10-070325-bch7emy.flog.bz2 (29828 bytes) added
#736 (UnrecoverableFileError on directory which has 6 shares (3 needed)) may be related.
Here is the mailing list message on nclug@nclug.org where I posted the promised follow-up.
I have the same problem; here are my incident reports (volunteer grid). Here is the troublesome directory. (My repair attempts have been from the CLI using the RW cap, not the RO.) Note that the files are all readable, and tahoe deep-check agrees they are recoverable; only repair fails.
kpreid: could you also upload those incident reports to this ticket? I don't have a volunteergrid node running on localhost.
I just set up a public web gateway for volunteergrid:
http://nooxie.zooko.com:9798/
It's running on nooxie.zooko.com, the same host that runs the introducer for the volunteergrid.
Thanks for the gateway! In the spirit of fewer clicks, http://nooxie.zooko.com:9798/uri/URI%3ADIR2-RO%3Awz2jevwzhgzdkpocyvadxjx6sm%3Aicljlu7etpouvvnduhuzyfgyyv5bvqp4iophltfdbtrwdjy3wuea/ contains kpreid's incident reports, as long as the volunteer grid and zooko's gateway stay up..
(in case it wasn't clear, I'm arguing that the volunteer grid is not as good a place to put bug report data as, say, this bug report :-).
Attachment incident-2009-06-19-165434-sa36v7a.flog.bz2 (22631 bytes) added
kpreid's incident #1
Attachment incident-2009-06-19-165531-faf6sda.flog.bz2 (23465 bytes) added
kpreid's incident 2
Attachment incident-2009-06-19-165803-2mz3eay.flog.bz2 (26384 bytes) added
kpreid's incident 3
Attachment incident-2009-06-19-165823-mi5gqdq.flog.bz2 (26901 bytes) added
kpreid's incident 4
Done.
ok, I found the bug.. repairer.py instantiates CiphertextDownloader with a !Client instance, when it's supposed to be passing a StorageFarmBroker instance. They both happen to have a method named get_servers, and the methods have similar signatures (they both accept a single string and return an iterable), but different semantics. The result is that the repairer's downloader was getting zero servers, because it was askinging with a storage_index where it should have been asking with a service name. Therefore the downloader didn't send out any queries, so it got no responses, so it concluded that there were no shares available.
test_repairer still passes because it's using a NoNetworkClient instead of the regular !Client, and NoNetworkClient isn't behaving quite the same way as !Client (the get_servers method happens to behave like StorageFarmBroker).
This was my bad.. I updated a number of places but missed repairer.py . The StorageFarmBroker thing, in general, should remove much of the need for a separate no-network test client (the plan is to use the regular client but configure it to not talk to an introducer and stuff in a bunch of loopback'ed storage servers). But in the transition period, this one fell through.
My plan is to change NoNetworkClient first, so that test_repairer fails like it's supposed to, then change one of the get_servers to a different name (so that the failure turns into an AttributeError), then finally fix repairer.py to pass in the correct object. Hopefully I'll get that done tomorrow.
If you'd like to just fix it (for local testing), edit repairer.py line 53 (in Repairer.start) and change the first argument of
download.CiphertextDownloader
fromself._client
toself._client.get_storage_broker()
. A quick test here suggests that this should fix the error.the patches I pushed in the last few days should fix this problem. Zooko, kpreid, could you upgrade and try the repair again? And if that works, close this ticket?
Seems to work. On my test case I get a different error:
ERROR: [MustForceRepairError](wiki/MustForceRepairError)(There were unrecoverable newer versions, so force=True must be passed to the repair() operation)
but I assume this is unrelated.Yeah, MustForceRepairError is indicated for mutable files, when there are fewer than 'k' shares of some version N, but k or more shares of some version N-1. (the version numbers are slightly more complicated than that, but that's irrelevant). This means that the repairer sees evidence of a newer version, but is unable to recover it, and passing in force=True to the repair() call will knowingly give up on that version.
I don't think there is yet a webapi to pass force=True. Also, I think there might be situations in which the repairer fails to look far enough for newer versions. Do a "check" and look at the version numbers (seqNNN in the share descriptions), to see if the message seems correct.
This can occur when a directory update occurs while the node is not connected to the usual storage nodes, especially if the nodes that are available then go away later.