TypeError when repairing an (unrecoverable?) directory #786
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#786
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I had just brought my laptop out of sleep and it hadn't yet connected to a wireless network when I clicked on the bookmark to take me to my blog. It said something to the effect that the file was unrecoverable, and I saw that the network had just about finished coming up (according to the little radio wave icon thingie at the upper-right-hand corner of my Mac OS X desktop), so I hit reload.
It said:
I'll attach the full resulting error page and the two incident report files that were generated.
Attachment Exception.html (8183 bytes) added
Attachment incident-2009-07-29-104334-hjflzua.flog.bz2 (43666 bytes) added
Attachment incident-2009-07-29-104230-vyc6byy.flog.bz2 (44195 bytes) added
allmydata-tahoe: 1.4.1-r3997, foolscap: 0.4.2, pycryptopp: 0.5.15, zfec: 1.4.5, Twisted: 8.2.0, Nevow: 0.9.31-r15675, zope.interface: 3.1.0c1, python: 2.5.4, platform: Darwin-8.11.1-i386-32bit, sqlite: 3.1.3, simplejson: 2.0.9, argparse: 0.8.0, pyOpenSSL: 0.9, pyutil: 1.3.34, zbase32: 1.1.0, setuptools: 0.6c12dev, pysqlite: 2.3.2
Hrm. verinfo=None is a likely way for one piece of code to tell another that there are no recoverable versions, and something should notice that and raise a file-not-recoverable error instead of a confusing+ugly not-iterable error, but of course the real question is why this behavior occurred the second time you tried to download it, when presumeably all your connections had become established.
Maybe only some of your connections were established by the second attempt, and this is an error which occurs when some but not all of the shares were retrieveable.
When I get a chance, I'll try to look at the incidents you attached and see if I can distinguish between these two situations. Thanks for including all the details and incident reports.. that will greatly help to analyze this one!
Hm, I think I might have misremembered the sequence of events. I don't think I "hit reload". I think instead I started a deep-check-verify-repair-renew on the directory that contains my blog. I'm sorry that I don't remember for sure.
By the way, if you get a chance to reconsider #653 (introducer client: connection count is wrong, VersionedRemoteReference needs EQ), the answers to the questions on that ticket might be relevant to this ticket.
Brian: any ideas or suggestions of things I can do to help on this one?
Here's another report of this issue:
http://allmydata.org/pipermail/tahoe-dev/2009-December/003420.html
François reported on IRC that it was the Tahoe-LAFS 1.5.0-0ubuntu1 that comes in Ubuntu Karmic, and that the latest trunk of Tahoe-LAFS didn't have this bug. However, his follow-up on the list said that it was a specific command --
deep-check -v --add-lease --repair
that failed but that doing the same actions in subsequent commands worked:http://allmydata.org/pipermail/tahoe-dev/2009-December/003421.html
So I'm not sure if this behavior really does differ between Tahoe-LAFS v1.5.0 and current trunk. Assigning to François for clarification.
TypeError when loading a directory while my wireless network was downto TypeError when loading a directoryI currently believe that both of these exceptions were the result of a repair attempted on a mutable file (probably a directory) which was unrecoverable. Francois' crash uses
download_version
, and the only place where that is used in the tahoe codebase is in the repairer. If there were no shares available, it would calldownload_version
with None (instead of the version to download, which is expressed as a tuple of things), and you'd see this kind of crash.incidentally, I found a bug in the repairer that incorrectly classifies an unrecoverable-but-not-completely-gone file (i.e. one with 1 or 2 shares, when k=3). It raises a
MustForceRepairError
with an explanation about there being unrecoverable newer versions. The intention was to raise this error when e.g. there are 9 shares of version 1, and 2 shares of version 2, since in that situation, repairing the only recoverable version (1) will knowingly discard the remaining shares of version 2, abandoning any hope of recovering the current contents of the file.TypeError when loading a directoryto TypeError when repairing an (unrecoverable?) directoryI looked at zooko's Incidents, and I think they're showing an entirely
different bug. The 104230-vyc6byy incident shows a mutable file being read
and written several times (there's no MODE_CHECK in there, which suggests
that it's not a check-and-repair operation, just a regular read-modify-write
call).
The relevant parts of the events leading up to the Incident are:
This suggests that we got two answers from the 6fyx server. It feels like
we sent two identical requests to it. The first one succeeded normally, the
second one failed. I suspect the logging code (which provides the "I thought
they had.." comment) is not accurately remembering the test vector that was
sent with the original message, instead it is using a local variable that has
already been updated by the time the log event is emitted. So I suspect that
the second answer is in response to a query which said "I think you should
have version 1831, please update to v1832", and of course since the first
message was processed by then, the server would already be at v1832.
The other incident (104334-hjflzua) is the same behavior on the same time, a
minute later, this time trying to update to v1833.
My current guess is that we're somehow getting two copies of the same server
in our peerlist, and sending updates to both.
The "error during repair" that this ticket might be about isn't reflected in
these two incidents. It's likely that it wouldn't show up as an incident at
all, just the traceback. I'll investigate further.
Investigating with Francois (on the 25c3 grid that he set up) is showing that a shallow check of a healthy directory is nevertheless reporting zero shares present when --add-lease is included. I now suspect problems in the code I added to source:src/allmydata/mutable/servermap.py#L548 (in _do_read) to tolerate servers who don't understand the add-lease message, since when you do add-lease on a mutable file, it sends the do-you-have-share message pipelined with the add-lease message. Older servers don't understand add-lease, so I wanted to ignore such an error, but to propagate any other errors that occurred (like local code bugs). I think that something is wrong in this code, and errors are being thrown (therefore disrupting the normal DYHB responses) when they shouldn't be.
Replying to warner:
This is now #874.
Ok, so the current related problems are:
MustForceRepairError
when repairing 0<numshares<k mutable filesTypeError
(instead of simple failure) when repairing numshares=0 mutable filesFrancois' node experienced #875 (because all the shares were on a tahoe-1.2.0 server) followed by #786. Zooko's exception was probably either a deep-repair when the file's servers were entirely offline or the same #875 problem, followed by #786. Zooko's incident reports capture the "?" issue.
changeset:ba0690c9d7a3bc28 should fix the
TypeError
problem: unrepairable files are now reported with success=False instead of a weird exception (it also fixes #874).The remaining question in my mind is where the multiple-answers-from-the-same-server incidents came from.
I've opened #877 to handle the multiple-answers bug. It turns out that a
DeadReferenceError
can cause the publish loop to reenter, causing multiple requests to be sent and then an incorrect UCWE to be declared.That means this ticket can be closed, since the only remaining issue now has its own ticket.