'tahoe deep-check --repair --add-lease' triggers an exception in the client #889
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#889
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
After an upgrade to the latest trunk and a new backup using immutable dirnodes, the daily cronjob which verifies that my data is safely stored logged the following error in twistd.log.
checker.py near line 13 looks like that:
And no accident file was created.
The incriminated lines comes from this patch.
Brian, what do you think? I'm not sure whether this patch should be rollback.
Looks critical to me, because it looks like we can't add leases to immutable directories (guessing that this is the underlying WEIRD that didn't get logged). Hopefully François will report more after removing the
tw_log.err()
feature.François: don't wait for Brian to look at the
tw_log.err()
thing -- investigating the underlying error is more urgent. Also I hereby take responsibility if Brian later says that removing that line was totally wrong. :-)Huh, I didn't know that twisted.python.log was so picky about tw_log.err being called outside of an exception context. Looks like this has been around since twisted-2.4.0, though.
So, we want the tw_log.err to be there, but we also want to make sure that an exception in it won't cause the subsequent foolscap log.err to get bypassed (and worse, any other code following the call to allmydata.util.log.err will also be bypassed, although usually log.err lives at the end of the callback chain so there is rarely much code sitting after it). log.msg and log.err must never raise exceptions.
It's probably still a good idea to move this functionality into Foolscap. But for now, I'd do two things:
failure=
argument out of its kwargs and pass it as the first positional argument to tw_log.err . The root problem here is that the_add_lease_failed
function is an errback, so the relevant exception has already been captured and bundled up into a Failure instance, but tw_log.err isn't being told about it properly. Foolscap's log.err looks for a failure in thefailure=
argument, but tw_log.err wants it to be the first positional argument. If it doesn't find one there, then it tries to create a new Failure, which explodes when you're in an errback and not in a synchronous exception handler.I agree with Zooko that commenting out the tw_log.err and reproducing the problem is a good direction to go. I'd like to know what the local error was.
Ok, after commenting
tw_log.err()
, the same exception occurs, but in foolscap this time. I'll try to add some code to dump more informations but it's gonna be slow because it takes about 30 minutes before the problem occurs.Here is the arguments that log.err() received.
class 'foolscap.ipb.DeadReferenceError'
suggests that the connection to a storage server turned out to be broken during the add lease process, and there is some part of the add-lease code which doesn't have a.addErrback()
which knows how to deal with the connection failing.Maybe try the fix to logging that Brian suggested and see if the correctly-logged diagnostics still suggest that this is the problem?
Attachment debug-log.txt (19685 bytes) added
Debug log coming from 'flogtool tail' running during the error.
The incriminated peer 'wpxpieqx' is running the following Tahoe version and is still accepting new shares, ie the disk is not full.
This is what gets logged in the storage node itself.
I also saw occurrences of this exception from long ago when the Tahoe client was running 1.5.0, so it isn't likely to be a new bug introduced in 1.6.0.
'tahoe deep-check --repair --add-lease' triggers a logging errorto 'tahoe deep-check --repair --add-lease' triggers an exception in the clienthrm,
DeadReferenceError
is supposed to be a remote failure, not a local one. It should have followed the other error path (the one which says "error in add_lease from ..." instead of "local error in add_lease to..").That said, yeah, even the remote-failure branch isn't looking for
DeadReferenceError
. That should certainly be ignored.Ok, so I think we have three issues here:
DeadReferenceError
is supposed to passf.check(RemoteException)
, fix it to do so_add_lease_failed
to ignoreDeadReferenceError
, in both immutable and mutable casesI've opened (and closed as WONTFIX) Foolscap#145 for the second issue: I decided that
DeadReferenceError
should remain separate. And therefore I just pushed changeset:bacb6fe5aad1f416 to cover the last issue: ignoreDeadReferenceError
in add_lease calls.So the remaining issue is to get util.log.err fixed.
changeset:d888bf33772327f2 includes the other fix, to call
log.err
properly (either with a Failure as the first positional argument, or in an exception-handler context).I think that should fix this ticket, but I'm hoping Francois can confirm that the original problem goes away. (since it looks like that error was being raised because one of the storage servers was disconnecting during the deep-check, and the new code should correctly ignore that disconnection. If you were seeing that message a lot, then there will still be an open question as to why a server is coming and going frequently).
Replying to warner:
Yes, the original problem (exception trigged on the client) is fixed now, thanks for that. I still see
foolscap.ipb.DeadReferenceError
on the storage nodes a few times a day, does that qualify as frequently? Any hint on how to debug those foolscap disconnections would be very welcome.Anyway, it's probably time move the disconnection problem to another ticket, see #896.