uncaught exception in mutable-retrieve: UCW between mapupdate and retrieve #474
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#474
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
If a mutable file is modified (by some external uncoordinated writer) after
we've done a servermap update but before we do a retrieve, then the exception
raised by mutable.Retrieve._maybe_send_more_queries
(source:src/allmydata/mutable/retrieve.py#L415) is not caught by the Deferred
chain. This means that the caller (who is probably in a filenode.modify
sequence) will be waiting forever for the answer.
The exception that shows up in the twisted log looks like this:
The Deferred chaining needs to be investigated to make sure that this
exception is properly returned to the caller via the errback on their
Deferred.
In addition, the code in filenode.modify needs to be examined to make sure
that this kind of uncoordinated write error is caught and retried. My concern
is that Retrieve is returning a
NotEnoughSharesError
that wraps aUncoordinatedWriteError
, rather than the UCWE directly, and that thef.trap in modify() might not know to look for that.
in addition, I think we need test coverage for an UCWE that occurs between mapupdate and retrieve. I believe that we already have coverage for one that occurs between retrieve and publish, but not before the retrieve. Note that this will require using a file that is large enough to not fit in the filenode's read-cache (20kB should be enough).
Unit tests to write:
hrm, a quick check of the code doesn't reveal any paths by which _check_for_done is called outside of a Deferred chain. I added a few unit tests changeset:dba0ad3d2841b7a0 to corrupt all shares between the mapupdate and retrieve, and they pass. I also increased the size of the test file to 20kB to allow the post-mapupdate corruption to actually get noticed.
At least one of these errors occurred while the webapi server in question was throwing out-of-memory errors, so I think I'm going to wait until it happens again before investigating it deeply.
For reference, the times (UTC) at which these errors occurred were:
with a MemoryError occurring at 2008-06-22 00:25:46.020Z . We have no events logged from this node (2b6w) between 17:02 and 17:36. Because of the foolscap log-buffer overflow problem, we have no foolscap log events past about 15:25.
It's really bothering me that mutable file upload and download behavior is so finicky, buggy, inefficient, hard to understand, different from immutable file upload and download behavior, etc. So I'm putting a bunch of tickets into the "1.8" Milestone. I am not, however, at this time, volunteering to work on these tickets, so it might be a mistake to put them into the 1.8 Milestone, but I really hope that someone else will volunteer or that I will decide to do it myself. :-)