reentrant confusion in mutable publish causes incorrect UCWE #877
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#877
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The incident reports in #786 reveal a problem in the mutable-file publish
control loop which is likely to trigger an inappropriate Uncoordinated Write
Error when a server goes away between mapupdate and publish. Here's the
setup:
data, Publish prepares to upload a new version. Let's say each share N
goes to server N (sh1 to server 1, etc), and that there's an extra copy of
share 3 on server 11. And let's say that server 3 disconnected while the
Retrieve was running (so it's still in the sharemap but the
RemoteReference
is now dead).loop()
is entered the first timeupdate_goal()
to figure out where all the shares want tolive. Since mapupdate found all shares, no new shares need to be placed:
each known share will be updated, including the doubled share 3.
_send_shares()
to deliver messages to the 11 serversthat we're using
_send_shares()
builds a dictionary namedall_tw_vectors
thatcontains all of the messages it will send, batched by serverid (so it can
send exactly one message to each server, even if there are multiple
shares per server). Then it loops through this dictionary (let's pretend
in sorted order even though
.items()
is random), and for eachpeerid, it updates
self.outstanding
to remember that the message isin flight (so our next pass through
loop()
won't send duplicatemessages), and then sends the message. The relevant code looks like this:
_do_testreadwrite
call will see the disconnected server and errbackwith
DeadReferenceError
. This immediately fires_got_write_error
, which log.UNUSUALs the event, removes the sharefrom
self.outstanding
, and adds the peer toself.bad_peers
soit becomes ineligible for other shares. Then the Deferred chain moves
onto the next call, and invokes
self.loop
. This all happens whilethe for loop is on the stack: the rest of the server messages have not
been processed yet.
loop()
is entered the second time, reentrantly. It updates thegoal, which removes the sh3-on-X mapping but doesn't add any new shares
because sh3 is already happily living on server3. It computes the
"needed" list of shares to send, which does not include 1 or 2 or 3
(because those messages are in
self.outstanding
, but does includeall of the others (because they haven't been added to
self.outstanding
yet, even though they will be as soon as thesuspended call to
_send_shares
gets a chance)._send_shares()
, which dutifully sends requeststo the needed list: 4, 5, etc, up to 10. The inner call to
_send_shares
exits, as does the inner call toloop()
_send_shares()
regains control, in the middleof its for loop. It moves to server 4, adds the messages to the
self.outstanding
set (which is ignored, because it was alreadythere from the inner call), sends the (duplicate) message, and continues
onwards.
servers.
test vector matches, so the write was performed. That message is removed
from
self.outstanding
with "discard". The servermap is updated withthe new share we know to be in place on this server.
self.outstanding
, but it was already gone, and "discard" ignoresthis. This message has failed, because the test vector it provided did not
match, because the first message had been processed before the second
message arrived.
things like: somebody modified the share on us: shnum=8: I thought they
had #1832:R=lzzf, but testv reported #1832:R=lzzf . The first version in
this message comes from the servermap, and indicates that it had been
updated upon receipt of the first response. The second version in this
message come from the server, and indicates that the first response was
successful.
someone else has modified the share since our mapupdate was done
UncoordinatedWriteError
The actual numbers of shares in Zooko's #786 104230-vyc6byy incident were
different than this simplified example. There were actually 20 shares
scattered among 12 servers due to lots of servers being offline during
previous updates. The server which was lost only held one share (sh0), and
that share was also stored on a different server. The order in which the for
loop was traversed is unknown. There were two server messages still
outstanding when the duplicate response was received, ending the Publish
process. There may have been lots of duplicate messages sent, for which the
responses didn't arrive until after the Incident log was closed.
The upshot is that the reentrant call to
loop()
can cause trouble. Thebest fix is probably to change the code in
_send_shares
to fireloop()
via an eventual-send, specifically with foolscap'sfireEventually
Deferred-creating function. Since this is the only placein Publish that invokes
loop()
(other than the initial priming call),this should be sufficient to prevent reentrancy.
Fixed, in changeset:419a076f59b93e66. Unfortunately I couldn't think of an easy way to test it, so there's not yet a unit test. I'll think about this further.. it will probably require a
NoNetworkGrid
and a monkeypatched server reference that will pretend to be dead at just the right time. It might also require changing the for loop to be deterministic (instead of walking the serverlist in random dictionary-key order).Nejucomo experienced a bug with similar symptoms -- UncoordinatedWriteError and incident report files that mentioned that the uploader was surprised. However, his problem was reproducible: five consecutive runs of
tahoe cp -r $LOCAL $TAHOE
(on volunteergrid) produced the same error and two or three incident report files for each run. So it was not caused by a server disconnecting in the middle of of a mutable write. Here are the incident report files. Note this may turn out to be a different bug with similar symptoms, but it is definitely related, so I'm taking the liberty of reopening this ticket at least until we can determine that this one is a "different bug".Nejucomo was using Tahoe-LAFS v1.5.0 as packaged by Ubuntu in Ubuntu 9.10 Karmic.
By the way, Brian, was the bug that you already fixed in this ticket introduced since Tahoe-LAFS v1.5?
Attachment incident-2009-12-30-004046-sdjqheq.flog.bz2 (31250 bytes) added
nejucomo's mutable write error, part 1
Attachment incident-2009-12-30-004051-2jaqkfq.flog.bz2 (30141 bytes) added
nejucomo's mutable write error, part 2
Attachment incident-2009-12-30-004101-4xcfo3a.flog.bz2 (30211 bytes) added
nejucomo's mutable write error, part 3
Attachment incident-2009-12-30-004120-n3ajkpa.flog.bz2 (30734 bytes) added
nejucomo's mutable write error, part 4
Attachment incident-2009-12-30-005544-3ytbnua.flog.bz2 (33865 bytes) added
nejucomo's mutable write error, part 5
Attachment incident-2009-12-30-005558-k2wfwga.flog.bz2 (36451 bytes) added
nejucomo's mutable write error, part 6
Attachment incident-2009-12-30-005558-k2wfwga.flog.2.bz2 (36451 bytes) added
nejucomo's mutable write error, part 6
Attachment incident-2009-12-30-005604-anowdpi.flog.bz2 (33736 bytes) added
nejucomo's mutable write error, part 7
Attachment incident-2009-12-30-005613-salpsqq.flog.bz2 (35808 bytes) added
nejucomo's mutable write error, part 8
Attachment incident-2009-12-30-005636-cqt6dna.flog.bz2 (36933 bytes) added
nejucomo's mutable write error, part 9
Tomorrow (Tahoe-LAFS Hacking Weekend!) I'll try reproducing Nathan's problem, and if I can I'll try different versions of Tahoe-LAFS, especially the current trunk which has your fix from earlier in this ticket.
I've opened #893 with the specific problem found in Nathan's incident reports. It's not the reentrant problem that #877 is about, but actually a different issue with mutable file share placement and error-handling, so I'm re-closing this ticket in favor of the new #893.