reentrant confusion in mutable publish causes incorrect UCWE #877

New Issue

warner · 2009-12-30T03:58:24Z

warner commented

2009-12-30 03:58:24 +00:00

The incident reports in #786 reveal a problem in the mutable-file publish
control loop which is likely to trigger an inappropriate Uncoordinated Write
Error when a server goes away between mapupdate and publish. Here's the
setup:

the user modifies an existing healthy mutable file, 3-of-10 encoding
Mapupdate identifies a bunch of shares on their servers, Retrieve gets the
data, Publish prepares to upload a new version. Let's say each share N
goes to server N (sh1 to server 1, etc), and that there's an extra copy of
share 3 on server 11. And let's say that server 3 disconnected while the
Retrieve was running (so it's still in the sharemap but the
RemoteReference is now dead).
loop() is entered the first time
it calls update_goal() to figure out where all the shares want to
live. Since mapupdate found all shares, no new shares need to be placed:
each known share will be updated, including the doubled share 3.
it then calls _send_shares() to deliver messages to the 11 servers
that we're using
_send_shares() builds a dictionary named all_tw_vectors that
contains all of the messages it will send, batched by serverid (so it can
send exactly one message to each server, even if there are multiple
shares per server). Then it loops through this dictionary (let's pretend
in sorted order even though .items() is random), and for each
peerid, it updates self.outstanding to remember that the message is
in flight (so our next pass through loop() won't send duplicate
messages), and then sends the message. The relevant code looks like this:

        for (peerid, tw_vectors) in all_tw_vectors.items():
            for shnum in shnums:
                self.outstanding.add( (peerid, shnum) )
            d = self._do_testreadwrite(peerid, secrets, tw_vectors, read_vector)
            d.addCallbacks(self._got_write_answer, self._got_write_error)
            d.addCallback(self.loop)

when we've processed peerid 1 and 2, and we're working on 3, the
_do_testreadwrite call will see the disconnected server and errback
with DeadReferenceError. This immediately fires
_got_write_error, which log.UNUSUALs the event, removes the share
from self.outstanding, and adds the peer to self.bad_peers so
it becomes ineligible for other shares. Then the Deferred chain moves
onto the next call, and invokes self.loop. This all happens while
the for loop is on the stack: the rest of the server messages have not
been processed yet.
now loop() is entered the second time, reentrantly. It updates the
goal, which removes the sh3-on-X mapping but doesn't add any new shares
because sh3 is already happily living on server3. It computes the
"needed" list of shares to send, which does not include 1 or 2 or 3
(because those messages are in self.outstanding, but does include
all of the others (because they haven't been added to
self.outstanding yet, even though they will be as soon as the
suspended call to _send_shares gets a chance).
Then it (re)enters _send_shares(), which dutifully sends requests
to the needed list: 4, 5, etc, up to 10. The inner call to
_send_shares exits, as does the inner call to loop()
now the outer call to _send_shares() regains control, in the middle
of its for loop. It moves to server 4, adds the messages to the
self.outstanding set (which is ignored, because it was already
there from the inner call), sends the (duplicate) message, and continues
onwards.
The result is that we get duplicate identical messages for the later
servers.
The first message to server 4 comes back with a successful result: the
test vector matches, so the write was performed. That message is removed
from self.outstanding with "discard". The servermap is updated with
the new share we know to be in place on this server.
Later, the second message to server 4 comes back. It is removed from
self.outstanding, but it was already gone, and "discard" ignores
this. This message has failed, because the test vector it provided did not
match, because the first message had been processed before the second
message arrived.
this log.WEIRDs an event (and triggers an Incident). The log messages say
things like: somebody modified the share on us: shnum=8: I thought they
had #1832:R=lzzf, but testv reported #1832:R=lzzf . The first version in
this message comes from the servermap, and indicates that it had been
updated upon receipt of the first response. The second version in this
message come from the server, and indicates that the first response was
successful.
it also marks the upload as being "surprised", since it appears that
someone else has modified the share since our mapupdate was done
the surprise causes this Publish to fail with
UncoordinatedWriteError

The actual numbers of shares in Zooko's #786 104230-vyc6byy incident were
different than this simplified example. There were actually 20 shares
scattered among 12 servers due to lots of servers being offline during
previous updates. The server which was lost only held one share (sh0), and
that share was also stored on a different server. The order in which the for
loop was traversed is unknown. There were two server messages still
outstanding when the duplicate response was received, ending the Publish
process. There may have been lots of duplicate messages sent, for which the
responses didn't arrive until after the Incident log was closed.

The upshot is that the reentrant call to loop() can cause trouble. The
best fix is probably to change the code in _send_shares to fire
loop() via an eventual-send, specifically with foolscap's
fireEventually Deferred-creating function. Since this is the only place
in Publish that invokes loop() (other than the initial priming call),
this should be sufficient to prevent reentrancy.

The incident reports in #786 reveal a problem in the mutable-file publish control loop which is likely to trigger an inappropriate Uncoordinated Write Error when a server goes away between mapupdate and publish. Here's the setup: * the user modifies an existing healthy mutable file, 3-of-10 encoding * Mapupdate identifies a bunch of shares on their servers, Retrieve gets the data, Publish prepares to upload a new version. Let's say each share N goes to server N (sh1 to server 1, etc), and that there's an extra copy of share 3 on server 11. And let's say that server 3 disconnected while the Retrieve was running (so it's still in the sharemap but the `RemoteReference` is now dead). * `loop()` is entered the first time * it calls `update_goal()` to figure out where all the shares want to live. Since mapupdate found all shares, no new shares need to be placed: each known share will be updated, including the doubled share 3. * it then calls `_send_shares()` to deliver messages to the 11 servers that we're using * `_send_shares()` builds a dictionary named `all_tw_vectors` that contains all of the messages it will send, batched by serverid (so it can send exactly one message to each server, even if there are multiple shares per server). Then it loops through this dictionary (let's pretend in sorted order even though `.items()` is random), and for each peerid, it updates `self.outstanding` to remember that the message is in flight (so our next pass through `loop()` won't send duplicate messages), and then sends the message. The relevant code looks like this: ``` for (peerid, tw_vectors) in all_tw_vectors.items(): for shnum in shnums: self.outstanding.add( (peerid, shnum) ) d = self._do_testreadwrite(peerid, secrets, tw_vectors, read_vector) d.addCallbacks(self._got_write_answer, self._got_write_error) d.addCallback(self.loop) ``` * when we've processed peerid 1 and 2, and we're working on 3, the `_do_testreadwrite` call will see the disconnected server and errback with `DeadReferenceError`. This immediately fires `_got_write_error`, which log.UNUSUALs the event, removes the share from `self.outstanding`, and adds the peer to `self.bad_peers` so it becomes ineligible for other shares. Then the Deferred chain moves onto the next call, and invokes `self.loop`. This all happens while the for loop is on the stack: the rest of the server messages have not been processed yet. * now `loop()` is entered the second time, reentrantly. It updates the goal, which removes the sh3-on-X mapping but doesn't add any new shares because sh3 is already happily living on server3. It computes the "needed" list of shares to send, which does not include 1 or 2 or 3 (because those messages are in `self.outstanding`, but does include all of the others (because they haven't been added to `self.outstanding` yet, even though they will be as soon as the suspended call to `_send_shares` gets a chance). * Then it (re)enters `_send_shares()`, which dutifully sends requests to the needed list: 4, 5, etc, up to 10. The inner call to `_send_shares` exits, as does the inner call to `loop()` * now the outer call to `_send_shares()` regains control, in the middle of its for loop. It moves to server 4, adds the messages to the `self.outstanding` set (which is ignored, because it was already there from the inner call), sends the (duplicate) message, and continues onwards. * The result is that we get duplicate identical messages for the later servers. * The first message to server 4 comes back with a successful result: the test vector matches, so the write was performed. That message is removed from `self.outstanding` with "discard". The servermap is updated with the new share we know to be in place on this server. * Later, the second message to server 4 comes back. It is removed from `self.outstanding`, but it was already gone, and "discard" ignores this. This message has failed, because the test vector it provided did not match, because the first message had been processed before the second message arrived. * this log.WEIRDs an event (and triggers an Incident). The log messages say things like: somebody modified the share on us: shnum=8: I thought they had #1832:R=lzzf, but testv reported #1832:R=lzzf . The first version in this message comes from the servermap, and indicates that it had been updated upon receipt of the first response. The second version in this message come from the server, and indicates that the first response was successful. * it also marks the upload as being "surprised", since it appears that someone else has modified the share since our mapupdate was done * the surprise causes this Publish to fail with `UncoordinatedWriteError` The actual numbers of shares in Zooko's #786 104230-vyc6byy incident were different than this simplified example. There were actually 20 shares scattered among 12 servers due to lots of servers being offline during previous updates. The server which was lost only held one share (sh0), and that share was also stored on a different server. The order in which the for loop was traversed is unknown. There were two server messages still outstanding when the duplicate response was received, ending the Publish process. There may have been lots of duplicate messages sent, for which the responses didn't arrive until after the Incident log was closed. The upshot is that the reentrant call to `loop()` can cause trouble. The best fix is probably to change the code in `_send_shares` to fire `loop()` via an eventual-send, specifically with foolscap's `fireEventually` Deferred-creating function. Since this is the only place in Publish that invokes `loop()` (other than the initial priming call), this should be sufficient to prevent reentrancy.

warner added the

labels 2009-12-30 03:58:24 +00:00

warner added this to the undecided milestone 2009-12-30 03:58:24 +00:00

warner commented

2010-01-02 22:11:26 +00:00

Fixed, in changeset:419a076f59b93e66. Unfortunately I couldn't think of an easy way to test it, so there's not yet a unit test. I'll think about this further.. it will probably require a NoNetworkGrid and a monkeypatched server reference that will pretend to be dead at just the right time. It might also require changing the for loop to be deterministic (instead of walking the serverlist in random dictionary-key order).

Fixed, in changeset:419a076f59b93e66. Unfortunately I couldn't think of an easy way to test it, so there's not yet a unit test. I'll think about this further.. it will probably require a `NoNetworkGrid` and a monkeypatched server reference that will pretend to be dead at just the right time. It might also require changing the for loop to be deterministic (instead of walking the serverlist in random dictionary-key order).

warner added the

fixed

label 2010-01-02 22:11:26 +00:00

warner modified the milestone from undecided to 1.6.0

2010-01-02 22:11:26 +00:00

warner closed this issue

2010-01-02 22:11:26 +00:00

zooko commented

2010-01-09 04:44:01 +00:00

Nejucomo experienced a bug with similar symptoms -- UncoordinatedWriteError and incident report files that mentioned that the uploader was surprised. However, his problem was reproducible: five consecutive runs of tahoe cp -r $LOCAL $TAHOE (on volunteergrid) produced the same error and two or three incident report files for each run. So it was not caused by a server disconnecting in the middle of of a mutable write. Here are the incident report files. Note this may turn out to be a different bug with similar symptoms, but it is definitely related, so I'm taking the liberty of reopening this ticket at least until we can determine that this one is a "different bug".

Nejucomo was using Tahoe-LAFS v1.5.0 as packaged by Ubuntu in Ubuntu 9.10 Karmic.

By the way, Brian, was the bug that you already fixed in this ticket introduced since Tahoe-LAFS v1.5?

Nejucomo experienced a bug with similar symptoms -- UncoordinatedWriteError and incident report files that mentioned that the uploader was surprised. However, his problem was reproducible: five consecutive runs of `tahoe cp -r $LOCAL $TAHOE` (on volunteergrid) produced the same error and two or three incident report files for each run. So it was *not* caused by a server disconnecting in the middle of of a mutable write. Here are the incident report files. Note this may turn out to be a different bug with similar symptoms, but it is definitely related, so I'm taking the liberty of reopening this ticket at least until we can determine that this one is a "different bug". Nejucomo was using Tahoe-LAFS v1.5.0 as packaged by Ubuntu in Ubuntu 9.10 Karmic. By the way, Brian, was the bug that you already fixed in this ticket introduced since Tahoe-LAFS v1.5?

zooko added

critical

and removed

major

fixed

labels 2010-01-09 04:44:01 +00:00

zooko reopened this issue

2010-01-09 04:44:01 +00:00

zooko commented

2010-01-09 04:46:11 +00:00

Attachment incident-2009-12-30-004046-sdjqheq.flog.bz2 (31250 bytes) added

nejucomo's mutable write error, part 1

**Attachment** incident-2009-12-30-004046-sdjqheq.flog.bz2 (31250 bytes) added nejucomo's mutable write error, part 1

incident-2009-12-30-004046-sdjqheq.flog.bz2

30 KiB

zooko commented

2010-01-09 04:46:25 +00:00

Attachment incident-2009-12-30-004051-2jaqkfq.flog.bz2 (30141 bytes) added

nejucomo's mutable write error, part 2

**Attachment** incident-2009-12-30-004051-2jaqkfq.flog.bz2 (30141 bytes) added nejucomo's mutable write error, part 2

incident-2009-12-30-004051-2jaqkfq.flog.bz2

29 KiB

zooko commented

2010-01-09 04:46:42 +00:00

Attachment incident-2009-12-30-004101-4xcfo3a.flog.bz2 (30211 bytes) added

nejucomo's mutable write error, part 3

**Attachment** incident-2009-12-30-004101-4xcfo3a.flog.bz2 (30211 bytes) added nejucomo's mutable write error, part 3

incident-2009-12-30-004101-4xcfo3a.flog.bz2

30 KiB

zooko commented

2010-01-09 04:46:52 +00:00

Attachment incident-2009-12-30-004120-n3ajkpa.flog.bz2 (30734 bytes) added

nejucomo's mutable write error, part 4

**Attachment** incident-2009-12-30-004120-n3ajkpa.flog.bz2 (30734 bytes) added nejucomo's mutable write error, part 4

incident-2009-12-30-004120-n3ajkpa.flog.bz2

30 KiB

zooko commented

2010-01-09 04:47:03 +00:00

Attachment incident-2009-12-30-005544-3ytbnua.flog.bz2 (33865 bytes) added

nejucomo's mutable write error, part 5

**Attachment** incident-2009-12-30-005544-3ytbnua.flog.bz2 (33865 bytes) added nejucomo's mutable write error, part 5

incident-2009-12-30-005544-3ytbnua.flog.bz2

33 KiB

zooko commented

2010-01-09 04:47:14 +00:00

Attachment incident-2009-12-30-005558-k2wfwga.flog.bz2 (36451 bytes) added

nejucomo's mutable write error, part 6

**Attachment** incident-2009-12-30-005558-k2wfwga.flog.bz2 (36451 bytes) added nejucomo's mutable write error, part 6

incident-2009-12-30-005558-k2wfwga.flog.bz2

36 KiB

zooko commented

2010-01-09 04:47:26 +00:00

Attachment incident-2009-12-30-005558-k2wfwga.flog.2.bz2 (36451 bytes) added

nejucomo's mutable write error, part 6

**Attachment** incident-2009-12-30-005558-k2wfwga.flog.2.bz2 (36451 bytes) added nejucomo's mutable write error, part 6

incident-2009-12-30-005558-k2wfwga.flog.2.bz2

36 KiB

zooko commented

2010-01-09 04:47:49 +00:00

Attachment incident-2009-12-30-005604-anowdpi.flog.bz2 (33736 bytes) added

nejucomo's mutable write error, part 7

**Attachment** incident-2009-12-30-005604-anowdpi.flog.bz2 (33736 bytes) added nejucomo's mutable write error, part 7

incident-2009-12-30-005604-anowdpi.flog.bz2

33 KiB

zooko commented

2010-01-09 04:48:07 +00:00

Attachment incident-2009-12-30-005613-salpsqq.flog.bz2 (35808 bytes) added

nejucomo's mutable write error, part 8

**Attachment** incident-2009-12-30-005613-salpsqq.flog.bz2 (35808 bytes) added nejucomo's mutable write error, part 8

incident-2009-12-30-005613-salpsqq.flog.bz2

35 KiB

zooko commented

2010-01-09 04:48:21 +00:00

Attachment incident-2009-12-30-005636-cqt6dna.flog.bz2 (36933 bytes) added

nejucomo's mutable write error, part 9

**Attachment** incident-2009-12-30-005636-cqt6dna.flog.bz2 (36933 bytes) added nejucomo's mutable write error, part 9

incident-2009-12-30-005636-cqt6dna.flog.bz2

36 KiB

zooko commented

2010-01-09 04:50:07 +00:00

Tomorrow (Tahoe-LAFS Hacking Weekend!) I'll try reproducing Nathan's problem, and if I can I'll try different versions of Tahoe-LAFS, especially the current trunk which has your fix from earlier in this ticket.

warner commented

2010-01-10 07:39:22 +00:00

I've opened #893 with the specific problem found in Nathan's incident reports. It's not the reentrant problem that #877 is about, but actually a different issue with mutable file share placement and error-handling, so I'm re-closing this ticket in favor of the new #893.