reentrant confusion in mutable publish causes incorrect UCWE #877

Closed
opened 2009-12-30 03:58:24 +00:00 by warner · 14 comments

The incident reports in #786 reveal a problem in the mutable-file publish
control loop which is likely to trigger an inappropriate Uncoordinated Write
Error when a server goes away between mapupdate and publish. Here's the
setup:

  • the user modifies an existing healthy mutable file, 3-of-10 encoding
  • Mapupdate identifies a bunch of shares on their servers, Retrieve gets the
    data, Publish prepares to upload a new version. Let's say each share N
    goes to server N (sh1 to server 1, etc), and that there's an extra copy of
    share 3 on server 11. And let's say that server 3 disconnected while the
    Retrieve was running (so it's still in the sharemap but the
    RemoteReference is now dead).
  • loop() is entered the first time
  • it calls update_goal() to figure out where all the shares want to
    live. Since mapupdate found all shares, no new shares need to be placed:
    each known share will be updated, including the doubled share 3.
  • it then calls _send_shares() to deliver messages to the 11 servers
    that we're using
  • _send_shares() builds a dictionary named all_tw_vectors that
    contains all of the messages it will send, batched by serverid (so it can
    send exactly one message to each server, even if there are multiple
    shares per server). Then it loops through this dictionary (let's pretend
    in sorted order even though .items() is random), and for each
    peerid, it updates self.outstanding to remember that the message is
    in flight (so our next pass through loop() won't send duplicate
    messages), and then sends the message. The relevant code looks like this:
        for (peerid, tw_vectors) in all_tw_vectors.items():
            for shnum in shnums:
                self.outstanding.add( (peerid, shnum) )
            d = self._do_testreadwrite(peerid, secrets, tw_vectors, read_vector)
            d.addCallbacks(self._got_write_answer, self._got_write_error)
            d.addCallback(self.loop)
  • when we've processed peerid 1 and 2, and we're working on 3, the
    _do_testreadwrite call will see the disconnected server and errback
    with DeadReferenceError. This immediately fires
    _got_write_error, which log.UNUSUALs the event, removes the share
    from self.outstanding, and adds the peer to self.bad_peers so
    it becomes ineligible for other shares. Then the Deferred chain moves
    onto the next call, and invokes self.loop. This all happens while
    the for loop is on the stack: the rest of the server messages have not
    been processed yet.
  • now loop() is entered the second time, reentrantly. It updates the
    goal, which removes the sh3-on-X mapping but doesn't add any new shares
    because sh3 is already happily living on server3. It computes the
    "needed" list of shares to send, which does not include 1 or 2 or 3
    (because those messages are in self.outstanding, but does include
    all of the others (because they haven't been added to
    self.outstanding yet, even though they will be as soon as the
    suspended call to _send_shares gets a chance).
  • Then it (re)enters _send_shares(), which dutifully sends requests
    to the needed list: 4, 5, etc, up to 10. The inner call to
    _send_shares exits, as does the inner call to loop()
  • now the outer call to _send_shares() regains control, in the middle
    of its for loop. It moves to server 4, adds the messages to the
    self.outstanding set (which is ignored, because it was already
    there from the inner call), sends the (duplicate) message, and continues
    onwards.
  • The result is that we get duplicate identical messages for the later
    servers.
  • The first message to server 4 comes back with a successful result: the
    test vector matches, so the write was performed. That message is removed
    from self.outstanding with "discard". The servermap is updated with
    the new share we know to be in place on this server.
  • Later, the second message to server 4 comes back. It is removed from
    self.outstanding, but it was already gone, and "discard" ignores
    this. This message has failed, because the test vector it provided did not
    match, because the first message had been processed before the second
    message arrived.
  • this log.WEIRDs an event (and triggers an Incident). The log messages say
    things like: somebody modified the share on us: shnum=8: I thought they
    had #1832:R=lzzf, but testv reported #1832:R=lzzf . The first version in
    this message comes from the servermap, and indicates that it had been
    updated upon receipt of the first response. The second version in this
    message come from the server, and indicates that the first response was
    successful.
  • it also marks the upload as being "surprised", since it appears that
    someone else has modified the share since our mapupdate was done
  • the surprise causes this Publish to fail with
    UncoordinatedWriteError

The actual numbers of shares in Zooko's #786 104230-vyc6byy incident were
different than this simplified example. There were actually 20 shares
scattered among 12 servers due to lots of servers being offline during
previous updates. The server which was lost only held one share (sh0), and
that share was also stored on a different server. The order in which the for
loop was traversed is unknown. There were two server messages still
outstanding when the duplicate response was received, ending the Publish
process. There may have been lots of duplicate messages sent, for which the
responses didn't arrive until after the Incident log was closed.

The upshot is that the reentrant call to loop() can cause trouble. The
best fix is probably to change the code in _send_shares to fire
loop() via an eventual-send, specifically with foolscap's
fireEventually Deferred-creating function. Since this is the only place
in Publish that invokes loop() (other than the initial priming call),
this should be sufficient to prevent reentrancy.

The incident reports in #786 reveal a problem in the mutable-file publish control loop which is likely to trigger an inappropriate Uncoordinated Write Error when a server goes away between mapupdate and publish. Here's the setup: * the user modifies an existing healthy mutable file, 3-of-10 encoding * Mapupdate identifies a bunch of shares on their servers, Retrieve gets the data, Publish prepares to upload a new version. Let's say each share N goes to server N (sh1 to server 1, etc), and that there's an extra copy of share 3 on server 11. And let's say that server 3 disconnected while the Retrieve was running (so it's still in the sharemap but the `RemoteReference` is now dead). * `loop()` is entered the first time * it calls `update_goal()` to figure out where all the shares want to live. Since mapupdate found all shares, no new shares need to be placed: each known share will be updated, including the doubled share 3. * it then calls `_send_shares()` to deliver messages to the 11 servers that we're using * `_send_shares()` builds a dictionary named `all_tw_vectors` that contains all of the messages it will send, batched by serverid (so it can send exactly one message to each server, even if there are multiple shares per server). Then it loops through this dictionary (let's pretend in sorted order even though `.items()` is random), and for each peerid, it updates `self.outstanding` to remember that the message is in flight (so our next pass through `loop()` won't send duplicate messages), and then sends the message. The relevant code looks like this: ``` for (peerid, tw_vectors) in all_tw_vectors.items(): for shnum in shnums: self.outstanding.add( (peerid, shnum) ) d = self._do_testreadwrite(peerid, secrets, tw_vectors, read_vector) d.addCallbacks(self._got_write_answer, self._got_write_error) d.addCallback(self.loop) ``` * when we've processed peerid 1 and 2, and we're working on 3, the `_do_testreadwrite` call will see the disconnected server and errback with `DeadReferenceError`. This immediately fires `_got_write_error`, which log.UNUSUALs the event, removes the share from `self.outstanding`, and adds the peer to `self.bad_peers` so it becomes ineligible for other shares. Then the Deferred chain moves onto the next call, and invokes `self.loop`. This all happens while the for loop is on the stack: the rest of the server messages have not been processed yet. * now `loop()` is entered the second time, reentrantly. It updates the goal, which removes the sh3-on-X mapping but doesn't add any new shares because sh3 is already happily living on server3. It computes the "needed" list of shares to send, which does not include 1 or 2 or 3 (because those messages are in `self.outstanding`, but does include all of the others (because they haven't been added to `self.outstanding` yet, even though they will be as soon as the suspended call to `_send_shares` gets a chance). * Then it (re)enters `_send_shares()`, which dutifully sends requests to the needed list: 4, 5, etc, up to 10. The inner call to `_send_shares` exits, as does the inner call to `loop()` * now the outer call to `_send_shares()` regains control, in the middle of its for loop. It moves to server 4, adds the messages to the `self.outstanding` set (which is ignored, because it was already there from the inner call), sends the (duplicate) message, and continues onwards. * The result is that we get duplicate identical messages for the later servers. * The first message to server 4 comes back with a successful result: the test vector matches, so the write was performed. That message is removed from `self.outstanding` with "discard". The servermap is updated with the new share we know to be in place on this server. * Later, the second message to server 4 comes back. It is removed from `self.outstanding`, but it was already gone, and "discard" ignores this. This message has failed, because the test vector it provided did not match, because the first message had been processed before the second message arrived. * this log.WEIRDs an event (and triggers an Incident). The log messages say things like: somebody modified the share on us: shnum=8: I thought they had #1832:R=lzzf, but testv reported #1832:R=lzzf . The first version in this message comes from the servermap, and indicates that it had been updated upon receipt of the first response. The second version in this message come from the server, and indicates that the first response was successful. * it also marks the upload as being "surprised", since it appears that someone else has modified the share since our mapupdate was done * the surprise causes this Publish to fail with `UncoordinatedWriteError` The actual numbers of shares in Zooko's #786 104230-vyc6byy incident were different than this simplified example. There were actually 20 shares scattered among 12 servers due to lots of servers being offline during previous updates. The server which was lost only held one share (sh0), and that share was also stored on a different server. The order in which the for loop was traversed is unknown. There were two server messages still outstanding when the duplicate response was received, ending the Publish process. There may have been lots of duplicate messages sent, for which the responses didn't arrive until after the Incident log was closed. The upshot is that the reentrant call to `loop()` can cause trouble. The best fix is probably to change the code in `_send_shares` to fire `loop()` via an eventual-send, specifically with foolscap's `fireEventually` Deferred-creating function. Since this is the only place in Publish that invokes `loop()` (other than the initial priming call), this should be sufficient to prevent reentrancy.
warner added the
code-mutable
major
defect
1.5.0
labels 2009-12-30 03:58:24 +00:00
warner added this to the undecided milestone 2009-12-30 03:58:24 +00:00
Author

Fixed, in changeset:419a076f59b93e66. Unfortunately I couldn't think of an easy way to test it, so there's not yet a unit test. I'll think about this further.. it will probably require a NoNetworkGrid and a monkeypatched server reference that will pretend to be dead at just the right time. It might also require changing the for loop to be deterministic (instead of walking the serverlist in random dictionary-key order).

Fixed, in changeset:419a076f59b93e66. Unfortunately I couldn't think of an easy way to test it, so there's not yet a unit test. I'll think about this further.. it will probably require a `NoNetworkGrid` and a monkeypatched server reference that will pretend to be dead at just the right time. It might also require changing the for loop to be deterministic (instead of walking the serverlist in random dictionary-key order).
warner added the
fixed
label 2010-01-02 22:11:26 +00:00
warner modified the milestone from undecided to 1.6.0 2010-01-02 22:11:26 +00:00

Nejucomo experienced a bug with similar symptoms -- UncoordinatedWriteError and incident report files that mentioned that the uploader was surprised. However, his problem was reproducible: five consecutive runs of tahoe cp -r $LOCAL $TAHOE (on volunteergrid) produced the same error and two or three incident report files for each run. So it was not caused by a server disconnecting in the middle of of a mutable write. Here are the incident report files. Note this may turn out to be a different bug with similar symptoms, but it is definitely related, so I'm taking the liberty of reopening this ticket at least until we can determine that this one is a "different bug".

Nejucomo was using Tahoe-LAFS v1.5.0 as packaged by Ubuntu in Ubuntu 9.10 Karmic.

By the way, Brian, was the bug that you already fixed in this ticket introduced since Tahoe-LAFS v1.5?

Nejucomo experienced a bug with similar symptoms -- UncoordinatedWriteError and incident report files that mentioned that the uploader was surprised. However, his problem was reproducible: five consecutive runs of `tahoe cp -r $LOCAL $TAHOE` (on volunteergrid) produced the same error and two or three incident report files for each run. So it was *not* caused by a server disconnecting in the middle of of a mutable write. Here are the incident report files. Note this may turn out to be a different bug with similar symptoms, but it is definitely related, so I'm taking the liberty of reopening this ticket at least until we can determine that this one is a "different bug". Nejucomo was using Tahoe-LAFS v1.5.0 as packaged by Ubuntu in Ubuntu 9.10 Karmic. By the way, Brian, was the bug that you already fixed in this ticket introduced since Tahoe-LAFS v1.5?
zooko added
critical
and removed
major
fixed
labels 2010-01-09 04:44:01 +00:00
zooko reopened this issue 2010-01-09 04:44:01 +00:00

Attachment incident-2009-12-30-004046-sdjqheq.flog.bz2 (31250 bytes) added

nejucomo's mutable write error, part 1

**Attachment** incident-2009-12-30-004046-sdjqheq.flog.bz2 (31250 bytes) added nejucomo's mutable write error, part 1

Attachment incident-2009-12-30-004051-2jaqkfq.flog.bz2 (30141 bytes) added

nejucomo's mutable write error, part 2

**Attachment** incident-2009-12-30-004051-2jaqkfq.flog.bz2 (30141 bytes) added nejucomo's mutable write error, part 2

Attachment incident-2009-12-30-004101-4xcfo3a.flog.bz2 (30211 bytes) added

nejucomo's mutable write error, part 3

**Attachment** incident-2009-12-30-004101-4xcfo3a.flog.bz2 (30211 bytes) added nejucomo's mutable write error, part 3

Attachment incident-2009-12-30-004120-n3ajkpa.flog.bz2 (30734 bytes) added

nejucomo's mutable write error, part 4

**Attachment** incident-2009-12-30-004120-n3ajkpa.flog.bz2 (30734 bytes) added nejucomo's mutable write error, part 4

Attachment incident-2009-12-30-005544-3ytbnua.flog.bz2 (33865 bytes) added

nejucomo's mutable write error, part 5

**Attachment** incident-2009-12-30-005544-3ytbnua.flog.bz2 (33865 bytes) added nejucomo's mutable write error, part 5

Attachment incident-2009-12-30-005558-k2wfwga.flog.bz2 (36451 bytes) added

nejucomo's mutable write error, part 6

**Attachment** incident-2009-12-30-005558-k2wfwga.flog.bz2 (36451 bytes) added nejucomo's mutable write error, part 6

Attachment incident-2009-12-30-005558-k2wfwga.flog.2.bz2 (36451 bytes) added

nejucomo's mutable write error, part 6

**Attachment** incident-2009-12-30-005558-k2wfwga.flog.2.bz2 (36451 bytes) added nejucomo's mutable write error, part 6

Attachment incident-2009-12-30-005604-anowdpi.flog.bz2 (33736 bytes) added

nejucomo's mutable write error, part 7

**Attachment** incident-2009-12-30-005604-anowdpi.flog.bz2 (33736 bytes) added nejucomo's mutable write error, part 7

Attachment incident-2009-12-30-005613-salpsqq.flog.bz2 (35808 bytes) added

nejucomo's mutable write error, part 8

**Attachment** incident-2009-12-30-005613-salpsqq.flog.bz2 (35808 bytes) added nejucomo's mutable write error, part 8

Attachment incident-2009-12-30-005636-cqt6dna.flog.bz2 (36933 bytes) added

nejucomo's mutable write error, part 9

**Attachment** incident-2009-12-30-005636-cqt6dna.flog.bz2 (36933 bytes) added nejucomo's mutable write error, part 9

Tomorrow (Tahoe-LAFS Hacking Weekend!) I'll try reproducing Nathan's problem, and if I can I'll try different versions of Tahoe-LAFS, especially the current trunk which has your fix from earlier in this ticket.

Tomorrow (Tahoe-LAFS Hacking Weekend!) I'll try reproducing Nathan's problem, and if I can I'll try different versions of Tahoe-LAFS, especially the current trunk which has your fix from earlier in this ticket.
Author

I've opened #893 with the specific problem found in Nathan's incident reports. It's not the reentrant problem that #877 is about, but actually a different issue with mutable file share placement and error-handling, so I'm re-closing this ticket in favor of the new #893.

I've opened #893 with the specific problem found in Nathan's incident reports. It's not the reentrant problem that #877 is about, but actually a different issue with mutable file share placement and error-handling, so I'm re-closing this ticket in favor of the new #893.
warner added the
fixed
label 2010-01-10 07:39:22 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#877
No description provided.