implement mutable-file recovery: update can't recover from <k new shares #272

New Issue

warner · 2008-01-11T05:10:22Z

warner commented

2008-01-11 05:10:22 +00:00

If a mutable slot has one or two shares at a newer version than the rest, our
current SMDF update code is unable to recover: all attempts to write a newer
version will fail with an UncoordinatedWriteError. The necessary fix is
twofold: the first part is to implement mutable-slot recovery (to replace all
shares with a version that's newer than the loner, by promoting the
older-but-more-popular-and-also-recoverable version to a later seqnum), the
second part is to jump directly to recovery when the post-query pre-write
sharemap determines that someone has a newer version than the one we want to
write (in Publish._got_all_query_results), then try the read-and-replace
again or defer to the application.

Oh how I want the thorough unit tests envisioned in #270! They would have
caught this earlier.

This problem happened to Peter's directory because of the accidental
nodeid-change described in #269. But it could occur without that problem with
the right sort of network partition at the right time.

Ok, on to the bug:

Suppose that share1 is placed on peer1, share2 on peer2, etc, and that we're
doing 3-of-10 encoding. Now suppose that an update is interrupted in such a
way that peer9 is updated but peers0 through 8 are not. Now most peers are at
version 4, but that one loner peer is at version 5. (in the case of Peter's
directory, most of the servers were upgraded incorrectly and changed their
nodeids, thus changing the write_enablers, and the loner peer was the one who
was either upgraded correctly or who wasn't upgraded at all).

If we read this slot, we read 'k' plus epsilon shares (four or five, I
think). We see everyone is at version 4 (rather, everyone we see is at
version 4), so we conclude that 4 is the most recent version. This is fine,
because in the face of uncoordinated writes, we're allowed to return any
version that was ever written.

Now, the next time we want to update this slot, mutable.Publish gets control.
This does read-before-replace, and it is written to assume that we can use a
seqnum one larger than the value that was last read. So we prepare to send
out shares at version "5".

We start by querying N+epsilon peers, to build up a sharemap of the
seqnum+roothash of every server we're thinking of sending a share to, so we
can decide which shares to send to whom, and to build up the testv list. This
sharemap is processed in Publish._got_all_query_results . It has a check that
raises UncoordinatedWriteError if one of the queries reports the existence
of a share that is newer than the seqnum that we already knew about. In this
case, the response from peer9 shows seqnum=5, which is equal-or-greater than
the "5" that we wanted to send out. This is evidence of an uncoordinated
write, because our read pass managed to extract version 4 from the grid, but
our query pass shows evidence of a version 5. We can tolerate lots of 5s and
one or two 4s (because then the read pass would have been unable to
reconstruct a version 4, and would have kept searching, and would have
eventually reconstructed a version 5), but we can't tolerate lots of 4s and
one or two 5s.

So this is the problem: we're spooked into an UncoordinatedWriteError by the
loner new share, and since we don't yet have any recovery code, we can't fix
the situation. If we tried to write out the new shares anyways, we could
probably get a quorum of our new seqnum=5 shares, and if next update time we
managed to reconstruct version 5, then we'd push out seqnum=6 shares to
everybody, and then the problem would go away.

We need recovery to handle this. When the UncoordinatedWriteError is
detected during the query phase, we should pass the sharemap to the recovery
code, which should pick a version to reinforce (according to the rules we
came up with in mutable.txt), and then send out shares as necessary to make
that version the dominant one. (If I recall correctly, I think that means we
would take the data from version 4, re-encode it into a version 6, then push
6 out to everyone).

Once recovery is done, the Publish attempt should still fail with an
UncoordinatedWriteError, as a signal to the application that their write
attempt might have failed. The application should perform a new read and
possibly attempt to make its modification again.

The presence of the #269 bug would interact with recovery code in a bad way:
it is likely that many of these shares are now immutable, and thus recovery
would be unable to get a quorum of the new version. But since we could easily
get into this state without #269 (by terminating a client just after it sends
out the first share update, before it manages to get 'k' share-update
messages out), this remains a serious problem.

If a mutable slot has one or two shares at a newer version than the rest, our current SMDF update code is unable to recover: all attempts to write a newer version will fail with an UncoordinatedWriteError. The necessary fix is twofold: the first part is to implement mutable-slot recovery (to replace all shares with a version that's newer than the loner, by promoting the older-but-more-popular-and-also-recoverable version to a later seqnum), the second part is to jump directly to recovery when the post-query pre-write sharemap determines that someone has a newer version than the one we want to write (in Publish._got_all_query_results), then try the read-and-replace again or defer to the application. Oh how I want the thorough unit tests envisioned in #270! They would have caught this earlier. This problem happened to Peter's directory because of the accidental nodeid-change described in #269. But it could occur without that problem with the right sort of network partition at the right time. Ok, on to the bug: Suppose that share1 is placed on peer1, share2 on peer2, etc, and that we're doing 3-of-10 encoding. Now suppose that an update is interrupted in such a way that peer9 is updated but peers0 through 8 are not. Now most peers are at version 4, but that one loner peer is at version 5. (in the case of Peter's directory, most of the servers were upgraded incorrectly and changed their nodeids, thus changing the write_enablers, and the loner peer was the one who was either upgraded correctly or who wasn't upgraded at all). If we read this slot, we read 'k' plus epsilon shares (four or five, I think). We see everyone is at version 4 (rather, everyone we see is at version 4), so we conclude that 4 is the most recent version. This is fine, because in the face of uncoordinated writes, we're allowed to return any version that was ever written. Now, the next time we want to update this slot, mutable.Publish gets control. This does read-before-replace, and it is written to assume that we can use a seqnum one larger than the value that was last read. So we prepare to send out shares at version "5". We start by querying N+epsilon peers, to build up a sharemap of the seqnum+roothash of every server we're thinking of sending a share to, so we can decide which shares to send to whom, and to build up the testv list. This sharemap is processed in Publish._got_all_query_results . It has a check that raises UncoordinatedWriteError if one of the queries reports the existence of a share that is newer than the seqnum that we already knew about. In this case, the response from peer9 shows seqnum=5, which is equal-or-greater than the "5" that we wanted to send out. This is evidence of an uncoordinated write, because our read pass managed to extract version 4 from the grid, but our query pass shows evidence of a version 5. We can tolerate lots of 5s and one or two 4s (because then the read pass would have been unable to reconstruct a version 4, and would have kept searching, and would have eventually reconstructed a version 5), but we can't tolerate lots of 4s and one or two 5s. So this is the problem: we're spooked into an UncoordinatedWriteError by the loner new share, and since we don't yet have any recovery code, we can't fix the situation. If we tried to write out the new shares anyways, we could probably get a quorum of our new seqnum=5 shares, and if next update time we managed to reconstruct version 5, then we'd push out seqnum=6 shares to everybody, and then the problem would go away. We need recovery to handle this. When the UncoordinatedWriteError is detected during the query phase, we should pass the sharemap to the recovery code, which should pick a version to reinforce (according to the rules we came up with in mutable.txt), and then send out shares as necessary to make that version the dominant one. (If I recall correctly, I think that means we would take the data from version 4, re-encode it into a version 6, then push 6 out to everyone). Once recovery is done, the Publish attempt should still fail with an UncoordinatedWriteError, as a signal to the application that their write attempt might have failed. The application should perform a new read and possibly attempt to make its modification again. The presence of the #269 bug would interact with recovery code in a bad way: it is likely that many of these shares are now immutable, and thus recovery would be unable to get a quorum of the new version. But since we could easily get into this state without #269 (by terminating a client just after it sends out the first share update, before it manages to get 'k' share-update messages out), this remains a serious problem.

warner added the

labels 2008-01-11 05:10:22 +00:00

warner added this to the undecided milestone 2008-01-11 05:10:22 +00:00

warner changed title from ~~mutable update can't recover from <k new shares~~ to implement mutable-file recovery: update can't recover from <k new shares

2008-02-08 02:52:40 +00:00

zooko modified the milestone from undecided to 0.9.0 (Allmydata 3.0 final)

2008-03-04 19:45:46 +00:00

zooko commented

2008-03-04 20:15:26 +00:00

See also related issue #232 -- "http://allmydata.org/trac/tahoe/ticket/232".

See also related issue #232 -- "<http://allmydata.org/trac/tahoe/ticket/232>".

zooko commented

2008-03-04 20:16:19 +00:00

I mean #232 -- "Peer selection doesn't rebalance shares on overwrite of mutable file."

zooko self-assigned this 2008-03-10 19:39:53 +00:00

zooko commented

2008-03-20 01:43:25 +00:00

Rob hit a bad case today which might have been caused by this issue. He was doing parallel, uncoordinated writes to directories (by accident), and he reported:

<robk-allmydata>         allmydata.encode.NotEnoughPeersError: Unable to find 
                 a copy of the privkey                                  [13:29] 
<robk-allmydata>         allmydata.mutable.UncoordinatedWriteError: I was 
                 surprised! 
<robk-allmydata>         allmydata.mutable.UncoordinatedWriteError: somebody 
                 has a newer sequence number than us 
<robk-allmydata>         allmydata.encode.NotEnoughPeersError: Unable to find 
                 a copy of the privkey 
<robk-allmydata>         allmydata.mutable.UncoordinatedWriteError: somebody 
                 has a newer sequence number than us 
<robk-allmydata>         allmydata.mutable.UncoordinatedWriteError: I was 
                 surprised! 
<robk-allmydata>         allmydata.mutable.UncoordinatedWriteError: somebody 
                 has a newer sequence number than us

Rob hit a bad case today which might have been caused by this issue. He was doing parallel, uncoordinated writes to directories (by accident), and he reported: ``` <robk-allmydata> allmydata.encode.NotEnoughPeersError: Unable to find a copy of the privkey [13:29] <robk-allmydata> allmydata.mutable.UncoordinatedWriteError: I was surprised! <robk-allmydata> allmydata.mutable.UncoordinatedWriteError: somebody has a newer sequence number than us <robk-allmydata> allmydata.encode.NotEnoughPeersError: Unable to find a copy of the privkey <robk-allmydata> allmydata.mutable.UncoordinatedWriteError: somebody has a newer sequence number than us <robk-allmydata> allmydata.mutable.UncoordinatedWriteError: I was surprised! <robk-allmydata> allmydata.mutable.UncoordinatedWriteError: somebody has a newer sequence number than us ```

zooko added this to the undecided milestone 2008-03-21 22:30:37 +00:00

zooko commented

2008-03-21 23:27:23 +00:00

For Milestone 1.0.0, let's see if we can figure out what happened with Rob's client there, and see if it is likely to cause any problems for clients in Tahoe v1.0.0, and if not then bump this ticket.

warner commented

2008-03-22 02:06:35 +00:00

that "I was surprised" is certainly an indication of detected collision. Not
being able to find the encrypted privkey.. oh, I think I understand that one.
The encrypted private key is stored at the end of the share, since it's kind
of large, and the dominant use case (reading) doesn't need to retrieve it. On
the first pass (the "reconnaissance phase"), we find out what the data size
is, and therefore which offset the encprivkey is at. On the second pass, we
use that offset to retrieve the encprivkey. But if someone sneaks in between
our first and second passes (and writes new data, of a different size), then
the key will have moved, and our attempt to read the encprivkey will grab the
wrong data. I'm not sure what exactly will happen (i.e. how we validate the
encprivkey), but somewhere in there it ought to throw a hash check exception
and drop that alleged copy of the privkey, causing the code to try and find a
better one. If this process discards all potential copies, the upload will
fail for lack of a privkey.

So, if Rob could identify the exact time of the collision, we could probably
look at the server logs and confirm the server-side collision sequence.

The rule remains: Don't Do That. :)

that "I was surprised" is certainly an indication of detected collision. Not being able to find the encrypted privkey.. oh, I think I understand that one. The encrypted private key is stored at the end of the share, since it's kind of large, and the dominant use case (reading) doesn't need to retrieve it. On the first pass (the "reconnaissance phase"), we find out what the data size is, and therefore which offset the encprivkey is at. On the second pass, we use that offset to retrieve the encprivkey. But if someone sneaks in between our first and second passes (and writes new data, of a different size), then the key will have moved, and our attempt to read the encprivkey will grab the wrong data. I'm not sure what exactly will happen (i.e. how we validate the encprivkey), but somewhere in there it ought to throw a hash check exception and drop that alleged copy of the privkey, causing the code to try and find a better one. If this process discards all potential copies, the upload will fail for lack of a privkey. So, if Rob could identify the exact time of the collision, we could probably look at the server logs and confirm the server-side collision sequence. The rule remains: Don't Do That. :)

zooko commented

2008-03-22 22:56:24 +00:00

Brian:

Thanks for the analysis. The collision sequence that you describe should result in this error message once, but then subsequent attempts to overwrite should succeed. But I thought Rob described a problem where the directory could not be written again. Perhaps I imagined that last part.

If it is a transient problem then I am satisfied with this for allmydata.org "Tahoe" 1.0.

The rule remains: Don't Do That. :)

What? No it doesn't! We've decided that it is okay for users of Tahoe to do this, as long as they don't do it with too many simultaneous writers at too fast a rate, and they don't mind all but one of their colliding writes disappearing with a trace.

Right?

Brian: Thanks for the analysis. The collision sequence that you describe should result in this error message once, but then subsequent attempts to overwrite should succeed. But I thought Rob described a problem where the directory could not be written again. Perhaps I imagined that last part. If it is a transient problem then I am satisfied with this for allmydata.org "Tahoe" 1.0. ``` The rule remains: Don't Do That. :) ``` What? No it doesn't! We've decided that it is okay for users of Tahoe to do this, as long as they don't do it with too many simultaneous writers at too fast a rate, and they don't mind all but one of their colliding writes disappearing with a trace. Right?

warner commented

2008-03-24 18:33:09 +00:00

You're right. The rule is: Don't Do That Very Much. :)

zooko commented

2008-03-24 22:25:47 +00:00

Okay, we talked about this and concluded that what Rob saw could be explained by the uncoordinated writes that his script accidentally generated. We updated the error messages to be clearer about what was surprising about the sequence numbers. Bumping this ticket out of Milestone 1.0.0.

zooko modified the milestone from 1.0.0 to 1.1.0

2008-03-24 22:25:47 +00:00

warner added

code-mutable

and removed

code-encoding

labels 2008-04-24 23:28:12 +00:00

zooko commented

2008-05-12 19:52:15 +00:00

Brian: should this ticket be updated to include anything from your recent refactoring of the mutable file upload/download code?

zooko removed their assignment 2008-05-12 19:52:15 +00:00

warner was assigned by zooko