implement mutable-file recovery: update can't recover from <k new shares #272
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#272
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
If a mutable slot has one or two shares at a newer version than the rest, our
current SMDF update code is unable to recover: all attempts to write a newer
version will fail with an UncoordinatedWriteError. The necessary fix is
twofold: the first part is to implement mutable-slot recovery (to replace all
shares with a version that's newer than the loner, by promoting the
older-but-more-popular-and-also-recoverable version to a later seqnum), the
second part is to jump directly to recovery when the post-query pre-write
sharemap determines that someone has a newer version than the one we want to
write (in Publish._got_all_query_results), then try the read-and-replace
again or defer to the application.
Oh how I want the thorough unit tests envisioned in #270! They would have
caught this earlier.
This problem happened to Peter's directory because of the accidental
nodeid-change described in #269. But it could occur without that problem with
the right sort of network partition at the right time.
Ok, on to the bug:
Suppose that share1 is placed on peer1, share2 on peer2, etc, and that we're
doing 3-of-10 encoding. Now suppose that an update is interrupted in such a
way that peer9 is updated but peers0 through 8 are not. Now most peers are at
version 4, but that one loner peer is at version 5. (in the case of Peter's
directory, most of the servers were upgraded incorrectly and changed their
nodeids, thus changing the write_enablers, and the loner peer was the one who
was either upgraded correctly or who wasn't upgraded at all).
If we read this slot, we read 'k' plus epsilon shares (four or five, I
think). We see everyone is at version 4 (rather, everyone we see is at
version 4), so we conclude that 4 is the most recent version. This is fine,
because in the face of uncoordinated writes, we're allowed to return any
version that was ever written.
Now, the next time we want to update this slot, mutable.Publish gets control.
This does read-before-replace, and it is written to assume that we can use a
seqnum one larger than the value that was last read. So we prepare to send
out shares at version "5".
We start by querying N+epsilon peers, to build up a sharemap of the
seqnum+roothash of every server we're thinking of sending a share to, so we
can decide which shares to send to whom, and to build up the testv list. This
sharemap is processed in Publish._got_all_query_results . It has a check that
raises UncoordinatedWriteError if one of the queries reports the existence
of a share that is newer than the seqnum that we already knew about. In this
case, the response from peer9 shows seqnum=5, which is equal-or-greater than
the "5" that we wanted to send out. This is evidence of an uncoordinated
write, because our read pass managed to extract version 4 from the grid, but
our query pass shows evidence of a version 5. We can tolerate lots of 5s and
one or two 4s (because then the read pass would have been unable to
reconstruct a version 4, and would have kept searching, and would have
eventually reconstructed a version 5), but we can't tolerate lots of 4s and
one or two 5s.
So this is the problem: we're spooked into an UncoordinatedWriteError by the
loner new share, and since we don't yet have any recovery code, we can't fix
the situation. If we tried to write out the new shares anyways, we could
probably get a quorum of our new seqnum=5 shares, and if next update time we
managed to reconstruct version 5, then we'd push out seqnum=6 shares to
everybody, and then the problem would go away.
We need recovery to handle this. When the UncoordinatedWriteError is
detected during the query phase, we should pass the sharemap to the recovery
code, which should pick a version to reinforce (according to the rules we
came up with in mutable.txt), and then send out shares as necessary to make
that version the dominant one. (If I recall correctly, I think that means we
would take the data from version 4, re-encode it into a version 6, then push
6 out to everyone).
Once recovery is done, the Publish attempt should still fail with an
UncoordinatedWriteError, as a signal to the application that their write
attempt might have failed. The application should perform a new read and
possibly attempt to make its modification again.
The presence of the #269 bug would interact with recovery code in a bad way:
it is likely that many of these shares are now immutable, and thus recovery
would be unable to get a quorum of the new version. But since we could easily
get into this state without #269 (by terminating a client just after it sends
out the first share update, before it manages to get 'k' share-update
messages out), this remains a serious problem.
mutable update can't recover from <k new sharesto implement mutable-file recovery: update can't recover from <k new sharesSee also related issue #232 -- "http://allmydata.org/trac/tahoe/ticket/232".
I mean #232 -- "Peer selection doesn't rebalance shares on overwrite of mutable file."
Rob hit a bad case today which might have been caused by this issue. He was doing parallel, uncoordinated writes to directories (by accident), and he reported:
For Milestone 1.0.0, let's see if we can figure out what happened with Rob's client there, and see if it is likely to cause any problems for clients in Tahoe v1.0.0, and if not then bump this ticket.
that "I was surprised" is certainly an indication of detected collision. Not
being able to find the encrypted privkey.. oh, I think I understand that one.
The encrypted private key is stored at the end of the share, since it's kind
of large, and the dominant use case (reading) doesn't need to retrieve it. On
the first pass (the "reconnaissance phase"), we find out what the data size
is, and therefore which offset the encprivkey is at. On the second pass, we
use that offset to retrieve the encprivkey. But if someone sneaks in between
our first and second passes (and writes new data, of a different size), then
the key will have moved, and our attempt to read the encprivkey will grab the
wrong data. I'm not sure what exactly will happen (i.e. how we validate the
encprivkey), but somewhere in there it ought to throw a hash check exception
and drop that alleged copy of the privkey, causing the code to try and find a
better one. If this process discards all potential copies, the upload will
fail for lack of a privkey.
So, if Rob could identify the exact time of the collision, we could probably
look at the server logs and confirm the server-side collision sequence.
The rule remains: Don't Do That. :)
Brian:
Thanks for the analysis. The collision sequence that you describe should result in this error message once, but then subsequent attempts to overwrite should succeed. But I thought Rob described a problem where the directory could not be written again. Perhaps I imagined that last part.
If it is a transient problem then I am satisfied with this for allmydata.org "Tahoe" 1.0.
What? No it doesn't! We've decided that it is okay for users of Tahoe to do this, as long as they don't do it with too many simultaneous writers at too fast a rate, and they don't mind all but one of their colliding writes disappearing with a trace.
Right?
You're right. The rule is: Don't Do That Very Much. :)
Okay, we talked about this and concluded that what Rob saw could be explained by the uncoordinated writes that his script accidentally generated. We updated the error messages to be clearer about what was surprising about the sequence numbers. Bumping this ticket out of Milestone 1.0.0.
Brian: should this ticket be updated to include anything from your recent refactoring of the mutable file upload/download code?
It's almost closed. I believe the new servermap scheme should fix this case.
I'm working on a test case to specifically exercise it now.
Closed, by the test added in changeset:ff0b9e25499c7e5f.