mutable-file surprise shares raise inappropriate UCWE #546
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#546
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
We added a "Thumper" to the allmydata.com prodnet today: 47 new nodes,
bringing the grid to a total of 111 nodes. Shortly afterwards, our
"trunk-prodnet" automated grid-tester failed:
http://allmydata.org/buildbot/builders/gutsy/builds/1195 . There were a
couple of different problems here, at least three tickets worth. This ticket
is about a problem that is revealed by the addition of a large number of
nodes.
First, a quick summary of the problems:
problem 1: mapupdate(MODE_WRITE) triggers on 1000, when it should use 1000$
(to avoid triggering on 10001) (ticket #547)
problem 2: the mapupdate can hit a false boundary because the inserted gap
was too large. It might be a good idea to increase epsilon for MODE_WRITE
to reduce the chance of this. (ticket #549)
problem 3: when mapupdate hits a false boundary (because of either of the
above problems), the subsequent publish may put shares to servers which
actually already have them. The code looks for these "surprise" shares and
raises UCWE if it sees them. The UCWE is probably inappropriate, instead a
new writev should be sent to update the old share. (this ticket, #546)
problem 4: delete() which hits UCWE (because of problem 3, or other issues)
will retry, but will fail the second time because must_exist=True is the
default (ticket #550)
Now, some background. When a directory or mutable file is created, gets a
public key, and from that it gets a storage index. The storage index defines
a permutation of storage servers: a sequence of serverids. A permutation
function is used which accepts as inputs a set of serverids and the storage
index, and it produces an ordered list of serverids (the same size as the
input set). The partial ordering of the serverids is always the same for any
given storage index. This means that adding new servers (i.e. re-running the
function with a larger set) will insert new serverids into the output list at
various places, but it will not change the relative ordering.
A new mutable file's shares are placed on the first N (=10) writeable servers
in permuted order. Later, when the file is subsequently read or written, a
"share map update" operation is performed, to take the then-current list of
servers and build up a partial map of which shares are on which servers. This
mapupdate operation has several different modes, depending upon what needs to
be done with the servermap: MODE_READ is for download-only, MODE_WRITE is for
read-modify-write, and MODE_CHECK is for file-checking and repair operations.
The mapupdate operation is designed to balance efficiency, speed, and safety.
Efficiency is served by querying as few servers as possible. Speed is served
mainly by minimizing round trips. Safety is a question of improving
consistency: there may be shares of various versions present on the servers,
and we desire to have a pretty good chance of seeing the "current" version,
and we want to update "most" shares when we modify the file.
In MODE_READ, the mapupdate operation queries servers, in permuted order,
until it has received evidence of k (=3) plus "epsilon" (=k) shares of the
highest version, and has not seen evidence of any competing versions (those
with a higher sequence number). This is enough to retrieve the file and also
to reduce the possibility of retrieving an old version (which, although
annoying, would not really threaten data-loss, because MODE_READ is not used
for directory modifications). Without the epsilon, a small number of old
shares would be enough to trick the operation into not looking for a newer
version (equivalently, a small group of colluding servers could accomplish a
rollback attack).
In MODE_WRITE, the mapupdate operation needs to be more thorough, for two
reasons. The first is that MODE_WRITE is for modification, which means
somebody is going to take the version that is returned and build a new
version on top of it (by modifying the contents). When that new version is
published, it will (ideally) completely replace any old versions. So if the
version that gets retrieved is not the current one, we'll lose the contents
of the current version, which will be a regression in the state of the file.
The second reason is that the eventual publish needs to update all shares, to
reduce the chance of leaving around shares of old versions (and thus
improving the safety of later operations).
So MODE_WRITE must try harder to locate all shares: it tries to strike a
different balance between efficiency, speed, and safety. In this mode, the
code is supposed to query servers in permuted order until:
can conclude we've searched beyond the active set of servers.
Specifically, the code looks for a range of servers in which the first
server has a share, and the following epsilon (=k) servers do not have a
share, then it declares the search complete.
There is a bug in the current implementation of MODE_WRITE, in which the last
condition (a "boundary" between the active set and the remaining unused
servers) is incorrectly triggered (the pattern-match looks for "1000" but it
should really look for "1000$", to avoid matching on "10001"). But that is in
a different ticket (#547).
This ticket is about the fact that the publish process doesn't keep track of
which shares it's sent to whom, so if it needs to make a second pass (to
place shares that were not accepted by their first destination), it will get
confused and signal an inappropriate UCWE.
When a single new server is added, it is effectively inserted at a random
location into the permuted peer list. The list that used to be "A B C D E"
becomes "A B new C D E". The mapupdate operation should tolerate this.
Ideally, some kind of rebalancing or repair operation should move the shares
up in the server list, continually moving the shares to their "ideal"
location (the one that improves the efficiency, speed, and safety of
publish/retrieve operations).
The publish operation will leave shares in place whenever possible. (it
currently has no mechanism to remove extra shares, and is obligated to modify
all shares in place, so it won't intentionally duplicate a share). So if we
pretend that N=5 and the file was first created on "A1 B2 C3 D4 E5" (i.e.
server A has share 1, server B has share 2, etc), then when "new" is
inserted, the map will look like "A1 B2 new(none) C3 D4 E5", and no share
will be placed on "new".
If lots of new servers are inserted, such that the map looks like "A B C new
new new D E", then the MODE_WRITE mapupdate may conclude that the active set
ends with server C: unless it spends the time+bandwidth to query more
servers, it won't see the shares on D and E. (if a gap of three servers isn't
convincing, imagine that the gap had 100 servers in it). In this case, the
mapupdate will stop after the sixth query, and report shares at "A1 B2 C3
new(none) new(none) new(none)". When the subsequent publish takes place, it
needs to put shares 4 and 5 somewhere, so it will put them on the first two
new servers, and the visible map will wind up as "A1 B2 C3 new4 new5
new(none)". In fact, the full map will be "A1 B2 C3 new4 new5 new(none)
D(old4) E(old5)", since D and E would not have been updated.
In effect, the shares have been migrated closer to the start of the permuted
list, and a few old shares will get leftover towards the end of the list. As
long as the new servers stick around, the old shares will never be seen. Some
day, we'll have a repairer or a share-migrator service that will spot these
and delete them, but for now they're fairly harmless. (if enough new servers
disappear, then the old shares could be dangerous, since they might cause
rollback). This process may happen multiple times, as new batches of servers
are added.
Now, if a medium number of new servers are added, the "unhoused shares" (4
and 5 in the above example) might be assigned to servers which have not been
queried. If we use N=7 and start with "A1 B2 C3 new new new D4 E5 F6 G7",
then the mapupdate code would stop at "A1 B2 C3 new new new", and the publish
code would need to find homes for shares 4-7. The three new servers would get
shares 4 5 and 6, and share 7 would be assigned to the pre-existing (but
unqueried) server D. The publish code makes the dubious assumption that if
server D wasn't queried, then server D does not have any shares.
As a result, a readv-and-testv-and-writev request is sent to server D that
asks it to accept the new version of share 7 (but only if it did not already
have a copy of share 7). The readv portion of the request reveals that server
D has a copy (now old) of share 4. The publish code responds to this
"surprise share" by raising UncoordinatedWriteError, since normally (i.e.
when a server was queried and reported having no share, then a tv-a-rv-a-wv
reports yes having a share) this indicates that someone else has written a
share in the interim.
This UCWE is "problem 3" above. If the mapupdate had queried this server, it
would have learned about share 4, and it would have done two things
differently: it would update D with the new version of share 4, and it would
not have sent share 4 to one of the new servers.
The trick is how to give it the opportunity to learn about this share while
still achieving the desired speed and efficiency. Maybe the rule should be
that no tv-a-rv-a-wv requests shall ever be sent to a server that hasn't
already been queried (i.e. require two round trips, the readv of which might
either be done in the mapupdate or in the publish). Another possibility is to
do it reactively: instead of throwing UCWE when we see the surprise share,
just send out another query to replace it.
Reducing the likelihood of this situation can help, but won't remove the
issue entirely. Even if the MODE_WRITE mapupdate used an epsilon equal to N
(so that we were sure we'd have answers from N servers, in the hopes of not
having to walk beyond the end of the set we've already queried), some of
those servers might not be able to accept the write, forcing us off into
uncharted territory.
update bug numbers, clarify the specific focus of this bug (surprise shares cause UCWE)
Francois Deppierraz
experienced
this same problem in a different form: one of his storage
servers was full, so the writev() failed with a "Disk Quota Exceeded"
IOError
. This caused the loop to place the unhoused share on adifferent server, one which had already been given a share. It hit the same
surprise-share failure that we hit due to the not-looking-far-enough problems
described above.
I can think of two kinds of fix right now:
The first would also have the effect of tolerating UCWE between two uncoordinated writers who happened to be writing the same thing, which seems unlikely but might happen if both clients are trying to recover or repair the file at the same time.
changeset:278c47b9bd859343 adds one of these fixes: tolerate "surprise" shares if their roothash matches the version we were trying to publish. This should fix the problem that Francois say, in which we re-use a server and send it multiple shares in separate requests.
There's a comment in publish.py (around line 716, in _got_write_answer) about what needs to be done for the other half of the fix (to deal with servers that we failed to query during the mapupdate step). There's a way to proceed correctly without needing to signal UCWE (see the comment for details, basically if the surprise share is old enough we can replace it without doing another mapupdate-read-modify-write cycle). But the shorter/more-complete fix is to raise UCWE but mark the servermap to:
I think this only affects whether we can successfully write an update, rather than whether we preserve updates that were claimed to be successfully written.
It's really bothering me that mutable file upload and download behavior is so finicky, buggy, inefficient, hard to understand, different from immutable file upload and download behavior, etc. So I'm putting a bunch of tickets into the "1.8" Milestone. I am not, however, at this time, volunteering to work on these tickets, so it might be a mistake to put them into the 1.8 Milestone, but I really hope that someone else will volunteer or that I will decide to do it myself. :-)
If you like this ticket, you might like #540 (inappropriate "uncoordinated write error" after handling a server failure).