UCWE when mapupdate gives up too early, then server errors require replacement servers #893
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#893
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The Incidents that were reported in #877 (by Zooko, referring to
mutable write errors experienced by nejucomo) indicate a thorny
problem that is distinct from both #877 (caused by a reentrancy
error) and #540 (caused by a logic bug that affects small grids
where a publish wraps around the peer circle). Here's the setup:
be updated. Ideally, the shares are concentrated at the beginning
of the permuted peerlist, on the first N servers.
heuristic: if we've seen enough shares, and we've also seen a
span of contiguous servers (in permuted order) that tell us they
do not have a share, then we stop the search. The size of this
span was intended to be N+epsilon (where 'epsilon' is a tradeoff
between performance and safety, and is set to k). Unfortunately
1.5.0 had a bug, and the span size was set to k+epsilon instead.
In the Incident:
the responses had returned, because it found a boundary early.
I'm not entirely sure how the sharemap was shaped, but it looks
like it stopped with a span of 3 servers ("found our boundary,
11000"), where k=3 and N=10, and returned a sharemap with 8
shares in it, some doubled up (I think there were 5 servers
involved)
(from
w6o6
, with two shares, but was ignoredUnfortunately, two of these (both owned by secorp) experienced
"Permission Denied" errors when attempting to write out the new
shares, suggesting a configuration error (maybe the tahoe node
process is owned by the wrong user)
picks suffer from the same error, so the fallover process repeats
a few times.
w6o6
, and sends it anew share, thinking that w6o6 has no shares (because the
servermap was not updated to include w6o6's late response).
shares that the test vector did not expect, causing a UCWE error
happened, even though UCWE was raised.
The biggest problem with this failure is that it is persistent. We
don't record any information that would tell a subsequent operation
to look further for existing shares, so exactly the same thing will
happen the next time we try to modify or repair the directory.
If secorp's servers weren't throwing errors, then I think the
condition would eventually fix itself: new shares would be placed on
his servers, bridging the span of servers without shares, and then
later mapupdate calls would keep going until they'd really seen all
of the shares.
Recently (changeset:eb1868628465a243, 08-Dec-2009) I fixed mapupdate to use N+epsilon
instead of k+epsilon. But the incident report suggests that it
stopped with a span of only 3 no-share servers. Looking more closely
at the code, I think it only waits for a span of epsilon (not
k+epsilon or N+epsilon), and that changeset:eb1868628465a243 changed something
different. I don't know if the thing that was changed would have
prevented this issue or not. It's possible that this is a
manifestation of #547 (mapupdate triggers on a false boundary), or
#549, or one of the other problems described in #546.
In general, we need to query more servers. But even if we increase
the span size or epsilon or whatever, there will always be a weird
situtation that could be handled better if we queried more servers.
We'd like to have something more adaptive: if the code hits UCWE
because it didn't try hard enough, then it should try harder next
time.
How should we deal with this? We need something to persist from one
operation on a given mutable filenode to the next, some sort of hint
that says "Hey, last time we were surprised, so next time you should
look further". Or something that tells us that we learned about
shares on servers X+Y+Z, and so the next time we do a mapupdate, we
shouldn't consider it complete until we've gotten responses from
those servers (in addition to any others that we might decide to
query).
The most natural place to keep this state would be on the mutable
filenode instance. This would help with UCWE that occurs inside a
modify() call, because the same filenode is used for each retry, but
in general filenodes are pretty short-lived. We don't want to keep
the mutable filenode around in RAM forever. Maybe a LRU cache that
keeps filenodes around for a few minutes, so that users who
experience UCWE and retry the operation can benefit from recent
history.
A storage protocol that included "where-are-the-other-shares" hints
(#599) would help: this would improve the reliability of mapupdate,
since the persistent information would be kept on the storage
servers, next to the shares.
A publish process which rebalanced the shares (#232, or
#661/#543/#864) might help, by filling in the gaps, except that here
the gap was caused by a batch of servers all suffering from the same
configuration problem.
The right answer probably lies in having UCWE triggering an
immediate repair, and having repair fill in the gaps. But it'd be
nice if there were a way to stash some information on the shares
before the gaps that let later operations know that they should look
past the gap.
looking at the changeset:eb1868628465a243 change more closely, I think it could be improved. The change increases
self.num_peers_to_query
, but that's only actually used by MODE_READ. The finish-criteria code for MODE_WRITE (ServermapUpdater._check_for_done
) should not let the process finish unless we've received answers from at leastself.num_peers_to_query
, just like MODE_READ (or if we've run out of servers to query).The gap size was always defined to be epsilon, but the minimum
num_peers_to_query
value was intended to make sure we'd looked far enough.I don't know if this is really a 1.6.0 or critical issue, but since #877 from whence it spawned was both, I guess I should set it to the same. Feel free to downgrade or defer it. I don't think we can find a good long-term solution in the 1.6.0 timeframe, but maybe the
num_peers_to_query
change could fit in.Maybe the uploader should have been unsurprised to find shares on
w6o6
. "Surprises" are when a server told you something during the mapupdate and then told you something different during the next phase. If you didn't ask a server during the mapupdate, then you shouldn't be surprised what he says in the next phase.I guess this is because I think of surprises as indicating UncoordinatedWriteError, rather than indicating "the world of storage servers are in a stranger state than you assumed".
What do you think of that, Brian? If you already decided to upload version
V+1
of your file, and then during the process you find some previously unused storage servers which are holding shares<= V
then you should just unsurprisedly resend your write request updated to say "Yes I know you have version U, and please replace it with this version V+1.". Now if you find previously unused servers that have a version> V+1
or a version==V
and with a different root hash then you should stop (but not with UncoordinatedWriteError).I think extending the finish criteria for MODE_WRITE (
ServermapUpdater._check_for_done
) would be a good improvement (it would have solved nejucomo's problem) and is probably feasible for Tahoe-LAFS v1.6.hm. Yeah, I suppose that if we were writing out version V+1, then it would be safe to overwrite any version (<V) or (=V,=oldroothash) that we saw. This would require either a clever test vector or some code to receive the "test failed, so no write" response and send out a new request with a new test vector. Not trivial, but not too hard.
I think that discovering a version (=V+1,!=newroothash) or (>V+1) should fire UCWE.. sounds to me like the exact definition of UCWE. Why do you think this should use a different error?
I'm not sure what it should do if it sees (=V,!=oldroothash). This indicates that a UCWE occurred in the past, and now there are two equal-version contenders for the file. Ideally the earlier write should have noticed this and performed an immediate repair. For the current publish, it should stop, but I'll agree that UCWE might not be the most accurate way to report it. Dunno.
Hm, the server doesn't currently support an operation that will apply a test-and-set if the share is present or else just apply the set if the share is absent, does it? source:src/allmydata/storage/server.py?rev=4118#L475 Oh look! It does! "
# compare the vectors against an empty share, in which all reads return empty strings
" So the current test vector should already work for this case, right? The client just needs to send a test vector saying if the server has a version less than the client's version then overwrite.Nejucomo expostulated: "UncoordinatedWriteError! I'm the only person in the universe who knows this write cap!"
And he was right -- there was never an uncoordinated write event. The client can't tell for sure, but if you never saw a server claim that it had version
V
and then claim that it had versionV+u
, then you should perhaps blame your problem on random network problems or bugs sooner than onUncoordinatedWriteError
. Could we have a related exception named something likeMultipleVersionsExist
of whichUncoordinatedWriteError
is the subclass that we use when we know that either some other client just now wrote when we were writing or else some server is doing a "roll-forward" in order to make it look like some other client has done so?This is the same case as server's V > client's V, right? So having the client send a testv which was
< V
should handle both cases.So there are three proposals for code changes in this ticket.
ServermapUpdater._check_for_done
) should not let the process finish unless we've received answers from at leastself.num_peers_to_query
, just like MODE_READ (or if we've run out of servers to query).MultipleVersionsExisted
and raise that instead ofUncoordinatedWriteError
if you didn't see actual evidence of a server changing its store between your mapupdate and your write. I earlier proposed thatUncoordinatedWriteError
could be a subtype ofMultipleVersionsExisted
, but I'm not sure whether it should be a subtype or just a separate type. Either way is fine with me.This might be related to #899, newly reported by Kyle Markley and Andrej Falout.
Not sure if we can fix this for 1.6.1, but it's definitely a candidate.
We're bumping this out of v1.6.1 just because we're not 100% sure that proposals 2 and 3 from comment:74529 are actually easy and safe.
It's really bothering me that mutable file upload and download behavior is so finicky, buggy, inefficient, hard to understand, different from immutable file upload and download behavior, etc. So I'm putting a bunch of tickets into the "1.8" Milestone. I am not, however, at this time, volunteering to work on these tickets, so it might be a mistake to put them into the 1.8 Milestone, but I really hope that someone else will volunteer or that I will decide to do it myself. :-)