Incomplete ServerMap triggers UncoordinatedWriteError upon mutable Publish #1795
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1795
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This error has been seen in the wild while working on the Tamias system that uses tahoe-lafs as a storage layer. It seems to show up much more often in our testing environment where we do have a lot of clients connecting and leaving the network at high frequencies (client nodes, not storage nodes).
Before overwriting a mutable file, the client builds a servermap using MODE_WRITE. This mode does not query all servers but stops querying when 'epsilon' consecutive servers stated that they do not have a share. When this happens (hit boundary, in the log) the servermap is considered to be done if all servers on the left of the boundary have answered.
In some corner cases, all those servers have answered but specific timing makes it so that the server is marked as having a share but the share information has not been processed yet. Because there are several concurrent calls to check_for_done, one of them might consider that the servermapupdate can stop running, actually preventing the processing of the last share.
This results in a partial servermap. When the Publish operation starts, it might select the last server - the one missing from the servermap - as a candidate for the missing share. It will then issue a testv that checks for the absence of a share. This testv fails because there is a share, and a UCW is triggered.
This can be seen in the attached log starting from event 6750, the boundary is found at 6898 and 6899 stops the servermap update. Event 6900 has the partial servermap, and events 6903,6904 show the last share processing that is filtered because the servermap update has already been stopped. 6918 and 6920 show the servermap before (partial) and after (unforunately chosing the 'hidden' server whose answer was discarded). This leads to the eventual UCW at event 6955 triggered by the failed testv at 6953.
In out testing environment, we use the attached workaround that moves the addition to the good_servers list at the very bottom of the deferedlist that is built per-server. This is expected to cause problems when servers have multiple shares, but it is just a temporary fix anyway.
Attachment changeset_r9db2f65ebb8eaa4f6094f2f99eff928ba285f5f5.diff (904 bytes) added
workaround
Attachment ucw_text_transcript.log (14887 bytes) added
Transcript of the incident report