uploader should keep trying other servers if its initially-chosen servers fail during the "scan" phase #2108
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
5 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#2108
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
In [v1.10 upload.py]source:trunk/src/allmydata/immutable/upload.py?annotate=blame&rev=196bd583b6c4959c60d3f73cdcefc9edda6a38ae#L390, during the uploader's "scan" phase (asking storage servers if they already have, or would be willing to accept upload of, shares of this file), if the uploader's first chosen servers answer "no can do" or fail, then it will keep asking more and more servers, until it either succeeds at uploading or runs out of candidates.
In 1382-rewrite-2 upload.py (which is hopefully going to be merged into trunk soon, and released in the upcoming Tahoe-LAFS v1.11), it instead chooses a few servers that it is going to ask, and if all of them fail then it gives up on the upload.
(I have a vague memory of discussing this on a conference call with the other Google Summer of Code Mentors and Mark Berger, and telling him to go ahead and do it this way, as it is simpler to implement. That might be a false memory.)
Anyway, I'd like to revisit this issue. For some situations, this would be a regression from 1.10 to 1.11, i.e. 1.10 would successfully upload and 1.11 would say that the upload failed. Therefore I'm adding the keywords "regression" and "blocks-release" to this ticket.
The reason to do it this way with a finite "scan" phase is that by first establishing which servers either already-have or are-willing-to-accept shares, we can then use our upload-strategy-of-happiness computation to plan which servers we want to upload to. Mixing planning with action is confusing, and the old 1.10 algorithm was hard to understand and has some undesirable behaviors. I suspect this is why we instructed Mark to go ahead with the simpler "phased" approach in #1382.
However, now that I've seen the 1382-rewrite-2 branch up close, I think I'm starting to see how a variant of it wouldn't be too complicated, would have the property of "always achieves Happiness if it is possible to do so" and would relieve it of this regression.
The idea would be that instead of:
We would instead have a state machine which does something like this:
R
be the set of servers who have responded to our queries by indicating that they either already have shares or would be willing to hold a share. At the beginning of the state machine,R
is∅
(the empty set). LetA
be the set of all servers that we have heard about.R
.A
and send it a query. When the query comes back, go to step 2.Daira said, at some point, I think, that this is not an important regression for Tahoe-LAFS v1.11, because the only cases where this would occur are when the grid has a high rate of churn (servers coming and going), and in those cases, Tahoe-LAFS v1.10 immutable upload probably has other problems. I think that's what she said. Anyway, it sounded right to me at the time and I agreed with it, but apparently we forgot to write it down on this ticket. Assigning to Daira to confirm and moving this ticket out of v1.11.
I said I didn't think it should be a blocker for 1.11 (not that it wasn't important). Zooko accurately described my reasoning.
Milestone renamed
renaming milestone
Moving open issues out of closed milestones.
Ticket retargeted after milestone closed