upload needs to be tolerant of lost peers #17
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#17
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
When we upload a file, we can tolerate not having enough peers (or those peers not offering enough space), based upon a threshold named "shares_of_happiness". We want to place 100 shares by default, and as long as we can find homes for at least 75 of them, we're happy.
But in the current source:src/allmydata/encode.py, if any of those peers go away while we're uploading, the entire upload fails. (worse yet, the failure is not reported properly.. there are a lot of unhandled errback in Deferreds in there).
encode.Encoder._encoded_segment needs to be changed to count failures rather than allowing them to kill off the whole segment (and thus the whole file). When the encode/upload process finishes, it needs to return both the roothash and a count of how many shares were successfully placed, so that the enclosing upload.py code can decide whether it's done or whether it needs to try again.
At the moment, since we're bouncing storage nodes every hour to deal with the silent-lost connection issues, any upload that is in progress at :10 or :20 or :30 is going to fail in this fashion.
Oh, I think it gets worse.. from some other tests I was doing, it looks like if you lose all peers, then the upload process goes into an infinite loop and slowly consumes more and more memory.
changeset:6bb9debc166df756 and changeset:f4c048bbeba15f51 should address this: now we keep going as long as we can still place 'shares_of_happiness' shares (which defaults to 75, in our 25-of-100 encoding). There are log messages generated when this happens, to indicate how close we are to giving up.
If we lose so many peers that we go below shares-of-happiness, the upload fails with a NotEnoughPeersError exception.
oops, it turns out that there is still a problem: if the peer is quietly lost
before the upload starts, then the initial
storage.WriteBucketProxy.start
call (which writes a bunch of offsetsinto the remote share) will fail with some sort of connection-lost error
(either when TCP times out, or when the storage server reconnects and
replaces the existing connection). Failures in this particular method call
are not caught in the same way as later failures, and any such failure will
cause the upload to fail.
The task is to modify
encode.Encoder.start():213
where it says:to wrap the start() calls in the same kind of drop-that-server-on-error code
that all the other remote calls use.
This might be the cause of #193 (if the upload was stalled waiting for the
lost peer's TCP connection to close), although I kind of doubt it. It might
also be the cause of #253.
Fixed, in changeset:4c5518faefebc1c7. I think that's the last of them, so I'm closing out this ticket again.