upload needs to be tolerant of lost peers #17

Closed
opened 2007-04-28 19:10:27 +00:00 by warner · 4 comments

When we upload a file, we can tolerate not having enough peers (or those peers not offering enough space), based upon a threshold named "shares_of_happiness". We want to place 100 shares by default, and as long as we can find homes for at least 75 of them, we're happy.

But in the current source:src/allmydata/encode.py, if any of those peers go away while we're uploading, the entire upload fails. (worse yet, the failure is not reported properly.. there are a lot of unhandled errback in Deferreds in there).

encode.Encoder._encoded_segment needs to be changed to count failures rather than allowing them to kill off the whole segment (and thus the whole file). When the encode/upload process finishes, it needs to return both the roothash and a count of how many shares were successfully placed, so that the enclosing upload.py code can decide whether it's done or whether it needs to try again.

At the moment, since we're bouncing storage nodes every hour to deal with the silent-lost connection issues, any upload that is in progress at :10 or :20 or :30 is going to fail in this fashion.

When we upload a file, we can tolerate not having enough peers (or those peers not offering enough space), based upon a threshold named "shares_of_happiness". We want to place 100 shares by default, and as long as we can find homes for at least 75 of them, we're happy. But in the current source:src/allmydata/encode.py, if any of those peers go away while we're uploading, the entire upload fails. (worse yet, the failure is not reported properly.. there are a lot of unhandled errback in Deferreds in there). encode.Encoder._encoded_segment needs to be changed to count failures rather than allowing them to kill off the whole segment (and thus the whole file). When the encode/upload process finishes, it needs to return both the roothash and a count of how many shares were successfully placed, so that the enclosing upload.py code can decide whether it's done or whether it needs to try again. At the moment, since we're bouncing storage nodes every hour to deal with the silent-lost connection issues, any upload that is in progress at :10 or :20 or :30 is going to fail in this fashion.
warner added the
major
defect
labels 2007-04-28 19:10:27 +00:00
warner self-assigned this 2007-04-28 19:10:27 +00:00
warner added the
code
label 2007-04-28 19:15:36 +00:00
warner added
critical
and removed
major
labels 2007-05-04 05:15:12 +00:00
Author

Oh, I think it gets worse.. from some other tests I was doing, it looks like if you lose all peers, then the upload process goes into an infinite loop and slowly consumes more and more memory.

Oh, I think it gets worse.. from some other tests I was doing, it looks like if you lose **all** peers, then the upload process goes into an infinite loop and slowly consumes more and more memory.
Author

changeset:6bb9debc166df756 and changeset:f4c048bbeba15f51 should address this: now we keep going as long as we can still place 'shares_of_happiness' shares (which defaults to 75, in our 25-of-100 encoding). There are log messages generated when this happens, to indicate how close we are to giving up.

If we lose so many peers that we go below shares-of-happiness, the upload fails with a NotEnoughPeersError exception.

changeset:6bb9debc166df756 and changeset:f4c048bbeba15f51 should address this: now we keep going as long as we can still place 'shares_of_happiness' shares (which defaults to 75, in our 25-of-100 encoding). There are log messages generated when this happens, to indicate how close we are to giving up. If we lose so many peers that we go below shares-of-happiness, the upload fails with a [NotEnoughPeersError](wiki/NotEnoughPeersError) exception.
warner added the
fixed
label 2007-06-06 19:50:05 +00:00
Author

oops, it turns out that there is still a problem: if the peer is quietly lost
before the upload starts, then the initial
storage.WriteBucketProxy.start call (which writes a bunch of offsets
into the remote share) will fail with some sort of connection-lost error
(either when TCP times out, or when the storage server reconnects and
replaces the existing connection). Failures in this particular method call
are not caught in the same way as later failures, and any such failure will
cause the upload to fail.

The task is to modify encode.Encoder.start():213 where it says:


        for l in self.landlords.values():
            d.addCallback(lambda res, l=l: l.start())

to wrap the start() calls in the same kind of drop-that-server-on-error code
that all the other remote calls use.

This might be the cause of #193 (if the upload was stalled waiting for the
lost peer's TCP connection to close), although I kind of doubt it. It might
also be the cause of #253.

oops, it turns out that there is still a problem: if the peer is quietly lost before the upload starts, then the initial `storage.WriteBucketProxy.start` call (which writes a bunch of offsets into the remote share) will fail with some sort of connection-lost error (either when TCP times out, or when the storage server reconnects and replaces the existing connection). Failures in this particular method call are not caught in the same way as later failures, and any such failure will cause the upload to fail. The task is to modify `encode.Encoder.start():213` where it says: ``` for l in self.landlords.values(): d.addCallback(lambda res, l=l: l.start()) ``` to wrap the start() calls in the same kind of drop-that-server-on-error code that all the other remote calls use. This might be the cause of #193 (if the upload was stalled waiting for the lost peer's TCP connection to close), although I kind of doubt it. It might also be the cause of #253.
warner added
code-encoding
major
0.7.0
and removed
code
critical
fixed
labels 2008-01-26 00:15:28 +00:00
warner modified the milestone from 0.3.0 to 0.9.0 (Allmydata 3.0 final) 2008-01-26 00:15:28 +00:00
warner reopened this issue 2008-01-26 00:15:28 +00:00
Author

Fixed, in changeset:4c5518faefebc1c7. I think that's the last of them, so I'm closing out this ticket again.

Fixed, in changeset:4c5518faefebc1c7. I *think* that's the last of them, so I'm closing out this ticket again.
warner added the
fixed
label 2008-01-28 19:24:50 +00:00
Sign in to join this conversation.
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#17
No description provided.