the immutable uploader should call remote_abort on buckets that it knows it won't be using #1117
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1117
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The immutable upload code doesn't seem to call the abort method of buckets on remote storage servers when an upload fails (e.g., because the user aborted it, or because peer selection didn't work out). This means that the placeholder sharefiles stick around and reduce available space until the client disconnects, or they expire in some other way. The immutable upload code should do a better job of calling abort to resolve this issue.
(this was first reported in http://tahoe-lafs.org/pipermail/tahoe-dev/2010-July/004656.html)
Kevan: is this a regression vs. v1.6.0? My guess is that this bug was already present in v1.6.0, but if not please add the "regression" tag to this ticket!
The storage server doesn't exactly use placeholder files, but the internal how-much-space-have-i-committed-to code will indeed keep counting that space until an abort() is sent, so the uploader should definitely abort the shares as soon as it realizes it isn't going to use them. Otherwise the allocation will stick around until the server connection is dropped.
The share that's hanging out may also convince later uploaders to refrain from uploading a new copy of that same share. I think the server reports in-progress shares in exactly the same way as it reports really-complete shares. So failing to abort a share is likely to confuse later uploads too.
From what I understand, the storage server only stops counting the space allocated to a bucket writer when that writer's
remote_close
method is called, since that causes the server'sbucket_writer_closed
method to be invoked, which causes the bucket writer to be removed from the active writers list.remote_abort
, on the other hand, only deletes the incoming file associated with the bucket. If I haven't misunderstood, this issue then breaks down into:remote_abort
more likeremote_close
, only instead of copying the file from the incomingdir to the sharedir, it needs to delete that file.I've attached a patch that addresses both of these issues. This can be considered a backward-compatibility break for clients that were relying on the fact that
abort()
ing a bucket didn't cause the server to stop counting the space assigned to that bucket. I'm not sure how likely it is that there are any such clients.In the tests for the client-side modifications, I use a
fireEventually()
to make sure that the abort messages get to the storage server before I check that they're sent (the bucket writer proxy abort call usescallRemoteOnly
instead ofcallRemote
, because it doesn't care so much about the result, so it is harder to reason about when the messages get to their destination when testing). Is this reasonable? Is this a good thing? Is there a better way?zooko: I think you're right; this bug seems to exist in 1.6.0 too, so this isn't a regression.
Attachment 1117patch.dpatch (12597 bytes) added
Since it is a bugfix and a patch exists it is still a candidate for 1.7.1. Someone should review it carefully!
I just read through 1117patch.dpatch . I didn't see any mistakes and the patch adds two unit tests--
test_abort()
,test_peer_selector_bucket_abort()
, andtest_encoder_bucket_abort(()
. However, I'm too sleepy to double check all the uses ofself.buckets
in source:src/allmydata/immutable/upload.py and to understand exactly what those tests do, so I'm putting this off until tomorrow.(Anyone else should feel free to review this before I get around to it.)
Thinking about this a bit further (and in light of the persistent-failure-to-upload described in #1118).. it's not a space problem, but rather it's a consequene of the way the server handles shares that it thinks are already in the process of being uploaded.
If an upload fails partway through (after allocate_buckets), such as how #1118 stopped at an
assert
statement, the storage servers will haveBucketWriter
objects with open filehandles to partially- (perhaps not-at-all-) written shares inincoming/
. The client will neither close those shares, nor abort them, nor drop the connection, so they'll stick around until the client next restarts. When the client tries to upload the file a second time, theirallocate_buckets
call will hit source:src/allmydata/storage/server.py#L335, in which the presence of theincoming/
file will cause the server to refuse to accept a new share, but not claim that it already has the share (indistinguishable from the case where it does not have enough space to accept the share).This effectively disables those servers for that one file (i.e. for that one storage-index). If the grid only has 10 servers, then a single failed upload is enough to leave the client with no servers that will accept shares, and all subsequent uploads of that file (until the client is restarted, severing the TCP connections and aborting the shares) will fail. If the grid has 20 servers, then two failed uploads are enough to get into this nothing-will-work state.
As the rest of this ticket points out, the necessary fix is to examine the error paths out of the uploader code, to make sure that all paths result in the shares either being closed or aborted. This is non-trivial. We need to accumulate a list of remote
BucketReaders
as soon as they are received from the server (in response to theallocate_buckets
call), and then have anaddBoth
handler (like a 'finally' block in synchronous try/except clauses) that aborts anything left in the list. When the upload succeeds, entries in this list should be removed as soon as the response to theclose()
message is received. Since theBucketReader
is received by the peer-selection code, whereas the best place for the addBoth handler is elsewhere (in theCHKUploader
, or maybe theUploader
), it's not clear where this list ought to live.Attachment test-1117.diff (1604 bytes) added
test case to check that the code fails without the abort-shares patch
the test case in the patch I just attached fails without Kevan's patch. I have not yet confirmed that it passes with his patch.
I have just confirmed that my new test case does indeed pass when Kevan's patch is applied. The tests that are in his patch are ok, but they focus on allocated size, rather than the ability to perform a second upload (i.e. the lack of the buggy prevent-all-further-allocate_buckets behavior). So I think we should apply both patches.
In changeset:16bb529339e6cbd5: