new immutable file upload protocol: streaming, fewer round-trips, quota-respecting #1851
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1851
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Here is a letter Brian wrote in 2008 about an improved upload protocol:
https://tahoe-lafs.org/pipermail/tahoe-dev/2008-May/000630.html
The letter describes several improvements. The first couple of improvements are about disk-full conditions, quotas, and read-only mode, and we've implemented most or all of that. The second part of the letter describes a new upload protocol that would be more efficient. Let's implement that! Then you can close this ticket.
Here's the part of Brian's 2008-May letter that I mean for this ticket (the rest of his letter is already implemented):
"""
Then we plan to modify the immutable-share storage server protocol (which
currently consists of allocate_buckets() and get_buckets()) to get rid of the
RIBucketWriter objects and instead use a single method as follows:
The "
upload_index
" is an as-yet-unfinished token that allows a server to upload a share in pieces (one segment per message) without holding a foolscap Referenceable the whole time. This should improve resumed uploads. "writev=
" is your usual write vector, a list of(offset, data)
pairs. The "close=
" flag indicates whether this is the last segment or not, serving the same purpose as the IPv4 "no more fragments" bit: when the server seesclose=True
, it should terminate theupload_index
and make the finished share visible other clients. If the client doesn't close theupload_index
in a timely fashion, the server can delete the partial share.expected_size=
is advisory, and tells the storage server how large the client expects this share to become. It is optional: if the client is streaming a file, it may not know how large the file will be, and cannot provide an expected size. The server uses this advice to make a guess about how much free space is left.If the server accepts the write (i.e. it did not run out of space while writing the share to disk, and it wasn't in a read-only mode), it returns
accepted=True
. It also returns an indication of how much free space it thinks it has left: this will be the 'df' space, minus the reserved space, minus the sum of all otherexpected_size=
values (TODO: maybe it should include this one too, obviously we must be clear about which approach we take).The client will use the
remaining_space=
response to decide whether it should continue sending segments to this server, or if it thinks that the server is likely to run out of space before it finishes sending the share (and therefore might want to switch to a different server before wasting too much work on the full one).For single-segment files, the client will generate all shares, then send them speculatively to N candidate servers (i.e. peer selection will just return the first N servers in the introducer's list of non-readonly storage servers). Each share will have just one block, and just one upload call, in which the
close=
flag isTrue
. These servers will either accept the share or reject it (because of insufficient space). Any share which is rejected willbe submitted to the next candidate server on the permuted list. This approach gets us a single roundtrip for small files when all servers have free space. When some servers are full, we lose one block of network bandwidth for each full server, and add at least one roundtrip. If clients think that servers are likely to be full and want to avoid the wasted bandwidth, they could spend an extra roundtrip by doing a small write and checking the
accepted=
response before committing to sending the full block.For multi-segment files, the client will generate the first segment's blocks, and send it speculatively to N candidate servers, along with its
expected_size=
(if available). These blocks will be retained in memory until a server accepts them. The client has a choice about how much pipelining it will do: it may encode additional segments and send them to the same servers, or it might wait until the responses to the first segment come back. When those responses come back, the client will drop any servers which reject the first block, or whoseremaining_space=
indicates that the share won't fit.Dropped servers will be replaced by the next candidate in the permuted list, and the same blocks are sent again. The client will pipeline some number of blocks (allowing multiple upload messages to be outstanding at once, each being retired by a successful ack response) that depends upon how much memory it wants to spend vs how much of the bandwidth-delay product it wants to utilize.
The client has a "client soft threshold", which is the minimum
remaining_space=
value that it is willing to tolerate. This implements a tradeoff between storage utilization and chance of uploading the file successfully on the first try. If this margin is too small, the client might send the whole share to the server only to have the very last block be rejected due to lack of space. But if the margin is too high, the client may forego using mostly-full-but-still-useable servers.The server cannot provide a guarantee of space. But the probability that a non-initial block will be rejected can be made very small by:
If a client loses this gamble (i.e. the server rejects one of their non-initial blocks), they must either abandon that share (and wind up with a less-than-100%-health file, in which fewer than N shares were placed), or they must find a new home for that share and restart the encoder (which means more round-trips and possibly more memory consumption.. one approach would be to stall all other shares while we re-encode the earlier segments for the new server and catch them up, then proceed forwards with the remaining segments for all servers in parallel).
Since the chance of being rejected is highest for the first block (since the client does not yet have any information about the server, indeed they cannot be sure that the server is still online), it makes sense to hold on to the first segment's blocks until that response has been received. An optimistic client which was desperate to reduce memory footprint and improve throughput could conceivably stream the whole file to candiate servers without waiting for an ack, then look for responses and restart encoding if there were any failures.
For streaming/resumeability, the storage protocol could also use a way to abort an upload (to accelerate the share-unfinished-for-too-long timeout) when the client decides to move to some other server (because there is not enough space left).
"""
As discussed on comment:4:ticket:2110, another desirable feature of a new upload protocol is that a second, concurrent, upload of the same immutable file succeeds. I think that the New Upload Protocol sketched out above has this property.
... multiple concurrent uploads, to be more precise.
Actually I think that, although the above protocol allows correct behaviour for concurrent uploads of a file, it doesn't actually make the correct behaviour any easier than for the current protocol. (The problem is the fact that there's only provision for one copy of a given file in the 'incoming' directory, which is an implementation issue on the server side, not a protocol issue.)