new immutable file upload protocol: streaming, fewer round-trips, quota-respecting #1851

Open
opened 2012-11-09 06:53:17 +00:00 by zooko · 4 comments

Here is a letter Brian wrote in 2008 about an improved upload protocol:

https://tahoe-lafs.org/pipermail/tahoe-dev/2008-May/000630.html

The letter describes several improvements. The first couple of improvements are about disk-full conditions, quotas, and read-only mode, and we've implemented most or all of that. The second part of the letter describes a new upload protocol that would be more efficient. Let's implement that! Then you can close this ticket.

Here is a letter Brian wrote in 2008 about an improved upload protocol: <https://tahoe-lafs.org/pipermail/tahoe-dev/2008-May/000630.html> The letter describes several improvements. The first couple of improvements are about disk-full conditions, quotas, and read-only mode, and we've implemented most or all of that. The second part of the letter describes a new upload protocol that would be more efficient. Let's implement that! Then you can close this ticket.
zooko added the
code-storage
normal
enhancement
1.9.2
labels 2012-11-09 06:53:17 +00:00
zooko added this to the undecided milestone 2012-11-09 06:53:17 +00:00
Author

Here's the part of Brian's 2008-May letter that I mean for this ticket (the rest of his letter is already implemented):

"""
Then we plan to modify the immutable-share storage server protocol (which
currently consists of allocate_buckets() and get_buckets()) to get rid of the
RIBucketWriter objects and instead use a single method as follows:

 def upload_immutable_share(upload_index, storage_index, sharenum,
                            writev, close=bool, expected_size=bool_or_None):
     return (accepted=bool, remaining_space=int)

The "upload_index" is an as-yet-unfinished token that allows a server to upload a share in pieces (one segment per message) without holding a foolscap Referenceable the whole time. This should improve resumed uploads. "writev=" is your usual write vector, a list of (offset, data) pairs. The "close=" flag indicates whether this is the last segment or not, serving the same purpose as the IPv4 "no more fragments" bit: when the server sees close=True, it should terminate the upload_index and make the finished share visible other clients. If the client doesn't close the upload_index in a timely fashion, the server can delete the partial share.

expected_size= is advisory, and tells the storage server how large the client expects this share to become. It is optional: if the client is streaming a file, it may not know how large the file will be, and cannot provide an expected size. The server uses this advice to make a guess about how much free space is left.

If the server accepts the write (i.e. it did not run out of space while writing the share to disk, and it wasn't in a read-only mode), it returns accepted=True. It also returns an indication of how much free space it thinks it has left: this will be the 'df' space, minus the reserved space, minus the sum of all other expected_size= values (TODO: maybe it should include this one too, obviously we must be clear about which approach we take).

The client will use the remaining_space= response to decide whether it should continue sending segments to this server, or if it thinks that the server is likely to run out of space before it finishes sending the share (and therefore might want to switch to a different server before wasting too much work on the full one).

For single-segment files, the client will generate all shares, then send them speculatively to N candidate servers (i.e. peer selection will just return the first N servers in the introducer's list of non-readonly storage servers). Each share will have just one block, and just one upload call, in which the close= flag is True. These servers will either accept the share or reject it (because of insufficient space). Any share which is rejected will
be submitted to the next candidate server on the permuted list. This approach gets us a single roundtrip for small files when all servers have free space. When some servers are full, we lose one block of network bandwidth for each full server, and add at least one roundtrip. If clients think that servers are likely to be full and want to avoid the wasted bandwidth, they could spend an extra roundtrip by doing a small write and checking the accepted= response before committing to sending the full block.

For multi-segment files, the client will generate the first segment's blocks, and send it speculatively to N candidate servers, along with its expected_size= (if available). These blocks will be retained in memory until a server accepts them. The client has a choice about how much pipelining it will do: it may encode additional segments and send them to the same servers, or it might wait until the responses to the first segment come back. When those responses come back, the client will drop any servers which reject the first block, or whose remaining_space= indicates that the share won't fit.

Dropped servers will be replaced by the next candidate in the permuted list, and the same blocks are sent again. The client will pipeline some number of blocks (allowing multiple upload messages to be outstanding at once, each being retired by a successful ack response) that depends upon how much memory it wants to spend vs how much of the bandwidth-delay product it wants to utilize.

The client has a "client soft threshold", which is the minimum remaining_space= value that it is willing to tolerate. This implements a tradeoff between storage utilization and chance of uploading the file successfully on the first try. If this margin is too small, the client might send the whole share to the server only to have the very last block be rejected due to lack of space. But if the margin is too high, the client may forego using mostly-full-but-still-useable servers.

The server cannot provide a guarantee of space. But the probability that a non-initial block will be rejected can be made very small by:

  • all clients providing accurate expected_size= information
  • the server maintaining accurate df measurements
  • clients paying close attention to the remaining_space= responses

If a client loses this gamble (i.e. the server rejects one of their non-initial blocks), they must either abandon that share (and wind up with a less-than-100%-health file, in which fewer than N shares were placed), or they must find a new home for that share and restart the encoder (which means more round-trips and possibly more memory consumption.. one approach would be to stall all other shares while we re-encode the earlier segments for the new server and catch them up, then proceed forwards with the remaining segments for all servers in parallel).

Since the chance of being rejected is highest for the first block (since the client does not yet have any information about the server, indeed they cannot be sure that the server is still online), it makes sense to hold on to the first segment's blocks until that response has been received. An optimistic client which was desperate to reduce memory footprint and improve throughput could conceivably stream the whole file to candiate servers without waiting for an ack, then look for responses and restart encoding if there were any failures.

For streaming/resumeability, the storage protocol could also use a way to abort an upload (to accelerate the share-unfinished-for-too-long timeout) when the client decides to move to some other server (because there is not enough space left).
"""

Here's the part of Brian's 2008-May letter that I mean for this ticket (the rest of his letter is already implemented): """ Then we plan to modify the immutable-share storage server protocol (which currently consists of allocate_buckets() and get_buckets()) to get rid of the RIBucketWriter objects and instead use a single method as follows: ``` def upload_immutable_share(upload_index, storage_index, sharenum, writev, close=bool, expected_size=bool_or_None): return (accepted=bool, remaining_space=int) ``` The "`upload_index`" is an as-yet-unfinished token that allows a server to upload a share in pieces (one segment per message) without holding a foolscap Referenceable the whole time. This should improve resumed uploads. "`writev=`" is your usual write vector, a list of `(offset, data)` pairs. The "`close=`" flag indicates whether this is the last segment or not, serving the same purpose as the IPv4 "no more fragments" bit: when the server sees `close=True`, it should terminate the `upload_index` and make the finished share visible other clients. If the client doesn't close the `upload_index` in a timely fashion, the server can delete the partial share. `expected_size=` is advisory, and tells the storage server how large the client expects this share to become. It is optional: if the client is streaming a file, it may not know how large the file will be, and cannot provide an expected size. The server uses this advice to make a guess about how much free space is left. If the server accepts the write (i.e. it did not run out of space while writing the share to disk, and it wasn't in a read-only mode), it returns `accepted=True`. It also returns an indication of how much free space it thinks it has left: this will be the 'df' space, minus the reserved space, minus the sum of all other `expected_size=` values (TODO: maybe it should include this one too, obviously we must be clear about which approach we take). The client will use the `remaining_space=` response to decide whether it should continue sending segments to this server, or if it thinks that the server is likely to run out of space before it finishes sending the share (and therefore might want to switch to a different server before wasting too much work on the full one). For single-segment files, the client will generate all shares, then send them speculatively to N candidate servers (i.e. peer selection will just return the first N servers in the introducer's list of non-readonly storage servers). Each share will have just one block, and just one upload call, in which the `close=` flag is `True`. These servers will either accept the share or reject it (because of insufficient space). Any share which is rejected will be submitted to the next candidate server on the permuted list. This approach gets us a single roundtrip for small files when all servers have free space. When some servers are full, we lose one block of network bandwidth for each full server, and add at least one roundtrip. If clients think that servers are likely to be full and want to avoid the wasted bandwidth, they could spend an extra roundtrip by doing a small write and checking the `accepted=` response before committing to sending the full block. For multi-segment files, the client will generate the first segment's blocks, and send it speculatively to N candidate servers, along with its `expected_size=` (if available). These blocks will be retained in memory until a server accepts them. The client has a choice about how much pipelining it will do: it may encode additional segments and send them to the same servers, or it might wait until the responses to the first segment come back. When those responses come back, the client will drop any servers which reject the first block, or whose `remaining_space=` indicates that the share won't fit. Dropped servers will be replaced by the next candidate in the permuted list, and the same blocks are sent again. The client will pipeline some number of blocks (allowing multiple upload messages to be outstanding at once, each being retired by a successful ack response) that depends upon how much memory it wants to spend vs how much of the bandwidth-delay product it wants to utilize. The client has a "client soft threshold", which is the minimum `remaining_space=` value that it is willing to tolerate. This implements a tradeoff between storage utilization and chance of uploading the file successfully on the first try. If this margin is too small, the client might send the whole share to the server only to have the very last block be rejected due to lack of space. But if the margin is too high, the client may forego using mostly-full-but-still-useable servers. The server cannot provide a guarantee of space. But the probability that a non-initial block will be rejected can be made very small by: * all clients providing accurate expected_size= information * the server maintaining accurate df measurements * clients paying close attention to the remaining_space= responses If a client loses this gamble (i.e. the server rejects one of their non-initial blocks), they must either abandon that share (and wind up with a less-than-100%-health file, in which fewer than N shares were placed), or they must find a new home for that share and restart the encoder (which means more round-trips and possibly more memory consumption.. one approach would be to stall all other shares while we re-encode the earlier segments for the new server and catch them up, then proceed forwards with the remaining segments for all servers in parallel). Since the chance of being rejected is highest for the first block (since the client does not yet have any information about the server, indeed they cannot be sure that the server is still online), it makes sense to hold on to the first segment's blocks until that response has been received. An optimistic client which was desperate to reduce memory footprint and improve throughput could conceivably stream the whole file to candiate servers without waiting for an ack, then look for responses and restart encoding if there were any failures. For streaming/resumeability, the storage protocol could also use a way to abort an upload (to accelerate the share-unfinished-for-too-long timeout) when the client decides to move to some other server (because there is not enough space left). """
Author

As discussed on comment:4:ticket:2110, another desirable feature of a new upload protocol is that a second, concurrent, upload of the same immutable file succeeds. I think that the New Upload Protocol sketched out above has this property.

As discussed on comment:4:ticket:2110, another desirable feature of a new upload protocol is that a second, concurrent, upload of the same immutable file succeeds. I *think* that the New Upload Protocol sketched out above has this property.
daira commented 2013-11-28 22:00:49 +00:00
Owner

... multiple concurrent uploads, to be more precise.

... multiple concurrent uploads, to be more precise.
daira commented 2013-11-28 22:04:50 +00:00
Owner

Actually I think that, although the above protocol allows correct behaviour for concurrent uploads of a file, it doesn't actually make the correct behaviour any easier than for the current protocol. (The problem is the fact that there's only provision for one copy of a given file in the 'incoming' directory, which is an implementation issue on the server side, not a protocol issue.)

Actually I think that, although the above protocol allows correct behaviour for concurrent uploads of a file, it doesn't actually make the correct behaviour any easier than for the current protocol. (The problem is the fact that there's only provision for one copy of a given file in the 'incoming' directory, which is an implementation issue on the server side, not a protocol issue.)
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#1851
No description provided.