support streaming uploads in uploader #1288

New Issue

tahoe-lafs · 2011-01-03T05:07:54Z

davidsarah commented

2011-01-03 05:07:54 +00:00

#320 is about supporting streaming uploads in the HTTP web-API.

In the case of the SFTP frontend, there is no problem with getting at the upload stream, unlike HTTP (see /tahoe-lafs/trac-2024-07-25/issues/5382#comment:21). So we could implement streaming upload immediately for SFTP at least in some cases (see #1041 for details), if the uploader itself supported it.

This ticket is about streaming support in the uploader itself. It looks like the current IUploadable interface isn't really suited to streaming (for example it has a get_size method, and it pulls the data when a "push" approach would be more appropriate), so there is some design work to do that is independent of HTTP.

#320 is about supporting streaming uploads in the HTTP web-API. In the case of the SFTP frontend, there is no problem with getting at the upload stream, unlike HTTP (see [/tahoe-lafs/trac-2024-07-25/issues/5382](/tahoe-lafs/trac-2024-07-25/issues/5382)#comment:21). So we could implement streaming upload immediately for SFTP at least in some cases (see #1041 for details), if the uploader itself supported it. This ticket is about streaming support in the uploader itself. It looks like the current `IUploadable` interface isn't really suited to streaming (for example it has a `get_size` method, and it pulls the data when a "push" approach would be more appropriate), so there is some design work to do that is independent of HTTP.

tahoe-lafs added the

labels 2011-01-03 05:07:54 +00:00

tahoe-lafs added this to the undecided milestone 2011-01-03 05:07:54 +00:00

zooko commented

2011-07-28 04:35:20 +00:00

Note that doing a streaming upload—where the storage servers are accepting and storing the first blocks of your file from you before you (the storage client) have even looked at the last blocks of that file—is inherently incompatible with client-side deduplication—where you realize that the file is already stored before you upload the first block.

If we wanted to implement this ticket and still to support client-side deduplication, which saves upload bandwidth and server-side storage space, then we'd have to make it be an option. For this upload do you want to make a pass over the data first, to see if it is already stored and you might be able to skip the upload, or do you want to do a streaming upload, where the storage client (== Tahoe-LAFS gateway) does not have to store temporary copy of the entire file in order to make two passes over it?

A streaming upload could be compatible with server-side deduplication, where after the last block of the share is uploaded, the server says "Oh look, I already have a copy of this share. I'll just delete the new one and add a new lease to the old one.". This doesn't help with upload bandwidth but conserves server-side storage space.

Note that doing a streaming upload—where the storage servers are accepting and storing the first blocks of your file from you before you (the storage client) have even looked at the last blocks of that file—is inherently incompatible with client-side deduplication—where you realize that the file is already stored before you upload the first block. If we wanted to implement this ticket and still to support client-side deduplication, which saves upload bandwidth and server-side storage space, then we'd have to make it be an option. For this upload do you want to make a pass over the data first, to see if it is already stored and you might be able to skip the upload, or do you want to do a streaming upload, where the storage client (== Tahoe-LAFS gateway) does not have to store temporary copy of the entire file in order to make two passes over it? A streaming upload could be compatible with server-side deduplication, where after the last block of the share is uploaded, the server says "Oh look, I already have a copy of this share. I'll just delete the new one and add a new lease to the old one.". This doesn't help with upload bandwidth but conserves server-side storage space.

davidsarah commented

2011-07-28 18:50:14 +00:00

Replying to zooko:

If we wanted to implement this ticket and still to support client-side deduplication, which saves upload bandwidth and server-side storage space, then we'd have to make it be an option. For this upload do you want to make a pass over the data first, to see if it is already stored and you might be able to skip the upload, or do you want to do a streaming upload, where the storage client (== Tahoe-LAFS gateway) does not have to store temporary copy of the entire file in order to make two passes over it?

Well, another possibility is that the client starts to upload the file, but aborts the upload if it finishes making a pass over the data and detects that it was already stored. That might make sense if the client is receiving the file faster than it is able to upload it.

A difficulty here is that without knowing the file's hash, the client can't determine the optimum set of servers to store shares on. But if the number of servers on the grid were not much greater than shares.total, then that might not matter, because it could start uploading shares to all servers. (Or there could be some cleverer way to work around this problem that I'm not seeing right now.)

Replying to [zooko](/tahoe-lafs/trac-2024-07-25/issues/1288#issuecomment-81677): > If we wanted to implement this ticket and still to support client-side deduplication, which saves upload bandwidth and server-side storage space, then we'd have to make it be an option. For this upload do you want to make a pass over the data first, to see if it is already stored and you might be able to skip the upload, or do you want to do a streaming upload, where the storage client (== Tahoe-LAFS gateway) does not have to store temporary copy of the entire file in order to make two passes over it? Well, another possibility is that the client starts to upload the file, but aborts the upload if it finishes making a pass over the data and detects that it was already stored. That might make sense if the client is receiving the file faster than it is able to upload it. A difficulty here is that without knowing the file's hash, the client can't determine the optimum set of servers to store shares on. But if the number of servers on the grid were not much greater than `shares.total`, then that might not matter, because it could start uploading shares to all servers. (Or there could be some cleverer way to work around this problem that I'm not seeing right now.)

zooko commented

2011-07-28 20:02:07 +00:00

Replying to [davidsarah]comment:2:

Well, another possibility is that the client starts to upload the file, but aborts the upload if it finishes making a pass over the data and detects that it was already stored. That might make sense if the client is receiving the file faster than it is able to upload it.

Hey, that is a very good idea. If you're streaming a file to a gateway for it to encrypt, erasure-code, and distribute among servers, then the gateway could dynamically choose to what degree it wanted to read the file from you faster than it can upload it, store it in temporary storage, and precompute the hash of it for deduplication purposes and to what degree it wanted to read the file from you only as fast as it could upload it to storage servers.

A difficulty here is that without knowing the file's hash, the client can't determine the optimum set of servers to store shares on. But if the number of servers on the grid were not much greater than shares.total, then that might not matter, because it could start uploading shares to all servers. (Or there could be some cleverer way to work around this problem that I'm not seeing right now.)

Brian and I have discussed this. I think we should start by conceiving of the "server selector" as a potentially different thing from the "file identifier". The former is what you need to have to choose which servers to contact first. The latter is what you send to a server to indicate to the server which file out of all the files it knows about.

Only the "server selector" part has to be known before upload begins. Another fact is that the server selector does not necessarily need to have a lot of information in it. For example, what if it were a 2-byte random value? That would define 65,536 ways to search any given set of servers (e.g. permute the list of servers according to this 2-byte server selector).

(The "file identifier" part does need to be collision-free: #753.)

There are some notes about these topics: #654, #482, wiki/ServerSelection, #467, #872.

Replying to [davidsarah]comment:2: > > Well, another possibility is that the client starts to upload the file, but aborts the upload if it finishes making a pass over the data and detects that it was already stored. That might make sense if the client is receiving the file faster than it is able to upload it. Hey, that is a very good idea. If you're streaming a file to a gateway for it to encrypt, erasure-code, and distribute among servers, then the gateway could dynamically choose to what degree it wanted to read the file from you faster than it can upload it, store it in temporary storage, and precompute the hash of it for deduplication purposes and to what degree it wanted to read the file from you only as fast as it could upload it to storage servers. > A difficulty here is that without knowing the file's hash, the client can't determine the optimum set of servers to store shares on. But if the number of servers on the grid were not much greater than `shares.total`, then that might not matter, because it could start uploading shares to all servers. (Or there could be some cleverer way to work around this problem that I'm not seeing right now.) Brian and I have discussed this. I think we should start by conceiving of the "server selector" as a potentially different thing from the "file identifier". The former is what you need to have to choose which servers to contact first. The latter is what you send to a server to indicate to the server which file out of all the files it knows about. Only the "server selector" part has to be known before upload begins. Another fact is that the server selector does not necessarily need to have a lot of information in it. For example, what if it were a 2-byte random value? That would define 65,536 ways to search any given set of servers (e.g. permute the list of servers according to this 2-byte server selector). (The "file identifier" part does need to be collision-free: #753.) There are some notes about these topics: #654, #482, [wiki/ServerSelection](wiki/ServerSelection), #467, #872.

davidsarah commented

2011-07-28 22:19:34 +00:00

I'd been thinking that to support convergence, the server selector has to be based on a hash of the whole file. But that's not necessarily true: it could be based on a hash of a prefix of the file (say the first segment), and the convergence secret. This would make no difference for small files, but for large files it would allow the server selector to be calculated more quickly, perhaps before the rest of the file is known.

This would mean that files with the same prefix in a given convergence set would always have the same selector. But arguably the server selector only has to have sufficient diversity to average out the consumed space among servers for a reasonably large collection of files. If it's sufficiently unusual to have lots of files with the same prefix in a given convergence set, that this goal is achieved even if the server selector only depends on the prefix (within that set).

I'd been thinking that to support convergence, the server selector has to be based on a hash of the whole file. But that's not necessarily true: it could be based on a hash of a prefix of the file (say the first segment), and the convergence secret. This would make no difference for small files, but for large files it would allow the server selector to be calculated more quickly, perhaps before the rest of the file is known. This would mean that files with the same prefix in a given convergence set would always have the same selector. But arguably the server selector only has to have sufficient diversity to average out the consumed space among servers for a reasonably large collection of files. If it's sufficiently unusual to have lots of files with the same prefix in a given convergence set, that this goal is achieved even if the server selector only depends on the prefix (within that set).

davidsarah commented

2012-04-12 01:16:43 +00:00

Zooko, Brian and I discussed this again on #tahoe-lafs.

Goals:

The SI should act as a verify cap and be sufficient for servers to verify the whole contents of any share they hold.
For converging files, the share with a given shnum should be recognizable as the same share, and never stored more than once on a given server.
If the client knows a suitable hash of the file plaintext and convergence secret before the upload, then it should be able to avoid the cost of uploading shares to servers that already have them, and also avoid encryption and erasure coding costs if shares have already been uploaded to enough servers. (If the protocol involves revealing this hash to storage servers, then it must not allow performing a guessing attack against the plaintext without knowing the convergence secret.)
If the client learns that hash during the upload, then it should be able to abort any uploads of shares to servers that already have them.
Streaming uploads should be possible, i.e. a client should be able to start to upload a large file without knowing its whole contents or length.
For each file, the ideal share placement should be a pseudorandom distribution among servers that is a deterministic function of that file and the convergence secret. It is not necessary for this to be a function that uses the whole file contents, or for it to be cryptographically random.
When doing an upload, the ideal share placement should be computable quickly by the client (e.g. using only a prefix of the file contents).
When doing a download, the ideal share placement must be computable from the SI.
Each server should be able to account for the space used by a given accounting principal, including the space that it uses temporarily during an upload (even if the upload will fail).
j. The sizes of read and write caps should be minimized. The size of a verify cap/SI is less important but should still be fairly small.
k. A downloader should have sufficient information, given the read cap and downloaded shares, to be able to check the integrity of the plaintext even if its decryption and erasure decoding routines are incorrect.
l. The verify cap for a file should be derivable off-line from the read cap.
m. If deep-verify caps are supported, the deep-verify cap for a file should be derivable off-line from the read cap, and the verify cap from the deep-verify vap.

All of these goals can be achieved simultaneously by a variation on Rainhill 3, or the simpler Rainhill 3x that does not support deep-verify caps. For simplicity, I'll just describe the variation on Rainhill 3x here:

EncP_R is used as the plaintext hash for goals c and d. The client can at any time ask whether a server has a share with a given EncP_R. (To do this lookup efficiently, the storage server must index shares by this as well as the SI, but that is easy if we're using a database.)
the server selection index is computed as SSI = hash(CS, first segment of Plain_R, Params), using a hash distinct from any other in the protocol. SSI is included as an extra field in all cap types.
before uploading, a client reserves a certain amount of space for that upload with its accounting principal credentials (e.g. using a signed message). If it did not know the file size in advance and needs more space, it can increase the reservation.
during the upload both the client and server compute the SI. At the end, the server discards the share if it is a duplicate of one it already has; otherwise, it compares the SI it computed with the one the client tells it, and retains the share if they match.

Zooko, Brian and I [discussed this again on #tahoe-lafs](http://fred.submusic.ch/irc/tahoe-lafs/2012-04-11#i_585815). Goals: 1. The SI should act as a verify cap and be sufficient for servers to verify the whole contents of any share they hold. 2. For converging files, the share with a given shnum should be recognizable as the same share, and never **stored** more than once on a given server. 3. If the client knows a suitable hash of the file plaintext and convergence secret before the upload, then it should be able to avoid the cost of uploading shares to servers that already have them, and also avoid encryption and erasure coding costs if shares have already been uploaded to enough servers. (If the protocol involves revealing this hash to storage servers, then it must not allow performing a guessing attack against the plaintext without knowing the convergence secret.) 4. If the client learns that hash during the upload, then it should be able to abort any uploads of shares to servers that already have them. 5. Streaming uploads should be possible, i.e. a client should be able to start to upload a large file without knowing its whole contents or length. 6. For each file, the ideal share placement should be a pseudorandom distribution among servers that is a deterministic function of that file and the convergence secret. It is not necessary for this to be a function that uses the whole file contents, or for it to be cryptographically random. 7. When doing an upload, the ideal share placement should be computable quickly by the client (e.g. using only a prefix of the file contents). 8. When doing a download, the ideal share placement must be computable from the SI. 1. Each server should be able to account for the space used by a given accounting principal, including the space that it uses temporarily during an upload (even if the upload will fail). j. The sizes of read and write caps should be minimized. The size of a verify cap/SI is less important but should still be fairly small. k. A downloader should have sufficient information, given the read cap and downloaded shares, to be able to check the integrity of the plaintext even if its decryption and erasure decoding routines are incorrect. l. The verify cap for a file should be derivable off-line from the read cap. m. If deep-verify caps are supported, the deep-verify cap for a file should be derivable off-line from the read cap, and the verify cap from the deep-verify vap. All of these goals can be achieved simultaneously by a variation on [Rainhill 3](https://tahoe-lafs.org/~davidsarah/immutable-rainhill-3.svg), or the simpler [Rainhill 3x](https://tahoe-lafs.org/~davidsarah/immutable-rainhill-3x.svg) that does not support deep-verify caps. For simplicity, I'll just describe the variation on Rainhill 3x here: * EncP_R is used as the plaintext hash for goals c and d. The client can at any time ask whether a server has a share with a given EncP_R. (To do this lookup efficiently, the storage server must index shares by this as well as the SI, but that is easy if we're using a database.) * the server selection index is computed as SSI = hash(CS, first segment of Plain_R, Params), using a hash distinct from any other in the protocol. SSI is included as an extra field in all cap types. * before uploading, a client reserves a certain amount of space for that upload with its accounting principal credentials (e.g. using a signed message). If it did not know the file size in advance and needs more space, it can increase the reservation. * during the upload both the client and server compute the SI. At the end, the server discards the share if it is a duplicate of one it already has; otherwise, it compares the SI it computed with the one the client tells it, and retains the share if they match.

Sign in to join this conversation.