support streaming uploads in uploader #1288
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1288
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
#320 is about supporting streaming uploads in the HTTP web-API.
In the case of the SFTP frontend, there is no problem with getting at the upload stream, unlike HTTP (see /tahoe-lafs/trac-2024-07-25/issues/5382#comment:21). So we could implement streaming upload immediately for SFTP at least in some cases (see #1041 for details), if the uploader itself supported it.
This ticket is about streaming support in the uploader itself. It looks like the current
IUploadable
interface isn't really suited to streaming (for example it has aget_size
method, and it pulls the data when a "push" approach would be more appropriate), so there is some design work to do that is independent of HTTP.Note that doing a streaming upload—where the storage servers are accepting and storing the first blocks of your file from you before you (the storage client) have even looked at the last blocks of that file—is inherently incompatible with client-side deduplication—where you realize that the file is already stored before you upload the first block.
If we wanted to implement this ticket and still to support client-side deduplication, which saves upload bandwidth and server-side storage space, then we'd have to make it be an option. For this upload do you want to make a pass over the data first, to see if it is already stored and you might be able to skip the upload, or do you want to do a streaming upload, where the storage client (== Tahoe-LAFS gateway) does not have to store temporary copy of the entire file in order to make two passes over it?
A streaming upload could be compatible with server-side deduplication, where after the last block of the share is uploaded, the server says "Oh look, I already have a copy of this share. I'll just delete the new one and add a new lease to the old one.". This doesn't help with upload bandwidth but conserves server-side storage space.
Replying to zooko:
Well, another possibility is that the client starts to upload the file, but aborts the upload if it finishes making a pass over the data and detects that it was already stored. That might make sense if the client is receiving the file faster than it is able to upload it.
A difficulty here is that without knowing the file's hash, the client can't determine the optimum set of servers to store shares on. But if the number of servers on the grid were not much greater than
shares.total
, then that might not matter, because it could start uploading shares to all servers. (Or there could be some cleverer way to work around this problem that I'm not seeing right now.)Replying to [davidsarah]comment:2:
Hey, that is a very good idea. If you're streaming a file to a gateway for it to encrypt, erasure-code, and distribute among servers, then the gateway could dynamically choose to what degree it wanted to read the file from you faster than it can upload it, store it in temporary storage, and precompute the hash of it for deduplication purposes and to what degree it wanted to read the file from you only as fast as it could upload it to storage servers.
Brian and I have discussed this. I think we should start by conceiving of the "server selector" as a potentially different thing from the "file identifier". The former is what you need to have to choose which servers to contact first. The latter is what you send to a server to indicate to the server which file out of all the files it knows about.
Only the "server selector" part has to be known before upload begins. Another fact is that the server selector does not necessarily need to have a lot of information in it. For example, what if it were a 2-byte random value? That would define 65,536 ways to search any given set of servers (e.g. permute the list of servers according to this 2-byte server selector).
(The "file identifier" part does need to be collision-free: #753.)
There are some notes about these topics: #654, #482, wiki/ServerSelection, #467, #872.
I'd been thinking that to support convergence, the server selector has to be based on a hash of the whole file. But that's not necessarily true: it could be based on a hash of a prefix of the file (say the first segment), and the convergence secret. This would make no difference for small files, but for large files it would allow the server selector to be calculated more quickly, perhaps before the rest of the file is known.
This would mean that files with the same prefix in a given convergence set would always have the same selector. But arguably the server selector only has to have sufficient diversity to average out the consumed space among servers for a reasonably large collection of files. If it's sufficiently unusual to have lots of files with the same prefix in a given convergence set, that this goal is achieved even if the server selector only depends on the prefix (within that set).
Zooko, Brian and I discussed this again on #tahoe-lafs.
Goals:
j. The sizes of read and write caps should be minimized. The size of a verify cap/SI is less important but should still be fairly small.
k. A downloader should have sufficient information, given the read cap and downloaded shares, to be able to check the integrity of the plaintext even if its decryption and erasure decoding routines are incorrect.
l. The verify cap for a file should be derivable off-line from the read cap.
m. If deep-verify caps are supported, the deep-verify cap for a file should be derivable off-line from the read cap, and the verify cap from the deep-verify vap.
All of these goals can be achieved simultaneously by a variation on Rainhill 3, or the simpler Rainhill 3x that does not support deep-verify caps. For simplicity, I'll just describe the variation on Rainhill 3x here: