Protocol is potentially high-latency and high bandwidth overhead for small files #3766

Open
opened 2021-08-16 19:33:34 +00:00 by itamarst · 1 comment

Imagine uploading a new, small file. As I understand it, this will require:

  1. Create a storage index.
  2. Upload each of the shares, e.g. 10 HTTP queries if there's 10 shares.

One can't do all queries in parallel, only uploads, because of the race condition between the uploads and the storage index existing. So even a clever, async client implementation will still require two HTTP roundtrips for each upload.

In addition to double latency (or 11× latency for a naive client, which maybe we don't care about), there's also a bunch of HTTP protocol overhead for uploading a file.

One can imagine an optimized variant of the API that includes both storage index and share creation in a single HTTP API call, for smaller files. This is, however, an optimization, and probably needn't exist in the first version.

Imagine uploading a new, small file. As I understand it, this will require: 1. Create a storage index. 2. Upload each of the shares, e.g. 10 HTTP queries if there's 10 shares. One can't do _all_ queries in parallel, only uploads, because of the race condition between the uploads and the storage index existing. So even a clever, async client implementation will still require two HTTP roundtrips for each upload. In addition to double latency (or 11× latency for a naive client, which maybe we don't care about), there's also a bunch of HTTP protocol overhead for uploading a file. One can imagine an optimized variant of the API that includes both storage index and share creation in a single HTTP API call, for smaller files. This is, however, an optimization, and probably needn't exist in the first version.
itamarst added the
unknown
normal
enhancement
n/a
labels 2021-08-16 19:33:34 +00:00
itamarst added this to the HTTP Storage Protocol milestone 2021-08-16 19:33:34 +00:00
exarkun was assigned by itamarst 2021-08-16 19:33:34 +00:00

(or 11× latency for a naive client, which maybe we don't care about)

Just to the point of such a naive client specifically: there are other motivations to not be this naive. Primarily, all shares are produced at the same time as the cleartext is processed. If you only upload one of them at a time, you have to store all the rest of them locally until you're ready to upload them. If you upload in parallel (which the current Tahoe-LAFS does using the Foolscap protocol) then you never have to store any of them locally, you can stream them all up as they're generated.

For small files, who cares. But for large files this is likely to be pretty crummy - especially given ZFEC expansion which means you might end up storing 2x or 3x or more (technically the maximum is 255x I think, but that's not a very likely client configuration).

> (or 11× latency for a naive client, which maybe we don't care about) Just to the point of such a naive client specifically: there are other motivations to not be this naive. Primarily, all shares are produced at the same time as the cleartext is processed. If you only upload one of them at a time, you have to store all the rest of them locally until you're ready to upload them. If you upload in parallel (which the current Tahoe-LAFS does using the Foolscap protocol) then you never have to store any of them locally, you can stream them all up as they're generated. For small files, who cares. But for large files this is likely to be pretty crummy - especially given ZFEC expansion which means you might end up storing 2x or 3x or more (technically the maximum is 255x I think, but that's not a very likely client configuration).
exarkun modified the milestone from HTTP Storage Protocol to HTTP Storage Protocol v2 2021-08-18 13:41:50 +00:00
Sign in to join this conversation.
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#3766
No description provided.