eliminate hard limit on size of SDMFs #359

Closed
opened 2008-03-20 02:47:45 +00:00 by zooko · 5 comments

We currently impose a hard limit on SDMFs of 3.5 MB. (It was recently raised from the initial value of 1 MB in order to support directories with up to 10,000 entries.)

We could remove this artificial limit entirely. There would remain "soft limits":

  1. Creating or updating an SDMF would take approximately 1+N/K * filesize RAM.

  2. It would take approximately N/K * filesize upload bandwidth to change even just one byte of the file. (if/when we implement a mutable upload helper, the client-to-helper bandwidth will be equal to the filesize).

We currently impose a hard limit on SDMFs of 3.5 MB. (It was recently raised from the initial value of 1 MB in order to support directories with up to 10,000 entries.) We could remove this artificial limit entirely. There would remain "soft limits": 1. Creating or updating an SDMF would take approximately 1+N/K * filesize RAM. 2. It would take approximately N/K * filesize upload bandwidth to change even just one byte of the file. (if/when we implement a mutable upload helper, the client-to-helper bandwidth will be equal to the filesize).
zooko added the
code-encoding
major
enhancement
0.9.0
labels 2008-03-20 02:47:45 +00:00
zooko added this to the eventually milestone 2008-03-20 02:47:45 +00:00

FYI, we don't have a mutable-file upload helper yet.

FYI, we don't have a mutable-file upload helper yet.
warner added
code-mutable
and removed
code-encoding
labels 2008-04-24 23:46:34 +00:00
davidsarah commented 2009-12-13 05:15:58 +00:00
Owner

What's the limit on an immutable file?

What's the limit on an immutable file?
Author

It was ticket to raise it to an extremely high limit. The currently limit is that there is a 64-bit unsigned field which holds the offset in bytes of the next data element that comes after the share contents on the storage server's disk. See the implementation and the in-line docs in source:src/allmydata/immutable/layout.py@3864.

This means that each individual share is limited to a few bytes less than 2^64^. Therefore the overall file is limited to k*2^64^. There might be some other limitation that I've forgotten about, but we haven't encountered it in practice, where people have many times uploaded files in excess of 12 GiB.

It was ticket #346 to raise it to an extremely high limit. The currently limit is that there is a 64-bit unsigned field which holds the offset in bytes of the next data element that comes after the share contents on the storage server's disk. See the implementation and the in-line docs in source:src/allmydata/immutable/layout.py@3864. This means that each individual share is limited to a few bytes less than `2^64^`. Therefore the overall file is limited to `k*2^64^`. There might be some other limitation that I've forgotten about, but we haven't encountered it in practice, where people have many times uploaded files in excess of 12 GiB.

Note that zooko's recent comments are about immutable files and their shares, whereas this ticket is about mutable files and shares, which use a different layout. However the same general statements are true. Mutable files were designed after we had some experience with immutable files, but before I learned to always use 64-bit fields for everything. They've used somewhat larger offset fields since day 1, which are big enough to accomodate very large shares. The layout is described in source:src/allmydata/mutable/layout.py .

To be precise, they use 32-bit fields to hold the offsets of the signature, share_hash_chain, block_hash_tree, and share_data, then use a 64-bit field to hold the offset of the enc_privkey and EOF. So they can tolerate 2^64^ bit share_data sections, which is where the bulk of the share's data lives. The block_hash_tree section is smaller than the share_data section, but still scales linearly with filesize. Because of the 32-bit field for offsetshare_data, it must be somewhat shorter than 2^32^ bytes, limiting it to 2^27^ hashes, so 2^26^ segments, which at our default 128KiB (2^17^) segsize means 2^43^ bytes, which is the limiting factor. By raising the segsize to e.g. 4MB (2^22^) this limit grows to 2^48^ bytes.

So, SDMF mutable files are limited by the share format to k*2^43^ bytes, or about 24TiB. Until we implement MDMF and can process mutable files one segment at a time (instead of holding the whole file in RAM), we'll be soft-limited by available memory, so practically speaking the limit is a couple of GB.

If we stick with the same share format for MDMF (which was our goal: old clients should be able to keep using their SDMF code to read MDMF-generated files, unless we really do need a separate salt for each segment: ), then MDMF files will be limited to k2^43^ bytes with a RAM footprint of about x128KiB (where "x" is probably 2 or 3). An uploader-side max_segsize configuration change can scale those two values together up to a filesize limit of k2^64^ bytes and a RAM footprint of x256GiB.

If we do change the share format for MDMF, then we should of course use 64-bit fields everywhere and remove this 2^43^ limit.

Finally, it turns out that this ticket is actually a dupe of , which was closed when we removed the hard limit on SDMF files in changeset:db939750a8831c1e back in June 2009. I'd initially imposed the arbitrary 3.5MB limit to discourage people from using the (inefficient, memory-hungry) SDMF format in ways that would disappoint their hopes for high-performance behavior, but I was talked out of this and Kevan implemented the fix, which was first released in 1.5.0 .

Note that zooko's recent comments are about immutable files and their shares, whereas this ticket is about mutable files and shares, which use a different layout. However the same general statements are true. Mutable files were designed after we had some experience with immutable files, but before I learned to always use 64-bit fields for everything. They've used somewhat larger offset fields since day 1, which are big enough to accomodate very large shares. The layout is described in source:src/allmydata/mutable/layout.py . To be precise, they use 32-bit fields to hold the offsets of the signature, share_hash_chain, block_hash_tree, and share_data, then use a 64-bit field to hold the offset of the enc_privkey and EOF. So they can tolerate 2^64^ bit share_data sections, which is where the bulk of the share's data lives. The block_hash_tree section is smaller than the share_data section, but still scales linearly with filesize. Because of the 32-bit field for `offsetshare_data`, it must be somewhat shorter than 2^32^ bytes, limiting it to 2^27^ hashes, so 2^26^ segments, which at our default 128KiB (2^17^) segsize means 2^43^ bytes, which is the limiting factor. By raising the segsize to e.g. 4MB (2^22^) this limit grows to 2^48^ bytes. So, SDMF mutable files are limited by the share format to k*2^43^ bytes, or about 24TiB. Until we implement MDMF and can process mutable files one segment at a time (instead of holding the whole file in RAM), we'll be soft-limited by available memory, so practically speaking the limit is a couple of GB. If we stick with the same share format for MDMF (which was our goal: old clients should be able to keep using their SDMF code to read MDMF-generated files, unless we really do need a separate salt for each segment: #393), then MDMF files will be limited to k*2^43^ bytes with a RAM footprint of about x*128KiB (where "x" is probably 2 or 3). An uploader-side max_segsize configuration change can scale those two values together up to a filesize limit of k*2^64^ bytes and a RAM footprint of x*256GiB. If we *do* change the share format for MDMF, then we should of course use 64-bit fields everywhere and remove this 2^43^ limit. Finally, it turns out that this ticket is actually a dupe of #694, which was closed when we removed the hard limit on SDMF files in changeset:db939750a8831c1e back in June 2009. I'd initially imposed the arbitrary 3.5MB limit to discourage people from using the (inefficient, memory-hungry) SDMF format in ways that would disappoint their hopes for high-performance behavior, but I was talked out of this and Kevan implemented the fix, which was first released in 1.5.0 .
warner added the
duplicate
label 2009-12-26 03:52:24 +00:00
warner modified the milestone from eventually to 1.5.0 2009-12-26 03:52:24 +00:00
Author

For the record, my comment:65482 was about immutable files because David-Sarah asked about them in comment:65481. :-) Thanks for the description of the mutable file size limits.

For the record, my [comment:65482](/tahoe-lafs/trac-2024-07-25/issues/359#issuecomment-65482) was about immutable files because David-Sarah asked about them in [comment:65481](/tahoe-lafs/trac-2024-07-25/issues/359#issuecomment-65481). :-) Thanks for the description of the mutable file size limits.
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#359
No description provided.