increase share-size field to 8 bytes, remove 12GiB filesize limit #346
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#346
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The version=0
storage.ShareFile
disk format currently uses a 4-bytefield to store the size of the share, which limits total file size (when k=3)
to 12GiB. We should implement a version=1 which uses an 8-byte field, to
remove this limit.
The
ShareFile
class will need to read the version number first, thenuse that to decide how to read the size and num_leases fields. Then it should
set
self._data_offset
andself._lease_offset
accordingly.I estimate that this change (plus testing) should take about one day.
The next limitation will be in the share structure (as opposed to the on-disk
structure), implemented in
WriteBucketProxy
, in the 4-byte offset forhash tree sections. I estimate that this represents a limit of about 9TB.
After that, I think we're using 8-byte offsets for everything, so we'll be
good for something like 16EiB.
It's a pity that this isn't fixed yet, somebody is uploading an 18GB (18491058490) file through the prodtahoe3 helper right now (SI=5bsz7dptf235innpmcvmfjea74), and it's going to fail badly in a few days when they finally finish pushing the ciphertext to the helper.
OTOH, 18GB is a stupidly large file.. I know my sense of "reasonable" file sizes is different than the rest of the world, but files of that size are hard to deal with even locally: it must take hours just to read the contents out from disk.
General note: in the future, I think it is probably a good idea to 8-bytes for all counts and sizes. Sometimes there are things that will be okay with 4 bytes, but instead of spending the time to figure out if it will be safe with 4 bytes, you should probably just use 8 bytes and move on to other topics.
Occasionally, there may actually be some field which gets used in such a way that conserving the 4 extra bytes is valuable enough that we should take the time to think it through and decide to go to 4 bytes (or 2, or 3, 5, or 6) instead of using our 8 byte default. I'm not aware of any field like that in the current Tahoe design -- everything currently should default to 8 bytes as far as I can think off the top of my head.
(And yes, Brian needs to get over his feeling that people are wrong to use files that large.)
The sad tale of 5bsz7 gets worse: struct.pack does not raise an exception when you give it a number that won't fit in the field you're creating, it just truncates the number. So this 18GB upload is going to look like it succeeds, and all the share data will be there, but the share size will be wrong. As a result, later attempts to read past the (size % 4GB) point will get a precondition exception, and the file will be unretrievable.
In looking at storage.py to see what could go wrong, I think I may have missed a limitation. The "segment_size" and "data_size" fields in the share structure (in WriteBucketProxy) are also 4-bytes long, not just the offsets of the hash trees. That imposes a 4GB size limit on the overall file.
Hm. So I wonder if those 4.7GB file uploads that were taking place actually succeeded or not. Rats.
So if this is causing a silent failure where the fact that the file wasn't uploaded is hidden from the user and they think the file was successfully uploaded, then we need to elevate this issue from "major" to "critical".
I've looked over the code again, and we're ok with up to 12GiB files. The variable names in WriteBucketProxy are confusing, they do not make it clear if the variable is per-share or per-file. The segment_size and data_size fields in question turn out to be per-share, so the fact that they (and the 4-byte entries in the offset table) are limited to 4GiB only limits the shares to 4GiB, meaning an overall file size limit of 12GiB.
I'm running a test now with a 5GiB file to confirm. It took about 2 hours to upload (on a local testnet, entirely on one machine), and will take about 16 hours to download.
To get past the 12GiB limit, there are actually two places in the share format that need fixing: the self._data_size value is stored in the third word of the share, in a 4-byte field, that needs to be expanded. This is in addition to the 4-byte entries in the offset table that I identified earlier.
In addition, it would be a great idea to change the variable names in WriteBucketProxy and ReadBucketProxy to avoid this sort of confusion in the future.
The 5GB file (not 5GiB) downloaded successfully after about 19 hours. Checksums match. So 5GB works in 1.1 .
I mentioned this ticket as one of the most important-to-me improvements that we could make in the Tahoe code: http://allmydata.org/pipermail/tahoe-dev/2008-September/000809.html
Fixed by changeset:6c4019ec33e7a253, but not by the expected technique of making a new server-side share file format with 8-byte
share data size
fields and making the server able to read the old format as well as the new format while writing the new format. Instead it was fixed by usingos.path.getsize()
instead of using theshare data size
field.If the server can now handle larger shares, it needs to advertise this fact or else the clients won't take advantage of it. That means the
StorageServer.VERSION
dictionary needs to be updated, in particular the "maximum-immutable-share-size" value should be raised.In addition, the part of changeset:6c4019ec33e7a253 which provides backwards-compatibility by writing
size mod 2**32
into the old share-size slot smells funny: what is this "4294967295" number? The fact that it is odd (and not even) is suspicious. Why not just use2**32
?Other than that, nice patch! I'm glad there was enough redundancy left over in the original share-format to allow this trick to work.
We should arrange to do a manual test of uploading a 13GiB file to a private grid and see if it works. I expect that test will take several hours, making it unsuitable for the regular unit test suite, but it would be nice to know it worked at least once.
changeset:c7cd3f38e7b8768d adds the new advertisement, so all that's left is to do a manual test.
changeset:cc50e2f4aa96dd66 adds code to use a
WriteBucketProxy_v2
when necessary, which was causing my manual test to fail. A new test is now running, I estimate it will take about 5 hours to complete, but it seems to have gotten off to a decent start.This was fixed for 1.3.0.
The current limit is that there is a 64-bit unsigned field which holds the offset in bytes of the next data element that comes after the share contents on the storage server's disk. See the implementation and the in-line docs in src/allmydata/immutable/layout.py@3864.
This means that each individual share is limited to a few bytes less than
2^64^
. Therefore the overall file is limited tok*2^64^
(wherek
is the number of shares). There might be some other limitation that I've forgotten about, but we haven't encountered it in practice, where people have many times uploaded files in excess of 12 GiB.