storage format is awfully inefficient for small shares #80
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#80
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Eventually we're going to need to revisit our StorageServer implementation. The current approach stores each share in a separate directory, puts the share itself in 'data' and puts the other metadata in its own files. This results in about 7 files per share.
This approach is nice and simple and understandable and browsable, but not particularly efficient (at least under ext3). For a 20-byte share (resulting from a 476-byte file), the directory appears to consume about 33kB, and the parent directory (which holds 58 such shares for the same file) appears to consume 2MB. This is probably just the basic disk-block quantization that most filesystems suffer from. Lots of small files are expensive.
Testing locally, it looks like concatenating all of the files for a single (884-byte) share reduces the space consumed by that share from 33kB to 8.2kB. If we move that file up a level, so that we don't have a directory-per-share, just one file-per-share, then the space consumed drops to 4.1kB.
So I'm thinking that in the medium term, we either need to move to reiserfs (which might handle small files more efficiently) or change our StorageServer to try and put all the data in a single file, which means committing to some of the metadata and pre-allocating space for it in the sharefile.
Ooh, it gets worse, I was trying to upload a copy of the 13MB tahoe source tree into testnet (which has about 1620 files, two thirds of which are patches under _darcs/). The upload failed about two thirds of the way through because of a zero-length file (see #81), but just 2/3rds of the upload consumes 1.2GB per storageserver (when clearly that should be closer to 13MB4/32/3, say 11.5MB).
This 100x overhead is going to be a problem...
Oh, and the tahoe-storagespace munin plugin stops working with that many directories. I rewrote it to use the native /usr/bin/du program instead of doing the directory traversal in python, and it still takes 63 seconds to measure the size of all three storageservers on testnet, which is an order of magnitude more than munin will allow before it gives up on the plugin (remember these get run every 5 minutes). It looks like three storageservers-worth of share directories is too large to fit in the kernel's filesystem cache, so measuring all of them causes thrashing. (in contrast, measuring just one node's space takes 14s the first time and just 2s each time thereafter).
So the reason that the storage space graph is currently broken is because the munin plugin can't keep up.
Zooko and I did some more analysis:
Our plans to improve this:
Our guess is that this will reduce the minimum space consumed to 40960 bytes (41kB), occuring when the filesize is 10134 (10kB) or smaller.
The URI:LIT fix will cover the 0-to-80ish byte files efficiently. It may be the case that we just accept the overhead for 80-to-10134 byte files, or perhaps we could switch to a different algorithm (simple replication instead of FEC?) for those files. We'll have to run some more numbers and look at the complexity burden first.
I've fixed the main problems here. My plan is to do some more tests, measure the current overhead (and record the results here), then close this ticket. #87 is a future change, since we want to retain the validation for a while, until we feel super-confident about the intermediate steps.
copy of a message I sent to tahoe-dev:
I've just upgraded testnet to the most recent code, and have been playing
with larger uploads (now that they're finally possible). A couple of
performance numbers:
uploading a copy of the tahoe source tree (created with 'darcs dist'),
telling the node to copy the files directly from disk, using:
time curl -T /dev/null '<http://localhost:8011/vdrive/global/tahoe?t=upload&localdir=/home/warner/tahoe>'
384 files
63 directories
about 4.6MB of data
upload takes 117 seconds
about 30MB consumed on the storage servers
0.3 seconds per file, 3.3 files per second
39kB per second
With the 3-out-of-10 encoding we're now using by default, we expect a 3.3x
expansion from FEC, so we'd expect those 4.6MB to expand to 15.3MB. The 30MB
that was actually consumed (a 2x overhead) is the effect of the 4096-byte
disk blocksize, since the tahoe tree contains a number of small files.
Uploading a copy of a recent linux kernel (linux-2.6.22.1.tar.bz2, 45.1MB)
tests out the large-file performance, this time sending the bytes over the
network (albeit from the same host as the node), using an actual http PUT:
time curl -T linux-2.6.22.1.tar.bz2 '<http://localhost:8011/vdrive/global/big/linux-2.6.22.1.tar.bz2>'
The 3.3x expansion of a 45.1MB file would lead us to expect 150.3MB consumed,
so the 151MB that was actually consumed is spot on.
Downloading the kernel image (on the same host) took place at 4.39MBps on the
same host as the node, and at 4.46MBps on a separate host (the introducer).
Please note that these speed numbers are somewhat unrealistic: on our
testnet, we have three storage servers running on one machine, and an
introducer/vdrive-server running on a second. Both machines live in the same
cabinet and are connected to each other by a gigabit-speed network (not that
it matters, because the introducer/vdrive-server holds minimal amounts of
data). So what we're measuring here is the speed at which a node can do FEC
and encryption, and the overhead of Foolscap's SSL link encryption, and maybe
the rate at which we can write shares to disk (although these files are small
enough that the kernel can probably buffer them entirely in memory and then
write them to disk at its leisure).
Having storageservers on separate machines would be both better and worse:
worse because the shares would have to be transmitted over an actual wire
(instead of through the loopback interface), and better because then the
storage servers wouldn't be fighting with each other for access to the shared
disk and CPU. When we get more machines to dedicate to this purpose, we'll do
some more performance testing.
here's a graph of overhead (although I'll be the first to admit it's not the best conceivable way to present this information..): overhead1.png .
The blue line is URI length. This grows from about 16 characters for a tiny (2-byte) file, to about 160 characters for everything longer than 55 bytes.
The pink line is effective expansion ratio. This is zero for small (<55byte) files, since we use LIT uris. Then it gets really big, because we consume 40960 bytes for a 56byte file, and that consumption stays constant up to a 10095-byte file. Then it jumps to 81920 bytes until we hit 122880 bytes at about 22400-byte files. It asympotically approaches 3.3x (from above) as the filesize gets larger (and the effect of the 4kB blocksize gets smaller).
Attachment overhead1.png (14114 bytes) added
Attachment overhead2.png (18836 bytes) added
ok, this one is more readable. The two axes are in bytes, and you can see how we get constant 41kB storage space until we hit 10k files, then 82kB storage space (two disk blocks per share) until we hit 22k files, then the stairstep continues until the shares get big enough for the disk blocks to not matter. We approach the intended 3.3x as the files get bigger, getting too close to care by about 1MB files.
I'm adding a tool called source:misc/storage-overhead.py to produce these measurements. To run it, use
PYTHONPATH=instdir/lib python misc/storage-overhead.py 1234
and it will print useful storage-usage numbers for each filesize you give it. You can also pass 'chart' instead of a filesize to produce a CSV file suitable for passing into gnumeric or some other spreadsheet (which is how I produced the graphs attached here).
and now I'm going to close out this ticket, because I think we've improved the situation well enough for now.