storage format is awfully inefficient for small shares #80

Closed
opened 2007-07-08 08:30:14 +00:00 by warner · 11 comments

Eventually we're going to need to revisit our StorageServer implementation. The current approach stores each share in a separate directory, puts the share itself in 'data' and puts the other metadata in its own files. This results in about 7 files per share.

This approach is nice and simple and understandable and browsable, but not particularly efficient (at least under ext3). For a 20-byte share (resulting from a 476-byte file), the directory appears to consume about 33kB, and the parent directory (which holds 58 such shares for the same file) appears to consume 2MB. This is probably just the basic disk-block quantization that most filesystems suffer from. Lots of small files are expensive.

Testing locally, it looks like concatenating all of the files for a single (884-byte) share reduces the space consumed by that share from 33kB to 8.2kB. If we move that file up a level, so that we don't have a directory-per-share, just one file-per-share, then the space consumed drops to 4.1kB.

So I'm thinking that in the medium term, we either need to move to reiserfs (which might handle small files more efficiently) or change our StorageServer to try and put all the data in a single file, which means committing to some of the metadata and pre-allocating space for it in the sharefile.

Eventually we're going to need to revisit our [StorageServer](wiki/StorageServer) implementation. The current approach stores each share in a separate directory, puts the share itself in 'data' and puts the other metadata in its own files. This results in about 7 files per share. This approach is nice and simple and understandable and browsable, but not particularly efficient (at least under ext3). For a 20-byte share (resulting from a 476-byte file), the directory appears to consume about 33kB, and the parent directory (which holds 58 such shares for the same file) appears to consume 2MB. This is probably just the basic disk-block quantization that most filesystems suffer from. Lots of small files are expensive. Testing locally, it looks like concatenating all of the files for a single (884-byte) share reduces the space consumed by that share from 33kB to 8.2kB. If we move that file up a level, so that we don't have a directory-per-share, just one file-per-share, then the space consumed drops to 4.1kB. So I'm thinking that in the medium term, we either need to move to reiserfs (which might handle small files more efficiently) or change our [StorageServer](wiki/StorageServer) to try and put all the data in a single file, which means committing to some of the metadata and pre-allocating space for it in the sharefile.
warner added the
code
major
defect
0.4.0
labels 2007-07-08 08:30:14 +00:00
warner added this to the undecided milestone 2007-07-08 08:30:14 +00:00
Author

Ooh, it gets worse, I was trying to upload a copy of the 13MB tahoe source tree into testnet (which has about 1620 files, two thirds of which are patches under _darcs/). The upload failed about two thirds of the way through because of a zero-length file (see #81), but just 2/3rds of the upload consumes 1.2GB per storageserver (when clearly that should be closer to 13MB4/32/3, say 11.5MB).

This 100x overhead is going to be a problem...

Ooh, it gets worse, I was trying to upload a copy of the 13MB tahoe source tree into testnet (which has about 1620 files, two thirds of which are patches under _darcs/). The upload failed about two thirds of the way through because of a zero-length file (see #81), but just 2/3rds of the upload consumes 1.2GB per storageserver (when clearly that should be closer to 13MB*4/3*2/3, say 11.5MB). This 100x overhead is going to be a problem...
Author

Oh, and the tahoe-storagespace munin plugin stops working with that many directories. I rewrote it to use the native /usr/bin/du program instead of doing the directory traversal in python, and it still takes 63 seconds to measure the size of all three storageservers on testnet, which is an order of magnitude more than munin will allow before it gives up on the plugin (remember these get run every 5 minutes). It looks like three storageservers-worth of share directories is too large to fit in the kernel's filesystem cache, so measuring all of them causes thrashing. (in contrast, measuring just one node's space takes 14s the first time and just 2s each time thereafter).

So the reason that the storage space graph is currently broken is because the munin plugin can't keep up.

Oh, and the tahoe-storagespace munin plugin stops working with that many directories. I rewrote it to use the native /usr/bin/du program instead of doing the directory traversal in python, and it still takes 63 seconds to measure the size of all three storageservers on testnet, which is an order of magnitude more than munin will allow before it gives up on the plugin (remember these get run every 5 minutes). It looks like three storageservers-worth of share directories is too large to fit in the kernel's filesystem cache, so measuring all of them causes thrashing. (in contrast, measuring just one node's space takes 14s the first time and just 2s each time thereafter). So the reason that the [storage space graph](http://allmydata.org/tahoe-munin/tahoebs1.allmydata.com-tahoe_storagespace.html) is currently broken is because the munin plugin can't keep up.
Author

Zooko and I did some more analysis:

  • at the moment, each file is encoded into 100 shares, regardless of how many nodes are present in the mesh at that time
  • for small files (less than 2MB) each share has 846 bytes of overhead.
  • each share is stored in a separate directory, in 7 separate files (of which the actual share data is one, the uri_extension is a second, and the various hashes and pieces of metadata are others)
    • on most filesystems (i.e. ext3), each file and directory consumes at minimum a single disk block
    • on filesystems that are larger than a few gigabytes, each disk block is 4096 bytes
  • as a result, in the 0.4.0 release, each share consumes a minimum of 8*4096=32768 bytes.
    • for tiny files, 1 of these bytes is share data, 846 is validation overhead, and 31921 are filesystem quantization lossage
  • so for small files, we incur 1511424 (1.5MB) of disk usage per file (totalled across all 100 shares, on all the blockservers). This usage is constant for filesizes up to about 100kB.

Our plans to improve this:

  • #84: produce fewer shares in small networks, by having the introducer suggest 3-of-10 instead of 25-of-100 by default, for a 10x improvement
  • #85: store shares in a single file rather than 7 files and a directory, for an 8x improvement
  • #81: implement LIT uris, which hold the body of the file inside the URI. To measure the improvement of this we need to collect some filesize histograms from real disk images.
  • (maybe) #87: store fewer validation hashes in each share, to reduce that 846-byte overhead to 718 bytes.

Our guess is that this will reduce the minimum space consumed to 40960 bytes (41kB), occuring when the filesize is 10134 (10kB) or smaller.

The URI:LIT fix will cover the 0-to-80ish byte files efficiently. It may be the case that we just accept the overhead for 80-to-10134 byte files, or perhaps we could switch to a different algorithm (simple replication instead of FEC?) for those files. We'll have to run some more numbers and look at the complexity burden first.

Zooko and I did some more analysis: * at the moment, each file is encoded into 100 shares, regardless of how many nodes are present in the mesh at that time * for small files (less than 2MB) each share has 846 bytes of overhead. * each share is stored in a separate directory, in 7 separate files (of which the actual share data is one, the uri_extension is a second, and the various hashes and pieces of metadata are others) * on most filesystems (i.e. ext3), each file and directory consumes at minimum a single disk block * on filesystems that are larger than a few gigabytes, each disk block is 4096 bytes * as a result, in the 0.4.0 release, each share consumes a minimum of 8*4096=32768 bytes. * for tiny files, 1 of these bytes is share data, 846 is validation overhead, and 31921 are filesystem quantization lossage * so for small files, we incur 1511424 (1.5MB) of disk usage per file (totalled across all 100 shares, on all the blockservers). This usage is constant for filesizes up to about 100kB. Our plans to improve this: * #84: produce fewer shares in small networks, by having the introducer suggest 3-of-10 instead of 25-of-100 by default, for a 10x improvement * #85: store shares in a single file rather than 7 files and a directory, for an 8x improvement * #81: implement LIT uris, which hold the body of the file inside the URI. To measure the improvement of this we need to collect some filesize histograms from real disk images. * (maybe) #87: store fewer validation hashes in each share, to reduce that 846-byte overhead to 718 bytes. Our guess is that this will reduce the minimum space consumed to 40960 bytes (41kB), occuring when the filesize is 10134 (10kB) or smaller. The URI:LIT fix will cover the 0-to-80ish byte files efficiently. It may be the case that we just accept the overhead for 80-to-10134 byte files, or perhaps we could switch to a different algorithm (simple replication instead of FEC?) for those files. We'll have to run some more numbers and look at the complexity burden first.
warner modified the milestone from undecided to 0.5.0 2007-07-12 18:59:55 +00:00
Author

I've fixed the main problems here. My plan is to do some more tests, measure the current overhead (and record the results here), then close this ticket. #87 is a future change, since we want to retain the validation for a while, until we feel super-confident about the intermediate steps.

I've fixed the main problems here. My plan is to do some more tests, measure the current overhead (and record the results here), then close this ticket. #87 is a future change, since we want to retain the validation for a while, until we feel super-confident about the intermediate steps.
Author

copy of a message I sent to tahoe-dev:

I've just upgraded testnet to the most recent code, and have been playing
with larger uploads (now that they're finally possible). A couple of
performance numbers:

  • uploading a copy of the tahoe source tree (created with 'darcs dist'),

  • telling the node to copy the files directly from disk, using:
    time curl -T /dev/null '<http://localhost:8011/vdrive/global/tahoe?t=upload&localdir=/home/warner/tahoe>'

  • 384 files

  • 63 directories

  • about 4.6MB of data

  • upload takes 117 seconds

  • about 30MB consumed on the storage servers

  • 0.3 seconds per file, 3.3 files per second

  • 39kB per second

With the 3-out-of-10 encoding we're now using by default, we expect a 3.3x
expansion from FEC, so we'd expect those 4.6MB to expand to 15.3MB. The 30MB
that was actually consumed (a 2x overhead) is the effect of the 4096-byte
disk blocksize, since the tahoe tree contains a number of small files.

Uploading a copy of a recent linux kernel (linux-2.6.22.1.tar.bz2, 45.1MB)
tests out the large-file performance, this time sending the bytes over the
network (albeit from the same host as the node), using an actual http PUT:
time curl -T linux-2.6.22.1.tar.bz2 '<http://localhost:8011/vdrive/global/big/linux-2.6.22.1.tar.bz2>'

  • 1 file
  • 1 new directory
  • 45.1MB of data
  • upload takes 44 seconds
  • 151MB consumed on the storage servers
  • 1.04MB per second

The 3.3x expansion of a 45.1MB file would lead us to expect 150.3MB consumed,
so the 151MB that was actually consumed is spot on.

Downloading the kernel image (on the same host) took place at 4.39MBps on the
same host as the node, and at 4.46MBps on a separate host (the introducer).

Please note that these speed numbers are somewhat unrealistic: on our
testnet, we have three storage servers running on one machine, and an
introducer/vdrive-server running on a second. Both machines live in the same
cabinet and are connected to each other by a gigabit-speed network (not that
it matters, because the introducer/vdrive-server holds minimal amounts of
data). So what we're measuring here is the speed at which a node can do FEC
and encryption, and the overhead of Foolscap's SSL link encryption, and maybe
the rate at which we can write shares to disk (although these files are small
enough that the kernel can probably buffer them entirely in memory and then
write them to disk at its leisure).

Having storageservers on separate machines would be both better and worse:
worse because the shares would have to be transmitted over an actual wire
(instead of through the loopback interface), and better because then the
storage servers wouldn't be fighting with each other for access to the shared
disk and CPU. When we get more machines to dedicate to this purpose, we'll do
some more performance testing.

copy of a message I sent to tahoe-dev: I've just upgraded testnet to the most recent code, and have been playing with larger uploads (now that they're finally possible). A couple of performance numbers: * uploading a copy of the tahoe source tree (created with 'darcs dist'), * telling the node to copy the files directly from disk, using: `time curl -T /dev/null '<http://localhost:8011/vdrive/global/tahoe?t=upload&localdir=/home/warner/tahoe>'` * 384 files * 63 directories * about 4.6MB of data * upload takes 117 seconds * about 30MB consumed on the storage servers * 0.3 seconds per file, 3.3 files per second * 39kB per second With the 3-out-of-10 encoding we're now using by default, we expect a 3.3x expansion from FEC, so we'd expect those 4.6MB to expand to 15.3MB. The 30MB that was actually consumed (a 2x overhead) is the effect of the 4096-byte disk blocksize, since the tahoe tree contains a number of small files. Uploading a copy of a recent linux kernel (linux-2.6.22.1.tar.bz2, 45.1MB) tests out the large-file performance, this time sending the bytes over the network (albeit from the same host as the node), using an actual http PUT: `time curl -T linux-2.6.22.1.tar.bz2 '<http://localhost:8011/vdrive/global/big/linux-2.6.22.1.tar.bz2>'` * 1 file * 1 new directory * 45.1MB of data * upload takes 44 seconds * 151MB consumed on the storage servers * 1.04MB per second The 3.3x expansion of a 45.1MB file would lead us to expect 150.3MB consumed, so the 151MB that was actually consumed is spot on. Downloading the kernel image (on the same host) took place at 4.39MBps on the same host as the node, and at 4.46MBps on a separate host (the introducer). Please note that these speed numbers are somewhat unrealistic: on our testnet, we have three storage servers running on one machine, and an introducer/vdrive-server running on a second. Both machines live in the same cabinet and are connected to each other by a gigabit-speed network (not that it matters, because the introducer/vdrive-server holds minimal amounts of data). So what we're measuring here is the speed at which a node can do FEC and encryption, and the overhead of Foolscap's SSL link encryption, and maybe the rate at which we can write shares to disk (although these files are small enough that the kernel can probably buffer them entirely in memory and then write them to disk at its leisure). Having storageservers on separate machines would be both better and worse: worse because the shares would have to be transmitted over an actual wire (instead of through the loopback interface), and better because then the storage servers wouldn't be fighting with each other for access to the shared disk and CPU. When we get more machines to dedicate to this purpose, we'll do some more performance testing.
Author

here's a graph of overhead (although I'll be the first to admit it's not the best conceivable way to present this information..): overhead1.png .

The blue line is URI length. This grows from about 16 characters for a tiny (2-byte) file, to about 160 characters for everything longer than 55 bytes.

The pink line is effective expansion ratio. This is zero for small (<55byte) files, since we use LIT uris. Then it gets really big, because we consume 40960 bytes for a 56byte file, and that consumption stays constant up to a 10095-byte file. Then it jumps to 81920 bytes until we hit 122880 bytes at about 22400-byte files. It asympotically approaches 3.3x (from above) as the filesize gets larger (and the effect of the 4kB blocksize gets smaller).

here's a graph of overhead (although I'll be the first to admit it's not the best conceivable way to present this information..): [overhead1.png](/tahoe-lafs/trac-2024-07-25/attachments/000078ac-b888-b298-26bb-cf5fbfffa123) . The blue line is URI length. This grows from about 16 characters for a tiny (2-byte) file, to about 160 characters for everything longer than 55 bytes. The pink line is effective expansion ratio. This is zero for small (<55byte) files, since we use LIT uris. Then it gets really big, because we consume 40960 bytes for a 56byte file, and that consumption stays constant up to a 10095-byte file. Then it jumps to 81920 bytes until we hit 122880 bytes at about 22400-byte files. It asympotically approaches 3.3x (from above) as the filesize gets larger (and the effect of the 4kB blocksize gets smaller).
Author

Attachment overhead1.png (14114 bytes) added

**Attachment** overhead1.png (14114 bytes) added
Author

Attachment overhead2.png (18836 bytes) added

**Attachment** overhead2.png (18836 bytes) added
Author

ok, this one is more readable. The two axes are in bytes, and you can see how we get constant 41kB storage space until we hit 10k files, then 82kB storage space (two disk blocks per share) until we hit 22k files, then the stairstep continues until the shares get big enough for the disk blocks to not matter. We approach the intended 3.3x as the files get bigger, getting too close to care by about 1MB files.

ok, [this one](/tahoe-lafs/trac-2024-07-25/attachments/000078ac-b888-b298-26bb-151415c154be) is more readable. The two axes are in bytes, and you can see how we get constant 41kB storage space until we hit 10k files, then 82kB storage space (two disk blocks per share) until we hit 22k files, then the stairstep continues until the shares get big enough for the disk blocks to not matter. We approach the intended 3.3x as the files get bigger, getting too close to care by about 1MB files.
Author

I'm adding a tool called source:misc/storage-overhead.py to produce these measurements. To run it, use

PYTHONPATH=instdir/lib python misc/storage-overhead.py 1234

and it will print useful storage-usage numbers for each filesize you give it. You can also pass 'chart' instead of a filesize to produce a CSV file suitable for passing into gnumeric or some other spreadsheet (which is how I produced the graphs attached here).

I'm adding a tool called source:misc/storage-overhead.py to produce these measurements. To run it, use `PYTHONPATH=instdir/lib python misc/storage-overhead.py 1234` and it will print useful storage-usage numbers for each filesize you give it. You can also pass 'chart' instead of a filesize to produce a CSV file suitable for passing into gnumeric or some other spreadsheet (which is how I produced the graphs attached here).
Author

and now I'm going to close out this ticket, because I think we've improved the situation well enough for now.

and now I'm going to close out this ticket, because I think we've improved the situation well enough for now.
warner added the
fixed
label 2007-07-16 20:45:43 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#80
No description provided.