rearrange share format to make downloads faster #1543

Open
opened 2011-09-26 00:02:01 +00:00 by warner · 6 comments

(I thought I had a ticket for this already, but now I can't find it)

In a new share format, we could arrange the layout of the various share pieces to benefit download speed. The biggest slowdowns are:

  • extra round trips required by unpredictable/overflexible layouts: the downloader must first fetch an offset table, and then use a second query to fetch the actual data. Speculative fetches can mitigate this, but a deterministic layout is preferable.
  • scatter/gather reads (e.g. hash-tree nodes spread out over a large area) require discontiguous read operations. This requires larger read vectors, more disk IO operations, and in the worst case more disk seeks.

Our current designs treat whole-file downloads and random-access as equal peers: we design first for minimum alacrity for the single-segment case. We could instead recognize that most downloads are whole-file, and accept a slight alacrity hit in exchange for faster throughput and higher transport efficiency:

  • we could fetch multiple segments at once, increasing alacrity and memory footprint for faster throughput
  • we could fetch hash nodes for multiple segments ahead of time

and we could rearrange the share to help:

  • Zooko's idea was to store extra hashnodes. We could put the full merkle-chain after each data block. My thought was to further append them with the bottom-most nodes first. This would increase storage requirements, because we'd be storing multiple redundant copies of the hash tree nodes. But reads would be much faster: each segment would require a block fetch of the data block plus 0 or more trailing hash nodes (depending upon which nodes were already known). That means one contiguous IO operation per block, which will be the fastest/most-efficient for the storage servers
  • integrity storage overhead with this approach becomes O(nlogn) instead of O(n). Rough estimate: for a 10MB file, segsize=128KiB, I think that needs 18kB overhead instead of 5kB (per share), about N*0.1% total, maybe double that because there are two file-sized hash trees (crypttext_hash_tree, block_hash_tree).
(I thought I had a ticket for this already, but now I can't find it) In a new share format, we could arrange the layout of the various share pieces to benefit download speed. The biggest slowdowns are: * extra round trips required by unpredictable/overflexible layouts: the downloader must first fetch an offset table, and then use a second query to fetch the actual data. Speculative fetches can mitigate this, but a deterministic layout is preferable. * scatter/gather reads (e.g. hash-tree nodes spread out over a large area) require discontiguous read operations. This requires larger read vectors, more disk IO operations, and in the worst case more disk seeks. Our current designs treat whole-file downloads and random-access as equal peers: we design first for minimum alacrity for the single-segment case. We could instead recognize that most downloads are whole-file, and accept a slight alacrity hit in exchange for faster throughput and higher transport efficiency: * we could fetch multiple segments at once, increasing alacrity and memory footprint for faster throughput * we could fetch hash nodes for multiple segments ahead of time and we could rearrange the share to help: * Zooko's idea was to store extra hashnodes. We could put the full merkle-chain after each data block. My thought was to further append them with the bottom-most nodes first. This would increase storage requirements, because we'd be storing multiple redundant copies of the hash tree nodes. But reads would be much faster: each segment would require a block fetch of the data block plus 0 or more trailing hash nodes (depending upon which nodes were already known). That means one contiguous IO operation per block, which will be the fastest/most-efficient for the storage servers * integrity storage overhead with this approach becomes O(nlogn) instead of O(n). Rough estimate: for a 10MB file, segsize=128KiB, I think that needs 18kB overhead instead of 5kB (per share), about N*0.1% total, maybe double that because there are two file-sized hash trees (crypttext_hash_tree, block_hash_tree).
warner added the
code-encoding
major
enhancement
1.9.0a1
labels 2011-09-26 00:02:01 +00:00
warner added this to the undecided milestone 2011-09-26 00:02:01 +00:00

Replying to warner:

Our current designs treat whole-file downloads and random-access as equal peers: we design first for minimum alacrity for the single-segment case. We could instead recognize that most downloads are whole-file, and accept a slight alacrity hit in exchange for faster throughput and higher transport efficiency

Just for context about my thinking, I am generally not that keen on increasing performance for some usages while decreasing it for others, even if the former usages are more common. One thing that I don't like about that is that people may make assumptions about expected performance based on past observed performance, and if the performance is more consistent over different use cases, their assumptions are less wrong.

This goes against a common (haha) engineering maxim of "optimize the common case". I think that maxim makes sense when a single user action is going to exercise a whole smorgasbord of different cases inside a system, so optimizing the most common of those internal cases results in a speedup for most or all possible user actions. But it is potentially problematic when different user actions exercise substantially different distributions of the internal cases.

In this particular issue I'll bet it doesn't matter much since the fewer-round-trips and the scatter-gather-reads (the first two improvements that Brian mentioned) are probably the big issues, and improving them would probably not increase the performance gap between different usages.

Replying to [warner](/tahoe-lafs/trac-2024-07-25/issues/6605): > > Our current designs treat whole-file downloads and random-access as equal peers: we design first for minimum alacrity for the single-segment case. We could instead recognize that most downloads are whole-file, and accept a slight alacrity hit in exchange for faster throughput and higher transport efficiency Just for context about my thinking, I am generally not that keen on increasing performance for some usages while decreasing it for others, even if the former usages are more common. One thing that I don't like about that is that people may make assumptions about expected performance based on past observed performance, and if the performance is more *consistent* over different use cases, their assumptions are less wrong. This goes against a common (haha) engineering maxim of "optimize the common case". I think that maxim makes sense when a single user action is going to exercise a whole smorgasbord of different cases inside a system, so optimizing the most common of those internal cases results in a speedup for most or all possible user actions. But it is potentially problematic when different user actions exercise substantially different distributions of the internal cases. In this particular issue I'll bet it doesn't matter much since the fewer-round-trips and the scatter-gather-reads (the first two improvements that Brian mentioned) are probably the big issues, and improving them would probably not increase the performance gap between different usages.
  • Zooko's idea was to store extra hashnodes. We could put the full merkle-chain after each data block. My thought was to further append them with the bottom-most nodes first. This would increase storage requirements, because we'd be storing multiple redundant copies of the hash tree nodes. But reads would be much faster: each segment would require a block fetch of the data block plus 0 or more trailing hash nodes (depending upon which nodes were already known). That means one contiguous IO operation per block, which will be the fastest/most-efficient for the storage servers
  • integrity storage overhead with this approach becomes O(nlogn) instead of O(n). Rough estimate: for a 10MB file, segsize=128KiB, I think that needs 18kB overhead instead of 5kB (per share), about N*0.1% total, maybe double that because there are two file-sized hash trees (crypttext_hash_tree, block_hash_tree).

Another alternative would be to start using multiple variable-length records (i.e. files) on the backend. In all three of our current backend designs, we chose to use the minimal number of variable length records (i.e. files) per share. That minimal number is 1 -- each share is stored in one file, including the share data, the hash tree nodes, and the leases. This choice minimizes the number of disk seeks necessary to find the share data and the hash tree nodes, but if the share data is large and the reader isn't going to be reading all of it in order then we may end up needing to jump around within it (or read over data that we don't need), which might end up costing us an extra seek or a few anyway.

Another problem with storing both share data and hash tree nodes together in a single file is what to do if the share data shrinks or grows, which can happen with SDMF and MDMF but not with immutables, or if the uploader doesn't know the size of the share data when they begin the upload, which currently can't happen but I hope will happen in the future (#1288).

Designing a layout and an algorithm to manage that which optimizes performance can get complicated, and the recent security issue #1528 has really driven home to me how even moderate complexity hides bugs. (Mea culpa: it hid the bug from me, not from Brian -- Brian implemented this moderate complexity correctly, originally.) Even if we get it right, it takes engineering time to do so.

So an alternative which makes it simpler for us is to use more variable-length records (files) per share, i.e. the storage server creates one file to hold the share data and another to hold the hash tree nodes. This costs, I think, one extra disk seek to find the second file (given that you've already paid the several disk seeks to find the first file), but it makes things easier and might actually win back some of the performance lost to the extra seek:

  • for the writer it makes things simpler because you can append more and more share data without pre-allocating the space, which allows you to start appending share data without even knowing how much you'll ultimately append (re: #1288) and which means you can shrink or extend an existing mutable file without needing to move the existing hash tree. Whenever the writer appends new share data it, it also appends the corresponding hash tree nodes in the file that stores the hash tree nodes. Likewise whenever an MDMF writer truncates share data, it truncates the corresponding hash tree nodes from the file that holds the hash tree. (Overwriting existing data requires changes that touch more parts of the hash tree so the effect on the hash tree isn't as localized as appending or truncating.)
  • for the reader it makes things simple and efficient because:
  • It knows where all the share data and all the hash tree data live before it receives the results of any reads back and even (for mutables) before it knows the current size of the file! This minimizes number of required round-trips.
  • The share data is contiguous -- uninterrupted by hash node data (as it would be in my proposal to intersperse the two, mentioned in the initial description of this ticket), so bulk reads of share data are get optimal throughput.
  • The hash tree data can still be be laid out in interesting ways to facilitate fast reads. There is never bulk data interspersed with the hash tree nodes that you are trying to read.
> * Zooko's idea was to store extra hashnodes. We could put the full merkle-chain after each data block. My thought was to further append them with the bottom-most nodes first. This would increase storage requirements, because we'd be storing multiple redundant copies of the hash tree nodes. But reads would be much faster: each segment would require a block fetch of the data block plus 0 or more trailing hash nodes (depending upon which nodes were already known). That means one contiguous IO operation per block, which will be the fastest/most-efficient for the storage servers > * integrity storage overhead with this approach becomes O(nlogn) instead of O(n). Rough estimate: for a 10MB file, segsize=128KiB, I think that needs 18kB overhead instead of 5kB (per share), about N*0.1% total, maybe double that because there are two file-sized hash trees (crypttext_hash_tree, block_hash_tree). Another alternative would be to start using multiple variable-length records (i.e. files) on the backend. In all three of our current backend designs, we chose to use the minimal number of variable length records (i.e. files) per share. That minimal number is 1 -- each share is stored in one file, including the share data, the hash tree nodes, and the leases. This choice minimizes the number of disk seeks necessary to find the share data and the hash tree nodes, but if the share data is large and the reader isn't going to be reading all of it in order then we may end up needing to jump around within it (or read over data that we don't need), which might end up costing us an extra seek or a few anyway. Another problem with storing both share data and hash tree nodes together in a single file is what to do if the share data shrinks or grows, which can happen with SDMF and MDMF but not with immutables, or if the uploader doesn't know the size of the share data when they begin the upload, which currently can't happen but I hope will happen in the future (#1288). Designing a layout and an algorithm to manage that which optimizes performance can get complicated, and the recent security issue #1528 has really driven home to me how even moderate complexity hides bugs. (Mea culpa: it hid the bug from me, not from Brian -- Brian implemented this moderate complexity correctly, originally.) Even if we get it right, it takes engineering time to do so. So an alternative which makes it simpler for us is to use more variable-length records (files) per share, i.e. the storage server creates one file to hold the share data and another to hold the hash tree nodes. This costs, I think, one extra disk seek to find the second file (given that you've already paid the several disk seeks to find the first file), but it makes things easier and might actually win back some of the performance lost to the extra seek: * for the writer it makes things simpler because you can append more and more share data without pre-allocating the space, which allows you to start appending share data without even knowing how much you'll ultimately append (re: #1288) and which means you can shrink or extend an existing mutable file without needing to move the existing hash tree. Whenever the writer appends new share data it, it also appends the corresponding hash tree nodes in the file that stores the hash tree nodes. Likewise whenever an MDMF writer truncates share data, it truncates the corresponding hash tree nodes from the file that holds the hash tree. (Overwriting existing data requires changes that touch more parts of the hash tree so the effect on the hash tree isn't as localized as appending or truncating.) * for the reader it makes things simple and efficient because: * It knows where all the share data and all the hash tree data live before it receives the results of any reads back and even (for mutables) before it knows the current size of the file! This minimizes number of required round-trips. * The share data is contiguous -- uninterrupted by hash node data (as it would be in my proposal to intersperse the two, mentioned in the initial description of this ticket), so bulk reads of share data are get optimal throughput. * The hash tree data *can* still be be laid out in interesting ways to facilitate fast reads. There is never bulk data interspersed with the hash tree nodes that you are trying to read.

Replying to [zooko]comment:2:

  • The hash tree data can still be be laid out in interesting ways to facilitate fast reads. There is never bulk data interspersed with the hash tree nodes that you are trying to read.

For example, you could say:

A typical reader might as well pull down a couple thousand bytes if it is going to pull down 1 byte. We will lay out 32 consecutive leaf hashes in a row, followed by the "uncle chain" which is necessary to validate those leafs hashes. So for example if there is a 10 GiB file with 128 KiB segments and K=3, then there are 245,763 blocks per share, which means the binary Merkle Tree will be 18 levels deep. 32 leaf hashes is sufficient to allow the downloader to compute the bottom-most 5 levels, which leaves 12 uncle hash tree nodes (the root node is not needed for the uncle chain which is why it is 12 instead of 13). We could write out the 32 leafs, followed by the 12 uncles, followed by space sufficient for 52 more uncles (in case the file grows). Each of these 64-hashvalue-wide records would take a total of 2048 bytes, and a typical reader would need to slurp only (part of) one or two of these records in order to validate up to 32 or 64 blocks of data, which would be around 1.3 or 2.6 MiB.
(This is analogous to my earlier suggestion of storing the complete uncle chain after each block of data, but now it is storing the complete uncle chain after each run of leaf hashes.)

The drawback of optimizing for the reader like that is that the writer then has to update all of those records whenever it changes the contents of any leaf (block). This is the opposite extreme of the layout where the writer has to do minimal work and the reader has to read from each of the separately located uncles. I think there may be a sweet spot tradeoff in which the first five levels of the binary tree are stored in the first record, so the writer always has to update that first record as well as the record(s) that are more local to its write and the reader always has to read that first record as well as the record(s) that are more local to its read.

Replying to [zooko]comment:2: > * The hash tree data *can* still be be laid out in interesting ways to facilitate fast reads. There is never bulk data interspersed with the hash tree nodes that you are trying to read. For example, you could say: > A typical reader might as well pull down a couple thousand bytes if it is going to pull down 1 byte. We will lay out 32 consecutive leaf hashes in a row, followed by the "uncle chain" which is necessary to validate those leafs hashes. So for example if there is a 10 GiB file with 128 KiB segments and K=3, then there are 245,763 blocks per share, which means the binary Merkle Tree will be 18 levels deep. 32 leaf hashes is sufficient to allow the downloader to compute the bottom-most 5 levels, which leaves 12 uncle hash tree nodes (the root node is not needed for the uncle chain which is why it is 12 instead of 13). We could write out the 32 leafs, followed by the 12 uncles, followed by space sufficient for 52 more uncles (in case the file grows). Each of these 64-hashvalue-wide records would take a total of 2048 bytes, and a typical reader would need to slurp only (part of) one or two of these records in order to validate up to 32 or 64 blocks of data, which would be around 1.3 or 2.6 MiB. (This is analogous to my earlier suggestion of storing the complete uncle chain after each block of data, but now it is storing the complete uncle chain after each run of leaf hashes.) The drawback of optimizing for the reader like that is that the writer then has to update *all* of those records whenever it changes the contents of any leaf (block). This is the opposite extreme of the layout where the writer has to do minimal work and the reader has to read from each of the separately located uncles. I think there may be a sweet spot tradeoff in which the first five levels of the binary tree are stored in the first record, so the writer always has to update that first record as well as the record(s) that are more local to its write and the reader always has to read that first record as well as the record(s) that are more local to its read.

Replying to [zooko]comment:3:

We could write out the 32 leafs, followed by the 12 uncles, followed by space sufficient for 52 more uncles (in case the file grows). Each of these 64-hashvalue-wide records would take a total of 2048 bytes, and a typical reader would need to slurp only (part of) one or two of these records in order to validate up to 32 or 64 blocks of data, which would be around 1.3 or 2.6 MiB.

Anybody who followed this idea closely may have noticed a minor error here -- the records would be 96 hashvalues wide, not 64, so to download an entire record would take 3072 bytes, not 2048 bytes.

Replying to [zooko]comment:3: > We could write out the 32 leafs, followed by the 12 uncles, followed by space sufficient for 52 more uncles (in case the file grows). Each of these 64-hashvalue-wide records would take a total of 2048 bytes, and a typical reader would need to slurp only (part of) one or two of these records in order to validate up to 32 or 64 blocks of data, which would be around 1.3 or 2.6 MiB. Anybody who followed this idea closely may have noticed a minor error here -- the records would be 96 hashvalues wide, not 64, so to download an entire record would take 3072 bytes, not 2048 bytes.

Replying to [zooko]comment:2:

Another alternative would be to start using multiple variable-length records (i.e. files) on the backend.

I just noticed that Brian wrote some ideas about this in #600. He emphasized the important of keeping things together in a single file in order to ease moving shares around. I agree that this is an important reason, and it was one of the reasons uppermost on our minds back then. Nowadays I perceive this as not such a big deal -- shares get migrated rarely, and when they do a simple rsync does the right thing.

Replying to [zooko]comment:2: > > Another alternative would be to start using multiple variable-length records (i.e. files) on the backend. I just noticed that Brian wrote some ideas about this in #600. He emphasized the important of keeping things together in a single file in order to ease moving shares around. I agree that this is an important reason, and it was one of the reasons uppermost on our minds back then. Nowadays I perceive this as not such a big deal -- shares get migrated rarely, and when they do a simple rsync does the right thing.
davidsarah commented 2012-03-29 19:20:44 +00:00
Owner

Duplicate of #1687?

Duplicate of #1687?
tahoe-lafs added
normal
and removed
major
labels 2012-03-29 19:20:44 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#1543
No description provided.