converge same file, same K, different N #678

New Issue

zooko · 2009-04-08T23:02:46Z

zooko commented

2009-04-08 23:02:46 +00:00

The underlying erasure code is "systematic", which means the first K shares are simply the segments of the file (except that the last one, which contains the end of the file, is padded out to be the same size as the others). The erasure code also has the property (I don't know what it is called) that the "check shares" or "secondary shares" -- the ones after the first K -- are also the same regardless of what N is.

Therefore if you upload a file with, e.g. K=13, N=16 and then you re-upload it with K=13, N=26, then the index 0 through the index 15 share that you upload would be exactly the same as they were in the original upload (if you used the same encryption key to encrypt the file before erasure-coding it).

However, Tahoe currently doesn't take advantage of this coincidence at all, because it doesn't use the same encryption key for those two files. Instead it includes N in the generation of the encryption key, and in the resulting immutable-file read-cap, so that the two uploads of the same file with the same K and different N result in completely different sets of shares and different read-caps. specs: [file-encoding.txt]source:docs/specifications/file-encoding.txt@20090222054054-4233b-ce16f0d882f804485e792782c1a9527db25114d0; source code: [upload.py]source:src/allmydata/immutable/upload.py@20090304013715-4233b-fbbf44eba801fb41795d4421bbcb7a028d45e6ff#L1116

To fix this ticket, figure out how to retain all of the safety and security properties that Tahoe immutable-file currently have, while also letting those two uploads share their first 16 shares.

(By the way, the reason I was reminded of this is that I'm currently doing an upload exactly like this: dogfood tasting report . :-))

The underlying erasure code is "systematic", which means the first `K` shares are simply the segments of the file (except that the last one, which contains the end of the file, is padded out to be the same size as the others). The erasure code also has the property (I don't know what it is called) that the "check shares" or "secondary shares" -- the ones after the first `K` -- are also the same regardless of what `N` is. Therefore if you upload a file with, e.g. `K=13`, `N=16` and then you *re-upload* it with `K=13`, `N=26`, then the index 0 through the index 15 share that you upload would be exactly the same as they were in the original upload (if you used the same encryption key to encrypt the file before erasure-coding it). However, Tahoe currently doesn't take advantage of this coincidence at all, because it doesn't use the same encryption key for those two files. Instead it includes `N` in the generation of the encryption key, and in the resulting immutable-file read-cap, so that the two uploads of the same file with the same `K` and different `N` result in completely different sets of shares and different read-caps. specs: [file-encoding.txt]source:docs/specifications/file-encoding.txt@20090222054054-4233b-ce16f0d882f804485e792782c1a9527db25114d0; source code: [upload.py]source:src/allmydata/immutable/upload.py@20090304013715-4233b-fbbf44eba801fb41795d4421bbcb7a028d45e6ff#L1116 To fix this ticket, figure out how to retain all of the safety and security properties that Tahoe immutable-file currently have, while also letting those two uploads share their first 16 shares. (By the way, the reason I was reminded of this is that I'm currently doing an upload exactly like this: [dogfood tasting report](http://allmydata.org/pipermail/tahoe-dev/2009-April/001554.html) . :-))

zooko added the

labels 2009-04-08 23:02:46 +00:00

zooko added this to the undecided milestone 2009-04-08 23:02:46 +00:00

zooko commented

2009-08-10 15:29:00 +00:00

The following clump of tickets might be of interest to people who are interested in this ticket: #711 (repair to different levels of M), #699 (optionally rebalance during repair or upload), #543 ('rebalancing manager'), #232 (Peer selection doesn't rebalance shares on overwrite of mutable file.), #678 (converge same file, same K, different M), #610 (upload should take better advantage of existing shares), #573 (Allow client to control which storage servers receive shares).

zooko commented

2009-08-10 15:45:46 +00:00

Also related: #778 ("shares of happiness" is the wrong measure; "servers of happiness" is better).

warner commented

2009-08-25 08:37:44 +00:00

One significant barrier to this is the share hash tree. When a
file is uploaded, we take the hash of all shares (well, the block
hash tree root for all shares) and put them into a list, then we
build another merkle tree over that list, and store the root of
that tree in the UEB (where it gets folded into the
integrity-checking part of the filecap).

This list is of length "N", the number of shares that were
created, padded up to a power-of-two boundary (for the merkle
tree). This makes it impossible to add more shares to the same
hash tree.

So if a file has already been uploaded with 3-of-10, and now you
decide you'd like to encode it with 3-of-16, then shares 0..9
will be the same, but the merkle tree that was distributed with
the first upload won't contain the hashes for shares 10..15, and
the UEB will be different for the two uploads, so the filecaps
will be different (to be specific, the readkey [therefore the
storage-index]and part will be the same, but the UEB hash that
provides all the integrity information will be different).

Let's analyze the compatibility grid. One one axis is the person
doing the download: either Sally holding the "short" 3-of-10
filecap, or Lucy holding the "long" 3-of-16 filecap. The other is
the class of share being used:

A: the original 10 shares from the 3-of-10 upload
B: shares 0..9 from the new 3-of-16 upload
C: shares 10..15 from the new 3-of-16 upload

(in ideal circumstances, the 3-of-16 upload would skip the "B"
shares, but if that uploader didn't happen to see the "A" shares
out there already, they might generate "B" shares)

Sally handles "A" shares just fine, as usual. When she sees "B"
shares, the current code will reject them because the UEB is
different than the filecap. The shares don't contain a full share
hash tree, but rather just the minimal chain necessary to
validate that one share. If she has enough "A" shares to know the
expected value of the "B" share in question (i.e. that leaf of
the short tree), then she can use the "B" share with confidence
(she'd use the share, but ignore the long hash chain that comes
with it). This basically means she must get that share's sibling:
if she has A0, she can validate B1; if she has A3, she can
validate B2, etc.

Should she see "C" shares (say that 5 times fast!), she's stuck.
There is no hash chain that will get her from the C share's hash
to her filecap's UEB hash. She could download the C share anyways
and throw it into FEC, and then test the resulting ciphertext
segment against the ciphertext hash tree: if it matches, great,
the share was good. If it doesn't, then she won't even know which
share had a problem.

Now, what about Lucy? When she sees an "A" share, she'll be in
the same boat as Sally looking at a "B" share. If Lucy manages to
find a sibling "B" share for each potential "A" share, then she
can validate the share hash, and use the share. But if she can't
(such as if there are no "B" shares, because the uploader
didn't create any because it saw the "A" shares out there), then
she won't be able to validate the "A" shares, and then she'll be
in the Sally-vs-C-shares boat: try FEC and catch problems with
the ciphertext hash.

So, what could be done to improve this?

include a full copy of the share hash tree in each share, not
just the minimal chain: this would let Sally use arbitrary "B"
shares and Lucy use arbitrary "A" shares, without forcing them
to locate a sibling share. It still wouldn't let Sally use "C"
shares with confidence.
allow the Downloader to speculate a bit: keep track of which
shares went into FEC, if the ciphertext hash fails then
iterate through the various combinations of shares, use some
sort of constraint-based logic to figure out which
combinations are still worth trying. The combinatorics will be
smaller with fewer non-validated shares, so the peer-selection
code should prioritize validatable shares.
if you plan ahead, you can pretend you're encoding to 3-of-16
or 3-of-32 or whatever, but then only actually upload the
first 10 shares. You still have to do all the encoding work,
however, because you have to compute the correct merkle tree
to put in those 10 shares, but you can save the bandwidth and
storage space for the shares that you'll create later. It's
not clear where the break-even point would be.

I'm not sure what else could be done.

For mutable files, it's a slightly different story. The encoding
scheme is basically the same, but if you're willing to modify all
the existing shares, you can increase the size of the share hash
tree later on (build a bigger tree, sign the root, then update
all shares with the new tree). Unfortunately I believe that the
share format puts the share hash tree before the share data, so
if you increase its size, you must also move the data (and the
remote API has no insert() method), so you'll basically have to
re-upload everything, including the old share data. So it's
possible, but expensive.

One significant barrier to this is the share hash tree. When a file is uploaded, we take the hash of all shares (well, the block hash tree root for all shares) and put them into a list, then we build another merkle tree over that list, and store the root of that tree in the UEB (where it gets folded into the integrity-checking part of the filecap). This list is of length "N", the number of shares that were created, padded up to a power-of-two boundary (for the merkle tree). This makes it impossible to add more shares to the same hash tree. So if a file has already been uploaded with 3-of-10, and now you decide you'd like to encode it with 3-of-16, then shares 0..9 will be the same, but the merkle tree that was distributed with the first upload won't contain the hashes for shares 10..15, and the UEB will be different for the two uploads, so the filecaps will be different (to be specific, the readkey [therefore the storage-index]and part will be the same, but the UEB hash that provides all the integrity information will be different). Let's analyze the compatibility grid. One one axis is the person doing the download: either Sally holding the "short" 3-of-10 filecap, or Lucy holding the "long" 3-of-16 filecap. The other is the class of share being used: * A: the original 10 shares from the 3-of-10 upload * B: shares 0..9 from the new 3-of-16 upload * C: shares 10..15 from the new 3-of-16 upload (in ideal circumstances, the 3-of-16 upload would skip the "B" shares, but if that uploader didn't happen to see the "A" shares out there already, they might generate "B" shares) Sally handles "A" shares just fine, as usual. When she sees "B" shares, the current code will reject them because the UEB is different than the filecap. The shares don't contain a full share hash tree, but rather just the minimal chain necessary to validate that one share. If she has enough "A" shares to know the expected value of the "B" share in question (i.e. that leaf of the short tree), then she can use the "B" share with confidence (she'd use the share, but ignore the long hash chain that comes with it). This basically means she must get that share's sibling: if she has A0, she can validate B1; if she has A3, she can validate B2, etc. Should she see "C" shares (say that 5 times fast!), she's stuck. There is no hash chain that will get her from the C share's hash to her filecap's UEB hash. She could download the C share anyways and throw it into FEC, and then test the resulting ciphertext segment against the ciphertext hash tree: if it matches, great, the share was good. If it doesn't, then she won't even know which share had a problem. Now, what about Lucy? When she sees an "A" share, she'll be in the same boat as Sally looking at a "B" share. If Lucy manages to find a sibling "B" share for each potential "A" share, then she can validate the share hash, and use the share. But if she can't (such as if there are *no* "B" shares, because the uploader didn't create any because it saw the "A" shares out there), then she won't be able to validate the "A" shares, and then she'll be in the Sally-vs-C-shares boat: try FEC and catch problems with the ciphertext hash. So, what could be done to improve this? * include a full copy of the share hash tree in each share, not just the minimal chain: this would let Sally use arbitrary "B" shares and Lucy use arbitrary "A" shares, without forcing them to locate a sibling share. It still wouldn't let Sally use "C" shares with confidence. * allow the Downloader to speculate a bit: keep track of which shares went into FEC, if the ciphertext hash fails then iterate through the various combinations of shares, use some sort of constraint-based logic to figure out which combinations are still worth trying. The combinatorics will be smaller with fewer non-validated shares, so the peer-selection code should prioritize validatable shares. * if you plan ahead, you can pretend you're encoding to 3-of-16 or 3-of-32 or whatever, but then only actually upload the first 10 shares. You still have to do all the encoding work, however, because you have to compute the correct merkle tree to put in those 10 shares, but you can save the bandwidth and storage space for the shares that you'll create later. It's not clear where the break-even point would be. I'm not sure what else could be done. For mutable files, it's a slightly different story. The encoding scheme is basically the same, but if you're willing to modify all the existing shares, you can increase the size of the share hash tree later on (build a bigger tree, sign the root, then update all shares with the new tree). Unfortunately I believe that the share format puts the share hash tree before the share data, so if you increase its size, you must also move the data (and the remote API has no insert() method), so you'll basically have to re-upload everything, including the old share data. So it's possible, but expensive.

davidsarah commented

2009-10-28 04:10:07 +00:00

Tagging issues relevant to new cap protocol design.