safely add plaintext_hash to immutable UEB #453

New Issue

warner · 2008-06-04T22:06:02Z

warner commented

2008-06-04 22:06:02 +00:00

We used to put a hash of the plaintext in the UEB, both a flat hash of the
whole file, and a merkle tree over the plaintext segments. This helps guard
against bugs in the decryption process, for example the bug that pycryptopp
had when compiled on certain compilers (I can't find a reference for it right
now).

But we removed it, because exposing it to the whole world enables the
"partial-information guessing attack", in exactly the same way that
universally-convergent encryption keys would do.

Robk pointed out a long while back that we could solve both problems (i.e.
get plaintext validation without exposure to the guessing attack) by
encrypting the plaintext hash with (a derivative of) the read-key. This way,
only the actual reader would get to see the plaintext hash. We could do this
by adding a field named 'protected_plaintext_hash' (and also
'protected_plaintext_roothash', for the merkle tree) to the UEB, both of
which are encrypted. Similarly we'd need to encrypt the plaintext merkle tree
(which normally lives in the share but outside the UEB).

Zooko pointed out that we could generalize this to think of the share as
containing several parts: one is unencrypted and visible to the whole world
(i.e. access is protected by only the verify-cap, which is generally
unguarded), but another is encrypted and visible only to the holder of the
read-cap. (For mutable files, we could have a third section, visible only to
the holder of the write-cap). The plaintext hash information would then go
into the read-cap section.

We also considered that, because the UEB is encoded as a dictionary, we can
add new keys to it without affecting earlier versions of the code: they will
ignore the keys that they don't know about.

We used to put a hash of the plaintext in the UEB, both a flat hash of the whole file, and a merkle tree over the plaintext segments. This helps guard against bugs in the decryption process, for example the bug that pycryptopp had when compiled on certain compilers (I can't find a reference for it right now). But we removed it, because exposing it to the whole world enables the "partial-information guessing attack", in exactly the same way that universally-convergent encryption keys would do. Robk pointed out a long while back that we could solve both problems (i.e. get plaintext validation without exposure to the guessing attack) by encrypting the plaintext hash with (a derivative of) the read-key. This way, only the actual reader would get to see the plaintext hash. We could do this by adding a field named 'protected_plaintext_hash' (and also 'protected_plaintext_roothash', for the merkle tree) to the UEB, both of which are encrypted. Similarly we'd need to encrypt the plaintext merkle tree (which normally lives in the share but outside the UEB). Zooko pointed out that we could generalize this to think of the share as containing several parts: one is unencrypted and visible to the whole world (i.e. access is protected by only the verify-cap, which is generally unguarded), but another is encrypted and visible only to the holder of the read-cap. (For mutable files, we could have a third section, visible only to the holder of the write-cap). The plaintext hash information would then go into the read-cap section. We also considered that, because the UEB is encoded as a dictionary, we can add new keys to it without affecting earlier versions of the code: they will ignore the keys that they don't know about.

warner added the

labels 2008-06-04 22:06:02 +00:00

warner added this to the undecided milestone 2008-06-04 22:06:02 +00:00

davidsarah commented

2009-10-28 04:09:45 +00:00

Tagging issues relevant to new cap protocol design.

davidsarah commented

2009-12-07 03:13:45 +00:00

Se #658 for a way to use this to improve performance, by optimizing uploads/downloads where the target file already exists with the same contents.

zooko commented

2010-02-23 04:42:09 +00:00

If we require a "flat hash" of anything (i.e. a Merkle–Damgård construction such as SHA-256 or one of the Merkle–Damgård, HAIFA, or other "chain-like" or "sponge-like" constructions that are SHA-3 candidates) then compatible implementations will be unable to parallelize the computation of the secure hash.

Most of our data structures are parallelizable. It would be kind of a shame if some future implementation of Tahoe-LAFS on a [144-core ColorForth chip](http://greenarraychips.com/home/documents/greg/GA144.htm) had to stop processing a Tahoe-LAFS data structure in 144 parallel streams and wait for one of its cores to process the whole input by itself. :-)

In contrast, if we require a Merkle Tree hash of the thing, then future implementations will have the option of parallelizing the computation of it.
A properly specified Merkle Tree has the same security properties as a flat hash has, although it doesn't have the property that it will detect bugs in the Merkle Tree implementation.

Another property that it doesn't have is that it computes a widely understood, standardized hash value of the input. (Unless, of course, we can convince the SHA-3 standards body to standardize on a tree hash instead of a linear hash.)
So anyway, in the short term I think we should either not have a plaintext hash at all (which is the current situation) or have one which is optional so future implementations can skip it or define a Merkle Tree hash of the plaintext instead of a linear hash.

If we require a "flat hash" of anything (i.e. a Merkle–Damgård construction such as SHA-256 or one of the Merkle–Damgård, HAIFA, or other "chain-like" or "sponge-like" constructions that are SHA-3 candidates) then compatible implementations will be unable to parallelize the computation of the secure hash. Most of our data structures are parallelizable. It would be kind of a shame if some future implementation of Tahoe-LAFS on a [144-core [ColorForth](wiki/ColorForth) chip](http://greenarraychips.com/home/documents/greg/GA144.htm) had to stop processing a Tahoe-LAFS data structure in 144 parallel streams and wait for one of its cores to process the whole input by itself. :-) In contrast, if we require a Merkle Tree hash of the thing, then future implementations will have the option of parallelizing the computation of it. A properly specified Merkle Tree has the same security properties as a flat hash has, although it doesn't have the property that it will detect bugs in the Merkle Tree implementation. Another property that it doesn't have is that it computes a widely understood, standardized hash value of the input. (Unless, of course, we can convince the SHA-3 standards body to standardize on a tree hash instead of a linear hash.) So anyway, in the short term I think we should either not have a plaintext hash at all (which is the current situation) or have one which is optional so future implementations can skip it or define a Merkle Tree hash of the plaintext instead of a linear hash.

warner commented

2010-02-23 09:43:57 +00:00

I like to support parallelism and performance, so if there really is a
tradeoff between having a flat hash and being able to do high-speed
super-parallel uploads, I'll prefer the choice that gives us performance.
This may mean having the flat hash be optional: if the uploader chooses to
provide it, and if the downloader chooses to download the entire file, then
the downloader will check it. This also gives enough information for a very
simple downloader to validate the whole file (or for a little shell script
that's comparing hashes of files on disk against data in the UEB).

So I think there should be four hash-like items in the shares:

flat plaintext hash: optional
tree plaintext hash: optional
flat ciphertext hash: optional
tree ciphertext hash: mandatory

and the flat hashes are only checked by a downloader who is fetching the
entire file. (note that my new downloader code is very segment-at-a-time
random-access oriented, so even in the near term the downloader might start
ignoring the flat hashes).

Then, on the day that we write a super-parallelized upload hasher (one day
after all tahoe users install an 18 exabyte-per-second DSL line, and two days
after we reduce the protocol to a single roundtrip, otherwise it wouldn't
make any significant difference), we also add a tahoe.cfg option that enables
or disables the generation of the flat hashes. Enabled would result in slower
upload hashing (one core would have to linearly see every byte of the file).
Disabled would result in faster uploads but would lose those simple flat
hashes that some downloaders might want to use.

Oh, and I forgot to mention this in the original description: we can put the
encrypted plaintext merkle tree hashes in the old plaintext_hash_tree
section, which will be ignored by old clients as long as they don't see a
plaintext_root_hash key in the UEB. This will let us quietly add an
encrypted plaintext hash tree without impacting compatibility with older
clients.

I like to support parallelism and performance, so if there really is a tradeoff between having a flat hash and being able to do high-speed super-parallel uploads, I'll prefer the choice that gives us performance. This may mean having the flat hash be optional: if the uploader chooses to provide it, and if the downloader chooses to download the entire file, then the downloader will check it. This also gives enough information for a very simple downloader to validate the whole file (or for a little shell script that's comparing hashes of files on disk against data in the UEB). So I think there should be four hash-like items in the shares: * flat plaintext hash: optional * tree plaintext hash: optional * flat ciphertext hash: optional * tree ciphertext hash: mandatory and the flat hashes are only checked by a downloader who is fetching the entire file. (note that my new downloader code is very segment-at-a-time random-access oriented, so even in the near term the downloader might start ignoring the flat hashes). Then, on the day that we write a super-parallelized upload hasher (one day after all tahoe users install an 18 exabyte-per-second DSL line, and two days after we reduce the protocol to a single roundtrip, otherwise it wouldn't make any significant difference), we also add a tahoe.cfg option that enables or disables the generation of the flat hashes. Enabled would result in slower upload hashing (one core would have to linearly see every byte of the file). Disabled would result in faster uploads but would lose those simple flat hashes that some downloaders might want to use. Oh, and I forgot to mention this in the original description: we can put the encrypted plaintext merkle tree hashes in the old `plaintext_hash_tree` section, which will be ignored by old clients as long as they don't see a `plaintext_root_hash` key in the UEB. This will let us quietly add an encrypted plaintext hash tree without impacting compatibility with older clients.

Sign in to join this conversation.