safely add plaintext_hash to immutable UEB #453
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#453
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
We used to put a hash of the plaintext in the UEB, both a flat hash of the
whole file, and a merkle tree over the plaintext segments. This helps guard
against bugs in the decryption process, for example the bug that pycryptopp
had when compiled on certain compilers (I can't find a reference for it right
now).
But we removed it, because exposing it to the whole world enables the
"partial-information guessing attack", in exactly the same way that
universally-convergent encryption keys would do.
Robk pointed out a long while back that we could solve both problems (i.e.
get plaintext validation without exposure to the guessing attack) by
encrypting the plaintext hash with (a derivative of) the read-key. This way,
only the actual reader would get to see the plaintext hash. We could do this
by adding a field named 'protected_plaintext_hash' (and also
'protected_plaintext_roothash', for the merkle tree) to the UEB, both of
which are encrypted. Similarly we'd need to encrypt the plaintext merkle tree
(which normally lives in the share but outside the UEB).
Zooko pointed out that we could generalize this to think of the share as
containing several parts: one is unencrypted and visible to the whole world
(i.e. access is protected by only the verify-cap, which is generally
unguarded), but another is encrypted and visible only to the holder of the
read-cap. (For mutable files, we could have a third section, visible only to
the holder of the write-cap). The plaintext hash information would then go
into the read-cap section.
We also considered that, because the UEB is encoded as a dictionary, we can
add new keys to it without affecting earlier versions of the code: they will
ignore the keys that they don't know about.
Tagging issues relevant to new cap protocol design.
Se #658 for a way to use this to improve performance, by optimizing uploads/downloads where the target file already exists with the same contents.
If we require a "flat hash" of anything (i.e. a Merkle–Damgård construction such as SHA-256 or one of the Merkle–Damgård, HAIFA, or other "chain-like" or "sponge-like" constructions that are SHA-3 candidates) then compatible implementations will be unable to parallelize the computation of the secure hash.
Most of our data structures are parallelizable. It would be kind of a shame if some future implementation of Tahoe-LAFS on a [144-core ColorForth chip](http://greenarraychips.com/home/documents/greg/GA144.htm) had to stop processing a Tahoe-LAFS data structure in 144 parallel streams and wait for one of its cores to process the whole input by itself. :-)
In contrast, if we require a Merkle Tree hash of the thing, then future implementations will have the option of parallelizing the computation of it.
A properly specified Merkle Tree has the same security properties as a flat hash has, although it doesn't have the property that it will detect bugs in the Merkle Tree implementation.
Another property that it doesn't have is that it computes a widely understood, standardized hash value of the input. (Unless, of course, we can convince the SHA-3 standards body to standardize on a tree hash instead of a linear hash.)
So anyway, in the short term I think we should either not have a plaintext hash at all (which is the current situation) or have one which is optional so future implementations can skip it or define a Merkle Tree hash of the plaintext instead of a linear hash.
I like to support parallelism and performance, so if there really is a
tradeoff between having a flat hash and being able to do high-speed
super-parallel uploads, I'll prefer the choice that gives us performance.
This may mean having the flat hash be optional: if the uploader chooses to
provide it, and if the downloader chooses to download the entire file, then
the downloader will check it. This also gives enough information for a very
simple downloader to validate the whole file (or for a little shell script
that's comparing hashes of files on disk against data in the UEB).
So I think there should be four hash-like items in the shares:
and the flat hashes are only checked by a downloader who is fetching the
entire file. (note that my new downloader code is very segment-at-a-time
random-access oriented, so even in the near term the downloader might start
ignoring the flat hashes).
Then, on the day that we write a super-parallelized upload hasher (one day
after all tahoe users install an 18 exabyte-per-second DSL line, and two days
after we reduce the protocol to a single roundtrip, otherwise it wouldn't
make any significant difference), we also add a tahoe.cfg option that enables
or disables the generation of the flat hashes. Enabled would result in slower
upload hashing (one core would have to linearly see every byte of the file).
Disabled would result in faster uploads but would lose those simple flat
hashes that some downloaders might want to use.
Oh, and I forgot to mention this in the original description: we can put the
encrypted plaintext merkle tree hashes in the old
plaintext_hash_tree
section, which will be ignored by old clients as long as they don't see a
plaintext_root_hash
key in the UEB. This will let us quietly add anencrypted plaintext hash tree without impacting compatibility with older
clients.