get_hash method in webapi for extension caching logic. #280

Open
opened 2008-01-17 05:01:47 +00:00 by nejucomo · 11 comments

The webapi could provide a call which returns the content's hash for a given capability:

get_hash(cap, hashtype) -> hash

cap - A string containing a capability.

hashtype - An enumeration type specifying the hash algorithm; example "sha256" (more below).

hash - The result of applying the specified hash to the contents referred to by cap.

Support for different hashtypes allows the backend to implement which ever types are convenient, and extension writers can request specific types for future versions.

As long as the hashtype is convenient for extensions to compute on their own, this allows them to make "smart" caching decisions. For instance, a local file system synchronization command could chose to only download (or upload) a file if get_hash returns a different hash than one computed from the local file.

The tahoe architecture may provide support for certain algorithms efficiently (because they are innate to the data structures).

The webapi could provide a call which returns the content's hash for a given capability: get_hash(cap, hashtype) -> hash cap - A string containing a capability. hashtype - An enumeration type specifying the hash algorithm; example "sha256" (more below). hash - The result of applying the specified hash to the contents referred to by cap. Support for different hashtypes allows the backend to implement which ever types are convenient, and extension writers can request specific types for future versions. As long as the hashtype is convenient for extensions to compute on their own, this allows them to make "smart" caching decisions. For instance, a local file system synchronization command could chose to only download (or upload) a file if get_hash returns a different hash than one computed from the local file. The tahoe architecture may provide support for certain algorithms efficiently (because they are innate to the data structures).
nejucomo added the
unknown
minor
enhancement
0.7.0
labels 2008-01-17 05:01:47 +00:00
nejucomo added this to the eventually milestone 2008-01-17 05:01:47 +00:00

Yeah, my concern is that I'm not sure where we would store these hashes. We
could stash them as metadata on directory edges, but then the API is more
like::

 hash = dirnode.get_hash_of_child("foo.txt", "sha1")

and of course you have to have the dirnode around to ask it anything.

To have a function that just takes an arbitrary cap would either mean that
these hashes are contained inside the cap (so the cap would have to get
bigger), or that there's some magic table somewhere that maps caps to hashes
(and where do we keep this table, who gets to add to it, who gets to read
from it, etc).

I completely agree with the utility of this feature, I just don't yet see how
to implement it.

Yeah, my concern is that I'm not sure where we would store these hashes. We could stash them as metadata on directory edges, but then the API is more like:: ``` hash = dirnode.get_hash_of_child("foo.txt", "sha1") ``` and of course you have to have the dirnode around to ask it anything. To have a function that just takes an arbitrary cap would either mean that these hashes are contained inside the cap (so the cap would have to get bigger), or that there's some magic table somewhere that maps caps to hashes (and where do we keep this table, who gets to add to it, who gets to read from it, etc). I completely agree with the utility of this feature, I just don't yet see how to implement it.

Here's something we could do:

Store such hashes (encrypted by the readcap) in the UEB (which will hopefully be renamed CEB), so Tahoe can answer queries like

get_hash(cap, hashtype) -> hash

by making a single request (typically) to a storage server. The supported hashtypes would be limited to the hashtypes that were supported by the uploader when they uploaded they file -- either just one (sha256), or maybe two or three (sha256 and Tiger and RIPEMD-160?). Most code which does file validation stuff nowadays still uses MD5, SHA-1, or Tiger, but the first two really shouldn't be used for secure file validation in the future, so I would be happy to not support them.

By the way, storing an encrypted sha256 hash of the plaintext in the CEB is something that Rob and perhaps Brian and perhaps I want to do anyway in order to gave further assurance that there wasn't a bug or wrong symmetric key in our decryption of the validated ciphertext.

Here's something we could do: Store such hashes (encrypted by the readcap) in the UEB (which will hopefully be renamed CEB), so Tahoe can answer queries like ``` get_hash(cap, hashtype) -> hash ``` by making a single request (typically) to a storage server. The supported hashtypes would be limited to the hashtypes that were supported by the uploader when they uploaded they file -- either just one (sha256), or maybe two or three (sha256 and Tiger and RIPEMD-160?). Most code which does file validation stuff nowadays still uses MD5, SHA-1, or Tiger, but the first two really shouldn't be used for secure file validation in the future, so I would be happy to not support them. By the way, storing an encrypted sha256 hash of the plaintext in the CEB is something that Rob and perhaps Brian and perhaps I want to do anyway in order to gave further assurance that there wasn't a bug or wrong symmetric key in our decryption of the validated ciphertext.

A user of allmydata.com's consumer backup service just requested that it display the md5sum of a file on the web site so that he could use that to assure himself that the file had uploaded completely and correctly.

A user of allmydata.com's consumer backup service just requested that it display the md5sum of a file on the web site so that he could use that to assure himself that the file had uploaded completely and correctly.
warner modified the milestone from eventually to undecided 2008-06-01 20:58:01 +00:00
warner added
code-frontend-web
and removed
unknown
labels 2009-03-08 22:02:51 +00:00
Author

The comments above seem to only consider a well-known hash function, like SHA256, and indeed it seems like including such a hash would add some overhead or complexity to the storage format. This might be worth it.

However, when I originally wrote this, I imagined there was some hashtype which was "innate" to Tahoe storage structures, and therefore this call could extract that information efficiently from a Cap.

After a quick skim of the architecture doc, it sounds like there is a merkle tree stored in the capability extension block. If this is a tree over the plain text, then the root of this tree could be efficiently returned by the proposed call, such as:

get_hash(myCap, "tahoe_content_merkle_tree_root")

Clients would then need to compute a merkle tree, but I expect this would be somewhat simple and efficient, given the right library for computing merkle trees.

Because I've noticed a thread on tahoe-dev about caching, and I've seen some tickets related to caching, I'm going to link all of these related tickets and threads together.

The comments above seem to only consider a well-known hash function, like SHA256, and indeed it seems like including such a hash would add some overhead or complexity to the storage format. This might be worth it. However, when I originally wrote this, I imagined there was some hashtype which was "innate" to Tahoe storage structures, and therefore this call could extract that information efficiently from a Cap. After a quick skim of the architecture doc, it sounds like there is a merkle tree stored in the capability extension block. If this is a tree over the plain text, then the root of this tree could be efficiently returned by the proposed call, such as: get_hash(myCap, "tahoe_content_merkle_tree_root") Clients would then need to compute a merkle tree, but I expect this would be somewhat simple and efficient, given the right library for computing merkle trees. Because I've noticed a thread on tahoe-dev about caching, and I've seen some tickets related to caching, I'm going to link all of these related tickets and threads together.
Author

See ticket #316 for a built-in caching feature proposal.

I personally prefer this minimal code change which makes it easier for clients to do caching versus a built-in caching feature. Fewer features, fewer configuration states, and more test-coverage per component.

See ticket #316 for a built-in caching feature proposal. I personally prefer this minimal code change which makes it easier for clients to do caching versus a built-in caching feature. Fewer features, fewer configuration states, and more test-coverage per component.

There is currently no hash of the plaintext stored. See http://allmydata.org/~zooko/lafs.pdf diagram 1 for what is stored for an immutable file currently. We used to have one, but we took it out because it was visible to anyone (it was stored on storage servers unencrypted), and this enables anyone to mount guess-and-check attacks (per http://hacktahoe.org/drew_perttula.html ). #453 (safely add plaintext_hash to immutable UEB) is a ticket to add plaintext hashes back but store them encrypted under the read-cap.

If we had #453, we could easily give out the hash-of-plaintext or else the root-of-merkle-tree-of-plaintext to serve this API. But wait a minute, what's the use case of this proposed API again? How come the user can't just use the verify cap instead of this hash-of-the-plaintext?

There is currently no hash of the plaintext stored. See <http://allmydata.org/~zooko/lafs.pdf> diagram 1 for what is stored for an immutable file currently. We used to have one, but we took it out because it was visible to anyone (it was stored on storage servers unencrypted), and this enables anyone to mount guess-and-check attacks (per <http://hacktahoe.org/drew_perttula.html> ). #453 (safely add plaintext_hash to immutable UEB) is a ticket to add plaintext hashes back but store them encrypted under the read-cap. If we had #453, we could easily give out the hash-of-plaintext or else the root-of-merkle-tree-of-plaintext to serve this API. But wait a minute, what's the use case of this proposed API again? How come the user can't just use the verify cap instead of this hash-of-the-plaintext?
davidsarah commented 2009-10-28 04:09:24 +00:00
Owner

Tagging issues relevant to new cap protocol design.

Tagging issues relevant to new cap protocol design.

I still don't understand why the use case for this isn't satisfied by verify caps.

I still don't understand why the use case for this isn't satisfied by verify caps.
Author

Replying to zooko:

I still don't understand why the use case for this isn't satisfied by verify caps.

Here's a use case I advocate:

  • I have a large file called myblob.bin and a capability, $C (of any kind) which I believe is associated with some revision of myblob.bin.
  • I use a commandline tool to calculate a cryptographic-hash-like value. Example alternatives:
  • $ md5sum myblob.bin > local_hash
  • $ pyeval 'hashlib.sha256(ri).hexdigest()' < myblob.bin > local_hash
  • $ tahoe calculate_hashlike_thingy --input-file myblob.bin > local_hash
  • I then ask tahoe for the hash-like value given the capability:
  • $ tahoe calculate_hashlike_thingy --input-uri $C > lafs_hash
  • NOTE: For my use case, I want this command to not do any networking, if possible.
  • Compare the results for equality:
  • $ if diff -q local_hash lafs_hash ; then echo 'This revision of myblob.bin is not stored at that capability.' ; fi

So for this use case to be satisfied by verify caps I need this command:

$ tahoe spit_out_verify_cap < myblob.bin

This command should only read myblob.bin but should not do any networking or use any state other than the cap and myblob.bin (so that any tahoe user on any grid can run it).

Is it feasible to make this command? That would satisfy my goal for this ticket.

Replying to [zooko](/tahoe-lafs/trac-2024-07-25/issues/280#issuecomment-64285): > I still don't understand why the use case for this isn't satisfied by verify caps. Here's a use case I advocate: * I have a large file called `myblob.bin` and a capability, `$C` (of any kind) which I believe is associated with some revision of `myblob.bin`. * I use a commandline tool to calculate a cryptographic-hash-like value. Example alternatives: * `$ md5sum myblob.bin > local_hash` * `$ pyeval 'hashlib.sha256(ri).hexdigest()' < myblob.bin > local_hash` * `$ tahoe calculate_hashlike_thingy --input-file myblob.bin > local_hash` * I then ask tahoe for the hash-like value given the capability: * `$ tahoe calculate_hashlike_thingy --input-uri $C > lafs_hash` * *NOTE*: For my use case, I want this command to not do any networking, if possible. * Compare the results for equality: * `$ if diff -q local_hash lafs_hash ; then echo 'This revision of myblob.bin is not stored at that capability.' ; fi` So for this use case to be satisfied by verify caps I need this command: ` $ tahoe spit_out_verify_cap < myblob.bin ` This command should only read `myblob.bin` but should not do any networking or use any state other than the cap and `myblob.bin` (so that any tahoe user on any grid can run it). Is it feasible to make this command? That would satisfy my goal for this ticket.
zooko self-assigned this 2012-02-21 22:09:08 +00:00
davidsarah commented 2012-02-22 00:53:10 +00:00
Owner

Replying to [nejucomo]comment:12:

So for this use case to be satisfied by verify caps I need this command:

$ tahoe spit_out_verify_cap < myblob.bin

This command should only read myblob.bin but should not do any networking or use any state other than the cap and myblob.bin (so that any tahoe user on any grid can run it).

Is it feasible to make this command? That would satisfy my goal for this ticket.

Yes, it is feasible to make this command. Depending on the cap protocol, it might have to do all the work of erasure coding the file and computing a Merkle hash of the ciphertext shares before it can compute the verify cap.

Your use case could also be met with a Merkle hash of the plaintext and convergence secret, which could be computed without erasure coding. But there's a tradeoff between being able to do that and the cap size: in order to be able to recover the plaintext hash from the read cap without network access, the encryption bits and the integrity bits of the read cap must be separate, which means that the minimum immutable read cap size for a security level of 2^K^ against 2^T^ targets is 3K + T (2K integrity bits and K+T confidentiality bits). In contrast the scheme with the shortest read caps so far without this constraint is Rainhill 3, which has an immutable read cap size of only 2K, the minimum possible to achieve 2^K^ security against collision attacks.

(A simplified version of Rainhill 3 without traversal caps is here. It does allow you to compute a plaintext hash P, or an encrypted hash EncP_R, before doing erasure coding, but in order to recover that value from the read cap, you also need EncK_R which is stored on the server.)

Replying to [nejucomo]comment:12: > So for this use case to be satisfied by verify caps I need this command: > > ` $ tahoe spit_out_verify_cap < myblob.bin ` > > This command should only read `myblob.bin` but should not do any networking or use any state other than the cap and `myblob.bin` (so that any tahoe user on any grid can run it). > > Is it feasible to make this command? That would satisfy my goal for this ticket. Yes, it is feasible to make this command. Depending on the cap protocol, it might have to do all the work of erasure coding the file and computing a Merkle hash of the ciphertext shares before it can compute the verify cap. Your use case could also be met with a Merkle hash of the plaintext and convergence secret, which could be computed without erasure coding. But there's a tradeoff between being able to do that and the cap size: in order to be able to recover the plaintext hash from the read cap without network access, the encryption bits and the integrity bits of the read cap must be separate, which means that the minimum immutable read cap size for a security level of 2^K^ against 2^T^ targets is 3K + T (2K integrity bits and K+T confidentiality bits). In contrast the scheme with the shortest read caps so far without this constraint is Rainhill 3, which has an immutable read cap size of only 2K, the minimum possible to achieve 2^K^ security against collision attacks. (A simplified version of Rainhill 3 without traversal caps is [here](https://tahoe-lafs.org/~davidsarah/immutable-rainhill-3x.png). It does allow you to compute a plaintext hash P, or an encrypted hash EncP_R, before doing erasure coding, but in order to recover that value from the read cap, you also need EncK_R which is stored on the server.)
davidsarah commented 2012-02-22 01:01:21 +00:00
Owner

BTW, if you drop the feature of being able to derive a verify cap from a read cap off-line, then a verify cap could include the information normally stored on the server that allows to verify a plaintext off-line without doing erasure coding, and read caps could still be optimally short. However, in practice I think off-line derivation of verify caps is the more useful feature.

BTW, if you drop the feature of being able to derive a verify cap from a read cap off-line, then a verify cap could include the information normally stored on the server that allows to verify a plaintext off-line without doing erasure coding, and read caps could still be optimally short. However, in practice I think off-line derivation of verify caps is the more useful feature.
Sign in to join this conversation.
No Milestone
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#280
No description provided.