use longer storage index / cap for collision resistance #753

Open
opened 2009-07-10 09:41:24 +00:00 by warner · 8 comments

As we work on new encoding schemes (like DSA-based mutable files), I'm
thinking that we want to put a lower bound on the cap/SI length to maintain
reasonable margin against collisions. 256 bits would be more than enough. 128
is ok, but a bit tight. 92 bits would make me nervous.

robk's friend Sylvan expressed concerns about Tahoe (and mnet/mojonation
before it) because, for something that is meant as a backup system, even the
slightest possibility of the CHK-based indexing scheme mapping two documents
to the same storage index was too high for him. (I believe he would be
more satisfied with a scheme that used centrally-allocated guaranteed-unique
storage indices, which we could do but would require more coordination and
longer caps, since we could no longer use a randomly-generated readkey to
derive the storage index. In exchange for a controllable but non-zero
probability of collision, we get to avoid central coordination and use
smaller caps).

The specific places where collisions could occur are:

  • mapping from file contents to CHK-derived readkey
  • mapping from readkey (CHK-derived or randomly generated) to storage index
  • mapping from randomly-generated mutable writekey to storage index

The "birthday paradox" determines the chance of collision. If I'm doing my
math right, if you want less than 'p' chance of getting any collisions
when selecting items out of a namespace of size 'N', then you can't
select more than C = sqrt(2*N*p) items. This is called a "paradox"
(where "surprise" would be a better term) because that square root causes C
to be surprisingly low: for birthdays (in which N=365), p=0.5 leads to C=19.
In the Tahoe context, C is the number of files you can add to the grid.

In the current case, our 128-bit storage index (N=2^128^) means that p=0.5
gets us a nice large 2^64^ number of files, except that p=0.5 is insufficient
margin: we'd much prefer a vanishingly small chance of collision, like
p=2^-64^. Fortunately we get two bits of margin for every one bit we reduce
from C. The table looks like:

N numfiles prob(collision)
96 2^48^ -> 2^-1^ (0.5)
96 2^48^ -> 2^-17^
96 2^32^ -> 2^-33^
96 2^24^ -> 2^-49^
128 2^64^ -> 2^-1^ (0.5)
128 2^56^ -> 2^-17^
128 2^48^ -> 2^-33^
128 2^32^ -> 2^-65^
192 2^96^ -> 2^-1^
192 2^80^ -> 2^-33^
192 2^64^ -> 2^-65^
256 2^128^ -> 2^-1^ (0.5)
256 2^96^ -> 2^-65^

Note that our N is the minimum of the storage-index size and the
top-most cap value (i.e. the readkey for immutable files, or the writekey for
mutable files). So a DSA-based mutable file with a 92-bit writecap gives us
an N of 2^92^, even if it is expanded into a storage-index of 128 or
256 bits.

Also note that the allmydata.com grid currently has something like 10M
objects in it, about C=2^23^.

So, I'm thinking that as much as a nice short 96-bit DSA mutable writecap
would be nice, it's too short to provide enough collision margin. I want to
be able to put trillions of files into a grid, and I want a the chance of
collision to be so small that I don't ever need to worry about it, and 96
bits isn't really there. 128 bits is probably good enough, but doesn't have
enough margin to be obviously and unquestionably safe (C=2^32^ is a lot of
files but you can imagine people wanting more, p=2^-64^ is a tiny probability
but you can imagine people wanting a bit better). 256 would be plenty (but of
course I want my filecaps to be shorter than that).

As we work on new encoding schemes (like DSA-based mutable files), I'm thinking that we want to put a lower bound on the cap/SI length to maintain reasonable margin against collisions. 256 bits would be more than enough. 128 is ok, but a bit tight. 92 bits would make me nervous. robk's friend Sylvan expressed concerns about Tahoe (and mnet/mojonation before it) because, for something that is meant as a backup system, even the slightest possibility of the CHK-based indexing scheme mapping two documents to the same storage index was too high for him. (I believe he would be more satisfied with a scheme that used centrally-allocated guaranteed-unique storage indices, which we could do but would require more coordination and longer caps, since we could no longer use a randomly-generated readkey to derive the storage index. In exchange for a controllable but non-zero probability of collision, we get to avoid central coordination and use smaller caps). The specific places where collisions could occur are: * mapping from file contents to CHK-derived readkey * mapping from readkey (CHK-derived or randomly generated) to storage index * mapping from randomly-generated mutable writekey to storage index The "birthday paradox" determines the chance of collision. If I'm doing my math right, if you want less than '`p`' chance of getting any collisions when selecting items out of a namespace of size '`N`', then you can't select more than `C = sqrt(2*N*p)` items. This is called a "paradox" (where "surprise" would be a better term) because that square root causes C to be surprisingly low: for birthdays (in which N=365), p=0.5 leads to C=19. In the Tahoe context, `C` is the number of files you can add to the grid. In the current case, our 128-bit storage index (N=2^128^) means that p=0.5 gets us a nice large 2^64^ number of files, except that p=0.5 is insufficient margin: we'd much prefer a vanishingly small chance of collision, like p=2^-64^. Fortunately we get two bits of margin for every one bit we reduce from C. The table looks like: | | | | | |---|---|---|---| |N|numfiles| |prob(collision)| |96|2^48^|->|2^-1^ (0.5)| |96|2^48^|->|2^-17^| |96|2^32^|->|2^-33^| |96|2^24^|->|2^-49^| |128|2^64^|->|2^-1^ (0.5)| |128|2^56^|->|2^-17^| |128|2^48^|->|2^-33^| |128|2^32^|->|2^-65^| |192|2^96^|->|2^-1^| |192|2^80^|->|2^-33^| |192|2^64^|->|2^-65^| |256|2^128^|->|2^-1^ (0.5)| |256|2^96^|->|2^-65^| Note that our `N` is the minimum of the storage-index size and the top-most cap value (i.e. the readkey for immutable files, or the writekey for mutable files). So a DSA-based mutable file with a 92-bit writecap gives us an `N` of 2^92^, even if it is expanded into a storage-index of 128 or 256 bits. Also note that the allmydata.com grid currently has something like 10M objects in it, about C=2^23^. So, I'm thinking that as much as a nice short 96-bit DSA mutable writecap would be nice, it's too short to provide enough collision margin. I want to be able to put trillions of files into a grid, and I want a the chance of collision to be so small that I don't ever need to worry about it, and 96 bits isn't really there. 128 bits is probably good enough, but doesn't have enough margin to be obviously and unquestionably safe (C=2^32^ is a lot of files but you can imagine people wanting more, p=2^-64^ is a tiny probability but you can imagine people wanting a bit better). 256 would be plenty (but of course I want my filecaps to be shorter than that).
warner added the
code-encoding
major
defect
1.4.1
labels 2009-07-10 09:41:24 +00:00
warner added this to the undecided milestone 2009-07-10 09:41:24 +00:00

Thanks for this analysis. I like your comments at the end that we want a bit of "overkill" in number of files and chance of collision. People who don't want to rely on the collision-resistance of secure hash functions at all are probably not part of our market, but people who are willing to do so in principle, but would feel better with a nice fat margin of safety are definitely in our market.

Note that if you generate a new write cap (private key) and then check and it turns out that the same write cap has been generated by someone else then now you have gained the ability to write to their mutable file or directory! That's why I have been thinking that 96-bits was too few for write caps. Originally I had been thinking something like "It would probably not be worthwhile for any attacker to spend 2^96^ computer power trying to forge a write cap.". But this way of thinking discounts two important factors: chance of success and number of targets. If there are 2^40^ writecaps in use, then if an attacker does a mere 2^36^ work (excluding the cost of checking whether each writecap that they generate is already out there), then they have a 2^-20^ chance of forging someone's writecap. (Thanks to Daniel J. Bernstein's papers and mailing list postings for helping me understand this important point. http://cr.yp.to/snuffle/bruteforce-20050425.pdf )

However, a storage-index collision doesn't sound nearly as serious to me. No integrity or confidentiality is necessarily lost due to storage-index collision, right? Well, it could complicate the question of "which servers are handling which shares of this mutable file" -- an issue that is already not well managed.

Anyway, nowadays I think of storage-indexes as a separate layer built on top of the crypto layer. People can define their storage indexes as secure hashes of some pieces of the capabilities if they want, or choose random storage indices, or hierarchical ones based on the DNS names, or just not have storage indices at all and require every downloader to query every server. It shouldn't impact the security of the crypto layer, if the crypto layer includes integrity checking using the verifycap itself on the storage-server side.

I think we should write a document called something like "crypto failure modes (What could possibly go wrong?)" that explains what the consequences are of each different type of failure. (As requested by Nathan: http://allmydata.org/pipermail/tahoe-dev/2009-April/001661.html .)

The one discussed above is caused by "two people choose the same write cap (signing key) seed (possibly due to malice)". That one leads to an integrity failure, where one of the people thinks that they are the only one with a write-cap to a specific file or directory, but actually someone else also has the same write-cap.

So I think that is the worst one (because I value integrity highly). Another similar integrity failure could come about from a failure of the digital signature algorithm -- i.e. if someone were able to forge digital signatures even without the writecap. (Note that a collision in the hash function used by the digital signature algorithm could cause this. People who don't want to rely on collision-resistance of secure hash functions at all can't even rely on rsa, dsa, ssh, ssl, or gpg, although I should hasten to add that those algorithms typically include some randomness in the input to their secure hash function, to make it that much harder for attackers to cause collisions.)

After those integrity failures, there are confidentiality failures. The obvious one is someone being able to crack AES-128-CTR without having the readkey. Another one is if the content-hash-key encryption where to generate the same readkey for two different immutable files. I suppose that's another reason why using an added convergence secret is safer (although I hasten to add that I see no reason to distrust the collision-resistance of SHA-256d-truncated-to-128-bits at the present).

Note that of course in practice the dangers from bugs and from operator error (i.e. misplacing your keys) are a lot greater than these algorithmic crypto risks. So much greater that the former are pretty much guaranteed to happen and the latter will probably never. Nonetheless, I value getting the crypto part right so that it is secure and also so that everyone who is willing to rely on crypto in principle is willing to rely on our crypto, so thanks for you help with this.

Thanks for this analysis. I like your comments at the end that we want a bit of "overkill" in number of files and chance of collision. People who don't want to rely on the collision-resistance of secure hash functions at *all* are probably not part of our market, but people who are willing to do so in principle, but would feel better with a nice fat margin of safety are definitely in our market. Note that if you generate a new write cap (private key) and then check and it turns out that the same write cap has been generated by someone else then now you have gained the ability to write to their mutable file or directory! That's why I have been thinking that 96-bits was too few for write caps. Originally I had been thinking something like "It would probably not be worthwhile for any attacker to spend 2^96^ computer power trying to forge a write cap.". But this way of thinking discounts two important factors: chance of success and number of targets. If there are 2^40^ writecaps in use, then if an attacker does a mere 2^36^ work (excluding the cost of checking whether each writecap that they generate is already out there), then they have a 2^-20^ chance of forging *someone's* writecap. (Thanks to Daniel J. Bernstein's papers and mailing list postings for helping me understand this important point. <http://cr.yp.to/snuffle/bruteforce-20050425.pdf> ) However, a storage-index collision doesn't sound nearly as serious to me. No integrity or confidentiality is necessarily lost due to storage-index collision, right? Well, it could complicate the question of "which servers are handling which shares of this mutable file" -- an issue that is already not well managed. Anyway, nowadays I think of storage-indexes as a separate layer built on top of the crypto layer. People can define their storage indexes as secure hashes of some pieces of the capabilities if they want, or choose random storage indices, or hierarchical ones based on the DNS names, or just not have storage indices at all and require every downloader to query every server. It shouldn't impact the security of the crypto layer, if the crypto layer includes integrity checking using the verifycap itself on the storage-server side. I think we should write a document called something like "crypto failure modes (What could possibly go wrong?)" that explains what the consequences are of each different type of failure. (As requested by Nathan: <http://allmydata.org/pipermail/tahoe-dev/2009-April/001661.html> .) The one discussed above is caused by "two people choose the same write cap (signing key) seed (possibly due to malice)". That one leads to an integrity failure, where one of the people thinks that they are the only one with a write-cap to a specific file or directory, but actually someone else also has the same write-cap. So I think that is the worst one (because I value integrity highly). Another similar integrity failure could come about from a failure of the digital signature algorithm -- i.e. if someone were able to forge digital signatures even without the writecap. (Note that a collision in the hash function used by the digital signature algorithm could cause this. People who don't want to rely on collision-resistance of secure hash functions at *all* can't even rely on rsa, dsa, ssh, ssl, or gpg, although I should hasten to add that those algorithms typically include some randomness in the input to their secure hash function, to make it that much harder for attackers to cause collisions.) After those integrity failures, there are confidentiality failures. The obvious one is someone being able to crack AES-128-CTR without having the readkey. Another one is if the content-hash-key encryption where to generate the same readkey for two different immutable files. I suppose that's another reason why using an `added convergence secret` is safer (although I hasten to add that I see no reason to distrust the collision-resistance of SHA-256d-truncated-to-128-bits at the present). Note that of course in practice the dangers from bugs and from operator error (i.e. misplacing your keys) are a lot greater than these algorithmic crypto risks. So much greater that the former are pretty much guaranteed to happen and the latter will probably never. Nonetheless, I value getting the crypto part right so that it is secure and also so that everyone who is willing to rely on crypto in principle is willing to rely on our crypto, so thanks for you help with this.

P.S. Oh, what I wrote about people choosing whatever storage index they want is possibly confused because I'm conflating two ideas into the word "storage index". Let's talk about those two uses as "server selector" and "file identifier". The former is how you decide which servers you're going to use (either when uploading a file for the first time or when using a cap to find a file that's already up there). The latter is how you tell a storage server which of the shares that it has are the shares that you are currently interested in. I currently think that the verify cap is a good thing to use for the latter role, but perhaps not for the former.

P.S. Oh, what I wrote about people choosing whatever storage index they want is possibly confused because I'm conflating two ideas into the word "storage index". Let's talk about those two uses as "server selector" and "file identifier". The former is how you decide which servers you're going to use (either when uploading a file for the first time or when using a cap to find a file that's already up there). The latter is how you tell a storage server which of the shares that it has are the shares that you are currently interested in. I currently think that the verify cap is a good thing to use for the latter role, but perhaps not for the former.
Author

as before, I think I'd like to continue using "storage index" for what you're
calling the "file identifier", but yeah split out "server selector" or
"peer-selection index" or some similar term for the purpose of determining
which servers you're going to be talking to. One way of describing this would
be "we used to use the storage-index as the peer-selection index, but these
days we put two separate values in the filecap".

I am also starting to think of these as separate concepts, but remember that
we've yet to actually implement such a split.

Sylvan's concern was about availability: he considered a backup system to be
broken if its design has a built-in probability of file unrecoverability.
It's easier to see the problem if we set the encryption-key and hash lengths
to infinity, but restrict the storage index to say four bits. Then upload two
files, and try to download one of them.. you've got a 1/16 chance of getting
a download failure because your two files had the same storage-index, you
downloaded the wrong bits, and now they won't pass the integrity check.

Also, when we talk about this, we should be careful to distinguish between
the failure modes of mutable files versus immutable files.. they're very
distinct. And, collisions at different levels have very different
consequences: if the storage index is too small, we'll get availability
failures; if the immutable encryption key or mutable writekey is too small,
we'll get confidentiality failures. I've been assuming that we'll keep the
security parameters sufficiently large.. this ticket was specifically about
the availability concerns resulting from a too-small storage index.

If we compress the filecap by deriving the storage-index from the writekey,
then clearly we're limited by min(len(writekey),len(storage-index)).

Mostly I ticketed this issue because it's something I want to keep in mind as
we design the next revision of the filecap format. If we don't already have a
wiki page for it, I'll add one to organize our ideas.. I think they're
currently spread across half a dozen tickets.

I updated the table in the description: I think 192-bit caps will let us have
an effectively-infinite number of files (2^64^) with an effectively-zero
chance of collision (2^-65^).

[to fix trac markup]edited

as before, I think I'd like to continue using "storage index" for what you're calling the "file identifier", but yeah split out "server selector" or "peer-selection index" or some similar term for the purpose of determining which servers you're going to be talking to. One way of describing this would be "we used to use the storage-index as the peer-selection index, but these days we put two separate values in the filecap". I am also starting to think of these as separate concepts, but remember that we've yet to actually implement such a split. Sylvan's concern was about availability: he considered a backup system to be broken if its design has a built-in probability of file unrecoverability. It's easier to see the problem if we set the encryption-key and hash lengths to infinity, but restrict the storage index to say four bits. Then upload two files, and try to download one of them.. you've got a 1/16 chance of getting a download failure because your two files had the same storage-index, you downloaded the wrong bits, and now they won't pass the integrity check. Also, when we talk about this, we should be careful to distinguish between the failure modes of mutable files versus immutable files.. they're very distinct. And, collisions at different levels have very different consequences: if the storage index is too small, we'll get availability failures; if the immutable encryption key or mutable writekey is too small, we'll get confidentiality failures. I've been assuming that we'll keep the security parameters sufficiently large.. this ticket was specifically about the availability concerns resulting from a too-small storage index. If we compress the filecap by deriving the storage-index from the writekey, then clearly we're limited by `min(len(writekey),len(storage-index))`. Mostly I ticketed this issue because it's something I want to keep in mind as we design the next revision of the filecap format. If we don't already have a wiki page for it, I'll add one to organize our ideas.. I think they're currently spread across half a dozen tickets. I updated the table in the description: I think 192-bit caps will let us have an effectively-infinite number of files (2^64^) with an effectively-zero chance of collision (2^-65^). [to fix trac markup]edited

Wait, if you you have 4-bit storage indexes, and servers verify integrity of shares, then you don't get an unrecoverable file, you get a failure to upload the second one, right? I really like the idea of servers verifying the integrity of shares, because without that then even if you have a nice fat storage index, the first server that you upload to could quickly turn around and upload garbage to all the other servers, under the same storage index that you just told him.

A failure to upload (backup) is a much better problem to have than a failure to download (recover), as long as it can be detected, understood, and fixed by the operators. I.e. "Oh, we need to add servers", or "Oh, we need to start using longer storage indexes". :-)

Wait, if you you have 4-bit storage indexes, *and servers verify integrity of shares*, then you don't get an unrecoverable file, you get a failure to upload the second one, right? I really like the idea of servers verifying the integrity of shares, because without that then even if you have a nice fat storage index, the first server that you upload to could quickly turn around and upload garbage to all the other servers, under the same storage index that you just told him. A failure to upload (backup) is a much better problem to have than a failure to download (recover), as long as it can be detected, understood, and fixed by the operators. I.e. "Oh, we need to add servers", or "Oh, we need to start using longer storage indexes". :-)
Author

Well, "It Depends". Current servers do no checking of shares.

(I agree, I like server-side share-verification, except for the load and
complexity it adds to the storage servers)

Immutable Files

Current code has uploaders quietly+simplisticly believing servers who say "I
already have a share by that SI", without additional checking. (they don't
search very far for existing shares, but if the server list is the same as it
was for the first file, they'll get complete overlap). This results in the
second file appearing to upload correctly but being completely unrecoverable
(all shares for that SI are for the first file, so a download of the second
file will get shares that fail the hash checks and fail).

The uploader that we haven't written yet will probably search further for
existing shares (especially once it's seen evidence of one), and will do some
amount of verification of those shares (at the very least pulling down the
UEB to make sure it's for the same file). This uploader will see evidence of
an SI collision and probably fail the upload (with an exception that
basically says "please pick a different SI, perhaps a random one"). Unless,
of course, you're unlucky and wind up talking to a different servers during
the two uploads. The subsequent downloader will either succeed but notice a
lot of "corrupted" shares (if they can talk to the second uploader's servers)
or fail due to a lot of corrupted shares (if they can't).

If we further improve the servers to verify the integrity of immutable shares
upon receipt, and we change the protocol to split server-selection-index from
storage-index, and assume that the SI is required to be derived from the
verifycap (so that servers can validate the share all the way up to the SI),
then we're still in the same boat as the previous paragraph: servers who
don't yet have a share will accept either the first upload or the second
(since we're assuming a too-short SI, and these two files are colliding), and
if the second upload succeeded due to disjoint peersets, then second-file
downloaders will succeed or fail depending upon which servers they're able to
talk to.

Mutable Files

We get the same issues here: if the storage-index is too small, then there
will be multiple RSA keys which map to the same SI. A validating server which
has accepted a share for key1 will reject an update for key2 (because it's
looking at the full pubkey inside the share), but other servers might accept
key2 (if they don't already have a share).

Well, "It Depends". Current servers do no checking of shares. (I agree, I like server-side share-verification, except for the load and complexity it adds to the storage servers) ## Immutable Files Current code has uploaders quietly+simplisticly believing servers who say "I already have a share by that SI", without additional checking. (they don't search very far for existing shares, but if the server list is the same as it was for the first file, they'll get complete overlap). This results in the second file appearing to upload correctly but being completely unrecoverable (all shares for that SI are for the first file, so a download of the second file will get shares that fail the hash checks and fail). The uploader that we haven't written yet will probably search further for existing shares (especially once it's seen evidence of one), and will do some amount of verification of those shares (at the very least pulling down the UEB to make sure it's for the same file). This uploader will see evidence of an SI collision and probably fail the upload (with an exception that basically says "please pick a different SI, perhaps a random one"). Unless, of course, you're unlucky and wind up talking to a different servers during the two uploads. The subsequent downloader will either succeed but notice a lot of "corrupted" shares (if they can talk to the second uploader's servers) or fail due to a lot of corrupted shares (if they can't). If we further improve the servers to verify the integrity of immutable shares upon receipt, and we change the protocol to split server-selection-index from storage-index, and assume that the SI is required to be derived from the verifycap (so that servers can validate the share all the way up to the SI), then we're still in the same boat as the previous paragraph: servers who don't yet have a share will accept either the first upload or the second (since we're assuming a too-short SI, and these two files are colliding), and if the second upload succeeded due to disjoint peersets, then second-file downloaders will succeed or fail depending upon which servers they're able to talk to. ## Mutable Files We get the same issues here: if the storage-index is too small, then there will be multiple RSA keys which map to the same SI. A validating server which has accepted a share for key1 will reject an update for key2 (because it's looking at the full pubkey inside the share), but other servers might accept key2 (if they don't already have a share).
davidsarah commented 2009-10-28 04:13:23 +00:00
Owner

Tagging issues relevant to new cap protocol design.

Tagging issues relevant to new cap protocol design.

fixed superscripts

fixed superscripts
davidsarah commented 2011-07-28 21:51:53 +00:00
Owner

Use trac markup for superscripts instead of Unicode; it's usually rendered much more clearly, and doesn't depend on font support.

Use trac markup for superscripts instead of Unicode; it's usually rendered much more clearly, and doesn't depend on font support.
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#753
No description provided.