small mutable files #197

Closed
opened 2007-10-29 22:29:57 +00:00 by warner · 17 comments

Last week, zooko and I finished designing "Small Distributed Mutable Files".
The design notes are in source:docs/mutable.txt . Now it's time to implement
them.

Our hope is to have this in an 0.6.2 release within two weeks (since I'm
giving a talk on 9-Nov and it'd be nice to have it ready by then), but if
that doesn't work out, the goal is for the 0.7 release.

Here's the task list, so we can claim different pieces without colliding too
much. Please edit this summary when you claim a piece.

  • adding RSA to allmydata.Crypto
    • not sure quite how yet: see #11
  • backend: new methods in IStorageServer to allocate/locate/read/write
    mutable buckets, MutableBucketWriter class to implement the server-side
    container format -CLAIMED BY warner, 80% done-
  • client-side SMDF slot format wrangling, given a chunk of data and the
    right keys, generate the slot contents
  • client-side peer selection: walk through peers, find existing shares,
    decide upon which shares go where
    • recovery algorithm
  • client-side filenode class: API should have replace method.
  • client-side dirnode class -CLAIMED BY warner-

Distributed Dirnodes (#115) will be implemented on top of this.

Last week, zooko and I finished designing "Small Distributed Mutable Files". The design notes are in source:docs/mutable.txt . Now it's time to implement them. Our hope is to have this in an 0.6.2 release within two weeks (since I'm giving a talk on 9-Nov and it'd be nice to have it ready by then), but if that doesn't work out, the goal is for the 0.7 release. Here's the task list, so we can claim different pieces without colliding too much. Please edit this summary when you claim a piece. * adding RSA to allmydata.Crypto * not sure quite how yet: see #11 * backend: new methods in IStorageServer to allocate/locate/read/write mutable buckets, [MutableBucketWriter](wiki/MutableBucketWriter) class to implement the server-side container format -CLAIMED BY warner, 80% done- * client-side SMDF slot format wrangling, given a chunk of data and the right keys, generate the slot contents * client-side peer selection: walk through peers, find existing shares, decide upon which shares go where * recovery algorithm * client-side filenode class: API should have `replace` method. * client-side dirnode class -CLAIMED BY warner- Distributed Dirnodes (#115) will be implemented on top of this.
warner added the
code-encoding
major
enhancement
0.6.1
labels 2007-10-29 22:29:57 +00:00
warner added this to the 0.7.0 milestone 2007-10-29 22:29:57 +00:00
warner self-assigned this 2007-10-29 22:29:57 +00:00
Author

I'll start with the backend. We're putting off RSA for a day to give zooko some time to look at alternatives to pycrypto

I'll start with the backend. We're putting off RSA for a day to give zooko some time to look at alternatives to pycrypto
Author

got most of the server-side slots in place. Needs more test coverage to make sure we don't corrupt shares on resize, etc. Also leases need a lot more test coverage.

Tomorrow I'll start looking at the layers above that.

got most of the server-side slots in place. Needs more test coverage to make sure we don't corrupt shares on resize, etc. Also leases need a lot more test coverage. Tomorrow I'll start looking at the layers above that.

This is the definitive feature for v0.7.0 release.

If we get this running in time for the next public presentation of Tahoe (which Brian will be giving), then we'll release v0.7. If we don't, then we'll release v0.6.2.

Oh, except that there might be a backwards-incompatibility due to sha-256. I haven't yet determined if our use of sha-256 would run afoul of that bug. But anyway, if there is a backwards-incompatibility then we have to bump the minor version number, so then it will be v0.7 regardless.

This is the definitive feature for v0.7.0 release. If we get this running in time for the next public presentation of Tahoe (which Brian will be giving), then we'll release v0.7. If we don't, then we'll release v0.6.2. Oh, except that there might be a backwards-incompatibility due to sha-256. I haven't yet determined if our use of sha-256 would run afoul of that bug. But anyway, if there is a backwards-incompatibility then we have to bump the minor version number, so then it will be v0.7 regardless.
zooko added
blocker
and removed
major
labels 2007-10-31 15:26:34 +00:00

This will fix [#187 security flaw: directory server can escalate read access into write access]. Merged #187 into this ticket.

This will fix [#187 security flaw: directory server can escalate read access into write access]. Merged #187 into this ticket.

Please see comment:/tahoe-lafs/trac-2024-07-25/issues/5164:9, the end of which mentions that perhaps URI isn't a good place for the "dir-or-file" type.

Please see comment:[/tahoe-lafs/trac-2024-07-25/issues/5164](/tahoe-lafs/trac-2024-07-25/issues/5164):9, the end of which mentions that perhaps URI isn't a good place for the "dir-or-file" type.
Author

Whew. Two weeks of intense design, followed by two weeks of very intense
coding, and Tahoe now has a shiny new mutable file implementation!

It isn't quite ready yet.. there are about two days of integration work left.
So no release this week, I'm afraid. But early next week.

The integration tasks left:

  • the Client class is where upload/download/create methods should live.
    That means client.upload(target), client.download_to(target),
    client.create_dirnode(), client.get_node_for_uri(uri).
    Direct access to the uploader/downloader should go away, and everything
    should go through the Client. We might want to move this to a separate
    object (or at least a separate Interface), but everything that accesses
    the grid layer should go through the same object.
  • the client should create a new-style dirnode upon first boot instead of
    an old-style one
  • the old dirnode code should be removed, along with the vdrive client-side
    code and the vdrive server (and the vdrive.furl config file)
  • dirnode2.py should replace dirnode.py
  • URIs for the new mutable filenodes and dirnodes are a bit goofy-looking.
    URI:DIR2:... . I suggest we leave them as they are for this release
    but set a task for the next release to go through and clean up all URIs
    (also settle on shorter hash lengths to make the URIs smaller, and
    probably also start storing binary representations of the read/write caps
    in dirnodes rather than printable URI strings)

Testing tasks left around this stuff:

  • the solaris buildslave is not actually testing new code (#145), so we need
    to know that pycryptopp is working on solaris before release
  • docs/README about the new dependency on libcrypto++
  • .debs for pycryptopp so we can install on the test grid machines

Mutable file tasks, security vulnerabilities, not for this release but soon
thereafter:

  • check the share hash chain (Retrieve._validate_share_and_extract_data). At
    the moment we check the block hash tree but not the share hash chain, so
    byzantine storage servers could trivially corrupt our data.
  • rollback-attacks: we chose a policy of "first retrieveable version wins"
    on download, but for small grids and large expansion factors (i.e. small
    values of k) this makes it awfully easy for a single out-of-date server to
    effectively perform a rollback attack against you. I think we should
    define some parameter epsilon and use the highest-seqnum'ed retrieveable
    version from k+epsilon servers.
  • consider verifying the signature in Publish._got_query_results. It's
    slightly expensive, so I don't want to do it without thinking it through.
    It would help prevent DoS attacks where a server makes us think that
    there's a colliding write taking place.

Mutable file tasks, lower priority:

  • analyze control flow to count the round trips. I was hoping we could get
    an update done in just one RTT but at the moment it's more like 3 or 4.
    It's much more challenging than I originally thought.
  • try to always push a share (perhaps N+1) to ourselves, so we'll have the
    private key around. It would be sad to have a directory that had contents
    that were unrecoverable but which we could no longer modify because we
    couldn't get the privkey anymore.
  • choose one likely server (specifically ourselves) during publish to
    use to fetch our encprivkey. This means doing an extra readv (or perhaps
    just an extra-large readv) for that one server in _query_peers: the rest
    can use pretty small reads, like 1000 bytes. This ought to save us a
    round-trip.
  • error-handling. peers throwing random remote exceptions should not cause
    our publish to fail unless it's for NotEnoughPeersError.
  • the notion of "container size" in the mutable-slot storage API is pretty
    fuzzy. One idea was to allow read vectors to refer to the end of the
    segment (like python string slices using negative index values), for which
    we'd need a well-defined container size. I'm not sure this is actually
    useful for anything, though. (maybe grabbing the encrypted privkey, since
    it's always at the end?). Probably not useful until MDMF where you'd want
    to grab the encprivkey without having to grab the whole share too.
  • tests, tests, tests. There are LOTS of corner cases that I want coverage
    on. The easy ones are what download does in the face of out-of-date
    servers. The hard ones are what upload does in the face of simultaneous
    writers.
  • write-time collision recovery. We designed an algorithm (in
    docs/mutable.txt) to handle this well. There is a place for it to be
    implemented in allmydata.mutable.Publish._maybe_recover .
  • Publish peer selection: rebalance shares on each publish, by noticing when
    there are multiple shares on a single peer and also unused peers in the
    permuted list. The idea is that shares created on a small grid should
    automatically spread out when updated after the grid has grown.
  • RSA key generation takes an unfortunately long time (between 0.8 and 3.2
    seconds in my casual tests). This will make a RW deepcopy of a large
    directory structure pretty slow. We should do some benchmarking of this
    thing to determine key size / speed tradeoffs, and maybe at some point
    consider ECC if it could be faster.
  • code terminology: share vs slot vs container, "SSK" vs mutable file vs
    slot. We need to nail down the meanings of some of these and clean up
    the code to match.
Whew. Two weeks of intense design, followed by two weeks of *very* intense coding, and Tahoe now has a shiny new mutable file implementation! It isn't quite ready yet.. there are about two days of integration work left. So no release this week, I'm afraid. But early next week. The integration tasks left: * the Client class is where upload/download/create methods should live. That means `client.upload(target)`, `client.download_to(target)`, `client.create_dirnode()`, `client.get_node_for_uri(uri)`. Direct access to the uploader/downloader should go away, and everything should go through the Client. We might want to move this to a separate object (or at least a separate Interface), but everything that accesses the grid layer should go through the same object. * the client should create a new-style dirnode upon first boot instead of an old-style one * the old dirnode code should be removed, along with the vdrive client-side code and the vdrive server (and the vdrive.furl config file) * dirnode2.py should replace dirnode.py * URIs for the new mutable filenodes and dirnodes are a bit goofy-looking. URI:DIR2:... . I suggest we leave them as they are for this release but set a task for the next release to go through and clean up all URIs (also settle on shorter hash lengths to make the URIs smaller, and probably also start storing binary representations of the read/write caps in dirnodes rather than printable URI strings) Testing tasks left around this stuff: * the solaris buildslave is not actually testing new code (#145), so we need to know that pycryptopp is working on solaris before release * docs/README about the new dependency on libcrypto++ * .debs for pycryptopp so we can install on the test grid machines Mutable file tasks, security vulnerabilities, not for this release but soon thereafter: * check the share hash chain (Retrieve._validate_share_and_extract_data). At the moment we check the block hash tree but not the share hash chain, so byzantine storage servers could trivially corrupt our data. * rollback-attacks: we chose a policy of "first retrieveable version wins" on download, but for small grids and large expansion factors (i.e. small values of k) this makes it awfully easy for a single out-of-date server to effectively perform a rollback attack against you. I think we should define some parameter epsilon and use the highest-seqnum'ed retrieveable version from k+epsilon servers. * consider verifying the signature in Publish._got_query_results. It's slightly expensive, so I don't want to do it without thinking it through. It would help prevent DoS attacks where a server makes us think that there's a colliding write taking place. Mutable file tasks, lower priority: * analyze control flow to count the round trips. I was hoping we could get an update done in just one RTT but at the moment it's more like 3 or 4. It's much more challenging than I originally thought. * try to always push a share (perhaps N+1) to ourselves, so we'll have the private key around. It would be sad to have a directory that had contents that were unrecoverable but which we could no longer modify because we couldn't get the privkey anymore. * choose one likely server (specifically ourselves) during publish to use to fetch our encprivkey. This means doing an extra readv (or perhaps just an extra-large readv) for that one server in _query_peers: the rest can use pretty small reads, like 1000 bytes. This ought to save us a round-trip. * error-handling. peers throwing random remote exceptions should not cause our publish to fail unless it's for [NotEnoughPeersError](wiki/NotEnoughPeersError). * the notion of "container size" in the mutable-slot storage API is pretty fuzzy. One idea was to allow read vectors to refer to the end of the segment (like python string slices using negative index values), for which we'd need a well-defined container size. I'm not sure this is actually useful for anything, though. (maybe grabbing the encrypted privkey, since it's always at the end?). Probably not useful until MDMF where you'd want to grab the encprivkey without having to grab the whole share too. * tests, tests, tests. There are LOTS of corner cases that I want coverage on. The easy ones are what download does in the face of out-of-date servers. The hard ones are what upload does in the face of simultaneous writers. * write-time collision recovery. We designed an algorithm (in docs/mutable.txt) to handle this well. There is a place for it to be implemented in allmydata.mutable.Publish._maybe_recover . * Publish peer selection: rebalance shares on each publish, by noticing when there are multiple shares on a single peer and also unused peers in the permuted list. The idea is that shares created on a small grid should automatically spread out when updated after the grid has grown. * RSA key generation takes an unfortunately long time (between 0.8 and 3.2 seconds in my casual tests). This will make a RW deepcopy of a large directory structure pretty slow. We should do some benchmarking of this thing to determine key size / speed tradeoffs, and maybe at some point consider ECC if it could be faster. * code terminology: share vs slot vs container, "SSK" vs mutable file vs slot. We need to nail down the meanings of some of these and clean up the code to match.
Author

If we provide web UI access to mutable files (as opposed to merely mutable
directories), say, so a certain developer could put an HTML presentation
together without being limited to a strict tree structure in the HREFs, I
think it should look like the following:

  • the "Upload File" button either acquires a "Mutable?" checkbox, or there
    are two separate buttons, one for immutable, one for mutable.

  • all entries in the directory listing that are mutable files should provide
    a way to upload new contents: either a (choose-file, "Replace"-button)
    pair, or a button labeled "Replace.." that takes you to a separate page
    from which you can choose a file and hit the real "Replace" button.

If we provide web UI access to mutable *files* (as opposed to merely mutable directories), say, so a certain developer could put an HTML presentation together without being limited to a strict tree structure in the HREFs, I think it should look like the following: * the "Upload File" button either acquires a "Mutable?" checkbox, or there are two separate buttons, one for immutable, one for mutable. * all entries in the directory listing that are mutable files should provide a way to upload new contents: either a (choose-file, "Replace"-button) pair, or a button labeled "Replace.." that takes you to a separate page from which you can choose a file and hit the real "Replace" button.
Author

This is an issue for a different ticket, but I'd also like to see our dirnode
URIs changed to be a clear wrapper around a filenode URI. When the internal
filenode URI is a mutable file (RW or RO), the dirnode is mutable (RW or RO).
When the internal filenode URI is an immutable CHK file, that dirnode is
immutable. When we implement the deep-copy facility, it will have an option
to either create a new tree of mutable RW dirnodes, or a tree of immutable
dirnodes, and the latter can be done much more efficiently by creating CHK
files (no expensive RSA key generation) instead of mutable files.

This is an issue for a different ticket, but I'd also like to see our dirnode URIs changed to be a clear wrapper around a filenode URI. When the internal filenode URI is a mutable file (RW or RO), the dirnode is mutable (RW or RO). When the internal filenode URI is an immutable CHK file, that dirnode is immutable. When we implement the deep-copy facility, it will have an option to either create a new tree of mutable RW dirnodes, or a tree of immutable dirnodes, and the latter can be done much more efficiently by creating CHK files (no expensive RSA key generation) instead of mutable files.
Author

I mentioned this one above:

* consider verifying the signature in Publish._got_query_results. It's
  slightly expensive, so I don't want to do it without thinking it
  through. It would help prevent DoS attacks where a server makes us
  think that there's a colliding write taking place.

I've decided we should verify it here; the cryptopp benchmarks show that my
Mac at home takes 560 microseconds per RSA2048 verification, so adding an
extra 10 checks will only add 5.6ms to the publish time, which is small
compared to the 42ms that the same computer will take to do the RSA2048
signature needed to publish the new contents.

I mentioned this one above: * consider verifying the signature in Publish._got_query_results. It's slightly expensive, so I don't want to do it without thinking it through. It would help prevent DoS attacks where a server makes us think that there's a colliding write taking place. I've decided we *should* verify it here; the cryptopp benchmarks show that my Mac at home takes 560 microseconds per RSA2048 verification, so adding an extra 10 checks will only add 5.6ms to the publish time, which is small compared to the 42ms that the same computer will take to do the RSA2048 signature needed to publish the new contents.
Author

I went ahead and pushed the verify-signatures-in-Publish._got_query_results change. It still needs testing, of course, just like all the other failure cases.

I went ahead and pushed the verify-signatures-in-Publish._got_query_results change. It still needs testing, of course, just like all the other failure cases.

I've created ticket #207 to hold all the parts of this task that can be deferred until v0.7.1.

Here are the remaining things that I think we need to fix to close this ticket:

  • There is currently a test failure on solaris. I also get a (different) unit test failure from the foolscap tests on that machine foolscap #31.
  • check the share hash chain (Retrieve._validate_share_and_extract_data). At the moment we check the block hash tree but not the share hash chain, so byzantine storage servers could trivially corrupt our data.
I've created ticket #207 to hold all the parts of this task that can be deferred until v0.7.1. Here are the remaining things that I think we need to fix to close this ticket: * There is currently [a test failure on solaris](http://allmydata.org/buildbot/builders/solaris/builds/91/steps/test/logs/test.log). I also get a (different) unit test failure from the foolscap tests on that machine [foolscap #31](http://foolscap.lothar.com/trac/ticket/31). * check the share hash chain (Retrieve._validate_share_and_extract_data). At the moment we check the block hash tree but not the share hash chain, so byzantine storage servers could trivially corrupt our data.

The test failure on solaris had to do with old, incompatible versions of packages being loaded from the system instead of the current versions of the packages. I removed the old packages from the host, so that problem is currently not happening (although we would still like to fix it do that the unit tests do not load packages from the system -- #145).

The test failure on solaris had to do with old, incompatible versions of packages being loaded from the system instead of the current versions of the packages. I removed the old packages from the host, so that problem is currently not happening (although we would still like to fix it do that the unit tests do not load packages from the system -- #145).

I believe Brian implemented "check the share hash chain" in changeset:8ba10d0155cddeb3, but I'll leave it to him to close this ticket.

I believe Brian implemented "check the share hash chain" in changeset:8ba10d0155cddeb3, but I'll leave it to him to close this ticket.
Author

nope, not yet. I added a test which provokes a failure in a single share, to see if download succeeds anyways, but that test does not specifically check to make sure that the corrupted share is detected. There is not yet code to check this hash chain. Maybe today, more likely late tomorrow.

nope, not yet. I added a test which provokes a failure in a single share, to see if download succeeds anyways, but that test does not specifically check to make sure that the corrupted share is detected. There is not yet code to check this hash chain. Maybe today, more likely late tomorrow.
Author

Ok, I just pushed the validate-the-share_hash_chain fix in changeset:2eeac5cff8baf05d, so this check is now being done.
I did an eyeball test, but we still need unit tests (but not for 0.7.0).

#207 has the remaining mutable file issues.

I will close this one shortly.. there's an OS-X buildbot failure in the test I just pushed, which I need to investigate before I'll consider this part done.

Ok, I just pushed the validate-the-share_hash_chain fix in changeset:2eeac5cff8baf05d, so this check is now being done. I did an eyeball test, but we still need unit tests (but not for 0.7.0). #207 has the remaining mutable file issues. I will close this one shortly.. there's an OS-X buildbot failure in the test I just pushed, which I need to investigate before I'll consider this part done.
Author

it looks like the code which is supposed to fall back to other shares when a corrupt share is detected is not working. The lack of deterministic tests (included in #207) makes this a random failure rather than an easily repeatable one.

still investigating..

it looks like the code which is supposed to fall back to other shares when a corrupt share is detected is not working. The lack of deterministic tests (included in #207) makes this a random failure rather than an easily repeatable one. still investigating..
Author

Ok, the last issue is resolved, and getting multiple shares from a single server (in which one is bad and the other good) no longer causes the good shares to be ignored. The remainder of the mutable file issues (for release after this one) are in #207, I'm closing this one out.

Ok, the last issue is resolved, and getting multiple shares from a single server (in which one is bad and the other good) no longer causes the good shares to be ignored. The remainder of the mutable file issues (for release after this one) are in #207, I'm closing this one out.
warner added the
fixed
label 2007-11-17 00:27:04 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#197
No description provided.