update webapi docs for distributed dirnodes #115

Closed
opened 2007-08-20 19:16:59 +00:00 by warner · 23 comments

Our current (temporary) situation is to put all vdrive "directory node"
information into an encrypted data structure that lives on a specific server.
This was fairly easy to implement, but lacks certain properties that we want,
specifically that it represents a single point of failure.

We want to improve the availability of dirnodes. There are a number of ways
to accomplish this, some cooler than others. One approach is to leave the
vdrive-server scheme in place but have multiple servers (each providing the
same TubID, using separate connection hints, or the usual sort of IP-based
load-balancer frontend box). This requires no change in code on the client
side, but puts a significant burden on the operators of the network: they
must run multiple machines.

A niftier approach would be to distribute the dirnode data in the same way we
distribute file data. This requires distributed mutable files (i.e. SSK
files), which will require a bunch of new code. It also opens up difficult
questions about synchronized updates when race conditions result in different
storage servers recording different versions of the directory.

The source:docs/dirnodes.txt file describes some of our goals and proposals.

Our current (temporary) situation is to put all vdrive "directory node" information into an encrypted data structure that lives on a specific server. This was fairly easy to implement, but lacks certain properties that we want, specifically that it represents a single point of failure. We want to improve the availability of dirnodes. There are a number of ways to accomplish this, some cooler than others. One approach is to leave the vdrive-server scheme in place but have multiple servers (each providing the same TubID, using separate connection hints, or the usual sort of IP-based load-balancer frontend box). This requires no change in code on the client side, but puts a significant burden on the operators of the network: they must run multiple machines. A niftier approach would be to distribute the dirnode data in the same way we distribute file data. This requires distributed mutable files (i.e. SSK files), which will require a bunch of new code. It also opens up difficult questions about synchronized updates when race conditions result in different storage servers recording different versions of the directory. The source:docs/dirnodes.txt file describes some of our goals and proposals.
warner added the
code
major
enhancement
0.4.0
labels 2007-08-20 19:16:59 +00:00
warner added this to the undecided milestone 2007-08-20 19:16:59 +00:00
Author

I'm starting to think that a reasonable solution is to distribute the data
with SSK files, but have an optional central-coordinator node.

Small grids who don't want any centralization just don't use the coordinator.
They run the risk of two people changing the same dirnode in incompatible
ways, in which case they have to revert to an earlier version or something..
we'll need some tools to display the situation to the user, but not tools to
automatically resolve it.

Large grids who are willing to accept some centralization do use the
coordinator. Dirnode reads are still fully-distributed and reliable, however
the ability to modify a dirnode is contingent upon the coordinator being
available. In addition, dirnode-modification may be vulnerable to an attacker
who just claims the lock all day long (however we can probably rig this so
that only people with the dirnode's write-key can perform this attack, making
it a non-issue).

Each SSK could have the FURL of a coordinator in it, and clients who want to
change the SSK shares are supposed to first contact the coordinator and
obtain a temporary lock on the storage index. Then they're only supposed to
send the "SSK_UPDATE" message to the shareholders while they hold that lock.
The full sequence of events would look like:

  1. user provides desired change (add/rename/delete)
  2. see if change is applicable (can't delete non-existent file)
  3. do peer selection, compute list of likely SSK shareholders
  4. contact first shareholder, discover coordinator FURL
  5. contact coordinator, attempt to claim the lock
  • if unsuccessful, wait a random number of seconds, then repeat at step 2
  1. if successful, send SSK_UPDATE messages to all shareholders
  2. when all responses come back (or timeout?), release the lock

Clients who are moving a file from one dirnode to another are allowed to
claim multiple locks at once, as long as they drop all locks while they wait
to retry.

If the coordinator is unavailable, the clients can proceed to update anyways,
and just run the risk of conflicts.

We have two current ideas about implementing SSKs. In the simplest form, we
store the same data on all shareholders (1-of-N encoding), and each
degenerate share has a sequence number. Downloaders look for the highest
sequence number they can find, and pick one of those shares at random.
Conflicts are expressed as two different shares with the same sequence
number.

In the more complex form, we continue to use k-of-N encoding, thus reducing
the amount of data stored on each host. In this form, it is important to add
a hash of the data (a hash of the crypttext is fine) to the version number,
because if there are conflicts, the client needs to make sure the k shares
they just pulled down are all for the same version (otherwise FEC will
produce complete garbage).

Personally, I'm not convinced k-of-N SSK is a good idea, but we should
explore it fully before dismissing it.

I'm starting to think that a reasonable solution is to distribute the data with SSK files, but have an optional central-coordinator node. Small grids who don't want any centralization just don't use the coordinator. They run the risk of two people changing the same dirnode in incompatible ways, in which case they have to revert to an earlier version or something.. we'll need some tools to display the situation to the user, but not tools to automatically resolve it. Large grids who are willing to accept some centralization *do* use the coordinator. Dirnode reads are still fully-distributed and reliable, however the ability to modify a dirnode is contingent upon the coordinator being available. In addition, dirnode-modification may be vulnerable to an attacker who just claims the lock all day long (however we can probably rig this so that only people with the dirnode's write-key can perform this attack, making it a non-issue). Each SSK could have the FURL of a coordinator in it, and clients who want to change the SSK shares are supposed to first contact the coordinator and obtain a temporary lock on the storage index. Then they're only supposed to send the "SSK_UPDATE" message to the shareholders while they hold that lock. The full sequence of events would look like: 1. user provides desired change (add/rename/delete) 2. see if change is applicable (can't delete non-existent file) 3. do peer selection, compute list of likely SSK shareholders 4. contact first shareholder, discover coordinator FURL 5. contact coordinator, attempt to claim the lock * if unsuccessful, wait a random number of seconds, then repeat at step 2 6. if successful, send SSK_UPDATE messages to all shareholders 7. when all responses come back (or timeout?), release the lock Clients who are moving a file from one dirnode to another are allowed to claim multiple locks at once, as long as they drop all locks while they wait to retry. If the coordinator is unavailable, the clients can proceed to update anyways, and just run the risk of conflicts. We have two current ideas about implementing SSKs. In the simplest form, we store the same data on all shareholders (1-of-N encoding), and each degenerate share has a sequence number. Downloaders look for the highest sequence number they can find, and pick one of those shares at random. Conflicts are expressed as two different shares with the same sequence number. In the more complex form, we continue to use k-of-N encoding, thus reducing the amount of data stored on each host. In this form, it is important to add a hash of the data (a hash of the crypttext is fine) to the version number, because if there *are* conflicts, the client needs to make sure the k shares they just pulled down are all for the same version (otherwise FEC will produce complete garbage). Personally, I'm not convinced k-of-N SSK is a good idea, but we should explore it fully before dismissing it.
Author

I'm working on a design for large mutable versioned distributed SSK-style
data structure. This could be used for either mutable files or for mutable
dirnodes. It allows fairly efficient access (both read and write) of
arbitrary bytes, even inserts/deletes of byteranges, and lets you refer to
older versions of the file. The design is inspired by Mercurial's "revlog"
format.

In working on it, I realized that you want your dirnodes to have higher
reliability and availability than the files they contain. Specifically, you
don't want the availability of a file to be significantly impacted by the
unavailability of one of its parent directories. This implies that the root
dirnode should be the most reliable thing of all, followed by the
intermediate directories, followed by the file itself. For example, we might
require that the dirnodes be 20dBA better than whatever we pick for the CHK
files. One way to think about this: pretend we have a directory hierarchy
that is 10 deep, and a file at the bottom, like
/1/2/3/4/5/6/7/8/9/10/file.txt . Now if the file has 40dBA availability
(99.99%), that means that out of one million attempts to retrieve it, we'd
expect to see 100 failures. If each dirnode has 60dBA, then we'd expect to
see 110 failures: 10 failures because an intermediate dirnode was
unavailable, 100 because the CHK shares were unavailable.

Given the same expansion factor and servers that are mostly availably, FEC
gets you much much much better availability than simple replication. For
example, 1-of-3 encoding (i.e. 3x replication) for 99% available servers gets
you 60dBA (i.e. 99.9999%), but 3-of-9 encoding for 99% servers gets you about
125dBA. The reason is easy to visualize: start killing off servers one at a
time; how many can you kill before the file is dead? 1-of-3 is a loss once
you've killed off 3 servers, whereas 3-of-9 is ok until you've lost 7
servers. If we use 1-of-6 encoding (6x replication), we get about 120dBA,
comparable to 3-of-9.

Anyways, the design I'm working on is complicated by FEC, and much simpler to
implement with straight replication. To get comparable availability, we need
to use more replication. So maybe dirnodes using this design should be
encoded with 1-of-5 or so.

I'm working on a design for large mutable versioned distributed SSK-style data structure. This could be used for either mutable files or for mutable dirnodes. It allows fairly efficient access (both read and write) of arbitrary bytes, even inserts/deletes of byteranges, and lets you refer to older versions of the file. The design is inspired by Mercurial's "revlog" format. In working on it, I realized that you want your dirnodes to have higher reliability and availability than the files they contain. Specifically, you don't want the availability of a file to be significantly impacted by the unavailability of one of its parent directories. This implies that the root dirnode should be the most reliable thing of all, followed by the intermediate directories, followed by the file itself. For example, we might require that the dirnodes be 20dBA better than whatever we pick for the CHK files. One way to think about this: pretend we have a directory hierarchy that is 10 deep, and a file at the bottom, like /1/2/3/4/5/6/7/8/9/10/file.txt . Now if the file has 40dBA availability (99.99%), that means that out of one million attempts to retrieve it, we'd expect to see 100 failures. If each dirnode has 60dBA, then we'd expect to see 110 failures: 10 failures because an intermediate dirnode was unavailable, 100 because the CHK shares were unavailable. Given the same expansion factor and servers that are mostly availably, FEC gets you much much much better availability than simple replication. For example, 1-of-3 encoding (i.e. 3x replication) for 99% available servers gets you 60dBA (i.e. 99.9999%), but 3-of-9 encoding for 99% servers gets you about 125dBA. The reason is easy to visualize: start killing off servers one at a time; how many can you kill before the file is dead? 1-of-3 is a loss once you've killed off 3 servers, whereas 3-of-9 is ok until you've lost 7 servers. If we use 1-of-6 encoding (6x replication), we get about 120dBA, comparable to 3-of-9. Anyways, the design I'm working on is complicated by FEC, and much simpler to implement with straight replication. To get comparable availability, we need to use more replication. So maybe dirnodes using this design should be encoded with 1-of-5 or so.
Author

These will be implemented on top of Small Mutable Files (#197), which are mutable but replace-only.

These will be implemented on top of Small Mutable Files (#197), which are mutable but replace-only.
zooko added this to the 0.7.0 milestone 2007-11-13 18:34:16 +00:00

As mentioned in #207:

* create new-style dirnode upon first boot instead of old-style one
* remove old dirnode code, replace with dirnode2 
As mentioned in #207: * create new-style dirnode upon first boot instead of old-style one * remove old dirnode code, replace with dirnode2

These last two tasks where completed in changeset:3605354a952d8efd, but there are a few more things to do:

  • extend the POST command to enable upload of a file without linking it into a directory
  • put a form to do that on the front page, next to the form to download a file given only its URI ("cap")
  • more better test coverage -- Brian has been rocking on this
These last two tasks where completed in changeset:3605354a952d8efd, but there are a few more things to do: * extend the POST command to enable upload of a file without linking it into a directory * put a form to do that on the front page, next to the form to download a file given only its URI ("cap") * more better test coverage -- Brian has been rocking on this

Also to do for v0.7.0:

update the docs to describe the new kind of directories. I have "XXX change this" marked in a few places in the docs in my sandbox, but I haven't started writing replacement text yet.

Also to do for v0.7.0: update the docs to describe the new kind of directories. I have "XXX change this" marked in a few places in the docs in my sandbox, but I haven't started writing replacement text yet.
Author

Things left to do for 0.7.0:

  • document POST /uri in webapi.txt (upload a file without attaching it to a directory)
  • add form to the welcome page to use POST /uri
  • document+test+implement POST /uri?t=mkdir (create a new unattached directory)
    • return new URI in response body
  • add form to the welcome page to use POST /uri?t=mkdir
    • adds a special kind of when_done flag that means "please redirect me
      to the directory page for the dirnode that I just created"

maybe for the future (post-0.7.0):

  • rename PUT into POST for certain things like t=mkdir
    • (for "functions that aren't methods", so to speak)
Things left to do for 0.7.0: * document POST /uri in webapi.txt (upload a file without attaching it to a directory) * add form to the welcome page to use POST /uri * document+test+implement POST /uri?t=mkdir (create a new unattached directory) * return new URI in response body * add form to the welcome page to use POST /uri?t=mkdir * adds a special kind of when_done flag that means "please redirect me to the directory page for the dirnode that I just created" maybe for the future (post-0.7.0): * rename PUT into POST for certain things like t=mkdir * (for "functions that aren't methods", so to speak)

First priority is #231.

Then:

  • document+test+implement POST /uri?t=mkdir (create a new unattached directory)
  • return new URI in response body
  • add form to the welcome page to use POST /uri?t=mkdir
  • adds a special kind of when_done flag that means "please redirect me to the directory page for the dirnode that I just created"
First priority is #231. Then: * document+test+implement POST /uri?t=mkdir (create a new unattached directory) * return new URI in response body * add form to the welcome page to use POST /uri?t=mkdir * adds a special kind of when_done flag that means "please redirect me to the directory page for the dirnode that I just created"

Oh, insert #232 as top-priority, even above #231.

Oh, insert #232 as top-priority, even above #231.

add:

  • if the client is configured to create no private directory, then do not put a link from the welcome page to the start.html page
  • if the client is configured to create a private directory, then put a not on the welcome page which says "private directory will be created once we are connected to X servers...", which note is replaced by a link to start.html after the private directory is created.
add: * if the client is configured to create no private directory, then do not put a link from the welcome page to the start.html page * if the client is configured to create a private directory, then put a not on the welcome page which says "private directory will be created once we are connected to X servers...", which note is replaced by a link to start.html after the private directory is created.

Finished the part about "If the client is configured to create no private directory, then do not put a link from the welcome page to the start.html page", in changeset:9848d2043df42bc3.

Finished the part about "If the client is configured to create no private directory, then do not put a link from the welcome page to the start.html page", in changeset:9848d2043df42bc3.

I bumped the part about showing the pending creation of the private directory into #234 -- "Nice UI for creation of private directory.".

I bumped the part about showing the pending creation of the private directory into #234 -- "Nice UI for creation of private directory.".

#232 -- "peer selection doesn't rebalance shares on overwrite of mutable file" has been bumped out of Milestone 0.7.0 in favor of #233 -- "work-around the poor handling of weird server sets in v0.7.0".

#232 -- "peer selection doesn't rebalance shares on overwrite of mutable file" has been bumped out of Milestone 0.7.0 in favor of #233 -- "work-around the poor handling of weird server sets in v0.7.0".

Still to do in this ticket:

  • document+test+implement POST /uri?t=mkdir (create a new unattached directory)
    o return new URI in response body
  • add form to the welcome page to use POST /uri?t=mkdir
    o adds a special kind of when_done flag that means "please redirect me to the directory page for the dirnode that I just created"
Still to do in this ticket: * document+test+implement POST /uri?t=mkdir (create a new unattached directory) o return new URI in response body * add form to the welcome page to use POST /uri?t=mkdir o adds a special kind of when_done flag that means "please redirect me to the directory page for the dirnode that I just created"

changeset:50bc0d2fb34d2018 finishes test+implement POST /uri?t=mkdir, returning new URI (soon to be called "cap") in the response body

Still to do in this ticket:

  • document POST /uri?t=mkdir
  • add ?redirect_to_result=true flag to request an HTTP 303 See Other redirect to the resulting newly created directory
  • add a form to the welcome page to create a new directory and redirect to it
changeset:50bc0d2fb34d2018 finishes test+implement `POST /uri?t=mkdir`, returning new URI (soon to be called "cap") in the response body Still to do in this ticket: * document `POST /uri?t=mkdir` * add `?redirect_to_result=true` flag to request an `HTTP 303 See Other` redirect to the resulting newly created directory * add a form to the welcome page to create a new directory and redirect to it

So currently there is a POST /uri/?t=mkdir which works and has unit tests, but it is using the technique of encoding the arguments into the URL, and it needs to switch to the technique of encoding the arguments into the request body, which is the standard for POSTs. There is also a button (a form) in my local sandbox, but that form produces POST queries with the arguments encoded into the body, so it doesn't work with the current implementation.

So currently there is a `POST /uri/?t=mkdir` which works and has unit tests, but it is using the technique of encoding the arguments into the URL, and it needs to switch to the technique of encoding the arguments into the request body, which is the standard for POSTs. There is also a button (a form) in my local sandbox, but that form produces POST queries with the arguments encoded into the body, so it doesn't work with the current implementation.
Author

I just pushed a change to make /uri look for the 't' argument in either the
queryargs or the form fields, using a utility function named get_arg() that
we could use to refactor other places that need args out of a request.

I think that "/uri" is the correct target of these commands. Note that
"/uri/" is a different place. Our current docs/webish.txt (section 1.g) says
that /uri?t=mkdir is the right place to do this, and the welcome page's form
(as rendered by Root.render_mkdir_form) winds up pointing at /uri, so I'm
going with "/uri" instead of "/uri/" .

To that end, I've changed the redirection URL that /uri?t=mkdir creates to
match: this redirection is emitted by the /uri page, and therefore needs to
be to "uri/$URI" instead of just "$URI". (The latter works if we were hitting
/uri/?t=mkdir, but not when we hit /uri?t=mkdir).

I've also changed the unit test to exercise "/uri?t=mkdir" instead of
"/uri/?t=mkdir", and to examine the redirection that comes back to make sure
it is correct.

I just pushed a change to make /uri look for the 't' argument in either the queryargs or the form fields, using a utility function named get_arg() that we could use to refactor other places that need args out of a request. I think that "/uri" is the correct target of these commands. Note that "/uri/" is a different place. Our current docs/webish.txt (section 1.g) says that /uri?t=mkdir is the right place to do this, and the welcome page's form (as rendered by Root.render_mkdir_form) winds up pointing at /uri, so I'm going with "/uri" instead of "/uri/" . To that end, I've changed the redirection URL that /uri?t=mkdir creates to match: this redirection is emitted by the /uri page, and therefore needs to be to "uri/$URI" instead of just "$URI". (The latter works if we were hitting /uri/?t=mkdir, but not when we hit /uri?t=mkdir). I've also changed the unit test to exercise "/uri?t=mkdir" instead of "/uri/?t=mkdir", and to examine the redirection that comes back to make sure it is correct.

See #233 -- "creation and management of "root" directories -- directories without parents".

See #233 -- "creation and management of "root" directories -- directories without parents".

Still to do:

  • Document POST /uri?t=mkdir in webapi.txt [2].
  • Lots of other documentation updates, many of which Josh and I have in local sandboxes here at my mom's farm in New Mexico.
Still to do: * Document POST /uri?t=mkdir in webapi.txt [2]. * Lots of other documentation updates, many of which Josh and I have in local sandboxes here at my mom's farm in New Mexico.

I'm going to do this webapi.txt update on the plane tomorrow.

I'm going to do this webapi.txt update on the plane tomorrow.

putting off updating webapi til after this release

putting off updating webapi til after this release
zooko changed title from distributed dirnodes to update webapi docs for distributed dirnodes 2008-01-23 02:26:55 +00:00
zooko added this to the undecided milestone 2008-01-23 02:29:05 +00:00

Brian: I think you might have finished this ticket.

Brian: I think you might have finished this ticket.
zooko removed their assignment 2008-02-07 23:37:22 +00:00
warner was assigned by zooko 2008-02-07 23:37:22 +00:00
Author

yup, just pushing the final docs changes now.

yup, just pushing the final docs changes now.
warner added the
fixed
label 2008-02-08 02:15:12 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#115
No description provided.