maybe add share-metadata: "where-are-the-other-shares" hints #599

Open
opened 2009-01-28 23:18:56 +00:00 by warner · 1 comment

An idea that we've kicked around before came back to me today, in a different
form. What if each share (living on some server somewhere), in addition to
the data necessary to recover the file (blocks and hashes and signatures),
also contained hints about the locations of other shares? These hints could
take the form of publically-visible FURLs of the storage servers that are
holding the other shares.

These "hints" would not be authoritative, because shares may have been
deleted, copied, or regenerated without necessarily updating all the other
shares. But, assuming that shares tend to stick around for a while, these
hints could provide a high-probability way to locate all the other shares.

The basic addition would be an extra method on the WriteBucket object,
something like set_share_location_hints, which accepts a list of FURLs,
and stores it next to the share. The retrieve side would be an extra return
argument to the get_buckets call, to return the list of hints.

The download algorithm (the new state-machine oriented one that we're
thinking of building, to help with #287 and #193) would be changed. The first
phase of the download process is responsible for acquiring ReadBucket
objects, each of which connects to some remote server. This phase would be
changed to use the regular "Tahoe-2" peer-selection algorithm to find a batch
of servers (perhaps 5 servers) to whom get_buckets messages will be
sent in parallel. Each time a response comes back, a new message will be sent
to the next unasked peer in the permuted list. But if a response includes
hints about the locations of other shares, then all of those hinted servers
will be asked. The download process can begin once at least "k" shares have
been located.

For mutable files, this increases the chance that we'll locate all shares of
the file, which helps reduce the danger of accidental-rollback.

The hints can contain FURLs, but in general only the tubid portion will be
used. The client will look in its list of connections to see if it has any
which connect to the same tubid as in the hint. This minimizes the importance
of servers maintaining their same IP address and port number for long periods
of time. If the client hasn't heard about the tubid from the Introducer, it
could conceivably try to connect to the given FURL anyways. This would cause
servers which have left the grid (or merely been unable to connect to the
introducer recently) to still be used, however it would also make it slightly
easier to cause mischief by publishing FURLs that point to port 25, etc.

For maximum benefit, hints should be updated when the file is repaired, but
we must be careful to define the necessary authority correctly. For mutable
files, the authority to modify the share is sufficient: anyone who can
clobber the file is also allowed to clobber the hints. For immutable files,
the question is more difficult. One possibility is that the repairer submits
potential hints to the servers that hold old shares, and the server validates
the hint itself before committing them to the share's metadata. The server
would do this by connecting to the FURL in question and performing the same
get_buckets that a downloading client would do. The server could then
randomly check a few segments against their block hash trees, and make sure
the remote share hash fits into its local share hash tree. This verification
process is made marginally easier by the fact that the verifying server
already has a copy of the share hash tree.

This last feature (servers checking up on each other) is reminiscent of the
original "Tahoe 1" design, in which each file had a cabal of servers holding
its shares, and the members of this cabal kept track of each other
(validating each others shares, repairing the file when necessary, recruiting
new members if too many servers dropped out). We abandoned that design
because it required each server to keep track of too much information, but if
the location-of-other-shares list is treated merely as a hint, then perhaps
it wouldn't be too bad.

An idea that we've kicked around before came back to me today, in a different form. What if each share (living on some server somewhere), in addition to the data necessary to recover the file (blocks and hashes and signatures), also contained hints about the locations of other shares? These hints could take the form of publically-visible FURLs of the storage servers that are holding the other shares. These "hints" would not be authoritative, because shares may have been deleted, copied, or regenerated without necessarily updating all the other shares. But, assuming that shares tend to stick around for a while, these hints could provide a high-probability way to locate all the other shares. The basic addition would be an extra method on the `WriteBucket` object, something like `set_share_location_hints`, which accepts a list of FURLs, and stores it next to the share. The retrieve side would be an extra return argument to the `get_buckets` call, to return the list of hints. The download algorithm (the new state-machine oriented one that we're thinking of building, to help with #287 and #193) would be changed. The first phase of the download process is responsible for acquiring `ReadBucket` objects, each of which connects to some remote server. This phase would be changed to use the regular "Tahoe-2" peer-selection algorithm to find a batch of servers (perhaps 5 servers) to whom `get_buckets` messages will be sent in parallel. Each time a response comes back, a new message will be sent to the next unasked peer in the permuted list. But if a response includes hints about the locations of other shares, then all of those hinted servers will be asked. The download process can begin once at least "k" shares have been located. For mutable files, this increases the chance that we'll locate all shares of the file, which helps reduce the danger of accidental-rollback. The hints can contain FURLs, but in general only the tubid portion will be used. The client will look in its list of connections to see if it has any which connect to the same tubid as in the hint. This minimizes the importance of servers maintaining their same IP address and port number for long periods of time. If the client hasn't heard about the tubid from the Introducer, it could conceivably try to connect to the given FURL anyways. This would cause servers which have left the grid (or merely been unable to connect to the introducer recently) to still be used, however it would also make it slightly easier to cause mischief by publishing FURLs that point to port 25, etc. For maximum benefit, hints should be updated when the file is repaired, but we must be careful to define the necessary authority correctly. For mutable files, the authority to modify the share is sufficient: anyone who can clobber the file is also allowed to clobber the hints. For immutable files, the question is more difficult. One possibility is that the repairer submits potential hints to the servers that hold old shares, and the server validates the hint itself before committing them to the share's metadata. The server would do this by connecting to the FURL in question and performing the same `get_buckets` that a downloading client would do. The server could then randomly check a few segments against their block hash trees, and make sure the remote share hash fits into its local share hash tree. This verification process is made marginally easier by the fact that the verifying server already has a copy of the share hash tree. This last feature (servers checking up on each other) is reminiscent of the original "Tahoe 1" design, in which each file had a cabal of servers holding its shares, and the members of this cabal kept track of each other (validating each others shares, repairing the file when necessary, recruiting new members if too many servers dropped out). We abandoned that design because it required each server to keep track of too much information, but if the location-of-other-shares list is treated merely as a hint, then perhaps it wouldn't be too bad.
warner added the
code-storage
major
enhancement
1.2.0
labels 2009-01-28 23:18:56 +00:00
warner added this to the undecided milestone 2009-01-28 23:18:56 +00:00
Author

I was thinking more about this last night, in the context of #872, #447,
#573, and other enhanced peer-selection goals. Many of these tickets express
a desire to improve certain properties of the distributed files, typically
reliability (by improving geographical diversity), reduced repair cost
(getting k+1 shares into each datacenter), increased download speed (placing
several shares near each downloader), and increased fairness of reliability
over time (by trying to fill all servers at the same time instead of at the
same rate, flattening out the entropy(peer-selection-process)-vs-time curve
as described in comment🎫872:4).

All of these goals are hard to accomplish right now, because we want small
fixed readcaps for each file, and that doesn't leave the downloader with a
lot of information available to do the matching peer-selection algorithm. The
fact that readcaps are immutable means that the repairer/rebalancer (which
need to add shares to new places, and sometimes remove them from old ones)
are changing the serverlist, and yet the downloader must be able to find the
new shares efficiently using the old (fixed) readcap. Even if the file isn't
repaired and all shares exist in the same places, the system needs to be
tolerant of server churn.

The Tahoe2 selection algorithm provides small fixed readcaps, handles share
movement due to repair (assuming the repairer uses Tahoe2 too), and has
reasonable tolerance of server churn (adding 10% new servers will increase
the downloader's search distance by an average of 10%). And it provides
fill-at-same-rate load-balancing behavior, which is easy to understand and
happens to give you fill-at-same-time behavior if you have homogeneous
servers. But it is hard to accomodate any of the other goals contained in the
tickets above, or to trade off some goals against others.

"All problems in computer science can be solved by introducing another layer
of abstraction". In the context of this ticket, we could accomplish more of
these goals simultaneously by introducing an intermediate
where-are-the-shares table. This table would contain a mutable mapping from
storage index to:

  • verifycap
  • share locations: list of (serverid, confidence level)
  • other serverlist locations: list of serverids

The serverid strings would merely need to be short prefixes of the full
serverids, perhaps as little as 4 bytes, since they would be expanded by
looking for matches in the full serverlist, and infrequent collisions would
merely increase the search distance slightly. (this approach, rather than
including a full FURL, trades off space against continued dependence on an
Introducer and/or a notion of enumerable grid membership).

Each storage server would offer access to "location slots" in this table,
just like they currently offer mutable-share slots. However, the slots
contain plaintext, and would use have writecaps and readcaps. Instead, the
remote operations would look like:

  • allocate location slot
  • later, set storage index and verifycap (once only)
  • propose locations (serverid1, serverid2, ...)
  • propose removal (serverid1, serverid2, ...)
  • fetch share location list
  • fetch other serverlist location list

The storage servers would be responsible for confirming the presence or
absence of shares themselves. To this end, each table entry has a "confidence
level", which the server uses to keep track of how much validation it has
performed. When a "propose location" message arrives, the serverids are added
with the lowest confidence level (0). The server sends do-you-have-share
messages to those servers, and if they claim "YES", the level is raised to 1.
Later, when bandwidth allows, the server can fetch a couple of blocks at
random, and the hash chains, and make sure everything verifies up to the
locally-held verifycap, and then raises the level to 2. Eventually, the
server can fetch and verify the entire share, and raise the level to 3.

Upon receipt of a "propose removal" message, the server marks the entry as
"possibly removed", and queries the indicated server directly. If and only if
the server response to the DYHB with a "NO", then the entry is actually
removed.

The server would need to run a service to manage this verification process,
and to allocate a certain amount of bandwidth to it. This manager should also
query the other serverlist holders, to update each other on the current share
locations.

The idea is that shares would be uploaded to completely arbitrary locations,
to accomplish whatever placement goals were desired (load balancing, download
speed, datacenter diversity, etc). Then the location slots would be allocated
on well-known servers (using the normal Tahoe2 algorithm), and filled with
the locations of the shares. The location slot holders would tell each other
about the shares to flood this information out to all slots.

A downloader would use Tahoe2 to find at least one location slot, read the
serverlist out of it, and then query those servers for shares. They could
read from multiple location slots if they somehow got out of sync. The likely
implementation would be to send "fetch share location list" messages in
parallel to the first N servers in the permuted list, and upon receipt of the
first positive response, send DYHB messages in parallel to everything it
mentions.

After the repairer uploaded new shares, it would locate the location slots
and inform them about the new share. Through flooding, eventually all
location slots would learn about the new share. When a share is removed from
a server (via lease expiration, or mutable rebalancing), the "propose
removal" message would be sent, and the location slot would confirm.

If the repairer notices that there are too few location slots, it can
allocate and fill new ones. It would be appropriate to have "N" location
slots. Since we're using 1-of-N encoding here (pure replication), the
reliability of the location slots should be much higher than the reliability
of the target file, and the extra layer of indirection should not
significantly reduce reliability.

Having the location slot servers perform their own verification removes the
need for an extra layer of writecaps (which would be needed by the repairer
to update the location list, making it harder to have storage servers drive
the repair process themselves). However it does increase their workload
significantly, making this look more like the old Tahoe1 scheme as mentioned
above. On the other hand, this could provide the basis for server-driven
repair, where their periodic check on each other could provoke repair of
unhealthy files without client involvement.

What will prevent us from wanting to manipulate the choice of location-slot
holders in the same way that those tickets want to manipulate the choice of
share holders? Location slots are very small, so we'd be unlikely to want to
balance the storage load as we do with (large) shares. They'd use 1-of-N
encoding, so certain performance and reliability concerns are reduced. But
for the others, we'd just have to hope that providing arbitrary location of
shares is enough to get folks to tolerate these fixed location of
location-slot holders.

There would be a nontrivial performance hit to use location-slots. Uploaders
would require one extra RTT, because the "propose location" messages could
not be sent until all shares had been closed. Servers would spend
considerable bandwidth and CPU checking up on each others shares. And
downloaders would incur an extra RTT to query the location-slot holders in
addition to the existing DYHB queries, however the DYHB queries would be
highly likely to succeed, making the combination potentially faster in
certain not-ideally-Tahoe2-distributed server selection schemes.

Other ideas:

  • use a limited subset of servers for the location slots (i.e. nodes might
    offer share-storage, or location-slot-storage, but not both)
  • alternatively, try to put the location-slots on the same servers as the
    shares, so they share fate and we avoid the reliability hit

Anyways, this feels like a lot of moving parts (and the server load worries
me), but a layer like this would let us use arbitrary server-selection
policies. It would also help us scale to millions of storage nodes, if the
Tahoe2 permutation and frequent queries only needed to be done on a few
hundred location-slot servers instead of the entire set of storage servers.

I was thinking more about this last night, in the context of #872, #447, #573, and other enhanced peer-selection goals. Many of these tickets express a desire to improve certain properties of the distributed files, typically reliability (by improving geographical diversity), reduced repair cost (getting k+1 shares into each datacenter), increased download speed (placing several shares near each downloader), and increased fairness of reliability over time (by trying to fill all servers at the same time instead of at the same rate, flattening out the entropy(peer-selection-process)-vs-time curve as described in comment:ticket:872:4). All of these goals are hard to accomplish right now, because we want small fixed readcaps for each file, and that doesn't leave the downloader with a lot of information available to do the matching peer-selection algorithm. The fact that readcaps are immutable means that the repairer/rebalancer (which need to add shares to new places, and sometimes remove them from old ones) are changing the serverlist, and yet the downloader must be able to find the new shares efficiently using the old (fixed) readcap. Even if the file isn't repaired and all shares exist in the same places, the system needs to be tolerant of server churn. The Tahoe2 selection algorithm provides small fixed readcaps, handles share movement due to repair (assuming the repairer uses Tahoe2 too), and has reasonable tolerance of server churn (adding 10% new servers will increase the downloader's search distance by an average of 10%). And it provides fill-at-same-rate load-balancing behavior, which is easy to understand and happens to give you fill-at-same-time behavior if you have homogeneous servers. But it is hard to accomodate any of the other goals contained in the tickets above, or to trade off some goals against others. "All problems in computer science can be solved by introducing another layer of abstraction". In the context of this ticket, we could accomplish more of these goals simultaneously by introducing an intermediate where-are-the-shares table. This table would contain a mutable mapping from storage index to: * verifycap * share locations: list of (serverid, confidence level) * other serverlist locations: list of serverids The serverid strings would merely need to be short prefixes of the full serverids, perhaps as little as 4 bytes, since they would be expanded by looking for matches in the full serverlist, and infrequent collisions would merely increase the search distance slightly. (this approach, rather than including a full FURL, trades off space against continued dependence on an Introducer and/or a notion of enumerable grid membership). Each storage server would offer access to "location slots" in this table, just like they currently offer mutable-share slots. However, the slots contain plaintext, and would use have writecaps and readcaps. Instead, the remote operations would look like: * allocate location slot * later, set storage index and verifycap (once only) * propose locations (serverid1, serverid2, ...) * propose removal (serverid1, serverid2, ...) * fetch share location list * fetch other serverlist location list The storage servers would be responsible for confirming the presence or absence of shares themselves. To this end, each table entry has a "confidence level", which the server uses to keep track of how much validation it has performed. When a "propose location" message arrives, the serverids are added with the lowest confidence level (0). The server sends do-you-have-share messages to those servers, and if they claim "YES", the level is raised to 1. Later, when bandwidth allows, the server can fetch a couple of blocks at random, and the hash chains, and make sure everything verifies up to the locally-held verifycap, and then raises the level to 2. Eventually, the server can fetch and verify the entire share, and raise the level to 3. Upon receipt of a "propose removal" message, the server marks the entry as "possibly removed", and queries the indicated server directly. If and only if the server response to the DYHB with a "NO", then the entry is actually removed. The server would need to run a service to manage this verification process, and to allocate a certain amount of bandwidth to it. This manager should also query the other serverlist holders, to update each other on the current share locations. The idea is that shares would be uploaded to completely arbitrary locations, to accomplish whatever placement goals were desired (load balancing, download speed, datacenter diversity, etc). Then the location slots would be allocated on well-known servers (using the normal Tahoe2 algorithm), and filled with the locations of the shares. The location slot holders would tell each other about the shares to flood this information out to all slots. A downloader would use Tahoe2 to find at least one location slot, read the serverlist out of it, and then query those servers for shares. They could read from multiple location slots if they somehow got out of sync. The likely implementation would be to send "fetch share location list" messages in parallel to the first N servers in the permuted list, and upon receipt of the first positive response, send DYHB messages in parallel to everything it mentions. After the repairer uploaded new shares, it would locate the location slots and inform them about the new share. Through flooding, eventually all location slots would learn about the new share. When a share is removed from a server (via lease expiration, or mutable rebalancing), the "propose removal" message would be sent, and the location slot would confirm. If the repairer notices that there are too few location slots, it can allocate and fill new ones. It would be appropriate to have "N" location slots. Since we're using 1-of-N encoding here (pure replication), the reliability of the location slots should be much higher than the reliability of the target file, and the extra layer of indirection should not significantly reduce reliability. Having the location slot servers perform their own verification removes the need for an extra layer of writecaps (which would be needed by the repairer to update the location list, making it harder to have storage servers drive the repair process themselves). However it does increase their workload significantly, making this look more like the old Tahoe1 scheme as mentioned above. On the other hand, this could provide the basis for server-driven repair, where their periodic check on each other could provoke repair of unhealthy files without client involvement. What will prevent us from wanting to manipulate the choice of location-slot holders in the same way that those tickets want to manipulate the choice of share holders? Location slots are very small, so we'd be unlikely to want to balance the storage load as we do with (large) shares. They'd use 1-of-N encoding, so certain performance and reliability concerns are reduced. But for the others, we'd just have to hope that providing arbitrary location of shares is enough to get folks to tolerate these fixed location of location-slot holders. There would be a nontrivial performance hit to use location-slots. Uploaders would require one extra RTT, because the "propose location" messages could not be sent until all shares had been closed. Servers would spend considerable bandwidth and CPU checking up on each others shares. And downloaders would incur an extra RTT to query the location-slot holders in addition to the existing DYHB queries, however the DYHB queries would be highly likely to succeed, making the combination potentially faster in certain not-ideally-Tahoe2-distributed server selection schemes. Other ideas: * use a limited subset of servers for the location slots (i.e. nodes might offer share-storage, or location-slot-storage, but not both) * alternatively, try to put the location-slots on the same servers as the shares, so they share fate and we avoid the reliability hit Anyways, this feels like a lot of moving parts (and the server load worries me), but a layer like this would let us use arbitrary server-selection policies. It would also help us scale to millions of storage nodes, if the Tahoe2 permutation and frequent queries only needed to be done on a few hundred location-slot servers instead of the entire set of storage servers.
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#599
No description provided.