Servers of happiness share placement distributes storage load unevenly in small grids #3022

Open
opened 2019-04-03 12:12:36 +00:00 by exarkun · 2 comments

Originally posted to tahoe-dev: https://tahoe-lafs.org/pipermail/tahoe-dev/2019-April/009937.html

Imagine a grid of 4 storage nodes and a client using parameters of needed=2 total=3 happy=3. Given 4 different uploads, I propose that an optimal distribution of shares is:

  • Server 0: uploadA-shareX, uploadB-shareY, uploadC-shareZ
  • Server 1: uploadA-shareY, uploadB-shareZ, uploadD-shareX
  • Server 2: uploadA-shareZ, uploadC-shareX, uploadD-shareY
  • Server 3: uploadB-shareX, uploadC-shareY, uploadD-shareZ

Happiness is maximized because no server has more than one share from any single upload. Additionally, storage utilization is optimized because each server has the same number of shares (of course different uploads need not be the same size but by spreading shares for each upload across different servers we also do about the best we can do with the current encoding scheme to equally share the load of large uploads).

However, this is not the distribution of shares that I observe in practice and I believe I see the cause of this in the implementation of servers of happiness.

First, the distribution of shares that I actually observe in practice is:

  • Server 0: uploadA-shareX, uploadB-shareX, uploadC-shareX, uploadD-shareX
  • Server 1: uploadA-shareY, uploadB-shareY, uploadC-shareY, uploadD-shareY
  • Server 2: uploadA-shareZ, uploadB-shareZ, uploadC-shareZ, uploadD-shareZ
  • Server 3:

Happiness has still been maximized because it is still the case that no server has more than one share from any single upload. However, it's clear that storage utilization has not been optimized because Server 3 has taken on no storage responsibilities at all.

Regarding the cause of this, here's my analysis.

Before "servers of happiness" starts, 2N storage servers are selected. Tahoe2ServerSelector.get_shareholders is mostly responsible for this, with the help of StorageFarmBroker.get_servers_for_psi. get_servers_for_psi produces a permuted list of all all connected storage servers. I think that it is the permutation of this list that is intended to provide storage load balancing across the grid. get_shareholders then grabs the first 2N of these for its purposes.

At this point, I think everything is still fine. In the scenario I outlined above, there are only 4 storage servers and 2N is 6. This means the same storage servers are always going to be returned by get_servers_for_psi - but they're not always going to be in the same order. Likewise, the first 2N of them is exactly the same thing and the same qualifications apply.

get_shareholders does a little more work to make sure the storage servers are actually available, splits them into read/write and read-only servers. None of this really makes a difference to the outcome of the scenario described above though.

The next relevant thing that happens is that get_shareholders uses PeerSelector.get_share_placements to compute a placement map that achieves the happiness target.

I think PeerSelector.get_share_placements is where things go wrong. The PeerSelector does not have the ordered list of 2N servers. Instead, it has some sets of servers. It passes those sets along to the servers of happiness algorithm (happiness_upload.share_placement) but the problem isn't solveable at this point. With only sets of servers and the original permuted list ordering lost, share_placement makes up an order (lexicographical on server id, I think). This is where the preference for servers 0-2 comes in and server 3 gets left out - every time.

Originally posted to tahoe-dev: <https://tahoe-lafs.org/pipermail/tahoe-dev/2019-April/009937.html> Imagine a grid of 4 storage nodes and a client using parameters of needed=2 total=3 happy=3. Given 4 different uploads, I propose that an optimal distribution of shares is: * Server 0: uploadA-shareX, uploadB-shareY, uploadC-shareZ * Server 1: uploadA-shareY, uploadB-shareZ, uploadD-shareX * Server 2: uploadA-shareZ, uploadC-shareX, uploadD-shareY * Server 3: uploadB-shareX, uploadC-shareY, uploadD-shareZ Happiness is maximized because no server has more than one share from any single upload. Additionally, storage utilization is optimized because each server has the same number of shares (of course different uploads need not be the same size but by spreading shares for each upload across different servers we also do about the best we can do with the current encoding scheme to equally share the load of large uploads). However, this is not the distribution of shares that I observe in practice and I believe I see the cause of this in the implementation of servers of happiness. First, the distribution of shares that I actually observe in practice is: * Server 0: uploadA-shareX, uploadB-shareX, uploadC-shareX, uploadD-shareX * Server 1: uploadA-shareY, uploadB-shareY, uploadC-shareY, uploadD-shareY * Server 2: uploadA-shareZ, uploadB-shareZ, uploadC-shareZ, uploadD-shareZ * Server 3: <empty> Happiness has still been maximized because it is still the case that no server has more than one share from any single upload. However, it's clear that storage utilization has not been optimized because Server 3 has taken on no storage responsibilities at all. Regarding the cause of this, here's my analysis. Before "servers of happiness" starts, 2N storage servers are selected. Tahoe2ServerSelector.get_shareholders is mostly responsible for this, with the help of [StorageFarmBroker](wiki/StorageFarmBroker).get_servers_for_psi. get_servers_for_psi produces a permuted list of all all connected storage servers. I think that it is the permutation of this list that is intended to provide storage load balancing across the grid. get_shareholders then grabs the first 2N of these for its purposes. At this point, I think everything is still fine. In the scenario I outlined above, there are only 4 storage servers and 2N is 6. This means the same storage servers are always going to be returned by get_servers_for_psi - but they're not always going to be in the same order. Likewise, the first 2N of them is exactly the same thing and the same qualifications apply. get_shareholders does a little more work to make sure the storage servers are actually available, splits them into read/write and read-only servers. None of this really makes a difference to the outcome of the scenario described above though. The next relevant thing that happens is that get_shareholders uses [PeerSelector](wiki/PeerSelector).get_share_placements to compute a placement map that achieves the happiness target. I think [PeerSelector](wiki/PeerSelector).get_share_placements is where things go wrong. The [PeerSelector](wiki/PeerSelector) does not have the ordered list of 2N servers. Instead, it has some sets of servers. It passes those sets along to the servers of happiness algorithm (happiness_upload.share_placement) but the problem isn't solveable at this point. With only sets of servers and the original permuted list ordering lost, share_placement makes up an order (lexicographical on server id, I think). This is where the preference for servers 0-2 comes in and server 3 gets left out - every time.
exarkun added the
unknown
normal
defect
1.12.1
labels 2019-04-03 12:12:36 +00:00
exarkun added this to the undecided milestone 2019-04-03 12:12:36 +00:00
Owner

To me there seems to be conflict of requirements, or maybe just
underspecification.
The configuration documentation about [#pref peers.preferred] outlines
behavior and some use-cases of putting some peers on top of list of those
considered for reading and writing.
The current algorithm, as described informally above and in length in
[#placements SoH document], selects 2*N top peers of the list
and makes arbitrary (likely pythonhash-based) selection from them as the
desired share to peer mapping.
In /tahoe-lafs/trac-2024-07-25/issues/5672#comment:16 it's suggested that window for considering peers for
holding data should be even larger if existing shares are found.
There is obvious issue of the ordering of peers as returned from
get_servers_for_psi not being respected, but I'd like us to pause for a moment
and look at whether fixing that actually addresses underlying issues.

The Servers of Happiness was devised as a metric of healthiness of a file,
namely for the purposes of "upload".
It consists of the number of peers that have distinct shares which are in the top 2*N
of the peer list for given Storage Index.
Furthermore a Servers of Happiness Placement Algorithm was designed that uses
this metric internally as a constraint on where to upload shares.
From reading the relevant tickets and paper, I gather there were two goals:

  • maximize the redundancy for the cases of peers becoming unavailable
  • minimize the amount of shares in need of uploading should some be already

available

Not only does the design of that algorithm ignore the preferred ordering of
peers for assigning new uploads, but I find the second goal is in conflict of
the use-case described in [#pref peers.preferred] of "… prefer their local
servers so that they can maintain access to all of their uploads without using
the internet."

It's worth noting that the "upload" mechanism is used for a lot of things other
than fresh upload to a grid.
Notably the check-and-repair mechanism from what I understood hooks download
and upload component together in a way that relies on the upload mechanism to
reuse shares already accessible in the top 2N (or should it be 4N?)
peers for given Storage Index.
This possibly achieves, given low enough peer turnover and frequent repairs,
moving the shares to the servers that are among the ones that are searched
first but not necessarily the actual top of the list.
Garbage collection could potentially remove the old shares on peers down the
list, but the current checker can only renew leases on all reachable peers or
not at all.
The comments in happiness_upload suggest that the attempted reupload actually
refreshes the leases, so counterintuitively you should not specify that you
want leases refreshed on repair unless you want redundant placement!
But as far as I understood the reupload is not attempted unless the file is
considered unhealthy by some metric so there doesn't seem to be a mechanism for
letting only redundant shares be collected.

Also of note is that various checkers use it's own implementation of Servers of
Happiness metric calculation in src/allmydata/util/happinessutil.py that only
partly reuses the happiness_upload code.
I assume this is to eliminate the 2*N limit that is imposed on uploads.

The deeper I dove in the related tickets (there are many!) the more of ad-hoc
unspecified requirement discussion I discovered.
Here I try to distill those that I've found that may affect desired placement
of shares over peers:

  • resilience (happiness)
    • The happiness number (cardinality of maximum match) presents a metric of
      resiliency against peers becoming unavailable.
    • Peers are considered atomic units of failure, which while oversimplified is
      pretty much the underlying abstraction for Tahoe-LAFS as it is now.
    • In face of more shares on a single peer the happiness number is not actual
      resiliency metric but a lower bound of it. (The spec discusses this)
  • permutation distance
    • Distance of available shares from the peers chosen by Storage Index based
      permutation.
  • locality
    • Ability to reconstruct files from local peers only (possibly specified by
      peers.preferred or some other mechanism).
    • This means having k shares available locally, even if there is less than k
      local peers.
  • localized redundancy
    • Amount of shares of single file placed on a single peer.
    • If this is larger than k then it's superfluous according to the failure
      model as we can reconstruct all shares from k.
  • global redundancy
    • Amount of shares duplicated across the grid.
  • bandwidth efficiency
    • There is incentive to not waste bandwidth by uploading shares that are
      already found on some peers.
  • balance
    • Even spread of shares across peers. The issue at the origin of this ticket.
    • Should by achievable by minimizing redundancy and permutation distance.

The /tahoe-lafs/trac-2024-07-25/issues/6274#comment:26 discusses the difference of requirements between
user-initiated upload and a repair.
While there seemed to be an agreement to split them, I don't think that got
worked on.

There are quite a few tickets for rebalancing files after initial upload, e.g.:
/tahoe-lafs/trac-2024-07-25/issues/5761, /tahoe-lafs/trac-2024-07-25/issues/5294 and even /tahoe-lafs/trac-2024-07-25/issues/5605 which proposes central rebalancing
manager.
I should add that the concept of central balancer is incompatible with the
feature of locality, while relocating check-and-repair would support it.

While extreme imbalance should be less of an issue with Servers of Happiness
Share Placement than it was with previous algorithm, the share_placement
function will try to always allocate all of the shares, which in absence of
enough connected peers can mean they all end up clustered on one or few peers, possibly with superfluous redundancy (>k).
This can happen both when having very low required happiness value and on repair where that value is artificially set to zero.

There were some proposals to disallow multiple shares per peer
(e.g. /tahoe-lafs/trac-2024-07-25/issues/6274#comment:35) or to allow capping their amount to supplied
number.
This would make the behavior closer to RAID and make the space usage
predictable despite shaky interconnection.

There is a lot of requests for allowing uploader select peers specifically,
usually for purposes of locality (/tahoe-lafs/trac-2024-07-25/issues/5635, /tahoe-lafs/trac-2024-07-25/issues/5529) but curiously also
for anti-locality (making sure of having shares non-locally) in /tahoe-lafs/trac-2024-07-25/issues/5543 for
purpose of migrating data from dying drive.

I didn't dive deeply into garbage collection tickets, but I presume the
above mentioned issues with rebalancing and removing redundant shares would be
well represented.

The upload.py API is mostly described in /tahoe-lafs/trac-2024-07-25/issues/6444, while bulk of rationale
for Servers of Happiness as a metric is in /tahoe-lafs/trac-2024-07-25/issues/5840.

The share placement is determined once, at allocation time, and does not reflect possibility of upload failing (/tahoe-lafs/trac-2024-07-25/issues/7138)

That's about what I managed to discover while spelunking the code and tickets
that may affect how to actually define the requirements (and their relative
precedence) for the share placement function(s) that we want.


To me there seems to be conflict of requirements, or maybe just underspecification. The configuration documentation about [#pref peers.preferred] outlines behavior and some use-cases of putting some peers on top of list of those considered for reading and writing. The current algorithm, as described informally above and in length in [#placements SoH document], selects 2*N top peers of the list and makes arbitrary (likely pythonhash-based) selection from them as the desired share to peer mapping. In [/tahoe-lafs/trac-2024-07-25/issues/5672](/tahoe-lafs/trac-2024-07-25/issues/5672)#comment:16 it's suggested that window for considering peers for holding data should be even larger if existing shares are found. There is obvious issue of the ordering of peers as returned from get_servers_for_psi not being respected, but I'd like us to pause for a moment and look at whether fixing that actually addresses underlying issues. The Servers of Happiness was devised as a metric of healthiness of a file, namely for the purposes of "upload". It consists of the number of peers that have distinct shares which are in the top 2*N of the peer list for given Storage Index. Furthermore a Servers of Happiness Placement Algorithm was designed that uses this metric internally as a constraint on where to upload shares. From reading the relevant tickets and paper, I gather there were two goals: * maximize the redundancy for the cases of peers becoming unavailable * minimize the amount of shares in need of uploading should some be already > available Not only does the design of that algorithm ignore the preferred ordering of peers for assigning new uploads, but I find the second goal is in conflict of the use-case described in [#pref peers.preferred] of "… prefer their local servers so that they can maintain access to all of their uploads without using the internet." It's worth noting that the "upload" mechanism is used for a lot of things other than fresh upload to a grid. Notably the check-and-repair mechanism from what I understood hooks download and upload component together in a way that relies on the upload mechanism to reuse shares already accessible in the top 2*N (or should it be 4*N?) peers for given Storage Index. This possibly achieves, given low enough peer turnover and frequent repairs, moving the shares to the servers that are among the ones that are searched first but not necessarily the actual top of the list. Garbage collection could potentially remove the old shares on peers down the list, but the current checker can only renew leases on all reachable peers or not at all. The comments in happiness_upload suggest that the attempted reupload actually refreshes the leases, so counterintuitively you should not specify that you want leases refreshed on repair unless you want redundant placement! But as far as I understood the reupload is not attempted unless the file is considered unhealthy by some metric so there doesn't seem to be a mechanism for letting only redundant shares be collected. Also of note is that various checkers use it's own implementation of Servers of Happiness metric calculation in src/allmydata/util/happinessutil.py that only partly reuses the happiness_upload code. I assume this is to eliminate the 2*N limit that is imposed on uploads. The deeper I dove in the related tickets (there are many!) the more of ad-hoc unspecified requirement discussion I discovered. Here I try to distill those that I've found that may affect desired placement of shares over peers: * resilience (happiness) - The happiness number (cardinality of maximum match) presents a metric of resiliency against peers becoming unavailable. - Peers are considered atomic units of failure, which while oversimplified is pretty much the underlying abstraction for Tahoe-LAFS as it is now. - In face of more shares on a single peer the happiness number is not actual resiliency metric but a lower bound of it. (The spec discusses this) * permutation distance - Distance of available shares from the peers chosen by Storage Index based permutation. * locality - Ability to reconstruct files from local peers only (possibly specified by peers.preferred or some other mechanism). - This means having k shares available locally, even if there is less than k local peers. * localized redundancy - Amount of shares of single file placed on a single peer. - If this is larger than k then it's superfluous according to the failure model as we can reconstruct all shares from k. * global redundancy - Amount of shares duplicated across the grid. * bandwidth efficiency - There is incentive to not waste bandwidth by uploading shares that are already found on some peers. * balance - Even spread of shares across peers. The issue at the origin of this ticket. - Should by achievable by minimizing redundancy and permutation distance. The [/tahoe-lafs/trac-2024-07-25/issues/6274](/tahoe-lafs/trac-2024-07-25/issues/6274)#comment:26 discusses the difference of requirements between user-initiated upload and a repair. While there seemed to be an agreement to split them, I don't think that got worked on. There are quite a few tickets for rebalancing files after initial upload, e.g.: [/tahoe-lafs/trac-2024-07-25/issues/5761](/tahoe-lafs/trac-2024-07-25/issues/5761), [/tahoe-lafs/trac-2024-07-25/issues/5294](/tahoe-lafs/trac-2024-07-25/issues/5294) and even [/tahoe-lafs/trac-2024-07-25/issues/5605](/tahoe-lafs/trac-2024-07-25/issues/5605) which proposes central rebalancing manager. I should add that the concept of central balancer is incompatible with the feature of locality, while relocating check-and-repair would support it. While extreme imbalance should be less of an issue with Servers of Happiness Share Placement than it was with previous algorithm, the share_placement function will try to always allocate all of the shares, which in absence of enough connected peers can mean they all end up clustered on one or few peers, possibly with superfluous redundancy (>k). This can happen both when having very low required happiness value and on repair where that value is artificially set to zero. There were some proposals to disallow multiple shares per peer (e.g. [/tahoe-lafs/trac-2024-07-25/issues/6274](/tahoe-lafs/trac-2024-07-25/issues/6274)#comment:35) or to allow capping their amount to supplied number. This would make the behavior closer to RAID and make the space usage predictable despite shaky interconnection. There is a lot of requests for allowing uploader select peers specifically, usually for purposes of locality ([/tahoe-lafs/trac-2024-07-25/issues/5635](/tahoe-lafs/trac-2024-07-25/issues/5635), [/tahoe-lafs/trac-2024-07-25/issues/5529](/tahoe-lafs/trac-2024-07-25/issues/5529)) but curiously also for anti-locality (making sure of having shares non-locally) in [/tahoe-lafs/trac-2024-07-25/issues/5543](/tahoe-lafs/trac-2024-07-25/issues/5543) for purpose of migrating data from dying drive. I didn't dive deeply into garbage collection tickets, but I presume the above mentioned issues with rebalancing and removing redundant shares would be well represented. The upload.py API is mostly described in [/tahoe-lafs/trac-2024-07-25/issues/6444](/tahoe-lafs/trac-2024-07-25/issues/6444), while bulk of rationale for Servers of Happiness as a metric is in [/tahoe-lafs/trac-2024-07-25/issues/5840](/tahoe-lafs/trac-2024-07-25/issues/5840). The share placement is determined once, at allocation time, and does not reflect possibility of upload failing ([/tahoe-lafs/trac-2024-07-25/issues/7138](/tahoe-lafs/trac-2024-07-25/issues/7138)) That's about what I managed to discover while spelunking the code and tickets that may affect how to actually define the requirements (and their relative precedence) for the share placement function(s) that we want. ---- * <a name="pref">peers.preferred</a> <https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/configuration.rst?rev=fad4ffe37edaf7ca7497f08bc543062d93ca4f99#L662> * <a name="placements">SoH document</a>:: <https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/specifications/servers-of-happiness.rst?rev=fad4ffe37edaf7ca7497f08bc543062d93ca4f99#L106>
Owner

There are several related questions I have about the desired behavior:

First: Is it desirable to upload more than one share to each server and if so, under which conditions?

Second: If so, what will be the mechanism for relocating some of the shares from servers that have more than one once more servers become available so that resilience is maximized? Preferably in a way that doesn't waste space by having too many useless copies of certain shares.

Third: What will be the mechanism to optimizing space usage by removing redundant copies of each share? Taking into account some servers might want to hold on to extra copies as a cache for quick access, as defined by peers.preferred currently.

By reading the source code I understood the current behavior to be:

  • On upload distribute shares across servers using the SoH algorithm.
  • If there are any shares left without servers assigned, just round-robin through the server list, regardless of how many shares the server already has.
  • On check-and-repair shares are checked for readability.
    • Whether all of the shares are reachable is reported as boolean "healthy".
    • Happiness is calculated and returned, but it doesn't seem to affect anything.
  • If the file is not healthy, repair is run. Repair uses the SoH algorithm to maximize happiness and round-robins for the rest.

To me it seems that this algorithm will on files where N is close to the size of grid will result in large amount of duplication unless every server has flawless connectivity, with no real means of reclaiming the space once the grid recovers from a partition.

Moreover, uploading more than max(k, SoH) shares seems pointless to me if it will result in more than one share per server. Frankly I'm not really sure whether the practice of uploading more than once to a server matches the usage model for Tahoe-LAFS at all.

Thinking about how to redistribute shares I've come to conclusion that it's probably best if the node initiating the upload/check-and-repair is also in charge of the reallocation. If there was a way for node to ask for content to be moved to them it would open (or rather widen) the possibility for malicious actor with introducer access to create DoS/dataloss by creating lot of nodes hoping to get to top of the permuted list and then throwing the data away. It would also mean that all nodes need to agree on the algorithm for the permuted list and which nodes are trusted to hold the data, rather than just the uploader deciding.

I've been thinking of possible improvements of the garbage collection mechanism (mainly in terms of quick space reclamation) and incorporating explicit mapping of shares onto specific servers through updateable data structure might be the way to address the issue of data relocation too.

There are several related questions I have about the desired behavior: First: Is it desirable to upload more than one share to each server and if so, under which conditions? Second: If so, what will be the mechanism for relocating some of the shares from servers that have more than one once more servers become available so that resilience is maximized? Preferably in a way that doesn't waste space by having too many useless copies of certain shares. Third: What will be the mechanism to optimizing space usage by removing redundant copies of each share? Taking into account some servers might want to hold on to extra copies as a cache for quick access, as defined by peers.preferred currently. By reading the source code I understood the current behavior to be: * On upload distribute shares across servers using the SoH algorithm. * If there are any shares left without servers assigned, just round-robin through the server list, regardless of how many shares the server already has. * On check-and-repair shares are checked for readability. * Whether all of the shares are reachable is reported as boolean "healthy". * Happiness is calculated and returned, but it doesn't seem to affect anything. * If the file is not healthy, repair is run. Repair uses the SoH algorithm to maximize happiness and round-robins for the rest. To me it seems that this algorithm will on files where N is close to the size of grid will result in large amount of duplication unless every server has flawless connectivity, with no real means of reclaiming the space once the grid recovers from a partition. Moreover, uploading more than max(k, SoH) shares seems pointless to me if it will result in more than one share per server. Frankly I'm not really sure whether the practice of uploading more than once to a server matches the usage model for Tahoe-LAFS at all. Thinking about how to redistribute shares I've come to conclusion that it's probably best if the node initiating the upload/check-and-repair is also in charge of the reallocation. If there was a way for node to ask for content to be moved to them it would open (or rather widen) the possibility for malicious actor with introducer access to create DoS/dataloss by creating lot of nodes hoping to get to top of the permuted list and then throwing the data away. It would also mean that all nodes need to agree on the algorithm for the permuted list and which nodes are trusted to hold the data, rather than just the uploader deciding. I've been thinking of possible improvements of the garbage collection mechanism (mainly in terms of quick space reclamation) and incorporating explicit mapping of shares onto specific servers through updateable data structure might be the way to address the issue of data relocation too.
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#3022
No description provided.