Adjust the probability of selecting a node according to its storage capacity (or other fitness measure) #872
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#872
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
If the probability of the peer selection algorithm putting a node close to the beginning of the list were proportional to its storage capacity, then that would better tolerate grids with a wide range of node capacities.
With a uniform selection probability, as at present, small-capacity nodes will be expected to receive many requests to store shares that they don't have room for, and to download shares that they don't have.
See http://allmydata.org/pipermail/tahoe-dev/2009-December/003408.html and followups for mailing list discussion.
Also see /tahoe-lafs/trac-2024-07-25/issues/5364#comment:-1.
bwahaha, welcome to a big can of worms :)
source:docs/specifications/outline.txt (section 3: "Server Selection
Algorithm, filecap format") is worth reading. It points out the requirement
that all of the uploader's choices are somehow recorded and made available to
the downloader. Or, rather, the downloader's sequence of servers needs to be
"well" correlated with the uploader's sequence.
So any upload-time code which is influenced by things like current remaining
server space will need a way to record its choices (or the information which
went into that choice) in the filecap, so it can influence the downloader in
the same way.
That said, choosing servers according to capacity would serve the purpose of
filling servers at the same time as opposed to filling them at the same rate.
(i.e., all servers become full at the same moment, versus each server sees
the same inbound bytes-per-second rate). If all servers had the same
capacity, these two options would be identical.
Part of the discussion in #302 is about whether this is good, important, or
irrelevant. In general, I think that full-at-the-same-time is good, but I'm
not sure it's actually better than fill-at-the-same-rate. I believe that
maximum reliablity occurs when each file has as many choices for servers as
possible, but those options will dwindle over time as servers get full. A
system which probabilistically favors some servers over others (based upon
capacity or whatever) will have fewer choices to work with.
Hm, I think there's a rigorous argument in there somewhere. The entropy of
the server-selection process (given a random storage-index) should be fairly
well-defined. A non-probabilistic algorithm will just give you log,,2,, of
the number of possible choices. A probabilistic algorithm would be like that,
but with each choice weighted by the probability of its selection. (I used to
know this stuff, really I did.. I'll look up my old Information Theory
textbook when I get home).
With that sort of definition, we could evaluate different algorithms
according to how well they maximize that entropy. Moreover, the entropy is a
function of the current state of the grid (like how many free servers are
left), and that state will evolve in different ways according to the
algorithm we choose. So we can further evaluate that entropy over time. Any
non-homogeneous grid will see the entropy drop over time, as the grid fills
up and the choices dwindle. We could develop a metric to talk about the
entropy averaged across all files: maybe the best algorithm is the one that
manages the highest average entropy, or perhaps the lowest variance, or
something.
A probabilistic selection algorithm will always have lower per-file entropy
than a non-probabilistic one, given the same number of potential servers
(well, a non-uniform-probability algorithm, to be precise). But if it manages
to preserve server availability longer, then the entropy averaged over the
life of the grid (from empty to full) might be higher. That's probably the
way we should investigate the value of a different algorithm.
Replying to warner:
I agree with the first sentence, but not the second. The "expected full at the same time" property will tend to maximize the number of storage nodes available to accept shares for as long as possible; that's why I believe it is better for file preservation.
There's a fairly straightforward way to change the selection algorithm to achieve this property. Suppose that capacity estimates are multiples of some unit C. If a storage node has capacity estimate eC*, we give it e entries in the list to be permuted (there's a way to make this more efficient; see below). That is, permute the list and then discard duplicates later than the first occurrence of a given node. The effect is similar to splitting each storage server into e virtual nodes that share the same disk space, but with the important difference that the upload algorithm will still try not to put multiple shares on a given server.
This means that the capacity estimate of a given storage node can't change, and must be known by the Introducer so that it can tell all other nodes. (The actual capacity can be different to the estimate; that won't cause any greater problems than at present.)
The performance of this algorithm as given above is poor when the sum of all e is large, but it can be improved by selecting the servers using a binary search tree rather than an explicit list. That is, each step of a Fisher-Yates shuffle would choose a random element from the search tree weighted by its capacity estimate, then delete that element from the tree. This is equivalent to using an arbitrarily small C.
I'm not sure that selection entropy is the main issue. The two most important things we want are:
Both of these are affected primarily by the proportion of servers that are available, not by their probability of selection.
Replying to [davidsarah]comment:3:
I think there are two levels of reliability in action here. The first and
most important is to avoid doubling-up of shares (the "servers of happiness"
threshold, but really it's strictly >=N servers). Certainly your reliability
drops significantly when you go below this number of available servers.
The second order effect is the decorrelation of per-file server sets, which
is the entropy thing I'm talking about. It only makes sense to talk about
this one after you've ensured that you have the first level for everything.
Imagine that you had 20 servers, the usual 3-of-10 encoding, and the
selection rule was that on even days you used servers 1-10, and on odd days
you used servers 11-20. Each file would have the first kind of reliability
(every file would use N distinct servers). But the second kind of reliability
would be marginal: an attacker who destroys the right 50% of the servers
would completely kill half the files (in fact they could fatally wound half
the files with just 40% of the servers).
In contrast, if each file gets a random selection of all twenty servers, then
there's minimal correlation between the servers used by any two files. An
attacker who destroys servers 1-10 would expect to kill just 2126/184756 =
1.15% of the files.
So I think the first goal is to keep >=N servers free for as long as possible
(ideally until the very last file fills the grid), but if we can achieve
that, then our second goal should be to maximize the number of ways in which
files are uploaded.
Yeah, I like the simplicity of that. But we need a stable way to inform the
downloader about the capacities we saw, so they can get to the same list.
Maybe a layer of indirection could help: the serverlist is stored in stable,
well-known places that do not depend upon server capacity (and the serverlist
isn't big enough to fill those places much), but the shares can go elsewhere
(to places chosen for the fill-at-the-same-time goal).
I'll agree with all of that. Certainly selection entropy is less important
than the servers-of-happiness (really >=N) criterion. I don't know how it
should compare against download performance.. probably below. I guess I'd put
selection entropy as the third item in your list.
I hope to explain my entropy concept better.
Here's another example of why I think the probabilistic approach needs to be
evaluated against the entropy concept. Imagine that you've got 3-of-10
encoding and 15 servers: 10 big ones and 5 tiny ones. The probabilistic
algorithm will almost always pick the 10 big ones and ignore the 5 tiny ones.
So even though we've nominally got 15 free servers, we rarely actually use
them all. So almost every file we upload will share a server-set
(big1-big10), making them more vulnerable (as a group). The entropy of the
selection algorithm will be nearly zero, since the tiny servers are chosen
with such low probability. The entropy will remain mostly constant over time,
though, since you'll probably fill the tiny servers at the same time as the
big servers, so your choices will remain about the same for the whole time.
Of course, if you send any more traffic towards those tiny servers (such as
if you went for same-rate instead of same-time), they'll fill up sooner than
the big ones, and they'll quickly be full. At that point, the entropy drops
to zero, because you have exactly one option.
Since the servers in this example are not of uniform size, this loss of
entropy is inevitable. There's a finite number of possibilities, and each
byte you upload consumes some of them. A completely homogeneous grid with
uniformly-sized uploads will run out of selection entropy all at the same
time, just as the last file causes the grid to be full. The
entropy-versus-time graph (from t=0 to t=grid-is-full) is flat. For
heterogeneous grids and a non-probabilstic algorithm the graph looks like a
step-wise decrementing function, starting high, dropping a bit after each
server fills up, but flat inside each region (the last plateau is at 0, when
there are exactly N servers left, then there's a region when we're doubling
up shares that we'd represent with a red line or negative numbers or
something else). I think a capacity-sensitive algorithm's graph would look
completely flat: since all servers should fill at the same time, there should
be no steps, but the overall entropy will be lower than if you chose freely
between the initially-available servers.
A flat graph would mean that late-uploaded files are just as good as
early-uploaded files. A decreasing curve means that early files have it
better than late files (or, to be more precise, that a batch of files
uploaded early will have less mutual-correlation than a similar batch
uploaded late: killing a random set of servers would be expected to kill more
of the late files than the early ones).
I suspect that the area under this curve is constant, independent of the
selection algorithm, and that the area is a function of the set of server
capacities. It would be maximal for homogeneous servers.
I'm merely thinking that it might be possible to measure the shape of this
curve for different selection algorithms, and that it's something to keep in
mind when picking one. If my suspicion about those shapes is correct, then
the probabilistic approach seems the "fairest", in that it would give equal
distributions to both early and late files.
I still don't know how to record enough information about the server choices
into the filecap, though. Capacities will change over time, and to make this
work right for the uploader, they'll need to keep changing their
probabilities in response to new "how much space do you have left" updates
from the servers.
Replying to [warner]comment:4:
Agreed, but I think that "expected full at the same time" is likely to help with this decorrelation as well. The reason is that if some servers are full -- even if >= N servers are not full -- then the choice of servers has been reduced.
For example, suppose you have N small-capacity servers and N large-capacity servers. If you choose servers uniformly, then all of the small-capacity servers will fill up first, and then the choice of servers for shares of subsequent files will be reduced to N. So by attacking only the N large-capacity servers, the attacker is disproportionately likely to kill the more recently added files. (Reading the later part of your comment, it seems we're in violent agreement on this.)
Note that this is a very likely situation in practice given the tendency to add servers in batches with increasing capacities (as in the case of allmydata); and in that case even rebalancing all shares would not help. With the non-uniform choice, OTOH, then rebalancing would restore the random distribution of all shares (regardless of when their file was originally uploaded) across servers.
The attacker still gets a greater advantage from killing a server with a higher capacity, but only to the extent that we would expect because it holds more shares. When the servers are mostly full, we cannot avoid that property.
The extent of the bias is also limited by the attempt to place only one share on each server when uploading. Suppose you have one server that has 100 times the capacity of all the others. The algorithm I suggested will almost always place one share of each file on that server -- but only one. This seems like reasonable behaviour (even for this unreasonably extreme configuration): it uses the capacity of the supersized server as well as possible without relying on it excessively.
If capacity estimates are fixed, then informing the downloader about them is no more difficult than informing the downloader about public keys. If the actual capacity of a server increases relative to its estimate, then the effect of that is never any worse than the uniform-probability selection. So I think there's a good case for just following the "worse is better" approach of assuming fixed capacity estimates (which makes the selection algorithm stable -- or as stable given server changes as it is now).
Agreed.
The same is true of the uniform algorithm as soon as the tiny servers fill up, which will be soon. The main difference seems to be that the non-uniform algorithm spreads out the non-uniformity in server selection over time.
[...]
You mean it will start at a lower value, I assume? That also matches my intuition (although we should do some simulations), and I think that the flat graph is preferable, because we don't want to favour earlier files at the expense of later ones.
Now there's a bold prediction (that we should be able to test by simulation).
As explained above, I don't think this is necessary.
Replying to [davidsarah]comment:3:
I didn't explain this well. The idea is that for each node of the search tree, you keep track of the total weights of its left and right subtrees. That allows you to pick a random node in the tree with probability proportional to its weight, by making depth binary choices where depth ~= log2 n for n servers. Deleting a node also takes depth time, because you have to update the total weights on the ancestors of the deleted node. The overall time is therefore O(n log n).
(The is more like the original Fisher-Yates shuffle than the Durstenfeld variant.)
Part of the goal of #543 ('rebalancing manager') is "to smooth out disk usage among all servers (more by percentage than by absolute usage)." This ticket might help by giving such a rebalancer less to do.
Ah, there is another constraint on the shuffle algorithm: it must be approximately stable when servers are added or removed. The existing algorithm (essentially, hash each peerid and sort by the hashes) is stable because adding or removing a server just adds or removes its hash, and the other hashes are sorted in the same order. The first algorithm described in comment:3 is also stable in this sense, because it can be defined in a similar way by hashing the peerid and a small integer. (It's easy to make this compatible with the existing scheme when all capacity estimates are equal.)
However the Fisher-Yates-based algorithm described in comment:6 is not stable in the required sense, and I don't see how to make it so (a pity, because I'd just finished implementing it :-/ ).
I'm not convinced that this stability or "consistent hashing" property is a hard requirement. All of the Tahoe-LAFS grids that have been deployed so far (with one exception) have so few storage servers that most reads query every server. The one exception is the allmydata.com production grid, which has about a hundred servers. It might work just fine to query all one hundred of them on every read, too.
Whether the consistent hashing property is important to real deployments is an empirical measurement question, IMO, and my guess is that for all of the current small grids the answer is "no measurable impact" and for allmydata.com the answer is "measurable impact, but not a critical problem".
Even if this stability property is not critical, it seems that losing it would be an unnecessary regression that might prevent us from scaling up to larger grids.
The original algorithm in comment:3 keeps this property while still meeting the goals of this ticket. I don't think the fact that it is less efficient when (sum of all e) is large would be a serious obstacle. Besides, I have an idea about how to do better, but I'll have to think about it some more.
Replying to davidsarah:
The comment:3 algorithm is equivalent to picking the minimum hash value out of e independent hash values for each server. We can get the same result by taking a single hash, and transforming it so that it follows the same distribution as the minimum of e hashes would have done.
Let Xe be the distribution given by the minimum of e independent uniform distributions U1..e, each on [0, 1). The cumulative distribution function of Xe is given by:
Then we can use inverse transform sampling to generate samples from Xe. For that we need the inverse of F_Xe which is
So if we let y be the peer id hash for a given server scaled to the range [0, 1), and e be its capacity estimate, then sorting according to 1 - (1-y)^(1/e)^ will give the same distribution of permutations that we would have got by the comment:3 algorithm.
Plotting (F_Xe)^-1^ for various e gives results^(1/1)&y1=1-(1-x)^(1/2)&y2=1-(1-x)^(1/3)&y3=1-(1-x)^(1/4)&y4=1-(1-x)^(1/5)&xmin=-0.35&xmax=1.35&ymin=0&ymax=1 that are intuitively reasonable in order for this to work: increasing e biases the transformed hash toward lower values that are more likely to be near the start of the list (but for any e, there is still some chance that the server will not be picked).
Also notice that:
To be be more concrete, at source:src/allmydata/storage_client.py#L121 :
would change to something like
using this utility function to convert a binary string to an integer (which inexplicably doesn't seem to be in the stdlib):
Incidentally, I know that Python floating point arithmetic might not give exactly the same results between machines. That shouldn't matter because it can only have the effect of swapping two servers next to each other in the permuted order, which we should be tolerant of.
neat!
The stability of this all still depends upon the stability of the capacity
estimates, right? I gather you've been assuming that any given server would
publish its total capacity in a fixed record, along with its nodeid. I've
been assuming that each server would publish it's current-remaining-capacity
periodically, in a record that changes over time, like the one that contains
the location hints and version info.
I like the adaptiveness of schemes that keep publishing updates as remaining
space dwindles. There will be a lot of random noise in our traffic rates, and
if these rates are adjusted over time to match the remaining space, then
we'll get a nice feedback loop to compensate for accidental fluctuations.
Also, server operators are likely to add or remove space at various times,
and it would be nice to be able to adapt to that.
But I don't know how to build a system with all of these nice properties at
once: smooth filling of servers by percentage instead of rate, stable
ordering of servers between upload time and download time, short filecaps,
minimal auxilliary information (like an explicit serverlist stored in some
intermediate location).
Good argument. (Zooko and I have discussed download-time query flooding
before, and we usually tend to land on the same sides each time). I don't yet
know how to scale tahoe up to millions of nodes, but I think it will be
important to have a well-defined and stable place to find your shares (even
if you have to do O(log(N)) queries to find them). Giving up on that now, by
requiring an O(N) search, feels like it will make that sort of scaling much
much harder.
Maybe we should discuss the properties of a protocol with an intermediate
step. I wrote up some if this in #599. The idea would be that upload-time
could place shares anywhere it likes (to achieve good load-balancing, or
geographical diversity, or ideal download bandwidth, whatever), but it would
then write down a list of which servers got used, and store that list in a
specific (well-known, stable) set of places.
Download reliability would depend upon having both one copy of the sharelist
available and >=k shares. But the list should be small enough to let us
afford 1-of-N encoding and have lots of copies, so the sharelist's impact on
reliability should be dwarfed by the share's impact (if you can get 10 or 20
dBA better, it'll be lost in the noise).
However, repairers and rebalancers need to participate in the protocol: they
have to find and update the sharelist. And we have to quantify how much of a
problem it would be for the sharelist to be wrong, because some of the
sharelist-holding servers might be unavailable when you go to move or create
some shares. It's effectively adding a bit of mutable state to your
normally-immutable shares, with all the CAP Theorem consequences that
entails.
#599 suggests putting list of "where are the other shares" hints on each
share, which would turn your download algorithm into "search normally for the
first share, then use the hints to accelerate the search for the others".
This would get rid of the potential reliablity penalty (since you get
fate-sharing between sharelists and shares), but couldn't accomodate
completely arbitrary share placement: you'd have to find at least one share
before you found all the other ones. So it might help improve performnce on
large grids (where, due to regular churn, you might normally have to query
hundreds or thousands of servers to find enough shares), but still wouldn't
really permit the use of fancy load-balancing share-placement algorithms like
what we're discussing here.
Replying to warner:
Thanks.
Increasing the capacity estimate of a node can only move it nearer the start of the list for any given file (storage id). Similarly, decreasing the capacity estimate of a node can only move it further from the start of the list for any given file. I think this is the strongest stability property that could be expected.
(When I said "the peer id hash for a given server" in comment:11, I actually meant the hash of the peer id and the storage id. The properties above hold because the sample biasing is done after computing the hash, and doesn't affect its input.)
Using the remaining capacity rather than the total capacity would make the peer selection less stable. It should still be quite stable while the grid is not close to being full (since servers will tend to fill at a rate proportional to their initial capacity), but when any given server is nearly full, its remaining space relative to other servers would no longer be a good approximation to its initial capacity relative to other servers.
Yes. I'm not sure yet whether this outweighs the stability issue.
That could work if file caps contain an epoch number, and if the history of all remaining capacities at each epoch can be obtained by all nodes. However,
The epoch number scheme has most of these properties, with the caveats above, but the auxiliary information is the full history of remaining capacities (or fitness values; see below) at each epoch.
The nodes don't need to maintain connections to all other nodes, so that's not the scaling constraint. The obvious scaling constraint is the size of the location info for other nodes. When you get to the point where that information takes an unreasonable amount of memory, you can split the network into subgrids, with each subgrid having a set of supernodes (with should be the most available nodes in each subgrid). Then each node only needs to know the locations of all the supernodes, and each supernode only needs to know the locations of other nodes within its subgrid. This creates a small-world network in which any node is at most two hops from any other. So, you can scale to roughly the square of the number of nodes that would otherwise be feasible: use the permuted list algorithm to pick the supernodes for a given file, and have them use the algorithm again to route to the actual storage servers.
But this isn't the right ticket to discuss scaling to many nodes; that would be #235.
I think that the epoch scheme is a refinement of that. Note that the bias doesn't have to be by capacity; it could use any fitness function. Using a single fitness value for each server wouldn't give you geographic diversity, but it would allow biasing by bandwidth etc.
Yes. The size of a history of fitness values need not be much greater than the location and public key hash info, as long as there are not too many epochs.
Count me as a skeptic of the relevance of the CAP theorem; it depends on a very strong notion of consistency. In any case, if the auxiliary information is the history of fitness values, then that history is only extended, not changed, so we don't really have mutable state.
I'm not sure this has a significant advantage over the epoch number approach. It has the same disadvantages wrt. convergent immutable files, although it wouldn't necessarily leak information about when a file was uploaded.
Adjust the probability of selecting a node according to its storage capacityto Adjust the probability of selecting a node according to its storage capacity (or other fitness measure)Fun analogy for anyone who knows image processing: if the cdf of the uniform distribution corresponds to a greyscale ramp, then applying the '1 - (1-x)^e^' bias is effectively applying a gamma function to lighten or darken it, increasing the chance of picking a lighter or darker shade.
On the p2p-hackers list, Tony Arcieri wrote: