stop permuting peerlist, use SI as offset into ring instead? #302
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#302
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
We were chatting today about a subject that comes up about every three
months: why do we use a consistently-permuted peerlist when choosing where
shares should be placed?
The StorageIndex is the output of a hash function and therefore randomly
distributed, so the set of all storage indices in the network should be (in
the long run) evenly distributed across the SHA-256 2**256 numberspace.
Our current ["Tahoe2"]source:git/docs/architecture.rst#server-selection algorithm uses the file's
Storage Index to permute a list of all known storage servers (by hashing the
(SI+peerid) string and sorting the result). This adds an additional level of
random-distribution: each file gets a different ordering of storage servers.
The alternate approach that Zooko suggested today was to skip the permutation
and instead define the algorithm to be:
The reason we went with the permuted list was because we were concerned about
the effects of non-uniform storage server capacities. There are two
situations to look at: light loading (i.e. nobody is full and all lease
requests are accepted), and heavy loading (some or most storage servers are
full, and are rejecting lease requests, so the uploader will visit other
servers to place their shares).
For light loading, the non-permuted approach causes any individual storage
server to experience a load roughly equal to the percentage of the ring that
is covered by its N counter-clockwise neighbors. On average, this will be the
same for all peers, but some peerids might be clustered more than others,
resulting in more traffic to those peers than the rest.
For heavy loading, once a server is full, all of the traffic that would have
landed on it will be redirected clockwise around the ring (imagine rain
flowing across the ground until it finds a hole large enough to form a
puddle). This will result in a concentration of load on the server just past
the full region, and may cause that server to fill quickly. The likely result
is a large section of the ring which is full, while there may be more space
available on the other side of the ring.
In contrast, the permuted approach removes the correlation between server
locations: each file sees all servers in different locations. So instead of
having "hot spots" (either caused by randomly-clustered peerids, or
randomly-clustered full servers), the load will be distributed more evenly.
We would expect the heavily-loaded grid to see all servers get full at
roughly the same time.
There are two arguments in favor of switching to the non-permuted approach.
The first is simplicity: it is slightly easier to explain the non-permuted
algorithm, and it is easier to predict where any given file's shares will
wind up. The second is a potential benefit for repair. The issue is as
follows: a storage server has just died (the hard drive experienced a fatal
error), and all of those shares are gone. Repair can be invoked to replace
the missing shares and bring the file back up to full strength, but which
files need to be repaired? The most convenient list of storage indexes that
need repairing was on the server that just died. Is there some other way to
construct this list of repair work?
(The assumption is that failure-driven repair is more sustainable than
constant repair of all known files. This depends heavily upon the numbers we
use: how many servers, how many files, how many clients, how many repair
processes, what bandwidth is consumed by repair, etc).
The benefit that non-permuted share distribution would offer is in the
resulting correlation between shares held by server B and those held by its
neighbors A and C. In a lightly-loaded grid, if all servers A+B+C have never
rejected a lease request and are equally old, then every storage index on
server B will also be on either A or C (assuming k>=2). Thus, if B dies, we
can use A and C to construct the list of repair work that needs to be done.
However, Rob astutely pointed out that there are plenty of other ways to
accomplish this same job. For example, each server could be assigned a
"buddy", using a simple foolscap protocol, and each time server B adds a new
share, it tells its buddy "D" about the storage index. If B dies, we ask the
buddy for the share list and dump it into the repair queue. We could use a
central server for this purpose, or distribute it out: what really matters is
that we have a way to find out who the buddy is.
We need to think more about this. I like the load distribution properties of
permuting, and I'm not particularly concerned about the descriptive
complexity, but I too am concerned about the repair design, and would like to
leave ourselves some tricks to pull out in case we run into future problems.
The B-probably-has-A+C benefit of non-permutation breaks down if any of the
servers are full or new, so I'm less convinced that it will be a significant
help. But, so much of this is guesswork.. we don't really know.
Suppose you are going to tell some database -- either a very decentralized special-purpose database such as the "buddy system", or a centralized general database such as an Oracle DB -- which files need to be checked for repair if you die.
If we permute the peerlist based on the fileid, then you need to tell that database the complete set of files that you are holding shares on, and you need to update that database whenever you add a share, or else lazily allow a certain degree of lag between receiving shares of a new file and updating that database. Likewise, it might be good to update that database whenever you remove a share.
If we don't permute the peerlist based on the fileid, then you could summarize by notifying that database about the upper bound and lower bound of the shares that you hold. If due to some exceptional condition (you are the only storage server reachable to some uploader) you accept shares that are outside your normal range, then you might want to notify the database about those ones individually, but in the common case you don't need to notify the database about changes because the changes fall into your normal bounds.
I'm not convinced that the load distribution properties of permuted-peerlist are actually better. Intuitively it feels better, and especially if you call correlations of maxed out servers "hot spots", but I don't think we really know whether one server that is out of space shedding overload to consecutive servers will be better or worse than that server shedding overload evenly to all other servers.
I will offer more detail about my last comment, in which I say that I'm not convinced that the load-distribution properties of permuted-peerlist are better.
Which load-distribution pattern is better will depend on two things: our monitoring/alerting mechanism, and our provisioning mechanism. We don't know yet what either of those two will look like.
Let me give a couple of concrete examples of how certain plausible such mechanisms would make the load-distribution pattern of the linear peerlist better than the load-distribution pattern of the permuted peerlist:
Monitoring/alerting mechanism: suppose we have a red light that goes on (e.g. an alert that triggers, sending e-mail to the operations team, for example), when a server reaches a critical low level of available storage. Suppose that, unbeknownst to the ops folks, the user access patterns have recently changed so that the grid as a whole is rapidly running out of space.
a. In universe A, with the linear peer list, one of those red lights goes on, because one of the storage servers is out of space. Subsequently, the red light from the next server over goes on, because the load shed from the first one has caused the second one to run out of space. And so forth, at a linear rate -- a new server goes red every K minutes or hours.
b. In universe B, with the permuted peer list, one of those red lights goes on, because one of the storage servers is out of space. No other alarms go off, because its load is shed evenly onto all other servers. Later, after K minutes or hours times the number of storage servers that we have, they all go red simultaneously.
Provisioning: suppose one of the ways that we like to provision more storage space is to roll out a new server. If we have a linear peer list then you know that it will soak up overload which is shed from peers counterclockwise from it in the right. If we have a permuted peer list then it will soak up overload which is shed evenly from all peers. The former knowledge may be useful, for example if some of your current nodes are smaller capacity (i.e. older) than others -- you might want to deploy the new node clockwise from them.
Now, I'm not saying that our monitoring/alerting/provisioning mechanisms will necessarily be like this. I'm saying that since we don't know that much about what those mechanisms will be like, we should choose the simpler scheme, even though personally the "more distributed" feel of the permuted scheme appeals to me.
An important point in the above hypothetical scenarios is that the linear ordering allows the ops team to conceptualize and predict the states of their servers. Simplicity is valuable for such reasons.
Zooko and I discussed this in more depth in Boulder the other week, and he's
convinced me to at least put a space for "which peer-selection method should
be used" into the next version of the URI string format. One setting would
mean the permuted-index that Tahoe currently uses. Another setting would mean
skip the permutation step.
It will take some further convincing before I'm ready to believe that
non-permuted should be the default. But the main argument that made me feel
that non-permuted should definitely be an option was that it may give grid
operators more control over the peer-selection process.
Specifically: imagine that we give storage servers two separate published
identifiers (included in their Introducer announcements). The first is their
cryptographically-secure foolscap nodeid, as usual. The second is a new
"storage index ring position", and is an arbitrary 160-bit number, not
authenticated at all. The default would be for both to be the same, but grid
operators could change the ring position all they wanted.
The non-permuted peer-selection algorithm would key off these ring position
numbers. They would still use the nodeid for things like computing shared
secrets (for write-enablers, etc), but they'd use the ring position numbers
to decide which servers they should talk to.
Now, suppose the grid operators bring 8 machines online, 4 in an east-coast
colo facility, and 4 in a west-coast one (always plan for meteors). Each
machine has 4 hard drives. How do you obtain maximum distribution of the
shares? With the permuted peer list, you just hope you get lucky.. on average
the shares will be fairly well distributed, but sometimes you'll wind up with
4 shares on the same machine (one per spindle), and much of the time you'll
get more shares on one coast than the other.
But with non-permuted peer-selection, you just have the grid operators assign
ring positions to be equally distributed around the ring, in an order that
achieves the distribution you want:
east-coast machine-1 spindle-1
west-coast machine-5 spindle-1
east-coast machine-2 spindle-1
west-coast machine-6 spindle-1
...
west-coast machine-8 spindle-1
east-coast machine-1 spindle-2
...
Until the servers fill up, this will provide maximum distribution of files
across failure boundaries. If the servers are fairly homogeneous, they're
likely to fill up at about the same time.
Now, there are things to be discussed about non-homogeneous servers, and
friendnet/non-managed gris in general, since I'm still concerned about what
happens when a big server shadows a second one by sitting right before it in
the ring. In a non-permuted scheme, the second server will see less traffic
than you'd expect (not zero traffic, the actual amount depends upon how much
space is between the N'th CCW node and the N-1'th CCW node). With random
ring-location allocations, I fear that this non-uniform distribution may be a
problem, but it will require some more analysis before I'd feel I understood
the issue well enough to know. The kind of simulation Zooko was doing
recently on the mailing list is a start, but he and I identified some
fundamental problems with that work (starting with "what are we trying to
measure?") that meant we can't really draw any conclusions from the results.
This is not a win for ops, actually.
Let's discuss our current case. I have 40 storage nodes running now. In order to avoid lumpy distributions of blocks, we need to distribute the ring positions of these nodes evenly. (There is some discussion about this, but I'm with Brian: uneven distribution of nodes will result in uneven distribution of blocks)
So now, I bring up 10 new nodes in LA. I want them spread around the ring, but there's no easy way to make that distribution uniform. So to get back to a uniform distribution, I need to change every node's index.
That's a HUGE admin overhead, which will cause me to request a tool to do it for me. So now I have some foolscap-speaking control mechanism, or we end up with a queen setting blockserver ranges again. Both of those are significant complexity for very little real gain.
One thing that Brian and I discussed as an alternative was to have a tag associated with a storage server that described it. I can think of maybe three tiers, and values up to 20 or so for each tier, so we should probably use 8 bytes. :) So byte 1 would be site number, byte two would be chassis number, etc. We'd then select a set of peers with a large average distance in those tags, possibly by selecting one peer, then reshuffling the list so the next peer is as far as possible from the first, etc. This isn't a well-formed suggestion yet, but something to think about.
I actually think that the current system is fine, and works well. My only real concern with the current mechanism is that I'm somewhat concerned about different drive sizes. It feels like disks should fill at a uniform rate on a percentage basis, not on a bytes basis. I'm willing to be convinced this isn't a problem, however.
To follow up on the "lumpy distribution" point: Zooko and I have gone back
and forth on this a bit. My current belief is that randomly-distributed
nodeids will not provide uniform share-upload request rates. Let me throw out
some rough analysis.
Imagine N=2 and four servers uniformly spread around the ring at A:90deg,
B:180deg, C:270deg, D:0/360deg. Clearly any SI that lands in the 1-90 range
will put a share on A and B, likewise if it lands in 90-180 the shares will
go to B and C, etc.
Now if we move A to 80deg, then SIs from 1-80 will land on A and B, 80-180
will land on B and C, 181-270 on C and D, 271-360 will land on D and A.
Assuming SIs are uniformly distributed, we get the following upload rates:
So we get non-uniform upload rates: server C (which is N shares around the
ring from the larger-than-average gap) will fill up faster than the rest, and
server A (which sits just after a smaller gap) will fill up slower.
If N=10 and we have 80 servers spread around, the same effect holds: the
server that is 10 away from the larger gap will get more shares. It's
possible that the effect is diluted a bit by large values of N (although I
don't currently believe it), but the effect will be persistent: as long as
those servers sit at those ring positions, their relative upload rates will
be different.
The balancing tool that Zandr refers to would basically need to look at the
set of servers and shift their ring positions to make the gaps all uniform.
Each time a server was added, these positions would need to be readjusted.
OTOH, you could roughly control the traffic rate for any particular server by
adjusting the N'th CCW gap a bit.
But, this tool would be obligatory: the unmanaged behavior is bad. Using
(effectively) randomly-chosen serverids as ring positions will result in
(significantly, I believe) different upload rates, whereas we'd really prefer
the upload rates to be completely uniform (specifically we want to fill all
servers at the same time, to put off doubling up shares as much as possible).
Whereas with the current permuted-list algorithm, the default unmanaged
behavior is good.
So, I'm not yet convinced either way, I'm willing to leave space in the next
version of the URI for a "peer-selection-algorithm number", so we can have
some files uploaded without permutation. But I'm less convinced that it's a
good idea than I was last month when I was in Boulder.
Zooko suggested I add a note about the following idea which came up in tahoe-dev:
Suppose an attacker gets to kill N servers of their choosing, and want to cause as much damage as possible. And suppose that there were far more than N servers in the grid, and we're using 1-of-N encoding. Now, if we're using the permuted-list algorithm, they could pick one file to completely kill (choose an arbitrary file, locate its servers, kill them all; boom, the file is dead). But killing two files is awfully hard: you'd have to be lucky and find two files that happen to permute to the same first N servers. I think the chance of killing a second file is like 1 over (M choose N), where M is the size of the grid: i.e., the number of permutations is huge.
And of course killing a third file is that probability squared, etc.
Whereas if you aren't using the permuted-list algorithm, and shares are placed on consecutive servers starting at the SI, the attacker can do a lot more damage. They just take out any N consecutive servers. They'll completely kill 1/M of the files on the grid (since there are only M total permutations in use, one for each server). And they'll kill all-but-one of the shares for another 2/M files (the two immediate neighbors), and all-but-two of another 2/M files, etc, in a sort of triangularly-shaped distribution.
So I still think that permuted-list provides better properties.
The simulator that I wrote showed that the effect Brian described in comment:64611 was lost in the noise -- the permuted-per-file uploads and the flat ring uploads filled up their servers in an indistinguishably variant and noisy pattern. The two share placement strategies were identical for the first 96% of the run (that is: zero servers were full after the first 96% of the time, and then all servers were full at 100% of the run, regardless of which share placement strategy was used). Also the difference between the two strategies in that four percent interval were not salient. It was entirely up to luck (i.e., the pattern of which storage indexes appeared at random) whether some servers would fill up significantly faster than others, regardless of which share placement strategy was used.
Here is my note about it which includes a link to the simulator code: http://allmydata.org/pipermail/tahoe-dev/2008-July/000676.html .
Thanks for providing a link to the code! That will help me alot.
I have failed to take the time to review the simulator code properly (as I said I'd try to do the last time this was brought up), but what I remember at the time was not believing that it was testing the right thing: I seem to recall believing that it was averaging together several values or several runs and thereby destroying the information which we were most interested in. So I need to examine the results carefully (and probably write a simulator or two of my own) because I'll be willing to act upon them.
I think it would be helpful to simulate a progression of encoding parameters, where we increase N from 1 to 10, and see how it affects the share distribution. The "lumpiness" that I mentioned earlier would be most noticible when N=1 and become less noticable as N grows. Perhaps Zooko's simulator was run only with N=10. Knowing just how much this effect goes away with higher values of N would be useful (as well as knowing how much it increases with other parameters, like the variance of storage-server capacities).
Attachment ringsim.py (10261 bytes) added
Brian's v1 simulator
ok, so the "ringsim.py" simulator that I just attached to this ticket
demonstrates one of the concerns I described above: non-permuted
peer-selection will result in higher bytes-per-second upload rates to some
servers than to others. (I haven't yet built a simulator to investigate the
effect of full servers shedding load onto their clockwise neighbors).
Run the code like
python ./ringsim.py --seed=abc --permute=1
. It willcreate a ring of 100 servers using "abc" as a seed to decide their nodeids.
(any specific seed will result in a consisent distribution of nodeids).
Then it will upload files (each with a size in
randrange(2GiB)
, meansize 1GiB) one at a time. Every few thousand uploads it will analyze the
space used per-server and emit a report line like:
The first block of numbers is "usage-per-file-per-server", meaning how much
storage space was used on each server, divided by the total number of files
that had been uploaded so far. If we pretend that we're uploading files at a
rate of one per second, this is actually measuring bytes-per-second. The
"min" value of 33.86 MB means that the least-used server had received 33.86MB
per file (i.e. per second). The most-used (fullest) server had received
38.66MB per second. Our average filesize of 1GiB and 3-of-10 encoding
parameters means that we'd expect to place 35.72MB per-server per-file.
The "spread-pf" is the difference between the least-used and most-used
servers: 4.80MB = 38.66-33.86. 13.45% is that spread expressed as a
percentage of the expected usage value.
The "stddev" is the standard deviation of all 100 servers' usage values. If
usage were perfectly uniform, this would be zero. 2.81% is the standard
deviation expressed as a percentage of the expected usage value.
The simulator will run nearly forever. Run it with
--permute=1
andnotice how the min/max values converge on the expected value over time, and
how the spread and stddev drop towards zero. In my test run, after 200000
files, the spread was down to 1.61MB (4.5%) and the stddev down to 265kB
(0.74%). This is the law of large numbers in action.
Now, re-run the simulator with
--permute=0
and--seed=abc
. Itruns much faster (because linear ring-offset selection is a lot easier than
hash-based permutation). Look at the usage report for 16000 files:
The spread is enormous, as is the standard deviation. The least-used server
is using roughly an eighth as much space as the most-full server, whereas in
the permuted case they were using within 15% of each other.
And if you let it run for a while and look at the 200000 file report, it
doesn't get better over time:
Even after 200k files, the least-to-most-used ratio is 8x. And the stddev is
basically constant.
A bit of extra code reveals why. The least-used server has a nodeid that
starts with dff96a, and the neighboring portion of the sorted list of
serverids (with the separation between each node and its CCW neighbor) shows:
100 uniformly-distributed servers would have a separation of 028F5C28F6...
but the randomly-chosen nodeids in our
--seed=abc
ring are notuniformly distributed. In this case, lucky node dff96a happened to land
unusually close after node dfba0, with a separation of just 003f5d..., about
one tenth the ideal (uniform) separation. In fact it sits at the end of an
unusally dense cluster of nodeids.
(what we actually care about is the separation between node dff96a and it's
10'th CCW neighbor, since we're encoding each file into 10 shares. The
separation between dfba0 and dff96a is a big contributor to this, but not the
whole thing).
And similarly, the most-used server was 4f5ab8, and that portion of the ring
looks like:
The 4f5ab8 node is sitting just clockwise of an unusually large gap, from
4681ad, with a separation of 08d90b, about 3.75 times the ideal (uniform)
separation.
This is the "lumpy distribution" problem that I was worried about. The effect
is reduced when shares are spread over more servers. If I re-run the
simulation with
--N=40
(3-of-40 encoding), I see a spread of about 50%the expected value, and a stddev of about 15%. There is a corresponding
increase in the effect when shares are spread over fewer servers:
--N=5
gives me a spread of 195% and a sddev of 46%.
The effect is easiest to understand when k=N=1. In that case, the inlet rate
for any given server is strictly equal to the total upload rate times the
fraction of the ring that lies between that server and its nearest CCW
neighbor. For our "abc" seed, the smallest separation is between node b80ea
and b8159, with a gap of 0006ea (which is 1/95 of the uniform gap), and the
largest is between 6ef56 and 7b41d, with a gap of 0c4c6 (about 4.8 times the
uniform gap). So we'd expect to see lucky node b8159 to get about 0.0095 of
the total traffic, and unlucky 7b41d to get about .048 of the traffic, and a
ratio between the two of about 455x.
And indeed, although b8159 and 5a82c are in competition for least-used
server, after about 300000 files, we get this report:
And 10.68MB/153kB is 70x, and 52.44/10.68 is 4.9x, matching our expectations
of 95x and 4.8x pretty closely.
If you re-run the program with e.g.
--seed=def --permute=0
, you'll geta different distribution of nodeids, which happens to get a 4x ratio between
most-full and least-full, and a stddev of about 30%. Better, but still pretty
bad.
--seed=def --permute=1
behaves just as well as the "abc" seed:stddev is again down to 0.74% after 200k files.
If you're lucky and find a seed that gives you a uniform distribution, then
--permute=0
should give you the same statistics as--permute=1
.But for most seeds (i.e. most grids), you'll get a very lumpy distribution.
--permute=1
tolerates arbitrarily lumpy server distributions.So, based upon this simulation, I'm fairly convinced that permuted-list is
necessary to avoid long-term uneven upload rates to different servers. A
simple linear ring-offset algorithm will subject servers to vastly different
loads unless the nodeids can be controlled to maintain a uniform distribution
(which means changing them every time a server is added or removed).
Now, do we actually need uniform upload rates? What we really want, to attain
maximum reliability, is to never double-up shares. That means we want all
servers to become full at the same time, so instead of equal bytes-per-second
for all servers, we actually want equal percentage-of-space-per-second for
all servers. That's much trickier. If we could completely (and continuously)
control nodeids (by decoupling peer-selection index values from
cryptographic-backed server pubkeys), we could adjust them to achieve
inter-server gaps that compensate for how much space they have remaining:
small servers would be clustered closer together, large servers would be
placed CW from large gaps. The math necessary to do this strikes me as pretty
complicated, and I think that changing nodeids over time would damage
efficient retrievability, since shares will no longer be in the ideal places
when the downloader tries to perform the same peer-selection routine as the
uploader did.
We could also have servers refuse some percentage of incoming shares even if
they had space for them, to get their percentage-full-per-second rates down
to match the grid-wide average. This would induce the same problems that the
ring-offset and lumpy-distribution scheme has: servers which happen to sit CW
of a self-throttling node will get more traffic than usual.
OTOH, it's a little bit easier than that: we don't need to engage in this
load-shaping work until we start to run out of servers. If we have at least
"N" servers with space available, then reliability is unaffected by the rate
at which they're filling up. So we could have servers accept shares at full
speed until it looked like the grid was starting to fill up, then have them
switch into a mode where they defer requests to other servers more and more
(to obtain uniform fill rates) as the remaining space dwindles. The shaping
effect would be negligible in a grid with lots of free space. A managed grid,
for which new servers are added before the grid gets full, would never need
to engage in load shaping. But any amount of load shaping that was being
performed would put off the day at which the first server gets full.
So, in summary, I am re-convinced that linear ring-offset has real problems,
and that permuted-list provides a more uniform bytes-per-second inlet rate,
which is easier to deal with and gives better system-wide properties.
Attachment ringsim.2.py (10953 bytes) added
Brian's v2 simulator, prints nodeid gaps and min/max nodeid
Thanks for doing this work to simulate it and write up such a detailed and useful report! I think you are right that the unpermuted share placement can often (depending on node id placement and
N
) result in significantly higher inlet rates to some storage servers than others. But as you say it isn't clear how much this matters: "Now, do we actually need uniform upload rates? What we really want, to attain maximum reliability, is to never double-up shares. That means we want all servers to become full at the same time, so instead of equal bytes-per-second for all servers, we actually want equal percentage-of-space-per-second for all servers."Note that in actual deployment, storage servers end up being of multiple generations, so for example on the allmydata.com prodgrid the oldest servers are running 1 TB hard drives, then once those started filling up we deployed the thumper which comprises about 48 storage servers each with a 0.5 TB hard drive, then once the thumper started getting full we deployed a few more servers, including ten which each had a 2 TB hard drive. The point is that there was never a time (after the initial deployment started to fill up) where we had similar amounts of free space on lots of servers so that equal inlet rates would lead to equal time-to-full.
My simulator (mentioned earlier in this thread) reported time-to-full instead of reporting inlet rate, and it indicated that regardless of whether you have permuted or non-permuted share placement, if you start with a large set of empty, same-sized servers and start filling them, then once the first one gets full then very quickly they all get full.
Note that there are two separate arguments: 1. A more uniform inlet rate might not be so important. 2. The time between the first one filling and the last one filling is a small fraction of the time between the start of the grid and the last one filling (regardless of share placement strategy).
I guess I'm not sure how you got from "do we actually need uniform upload rates?" to "easier to deal with and gives better system-wide properties" in your comment:64615.
Oh! Also note that "What we really want, to attain maximum reliability, is to never double-up shares" is at least partially if not fully addressed by #778.
The results of Brian's and Zooko's simulations seem contradictory to me. Suppose that one server is filling up x times faster than the average of all the servers. Then, for the case where all servers have the same amount of storage, I'd expect that server to fill up at the point where the grid is only approximately 1/x full. Is it my expectation that is wrong, or is it one of the simulations (hopefully not both :-)?
In any case, if #778 is fixed, then a non-uniform upload rate shouldn't affect preservation of files, for as long as there are still happy servers that are not full. But if some servers become full early, then that will affect upload performance because uploads will be expected to have to contact more servers before finding one that is non-full.
Also, if #778 is fixed but #543 and #699 aren't, then in order to recover from a full grid you would need to simultaneously add happy servers (no matter how much extra space each new server has). That sounds like it could be an operational problem. So fixing #543 and #699 should be a priority after releasing 1.6, I think.
Attachment ringsim.3.py (8203 bytes) added
v3 of brian's simulator
Replying to davidsarah:
And also download performance, because fewer shares will be in the expected places near the front of the permuted list.
Zooko: you make excellent points about heterogeneous server capacities. I've
always gone back and forth between different mental images of how a Tahoe
grid "ought" to work:
universe
add more, repeat. (the first few years of allmydata.com, in which each
"phase" or "wave" of servers usually consisted of 20 drives, 1TB each,
four drives to a 1U chassis)
year of allmydata.com, with the 48*500GB thumper)
them come and go over time (testgrid, volunteer grid)
I've also gone back and forth in my assumptions about how files will come and
go. In the latest-snapshot-only backup world, we generally assume that a new
user starts by uploading a whole bunch of files, then switches into a
"maintenance" mode in which a small percentage of those files are replaced by
new versions every few days. In maintenance mode, you're probably using more
space than you're deleting, but at a much slower rate than during the initial
upload. If you're adding more users than you're removing, the overall usage
rate will remain positive, so you'll have to add more disk space eventually.
It's like economic inflation, but for disk space.
Each of these styles will result in different sorts of share placement
behavior. The allmydata.com phase1/phase2 style has resulted in files living
in a specific generation of server, which of course is not ideal for
downloads (since the shares are never in the ideal place): rebalancing is the
nominal answer, but the traffic costs may be prohibitive.
Zooko writes:
Let me first correct something I said. Brian writes:
Actually, we want more than that. We want to maximize diversity of share
placement. If we never doubled-up shares, but always put files on the same
sets of servers (either servers 1-10, or servers 11-20, or 21-30, but never
any other sets like 1,3,7,14,16,21,26,28,29,30), then random server failures
have a higher probability of killing files. We get the lowest probability of
losing files when there is minimal correlation between share placement of
each file. We've discussed this before, partially in this ticket (the
"Suppose an attacker gets to kill N servers of their choosing" comment:64612),
and partially elsewhere. I've built a simulator to prove this before, so I
can expand on the idea if it's contentious.
So for maximum reliability, which I'll put into the category of "better
system-wide properties", we want to keep as many servers available and in the
pool of non-full servers for as long as possible.
In the allmydata.com phase1/phase2 scenario, all the servers had the same
sized drives. We started with 20 servers (1TB each), and they all started to
get full at about the same time. I think it took something like 20 weeks to
fill those. When they were about 90% full (say two weeks to go), we bought a
new batch of servers and brought them online. This dropped the uniform
(per-server) inlet rate in half (since we then had 40 servers instead of 20).
Four weeks later, the old batch finally filled up, at which point we were
back down to 20 free servers and the inlet rate bumped back up to normal. We
had 18 weeks of files spread among just the phase1 servers, then 2 weeks of
files spread among phase1+phase2, then a duration (17 weeks? 19? math is
hard) when files were only stored on the phase2 servers, then we spun up a
phase3, etc. Even when we added the thumper, with its 48 drives of 500GB
each, the available server space consisted mostly of uniformly-sized drives
(i.e. phase4 was entirely the thumper).
To get maximum reliability from that environment, you want to fill them at
equal bytes-per-second rates, because that way they'll all get full at the
same time (and you'll always have a maximum number of hosts to choose from).
You'd get even better reliability if you bought all your phases ahead of
time, and filled 40 drives at half the rate for 40 weeks, instead of 20
drives for 20 weeks and then a different 20 drives for the next 20 weeks. But
capital costs, interest rates, predicting future needs, etc (all the usual
business decisions) encourage just-in-time procurement policies, so 20 active
servers was a reasonable compromise.
If we had upload rates that were as non-uniform as the ring-offset algorithm
would provide, we'd fill some servers in just a few weeks, reducing the pool
of active servers from 20 down to 18 or 16 or something, which would reduce
overall reliability; not as badly as losing 11 servers such that we were down
to 9 active ones and started doubling up shares, but still a measureable
loss.
Now, if you have a bunch of heterogeneous servers, like on the testgrid or
the volunteergrid, it's more complicated. There is no way to maintain maximum
diversity: either you shape your traffic (intentionally sending more shares
to large servers than small ones, in an attempt to fill everybody at the same
time), or you send uniform rates to everybody and fill the small ones first
and then end up sending more shares for the later files to the large servers.
Both approaches eventually result in non-ideal distribution. I think we just
have to accept this fact: files that are uploaded during periods of more free
servers will be more reliable. This might be at the beginning of time, when
small servers aren't full yet, or later when new servers join the grid, or
just after some sort of rebalancing process causes small servers to become
briefly available again.
Having non-uniform upload rates in a heterogeneous grid means that a lot of
your properties become the luck of the draw. Really small servers will be
filled so quickly that you can just ignore them. The number of available
servers at any given point in time will depend upon both the variation
between upload rates and the variation in server capacity. My previous
simulation showed that k=3,N=10,servers=100 could easily result in an 8x
variation in upload rate. We could imagine that the smallest server anyone
might run would be say 10GB, and the largest would be say 4TB, giving a 400x
range in capacity. Both are huge.
So having uniform traffic rates is a good thing for homogeneous grids (or
piecewise homogeneous grids like allmydata.com's successive waves of
servers), and it would be hard to predict its effects on a heterogeneous
grid. Good for one and neutral/unknown for the other.
As for "easier to deal with", I think it will be easier for us to tell server
operators that they'll each be receiving 1/X of the total upload traffic
(where X is the mostly-constant number of non-full servers, something that
will be easy to display on the Welcome page), than to tell them that their
traffic will different for each server and will depend in hard-to-predict
ways upon the exact distribution of nodeids and the encoding parameters
chosen by each uploader. We could build a tool that took k+N+serverlist and
computed a rate for any given server, and display that on the welcome page,
but I'd expect users to look at us funny and wonder why it keeps changing and
why they can't control it (assuming that it proves hard to decouple nodeid
from ring position). I think users would understand the practical effects of
a uniform rate much more easily, and then be able to answer useful questions
like "how long will it be until my server is full".
Zooko writes:
With all due respect, your conclusion is wrong. In fact, the message you
wrote in http://allmydata.org/pipermail/tahoe-dev/2008-July/000676.html
doesn't actually support this conclusion:
Which, if you had removed the text that was negated by the misleading
averaging-of-many-iterations, would have just read:
from which it is hard to draw any conclusions. (in the future, if you
conclude A and then realize that A is wrong and then examine B, then please
write a message that says "B" instead of a message that says "A! Not A! B".
I'll get more out of it and I'll be more convinced by your B conclusions.
Much of my vague disbelief about your simulator's results were due to the
misleading train of thought through A+!A territory :-).
Now, the examples of --iters=1 you attached do appear to support your
conclusion, in that it showed about 137k files before the first server was
filled, and about 140k files when the first wrap occurred. I ran your
simulator some more myself (with the same settings as my own: 3-of-10, 100
servers) to learn more, and saw similar results, all of which were at odds
with the simulator that I wrote yesterday.
When I run the current version (v3) of my ringsim.py with --permute=0, I see
the first server being filled at about 83k files, whereas the first wrapped
file (which occurs when the numservers-N'th server is filled) occurs at about
128k files, and the entire grid fills at about 129k files. This hardly counts
as "once the first one gets full then very quickly they all get full": the
two events are separated by a third of the total uploads.
Now, why did your simulation results differ so greatly from mine? I was
missing this until just today, but the source of the problem is a bug in your
simulator. The SI-as-offset-into-ring algorithm, as defined in the
Description of this ticket, is:
However your
simulate_load.py
code which simulated this algorithm is:which actually implements an entirely different algorithm, one which is
clearly insensitive to the distribution of the nodeids. My simulator
implements the real rotate-by-SI algorithm, and is thus sensitive to the
lumpy distribution. That's why I was seeing a big difference between
first-full and last-full, and you were not.
Your simulator's algorithm is even simpler. Maybe we should consider pursuing
it, instead of these other two? Let's call it "random-spin". The full
specification would be something like:
Unfortunately using
SI % len(servers)
for the spincount calculation wouldbe massively sensitive to the number of servers: adding or removing a single
server would effectively randomize its value, causing the downloader to look
in entirely the wrong place. We need something more stable.
Perhaps if the spin count were recorded in the filecap? We could either
generate it randomly (and give up any pretense of convergence), or use the SI
to calculate it in the first place (and get location-of-storage convergence
iff the serverlist remained exactly the same). Either way, if we write it
down, the downloader could use a that copy from the filecap, making it a bit
more stable.
How sensitive is this to server churn? Specifically, if a server gets added,
how much overlap will there be between the downloader's servers-to-query list
and the uploader's? If the new node appears CW of the last (most-CW)
uploader-server, then nobody will notice. If it is added CCW of the last
uploader-server, then the overlap will be reduced by one. Adding multiple
nodes in a single area affects the overlap to a varying extent depending upon
the insertion point: it hurts the most if it's CCW of the first
uploader-server, dropping to zero impact if it's CW of the last
uploader-server.
This is a bit troubling, because it means that some files are going to be
more sensitive to churn than others. A file that has a spincount of zero is
most tolerant: 10% churn means only 10% loss of overlap. But if the spincount
is high, so the set of uploader-servers starts at the 90% point of the ring
(11 o-clock, nodeid 0xff..), then every single inserted server will mess up
the overlap. With 3-of-10 encoding, adding 10 servers anywhere on the ring
(even if there were a million servers there already) would double the
downloader's search distance, and removing 10 servers would cause the
downloader to have to query every single node in the grid to find its shares
(the starting point would be beyond all the right nodes, so it'd have to
search the long way around).
I believe the permuted-list algorithm is more uniformly sensitive to churn:
regardless of where the file lives, adding 10% new nodes (i.e. adding 10
nodes to a grid that used to have 100 nodes) will reduce the overlap by 10%.
Adding 10 servers to a million-node grid would have negligible effect.
I suspect that random-spin is, on average, equally sensitive to churn as
permuted-list, but I believe there would be a huge variation from one file to
the next, enough to cause significant problems for downloaders.
So again I'm back to favoring permuted-list.
Well, no, #778 ("use servers-of-happiness instead of shares-of-happiness") is
a question of what qualifies as a successful upload versus a failed upload.
It doesn't say anything about how to choose the servers. Instead, as I
understand it, it's about evaluating the results of an upload to decide
whether it needs to be repeated (by reporting it as a failure to the user,
who will presumeable try it again). #778 seems to provide a function that is
run after the upload is complete, once which takes the set of surviving
servers (i.e. the ones which stuck around until the end of the upload
process) and returns a success/fail boolean.
To attain maximum reliability, in the long run, requires a server-selection
policy/algorithm that gives us good results and maximum diversity across
thousands and millions of uploads, with continual server churn. #778 is too
narrowly scoped for that.
davidsarah writes:
Yes, that's correct. When my ring-offset simulation (ringsim.py) is run with
parameters
--seed=abc --servers=40 --k=1 --N=1 --permute=0 --fileseed=123
,then X is about 4 (about 9MB/upload vs an expected average of 2.28MB/file).
We'd expect the grid to become full after 439201 uploads, but the first
server becomes full after only 110344 uploads. (the current version of
ringsim.py is fully deterministic: you should be able to run it with the same
parameters and get exactly the same results).
Note that my previous simulator didn't make it easy to tell when a server
became full, and had set the server capacity ridiculously high (1PB) to focus
on the long-term bytes-per-second inlet rates (it also didn't actually
implement the fall-over-to-other-servers code that would be necessary with
finite capacity). The current version puts more attention on the grid-is-full
period and handles server fall-over accurately.
Zooko's simulation was actually of a different algorithm, "random-spin",
which has uniform inlet rates, but has other problems, described above.
On a per-file basis, yes. I think a non-uniform upload rate in a homogeneous
grid will result in the (fewer than $HAPPY non-full servers) state happening
slightly earlier. As evidence, I look at ringsim.py's results for the "FIRST
FILE WRAPPED" event, which is a more stringent condition than $HAPPY would be
(it's equivalent to setting $HAPPY=N). With --seed=abc, I see ring-offset
(i.e. --permute=0) getting the first wrap at 124801 files, and the grid being
full at 126738 files, and I see permuted-list get the wrap at 126528 and
grid-full at 126738. So permuted-list was able to upload about 1700 more
files without doubling-up shares than ring-offset could do.
But I'll continue to argue that, when you look at multiple files as a group,
you get better reliability when you have more servers, in which case the
filling-some-servers-earlier behavior will be detrimental to even non-wrapped
files (ones that #778 would declare as successful).
I haven't even been thinking about upload performance, but yeah, that's
right. I suspect that random-spin and ring-offset will suffer problems here:
some section of the ring will get full, and then any upload which touches
that section will have to hunt a long distance for free servers (whereas
other portions of the ring will have short searches because they're all
free). Whereas permuted-list will effectively get a random list of servers
every time, so the uploader's search distance should grow smoothly as the
grid fills up, each file getting roughly the same distance. In other words,
the mean of the upload search distance will grow as the grid fills, but for
permuted-list the standard deviation will remain low, whereas for random-spin
and ring-offset the stdev will be high.
This is essentially #872.
Oh! You're right, my simulator was not simulating random-index-into-ring correctly. Sorry about that.
Okay, I was wrong because my simulator had a bug, and Brian was right that a flat ring has substantially worse load-balancing properties. I think we should close this ticket as
wontfix
, but I'd like to read through it one last time and extract any valuable bits first.In Boris Mejías' thesis, the non-uniform distribution problem in ring-based DHTs is mentioned in section 2.3.1. He references this paper on how to solve it, although I haven't checked whether that would be applicable to Tahoe.
I think this ticket should now be wontfixed.
Okay, I'm finally closing this ticket! Brian was totally right, and his original idea of permuting the ring per-file-id was great, we've used it all along, and it does help a lot with issues of load-balancing storage which are endemic to the whole consistent-hashing concept. Thanks, Brian! ☺