"shares of happiness" is the wrong measure; "servers of happiness" is better #778
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#778
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
metcarob posted a nice clear bug report to the list:
http://allmydata.org/pipermail/tahoe-dev/2009-August/002494.html
Zooko writing: I've mentioned this a few times before, but I don't think there is a ticket about this exact issue. Basically, the concept of "shares of happiness" (the number of shares, the successful upload of which means that your file is safely uploaded) is almost never what people want. A more useful concept is "servers of happiness" (the number of servers, the survival and correct function of which will guarantee that your file is available).
"Servers of happiness" isn't going to be right for everyone, though. Eventually some people are going to need #573 (Allow client to control which storage servers receive shares), where they get to specify complex policies like "This upload was successful if at least three different geographic sites have
K+1
shares each.".I'm marking this as "priority: critical" instead of the standard priority (which is called "major"), because it could be a reliability problem for users such as metcarob who aren't already familiar enough with the details of Tahoe-LAFS architecture to avoid the problem. I'm also marking it with the keyword "reliability".
The following clump of tickets might be of interest to people who are interested in this ticket: #711 (repair to different levels of M), #699 (optionally rebalance during repair or upload), #543 ('rebalancing manager'), #232 (Peer selection doesn't rebalance shares on overwrite of mutable file.), #678 (converge same file, same K, different M), #610 (upload should take better advantage of existing shares), #573 (Allow client to control which storage servers receive shares).
I'd be interested in trying to fix this.
From what I can tell, share creation and uploading happens in two places:
Am I missing any?
I'm also a bit confused at the logic in [publish.py]source:/src/allmydata/mutable/publish.py. It doesn't seem to refer anywhere to the idea of a happiness value for shares -- is there a reason for this?
In any case, this ticket (unless I'm misunderstanding something) breaks down something like this:
Thoughts?
Hooray! Thank you for volunteering, Kevan!
Uploading an immutable file is a safe thing to do -- each share that you upload will either end up being properly served by a server or it won't. Success means enough shares are hosted by enough servers. If the upload fails, no harm was done. Publishing a new version of a mutable file is less safe, because each share that gets accepted by a server overwrites any share of the same file and the same sharenumber that was previously held by that server. Once you've started overwriting older shares then you really ought to finish -- you can't undo, and if you don't finish writing all of your shares then your new version will be unhealthy.
So this ticket is easier than you thought. First of all, you can ignore mutable files entirely, and second, the only change to be made to the share placement algorithm for publish.py is "when to give up and report failure".
Here is a recent motivating example, in case you missed it:
http://allmydata.org/pipermail/tahoe-dev/2009-August/002494.html # Why no error message when I have only one storage node?
So the basic test is, with 3-of-10 encoding but only 1 storage server, upload the file, and then give up and return an error message to the user because 1 < servers_of_happiness. (By the way, a sane default value for "servers of happiness" might be ceil(k + m / 2).)
I have an alternative idea which I described on the mailing list:
http://allmydata.org/pipermail/tahoe-dev/2009-August/002605.html
In my view shares and servers of happiness are both wrong, though servers is less wrong than shares.
I think the right view is "probability of survival of happiness". Given some assumptions about server reliability, we can calculate for any given distribution of shares what the probability of file survival is over a given interval. That means that we can also calculate (perhaps directly, or perhaps through a quick numerical search) the number of shares we need to distribute in order to achieve a given level of reliability.
This would fix the original problem of this thread: If it is impossible to distribute enough shares to enough servers to achieve the required reliability, then the upload would fail. Ultimately, I think it's a far more useful and flexible approach to the issue. A future Tahoe incarnation that tracks statistics on the availability of peers could estimate their reliability individually, and generate and distribute additional shares to attain the required file reliability. Heck, given aggregate data from a large enough number of grids, we might be able to estimate reliabilities for given hardware/OS configurations to feed into the mix. A share on a Debian stable machine with RAID-6 storage and 100 days of uptime is worth more than a share on a laptop running a copy of Vista which was re-installed last week. Actually implementing the code required to gather all of that sort of data and usefully synthesize it would be a significant project all on its own, but the point is that it would fit within the reliability-based framework, as would anything else we could dream up to make reliability estimates more accurate.
The main reason this is better, though, is that even if the file reliability is computed from estimates that are of somewhat questionable value, it still gives the end user a better way to know/specify what reliability they want to obtain than simply specifying "servers/shares of happiness" and fixed FEC parameters.
If this is of interest, I will be happy to write the code to compute M given K, r (the desired reliability threshold) and estimates of server reliability.
I was working on the documentation updates for this ticket earlier today, and:
started bugging me.
As I understand it, the purpose of this ticket is to fix the issue that metacrob mentioned on the mailing list -- that Tahoe should not consider an upload successful if it cannot place shares on at least servers of happiness servers. The naive and, to me, obivious way to do this is to keep track of how many distinct servers we use when uploading a file, and then failing if, when there are no more outstanding shares, we haven't used servers of happiness servers.
However, this technique doesn't seem to achieve the goal of the wording above.
Let's suppose that I have a file
f
, default encoding parametersk=3, M=10, happy=2
, and that there are enough servers in my grid so that I can find a distinct server to accept each share. If I'm using tahoe patched with my naive solution above, the upload will succeed, but since every server only has one of 10 shares, andk=3
, the correct functioning of 2 servers is not enough to guarantee the availability of my file.An obvious enough solution to this is to ignore values of happy that are less than
k
. But there are use cases for values of happy that are less thank
-- e.g., if I just want an offsite backup somewhere and don't care very much about reliability beyond that. Maybe we should just change the description of happy to remove any guarantees about reliability, leaving that instead for some of the tickets referenced in the first comment.Hopefully this doesn't come off as me being pedantic about wording -- I'm just trying to make sure that I'm on the same page as you on what needs to be done with this ticket.
Hm, good points, Kevan.
So, for the case of off-site backup where you don't care how many servers need to stay up in order to guarantee the availability of your data, then you should set
servers_of_happiness=1
, right?And for the case that you have
K=3
andM=10
, then we could extend the upload peer selection algorithm so that if you haveservers_of_happiness=2
then it has to put more than one share on each server, and in such a way that there are no two servers which have the same two shares. But instead we could make it so that your upload fails with the error "Didn't upload it in such a way that the survival of any 2 servers was sufficient for the survival of the file.", then you realize that if that's what you want you ought to setK=2
and re-upload.How does that sound?
I like the idea of extending the peer selection algorithm -- it'd be cool to be able to support choices of
servers_of_happiness
that are less thank
, but I'm not sure how to do that. An algorithm for that probably wouldn't be too difficult, but maybe beyond the scope of this ticket? I'm fine with failing with an error, too.As you say, what we're effectively saying with the backup scenario is that we don't care how many servers need to stay up to guarantee availability of data -- i.e.: we don't have a
servers_of_happiness
. Sayingservers_of_happiness=1
is certainly a way of supporting that, but it seems confusing to adopt that as the standard -- we're special casing a configuration value to mean something when set to a particular value that it doesn't mean with other values (not to mention the fact that there might be users who would want the correct functioning of any one of the servers that they've uploaded shares to for a file to guarantee availability of that file). I think it'd be better if we supported the backup scenario by not performing theservers_of_happiness
behavior unless the user has explicitly set aservers_of_happiness
value in the config file (we could support this in code by using 0 as a default or something). This seems more consistent with what the user is saying with that value -- if they don't set it, they don't care, and if they do, then we can interpret it in a way that doesn't involve too many special cases.I'm confused -- I think that
servers_of_happiness=1
is not a special case -- that it is the normal meaning ofservers_of_happiness
. I definitely think changing the upload algorithm is out of scope of this ticket.Now I've gotta run to a friend's birthday so I'll re-read this when I return...
Oh, I guess my idea leads us to conclude that if you are doing the "I'm doing a backup and I don't care how many servers" scenario then you have to set
k=1
, too.Okay I'll think about this more later.
Based on the definition given above, (i.e.:
servers_of_happiness=n
implies that the survival ofn
servers is sufficient for my file to be available) it doesn't seem unreasonable to interpretservers_of_happiness=1
to mean that if any one of the servers that my file is initially uploaded to is online, then my file is still available. This is not the same as saying that I do not care aboutservers_of_happiness
-- solving the first interpretation would require placingk
shares on each server that received any shares at all, while the second case is solved if the upload of the file is successful at all, regardless of how many shares end up on how many servers. Or at least that's how I interpret it. Hopefully that's clearer.Okay, I propose that for now we just check whether
servers_of_happiness < k
and if so error out with an error message explaining that we don't currently support that option.That means that the "I don't care" use case that you describe is not currently supported.
How does that sound?
Okay. This ticket is kind of a specific case of some of the tickets you mentioned up there, so we can always revisit the issue later, when solving them.
hm. How would that work for users like metacrob, who have only one storage node? If they set
servers_of_happiness=1
, they get an error, and if they setservers_of_happiness=3
(assuming thatk
stays as a default), then they get another error (because there aren't enough servers to honor their preference). Maybe I'm misunderstanding somethingTo fix this, I propose that if there is no happy config value set, we ignore it (by, e.g.,
or something similar), and do not enforce
servers_of_happiness
. If the value is set, and is less thank
, we error out as you suggest.Wouldn't metacrob getting an error because there aren't enough servers to honor their preference be exactly what we want?
I think that's what metacrob is asking for.
What do you think?
I suppose we should probably set
k=3, m=10, servers_of_happiness=7
as the default.Yes -- sorry, I didn't think that comment through.
What I meant to say is that in fixing the problem for metacrob (in that tahoe will report failure for his grid), we end up making tahoe unusable for people who happen to have small grids (i.e.: one or two or whatever storage nodes, where whatever < k) and don't especially care about the behavior that we're implementing. That's why I think it is a good idea to support the "don't care" behavior somehow -- it doesn't seem particularly hard to support, and not doing it potentially makes tahoe a lot harder to use for some users. Hopefully that's clearer.
Okay, I understand what you mean. This hypothetical use case is in a sense the exact opposite of metacrob's. He wants Tahoe-LAFS to give an error when there are only one or two storage servers, this other user would want Tahoe-LAFS to silently succeed in the same case. If we implement
shares_of_happiness
as described so far in this ticket -- with no special case and with no change to the upload peer selection algorithm -- then metacrob's use case will be satisfied by the default settings (k=3, servers_of_happiness=7, m=10
), and the other use case would have to change theirk
and theirservers_of_happiness
to both be <= the number of servers that they actually have. Is that right?We could implement metacrob's use case right now without -- the "servers_of_happiness = dont_care" feature -- and then wait until we have more information about how it works, for example a bug report from someone who expected upload to work with the default parameters when they had only one or two servers.
What do you think of that?
Replying to zooko:
Yes, exactly.
That sounds fine.
I'm also fine with setting the defaults to be
k = 3, m = 10, happy = 7
-- it's pretty easy to change them if they end up being poor choices.A summary of the discussion, for those of you following along at home (zooko,
feel free to add to this if you think I've missed something):
The Problem
Tahoe-LAFS will store a file on a grid in an unreliable way (specifically, at least for this bug report, uploading everything associated with a file
f
to only one storage node) withoutreporting anything to the user.
The solution
We will change
shares.happy
intahoe.cfg
to meanservers_of_happiness
.servers_of_happiness
means two things:least
servers_of_happiness
distinct storage nodes.no more than
servers_of_happiness
storage nodes uploaded to in the initial uploadremain functioning.
Both of these conditions are necessary to solve metacrob's use case.
He should be able to tell Tahoe-LAFS that he does not consider an upload
successful unless shares from that upload were distributed across at least
n
servers (> 1 for the bug report, but in general): the first conditionaddresses this.
This is not enough to solve metacrob's use case, though -- if
he has uploaded shares from a file to 5 servers, but cannot recover that file
unless one particular server is online and working, then he is no better off
when that server fails than he would be if that server held every share of his
file. The second condition addresses this.
If we remove the first condition, then
servers_of_happiness
is satisfied ifthe file is uploaded entirely to only one server (since 1 < "servers of
happiness"): clearly, this is undesirable -- indeed, it is the exact behavior
mentioned in the bug report.
==== Implementation issues ====
Supporting this in Tahoe-LAFS is fairly trivial if
servers_of_happiness
isgreater than or equal to
k
, the number of distinct shares generated from a filef
necessary to recover
f
: the first condition (servers_of_happiness
distinct servers having a distinct share of a file
f
) implies the second (servers_of_happiness
distinct servers being enough to reconstructf
), because no morethan
servers_of_happiness
distinct pieces off
are necessary to reconstructf
.Supporting
servers_of_happiness
values less thank
is harder-- the first condition no longer implies the second. To see why this is,
consider uploading a file
f
onto a grid of 10 well-behaved storage nodeswith encoding parameters (
happy=2, k=3, m=10
). Suppose that each storagenode accepts one share. Then each pair of server nodes have only two distinct
shares between them -- not enough to reconstruct
f
.We could support these values if we ensured that some servers had more than one
share in such a way as to ensure that the cardinality of the set difference of
the shares held by any
servers_of_happiness
servers is at leastk
, butthis is tricky, and, for the moment, beyond the scope of this ticket. As a
stop-gap, Tahoe-LAFS will fail with an error if asked to upload a file when a
user has
servers_of_happiness
set to a value less thank
.The proposed default encoding parameters for
servers_of_happiness
are(
k=3, happy=7, m=10
). Oneconsequence of these defaults and the stop-gap described above is that users of small grids (where there are one
or two storage nodes) will by default not be able to upload files unless they change their
k
s to 2 or 1. If bug reports surface about this decision, we'll revisitit.
(I'm hoping that there aren't any more points to discuss in there -- my goal in writing it was to summarize the discussion so that I know what I need to do in fixing this ticket)
What is the problem/shortcoming with forcing
k<=happy<=m
?And make the error message simply state that this has to be true for Tahoe to succeed?
Document some easily-expected example settings and ramifications:
k=3, happy=7, m=10
-- defaultk=1, happy=1, m=X
-- not recommended, no redundancyk=2, happy=2, m=X
-- friendnet, backup guaranteed (minimal protection)k=3, happy=3, m=X
-- friendnet, backup guaranteed (better protection)etc...
Replying to terrell:
Without specifying m, you can't really be sure that the "protection level" statements are true. For example, with
k=2, happy=2, m=3
, only one of the two guaranteed servers has enough shares to reconstruct the file, so if that server fails, you've lost the file, which I would call "single point of failure", not "minimal protection".If
happy == k
, you needm > k + 1
or you have a single point of failure.Perhaps it would be better to measure happiness by "number of servers that can fail without losing the file"? I think that makes the implications of setting the parameter much easier for users to understand, and it's not complicated for Tahoe to compute: Just generate the peer list, assign share counts to the peers, sort by share count (ascending), and sum up the list until
total >= k
. The number of remaining, unsummed, servers is the maximum that can be lost without losing the file.Obviously in the case that spawned this ticket, metacarob's max-losable would be zero, a very bad situation for reliability. Setting max-losable to one would provide a guarantee of minimal redundancy, setting it to two or three would provide good reliability.
"""
servers_of_happiness means two things:
"""
Hm.... I see that I have not understood what I want all this time.
I guess what I've really been wanting all this time is "The Erasue Coding Property" as applied by counting servers instead of by counting shares. That is:
k=3, m=10
means that after this upload there will ideally be 10 distinct servers, the survival of any 3 of which guarantee my file. The addition of servers of happnessh=7
means that if, after the upload, there are 7 distinct servers, the survival of any 3 of which guarantee my file, I will consider the upload a success, but if there are only 6 I will consider the upload a failure.So I think what I really want is to make
servers_of_happiness
mean "The number of servers in set X" andk
mean "Any subset of X of size k is sufficient to recover the file".I guess that means
servers_of_happiness
is "The minimum number of servers (such that anyk
of those servers is sufficient) to consider this upload a success.". Is that eight?Kevan: I really like your method of working in which you thoroughly document it first, then write the tests, then the code (then someone reviews it for you). It reminds me of Brian's method of working.
Trel: It makes sense to me that
k <= h <= m
, but at the moment I'm not sure my intuitions about these things are right at all!Replying to zooko:
I'm not sure how this
k
interacts with the number-of-sharesk
. Are you assuming implicitly that there is only one share per server? I think there are good reasons to have more than one share per server (to maximize download performance while still assuring reliability), in which case the number-of-sharesk
must be distinct from, and larger than, the number-of-serversk
. So it appears that this formulation of the approach requires two parameters: minimum-recoverable-server-set-size (k_server
) and servers-of-happiness (h
). These are in addition to the FEC parameters,k_share
andm
}.The more I think about this, the more confusing servers-of-happiness becomes as a measure or selector of reliability. But then, I'm easily confused :-)
Replying to [swillden]comment:21:
This sounds like the right thing to me.
Treat that as an "implementation detail" that Tahoe-LAFS can optimize as it likes, as long as the guarantee that it offers to the user still holds: that there are
h
servers, the survival of anyk
of which will guarantee the survival of the file. (Where there will bem
total servers if possible, but at leasth
servers, or else the upload fails.)That seems to be what users like metacrob think that we are already offering, and it is also the property that I would like from my backups. It is "the erasure-coding property" as applied to servers, not to shares. (Under the hood, of course, Tahoe-LAFS has to produce shares and then decide how to upload them in order to achieve this property and also to optimize other things like up- and down- transfer performance.)
(Also, I think that it can be implemented as a small patch to the current upload algorithm.)
Shawn: what do you think?
Kevan: you're now my test to see whether I'm making sense about this topic. Is the above coherent? :-)
Replying to zooko:
I like it. Not as much as a statistical reliability threshold, but I like it.
However, I think that it's worth working out how to implement the "implementation detail".
So, to be clear, this is my understanding of your proposal:
We still have three configured parameters,
k, h, m
which mean, respectively, "Minimum number of servers required to recover a file", "Servers of Happiness (aka the repair threshold)" and "Maximum number of servers to use".FEC encoding parameters
k_e, m_e
must be chosen by Tahoe such that it's possible to satisfyk, h, m
.My suggested algorithm for selecting
k_e
is to set it to the number of servers chosen to receive shares,n
, which must satisfyk <= h <= n == k_e <= m
. Settingk_e
to the number of servers chosen will maximize download performance when all servers are available (assuming all servers have similar performance -- and it could do a reasonable job even if they don't).m_e
must then be chosen such that when thosem_e
shares are equally distributed across then
servers, anyk
-server subset of them has sufficient shares to recover the file. That is, each of then
servers must have at leastk_e // k
shares. Ifk_e / k
is an integer, that means thatm_e = n * (k_e / k)
.If
k_e / k
is not an integer, the minimalm_e
is a little more complicated. In that case, most of the servers must have1 + k_e // k
shares, butk-1
of them only needk_e // k
shares. Som_e = (n-k-1)*(1 + k_e // k) + (k-1)*(1 + k_e // k)
. Rearranging givesm_e = n - k + 1 + n * (k_e // k)
.In Python code:
So, for
k=3, n=10
, and assuming we setk_e = n
to maximize download performance,k_e = 10, m_e = 38
. Eight of the 10 servers will get four shares and two will get three shares. The fewest shares that any three servers can have is 10, but three-server sets may have 11 or (most commonly) 12 shares.This multiplication of shares may even make it easy to address performance mismatches between servers. The client can request one share from each of 10 servers, but if one of the servers is very fast and completes first, while another is very slow, the client could request another share from the fast server.
One downside of this approach is that it makes the expansion factor somewhat unpredictable, since it's dependent on the number of servers available. More servers means more expansion -- also more reliability.
For completeness, we should also define what this approach means for repair. An algorithm for deciding if a file needs to be repaired is:
k
servers in the list. If they collectively have fewer thank_e
shares, remove the first one and repeat this step.len(list) >= h
, the file is happy.The first expression for
m_e
is wrong, even though the rest of the argument and the rearranged version is correct. What you mean is:... So
m_e = (n-(k-1))*(1 + k_e * k) + (k-1)*(k_e * k)
. Rearranging givesm_e = n - k + 1 + n * (k_e // k)
.That's right. I meant to type
n-k+1
, but typedn-k-1
.Replying to swillden:
zooko: I think you're being clear + coherent.
To summarize the recent discussion.
Share uploading and encoding will be controlled by five parameters.
The three that the user can directly specify are:
k
: The number of servers that must survive for the file to survive (this is an upper bound, so another way of thinking of it is "I should be able to retrieve my file if at mostk
of the original servers that received parts of it continue to function").m
: The ideal number of distinct servers (or: an upper bound on distinct servers)h
: A lower bound on distinct servers: at leasth
servers must receive shares in order for an upload to be considered successful.Two others,
k_e
andm_e
(wouldn't it be cool if you could use LaTeX in trac tickets?), are calculated at upload time by tahoe based on user-defined parameters and grid data (namely,n
, the number of servers receiving shares). These are defined as:k_e = n
m_e = n * (k_e / k)
ifk
dividesk_e
, otherwisem_e = n - k + 1 + n * (k_e // k)
We impose the constraint that
k <= m
, thath <= m
, and thath, k, m > 0
. We do not say thath >= k
because a user who simply wants a backup and doesn't care about the specific dispersal or replication of their shares would sayh=1
(i.e., the upload is successful if the shares are somewhere). Additionally, we will say that the upload is a failure ifn < h
.Does anyone disagree with my summary?
I like this.
It elegantly supports the use case of someone who doesn't care much about their files beyond the fact that they're on the grid somewhere by decoupling
h
from most of the logic (since it remains, unless I'm paraphrasing badly, only in checks at the end of file upload and in the repairer) -- they'd set (k=10, h=1, m=10
), and be on their way. It also gives metacrob the tools he needs to support his use case.What sort of defaults seem reasonable for this new behavior?
I feel a lot better about this ticket, now -- it will be pretty cool once we get it working, thanks to all the new feedback. :-)
Kevan:
I think Shawn and you and I (at least) agree on the desired result -- "
k
-out-of-m
" servers is the desired goal of upload and at least "k
-out-of-h
" servers is the least reliable upload that still counts as a success.I think I like your and Shawn's proposed algorithm for how to choose
k_e
andm_e
and to map shares onto servers. However, it should be a separate ticket from this one. The reason is that there is (I think) a very easy way to implement this ticket without implementing that improved algorithm. That is: letk_e = k
,m_e = m
(the same as it is now), then run the current algorithm for mapping-shares-to-servers, then check if the result satisfies the new criteria for success.This will be much easier to implement because the algorithm for mapping shares to servers can be a complex. If I recall correctly, it currently adaptively deals with servers which fail to respond to the initial "Will you hold these shares?" requests, and with servers that respond with "Yes, but actually I already have shares number 1 and 4.", and with servers that respond with "Yes, I'll hold that one, Yes, I'll hold that one, No, I'm full go away.". (I could be wrong about some of those details.)
So if this ticket is just about changing the criteria of upload success and not changing the share-to-server-mapping algorithm then you can finish this ticket sooner and we can have greater confidence that it didn't cause regressions. Note that in the "normal" cases that I can think of, the current share-to-server-mapping algorithm will meet the new criteria, except of course in the case of fewer than
h
servers which can accept shares.By the way, I'd like to have Brian Warner review this ticket once we've finalized the design.
That's a good summary, kevan, but my method for calculating
k_e
andm_e
assumes thatn >= k
andk_e >= k
. Ifh < k
, then presumably there are situations wheren = k_e < k
.It's kind of odd to specify that
k
is the number of servers needed for the file to survive, but then to say that the we're happy even if there are less thank
servers available. It seems like if you want to allow that, then what you really want is the "shares of happiness" algorithm, not the "servers of happiness" algorithm.Also, if there's only one server available, is there any sense in doing FEC coding at all? Or, at least, is there any sense in setting
k_e > 1
orm_e > 1
? I suppose if there were a mechanism to redistribute the shares later it would. And Zooko's observation (in another ticket) thatm_e
can be cheaply increased would imply that it makes sense to setk_e = m_e = k
, and then let a future repair cycle increasem_e
when more servers are available. So, perhaps thek_e, m_e
selection algorithm should have as a special case thatk_e = m_e = k
whenn = 1
. That would result in no FEC expansion when only a single server is being used, which is nice.But what about when
n > 1, n < k
?k_e = m_e = k
would result in lower reliability than only using one server, which seems like a bad idea. Forn=2
, any allocation that doesn't allow either server to reconstruct the file is worse than using only one server. For that case,k_e = k, m_e = k_e * n
would make sense, which covers then=1
case as well. For larger values ofn
(but stilln < k
), what to do is less clear. Suppose the user specifiedk = 10, h = 1, m = 10
, as you suggested, and there were nine servers available.k_e = k
is fine in that case, butm_e = k_e * n
would result in FEC expansion of 9!In that case, what the user appears to be asking for, with
k = m
, is distribution with no redundancy at all, meaning that if any of the servers fails, the file is lost. If that's what they really want, thenk_e = m_e = k
is fine. But what if they specifiedk = 5, h = 1, m = 10
? That would indicate a desire for some redundancy, but a willingness to accept no redundancy rather than failing the upload.Implicit in the choices of
k
andm
is an acceptable FEC expansion amount and a desired redundancy level. Perhaps what makes the most sense is to try to honor the indicated FEC expansion limit, cappingm_e
atk_e * m / k
.So the rule would be, if
n < k
, thenk_e = k, m_e = min(k_e * n, k_e * m / k)
. Of course, ifn < h
, then the upload should fail.I'll attach a file with some code that implements this, and computes the reliability for various options. It seems to behave reasonably for a wide variety of inputs.
Attachment fecparams.py (3075 bytes) added
Implementation of proposed k_e, m_e selection algorithm, per comment 32
Replying to zooko:
That seems like it works fine. The only major advantage of the "new" algorithm is increasing parallelism for downloads, which is a separate issue, and should be a separate ticket (I'll open one). In fact, I think there's a much simpler way to increase parallelism.
To be clear, the "new criteria for success" are, I believe:
Any
k
-server subset of then
successful servers has sufficient shares to construct the file. Ifk = k_e, n >= k
, this is trivially guaranteed to be satisfied. Ifn < k
, then we don't have the FEC survivability guarantee, but survivability degrades fairly gracefully.n >= h
.Where
n
is the number of servers that receive at least one share, of course.I just wanted to point out that metacarob's original problem could be solved much more simply:
I'd be cautious about a constraint-oriented algorithm: the implementation can get very tricky. The approach described above seems to avoid that rabbit hole, by merely testing an assertion after peer-selection has taken place (and presumeably again after the shares have finished uploading, in case you lost servers during the upload).
The terminology is confusing to me, because tahoe uses "k-of-N" encoding everywhere internally. I know that this ticket got started with the alternate notation, but when it comes time to implement this, could you find a different name for the new "n" (number of servers which end up holding a share) value? I'd suggest leaving "k" and "N" referring to shares, and introduce a new "k_s" and "N_s" to refer to servers. In the code (as opposed to Trac discussions), verboser is better (up to a point), so you could use "k_servers" and "N_servers" to avoid the ambiguity of the "s" prefix.
I'm not sure that changing the encoding parameters on the fly is a great idea: it may surprise people who thought they had already made their decision about the expansion-vs-reliability tradeoff. OTOH, I suppose my attitude reflects my bias towards stable grids, where the number of servers is supposedly known in advance, and doesn't change very much.. which may not be the case. I imagine that Shawn's efforts will result in a web page on the client node which has knobs for encoding parameters and displays probable reliability properties as an output, to help users make those decisions. Or the other way around, where the user dials in the reliability requirements, and is informed of the encoding strategy (and resulting expansion factor).
Replying to zooko:
This sounds fine. Does anyone have any objections to me starting along this path?
Kevan: go for it!
Everyone: please move discussion of improving FEC parameters and/or server selection to #791 (Optimize FEC parameters), and leave this ticket to track Kevan's progress on documenting, testing, and implementing the simpler goal. That goal is:
k
,n
, andh
("happiness"), just as Tahoe-LAFS v1.5 does.k
-out-of-h
servers.". If it wouldn't, abort the upload.Cool.
I've written some tests for the new behavior, and implemented the behavior itself in the places that I think it needs to be implemented (some checks in [upload.py]source:/src/allmydata/immutable/upload.py, and one in [encode.py]source:/src/allmydata/immutable/encode.py, as well as some jiggling to handle the dual meanings of
k
andm
) -- the roadblock now is improved documentation, and changing some unit tests that were written with the old happy behavior in mind (I think I have three of those left to fix).Kevan: how is it going? Do you need help improving the documentation?
I'm pretty much done, aside from two issues.
Other than that, everything seems fine -- the new behavior is there and working, there are tests for it, and all of the unit tests that relied on the old behavior have been fixed to pass with the new behavior.
Okay, there are three files, containing a few patches.
behavior.txt
implements the behavior that we were discussing (at least that part of it that isn't relegated to #791). The bulk of this consists of changes in [upload.py]source:/src/allmydata/immutable/upload.py and [encode.py]source:/src/allmydata/immutable/encode.py, with a minor change to [download.py]source:/src/allmydata/immutable/download.py (which is list item 1 above).docs.txt
, which updates [configuration.txt]source:/docs/configuration.txt and [interfaces.py]source:/src/allmydata/interfaces.py to reflectservers_of_happiness
.tests.txt
, which updates a bunch of existing tests to work with the new semantics, and also adds a couple of new ones.(apologies for the amount of time that it has taken for a simple task; I've been on vacation for the past week or so, and haven't had as much time to work on this as I would have wanted)
Putting this in 1.5.1 Milestone (it looks like it is going to be done by 1.5.1 release and I don't want to hold it out of trunk).
Hm, actually maybe we want to release a Tahoe-LAFS named "v1.5.1" which doesn't have any user-visible behavior changes/feature changes aside from bugfixes, in which case maybe we should hold this patch (safely stored in this ticket), until after that release. Please don't put off reviewing this patch, though, if you were planning to review this patch, do it now!
I'm happy with where the documentation is for this. Perhaps someone can proofread?
I think the only stumbling block left here is the
servers_of_happiness
inCiphertextDownloader
-- I think it's probably best to set this to whatever the user definedservers_of_happiness
is, but I can't think of a clean way to do that. Perhaps someone with a fresh perspective can look and see if there's an obvious way that I've missed?(maybe writing up the problem in detail will help me to think of a solution)
If I understand the code, Tahoe-LAFS (in [checker.py]source:src/allmydata/immutable/checker.py@4045#L287) defines an immutable file node as healthy if all of the shares originally placed onto the grid (
m
) are still available (for some definition of available, depending on the verify flag), unhealthy if fewer thanm
but more thank
shares are still available, and unrecoverable if fewer thank
shares are still available.In source:src/allmydata/interfaces.py@4045#L1628, ICheckable defines the method
check_and_repair
, which tells the receiving object to check and attempt to repair (if necessary) itself: this interface and method are implemented by [FileNode]source:src/allmydata/immutable/filenode.py@4045#L181, which represents an immutable file node on the Tahoe-LAFS grid.The check and repair process proceeds something like this (again, if I understand the logic):
(I know there's a bit of hand waving in there, but hopefully I got the gist of it)
The repairer (source:src/allmydata/immutable/repairer.py@4045#L14) is pretty simple: it downloads the content associated with the FileNode in the normal way using a DownUpConnector as a target, and then uploads the DownUpConnector in the normal way. Since [DownUpConnector]source:src/allmydata/immutable/repairer.py@4045#L87 implements [IEncryptedUploadable]source:src/allmydata/interfaces.py@4045#L1400, it is responsible for providing the encoding and uploading operations with encoding parameters, including
servers_of_happiness
.The problem that this long-winded comment is getting to is here: source:src/allmydata/immutable/download.py@4048#L869. The CiphertextDownloader sets the
happy
encoding parameter of its target to bek
. Sincek
can be bigger thanservers_of_happiness
, this isn't good. In most cases, the accompanying comment is right; in the case of a file repair, it isn't, because the encoding parameters stored in the DownUpConnector are used by the encoding + upload process.I think that the CiphertextDownloader should ideally be following the user's configured
happy
value instead of setting it to something else. Where I'm stuck is in figuring out a way to tell it whathappy
should be. Some ideas:happy
.In any case, that's basically the one stumbling block that I'm aware of in this ticket.
Hm. I don't know the answer off the top of my head. There seem to be too many ways to configure the encoding parameters in upload.py. Perhaps some of them are vestigial? I'm going to assign this ticket to Brian. Brian: please just tell Kevan your opinion about how the repairer should learn the user's configuration of
happy
and then assign this ticket back to Kevan. Thanks!Brian doesn't know the answer to this off the top of his head, either. Help! Assigning this ticket to "nobody".
Oh wait, maybe repairer doesn't need to know servers of happiness. Maybe we should set it to 0. Repair is currently intended to finish regardless of how well it is doing and then report to the user how well it did, so then it should just report the resulting servers of happiness when it is done. That would, I think be analogous to the way the current repairer works with regard to shares of happiness (which by the way I wrote). Kevan: does that make any sense?
Assigning this ticket to Kevan.
It seems consistent enough with the current repairer, which (at first glance) does what you say.
So, if I understand your point: the repairer's job is to attempt to repair an unhealthy file, and report its results -- presumably other code, if it desires, can then determine whether the repair was successful or not. I think that this makes sense, but the repairer (or at least the code I've skimmed before leaving for work) has an internal definition of what is healthy -- a file with all of its shares intact -- and what isn't. This is necessary, because otherwise the repairer wouldn't know when to attempt to repair a file, but I'd feel better if the repairer tried to use the same definition of healthy as the user, I guess.
I realize that the same argument applies to
shares_of_happiness
, too (though perhaps less so -- though the repairer doesn't respectshares_of_happiness
directly, it is likely to be indirectly respected in the sense that the reparier's threshold of health is probably stricter thanshares_of_happiness
, while the repairer's threshold of health does not care about the distribution of shares): it is really more a gripe with the way the repairer works in general than an issue specific to this ticket. Given that, I'm fine with that solution to this ticket.(apologies if this is less than clear: I wrote this on my way out the door to work :)
I'm updating the
behavior.txt
patch to implement Zooko's suggestion.Great! Looking over your patch, this part catches my eye:
I guess this is the critical part of this patch.
If the number of landlords falls below H, then we definitely won't get at least K-out-of-H reliability at the end, so it should indeed treat this as a failure here. But, what if
len(servers_with_shares) >= self.servers_of_happiness
, and yet we still don't have K-out-of-H reliability because some of the servers have the same shares as each other (instead of different share numbers). Can that happen? I think so from reading the patch -- maybe we should have a test of this case.The case you describe, if I understand it correctly, would be fit by this example:
Suppose I have four servers,
s_1
throughs_4
. Suppose I havek = 3, h = 3, N=10
. Suppose that the shares generated from a file f are numberedf_n
(sof_1
is share 1, and so on). Then the situation you describe is a problem if shares are distributed as follows:s_1
:f_1
s_2
:f_1
s_3
:f_1
s_4
:f_2, f_3, f_4, f_5, f_6, f_7, f_8, f_9, f_10
because the set
{s_1, s_2, s_3
} is not sufficient to recover the file, though it is the same size ask
.Do you think we're on the same page?
From what I understand of the placement algorithm, Tahoe-LAFS wouldn't normally distribute one share to more than one server.
[_loop]source:src/allmydata/immutable/upload.py@4045#L225 in
Tahoe2PeerSelector
contains the guts of the peer selection algorithm. As I understand it, it only has one request for a particular share out to peers at a time -- either by usingpop()
on the list containing the homeless shares, or by setting the slice of the homeless shares list out on the wire at the moment to the empty list. If this is the case, then it shouldn't allocate one share to more than one server.A more interesting case is if, for some reason, we attempt to upload
f
to a grid that has the following layout:s_1
:f_1
s_2
:f_1
s_3
:f_1
s_4
:\emptyset
I'll also stipulate that
s_1, s_2
ands_3
will not accept any more shares, whiles_4
will accept every share it is asked to accept. Let's walk through what happens when I try to uploadf
.I'll assume for simplicity that
_loop
starts withself.homeless_shares = [f_2, f_3, f_4, f_5, f_6, f_7, f_8, f_9, f_10]f_1, self.uncontacted_peers = [s_2, s_3, s_4]s_1,
_loop
will start by checking to see if there are any homeless shares. There are, of course. It then checks to see if there are any uncontacted peers. At this point in the execution, there are, so it pops the first uncontacted peer (s_1
) off of the list of uncontacted peers, and pops the first homeless share (f_1
) off of the list of homeless shares. It then askss_1
to storef_1
. Recall thats_1
already hasf_1
. Since theStorageServer
([source:src/allmydata/storage/server.py@3841#L35]) tells remote callers about all of the shares it has for a given storage index,s_1
tells theTahoe2PeerSelector
instance that it hasf_1
, and that it has not allocated any shares. This is handled by the_got_response
method ofTahoe2PeerSelector
, which sees thats_1
has not allocated anything, but that it already hasf_1
: that means thatf_1
is not added to the homeless list again, and thats_1
is set to be the prexisting peer withf_1
._loop
will now asks_2
if it can storef_1
in the same way.s_2
will reply (in the same way ass_1
) that it can't (i.e.: that it hasn't allocatedf_2
), but that it happens to havef_1
. This causes_got_response
to sets_2
as the prexisting share withf_1
, overwriting the previously set value (s_1
)The same process occurs with
s_3
: the end result of the_loop/_got_response
combination in that case is thats_3
is now the prexisting share withf_1
.Finally,
_loop
asks the last uncontacted peer,s_4
, to storef_2
.s_4
replies that it doesn't have any preexisting shares for the storage index, and that it has allocatedf_2
.s_4
is recorded inself.use_peers
as having done this.There are now no more uncontacted peers, so the
Tahoe2PeerSelector
instance will attempt to store a proportional amount of the homeless shares (see [source:src/allmydata/immutable/upload.py@4045#L263]) on each of the remaining servers.s_1, s_2
, ands_3
will refuse to accept any of these, and will not be used in any further queries as a result.s_4
will accept all of the shares thatTahoe2PeerSelector
wants to put on it, so it will be re-appended to the list of peers to consider. On the next pass,_loop
will attempt to store all currently homeless shares ons_4
, and will succeed.Tahoe2PeerSelector
now has no more shares to allocate, and executes the comparison that Zooko mentions. To getservers_with_shares
,self._servers_with_shares
takes the union of the set of peers that we're actually uploading shares to and the set of peers that we've recorded as already having shares. The latter set is built using the prexisting shares structure that is updated each time a response is received froms_1
,s_2
, ors_3
: this will never have more than one server forf_1
(or anyf_n
). The only other server in this list will bes_4}}, since it has all of the other shares. ```_servers_with_shares
, then, returns a list of two servers, andTahoe2PeerSelector
correctly fails, because 2 is less than 3.One example is obviously not proof that the peer selection algorithm guards against this in all cases, though, so maybe an additional test (or revising that one) would be a good idea. Did you have anything in particular in mind that would be a better indicator?
This is perhaps off-topic, but if I understand the execution of
_got_response
, it seems like we don't say that a query has made progress (or that it is a good query) unless we know that the remote server has either allocated shares for us, or we learn that it already has shares that we think are homeless. This means that if I ask a server to store one share, and it happens to have that share (but no others), its response isn't considered successful, because the share I asked it to store isn't in homeless shares (since we popped it/destructively modified homeless shares when we removed it), and nothing is allocated. Is this intentional?(whew, that was a lot longer than I was planning on it being :-/)
Kevan:
It's great to see that you study the code so carefully. This gives me a nice warm fuzzy feeling that more eyeballs have looked at it. Anyway, I'm very sorry it has been two weeks since it was my turn to reply on this ticket and I haven't done so. I've been really busy.
I guess the key fact that you've shown that I didn't appreciate is that the variable
_servers_with_shares
holds only servers that have a new share that isn't already held by one of the servers in that set. Perhaps it should be renamed to something like_servers_with_unique_shares
. (If you do that, please use thedarcs replace
command to rename it.)Now I can think of one more issue. You've pretty much convinced me that this way of counting
_servers_with_shares
can't overcount unique shares which are available on separate servers, but it could undercount. For example, supposes_1: f_1
,s_2: f_2
,s_3: f_3
,s_4: f_1, f_2, f_3
. Then ifs_4
is counted first it will prevents_1
,s_2
, ands_3
from being counted because they don't have any new shares, so the final value of_servers_with_shares
will be 1. On the other hand ifs_1
, thens_2
, thens_3
are counted first the final value will be 3. If this is right, then it means that sometimes an upload could be reported as failing (because the uploader happened to talk tos_4
first) when it should have been reported as succeeding.What do you think? It might be worth committing your patch as is, meaning that trunk would then potentially suffer from uploads spuriously failing when they shouldn't have (but they never suffer from uploads spuriously succeeding when they shouldn't have), and then starting on a separate patch to avoid that problem. Or, perhaps we should keep this patch out of trunk even longer and think about that issue.
Regards,
Zooko
I interpret Shawn Willden's message http://allmydata.org/pipermail/tahoe-dev/2009-October/002972.html as being a problem that would be solved by this ticket. Shawn -- care to comment?
Hm. That scenario would be a problem, and I don't really see an obvious solution to it.
We could alter the logic at source:src/allmydata/immutable/upload.py@4045#L225 to not just give up after determining that there are no homeless shares, but that there aren't enough distinct servers with shares to consider the upload a success.
We could, for example, figure out how many more servers need to have shares on them for the upload to work (
n = servers_of_happiness - servers_with_shares
). We could then unallocaten
shares from servers that have more than one share allocated, stick them back inself.homeless_shares
, and then let the selection process continue as normal. We'd need a way to prevent it from looping, though -- maybe it should only do this if there are uncontacted peers. Would we want to remove shares from servers that happen to already have them if we're not counting them in the upload? If so, is there a way to do that?Does that idea make sense?
Regarding holding up this patch versus committing now and making it a separate issue:
I have school and work to keep me busy, so I'd be able to dedicate maybe an afternoon or two a week to keep working on this issue. I'm happy to do that -- I'd like to finish it -- but it would probably be a little while before we ended up committing a fix if we waited for that to be done (if someone with more time on their hands wanted to take over, that issue would be solved, I guess). So I guess that's one argument for making it a separate issue. On the other hand, it'd be nice to eliminate edge cases before committing. So there's that. I'm not sure which way I lean.
Oops -- I forgot that I made the naming changes that you suggested while writing that, and tried to upload a patch with them, but then I realized that there was a typo in the patch name, and that I missed some of the names. Here's a revised patch without those issues.
Attachment behavior.2.txt (20043 bytes) added
The behavior discussed in this ticket
Yes, we definitely want tests. Check out [GridTestMixin]source:src/allmydata/test/no_network.py@20090815112846-66853-7015fcf1322720ece28def7b8f2e4955b4689862#L242 and the way that it is used by [test_repairer.py]source:src/allmydata/test/test_repairer.py?rev=20091005221849-66853-3d1e85b7a2af40ddd07f07676ffb8f6dcc57d983#L397. I guess we could use that pattern to set up a test grid with the shares stored on servers in a specific pattern such as discussed in comment:72465 and comment:72466.
Kevan and I chatted on IRC and we both independently thought that his patch shouldn't go in when it is susceptible to that "spurious failure to upload" problem, because if that problem were to occur the user would have no good way to work-around it and would be stuck.
Okay, I'm updating two patches.
I updated my tests patch to include a test for the scenario Zooko proposed in comment:72466. It's not quite ideal (I need to figure out a way to make the
Tahoe2PeerSelector
pop server 0 off the peers list first for it to be perfect), but it fails with the current code.I also noticed that my
_servers_with_unique_shares
method in upload.py was comparing peerids with things that weren't peerids, so I made a minor change to the behavior.txt patch to address that.My todo list is basically:
I welcome any comments.
In http://allmydata.org/pipermail/tahoe-dev/2009-October/002998.html I suggested this measure of "servers of happiness":
First of all, let's call a set of servers "sufficient" if you can
download the file from that set (i.e. if at least
K
distinct sharesare hosted in that set of servers).
Now consider the largest set of servers such that every
K
-sizedsubset of it is sufficient.
Let's call the size of that largest set
S
. Now my intuition about"Happyness" is that I configure a Happyness number
H
, and if anupload results in
S >= H
then I'm happy.I think this is also Robert Metcalf's intuition. It may also be
Shawn Willden's intuition, but on the other hand perhaps Shawn
Willden's intuition is something more sophisticated. ;-)
A neat thing about this way of thinking is that the number
S
could represent the"health" or "robustness" of the file. An upload or a file-check
operation could report
S
to the user.I replied to this comment in http://allmydata.org/pipermail/tahoe-dev/2009-October/003000.html, but I want to reformat it for Trac before posting it here. In the meantime, there's that.
Attachment tests.2.txt (34070 bytes) added
I've updated tests.txt to have a test for the layout I discuss in comment:72465. This required a minor change to a couple of the methods in NoNetworkGrid, which that patch also contains.
(is there a way of deleting files from a ticket? Maybe I should just work harder at remembering to check the replace checkbox when I update tickets :-).
I'm updating tests.txt to fix some bugs where I was mixing callbacks with synchronous code where I shouldn't have been. I also tweaked the share distribution for the comment:72466 test so that the Tahoe2PeerSelector sees the servers in the right order.
The simplest fix I can think of for the comment:72466 issue changes the way that the [_got_response]source:src/allmydata/immutable/upload.py@4045#L313 method in a Tahoe2PeerSelector instance handles existing shares in a response.
Right now, if a peer tells Tahoe2PeerSelector that it has a share that it wasn't asked to allocate (in the
alreadygot
part of the response), then the logic in _got_response will alter the entry for that share inpreexisting_shares
to point to the peer, regardless of whether or not that entry was already pointing to something else.It's kind of rough, but this check fixes the issue for me (in that it makes the test pass). Thoughts?
I'm updating behavior.txt to have the rough fix that I mention in comment:72476, and tests.txt to add a test for the logic that calculates servers_of_happiness in Tahoe2PeerSelector. I think this fixes comment:72466. Thoughts/feedback?
I was thinking about this the other day, and got to wondering about how the Encoder handles preexisting shares in the event of some servers failing during an upload.
(note that the following example is in terms of the existing
shares_of_happiness
behavior -- it is easier to link to that code than to my patches)As an example, we first look at [start_encrypted]source:src/allmydata/immutable/upload.py@4045#L711 in CHKUploader. This method creates and runs a Tahoe2PeerSelector to distribute shares of an IEncryptedUploadable across the grid. The results of this are handled in [set_shareholders]source:src/allmydata/immutable/upload.py@4045#L753. Note that the PeerTracker instances in
use_peers
are send to the Encoder instance, while the peerids inalready_peers
are only used in the upload results. In any case, after invokingset_shareholders
on the Encoder, the CHKUploader starts the upload.The part of the Encoding process that concerns me is [_remove_shareholder]source:src/allmydata/immutable/encode.py@4045#L489. This method is called when there is an error sending data to one of the shareholders. If a shareholder is lost, the Encoder will check to make sure that
shares_of_happiness
is still met even with the lost server -- if not, it will abort the upload. The problem with this check is that the Encoder, from what I can tell, has no way of knowing about the shares that already exist on the grid, and thus can't take them into account when making this check. So, if I (say) 8 shares for my storage index already on the grid,shares_of_happiness = 7
, only two things for the Encoder to actually send, and one (or both) of those transfers fail, my upload will fail when it shouldn't.Does it seem like I'm off-base there? If not, then it certainly seems like my implementation of
servers_of_happiness
would fall victim to pretty much the same issue. Is there an obvious way to fix that?(this comment should have a unit test or two written for it, so that what I'm saying is demonstrated/more easily understood, but I need to leave for class now)
Okay I read your explanation and followed along using the links that you embedded and I think you are right. I hope you write that unit test or two when you get back from class!
To fix this in terms of shares-of-happiness I guess [CHKUploader.set_shareholders()]source:src/allmydata/immutable/upload.py?rev=20090815112846-66853-7015fcf1322720ece28def7b8f2e4955b4689862#L753 could also give an integer "already uploaded shares" to the encoder and the encoder could add that integer to the
len(self.landlords)
that is uses to decide if there is still a chance of success.To fix it in terms of servers-of-happiness, I suspect that
CHKUploader.set_shareholders()
would need to pass more information to the encoder, perhaps telling the encoder everything it knows about which servers already have which shares. I'm not entirely sure how your current patch works, but I'll write more about that in a separate comment.I'm uploading the new tests. I'll need to modify them again when we define a way to give the Encoder more information about which servers have which shares.
It occurred to me when writing that test that my patch to Encoder doesn't do what I thought it did when I wrote it. Specifically, since self.landlords in an Encoder is simply a list of IStorageBucketWriters (I'm not sure what I thought they were, but it wasn't that), doing set(self.landlords) doesn't really do anything (let alone give a good value for the number of servers that are left). So I need to think of a better way of doing that, too.
Attachment tests.3.txt (42374 bytes) added
I altered the set_shareholders method in [IEncoder]source:src/allmydata/interfaces.py@4088#L1224 to require a
servermap
argument.servermap
is a mapping of shnum to a string (the peerid, ideally) that will be storing (whether by result of an upload or by result of already having it) a share. This gives the Encoder enough information to make an accurate check forservers_of_happiness
when it loses a peer and (combined with modifications to make code that used the Encoder use the new form of set_shareholders) also makes the tests pass.Comments? Is there anything else in the way of this issue? Is there a cleaner way of altering the Encoder to do what I want it to do?
I didn't understand your proposed solution in comment:72476 to my proposed problem in comment:72466. Doesn't your solution mean that if the uploader encounters the servers in the opposite order, ending with
s_4
, the server that has one copy of each share number, that the upload will then abort thinking that the "sufficiency number"S
is a mere 1?If that's the case, then a test just like the second layout in your nice new
test_problem_layouts
but with the order reversed should fail.Yes, that's right -- I misread your comment, and the test you suggest does indeed fail (I'll update tests.txt with the new test).
I think that the general solution I propose in comment:72468 (but didn't use because, given my view of comment:72466 at the time, the one I ended up implementing seemed easier and just as effective) would still work for that issue -- upon seeing that there are no homeless shares, but a too-small
S
, the Tahoe2PeerSelector would check to see if there are more uncontacted peers thanservers_of_happiness - S
, and, if so, stickservers_of_happiness - S
shares back into homeless shares for the algorithm to try to distribute.This is (at best) likely to be inefficient, though. If the Tahoe2PeerSelector has seen (as an example)
s_4
with all the shares, and there ares_1, s_2, s_3
withf_1, f_2, f_3
respectively, we'd want to take advantage of the existing shares ons_1, s_2, s_3
and not allocate different shares to them. In this scenario, though, Tahoe2PeerSelector doesn't know anything about these servers other than the fact that it hasn't contacted them, so it can't do much more than guess. If the servers accept the shares, though, the upload will work as it should.In the worst case -- that of full
s_1, s_2, s_3
-- this will fail, too, because unless we happen to choose the right shares from those already allocated tos_4
to put back into homeless shares, we'll end up with homeless shares. Only I'm forgetting that the peer selection process has multiple phases. What I think will happen in that case is that when the Tahoe2PeerSelector attempts to store shares on those servers, it'll fail, but it will also discover that those servers havef_1, f_2, f_3
, and record those in its mapping of already allocated shares. It will then asks_4
to store the shares that it wasn't able to store ons_1, s_2, s_3
-- shares whichs_4
already has. So these shares will be removed from the list of homeless shares, the list of homeless shares will be empty, and theservers_of_happiness
check should succeed.Does that seem right?
I think comment:72466 generalizes to any instance where the first server (or really the first n servers, where n is less than servers_of_happiness) happen to store all of the shares associated with a storage index. To that end, I'm also adding a test where
s_1, s_2, s_3
are empty. I'm also adding a test for the "worst case" described above.I'm attaching an updated behavior.txt with my solution from comment:72468. It passes most of the new tests, with the exception of the one for the worst case in comment:72484. This fails because of [a check in Tahoe2PeerSelector]source:src/allmydata/immutable/upload.py@4045#L195 that filters the list of peers given to the selector instance to only those peers capable of accepting a share of this file, based on the largest sharesize they can accept, according to the results of [remote_get_version()]source:src/allmydata/storage/server.py@3871#L232 in the storage server. If peers are full, they will be weeded out by this check, and their existing shares will not be accounted for in the peer selection process. However, removing the check entirely isn't a good option, either; according to the comment, the check is necessary to ensure that #439 is handled.
Ideas? It'd be nice to handle this case, though it is kind of an edge case, so maybe it's okay to not handle it? That feels kind of dirty, though. Is there another criterion that we could use to address #439? Maybe we could modify Tahoe2PeerSelector to view these peers as "read-only" peers; that is, it will ask them if they have any existing shares, but not attempt to allocate any shares to them?
So I was playing with this yesterday and thought of a potential fix:
Instead of throwing away the peers that aren't able to accept a share of a potential upload, the Tahoe2PeerSelector should keep them (temporarily) around in a list of read-only peers; peers which may already have shares of a particular file, but which won't accept any new shares. We can then ask these peers (using [remote_get_buckets()]source:src/allmydata/storage/server.py?rev=3871#L413) which buckets (if any) they have for our storage index, using the results to populate
self.preexisting_shares
. The peer selection process then runs as it normally does, knowing about the shares on the peers that it would have thrown away before. I'm attaching updated behavior.txt and tests.txt with these fixes.Does this seem reasonable? Does anyone have other suggestions?
I haven't reviewed your code yet, but what you wrote in comment:72486 sounds right to me.
By the way I spoke with Brian Warner in person last night and we agreed that he will review this ticket once Kevan and I are satisfied with it just to make extra double sure that we aren't screwing something up. Brian is a bit skeptical about our whole approach (e.g. see his cautionary comments in comment:72449) but I'm pretty sure we're on the right track.
Hey check it out I just happened to notice #608 (premature abort of upload if some shares were already present and some servers fail). I think Kevan's patch might also fix #608.
I made a tweak to one of the utility functions in upload.py to make sure that servers_of_happiness is not overcounted -- if the peer selection process yielded both a bucket in self.use_peers and an existing share in self.preexisting_shares, and those happened to be different peers, both of those peers would be counted. I added a test to demonstrate this behavior. My fix was to alter a utility function to check for this situation, and only count the peer associated with the bucket if it is detected. If anyone has a more elegant fix for this, please post it.
I also did a bit of cleanup and reorganization. In particular:
Thoughts?
Also, I think that my patch fixes #608 -- I added a test called
test_dropped_servers_in_encoder
that does the same thing (but with different numbers) as Zooko's suggested test in #608, and it passes now.Marking this ticket for review (by me). Once I approve it then let's make sure the docs and unit tests are so thorough that Brian can review it without having to read this entire thread. :-)
Cool! The patches aren't necessarily the most intuitive to read -- if you want me to do anything to them to make your review process easier, just let me know. :)
We need to raise a different exception than
NotEnoughSharesError
in the case that upload fails due to not enough servers being available. In case you think I am being overly picky, consider the fact that I am currently answering support mail in another window from an allmydata.com customer who is asking about this"NotEnoughSharesError"
that he sees. :-)I opened ticket #834 (clarify error message when upload fails due to insufficient servers of happiness) and assigned it to Kevan.
See (@@http://allmydata.org/trac/tahoe/ticket/834#comment:-1@@) for an explanation of these updates.
I'm reviewing doc.txt.
Other than that it looks right!
I think you're right on both points, and I've updated docs.txt to make those changes.
I'm reviewing tests.txt.
def _remove_share_0():
", I got confused for awhile because the comment numbers the servers as 1, 2, 3, 4 but the code indexes them as 1, 2, 3, 0. Maybe change the docs to use the code's indexes, or maybe add a comment in the docs explaining which server number corresponds to which index inself.shares
? Oh, even easier would be rename_remove_share_0()
to_remove_share_0_from_server_4()
.UploadHappinessError
is raised instead ofNoSharesError
during the upload process, but the test.txt attachment (the one that I am looking at right now -- unfortunately I don't know how to indicate which version of test.txt I mean) is testing forNoSharesError
, like this:test_dropped_servers_in_encoder()
I wasn't sure I remembered what was being tested -- please add some in-line comments at the beginning saying what the uploader ought to do in what case. Otherwisetest_dropped_servers_in_encoder()
looks fine.more to come -- I'm not done yet reviewing tests.txt. I've reviewed it from the top down to
test_dropped_servers_in_encoder()
.Replying to zooko:
I think the message needs to be split into two cases. Technically, there are always at least $K-1 servers such that any $K of them will be able to reconstruct the file (vacuously, since there are no $K-element subsets of a ($K-1)-element set). However, it is confusing to say that. So, if $X < $K, we should use a different message, something like "We were only able to locate or upload shares to $S servers. We were asked to make it so that there would be at least $H servers, any $K of which are sufficient to download the file."
Oh, and the special case "We were only able to locate or upload shares to 0 servers." would be better expressed as "We were not able to locate or upload shares to any servers."
From docs.txt:
I don't think it makes sense to allow shares.happy to be less than k, since (if I understand correctly) that is equivalent to letting all uploads succeed even if they place no shares.
Shouldn't
UploadHappinessError
be namedUploadUnhappinessError
?(apologies for the delay in replying to these comments -- I've been a bit slammed with schoolwork lately, so I'm getting to them as I have time.)
Replying to zooko:
Good catch -- I thought I'd caught all those. I changed the test to do that.
This is actually changed in a later patch in tests.txt. Sorry -- making a lot of small patches made it easier to keep track of things as I was working on this, but it probably makes it harder to review. If you like, I could give a flattened, diff-style "patch" for you to review instead?
Okay, that's done too.
Thanks for reviewing.
I agree with David-Sarah on the idea of having two error messages. For the error message where there are fewer servers than k, it now prints something like:
and, in the other case, it prints something like:
Note that this particular error message is only output if the peer selection process managed to place all of the shares -- this means that there will always be at least one server, so we don't need to worry about special-casing it for 0 servers. In cases where the selection process wasn't able to place all of the shares, the error message looks like:
Here, servers_of_happiness is handled by "want to place on 4 servers"; this is a necessary condition for happy=4 to be satisfied, but I think it could probably be improved, so I'll look into doing that if no one disagrees.
The message above is very similar to a message logged when an upload succeeds: you can see it here: http://allmydata.org/trac/tahoe/browser/src/allmydata/immutable/upload.py?rev=c4d38ad4c56233e3#L228, while the message above can be seen here: http://allmydata.org/trac/tahoe/browser/src/allmydata/immutable/upload.py?rev=c4d38ad4c56233e3#L288
I abstracted these into one message, which reads like this:
This is now logged when an upload is successful, and stored in the UploadHappinessError when an upload fails for not having placed all shares. The messages that Zooko and David-Sarah commented on earlier do not include this message (they didn't before, either), but could easily be made to do so (maybe as a footnote to the more informative error text), if we wanted to.
Replying to davidsarah:
I agree. I'm changing that. I'm also changing the upper bound to "N"; the peer selector doesn't know how to place more than N shares, so anything greater than that would not be satisfiable, either.
We should probably make the client do a bounds check on that value before using it, in case it is set to something that we don't support (especially if there were different acceptable defaults before). Is there a preferred way to do that?
Replying to davidsarah:
That is more consistent with the naming of the other exceptions. I'll go ahead and change it.
Thanks for the feedback so far.
Dear Kevan:
I'm sorry it has taken me so long to do more review.
From comment:91:
"This is actually changed in a later patch in tests.txt. Sorry -- making a lot of small patches made it easier to keep track of things as I was working on this, but it probably makes it harder to review. If you like, I could give a flattened, diff-style "patch" for you to review instead?"
Either way will work fine for me. I didn't realize there were subsequent patches in tests.txt, but now that I know that I can handle it.
I just realized that there is another ticket that this ticket will help with -- #614 (redefine "Healthy" to be "Happy" for checker/verifier/repairer). http://allmydata.org/pipermail/tahoe-dev/2009-December/003435.html
Okay I've consolidated the tests.txt patches into one diff (with the appended darcs command line) and I'm reading through it. There's an error in the comments in
test_upload.py:test_problem_layout_comment_52()
. It says:I think it ought to say:
I think that the error message is still not quite accurate enough. (Sorry, I know explaining this in an error message is tricky.)
Currently in
test_problem_layout_comment_52()
when there is a comment:72465 -style situation (per the comments quoted in comment:72511), then the error message is "shares could only be placed or found on 2 server(s). We were asked to place shares on at least 4 servers such that any 3 of them have enough shares to recover the file". This is wrong because shares were actually placed or found on 4 servers, right? How about "shares could be placed or found on 4 server(s), but the shares are not spread out evenly enough to ensure that any 3 of those servers have enough shares to recover the file. We were asked to place shares on at least 4 servers such that any 3 of them have enough shares to recover the file."?Here's the darcs command-line to show the combined diff of the eleven different test patches in tests.txt:
Good catch on the comment -- I'll update the tests patch to fix that.
I like your error message better for the case where shares are on (or could be on) more than
k
servers. It is nonsensical, though, when we have placed shares on fewer than that -- you'd get something like "shares could be placed or found on 2 server(s), but the shares are not spread out evenly enough to ensure that any 3 of those...". So I left the existing message in for that condition.I'll update behavior.txt and tests.txt with these changes now.
I just updated docs.txt and tests.txt to fix some conflicts that they had with the current state of the project. If you're working with them, you should download the new versions.
Brian: I'm not ready for you to review tests.txt or behavior.txt yet, but I am ready for you to review docs.txt, which I've already reviewed. See if you can read docs.txt and yet refrain from going on to read tests.txt and behavior.txt in order to understand the patch better. :-)
I'm reviewing tests.txt now.
Kevan: I got confused about the server layouts in the comments in the tests. Here is a diff that shows the notes I made in the file as I went. Please consider these as questions or suggestions rather than as an actual diff for you to automatically apply. Thanks!
And, on line 897:
I was confused for a while because I didn't realize the servers were read-only.
From behavior.txt:
Reading the following function signature and docstring:
I don't understand is the meaning of "unique" and whether the return value will just be every server whose
peerid
appeared in the values ofexisting_shares
. Also what is the meaning of the list of PeerTracker instances.Reading the body of
servers_with_unique_shares()
I still don't understand what it is calculating. Even though it has a few comments like:It isn't clear to me why favoring the pre-existing share would lead to overcounting, or even what the two different things
peer.buckets
andexisting_shares
are.By the way I am sleep-deprived, I'm tired from the effort of hobbling around on crutches (now that I have a job and I have to actually leave the house every day) and I have a cold. This makes me the perfect patch reviewer! If you can fix up the code and doc so that even I, in my foggy state, can understand it then anyone can understand it.
Okay, looking at the way that
servers_with_unique_shares()
is invoked in upload.py, it kind of seems like its name could be changed toservers_of_happiness()
! And maybe a docstring could be written for it explaining its semantics in more detail.servers_with_unique_shares()
(which per the previous comment should perhaps be renamedservers_of_happiness()
returns a set, but the various callers of it never do anything but take the length of that set. Refactor it so that this function returns just the length. (This might contribute to optimization of the implementation of the function, but more importantly it will make it easier for the confused reader, like me, to figure out what it does and what it is used for.)Okay having read the test function from test_upload.py named
test_servers_with_unique_shares()
I am left with some questions in mind about the intended semantics ofservers_with_unique_shares()
. The test function asserts that andexisting_shares
of{1:"server1",2:"server2",3:"server3",4:"server4"
} results in 4 and that anexisting_shares
of{1:"server1",2:"server2",3:"server3",4:"server1"
} results in 3.Then it adds in the
used_peers
option and asserts that anexisting_shares
of{1:"server1",2:"server2",3:"server3",4:"server1"
} plus a set of peers{"server5":[5],"server6":[6],"server7":[7],"server8":[8]
} results in 7. That makes sense.Then it asserts that
existing_shares
of{1:"server1",2:"server2",3:"server3",4:"server1"
} plus a set of peers{"server5":[5],"server6":[6],"server7":[7],"server8":[8],"server1":[1]
} also results in 7. That makes sense too.Finally it asserts that when the inputs are empty the result is 0.
Okay, but now what about something like
existing_shares
of{1:"server1",2:"server2",3:"server3",4:"server4"
} plus a set of peers{"server5":[4,5],"server6":[3,5]
}. Should that result in 5 or 6? Or what about just a set of peers{"server0":[0,1,3],"server1":[1,2,4],"server2":[0,2,5]
}. How many "servers with unique shares" is that? I guess it is 3 because each of the servers has at least one unique share.Okay how about
{"server0":[0,1],"server1":[1,2],"server2":[0,2]
}. None of the servers have a share that isn't offered by any other server. So is the answer 0? I guess not!Is this the function that actually implements the core "servers of happiness" logic? If so perhaps it should enumerate all relevant edge cases (that we've discovered so far in the very long discussion on this ticket!) or else have a comment directing the reader to the relevant other tests (such as
test_problem_layout_comment_53()
. And the test code -- or more likely the function itselfservers_with_unique_shares()
should have doc describing these edges cases or directing the reader to the relevant doc (perhaps source:docs/configuration.txt).But maybe the edge cases discussed in this ticket don't apply to this function, which is a pure function mapping the input sets to the output integer (and without regard to the order of the input sets), when many of the edge cases such as
test_problem_layout_comment_53()
have to do with what order the peers are found in.So, maybe documentation could be added specifying what behavior this pure function has to provide in order to allow the code that uses it to satisfy those edge cases.
As you can see I'm in the perfect state to review docs and tests patches to ask for more clarification. I'm perhaps not in the perfect state to review code and look for bugs or improvements. If I get through all the docs and tests I'll probably have some sleep. :-)
A more idiomatic, and I think faster, and I think correct implementation of
shares_by_server()
would be:(In fact, it is such a natural thing to want to do that I wonder if there is an even more idiomatic way to do it in Python or some builtin function that just does exactly this transformation. :-))
I guess the
servers_of_happiness
number can never be decreased by adding a server?Regarding
should_add_server()
, I don't think that the namebucket
is right. That parameter appears to hold a share number. A "bucket" refers to a service running on a storage server which holds a single share. source:src/allmydata/interfaces.py@4147#L35Re:
test_should_add_server}}: again, like with ```test_servers_with_unique_shares()
, when I'm looking at this test I wonder if it shouldn't exercise more interesting cases or edge cases of the underlyingshould_add_server()
. But again, maybe those edge cases are better tested in the "higher level" tests earlier in test_upload.py. And again, maybe doc about the underlying function or the testing strategy would allay my concern and tell me where to look to make sure that the edge cases are covered.HOORAY! I'm finally finished reviewing all of Kevan's tests!
Now there are two things that can happen in parallel:
I think it might be a good idea to make docs/specifications/servers-of-happiness.txt at some point, so that all of the edge cases and proofs and whatnot in this ticket are summarized in one place, and explained more clearly than in this ticket.
That specification document might be a good idea, yes.
Thanks for the comments on the test annotations. I've made some changes that will hopefully make the issues you addressed clearer. I've also annotated the functions in
upload.py
that you mentioned; I'd try to answer your questions in this comment, but it would probably be a better test for my modifications if you instead tried to see if they answer your questions. I like your revised implementation ofshould_add_server
, though I think you probably meant to type:because yours didn't work as written.
I'll upload the changes I've made now.
In behavior.txt a line of source:src/allmydata/immutable/download.py gets changed to say:
If you are going to create a servers-of-happiness spec doc then I guess that line should refer to it instead of to the ticket.
Reviewing the latest tests.txt:
Around line 986 it looks like the server numbers are 1-indexed instead of 0-indexed. Is that right or am I confused? I think they should all be consistent, and I guess that means all 0-indexed to match the Python code e.g.
self._add_server_with_share(server_number=2, share_number=0)
, which is adding the server whose index in a Python list is "2", not the second server (the second server's index in a Python list is "1").I read the comment at
def test_servers_of_happiness_utility_function()
and was confused about what servers_of_happiness() does, but then I discovered the docstring ofservers_of_happiness()
which explains it better. So I suggest to reduce the comment atdef test_servers_of_happiness_utility_function()
to something like:This function is concerned with the one-to-one (a.k.a. "injective" function
servers_of_happiness(); many issues that deal with other aspects of the
servers of happiness behavior that happen to use this function are tested
elsewhere. These tests exist to ensure that upload.servers_of_happiness
doesn't under or overcount the happiness value for given inputs.
The tests and the comments thereof are now looking very good! It is time for Brian and/or David-Sarah to review docs.txt and tests.txt!
The apparent 1-indexing around 986 is intentional. I delete server 0 from the grid in that part of the test because it has all of the shares associated with the file, which isn't what I want it to have for that particular scenario. There is a comment to that effect:
What can I write there to make what I'm doing clearer? Or would that part of the test be clearer if I just manually deleted shares from the storage directory associated with server 0 (to me, adding server 4 made the test easier to read than doing that, but maybe I'm wrong).
I'm glad that the new docstring for upload.servers_of_happiness does the job, and I've removed the redundant comment at the top of its test.
I made #911 for the task of making a specification file for the servers of happiness behavior.
Okay I think you've done enough to make it clear and I just wasn't reading your comments carefully enough.
I've reviewed docs.txt and cannot find any problems. However there is some text in source:docs/frontends/webapi.txt that might need updating:
Is this still an accurate description of what
?t=check
does?Replying to davidsarah:
Actually, changing that would be part of #614 (redefine "Healthy" to be "Happy" for checker/verifier/repairer). So LGTM for docs.txt.
A few notes about behavior.txt:
del(self.servermapshareid)
asdel self.servermapshareid
servers_left = list(set(self.servermap.values()))
and then takinglen(servers_left)
looks like it could be much simplified and optimized tonum_servers_left = len(self.servermap)
. :-)could be simplified to
About
servers_of_happiness()
:used_peers
doesn't need to be copied becauseservers_of_happiness()
never mutates it.Agreed, but if there were an easy way to assert inside
servers_of_happiness()
that no share is assigned to multiple servers, that would be cool too.bucketcountdict
please put a comment saying what the keys and values are and, if is seems helpful, whatbucketcountdict
is forsharecountdict
Oh I see that
sharedict
already has a comment, which starts with# We build a dict of shareid => peerid mappings...
. Maybe move that comment up to the declaration ofsharedict
.Tiny style detail:
is written as
in the rest of our code, so it would be good to do so here for consistency.
could be:
Hm, hey doesn't this boil down to checking whether
sharedictk
is in there or not?...Replying to zooko:
If I'm not mistaken, this will simply take the number of servers in the sharemap -- the set call is necessary so that we consider the number of distinct servers. It can be optimized to
though (I don't know why I thought that you couldn't take the length of a set), and I've done that.
Replying to zooko:
Something like that, actually. I shied away from that when I was writing that snippet because I thought that it would end up being really fragile, but it turned out not to be. Thanks.
I liked your other suggestions, and implemented them, including the assertion in
upload.servers_with_shares
and a test for it. I've also fixed a few other issues that I found while I was fixing the ones you pointed out. I'm attaching new patches with all of these changes.Attachment tests.4.txt (137343 bytes) added
I realized as I was driving home just now that I don't know what the code will do, after Kevan's behavior.txt patch is applied, when "servers of happiness" can be achieved only by uploading redundant shares. For example, tests.txt adds a test in "test_upload.py" named
test_problem_layout_comment_52
which creates a server layout like this:Where server 0 is read-write and servers 1, 2 and 3 are read-only. (And by the way Kevin, please make comments state that servers 1, 2 and 3 are read-only.)
In this scenario (with
K == 3
) the uploader can't achieve "servers of happiness" == 4 even though it can immediately see that allM == 10
of the shares are hosted on the grid.But what about the case that servers 1, 2 and 3 were still able to accept new shares? Then our uploader could either abort and say "servers of happiness couldn't be satisfied", due to the fact that it can't achieve "servers of happiness" without uploading redundant copies of shares that are already on the grid, or it could succeed by uploading a new copy of shares 2 and 3.
We should have a test for this case. If our uploader gives up in this case then we should assert that the uploader gives up with a reasonable error message and without wasting bandwidth by uploading shares. If it proceeds in this case then we should assert that it succeeds and that it doesn't upload more shares than it has to (which is two in this case).
From behavior.txt:
should be
(The former constructs a new array (Python list) of the keys, tests for
bucket
in it and then immediately discards it. Also it is wordier.)from upload.py:
I don't understand this comment. It appears that the code is blindly clobbering entries in
self.preexisting_shares
. Does that mean that the code is not making sure that we don't unintentionally report a lower happiness value than actually exists? I think it would make sense to turnself.preexisting_shares
into a map from sharenum to a set of serverids instead of from sharenum to serverid.If the code above is reporting a lower happiness value than it ought then we should be able to write a unit test that fails in which the file is actually happy but the upload reports it as unhappy.
There is a similar issue in
set_shareholders()
. Here is the code snippet with a giant comment that I typed into it while doing code review:In this case I think that it is okay to blindly clobber the entry in
servermap
, but only because we will do so only in the case that that entry wasn't doing any good in terms of servers of happiness. However, I am really not sure, and I wonder if redefiningservermap
to be a map from sharenum to a set of serverids instead of a map from sharenum to a serverid wouldn't make things clearer and possibly also less buggy.Kevan:
Here is a patch which attempts to make construction of the set of read-only easier to understand (for me). Please treat this patch merely as suggestions, not as an actual patch that I want you to apply. (If, for example, you think that it is better to have the
self.readonly_peers
member variable declared at the top of the function with the other member variables, that's okay with me, although then it should not be initialized to an empty list and then later redirected to point to a different object -- either the empty list should later be filled or else it should be initialized toNone
.)servers_of_happiness()
ends with:It seems like the
servers
local variable should be a set instead of a list from the start, and then the final line of the function can be:Kevan:
I've been struggling and struggling to understand the
servers_of_happiness()
function. The documentation -- that it attempts to find a 1-to-1 (a.k.a. "injective") function from servers to shares sounds great! But, despite many attempts, I have yet to understand if the code is actually doing the right thing. (Note: this may well be in part my fault for being thick-headed. Especially these days, when I am very sleep-deprived and stressed and busy. But if we can make a function that even I can understand then we'll be golden.)So, one thing that occurs to me as I look at this function today is that it might help if
existing_shares
andused_peers
had more consistent data types and names. If I understand correctly what they do (which is a big 'if' at this point), they could each be a map fromshareid
toserverid
, or possibly a map fromshareid
to a set ofserverid
's, and their names could beexisting_shares
andplanned_shares
, and the doc could explain thatexisting_shares
describes shares that are already alleged to be hosted by servers, andplanned_shares
describes shares that we are currently planning to upload to servers.Would that be correct? It raises the question in my mind as to why
servers_of_happiness()
distinguishes between those two inputs instead of just generating its injective function from the union of those two inputs. I suspect that this is because we want to prefer existing shares instead of new shares when the two collide (i.e. when uploading a new share would be redundant) in the interests of upload efficiency. Is that true? Perhaps a note to that effect could be added to theservers_of_happiness()
doc.I realize that I have asked so many times for further explanation of
servers_of_happiness()
that it has become comment-heavy. Oh well! If we see ways to make the comments more concise and just as explanatory that would be cool, but better too many comments than too little, for this particular twisty little core function. :-)Thanks!
(I'm working on these as I have time -- I usually have a lot to do during the week)
Replying to zooko:
There is a test for this (or something very like this) in test_problem_layouts_comment_53:
Note that this is slightly different than your case, in that the other servers have no shares at all. So the correct number of shares for the encoder to push is 3, not 2. I didn't have the assertion in there, though, so I'll go ahead and attach a patch where the assertion is there. This also uncovered a bug in
should_add_server
, in whichshould_add_server
would not approve of adding unknown shares to theexisting_shares
dict if they were on a server that was already inexisting_shares
. I've fixed this, and added a test for it.Replying to zooko:
The point of
should_add_server
is to make sure that that assignment only happens if it would makeself.preexisting_shares
happier than it already is -- that's what the if-statement is checking. Maybe I don't understand what you mean when you say blindly clobbering shares?I'm not sure what I think about making preexisting_shares a map to a set of peerids. It would remove the need for
should_add_server
, but would also make the end calculations more complicated. I'll think about that when I've worked through the rest of your comments.(of course, if I'm wrong in my last comment, maybe it makes sense to do that regardless of complexity)
Replying to [kevan]comment:145:
Shouldn't the point be to be to make the union of
self.preexisting_shares
andused_peers
(a.k.a.planned_shares
) happier than it already is?By "blindly clobbering" I mean what your comment said originally: "blindly clobbering entries in self.preexisting_shares", i.e. doing a
self.preexisting_sharesshareid = something
without first checking whether there is already an entry in that dict under the keyshareid
.Replying to kevan:
Okay, think about it!
Yeah, there are two notions of "complexity". One is computational -- basically "the size of the code for the shortest implementation of this". The other is cognitive -- basically "the difficulty for the Tahoe-LAFS hackers to learn this and then to hold it correctly in their heads".
It might be the case that making both data structures be maps from shareid to set of serverid would make the algorithm "simpler" in the latter way -- easier to comprehend. I'm not sure.
Replying to [zooko]comment:148:
I thought about it more, and found a lot to like about this idea. Not only did it eliminate
should_add_server
, but it madeservers_of_happiness
(to me, at least) much clearer. I implemented that earlier today, and I'll attach updated patches with that shortly. Please readservers_of_happiness
(now at src/allmydata/util/happinessutil.py, since there was no good reason not to calculate happiness the same way in the encoder as we do in the peer selection process) again, and if it is still confusing, hopefully we can fix it so that it isn't.The description needs to be something like "or upload was unable to meet the 'servers_of_happiness' threshold." The name of the name of the exception is also not quite right, but I don't have a good suggestion if it still needs to be the same exception for upload and download. Perhaps it doesn't, in which case how about just "FailedDownloadError" and "FailedUploadError"? Why it failed is probably not something that client code is likely to switch based on.
Replying to zooko:
In the current formulation of
servers_of_happiness
, I think this needs to be a list until it is returned, because we don't want the properties of a set until we've dealt with multiply-hosted shares. Perhaps you can read the revised function and let me know if you agree?Replying to zooko:
I like having
self.readonly_peers
mentioned at the top of the function, if for no other reason than consistency, so if it's all the same to you, I'd like to leave that. I'll change it to initialize to None, though, and I'll amend that comment to mention that it is eventually a list, so the self-documenting aspects of that are preserved for the most part.I think your other changes are a good idea, though, so I'll update my patches to incorporate them.
Replying to davidsarah:
Actually, I think that the upload process was switched in an earlier patch to use UploadUnhappinessError instead of that, so it should only be raised in the context of a download (from what I can tell, it isn't used out of that context outside of the immutable file code, either, but I might have missed something). Then it should be fine to just get rid of that part of the comment entirely. If it is not raised in the context of an upload, do you still think the name is inappropriate?
aha, I apparently already made that change -- it just made it into a later part of behavior.txt. I noticed that I had neglected to rename something that referenced NotEnoughSharesError in the context of an upload while researching that, though, so I'll go ahead and incorporate that fix into the others that I'm uploading.
Tomorrow morning we'll see whether behavior.txt passes the "bus to work" test. I have about 20 minutes on the bus to work. Can I read and understand the patch in 20 minutes? While sleep-deprived and distracted? If so, then it is a great patch in terms of ease of comprehension! Whether I can comprehend it or not, I'll try to post some comment about it when I get to work. :-)
(Brian, David-Sarah, others, you are still more than welcome to review docs.txt and tests.txt. Also you are welcome to review behavior.txt, but I would suggest that you post your feedback about docs.txt and tests.txt first, if you haven't already.)
Okay a few quick notes:
I'm getting closer! For one thing, there is less code now, and it is starting to make more sense to me. But I didn't manage to really understand the whole thing on this bus ride.
I got stuck on this line for a while:
In general when reading Python I find it easier to understand "imperative style" in which variables get initialized and mutated than "functional style" in which values get computed as compound functions of their inputs. After I figured out what the line above does then I saw that the later code uses
count()
onservers
, and so I think the intended semantics ofservers
is a list where each serverid appears one time for each unique share that the server holds.I would find it easier to understand if
servers
were a map from serverid to number of unique shares that that server holds (or will hold according to the current plan as represented byused_peers
). Oh, in fact, how about callingservers = shares_by_server(existing_shares)
(afterexisting_shares
has been updated to include the information fromused_peers
), then uselen(serversserverid)
to get the number of shares?Oh, this
existing_shares
has the wrong name now that it includes information about planned shares from theused_peers
argument. Maybe justshares
. :-)Okay, then I ran out of time and didn't read the new changes to upload.py.
I think it would help if the algorithm for calculating servers-of-happiness were written out in docs. I mean the algorithm as described imperatively as a sequence of calculations and mutations. Let me try to explain the algorithm myself as a way to get you started on documenting it:
The algorithm for calculating servers-of-happiness starts with a map from servers to shares that the server holds. More than one server can map to a single share and more than one share can be mapped to from a single server.
The algorithm looks for a share that has more than one server mapping to it and then excludes all but one of those links (from server to share). Then it iterates this until there are no more cases of a share with more than one server mapping to it, and then it is done and the happiness value is the number of servers mapping to a share.
Now to finish fully defining the algorithm we have to explain how it chooses which link to retain when it finds multiple links pointing to a share.
By the way last night a servers-of-happiness puzzle occurred to me. Suppose server A maps to shares 0 and 1, and server B maps to shares 1 and 2, and server C maps to share 2. Now if the servers-of-happiness algorithm described above first looks at share 1, it has to decide whether to exclude the link from server A to share 1 or from server B to share 1. If it chooses to exclude the link from server B to share 1 then it is going to end up returning a servers-of-happiness value of 2. But if it chooses to exclude the link from server A to share 1 then it is going to end up returning a servers-of-happiness value of 3. But I don't think the algorithm can figure that out just by examining share 2. We should write a unit test of this case and (a) see whether the current algorithm passes that test and (b) if it does, figure out why it does and see if we can write another unit test that it won't pass. :-)
replying to myself:
Replying to zooko:
Oh, and we have to specify how it chooses which share to examine next. Presumably a good way to choose the next share plus a good what to choose which links to exclude can solve the puzzle I posted in comment:72570 and other similar puzzles.
A simpler puzzle with the same issue is: server A maps to share 0 and share 1, server B maps to share 1.
I think we should not include #778 in Tahoe-LAFS v1.6. Even after I finish reviewing the patch and am willing to commit it to trunk, then we should probably give it some time to be manually tested before putting into a stable release. So let's plan to put it into the next stable release.
Tahoe-LAFS v1.6 needs to be released by (I think, probably) Thursday to have a reasonable chance of inclusion in Ubuntu Lucid.
Okay. Then I'm going to focus on finishing my review of #833 instead of this.
I was sure that computing servers-of-happiness must correspond to some well-understood mathematical problem, but I've now discovered which one -- it's called "maximum bipartite matching".
Since servers are distinct from shares, the relation between servers and shares corresponds to a bipartite graph.
Any bijective function from servers to shares that is included in this relation is called a matching of the bipartite graph.
The servers-of-happiness number is the maximum size of any such matching. Intuitively, that's because a matching gives a set of H servers that have at least H distinct shares between them.
So, apply a maximum bipartite matching algorithm to the sharemap (to find some maximum matching), take its size, and you're done.
(Actually, the servers-of-happiness value as we originally defined it is max(k-1, max_bipartite_matching_size). But the max_bipartite_matching_size works as a measure of the quality of the server->share mapping even when it is less than k-1. In that case there won't be enough shares to recover the file, of course.)
I doubt that the algorithm outlined by Zooko always computes max_bipartite_matching_size correctly. That's because it is a greedy algorithm -- it works by removing edges from the bipartite graph that are not in some maximum matching, and assumes that it's possible to make a correct local decision of which edge to remove. There are definitely greedy algorithms that result in a near-maximum matching, but it won't always be maximum.
Algorithms to find maximum bipartite matchings are really simple, just a few lines of code, and have reasonable algorithmic complexity. I will try to find one and translate it to Python.
this is brilliant. well done sir.
Oops, I missed Zooko's message disagreeing with doing this for 1.6.1.
I've been playing with this over weekend after (very slowly) brushing up on graph theory enough to understand bipartite matching algorithms. I'm attaching updated behavior.txt and tests.txt patches with an implementation of the Edmonds-Karp algorithm for computing a maximum bipartite matching. My implementation is fairly general, so there is probably room for optimization. Also, David-Sarah were working on an implementation which may be clearer/more efficient than mine. I have not added any of Zooko's puzzles to the tests yet, but my changes pass all of the existing tests.
Attachment behavior.txt (139094 bytes) added
adding #834 behavior to the #778 patches
I closed #834 as a duplicate of this ticket.
I'm sorry, I guess I was waiting for you to add my puzzles to the tests, but I guess you are waiting for me to review the current patches! I will do so when I have the opportunity--probably this coming weekend.
oops, I'd forgotten about those completely. Sorry! I'll add them now.
Attachment tests.txt (169252 bytes) added
tests updated to be current
I was describing this ticket to Kris Nuttycombe and he wants to know why we have a separately configurable
h
parameter instead of usingn
for the "servers of happiness". That is: if your erasure coding parametern
is 10, then your servers of happiness ought to be 10 too. This sounds like a really good idea to me because it is what people assume who (like Kris) know only the basic architectural fact of Tahoe-LAFS's use of erasure coding.The fact that the docs in the current patch on this ticket warn the user not to set
h
higher than the number of servers they have seems to play into this.(I have the vague feeling that we've already considered this option and rejected it for some reason or another in this giant nine month long, one hundred and seventy message long thread, but I thought it would be worth writing it down, so here's message one hundred and seventy two!)
The docs from the doc patch advise you what you ought to set
N
equal to, but doesn't explicitly advise you what to seth
to:"""
we'll try to place 10 shares, we'll be happy if we can place shares on enough
servers that there are 7 different servers, the correct functioning of any 3 of
which guarantee the availability of the file, and we need to get back any 3 to
recover the file. This results in a 3.3x expansion factor. In general, you
should set N about equal to the number of peers in your grid, then set N/k to
achieve your desired availability goals.
"""
Kris points out that having each upload touch every one of the servers in your distributed filesystem seems like a strange thing to do. That's a good point! It probably makes good sense to set
N
to the number of servers when the number of servers that you have is between 6 and 16. It probably makes less sense when you have 50 or more.Okay I've read through the latest test.txt and I didn't see anything wrong with it. Nice work, Kevan!
Replying to zooko:
The desired behaviour is that an upload should try to place
n
shares, but settle for placingh
shares mapping to unique servers. If they are the same, that would mean that we always only try to obtain the minimum preservation and availability that the user has requested. It seems to me to be a good idea to attempt to obtain a higher safety margin whenever the requests to the additionaln-h
servers succeed immediately.(If #946 were implemented in a way that did not increase memory usage, then the only additional cost would be bandwidth, not upload latency.)
Decoupling the n and h parameters might also be useful in the context of share redistribution and friendnets -- i.e.: if you have 10 mostly-reliable storage servers on your friendnet, then you'd ideally want one share on each of them, but you expect 2 or 3 of them to be offline sometimes, so you settle for h = 7, then let a redistribution process move the doubly-placed shares onto other storage servers as they become available. Using multiply-placed shares would allow a redistribution process to accomplish this without you giving it anything beyond the verify-cap.
It would, AFAIK, be possible to get as far as getting the unencoded ciphertext of a mutable or immutable file with only a verify cap, which also means that a redistribution process, given the verify cap, could re-encode the file if necessary, but it could not go farther than that. In the case of an immutable file, the block hash tree, share hash tree, and root hashes would all change, which means that the hash of the UEB would change, which means that the redistributed version of the file would be a new file, since a capability designating it would be distinct from a capability designating another encoding of the same ciphertext. In the case of a mutable file, the redistribution process would need a writecap to get the RSA private key necessary to sign the new encoding of the file, since the hash trees also change in the mutable file.
(in writing the above, I'm imagining a "redistributor" helper; you would give your verify caps to the redistributor, and it would, given the n encoded in the cap, try to make sure that h = n for you automatically)
OTOH, we don't have a redistribution helper right now, so maybe that's not a good reason to keep them separate.
(I don't think we've discussed this yet, so it is good that it was brought up.)
when I say "doubly-placed shares", I mean "multiply-placed shares", and when I say "multiply-placed shares", I mean "shares that have been placed on storage servers that are already storing shares for that upload", which is another way of saying "shares that won't make servers-of-happiness go up".
David-Sarah and Kevan: thanks for your explanations (comment:175 and comment:72590) of why it is useful to have
h
<n
. That makes sense. (It's roughly similar to what I sleepily told Kris when he asked the other night.) I liked Kevan's friendnet motivating example.Kevan: I don't entirely follow what you are talking about in (comment:72590) about repairing using a verify cap and the immutable file cap changing. There are at least two tickets that sound relevant: #568 (converge same file, same K, different M), #678 (converge same file, same K, different M).
I intend to start reviewing Kevan's latest behavior.txt patch on the way to work this morning.
If I'm not mistaken, the issue is essentially that in ticket 678, comment 3; a new encoding of the same file yields a different UEB hash than the old encoding; this causes clients with the old filecap to reject the shares from the new encoding because, from their perspective, the UEB is wrong.
If I'm not mistaken, the current repairer regenerates shares that were generated during the initial upload but are no longer available. In the first case in comment:72590 (where we decouple h and n to tolerate multiply placed shares so we can move them around later), we're redistributing shares to achieve optimal reliability, even though all of the shares may still be available. In the second case (where h and n are the same thing, so we have to generate a new encoding of the same file to take advantage of opportunities to improve reliability), we're re-encoding the file so we can take advantage of an opportunity to improve reliability. The file is not necessarily damaged or unhealthy in either of these cases, so while a redistribution operation could conceptually be called a repair, it is important to distinguish that from file repair as currently implemented in Tahoe-LAFS. I mention this in case conflating this operation and the repair facilities currently implemented in Tahoe-LAFS led to your confusion.
Does that make more sense?
This also suggests a value for
h
-- maybe it can be the smallest number of servers that you reasonably expect to be on your grid when it is functioning correctly.Attachment docs.txt (7075 bytes) added
update documentation patches per comment:-1 and comment:-1
Here is a partial review of the latest behavior.txt.
I did this review on the bus after work, when I was tired, and now it is past bed-time and I'm even more tired so take it with a grain of salt. The line numbers in the following are from viewing this URL: http://allmydata.org/trac/tahoe-lafs/attachment/ticket/778/behavior.txt . By the way Kevan, please start versioning your patch attachments to this ticket, e.g. calling the next one "behavior2.txt" so that I can refer back to the one that I just partially reviewed, and so that these lines numbers will continue to be correct for that URL.
Here is a suggested edit to an error message:
change to
small improvement in Python usage:
replace with
Can peerid be false? Well, perhaps this question is out of scope of this ticket, since I see that
get_peerid()
thinks that it can return None, although I can't see how it could ever do that. Maybe addprecondition(nodeid)
inWriteBucketProxy.*init*()
.Does this mean we'll get a less detailed error message if we become unhappy due to server loss during upload than if we become unhappy due to not being able to find sufficient servers to begin the upload? E.g. the way in the latter case we distinguish between peer_count < self.needed_shares vs. happiness < self.needed_shares vs. len(x-happy-subset) < servers_of_happiness.
Maybe we should extract that reporting code that produces a message for each of those three cases into a function and call that function from those two places that we decide to give up.
Make self.peers_with_shares be a set in the first place.
Okay I got down to line 1427 of the representation of behavior.txt at http://allmydata.org/trac/tahoe-lafs/attachment/ticket/778/behavior.txt . Hopefully I'll pick it up from there tomorrow!
Replying to zooko:
Done.
Done.
In the context of that statement, it can't be false. In general, it can, though -- see, for example,
in
Tahoe2PeerSelector.get_shareholders
, where the wbp is used without a peerid to get a size value that can then be used to filter out servers that don't accept a share as large as those associated with the file it is trying to place. So I don't think adding that precondition to WriteBucketProxy.init is necessarily right -- we could still support the use case inget_shareholders
with a dummy peerid if we did that, though. I'll remove that if-statement, and replace it with anassert peerid
to make things clearer.Good idea; done.
Done.
Thanks for the review.
Attachment behavior2.txt (97514 bytes) added
Attachment tests2.txt (116478 bytes) added
For your information, I decided to put Kevan's patches (from behavior2.txt) into trac so that I could explore them with a nice web UI:
http://tahoe-lafs.org/trac/tahoe-lafs-ticket778/changeset/4293
don't get confused -- URLs that begin with
<http://tahoe-lafs.org/trac/tahoe-lafs-ticket778>
are a trac instance that has Kevan's patches and is therefore divergent from trunk, which has URL<http://tahoe-lafs.org/trac/tahoe-lafs>
. The patches shown in the trac of Kevan's branch may mutate before they actually get applied to trunk.Okay and I also just applied the 12 tests patches from tests2.txt:
http://tahoe-lafs.org/trac/tahoe-lafs-ticket778/log/
Currently error messages say something like "shares could only be placed on 2 servers". If you shift the word "only" to the right as far as possible then it becomes less ambiguous, e.g.:
versus:
Moral: always try to shift the word "only" as far to the right as possible.
I get confused about this logic in
_loop()
on line 315 of upload.py on Kevan's branch.Questions:
uncontacted_peers
inif delta <= len(self.uncontacted_peers)
and not also considering peers that have already been contacted but could accept another share, right? So what if we had a situation like{server0: (share0, share1, share2), server1: (share0), server2: (share0)
} and the only way to achieve happiness is to upload share1 or share2 to either server1 or server2? Would theif
evaluate toFalse
sincelen(self.uncontacted_peers)
is zero and the upload would fail when it shouldn't? This would be similar to the situation tested in line 1026 of test_upload.py on Kevan's branch, but in that test the servers have no shares instead of having one share each. It would also be similar to the situation tested in line 853 of test_upload.py on Kevan's branch but in that test the servers which have one share each are read-only so the upload must fail._loop()
guaranteed to terminate? Could it endlessly cycle, reassigning some shares to some servers and then recursing in line 356 (Kevan's branch) until it hits the Python maximum recursion limit? If it is guaranteed to terminate, how can we tell?_loop()
which calls itself really sort of like the greedy algorithm that I proposed back in comment:157 and which is not guaranteed to find a solution even if one exists? Now that we have a maximum bipartite matching algorithm can't we do away with_loop()
entirely?Hm, I see that in addition to the self-call on line 356,
_loop()
is also called from line 411 (Kevan's branch) and from line 504 (Kevan's branch). Line 411 is when we've finished assigning shares to servers which already had shares, and line 504 is when we get a response from a remote call to a server'sallocate_buckets()
.It kind of seems like to me that only one of these three calls to
_loop()
should still be made in the new world of maximum bipartite matching, and that is the one where we just got new information from a server in response to our attempt toallocate_buckets()
. (That new information could be information to the effect that the server already has some shares, information to the effect that the server hereby agrees to hold some shares that we will upload, or information to the effect that the server has failed.) The other two calls to_loop()
seem to be part of an iterative/imperative/greedy algorithm which is hopefully superceded by maximum bipartite matching.Well, one way to tell if what I just said is true is to add a unit test of a case like the one I mentioned above, where there are servers that are already holding shares (so they will not be in
self.uncontacted_peers)
, but need to receive more shares in order to achieve happiness. Oh wait, what does it take for a server to be absent fromself.uncontacted_peers
? Well, per line 237 you get added toself.uncontacted_peers
if you are writable, and then in line 369 you get popped off ofself.uncontacted_peers
whenever we decide to ask you to hold a share. So it seems like we need a test of a case where to succeed you have to first upload one share to every writable server and then you do not have any homeless shares (so that theif
on line 316 will beTrue
) and then you have not already achieved servers-of-happiness (so that theif
on line 319 will beFalse
) and then if you upload more shares to one or more of the servers that you've already uploaded a share to then you can achieve servers-of-happiness.Is it even possible to have such a situation? I can't come up with an example in my head right now. Perhaps this means the current implementation is correct? :-)
Taking a step back, we have now a nice algorithm for determining how many servers-of-happiness we have for a given static mapping from servers to shares. Do we also have a nice algorithm for dynamically choosing which servers should receive which shares? Or do we have an ad-hoc and underspecified state machine encoded into
_loop()
? :-) (Mind you, if it is the latter, but it behaves at least as well as v1.6.1 does for all the tests we can throw at it, then it is probably still worth shipping in v1.7.0 beta.)This dynamic algorithm has to do the following:
get_buckets()
requests to every read-only server.allocate_buckets()
requests to any servers that it wants.allocate_buckets()
will eventually return a tuple of "already got these shares" and "I hereby agree to hold those shares" or else fail (per RIStorageServer). The interface doesn't say this, but the current storage servers will tell you that they already have shares from this file even if you didn't request that they hold those shares from this file--see StorageServer.remote_allocate_buckets(). If you asked a server to hold a share and it doesn't tell you that it already has the share nor does it tell you that it hereby agrees to hold the share then one of two things happened: it turned out to be full and couldn't hold that share, or someone else is simultaneously uploading that share right now but hasn't finished.allocate_buckets()
requests in response to earlier requests resolving or failing.So: can we describe what the current implementation on Kevan's branch does in terms of how it chooses its step 3 and step 6 moves? I remember that when we originally started on this ticket nine monhs ago I boldly stated that we could leave the current upload logic from Tahoe-LAFS v1.6.1 (i.e. the step 3 and step 6 choices) in place and just change the evaluation at the end (i.e. the step 8 decision). This might still be true, but at this point I'm not sure if we've introduced a regression in the step 3 and step 6 behavior. Can we convince ourselves that we haven't, by analysis or testing? Or could we define newer, simpler step 3 and step 6 behavior that would be easier to convince ourselves is correct behavior?
Considering that we really want to release Tahoe-LAFS v1.7 beta about 48 hours from now it would probably be better to go down the first route of analyzing the current implementation to convince ourselves that it is not worse-behaved than the v1.6.1 implementation instead of changing the upload behavior from the current implementation. :-)
Okay, here is a more specific question: what is the difference, if any, between Tahoe-LAFS v1.6.1 and Kevan's #778 branch with regard to steps 3 and 6? Can you answer that question, Kevan?
Hm. Wow, there is a difference between 1.6.1 and Kevan's #778 branch with regard to step 3. In Kevan's branch it issues
get_buckets()
queries to the read-only servers (step 2) serially. It waits to hear back from the first read-only server before it queries the second read-only server, and so on. That could be a performance problem when there are a lot of read-only servers. On the other hand in v1.6.1 it doesn't appear to query read-only servers at all, which means that if you've uploaded a file and then the servers to which you uploaded it fill up, then next time you upload that file you will upload all of its shares to new not-yet-full servers instead of being happy with the current servers. Can that be right? Kevan has added a test of that sort of situation, so if it is true that v1.6.1 behaves so badly then v1.6.1 should fail Kevan's new test. :-)Okay folks I could use some help reviewing this branch. :-) Nobody can be expected to read through the history of this ticket, but perhaps you could start by just reading the diff of all of Kevan's patches put together. Kevan's patches include very good doc and test, which will hopefully serve as your guide.
(I got that combined diff showing all of Kevan's patches’ by looking at the log and selecting two versions to diff.)
This change removes progress when a server already has a share. Intentional?
The algorithm as written appears to make assumptions about happiness. Specifically, happiness is the number of distinct servers with at least one share. If that's not the case, then this algorithm will probably break badly, for the reasons you specified. Otherwise, it appears to be correct.
My only recommendation would be to sort the servers in order of number of shares held (descending) before picking shares to redistribute.
I did not review the testcases.
Dear taral: thank you for the code review!
Replying to [taral]comment:189:
Hm... Good catch. That looks unintentional to me. At least it is inconsistent with the comment left at line 477. I think removing
progress = True
from there means that when you get a response from a server which did not accept any of your requests to hold a share but which did inform you that it already had some shares then you'll treat that as "not progress". Oh, but it looks like the only effect of "progress" is to changegood_query_count
,bad_query_count
, andfull_count
, which are used only for reporting progress, so maybe it is intentional or at least harmless.We'll leave it to Kevan to answer definitively.
That's not exactly the definition of happiness. I forgot to push Kevan's doc patches into the branch. Here they are. They state what happiness means.
Hm, I'm not sure about that -- I'll leave that to Kevan to think about.
Replying to zooko:
Replying to [zooko]comment:189:
Hm.
First note that, if we're in this situation, we have 3 shares to place -- that condition, I think, is only hit when there are no more homeless shares, and the only shares that we've accounted for so far are 0, 1, and 2. Then the happiest upload we can make has happiness of 3, since happiness is bounded from above by the size of the smaller of the two vertex sets in the bipartite graph, and we know that both of these have size of either exactly (in the case of the shares) or at least (in the case of the servers) 3. The happiness that we see above is 2.
If this is fixable by redistribution, at least one of server 1 and server 2 must accept new shares; otherwise, the algorithm is right to declare the upload unhappy and give up. Also, at least one of server 1 and server 2 must be uncontacted (_loop has not yet asked it to hold a share), since _loop will not in its first pass (corresponding to step 3 in your terminology below) ask a peer to hold a share that it has already placed. So our possible causes for a layout like this are something like:
We see server 0 first; we either attempt to send share 0 to server 0, or server 0 is readonly but already has all of them.
We see [1 | server 2]server first; we either store share 0 on [1|server 2]server, or it is readonly and we detect it during step 2. We then see server 0, and stop placing shares because we have already accounted for them.
Then, since [1 | server 2]server is both writable and not yet contacted, we can't know that it holds share 0.
So we'll remove shares from server 0 (because it's the only one with more than one share), and put them in homeless_shares for distribution to server 2 (for the sake of argument, without loss of generality, etc). If we opt to redistribute share 1 or share 2, we're okay; happiness is 3. If we opt to redistribute share 0, we're not, because (though we wouldn't know it) server 2 already has that one.
So I guess that's not the right way of going about share redistribution. Makes sense; I spent a while looking for an example that would break that logic, but I couldn't find one.
The reassignment is contingent upon there being uncontacted servers. Regardless of how we reassign shares, the loop will contact at least one uncontacted server on each iteration if there are any -- see line 368 of _loop. Since the number of uncontacted servers must either decrease or be zero as a result of/on each iteration of the loop, it will eventually terminate (or, if it does not terminate, it is for reasons other than that bit of logic).
Yes, I think so, after thinking about it, but I'll detail that in another comment.
Unless I'm misunderstanding your example, I think my case above would do the trick.
We have one nice, well-defined algorithm for computing the happiness of a static share assignment and one poorly defined, ad-hoc, and suboptimal algorithm for redistributing shares grafted onto an existing state machine. This is how I learn, I guess. :-/
It wouldn't be hard to extend the bipartite matching algorithm to do this, I think, but I will outline the extension in another comment, since this comment is concerned with the ad-hoc state machine algorithm. :-)
Step 3, if I understand right, corresponds to the initial attempts to place shares on the grid. Both #778 and 1.6.1 do this in the same way; they loop through the list of uncontacted peers (peers which have not yet been asked to store a share for this upload), and ask each of them to store a share. This proceeds until either all shares are placed or all peers are contacted. The peers that they choose to ask are also the same -- specifically, the peers that aren't readonly.
The difference between #778 and 1.6.1 in step 6 is in the halfhearted attempt to redistribute shares if servers of happiness fails while there are still uncontacted peers. This affects how the _loop opts to send more _allocate_buckets() requests.
In 1.6.1, after the initial series of requests, if there were failures, or there are still homeless shares, _loop will repeatedly ask each of the peers that accepted shares on the last pass to accept a proportional (shares left / peers that accepted shares on the last pass) number of shares, and continue to do this until there are no more homeless shares or until there are no more accepting peers.
This logic is not sufficient to guarantee that a completed upload implies acceptable file health when our health metric is servers of happiness instead of shares of happiness, so #778 adds more logic to step 6. Now, in addition to checking for homeless shares, we check for the case where all shares were placed, but not all peers were contacted, and attempt to remedy the case where that results in an unhealthy upload by issuing allocate_buckets() requests to peers that we have not yet contacted.
So, in summary, step 6 differs in #778 in that the upload process will send allocate_buckets() queries to peers in more cases than in 1.6.1.
If I'm understanding your steps right, wouldn't it be more correct here to say that #778 has a step 2, and 1.6.1 doesn't? It seems like this behavior is a part of step 2, not step 3. I think that that is the case, though -- 1.6.1 simply filters the peers that cannot accept shares for an upload out of the list of peers that it will check.
You are right that there is no reason for that to be done serially. I must have been trying to make it look like _loop, but it doesn't have the same need to coordinate as _loop.
Replying to [zooko]comment:190:
That's correct --
progress = True
implies that the query placed shares, whileprogress = False
implies that it did not. This is used to know how to incrementself.good_query_count
andself.bad_query_count
. I will fix the comment so that it is less misleading.Though, I guess to be consistent with the error message that these figures are collected for, which is
I should say that the request makes progress if it finds shares that were not already placed, and does not make progress in other cases. This is consistent with the comment.
It is a bit confusing to say that a request that only finds shares that have already been placed is a bad query. However, the distinction between a good query and a bad query isn't as clear-cut in the world of servers of happiness as it was in the world of shares of happiness. A share discovered on a server incidentally as a result of share placement is not necessarily contributing to the health of the upload -- its discovery may not make the file any healthier. However, if a share is allocated as a result of a request to allocate that share, it is much more likely (perhaps even true in general, but I'm not quite sure of that) that its placement will help contribute to the health of the upload.
I don't think it would hurt -- it would help to spread out big concentrations of shares amongst more peers, which would reduce the number of shares lost if the peer that formerly had a big concentration of share went down.
The treatment of progress is also wrong because it does not account for the case when the share that we wanted to allocate space for is not in
allocated
because it is inalreadygot
, and falsely causes this case to be recorded as the server being full, which is not necessarily true.Replying to zooko:
Here is what I had in mind for a nice algorithm for dynamically choosing which servers should receive which shares.
Input: A list of peers.
Output: A share -> peer assignment.
allocate_buckets()
. Ifallocate_buckets()
fails due to the peer being full, remove all of the edges added for the peer in step 5 and resume from step 6. Ifallocate_buckets()
fails due to peer failure, remove the peer from the graph, and resume from step 6.I think the algorithm halts; if it does not find a solution in step 7, it removes either a vertex or some edges (together with one peer to which shares can be assigned) from the graph. Eventually the graph will either run out of vertices in the peer set altogether, or will not have any bucket allocations that can fail, and will finish. It should be optimal for the same reason that the check now is optimal; only now we allow it to find an optimal matching instead of verifying that a matching of a certain size exists.
Thoughts?
Looks good to me. Can #8 be partially folded into #6 as some kind of heuristic?
It occurs to me that bipartite matching isn't quite right... let me think about this.
Okay, bipartite matching is sufficient but not necessary. I'm trying to see if something better exists.
Replying to taral:
It is true that maximum_matching_size >= servers_of_happiness_threshold is not necessary for the original definition of happiness that we started with. Here's a counterexample to it being necessary:
Let N = 4, k = 2, H = 3.
I.e. we have 4 shares, which need to be placed on at least 3 servers such that the shares from any two of those servers are sufficient to reconstruct the file.
If we place copies of the same two shares on three of the servers, then the maximum matching size is only 2, not 3 as required. However, the original happiness criterion is met, because any two of the three servers (actually only one is needed) will have enough shares to reconstruct the file.
OTOH, this is not a good share placement!
So, I believe that maximum matching size is a good measure of happiness even though it's not necessary to meet the original happiness criterion, because it also ensures placements that don't use excess storage. (If there are share copies that aren't in the maximum matching, they can be deleted.)
Replying to [kevan]comment:194:
To be precise, a maximal bipartite matching is not the same thing as a maximum bipartite matching. We want the latter.
My major concern at this point is: Does the ticket778 branch as it currently stands contain regressions vs. Tahoe-LAFS v1.6.1?
Replying to [kevan]comment:191:
Can you write a unit test that causes this behavior?
How does v1.6.1 behave in a case like this? I guess it ignores read-only servers entirely during upload. But if it were a case like this one involving only writable servers, then v1.6.1 would go ahead and stop happily with "shares of happiness" even though there isn't actually "servers of happiness". So as far as I currently understand (which isn't that far), the current ticket778 branch does not have any regression vs. v1.6.1 with regard to this case.
Okay I agree that
_loop()
will terminate.Replying to [kevan]comment:191:
As mentioned, if possible, I would like to see a unit test of this case, even if that test is going to be marked as "TODO" for the v1.7 release.
Yeah it seems like your thoughts along those lines are destined to grow into a ticket for after v1.7 release which is something like "rewrite uploader to make it awesome". :-)
Okay I think that I'm convinced that the ticket778 branch doesn't lead to any regressions with regard to step 6. I withhold my final opinion on that until I re-re-re-re-re-read your patches with this conversation in mind, though.
You are right—my mistake. I meant step 2.
Doing it serially could be a significant performance regression vs. 1.6.1 if there are many read-only servers. In fact, waiting for all the answers to your step 2 queries—even if done in parallel instead of in serial—could be a performance regression (if there is a hung server which is marked read-only and which never replies to the query). However, while I think it might be worth changing your branch to issue the step 2 queries in parallel instead of serial, I do not think it would be worth changing your branch (before v1.7) to proceed with the later steps before all of the step 2 queries have been answered.
Okay I intend to review the branch one last time tomorrow hunting for significant regressions from v1.6.1.
Replying to [zooko]comment:201:
Agreed. Improving the availability of upload in the presence of hung or very slow servers is ticket #873 (also ticket #946 might help). If this affected download, I'd be reluctant to accept any regression, but for upload we know there is a lot more work to do to improve availability.
[and Zooko are pair-programming on this review.]Amber
query_allocated
should be namedask_about_existing_shares
.self.peers_with_shares = set([])
should beself.peers_with_shares = set()
When querying readonly servers, we need to limit the number of readonly servers queried so that we don't query every readonly server on the grid. How about if we just query those that are among the first 2*n in
all_peers
. This would do it: replace:with:
[adds: I still think we should parallelize that initial query to readonly servers.]Zooko
Greedy algorithms are known to construct maximal (but not necessarily maximum) bipartite matching.
I think step 6 can be improved by using a weighted algorithm where existing shares are "heavier" than new shares.
Oh the docs in [architecture.txt]source:docs/architecture.txt@4222#L140 should be updated to describe the new step 2 in the upload algorithm.
I'm attaching docs3.txt, behavior3.txt, and tests3.txt. These are the canonical versions of the #778 patches.Things that are changed:
Per the focus of this ticket on things that might be regressions against 1.6.1, I did not change the redistribution algorithm to take the number of shares on a peer into account when reassigning them any more than it already does.
Thanks for the feedback so far.
Attachment behavior3.txt (100052 bytes) added
Attachment tests3.txt (128585 bytes) added
Attachment docs3.txt (10821 bytes) added
Replying to [zooko]comment:200:
If we're comparing 1.6.1 and 1.7.0's ability to determine servers of happiness, you're right; 1.6.1 would call the upload successful even though the shares are not distributed in the required way, while these patches will call it unsuccessful because the share layout is unhappy, and the peer selection logic doesn't know how to distribute shares in such a way as to make it happy.
I put Kevan's latest patches into a new branch:
http://tahoe-lafs.org/trac/tahoe-lafs-ticket778-3/log/?action=stop_on_copy&mode=follow_copy&rev=4308&stop_rev=&limit=30
Okay here I am reviewing #778 patches for what seems like the one millionth time. This time through the core "put them back so we can try to redistribute them" code in
_loop()
finally makes sense to me!Let's start a new ticket named something like "make upload awesome" so that we don't forget details such as Kevan's idea in comment:194 and Taral's suggestion in comment:72619.
Now reviewing the rest of the branch... :-)
Okay! I'm done reviewing it! Way to go, Kevan! I'm running unit tests with code coverage reporting and then I'll push your branch into trunk!
Here's a code coverage report about immutable/upload.py from running the entire unit test suite:
http://tahoe-lafs.org/~zooko/src_allmydata_immutable_upload.html
Two interesting bits: it claims that line 286 and line 344 are only partially covered -- i.e. that for each one, of the two possible lines that could be executed next after that line, only one of those possible next-lines was actually executed during the tests. It gives a line number (on the right-hand side) for which line of code it could have jumped to but didn't, but that number is bogus. However, its claim that line 344 is only partially covered is substantiated by its claim that lines 355-357 are never executed.
I'm going to proceed with applying your branch to trunk, but you might want to go ahead and investigate this.
Fixes applied in changeset:7c4c6f393ec2ad2a through changeset:77aabe7066e539c0. Kevan: please review my changeset:77aabe7066e539c0. Thanks a lot for all your hard work on this Kevan! Thanks also to Shawn Willden, David-Sarah, Brian, Taral, and to Terrell for some drive-by encouragement. :-)
In the shower this morning I suddenly realized I hadn't actually reviewed source:src/allmydata/util/happinessutil.py! Fortunately, a quick glance at the code coverage results from running just the
allmydata.test.test_upload
tests shows that happinessutil.py has 100% line- and branch- coverage. :-)http://tahoe-lafs.org/~zooko/htmlcov-just-allmydata.test.test_upload/src_allmydata_util_happinessutil.html
I still plan to review happinessutil.py at some point.
Replying to zooko:
Change "servers nodes" to "server nodes" in the first paragraph.
"Each server has announced its available space when it connected to the introducer" sounds clunky. Maybe "Each server announced its available space when it connected to the introducer", or "Servers announce their available space when they connect to the introducer".
(The announcement field that contains the space announcement is actually called "maximum-immutable-share-size", though it is set to the amount of space the storage server thinks it has available, so what we have is still technically correct)
The rest of changeset:77aabe7066e539c0 looks good to me.
Replying to zooko:
I think the line number is right with line 343, since it would jump back to the start of the while loop if it didn't take the branch leading into the body of the if-statement. I'll add a test for that.
Also, I need to add a test for the situation where it can't distribute all of the shares, but still has a happy layout.
I think the other instances of partial coverage are benign -- do you agree?
Yes—good enough.
Patch attached -- I also removed a comment that didn't make much sense. This fixes both of the code coverage issues that I saw in your report.
Attachment 778codecoverage.darcspatch.txt (18121 bytes) added
See also #911 (Create a specification for the servers of happiness behavior).
A user with handle "bjp" joined the IRC channel and asked how to set up a simple grid in order to test SFTP. Reading the transcript the next day, I realized that this ticket (#778) has broken the setup instructions in http://tahoe-lafs.org/source/tahoe-lafs/trunk/docs/running.html because the default configuration of
K
,H
, andN
can't work without at leastH
storage servers, but the running.html instructions don't say that you need multiple storage servers.Kevan: do you feel like updating running.html?
Patch attached. I made a separate patch (within the single patch file, but distinct from the one that fixes the issue above) that changes uses of tahoe and Tahoe (where appropriate) to Tahoe-LAFS, to be consistent with the title and how Tahoe-LAFS is referred to elsewhere; feel free to apply or not apply that one as you see fit.
Attachment 778running.darcspatch.txt (18138 bytes) added
To test this feature on your system build current Tahoe-LAFS trunk (changeset:e225f573b9c3fb0e or newer) and see how uploads perform and whether they fail with a clean error message if there aren't enough storage servers available to spread your file out over
H
different servers such that anyK
of those servers can reconstruct your file.Attachment mutabledocs.dpatch (18575 bytes) added
make a distinction between immutable file uploads and mutable file uploads wrt servers of happiness
David-Sarah pointed out in IRC that the fact that this only applies to immutable files at the moment is not well-documented. The last attachment tries to rectify this.
From mutabledocs.dpatch:
This can be read as implying that the mutable placement algorithm still gives at least the same guarantees as the immutable one. But in fact it succeeds even when there are fewer than "servers of happiness" connected servers.
'mutable' -> 'Mutable' and ')' -> '.)'
I reworded the blurb in architecture.txt, and made the changes you suggested to configuration.txt -- see attachment:mutabledocsv2.dpatch.
Thanks for the feedback.
Attachment mutabledocsv2.dpatch (19144 bytes) added