redefine "Healthy" to be "Happy" for checker/verifier/repairer #614

Open
opened 2009-02-10 06:17:43 +00:00 by zooko · 14 comments

Part of dreid's performance problem (in addition to the major part: #610, and the other consideration: #613) is that his client is uploading every file he has ever uploaded when the checker reports that the file is not "Healthy", with only 9 shares of the M=10 (K=3). Maybe we should redefine "Healthy" to be 7 shares and let numbers of shares greater than 7 be "super extra Healthy".

I choose 7 because that is the current default value of "shares of happiness". "shares of happiness" is a related notion: when you are doing an upload, if some of the attempts to upload shares fail, and you are left with 7 or more shares at the end, then you report to the user that the upload succeeded. If enough uploads fail so that you can't get more than 6 shares uploaded, then you immediately abort and report to the user that the upload failed. Maybe repairer ought to use the same heuristic as uploader does with regard to how many shares is enough to "call it good".

Part of dreid's performance problem (in addition to the major part: #610, and the other consideration: #613) is that his client is uploading every file he has ever uploaded when the checker reports that the file is not "Healthy", with only 9 shares of the M=10 (K=3). Maybe we should redefine "Healthy" to be 7 shares and let numbers of shares greater than 7 be "super extra Healthy". I choose 7 because that is the current default value of "shares of happiness". "shares of happiness" is a related notion: when you are doing an upload, if some of the attempts to upload shares fail, and you are left with 7 or more shares at the end, then you report to the user that the upload succeeded. If enough uploads fail so that you can't get more than 6 shares uploaded, then you immediately abort and report to the user that the upload failed. Maybe repairer ought to use the same heuristic as uploader does with regard to how many shares is enough to "call it good".
zooko added the
code-network
major
defect
1.3.0
labels 2009-02-10 06:17:43 +00:00
zooko added this to the undecided milestone 2009-02-10 06:17:43 +00:00
warner added
code-encoding
and removed
code-network
labels 2009-03-24 20:15:00 +00:00
Author

Per #778 ("shares of happiness" is the wrong measure; "servers of happiness" is better), the definition of a "well stored" file should be a file for which there are "servers of happiness" distinct servers such that any K of them are sufficient to recover your file. This would be a good definition for "healthy" in checker/verifier/repairer -- don't bother repairing a file if there are already "servers of happiness" for that file.

Per #778 ("shares of happiness" is the wrong measure; "servers of happiness" is better), the definition of a "well stored" file should be a file for which there are "servers of happiness" distinct servers such that any `K` of them are sufficient to recover your file. This would be a good definition for "healthy" in checker/verifier/repairer -- don't bother repairing a file if there are already "servers of happiness" for that file.
zooko changed title from redefine "Healthy" to be 7 in 3-of-10 encoding to redefine "Healthy" to be "Happy" in 3-of-10 encoding 2009-12-29 19:12:11 +00:00
zooko changed title from redefine "Healthy" to be "Happy" in 3-of-10 encoding to redefine "Healthy" to be "Happy" for checker/verifier/repairer 2009-12-29 19:12:35 +00:00
davidsarah commented 2010-01-17 14:41:24 +00:00
Owner

When this is fixed, remember to change the webapi.txt doc for POST $URL?t=check.

When this is fixed, remember to change the webapi.txt doc for `POST $URL?t=check`.
Author

#778 ("shares of happiness" is the wrong measure; "servers of happiness" is better) is fixed! Now we should change checker/verifier/repairer to report health (for the first two) and to trigger a repair (for the last one) only if the file in question lacks sufficient servers-of-happiness.

Kevan: if you are interested, you could investigate whether this is such an easy change to make that we could squeeze it into the v1.7 release.

(I doubt it.)

#778 ("shares of happiness" is the wrong measure; "servers of happiness" is better) is fixed! Now we should change checker/verifier/repairer to report health (for the first two) and to trigger a repair (for the last one) only if the file in question lacks sufficient servers-of-happiness. Kevan: if you are interested, you could investigate whether this is such an easy change to make that we could squeeze it into the v1.7 release. (I doubt it.)
zooko modified the milestone from undecided to 1.7.0 2010-05-16 05:17:43 +00:00
kevan commented 2010-05-16 18:23:53 +00:00
Owner

The checker/verifier would be straightforward to modify. The repairer works by downloading and re-uploading the file, regardless of how healthy it is (relying on the immutable file upload code to not do more work than it needs to), and doesn't explicitly consider file health in doing that. However, from what I can tell, you can't repair an immutable file without first checking its health, and the immutable filenode code won't repair a file unless the Checker/Verifier says that it is unhealthy, so it is probably enough to just change the definition of health in the Checker/Verifier.

I might have some time later this week to work on this, but getting it implemented correctly and clearly, testing it thoroughly, and having adequate code review by the 23rd might be optimistic.

The checker/verifier would be straightforward to modify. The repairer works by downloading and re-uploading the file, regardless of how healthy it is (relying on the immutable file upload code to not do more work than it needs to), and doesn't explicitly consider file health in doing that. However, from what I can tell, you can't repair an immutable file without first checking its health, and the immutable filenode code won't repair a file unless the Checker/Verifier says that it is unhealthy, so it is probably enough to just change the definition of health in the Checker/Verifier. I might have some time later this week to work on this, but getting it implemented correctly and clearly, testing it thoroughly, and having adequate code review by the 23rd might be optimistic.
Author

Yeah, let's plan to finish this up in v1.8.0.

Yeah, let's plan to finish this up in v1.8.0.
zooko modified the milestone from 1.7.0 to 1.8.0 2010-05-16 18:34:33 +00:00
Author

I'm just wondering if not doing this for v1.7.0 means there is a regression in v1.7.0. I don't think so, but I thought I should write down my notes here to be sure. The notion of whether a file is "healthy" according to checker-verifier and the notion of whether an upload was "successful" according to uploader-repairer differs (also true in earlier releases ever since repairer was originally implemented).

There are two ways the difference could manifest:

  1. Checker-verifier could say the file is Ok even though uploader-repairer would not be satisfied with it and would either strengthen/rebalance it or report failure.
  2. Checker-verifier could say that the file is not-Ok even though uploader-repairer would be satisfied with it and would not change it when asked to upload it.

For a while I was worried about the second case as a potential regression in v1.7.0 because the new uploader (and therefore repairer) has more stringent requirements for what constitutes a successful upload than the v1.6 one did. I imagined that maybe every time someone's checker-repairer process ran the checker would report "This is not healthy" and then the repairer would run and do a lot of work but leave the file in such a state that the next time the checker looked at it the checker would still consider it to be unhealthy.

However, I don't believe this is a risk for v1.7 because the new uploader (repairer), while it could be satisfied with fewer shares available than the checker would be, it actually goes ahead and uploads all shares if possible, which would be enough to satisfy the v1.7 checker. In fact, this is the same pattern as the old v1.6 uploader, which would be satisfied with only shares-of-happiness shares being available (which is not enough to satisfy a checker/verifier) but goes ahead and uploads all N shares normally.

So, there's no problem, but I thought I should write all this down just in case someone else detects a flaw in my thinking. Also if we ever implement #946 (upload should succeed as soon as the servers-of-happiness criterion is met) then we'll have to revisit this issue!

I'm just wondering if *not* doing this for v1.7.0 means there is a regression in v1.7.0. I don't think so, but I thought I should write down my notes here to be sure. The notion of whether a file is "healthy" according to checker-verifier and the notion of whether an upload was "successful" according to uploader-repairer differs (also true in earlier releases ever since repairer was originally implemented). There are two ways the difference could manifest: 1. Checker-verifier could say the file is Ok even though uploader-repairer would not be satisfied with it and would either strengthen/rebalance it or report failure. 2. Checker-verifier could say that the file is not-Ok even though uploader-repairer would be satisfied with it and would not change it when asked to upload it. For a while I was worried about the second case as a potential regression in v1.7.0 because the new uploader (and therefore repairer) has more stringent requirements for what constitutes a successful upload than the v1.6 one did. I imagined that maybe every time someone's checker-repairer process ran the checker would report "This is not healthy" and then the repairer would run and do a lot of work but leave the file in such a state that the next time the checker looked at it the checker would still consider it to be unhealthy. However, I don't believe this is a risk for v1.7 because the new uploader (repairer), while it could be *satisfied* with fewer shares available than the checker would be, it actually goes ahead and uploads all shares if possible, which would be enough to satisfy the v1.7 checker. In fact, this is the same pattern as the old v1.6 uploader, which would be *satisfied* with only shares-of-happiness shares being available (which is not enough to satisfy a checker/verifier) but goes ahead and uploads all `N` shares normally. So, there's no problem, but I thought I should write all this down just in case someone else detects a flaw in my thinking. Also if we ever implement #946 (upload should succeed as soon as the servers-of-happiness criterion is met) then we'll have to revisit this issue!
tahoe-lafs modified the milestone from 1.8.0 to 1.9.0 2010-08-10 04:13:13 +00:00
david-sarah@jacaranda.org commented 2010-09-11 00:52:36 +00:00
Owner

In changeset:31f66c5470641890:

docs/frontends/webapi.txt: document that the meaning of the 'healthy' field may change in future to reflect servers-of-happiness; refs #614
In changeset:31f66c5470641890: ``` docs/frontends/webapi.txt: document that the meaning of the 'healthy' field may change in future to reflect servers-of-happiness; refs #614 ```
kevan commented 2010-10-13 23:07:36 +00:00
Owner

See also: #1212, where we discuss and ultimately agree (I think?) on how a happiness-aware repairer/checker ought to work.

See also: #1212, where we discuss and ultimately agree (I think?) on how a happiness-aware repairer/checker ought to work.

not making it into 1.9

not making it into 1.9
warner modified the milestone from 1.9.0 to 1.10.0 2011-10-13 17:03:38 +00:00
Author

There was discussion of this issue on tahoe-dev: [//pipermail/tahoe-dev/2013-March/008091.html]

There was discussion of this issue on tahoe-dev: [//pipermail/tahoe-dev/2013-March/008091.html]

I have started working on this ticket. You can view my progress here: https://github.com/markberger/tahoe-lafs/tree/614-happy-is-healthy

As of right now I've only written a couple unit tests that should pass before this ticket is closed. If someone could glance over them to make sure they're implemented correctly, I would really appreciate it. The solution to this problem will probably be a little hacky because h is not stored in the verify cap while k and n are. Whoever is working on the new cap design might want to consider storing h in the verify cap if servers-of-happiness is going to be used with the new caps.

I have started working on this ticket. You can view my progress here: <https://github.com/markberger/tahoe-lafs/tree/614-happy-is-healthy> As of right now I've only written a couple unit tests that should pass before this ticket is closed. If someone could glance over them to make sure they're implemented correctly, I would really appreciate it. The solution to this problem will probably be a little hacky because h is not stored in the verify cap while k and n are. Whoever is working on the new cap design might want to consider storing h in the verify cap if servers-of-happiness is going to be used with the new caps.
Author

markberger: great!

I wouldn't call it "hacky". Just say that the checker/verifier/repairer is judging whether a given file is happy according to its current user. Different users may have different happiness requirements for the same file when they run the checker/verifier/repairer. It is up to the person running the checker/verifier/repairer, not up to the person who originally uploaded the file, to decide what constitutes happiness for them.

markberger: great! I wouldn't call it "hacky". Just say that the checker/verifier/repairer is judging whether a given file is happy *according to its current user*. Different users may have different happiness requirements for the same file when they run the checker/verifier/repairer. It is up to the person running the checker/verifier/repairer, not up to the person who originally uploaded the file, to decide what constitutes happiness for them.

After thinking about this ticket, I believe that it should be blocked by #1057 (use servers-of-happiness for mutable files). Continuing to diverge the behavior of immutable and mutable files only creates more trouble for users. Also it doesn't make sense to upload a file that the checker might deem unhealthy.

My github branch has a working patch for immutable files (some tests need to be altered because repair does not occur when there are 9 out of 10 shares) but I am going to put this ticket on hold and focus on #1057.

After thinking about this ticket, I believe that it should be blocked by #1057 (use servers-of-happiness for mutable files). Continuing to diverge the behavior of immutable and mutable files only creates more trouble for users. Also it doesn't make sense to upload a file that the checker might deem unhealthy. My github branch has a working patch for immutable files (some tests need to be altered because repair does not occur when there are 9 out of 10 shares) but I am going to put this ticket on hold and focus on #1057.
Author

Let's revisit this ticket after we land #1382.

Let's revisit this ticket after we land #1382.
Sign in to join this conversation.
No Milestone
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#614
No description provided.