Failure to achieve happiness in upload or repair #1130
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
5 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1130
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Prior to Tahoe-LAFS v1.7.1, the immutable uploader would sometimes raise an assertion error (#1118). We fixed that problem, and we also fixed the problem of uploader uploading an insufficiently well-distributed set of shares while thinking that it achieved servers-of-happiness. But now uploader gives up and doesn't upload at all, saying that it hasn't achieved happiness, when if it were smarter it could achieve happiness. This ticket is to make it successfully upload in this case.
Log excerpt:
Attachment stuff.flog.bz2 (10011 bytes) added
Log from flogtool
I think I had originally uploaded this file when I was configured to use encoding parameters 2/3/4. That may explain the original distribution of the shares. I assume it's legal for a client to change their parameters (as I did, to 2/4/4) and continue using the grid. In this case the share needs to be migrated, but the migration doesn't happen.
This issue reinforces Brian's sense of dubiousity of servers-of-happiness: http://tahoe-lafs.org/pipermail/tahoe-dev/2010-December/005704.html . This bothers me! I want Brian to love servers of happiness and revel in its excellence. Perhaps fixing this ticket would help.
According to David-Sarah in this tahoe-dev message, this issue is nearly the same as the one tested in test_problem_layout_ticket_1128. So anybody who wants to fix this can start by running that one unit test.
Yes, #1128 had already been closed as a duplicate of this ticket. The name of the unit test should probably be changed (although I hope we fix it before the next release anyway).
Upload and repair are sufficiently similar that I think they can be covered by the same ticket for this issue. They are implemented mostly by the same code, and they both should change to take into account existing shares in the same way, probably along the lines of ticket:1212#comment:-1. The difference is when happiness is not achieved, upload should fail, while repair should still make a best effort to improve preservation of the file. But that needn't stop them from using the same improvement to the share placement algorithm.
Failure to achieve happiness in uploadto Failure to achieve happiness in upload or repair[the algorithm from ticket:1212#comment:-1 here, with some minor refinements, for ease of reference]copying
This is how I think the repairer should work:
tahoe.cfg
.The while loop should be done in parallel, with up to N - |M| outstanding requests.
Upload would work in the same way (for the general case where there may be existing shares), except that it would fail if it is not possible to achieve |M| >= H.
[numbered the steps]edit:
The algorithm David-Sarah proposes in comment:78773 sounds fine to me.
not making it into 1.9
Kevan: would the algorithm from your master's thesis solve this ticket? Would it be compatible with, or equivalent to, the algorithm that David-Sarah proposed in comment:78773?
I just thought of another wrinkle: the initial servermap in step 2 may contain shares with leases that are about to expire. The repairer should attempt to renew any leases on shares that are still needed, and only then (once it knows which renew operations succeeded) decide which new or replacement shares need to be stored.
The comment:78773 algorithm would fix #699. Note that in the case where there are existing shares that don't contribute to the maximum matching found in step 3, those shares (which are redundant if the repair is successful) will not be deleted. However, any redundant shares would not have their leases renewed.
Step 5 in the comment:78773 algorithm isn't very specific about where the remaining shares are placed. I can think of two possibilities:
a) continue the loop in step 4, i.e. place in the order of the permuted list with wrap-around.
b) sort the servers by the number of shares they have at that point (breaking ties in some deterministic way) and place on the servers with fewest shares first.
This would be fixed by #1382, right?
Daira thinks it's the same problem as #1124, so yes.
Milestone renamed
moving most tickets from 1.12 to 1.13 so we can release 1.12 with magic-folders
Moving open issues out of closed milestones.
Ticket retargeted after milestone closed