Failure to achieve happiness in upload or repair #1130

New Issue

tahoe-lafs · 2010-07-20T02:33:07Z

kmarkley86 commented

2010-07-20 02:33:07 +00:00

Prior to Tahoe-LAFS v1.7.1, the immutable uploader would sometimes raise an assertion error (#1118). We fixed that problem, and we also fixed the problem of uploader uploading an insufficiently well-distributed set of shares while thinking that it achieved servers-of-happiness. But now uploader gives up and doesn't upload at all, saying that it hasn't achieved happiness, when if it were smarter it could achieve happiness. This ticket is to make it successfully upload in this case.

Log excerpt:

19:12:35.519 L20 []#1337 CHKUploader starting
19:12:35.519 L20 []#1338 starting upload of <allmydata.immutable.upload.EncryptAnUploadable instance at 0x20886b5a8>
19:12:35.520 L20 []#1339 creating Encoder <Encoder for unknown storage index>
19:12:35.520 L20 []#1340 file size: 106
19:12:35.520 L10 []#1341 my encoding parameters: (2, 4, 4, 106)
19:12:35.520 L20 []#1342 got encoding parameters: 2/4/4 106
19:12:35.520 L20 []#1343 now setting up codec
19:12:35.520 L20 []#1344 using storage index 5xpii
19:12:35.520 L20 []#1345 <Tahoe2PeerSelector for upload 5xpii> starting
19:12:35.633 L10 []#1346 response from peer 47cslusc: alreadygot=(), allocated=(0,)
19:12:36.590 L10 []#1347 response from peer vjqcroal: alreadygot=(0, 3), allocated=(1,)
19:12:37.119 L10 []#1348 response from peer sn4ana4b: alreadygot=(1,), allocated=(2,)
19:12:37.124 L20 []#1349 storage: allocate_buckets 5xpiivbjrybcmy4ws7xp7dxez4
19:12:37.130 L10 []#1350 response from peer yuzbctlc: alreadygot=(2,), allocated=(0,)
19:12:37.130 L25 []#1351 server selection unsuccessful for <Tahoe2PeerSelector for upload 5xpii>: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 peers, 4 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)), merged={0: set(['\xc52\x11Mb\xa1\xff\x8d\xafn\x0b#s\x17\xbe\x82\x85\x93G0']), 1: set(['\xaa`(\xb8\x0b\x89\x98Y\xfb\xcc2,T\xd0\xde\xf7\xca\xbfA#', '\x93x\x06\x83\x81\xdb\x12*\xe5\xb095T\xf0\x1e\xa5\x00V+\x0f']), 2: set(['\xc52\x11Mb\xa1\xff\x8d\xafn\x0b#s\x17\xbe\x82\x85\x93G0', '\x93x\x06\x83\x81\xdb\x12*\xe5\xb095T\xf0\x1e\xa5\x00V+\x0f']), 3: set(['\xaa`(\xb8\x0b\x89\x98Y\xfb\xcc2,T\xd0\xde\xf7\xca\xbfA#'])}
19:12:37.133 L20 []#1352 web: 127.0.0.1 PUT /uri/[CENSORED].. 500 1826
19:12:37.148 L23 []#1353 storage: aborting sharefile /home/tahoe/.tahoe/storage/shares/incoming/5x/5xpiivbjrybcmy4ws7xp7dxez4/0

Prior to Tahoe-LAFS v1.7.1, the immutable uploader would sometimes raise an assertion error (#1118). We fixed that problem, and we also fixed the problem of uploader uploading an insufficiently well-distributed set of shares while thinking that it achieved servers-of-happiness. But now uploader gives up and doesn't upload at all, saying that it hasn't achieved happiness, when if it were smarter it could achieve happiness. This ticket is to make it successfully upload in this case. Log excerpt: ``` 19:12:35.519 L20 []#1337 CHKUploader starting 19:12:35.519 L20 []#1338 starting upload of <allmydata.immutable.upload.EncryptAnUploadable instance at 0x20886b5a8> 19:12:35.520 L20 []#1339 creating Encoder <Encoder for unknown storage index> 19:12:35.520 L20 []#1340 file size: 106 19:12:35.520 L10 []#1341 my encoding parameters: (2, 4, 4, 106) 19:12:35.520 L20 []#1342 got encoding parameters: 2/4/4 106 19:12:35.520 L20 []#1343 now setting up codec 19:12:35.520 L20 []#1344 using storage index 5xpii 19:12:35.520 L20 []#1345 <Tahoe2PeerSelector for upload 5xpii> starting 19:12:35.633 L10 []#1346 response from peer 47cslusc: alreadygot=(), allocated=(0,) 19:12:36.590 L10 []#1347 response from peer vjqcroal: alreadygot=(0, 3), allocated=(1,) 19:12:37.119 L10 []#1348 response from peer sn4ana4b: alreadygot=(1,), allocated=(2,) 19:12:37.124 L20 []#1349 storage: allocate_buckets 5xpiivbjrybcmy4ws7xp7dxez4 19:12:37.130 L10 []#1350 response from peer yuzbctlc: alreadygot=(2,), allocated=(0,) 19:12:37.130 L25 []#1351 server selection unsuccessful for <Tahoe2PeerSelector for upload 5xpii>: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 peers, 4 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)), merged={0: set(['\xc52\x11Mb\xa1\xff\x8d\xafn\x0b#s\x17\xbe\x82\x85\x93G0']), 1: set(['\xaa`(\xb8\x0b\x89\x98Y\xfb\xcc2,T\xd0\xde\xf7\xca\xbfA#', '\x93x\x06\x83\x81\xdb\x12*\xe5\xb095T\xf0\x1e\xa5\x00V+\x0f']), 2: set(['\xc52\x11Mb\xa1\xff\x8d\xafn\x0b#s\x17\xbe\x82\x85\x93G0', '\x93x\x06\x83\x81\xdb\x12*\xe5\xb095T\xf0\x1e\xa5\x00V+\x0f']), 3: set(['\xaa`(\xb8\x0b\x89\x98Y\xfb\xcc2,T\xd0\xde\xf7\xca\xbfA#'])} 19:12:37.133 L20 []#1352 web: 127.0.0.1 PUT /uri/[CENSORED].. 500 1826 19:12:37.148 L23 []#1353 storage: aborting sharefile /home/tahoe/.tahoe/storage/shares/incoming/5x/5xpiivbjrybcmy4ws7xp7dxez4/0 ```

tahoe-lafs added the

labels 2010-07-20 02:33:07 +00:00

tahoe-lafs added this to the undecided milestone 2010-07-20 02:33:07 +00:00

kmarkley86 commented

2010-07-20 02:33:43 +00:00

Attachment stuff.flog.bz2 (10011 bytes) added

Log from flogtool

**Attachment** stuff.flog.bz2 (10011 bytes) added Log from flogtool

stuff.flog.bz2

9.8 KiB

kmarkley86 commented

2010-07-20 02:44:44 +00:00

I think I had originally uploaded this file when I was configured to use encoding parameters 2/3/4. That may explain the original distribution of the shares. I assume it's legal for a client to change their parameters (as I did, to 2/4/4) and continue using the grid. In this case the share needs to be migrated, but the migration doesn't happen.

tahoe-lafs added

and removed

labels 2010-07-20 03:04:36 +00:00

tahoe-lafs modified the milestone from undecided to 1.9.0

2010-08-12 23:34:04 +00:00

zooko commented

2010-12-29 08:45:45 +00:00

This issue reinforces Brian's sense of dubiousity of servers-of-happiness: http://tahoe-lafs.org/pipermail/tahoe-dev/2010-December/005704.html . This bothers me! I want Brian to love servers of happiness and revel in its excellence. Perhaps fixing this ticket would help.

This issue reinforces Brian's sense of dubiousity of servers-of-happiness: <http://tahoe-lafs.org/pipermail/tahoe-dev/2010-December/005704.html> . This bothers me! I want Brian to love servers of happiness and revel in its excellence. Perhaps fixing this ticket would help.

zooko commented

2010-12-29 09:10:28 +00:00

According to David-Sarah in this tahoe-dev message, this issue is nearly the same as the one tested in test_problem_layout_ticket_1128. So anybody who wants to fix this can start by running that one unit test.

According to David-Sarah in [this tahoe-dev message](http://tahoe-lafs.org/pipermail/tahoe-dev/2010-December/005698.html), this issue is nearly the same as the one tested in [test_problem_layout_ticket_1128](http://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/src/allmydata/test/test_upload.py?annotate=blame&rev=4657#L1913). So anybody who wants to fix this can start by running that one unit test.

davidsarah commented

2010-12-29 20:06:45 +00:00

Yes, #1128 had already been closed as a duplicate of this ticket. The name of the unit test should probably be changed (although I hope we fix it before the next release anyway).

davidsarah commented

2011-06-09 00:11:37 +00:00

Upload and repair are sufficiently similar that I think they can be covered by the same ticket for this issue. They are implemented mostly by the same code, and they both should change to take into account existing shares in the same way, probably along the lines of ticket:1212#comment:-1. The difference is when happiness is not achieved, upload should fail, while repair should still make a best effort to improve preservation of the file. But that needn't stop them from using the same improvement to the share placement algorithm.

Upload and repair are sufficiently similar that I think they can be covered by the same ticket for this issue. They are implemented mostly by the same code, and they both should change to take into account existing shares in the same way, probably along the lines of ticket:1212#[comment:-1](/tahoe-lafs/trac-2024-07-25/issues/1130#issuecomment--1). The difference is when happiness is not achieved, upload should fail, while repair should still make a best effort to improve preservation of the file. But that needn't stop them from using the same improvement to the share placement algorithm.

tahoe-lafs changed title from ~~Failure to achieve happiness in upload~~ to Failure to achieve happiness in upload or repair

2011-06-09 00:11:37 +00:00

davidsarah commented

2011-06-09 00:28:36 +00:00

[the algorithm from ticket:1212#comment:-1 here, with some minor refinements, for ease of reference]copying

This is how I think the repairer should work:

let k and N be the shares-needed and total number of shares for this file, and let H be the happiness threshold read from tahoe.cfg.
construct a server map for this file by asking all connected servers which shares they have. (In the case of a mutable file, construct a server map for the latest retrievable version.)
construct a maximum matching M : server -> share, of size |M|, for this file (preferring to include servers that are earlier on the permuted list when there is a choice).
while |M| < N, and we have not tried to put shares on all connected servers:
- pick a share not in M, and the server not in M that is next on the permuted list, wrapping around if necessary. Try to extend M by putting that share onto that server.
place any remaining shares on servers that are already in the map (don't count these in |M|).
if the file is not retrievable, report that the repair failed completely. If k <= |M| < H, report that the file is retrievable but unhealthy. In any case report what |M| is.

The while loop should be done in parallel, with up to N - |M| outstanding requests.

Upload would work in the same way (for the general case where there may be existing shares), except that it would fail if it is not possible to achieve |M| >= H.

[numbered the steps]edit:

[the algorithm from ticket:1212#[comment:-1](/tahoe-lafs/trac-2024-07-25/issues/1130#issuecomment--1) here, with some minor refinements, for ease of reference]copying This is how I think the repairer should work: 1. let k and N be the shares-needed and total number of shares for this file, and let H be the happiness threshold read from `tahoe.cfg`. 2. construct a server map for this file by asking all connected servers which shares they have. (In the case of a mutable file, construct a server map for the latest retrievable version.) 3. construct a maximum matching M : server -> share, of size |M|, for this file (preferring to include servers that are earlier on the permuted list when there is a choice). 4. while |M| < N, and we have not tried to put shares on all connected servers: * pick a share not in M, and the server not in M that is next on the permuted list, wrapping around if necessary. Try to extend M by putting that share onto that server. 5. place any remaining shares on servers that are already in the map (don't count these in |M|). 6. if the file is not retrievable, report that the repair failed completely. If k <= |M| < H, report that the file is retrievable but unhealthy. In any case report what |M| is. The while loop should be done in parallel, with up to N - |M| outstanding requests. Upload would work in the same way (for the general case where there may be existing shares), except that it would fail if it is not possible to achieve |M| >= H. [numbered the steps]edit:

zooko commented

2011-10-01 04:27:53 +00:00

The algorithm David-Sarah proposes in comment:78773 sounds fine to me.

The algorithm David-Sarah proposes in [comment:78773](/tahoe-lafs/trac-2024-07-25/issues/1130#issuecomment-78773) sounds fine to me.

warner commented

2011-10-13 17:05:29 +00:00

not making it into 1.9

warner modified the milestone from 1.9.0 to 1.10.0

2011-10-13 17:05:29 +00:00

zooko commented

2012-03-25 18:55:00 +00:00

Kevan: would the algorithm from your master's thesis solve this ticket? Would it be compatible with, or equivalent to, the algorithm that David-Sarah proposed in comment:78773?

Kevan: would the algorithm from your master's thesis solve this ticket? Would it be compatible with, or equivalent to, the algorithm that David-Sarah proposed in [comment:78773](/tahoe-lafs/trac-2024-07-25/issues/1130#issuecomment-78773)?

davidsarah commented

2012-09-29 20:50:08 +00:00

I just thought of another wrinkle: the initial servermap in step 2 may contain shares with leases that are about to expire. The repairer should attempt to renew any leases on shares that are still needed, and only then (once it knows which renew operations succeeded) decide which new or replacement shares need to be stored.

davidsarah commented

2013-02-15 03:50:21 +00:00

The comment:78773 algorithm would fix #699. Note that in the case where there are existing shares that don't contribute to the maximum matching found in step 3, those shares (which are redundant if the repair is successful) will not be deleted. However, any redundant shares would not have their leases renewed.

The [comment:78773](/tahoe-lafs/trac-2024-07-25/issues/1130#issuecomment-78773) algorithm would fix #699. Note that in the case where there are existing shares that don't contribute to the maximum matching found in step 3, those shares (which are redundant if the repair is successful) will not be deleted. However, any redundant shares would not have their leases renewed.

daira commented

2013-06-27 17:11:31 +00:00

Step 5 in the comment:78773 algorithm isn't very specific about where the remaining shares are placed. I can think of two possibilities:

a) continue the loop in step 4, i.e. place in the order of the permuted list with wrap-around.

b) sort the servers by the number of shares they have at that point (breaking ties in some deterministic way) and place on the servers with fewest shares first.

Step 5 in the [comment:78773](/tahoe-lafs/trac-2024-07-25/issues/1130#issuecomment-78773) algorithm isn't very specific about where the remaining shares are placed. I can think of two possibilities: a) continue the loop in step 4, i.e. place in the order of the permuted list with wrap-around. b) sort the servers by the number of shares they have at that point (breaking ties in some deterministic way) and place on the servers with fewest shares first.

tahoe-lafs modified the milestone from soon to 1.11.0

2013-09-01 05:30:02 +00:00

zooko commented

2013-09-01 13:33:30 +00:00

This would be fixed by #1382, right?