Failure to achieve happiness in upload or repair #1130

Open
opened 2010-07-20 02:33:07 +00:00 by kmarkley86 · 19 comments
kmarkley86 commented 2010-07-20 02:33:07 +00:00
Owner

Prior to Tahoe-LAFS v1.7.1, the immutable uploader would sometimes raise an assertion error (#1118). We fixed that problem, and we also fixed the problem of uploader uploading an insufficiently well-distributed set of shares while thinking that it achieved servers-of-happiness. But now uploader gives up and doesn't upload at all, saying that it hasn't achieved happiness, when if it were smarter it could achieve happiness. This ticket is to make it successfully upload in this case.

Log excerpt:

19:12:35.519 L20 []#1337 CHKUploader starting
19:12:35.519 L20 []#1338 starting upload of <allmydata.immutable.upload.EncryptAnUploadable instance at 0x20886b5a8>
19:12:35.520 L20 []#1339 creating Encoder <Encoder for unknown storage index>
19:12:35.520 L20 []#1340 file size: 106
19:12:35.520 L10 []#1341 my encoding parameters: (2, 4, 4, 106)
19:12:35.520 L20 []#1342 got encoding parameters: 2/4/4 106
19:12:35.520 L20 []#1343 now setting up codec
19:12:35.520 L20 []#1344 using storage index 5xpii
19:12:35.520 L20 []#1345 <Tahoe2PeerSelector for upload 5xpii> starting
19:12:35.633 L10 []#1346 response from peer 47cslusc: alreadygot=(), allocated=(0,)
19:12:36.590 L10 []#1347 response from peer vjqcroal: alreadygot=(0, 3), allocated=(1,)
19:12:37.119 L10 []#1348 response from peer sn4ana4b: alreadygot=(1,), allocated=(2,)
19:12:37.124 L20 []#1349 storage: allocate_buckets 5xpiivbjrybcmy4ws7xp7dxez4
19:12:37.130 L10 []#1350 response from peer yuzbctlc: alreadygot=(2,), allocated=(0,)
19:12:37.130 L25 []#1351 server selection unsuccessful for <Tahoe2PeerSelector for upload 5xpii>: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 peers, 4 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)), merged={0: set(['\xc52\x11Mb\xa1\xff\x8d\xafn\x0b#s\x17\xbe\x82\x85\x93G0']), 1: set(['\xaa`(\xb8\x0b\x89\x98Y\xfb\xcc2,T\xd0\xde\xf7\xca\xbfA#', '\x93x\x06\x83\x81\xdb\x12*\xe5\xb095T\xf0\x1e\xa5\x00V+\x0f']), 2: set(['\xc52\x11Mb\xa1\xff\x8d\xafn\x0b#s\x17\xbe\x82\x85\x93G0', '\x93x\x06\x83\x81\xdb\x12*\xe5\xb095T\xf0\x1e\xa5\x00V+\x0f']), 3: set(['\xaa`(\xb8\x0b\x89\x98Y\xfb\xcc2,T\xd0\xde\xf7\xca\xbfA#'])}
19:12:37.133 L20 []#1352 web: 127.0.0.1 PUT /uri/[CENSORED].. 500 1826
19:12:37.148 L23 []#1353 storage: aborting sharefile /home/tahoe/.tahoe/storage/shares/incoming/5x/5xpiivbjrybcmy4ws7xp7dxez4/0
Prior to Tahoe-LAFS v1.7.1, the immutable uploader would sometimes raise an assertion error (#1118). We fixed that problem, and we also fixed the problem of uploader uploading an insufficiently well-distributed set of shares while thinking that it achieved servers-of-happiness. But now uploader gives up and doesn't upload at all, saying that it hasn't achieved happiness, when if it were smarter it could achieve happiness. This ticket is to make it successfully upload in this case. Log excerpt: ``` 19:12:35.519 L20 []#1337 CHKUploader starting 19:12:35.519 L20 []#1338 starting upload of <allmydata.immutable.upload.EncryptAnUploadable instance at 0x20886b5a8> 19:12:35.520 L20 []#1339 creating Encoder <Encoder for unknown storage index> 19:12:35.520 L20 []#1340 file size: 106 19:12:35.520 L10 []#1341 my encoding parameters: (2, 4, 4, 106) 19:12:35.520 L20 []#1342 got encoding parameters: 2/4/4 106 19:12:35.520 L20 []#1343 now setting up codec 19:12:35.520 L20 []#1344 using storage index 5xpii 19:12:35.520 L20 []#1345 <Tahoe2PeerSelector for upload 5xpii> starting 19:12:35.633 L10 []#1346 response from peer 47cslusc: alreadygot=(), allocated=(0,) 19:12:36.590 L10 []#1347 response from peer vjqcroal: alreadygot=(0, 3), allocated=(1,) 19:12:37.119 L10 []#1348 response from peer sn4ana4b: alreadygot=(1,), allocated=(2,) 19:12:37.124 L20 []#1349 storage: allocate_buckets 5xpiivbjrybcmy4ws7xp7dxez4 19:12:37.130 L10 []#1350 response from peer yuzbctlc: alreadygot=(2,), allocated=(0,) 19:12:37.130 L25 []#1351 server selection unsuccessful for <Tahoe2PeerSelector for upload 5xpii>: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 peers, 4 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)), merged={0: set(['\xc52\x11Mb\xa1\xff\x8d\xafn\x0b#s\x17\xbe\x82\x85\x93G0']), 1: set(['\xaa`(\xb8\x0b\x89\x98Y\xfb\xcc2,T\xd0\xde\xf7\xca\xbfA#', '\x93x\x06\x83\x81\xdb\x12*\xe5\xb095T\xf0\x1e\xa5\x00V+\x0f']), 2: set(['\xc52\x11Mb\xa1\xff\x8d\xafn\x0b#s\x17\xbe\x82\x85\x93G0', '\x93x\x06\x83\x81\xdb\x12*\xe5\xb095T\xf0\x1e\xa5\x00V+\x0f']), 3: set(['\xaa`(\xb8\x0b\x89\x98Y\xfb\xcc2,T\xd0\xde\xf7\xca\xbfA#'])} 19:12:37.133 L20 []#1352 web: 127.0.0.1 PUT /uri/[CENSORED].. 500 1826 19:12:37.148 L23 []#1353 storage: aborting sharefile /home/tahoe/.tahoe/storage/shares/incoming/5x/5xpiivbjrybcmy4ws7xp7dxez4/0 ```
tahoe-lafs added the
unknown
major
defect
1.7.0
labels 2010-07-20 02:33:07 +00:00
tahoe-lafs added this to the undecided milestone 2010-07-20 02:33:07 +00:00
kmarkley86 commented 2010-07-20 02:33:43 +00:00
Author
Owner

Attachment stuff.flog.bz2 (10011 bytes) added

Log from flogtool

**Attachment** stuff.flog.bz2 (10011 bytes) added Log from flogtool
kmarkley86 commented 2010-07-20 02:44:44 +00:00
Author
Owner

I think I had originally uploaded this file when I was configured to use encoding parameters 2/3/4. That may explain the original distribution of the shares. I assume it's legal for a client to change their parameters (as I did, to 2/4/4) and continue using the grid. In this case the share needs to be migrated, but the migration doesn't happen.

I think I had originally uploaded this file when I was configured to use encoding parameters 2/3/4. That may explain the original distribution of the shares. I assume it's legal for a client to change their parameters (as I did, to 2/4/4) and continue using the grid. In this case the share needs to be migrated, but the migration doesn't happen.
tahoe-lafs added
code-peerselection
1.7.1
and removed
unknown
1.7.0
labels 2010-07-20 03:04:36 +00:00
tahoe-lafs modified the milestone from undecided to 1.9.0 2010-08-12 23:34:04 +00:00

This issue reinforces Brian's sense of dubiousity of servers-of-happiness: http://tahoe-lafs.org/pipermail/tahoe-dev/2010-December/005704.html . This bothers me! I want Brian to love servers of happiness and revel in its excellence. Perhaps fixing this ticket would help.

This issue reinforces Brian's sense of dubiousity of servers-of-happiness: <http://tahoe-lafs.org/pipermail/tahoe-dev/2010-December/005704.html> . This bothers me! I want Brian to love servers of happiness and revel in its excellence. Perhaps fixing this ticket would help.

According to David-Sarah in this tahoe-dev message, this issue is nearly the same as the one tested in test_problem_layout_ticket_1128. So anybody who wants to fix this can start by running that one unit test.

According to David-Sarah in [this tahoe-dev message](http://tahoe-lafs.org/pipermail/tahoe-dev/2010-December/005698.html), this issue is nearly the same as the one tested in [test_problem_layout_ticket_1128](http://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/src/allmydata/test/test_upload.py?annotate=blame&rev=4657#L1913). So anybody who wants to fix this can start by running that one unit test.
davidsarah commented 2010-12-29 20:06:45 +00:00
Author
Owner

Yes, #1128 had already been closed as a duplicate of this ticket. The name of the unit test should probably be changed (although I hope we fix it before the next release anyway).

Yes, #1128 had already been closed as a duplicate of this ticket. The name of the unit test should probably be changed (although I hope we fix it before the next release anyway).
davidsarah commented 2011-06-09 00:11:37 +00:00
Author
Owner

Upload and repair are sufficiently similar that I think they can be covered by the same ticket for this issue. They are implemented mostly by the same code, and they both should change to take into account existing shares in the same way, probably along the lines of ticket:1212#comment:-1. The difference is when happiness is not achieved, upload should fail, while repair should still make a best effort to improve preservation of the file. But that needn't stop them from using the same improvement to the share placement algorithm.

Upload and repair are sufficiently similar that I think they can be covered by the same ticket for this issue. They are implemented mostly by the same code, and they both should change to take into account existing shares in the same way, probably along the lines of ticket:1212#[comment:-1](/tahoe-lafs/trac-2024-07-25/issues/1130#issuecomment--1). The difference is when happiness is not achieved, upload should fail, while repair should still make a best effort to improve preservation of the file. But that needn't stop them from using the same improvement to the share placement algorithm.
tahoe-lafs changed title from Failure to achieve happiness in upload to Failure to achieve happiness in upload or repair 2011-06-09 00:11:37 +00:00
davidsarah commented 2011-06-09 00:28:36 +00:00
Author
Owner

[the algorithm from ticket:1212#comment:-1 here, with some minor refinements, for ease of reference]copying

This is how I think the repairer should work:

  1. let k and N be the shares-needed and total number of shares for this file, and let H be the happiness threshold read from tahoe.cfg.
  2. construct a server map for this file by asking all connected servers which shares they have. (In the case of a mutable file, construct a server map for the latest retrievable version.)
  3. construct a maximum matching M : server -> share, of size |M|, for this file (preferring to include servers that are earlier on the permuted list when there is a choice).
  4. while |M| < N, and we have not tried to put shares on all connected servers:
    • pick a share not in M, and the server not in M that is next on the permuted list, wrapping around if necessary. Try to extend M by putting that share onto that server.
  5. place any remaining shares on servers that are already in the map (don't count these in |M|).
  6. if the file is not retrievable, report that the repair failed completely. If k <= |M| < H, report that the file is retrievable but unhealthy. In any case report what |M| is.

The while loop should be done in parallel, with up to N - |M| outstanding requests.

Upload would work in the same way (for the general case where there may be existing shares), except that it would fail if it is not possible to achieve |M| >= H.

[numbered the steps]edit:

[the algorithm from ticket:1212#[comment:-1](/tahoe-lafs/trac-2024-07-25/issues/1130#issuecomment--1) here, with some minor refinements, for ease of reference]copying This is how I think the repairer should work: 1. let k and N be the shares-needed and total number of shares for this file, and let H be the happiness threshold read from `tahoe.cfg`. 2. construct a server map for this file by asking all connected servers which shares they have. (In the case of a mutable file, construct a server map for the latest retrievable version.) 3. construct a maximum matching M : server -> share, of size |M|, for this file (preferring to include servers that are earlier on the permuted list when there is a choice). 4. while |M| < N, and we have not tried to put shares on all connected servers: * pick a share not in M, and the server not in M that is next on the permuted list, wrapping around if necessary. Try to extend M by putting that share onto that server. 5. place any remaining shares on servers that are already in the map (don't count these in |M|). 6. if the file is not retrievable, report that the repair failed completely. If k <= |M| < H, report that the file is retrievable but unhealthy. In any case report what |M| is. The while loop should be done in parallel, with up to N - |M| outstanding requests. Upload would work in the same way (for the general case where there may be existing shares), except that it would fail if it is not possible to achieve |M| >= H. [numbered the steps]edit:

The algorithm David-Sarah proposes in comment:78773 sounds fine to me.

The algorithm David-Sarah proposes in [comment:78773](/tahoe-lafs/trac-2024-07-25/issues/1130#issuecomment-78773) sounds fine to me.

not making it into 1.9

not making it into 1.9
warner modified the milestone from 1.9.0 to 1.10.0 2011-10-13 17:05:29 +00:00

Kevan: would the algorithm from your master's thesis solve this ticket? Would it be compatible with, or equivalent to, the algorithm that David-Sarah proposed in comment:78773?

Kevan: would the algorithm from your master's thesis solve this ticket? Would it be compatible with, or equivalent to, the algorithm that David-Sarah proposed in [comment:78773](/tahoe-lafs/trac-2024-07-25/issues/1130#issuecomment-78773)?
davidsarah commented 2012-09-29 20:50:08 +00:00
Author
Owner

I just thought of another wrinkle: the initial servermap in step 2 may contain shares with leases that are about to expire. The repairer should attempt to renew any leases on shares that are still needed, and only then (once it knows which renew operations succeeded) decide which new or replacement shares need to be stored.

I just thought of another wrinkle: the initial servermap in step 2 may contain shares with leases that are about to expire. The repairer should attempt to renew any leases on shares that are still needed, and only then (once it knows which renew operations succeeded) decide which new or replacement shares need to be stored.
davidsarah commented 2013-02-15 03:50:21 +00:00
Author
Owner

The comment:78773 algorithm would fix #699. Note that in the case where there are existing shares that don't contribute to the maximum matching found in step 3, those shares (which are redundant if the repair is successful) will not be deleted. However, any redundant shares would not have their leases renewed.

The [comment:78773](/tahoe-lafs/trac-2024-07-25/issues/1130#issuecomment-78773) algorithm would fix #699. Note that in the case where there are existing shares that don't contribute to the maximum matching found in step 3, those shares (which are redundant if the repair is successful) will not be deleted. However, any redundant shares would not have their leases renewed.
daira commented 2013-06-27 17:11:31 +00:00
Author
Owner

Step 5 in the comment:78773 algorithm isn't very specific about where the remaining shares are placed. I can think of two possibilities:

a) continue the loop in step 4, i.e. place in the order of the permuted list with wrap-around.

b) sort the servers by the number of shares they have at that point (breaking ties in some deterministic way) and place on the servers with fewest shares first.

Step 5 in the [comment:78773](/tahoe-lafs/trac-2024-07-25/issues/1130#issuecomment-78773) algorithm isn't very specific about where the remaining shares are placed. I can think of two possibilities: a) continue the loop in step 4, i.e. place in the order of the permuted list with wrap-around. b) sort the servers by the number of shares they have at that point (breaking ties in some deterministic way) and place on the servers with fewest shares first.
tahoe-lafs modified the milestone from soon to 1.11.0 2013-09-01 05:30:02 +00:00

This would be fixed by #1382, right?

This would be fixed by #1382, right?

Daira thinks it's the same problem as #1124, so yes.

Daira thinks it's the same problem as #1124, so yes.
warner modified the milestone from 1.10.1 to 1.11.0 2015-01-20 17:26:08 +00:00

Milestone renamed

Milestone renamed
warner modified the milestone from 1.11.0 to 1.12.0 2016-03-22 05:02:52 +00:00

moving most tickets from 1.12 to 1.13 so we can release 1.12 with magic-folders

moving most tickets from 1.12 to 1.13 so we can release 1.12 with magic-folders
warner modified the milestone from 1.12.0 to 1.13.0 2016-06-28 18:20:37 +00:00

Moving open issues out of closed milestones.

Moving open issues out of closed milestones.
exarkun modified the milestone from 1.13.0 to 1.15.0 2020-06-30 14:45:13 +00:00
Owner

Ticket retargeted after milestone closed

Ticket retargeted after milestone closed
meejah modified the milestone from 1.15.0 to soon 2021-03-30 18:40:19 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
5 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#1130
No description provided.