Not enough available servers are found #2016

New Issue

tahoe-lafs · 2013-07-05T19:36:56Z

kapiteined commented

2013-07-05 19:36:56 +00:00

When uploading a file, it fails with the following error:

<class 'allmydata.interfaces.UploadUnhappinessError'>: shares could be placed on only 4 server(s) such that any 3 of them have enough shares to recover the file, but we were asked to place shares on at least 5 such servers. (placed all 5 shares, want to place shares on at least 5 servers such that any 3 of them have enough shares to recover the file, sent 6 queries to 6 servers, 4 queries placed some shares, 2 placed none (of which 2 placed none due to the server being full and 0 placed none due to an error))

There are 12 servers connected to this grid (pubgrid) yet 6 queries are send, and because two are full the upload fails (if i interpreted the error right).

Shouldn't there be another round of queries if the first round does not yield enough available servers?

When uploading a file, it fails with the following error: <class 'allmydata.interfaces.UploadUnhappinessError'>: shares could be placed on only 4 server(s) such that any 3 of them have enough shares to recover the file, but we were asked to place shares on at least 5 such servers. (placed all 5 shares, want to place shares on at least 5 servers such that any 3 of them have enough shares to recover the file, sent 6 queries to 6 servers, 4 queries placed some shares, 2 placed none (of which 2 placed none due to the server being full and 0 placed none due to an error)) There are 12 servers connected to this grid (pubgrid) yet 6 queries are send, and because two are full the upload fails (if i interpreted the error right). Shouldn't there be another round of queries if the first round does not yield enough available servers?

tahoe-lafs added the

labels 2013-07-05 19:36:56 +00:00

tahoe-lafs added this to the undecided milestone 2013-07-05 19:36:56 +00:00

kapiteined commented

2013-07-05 19:57:05 +00:00

Replying to kapiteined:

When uploading a file, it fails with the following error:

<class 'allmydata.interfaces.UploadUnhappinessError'>: shares could be placed on only 4 server(s) such that any 3 of them have enough shares to recover the file, but we were asked to place shares on at least 5 such servers. (placed all 5 shares, want to place shares on at least 5 servers such that any 3 of them have enough shares to recover the file, sent 6 queries to 6 servers, 4 queries placed some shares, 2 placed none (of which 2 placed none due to the server being full and 0 placed none due to an error))

There are 12 servers connected to this grid (pubgrid) yet 6 queries are send, and because two are full the upload fails (if i interpreted the error right).

Shouldn't there be another round of queries if the first round does not yield enough available servers?

somehow attaching a file to this ticket failed, so i put the error report
( incident-2013-07-05--19-34-13Z-7o6admq.flog.bz2 )
at URI:CHK:7tbpjhxokkmpere6nxwfa5cvey:37ypgfhpwg67veqpyhjve22edmh3w3jwpbds47yfnvjussvalmaq:3:5:74128
in the pubgrid.

Replying to [kapiteined](/tahoe-lafs/trac-2024-07-25/issues/7046): > When uploading a file, it fails with the following error: > > <class 'allmydata.interfaces.UploadUnhappinessError'>: shares could be placed on only 4 server(s) such that any 3 of them have enough shares to recover the file, but we were asked to place shares on at least 5 such servers. (placed all 5 shares, want to place shares on at least 5 servers such that any 3 of them have enough shares to recover the file, sent 6 queries to 6 servers, 4 queries placed some shares, 2 placed none (of which 2 placed none due to the server being full and 0 placed none due to an error)) > > There are 12 servers connected to this grid (pubgrid) yet 6 queries are send, and because two are full the upload fails (if i interpreted the error right). > > Shouldn't there be another round of queries if the first round does not yield enough available servers? somehow attaching a file to this ticket failed, so i put the error report ( incident-2013-07-05--19-34-13Z-7o6admq.flog.bz2 ) at URI:CHK:7tbpjhxokkmpere6nxwfa5cvey:37ypgfhpwg67veqpyhjve22edmh3w3jwpbds47yfnvjussvalmaq:3:5:74128 in the pubgrid.

daira commented

2013-07-05 20:49:40 +00:00

Here's the most important part of the log:

local#675113 20:33:49.785: CHKUploader starting
local#675114 20:33:49.786: starting upload of <allmydata.immutable.upload.EncryptAnUploadable instance at 0x31a3378>
local#675115 20:33:49.786: creating Encoder <Encoder for unknown storage index>
local#675116 20:33:49.787: file size: 658086
local#675117 20:33:49.789: my encoding parameters: (3, 5, 5, 131073)
local#675118 20:33:49.790: got encoding parameters: 3/5/5 131073
local#675119 20:33:49.790: now setting up codec
local#675120 20:33:49.878: using storage index jbljj
local#675121 20:33:49.878: <Tahoe2ServerSelector for upload jbljj>(jbljj): starting
local#675122 20:33:49.927: <Tahoe2ServerSelector for upload jbljj>(jbljj): asking server psdgefgf for any existing shares
local#675123 20:33:49.954: <Tahoe2ServerSelector for upload jbljj>(jbljj): asking server 5sqtlw for any existing shares
local#675124 20:33:49.964: got result from [hrtib2], 0 shares
local#675125 20:33:49.965: but we're not running, so we'll ignore it
local#675126 20:33:49.966: _check_for_done, mode is 'MODE_READ', 2 queries outstanding, 2 extra servers available, 0 'must query' servers left, need_privkey=False
local#675127 20:33:49.967: but we're not running
local#675128 20:33:49.988: got result from [nszizg], 0 shares
local#675129 20:33:49.989: but we're not running, so we'll ignore it
local#675130 20:33:49.990: _check_for_done, mode is 'MODE_READ', 1 queries outstanding, 2 extra servers available, 0 'must query' servers left, need_privkey=False
local#675131 20:33:49.990: but we're not running
local#675132 20:33:50.083: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to get_buckets() from server psdgefgf: alreadygot=()
local#675133 20:33:50.112: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to get_buckets() from server 5sqtlw: alreadygot=()
local#675134 20:33:50.216: got result from [r7cddi], 0 shares
local#675135 20:33:50.217: but we're not running, so we'll ignore it
local#675136 20:33:50.218: _check_for_done, mode is 'MODE_READ', 0 queries outstanding, 2 extra servers available, 0 'must query' servers left, need_privkey=False
local#675137 20:33:50.219: but we're not running
local#675138 20:33:50.290: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server i76mi6: alreadygot=(0,), allocated=()
local#675139 20:33:50.457: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server lxmst5: alreadygot=(2,), allocated=(1,)
local#675140 20:33:50.667: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server sf7ehc: alreadygot=(3,), allocated=()
local#675141 20:33:50.822: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server ddvfcd: alreadygot=(4,), allocated=()
local#675142 20:33:50.839: <Tahoe2ServerSelector for upload jbljj>(jbljj): server selection unsuccessful for <Tahoe2ServerSelector for upload jbljj>:
 shares could be placed on only 4 server(s) such that any 3 of them have enough shares to recover the file, but we were asked to place shares on at least 5 such servers.
 (placed all 5 shares, want to place shares on at least 5 servers such that any 3 of them have enough shares to recover the file, sent 6 queries to 6 servers, 4 queries placed some shares, 2 placed none (of which 2 placed none due to the server being full and 0 placed none due to an error)),
 merged=sh0: i76mi6en, sh1: lxmst5bx, sh2: lxmst5bx, sh3: sf7ehcpn, sh4: ddvfcdns

Here's the most important part of the log: ``` local#675113 20:33:49.785: CHKUploader starting local#675114 20:33:49.786: starting upload of <allmydata.immutable.upload.EncryptAnUploadable instance at 0x31a3378> local#675115 20:33:49.786: creating Encoder <Encoder for unknown storage index> local#675116 20:33:49.787: file size: 658086 local#675117 20:33:49.789: my encoding parameters: (3, 5, 5, 131073) local#675118 20:33:49.790: got encoding parameters: 3/5/5 131073 local#675119 20:33:49.790: now setting up codec local#675120 20:33:49.878: using storage index jbljj local#675121 20:33:49.878: <Tahoe2ServerSelector for upload jbljj>(jbljj): starting local#675122 20:33:49.927: <Tahoe2ServerSelector for upload jbljj>(jbljj): asking server psdgefgf for any existing shares local#675123 20:33:49.954: <Tahoe2ServerSelector for upload jbljj>(jbljj): asking server 5sqtlw for any existing shares local#675124 20:33:49.964: got result from [hrtib2], 0 shares local#675125 20:33:49.965: but we're not running, so we'll ignore it local#675126 20:33:49.966: _check_for_done, mode is 'MODE_READ', 2 queries outstanding, 2 extra servers available, 0 'must query' servers left, need_privkey=False local#675127 20:33:49.967: but we're not running local#675128 20:33:49.988: got result from [nszizg], 0 shares local#675129 20:33:49.989: but we're not running, so we'll ignore it local#675130 20:33:49.990: _check_for_done, mode is 'MODE_READ', 1 queries outstanding, 2 extra servers available, 0 'must query' servers left, need_privkey=False local#675131 20:33:49.990: but we're not running local#675132 20:33:50.083: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to get_buckets() from server psdgefgf: alreadygot=() local#675133 20:33:50.112: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to get_buckets() from server 5sqtlw: alreadygot=() local#675134 20:33:50.216: got result from [r7cddi], 0 shares local#675135 20:33:50.217: but we're not running, so we'll ignore it local#675136 20:33:50.218: _check_for_done, mode is 'MODE_READ', 0 queries outstanding, 2 extra servers available, 0 'must query' servers left, need_privkey=False local#675137 20:33:50.219: but we're not running local#675138 20:33:50.290: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server i76mi6: alreadygot=(0,), allocated=() local#675139 20:33:50.457: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server lxmst5: alreadygot=(2,), allocated=(1,) local#675140 20:33:50.667: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server sf7ehc: alreadygot=(3,), allocated=() local#675141 20:33:50.822: <Tahoe2ServerSelector for upload jbljj>(jbljj): response to allocate_buckets() from server ddvfcd: alreadygot=(4,), allocated=() local#675142 20:33:50.839: <Tahoe2ServerSelector for upload jbljj>(jbljj): server selection unsuccessful for <Tahoe2ServerSelector for upload jbljj>: shares could be placed on only 4 server(s) such that any 3 of them have enough shares to recover the file, but we were asked to place shares on at least 5 such servers. (placed all 5 shares, want to place shares on at least 5 servers such that any 3 of them have enough shares to recover the file, sent 6 queries to 6 servers, 4 queries placed some shares, 2 placed none (of which 2 placed none due to the server being full and 0 placed none due to an error)), merged=sh0: i76mi6en, sh1: lxmst5bx, sh2: lxmst5bx, sh3: sf7ehcpn, sh4: ddvfcdns ```

daira commented

2013-07-05 20:59:39 +00:00

Here's my interpretation: with h = N = 5, as soon as the Tahoe2ServerSelector decides to put two shares on the same server (here sh1 and sh2 on lxmst5bx), the upload is doomed. The shares all have to be on different servers whenever h = N, but the termination condition is just that all shares have been placed, not that they have been placed in a way that meets the happiness condition.

If that's the problem, then #1382 should fix it. This would also explain why VG2 was unreliable with h close to N.

Here's my interpretation: with h = N = 5, as soon as the `Tahoe2ServerSelector` decides to put two shares on the same server (here sh1 and sh2 on lxmst5bx), the upload is doomed. The shares all have to be on different servers whenever h = N, but the termination condition is just that all shares have been placed, not that they have been placed in a way that meets the happiness condition. If that's the problem, then #1382 should fix it. This would also explain why VG2 was unreliable with h close to N.

zooko commented

2013-07-05 21:03:15 +00:00

Daira: excellent work diagnosing this!! Ed: thanks so much for the bug report. Daira: it looks like you are right, and I think this does explain those bugs that the volunteergrid2 people reported and that I never understood. Thank you!

Daira: excellent work diagnosing this!! Ed: thanks so much for the bug report. Daira: it looks like you are right, and I think this *does* explain those bugs that the volunteergrid2 people reported and that I never understood. Thank you!

kapiteined commented

2013-07-05 21:08:50 +00:00

And to check if that is the case, i changed to 3-7-10 encoding, and now the upload succeeds!
Success: file copied

Does this call for a change in code, or for a big warning sticker:
"don't choose h and n to close together" ?

And to check if that is the case, i changed to 3-7-10 encoding, and now the upload succeeds! Success: file copied Does this call for a change in code, or for a big warning sticker: "don't choose h and n to close together" ?

daira commented

2013-07-07 19:40:32 +00:00

We intend to fix it for v1.11 (Mark Berger's branch for #1382 already basically works), but there would be no harm in pointing out this problem on tahoe-dev in the meantime.

daira commented

2013-07-09 14:33:42 +00:00

Same bug as #1791?

tahoe-lafs added

and removed

labels 2013-07-09 14:33:42 +00:00

tahoe-lafs modified the milestone from undecided to 1.11.0

2013-07-09 14:33:42 +00:00

daira commented

2013-07-09 14:38:12 +00:00

Replying to daira:

Same bug as #1791?

Yes, that bug also had h = N and two shares that were placed on the same server, so almost identical. I'll copy the conclusions here to that ticket.

Replying to [daira](/tahoe-lafs/trac-2024-07-25/issues/2016#issuecomment-92454): > Same bug as #1791? Yes, that bug also had h = N and two shares that were placed on the same server, so almost identical. I'll copy the conclusions here to that ticket.

tahoe-lafs added the

duplicate

label 2013-07-09 14:38:12 +00:00

daira closed this issue

2013-07-09 14:38:12 +00:00

Sign in to join this conversation.