Making requests too soon after startup can fail #719

Open
opened 2009-05-31 15:42:35 +00:00 by bewst · 7 comments
bewst commented 2009-05-31 15:42:35 +00:00
Owner
$ tahoe start
STARTING /export/home/dave/.tahoe
client node probably started
$ tahoe ls
Error during GET: 410 Gone [UnrecoverableFileError](wiki/UnrecoverableFileError): the directory (or mutable file) could not be retrieved, because there were insufficient good shares. This might indicate that no servers were connected, insufficient servers were connected, the URI was corrupt, or that shares have been lost due to server departure, hard drive failure, or disk corruption. You should perform a filecheck on this object to learn more.
$ tahoe ls
Welcome_to_Allmydata.pdf
_My Shared Files_
_Recycle bin_
bak
c++std2003.pdf
$
``` $ tahoe start STARTING /export/home/dave/.tahoe client node probably started $ tahoe ls Error during GET: 410 Gone [UnrecoverableFileError](wiki/UnrecoverableFileError): the directory (or mutable file) could not be retrieved, because there were insufficient good shares. This might indicate that no servers were connected, insufficient servers were connected, the URI was corrupt, or that shares have been lost due to server departure, hard drive failure, or disk corruption. You should perform a filecheck on this object to learn more. $ tahoe ls Welcome_to_Allmydata.pdf _My Shared Files_ _Recycle bin_ bak c++std2003.pdf $ ```
tahoe-lafs added the
unknown
major
defect
1.4.1
labels 2009-05-31 15:42:35 +00:00
tahoe-lafs added this to the undecided milestone 2009-05-31 15:42:35 +00:00

This is an issue with hidden depths.. how should the client node know that it
has connected to every server that it's ever going to need?

But it should be easy to improve the situation somewhat. To start with, there
should be some internal function that keeps track of "progress towards full
connection":

  • have we connected to the introducer? how long ago did we connect? do we
    even have an introducer.furl?
  • how many storage servers have we been told about? how many are connected?
    how many are left? how long have we been trying to connect to them?

Then, when a directory retrieve or a file download fails due to insufficient
shares, this function could provide additional human-useful data, like saying
"we couldn't retrieve that directory right now, but since it looks like we've
only been connected to the introducer for two seconds, maybe we just don't
know about enough servers yet. You should try again in ten seconds.".

I'm not sure how to deliver that extra information. Specifically, the tahoe
node should not try to guess whether this is a transient failure or a
permanent one: we don't want to resort to heuristics or fixed timeouts. So
this extra data is advisory and should be interpreted by a human rather than
a piece of code.

So from the webapi point of view, 410 still seems like the right response
code, but maybe we can add the text to the response body, and make sure that
the CLI tools will deliver this body to stderr.

We have similar issues in a browser. I don't know when browsers will show the
response body for things like 410 GONE, but maybe we can use the same
technique.

This is an issue with hidden depths.. how should the client node know that it has connected to every server that it's ever going to need? But it should be easy to improve the situation somewhat. To start with, there should be some internal function that keeps track of "progress towards full connection": * have we connected to the introducer? how long ago did we connect? do we even have an introducer.furl? * how many storage servers have we been told about? how many are connected? how many are left? how long have we been trying to connect to them? Then, when a directory retrieve or a file download fails due to insufficient shares, this function could provide additional human-useful data, like saying "we couldn't retrieve that directory right now, but since it looks like we've only been connected to the introducer for two seconds, maybe we just don't know about enough servers yet. You should try again in ten seconds.". I'm not sure how to deliver that extra information. Specifically, the tahoe node should not try to guess whether this is a transient failure or a permanent one: we don't want to resort to heuristics or fixed timeouts. So this extra data is advisory and should be interpreted by a human rather than a piece of code. So from the webapi point of view, 410 still seems like the right response code, but maybe we can add the text to the response body, and make sure that the CLI tools will deliver this body to stderr. We have similar issues in a browser. I don't know when browsers will show the response body for things like 410 GONE, but maybe we can use the same technique.
warner added
code-network
and removed
unknown
labels 2009-05-31 21:11:52 +00:00
davidsarah commented 2010-04-04 17:10:48 +00:00
Author
Owner

This issue also affects the WUI. Some browsers (in particular IE) will hide response bodies for HTTP errors by default, but that doesn't mean that isn't the right place to put human-readable info about the error; the HTTP spec specifically says that browsers SHOULD display the entity body for errors (see the end of RFC 2616 section 6.1.1).

This issue also affects the WUI. Some browsers (in particular IE) will hide response bodies for HTTP errors by default, but that doesn't mean that isn't the right place to put human-readable info about the error; the HTTP spec specifically says that browsers SHOULD display the entity body for errors (see the end of [RFC 2616 section 6.1.1](https://tools.ietf.org/html/rfc2616#section-6.1.1)).
tahoe-lafs modified the milestone from undecided to 1.7.0 2010-04-04 17:10:48 +00:00
zooko modified the milestone from 1.7.0 to eventually 2010-06-18 23:28:17 +00:00
tahoe-lafs added
code-frontend-web
and removed
code-network
labels 2010-07-24 00:43:37 +00:00
tahoe-lafs modified the milestone from eventually to soon 2010-07-24 00:43:37 +00:00
davidsarah commented 2010-08-16 20:53:33 +00:00
Author
Owner

This issue affects all operations including check and repair, and all frontends.

This issue affects all operations including check and repair, and all frontends.
tahoe-lafs added
code-frontend
and removed
code-frontend-web
labels 2010-08-16 20:53:33 +00:00
davidsarah commented 2012-04-23 18:40:35 +00:00
Author
Owner

See also #1596 for the error-reporting aspect (not just on start-up).

See also #1596 for the error-reporting aspect (not just on start-up).
daira commented 2013-08-02 04:27:41 +00:00
Author
Owner

#2043 was a duplicate.

#2043 was a duplicate.
dawuud commented 2015-04-07 00:02:44 +00:00
Author
Owner

Daira and I are working on the related ticket #1449. Can we also satisfy this ticket?

Daira and I are working on the related ticket #1449. Can we also satisfy this ticket?
Brian Warner <warner@lothar.com> commented 2017-09-19 17:20:49 +00:00
Author
Owner

In 04b34b6/trunk:

Merge PR417: rewrite tahoe start/stop/daemonize

* refs ticket:1148 (splits up startstop_node, improves coverage)
* refs ticket:275 ('start' probably doesn't exit until furl is written)
* refs ticket:1121 (probably improves coverage)
* refs ticket:1377 (probably fixed)
* refs ticket:2149 (might influence, probably won't fix)
* refs ticket:719 (probably improved)
In [04b34b6/trunk](/tahoe-lafs/trac-2024-07-25/commit/04b34b6fd2bb112942d2cb8ea41ef11ee4c72347): ``` Merge PR417: rewrite tahoe start/stop/daemonize * refs ticket:1148 (splits up startstop_node, improves coverage) * refs ticket:275 ('start' probably doesn't exit until furl is written) * refs ticket:1121 (probably improves coverage) * refs ticket:1377 (probably fixed) * refs ticket:2149 (might influence, probably won't fix) * refs ticket:719 (probably improved) ```
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#719
No description provided.