Making requests too soon after startup can fail #719

New Issue

tahoe-lafs · 2009-05-31T15:42:35Z

bewst commented

2009-05-31 15:42:35 +00:00

$ tahoe start
STARTING /export/home/dave/.tahoe
client node probably started
$ tahoe ls
Error during GET: 410 Gone [UnrecoverableFileError](wiki/UnrecoverableFileError): the directory (or mutable file) could not be retrieved, because there were insufficient good shares. This might indicate that no servers were connected, insufficient servers were connected, the URI was corrupt, or that shares have been lost due to server departure, hard drive failure, or disk corruption. You should perform a filecheck on this object to learn more.
$ tahoe ls
Welcome_to_Allmydata.pdf
_My Shared Files_
_Recycle bin_
bak
c++std2003.pdf
$

``` $ tahoe start STARTING /export/home/dave/.tahoe client node probably started $ tahoe ls Error during GET: 410 Gone [UnrecoverableFileError](wiki/UnrecoverableFileError): the directory (or mutable file) could not be retrieved, because there were insufficient good shares. This might indicate that no servers were connected, insufficient servers were connected, the URI was corrupt, or that shares have been lost due to server departure, hard drive failure, or disk corruption. You should perform a filecheck on this object to learn more. $ tahoe ls Welcome_to_Allmydata.pdf _My Shared Files_ _Recycle bin_ bak c++std2003.pdf $ ```

tahoe-lafs added the

labels 2009-05-31 15:42:35 +00:00

tahoe-lafs added this to the undecided milestone 2009-05-31 15:42:35 +00:00

warner commented

2009-05-31 21:11:52 +00:00

This is an issue with hidden depths.. how should the client node know that it
has connected to every server that it's ever going to need?

But it should be easy to improve the situation somewhat. To start with, there
should be some internal function that keeps track of "progress towards full
connection":

have we connected to the introducer? how long ago did we connect? do we
even have an introducer.furl?
how many storage servers have we been told about? how many are connected?
how many are left? how long have we been trying to connect to them?

Then, when a directory retrieve or a file download fails due to insufficient
shares, this function could provide additional human-useful data, like saying
"we couldn't retrieve that directory right now, but since it looks like we've
only been connected to the introducer for two seconds, maybe we just don't
know about enough servers yet. You should try again in ten seconds.".

I'm not sure how to deliver that extra information. Specifically, the tahoe
node should not try to guess whether this is a transient failure or a
permanent one: we don't want to resort to heuristics or fixed timeouts. So
this extra data is advisory and should be interpreted by a human rather than
a piece of code.

So from the webapi point of view, 410 still seems like the right response
code, but maybe we can add the text to the response body, and make sure that
the CLI tools will deliver this body to stderr.

We have similar issues in a browser. I don't know when browsers will show the
response body for things like 410 GONE, but maybe we can use the same
technique.

This is an issue with hidden depths.. how should the client node know that it has connected to every server that it's ever going to need? But it should be easy to improve the situation somewhat. To start with, there should be some internal function that keeps track of "progress towards full connection": * have we connected to the introducer? how long ago did we connect? do we even have an introducer.furl? * how many storage servers have we been told about? how many are connected? how many are left? how long have we been trying to connect to them? Then, when a directory retrieve or a file download fails due to insufficient shares, this function could provide additional human-useful data, like saying "we couldn't retrieve that directory right now, but since it looks like we've only been connected to the introducer for two seconds, maybe we just don't know about enough servers yet. You should try again in ten seconds.". I'm not sure how to deliver that extra information. Specifically, the tahoe node should not try to guess whether this is a transient failure or a permanent one: we don't want to resort to heuristics or fixed timeouts. So this extra data is advisory and should be interpreted by a human rather than a piece of code. So from the webapi point of view, 410 still seems like the right response code, but maybe we can add the text to the response body, and make sure that the CLI tools will deliver this body to stderr. We have similar issues in a browser. I don't know when browsers will show the response body for things like 410 GONE, but maybe we can use the same technique.

warner added

code-network

and removed

unknown

labels 2009-05-31 21:11:52 +00:00

davidsarah commented

2010-04-04 17:10:48 +00:00

This issue also affects the WUI. Some browsers (in particular IE) will hide response bodies for HTTP errors by default, but that doesn't mean that isn't the right place to put human-readable info about the error; the HTTP spec specifically says that browsers SHOULD display the entity body for errors (see the end of RFC 2616 section 6.1.1).

This issue also affects the WUI. Some browsers (in particular IE) will hide response bodies for HTTP errors by default, but that doesn't mean that isn't the right place to put human-readable info about the error; the HTTP spec specifically says that browsers SHOULD display the entity body for errors (see the end of [RFC 2616 section 6.1.1](https://tools.ietf.org/html/rfc2616#section-6.1.1)).

tahoe-lafs modified the milestone from undecided to 1.7.0

2010-04-04 17:10:48 +00:00

zooko modified the milestone from 1.7.0 to eventually

2010-06-18 23:28:17 +00:00

tahoe-lafs added

code-frontend-web

and removed

code-network

labels 2010-07-24 00:43:37 +00:00

tahoe-lafs modified the milestone from eventually to soon

2010-07-24 00:43:37 +00:00

davidsarah commented

2010-08-16 20:53:33 +00:00

This issue affects all operations including check and repair, and all frontends.

tahoe-lafs added

code-frontend

and removed

code-frontend-web

labels 2010-08-16 20:53:33 +00:00

davidsarah commented

2012-04-23 18:40:35 +00:00

See also #1596 for the error-reporting aspect (not just on start-up).

daira commented

2013-08-02 04:27:41 +00:00

#2043 was a duplicate.

dawuud commented

2015-04-07 00:02:44 +00:00

Daira and I are working on the related ticket #1449. Can we also satisfy this ticket?

Brian Warner <warner@lothar.com> commented

2017-09-19 17:20:49 +00:00

In 04b34b6/trunk:

Merge PR417: rewrite tahoe start/stop/daemonize

* refs ticket:1148 (splits up startstop_node, improves coverage)
* refs ticket:275 ('start' probably doesn't exit until furl is written)
* refs ticket:1121 (probably improves coverage)
* refs ticket:1377 (probably fixed)
* refs ticket:2149 (might influence, probably won't fix)
* refs ticket:719 (probably improved)

In [04b34b6/trunk](/tahoe-lafs/trac-2024-07-25/commit/04b34b6fd2bb112942d2cb8ea41ef11ee4c72347): ``` Merge PR417: rewrite tahoe start/stop/daemonize * refs ticket:1148 (splits up startstop_node, improves coverage) * refs ticket:275 ('start' probably doesn't exit until furl is written) * refs ticket:1121 (probably improves coverage) * refs ticket:1377 (probably fixed) * refs ticket:2149 (might influence, probably won't fix) * refs ticket:719 (probably improved) ```

Sign in to join this conversation.