when an upload or check fails, link to a full explanation of what happened #1941

New Issue

zooko · 2013-04-09T19:23:40Z

zooko commented

2013-04-09 19:23:40 +00:00

I heard that the volunteergrid2 project has shut down. The participants, in explaining why they gave up on it, said that they often got "unhappiness errors" when they tried to upload files, so therefore they never trusted the grid with their backups.

There are two problems here that this ticket attempts to address:

They didn't trust the grid. Why? Not because the upload failed, but because they didn't know why the upload had failed. They interpreted this as evidence that Tahoe-LAFS was buggy or unreliable. If they had seen a clear, understandable explanation that said "This upload failed because you specified you required at least 15 servers, and of the 20 servers on your grid, 10 of them are currently unreachable.", then they would have continued to trust the Tahoe-LAFS software and they would have known what changes to make (to their grid or their happiness parameter) to get what they wanted. (Note that information was actually already in those "unhappiness errors", but they didn't read or understand it. See below.)
We (the tahoe-lafs developers) don't know why their uploads failed. Perhaps Tahoe-LAFS was harboring some previously-unknown bug. Perhaps too many of their servers were on flaky home DSL that timed-out most requests. Perhaps it was something else. We can't improve the software without a working feedback loop whereby we can learn the details of failures.

This ticket is to make it so that when an upload fails, you can read an understandable story of what happened that led to the failure, specifying which servers your client tried to use and what each server did.

Note that the basic information of how many servers were reachable, etc., is encoded into the error message that users currently see, but users do not read that error message, because it contains a Python traceback, so they just gloss over it. So this ticket is to make two changes to that:

Add more information. Not just the number of servers that failed, but which specific servers (identifiers, nicknames, IP addresses) and when.
Make it a human-oriented HTML page, not a Python traceback. Most users will not read anything that contains a Python traceback.

I heard that the volunteergrid2 project has shut down. The participants, in explaining why they gave up on it, said that they often got "unhappiness errors" when they tried to upload files, so therefore they never trusted the grid with their backups. There are two problems here that this ticket attempts to address: 1. They didn't trust the grid. Why? Not because the upload failed, but because **they didn't know why the upload had failed**. They interpreted this as evidence that Tahoe-LAFS was buggy or unreliable. If they had seen a clear, understandable explanation that said "This upload failed because you specified you required at least 15 servers, and of the 20 servers on your grid, 10 of them are currently unreachable.", then they would have continued to trust the Tahoe-LAFS software and they would have known what changes to make (to their grid or their happiness parameter) to get what they wanted. (Note that information was actually already in those "unhappiness errors", but they didn't read or understand it. See below.) 2. We (the tahoe-lafs developers) don't know why their uploads failed. Perhaps Tahoe-LAFS was harboring some previously-unknown bug. Perhaps too many of their servers were on flaky home DSL that timed-out most requests. Perhaps it was something else. We can't improve the software without a working feedback loop whereby we can learn the details of failures. This ticket is to make it so that when an upload fails, you can read an understandable story of what happened that led to the failure, specifying which servers your client tried to use and what each server did. Note that the basic information of how many servers were reachable, etc., is encoded into the error message that users currently see, but users do not read that error message, because it contains a Python traceback, so they just gloss over it. So this ticket is to make two changes to that: 1. Add more information. Not just the number of servers that failed, but which specific servers (identifiers, nicknames, IP addresses) and when. 2. Make it a human-oriented HTML page, not a Python traceback. Most users will not read anything that contains a Python traceback.

zooko added the

labels 2013-04-09 19:23:40 +00:00

zooko added this to the undecided milestone 2013-04-09 19:23:40 +00:00

zooko added

code-network

and removed

unknown

labels 2013-04-09 19:24:13 +00:00

zooko modified the milestone from undecided to 1.11.0

2013-04-09 19:24:13 +00:00

tahoe-lafs changed title from ~~when an upload fails, link to a full explanation of what happened~~ to when an upload or check fails, link to a full explanation of what happened

2013-04-11 20:45:36 +00:00

daira commented

2013-04-11 21:15:15 +00:00

See also #1821. The implementation can probably be (partly) shared between check and upload.

kpreid commented

2013-04-26 03:14:32 +00:00

This is my UI design opinion:

The very most fundamental thing that needs to be done is that the error should be formatted for the web, not as a Python traceback. Tracebacks are for when the program itself does not understand what went wrong, which is not the case here.

This is a random idea I had which Zooko liked:

The error page should prominently contain a bar graph, which contains the same information as the current error, but in a graphical and explained format. For example:

You have this many storage servers online: ###......
You need this many to upload this file:    \---/

An actual graphical rather than ASCII-art version would have the resolution to distinguish between needed and happy, and storage servers which are full vs. offline. Furthermore, each box could be labeled with the storage server's name, perhaps turned sideways (or perhaps turn the entire graph sideways).

This is my UI design opinion: The very most fundamental thing that needs to be done is that the error should be formatted for the web, not as a Python traceback. Tracebacks are for when *the program itself does not understand what went wrong,* which is not the case here. This is a random idea I had which Zooko liked: The error page should prominently contain a bar graph, which contains the same information as the current error, but in a graphical and explained format. For example: ``` You have this many storage servers online: ###...... You need this many to upload this file: \---/ ``` An actual graphical rather than ASCII-art version would have the resolution to distinguish between needed and happy, and storage servers which are full vs. offline. Furthermore, each box could be labeled with the storage server's name, perhaps turned sideways (or perhaps turn the entire graph sideways).

daira commented

2014-12-11 23:23:06 +00:00

Duplicate of #2101.

tahoe-lafs added the

duplicate

label 2014-12-11 23:23:06 +00:00

daira closed this issue

2014-12-11 23:23:06 +00:00

Sign in to join this conversation.