improve error messages from failed uploads #2101

Open
opened 2013-11-12 01:45:59 +00:00 by zooko · 10 comments

The error message when an upload fails is a "wall of text". It is hard to read. It looks like this:

[instance: Traceback: <class 'allmydata.interfaces.UploadUnhappinessError'>: server selection failed for : shares could be placed or found on only 0 server(s). We were asked to place shares on at least 4 server(s) such that any 3 of them have enough shares to recover the file. (placed 0 shares out of 10 total (10 homeless), want to place shares on at least 4 servers such that any 3 of them have enough shares to recover the file, sent 5 queries to 5 servers, 5 queries asked about existing shares (of which 0 failed due to an error), 0 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error))

Daira pointed out that even though the current error message is too long, it still lacks the most important information that you might want to use in your response to this error, which is the identities of which servers failed and which succeeded.

A possible improvement to this would be to return a data structure instead of a string, similar to the [source:trunk/src/allmydata/check_results.py?annotate=blame&rev=188c7fecf5d2e62d8dcfbff0791fe1125def971b#L7 CheckResults]Failure.

There is probably a related data structure already being produced and displayed over on the "Recent Uploads and Downloads" page for the failed upload.

The error message when an upload fails is a "wall of text". It is hard to read. It looks like this: [instance: Traceback: <class 'allmydata.interfaces.UploadUnhappinessError'>: server selection failed for <Tahoe2ServerSelector for upload dglev>: shares could be placed or found on only 0 server(s). We were asked to place shares on at least 4 server(s) such that any 3 of them have enough shares to recover the file. (placed 0 shares out of 10 total (10 homeless), want to place shares on at least 4 servers such that any 3 of them have enough shares to recover the file, sent 5 queries to 5 servers, 5 queries asked about existing shares (of which 0 failed due to an error), 0 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)) Daira pointed out that even though the current error message is too long, it still lacks the most important information that you might want to use in your response to this error, which is the identities of which servers failed and which succeeded. A possible improvement to this would be to return a data structure instead of a string, similar to the [source:trunk/src/allmydata/check_results.py?annotate=blame&rev=188c7fecf5d2e62d8dcfbff0791fe1125def971b#L7 [CheckResults](wiki/CheckResults)]Failure. There is probably a related data structure already being produced and displayed over on the "Recent Uploads and Downloads" page for the failed upload.
zooko added the
unknown
normal
defect
1.10.0
labels 2013-11-12 01:45:59 +00:00
zooko added this to the undecided milestone 2013-11-12 01:45:59 +00:00
Author

Mark pointed out that this is a bad error message, too:

    def _get_progress_message(self):
        if not self.homeless_shares:
XXX "placed" might actually mean "found out some or all of them were already there"
            msg = "placed all %d shares, " % (self.total_shares)
Mark pointed out that this is a bad error message, too: ``` def _get_progress_message(self): if not self.homeless_shares: XXX "placed" might actually mean "found out some or all of them were already there" msg = "placed all %d shares, " % (self.total_shares) ```
Author

related tickets: #1596, #1116

related tickets: #1596, #1116
tahoe-lafs added
code-peerselection
and removed
unknown
labels 2013-11-14 23:53:15 +00:00
tahoe-lafs modified the milestone from undecided to 1.12.0 2013-11-14 23:53:15 +00:00
Author

related tickets: #2130 #1821

related tickets: #2130 #1821
Author

The best solution would be for it to just show, visually, a complete map of which shares went to which servers, and which shares failed when it attempted to send them to which servers, and how each one failed.

The best solution would be for it to just show, visually, a complete map of which shares went to which servers, and which shares failed when it attempted to send them to which servers, and how each one failed.
Author

User complaining about this on the mailing list: [/pipermail/tahoe-dev/2014-December/009279.html https://tahoe-lafs.org/pipermail/tahoe-dev/2014-December/009279.html]

User complaining about this on the mailing list: [/pipermail/tahoe-dev/2014-December/009279.html <https://tahoe-lafs.org/pipermail/tahoe-dev/2014-December/009279.html>]
daira commented 2014-12-11 23:24:52 +00:00
Owner

#1941 was a duplicate. Its description was:

I heard that the volunteergrid2 project has shut down. The participants, in explaining why they gave up on it, said that they often got "unhappiness errors" when they tried to upload files, so therefore they never trusted the grid with their backups.

There are two problems here that this ticket attempts to address:

  1. They didn't trust the grid. Why? Not because the upload failed, but because they didn't know why the upload had failed. They interpreted this as evidence that Tahoe-LAFS was buggy or unreliable. If they had seen a clear, understandable explanation that said "This upload failed because you specified you required at least 15 servers, and of the 20 servers on your grid, 10 of them are currently unreachable.", then they would have continued to trust the Tahoe-LAFS software and they would have known what changes to make (to their grid or their happiness parameter) to get what they wanted. (Note that information was actually already in those "unhappiness errors", but they didn't read or understand it. See below.)

  2. We (the tahoe-lafs developers) don't know why their uploads failed. Perhaps Tahoe-LAFS was harboring some previously-unknown bug. Perhaps too many of their servers were on flaky home DSL that timed-out most requests. Perhaps it was something else. We can't improve the software without a working feedback loop whereby we can learn the details of failures.

This ticket is to make it so that when an upload fails, you can read an understandable story of what happened that led to the failure, specifying which servers your client tried to use and what each server did.

Note that the basic information of how many servers were reachable, etc., is encoded into the error message that users currently see, but users do not read that error message, because it contains a Python traceback, so they just gloss over it. So this ticket is to make two changes to that:

  1. Add more information. Not just the number of servers that failed, but which specific servers (identifiers, nicknames, IP addresses) and when.

  2. Make it a human-oriented HTML page, not a Python traceback. Most users will not read anything that contains a Python traceback.

#1941 was a duplicate. Its description was: > I heard that the volunteergrid2 project has shut down. The participants, in explaining why they gave up on it, said that they often got "unhappiness errors" when they tried to upload files, so therefore they never trusted the grid with their backups. > > There are two problems here that this ticket attempts to address: > > 1. They didn't trust the grid. Why? Not because the upload failed, but because they didn't know why the upload had failed. They interpreted this as evidence that Tahoe-LAFS was buggy or unreliable. If they had seen a clear, understandable explanation that said "This upload failed because you specified you required at least 15 servers, and of the 20 servers on your grid, 10 of them are currently unreachable.", then they would have continued to trust the Tahoe-LAFS software and they would have known what changes to make (to their grid or their happiness parameter) to get what they wanted. (Note that information was actually already in those "unhappiness errors", but they didn't read or understand it. See below.) > > 2. We (the tahoe-lafs developers) don't know why their uploads failed. Perhaps Tahoe-LAFS was harboring some previously-unknown bug. Perhaps too many of their servers were on flaky home DSL that timed-out most requests. Perhaps it was something else. We can't improve the software without a working feedback loop whereby we can learn the details of failures. > > This ticket is to make it so that when an upload fails, you can read an understandable story of what happened that led to the failure, specifying which servers your client tried to use and what each server did. > > Note that the basic information of how many servers were reachable, etc., is encoded into the error message that users currently see, but users do not read that error message, because it contains a Python traceback, so they just gloss over it. So this ticket is to make two changes to that: > > 1. Add more information. Not just the number of servers that failed, but which specific servers (identifiers, nicknames, IP addresses) and when. > > 2. Make it a human-oriented HTML page, not a Python traceback. Most users will not read anything that contains a Python traceback.

Milestone renamed

Milestone renamed
warner modified the milestone from 1.12.0 to 1.13.0 2016-03-22 05:02:25 +00:00

renaming milestone

renaming milestone
warner modified the milestone from 1.13.0 to 1.14.0 2016-06-28 18:17:14 +00:00

Moving open issues out of closed milestones.

Moving open issues out of closed milestones.
exarkun modified the milestone from 1.14.0 to 1.15.0 2020-06-30 14:45:13 +00:00
Owner

Ticket retargeted after milestone closed

Ticket retargeted after milestone closed
meejah modified the milestone from 1.15.0 to soon 2021-03-30 18:40:19 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
5 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#2101
No description provided.