S3 backend: intermittent "We encountered an internal error. Please try again." from S3 #1590
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1590
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Traceback from CLI (
tahoe backup
):Incident report: https://tahoe-lafs.org/~zooko/incident-2011-11-19--01%3a16%3a49Z-xrkocuy.flog.bz2
Summary: judging from traffic on the AWS forum, 500 or 503 errors from S3 do happen, but usually indicate a bug or failure on the AWS side and not a "normal" transient error that should just be ignored. One AWS tech gave a clue when he wrote "Receiving this error more frequently than 1 in 5000 requests may indicate an error.".
Conclusion for Least Authority Enterprise's purposes: we should log as much data as we can about each failure, and we should aggregate the occurrences of these failures to generate statistics and look for patterns, and we should have monitoring and alerting in place to show us the historical record of these failures and to call us if it gets worse.
(In addition to all that, we should probably go ahead and retry the failed request...)
I searched the AWS forums, its S3 sub-forum, for the following search terms constrained to the year 2012:
Searching for the year 2011:
#1589 should improve the logging of these errors.
These errors are transient, at least in some cases on secorp's server. I've updated that server (only) to retry once after each 5xx failure; we'll see how that goes. The original error is logged and still causes an incident in any case.
Assigning to secorp to report how well that retrying change went.
Here's a link to the code that probably fixed this issue:
https://tahoe-lafs.org/trac/tahoe-lafs/changeset/5647/ticket999-S3-backend
We're pretty satisfied with this fix/workaround ([20120510215310-93fa1-a61c9bd9f506b1f2366233e6e338683029c78e94]), which is in production use in the Tahoe-LAFS-on-S3 product of Least Authority Enterprises.