S3 backend: intermittent "We encountered an internal error. Please try again." from S3 #1590

New Issue

tahoe-lafs · 2011-11-19T02:43:28Z

davidsarah commented

2011-11-19 02:43:28 +00:00

Traceback from CLI (tahoe backup):

Traceback (most recent call last):
  File "/home/zooko/playground/tahoe-lafs/trunk/support/bin/tahoe", line 9, in <module>
    load_entry_point('allmydata-tahoe==1.9.0-r5375', 'console_scripts', 'tahoe')()
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/runner.py", line 113, in run
    rc = runner(sys.argv[1:], install_node_control=install_node_control)
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/runner.py", line 99, in runner
    rc = cli.dispatch[command](so)
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/cli.py", line 569, in backup
    rc = tahoe_backup.backup(options)
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 325, in backup
    return bu.run()
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 118, in run
    new_backup_dircap = self.process(options.from_dir)
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 188, in process
    childcap = self.process(childpath)
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 194, in process
    childcap, metadata = self.upload(childpath)
  File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 305, in upload
    raise HTTPError("Error during file PUT", resp)
allmydata.scripts.common_http.HTTPError: Error during file PUT: 500 Internal Server Error
"Traceback (most recent call last):
  File \"/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/util/pipeline.py\", line 65, in _ebDeferred
    self.checkForFinished()
  File \"/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/util/pipeline.py\", line 53, in checkForFinished
    self.errback(self.failure)
  File \"/usr/local/lib/python2.7/dist-packages/Twisted-11.0.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py\", line 388, in errback
    self._startRunCallbacks(fail)
  File \"/usr/local/lib/python2.7/dist-packages/Twisted-11.0.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py\", line 455, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File \"/usr/local/lib/python2.7/dist-packages/Twisted-11.0.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py\", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File \"/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/immutable/encode.py\", line 504, in _remove_shareholder
    raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed or found on only 0 server(s).
 We were asked to place shares on at least 1 server(s) such that any 1 of them have enough shares to recover the file.:
 [Failure instance: Traceback (failure with no frames): <class 'allmydata.util.pipeline.PipelineError'>:
 <PipelineError error=([Failure instance: Traceback (failure with no frames): <class 'foolscap.tokens.RemoteException'>:
 <RemoteException around '[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
  File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/internet/tcp.py\", line 277, in connectionLost
    protocol.connectionLost(reason)
  File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/web/client.py\", line 191, in connectionLost
    self.factory._disconnectedDeferred.callback(None)
  File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/internet/defer.py\", line 362, in callback
    self._startRunCallbacks(result)
  File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/internet/defer.py\", line 458, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/internet/defer.py\", line 545, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File \"/usr/local/lib/python2.6/dist-packages/txAWS-0.2.1.post2-py2.6.egg/txaws/s3/client.py\", line 30, in s3_error_wrapper
    error_wrapper(error, S3Error)
  File \"/usr/local/lib/python2.6/dist-packages/txAWS-0.2.1.post2-py2.6.egg/txaws/client/base.py\", line 46, in error_wrapper
    raise fallback_error
txaws.s3.exception.S3Error: Error Message: We encountered an internal error. Please try again.
]'>
])>
]
"

Incident report: https://tahoe-lafs.org/~zooko/incident-2011-11-19--01%3a16%3a49Z-xrkocuy.flog.bz2

Traceback from CLI (`tahoe backup`): ``` Traceback (most recent call last): File "/home/zooko/playground/tahoe-lafs/trunk/support/bin/tahoe", line 9, in <module> load_entry_point('allmydata-tahoe==1.9.0-r5375', 'console_scripts', 'tahoe')() File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/runner.py", line 113, in run rc = runner(sys.argv[1:], install_node_control=install_node_control) File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/runner.py", line 99, in runner rc = cli.dispatch[command](so) File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/cli.py", line 569, in backup rc = tahoe_backup.backup(options) File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 325, in backup return bu.run() File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 118, in run new_backup_dircap = self.process(options.from_dir) File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 188, in process childcap = self.process(childpath) File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 194, in process childcap, metadata = self.upload(childpath) File "/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/scripts/tahoe_backup.py", line 305, in upload raise HTTPError("Error during file PUT", resp) allmydata.scripts.common_http.HTTPError: Error during file PUT: 500 Internal Server Error "Traceback (most recent call last): File \"/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/util/pipeline.py\", line 65, in _ebDeferred self.checkForFinished() File \"/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/util/pipeline.py\", line 53, in checkForFinished self.errback(self.failure) File \"/usr/local/lib/python2.7/dist-packages/Twisted-11.0.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py\", line 388, in errback self._startRunCallbacks(fail) File \"/usr/local/lib/python2.7/dist-packages/Twisted-11.0.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py\", line 455, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File \"/usr/local/lib/python2.7/dist-packages/Twisted-11.0.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py\", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File \"/home/zooko/playground/tahoe-lafs/trunk/src/allmydata/immutable/encode.py\", line 504, in _remove_shareholder raise UploadUnhappinessError(msg) allmydata.interfaces.UploadUnhappinessError: shares could be placed or found on only 0 server(s). We were asked to place shares on at least 1 server(s) such that any 1 of them have enough shares to recover the file.: [Failure instance: Traceback (failure with no frames): <class 'allmydata.util.pipeline.PipelineError'>: <PipelineError error=([Failure instance: Traceback (failure with no frames): <class 'foolscap.tokens.RemoteException'>: <RemoteException around '[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/internet/tcp.py\", line 277, in connectionLost protocol.connectionLost(reason) File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/web/client.py\", line 191, in connectionLost self.factory._disconnectedDeferred.callback(None) File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/internet/defer.py\", line 362, in callback self._startRunCallbacks(result) File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/internet/defer.py\", line 458, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File \"/usr/local/lib/python2.6/dist-packages/Twisted-11.1.0-py2.6-linux-i686.egg/twisted/internet/defer.py\", line 545, in _runCallbacks current.result = callback(current.result, *args, **kw) File \"/usr/local/lib/python2.6/dist-packages/txAWS-0.2.1.post2-py2.6.egg/txaws/s3/client.py\", line 30, in s3_error_wrapper error_wrapper(error, S3Error) File \"/usr/local/lib/python2.6/dist-packages/txAWS-0.2.1.post2-py2.6.egg/txaws/client/base.py\", line 46, in error_wrapper raise fallback_error txaws.s3.exception.S3Error: Error Message: We encountered an internal error. Please try again. ]'> ])> ] " ``` Incident report: <https://tahoe-lafs.org/~zooko/incident-2011-11-19--01%3a16%3a49Z-xrkocuy.flog.bz2>

tahoe-lafs added the

labels 2011-11-19 02:43:28 +00:00

tahoe-lafs added this to the undecided milestone 2011-11-19 02:43:28 +00:00

zooko commented

2012-02-09 21:06:42 +00:00

Summary: judging from traffic on the AWS forum, 500 or 503 errors from S3 do happen, but usually indicate a bug or failure on the AWS side and not a "normal" transient error that should just be ignored. One AWS tech gave a clue when he wrote "Receiving this error more frequently than 1 in 5000 requests may indicate an error.".

Conclusion for Least Authority Enterprise's purposes: we should log as much data as we can about each failure, and we should aggregate the occurrences of these failures to generate statistics and look for patterns, and we should have monitoring and alerting in place to show us the historical record of these failures and to call us if it gets worse.

(In addition to all that, we should probably go ahead and retry the failed request...)

I searched the AWS forums, its S3 sub-forum, for the following search terms constrained to the year 2012:

"503": 0 hits
"500": 3 hits (that were about the 500 error code instead of, say, 500 ms):
https://forums.aws.amazon.com/thread.jspa?messageID=313830 -- AWS tech says you oughta retry, and asks for more information about the pattern of the failure
https://forums.aws.amazon.com/thread.jspa?messageID=313074 -- not clear to me if that is a 500 from S3 or from CloudFront or something

Searching for the year 2011:

"503": 2 hits
- https://forums.aws.amazon.com/thread.jspa?messageID=260249 -- Code: SlowDown / Message: Please reduce your request rate.
- https://forums.aws.amazon.com/thread.jspa?messageID=236572 -- sudden storm of 503's, concentrated in Europe region, no explanation, but the users didn't post follow-ups complaining more so it must have been resolved
- https://forums.aws.amazon.com/thread.jspa?messageID=260617 -- Code: SlowDown / Message: Please reduce your request rate.
"500": 12 hits
- https://forums.aws.amazon.com/thread.jspa?messageID=297376 -- storm of failures including spurious "access denied" and objects just disappearing after upload
- https://forums.aws.amazon.com/thread.jspa?messageID=284078 -- turned out to be wrong credentials
- https://forums.aws.amazon.com/thread.jspa?messageID=260843 -- was an internal error in S3 that was subsequently fixed by AWS
- https://forums.aws.amazon.com/thread.jspa?messageID=265875 -- unexplained
- https://forums.aws.amazon.com/thread.jspa?messageID=217704 -- unexplained
- https://forums.aws.amazon.com/thread.jspa?messageID=313830 -- unexplained, AWS tech says "Receiving this error more frequently than 1 in 5000 requests may indicate an error."
- https://forums.aws.amazon.com/thread.jspa?messageID=227851 -- "current service issue"
- https://forums.aws.amazon.com/thread.jspa?messageID=249349 -- trying to upload large file, unexplained
- https://forums.aws.amazon.com/thread.jspa?messageID=260788 -- temporary service failure of AWS
- https://forums.aws.amazon.com/thread.jspa?messageID=215866 -- bug in AWS triggered by deleting many objects
- https://forums.aws.amazon.com/thread.jspa?messageID=272033 -- apparently the same bug
- https://forums.aws.amazon.com/thread.jspa?messageID=313074 -- unclear, possibly service failure

Summary: judging from traffic on the AWS forum, 500 or 503 errors from S3 do happen, but usually indicate a bug or failure on the AWS side and not a "normal" transient error that should just be ignored. One AWS tech gave a clue when he wrote "Receiving this error more frequently than 1 in 5000 requests may indicate an error.". Conclusion for Least Authority Enterprise's purposes: we should log as much data as we can about each failure, and we should aggregate the occurrences of these failures to generate statistics and look for patterns, and we should have monitoring and alerting in place to show us the historical record of these failures and to call us if it gets worse. (In addition to all that, we should probably go ahead and retry the failed request...) I searched [the AWS forums](https://forums.aws.amazon.com), its S3 sub-forum, for the following search terms constrained to the year 2012: * "503": 0 hits * "500": 3 hits (that were about the 500 error code instead of, say, 500 ms): * <https://forums.aws.amazon.com/thread.jspa?messageID=313830> -- AWS tech says you oughta retry, and asks for more information about the pattern of the failure * <https://forums.aws.amazon.com/thread.jspa?messageID=313074> -- not clear to me if that is a 500 from S3 or from CloudFront or something Searching for the year 2011: * "503": 2 hits * <https://forums.aws.amazon.com/thread.jspa?messageID=260249> -- Code: SlowDown / Message: Please reduce your request rate. * <https://forums.aws.amazon.com/thread.jspa?messageID=236572> -- sudden storm of 503's, concentrated in Europe region, no explanation, but the users didn't post follow-ups complaining more so it must have been resolved * <https://forums.aws.amazon.com/thread.jspa?messageID=260617> -- Code: SlowDown / Message: Please reduce your request rate. * "500": 12 hits * <https://forums.aws.amazon.com/thread.jspa?messageID=297376> -- storm of failures including spurious "access denied" and objects just disappearing after upload * <https://forums.aws.amazon.com/thread.jspa?messageID=284078> -- turned out to be wrong credentials * <https://forums.aws.amazon.com/thread.jspa?messageID=260843> -- was an internal error in S3 that was subsequently fixed by AWS * <https://forums.aws.amazon.com/thread.jspa?messageID=265875> -- unexplained * <https://forums.aws.amazon.com/thread.jspa?messageID=217704> -- unexplained * <https://forums.aws.amazon.com/thread.jspa?messageID=313830> -- unexplained, AWS tech says "Receiving this error more frequently than 1 in 5000 requests may indicate an error." * <https://forums.aws.amazon.com/thread.jspa?messageID=227851> -- "current service issue" * <https://forums.aws.amazon.com/thread.jspa?messageID=249349> -- trying to upload large file, unexplained * <https://forums.aws.amazon.com/thread.jspa?messageID=260788> -- temporary service failure of AWS * <https://forums.aws.amazon.com/thread.jspa?messageID=215866> -- bug in AWS triggered by deleting many objects * <https://forums.aws.amazon.com/thread.jspa?messageID=272033> -- apparently the same bug * <https://forums.aws.amazon.com/thread.jspa?messageID=313074> -- unclear, possibly service failure

davidsarah commented

2012-02-18 06:16:58 +00:00

#1589 should improve the logging of these errors.

zooko added

and removed

labels 2012-02-24 05:56:20 +00:00

davidsarah commented

2012-03-09 05:40:45 +00:00

These errors are transient, at least in some cases on secorp's server. I've updated that server (only) to retry once after each 5xx failure; we'll see how that goes. The original error is logged and still causes an incident in any case.

These errors *are* transient, at least in some cases on secorp's server. I've updated that server (only) to retry once after each 5xx failure; we'll see how that goes. The original error is logged and still causes an incident in any case.

davidsarah commented

2012-03-09 06:16:01 +00:00

Assigning to secorp to report how well that retrying change went.

zancas commented

2012-05-25 06:27:29 +00:00

Here's a link to the code that probably fixed this issue:

https://tahoe-lafs.org/trac/tahoe-lafs/changeset/5647/ticket999-S3-backend

Here's a link to the code that probably fixed this issue: <https://tahoe-lafs.org/trac/tahoe-lafs/changeset/5647/ticket999-S3-backend>

zooko commented

2012-07-02 09:14:34 +00:00

We're pretty satisfied with this fix/workaround ([20120510215310-93fa1-a61c9bd9f506b1f2366233e6e338683029c78e94]), which is in production use in the Tahoe-LAFS-on-S3 product of Least Authority Enterprises.

zooko added the

fixed

label 2012-07-02 09:14:34 +00:00

zooko closed this issue

2012-07-02 09:14:34 +00:00

Sign in to join this conversation.