xml parse error from S4 service #2116
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#2116
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I tried the same command-line again, and this time it worked. I had started by tahoe-lafs gateway a few second before I got this xml parse error, so I suspect that this is a case of #719.
I don't think it is a case of #719: the S3 container code in the server got as far as making an S3 request and trying to parse the result, which was not valid XML. #719 refers to the case where the client can't reach enough servers.
Actually the response from S3 seems to have been empty.
Hmm, maybe an empty S3 response should trigger a retry?
Sigh, I wish we could reduce that error report to something more concise -- although I suppose we should be grateful that the actual error was in there somewhere.
Replying to daira:
I guess. :-/
This is what I think of as "snowshoes". Having our upper layer perform some redundancy or retrying just to work-around what appear to be bugs or unreliability from the lower layer. Ick.
Well, I'm philosophically resistant to that as well, but I have no confidence that we can get the underlying problem with S3 fixed.
Replying to daira:
Yeah, I agree that we should go ahead and add a limited number of exponential-backoff-retries for this.
Note that this isn't the first time I've seen an empty response from S3. They are rare, but they've come up before - we just haven't recorded that problem until now.
I just a similar failure again:
Hm, I seem to get this every time that I run a full
tahoe backup
on my data now. ☹ It seems like the LeastAuthority S4 service is suffering badly from this.I just got this same error again, but this time on read instead of on upload. Manually retrying a couple of times (by hitting C-r in my web browser) got it to work.
CyberAxe just saw the same error: http://sebsauvage.net/paste/?e301277924eb38dc#6W5WPJkf1HgW3+rdFVvQ/2BqzYIZ0ZwQyPUb08+kY0A=
I'm going to make it retry.
Note that the error occurs in txAWS when it tries to parse an empty response body for a 500 error as XML. The fix might need to be in txAWS as well.
It's a bit difficult to test this, because the tests intentionally do not depend on txAWS (currently they only test the mock cloud backend, not the service container implementations).
Replying to daira:
No, this is a wrong inference. The error occurs in
txaws.s3.client._parse_get_bucket
, which should not be reached if the S3 server responded with a 500 error. (The 500 error in the stack trace is from the Tahoe-LAFS gateway, not S3.) So it looks like S3 returned success with an empty body.Pasting the definitions from txaws.util into a Python interactive session:
It looks like only an empty string will give the
xml.etree.ElementTree.ParseError: no element found: line 1, column 0
error.Got this last night, did Za do the update on my backend? If so it didn't work.
:: --------------------- START of LOG ------------
:: START Date: 20142103 Time: 21327
:: Backup-Now assuming tahoe: as alias and BackedUp as Target Location in Grid
:: c:\Python27\ must be in system PATH: see tahoe-lafs Windows wiki for help
:: Will Backup everything in folder c:\BackMeUp\
:: v0040
:: --------------------- Start Backup ------------
:: --------------------- Errors Backup -----------
(3:10:26 PM) CyberAxe: Seems like instead of the 500 error now I'm getting it just stuck over and over, maybe I'm not waitting long enought for the 500 error to generate but I've waited hours. http://sebsauvage.net/paste/?f38e2b33f87e1422#A8iAnduHoydJrE6XejuwYX6CHoRx/iWBxUb+OLx1mvM= I have to terminate the process here's what it says when I do that.
(3:12:20 PM) CyberAxe: the last line it's stuck on in this example is a create folder action, when I restart backup it does use that folder, so it is creating it re-using old directory for 'c:\BackMeUp\WesternColorado_files\images\WaspWar3_files\9'
(3:12:20 PM) CyberAxe: re-using old directory for 'c:\BackMeUp\WesternColorado_files\images\WaspWar3_files'
(3:12:20 PM) CyberAxe: re-using old directory for 'c:\BackMeUp\WesternColorado_files\images'
(3:12:20 PM) CyberAxe: re-using old directory for 'c:\BackMeUp\WesternColorado_files'
(3:12:20 PM) CyberAxe: re-using old directory for 'c:\BackMeUp' <----- where it got stuck.
Here's another trackback error I got after the backend update to fix 500 error
Here's another error after the last one
I found that my script wasn't showing me it had moved on to a deep-check and was done.
Replying to CyberAxe:
Here's what I'm getting now:
Replying to CyberAxe:
This is a separate bug in the S3 retry logic. Filed as #2206.
Fixed in https://github.com/tahoe-lafs/tahoe-lafs/commit/ea3111d24a97bf9873964383430a9f2e9ff5eb70 on the 1819-cloud-merge branch.