server-side crawlers: tolerate corrupted shares, verify shares #812
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#812
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
From twistd.log on prodtahoe17 data6:
No incident logs.
Here is the mutable share in question, attached.
Attachment 5 (2942 bytes) added
/Users/wonwinmcbrootles/5
hm, the share appears to be truncated. I looked at sh7 (in /data5 on that same box), which appears to be intact. Both are for seqnum 6, and both appear to match up until the truncation point (which occurs in the middle of the encrypted private key).
These shares are supposed to be written atomically, with a write vector of length one (consisting of the entire share) that is written in a single python
f.write
call. Perhaps the write got interrupted (node reboot or system crash), in such a bad way that it caused only part of the data to be written out to disk? The truncation point isn't on a particularly round boundary (the file is 0xb7e bytes long), so it doesn't feel like a disk block size or anything like that.Weird.
I suppose the important part is to recover gracefully from it. I believe the share-crawler should keep going after the error.. that'd be the first thing to verify.
I guess the second step would be to build a validating share crawler, and/or have some code in the lease-checking share crawler which would respond to problems like this by moving the corrupt share into a junkpile and logging the issue somewhere.
Let's make this ticket be about recovering from this sort of corruption:
In the meanwhile, I assume the work-around is to rm that file, right?
exception from attempt to parse leasesto handle corrupted lease filesSounds good. To be specific, this is unrelated to leases, it's just that the lease-expiring crawler is what first noticed the corruption. So this ticket is about:
And yeah, just rm the file, it's useless to anyone. The next time that directory is modified, a new copy of that share will be created.
handle corrupted lease filesto server-side crawlers: tolerate corrupted shares, verify shares#1834 would remove the lease-checking crawler and bucket-counting crawlers, making this ticket irrelevant. However, we might then want to invent a share-verifying crawler, just for the purpose of looking for corrupted shares, which would make this ticket relevant again.
My consulting client (codename "WAG") (see comment:6:ticket:1278) has corrupt shares. This message is in the storage server's twistd.log:
And this message is in an incident report file generated by the node (which is both the storage server and the gateway):
This is with this version of Tahoe-LAFS:
Okay, I've been looking into this, and I see that in fact this kind of corruption is handled, by logging it, accounting for it in a count called
corrupt-shares
, and skipping the corrupted share. See [expirer.py]source:trunk/src/allmydata/storage/expirer.py?annotate=blame&rev=d5651a0d0eebdc144db53425ee461e186319e5fd#L127. The only reason that we've thought it was not being handled all these years is because it callstwisted.python.log.err()
, which emits a string to the twistd.log that says"Unhandled Error"
, plus a stack trace. So I propose that we just remove that call totwisted.python.log.err()
and add a unit test which requires the code under test to detect and skip over corrupted shares without emitting this error log.Wait, why do we ever log directly to the Twisted log? Let's remove all the places that do that.
Filed #2343 (remove all direct logging to twisted.python.log/err).