deal with fragile, but disposable, bucket state files #1280
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1280
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
If a bucket state file can't be loaded and parsed for any reason (usually because it is 0-length, but any other sort of error should also be handled similarly), then just blow it away and start fresh.
------- original post by François below:
After a hard system shutdown due to power failure, Tahoe node might not be able to start again automatically because files storage/bucket_counter.state or storage/lease_checker.state are empty.
The easy workaround is to manually delete the empty files before restarting nodes.
Here is what a startup attempt looks like in such case.
oops. Yeah, if
pickle.load
fails at startup, we should probably delete the file so it can be regenerated.We should probably also investigate using the "write-to-temp-file/close/rename" dance. One reason to not do this are performance (we should measure the diskio hit for a big server). We should also think through the atomicity expectations: if the rename fails, will the state of the pickle adequately match the state of the servers (probably yes, but it's worth a few more minutes of thought).
See also #1290 (replace all use of pickles with JSON).
I think the current code uses the "write-to-temp-file&&close&&rename" dance. Zancas was working on a patch to use FilePath (#1437) and I saw that his patch simplified the extant "write-to-temp-file&&close&&rename" dance by using
FilePath.setContents
. I'll look for the details.[Here]source:trunk/src/allmydata/storage/crawler.py?annotate=blame&rev=4164#L225 is where the current trunk code uses the "write-to-temp-file&&close&&rename" dance. Closing this ticket as fixed. :-) Re-open if you disagree.
Replying to warner:
It appears to me that Brian had already implemented the "write-to-temp-file&&close&&rename" dance [20090219034633-4233b-3abc69ace6ea2f453b1fdbefa754e7ecbc6b4516 about a year before he wrote the comment above]. :-)
This is a crawler state file, so if the rename fails then the previous version of the file is left there, right? So the crawler will re-do whatever work it did since the last successful rename. If there is a persistent problem which causes rename to fail (e.g. the permissions on the old state file and its parent directory forbid you to overwrite it?), then this would be a bad failure mode where the crawler always appears to be doing work but never finishes. (The "Sisyphus" Failure Mode. :-))
Oh, except the failure to do the rename will cause the node to stop with an error message, right? Re-opening this ticket and changing it to say "If the write fails, stop the node with a clear error message.". Why stop the node? Because there is no other reliable way to get the operator's attention. Also because there is something screwed-up here, and stopping is the safest course to prevent worse failures.
bucket_counter.state and lease_checker.state might get corrupted after hard system shutdownto if bucket_counter.state or lease_checker.state can't be written, stop the node with an error messageReplying to [zooko]comment:6:
Seems to me that's a bug right there. There should be some other way to get the operator's attention, and if there were, there would be lots of things that could use it without introducing SPoFs by stopping the node. Straw-man suggestion: send (rate-limited) email to an admin address giving a link to the log.
Replying to [davidsarah]comment:7:
Well, let's open a ticket for it, but I still think that even if we fix that issue by providing a reliable way to get the operator's attention, that stopping the node is still the right thing to do when we discover something wrong like this, because (a) it increases the likelihood of getting the operator's attention, (b) it reduces the chance of damage or exploitation, and (c) it helps the operator with post-mortem diagnostics.
What do you mean SPoFs?
I guess the [Incident Gatherer]source:trunk/docs/logging.rst#incident-gatherer is perfect for this.
if bucket_counter.state or lease_checker.state can't be written, stop the node with an error messageto deal with fragile, but disposable, bucket state filesI suspect that changeset [3cb99364]changeset:3cb99364e6a83d0064d2838a0c470278903e19ac/trunk effectively fixed this ticket. (It doesn't delete a corrupted state file, but it does use default values, and the corrupted file will be overwritten on the next crawler pass.) Please close this ticket as fixed in milestone 1.9.2 if you agree.
See also #1290.
Replying to daira:
I agree, so I'll close this ticket, but there's something that I don't like about [3cb99364]changeset:3cb99364e6a83d0064d2838a0c470278903e19ac/trunk, which is that it catches arbitrary
Exception
instead of a set of exceptions that we think it should handle. Catching of generalException
has caused problems in the past, such as when the exception being caught is due to a shallow bug in the the code (like anAttributeError
, or a out-of-memory exception, etc. How about if we change it from catchingException
to catching(EnvironmentError, EOFError, [KeyError](wiki/KeyError))
?Short of reading the stdlib source code, I don't know the set of exceptions that
pickle.load
can raise. I'd be open to changing it so that thetry ... except Exception
is only around thepickle.load
call, though. Perhaps something like:#1834 would fix this.