deal with fragile, but disposable, bucket state files #1280

New Issue

tahoe-lafs · 2010-12-20T09:54:30Z

francois commented

2010-12-20 09:54:30 +00:00

If a bucket state file can't be loaded and parsed for any reason (usually because it is 0-length, but any other sort of error should also be handled similarly), then just blow it away and start fresh.

------- original post by François below:

After a hard system shutdown due to power failure, Tahoe node might not be able to start again automatically because files storage/bucket_counter.state or storage/lease_checker.state are empty.

The easy workaround is to manually delete the empty files before restarting nodes.

find /srv/tahoe/*/storage/bucket_counter.state -size 0 -exec rm {} \;
find /srv/tahoe/*/storage/lease_checker.state -size 0 -exec rm {} \;

Here is what a startup attempt looks like in such case.

Traceback (most recent call last):
  File "/usr/lib/python2.5/site-packages/twisted/application/app.py", line 614, in run
    runApp(config)
  File "/usr/lib/python2.5/site-packages/twisted/scripts/twistd.py", line 23, in runApp
    _SomeApplicationRunner(config).run()
  File "/usr/lib/python2.5/site-packages/twisted/application/app.py", line 330, in run
    self.application = self.createOrGetApplication()
  File "/usr/lib/python2.5/site-packages/twisted/application/app.py", line 416, in createOrGetApplication
    application = getApplication(self.config, passphrase)
--- <exception caught here> ---
  File "/usr/lib/python2.5/site-packages/twisted/application/app.py", line 427, in getApplication
    application = service.loadApplication(filename, style, passphrase)
  File "/usr/lib/python2.5/site-packages/twisted/application/service.py", line 368, in loadApplication
    application = sob.loadValueFromFile(filename, 'application', passphrase)
  File "/usr/lib/python2.5/site-packages/twisted/persisted/sob.py", line 214, in loadValueFromFile
    exec fileObj in d, d
  File "tahoe-client.tac", line 10, in <module>
    c = client.Client()
  File "/opt/tahoe-lafs/src/allmydata/client.py", line 140, in __init__
    self.init_storage()
  File "/opt/tahoe-lafs/src/allmydata/client.py", line 269, in init_storage
    expiration_sharetypes=expiration_sharetypes)
  File "/opt/tahoe-lafs/src/allmydata/storage/server.py", line 97, in __init__
    self.add_bucket_counter()
  File "/opt/tahoe-lafs/src/allmydata/storage/server.py", line 114, in add_bucket_counter
    self.bucket_counter = BucketCountingCrawler(self, statefile)
  File "/opt/tahoe-lafs/src/allmydata/storage/crawler.py", line 449, in __init__
    ShareCrawler.__init__(self, server, statefile)
  File "/opt/tahoe-lafs/src/allmydata/storage/crawler.py", line 86, in __init__
    self.load_state()
  File "/opt/tahoe-lafs/src/allmydata/storage/crawler.py", line 195, in load_state
    state = pickle.load(f)
exceptions.EOFError:

If a bucket state file can't be loaded and parsed for any reason (usually because it is 0-length, but any other sort of error should also be handled similarly), then just blow it away and start fresh. ------- original post by François below: After a hard system shutdown due to power failure, Tahoe node might not be able to start again automatically because files **storage/bucket_counter.state** or **storage/lease_checker.state** are empty. The easy workaround is to manually delete the empty files before restarting nodes. ``` find /srv/tahoe/*/storage/bucket_counter.state -size 0 -exec rm {} \; find /srv/tahoe/*/storage/lease_checker.state -size 0 -exec rm {} \; ``` Here is what a startup attempt looks like in such case. ``` Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/twisted/application/app.py", line 614, in run runApp(config) File "/usr/lib/python2.5/site-packages/twisted/scripts/twistd.py", line 23, in runApp _SomeApplicationRunner(config).run() File "/usr/lib/python2.5/site-packages/twisted/application/app.py", line 330, in run self.application = self.createOrGetApplication() File "/usr/lib/python2.5/site-packages/twisted/application/app.py", line 416, in createOrGetApplication application = getApplication(self.config, passphrase) --- <exception caught here> --- File "/usr/lib/python2.5/site-packages/twisted/application/app.py", line 427, in getApplication application = service.loadApplication(filename, style, passphrase) File "/usr/lib/python2.5/site-packages/twisted/application/service.py", line 368, in loadApplication application = sob.loadValueFromFile(filename, 'application', passphrase) File "/usr/lib/python2.5/site-packages/twisted/persisted/sob.py", line 214, in loadValueFromFile exec fileObj in d, d File "tahoe-client.tac", line 10, in <module> c = client.Client() File "/opt/tahoe-lafs/src/allmydata/client.py", line 140, in __init__ self.init_storage() File "/opt/tahoe-lafs/src/allmydata/client.py", line 269, in init_storage expiration_sharetypes=expiration_sharetypes) File "/opt/tahoe-lafs/src/allmydata/storage/server.py", line 97, in __init__ self.add_bucket_counter() File "/opt/tahoe-lafs/src/allmydata/storage/server.py", line 114, in add_bucket_counter self.bucket_counter = BucketCountingCrawler(self, statefile) File "/opt/tahoe-lafs/src/allmydata/storage/crawler.py", line 449, in __init__ ShareCrawler.__init__(self, server, statefile) File "/opt/tahoe-lafs/src/allmydata/storage/crawler.py", line 86, in __init__ self.load_state() File "/opt/tahoe-lafs/src/allmydata/storage/crawler.py", line 195, in load_state state = pickle.load(f) exceptions.EOFError: ```

tahoe-lafs added the

labels 2010-12-20 09:54:30 +00:00

tahoe-lafs added this to the 1.9.0 milestone 2010-12-20 09:54:30 +00:00

warner commented

2010-12-20 16:06:31 +00:00

oops. Yeah, if pickle.load fails at startup, we should probably delete the file so it can be regenerated.

We should probably also investigate using the "write-to-temp-file/close/rename" dance. One reason to not do this are performance (we should measure the diskio hit for a big server). We should also think through the atomicity expectations: if the rename fails, will the state of the pickle adequately match the state of the servers (probably yes, but it's worth a few more minutes of thought).

oops. Yeah, if `pickle.load` fails at startup, we should probably delete the file so it can be regenerated. We should probably also investigate using the "write-to-temp-file/close/rename" dance. One reason to not do this are performance (we should measure the diskio hit for a big server). We should also think through the atomicity expectations: if the rename fails, will the state of the pickle adequately match the state of the servers (probably yes, but it's worth a few more minutes of thought).

davidsarah commented

2011-01-04 02:14:04 +00:00

See also #1290 (replace all use of pickles with JSON).

zooko commented

2011-08-18 04:23:36 +00:00

I think the current code uses the "write-to-temp-file&&close&&rename" dance. Zancas was working on a patch to use FilePath (#1437) and I saw that his patch simplified the extant "write-to-temp-file&&close&&rename" dance by using FilePath.setContents. I'll look for the details.

I think the current code uses the "write-to-temp-file&&close&&rename" dance. Zancas was working on a patch to use [FilePath](http://twistedmatrix.com/documents/current/api/twisted.python.filepath.FilePath.html) (#1437) and I saw that his patch simplified the extant "write-to-temp-file&&close&&rename" dance by using `FilePath.setContents`. I'll look for the details.

zooko commented

2011-08-18 05:47:15 +00:00

[Here]source:trunk/src/allmydata/storage/crawler.py?annotate=blame&rev=4164#L225 is where the current trunk code uses the "write-to-temp-file&&close&&rename" dance. Closing this ticket as fixed. :-) Re-open if you disagree.

zooko added the

fixed

label 2011-08-18 05:47:15 +00:00

zooko closed this issue

2011-08-18 05:47:15 +00:00

zooko commented

2011-08-18 16:41:26 +00:00

Replying to warner:

We should probably also investigate using the "write-to-temp-file/close/rename" dance.

It appears to me that Brian had already implemented the "write-to-temp-file&&close&&rename" dance [20090219034633-4233b-3abc69ace6ea2f453b1fdbefa754e7ecbc6b4516 about a year before he wrote the comment above]. :-)

One reason to not do this are performance (we should measure the diskio hit for a big server). We should also think through the atomicity expectations: if the rename fails, will the state of the pickle adequately match the state of the servers (probably yes, but it's worth a few more minutes of thought).

This is a crawler state file, so if the rename fails then the previous version of the file is left there, right? So the crawler will re-do whatever work it did since the last successful rename. If there is a persistent problem which causes rename to fail (e.g. the permissions on the old state file and its parent directory forbid you to overwrite it?), then this would be a bad failure mode where the crawler always appears to be doing work but never finishes. (The "Sisyphus" Failure Mode. :-))

Oh, except the failure to do the rename will cause the node to stop with an error message, right? Re-opening this ticket and changing it to say "If the write fails, stop the node with a clear error message.". Why stop the node? Because there is no other reliable way to get the operator's attention. Also because there is something screwed-up here, and stopping is the safest course to prevent worse failures.

Replying to [warner](/tahoe-lafs/trac-2024-07-25/issues/1280#issuecomment-81533): > > We should probably also investigate using the "write-to-temp-file/close/rename" dance. It appears to me that Brian had already implemented the "write-to-temp-file&&close&&rename" dance [20090219034633-4233b-3abc69ace6ea2f453b1fdbefa754e7ecbc6b4516 about a year before he wrote the comment above]. :-) > One reason to not do this are performance (we should measure the diskio hit for a big server). We should also think through the atomicity expectations: if the rename fails, will the state of the pickle adequately match the state of the servers (probably yes, but it's worth a few more minutes of thought). This is a crawler state file, so if the rename fails then the previous version of the file is left there, right? So the crawler will re-do whatever work it did since the last successful rename. If there is a persistent problem which causes rename to fail (e.g. the permissions on the old state file and its parent directory forbid you to overwrite it?), then this would be a bad failure mode where the crawler always appears to be doing work but never finishes. (The "Sisyphus" Failure Mode. :-)) Oh, except the failure to do the rename will cause the node to stop with an error message, right? Re-opening this ticket and changing it to say "If the write fails, stop the node with a clear error message.". Why stop the node? Because there is no other reliable way to get the operator's attention. Also because there is something screwed-up here, and stopping is the safest course to prevent worse failures.

zooko removed the

fixed

label 2011-08-18 16:41:26 +00:00

zooko reopened this issue

2011-08-18 16:41:26 +00:00

zooko changed title from ~~bucket_counter.state and lease_checker.state might get corrupted after hard system shutdown~~ to if bucket_counter.state or lease_checker.state can't be written, stop the node with an error message

2011-08-18 16:41:26 +00:00

davidsarah commented

2011-08-19 02:02:54 +00:00

Replying to [zooko]comment:6:

Why stop the node? Because there is no other reliable way to get the operator's attention.

Seems to me that's a bug right there. There should be some other way to get the operator's attention, and if there were, there would be lots of things that could use it without introducing SPoFs by stopping the node. Straw-man suggestion: send (rate-limited) email to an admin address giving a link to the log.

Replying to [zooko]comment:6: > Why stop the node? Because there is no other reliable way to get the operator's attention. Seems to me that's a bug right there. There should be some other way to get the operator's attention, and if there were, there would be lots of things that could use it without introducing SPoFs by stopping the node. Straw-man suggestion: send (rate-limited) email to an admin address giving a link to the log.

zooko commented

2011-08-19 05:27:00 +00:00

Replying to [davidsarah]comment:7:

Replying to [zooko]comment:6:

Why stop the node? Because there is no other reliable way to get the operator's attention.

Seems to me that's a bug right there.

Well, let's open a ticket for it, but I still think that even if we fix that issue by providing a reliable way to get the operator's attention, that stopping the node is still the right thing to do when we discover something wrong like this, because (a) it increases the likelihood of getting the operator's attention, (b) it reduces the chance of damage or exploitation, and (c) it helps the operator with post-mortem diagnostics.

There should be some other way to get the operator's attention, and if there were, there would be lots of things that could use it without introducing SPoFs by stopping the node.

What do you mean SPoFs?

Straw-man suggestion: send (rate-limited) email to an admin address giving a link to the log.

I guess the [Incident Gatherer]source:trunk/docs/logging.rst#incident-gatherer is perfect for this.

Replying to [davidsarah]comment:7: > Replying to [zooko]comment:6: > > Why stop the node? Because there is no other reliable way to get the operator's attention. > > Seems to me that's a bug right there. Well, let's open a ticket for it, but I still think that even if we fix that issue by providing a reliable way to get the operator's attention, that stopping the node is still the right thing to do when we discover something wrong like this, because (a) it increases the likelihood of getting the operator's attention, (b) it reduces the chance of damage or exploitation, and (c) it helps the operator with post-mortem diagnostics. > There should be some other way to get the operator's attention, and if there were, there would be lots of things that could use it without introducing SPoFs by stopping the node. What do you mean *S*PoFs? > Straw-man suggestion: send (rate-limited) email to an admin address giving a link to the log. I guess the [Incident Gatherer]source:trunk/docs/logging.rst#incident-gatherer is perfect for this.

tahoe-lafs modified the milestone from 1.9.0 to 1.10.0

2011-10-13 19:53:12 +00:00

tahoe-lafs added

normal

and removed

major

labels 2012-04-01 05:02:29 +00:00

zooko changed title from ~~if bucket_counter.state or lease_checker.state can't be written, stop the node with an error message~~ to deal with fragile, but disposable, bucket state files

2012-10-16 02:36:28 +00:00

daira commented

2013-07-09 14:25:47 +00:00

I suspect that changeset [3cb99364]changeset:3cb99364e6a83d0064d2838a0c470278903e19ac/trunk effectively fixed this ticket. (It doesn't delete a corrupted state file, but it does use default values, and the corrupted file will be overwritten on the next crawler pass.) Please close this ticket as fixed in milestone 1.9.2 if you agree.

See also #1290.

I suspect that changeset [3cb99364]changeset:3cb99364e6a83d0064d2838a0c470278903e19ac/trunk effectively fixed this ticket. (It doesn't delete a corrupted state file, but it does use default values, and the corrupted file will be overwritten on the next crawler pass.) Please close this ticket as fixed in milestone 1.9.2 if you agree. See also #1290.

zooko commented

2013-07-09 15:36:42 +00:00

Replying to daira:

I suspect that changeset [3cb99364]changeset:3cb99364e6a83d0064d2838a0c470278903e19ac/trunk effectively fixed this ticket. (It doesn't delete a corrupted state file, but it does use default values, and the corrupted file will be overwritten on the next crawler pass.) Please close this ticket as fixed in milestone 1.9.2 if you agree.

I agree, so I'll close this ticket, but there's something that I don't like about [3cb99364]changeset:3cb99364e6a83d0064d2838a0c470278903e19ac/trunk, which is that it catches arbitrary Exception instead of a set of exceptions that we think it should handle. Catching of general Exception has caused problems in the past, such as when the exception being caught is due to a shallow bug in the the code (like an AttributeError, or a out-of-memory exception, etc. How about if we change it from catching Exception to catching (EnvironmentError, EOFError, [KeyError](wiki/KeyError))?

Replying to [daira](/tahoe-lafs/trac-2024-07-25/issues/1280#issuecomment-81549): > I suspect that changeset [3cb99364]changeset:3cb99364e6a83d0064d2838a0c470278903e19ac/trunk effectively fixed this ticket. (It doesn't delete a corrupted state file, but it does use default values, and the corrupted file will be overwritten on the next crawler pass.) Please close this ticket as fixed in milestone 1.9.2 if you agree. I agree, so I'll close this ticket, but there's something that I don't like about [3cb99364]changeset:3cb99364e6a83d0064d2838a0c470278903e19ac/trunk, which is that it catches arbitrary `Exception` instead of a set of exceptions that we think it should handle. Catching of general `Exception` has caused problems in the past, such as when the exception being caught is due to a shallow bug in the the code (like an `AttributeError`, or a out-of-memory exception, etc. How about if we change it from catching `Exception` to catching `(EnvironmentError, EOFError, [KeyError](wiki/KeyError))`?

daira commented

2013-07-14 16:14:44 +00:00

Short of reading the stdlib source code, I don't know the set of exceptions that pickle.load can raise. I'd be open to changing it so that the try ... except Exception is only around the pickle.load call, though. Perhaps something like:

        state = {"version": 1,
                 "last-cycle-finished": None,
                 "current-cycle": None,
                 "last-complete-prefix": None,
                 "last-complete-bucket": None,
                 }
        try:
            f = open(self.statefile, "rb")
        except EnvironmentError:
            pass
        else:
            try:
                state = pickle.load(f)
            except Exception:
                pass
            finally:
                f.close()

Short of reading the stdlib source code, I don't know the set of exceptions that `pickle.load` can raise. I'd be open to changing it so that the `try ... except Exception` is only around the `pickle.load` call, though. Perhaps something like: ``` state = {"version": 1, "last-cycle-finished": None, "current-cycle": None, "last-complete-prefix": None, "last-complete-bucket": None, } try: f = open(self.statefile, "rb") except EnvironmentError: pass else: try: state = pickle.load(f) except Exception: pass finally: f.close() ```

zooko commented

2013-11-14 19:08:24 +00:00

#1834 would fix this.

Sign in to join this conversation.