'tahoe deep-check --repair --add-lease' triggers an exception in the client #889

New Issue

tahoe-lafs · 2010-01-09T23:51:28Z

francois commented

2010-01-09 23:51:28 +00:00

After an upgrade to the latest trunk and a new backup using immutable dirnodes, the daily cronjob which verifies that my data is safely stored logged the following error in twistd.log.

2010-01-10 00:08:27+0100 [-] Unhandled error in Deferred:
2010-01-10 00:08:27+0100 [-] Unhandled Error
        Traceback (most recent call last):
          File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py", line 26, in _turn
            cb(*args, **kwargs)
          File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/call.py", line 90, in fail
            self.deferred.errback(why)
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 269, in errback
            self._startRunCallbacks(fail)
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 312, in _startRunCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 328, in _runCallbacks
            self.result = callback(self.result, *args, **kw)
          File "/home/francois/dev/tahoe/src/allmydata/immutable/checker.py", line 123, in _add_lease_failed
            level=log.WEIRD, umid="hEGuQg")
          File "/home/francois/dev/tahoe/src/allmydata/util/log.py", line 24, in err
            tw_log.err(*args, **kwargs)
          File "/usr/lib/python2.5/site-packages/twisted/python/log.py", line 153, in err
            _stuff = failure.Failure()
          File "/usr/lib/python2.5/site-packages/twisted/python/failure.py", line 174, in __init__
            raise NoCurrentExceptionError()
        twisted.python.failure.NoCurrentExceptionError:

checker.py near line 13 looks like that:

        log.err(format="local error in add_lease to [%(peerid)s]: %(f_value)s",
                peerid=idlib.shortnodeid_b2a(peerid),
                f_value=str(f.value),
                failure=f,
                level=log.WEIRD, umid="hEGuQg")

And no accident file was created.

After an upgrade to the latest trunk and a new backup using immutable dirnodes, the daily cronjob which verifies that my data is safely stored logged the following error in twistd.log. ``` 2010-01-10 00:08:27+0100 [-] Unhandled error in Deferred: 2010-01-10 00:08:27+0100 [-] Unhandled Error Traceback (most recent call last): File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py", line 26, in _turn cb(*args, **kwargs) File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/call.py", line 90, in fail self.deferred.errback(why) File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 269, in errback self._startRunCallbacks(fail) File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 312, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 328, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/home/francois/dev/tahoe/src/allmydata/immutable/checker.py", line 123, in _add_lease_failed level=log.WEIRD, umid="hEGuQg") File "/home/francois/dev/tahoe/src/allmydata/util/log.py", line 24, in err tw_log.err(*args, **kwargs) File "/usr/lib/python2.5/site-packages/twisted/python/log.py", line 153, in err _stuff = failure.Failure() File "/usr/lib/python2.5/site-packages/twisted/python/failure.py", line 174, in __init__ raise NoCurrentExceptionError() twisted.python.failure.NoCurrentExceptionError: ``` checker.py near line 13 looks like that: ``` log.err(format="local error in add_lease to [%(peerid)s]: %(f_value)s", peerid=idlib.shortnodeid_b2a(peerid), f_value=str(f.value), failure=f, level=log.WEIRD, umid="hEGuQg") ``` And no accident file was created.

tahoe-lafs added the

labels 2010-01-09 23:51:28 +00:00

tahoe-lafs added this to the 1.6.0 milestone 2010-01-09 23:51:28 +00:00

francois commented

2010-01-10 02:09:01 +00:00

(01:06:02 AM) zooko_afk: francois: i'm looking at util/log.py
(01:06:20 AM) zooko_afk: it calls twisted's log.err just to force any unit test that might be running to fail
(01:06:37 AM) zooko_afk: but twisted's log.err expects there to be an exception in existence
(01:06:43 AM) zooko_afk: which in your case there isn't
(01:06:55 AM) zooko_afk: I suggestr that...  we either remove that 
(01:07:13 AM) zooko_afk: or catch this error frpom twoisted and ignore it (ugh)
(01:07:38 AM) zooko_afk: I suggest the former, as a patch for tahoe-lafs v1.6, to be reviewed by brian.
(01:07:41 AM) zooko_afk: want to test it?
(01:07:54 AM) zooko_afk: thos will also reveal the original exception that you found.
(01:08:09 AM) zooko_afk: gotta go do something w irby for a while...
(01:08:39 AM) francois: it won't breaks things if I completely remove it?
(01:08:46 AM) francois: I'll give it a try
(01:44:54 AM) zooko_afk: It looks to me like the tw_log.err() line is unnecessary and is the cause of the problem where the original error is not reported.

``` (01:06:02 AM) zooko_afk: francois: i'm looking at util/log.py (01:06:20 AM) zooko_afk: it calls twisted's log.err just to force any unit test that might be running to fail (01:06:37 AM) zooko_afk: but twisted's log.err expects there to be an exception in existence (01:06:43 AM) zooko_afk: which in your case there isn't (01:06:55 AM) zooko_afk: I suggestr that... we either remove that (01:07:13 AM) zooko_afk: or catch this error frpom twoisted and ignore it (ugh) (01:07:38 AM) zooko_afk: I suggest the former, as a patch for tahoe-lafs v1.6, to be reviewed by brian. (01:07:41 AM) zooko_afk: want to test it? (01:07:54 AM) zooko_afk: thos will also reveal the original exception that you found. (01:08:09 AM) zooko_afk: gotta go do something w irby for a while... (01:08:39 AM) francois: it won't breaks things if I completely remove it? (01:08:46 AM) francois: I'll give it a try (01:44:54 AM) zooko_afk: It looks to me like the tw_log.err() line is unnecessary and is the cause of the problem where the original error is not reported. ```

francois commented

2010-01-10 02:26:59 +00:00

The incriminated lines comes from this patch.

Sat Sep 20 19:34:27 CEST 2008  warner@lothar.com          
  * util.log: send log.err to Twisted too, so that Trial tests are flunked
diff -rN old-tahoe-upstream/src/allmydata/util/log.py new-tahoe-upstream/src/allmydata/util/log.py
2a3
> from twisted.python import log as tw_log
15a17,21
> # If log.err() happens during a unit test, the unit test should fail. We
> # accomplish this by sending it to twisted.log too. When a WEIRD/SCARY/BAD
> # thing happens that is nevertheless handled, use log.msg(failure=f,
> # level=WEIRD) instead.
> 
17c23
<     # this should probably be in foolscap
---
>     tw_log.err(*args, **kwargs)

Brian, what do you think? I'm not sure whether this patch should be rollback.

The incriminated lines comes from this patch. ``` Sat Sep 20 19:34:27 CEST 2008 warner@lothar.com * util.log: send log.err to Twisted too, so that Trial tests are flunked diff -rN old-tahoe-upstream/src/allmydata/util/log.py new-tahoe-upstream/src/allmydata/util/log.py 2a3 > from twisted.python import log as tw_log 15a17,21 > # If log.err() happens during a unit test, the unit test should fail. We > # accomplish this by sending it to twisted.log too. When a WEIRD/SCARY/BAD > # thing happens that is nevertheless handled, use log.msg(failure=f, > # level=WEIRD) instead. > 17c23 < # this should probably be in foolscap --- > tw_log.err(*args, **kwargs) ``` Brian, what do you think? I'm not sure whether this patch should be rollback.

zooko commented

2010-01-10 05:58:16 +00:00

Looks critical to me, because it looks like we can't add leases to immutable directories (guessing that this is the underlying WEIRD that didn't get logged). Hopefully François will report more after removing the tw_log.err() feature.

Looks critical to me, because it looks like we can't add leases to immutable directories (guessing that this is the underlying WEIRD that didn't get logged). Hopefully François will report more after removing the `tw_log.err()` feature.

zooko added

critical

and removed

major

labels 2010-01-10 05:58:16 +00:00

zooko commented

2010-01-10 05:59:13 +00:00

François: don't wait for Brian to look at the tw_log.err() thing -- investigating the underlying error is more urgent. Also I hereby take responsibility if Brian later says that removing that line was totally wrong. :-)

François: don't wait for Brian to look at the `tw_log.err()` thing -- investigating the underlying error is more urgent. Also I hereby take responsibility if Brian later says that removing that line was totally wrong. :-)

warner commented

2010-01-10 08:03:22 +00:00

Huh, I didn't know that twisted.python.log was so picky about tw_log.err being called outside of an exception context. Looks like this has been around since twisted-2.4.0, though.

So, we want the tw_log.err to be there, but we also want to make sure that an exception in it won't cause the subsequent foolscap log.err to get bypassed (and worse, any other code following the call to allmydata.util.log.err will also be bypassed, although usually log.err lives at the end of the callback chain so there is rarely much code sitting after it). log.msg and log.err must never raise exceptions.

It's probably still a good idea to move this functionality into Foolscap. But for now, I'd do two things:

have util.log.err catch-and-ignore any exceptions raised by tw_log.err, or maybe just this NoCurrentExceptionError
have util.log.err pull any failure= argument out of its kwargs and pass it as the first positional argument to tw_log.err . The root problem here is that the _add_lease_failed function is an errback, so the relevant exception has already been captured and bundled up into a Failure instance, but tw_log.err isn't being told about it properly. Foolscap's log.err looks for a failure in the failure= argument, but tw_log.err wants it to be the first positional argument. If it doesn't find one there, then it tries to create a new Failure, which explodes when you're in an errback and not in a synchronous exception handler.

I agree with Zooko that commenting out the tw_log.err and reproducing the problem is a good direction to go. I'd like to know what the local error was.

Huh, I didn't know that twisted.python.log was so picky about tw_log.err being called outside of an exception context. Looks like this has been around since twisted-2.4.0, though. So, we want the tw_log.err to be there, but we also want to make sure that an exception in it won't cause the subsequent foolscap log.err to get bypassed (and worse, any other code following the call to allmydata.util.log.err will also be bypassed, although usually log.err lives at the end of the callback chain so there is rarely much code sitting after it). log.msg and log.err must never raise exceptions. It's probably still a good idea to move this functionality into Foolscap. But for now, I'd do two things: * have util.log.err catch-and-ignore any exceptions raised by tw_log.err, or maybe just this NoCurrentExceptionError * have util.log.err pull any `failure=` argument out of its kwargs and pass it as the first positional argument to tw_log.err . The root problem here is that the `_add_lease_failed` function is an errback, so the relevant exception has already been captured and bundled up into a Failure instance, but tw_log.err isn't being told about it properly. Foolscap's log.err looks for a failure in the `failure=` argument, but tw_log.err wants it to be the first positional argument. If it doesn't find one there, then it tries to create a new Failure, which explodes when you're in an errback and not in a synchronous exception handler. I agree with Zooko that commenting out the tw_log.err and reproducing the problem is a good direction to go. I'd like to know what the local error was.

francois commented

2010-01-10 14:51:37 +00:00

Ok, after commenting tw_log.err(), the same exception occurs, but in foolscap this time. I'll try to add some code to dump more informations but it's gonna be slow because it takes about 30 minutes before the problem occurs.

2010-01-10 14:38:24+0100 [-] Unhandled error in Deferred:
2010-01-10 14:38:24+0100 [-] Unhandled Error
        Traceback (most recent call last):
          File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py", line 26, in _turn
            cb(*args, **kwargs)
          File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/call.py", line 90, in fail
            self.deferred.errback(why)
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 269, in errback
            self._startRunCallbacks(fail)
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 312, in _startRunCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 328, in _runCallbacks
            self.result = callback(self.result, *args, **kw)
          File "/home/francois/dev/tahoe/src/allmydata/immutable/checker.py", line 123, in _add_lease_failed
            level=log.WEIRD, umid="hEGuQg")
          File "/home/francois/dev/tahoe/src/allmydata/util/log.py", line 27, in err
            return log.err(*args, **kwargs)
          File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/logging/log.py", line 218, in err
            _stuff = failure.Failure()
          File "/usr/lib/python2.5/site-packages/twisted/python/failure.py", line 174, in __init__
            raise NoCurrentExceptionError()
        twisted.python.failure.NoCurrentExceptionError:

Ok, after commenting `tw_log.err()`, the same exception occurs, but in foolscap this time. I'll try to add some code to dump more informations but it's gonna be slow because it takes about 30 minutes before the problem occurs. ``` 2010-01-10 14:38:24+0100 [-] Unhandled error in Deferred: 2010-01-10 14:38:24+0100 [-] Unhandled Error Traceback (most recent call last): File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py", line 26, in _turn cb(*args, **kwargs) File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/call.py", line 90, in fail self.deferred.errback(why) File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 269, in errback self._startRunCallbacks(fail) File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 312, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 328, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/home/francois/dev/tahoe/src/allmydata/immutable/checker.py", line 123, in _add_lease_failed level=log.WEIRD, umid="hEGuQg") File "/home/francois/dev/tahoe/src/allmydata/util/log.py", line 27, in err return log.err(*args, **kwargs) File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/logging/log.py", line 218, in err _stuff = failure.Failure() File "/usr/lib/python2.5/site-packages/twisted/python/failure.py", line 174, in __init__ raise NoCurrentExceptionError() twisted.python.failure.NoCurrentExceptionError: ```

francois commented

2010-01-10 20:58:30 +00:00

Here is the arguments that log.err() received.

2010-01-10 20:08:21+0100 [-] {'level': 30, 'format': 'local error in add_lease to [%(peerid)s]: %(f_value)s', 'failure': <twisted.python.failure.Failure <class 'foolscap.ipb.DeadReferenceError'>>, 'f_value': 'Connection was lost', 'peerid': 'wpxpieqx', 'umid': 'hEGuQg'}

Here is the arguments that log.err() received. ``` 2010-01-10 20:08:21+0100 [-] {'level': 30, 'format': 'local error in add_lease to [%(peerid)s]: %(f_value)s', 'failure': <twisted.python.failure.Failure <class 'foolscap.ipb.DeadReferenceError'>>, 'f_value': 'Connection was lost', 'peerid': 'wpxpieqx', 'umid': 'hEGuQg'} ```

zooko commented

2010-01-10 22:27:57 +00:00

class 'foolscap.ipb.DeadReferenceError' suggests that the connection to a storage server turned out to be broken during the add lease process, and there is some part of the add-lease code which doesn't have a .addErrback() which knows how to deal with the connection failing.

Maybe try the fix to logging that Brian suggested and see if the correctly-logged diagnostics still suggest that this is the problem?

`class 'foolscap.ipb.DeadReferenceError'` suggests that the connection to a storage server turned out to be broken during the add lease process, and there is some part of the add-lease code which doesn't have a `.addErrback()` which knows how to deal with the connection failing. Maybe try the fix to logging that Brian suggested and see if the correctly-logged diagnostics still suggest that this is the problem?

francois commented

2010-01-10 22:34:05 +00:00

Attachment debug-log.txt (19685 bytes) added

Debug log coming from 'flogtool tail' running during the error.

**Attachment** debug-log.txt (19685 bytes) added Debug log coming from 'flogtool tail' running during the error.

debug-log.txt

19 KiB

francois commented

2010-01-10 22:37:04 +00:00

The incriminated peer 'wpxpieqx' is running the following Tahoe version and is still accepting new shares, ie the disk is not full.

allmydata-tahoe: 1.5.0-r4054, foolscap: 0.4.2, pycryptopp: 0.5.16-r672, zfec: 1.4.6-r333, Twisted: 8.1.0, Nevow: 0.9.31, zope.interface: 3.3.1, python: 2.5.2, platform: Linux-debian_5.0.3-i686-32bit, sqlite: 3.5.9, simplejson: 1.9.2, argparse: 1.0.1, pyOpenSSL: 0.7, pyutil: 1.4.1, zbase32: 1.1.1, setuptools: 0.6c8, pysqlite: 2.3.2

The incriminated peer 'wpxpieqx' is running the following Tahoe version and is still accepting new shares, ie the disk is not full. ``` allmydata-tahoe: 1.5.0-r4054, foolscap: 0.4.2, pycryptopp: 0.5.16-r672, zfec: 1.4.6-r333, Twisted: 8.1.0, Nevow: 0.9.31, zope.interface: 3.3.1, python: 2.5.2, platform: Linux-debian_5.0.3-i686-32bit, sqlite: 3.5.9, simplejson: 1.9.2, argparse: 1.0.1, pyOpenSSL: 0.7, pyutil: 1.4.1, zbase32: 1.1.1, setuptools: 0.6c8, pysqlite: 2.3.2 ```

francois commented

2010-01-10 22:54:41 +00:00

This is what gets logged in the storage node itself.

2010-01-10 20:08:23+0100 [-] Unhandled Error
        Traceback (most recent call last):
        Failure: foolscap.ipb.DeadReferenceError: [b5le] replaced by a new connection

I also saw occurrences of this exception from long ago when the Tahoe client was running 1.5.0, so it isn't likely to be a new bug introduced in 1.6.0.

This is what gets logged in the storage node itself. ``` 2010-01-10 20:08:23+0100 [-] Unhandled Error Traceback (most recent call last): Failure: foolscap.ipb.DeadReferenceError: [b5le] replaced by a new connection ``` I also saw occurrences of this exception from long ago when the Tahoe client was running 1.5.0, so it isn't likely to be a new bug introduced in 1.6.0.

zooko changed title from ~~'tahoe deep-check --repair --add-lease' triggers a logging error~~ to 'tahoe deep-check --repair --add-lease' triggers an exception in the client

2010-01-11 01:27:59 +00:00

warner commented

2010-01-11 19:13:48 +00:00

hrm, DeadReferenceError is supposed to be a remote failure, not a local one. It should have followed the other error path (the one which says "error in add_lease from ..." instead of "local error in add_lease to..").

That said, yeah, even the remote-failure branch isn't looking for DeadReferenceError. That should certainly be ignored.

Ok, so I think we have three issues here:

allmydata.util.log.err must be fixed to pass a failure into tw_log.err properly
decide if DeadReferenceError is supposed to pass f.check(RemoteException), fix it to do so
fix _add_lease_failed to ignore DeadReferenceError, in both immutable and mutable cases

hrm, `DeadReferenceError` is supposed to be a remote failure, not a local one. It should have followed the other error path (the one which says "error in add_lease from ..." instead of "local error in add_lease to.."). That said, yeah, even the remote-failure branch isn't looking for `DeadReferenceError`. That should certainly be ignored. Ok, so I think we have three issues here: * allmydata.util.log.err must be fixed to pass a failure into tw_log.err properly * decide if `DeadReferenceError` is supposed to pass `f.check(RemoteException)`, fix it to do so * fix `_add_lease_failed` to ignore `DeadReferenceError`, in both immutable and mutable cases

warner commented

2010-01-12 00:15:35 +00:00

I've opened (and closed as WONTFIX) Foolscap#145 for the second issue: I decided that DeadReferenceError should remain separate. And therefore I just pushed changeset:bacb6fe5aad1f416 to cover the last issue: ignore DeadReferenceError in add_lease calls.

So the remaining issue is to get util.log.err fixed.

I've opened (and closed as WONTFIX) [Foolscap#145](http://foolscap.lothar.com/trac/ticket/145) for the second issue: I decided that `DeadReferenceError` should remain separate. And therefore I just pushed changeset:bacb6fe5aad1f416 to cover the last issue: ignore `DeadReferenceError` in add_lease calls. So the remaining issue is to get util.log.err fixed.

warner commented

2010-01-12 01:45:13 +00:00

changeset:d888bf33772327f2 includes the other fix, to call log.err properly (either with a Failure as the first positional argument, or in an exception-handler context).

I think that should fix this ticket, but I'm hoping Francois can confirm that the original problem goes away. (since it looks like that error was being raised because one of the storage servers was disconnecting during the deep-check, and the new code should correctly ignore that disconnection. If you were seeing that message a lot, then there will still be an open question as to why a server is coming and going frequently).

changeset:d888bf33772327f2 includes the other fix, to call `log.err` properly (either with a Failure as the first positional argument, or in an exception-handler context). I think that should fix this ticket, but I'm hoping Francois can confirm that the original problem goes away. (since it looks like that error was being raised because one of the storage servers was disconnecting during the deep-check, and the new code should correctly ignore that disconnection. If you were seeing that message a lot, then there will still be an open question as to why a server is coming and going frequently).

francois commented

2010-01-12 13:37:43 +00:00

Replying to warner:

I think that should fix this ticket, but I'm hoping Francois can confirm that the original problem goes away. (since it looks like that error was being raised because one of the storage servers was disconnecting during the deep-check, and the new code should correctly ignore that disconnection. If you were seeing that message a lot, then there will still be an open question as to why a server is coming and going frequently).

Yes, the original problem (exception trigged on the client) is fixed now, thanks for that. I still see foolscap.ipb.DeadReferenceError on the storage nodes a few times a day, does that qualify as frequently? Any hint on how to debug those foolscap disconnections would be very welcome.

Anyway, it's probably time move the disconnection problem to another ticket, see #896.

Replying to [warner](/tahoe-lafs/trac-2024-07-25/issues/889#issuecomment-74460): > I think that should fix this ticket, but I'm hoping Francois can confirm that the original problem goes away. (since it looks like that error was being raised because one of the storage servers was disconnecting during the deep-check, and the new code should correctly ignore that disconnection. If you were seeing that message a lot, then there will still be an open question as to why a server is coming and going frequently). Yes, the original problem (exception trigged on the **client**) is fixed now, thanks for that. I still see `foolscap.ipb.DeadReferenceError` on the storage nodes a few times a day, does that qualify as *frequently*? Any hint on how to debug those foolscap disconnections would be very welcome. Anyway, it's probably time move the disconnection problem to another ticket, see #896.

tahoe-lafs added the

fixed

label 2010-01-12 13:37:43 +00:00

francois closed this issue

2010-01-12 13:37:43 +00:00

Sign in to join this conversation.