'tahoe deep-check --repair --add-lease' triggers an exception in the client #889

Closed
opened 2010-01-09 23:51:28 +00:00 by francois · 15 comments
francois commented 2010-01-09 23:51:28 +00:00
Owner

After an upgrade to the latest trunk and a new backup using immutable dirnodes, the daily cronjob which verifies that my data is safely stored logged the following error in twistd.log.

2010-01-10 00:08:27+0100 [-] Unhandled error in Deferred:
2010-01-10 00:08:27+0100 [-] Unhandled Error
        Traceback (most recent call last):
          File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py", line 26, in _turn
            cb(*args, **kwargs)
          File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/call.py", line 90, in fail
            self.deferred.errback(why)
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 269, in errback
            self._startRunCallbacks(fail)
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 312, in _startRunCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 328, in _runCallbacks
            self.result = callback(self.result, *args, **kw)
          File "/home/francois/dev/tahoe/src/allmydata/immutable/checker.py", line 123, in _add_lease_failed
            level=log.WEIRD, umid="hEGuQg")
          File "/home/francois/dev/tahoe/src/allmydata/util/log.py", line 24, in err
            tw_log.err(*args, **kwargs)
          File "/usr/lib/python2.5/site-packages/twisted/python/log.py", line 153, in err
            _stuff = failure.Failure()
          File "/usr/lib/python2.5/site-packages/twisted/python/failure.py", line 174, in __init__
            raise NoCurrentExceptionError()
        twisted.python.failure.NoCurrentExceptionError:

checker.py near line 13 looks like that:

        log.err(format="local error in add_lease to [%(peerid)s]: %(f_value)s",
                peerid=idlib.shortnodeid_b2a(peerid),
                f_value=str(f.value),
                failure=f,
                level=log.WEIRD, umid="hEGuQg")

And no accident file was created.

After an upgrade to the latest trunk and a new backup using immutable dirnodes, the daily cronjob which verifies that my data is safely stored logged the following error in twistd.log. ``` 2010-01-10 00:08:27+0100 [-] Unhandled error in Deferred: 2010-01-10 00:08:27+0100 [-] Unhandled Error Traceback (most recent call last): File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py", line 26, in _turn cb(*args, **kwargs) File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/call.py", line 90, in fail self.deferred.errback(why) File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 269, in errback self._startRunCallbacks(fail) File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 312, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 328, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/home/francois/dev/tahoe/src/allmydata/immutable/checker.py", line 123, in _add_lease_failed level=log.WEIRD, umid="hEGuQg") File "/home/francois/dev/tahoe/src/allmydata/util/log.py", line 24, in err tw_log.err(*args, **kwargs) File "/usr/lib/python2.5/site-packages/twisted/python/log.py", line 153, in err _stuff = failure.Failure() File "/usr/lib/python2.5/site-packages/twisted/python/failure.py", line 174, in __init__ raise NoCurrentExceptionError() twisted.python.failure.NoCurrentExceptionError: ``` checker.py near line 13 looks like that: ``` log.err(format="local error in add_lease to [%(peerid)s]: %(f_value)s", peerid=idlib.shortnodeid_b2a(peerid), f_value=str(f.value), failure=f, level=log.WEIRD, umid="hEGuQg") ``` And no accident file was created.
tahoe-lafs added the
code
major
defect
1.5.0
labels 2010-01-09 23:51:28 +00:00
tahoe-lafs added this to the 1.6.0 milestone 2010-01-09 23:51:28 +00:00
francois commented 2010-01-10 02:09:01 +00:00
Author
Owner
(01:06:02 AM) zooko_afk: francois: i'm looking at util/log.py
(01:06:20 AM) zooko_afk: it calls twisted's log.err just to force any unit test that might be running to fail
(01:06:37 AM) zooko_afk: but twisted's log.err expects there to be an exception in existence
(01:06:43 AM) zooko_afk: which in your case there isn't
(01:06:55 AM) zooko_afk: I suggestr that...  we either remove that 
(01:07:13 AM) zooko_afk: or catch this error frpom twoisted and ignore it (ugh)
(01:07:38 AM) zooko_afk: I suggest the former, as a patch for tahoe-lafs v1.6, to be reviewed by brian.
(01:07:41 AM) zooko_afk: want to test it?
(01:07:54 AM) zooko_afk: thos will also reveal the original exception that you found.
(01:08:09 AM) zooko_afk: gotta go do something w irby for a while...
(01:08:39 AM) francois: it won't breaks things if I completely remove it?
(01:08:46 AM) francois: I'll give it a try
(01:44:54 AM) zooko_afk: It looks to me like the tw_log.err() line is unnecessary and is the cause of the problem where the original error is not reported.
``` (01:06:02 AM) zooko_afk: francois: i'm looking at util/log.py (01:06:20 AM) zooko_afk: it calls twisted's log.err just to force any unit test that might be running to fail (01:06:37 AM) zooko_afk: but twisted's log.err expects there to be an exception in existence (01:06:43 AM) zooko_afk: which in your case there isn't (01:06:55 AM) zooko_afk: I suggestr that... we either remove that (01:07:13 AM) zooko_afk: or catch this error frpom twoisted and ignore it (ugh) (01:07:38 AM) zooko_afk: I suggest the former, as a patch for tahoe-lafs v1.6, to be reviewed by brian. (01:07:41 AM) zooko_afk: want to test it? (01:07:54 AM) zooko_afk: thos will also reveal the original exception that you found. (01:08:09 AM) zooko_afk: gotta go do something w irby for a while... (01:08:39 AM) francois: it won't breaks things if I completely remove it? (01:08:46 AM) francois: I'll give it a try (01:44:54 AM) zooko_afk: It looks to me like the tw_log.err() line is unnecessary and is the cause of the problem where the original error is not reported. ```
francois commented 2010-01-10 02:26:59 +00:00
Author
Owner

The incriminated lines comes from this patch.

Sat Sep 20 19:34:27 CEST 2008  warner@lothar.com          
  * util.log: send log.err to Twisted too, so that Trial tests are flunked
diff -rN old-tahoe-upstream/src/allmydata/util/log.py new-tahoe-upstream/src/allmydata/util/log.py
2a3
> from twisted.python import log as tw_log
15a17,21
> # If log.err() happens during a unit test, the unit test should fail. We
> # accomplish this by sending it to twisted.log too. When a WEIRD/SCARY/BAD
> # thing happens that is nevertheless handled, use log.msg(failure=f,
> # level=WEIRD) instead.
> 
17c23
<     # this should probably be in foolscap
---
>     tw_log.err(*args, **kwargs)

Brian, what do you think? I'm not sure whether this patch should be rollback.

The incriminated lines comes from this patch. ``` Sat Sep 20 19:34:27 CEST 2008 warner@lothar.com * util.log: send log.err to Twisted too, so that Trial tests are flunked diff -rN old-tahoe-upstream/src/allmydata/util/log.py new-tahoe-upstream/src/allmydata/util/log.py 2a3 > from twisted.python import log as tw_log 15a17,21 > # If log.err() happens during a unit test, the unit test should fail. We > # accomplish this by sending it to twisted.log too. When a WEIRD/SCARY/BAD > # thing happens that is nevertheless handled, use log.msg(failure=f, > # level=WEIRD) instead. > 17c23 < # this should probably be in foolscap --- > tw_log.err(*args, **kwargs) ``` Brian, what do you think? I'm not sure whether this patch should be rollback.

Looks critical to me, because it looks like we can't add leases to immutable directories (guessing that this is the underlying WEIRD that didn't get logged). Hopefully François will report more after removing the tw_log.err() feature.

Looks critical to me, because it looks like we can't add leases to immutable directories (guessing that this is the underlying WEIRD that didn't get logged). Hopefully François will report more after removing the `tw_log.err()` feature.
zooko added
critical
and removed
major
labels 2010-01-10 05:58:16 +00:00

François: don't wait for Brian to look at the tw_log.err() thing -- investigating the underlying error is more urgent. Also I hereby take responsibility if Brian later says that removing that line was totally wrong. :-)

François: don't wait for Brian to look at the `tw_log.err()` thing -- investigating the underlying error is more urgent. Also I hereby take responsibility if Brian later says that removing that line was totally wrong. :-)

Huh, I didn't know that twisted.python.log was so picky about tw_log.err being called outside of an exception context. Looks like this has been around since twisted-2.4.0, though.

So, we want the tw_log.err to be there, but we also want to make sure that an exception in it won't cause the subsequent foolscap log.err to get bypassed (and worse, any other code following the call to allmydata.util.log.err will also be bypassed, although usually log.err lives at the end of the callback chain so there is rarely much code sitting after it). log.msg and log.err must never raise exceptions.

It's probably still a good idea to move this functionality into Foolscap. But for now, I'd do two things:

  • have util.log.err catch-and-ignore any exceptions raised by tw_log.err, or maybe just this NoCurrentExceptionError
  • have util.log.err pull any failure= argument out of its kwargs and pass it as the first positional argument to tw_log.err . The root problem here is that the _add_lease_failed function is an errback, so the relevant exception has already been captured and bundled up into a Failure instance, but tw_log.err isn't being told about it properly. Foolscap's log.err looks for a failure in the failure= argument, but tw_log.err wants it to be the first positional argument. If it doesn't find one there, then it tries to create a new Failure, which explodes when you're in an errback and not in a synchronous exception handler.

I agree with Zooko that commenting out the tw_log.err and reproducing the problem is a good direction to go. I'd like to know what the local error was.

Huh, I didn't know that twisted.python.log was so picky about tw_log.err being called outside of an exception context. Looks like this has been around since twisted-2.4.0, though. So, we want the tw_log.err to be there, but we also want to make sure that an exception in it won't cause the subsequent foolscap log.err to get bypassed (and worse, any other code following the call to allmydata.util.log.err will also be bypassed, although usually log.err lives at the end of the callback chain so there is rarely much code sitting after it). log.msg and log.err must never raise exceptions. It's probably still a good idea to move this functionality into Foolscap. But for now, I'd do two things: * have util.log.err catch-and-ignore any exceptions raised by tw_log.err, or maybe just this NoCurrentExceptionError * have util.log.err pull any `failure=` argument out of its kwargs and pass it as the first positional argument to tw_log.err . The root problem here is that the `_add_lease_failed` function is an errback, so the relevant exception has already been captured and bundled up into a Failure instance, but tw_log.err isn't being told about it properly. Foolscap's log.err looks for a failure in the `failure=` argument, but tw_log.err wants it to be the first positional argument. If it doesn't find one there, then it tries to create a new Failure, which explodes when you're in an errback and not in a synchronous exception handler. I agree with Zooko that commenting out the tw_log.err and reproducing the problem is a good direction to go. I'd like to know what the local error was.
francois commented 2010-01-10 14:51:37 +00:00
Author
Owner

Ok, after commenting tw_log.err(), the same exception occurs, but in foolscap this time. I'll try to add some code to dump more informations but it's gonna be slow because it takes about 30 minutes before the problem occurs.

2010-01-10 14:38:24+0100 [-] Unhandled error in Deferred:
2010-01-10 14:38:24+0100 [-] Unhandled Error
        Traceback (most recent call last):
          File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py", line 26, in _turn
            cb(*args, **kwargs)
          File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/call.py", line 90, in fail
            self.deferred.errback(why)
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 269, in errback
            self._startRunCallbacks(fail)
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 312, in _startRunCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 328, in _runCallbacks
            self.result = callback(self.result, *args, **kw)
          File "/home/francois/dev/tahoe/src/allmydata/immutable/checker.py", line 123, in _add_lease_failed
            level=log.WEIRD, umid="hEGuQg")
          File "/home/francois/dev/tahoe/src/allmydata/util/log.py", line 27, in err
            return log.err(*args, **kwargs)
          File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/logging/log.py", line 218, in err
            _stuff = failure.Failure()
          File "/usr/lib/python2.5/site-packages/twisted/python/failure.py", line 174, in __init__
            raise NoCurrentExceptionError()
        twisted.python.failure.NoCurrentExceptionError:
Ok, after commenting `tw_log.err()`, the same exception occurs, but in foolscap this time. I'll try to add some code to dump more informations but it's gonna be slow because it takes about 30 minutes before the problem occurs. ``` 2010-01-10 14:38:24+0100 [-] Unhandled error in Deferred: 2010-01-10 14:38:24+0100 [-] Unhandled Error Traceback (most recent call last): File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py", line 26, in _turn cb(*args, **kwargs) File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/call.py", line 90, in fail self.deferred.errback(why) File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 269, in errback self._startRunCallbacks(fail) File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 312, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 328, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/home/francois/dev/tahoe/src/allmydata/immutable/checker.py", line 123, in _add_lease_failed level=log.WEIRD, umid="hEGuQg") File "/home/francois/dev/tahoe/src/allmydata/util/log.py", line 27, in err return log.err(*args, **kwargs) File "/home/francois/dev/tahoe/support/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/logging/log.py", line 218, in err _stuff = failure.Failure() File "/usr/lib/python2.5/site-packages/twisted/python/failure.py", line 174, in __init__ raise NoCurrentExceptionError() twisted.python.failure.NoCurrentExceptionError: ```
francois commented 2010-01-10 20:58:30 +00:00
Author
Owner

Here is the arguments that log.err() received.

2010-01-10 20:08:21+0100 [-] {'level': 30, 'format': 'local error in add_lease to [%(peerid)s]: %(f_value)s', 'failure': <twisted.python.failure.Failure <class 'foolscap.ipb.DeadReferenceError'>>, 'f_value': 'Connection was lost', 'peerid': 'wpxpieqx', 'umid': 'hEGuQg'}
Here is the arguments that log.err() received. ``` 2010-01-10 20:08:21+0100 [-] {'level': 30, 'format': 'local error in add_lease to [%(peerid)s]: %(f_value)s', 'failure': <twisted.python.failure.Failure <class 'foolscap.ipb.DeadReferenceError'>>, 'f_value': 'Connection was lost', 'peerid': 'wpxpieqx', 'umid': 'hEGuQg'} ```

class 'foolscap.ipb.DeadReferenceError' suggests that the connection to a storage server turned out to be broken during the add lease process, and there is some part of the add-lease code which doesn't have a .addErrback() which knows how to deal with the connection failing.

Maybe try the fix to logging that Brian suggested and see if the correctly-logged diagnostics still suggest that this is the problem?

`class 'foolscap.ipb.DeadReferenceError'` suggests that the connection to a storage server turned out to be broken during the add lease process, and there is some part of the add-lease code which doesn't have a `.addErrback()` which knows how to deal with the connection failing. Maybe try the fix to logging that Brian suggested and see if the correctly-logged diagnostics still suggest that this is the problem?
francois commented 2010-01-10 22:34:05 +00:00
Author
Owner

Attachment debug-log.txt (19685 bytes) added

Debug log coming from 'flogtool tail' running during the error.

**Attachment** debug-log.txt (19685 bytes) added Debug log coming from 'flogtool tail' running during the error.
francois commented 2010-01-10 22:37:04 +00:00
Author
Owner

The incriminated peer 'wpxpieqx' is running the following Tahoe version and is still accepting new shares, ie the disk is not full.

allmydata-tahoe: 1.5.0-r4054, foolscap: 0.4.2, pycryptopp: 0.5.16-r672, zfec: 1.4.6-r333, Twisted: 8.1.0, Nevow: 0.9.31, zope.interface: 3.3.1, python: 2.5.2, platform: Linux-debian_5.0.3-i686-32bit, sqlite: 3.5.9, simplejson: 1.9.2, argparse: 1.0.1, pyOpenSSL: 0.7, pyutil: 1.4.1, zbase32: 1.1.1, setuptools: 0.6c8, pysqlite: 2.3.2
The incriminated peer 'wpxpieqx' is running the following Tahoe version and is still accepting new shares, ie the disk is not full. ``` allmydata-tahoe: 1.5.0-r4054, foolscap: 0.4.2, pycryptopp: 0.5.16-r672, zfec: 1.4.6-r333, Twisted: 8.1.0, Nevow: 0.9.31, zope.interface: 3.3.1, python: 2.5.2, platform: Linux-debian_5.0.3-i686-32bit, sqlite: 3.5.9, simplejson: 1.9.2, argparse: 1.0.1, pyOpenSSL: 0.7, pyutil: 1.4.1, zbase32: 1.1.1, setuptools: 0.6c8, pysqlite: 2.3.2 ```
francois commented 2010-01-10 22:54:41 +00:00
Author
Owner

This is what gets logged in the storage node itself.

2010-01-10 20:08:23+0100 [-] Unhandled Error
        Traceback (most recent call last):
        Failure: foolscap.ipb.DeadReferenceError: [b5le] replaced by a new connection

I also saw occurrences of this exception from long ago when the Tahoe client was running 1.5.0, so it isn't likely to be a new bug introduced in 1.6.0.

This is what gets logged in the storage node itself. ``` 2010-01-10 20:08:23+0100 [-] Unhandled Error Traceback (most recent call last): Failure: foolscap.ipb.DeadReferenceError: [b5le] replaced by a new connection ``` I also saw occurrences of this exception from long ago when the Tahoe client was running 1.5.0, so it isn't likely to be a new bug introduced in 1.6.0.
zooko changed title from 'tahoe deep-check --repair --add-lease' triggers a logging error to 'tahoe deep-check --repair --add-lease' triggers an exception in the client 2010-01-11 01:27:59 +00:00

hrm, DeadReferenceError is supposed to be a remote failure, not a local one. It should have followed the other error path (the one which says "error in add_lease from ..." instead of "local error in add_lease to..").

That said, yeah, even the remote-failure branch isn't looking for DeadReferenceError. That should certainly be ignored.

Ok, so I think we have three issues here:

  • allmydata.util.log.err must be fixed to pass a failure into tw_log.err properly
  • decide if DeadReferenceError is supposed to pass f.check(RemoteException), fix it to do so
  • fix _add_lease_failed to ignore DeadReferenceError, in both immutable and mutable cases
hrm, `DeadReferenceError` is supposed to be a remote failure, not a local one. It should have followed the other error path (the one which says "error in add_lease from ..." instead of "local error in add_lease to.."). That said, yeah, even the remote-failure branch isn't looking for `DeadReferenceError`. That should certainly be ignored. Ok, so I think we have three issues here: * allmydata.util.log.err must be fixed to pass a failure into tw_log.err properly * decide if `DeadReferenceError` is supposed to pass `f.check(RemoteException)`, fix it to do so * fix `_add_lease_failed` to ignore `DeadReferenceError`, in both immutable and mutable cases

I've opened (and closed as WONTFIX) Foolscap#145 for the second issue: I decided that DeadReferenceError should remain separate. And therefore I just pushed changeset:bacb6fe5aad1f416 to cover the last issue: ignore DeadReferenceError in add_lease calls.

So the remaining issue is to get util.log.err fixed.

I've opened (and closed as WONTFIX) [Foolscap#145](http://foolscap.lothar.com/trac/ticket/145) for the second issue: I decided that `DeadReferenceError` should remain separate. And therefore I just pushed changeset:bacb6fe5aad1f416 to cover the last issue: ignore `DeadReferenceError` in add_lease calls. So the remaining issue is to get util.log.err fixed.

changeset:d888bf33772327f2 includes the other fix, to call log.err properly (either with a Failure as the first positional argument, or in an exception-handler context).

I think that should fix this ticket, but I'm hoping Francois can confirm that the original problem goes away. (since it looks like that error was being raised because one of the storage servers was disconnecting during the deep-check, and the new code should correctly ignore that disconnection. If you were seeing that message a lot, then there will still be an open question as to why a server is coming and going frequently).

changeset:d888bf33772327f2 includes the other fix, to call `log.err` properly (either with a Failure as the first positional argument, or in an exception-handler context). I think that should fix this ticket, but I'm hoping Francois can confirm that the original problem goes away. (since it looks like that error was being raised because one of the storage servers was disconnecting during the deep-check, and the new code should correctly ignore that disconnection. If you were seeing that message a lot, then there will still be an open question as to why a server is coming and going frequently).
francois commented 2010-01-12 13:37:43 +00:00
Author
Owner

Replying to warner:

I think that should fix this ticket, but I'm hoping Francois can confirm that the original problem goes away. (since it looks like that error was being raised because one of the storage servers was disconnecting during the deep-check, and the new code should correctly ignore that disconnection. If you were seeing that message a lot, then there will still be an open question as to why a server is coming and going frequently).

Yes, the original problem (exception trigged on the client) is fixed now, thanks for that. I still see foolscap.ipb.DeadReferenceError on the storage nodes a few times a day, does that qualify as frequently? Any hint on how to debug those foolscap disconnections would be very welcome.

Anyway, it's probably time move the disconnection problem to another ticket, see #896.

Replying to [warner](/tahoe-lafs/trac-2024-07-25/issues/889#issuecomment-74460): > I think that should fix this ticket, but I'm hoping Francois can confirm that the original problem goes away. (since it looks like that error was being raised because one of the storage servers was disconnecting during the deep-check, and the new code should correctly ignore that disconnection. If you were seeing that message a lot, then there will still be an open question as to why a server is coming and going frequently). Yes, the original problem (exception trigged on the **client**) is fixed now, thanks for that. I still see `foolscap.ipb.DeadReferenceError` on the storage nodes a few times a day, does that qualify as *frequently*? Any hint on how to debug those foolscap disconnections would be very welcome. Anyway, it's probably time move the disconnection problem to another ticket, see #896.
tahoe-lafs added the
fixed
label 2010-01-12 13:37:43 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#889
No description provided.