webish directory page can't display more than 192 children #237

Closed
opened 2007-12-18 23:58:37 +00:00 by warner · 10 comments

While doing some load testing (which involves writing lots and lots of files
into a few directories), I discovered an exception when using the webish
interface to list one of these large directories. The exception came out of
nevow as it attempted to render the 193rd child of the directory:

2007-12-18 23:48:36.611Z [-] Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 191, in addCallback
    callbackKeywords=kw)
  File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 182, in addCallbacks
    self._runCallbacks()
  File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 317, in _runCallbacks
    self.result = callback(self.result, *args, **kw)
  File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/twist.py", line 41, in cb
    _drive(iterable, finished)
--- <exception caught here> ---
  File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/twist.py", line 23, in _drive
    next = iterable.next()
  File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/ten.py", line 83, in iterflatten
    for item in gen:
  File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/flatstan.py", line 103, in TagSerializer
    yield serialize(toBeRenderedBy, context)
  File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/ten.py", line 70, in serialize
    return partialflatten(context, obj)
  File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/ten.py", line 61, in partialflatten
    return flattener(obj, context)
  File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/flatstan.py", line 262, in DirectiveSerializer
    return serialize(renderer, context)
  File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/ten.py", line 70, in serialize
    return partialflatten(context, obj)
  File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/ten.py", line 61, in partialflatten
    return flattener(obj, context)
  File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/flatstan.py", line 245, in MethodSerializer
    return FunctionSerializer(original, context, nocontext)
  File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/flatstan.py", line 234, in FunctionSerializer
    result = original(context, data)
  File "/home/amduser/tahoe-loadnet/source/src/allmydata/webish.py", line 227, in render_row
    ctx.fillSlots("size", target.get_size())
  File "/home/amduser/tahoe-loadnet/source/src/allmydata/filenode.py", line 25, in get_size
    return IFileURI(self.uri).get_size()
  File "/usr/lib/python2.5/site-packages/twisted/python/components.py", line 113, in _hook
    return factory(ob)
  File "/home/amduser/tahoe-loadnet/source/src/allmydata/uri.py", line 428, in from_string_filenode
    u = from_string(s)
  File "/home/amduser/tahoe-loadnet/source/src/allmydata/uri.py", line 391, in from_string
    return CHKFileURI().init_from_string(s)
  File "/home/amduser/tahoe-loadnet/source/src/allmydata/uri.py", line 52, in init_from_string
    self.key = idlib.a2b(key_s)
  File "/home/amduser/tahoe-loadnet/source/src/allmydata/util/idlib.py", line 190, in a2b
    precondition(could_be_base32_encoded(cs), "cs is required to be possibly base32 encoded data.", cs=cs)
exceptions.RuntimeError: maximum recursion depth exceeded

I'm not sure if there is actual recursion taking place.. it may just be a
function that's returning defer.succeed(result) and this basically surprises
some other piece of code that doesn't expect the Deferred to fire so quickly.

While doing some load testing (which involves writing lots and lots of files into a few directories), I discovered an exception when using the webish interface to list one of these large directories. The exception came out of nevow as it attempted to render the 193rd child of the directory: ``` 2007-12-18 23:48:36.611Z [-] Unhandled Error Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 191, in addCallback callbackKeywords=kw) File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 182, in addCallbacks self._runCallbacks() File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 317, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/twist.py", line 41, in cb _drive(iterable, finished) --- <exception caught here> --- File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/twist.py", line 23, in _drive next = iterable.next() File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/ten.py", line 83, in iterflatten for item in gen: File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/flatstan.py", line 103, in TagSerializer yield serialize(toBeRenderedBy, context) File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/ten.py", line 70, in serialize return partialflatten(context, obj) File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/ten.py", line 61, in partialflatten return flattener(obj, context) File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/flatstan.py", line 262, in DirectiveSerializer return serialize(renderer, context) File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/ten.py", line 70, in serialize return partialflatten(context, obj) File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/ten.py", line 61, in partialflatten return flattener(obj, context) File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/flatstan.py", line 245, in MethodSerializer return FunctionSerializer(original, context, nocontext) File "/home/amduser/tahoe-loadnet/source/support/lib/python2.5/site-packages/Nevow-0.9.0-py2.5.egg/nevow/flat/flatstan.py", line 234, in FunctionSerializer result = original(context, data) File "/home/amduser/tahoe-loadnet/source/src/allmydata/webish.py", line 227, in render_row ctx.fillSlots("size", target.get_size()) File "/home/amduser/tahoe-loadnet/source/src/allmydata/filenode.py", line 25, in get_size return IFileURI(self.uri).get_size() File "/usr/lib/python2.5/site-packages/twisted/python/components.py", line 113, in _hook return factory(ob) File "/home/amduser/tahoe-loadnet/source/src/allmydata/uri.py", line 428, in from_string_filenode u = from_string(s) File "/home/amduser/tahoe-loadnet/source/src/allmydata/uri.py", line 391, in from_string return CHKFileURI().init_from_string(s) File "/home/amduser/tahoe-loadnet/source/src/allmydata/uri.py", line 52, in init_from_string self.key = idlib.a2b(key_s) File "/home/amduser/tahoe-loadnet/source/src/allmydata/util/idlib.py", line 190, in a2b precondition(could_be_base32_encoded(cs), "cs is required to be possibly base32 encoded data.", cs=cs) exceptions.RuntimeError: maximum recursion depth exceeded ``` I'm not sure if there is actual recursion taking place.. it may just be a function that's returning defer.succeed(result) and this basically surprises some other piece of code that doesn't expect the Deferred to fire so quickly.
warner added the
code-frontend-web
major
defect
0.7.0
labels 2007-12-18 23:58:37 +00:00
warner added this to the undecided milestone 2007-12-18 23:58:37 +00:00
Author

hrm, the exception got mangled during pasting..

hrm, the exception got mangled during pasting..
Author

grr. I can reproduce this reliably with a live node (the failure sometimes
has a nevow traceback on the web page, sometimes it hits the stack limit
while rendering the failure and you get the "additionally an error occurred
rendering the error page" message). But I can't get my test case to fail
(test_web.WebTest.test_GET_DIRURL_large) at all.

The Deferred fired by webish's data_children() is .paused=1 in the live node,
because somewhere in the upstream chain a callback returned another Deferred
that hasn't fired yet. I can simulate this in test_web by modifying
test.common.FakeMutableFileNode.download_to_data() to add a callback rung
that returns fireEventually(data), and that makes .paused=1 like in the live
node, but it doesn't trigger the failure.

On the "I'm desperate so try to fix it even without a failing test case"
front, adding fireEventually() to data_children() doesn't help. I'm pretty
sure the problem is that Nevow's flattening code (iterflatten?) sees a
Deferred that fires with a big list of stuff, and when the top deferred
fires, it does an addCallback for every item inside it. Each addCallback
fires immediately, which reenters the loop, and that's what blows the stack.

I was able to make the live node stop failing with the following ugly hack
in data_children():

    def data_children(self, ctx, data):
        d = self._dirnode.list()
        from foolscap.eventual import fireEventually
        d.addCallback(lambda dict: sorted(dict.items()))
        def _stall(items):
            return [fireEventually(i) for i in items]
        d.addCallback(_stall)
        return d

I.e. force iterflatten to slow down by firing each individual item in a
separate turn. Inefficient!

Ok, I have to run, so I'm not going to get this one fixed before I get back
from vacation. The fireEventually hack will work if we're desperate, but I
really don't want to add it, and also I don't want to add it without being
able to produce a failing test case. (the next step for a test case is to do
it in test_system() where we use real dirnodes, but that's slower and results
in failures that are harder to track down than a focussed test like
test_web).

grr. I can reproduce this reliably with a live node (the failure sometimes has a nevow traceback on the web page, sometimes it hits the stack limit while rendering the failure and you get the "additionally an error occurred rendering the error page" message). But I can't get my test case to fail (test_web.WebTest.test_GET_DIRURL_large) at all. The Deferred fired by webish's data_children() is .paused=1 in the live node, because somewhere in the upstream chain a callback returned another Deferred that hasn't fired yet. I can simulate this in test_web by modifying test.common.FakeMutableFileNode.download_to_data() to add a callback rung that returns fireEventually(data), and that makes .paused=1 like in the live node, but it doesn't trigger the failure. On the "I'm desperate so try to fix it even without a failing test case" front, adding fireEventually() to data_children() doesn't help. I'm pretty sure the problem is that Nevow's flattening code (iterflatten?) sees a Deferred that fires with a big list of stuff, and when the top deferred fires, it does an addCallback for every item inside it. Each addCallback fires immediately, which reenters the loop, and that's what blows the stack. I *was* able to make the live node stop failing with the following ugly hack in data_children(): ``` def data_children(self, ctx, data): d = self._dirnode.list() from foolscap.eventual import fireEventually d.addCallback(lambda dict: sorted(dict.items())) def _stall(items): return [fireEventually(i) for i in items] d.addCallback(_stall) return d ``` I.e. force iterflatten to slow down by firing each individual item in a separate turn. Inefficient! Ok, I have to run, so I'm not going to get this one fixed before I get back from vacation. The fireEventually hack will work if we're desperate, but I really don't want to add it, and also I don't want to add it without being able to produce a failing test case. (the next step for a test case is to do it in test_system() where we use real dirnodes, but that's slower and results in failures that are harder to track down than a focussed test like test_web).
Author

I think this is important enough to bring forward into the 0.7.1 release.

I think this is important enough to bring forward into the 0.7.1 release.
warner self-assigned this 2007-12-31 21:39:58 +00:00
Author

robk pointed out that the test might be passing (when I'd expect it to fail) because the client/consumer is in the same process as the server/producer. Basically the producer flattens a number of entries and writes it to the HTTP socket, but then it has to pause until the socket is drained, and since the reader of that socket is in the same process, it gets a chance to retire more deferreds before it is run again.

If we look at Nevow and don't see any evidence of Producer/Consumer code, that blows this theory out the window. If this is the reason the test is not failing, then a better way to trigger the bug might be to ask nevow to render the directory page to a string, instead of hitting it over HTTP. That might cause it to do the flattening all in a single turn, which ought to trigger the problem.
I seem to recall that Nevow had some way of doing this (so that it can be used in library-mode, driven by something other than twisted.web), but I don't remember exactly how to do it right now.

robk pointed out that the test might be passing (when I'd expect it to fail) because the client/consumer is in the same process as the server/producer. Basically the producer flattens a number of entries and writes it to the HTTP socket, but then it has to pause until the socket is drained, and since the reader of that socket is in the same process, it gets a chance to retire more deferreds before it is run again. If we look at Nevow and don't see any evidence of Producer/Consumer code, that blows this theory out the window. If this *is* the reason the test is not failing, then a better way to trigger the bug might be to ask nevow to render the directory page to a string, instead of hitting it over HTTP. That might cause it to do the flattening all in a single turn, which ought to trigger the problem. I seem to recall that Nevow had some way of doing this (so that it can be used in library-mode, driven by something other than twisted.web), but I don't remember exactly how to do it right now.
tahoe-lafs added this to the undecided milestone 2008-01-23 02:40:34 +00:00
Author

I tried using renderString(), but that didn't trigger the problem either:


        def _check(dirnode):
            #large_url = "/uri/" + dirnode.get_uri() + "/"
            #return self.GET(large_url)
            # we play some games here to trigger nevow's flattener
            r = webish.Directory("rootname", dirnode, ("dirpath",))
            import nevow.context
            ctx = nevow.context.WovenContext()
            from nevow import inevow
            class MyRequest:
                pass
            req = MyRequest()
            req.prePathURL = lambda: ""
            req.uri = ""
            req.args = {}
            req.fields = {}
            ctx.remember(req, inevow.IRequest)
            ctx.remember(self.s, webish.IClient)
            return r.renderString(ctx)
        d.addCallback(_check)
I tried using renderString(), but that didn't trigger the problem either: ``` def _check(dirnode): #large_url = "/uri/" + dirnode.get_uri() + "/" #return self.GET(large_url) # we play some games here to trigger nevow's flattener r = webish.Directory("rootname", dirnode, ("dirpath",)) import nevow.context ctx = nevow.context.WovenContext() from nevow import inevow class MyRequest: pass req = MyRequest() req.prePathURL = lambda: "" req.uri = "" req.args = {} req.fields = {} ctx.remember(req, inevow.IRequest) ctx.remember(self.s, webish.IClient) return r.renderString(ctx) d.addCallback(_check) ```
Author

I just thought of a less-inefficient form of that fireEventually()
workaround: only insert the turn break once every 100 items, instead of on
every item. The following code seems to work:

        def _stall_some(items):
            output = []
            for i,item in enumerate(items):
                if i % 100 == 0:
                    output.append(fireEventually(item))
                else:
                    output.append(item)
            return output
        d.addCallback(_stall_some)
        return d

I still haven't been able to come up with a failing unit test for this, but
I'm going to commit this fix and close the ticket anyways.

Fixed in changeset:124b2160b930c180

I just thought of a less-inefficient form of that fireEventually() workaround: only insert the turn break once every 100 items, instead of on every item. The following code seems to work: ``` def _stall_some(items): output = [] for i,item in enumerate(items): if i % 100 == 0: output.append(fireEventually(item)) else: output.append(item) return output d.addCallback(_stall_some) return d ``` I still haven't been able to come up with a failing unit test for this, but I'm going to commit this fix and close the ticket anyways. Fixed in changeset:124b2160b930c180
warner added the
fixed
label 2008-02-13 03:28:48 +00:00
warner modified the milestone from undecided to 0.8.0 (Allmydata 3.0 Beta) 2008-02-13 03:28:48 +00:00
Author

Some more notes on this problem, so I can find them again later:

Problem One: long list of callbacks, all of them are ready

Each Deferred (we'll call the first one Deferred A) has a list of callback
functions. Each time you do d.addCallback(), this list grows by one element.

When Deferred A fires, the list is executed in a 'while' loop, in
Deferred._runCallbacks. If the callbacks all return either a normal value or
a Failure, then the list is completely consumed during the one call to
_runCallbacks, and everything is fine.

However, when a callback returns another Deferred B (chaining), the first
Deferred A must wait for the second to finish. The code that does this looks
like:

    if isinstance(self.result, Deferred):
        self.pause()
        self.result.addBoth(self._continue)
        break

The second Deferred B might have already been fired by this point, either
because it was born ready (created with defer.succeed, or
defer.maybeDeferred), or because whatever was being waited upon has already
occurred.

If this occurs, the subsequent callback in Deferred A's chain will fire (with
B's result), but it will fire through a 6-frame recursive loop instead of
firing on the next pass of the 'while' loop. As a result, each such
ready-to-fire Deferred will add 6 stack frames. 166 such loops are enough to
put more than 1000 frames on the stack, which will exceed Python's default
sys.getrecursionlimit() .

The 6-frame cycle is:

  File "twisted/internet/defer.py", line 214, in addBoth
    callbackKeywords=kw, errbackKeywords=kw)
  File "twisted/internet/defer.py", line 186, in addCallbacks
    self._runCallbacks()
  File "twisted/internet/defer.py", line 328, in _runCallbacks
    self.result = callback(self.result, *args, **kw)
  File "twisted/internet/defer.py", line 289, in _continue
    self.unpause()
  File "twisted/internet/defer.py", line 285, in unpause
    self._runCallbacks()
  File "twisted/internet/defer.py", line 341, in _runCallbacks
    self.result.addBoth(self._continue)

The following sample code will cause this situation:

import traceback
from twisted.internet import defer

def fire(res, which):
    print "FIRE", which, "stack:", len(traceback.extract_stack())
    if which == 2:
        traceback.print_stack()
    return defer.succeed(None)

d = defer.Deferred()
for i in range(5): # set this to 170 to see failure
    d.addCallback(fire, i)

d.callback("go")

The exception that this provokes is caught by the Deferred's Failure
mechanisms, but then Twisted has an internal failure while trying to capture
it. The actual Unhandled error in Deferred that gets put into the logs is:

Unhandled error in Deferred:
Traceback (most recent call last):
  File "twisted/internet/defer.py", line 285, in unpause
    self._runCallbacks()
  File "twisted/internet/defer.py", line 341, in _runCallbacks
    self.result.addBoth(self._continue)
  File "twisted/internet/defer.py", line 214, in addBoth
    callbackKeywords=kw, errbackKeywords=kw)
  File "twisted/internet/defer.py", line 186, in addCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "twisted/internet/defer.py", line 328, in _runCallbacks
    self.result = callback(self.result, *args, **kw)
  File "twisted/internet/defer.py", line 289, in _continue
    self.unpause()
  File "twisted/internet/defer.py", line 285, in unpause
    self._runCallbacks()
  File "twisted/internet/defer.py", line 344, in _runCallbacks
    self.result = failure.Failure()
  File "twisted/python/failure.py", line 265, in __init__
    parentCs = reflect.allYourBase(self.type)
  File "twisted/python/reflect.py", line 542, in allYourBase
    accumulateBases(classObj, l, baseClass)
  File "twisted/python/reflect.py", line 550, in accumulateBases
    accumulateBases(base, l, baseClass)
  File "twisted/python/reflect.py", line 550, in accumulateBases
    accumulateBases(base, l, baseClass)
  File "twisted/python/reflect.py", line 550, in accumulateBases
    accumulateBases(base, l, baseClass)
  File "twisted/python/reflect.py", line 550, in accumulateBases
    accumulateBases(base, l, baseClass)
exceptions.RuntimeError: maximum recursion depth exceeded

Problem Two: deep chain of callbacks, e.g. recursive delayed polling

The other kind of recursion-limit-violation failures that occurrs with
Deferreds involves a long chain that finally fires. The most common way to
generate such a chain is with a recursive method that separates each call
with a Deferred, such as a polling function that returns a Deferred:

   def wait_until_done(self, ignored=None):
       if self.done:
          return True
       else:
          d = Deferred()
          reactor.callLater(1.0, d.callback, None)
          d.addCallback(self.wait_until_done)
          return d

If this function must poll more than 331 times, the reactor tick which
notices the expired timer and fires d.callback will see a
recursion-depth-exceeded exception. The last Deferred fires, which triggers
the _continue callback on the next-to-last Deferred, which allows it to fire,
which triggers the [-2] Deferred, etc. This recursive cycle is of
length 3 and has the following frames:

  File "twisted/internet/defer.py", line 328, in _runCallbacks
    self.result = callback(self.result, *args, **kw)
  File "twisted/internet/defer.py", line 289, in _continue
    self.unpause()
  File "twisted/internet/defer.py", line 285, in unpause
    self._runCallbacks()

This one is trickier to find, because the root of the stack will be some
internal reactor call rather than application code. In particular, the bottom
of the stack will look like:

  File "/tmp/t.py", line 26, in <module>
    reactor.run()
  File "twisted/internet/base.py", line 1048, in run
    self.mainLoop()
  File "twisted/internet/base.py", line 1057, in mainLoop
    self.runUntilCurrent()
  File "twisted/internet/base.py", line 705, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "twisted/internet/defer.py", line 243, in callback
    self._startRunCallbacks(result)
  File "twisted/internet/defer.py", line 312, in _startRunCallbacks
    self._runCallbacks()
  File "twisted/internet/defer.py", line 328, in _runCallbacks
    self.result = callback(self.result, *args, **kw)
  File "twisted/internet/defer.py", line 289, in _continue
    self.unpause()
  File "twisted/internet/defer.py", line 285, in unpause
    self._runCallbacks()

The other tricky thing about this failure is that the application code is
sitting on the end of the stack: any callback that is attached to the
Deferred that wait_until_done returns will run in a low-stack
environment. As a result, recursion-depth-exceeded exceptions will be
triggered by seemingly innocent application code. Note how the "DONE" number
changes as you modify the self.count comparsion value in this example:


import traceback
from twisted.internet import reactor
from twisted.internet.defer import Deferred

class Poller:
    count = 0
    def wait_until_done(self, ignored=None):
        self.count += 1
        if self.count > 301: # 331 works, 332 fails.
            return True
        else:
            d = Deferred()
            reactor.callLater(0.0, d.callback, None)
            d.addCallback(self.wait_until_done)
            return d

p = Poller()
def done(res):
    #traceback.print_stack()
    print "DONE", len(traceback.extract_stack())
d = p.wait_until_done()
d.addCallback(done)
reactor.run()

When this fails, the traceback that shows up in the logs looks like:

Unhandled error in Deferred:
Traceback (most recent call last):
  File "twisted/internet/defer.py", line 285, in unpause
    self._runCallbacks()
  File "twisted/internet/defer.py", line 328, in _runCallbacks
    self.result = callback(self.result, *args, **kw)
  File "twisted/internet/defer.py", line 289, in _continue
    self.unpause()
  File "twisted/internet/defer.py", line 285, in unpause
    self._runCallbacks()
--- <exception caught here> ---
  File "twisted/internet/defer.py", line 328, in _runCallbacks
    self.result = callback(self.result, *args, **kw)
  File "twisted/internet/defer.py", line 289, in _continue
    self.unpause()
  File "twisted/internet/defer.py", line 285, in unpause
    self._runCallbacks()
  File "twisted/internet/defer.py", line 344, in _runCallbacks
    self.result = failure.Failure()
  File "twisted/python/failure.py", line 265, in __init__
    parentCs = reflect.allYourBase(self.type)
  File "twisted/python/reflect.py", line 542, in allYourBase
    accumulateBases(classObj, l, baseClass)
  File "twisted/python/reflect.py", line 550, in accumulateBases
    accumulateBases(base, l, baseClass)
  File "twisted/python/reflect.py", line 550, in accumulateBases
    accumulateBases(base, l, baseClass)
  File "twisted/python/reflect.py", line 550, in accumulateBases
    accumulateBases(base, l, baseClass)
  File "twisted/python/reflect.py", line 550, in accumulateBases
    accumulateBases(base, l, baseClass)
exceptions.RuntimeError: maximum recursion depth exceeded

Combinations

Note that these two problems can interact. Each ready-to-fire callback
attached to a single Deferred uses 6 stack frames, and each chained callback
uses 3 stack frames. If X6+Y3 > 1000, the code will fail.

Solutions

For problem one, the requirement is that Deferreds never wind up with more
than 166 callbacks that are ready to fire. In other words, there must be at
least one not-ready-to-fire Deferred in each span of 166 callbacks.

One way to accomplish this is to have every 100th call return
foolscap.eventual.fireEventually(result) instead of
defer.succeed(result). Having every call do this works too, it just
slows things down a bit.

def fire(res, which):
    return defer.fireEventually(None)

d = defer.Deferred()
for i in range(170):
    d.addCallback(fire, i)

For problem two, the requirement is that the depth of the tail-recursion
chain not exceed 331 cycles, minus some room for the code you're eventually
going to attach to the end. One way to accomplish this is to have every 300th
call (or every single call, if you are willing to accept the slowdown) add an
additional fireEventually to slow things down.

    def wait_until_done(self, ignored=None):
        self.count += 1
        if self.count > 301: # 331 works, 332 fails.
            return True
        else:
            d = Deferred()
            reactor.callLater(0.0, d.callback, None)
            d.addCallback(self.wait_until_done)
            d.addCallback(lambda res: fireEventually(res))
            return d
Some more notes on this problem, so I can find them again later: ## Problem One: long list of callbacks, all of them are ready Each Deferred (we'll call the first one Deferred A) has a list of callback functions. Each time you do d.addCallback(), this list grows by one element. When Deferred A fires, the list is executed in a 'while' loop, in Deferred._runCallbacks. If the callbacks all return either a normal value or a Failure, then the list is completely consumed during the one call to _runCallbacks, and everything is fine. However, when a callback returns another Deferred B (chaining), the first Deferred A must wait for the second to finish. The code that does this looks like: ``` if isinstance(self.result, Deferred): self.pause() self.result.addBoth(self._continue) break ``` The second Deferred B might have already been fired by this point, either because it was born ready (created with defer.succeed, or defer.maybeDeferred), or because whatever was being waited upon has already occurred. If this occurs, the subsequent callback in Deferred A's chain will fire (with B's result), but it will fire through a 6-frame recursive loop instead of firing on the next pass of the 'while' loop. As a result, each such ready-to-fire Deferred will add 6 stack frames. 166 such loops are enough to put more than 1000 frames on the stack, which will exceed Python's default sys.getrecursionlimit() . The 6-frame cycle is: ``` File "twisted/internet/defer.py", line 214, in addBoth callbackKeywords=kw, errbackKeywords=kw) File "twisted/internet/defer.py", line 186, in addCallbacks self._runCallbacks() File "twisted/internet/defer.py", line 328, in _runCallbacks self.result = callback(self.result, *args, **kw) File "twisted/internet/defer.py", line 289, in _continue self.unpause() File "twisted/internet/defer.py", line 285, in unpause self._runCallbacks() File "twisted/internet/defer.py", line 341, in _runCallbacks self.result.addBoth(self._continue) ``` The following sample code will cause this situation: ``` import traceback from twisted.internet import defer def fire(res, which): print "FIRE", which, "stack:", len(traceback.extract_stack()) if which == 2: traceback.print_stack() return defer.succeed(None) d = defer.Deferred() for i in range(5): # set this to 170 to see failure d.addCallback(fire, i) d.callback("go") ``` The exception that this provokes is caught by the Deferred's Failure mechanisms, but then Twisted has an internal failure while trying to capture it. The actual Unhandled error in Deferred that gets put into the logs is: ``` Unhandled error in Deferred: Traceback (most recent call last): File "twisted/internet/defer.py", line 285, in unpause self._runCallbacks() File "twisted/internet/defer.py", line 341, in _runCallbacks self.result.addBoth(self._continue) File "twisted/internet/defer.py", line 214, in addBoth callbackKeywords=kw, errbackKeywords=kw) File "twisted/internet/defer.py", line 186, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "twisted/internet/defer.py", line 328, in _runCallbacks self.result = callback(self.result, *args, **kw) File "twisted/internet/defer.py", line 289, in _continue self.unpause() File "twisted/internet/defer.py", line 285, in unpause self._runCallbacks() File "twisted/internet/defer.py", line 344, in _runCallbacks self.result = failure.Failure() File "twisted/python/failure.py", line 265, in __init__ parentCs = reflect.allYourBase(self.type) File "twisted/python/reflect.py", line 542, in allYourBase accumulateBases(classObj, l, baseClass) File "twisted/python/reflect.py", line 550, in accumulateBases accumulateBases(base, l, baseClass) File "twisted/python/reflect.py", line 550, in accumulateBases accumulateBases(base, l, baseClass) File "twisted/python/reflect.py", line 550, in accumulateBases accumulateBases(base, l, baseClass) File "twisted/python/reflect.py", line 550, in accumulateBases accumulateBases(base, l, baseClass) exceptions.RuntimeError: maximum recursion depth exceeded ``` ## Problem Two: deep chain of callbacks, e.g. recursive delayed polling The other kind of recursion-limit-violation failures that occurrs with Deferreds involves a long chain that finally fires. The most common way to generate such a chain is with a recursive method that separates each call with a Deferred, such as a polling function that returns a Deferred: ``` def wait_until_done(self, ignored=None): if self.done: return True else: d = Deferred() reactor.callLater(1.0, d.callback, None) d.addCallback(self.wait_until_done) return d ``` If this function must poll more than 331 times, the reactor tick which notices the expired timer and fires d.callback will see a recursion-depth-exceeded exception. The last Deferred fires, which triggers the _continue callback on the next-to-last Deferred, which allows it to fire, which triggers the `[-2]` Deferred, etc. This recursive cycle is of length 3 and has the following frames: ``` File "twisted/internet/defer.py", line 328, in _runCallbacks self.result = callback(self.result, *args, **kw) File "twisted/internet/defer.py", line 289, in _continue self.unpause() File "twisted/internet/defer.py", line 285, in unpause self._runCallbacks() ``` This one is trickier to find, because the root of the stack will be some internal reactor call rather than application code. In particular, the bottom of the stack will look like: ``` File "/tmp/t.py", line 26, in <module> reactor.run() File "twisted/internet/base.py", line 1048, in run self.mainLoop() File "twisted/internet/base.py", line 1057, in mainLoop self.runUntilCurrent() File "twisted/internet/base.py", line 705, in runUntilCurrent call.func(*call.args, **call.kw) File "twisted/internet/defer.py", line 243, in callback self._startRunCallbacks(result) File "twisted/internet/defer.py", line 312, in _startRunCallbacks self._runCallbacks() File "twisted/internet/defer.py", line 328, in _runCallbacks self.result = callback(self.result, *args, **kw) File "twisted/internet/defer.py", line 289, in _continue self.unpause() File "twisted/internet/defer.py", line 285, in unpause self._runCallbacks() ``` The other tricky thing about this failure is that the application code is sitting on the end of the stack: any callback that is attached to the Deferred that `wait_until_done` returns will run in a low-stack environment. As a result, recursion-depth-exceeded exceptions will be triggered by seemingly innocent application code. Note how the "DONE" number changes as you modify the self.count comparsion value in this example: ``` /usr/bin/python import traceback from twisted.internet import reactor from twisted.internet.defer import Deferred class Poller: count = 0 def wait_until_done(self, ignored=None): self.count += 1 if self.count > 301: # 331 works, 332 fails. return True else: d = Deferred() reactor.callLater(0.0, d.callback, None) d.addCallback(self.wait_until_done) return d p = Poller() def done(res): #traceback.print_stack() print "DONE", len(traceback.extract_stack()) d = p.wait_until_done() d.addCallback(done) reactor.run() ``` When this fails, the traceback that shows up in the logs looks like: ``` Unhandled error in Deferred: Traceback (most recent call last): File "twisted/internet/defer.py", line 285, in unpause self._runCallbacks() File "twisted/internet/defer.py", line 328, in _runCallbacks self.result = callback(self.result, *args, **kw) File "twisted/internet/defer.py", line 289, in _continue self.unpause() File "twisted/internet/defer.py", line 285, in unpause self._runCallbacks() --- <exception caught here> --- File "twisted/internet/defer.py", line 328, in _runCallbacks self.result = callback(self.result, *args, **kw) File "twisted/internet/defer.py", line 289, in _continue self.unpause() File "twisted/internet/defer.py", line 285, in unpause self._runCallbacks() File "twisted/internet/defer.py", line 344, in _runCallbacks self.result = failure.Failure() File "twisted/python/failure.py", line 265, in __init__ parentCs = reflect.allYourBase(self.type) File "twisted/python/reflect.py", line 542, in allYourBase accumulateBases(classObj, l, baseClass) File "twisted/python/reflect.py", line 550, in accumulateBases accumulateBases(base, l, baseClass) File "twisted/python/reflect.py", line 550, in accumulateBases accumulateBases(base, l, baseClass) File "twisted/python/reflect.py", line 550, in accumulateBases accumulateBases(base, l, baseClass) File "twisted/python/reflect.py", line 550, in accumulateBases accumulateBases(base, l, baseClass) exceptions.RuntimeError: maximum recursion depth exceeded ``` ## Combinations Note that these two problems can interact. Each ready-to-fire callback attached to a single Deferred uses 6 stack frames, and each chained callback uses 3 stack frames. If X*6+Y*3 > 1000, the code will fail. ## Solutions For problem one, the requirement is that Deferreds never wind up with more than 166 callbacks that are ready to fire. In other words, there must be at least one not-ready-to-fire Deferred in each span of 166 callbacks. One way to accomplish this is to have every 100th call return `foolscap.eventual.fireEventually(result)` instead of `defer.succeed(result)`. Having every call do this works too, it just slows things down a bit. ``` def fire(res, which): return defer.fireEventually(None) d = defer.Deferred() for i in range(170): d.addCallback(fire, i) ``` For problem two, the requirement is that the depth of the tail-recursion chain not exceed 331 cycles, minus some room for the code you're eventually going to attach to the end. One way to accomplish this is to have every 300th call (or every single call, if you are willing to accept the slowdown) add an additional `fireEventually` to slow things down. ``` def wait_until_done(self, ignored=None): self.count += 1 if self.count > 301: # 331 works, 332 fails. return True else: d = Deferred() reactor.callLater(0.0, d.callback, None) d.addCallback(self.wait_until_done) d.addCallback(lambda res: fireEventually(res)) return d ```

Twisted 10.2.0 has the fix for the stack overflow in chained deferreds: http://twistedmatrix.com/trac/ticket/1833

I suppose this means that some year, when everyone has upgraded to Twisted >= v10.2.0 (which I estimate will be in about 2015), we can simplify the Tahoe-LAFS code by depending on that behavior.

Twisted 10.2.0 has the fix for the stack overflow in chained deferreds: <http://twistedmatrix.com/trac/ticket/1833> I suppose this means that some year, when everyone has upgraded to Twisted >= v10.2.0 (which I estimate will be in about 2015), we can simplify the Tahoe-LAFS code by depending on that behavior.
davidsarah commented 2010-11-30 05:09:40 +00:00
Owner

Keywords so I can find this ticket again. (Yes, I know it is fixed, but comment:63648 is a really good description of the problem.)

Keywords so I can find this ticket again. (Yes, I know it is fixed, but [comment:63648](/tahoe-lafs/trac-2024-07-25/issues/237#issuecomment-63648) is a really good description of the problem.)
Author
hm, I think you meant <http://twistedmatrix.com/trac/ticket/411>
Sign in to join this conversation.
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#237
No description provided.