"recursion depth exceeded" failure during t=manifest of large directory tree #535

New Issue

warner · 2008-11-15T03:31:50Z

warner commented

2008-11-15 03:31:50 +00:00

While working on the new "tahoe manifest" CLI command, I noticed some
stack-exhaustion exceptions that occurred as the code was traversing a large
(but not huge) directory structure.

There were about 700 files, 60 directories, and the largest directory had
about 150 children.

#237 has some other notes about how Deferreds can wind up using unusually
deep stacks (and thus run into Python's default 1000-frame "recursion
limit"). But it turns out that this problem was slightly different.

The key to tracking down these sorts of problems is to start by adding the
following code to Deferred._runCallbacks:

--- twisted/internet/defer.py	(revision 24958)
+++ twisted/internet/defer.py	(working copy)
@@ -325,6 +325,12 @@
                 try:
                     self._runningCallbacks = True
                     try:
+                        if len(traceback.extract_stack()) > 900:
+                            print "running", len(traceback.extract_stack())
+                            traceback.print_stack()
+                            print "running", len(traceback.extract_stack())
+                            import os
+                            os.abort()
                         self.result = callback(self.result, *args, **kw)
                     finally:
                         self._runningCallbacks = False
@@ -337,6 +343,12 @@
                         # self.callbacks until it is empty, then return here,
                         # where there is no more work to be done, so this call
                         # will return as well.
+                        if len(traceback.extract_stack()) > 900:
+                            print "chaining", len(traceback.extract_stack())
+                            traceback.print_stack()
+                            print "chaining", len(traceback.extract_stack())
+                            import os
+                            os.abort()
                         self.pause()
                         self.result.addBoth(self._continue)
                         break

This will log the stack that leads up to the failure. In the two cases I
wrote about in the #237 comment, the problem occurs entirely in Twisted code
(even though the root cause was set up in application code).

In this case, I saw allmydata.util.limiter:43 Limiter.maybe_start_task() and
limiter.py:48 _done in the 5-frame recursion cycle. It turns out that a
ready-to-fire Deferred (that results from a limiter.add() of an
immediate function, in this case ManifestWalker.add_node) causes
maybe_start_task to synchronously invoke self._done, which synchronously
invoked maybe_start_task. This cycle repeated for every pending task, adding
5 stack frames each time, until the stack got full.

The solution was to interrupt the cycle with an eventual-send:

--- src/allmydata/util/limiter.py
+++ src/allmydata/util/limiter.py
@@ -37,4 +37,4 @@
     def _done(self, res, done_d):
         self.active -= 1
         eventually(done_d.callback, res)
-        self.maybe_start_task()
+        eventually(self.maybe_start_task)

While working on the new "tahoe manifest" CLI command, I noticed some stack-exhaustion exceptions that occurred as the code was traversing a large (but not huge) directory structure. There were about 700 files, 60 directories, and the largest directory had about 150 children. #237 has some other notes about how Deferreds can wind up using unusually deep stacks (and thus run into Python's default 1000-frame "recursion limit"). But it turns out that this problem was slightly different. The key to tracking down these sorts of problems is to start by adding the following code to Deferred._runCallbacks: ``` --- twisted/internet/defer.py (revision 24958) +++ twisted/internet/defer.py (working copy) @@ -325,6 +325,12 @@ try: self._runningCallbacks = True try: + if len(traceback.extract_stack()) > 900: + print "running", len(traceback.extract_stack()) + traceback.print_stack() + print "running", len(traceback.extract_stack()) + import os + os.abort() self.result = callback(self.result, *args, **kw) finally: self._runningCallbacks = False @@ -337,6 +343,12 @@ # self.callbacks until it is empty, then return here, # where there is no more work to be done, so this call # will return as well. + if len(traceback.extract_stack()) > 900: + print "chaining", len(traceback.extract_stack()) + traceback.print_stack() + print "chaining", len(traceback.extract_stack()) + import os + os.abort() self.pause() self.result.addBoth(self._continue) break ``` This will log the stack that leads up to the failure. In the two cases I wrote about in the #237 comment, the problem occurs entirely in Twisted code (even though the root cause was set up in application code). In this case, I saw allmydata.util.limiter:43 Limiter.maybe_start_task() and limiter.py:48 _done in the 5-frame recursion cycle. It turns out that a ready-to-fire Deferred (that results from a `limiter.add()` of an immediate function, in this case `ManifestWalker.add_node`) causes maybe_start_task to synchronously invoke self._done, which synchronously invoked maybe_start_task. This cycle repeated for every pending task, adding 5 stack frames each time, until the stack got full. The solution was to interrupt the cycle with an eventual-send: ``` --- src/allmydata/util/limiter.py +++ src/allmydata/util/limiter.py @@ -37,4 +37,4 @@ def _done(self, res, done_d): self.active -= 1 eventually(done_d.callback, res) - self.maybe_start_task() + eventually(self.maybe_start_task) ```

warner added the

labels 2008-11-15 03:31:50 +00:00

warner added this to the 1.3.0 milestone 2008-11-15 03:31:50 +00:00

warner commented

2008-11-15 05:06:37 +00:00

changeset:fc7cd23bd1906351 fixes this, but I'm trying to write a unit test to prove it. I'm having problems setting up a test environment that triggers the issue: my approach is to create a top-level directory that contains 20 subdirectories, each of which contains 100 LIT files, and then do a t=manifest of the whole thing. The 20 subdirs ought to fill the Limiter queue, and then the 2000 objects should hit ManifestWalker.add_node and finish synchronously. That ought to flood the Limiter's queue with ready-to-fire Deferreds.

But something's screwy in the admittedly weird "fake directory node" test harness I wrote, and it isn't working yet.

changeset:fc7cd23bd1906351 fixes this, but I'm trying to write a unit test to prove it. I'm having problems setting up a test environment that triggers the issue: my approach is to create a top-level directory that contains 20 subdirectories, each of which contains 100 LIT files, and then do a t=manifest of the whole thing. The 20 subdirs ought to fill the Limiter queue, and then the 2000 objects should hit `ManifestWalker.add_node` and finish synchronously. That ought to flood the Limiter's queue with ready-to-fire Deferreds. But something's screwy in the admittedly weird "fake directory node" test harness I wrote, and it isn't working yet.

zooko commented

2009-02-07 19:49:40 +00:00

I guess this is fixed but Brian couldn't figure out how to write a unit test of it. Shall we close it as fixed?