"recursion depth exceeded" failure during t=manifest of large directory tree #535
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#535
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
While working on the new "tahoe manifest" CLI command, I noticed some
stack-exhaustion exceptions that occurred as the code was traversing a large
(but not huge) directory structure.
There were about 700 files, 60 directories, and the largest directory had
about 150 children.
#237 has some other notes about how Deferreds can wind up using unusually
deep stacks (and thus run into Python's default 1000-frame "recursion
limit"). But it turns out that this problem was slightly different.
The key to tracking down these sorts of problems is to start by adding the
following code to Deferred._runCallbacks:
This will log the stack that leads up to the failure. In the two cases I
wrote about in the #237 comment, the problem occurs entirely in Twisted code
(even though the root cause was set up in application code).
In this case, I saw allmydata.util.limiter:43 Limiter.maybe_start_task() and
limiter.py:48 _done in the 5-frame recursion cycle. It turns out that a
ready-to-fire Deferred (that results from a
limiter.add()
of animmediate function, in this case
ManifestWalker.add_node
) causesmaybe_start_task to synchronously invoke self._done, which synchronously
invoked maybe_start_task. This cycle repeated for every pending task, adding
5 stack frames each time, until the stack got full.
The solution was to interrupt the cycle with an eventual-send:
changeset:fc7cd23bd1906351 fixes this, but I'm trying to write a unit test to prove it. I'm having problems setting up a test environment that triggers the issue: my approach is to create a top-level directory that contains 20 subdirectories, each of which contains 100 LIT files, and then do a t=manifest of the whole thing. The 20 subdirs ought to fill the Limiter queue, and then the 2000 objects should hit
ManifestWalker.add_node
and finish synchronously. That ought to flood the Limiter's queue with ready-to-fire Deferreds.But something's screwy in the admittedly weird "fake directory node" test harness I wrote, and it isn't working yet.
I guess this is fixed but Brian couldn't figure out how to write a unit test of it. Shall we close it as fixed?
yeah, I guess. I haven't seen the bug occur since I changed that code.