webish directory page can't display more than 192 children #237
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#237
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
While doing some load testing (which involves writing lots and lots of files
into a few directories), I discovered an exception when using the webish
interface to list one of these large directories. The exception came out of
nevow as it attempted to render the 193rd child of the directory:
I'm not sure if there is actual recursion taking place.. it may just be a
function that's returning defer.succeed(result) and this basically surprises
some other piece of code that doesn't expect the Deferred to fire so quickly.
hrm, the exception got mangled during pasting..
grr. I can reproduce this reliably with a live node (the failure sometimes
has a nevow traceback on the web page, sometimes it hits the stack limit
while rendering the failure and you get the "additionally an error occurred
rendering the error page" message). But I can't get my test case to fail
(test_web.WebTest.test_GET_DIRURL_large) at all.
The Deferred fired by webish's data_children() is .paused=1 in the live node,
because somewhere in the upstream chain a callback returned another Deferred
that hasn't fired yet. I can simulate this in test_web by modifying
test.common.FakeMutableFileNode.download_to_data() to add a callback rung
that returns fireEventually(data), and that makes .paused=1 like in the live
node, but it doesn't trigger the failure.
On the "I'm desperate so try to fix it even without a failing test case"
front, adding fireEventually() to data_children() doesn't help. I'm pretty
sure the problem is that Nevow's flattening code (iterflatten?) sees a
Deferred that fires with a big list of stuff, and when the top deferred
fires, it does an addCallback for every item inside it. Each addCallback
fires immediately, which reenters the loop, and that's what blows the stack.
I was able to make the live node stop failing with the following ugly hack
in data_children():
I.e. force iterflatten to slow down by firing each individual item in a
separate turn. Inefficient!
Ok, I have to run, so I'm not going to get this one fixed before I get back
from vacation. The fireEventually hack will work if we're desperate, but I
really don't want to add it, and also I don't want to add it without being
able to produce a failing test case. (the next step for a test case is to do
it in test_system() where we use real dirnodes, but that's slower and results
in failures that are harder to track down than a focussed test like
test_web).
I think this is important enough to bring forward into the 0.7.1 release.
robk pointed out that the test might be passing (when I'd expect it to fail) because the client/consumer is in the same process as the server/producer. Basically the producer flattens a number of entries and writes it to the HTTP socket, but then it has to pause until the socket is drained, and since the reader of that socket is in the same process, it gets a chance to retire more deferreds before it is run again.
If we look at Nevow and don't see any evidence of Producer/Consumer code, that blows this theory out the window. If this is the reason the test is not failing, then a better way to trigger the bug might be to ask nevow to render the directory page to a string, instead of hitting it over HTTP. That might cause it to do the flattening all in a single turn, which ought to trigger the problem.
I seem to recall that Nevow had some way of doing this (so that it can be used in library-mode, driven by something other than twisted.web), but I don't remember exactly how to do it right now.
I tried using renderString(), but that didn't trigger the problem either:
I just thought of a less-inefficient form of that fireEventually()
workaround: only insert the turn break once every 100 items, instead of on
every item. The following code seems to work:
I still haven't been able to come up with a failing unit test for this, but
I'm going to commit this fix and close the ticket anyways.
Fixed in changeset:124b2160b930c180
Some more notes on this problem, so I can find them again later:
Problem One: long list of callbacks, all of them are ready
Each Deferred (we'll call the first one Deferred A) has a list of callback
functions. Each time you do d.addCallback(), this list grows by one element.
When Deferred A fires, the list is executed in a 'while' loop, in
Deferred._runCallbacks. If the callbacks all return either a normal value or
a Failure, then the list is completely consumed during the one call to
_runCallbacks, and everything is fine.
However, when a callback returns another Deferred B (chaining), the first
Deferred A must wait for the second to finish. The code that does this looks
like:
The second Deferred B might have already been fired by this point, either
because it was born ready (created with defer.succeed, or
defer.maybeDeferred), or because whatever was being waited upon has already
occurred.
If this occurs, the subsequent callback in Deferred A's chain will fire (with
B's result), but it will fire through a 6-frame recursive loop instead of
firing on the next pass of the 'while' loop. As a result, each such
ready-to-fire Deferred will add 6 stack frames. 166 such loops are enough to
put more than 1000 frames on the stack, which will exceed Python's default
sys.getrecursionlimit() .
The 6-frame cycle is:
The following sample code will cause this situation:
The exception that this provokes is caught by the Deferred's Failure
mechanisms, but then Twisted has an internal failure while trying to capture
it. The actual Unhandled error in Deferred that gets put into the logs is:
Problem Two: deep chain of callbacks, e.g. recursive delayed polling
The other kind of recursion-limit-violation failures that occurrs with
Deferreds involves a long chain that finally fires. The most common way to
generate such a chain is with a recursive method that separates each call
with a Deferred, such as a polling function that returns a Deferred:
If this function must poll more than 331 times, the reactor tick which
notices the expired timer and fires d.callback will see a
recursion-depth-exceeded exception. The last Deferred fires, which triggers
the _continue callback on the next-to-last Deferred, which allows it to fire,
which triggers the
[-2]
Deferred, etc. This recursive cycle is oflength 3 and has the following frames:
This one is trickier to find, because the root of the stack will be some
internal reactor call rather than application code. In particular, the bottom
of the stack will look like:
The other tricky thing about this failure is that the application code is
sitting on the end of the stack: any callback that is attached to the
Deferred that
wait_until_done
returns will run in a low-stackenvironment. As a result, recursion-depth-exceeded exceptions will be
triggered by seemingly innocent application code. Note how the "DONE" number
changes as you modify the self.count comparsion value in this example:
When this fails, the traceback that shows up in the logs looks like:
Combinations
Note that these two problems can interact. Each ready-to-fire callback
attached to a single Deferred uses 6 stack frames, and each chained callback
uses 3 stack frames. If X6+Y3 > 1000, the code will fail.
Solutions
For problem one, the requirement is that Deferreds never wind up with more
than 166 callbacks that are ready to fire. In other words, there must be at
least one not-ready-to-fire Deferred in each span of 166 callbacks.
One way to accomplish this is to have every 100th call return
foolscap.eventual.fireEventually(result)
instead ofdefer.succeed(result)
. Having every call do this works too, it justslows things down a bit.
For problem two, the requirement is that the depth of the tail-recursion
chain not exceed 331 cycles, minus some room for the code you're eventually
going to attach to the end. One way to accomplish this is to have every 300th
call (or every single call, if you are willing to accept the slowdown) add an
additional
fireEventually
to slow things down.Twisted 10.2.0 has the fix for the stack overflow in chained deferreds: http://twistedmatrix.com/trac/ticket/1833
I suppose this means that some year, when everyone has upgraded to Twisted >= v10.2.0 (which I estimate will be in about 2015), we can simplify the Tahoe-LAFS code by depending on that behavior.
Keywords so I can find this ticket again. (Yes, I know it is fixed, but comment:63648 is a really good description of the problem.)
hm, I think you meant http://twistedmatrix.com/trac/ticket/411