python2.5 setup.py test runs CPU to 100% on 32-bit single-core NetBSD "4" #737
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#737
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Running as in summary:
python2.5 setup.py test
... inside allmydata-tahoe-1.4.1
Eventually stops and runs the CPU up to 100% on a specific test. I am attaching a log of the session.
Details of the machine:
NetBSD quickie 4.99.7 NetBSD 4.99.7 (quickie) #0: Tue Jan 2 14:47:23 PST 2007 root@quickie:/v/src/sys/arch/i386/compile/quickie i386
AMD Athlon(tm) XP 2500+ (single-core, 32-bit)
2.5GB RAM
Python 2.5.2
This is a transitional pthread machine, partway between the M:N -> 1:1 threading model transition. The M:N threads should be functional and for all system and most application purposes they are. (KDE, etc.) However, on occasion some software makes assumptions or is built without threading support because configure detected anomalous behaviour.
Attachment file (6480 bytes) added
Aha! Now this has happened on Black Dew's Debian buildslave:
http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/1/steps/test/logs/stdio
I think i (patrially) figured this out...
Looks like there is a bug either somwhere in tahoe or foolscap that causes foolscap.eventual._SimpleCallQueue._turn to be scheduled with a delay of NaN.
From what i can see the methods in _SimpleCallQueue allways install it with reactor.callLater(0, self._turn) - so it is somwhere else.
Anyway, this triggers a bug in twisted that makes the delayed call stick in reactor._pendingTimedCalls forever and no later timed calls get processed. Also in this state reactor.timeout() returns 0 and the reactor starts iterating without delay causing 100% CPU load.
Nice work, bdew! It's interesting that this happens only in [allmydata.test.test_cli.Check.test_check]source:src/allmydata/test/test_cli.py@20090609210509-92b7f-540d8762ba78bcb54be6775a376332d453ea2f7b#L1134, and only sometimes. Examining that particular test case doesn't suggest anything special to me. bdew: could you run the tests with the
--random
and--until-failure
options oftrial
?Assigning this ticket to Brian -- it sounds like the foolscap eventual-send mechanism or Twisted might have a bug. Oh, that's a good question -- what versions of Twisted have we seen this on? bdew's buildbot has Twisted 8.2.0: http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/1/steps/tahoe-version/logs/stdio . midnightmagic's Twisted version isn't known. There were two other people who reported this bug to me, but they didn't mention Twisted versions.
Attachment bdew-test-random-unpached.txt (5765 bytes) added
trial with --random and --until-failure (got stuck too)
Attachment bdew-patched-twisted.txt (84672 bytes) added
A log of tests with twisted patched to show when it happens
I've attached the requested log. (this is before trying to catch it in twisted)
I've also patched my local twisted installation to detect the situation and fix it (bu reseting time to 0) and scream about it (look for NAN TIME DETECTED). As you can see from the second log it happens in other tests too.
yowza. Yeah, since foolscap's eventual() stuff always uses reactor.callLater(0), it must be in Twisted somewhere. Twisted will basically do (time.time()+delay), and we're passing in delay=0, so maybe time.time() is returning
NaN
?First step, confirm that you can reproduce this in Foolscap (without Tahoe). Download the latest foolscap-0.4.2 (from pypi.python.org ... I'm having Trac problems on http://foolscap.lothar.com/trac and I can't edit the front page to reflect the 0.4.2 release). Then do
trial --until-failure foolscap.test.test_eventual
, which will run just the eventual-send -using tests.Assuming that fails in the same way, try this:
and see if that explodes. It's also likely that Twisted is subtracting a timestamp at some point (i.e.
delay = next_event - time.time()
) .. maybe subtraction of numbers that are too close to each other is creating a NaN?This reminds me of a problem I once had on a MIPS box running a not-quite-ready version of linux (one of the old Cobalt Qubes). I rebooted it once and Mailman (of all things) threw an exception saying "Eek! PI is greater than four!". I tracked it down to a bug in the kernel: it wasn't saving the floating-point registers during a context switch, so from a single process's point of view, its fp registers were corrupted at random times. Mailman happens to import the "math" module (for its exponential backoff algorithm), which gets its value of Pi from a C extension module (for improved accuracy or something). To make sure that the extension module was compiled and loaded properly, there was an import-time assertion like
3 <= PI <= 4
. And the floating-point register which held that value managed to be clobbered by a task-switch in the tiny interval between being loaded from memory and being tested by that assertion.Sometimes the fringe is a dangerous place :)
I've run the test for a while and it doesn't seem to trigger the bug.
Also if anyone else runs that, note that t != float("NaN") won't work because
use repr(t) != 'nan'
PS: I've reported the bug on twisted tracker too http://twistedmatrix.com/trac/ticket/3884
Testing running for quite some time.. no error.
I'm afraid I'm not good enough with python to know how to run the foolscap test though. Can I request just a couple more pointers? Assume I'm very facile with command-line, and system-level stuff. (And C for that matter.) But just not the magic incantations to get Python to know where zope.interface is for importing foolscap, and how to actually run "trial --until-failure" and so on.
righto. You had the Tahoe unit tests running, which means it was able to
import or build Foolscap and everything that Foolscap depends upon, so things
should probably be pretty easy.
First step, download the foolscap-0.4.2 tarball from
http://foolscap.lothar.com/trac (the front page has a download link), or from
http://pypi.python.org (search for "foolscap" and you'll get a page with
download links).
Then unpack it and try running "make test". If that works, jump to the
"trial" step below. (/usr/bin/trial is the program that runs the unit tests.
it comes from Twisted. "make test" is a simple wrapper for invoking trial.)
If that can't find a "trial" executable, or if it complains about
ImportErrors, then you may need to install one or more of the Foolscap
dependencies. The dependency graph looks like:
portion of the Zope package, less than 1% by weight. We don't need all of
Zope. Don't let the enormous size of Zope scare you)
Tahoe goes to great lengths to hide the dependencies from you, by downloading
and installing everything that isn't already available on the system (and
sometimes even things which are on the system, but that's a bug). It can
only install python things, though.. it doesn't know how to build
libopenssl.a .
If you were to install Tahoe, you'd get everything that Foolscap needs too
(Twisted, zope.interface, pyopenssl). That might not include executable
scripts like /usr/bin/trial though.
Foolscap does not behave this way: you have to satisfy its dependencies
yourself.
So, if there's no "trial" on your $PATH, then download the Twisted tarball
from http://twistedmatrix.com , unpack it, and do "sudo python setup.py
install" (if you're willing to install stuff to /usr/lib and /usr/bin .. I
prefer packages or GNU stow, so I usually do "sudo python setup.py install
--prefix /usr/local/stow/XYZ" and then "cd /usr/local/stow && stow -v XYZ" to
put it into place in /usr/local/bin and /usr/local/lib). I believe that
Twisted includes a copy of zope.interface and will install it if necessary..
no need to download something extra.
After that, "
trial --version
" should work, so try foolscap's "make test"again.
Foolscap will skip the crypto tests if pyopenssl is not installed, but for
this bug, we don't care about those.
Then running "
trial --until-failure foolscap.test.test_eventual
" in thefoolscap source directory will tell it to run that one specific test_eventual
over and over again until something fails.
hope that helps.. grab me on IRC if you want to walk through this together.
Ha! it looks like time() is returning NaN on my system once in a while after all.
I went and added a bunch of assert's to make find where those NaN's are comming from and narrowed it to reactor.callLater - the input vas valid (there is actually a check there already), but the resulting DelayedCall was created with a NaN timeout.
So with another assert there
I got:
Note that i added 3 more calls to reactor.seconds() (which is really time.time) and they returned valid results.
So it looks that on this system something (happening either before this code, or in parallel in another thread) is breaking exactly one call to time()
Wow, fun! A quick look at the python-2.6 source (Modules/timemodule.c:floattime) doesn't suggest any obvious way to get a NaN.. it calls the C gettimeofday/ftime/time (depending upon what your platform has), adds the pieces together, and returns the result.
You said that a simple test case that just calls time.time() repeatedly didn't ever fail, right? That's unfortunate.. if we didn't think Tahoe was involved then I'd suggest instrumenting timemodule.c to remember the pieces it got, build the PyFloat, then if it's NaN immediately print out the pieces, so we could figure out what gettimeofday() returned that provoked a NaN.
If there were a low-level threading bug that was clobbering memory, I'd expect to see exceptions or deeper errors than just a NaN. If time() couldn't allocate the memory for the PyFloat object, it would raise an exception instead of returning NaN.
Hm, it's worth noting that floats are formatted to strings (in Objects/floatobject.c:format_float) by doing snprintf(), so if your platform's libc does something funky with snprintf(), that might cause problems. Also, Python doesn't appear to do anything to define or test NaN directly: it just tells C to do a+b or a>=b or whatever. So something weird in your C compiler's implementation of floating-point math (or your CPU) could get involved too.
If you get into this, you might try:
If we catch gettimeofday() returning something insane, it's either the kernel or some weird memory corruption that's just not causing anything else to catch fire. If gettimeofday() is behaving, then we should suspect the floatseconds() math or the floating point operations done afterward.
Another idea is to add code to floatseconds() that stringifies the float and compares it against NaN right away. Then run everything under gdb and put a breakpoint on the 'if' side of that comparison, then start using the tahoe node until it fails in this way. Then look at the local variables in the debugger and see if anything looks suspicious.
boy, you know how to find the fun bugs, don't you? :-)
oh, also, in the python assertion, use
tmp=self.seconds()
and do your checks on tmp. If things are this weird, we should suspect the stringification routines too, so print tmp several times and make sure it holds still. (i.e. if there were a bug in float-to-string, could that manifest as the same symptoms?)Okie doke. Had a chance to run things on one of the test-failing machines:
===============================================================================
SKIPPED: foolscap.test.test_banana.ThereAndBackAgain.test_boundMethod_newstyle
new-style classes still broken
SKIPPED: foolscap.test.test_banana.ThereAndBackAgain.test_classMethod_newstyle
new-style classes still broken
SKIPPED: foolscap.test.test_banana.ThereAndBackAgain.test_class_newstyle
new-style classes still broken
SKIPPED: foolscap.test.test_banana.ThereAndBackAgain.test_instance_newstyle
new-style classes still broken
Ran 460 tests in 168.897s
PASSED (skips=4, successes=456)
So, done this.
I had pyopenssl installed and it succeeded.
Running this now and can run it overnight. :-) I've nice'd it right up to 20 so I can keep using the machine otherwise. I will report on the results.
I ran the tests overnight and they were still running when I arrived in the morning. The final output was:
Ran 3 tests in 0.024s
PASSED (successes=3)
Test Pass 4390063
foolscap.test.test_eventual
^C TestEventual
testFire ... OK
testFlush ... OK
testSend ... OK
^C------------------------------------------------------------------------------
Ran 3 tests in 0.024s
PASSED (successes=3)
Test Pass 4390064
foolscap.test.test_eventual
TestEventual
testFire ... OK
^C testFlush ... ERROR
===============================================================================
ERROR: foolscap.test.test_eventual.TestEventual.testFlush
Traceback (most recent call last):
twisted.internet.error.ReactorNotRunning: Can't stop reactor that isn't running.
Ran 3 tests in 0.048s
^CFAILED (errors=1, successes=2)
^C
Hm, it looks like the buildbot that mm set up just locked up in
allmydata.test.test_cli.Check.test_check
...http://allmydata.org/buildbot/builders/MM%20netbsd4%20i386%20warp/builds/6/steps/test/logs/stdio
There's no indication from the build logs whether it was at max cpu usage, of course.
It's strange that mm running it in a shell should behave differently than the buildslave running it.
mm: could you try the same command-line and environment that the buildslave did? It is all listed at the top of that web page. The command was
python2.5 setup.py test --reporter=timing
and among the environment variables wasPYTHONPATH=/home/tahoe/lib/python
.By the way now that we have both Black Dew's Debian builder and mm's NetBSD builder hooked up, I see that it was also
allmydata.test.test_cli.Check.test_check
that locked up on Black Dew's:http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/1/steps/test/logs/stdio
That's the one that Black Dew tracked down to the Python
time.time()
call returningNaN
instead of seconds-since-epoch. Black Dew tracked that down by adding debugging instrumentation in Python and re-running. Brian suggested (in this ticket) further debuggery. Perhaps you two guys could supply mm with a patch that he could apply to determine if his lock-up is of the same source?Attachment twisted.patch (3269 bytes) added
I've attached the patch (for twisted.internet.base) - there are a bunch of asserts that will try to catch where the NaN comes from, and if that fails will yell about it when it reaches reactors loop.
On my buildbot only the last piece is used (not the asserts).
Replying to zooko:
eek! the text is getting smaller and smaller!
don't worry about the dropin.cache.new problem, it's a know Twisted issue that's harmless. If you really want to, running "trial --help" as root should make it go away, by giving trial the ability to write to the cache file that it so desperately wants to see.
Hmm now it segfaulted on the first problematic test:
http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/21/steps/test/logs/stdio
Sadly i didn't get a dump and can't reproduce the crash.
See also #738, which may be due to the same cause as this issue.
Update: since the fix mentioned in #738 (disabling crypto++ assembly optimizations) was implemented in my local copy here, python2.5 setup.py test NOW COMPLETES THROUGH THE SPOT WHERE IT FROZE BEFORE!!
So this specific issue I think, if those other ones can be closed, can be closed also once we verify that the same fix might work for bdew.
Using trunk from pycryptopp and trunk tahoe (via PYTHONPATH and deleting tahoe/support/lib/site-packages/pycrypto*) the test no longer fails and pushes through the spot where it failed before.
.. with full crypto optimizations.
Other things fail, but.. looks like it's just because it's trunk and certainly not the same problem this ticket references.
Same here, with fixed pycryptopp all the tests pass without hangs/rashes:
http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/40/steps/test/logs/stdio
Fixed by changeset:9578e70161009035 which increases the requirement on pycryptopp to >= 0.5.15. Note that if you are building pycryptopp against an external libcryptopp, however, then you may still have this bug if your libcryptopp has it.