python2.5 setup.py test runs CPU to 100% on 32-bit single-core NetBSD "4" #737

New Issue

tahoe-lafs · 2009-06-17T18:20:03Z

midnightmagic commented

2009-06-17 18:20:03 +00:00

Running as in summary:

python2.5 setup.py test

... inside allmydata-tahoe-1.4.1

Eventually stops and runs the CPU up to 100% on a specific test. I am attaching a log of the session.

Details of the machine:

NetBSD quickie 4.99.7 NetBSD 4.99.7 (quickie) #0: Tue Jan 2 14:47:23 PST 2007 root@quickie:/v/src/sys/arch/i386/compile/quickie i386

AMD Athlon(tm) XP 2500+ (single-core, 32-bit)
2.5GB RAM

Python 2.5.2

This is a transitional pthread machine, partway between the M:N -> 1:1 threading model transition. The M:N threads should be functional and for all system and most application purposes they are. (KDE, etc.) However, on occasion some software makes assumptions or is built without threading support because configure detected anomalous behaviour.

Running as in summary: python2.5 setup.py test ... inside allmydata-tahoe-1.4.1 Eventually stops and runs the CPU up to 100% on a specific test. I am attaching a log of the session. Details of the machine: NetBSD quickie 4.99.7 NetBSD 4.99.7 (quickie) #0: Tue Jan 2 14:47:23 PST 2007 root@quickie:/v/src/sys/arch/i386/compile/quickie i386 AMD Athlon(tm) XP 2500+ (single-core, 32-bit) 2.5GB RAM Python 2.5.2 This is a transitional pthread machine, partway between the M:N -> 1:1 threading model transition. The M:N threads *should* be functional and for all system and most application purposes they are. (KDE, etc.) However, on occasion some software makes assumptions or is built without threading support because configure detected anomalous behaviour.

tahoe-lafs added the

labels 2009-06-17 18:20:03 +00:00

tahoe-lafs added this to the undecided milestone 2009-06-17 18:20:03 +00:00

midnightmagic commented

2009-06-17 18:20:16 +00:00

Attachment file (6480 bytes) added

**Attachment** file (6480 bytes) added

file

6.3 KiB

zooko commented

2009-06-17 23:53:54 +00:00

Aha! Now this has happened on Black Dew's Debian buildslave:

http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/1/steps/test/logs/stdio

Aha! Now this has happened on Black Dew's Debian buildslave: <http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/1/steps/test/logs/stdio>

bdew commented

2009-06-18 12:01:19 +00:00

I think i (patrially) figured this out...

Looks like there is a bug either somwhere in tahoe or foolscap that causes foolscap.eventual._SimpleCallQueue._turn to be scheduled with a delay of NaN.

From what i can see the methods in _SimpleCallQueue allways install it with reactor.callLater(0, self._turn) - so it is somwhere else.

Anyway, this triggers a bug in twisted that makes the delayed call stick in reactor._pendingTimedCalls forever and no later timed calls get processed. Also in this state reactor.timeout() returns 0 and the reactor starts iterating without delay causing 100% CPU load.

I think i (patrially) figured this out... Looks like there is a bug either somwhere in tahoe or foolscap that causes foolscap.eventual._SimpleCallQueue._turn to be scheduled with a delay of NaN. From what i can see the methods in _SimpleCallQueue allways install it with reactor.callLater(0, self._turn) - so it is somwhere else. Anyway, this triggers a bug in twisted that makes the delayed call stick in reactor._pendingTimedCalls forever and no later timed calls get processed. Also in this state reactor.timeout() returns 0 and the reactor starts iterating without delay causing 100% CPU load.

zooko commented

2009-06-18 15:29:23 +00:00

Nice work, bdew! It's interesting that this happens only in [allmydata.test.test_cli.Check.test_check]source:src/allmydata/test/test_cli.py@20090609210509-92b7f-540d8762ba78bcb54be6775a376332d453ea2f7b#L1134, and only sometimes. Examining that particular test case doesn't suggest anything special to me. bdew: could you run the tests with the --random and --until-failure options of trial?

Assigning this ticket to Brian -- it sounds like the foolscap eventual-send mechanism or Twisted might have a bug. Oh, that's a good question -- what versions of Twisted have we seen this on? bdew's buildbot has Twisted 8.2.0: http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/1/steps/tahoe-version/logs/stdio . midnightmagic's Twisted version isn't known. There were two other people who reported this bug to me, but they didn't mention Twisted versions.

Nice work, bdew! It's interesting that this happens only in [allmydata.test.test_cli.Check.test_check]source:src/allmydata/test/test_cli.py@20090609210509-92b7f-540d8762ba78bcb54be6775a376332d453ea2f7b#L1134, and only sometimes. Examining that particular test case doesn't suggest anything special to me. bdew: could you run the tests with the `--random` and `--until-failure` options of `trial`? Assigning this ticket to Brian -- it sounds like the foolscap eventual-send mechanism or Twisted might have a bug. Oh, that's a good question -- what versions of Twisted have we seen this on? bdew's buildbot has Twisted 8.2.0: <http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/1/steps/tahoe-version/logs/stdio> . midnightmagic's Twisted version isn't known. There were two other people who reported this bug to me, but they didn't mention Twisted versions.

zooko modified the milestone from undecided to 1.5.0

2009-06-18 15:29:23 +00:00

bdew commented

2009-06-18 16:57:58 +00:00

Attachment bdew-test-random-unpached.txt (5765 bytes) added

trial with --random and --until-failure (got stuck too)

**Attachment** bdew-test-random-unpached.txt (5765 bytes) added trial with --random and --until-failure (got stuck too)

bdew-test-random-unpached.txt

5.6 KiB

bdew commented

2009-06-18 16:59:17 +00:00

Attachment bdew-patched-twisted.txt (84672 bytes) added

A log of tests with twisted patched to show when it happens

**Attachment** bdew-patched-twisted.txt (84672 bytes) added A log of tests with twisted patched to show when it happens

bdew-patched-twisted.txt

83 KiB

bdew commented

2009-06-18 17:02:14 +00:00

I've attached the requested log. (this is before trying to catch it in twisted)

I've also patched my local twisted installation to detect the situation and fix it (bu reseting time to 0) and scream about it (look for NAN TIME DETECTED). As you can see from the second log it happens in other tests too.

I've attached the requested log. (this is before trying to catch it in twisted) I've also patched my local twisted installation to detect the situation and fix it (bu reseting time to 0) and scream about it (look for NAN TIME DETECTED). As you can see from the second log it happens in other tests too.

warner commented

2009-06-18 18:29:18 +00:00

yowza. Yeah, since foolscap's eventual() stuff always uses reactor.callLater(0), it must be in Twisted somewhere. Twisted will basically do (time.time()+delay), and we're passing in delay=0, so maybe time.time() is returning NaN ?

First step, confirm that you can reproduce this in Foolscap (without Tahoe). Download the latest foolscap-0.4.2 (from pypi.python.org ... I'm having Trac problems on http://foolscap.lothar.com/trac and I can't edit the front page to reflect the 0.4.2 release). Then do trial --until-failure foolscap.test.test_eventual, which will run just the eventual-send -using tests.

Assuming that fails in the same way, try this:

import time
while True:
  t = time.time()
  assert t != float("NaN")

and see if that explodes. It's also likely that Twisted is subtracting a timestamp at some point (i.e. delay = next_event - time.time()) .. maybe subtraction of numbers that are too close to each other is creating a NaN?

This reminds me of a problem I once had on a MIPS box running a not-quite-ready version of linux (one of the old Cobalt Qubes). I rebooted it once and Mailman (of all things) threw an exception saying "Eek! PI is greater than four!". I tracked it down to a bug in the kernel: it wasn't saving the floating-point registers during a context switch, so from a single process's point of view, its fp registers were corrupted at random times. Mailman happens to import the "math" module (for its exponential backoff algorithm), which gets its value of Pi from a C extension module (for improved accuracy or something). To make sure that the extension module was compiled and loaded properly, there was an import-time assertion like 3 <= PI <= 4. And the floating-point register which held that value managed to be clobbered by a task-switch in the tiny interval between being loaded from memory and being tested by that assertion.

Sometimes the fringe is a dangerous place :)

yowza. Yeah, since foolscap's eventual() stuff always uses reactor.callLater(0), it must be in Twisted somewhere. Twisted will basically do (time.time()+delay), and we're passing in delay=0, so maybe time.time() is returning `NaN` ? First step, confirm that you can reproduce this in Foolscap (without Tahoe). Download the latest foolscap-0.4.2 (from pypi.python.org ... I'm having Trac problems on <http://foolscap.lothar.com/trac> and I can't edit the front page to reflect the 0.4.2 release). Then do `trial --until-failure foolscap.test.test_eventual`, which will run just the eventual-send -using tests. Assuming that fails in the same way, try this: ```/usr/bin/python import time while True: t = time.time() assert t != float("NaN") ``` and see if that explodes. It's also likely that Twisted is subtracting a timestamp at some point (i.e. `delay = next_event - time.time()`) .. maybe subtraction of numbers that are too close to each other is creating a NaN? This reminds me of a problem I once had on a MIPS box running a not-quite-ready version of linux (one of the old Cobalt Qubes). I rebooted it once and Mailman (of all things) threw an exception saying "Eek! PI is greater than four!". I tracked it down to a bug in the kernel: it wasn't saving the floating-point registers during a context switch, so from a single process's point of view, its fp registers were corrupted at random times. Mailman happens to import the "math" module (for its exponential backoff algorithm), which gets its value of Pi from a C extension module (for improved accuracy or something). To make sure that the extension module was compiled and loaded properly, there was an import-time assertion like `3 <= PI <= 4`. And the floating-point register which held that value managed to be clobbered by a task-switch in the tiny interval between being loaded from memory and being tested by that assertion. Sometimes the fringe is a dangerous place :)

bdew commented

2009-06-18 22:08:06 +00:00

I've run the test for a while and it doesn't seem to trigger the bug.

Also if anyone else runs that, note that t != float("NaN") won't work because

>>> float("NaN")==float("NaN")
False

use repr(t) != 'nan'

PS: I've reported the bug on twisted tracker too http://twistedmatrix.com/trac/ticket/3884

I've run the test for a while and it doesn't seem to trigger the bug. Also if anyone else runs that, note that t != float("NaN") won't work because ``` >>> float("NaN")==float("NaN") False ``` use repr(t) != 'nan' PS: I've reported the bug on twisted tracker too <http://twistedmatrix.com/trac/ticket/3884>

midnightmagic commented

2009-06-19 19:39:06 +00:00

Testing running for quite some time.. no error.

I'm afraid I'm not good enough with python to know how to run the foolscap test though. Can I request just a couple more pointers? Assume I'm very facile with command-line, and system-level stuff. (And C for that matter.) But just not the magic incantations to get Python to know where zope.interface is for importing foolscap, and how to actually run "trial --until-failure" and so on.

Testing running for quite some time.. no error. I'm afraid I'm not good enough with python to know how to run the foolscap test though. Can I request just a couple more pointers? Assume I'm very facile with command-line, and system-level stuff. (And C for that matter.) But just not the magic incantations to get Python to know where zope.interface is for importing foolscap, and how to actually run "trial --until-failure" and so on.

warner commented

2009-06-20 08:37:26 +00:00

righto. You had the Tahoe unit tests running, which means it was able to
import or build Foolscap and everything that Foolscap depends upon, so things
should probably be pretty easy.

First step, download the foolscap-0.4.2 tarball from
http://foolscap.lothar.com/trac (the front page has a download link), or from
http://pypi.python.org (search for "foolscap" and you'll get a page with
download links).

Then unpack it and try running "make test". If that works, jump to the
"trial" step below. (/usr/bin/trial is the program that runs the unit tests.
it comes from Twisted. "make test" is a simple wrapper for invoking trial.)

If that can't find a "trial" executable, or if it complains about
ImportErrors, then you may need to install one or more of the Foolscap
dependencies. The dependency graph looks like:

Foolscap requires Twisted
Twisted requires zope.interface (note: zope.interface is a tiny tiny
portion of the Zope package, less than 1% by weight. We don't need all of
Zope. Don't let the enormous size of Zope scare you)
Foolscap requires pyopenssl
pyopenssl links against the OpenSSL libraries (.a/.so/.h)

Tahoe goes to great lengths to hide the dependencies from you, by downloading
and installing everything that isn't already available on the system (and
sometimes even things which are on the system, but that's a bug). It can
only install python things, though.. it doesn't know how to build
libopenssl.a .

If you were to install Tahoe, you'd get everything that Foolscap needs too
(Twisted, zope.interface, pyopenssl). That might not include executable
scripts like /usr/bin/trial though.

Foolscap does not behave this way: you have to satisfy its dependencies
yourself.

So, if there's no "trial" on your $PATH, then download the Twisted tarball
from http://twistedmatrix.com , unpack it, and do "sudo python setup.py
install" (if you're willing to install stuff to /usr/lib and /usr/bin .. I
prefer packages or GNU stow, so I usually do "sudo python setup.py install
--prefix /usr/local/stow/XYZ" and then "cd /usr/local/stow && stow -v XYZ" to
put it into place in /usr/local/bin and /usr/local/lib). I believe that
Twisted includes a copy of zope.interface and will install it if necessary..
no need to download something extra.

After that, "trial --version" should work, so try foolscap's "make test"
again.

Foolscap will skip the crypto tests if pyopenssl is not installed, but for
this bug, we don't care about those.

Then running "trial --until-failure foolscap.test.test_eventual" in the
foolscap source directory will tell it to run that one specific test_eventual
over and over again until something fails.

hope that helps.. grab me on IRC if you want to walk through this together.

righto. You had the Tahoe unit tests running, which means it was able to import or build Foolscap and everything that Foolscap depends upon, so things should probably be pretty easy. First step, download the foolscap-0.4.2 tarball from <http://foolscap.lothar.com/trac> (the front page has a download link), or from <http://pypi.python.org> (search for "foolscap" and you'll get a page with download links). Then unpack it and try running "make test". If that works, jump to the "trial" step below. (/usr/bin/trial is the program that runs the unit tests. it comes from Twisted. "make test" is a simple wrapper for invoking trial.) If that can't find a "trial" executable, or if it complains about ImportErrors, then you may need to install one or more of the Foolscap dependencies. The dependency graph looks like: * Foolscap requires Twisted * Twisted requires zope.interface (note: zope.interface is a tiny tiny portion of the Zope package, less than 1% by weight. We don't need all of Zope. Don't let the enormous size of Zope scare you) * Foolscap requires pyopenssl * pyopenssl links against the OpenSSL libraries (.a/.so/.h) Tahoe goes to great lengths to hide the dependencies from you, by downloading and installing everything that isn't already available on the system (and sometimes even things which *are* on the system, but that's a bug). It can only install python things, though.. it doesn't know how to build libopenssl.a . If you were to install Tahoe, you'd get everything that Foolscap needs too (Twisted, zope.interface, pyopenssl). That might not include executable scripts like /usr/bin/trial though. Foolscap does not behave this way: you have to satisfy its dependencies yourself. So, if there's no "trial" on your $PATH, then download the Twisted tarball from <http://twistedmatrix.com> , unpack it, and do "sudo python setup.py install" (if you're willing to install stuff to /usr/lib and /usr/bin .. I prefer packages or GNU stow, so I usually do "sudo python setup.py install --prefix /usr/local/stow/XYZ" and then "cd /usr/local/stow && stow -v XYZ" to put it into place in /usr/local/bin and /usr/local/lib). I believe that Twisted includes a copy of zope.interface and will install it if necessary.. no need to download something extra. After that, "`trial --version`" should work, so try foolscap's "make test" again. Foolscap will skip the crypto tests if pyopenssl is not installed, but for this bug, we don't care about those. Then running "`trial --until-failure foolscap.test.test_eventual`" in the foolscap source directory will tell it to run that one specific test_eventual over and over again until something fails. hope that helps.. grab me on IRC if you want to walk through this together.

bdew commented

2009-06-21 15:00:51 +00:00

Ha! it looks like time() is returning NaN on my system once in a while after all.

I went and added a bunch of assert's to make find where those NaN's are comming from and narrowed it to reactor.callLater - the input vas valid (there is actually a check there already), but the resulting DelayedCall was created with a NaN timeout.

So with another assert there

assert repr(self.seconds())!='nan', 'self.seconds() [really: %r] returned NaN! %r,%r,%r'%(self.seconds,self.seconds(),self.seconds(),self.seconds())

I got:

Traceback (most recent call last):
Failure: twisted.internet.defer.FirstError: FirstError[#8, [Failure instance: Traceback: <type 'exceptions.AssertionError'>: self.seconds() [really: <built-in function time>] returned NaN! 1245566718.5629139,1245566718.562958,1245566718.563
/usr/src/tahoe/build2/virt/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-i686.egg/twisted/internet/base.py:779:runUntilCurrent
/usr/src/tahoe/build2/virt/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py:26:_turn
/usr/src/tahoe/build2/Twisted-8.2.0-py2.5-linux-i686.egg/twisted/internet/defer.py:243:callback
/usr/src/tahoe/build2/Twisted-8.2.0-py2.5-linux-i686.egg/twisted/internet/defer.py:312:_startRunCallbacks
--- <exception caught here> ---
/usr/src/tahoe/build2/Twisted-8.2.0-py2.5-linux-i686.egg/twisted/internet/defer.py:328:_runCallbacks
/usr/src/tahoe/build2/virt/lib/python2.5/site-packages/allmydata_tahoe-1.4.1_r3918-py2.5.egg/allmydata/mutable/checker.py:98:_got_answer
/usr/src/tahoe/build2/virt/lib/python2.5/site-packages/allmydata_tahoe-1.4.1_r3918-py2.5.egg/allmydata/mutable/checker.py:146:notify_server_corruption
/usr/src/tahoe/build2/virt/lib/python2.5/site-packages/allmydata_tahoe-1.4.1_r3918-py2.5.egg/allmydata/test/no_network.py:44:callRemoteOnly
/usr/src/tahoe/build2/virt/lib/python2.5/site-packages/allmydata_tahoe-1.4.1_r3918-py2.5.egg/allmydata/test/no_network.py:63:callRemote
/usr/src/tahoe/build2/virt/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py:70:fireEventually
/usr/src/tahoe/build2/virt/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py:61:eventually
/usr/src/tahoe/build2/virt/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py:16:append
/usr/src/tahoe/build2/virt/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-i686.egg/twisted/internet/base.py:669:callLater
]]

Note that i added 3 more calls to reactor.seconds() (which is really time.time) and they returned valid results.

So it looks that on this system something (happening either before this code, or in parallel in another thread) is breaking exactly one call to time()

Ha! it looks like time() is returning NaN on my system once in a while after all. I went and added a bunch of assert's to make find where those NaN's are comming from and narrowed it to reactor.callLater - the input vas valid (there is actually a check there already), but the resulting [DelayedCall](wiki/DelayedCall) was created with a NaN timeout. So with another assert there ``` assert repr(self.seconds())!='nan', 'self.seconds() [really: %r] returned NaN! %r,%r,%r'%(self.seconds,self.seconds(),self.seconds(),self.seconds()) ``` I got: ``` Traceback (most recent call last): Failure: twisted.internet.defer.FirstError: FirstError[#8, [Failure instance: Traceback: <type 'exceptions.AssertionError'>: self.seconds() [really: <built-in function time>] returned NaN! 1245566718.5629139,1245566718.562958,1245566718.563 /usr/src/tahoe/build2/virt/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-i686.egg/twisted/internet/base.py:779:runUntilCurrent /usr/src/tahoe/build2/virt/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py:26:_turn /usr/src/tahoe/build2/Twisted-8.2.0-py2.5-linux-i686.egg/twisted/internet/defer.py:243:callback /usr/src/tahoe/build2/Twisted-8.2.0-py2.5-linux-i686.egg/twisted/internet/defer.py:312:_startRunCallbacks --- <exception caught here> --- /usr/src/tahoe/build2/Twisted-8.2.0-py2.5-linux-i686.egg/twisted/internet/defer.py:328:_runCallbacks /usr/src/tahoe/build2/virt/lib/python2.5/site-packages/allmydata_tahoe-1.4.1_r3918-py2.5.egg/allmydata/mutable/checker.py:98:_got_answer /usr/src/tahoe/build2/virt/lib/python2.5/site-packages/allmydata_tahoe-1.4.1_r3918-py2.5.egg/allmydata/mutable/checker.py:146:notify_server_corruption /usr/src/tahoe/build2/virt/lib/python2.5/site-packages/allmydata_tahoe-1.4.1_r3918-py2.5.egg/allmydata/test/no_network.py:44:callRemoteOnly /usr/src/tahoe/build2/virt/lib/python2.5/site-packages/allmydata_tahoe-1.4.1_r3918-py2.5.egg/allmydata/test/no_network.py:63:callRemote /usr/src/tahoe/build2/virt/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py:70:fireEventually /usr/src/tahoe/build2/virt/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py:61:eventually /usr/src/tahoe/build2/virt/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py:16:append /usr/src/tahoe/build2/virt/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-i686.egg/twisted/internet/base.py:669:callLater ]] ``` Note that i added 3 more calls to reactor.seconds() (which is really time.time) and they returned valid results. So it looks that on this system something (happening either before this code, or in parallel in another thread) is breaking exactly one call to time()

warner commented

2009-06-21 20:55:55 +00:00

Wow, fun! A quick look at the python-2.6 source (Modules/timemodule.c:floattime) doesn't suggest any obvious way to get a NaN.. it calls the C gettimeofday/ftime/time (depending upon what your platform has), adds the pieces together, and returns the result.

You said that a simple test case that just calls time.time() repeatedly didn't ever fail, right? That's unfortunate.. if we didn't think Tahoe was involved then I'd suggest instrumenting timemodule.c to remember the pieces it got, build the PyFloat, then if it's NaN immediately print out the pieces, so we could figure out what gettimeofday() returned that provoked a NaN.

If there were a low-level threading bug that was clobbering memory, I'd expect to see exceptions or deeper errors than just a NaN. If time() couldn't allocate the memory for the PyFloat object, it would raise an exception instead of returning NaN.

Hm, it's worth noting that floats are formatted to strings (in Objects/floatobject.c:format_float) by doing snprintf(), so if your platform's libc does something funky with snprintf(), that might cause problems. Also, Python doesn't appear to do anything to define or test NaN directly: it just tells C to do a+b or a>=b or whatever. So something weird in your C compiler's implementation of floating-point math (or your CPU) could get involved too.

If you get into this, you might try:

modify python's timemodule.c to store the values retrieved from gettimeofday() in a file-global variable, just before it adds them together to create the return value for floatseconds()/time()
add a function to timemodule.c which retrieves these stored values with as little interpretation as possible (maybe memcopy them into a string in addition to interpreting them as floats)
in your catch-NaN-in-reactor.callLater assertion, retrieve and print these values

If we catch gettimeofday() returning something insane, it's either the kernel or some weird memory corruption that's just not causing anything else to catch fire. If gettimeofday() is behaving, then we should suspect the floatseconds() math or the floating point operations done afterward.

Another idea is to add code to floatseconds() that stringifies the float and compares it against NaN right away. Then run everything under gdb and put a breakpoint on the 'if' side of that comparison, then start using the tahoe node until it fails in this way. Then look at the local variables in the debugger and see if anything looks suspicious.

boy, you know how to find the fun bugs, don't you? :-)

Wow, fun! A quick look at the python-2.6 source (Modules/timemodule.c:floattime) doesn't suggest any obvious way to get a NaN.. it calls the C gettimeofday/ftime/time (depending upon what your platform has), adds the pieces together, and returns the result. You said that a simple test case that just calls time.time() repeatedly didn't ever fail, right? That's unfortunate.. if we didn't think Tahoe was involved then I'd suggest instrumenting timemodule.c to remember the pieces it got, build the PyFloat, then if it's NaN immediately print out the pieces, so we could figure out what gettimeofday() returned that provoked a NaN. If there were a low-level threading bug that was clobbering memory, I'd expect to see exceptions or deeper errors than just a NaN. If time() couldn't allocate the memory for the PyFloat object, it would raise an exception instead of returning NaN. Hm, it's worth noting that floats are formatted to strings (in Objects/floatobject.c:format_float) by doing snprintf(), so if your platform's libc does something funky with snprintf(), that might cause problems. Also, Python doesn't appear to do anything to define or test NaN directly: it just tells C to do a+b or a>=b or whatever. So something weird in your C compiler's implementation of floating-point math (or your CPU) could get involved too. If you get into this, you might try: * modify python's timemodule.c to store the values retrieved from gettimeofday() in a file-global variable, just before it adds them together to create the return value for floatseconds()/time() * add a function to timemodule.c which retrieves these stored values with as little interpretation as possible (maybe memcopy them into a string in addition to interpreting them as floats) * in your catch-NaN-in-reactor.callLater assertion, retrieve and print these values If we catch gettimeofday() returning something insane, it's either the kernel or some weird memory corruption that's just not causing anything else to catch fire. If gettimeofday() is behaving, then we should suspect the floatseconds() math or the floating point operations done afterward. Another idea is to add code to floatseconds() that stringifies the float and compares it against NaN right away. Then run everything under gdb and put a breakpoint on the 'if' side of that comparison, then start using the tahoe node until it fails in this way. Then look at the local variables in the debugger and see if anything looks suspicious. boy, you know how to find the fun bugs, don't you? :-)

warner commented

2009-06-21 20:58:01 +00:00

oh, also, in the python assertion, use tmp=self.seconds() and do your checks on tmp. If things are this weird, we should suspect the stringification routines too, so print tmp several times and make sure it holds still. (i.e. if there were a bug in float-to-string, could that manifest as the same symptoms?)

oh, also, in the python assertion, use `tmp=self.seconds()` and do your checks on tmp. If things are this weird, we should suspect the stringification routines too, so print tmp several times and make sure it holds still. (i.e. if there were a bug in float-to-string, could that manifest as the same symptoms?)

midnightmagic commented

2009-06-22 23:31:33 +00:00

Okie doke. Had a chance to run things on one of the test-failing machines:

===============================================================================
SKIPPED: foolscap.test.test_banana.ThereAndBackAgain.test_boundMethod_newstyle

new-style classes still broken

SKIPPED: foolscap.test.test_banana.ThereAndBackAgain.test_classMethod_newstyle

new-style classes still broken

SKIPPED: foolscap.test.test_banana.ThereAndBackAgain.test_class_newstyle

new-style classes still broken

SKIPPED: foolscap.test.test_banana.ThereAndBackAgain.test_instance_newstyle

new-style classes still broken

Ran 460 tests in 168.897s

PASSED (skips=4, successes=456)

After that, "trial --version" should work, so try foolscap's "make test"
again.

So, done this.

Foolscap will skip the crypto tests if pyopenssl is not installed, but for
this bug, we don't care about those.

I had pyopenssl installed and it succeeded.

Then running "trial --until-failure foolscap.test.test_eventual" in the
foolscap source directory will tell it to run that one specific test_eventual
over and over again until something fails.

Running this now and can run it overnight. :-) I've nice'd it right up to 20 so I can keep using the machine otherwise. I will report on the results.

Okie doke. Had a chance to run things on one of the test-failing machines: =============================================================================== SKIPPED: foolscap.test.test_banana.ThereAndBackAgain.test_boundMethod_newstyle new-style classes still broken =============================================================================== SKIPPED: foolscap.test.test_banana.ThereAndBackAgain.test_classMethod_newstyle new-style classes still broken =============================================================================== SKIPPED: foolscap.test.test_banana.ThereAndBackAgain.test_class_newstyle new-style classes still broken =============================================================================== SKIPPED: foolscap.test.test_banana.ThereAndBackAgain.test_instance_newstyle new-style classes still broken ------------------------------------------------------------------------------- Ran 460 tests in 168.897s PASSED (skips=4, successes=456) > After that, "`trial --version`" should work, so try foolscap's "make test" > again. So, done this. > Foolscap will skip the crypto tests if pyopenssl is not installed, but for > this bug, we don't care about those. I had pyopenssl installed and it succeeded. > Then running "`trial --until-failure foolscap.test.test_eventual`" in the > foolscap source directory will tell it to run that one specific test_eventual > over and over again until something fails. Running this now and can run it overnight. :-) I've nice'd it right up to 20 so I can keep using the machine otherwise. I will report on the results.

midnightmagic commented

2009-06-23 17:21:32 +00:00

I ran the tests overnight and they were still running when I arrived in the morning. The final output was:

Ran 3 tests in 0.024s

PASSED (successes=3)
Test Pass 4390063
foolscap.test.test_eventual
^C TestEventual
testFire ... OK
testFlush ... OK
testSend ... OK

^C------------------------------------------------------------------------------
Ran 3 tests in 0.024s

PASSED (successes=3)
Test Pass 4390064
foolscap.test.test_eventual
TestEventual
testFire ... OK
^C testFlush ... ERROR

testSend ...                                                           ^COK

===============================================================================
ERROR: foolscap.test.test_eventual.TestEventual.testFlush

Traceback (most recent call last):

File "usr/pkg/lib/python2.5/site-packages/twisted/internet/base.py", line 677, in runUntilCurrent

File "usr/pkg/lib/python2.5/site-packages/twisted/internet/base.py", line 495, in stop

twisted.internet.error.ReactorNotRunning: Can't stop reactor that isn't running.

Ran 3 tests in 0.048s

^CFAILED (errors=1, successes=2)
^C

I ran the tests overnight and they were still running when I arrived in the morning. The final output was: ------------------------------------------------------------------------------- Ran 3 tests in 0.024s PASSED (successes=3) Test Pass 4390063 foolscap.test.test_eventual ^C [TestEventual](wiki/TestEventual) testFire ... OK testFlush ... OK testSend ... OK ^C------------------------------------------------------------------------------ Ran 3 tests in 0.024s PASSED (successes=3) Test Pass 4390064 foolscap.test.test_eventual [TestEventual](wiki/TestEventual) testFire ... OK ^C testFlush ... ERROR testSend ... ^COK =============================================================================== ERROR: foolscap.test.test_eventual.TestEventual.testFlush Traceback (most recent call last): > File "usr/pkg/lib/python2.5/site-packages/twisted/internet/base.py", line 677, in runUntilCurrent > File "usr/pkg/lib/python2.5/site-packages/twisted/internet/base.py", line 495, in stop twisted.internet.error.ReactorNotRunning: Can't stop reactor that isn't running. ------------------------------------------------------------------------------- Ran 3 tests in 0.048s ^CFAILED (errors=1, successes=2) ^C

zooko commented

2009-06-24 22:19:01 +00:00

Hm, it looks like the buildbot that mm set up just locked up in allmydata.test.test_cli.Check.test_check ...

http://allmydata.org/buildbot/builders/MM%20netbsd4%20i386%20warp/builds/6/steps/test/logs/stdio

There's no indication from the build logs whether it was at max cpu usage, of course.

It's strange that mm running it in a shell should behave differently than the buildslave running it.

mm: could you try the same command-line and environment that the buildslave did? It is all listed at the top of that web page. The command was python2.5 setup.py test --reporter=timing and among the environment variables was PYTHONPATH=/home/tahoe/lib/python.

Hm, it looks like the buildbot that mm set up just locked up in `allmydata.test.test_cli.Check.test_check` ... <http://allmydata.org/buildbot/builders/MM%20netbsd4%20i386%20warp/builds/6/steps/test/logs/stdio> There's no indication from the build logs whether it was at max cpu usage, of course. It's strange that mm running it in a shell should behave differently than the buildslave running it. mm: could you try the same command-line and environment that the buildslave did? It is all listed at the top of that web page. The command was `python2.5 setup.py test --reporter=timing` and among the environment variables was `PYTHONPATH=/home/tahoe/lib/python`.

zooko commented

2009-06-24 22:22:02 +00:00

By the way now that we have both Black Dew's Debian builder and mm's NetBSD builder hooked up, I see that it was also allmydata.test.test_cli.Check.test_check that locked up on Black Dew's:

http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/1/steps/test/logs/stdio

That's the one that Black Dew tracked down to the Python time.time() call returning NaN instead of seconds-since-epoch. Black Dew tracked that down by adding debugging instrumentation in Python and re-running. Brian suggested (in this ticket) further debuggery. Perhaps you two guys could supply mm with a patch that he could apply to determine if his lock-up is of the same source?

By the way now that we have both Black Dew's Debian builder and mm's NetBSD builder hooked up, I see that it was also `allmydata.test.test_cli.Check.test_check` that locked up on Black Dew's: <http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/1/steps/test/logs/stdio> That's the one that Black Dew tracked down to the Python `time.time()` call returning `NaN` instead of seconds-since-epoch. Black Dew tracked that down by adding debugging instrumentation in Python and re-running. Brian suggested (in this ticket) further debuggery. Perhaps you two guys could supply mm with a patch that he could apply to determine if his lock-up is of the same source?

bdew commented

2009-06-24 22:52:23 +00:00

Attachment twisted.patch (3269 bytes) added

**Attachment** twisted.patch (3269 bytes) added

twisted.patch

3.2 KiB

bdew commented

2009-06-24 22:55:47 +00:00

I've attached the patch (for twisted.internet.base) - there are a bunch of asserts that will try to catch where the NaN comes from, and if that fails will yell about it when it reaches reactors loop.

On my buildbot only the last piece is used (not the asserts).

I've attached the patch (for twisted.internet.base) - there are a bunch of asserts that will try to catch where the NaN comes from, and if that fails will yell about it when it reaches reactors loop. On my buildbot only the last piece is used (not the asserts).

midnightmagic commented

2009-06-25 00:17:57 +00:00

Replying to zooko:

It's strange that mm running it in a shell should behave differently than the buildslave
running it.

mm: could you try the same command-line and environment that the buildslave did? It is all
listed at the top of that web page. The command was

reporter=timing ``` 
and among the environment variables was ` PYTHONPATH=/home/tahoe/lib/python `. 
 
One difference could be the system-installed twisted python package I had to install to get the buildbot working. Note the: 
 
`exceptions.IOError: [Errno 13] Permission denied: '/usr/pkg/lib/python2.5/site-packages/twisted/plugins/dropin.cache.new'` 
 
... that's kind of weird. I'd remove twisted from the system location and let tahoe/buildslave do the work themselves if there was a sneaky way of installing twisted so it was buildslave-specific, and build jobs didn't inherit it.

Replying to [zooko](/tahoe-lafs/trac-2024-07-25/issues/737#issuecomment-71657): > > It's strange that mm running it in a shell should behave differently than the buildslave > running it. > > mm: could you try the same command-line and environment that the buildslave did? It is all > listed at the top of that web page. The command was ```python2.5 setup.py test -- reporter=timing ``` and among the environment variables was ` PYTHONPATH=/home/tahoe/lib/python `. One difference could be the system-installed twisted python package I had to install to get the buildbot working. Note the: `exceptions.IOError: [Errno 13] Permission denied: '/usr/pkg/lib/python2.5/site-packages/twisted/plugins/dropin.cache.new'` ... that's kind of weird. I'd remove twisted from the system location and let tahoe/buildslave do the work themselves if there was a sneaky way of installing twisted so it was buildslave-specific, and build jobs didn't inherit it.

warner commented

2009-06-25 00:37:37 +00:00

eek! the text is getting smaller and smaller!

don't worry about the dropin.cache.new problem, it's a know Twisted issue that's harmless. If you really want to, running "trial --help" as root should make it go away, by giving trial the ability to write to the cache file that it so desperately wants to see.

eek! the text is getting smaller and smaller! don't worry about the dropin.cache.new problem, it's a know Twisted issue that's harmless. If you really want to, running "trial --help" as root should make it go away, by giving trial the ability to write to the cache file that it so desperately wants to see.

bdew commented

2009-06-27 07:36:05 +00:00

Hmm now it segfaulted on the first problematic test:

http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/21/steps/test/logs/stdio

Sadly i didn't get a dump and can't reproduce the crash.

Hmm now it segfaulted on the first problematic test: <http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/21/steps/test/logs/stdio> Sadly i didn't get a dump and can't reproduce the crash.

zooko commented

2009-07-02 20:48:34 +00:00

See also #738, which may be due to the same cause as this issue.

midnightmagic commented

2009-07-03 23:06:40 +00:00

Update: since the fix mentioned in #738 (disabling crypto++ assembly optimizations) was implemented in my local copy here, python2.5 setup.py test NOW COMPLETES THROUGH THE SPOT WHERE IT FROZE BEFORE!!

So this specific issue I think, if those other ones can be closed, can be closed also once we verify that the same fix might work for bdew.

Update: since the fix mentioned in #738 (disabling crypto++ assembly optimizations) was implemented in my local copy here, python2.5 setup.py test NOW COMPLETES THROUGH THE SPOT WHERE IT FROZE BEFORE!! So this specific issue I think, if those other ones can be closed, can be closed also once we verify that the same fix might work for bdew.

midnightmagic commented

2009-07-06 23:15:39 +00:00

Using trunk from pycryptopp and trunk tahoe (via PYTHONPATH and deleting tahoe/support/lib/site-packages/pycrypto*) the test no longer fails and pushes through the spot where it failed before.

.. with full crypto optimizations.

Other things fail, but.. looks like it's just because it's trunk and certainly not the same problem this ticket references.

Using trunk from pycryptopp and trunk tahoe (via PYTHONPATH and deleting tahoe/support/lib/site-packages/pycrypto*) the test no longer fails and pushes through the spot where it failed before. .. with full crypto optimizations. Other things fail, but.. looks like it's just because it's trunk and certainly not the same problem this ticket references.

bdew commented

2009-07-06 23:42:59 +00:00

Same here, with fixed pycryptopp all the tests pass without hangs/rashes:

http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/40/steps/test/logs/stdio

Same here, with fixed pycryptopp all the tests pass without hangs/rashes: <http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/40/steps/test/logs/stdio>

zooko commented

2009-07-07 19:09:19 +00:00

Fixed by changeset:9578e70161009035 which increases the requirement on pycryptopp to >= 0.5.15. Note that if you are building pycryptopp against an external libcryptopp, however, then you may still have this bug if your libcryptopp has it.

zooko added the

fixed

label 2009-07-07 19:09:19 +00:00

zooko closed this issue

2009-07-07 19:09:19 +00:00

Sign in to join this conversation.