occasional failure in iputil (timeout in test_runner): use 'netifaces' package? #532

Closed
opened 2008-11-03 22:00:13 +00:00 by warner · 8 comments

I'm seeing very occasional failures in the allmydata.test.test_runner.RunNode.test_introducer test. To reproduce it, in one shell I run:

run_to_death.pl 'make quicktest TEST=allmydata.test.test_runner.RunNode.test_introducer'

(where run_to_death.pl is a little perl script I've got to just keep running the same command over and over again until the exit status is nonzero)

while in another shell I slow things down by doing python -c "while 1: pass".

This usually fails after about 10 minutes.

The actual failure is a timeout. It appears that the iputil.py routine that uses reactor.spawnProcess to run /sbin/ifconfig (to figure out which interfaces are available and therefore what local IP addresses we should advertise) just plain fails: the Deferred never fires. My hunch is that somehow the SIGCHLD handle is broken, so the child process has finished but the parent doesn't notice.

This doesn't happen frequently enough to really worry about, but some day it'd be nice to fix it.

One possibility is to switch to the 'python-netifaces' tool, which unfortunately has compiled C code, but which claims to be fairly cross-platform and probably doesn't require a separate command to be spawned.

I'm seeing very occasional failures in the `allmydata.test.test_runner.RunNode.test_introducer` test. To reproduce it, in one shell I run: ``` run_to_death.pl 'make quicktest TEST=allmydata.test.test_runner.RunNode.test_introducer' ``` (where run_to_death.pl is a little perl script I've got to just keep running the same command over and over again until the exit status is nonzero) while in another shell I slow things down by doing `python -c "while 1: pass"`. This usually fails after about 10 minutes. The actual failure is a timeout. It appears that the iputil.py routine that uses `reactor.spawnProcess` to run `/sbin/ifconfig` (to figure out which interfaces are available and therefore what local IP addresses we should advertise) just plain fails: the Deferred never fires. My hunch is that somehow the SIGCHLD handle is broken, so the child process has finished but the parent doesn't notice. This doesn't happen frequently enough to really worry about, but some day it'd be nice to fix it. One possibility is to switch to the 'python-netifaces' tool, which unfortunately has compiled C code, but which claims to be fairly cross-platform and probably doesn't require a separate command to be spawned.
warner added the
code
minor
defect
1.2.0
labels 2008-11-03 22:00:13 +00:00
warner added this to the undecided milestone 2008-11-03 22:00:13 +00:00

So if your hunch is correct then this reveals the existence of a bug in Twisted?

So if your hunch is correct then this reveals the existence of a bug in Twisted?
Author

seems plausible, yes. A smaller test case (which I don't quite have the time to build right now) would be to just run the /sbin/ifconfig command via reactor.spawnProcess, gathering but mostly ignoring the output, and then see if that can be made to fail.

seems plausible, yes. A smaller test case (which I don't quite have the time to build right now) would be to just run the /sbin/ifconfig command via reactor.spawnProcess, gathering but mostly ignoring the output, and then see if that can be made to fail.

Should trial --until-failure allmydata.test.test_runner.RunNode.test_introducer also trigger the bug, then?

I'm running trial --until-failure pyutil.test.test_iputil.

Should `trial --until-failure allmydata.test.test_runner.RunNode.test_introducer` also trigger the bug, then? I'm running `trial --until-failure pyutil.test.test_iputil`.

trial --until-failure pyutil.test.test_iputil wasn't able to reproduce this failure after about an hour of running. I'll try Brian's script next.

`trial --until-failure pyutil.test.test_iputil` wasn't able to reproduce this failure after about an hour of running. I'll try Brian's script next.

Okay now I'm running this script:

time ( /bin/true; while [ $? = 0 ] ; do trial pyutil.test.test_iputil; done ) &> x.txt
Okay now I'm running this script: ``` time ( /bin/true; while [ $? = 0 ] ; do trial pyutil.test.test_iputil; done ) &> x.txt ```

Okay, I let that script run all day and it didn't fail. Also the workstation (yukyuk) was loaded down with other jobs at the same time.

Okay, I let that script run all day and it didn't fail. Also the workstation (yukyuk) was loaded down with other jobs at the same time.
Author

I ran this test in a loop on a loaded box for a while and it didn't fail either, so maybe it's been fixed in whatever new version of Twisted I'm using now. It sounds like we can let this one go. Closing as "works for me".

I ran this test in a loop on a loaded box for a while and it didn't fail either, so maybe it's been fixed in whatever new version of Twisted I'm using now. It sounds like we can let this one go. Closing as "works for me".
warner added the
worksforme
label 2009-06-21 20:22:02 +00:00
warner modified the milestone from undecided to 1.5.0 2009-06-21 20:22:02 +00:00
davidsarah commented 2012-12-16 15:22:50 +00:00
Owner

Note that there definitely are nondeterministic bugs due to how we spawn the command for iputil; see #1381. I think that bug would not cause a timeout, though.

Note that there definitely are nondeterministic bugs due to how we spawn the command for iputil; see #1381. I think that bug would not cause a timeout, though.
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#532
No description provided.