occasional failure in iputil (timeout in test_runner): use 'netifaces' package? #532
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#532
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I'm seeing very occasional failures in the
allmydata.test.test_runner.RunNode.test_introducer
test. To reproduce it, in one shell I run:(where run_to_death.pl is a little perl script I've got to just keep running the same command over and over again until the exit status is nonzero)
while in another shell I slow things down by doing
python -c "while 1: pass"
.This usually fails after about 10 minutes.
The actual failure is a timeout. It appears that the iputil.py routine that uses
reactor.spawnProcess
to run/sbin/ifconfig
(to figure out which interfaces are available and therefore what local IP addresses we should advertise) just plain fails: the Deferred never fires. My hunch is that somehow the SIGCHLD handle is broken, so the child process has finished but the parent doesn't notice.This doesn't happen frequently enough to really worry about, but some day it'd be nice to fix it.
One possibility is to switch to the 'python-netifaces' tool, which unfortunately has compiled C code, but which claims to be fairly cross-platform and probably doesn't require a separate command to be spawned.
So if your hunch is correct then this reveals the existence of a bug in Twisted?
seems plausible, yes. A smaller test case (which I don't quite have the time to build right now) would be to just run the /sbin/ifconfig command via reactor.spawnProcess, gathering but mostly ignoring the output, and then see if that can be made to fail.
Should
trial --until-failure allmydata.test.test_runner.RunNode.test_introducer
also trigger the bug, then?I'm running
trial --until-failure pyutil.test.test_iputil
.trial --until-failure pyutil.test.test_iputil
wasn't able to reproduce this failure after about an hour of running. I'll try Brian's script next.Okay now I'm running this script:
Okay, I let that script run all day and it didn't fail. Also the workstation (yukyuk) was loaded down with other jobs at the same time.
I ran this test in a loop on a loaded box for a while and it didn't fail either, so maybe it's been fixed in whatever new version of Twisted I'm using now. It sounds like we can let this one go. Closing as "works for me".
Note that there definitely are nondeterministic bugs due to how we spawn the command for iputil; see #1381. I think that bug would not cause a timeout, though.