allmydata.test.test_system.SystemTest.test_mutable sometimes hangs on a slow machine #846
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#846
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
On François's lenny-armv5tel box,
allmydata.test.test_system.SystemTest.test_mutable
sometimes stops making progress and then gets timed out after 3600 seconds, e.g.: http://allmydata.org/buildbot/builders/François lenny-armv5tel/builds/16 and many more. In the cases where that test does pass it takes only a couple of hundred seconds, e.g.: http://allmydata.org/buildbot/builders/François lenny-armv5tel/builds/8/steps/test/logs/stdio where it took 227 seconds. (In that same passing test run other tests took longer than 227 seconds -- see http://allmydata.org/buildbot/builders/François lenny-armv5tel/builds/8/steps/test/logs/timings .)Brian looked at the test.log files from passing and failing examples and said that there was little information there, but that one difference was that in the passing cases that he saw, the time between the beginning of the test case (e.g.
2009-11-20 18:08:54.346Z [-] --> allmydata.test.test_system.SystemTest.test_mutable <--
) and the first message from Node startup (e.g.2009-11-20 18:08:55.475Z [-] foolscap.pb.Listener starting on 35403
) was about 1 second, and in the failing cases, e.g. start test2009-11-28 13:36:48.970Z [-] --> allmydata.test.test_system.SystemTest.test_mutable <--
and Node startup2009-11-28 13:36:53.516Z [-] foolscap.pb.Listener starting on 55397
was about 5 seconds.So it could be that there is some sort of race condition where if it takes the Node longer than 5 seconds to start up (perhaps waiting to bind to a TCP port or something) then some other part of the test gets confused by having won a race that it didn't expect to win.
Hm, I wonder if I could simulate that on a fast computer by inserting some sort of 10s delay before allowing Node startup to complete...
The next step is to make this test reproducible. François, could you please run just this one test, such as with
trial --reporter=verbose --until-failure allmydata.test.test_system.SystemTest.test_mutable
and see if you can tell when it passes vs. when it fails? (Maybe it has to do with other processes loading the CPU?) Note that which version of Tahoe-LAFS gets imported and tested by that command-line will be determined by your PYTHONPATH.François: I'd like to get this fixed so that ARM can be a supported platform for the upcoming v1.6 release, so if you can't do this soon then please either give me or Brian an ssh account on your box or just say "Can't work on this now" so that we can think of some alternative strategies. Thanks!
Zooko, that's right they are other CPU intensive processes like Mediatomb and MLDonkey running on this box for about 2 weeks. This probably explain why those tests have been failing recently.
The 'trial --until-failure' is currently running on the box, I'll post the results tomorrow morning (CET).
The test 'allmydata.test.test_system.SystemTest.test_mutable' failed after 20 successful runs, it definitely looks like a race condition. See the attached logfile for details.
Attachment tahoe-846-test-fail.txt (7943 bytes) added
Attachment test.log (19363 bytes) added
The failure seems related to the connect_timeout expiration in foolscap. This is currently outside of my understanding of the Tahoe codebase
Brian, do you see any hint in these logfiles ? Should you need an SSH account on this box, just send me your public key by email
Why do you say that it seems related to the
connect_timeout
expiration?Because of that kind of log messages which appears in test.log:
#838 was a duplicate. It has another log attached to it; I don't know whether that provides any more information.
Replying to davidsarah:
I meant, it links to http://allmydata.org/buildbot/builders/Fran%C3%A7ois%20lenny-armv5tel/builds/12/steps/test/logs/stdio
Well, I haven't had time to investigate this, and I guess [Buildbot Policy wiki/BuildbotPolicy]our says that we need to demote ARM from the Supported Platforms category until someone has time to fix it. :-(
I'll do the demotion sometime today unless someone jumps in to rescue the ARM platform from the ignominy.
It's really bothering me that mutable file upload and download behavior is so finicky, buggy, inefficient, hard to understand, different from immutable file upload and download behavior, etc. So I'm putting a bunch of tickets into the "1.8" Milestone. I am not, however, at this time, volunteering to work on these tickets, so it might be a mistake to put them into the 1.8 Milestone, but I really hope that someone else will volunteer or that I will decide to do it myself. :-)
ticket:1084#comment:73736 (but probably not the main issue in that ticket) may be an instance of the bug in this ticket. It occurs in a different method of SystemTest, but the error is almost identical to the first one in tahoe-846-test-fail.txt.