allmydata.test.test_system.SystemTest.test_mutable sometimes hangs on a slow machine #846

Open
opened 2009-11-28 22:03:42 +00:00 by zooko · 12 comments

On François's lenny-armv5tel box, allmydata.test.test_system.SystemTest.test_mutable sometimes stops making progress and then gets timed out after 3600 seconds, e.g.: http://allmydata.org/buildbot/builders/François lenny-armv5tel/builds/16 and many more. In the cases where that test does pass it takes only a couple of hundred seconds, e.g.: http://allmydata.org/buildbot/builders/François lenny-armv5tel/builds/8/steps/test/logs/stdio where it took 227 seconds. (In that same passing test run other tests took longer than 227 seconds -- see http://allmydata.org/buildbot/builders/François lenny-armv5tel/builds/8/steps/test/logs/timings .)

Brian looked at the test.log files from passing and failing examples and said that there was little information there, but that one difference was that in the passing cases that he saw, the time between the beginning of the test case (e.g. 2009-11-20 18:08:54.346Z [-] --> allmydata.test.test_system.SystemTest.test_mutable <--) and the first message from Node startup (e.g. 2009-11-20 18:08:55.475Z [-] foolscap.pb.Listener starting on 35403) was about 1 second, and in the failing cases, e.g. start test 2009-11-28 13:36:48.970Z [-] --> allmydata.test.test_system.SystemTest.test_mutable <-- and Node startup 2009-11-28 13:36:53.516Z [-] foolscap.pb.Listener starting on 55397 was about 5 seconds.

So it could be that there is some sort of race condition where if it takes the Node longer than 5 seconds to start up (perhaps waiting to bind to a TCP port or something) then some other part of the test gets confused by having won a race that it didn't expect to win.

Hm, I wonder if I could simulate that on a fast computer by inserting some sort of 10s delay before allowing Node startup to complete...

The next step is to make this test reproducible. François, could you please run just this one test, such as with trial --reporter=verbose --until-failure allmydata.test.test_system.SystemTest.test_mutable and see if you can tell when it passes vs. when it fails? (Maybe it has to do with other processes loading the CPU?) Note that which version of Tahoe-LAFS gets imported and tested by that command-line will be determined by your PYTHONPATH.

François: I'd like to get this fixed so that ARM can be a supported platform for the upcoming v1.6 release, so if you can't do this soon then please either give me or Brian an ssh account on your box or just say "Can't work on this now" so that we can think of some alternative strategies. Thanks!

On François's lenny-armv5tel box, `allmydata.test.test_system.SystemTest.test_mutable` sometimes stops making progress and then gets timed out after 3600 seconds, e.g.: <http://allmydata.org/buildbot/builders/Fran>çois lenny-armv5tel/builds/16 and many more. In the cases where that test does pass it takes only a couple of hundred seconds, e.g.: <http://allmydata.org/buildbot/builders/Fran>çois lenny-armv5tel/builds/8/steps/test/logs/stdio where it took 227 seconds. (In that same passing test run other tests took longer than 227 seconds -- see <http://allmydata.org/buildbot/builders/Fran>çois lenny-armv5tel/builds/8/steps/test/logs/timings .) Brian looked at the test.log files from passing and failing examples and said that there was little information there, but that one difference was that in the passing cases that he saw, the time between the beginning of the test case (e.g. `2009-11-20 18:08:54.346Z [-] --> allmydata.test.test_system.SystemTest.test_mutable <--`) and the first message from Node startup (e.g. `2009-11-20 18:08:55.475Z [-] foolscap.pb.Listener starting on 35403`) was about 1 second, and in the failing cases, e.g. start test `2009-11-28 13:36:48.970Z [-] --> allmydata.test.test_system.SystemTest.test_mutable <--` and Node startup `2009-11-28 13:36:53.516Z [-] foolscap.pb.Listener starting on 55397` was about 5 seconds. So it could be that there is some sort of race condition where if it takes the Node longer than 5 seconds to start up (perhaps waiting to bind to a TCP port or something) then some other part of the test gets confused by having won a race that it didn't expect to win. Hm, I wonder if I could simulate that on a fast computer by inserting some sort of 10s delay before allowing Node startup to complete... The next step is to make this test reproducible. François, could you please run just this one test, such as with `trial --reporter=verbose --until-failure allmydata.test.test_system.SystemTest.test_mutable` and see if you can tell when it passes vs. when it fails? (Maybe it has to do with other processes loading the CPU?) Note that which version of Tahoe-LAFS gets imported and tested by that command-line will be determined by your PYTHONPATH. François: I'd like to get this fixed so that ARM can be a supported platform for the upcoming v1.6 release, so if you *can't* do this soon then please either give me or Brian an ssh account on your box or just say "Can't work on this now" so that we can think of some alternative strategies. Thanks!
zooko added the
unknown
major
defect
1.5.0
labels 2009-11-28 22:03:42 +00:00
zooko added this to the 1.6.0 milestone 2009-11-28 22:03:42 +00:00
francois commented 2009-11-29 23:47:04 +00:00
Owner

Zooko, that's right they are other CPU intensive processes like Mediatomb and MLDonkey running on this box for about 2 weeks. This probably explain why those tests have been failing recently.

The 'trial --until-failure' is currently running on the box, I'll post the results tomorrow morning (CET).

Zooko, that's right they are other CPU intensive processes like Mediatomb and MLDonkey running on this box for about 2 weeks. This probably explain why those tests have been failing recently. The 'trial --until-failure' is currently running on the box, I'll post the results tomorrow morning (CET).
francois commented 2009-11-30 10:58:14 +00:00
Owner

The test 'allmydata.test.test_system.SystemTest.test_mutable' failed after 20 successful runs, it definitely looks like a race condition. See the attached logfile for details.

The test 'allmydata.test.test_system.SystemTest.test_mutable' failed after 20 successful runs, it definitely looks like a race condition. See the attached logfile for details.
francois commented 2009-11-30 10:58:36 +00:00
Owner

Attachment tahoe-846-test-fail.txt (7943 bytes) added

**Attachment** tahoe-846-test-fail.txt (7943 bytes) added
francois commented 2009-11-30 11:02:42 +00:00
Owner

Attachment test.log (19363 bytes) added

**Attachment** test.log (19363 bytes) added
19 KiB
tahoe-lafs added
code-mutable
and removed
unknown
labels 2009-12-01 00:16:53 +00:00
francois commented 2009-12-05 17:22:20 +00:00
Owner

The failure seems related to the connect_timeout expiration in foolscap. This is currently outside of my understanding of the Tahoe codebase

Brian, do you see any hint in these logfiles ? Should you need an SSH account on this box, just send me your public key by email

The failure seems related to the connect_timeout expiration in foolscap. This is currently outside of my understanding of the Tahoe codebase Brian, do you see any hint in these logfiles ? Should you need an SSH account on this box, just send me your public key by email
Author

Why do you say that it seems related to the connect_timeout expiration?

Why do you say that it seems related to the `connect_timeout` expiration?
francois commented 2009-12-06 11:47:09 +00:00
Owner

Because of that kind of log messages which appears in test.log:

2009-11-30 01:07:46.730Z [-] Reconnector._failed (furl=pb://z3g3meylekyi7yjo4azff2xbqwjbuioa@62.220.138.135:42639,127.0.0.1:42639/kdnymib6zqlgwo6nrihhofctqoo7chfe): [Failure instance: Traceback (failure with no frames): <class 'foolscap.tokens.NegotiationError'>: no connection established within client timeout
        ]
Because of that kind of log messages which appears in test.log: ``` 2009-11-30 01:07:46.730Z [-] Reconnector._failed (furl=pb://z3g3meylekyi7yjo4azff2xbqwjbuioa@62.220.138.135:42639,127.0.0.1:42639/kdnymib6zqlgwo6nrihhofctqoo7chfe): [Failure instance: Traceback (failure with no frames): <class 'foolscap.tokens.NegotiationError'>: no connection established within client timeout ] ```
davidsarah commented 2009-12-06 16:50:55 +00:00
Owner

#838 was a duplicate. It has another log attached to it; I don't know whether that provides any more information.

#838 was a duplicate. It has another log attached to it; I don't know whether that provides any more information.
davidsarah commented 2009-12-06 16:52:09 +00:00
Owner
Replying to [davidsarah](/tahoe-lafs/trac-2024-07-25/issues/846#issuecomment-73738): > #838 ... has another log attached to it I meant, it links to <http://allmydata.org/buildbot/builders/Fran%C3%A7ois%20lenny-armv5tel/builds/12/steps/test/logs/stdio>
Author

Well, I haven't had time to investigate this, and I guess [Buildbot Policy wiki/BuildbotPolicy]our says that we need to demote ARM from the Supported Platforms category until someone has time to fix it. :-(

I'll do the demotion sometime today unless someone jumps in to rescue the ARM platform from the ignominy.

Well, I haven't had time to investigate this, and I guess [Buildbot Policy [wiki/BuildbotPolicy](wiki/BuildbotPolicy)]our says that we need to demote ARM from the Supported Platforms category until someone has time to fix it. :-( I'll do the demotion sometime today unless someone jumps in to rescue the ARM platform from the ignominy.
zooko modified the milestone from 1.6.0 to eventually 2010-01-26 15:44:40 +00:00
Author

It's really bothering me that mutable file upload and download behavior is so finicky, buggy, inefficient, hard to understand, different from immutable file upload and download behavior, etc. So I'm putting a bunch of tickets into the "1.8" Milestone. I am not, however, at this time, volunteering to work on these tickets, so it might be a mistake to put them into the 1.8 Milestone, but I really hope that someone else will volunteer or that I will decide to do it myself. :-)

It's really bothering me that mutable file upload and download behavior is so finicky, buggy, inefficient, hard to understand, different from immutable file upload and download behavior, etc. So I'm putting a bunch of tickets into the "1.8" Milestone. I am not, however, at this time, volunteering to work on these tickets, so it might be a mistake to put them into the 1.8 Milestone, but I really hope that someone else will volunteer or that I will decide to do it myself. :-)
zooko modified the milestone from eventually to 1.8.0 2010-05-27 22:11:00 +00:00
zooko modified the milestone from 1.8.0 to soon 2010-08-14 06:45:57 +00:00
davidsarah commented 2011-09-09 18:26:02 +00:00
Owner

ticket:1084#comment:73736 (but probably not the main issue in that ticket) may be an instance of the bug in this ticket. It occurs in a different method of SystemTest, but the error is almost identical to the first one in tahoe-846-test-fail.txt.

ticket:1084#[comment:73736](/tahoe-lafs/trac-2024-07-25/issues/846#issuecomment-73736) (but probably not the main issue in that ticket) may be an instance of the bug in this ticket. It occurs in a different method of SystemTest, but the error is almost identical to the first one in [tahoe-846-test-fail.txt](/tahoe-lafs/trac-2024-07-25/attachments/000078ac-8d16-3d76-c569-31b4b655df12).
warner was unassigned by zooko 2011-09-28 16:23:03 +00:00
zooko self-assigned this 2011-09-28 16:23:03 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#846
No description provided.