allmydata.test.test_system.SystemTest.test_mutable sometimes hangs on a slow machine #846

New Issue

zooko · 2009-11-28T22:03:42Z

zooko commented

2009-11-28 22:03:42 +00:00

On François's lenny-armv5tel box, allmydata.test.test_system.SystemTest.test_mutable sometimes stops making progress and then gets timed out after 3600 seconds, e.g.: http://allmydata.org/buildbot/builders/François lenny-armv5tel/builds/16 and many more. In the cases where that test does pass it takes only a couple of hundred seconds, e.g.: http://allmydata.org/buildbot/builders/François lenny-armv5tel/builds/8/steps/test/logs/stdio where it took 227 seconds. (In that same passing test run other tests took longer than 227 seconds -- see http://allmydata.org/buildbot/builders/François lenny-armv5tel/builds/8/steps/test/logs/timings .)

Brian looked at the test.log files from passing and failing examples and said that there was little information there, but that one difference was that in the passing cases that he saw, the time between the beginning of the test case (e.g. 2009-11-20 18:08:54.346Z [-] --> allmydata.test.test_system.SystemTest.test_mutable <--) and the first message from Node startup (e.g. 2009-11-20 18:08:55.475Z [-] foolscap.pb.Listener starting on 35403) was about 1 second, and in the failing cases, e.g. start test 2009-11-28 13:36:48.970Z [-] --> allmydata.test.test_system.SystemTest.test_mutable <-- and Node startup 2009-11-28 13:36:53.516Z [-] foolscap.pb.Listener starting on 55397 was about 5 seconds.

So it could be that there is some sort of race condition where if it takes the Node longer than 5 seconds to start up (perhaps waiting to bind to a TCP port or something) then some other part of the test gets confused by having won a race that it didn't expect to win.

Hm, I wonder if I could simulate that on a fast computer by inserting some sort of 10s delay before allowing Node startup to complete...

The next step is to make this test reproducible. François, could you please run just this one test, such as with trial --reporter=verbose --until-failure allmydata.test.test_system.SystemTest.test_mutable and see if you can tell when it passes vs. when it fails? (Maybe it has to do with other processes loading the CPU?) Note that which version of Tahoe-LAFS gets imported and tested by that command-line will be determined by your PYTHONPATH.

François: I'd like to get this fixed so that ARM can be a supported platform for the upcoming v1.6 release, so if you can't do this soon then please either give me or Brian an ssh account on your box or just say "Can't work on this now" so that we can think of some alternative strategies. Thanks!

On François's lenny-armv5tel box, `allmydata.test.test_system.SystemTest.test_mutable` sometimes stops making progress and then gets timed out after 3600 seconds, e.g.: <http://allmydata.org/buildbot/builders/Fran>çois lenny-armv5tel/builds/16 and many more. In the cases where that test does pass it takes only a couple of hundred seconds, e.g.: <http://allmydata.org/buildbot/builders/Fran>çois lenny-armv5tel/builds/8/steps/test/logs/stdio where it took 227 seconds. (In that same passing test run other tests took longer than 227 seconds -- see <http://allmydata.org/buildbot/builders/Fran>çois lenny-armv5tel/builds/8/steps/test/logs/timings .) Brian looked at the test.log files from passing and failing examples and said that there was little information there, but that one difference was that in the passing cases that he saw, the time between the beginning of the test case (e.g. `2009-11-20 18:08:54.346Z [-] --> allmydata.test.test_system.SystemTest.test_mutable <--`) and the first message from Node startup (e.g. `2009-11-20 18:08:55.475Z [-] foolscap.pb.Listener starting on 35403`) was about 1 second, and in the failing cases, e.g. start test `2009-11-28 13:36:48.970Z [-] --> allmydata.test.test_system.SystemTest.test_mutable <--` and Node startup `2009-11-28 13:36:53.516Z [-] foolscap.pb.Listener starting on 55397` was about 5 seconds. So it could be that there is some sort of race condition where if it takes the Node longer than 5 seconds to start up (perhaps waiting to bind to a TCP port or something) then some other part of the test gets confused by having won a race that it didn't expect to win. Hm, I wonder if I could simulate that on a fast computer by inserting some sort of 10s delay before allowing Node startup to complete... The next step is to make this test reproducible. François, could you please run just this one test, such as with `trial --reporter=verbose --until-failure allmydata.test.test_system.SystemTest.test_mutable` and see if you can tell when it passes vs. when it fails? (Maybe it has to do with other processes loading the CPU?) Note that which version of Tahoe-LAFS gets imported and tested by that command-line will be determined by your PYTHONPATH. François: I'd like to get this fixed so that ARM can be a supported platform for the upcoming v1.6 release, so if you *can't* do this soon then please either give me or Brian an ssh account on your box or just say "Can't work on this now" so that we can think of some alternative strategies. Thanks!

zooko added the

labels 2009-11-28 22:03:42 +00:00

zooko added this to the 1.6.0 milestone 2009-11-28 22:03:42 +00:00

francois commented

2009-11-29 23:47:04 +00:00

Zooko, that's right they are other CPU intensive processes like Mediatomb and MLDonkey running on this box for about 2 weeks. This probably explain why those tests have been failing recently.

The 'trial --until-failure' is currently running on the box, I'll post the results tomorrow morning (CET).

Zooko, that's right they are other CPU intensive processes like Mediatomb and MLDonkey running on this box for about 2 weeks. This probably explain why those tests have been failing recently. The 'trial --until-failure' is currently running on the box, I'll post the results tomorrow morning (CET).

francois commented

2009-11-30 10:58:14 +00:00

The test 'allmydata.test.test_system.SystemTest.test_mutable' failed after 20 successful runs, it definitely looks like a race condition. See the attached logfile for details.

francois commented

2009-11-30 10:58:36 +00:00

Attachment tahoe-846-test-fail.txt (7943 bytes) added

**Attachment** tahoe-846-test-fail.txt (7943 bytes) added

tahoe-846-test-fail.txt

7.8 KiB

francois commented

2009-11-30 11:02:42 +00:00

Attachment test.log (19363 bytes) added

**Attachment** test.log (19363 bytes) added

test.log

19 KiB

tahoe-lafs added

code-mutable

and removed

unknown

labels 2009-12-01 00:16:53 +00:00

francois commented

2009-12-05 17:22:20 +00:00

The failure seems related to the connect_timeout expiration in foolscap. This is currently outside of my understanding of the Tahoe codebase

Brian, do you see any hint in these logfiles ? Should you need an SSH account on this box, just send me your public key by email

The failure seems related to the connect_timeout expiration in foolscap. This is currently outside of my understanding of the Tahoe codebase Brian, do you see any hint in these logfiles ? Should you need an SSH account on this box, just send me your public key by email

zooko commented

2009-12-06 04:23:56 +00:00

Why do you say that it seems related to the connect_timeout expiration?

Why do you say that it seems related to the `connect_timeout` expiration?

francois commented

2009-12-06 11:47:09 +00:00

Because of that kind of log messages which appears in test.log:

2009-11-30 01:07:46.730Z [-] Reconnector._failed (furl=pb://z3g3meylekyi7yjo4azff2xbqwjbuioa@62.220.138.135:42639,127.0.0.1:42639/kdnymib6zqlgwo6nrihhofctqoo7chfe): [Failure instance: Traceback (failure with no frames): <class 'foolscap.tokens.NegotiationError'>: no connection established within client timeout
        ]

Because of that kind of log messages which appears in test.log: ``` 2009-11-30 01:07:46.730Z [-] Reconnector._failed (furl=pb://z3g3meylekyi7yjo4azff2xbqwjbuioa@62.220.138.135:42639,127.0.0.1:42639/kdnymib6zqlgwo6nrihhofctqoo7chfe): [Failure instance: Traceback (failure with no frames): <class 'foolscap.tokens.NegotiationError'>: no connection established within client timeout ] ```

davidsarah commented

2009-12-06 16:50:55 +00:00

#838 was a duplicate. It has another log attached to it; I don't know whether that provides any more information.

davidsarah commented

2009-12-06 16:52:09 +00:00

Replying to davidsarah:

#838 ... has another log attached to it

I meant, it links to http://allmydata.org/buildbot/builders/Fran%C3%A7ois%20lenny-armv5tel/builds/12/steps/test/logs/stdio

Replying to [davidsarah](/tahoe-lafs/trac-2024-07-25/issues/846#issuecomment-73738): > #838 ... has another log attached to it I meant, it links to <http://allmydata.org/buildbot/builders/Fran%C3%A7ois%20lenny-armv5tel/builds/12/steps/test/logs/stdio>

zooko commented

2009-12-12 16:26:52 +00:00

Well, I haven't had time to investigate this, and I guess [Buildbot Policy wiki/BuildbotPolicy]our says that we need to demote ARM from the Supported Platforms category until someone has time to fix it. :-(

I'll do the demotion sometime today unless someone jumps in to rescue the ARM platform from the ignominy.

Well, I haven't had time to investigate this, and I guess [Buildbot Policy [wiki/BuildbotPolicy](wiki/BuildbotPolicy)]our says that we need to demote ARM from the Supported Platforms category until someone has time to fix it. :-( I'll do the demotion sometime today unless someone jumps in to rescue the ARM platform from the ignominy.

zooko modified the milestone from 1.6.0 to eventually

2010-01-26 15:44:40 +00:00

zooko commented

2010-05-27 22:11:00 +00:00

It's really bothering me that mutable file upload and download behavior is so finicky, buggy, inefficient, hard to understand, different from immutable file upload and download behavior, etc. So I'm putting a bunch of tickets into the "1.8" Milestone. I am not, however, at this time, volunteering to work on these tickets, so it might be a mistake to put them into the 1.8 Milestone, but I really hope that someone else will volunteer or that I will decide to do it myself. :-)

zooko modified the milestone from eventually to 1.8.0

2010-05-27 22:11:00 +00:00

zooko modified the milestone from 1.8.0 to soon

2010-08-14 06:45:57 +00:00

davidsarah commented

2011-09-09 18:26:02 +00:00

ticket:1084#comment:73736 (but probably not the main issue in that ticket) may be an instance of the bug in this ticket. It occurs in a different method of SystemTest, but the error is almost identical to the first one in tahoe-846-test-fail.txt.

ticket:1084#[comment:73736](/tahoe-lafs/trac-2024-07-25/issues/846#issuecomment-73736) (but probably not the main issue in that ticket) may be an instance of the bug in this ticket. It occurs in a different method of SystemTest, but the error is almost identical to the first one in [tahoe-846-test-fail.txt](/tahoe-lafs/trac-2024-07-25/attachments/000078ac-8d16-3d76-c569-31b4b655df12).

warner was unassigned by zooko

2011-09-28 16:23:03 +00:00

zooko self-assigned this 2011-09-28 16:23:03 +00:00

Sign in to join this conversation.