nondeterministic failure of allmydata.test.test_download.DownloadTest.test_lost_servers with NotEnoughSharesError #1197

Closed
opened 2010-09-11 02:22:08 +00:00 by davidsarah · 6 comments
davidsarah commented 2010-09-11 02:22:08 +00:00
Owner

(http://tahoe-lafs.org/buildbot/builders/FreeStorm%20WinXP-x86%20py2.6/builds/273/steps/test/logs/stdio) :

[ERROR]: allmydata.test.test_download.DownloadTest.test_lost_servers

Traceback (most recent call last):
Failure: allmydata.interfaces.NotEnoughSharesError: ran out of
shares: complete=sh4 pending=Share(sh7-on-k6vb2) overdue= unused=
need 3. Last failure: [Failure instance: Traceback (failure with no
frames): <class 'foolscap.tokens.RemoteException'>: <RemoteException
around '[Failure instance: Traceback: <class
'allmydata.test.no_network.IntentionalError'>: I was asked to break
c:\python26\lib\site-packages\twisted\internet\base.py:778:runUntilCurrent
build\bdist.win32\egg\foolscap\eventual.py:26:_turn
c:\python26\lib\site-packages\twisted\internet\defer.py:280:callback
c:\python26\lib\site-packages\twisted\internet\defer.py:354:_startRunCallbacks
--- <exception caught here> ---
c:\python26\lib\site-packages\twisted\internet\defer.py:371:_runCallbacks
c:\buildbot_tahoe\freestorm_winxp-x86_py2.6\build\src\allmydata\test\no_network.py:81:<lambda>
c:\buildbot_tahoe\freestorm_winxp-x86_py2.6\build\src\allmydata\test\no_network.py:68:_call
]'>
]

This test succeeds on other builders, and in the next build of the same source.

(http://tahoe-lafs.org/buildbot/builders/FreeStorm%20WinXP-x86%20py2.6/builds/273/steps/test/logs/stdio) : ``` [ERROR]: allmydata.test.test_download.DownloadTest.test_lost_servers Traceback (most recent call last): Failure: allmydata.interfaces.NotEnoughSharesError: ran out of shares: complete=sh4 pending=Share(sh7-on-k6vb2) overdue= unused= need 3. Last failure: [Failure instance: Traceback (failure with no frames): <class 'foolscap.tokens.RemoteException'>: <RemoteException around '[Failure instance: Traceback: <class 'allmydata.test.no_network.IntentionalError'>: I was asked to break c:\python26\lib\site-packages\twisted\internet\base.py:778:runUntilCurrent build\bdist.win32\egg\foolscap\eventual.py:26:_turn c:\python26\lib\site-packages\twisted\internet\defer.py:280:callback c:\python26\lib\site-packages\twisted\internet\defer.py:354:_startRunCallbacks --- <exception caught here> --- c:\python26\lib\site-packages\twisted\internet\defer.py:371:_runCallbacks c:\buildbot_tahoe\freestorm_winxp-x86_py2.6\build\src\allmydata\test\no_network.py:81:<lambda> c:\buildbot_tahoe\freestorm_winxp-x86_py2.6\build\src\allmydata\test\no_network.py:68:_call ]'> ] ``` This test succeeds on other builders, and in the [next build](http://tahoe-lafs.org/buildbot/builders/FreeStorm%20WinXP-x86%20py2.6/builds/274/steps/test/logs/stdio) of the same source.
tahoe-lafs added the
code-network
major
defect
1.8β
labels 2010-09-11 02:22:08 +00:00
tahoe-lafs added this to the 1.8.0 milestone 2010-09-11 02:22:08 +00:00

I have seen this a couple of times now. I first saw it after applying changeset:0475bd8e27ae8307. I suspect changeset:0475bd8e27ae8307 causes it. :-(

I have seen this a couple of times now. I first saw it after applying changeset:0475bd8e27ae8307. I suspect changeset:0475bd8e27ae8307 causes it. :-(
zooko self-assigned this 2010-09-11 04:22:39 +00:00

Ugh, I spent a few hours just now trying to make allmydata.test.test_download.DownloadTest.test_lost_servers deterministic and was not able to do so. Now that I'm writing this, I think maybe I don't even want allmydata.test.test_download.DownloadTest.test_lost_servers to be deterministic. It is basically a "system test". Maybe it is okay to leave it with some non-determinism (which I assume comes from iterating over a dict or a set of servers) in order to show up issues like this one, as long as also write a deterministic, narrowed test for each such issue that shows up.

Okay I'll try that next...

Ugh, I spent a few hours just now trying to make `allmydata.test.test_download.DownloadTest.test_lost_servers` deterministic and was not able to do so. Now that I'm writing this, I think maybe I don't even *want* `allmydata.test.test_download.DownloadTest.test_lost_servers` to be deterministic. It is basically a "system test". Maybe it is okay to leave it with some non-determinism (which I assume comes from iterating over a dict or a set of servers) in order to show up issues like this one, as long as also write a deterministic, narrowed test for each such issue that shows up. Okay I'll try that next...

Here is the last part of _trial_temp/test.log (created with FLOGFILE=flog.out.bz2 FLOGLEVEL=1 FLOGTOTWISTED=1 PYTHONPATH=src/ trial --rterror allmydata.test.test_download.DownloadTest.test_lost_servers) of a run in which the download succeeded:

2010-09-11 00:00:19-0600 [-] imm Node(54cbd6tx).get_segment(1)
2010-09-11 00:00:19-0600 [-] error requesting 60+24 from 6jdsp for si 54cbd6tx
2010-09-11 00:00:19-0600 [-] abandoning Share(sh0-on-6jdsp)
2010-09-11 00:00:19-0600 [-] error requesting 60+24 from ysbz4 for si 54cbd6tx
2010-09-11 00:00:19-0600 [-] abandoning Share(sh1-on-ysbz4)
2010-09-11 00:00:19-0600 [-] error requesting 60+24 from k6vb2 for si 54cbd6tx
2010-09-11 00:00:19-0600 [-] abandoning Share(sh2-on-k6vb2)
2010-09-11 00:00:19-0600 [-] error requesting 0+1024 from alg6u for si 54cbd6tx
2010-09-11 00:00:19-0600 [-] abandoning Share(sh3-on-alg6u)
2010-09-11 00:00:19-0600 [-] got error from [xgru5adv]
2010-09-11 00:00:19-0600 [-] storage: get_buckets 54cbd6txyhbfqrdoclhpinsux4
2010-09-11 00:00:19-0600 [-] storage: get_buckets 54cbd6txyhbfqrdoclhpinsux4
2010-09-11 00:00:19-0600 [-] got error from [rvsry4kn]
2010-09-11 00:00:19-0600 [-] storage: get_buckets 54cbd6txyhbfqrdoclhpinsux4
2010-09-11 00:00:19-0600 [-] got error from [fob7vqgd]
2010-09-11 00:00:19-0600 [-] delivering segment(1)
2010-09-11 00:00:19-0600 [-] Segmentation got data: want [72-310), given [72-144), for segnum=1
2010-09-11 00:00:19-0600 [-] imm Node(54cbd6tx).get_segment(2)
2010-09-11 00:00:19-0600 [-] delivering segment(2)

Here is the comparable section of a run in which the download failed:

2010-09-11 00:05:55-0600 [-] imm Node(x36qlfft).get_segment(1)
2010-09-11 00:05:55-0600 [-] error requesting 60+24 from alg6u for si x36qlfft
2010-09-11 00:05:55-0600 [-] abandoning Share(sh0-on-alg6u)
2010-09-11 00:05:55-0600 [-] error requesting 60+24 from egjce for si x36qlfft
2010-09-11 00:05:55-0600 [-] abandoning Share(sh1-on-egjce)
2010-09-11 00:05:55-0600 [-] error requesting 60+24 from k6vb2 for si x36qlfft
2010-09-11 00:05:55-0600 [-] abandoning Share(sh2-on-k6vb2)
2010-09-11 00:05:55-0600 [-] error requesting 0+1024 from 7r4gd for si x36qlfft
2010-09-11 00:05:55-0600 [-] abandoning Share(sh3-on-7r4gd)
2010-09-11 00:05:55-0600 [-] storage: get_buckets x36qlfft7g5c7raim6y7yb25ju
2010-09-11 00:05:55-0600 [-] storage: get_buckets x36qlfft7g5c7raim6y7yb25ju
2010-09-11 00:05:55-0600 [-] got error from [fob7vqgd]
2010-09-11 00:05:55-0600 [-] got error from [rvsry4kn]
2010-09-11 00:05:55-0600 [-] got error from [ysbz4st7]
2010-09-11 00:05:55-0600 [-] got error from [xgru5adv]
2010-09-11 00:05:55-0600 [-] ShareFinder.loop: no_more_shares, ever
2010-09-11 00:05:55-0600 [-] ShareFinder.loop: no_more_shares, ever
2010-09-11 00:05:55-0600 [-] ShareFinder.loop: no_more_shares, ever
2010-09-11 00:05:55-0600 [-] ShareFinder.loop: no_more_shares, ever
2010-09-11 00:05:55-0600 [-] ran out of shares: complete= pending=Share(sh4-on-6jdsp),Share(sh5-on-b3llg) overdue= unused= need 3. Last failure: [Failure instance: Traceback (failure with no frames): <class 'foolscap.tokens.RemoteException'>: <RemoteException around '[Failure instance: Traceback: <class 'allmydata.test.no_network.IntentionalError'>: I was asked to break
        /Library/Python/2.6/site-packages/Twisted-10.1.0_r29874-py2.6-macosx-10.6-universal.egg/twisted/internet/base.py:789:runUntilCurrent
        /Library/Python/2.6/site-packages/foolscap-0.5.1-py2.6.egg/foolscap/eventual.py:26:_turn
        /Library/Python/2.6/site-packages/Twisted-10.1.0_r29874-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py:351:callback
        /Library/Python/2.6/site-packages/Twisted-10.1.0_r29874-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py:450:_startRunCallbacks
        --- <exception caught here> ---
        /Library/Python/2.6/site-packages/Twisted-10.1.0_r29874-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py:468:_runCallbacks
        /Users/zooko/playground/tahoe-lafs/trunk/src/allmydata/test/no_network.py:81:<lambda>
        /Users/zooko/playground/tahoe-lafs/trunk/src/allmydata/test/no_network.py:68:_call
        ]'>
        ]
Here is the last part of `_trial_temp/test.log` (created with `FLOGFILE=flog.out.bz2 FLOGLEVEL=1 FLOGTOTWISTED=1 PYTHONPATH=src/ trial --rterror allmydata.test.test_download.DownloadTest.test_lost_servers`) of a run in which the download succeeded: ``` 2010-09-11 00:00:19-0600 [-] imm Node(54cbd6tx).get_segment(1) 2010-09-11 00:00:19-0600 [-] error requesting 60+24 from 6jdsp for si 54cbd6tx 2010-09-11 00:00:19-0600 [-] abandoning Share(sh0-on-6jdsp) 2010-09-11 00:00:19-0600 [-] error requesting 60+24 from ysbz4 for si 54cbd6tx 2010-09-11 00:00:19-0600 [-] abandoning Share(sh1-on-ysbz4) 2010-09-11 00:00:19-0600 [-] error requesting 60+24 from k6vb2 for si 54cbd6tx 2010-09-11 00:00:19-0600 [-] abandoning Share(sh2-on-k6vb2) 2010-09-11 00:00:19-0600 [-] error requesting 0+1024 from alg6u for si 54cbd6tx 2010-09-11 00:00:19-0600 [-] abandoning Share(sh3-on-alg6u) 2010-09-11 00:00:19-0600 [-] got error from [xgru5adv] 2010-09-11 00:00:19-0600 [-] storage: get_buckets 54cbd6txyhbfqrdoclhpinsux4 2010-09-11 00:00:19-0600 [-] storage: get_buckets 54cbd6txyhbfqrdoclhpinsux4 2010-09-11 00:00:19-0600 [-] got error from [rvsry4kn] 2010-09-11 00:00:19-0600 [-] storage: get_buckets 54cbd6txyhbfqrdoclhpinsux4 2010-09-11 00:00:19-0600 [-] got error from [fob7vqgd] 2010-09-11 00:00:19-0600 [-] delivering segment(1) 2010-09-11 00:00:19-0600 [-] Segmentation got data: want [72-310), given [72-144), for segnum=1 2010-09-11 00:00:19-0600 [-] imm Node(54cbd6tx).get_segment(2) 2010-09-11 00:00:19-0600 [-] delivering segment(2) ``` Here is the comparable section of a run in which the download failed: ``` 2010-09-11 00:05:55-0600 [-] imm Node(x36qlfft).get_segment(1) 2010-09-11 00:05:55-0600 [-] error requesting 60+24 from alg6u for si x36qlfft 2010-09-11 00:05:55-0600 [-] abandoning Share(sh0-on-alg6u) 2010-09-11 00:05:55-0600 [-] error requesting 60+24 from egjce for si x36qlfft 2010-09-11 00:05:55-0600 [-] abandoning Share(sh1-on-egjce) 2010-09-11 00:05:55-0600 [-] error requesting 60+24 from k6vb2 for si x36qlfft 2010-09-11 00:05:55-0600 [-] abandoning Share(sh2-on-k6vb2) 2010-09-11 00:05:55-0600 [-] error requesting 0+1024 from 7r4gd for si x36qlfft 2010-09-11 00:05:55-0600 [-] abandoning Share(sh3-on-7r4gd) 2010-09-11 00:05:55-0600 [-] storage: get_buckets x36qlfft7g5c7raim6y7yb25ju 2010-09-11 00:05:55-0600 [-] storage: get_buckets x36qlfft7g5c7raim6y7yb25ju 2010-09-11 00:05:55-0600 [-] got error from [fob7vqgd] 2010-09-11 00:05:55-0600 [-] got error from [rvsry4kn] 2010-09-11 00:05:55-0600 [-] got error from [ysbz4st7] 2010-09-11 00:05:55-0600 [-] got error from [xgru5adv] 2010-09-11 00:05:55-0600 [-] ShareFinder.loop: no_more_shares, ever 2010-09-11 00:05:55-0600 [-] ShareFinder.loop: no_more_shares, ever 2010-09-11 00:05:55-0600 [-] ShareFinder.loop: no_more_shares, ever 2010-09-11 00:05:55-0600 [-] ShareFinder.loop: no_more_shares, ever 2010-09-11 00:05:55-0600 [-] ran out of shares: complete= pending=Share(sh4-on-6jdsp),Share(sh5-on-b3llg) overdue= unused= need 3. Last failure: [Failure instance: Traceback (failure with no frames): <class 'foolscap.tokens.RemoteException'>: <RemoteException around '[Failure instance: Traceback: <class 'allmydata.test.no_network.IntentionalError'>: I was asked to break /Library/Python/2.6/site-packages/Twisted-10.1.0_r29874-py2.6-macosx-10.6-universal.egg/twisted/internet/base.py:789:runUntilCurrent /Library/Python/2.6/site-packages/foolscap-0.5.1-py2.6.egg/foolscap/eventual.py:26:_turn /Library/Python/2.6/site-packages/Twisted-10.1.0_r29874-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py:351:callback /Library/Python/2.6/site-packages/Twisted-10.1.0_r29874-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py:450:_startRunCallbacks --- <exception caught here> --- /Library/Python/2.6/site-packages/Twisted-10.1.0_r29874-py2.6-macosx-10.6-universal.egg/twisted/internet/defer.py:468:_runCallbacks /Users/zooko/playground/tahoe-lafs/trunk/src/allmydata/test/no_network.py:81:<lambda> /Users/zooko/playground/tahoe-lafs/trunk/src/allmydata/test/no_network.py:68:_call ]'> ] ```
david-sarah@jacaranda.org commented 2010-09-12 00:52:32 +00:00
Author
Owner

In changeset:6c4ba62c5d3a7624:

test_download.py: simplify and possibly fix test_lost_servers; refs #1197
In changeset:6c4ba62c5d3a7624: ``` test_download.py: simplify and possibly fix test_lost_servers; refs #1197 ```
davidsarah commented 2010-09-12 01:42:31 +00:00
Author
Owner

That seems to have worked.

That seems to have worked.

reviewed. Thanks!! :-)

reviewed. Thanks!! :-)
zooko added the
fixed
label 2010-09-12 07:25:32 +00:00
zooko closed this issue 2010-09-12 07:25:32 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#1197
No description provided.