maximum recursion depth exceeded in Tahoe2PeerSelector #758

Closed
opened 2009-07-14 04:19:24 +00:00 by zooko · 2 comments

I just got this traceback from a node using the volunteergrid:

/usr/local/lib/python2.6/dist-packages/Twisted-8.2.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py, line 328 in _runCallbacks
326                    self._runningCallbacks = True
327                    try:
328                        self.result = callback(self.result, *args, **kw)
329                    finally:
Locals
callback	<bound method Tahoe2PeerSelector._got_response of <Tahoe2PeerSelector for upload nztp5>>
self	<Deferred at 0x4d93a70 current result: None>
args	(<PeerTracker for peer xjy2clbq and SI nztp5>, set([19, 20]), [<PeerTracker for peer gapnio7p and SI nztp5>])
kw	{}
/home/volunteergrid/src/tahoe/src/allmydata/immutable/upload.py, line 384 in _got_response
382
383        # now loop
384        return self._loop()
385
Locals
self	<Tahoe2PeerSelector for upload nztp5>
/home/volunteergrid/src/tahoe/src/allmydata/immutable/upload.py, line 284 in _loop
282            self.contacted_peers.extend(self.contacted_peers2)
283            self.contacted_peers[:] = []
284            return self._loop()
285        else:
Locals
self	<Tahoe2PeerSelector for upload nztp5>
/home/volunteergrid/src/tahoe/src/allmydata/immutable/upload.py, line 284 in _loop
282            self.contacted_peers.extend(self.contacted_peers2)
283            self.contacted_peers[:] = []
284            return self._loop()
285        else:
Locals
self	<Tahoe2PeerSelector for upload nztp5>

(And so forth until maximum recursion depth exceeded.)

There are only 15 servers on the volunteergrid right now. The clause that is shown, around [279 of upload.py]source:src/allmydata/immutable/upload.py#L279 is for the case that all servers have been asked to hold a share, and then all servers have been asked to hold a second share, and this clause is to iterate and go on to ask them to hold yet a third-or-greater share.

It appears that this loop never terminated before the recursion depth was exceeded. We have [tests of this case]source:src/allmydata/tahoe/test/test_upload.py@20090625021809-4233b-9cdbf53c54025466fea8ab97bed668cd0017b142#L483, but... Hey waitaminute! That code in upload.py says:

elif self.contacted_peers2:
    # we've finished the second-or-later pass. Move all the remaining
    # peers back into self.contacted_peers for the next pass
    self.contacted_peers.extend(self.contacted_peers2)
    self.contacted_peers[:] = []
    return self._loop()

That can't be right. It probably means to say:

    self.contacted_peers.extend(self.contacted_peers2)
    del self.contacted_peers2[:]

Why does that test catch this bug?

But it is too late at night for me to be messing with such stuff.

If someone in a different timezone or a different sleep schedule wants to fix the test to catch this bug while I sleep, that would be great! :-)

I just got this traceback from a node using the volunteergrid: ``` /usr/local/lib/python2.6/dist-packages/Twisted-8.2.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py, line 328 in _runCallbacks 326 self._runningCallbacks = True 327 try: 328 self.result = callback(self.result, *args, **kw) 329 finally: Locals callback <bound method Tahoe2PeerSelector._got_response of <Tahoe2PeerSelector for upload nztp5>> self <Deferred at 0x4d93a70 current result: None> args (<PeerTracker for peer xjy2clbq and SI nztp5>, set([19, 20]), [<PeerTracker for peer gapnio7p and SI nztp5>]) kw {} /home/volunteergrid/src/tahoe/src/allmydata/immutable/upload.py, line 384 in _got_response 382 383 # now loop 384 return self._loop() 385 Locals self <Tahoe2PeerSelector for upload nztp5> /home/volunteergrid/src/tahoe/src/allmydata/immutable/upload.py, line 284 in _loop 282 self.contacted_peers.extend(self.contacted_peers2) 283 self.contacted_peers[:] = [] 284 return self._loop() 285 else: Locals self <Tahoe2PeerSelector for upload nztp5> /home/volunteergrid/src/tahoe/src/allmydata/immutable/upload.py, line 284 in _loop 282 self.contacted_peers.extend(self.contacted_peers2) 283 self.contacted_peers[:] = [] 284 return self._loop() 285 else: Locals self <Tahoe2PeerSelector for upload nztp5> ``` (And so forth until maximum recursion depth exceeded.) There are only 15 servers on the volunteergrid right now. The clause that is shown, around [279 of upload.py]source:src/allmydata/immutable/upload.py#L279 is for the case that all servers have been asked to hold a share, and then all servers have been asked to hold a second share, and this clause is to iterate and go on to ask them to hold yet a third-or-greater share. It appears that this loop never terminated before the recursion depth was exceeded. We have [tests of this case]source:src/allmydata/tahoe/test/test_upload.py@20090625021809-4233b-9cdbf53c54025466fea8ab97bed668cd0017b142#L483, but... Hey waitaminute! That code in upload.py says: ``` elif self.contacted_peers2: # we've finished the second-or-later pass. Move all the remaining # peers back into self.contacted_peers for the next pass self.contacted_peers.extend(self.contacted_peers2) self.contacted_peers[:] = [] return self._loop() ``` That can't be right. It probably means to say: ``` self.contacted_peers.extend(self.contacted_peers2) del self.contacted_peers2[:] ``` Why does that test catch this bug? But it is too late at night for me to be messing with such stuff. If someone in a different timezone or a different sleep schedule wants to fix the test to catch this bug while I sleep, that would be great! :-)
zooko added the
code-peerselection
major
defect
1.4.1
labels 2009-07-14 04:19:24 +00:00
zooko added this to the 1.5.0 milestone 2009-07-14 04:19:24 +00:00
tahoe-lafs changed title from maxmimum recursion depth exceeded in Tahoe2PeerSelector to maximum recursion depth exceeded in Tahoe2PeerSelector 2009-07-15 03:45:54 +00:00

Huh, yeah, that code !!!is!!! odd.. your analysis feel right, but I'm too jetlagged to understand this code right now either. I want to rewrite the uploader anyways, but that's not going to happen for 1.5.

Huh, yeah, that code !!!is!!! odd.. your analysis feel right, but I'm too jetlagged to understand this code right now either. I want to rewrite the uploader anyways, but that's not going to happen for 1.5.

This should be fixed, by changeset:1192b61dfed62a49.

This should be fixed, by changeset:1192b61dfed62a49.
warner added the
fixed
label 2009-07-17 05:13:14 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#758
No description provided.