memcheck-64 fails sporadically #250

Closed
opened 2007-12-29 06:17:15 +00:00 by zooko · 6 comments

Brian knows a little bit more about this. There's some sort of race condition in shutting down old test runs and starting new ones, or something like that.

Brian knows a little bit more about this. There's some sort of race condition in shutting down old test runs and starting new ones, or something like that.
zooko added the
operational
major
defect
0.7.0
labels 2007-12-29 06:17:15 +00:00
zooko added this to the undecided milestone 2007-12-29 06:17:15 +00:00
warner was assigned by zooko 2007-12-29 06:17:15 +00:00

we fixed one possible source of failures: the pre-determined webport. This should fix failures that say "address already in use" in the nodelog. Let's watch the buildbot and see if any new failures show up.

we fixed one possible source of failures: the pre-determined webport. This should fix failures that say "address already in use" in the nodelog. Let's watch the buildbot and see if any new failures show up.
Author

I think this has been fixed.

I think this has been fixed.
zooko added the
fixed
label 2008-05-31 01:21:24 +00:00
zooko closed this issue 2008-05-31 01:21:24 +00:00
zooko modified the milestone from undecided to 1.1.0 2008-05-31 01:21:29 +00:00
Author

This just happened on a different builder:

http://allmydata.org/buildbot/builders/feisty2.5/builds/1557/steps/test/logs/stdio

allmydata.test.test_client.Run.test_reloadable ... Traceback (most recent call last):
  File "/home/buildslave/tahoe/feisty2.5/build/src/allmydata/test/test_client.py", line 194, in _restart
    c2.setServiceParent(self.sparent)
  File "/usr/lib/python2.5/site-packages/twisted/application/service.py", line 148, in setServiceParent
    self.parent.addService(self)
  File "/usr/lib/python2.5/site-packages/twisted/application/service.py", line 259, in addService
    service.privilegedStartService()
  File "/usr/lib/python2.5/site-packages/twisted/application/service.py", line 228, in privilegedStartService
    service.privilegedStartService()
  File "/usr/lib/python2.5/site-packages/twisted/application/service.py", line 228, in privilegedStartService
    service.privilegedStartService()
  File "/usr/lib/python2.5/site-packages/twisted/application/internet.py", line 68, in privilegedStartService
    self._port = self._getPort()
  File "/usr/lib/python2.5/site-packages/twisted/application/internet.py", line 86, in _getPort
    return getattr(reactor, 'listen'+self.method)(*self.args, **self.kwargs)
  File "/usr/lib/python2.5/site-packages/twisted/internet/posixbase.py", line 467, in listenTCP
    p.startListening()
  File "/usr/lib/python2.5/site-packages/twisted/internet/tcp.py", line 733, in startListening
    raise CannotListenError, (self.interface, self.port, le)
twisted.internet.error.CannotListenError: Couldn't listen on any:43755: (98, 'Address already in use').

Is it possible that this fault happens whenever the same port number is chosen at random by two successive tests?

This just happened on a different builder: <http://allmydata.org/buildbot/builders/feisty2.5/builds/1557/steps/test/logs/stdio> ``` allmydata.test.test_client.Run.test_reloadable ... Traceback (most recent call last): File "/home/buildslave/tahoe/feisty2.5/build/src/allmydata/test/test_client.py", line 194, in _restart c2.setServiceParent(self.sparent) File "/usr/lib/python2.5/site-packages/twisted/application/service.py", line 148, in setServiceParent self.parent.addService(self) File "/usr/lib/python2.5/site-packages/twisted/application/service.py", line 259, in addService service.privilegedStartService() File "/usr/lib/python2.5/site-packages/twisted/application/service.py", line 228, in privilegedStartService service.privilegedStartService() File "/usr/lib/python2.5/site-packages/twisted/application/service.py", line 228, in privilegedStartService service.privilegedStartService() File "/usr/lib/python2.5/site-packages/twisted/application/internet.py", line 68, in privilegedStartService self._port = self._getPort() File "/usr/lib/python2.5/site-packages/twisted/application/internet.py", line 86, in _getPort return getattr(reactor, 'listen'+self.method)(*self.args, **self.kwargs) File "/usr/lib/python2.5/site-packages/twisted/internet/posixbase.py", line 467, in listenTCP p.startListening() File "/usr/lib/python2.5/site-packages/twisted/internet/tcp.py", line 733, in startListening raise CannotListenError, (self.interface, self.port, le) twisted.internet.error.CannotListenError: Couldn't listen on any:43755: (98, 'Address already in use'). ``` Is it possible that this fault happens whenever the same port number is chosen at random by two successive tests?
zooko removed the
fixed
label 2008-07-14 16:09:21 +00:00
zooko reopened this issue 2008-07-14 16:09:21 +00:00

There are comments in the test that have more detail.. the issue is some
absolute timeouts that were not easy to get rid of. The problem is most likely
the old instance not completely shutting down before the new one is started up.

source:/src/allmydata/test/test_client.py@2712#L166 has details:

def test_reloadable(self):
    basedir = "test_client.Run.test_reloadable"
    os.mkdir(basedir)
    dummy = "pb://wl74cyahejagspqgy4x5ukrvfnevlknt@127.0.0.1:58889/bogus"
    open(os.path.join(basedir, "introducer.furl"), "w").write(dummy)
    c1 = client.Client(basedir)
    c1.setServiceParent(self.sparent)

    # delay to let the service start up completely. I'm not entirely sure
    # this is necessary.
    d = self.stall(delay=2.0)
    d.addCallback(lambda res: c1.disownServiceParent())
    # the cygwin buildslave seems to need more time to let the old
    # service completely shut down. When delay=0.1, I saw this test fail,
    # probably due to the logport trying to reclaim the old socket
    # number. This suggests that either we're dropping a Deferred
    # somewhere in the shutdown sequence, or that cygwin is just cranky.
    d.addCallback(self.stall, delay=2.0)
    def _restart(res):
        # TODO: pause for slightly over one second, to let
        # Client._check_hotline poll the file once. That will exercise
        # another few lines. Then add another test in which we don't
        # update the file at all, and watch to see the node shutdown. (to
        # do this, use a modified node which overrides Node.shutdown(),
        # also change _check_hotline to use it instead of a raw
        # reactor.stop, also instrument the shutdown event in an
        # attribute that we can check)
        c2 = client.Client(basedir)
        c2.setServiceParent(self.sparent)
        return c2.disownServiceParent()
    d.addCallback(_restart)
    return d
There are comments in the test that have more detail.. the issue is some absolute timeouts that were not easy to get rid of. The problem is most likely the old instance not completely shutting down before the new one is started up. source:/src/allmydata/test/test_client.py@2712#L166 has details: ``` def test_reloadable(self): basedir = "test_client.Run.test_reloadable" os.mkdir(basedir) dummy = "pb://wl74cyahejagspqgy4x5ukrvfnevlknt@127.0.0.1:58889/bogus" open(os.path.join(basedir, "introducer.furl"), "w").write(dummy) c1 = client.Client(basedir) c1.setServiceParent(self.sparent) # delay to let the service start up completely. I'm not entirely sure # this is necessary. d = self.stall(delay=2.0) d.addCallback(lambda res: c1.disownServiceParent()) # the cygwin buildslave seems to need more time to let the old # service completely shut down. When delay=0.1, I saw this test fail, # probably due to the logport trying to reclaim the old socket # number. This suggests that either we're dropping a Deferred # somewhere in the shutdown sequence, or that cygwin is just cranky. d.addCallback(self.stall, delay=2.0) def _restart(res): # TODO: pause for slightly over one second, to let # Client._check_hotline poll the file once. That will exercise # another few lines. Then add another test in which we don't # update the file at all, and watch to see the node shutdown. (to # do this, use a modified node which overrides Node.shutdown(), # also change _check_hotline to use it instead of a raw # reactor.stop, also instrument the shutdown event in an # attribute that we can check) c2 = client.Client(basedir) c2.setServiceParent(self.sparent) return c2.disownServiceParent() d.addCallback(_restart) return d ```
Author

Hm.. This ticket was last touched 9 months ago. I haven't been seeing this failure in practice recently, as far as I recall. Close this as fixed?

Hm.. This ticket was last touched 9 months ago. I haven't been seeing this failure in practice recently, as far as I recall. Close this as fixed?

I don't remember seeing this failure for a while either. I think it's safe to close.. feel free to reopen if it appears again.

I don't remember seeing this failure for a while either. I think it's safe to close.. feel free to reopen if it appears again.
warner added the
fixed
label 2009-04-08 02:18:08 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#250
No description provided.