Tahoe process gone wild #1824

New Issue

kpreid · 2012-10-15T15:04:12Z

kpreid commented

2012-10-15 15:04:12 +00:00

I woke up this morning (~07:20 PDT) to find my machine sluggish to respond. It turned out to be my Volunteer Grid 1 tahoe node, having grown to 24GB virtual, 12GB (of 16GB) real. It was not responsive to the web port. It was using about 0.3% CPU.

It appears to have hung in the middle of my regularly scheduled repair/lease process, which started 14 hours ago; here is the output from that. Note that this cronjob mail didn't come to me until I killed the process, so I suspect it hung in the middle of the third repair, and the tracebacks are merely due to the death of the TCP server.

Further details will be added later; I have no time to hunt for logs now.

Randomised start: Sun Oct 14 18:42:35 PDT 2012
*** calendar-publish-dir
	done: 3 objects checked
	 pre-repair: 3 healthy, 0 unhealthy
	 0 repairs attempted, 0 successful, 0 failed
	 post-repair: 3 healthy, 0 unhealthy

*** family
	 repair successful
	done: 2 objects checked
	 pre-repair: 1 healthy, 1 unhealthy
	 1 repairs attempted, 1 successful, 0 failed
	 post-repair: 1 healthy, 1 unhealthy

*** publish
Traceback (most recent call last):
 File "/External/Projects/tahoe/support/bin/tahoe", line 9, in <module>
   load_entry_point('allmydata-tahoe==1.9.0.post28', 'console_scripts', 'tahoe')()
 File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 113, in run
   rc = runner(sys.argv[1:], install_node_control=install_node_control)
 File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 99, in runner
   rc = cli.dispatch[command](so)
 File "/External/Projects/tahoe/src/allmydata/scripts/cli.py", line 594, in deepcheck
   rc = tahoe_check.deepcheck(options)
 File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 306, in deepcheck
   return DeepCheckStreamer().run(options)
 File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 287, in run
   resp = do_http("POST", url)
 File "/External/Projects/tahoe/src/allmydata/scripts/common_http.py", line 68, in do_http
   return c.getresponse()
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1013, in getresponse
   response.begin()
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 402, in begin
   version, status, reason = self._read_status()
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 366, in _read_status
   raise BadStatusLine(line)
httplib.BadStatusLine: ''

*** tahoe-illustration
Traceback (most recent call last):
 File "/External/Projects/tahoe/support/bin/tahoe", line 9, in <module>
   load_entry_point('allmydata-tahoe==1.9.0.post28', 'console_scripts', 'tahoe')()
 File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 113, in run
   rc = runner(sys.argv[1:], install_node_control=install_node_control)
 File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 99, in runner
   rc = cli.dispatch[command](so)
 File "/External/Projects/tahoe/src/allmydata/scripts/cli.py", line 594, in deepcheck
   rc = tahoe_check.deepcheck(options)
 File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 306, in deepcheck
   return DeepCheckStreamer().run(options)
 File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 287, in run
   resp = do_http("POST", url)
 File "/External/Projects/tahoe/src/allmydata/scripts/common_http.py", line 60, in do_http
   c.endheaders()
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 937, in endheaders
   self._send_output(message_body)
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 797, in _send_output
   self.send(msg)
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 759, in send
   self.connect()
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 740, in connect
   self.timeout, self.source_address)
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 571, in create_connection
   raise err
socket.error: [Errno 61] Connection refused

I woke up this morning (~07:20 PDT) to find my machine sluggish to respond. It turned out to be my Volunteer Grid 1 tahoe node, having grown to 24GB virtual, 12GB (of 16GB) real. It was not responsive to the web port. It was using about 0.3% CPU. It appears to have hung in the middle of my regularly scheduled repair/lease process, which started 14 hours ago; here is the output from that. Note that this cronjob mail didn't come to me until I killed the process, so I suspect it hung in the middle of the third repair, and the tracebacks are merely due to the death of the TCP server. Further details will be added later; I have no time to hunt for logs now. ``` Randomised start: Sun Oct 14 18:42:35 PDT 2012 *** calendar-publish-dir done: 3 objects checked pre-repair: 3 healthy, 0 unhealthy 0 repairs attempted, 0 successful, 0 failed post-repair: 3 healthy, 0 unhealthy *** family repair successful done: 2 objects checked pre-repair: 1 healthy, 1 unhealthy 1 repairs attempted, 1 successful, 0 failed post-repair: 1 healthy, 1 unhealthy *** publish Traceback (most recent call last): File "/External/Projects/tahoe/support/bin/tahoe", line 9, in <module> load_entry_point('allmydata-tahoe==1.9.0.post28', 'console_scripts', 'tahoe')() File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 113, in run rc = runner(sys.argv[1:], install_node_control=install_node_control) File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 99, in runner rc = cli.dispatch[command](so) File "/External/Projects/tahoe/src/allmydata/scripts/cli.py", line 594, in deepcheck rc = tahoe_check.deepcheck(options) File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 306, in deepcheck return DeepCheckStreamer().run(options) File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 287, in run resp = do_http("POST", url) File "/External/Projects/tahoe/src/allmydata/scripts/common_http.py", line 68, in do_http return c.getresponse() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1013, in getresponse response.begin() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 402, in begin version, status, reason = self._read_status() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 366, in _read_status raise BadStatusLine(line) httplib.BadStatusLine: '' *** tahoe-illustration Traceback (most recent call last): File "/External/Projects/tahoe/support/bin/tahoe", line 9, in <module> load_entry_point('allmydata-tahoe==1.9.0.post28', 'console_scripts', 'tahoe')() File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 113, in run rc = runner(sys.argv[1:], install_node_control=install_node_control) File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 99, in runner rc = cli.dispatch[command](so) File "/External/Projects/tahoe/src/allmydata/scripts/cli.py", line 594, in deepcheck rc = tahoe_check.deepcheck(options) File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 306, in deepcheck return DeepCheckStreamer().run(options) File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 287, in run resp = do_http("POST", url) File "/External/Projects/tahoe/src/allmydata/scripts/common_http.py", line 60, in do_http c.endheaders() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 937, in endheaders self._send_output(message_body) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 797, in _send_output self.send(msg) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 759, in send self.connect() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 740, in connect self.timeout, self.source_address) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 571, in create_connection raise err socket.error: [Errno 61] Connection refused ```

kpreid added the

labels 2012-10-15 15:04:12 +00:00

kpreid added this to the undecided milestone 2012-10-15 15:04:12 +00:00

zooko commented

2012-10-15 15:17:49 +00:00

When you back to this, please attach any incident report files from the relevant time period, and please paste in the output of tahoe --version. By the way, I didn't know Volunteer Grid 1 was still in operation! Does it work reliably for you? How many storage servers and how many users does it gave?

When you back to this, please attach any incident report files from the relevant time period, and please paste in the output of `tahoe --version`. By the way, I didn't know Volunteer Grid 1 was still in operation! Does it work reliably for you? How many storage servers and how many users does it gave?

davidsarah commented

2012-10-15 15:23:40 +00:00

Replying to kpreid:

*** family
	 repair successful
	done: 2 objects checked
	 pre-repair: 1 healthy, 1 unhealthy
	 1 repairs attempted, 1 successful, 0 failed
	 post-repair: 1 healthy, 1 unhealthy

This confusing report is probably due to #766, and probably unrelated to the main issue in this ticket.

Replying to [kpreid](/tahoe-lafs/trac-2024-07-25/issues/6856): > ``` > *** family > repair successful > done: 2 objects checked > pre-repair: 1 healthy, 1 unhealthy > 1 repairs attempted, 1 successful, 0 failed > post-repair: 1 healthy, 1 unhealthy > ``` This confusing report is probably due to #766, and probably unrelated to the main issue in this ticket.

tahoe-lafs added

code

1.9.2

and removed

unknown

labels 2012-10-15 15:23:40 +00:00

kpreid commented

2012-10-16 03:14:16 +00:00

There are no recent incident report files. There are no recent twistd.log entries. My tahoesvc.log is more interesting, and is full of repetitions (0.11 second intervals) of the following sequence, followed at the end by the startup log after I killed it.

2012-10-15 14:05:34.469Z [-] Unhandled error in Deferred:
2012-10-15 14:05:38.016Z [-] Unhandled Error
	Traceback (most recent call last):
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 610, in _download_current_segment
	    d = self._process_segment(self._current_segment)
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 638, in _process_segment
	    dl.addErrback(self._validation_or_decoding_failed, [reader])
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 308, in addErrback
	    errbackKeywords=kw)
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 286, in addCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 542, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed
	    self._mark_bad_share(reader.server, reader.shnum, reader, f)
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 595, in _mark_bad_share
	    self.notify_server_corruption(server, shnum, str(f.value))
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 938, in notify_server_corruption
	    rref.callRemoteOnly("advise_corrupt_share",
	exceptions.AttributeError: 'NoneType' object has no attribute 'callRemoteOnly'

Going back to the beginning of this error (85 megabytes of logs ago), I find this as immediately preceding the spew, and nothing recent before it:

2012-10-15 01:43:37.119Z [-] Unhandled error in Deferred:
2012-10-15 01:43:37.136Z [-] Unhandled Error
	Traceback (most recent call last):
	Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=gtnn) (during method=RIStorageServer.tahoe.allmydata.com:slot_readv)

There are no recent incident report files. There are no recent twistd.log entries. My tahoesvc.log is more interesting, and is full of repetitions (0.11 second intervals) of the following sequence, followed at the end by the startup log after I killed it. ``` 2012-10-15 14:05:34.469Z [-] Unhandled error in Deferred: 2012-10-15 14:05:38.016Z [-] Unhandled Error Traceback (most recent call last): File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 610, in _download_current_segment d = self._process_segment(self._current_segment) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 638, in _process_segment dl.addErrback(self._validation_or_decoding_failed, [reader]) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 308, in addErrback errbackKeywords=kw) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 286, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed self._mark_bad_share(reader.server, reader.shnum, reader, f) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 595, in _mark_bad_share self.notify_server_corruption(server, shnum, str(f.value)) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 938, in notify_server_corruption rref.callRemoteOnly("advise_corrupt_share", exceptions.AttributeError: 'NoneType' object has no attribute 'callRemoteOnly' ``` Going back to the beginning of this error (85 megabytes of logs ago), I find this as immediately preceding the spew, and nothing recent before it: ``` 2012-10-15 01:43:37.119Z [-] Unhandled error in Deferred: 2012-10-15 01:43:37.136Z [-] Unhandled Error Traceback (most recent call last): Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=gtnn) (during method=RIStorageServer.tahoe.allmydata.com:slot_readv) ```

davidsarah commented

2012-10-16 15:07:55 +00:00

OK, so:

we still have cases where a disconnected rref is causing errors due to trying to call None. Wasn't there a previous ticket about that?
the unhandled exception is causing an infinite deferred loop. Let's fix that separately, even if we change this particular case to be handled.

OK, so: - we still have cases where a disconnected rref is causing errors due to trying to call `None`. Wasn't there a previous ticket about that? - the unhandled exception is causing an infinite deferred loop. Let's fix that separately, even if we change this particular case to be handled.

tahoe-lafs added

major

and removed

normal

labels 2012-10-17 02:08:45 +00:00

tahoe-lafs modified the milestone from undecided to 1.11.0

2012-10-17 02:08:45 +00:00

kpreid was unassigned by tahoe-lafs

2012-10-17 02:09:30 +00:00

davidsarah commented

2012-11-12 23:24:46 +00:00

This tahoe-dev thread might be the same problem: https://tahoe-lafs.org/pipermail/tahoe-dev/2012-November/007829.html

This tahoe-dev thread *might* be the same problem: <https://tahoe-lafs.org/pipermail/tahoe-dev/2012-November/007829.html>

zooko commented

2012-11-13 23:25:08 +00:00

Marking this as Priority: Critical, because it seems more important to me than most of the tickets marked Priority: Major.

zooko added

critical

and removed

major

labels 2012-11-13 23:25:08 +00:00

zooko commented

2012-11-15 11:43:09 +00:00

Iantcho posted an incident report (trigger: he pushed the button to generate an incident report):

[//pipermail/tahoe-dev/2012-November/007836.html https://tahoe-lafs.org/pipermail/tahoe-dev/2012-November/007836.html]

Iantcho posted an incident report (trigger: he pushed the button to generate an incident report): [//pipermail/tahoe-dev/2012-November/007836.html <https://tahoe-lafs.org/pipermail/tahoe-dev/2012-November/007836.html>]

kpreid commented

2012-12-25 13:22:23 +00:00

Had this happen again today. However, slightly different logs — there was nothing immediately before the repeating message by timestamp. Here is the first repeating message and a couple of its predecessors:


2012-12-25 08:04:08.619Z [-] Unhandled Error
	Traceback (most recent call last):
	Failure: allmydata.interfaces.DownloadStopped: our Consumer called stopProducing()
	
2012-12-25 08:04:08.677Z [-] Unhandled error in Deferred:
2012-12-25 08:04:08.677Z [-] Unhandled Error
	Traceback (most recent call last):
	  File "/External/Projects/tahoe/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/appserver.py", line 163, in gotPageContext
	    pageContext.tag.renderHTTP, pageContext
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 133, in maybeDeferred
	    result = f(*args, **kw)
	  File "/External/Projects/tahoe/src/allmydata/web/filenode.py", line 490, in renderHTTP
	    d.addCallbacks(_finished, _error)
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 286, in addCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 542, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "/External/Projects/tahoe/src/allmydata/web/filenode.py", line 489, in _error
	    eh.renderHTTP_exception(ctx, f)
	  File "/External/Projects/tahoe/src/allmydata/web/common.py", line 326, in renderHTTP_exception
	    return super.renderHTTP_exception(self, ctx, f)
	  File "/External/Projects/tahoe/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/appserver.py", line 61, in renderHTTP_exception
	    request.finishRequest( False )
	  File "/External/Projects/tahoe/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/appserver.py", line 176, in finishRequest
	    server.Request.finish(self)
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/web/http.py", line 866, in finish
	    "Request.finish called on a request after its connection was lost; "
	exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this.
	
2012-12-25 09:16:06.623Z [-] Unhandled error in Deferred:
2012-12-25 09:16:06.909Z [-] Unhandled Error
	Traceback (most recent call last):
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 610, in _download_current_segment
	    d = self._process_segment(self._current_segment)
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 638, in _process_segment
	    dl.addErrback(self._validation_or_decoding_failed, [reader])
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 308, in addErrback
	    errbackKeywords=kw)
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 286, in addCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 542, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed
	    self._mark_bad_share(reader.server, reader.shnum, reader, f)
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 595, in _mark_bad_share
	    self.notify_server_corruption(server, shnum, str(f.value))
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 938, in notify_server_corruption
	    rref.callRemoteOnly("advise_corrupt_share",
	exceptions.AttributeError: 'NoneType' object has no attribute 'callRemoteOnly'

Had this happen again today. However, slightly different logs — there was nothing immediately before the repeating message by timestamp. Here is the first repeating message and a couple of its predecessors: ``` 2012-12-25 08:04:08.619Z [-] Unhandled Error Traceback (most recent call last): Failure: allmydata.interfaces.DownloadStopped: our Consumer called stopProducing() 2012-12-25 08:04:08.677Z [-] Unhandled error in Deferred: 2012-12-25 08:04:08.677Z [-] Unhandled Error Traceback (most recent call last): File "/External/Projects/tahoe/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/appserver.py", line 163, in gotPageContext pageContext.tag.renderHTTP, pageContext File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 133, in maybeDeferred result = f(*args, **kw) File "/External/Projects/tahoe/src/allmydata/web/filenode.py", line 490, in renderHTTP d.addCallbacks(_finished, _error) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 286, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/External/Projects/tahoe/src/allmydata/web/filenode.py", line 489, in _error eh.renderHTTP_exception(ctx, f) File "/External/Projects/tahoe/src/allmydata/web/common.py", line 326, in renderHTTP_exception return super.renderHTTP_exception(self, ctx, f) File "/External/Projects/tahoe/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/appserver.py", line 61, in renderHTTP_exception request.finishRequest( False ) File "/External/Projects/tahoe/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/appserver.py", line 176, in finishRequest server.Request.finish(self) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/web/http.py", line 866, in finish "Request.finish called on a request after its connection was lost; " exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this. 2012-12-25 09:16:06.623Z [-] Unhandled error in Deferred: 2012-12-25 09:16:06.909Z [-] Unhandled Error Traceback (most recent call last): File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 610, in _download_current_segment d = self._process_segment(self._current_segment) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 638, in _process_segment dl.addErrback(self._validation_or_decoding_failed, [reader]) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 308, in addErrback errbackKeywords=kw) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 286, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed self._mark_bad_share(reader.server, reader.shnum, reader, f) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 595, in _mark_bad_share self.notify_server_corruption(server, shnum, str(f.value)) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 938, in notify_server_corruption rref.callRemoteOnly("advise_corrupt_share", exceptions.AttributeError: 'NoneType' object has no attribute 'callRemoteOnly' ```

zooko commented

2013-01-02 07:03:54 +00:00

Thank you, kpreid! That last exception showing that rref was None is a smoking gun that may be useful!

Thank you, kpreid! That last exception showing that `rref` was `None` is a smoking gun that may be useful!

zooko commented

2013-01-03 04:54:02 +00:00

Okay, I just (re-?) noticed that the rref is None was reported by kpreid earlier (comment:89750) and that then davidsarah wrote:

Replying to davidsarah:

OK, so:

we still have cases where a disconnected rref is causing errors due to trying to call None. Wasn't there a previous ticket about that?

the unhandled exception is causing an infinite deferred loop. Let's fix that separately, even if we change this particular case to be handled.

Why do you say there is an infinite deferred loop? Your proposed approach sounds good to me!

Okay, I just (re-?) noticed that the `rref` is `None` was reported by kpreid earlier ([comment:89750](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89750)) and that then davidsarah wrote: Replying to [davidsarah](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89751): > OK, so: > - we still have cases where a disconnected rref is causing errors due to trying to call `None`. Wasn't there a previous ticket about that? > - the unhandled exception is causing an infinite deferred loop. Let's fix that separately, even if we change this particular case to be handled. Why do you say there is an infinite deferred loop? Your proposed approach sounds good to me!

daira commented

2013-04-03 01:48:55 +00:00

#1939 is a possible duplicate.

daira commented

2013-04-03 01:52:08 +00:00

kpreid, can you check whether the incident logs on #1939 are consistent with this ticket?

kpreid was assigned by tahoe-lafs

2013-04-03 01:52:08 +00:00

kpreid commented

2013-04-03 02:22:20 +00:00

They aren't, I think. There's no major spam and the errors don't look similar.

daira commented

2013-04-03 04:07:51 +00:00

Replying to kpreid:

They aren't, I think. There's no major spam and the errors don't look similar.

Okay, thanks for looking at that.

Replying to [kpreid](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89767): > They aren't, I think. There's no major spam and the errors don't look similar. Okay, thanks for looking at that.

zooko commented

2013-04-23 21:09:30 +00:00

Replying to davidsarah:

OK, so:

we still have cases where a disconnected rref is causing errors due to trying to call None. Wasn't there a previous ticket about that?

#1636. Which was fixed in Tahoe-LAFS 1.9.2. kpreid: what version of Tahoe-LAFS generated the output you show here?

Replying to [davidsarah](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89751): > OK, so: > - we still have cases where a disconnected rref is causing errors due to trying to call `None`. Wasn't there a previous ticket about that? #1636. Which was fixed in Tahoe-LAFS 1.9.2. kpreid: what version of Tahoe-LAFS generated the output you show here?

kpreid commented

2013-04-24 01:10:00 +00:00

kpreid: what version of Tahoe-LAFS generated the output you show here?

The traceback contains a mention of "1.9.0.post28". The same string is still in my build/lib/allmydata/_version.py, if that's the right place to look. I don't know offhand whether I've updated at all since I had this error, but if I haven't the git revision is changeset:dd2e7a1127353a60 plus the patch at (@@https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1628#comment:89755@@) .

(Why yes, I'm not especially keeping my Tahoe in tip-top shape these days...)

> kpreid: what version of Tahoe-LAFS generated the output you show here? The traceback contains a mention of "1.9.0.post28". The same string is still in my `build/lib/allmydata/_version.py`, if that's the right place to look. I don't know offhand whether I've updated at all since I had this error, but if I haven't the git revision is changeset:dd2e7a1127353a60 plus the patch at (@@https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1628#[comment:89755](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89755)@@) . (Why yes, I'm not especially keeping my Tahoe in tip-top shape these days...)

zooko commented

2013-04-24 21:00:06 +00:00

Replying to kpreid:

kpreid: what version of Tahoe-LAFS generated the output you show here?

The traceback contains a mention of "1.9.0.post28". The same string is still in my build/lib/allmydata/_version.py, if that's the right place to look. I don't know offhand whether I've updated at all since I had this error, but if I haven't the git revision is changeset:dd2e7a1127353a60 plus the patch at (@@https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1628#comment:89755@@) .

Great! Then I believe this issue to be fixed in 1.9.2, because it is an instance of #1636. Daira, what did you mean in comment:89751 by "the unhandled exception is causing an infinite deferred loop. Let's fix that separately, even if we change this particular case to be handled."? Do you think we can close this ticket as as dup of #1636 now?

kpreid: you could try upgrading to 1.9.2 (or the shiny new 1.10 pre-release!) and then let us know if this problem or anything like it ever recurs. Thanks for the bug report!

Replying to [kpreid](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89770): > > kpreid: what version of Tahoe-LAFS generated the output you show here? > > The traceback contains a mention of "1.9.0.post28". The same string is still in my `build/lib/allmydata/_version.py`, if that's the right place to look. I don't know offhand whether I've updated at all since I had this error, but if I haven't the git revision is changeset:dd2e7a1127353a60 plus the patch at (@@https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1628#[comment:89755](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89755)@@) . Great! Then I believe this issue to be fixed in 1.9.2, because it is an instance of #1636. Daira, what did you mean in [comment:89751](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89751) by "the unhandled exception is causing an infinite deferred loop. Let's fix that separately, even if we change this particular case to be handled."? Do you think we can close this ticket as as dup of #1636 now? kpreid: you could try upgrading to 1.9.2 (or the shiny new 1.10 pre-release!) and then let us know if this problem or anything like it ever recurs. Thanks for the bug report!

daira commented

2013-04-25 01:35:11 +00:00

The tracebacks in comment:89750 and comment:89761 do look the same as in #1636.

What I meant was: why was the traceback repeating every 0.11 seconds in comment:89750? There's no mention of that symptom on #1636, and it's difficult to tell whether the patches to retrieve.py in /tahoe-lafs/trac-2024-07-25/issues/6668#comment:89757 would have incidentally fixed it.

The tracebacks in [comment:89750](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89750) and [comment:89761](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89761) do look the same as in #1636. What I meant was: why was the traceback repeating every 0.11 seconds in [comment:89750](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89750)? There's no mention of that symptom on #1636, and it's difficult to tell whether the patches to retrieve.py in [/tahoe-lafs/trac-2024-07-25/issues/6668](/tahoe-lafs/trac-2024-07-25/issues/6668)#[comment:89757](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89757) would have incidentally fixed it.

zooko commented

2013-04-25 15:22:48 +00:00

Replying to daira:

The tracebacks in comment:89750 and comment:89761 do look the same as in #1636.

What I meant was: why was the traceback repeating every 0.11 seconds in comment:89750? There's no mention of that symptom on #1636, and it's difficult to tell whether the patches to retrieve.py in /tahoe-lafs/trac-2024-07-25/issues/6668#comment:89757 would have incidentally fixed it.

Maybe it wasn't a loop, but simply that kpreid's deep-check-and-repair job was going to a new file every 0.11 seconds.

Replying to [daira](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89772): > The tracebacks in [comment:89750](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89750) and [comment:89761](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89761) do look the same as in #1636. > > What I meant was: why was the traceback repeating every 0.11 seconds in [comment:89750](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89750)? There's no mention of that symptom on #1636, and it's difficult to tell whether the patches to retrieve.py in [/tahoe-lafs/trac-2024-07-25/issues/6668](/tahoe-lafs/trac-2024-07-25/issues/6668)#[comment:89757](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89757) would have incidentally fixed it. Maybe it wasn't a loop, but simply that kpreid's deep-check-and-repair job was going to a new file every 0.11 seconds.

kpreid commented

2013-04-25 15:26:08 +00:00

Maybe it wasn't a loop, but simply that kpreid's deep-check-and-repair job was going to a new file every 0.11 seconds.

That seems unlikely: I have a total of 15 files.

> Maybe it wasn't a loop, but simply that kpreid's deep-check-and-repair job was going to a new file every 0.11 seconds. That seems unlikely: I have a total of 15 files.

zooko commented

2013-04-25 19:53:13 +00:00

I looked at the code a bit and I didn't notice a way for a loop to happen. The stack trace from the original report features this line:

	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed

But in [the 1.9.2 version of retrieve.py]source:1.9.2/src/allmydata/mutable/retrieve.py?rev=5480#L725 and in the [current version of retrieve.py]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d8c536847b1ea577f9ac5b6aa98c7ce5d1961c8c there isn't the same code.

Hm... let's see. There is no way that I can see that _validation_or_decoding_failed could lead to a loop, [in 1.9.0]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d9c1064d42a322b58d3243923d95fd56235d7c89#L720. Is there any way that there could be a loop that calls out to _validation_or_decoding_failed over and over? Well, [this stuff here]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d9c1064d42a322b58d3243923d95fd56235d7c89#L612 is complicated. Could it have a loop? No, I don't see it. Any errback from the call to _validation_or_decoding_failed should... Wait, what happens when you get an exception from your errback? I assume it just errbacks again? Hm, this code is very confusing because there are two variables in different scopes named dl. Anyway, I think the intent is that the errback goes here, where it terminates and doesn't loop: [loop()]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d9c1064d42a322b58d3243923d95fd56235d7c89#L252.

The [modern code]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d8c536847b1ea577f9ac5b6aa98c7ce5d1961c8c#L614 that replaced this doesn't have the two variables named dl. It still looks like you'll get an exception from the errback (named _handle_bad_shares in this version). Hm, in the current version that exception will then get ignored by [line 642]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d8c536847b1ea577f9ac5b6aa98c7ce5d1961c8c#L642 dl = deferredutil.gatherResults(ds) since deferredutil.gatherResults() always sets consumeErrors=True. Could that lead to a loop?

Daira: help!

I looked at the code a bit and I didn't notice a way for a loop to happen. The stack trace from the original report features this line: ``` File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed ``` But in [the 1.9.2 version of retrieve.py]source:1.9.2/src/allmydata/mutable/retrieve.py?rev=5480#L725 and in the [current version of retrieve.py]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d8c536847b1ea577f9ac5b6aa98c7ce5d1961c8c there isn't the same code. Hm... let's see. There is no way that I can see that `_validation_or_decoding_failed` could lead to a loop, [in 1.9.0]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d9c1064d42a322b58d3243923d95fd56235d7c89#L720. Is there any way that there could be a loop that calls out to `_validation_or_decoding_failed` over and over? Well, [this stuff here]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d9c1064d42a322b58d3243923d95fd56235d7c89#L612 is complicated. Could it have a loop? No, I don't see it. Any errback from the call to `_validation_or_decoding_failed` should... Wait, what happens when you get an exception from your errback? I assume it just errbacks again? Hm, this code is very confusing because there are two variables in different scopes named `dl`. Anyway, I think the *intent* is that the errback goes here, where it terminates and doesn't loop: [loop()]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d9c1064d42a322b58d3243923d95fd56235d7c89#L252. The [modern code]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d8c536847b1ea577f9ac5b6aa98c7ce5d1961c8c#L614 that replaced this doesn't have the two variables named `dl`. It still looks like you'll get an exception from the errback (named `_handle_bad_shares` in this version). Hm, in the current version that exception will then get ignored by [line 642]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d8c536847b1ea577f9ac5b6aa98c7ce5d1961c8c#L642 `dl = deferredutil.gatherResults(ds)` since `deferredutil.gatherResults()` always sets `consumeErrors=True`. Could that lead to a loop? Daira: help!

Sign in to join this conversation.