Tahoe process gone wild #1824

Open
opened 2012-10-15 15:04:12 +00:00 by kpreid · 21 comments

I woke up this morning (~07:20 PDT) to find my machine sluggish to respond. It turned out to be my Volunteer Grid 1 tahoe node, having grown to 24GB virtual, 12GB (of 16GB) real. It was not responsive to the web port. It was using about 0.3% CPU.

It appears to have hung in the middle of my regularly scheduled repair/lease process, which started 14 hours ago; here is the output from that. Note that this cronjob mail didn't come to me until I killed the process, so I suspect it hung in the middle of the third repair, and the tracebacks are merely due to the death of the TCP server.

Further details will be added later; I have no time to hunt for logs now.

Randomised start: Sun Oct 14 18:42:35 PDT 2012
*** calendar-publish-dir
	done: 3 objects checked
	 pre-repair: 3 healthy, 0 unhealthy
	 0 repairs attempted, 0 successful, 0 failed
	 post-repair: 3 healthy, 0 unhealthy

*** family
	 repair successful
	done: 2 objects checked
	 pre-repair: 1 healthy, 1 unhealthy
	 1 repairs attempted, 1 successful, 0 failed
	 post-repair: 1 healthy, 1 unhealthy

*** publish
Traceback (most recent call last):
 File "/External/Projects/tahoe/support/bin/tahoe", line 9, in <module>
   load_entry_point('allmydata-tahoe==1.9.0.post28', 'console_scripts', 'tahoe')()
 File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 113, in run
   rc = runner(sys.argv[1:], install_node_control=install_node_control)
 File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 99, in runner
   rc = cli.dispatch[command](so)
 File "/External/Projects/tahoe/src/allmydata/scripts/cli.py", line 594, in deepcheck
   rc = tahoe_check.deepcheck(options)
 File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 306, in deepcheck
   return DeepCheckStreamer().run(options)
 File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 287, in run
   resp = do_http("POST", url)
 File "/External/Projects/tahoe/src/allmydata/scripts/common_http.py", line 68, in do_http
   return c.getresponse()
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1013, in getresponse
   response.begin()
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 402, in begin
   version, status, reason = self._read_status()
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 366, in _read_status
   raise BadStatusLine(line)
httplib.BadStatusLine: ''

*** tahoe-illustration
Traceback (most recent call last):
 File "/External/Projects/tahoe/support/bin/tahoe", line 9, in <module>
   load_entry_point('allmydata-tahoe==1.9.0.post28', 'console_scripts', 'tahoe')()
 File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 113, in run
   rc = runner(sys.argv[1:], install_node_control=install_node_control)
 File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 99, in runner
   rc = cli.dispatch[command](so)
 File "/External/Projects/tahoe/src/allmydata/scripts/cli.py", line 594, in deepcheck
   rc = tahoe_check.deepcheck(options)
 File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 306, in deepcheck
   return DeepCheckStreamer().run(options)
 File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 287, in run
   resp = do_http("POST", url)
 File "/External/Projects/tahoe/src/allmydata/scripts/common_http.py", line 60, in do_http
   c.endheaders()
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 937, in endheaders
   self._send_output(message_body)
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 797, in _send_output
   self.send(msg)
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 759, in send
   self.connect()
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 740, in connect
   self.timeout, self.source_address)
 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 571, in create_connection
   raise err
socket.error: [Errno 61] Connection refused

I woke up this morning (~07:20 PDT) to find my machine sluggish to respond. It turned out to be my Volunteer Grid 1 tahoe node, having grown to 24GB virtual, 12GB (of 16GB) real. It was not responsive to the web port. It was using about 0.3% CPU. It appears to have hung in the middle of my regularly scheduled repair/lease process, which started 14 hours ago; here is the output from that. Note that this cronjob mail didn't come to me until I killed the process, so I suspect it hung in the middle of the third repair, and the tracebacks are merely due to the death of the TCP server. Further details will be added later; I have no time to hunt for logs now. ``` Randomised start: Sun Oct 14 18:42:35 PDT 2012 *** calendar-publish-dir done: 3 objects checked pre-repair: 3 healthy, 0 unhealthy 0 repairs attempted, 0 successful, 0 failed post-repair: 3 healthy, 0 unhealthy *** family repair successful done: 2 objects checked pre-repair: 1 healthy, 1 unhealthy 1 repairs attempted, 1 successful, 0 failed post-repair: 1 healthy, 1 unhealthy *** publish Traceback (most recent call last): File "/External/Projects/tahoe/support/bin/tahoe", line 9, in <module> load_entry_point('allmydata-tahoe==1.9.0.post28', 'console_scripts', 'tahoe')() File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 113, in run rc = runner(sys.argv[1:], install_node_control=install_node_control) File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 99, in runner rc = cli.dispatch[command](so) File "/External/Projects/tahoe/src/allmydata/scripts/cli.py", line 594, in deepcheck rc = tahoe_check.deepcheck(options) File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 306, in deepcheck return DeepCheckStreamer().run(options) File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 287, in run resp = do_http("POST", url) File "/External/Projects/tahoe/src/allmydata/scripts/common_http.py", line 68, in do_http return c.getresponse() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1013, in getresponse response.begin() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 402, in begin version, status, reason = self._read_status() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 366, in _read_status raise BadStatusLine(line) httplib.BadStatusLine: '' *** tahoe-illustration Traceback (most recent call last): File "/External/Projects/tahoe/support/bin/tahoe", line 9, in <module> load_entry_point('allmydata-tahoe==1.9.0.post28', 'console_scripts', 'tahoe')() File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 113, in run rc = runner(sys.argv[1:], install_node_control=install_node_control) File "/External/Projects/tahoe/src/allmydata/scripts/runner.py", line 99, in runner rc = cli.dispatch[command](so) File "/External/Projects/tahoe/src/allmydata/scripts/cli.py", line 594, in deepcheck rc = tahoe_check.deepcheck(options) File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 306, in deepcheck return DeepCheckStreamer().run(options) File "/External/Projects/tahoe/src/allmydata/scripts/tahoe_check.py", line 287, in run resp = do_http("POST", url) File "/External/Projects/tahoe/src/allmydata/scripts/common_http.py", line 60, in do_http c.endheaders() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 937, in endheaders self._send_output(message_body) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 797, in _send_output self.send(msg) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 759, in send self.connect() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 740, in connect self.timeout, self.source_address) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 571, in create_connection raise err socket.error: [Errno 61] Connection refused ```
kpreid added the
unknown
normal
defect
unknown
labels 2012-10-15 15:04:12 +00:00
kpreid added this to the undecided milestone 2012-10-15 15:04:12 +00:00

When you back to this, please attach any incident report files from the relevant time period, and please paste in the output of tahoe --version. By the way, I didn't know Volunteer Grid 1 was still in operation! Does it work reliably for you? How many storage servers and how many users does it gave?

When you back to this, please attach any incident report files from the relevant time period, and please paste in the output of `tahoe --version`. By the way, I didn't know Volunteer Grid 1 was still in operation! Does it work reliably for you? How many storage servers and how many users does it gave?
davidsarah commented 2012-10-15 15:23:40 +00:00
Owner

Replying to kpreid:

*** family
	 repair successful
	done: 2 objects checked
	 pre-repair: 1 healthy, 1 unhealthy
	 1 repairs attempted, 1 successful, 0 failed
	 post-repair: 1 healthy, 1 unhealthy

This confusing report is probably due to #766, and probably unrelated to the main issue in this ticket.

Replying to [kpreid](/tahoe-lafs/trac-2024-07-25/issues/6856): > ``` > *** family > repair successful > done: 2 objects checked > pre-repair: 1 healthy, 1 unhealthy > 1 repairs attempted, 1 successful, 0 failed > post-repair: 1 healthy, 1 unhealthy > ``` This confusing report is probably due to #766, and probably unrelated to the main issue in this ticket.
tahoe-lafs added
code
1.9.2
and removed
unknown
unknown
labels 2012-10-15 15:23:40 +00:00
Author

There are no recent incident report files. There are no recent twistd.log entries. My tahoesvc.log is more interesting, and is full of repetitions (0.11 second intervals) of the following sequence, followed at the end by the startup log after I killed it.

2012-10-15 14:05:34.469Z [-] Unhandled error in Deferred:
2012-10-15 14:05:38.016Z [-] Unhandled Error
	Traceback (most recent call last):
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 610, in _download_current_segment
	    d = self._process_segment(self._current_segment)
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 638, in _process_segment
	    dl.addErrback(self._validation_or_decoding_failed, [reader])
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 308, in addErrback
	    errbackKeywords=kw)
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 286, in addCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 542, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed
	    self._mark_bad_share(reader.server, reader.shnum, reader, f)
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 595, in _mark_bad_share
	    self.notify_server_corruption(server, shnum, str(f.value))
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 938, in notify_server_corruption
	    rref.callRemoteOnly("advise_corrupt_share",
	exceptions.AttributeError: 'NoneType' object has no attribute 'callRemoteOnly'

Going back to the beginning of this error (85 megabytes of logs ago), I find this as immediately preceding the spew, and nothing recent before it:

2012-10-15 01:43:37.119Z [-] Unhandled error in Deferred:
2012-10-15 01:43:37.136Z [-] Unhandled Error
	Traceback (most recent call last):
	Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=gtnn) (during method=RIStorageServer.tahoe.allmydata.com:slot_readv)
There are no recent incident report files. There are no recent twistd.log entries. My tahoesvc.log is more interesting, and is full of repetitions (0.11 second intervals) of the following sequence, followed at the end by the startup log after I killed it. ``` 2012-10-15 14:05:34.469Z [-] Unhandled error in Deferred: 2012-10-15 14:05:38.016Z [-] Unhandled Error Traceback (most recent call last): File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 610, in _download_current_segment d = self._process_segment(self._current_segment) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 638, in _process_segment dl.addErrback(self._validation_or_decoding_failed, [reader]) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 308, in addErrback errbackKeywords=kw) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 286, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed self._mark_bad_share(reader.server, reader.shnum, reader, f) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 595, in _mark_bad_share self.notify_server_corruption(server, shnum, str(f.value)) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 938, in notify_server_corruption rref.callRemoteOnly("advise_corrupt_share", exceptions.AttributeError: 'NoneType' object has no attribute 'callRemoteOnly' ``` Going back to the beginning of this error (85 megabytes of logs ago), I find this as immediately preceding the spew, and nothing recent before it: ``` 2012-10-15 01:43:37.119Z [-] Unhandled error in Deferred: 2012-10-15 01:43:37.136Z [-] Unhandled Error Traceback (most recent call last): Failure: foolscap.ipb.DeadReferenceError: Connection was lost (to tubid=gtnn) (during method=RIStorageServer.tahoe.allmydata.com:slot_readv) ```
davidsarah commented 2012-10-16 15:07:55 +00:00
Owner

OK, so:

  • we still have cases where a disconnected rref is causing errors due to trying to call None. Wasn't there a previous ticket about that?
  • the unhandled exception is causing an infinite deferred loop. Let's fix that separately, even if we change this particular case to be handled.
OK, so: - we still have cases where a disconnected rref is causing errors due to trying to call `None`. Wasn't there a previous ticket about that? - the unhandled exception is causing an infinite deferred loop. Let's fix that separately, even if we change this particular case to be handled.
tahoe-lafs added
major
and removed
normal
labels 2012-10-17 02:08:45 +00:00
tahoe-lafs modified the milestone from undecided to 1.11.0 2012-10-17 02:08:45 +00:00
kpreid was unassigned by tahoe-lafs 2012-10-17 02:09:30 +00:00
davidsarah commented 2012-11-12 23:24:46 +00:00
Owner

This tahoe-dev thread might be the same problem: https://tahoe-lafs.org/pipermail/tahoe-dev/2012-November/007829.html

This tahoe-dev thread *might* be the same problem: <https://tahoe-lafs.org/pipermail/tahoe-dev/2012-November/007829.html>

Marking this as Priority: Critical, because it seems more important to me than most of the tickets marked Priority: Major.

Marking this as Priority: Critical, because it seems more important to me than most of the tickets marked Priority: Major.
zooko added
critical
and removed
major
labels 2012-11-13 23:25:08 +00:00

Iantcho posted an incident report (trigger: he pushed the button to generate an incident report):

[//pipermail/tahoe-dev/2012-November/007836.html https://tahoe-lafs.org/pipermail/tahoe-dev/2012-November/007836.html]

Iantcho posted an incident report (trigger: he pushed the button to generate an incident report): [//pipermail/tahoe-dev/2012-November/007836.html <https://tahoe-lafs.org/pipermail/tahoe-dev/2012-November/007836.html>]
Author

Had this happen again today. However, slightly different logs — there was nothing immediately before the repeating message by timestamp. Here is the first repeating message and a couple of its predecessors:


2012-12-25 08:04:08.619Z [-] Unhandled Error
	Traceback (most recent call last):
	Failure: allmydata.interfaces.DownloadStopped: our Consumer called stopProducing()
	
2012-12-25 08:04:08.677Z [-] Unhandled error in Deferred:
2012-12-25 08:04:08.677Z [-] Unhandled Error
	Traceback (most recent call last):
	  File "/External/Projects/tahoe/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/appserver.py", line 163, in gotPageContext
	    pageContext.tag.renderHTTP, pageContext
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 133, in maybeDeferred
	    result = f(*args, **kw)
	  File "/External/Projects/tahoe/src/allmydata/web/filenode.py", line 490, in renderHTTP
	    d.addCallbacks(_finished, _error)
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 286, in addCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 542, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "/External/Projects/tahoe/src/allmydata/web/filenode.py", line 489, in _error
	    eh.renderHTTP_exception(ctx, f)
	  File "/External/Projects/tahoe/src/allmydata/web/common.py", line 326, in renderHTTP_exception
	    return super.renderHTTP_exception(self, ctx, f)
	  File "/External/Projects/tahoe/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/appserver.py", line 61, in renderHTTP_exception
	    request.finishRequest( False )
	  File "/External/Projects/tahoe/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/appserver.py", line 176, in finishRequest
	    server.Request.finish(self)
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/web/http.py", line 866, in finish
	    "Request.finish called on a request after its connection was lost; "
	exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this.
	
2012-12-25 09:16:06.623Z [-] Unhandled error in Deferred:
2012-12-25 09:16:06.909Z [-] Unhandled Error
	Traceback (most recent call last):
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 610, in _download_current_segment
	    d = self._process_segment(self._current_segment)
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 638, in _process_segment
	    dl.addErrback(self._validation_or_decoding_failed, [reader])
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 308, in addErrback
	    errbackKeywords=kw)
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 286, in addCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 542, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed
	    self._mark_bad_share(reader.server, reader.shnum, reader, f)
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 595, in _mark_bad_share
	    self.notify_server_corruption(server, shnum, str(f.value))
	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 938, in notify_server_corruption
	    rref.callRemoteOnly("advise_corrupt_share",
	exceptions.AttributeError: 'NoneType' object has no attribute 'callRemoteOnly'
Had this happen again today. However, slightly different logs — there was nothing immediately before the repeating message by timestamp. Here is the first repeating message and a couple of its predecessors: ``` 2012-12-25 08:04:08.619Z [-] Unhandled Error Traceback (most recent call last): Failure: allmydata.interfaces.DownloadStopped: our Consumer called stopProducing() 2012-12-25 08:04:08.677Z [-] Unhandled error in Deferred: 2012-12-25 08:04:08.677Z [-] Unhandled Error Traceback (most recent call last): File "/External/Projects/tahoe/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/appserver.py", line 163, in gotPageContext pageContext.tag.renderHTTP, pageContext File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 133, in maybeDeferred result = f(*args, **kw) File "/External/Projects/tahoe/src/allmydata/web/filenode.py", line 490, in renderHTTP d.addCallbacks(_finished, _error) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 286, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/External/Projects/tahoe/src/allmydata/web/filenode.py", line 489, in _error eh.renderHTTP_exception(ctx, f) File "/External/Projects/tahoe/src/allmydata/web/common.py", line 326, in renderHTTP_exception return super.renderHTTP_exception(self, ctx, f) File "/External/Projects/tahoe/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/appserver.py", line 61, in renderHTTP_exception request.finishRequest( False ) File "/External/Projects/tahoe/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/appserver.py", line 176, in finishRequest server.Request.finish(self) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/web/http.py", line 866, in finish "Request.finish called on a request after its connection was lost; " exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this. 2012-12-25 09:16:06.623Z [-] Unhandled error in Deferred: 2012-12-25 09:16:06.909Z [-] Unhandled Error Traceback (most recent call last): File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 610, in _download_current_segment d = self._process_segment(self._current_segment) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 638, in _process_segment dl.addErrback(self._validation_or_decoding_failed, [reader]) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 308, in addErrback errbackKeywords=kw) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 286, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed self._mark_bad_share(reader.server, reader.shnum, reader, f) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 595, in _mark_bad_share self.notify_server_corruption(server, shnum, str(f.value)) File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 938, in notify_server_corruption rref.callRemoteOnly("advise_corrupt_share", exceptions.AttributeError: 'NoneType' object has no attribute 'callRemoteOnly' ```

Thank you, kpreid! That last exception showing that rref was None is a smoking gun that may be useful!

Thank you, kpreid! That last exception showing that `rref` was `None` is a smoking gun that may be useful!

Okay, I just (re-?) noticed that the rref is None was reported by kpreid earlier (comment:89750) and that then davidsarah wrote:

Replying to davidsarah:

OK, so:

  • we still have cases where a disconnected rref is causing errors due to trying to call None. Wasn't there a previous ticket about that?
  • the unhandled exception is causing an infinite deferred loop. Let's fix that separately, even if we change this particular case to be handled.

Why do you say there is an infinite deferred loop? Your proposed approach sounds good to me!

Okay, I just (re-?) noticed that the `rref` is `None` was reported by kpreid earlier ([comment:89750](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89750)) and that then davidsarah wrote: Replying to [davidsarah](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89751): > OK, so: > - we still have cases where a disconnected rref is causing errors due to trying to call `None`. Wasn't there a previous ticket about that? > - the unhandled exception is causing an infinite deferred loop. Let's fix that separately, even if we change this particular case to be handled. Why do you say there is an infinite deferred loop? Your proposed approach sounds good to me!
daira commented 2013-04-03 01:48:55 +00:00
Owner

#1939 is a possible duplicate.

#1939 is a possible duplicate.
daira commented 2013-04-03 01:52:08 +00:00
Owner

kpreid, can you check whether the incident logs on #1939 are consistent with this ticket?

kpreid, can you check whether the incident logs on #1939 are consistent with this ticket?
kpreid was assigned by tahoe-lafs 2013-04-03 01:52:08 +00:00
Author

They aren't, I think. There's no major spam and the errors don't look similar.

They aren't, I think. There's no major spam and the errors don't look similar.
daira commented 2013-04-03 04:07:51 +00:00
Owner

Replying to kpreid:

They aren't, I think. There's no major spam and the errors don't look similar.

Okay, thanks for looking at that.

Replying to [kpreid](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89767): > They aren't, I think. There's no major spam and the errors don't look similar. Okay, thanks for looking at that.

Replying to davidsarah:

OK, so:

  • we still have cases where a disconnected rref is causing errors due to trying to call None. Wasn't there a previous ticket about that?

#1636. Which was fixed in Tahoe-LAFS 1.9.2. kpreid: what version of Tahoe-LAFS generated the output you show here?

Replying to [davidsarah](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89751): > OK, so: > - we still have cases where a disconnected rref is causing errors due to trying to call `None`. Wasn't there a previous ticket about that? #1636. Which was fixed in Tahoe-LAFS 1.9.2. kpreid: what version of Tahoe-LAFS generated the output you show here?
Author

kpreid: what version of Tahoe-LAFS generated the output you show here?

The traceback contains a mention of "1.9.0.post28". The same string is still in my build/lib/allmydata/_version.py, if that's the right place to look. I don't know offhand whether I've updated at all since I had this error, but if I haven't the git revision is changeset:dd2e7a1127353a60 plus the patch at (@@https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1628#comment:89755@@) .

(Why yes, I'm not especially keeping my Tahoe in tip-top shape these days...)

> kpreid: what version of Tahoe-LAFS generated the output you show here? The traceback contains a mention of "1.9.0.post28". The same string is still in my `build/lib/allmydata/_version.py`, if that's the right place to look. I don't know offhand whether I've updated at all since I had this error, but if I haven't the git revision is changeset:dd2e7a1127353a60 plus the patch at (@@https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1628#[comment:89755](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89755)@@) . (Why yes, I'm not especially keeping my Tahoe in tip-top shape these days...)

Replying to kpreid:

kpreid: what version of Tahoe-LAFS generated the output you show here?

The traceback contains a mention of "1.9.0.post28". The same string is still in my build/lib/allmydata/_version.py, if that's the right place to look. I don't know offhand whether I've updated at all since I had this error, but if I haven't the git revision is changeset:dd2e7a1127353a60 plus the patch at (@@https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1628#comment:89755@@) .

Great! Then I believe this issue to be fixed in 1.9.2, because it is an instance of #1636. Daira, what did you mean in comment:89751 by "the unhandled exception is causing an infinite deferred loop. Let's fix that separately, even if we change this particular case to be handled."? Do you think we can close this ticket as as dup of #1636 now?

kpreid: you could try upgrading to 1.9.2 (or the shiny new 1.10 pre-release!) and then let us know if this problem or anything like it ever recurs. Thanks for the bug report!

Replying to [kpreid](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89770): > > kpreid: what version of Tahoe-LAFS generated the output you show here? > > The traceback contains a mention of "1.9.0.post28". The same string is still in my `build/lib/allmydata/_version.py`, if that's the right place to look. I don't know offhand whether I've updated at all since I had this error, but if I haven't the git revision is changeset:dd2e7a1127353a60 plus the patch at (@@https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1628#[comment:89755](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89755)@@) . Great! Then I believe this issue to be fixed in 1.9.2, because it is an instance of #1636. Daira, what did you mean in [comment:89751](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89751) by "the unhandled exception is causing an infinite deferred loop. Let's fix that separately, even if we change this particular case to be handled."? Do you think we can close this ticket as as dup of #1636 now? kpreid: you could try upgrading to 1.9.2 (or the shiny new 1.10 pre-release!) and then let us know if this problem or anything like it ever recurs. Thanks for the bug report!
daira commented 2013-04-25 01:35:11 +00:00
Owner

The tracebacks in comment:89750 and comment:89761 do look the same as in #1636.

What I meant was: why was the traceback repeating every 0.11 seconds in comment:89750? There's no mention of that symptom on #1636, and it's difficult to tell whether the patches to retrieve.py in /tahoe-lafs/trac-2024-07-25/issues/6668#comment:89757 would have incidentally fixed it.

The tracebacks in [comment:89750](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89750) and [comment:89761](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89761) do look the same as in #1636. What I meant was: why was the traceback repeating every 0.11 seconds in [comment:89750](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89750)? There's no mention of that symptom on #1636, and it's difficult to tell whether the patches to retrieve.py in [/tahoe-lafs/trac-2024-07-25/issues/6668](/tahoe-lafs/trac-2024-07-25/issues/6668)#[comment:89757](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89757) would have incidentally fixed it.

Replying to daira:

The tracebacks in comment:89750 and comment:89761 do look the same as in #1636.

What I meant was: why was the traceback repeating every 0.11 seconds in comment:89750? There's no mention of that symptom on #1636, and it's difficult to tell whether the patches to retrieve.py in /tahoe-lafs/trac-2024-07-25/issues/6668#comment:89757 would have incidentally fixed it.

Maybe it wasn't a loop, but simply that kpreid's deep-check-and-repair job was going to a new file every 0.11 seconds.

Replying to [daira](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89772): > The tracebacks in [comment:89750](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89750) and [comment:89761](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89761) do look the same as in #1636. > > What I meant was: why was the traceback repeating every 0.11 seconds in [comment:89750](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89750)? There's no mention of that symptom on #1636, and it's difficult to tell whether the patches to retrieve.py in [/tahoe-lafs/trac-2024-07-25/issues/6668](/tahoe-lafs/trac-2024-07-25/issues/6668)#[comment:89757](/tahoe-lafs/trac-2024-07-25/issues/1824#issuecomment-89757) would have incidentally fixed it. Maybe it wasn't a loop, but simply that kpreid's deep-check-and-repair job was going to a new file every 0.11 seconds.
Author

Maybe it wasn't a loop, but simply that kpreid's deep-check-and-repair job was going to a new file every 0.11 seconds.

That seems unlikely: I have a total of 15 files.

> Maybe it wasn't a loop, but simply that kpreid's deep-check-and-repair job was going to a new file every 0.11 seconds. That seems unlikely: I have a total of 15 files.

I looked at the code a bit and I didn't notice a way for a loop to happen. The stack trace from the original report features this line:

	  File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed

But in [the 1.9.2 version of retrieve.py]source:1.9.2/src/allmydata/mutable/retrieve.py?rev=5480#L725 and in the [current version of retrieve.py]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d8c536847b1ea577f9ac5b6aa98c7ce5d1961c8c there isn't the same code.

Hm... let's see. There is no way that I can see that _validation_or_decoding_failed could lead to a loop, [in 1.9.0]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d9c1064d42a322b58d3243923d95fd56235d7c89#L720. Is there any way that there could be a loop that calls out to _validation_or_decoding_failed over and over? Well, [this stuff here]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d9c1064d42a322b58d3243923d95fd56235d7c89#L612 is complicated. Could it have a loop? No, I don't see it. Any errback from the call to _validation_or_decoding_failed should... Wait, what happens when you get an exception from your errback? I assume it just errbacks again? Hm, this code is very confusing because there are two variables in different scopes named dl. Anyway, I think the intent is that the errback goes here, where it terminates and doesn't loop: [loop()]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d9c1064d42a322b58d3243923d95fd56235d7c89#L252.

The [modern code]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d8c536847b1ea577f9ac5b6aa98c7ce5d1961c8c#L614 that replaced this doesn't have the two variables named dl. It still looks like you'll get an exception from the errback (named _handle_bad_shares in this version). Hm, in the current version that exception will then get ignored by [line 642]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d8c536847b1ea577f9ac5b6aa98c7ce5d1961c8c#L642 dl = deferredutil.gatherResults(ds) since deferredutil.gatherResults() always sets consumeErrors=True. Could that lead to a loop?

Daira: help!

I looked at the code a bit and I didn't notice a way for a loop to happen. The stack trace from the original report features this line: ``` File "/External/Projects/tahoe/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed ``` But in [the 1.9.2 version of retrieve.py]source:1.9.2/src/allmydata/mutable/retrieve.py?rev=5480#L725 and in the [current version of retrieve.py]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d8c536847b1ea577f9ac5b6aa98c7ce5d1961c8c there isn't the same code. Hm... let's see. There is no way that I can see that `_validation_or_decoding_failed` could lead to a loop, [in 1.9.0]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d9c1064d42a322b58d3243923d95fd56235d7c89#L720. Is there any way that there could be a loop that calls out to `_validation_or_decoding_failed` over and over? Well, [this stuff here]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d9c1064d42a322b58d3243923d95fd56235d7c89#L612 is complicated. Could it have a loop? No, I don't see it. Any errback from the call to `_validation_or_decoding_failed` should... Wait, what happens when you get an exception from your errback? I assume it just errbacks again? Hm, this code is very confusing because there are two variables in different scopes named `dl`. Anyway, I think the *intent* is that the errback goes here, where it terminates and doesn't loop: [loop()]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d9c1064d42a322b58d3243923d95fd56235d7c89#L252. The [modern code]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d8c536847b1ea577f9ac5b6aa98c7ce5d1961c8c#L614 that replaced this doesn't have the two variables named `dl`. It still looks like you'll get an exception from the errback (named `_handle_bad_shares` in this version). Hm, in the current version that exception will then get ignored by [line 642]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d8c536847b1ea577f9ac5b6aa98c7ce5d1961c8c#L642 `dl = deferredutil.gatherResults(ds)` since `deferredutil.gatherResults()` always sets `consumeErrors=True`. Could that lead to a loop? Daira: help!
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#1824
No description provided.