Unhandled error in Deferred during retrieve #1636

Closed
opened 2011-12-14 21:08:06 +00:00 by killyourtv · 43 comments
killyourtv commented 2011-12-14 21:08:06 +00:00
Owner
allmydata-tahoe: 1.9.0-r5387,
foolscap: 0.6.2,
pycryptopp: 0.5.29,
zfec: 1.4.22,
Twisted: 11.0.0,
Nevow: 0.10.0,
zope.interface: unknown,
python: 2.7.2+,
platform: Linux-debian_wheezy/sid-x86_64-64bit_ELF,
pyOpenSSL: 0.13,
simplejson: 2.3.0,
pycrypto: 2.4.1,
pyasn1: unknown,
mock: 0.7.2,
sqlite3: 2.6.0 [sqlite 3.7.9],
setuptools: 0.6c16dev3

...with patches from #1007, #1010, and #1628 applied.

When uploading files to the volunteer grid on I2P, I received the following exception:

2011-12-14 13:34:49+0000 [-] Unhandled error in Deferred:
2011-12-14 13:34:49+0000 [-] Unhandled Error
	Traceback (most recent call last):
	  File "$INSTALL_LOCATION/tahoe-lafs/src/allmydata/mutable/retrieve.py", line 610, in _download_current_segment
	    d = self._process_segment(self._current_segment)
	  File "$INSTALL_LOCATION/tahoe-lafs/src/allmydata/mutable/retrieve.py", line 638, in _process_segment
	    dl.addErrback(self._validation_or_decoding_failed, [reader])
	  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 308, in addErrback
	    errbackKeywords=kw)
	  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 286, in addCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 542, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "$INSTALL_LOCATION/tahoe-lafs/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed
	    self._mark_bad_share(reader.server, reader.shnum, reader, f)
	  File "$INSTALL_LOCATION/tahoe-lafs/src/allmydata/mutable/retrieve.py", line 595, in _mark_bad_share
	    self.notify_server_corruption(server, shnum, str(f.value))
	  File "$INSTALL_LOCATION/tahoe-lafs/src/allmydata/mutable/retrieve.py", line 938, in notify_server_corruption
	    rref.callRemoteOnly("advise_corrupt_share",
	exceptions.AttributeError: 'NoneType' object has no attribute 'callRemoteOnly'

I do not see any flog files from the time of the error (and it took place when I was AFK).

At http://127.0.0.1:3456/status/ it's still visible in the "Active Operations" section, even though the event took place hours ago.

Mutable File Publish Status

    Started: 13:34:51 14-Dec-2011
    Storage Index: 7kkjs3zair343ujzz7ghzwx5si
    Helper?: No
    Current Size: 2972
    Progress: 0.0%
    Status: Pushing shares

Retrieve Results

    Encoding: 3 of 10
    Sharemap:
        0 -> Placed on [q5b52rmg], [bgahn3ft]
        4 -> Placed on [q5b52rmg]
        7 -> Placed on [5bqq6b7f]
        9 -> Placed on [jvgf7m73]
    Timings:
        Total: ()
            Setup: 6.2ms
            Encrypting: 310us (9.59MBps)
            Encoding: 264us (11.25MBps)
            Packing Shares: ()
                RSA Signature: 6.7ms
            Pushing: ()

For what it's worth, there are other failed upload attempts that are remaining in the list of "Active Operations" (even though they failed many, many hours ago). I'm retrying the batch upload; I'll try to generate the flog files if it errors out while near.

(The fact that uploads may fail occasionally on networks like I2P or Tor isn't unusual; that they're still shown as active long after failing is (and it's not something that I saw with 1.8.x)).

``` allmydata-tahoe: 1.9.0-r5387, foolscap: 0.6.2, pycryptopp: 0.5.29, zfec: 1.4.22, Twisted: 11.0.0, Nevow: 0.10.0, zope.interface: unknown, python: 2.7.2+, platform: Linux-debian_wheezy/sid-x86_64-64bit_ELF, pyOpenSSL: 0.13, simplejson: 2.3.0, pycrypto: 2.4.1, pyasn1: unknown, mock: 0.7.2, sqlite3: 2.6.0 [sqlite 3.7.9], setuptools: 0.6c16dev3 ``` ...with patches from #1007, #1010, and #1628 applied. When uploading files to the volunteer grid on I2P, I received the following exception: ``` 2011-12-14 13:34:49+0000 [-] Unhandled error in Deferred: 2011-12-14 13:34:49+0000 [-] Unhandled Error Traceback (most recent call last): File "$INSTALL_LOCATION/tahoe-lafs/src/allmydata/mutable/retrieve.py", line 610, in _download_current_segment d = self._process_segment(self._current_segment) File "$INSTALL_LOCATION/tahoe-lafs/src/allmydata/mutable/retrieve.py", line 638, in _process_segment dl.addErrback(self._validation_or_decoding_failed, [reader]) File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 308, in addErrback errbackKeywords=kw) File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 286, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File "$INSTALL_LOCATION/tahoe-lafs/src/allmydata/mutable/retrieve.py", line 736, in _validation_or_decoding_failed self._mark_bad_share(reader.server, reader.shnum, reader, f) File "$INSTALL_LOCATION/tahoe-lafs/src/allmydata/mutable/retrieve.py", line 595, in _mark_bad_share self.notify_server_corruption(server, shnum, str(f.value)) File "$INSTALL_LOCATION/tahoe-lafs/src/allmydata/mutable/retrieve.py", line 938, in notify_server_corruption rref.callRemoteOnly("advise_corrupt_share", exceptions.AttributeError: 'NoneType' object has no attribute 'callRemoteOnly' ``` I do not see any flog files from the time of the error (and it took place when I was AFK). At <http://127.0.0.1:3456/status/> it's still visible in the "Active Operations" section, even though the event took place hours ago. ``` Mutable File Publish Status Started: 13:34:51 14-Dec-2011 Storage Index: 7kkjs3zair343ujzz7ghzwx5si Helper?: No Current Size: 2972 Progress: 0.0% Status: Pushing shares Retrieve Results Encoding: 3 of 10 Sharemap: 0 -> Placed on [q5b52rmg], [bgahn3ft] 4 -> Placed on [q5b52rmg] 7 -> Placed on [5bqq6b7f] 9 -> Placed on [jvgf7m73] Timings: Total: () Setup: 6.2ms Encrypting: 310us (9.59MBps) Encoding: 264us (11.25MBps) Packing Shares: () RSA Signature: 6.7ms Pushing: () ``` For what it's worth, there are other failed upload attempts that are remaining in the list of "Active Operations" (even though they failed many, many hours ago). I'm retrying the batch upload; I'll try to generate the flog files if it errors out while near. (The fact that uploads may fail occasionally on networks like I2P or Tor isn't unusual; that they're still shown as active long after failing is (and it's not something that I saw with 1.8.x)).
tahoe-lafs added the
unknown
major
defect
1.9.0
labels 2011-12-14 21:08:06 +00:00
tahoe-lafs added this to the undecided milestone 2011-12-14 21:08:06 +00:00
killyourtv commented 2011-12-16 14:30:41 +00:00
Author
Owner

I don't know how to best explain this (it may be related to the regression being tackled in #1628), but:

During my attempted re-upload, my system became really sluggish. After what seemed like an eternity I managed to get "htop" running. The system load was at 35 with the tahoe process using ~1.2GB of RAM.

Unfortunately, after 15 minutes or so of trying to get back to a prompt (or ssh in), I grew impatient and hit the power switch.

It may not be very helpful but attached is a snippet of twistd.log from that period. The fun started at 18:43:28 UTC. The logging stopped at 18:44:48 UTC. The attached file has 49679 lines.

There isn't anything in ~/.tahoe/logs/incidents from that time.


I haven't seen any "Unhandled error in Deferred:" entries in twistd.log from my use of 1.8.x.

I don't know how to best explain this (it may be related to the regression being tackled in #1628), but: During my attempted re-upload, my system became _really_ sluggish. After what seemed like an eternity I managed to get "htop" running. The system load was at 35 with the tahoe process using ~1.2GB of RAM. Unfortunately, after 15 minutes or so of trying to get back to a prompt (or ssh in), I grew impatient and hit the power switch. It may not be very helpful but attached is a snippet of twistd.log from that period. The fun started at 18:43:28 UTC. The logging stopped at 18:44:48 UTC. The attached file has 49679 lines. There isn't anything in ~/.tahoe/logs/incidents from that time. --------------- I haven't seen any "Unhandled error in Deferred:" entries in twistd.log from my use of 1.8.x.
killyourtv commented 2011-12-16 14:31:18 +00:00
Author
Owner

Attachment twistd.log.old.xz (1992 bytes) added

**Attachment** twistd.log.old.xz (1992 bytes) added

I'm promoting this to priority critical and adding the keyword regression and putting it into the Milestone 1.10.0.

I'm promoting this to priority `critical` and adding the keyword `regression` and putting it into the Milestone 1.10.0.
zooko added
critical
and removed
major
labels 2011-12-16 16:09:54 +00:00
zooko modified the milestone from undecided to 1.10.0 2011-12-16 16:09:54 +00:00
zooko modified the milestone from 1.10.0 to 1.9.1 2011-12-16 16:53:46 +00:00
davidsarah commented 2011-12-17 05:29:02 +00:00
Author
Owner

Replying to killyourtv:

I don't know how to best explain this (it may be related to the regression being tackled in #1628), ...

I can't see any evidence that it is the same bug. What makes you suspect that it is?

(#1583 seems to be the same bug as #1628, but they have different symptoms to this one.)

Replying to [killyourtv](/tahoe-lafs/trac-2024-07-25/issues/1636#issuecomment-86749): > I don't know how to best explain this (it may be related to the regression being tackled in #1628), ... I can't see any evidence that it is the same bug. What makes you suspect that it is? (#1583 seems to be the same bug as #1628, but they have different symptoms to this one.)

Ok, I think we have two things going on here:

  • notify_server_corruption needs to check if the rref it gets is None before invoking callRemoteOnly on it. This was probably exposed by my IServer refactoring, since before that, the servermap was holding actual RemoteReference objects, and now it just holds IServer objects (from which you can get a RemoteReference, if it is still connected, with .get_rref()).
  • certain errors during the retrieve code get dropped, and the retrieve hangs. To fix this, the control flow needs to be carefully structured to make sure that all exceptions will get caught in a Deferred chain that leads into the main Retrieve deferred.

Zooko is going to write a patch for the first problem. Brian will walk through the code to fix the second problem.

Ok, I think we have two things going on here: * `notify_server_corruption` needs to check if the rref it gets is None before invoking `callRemoteOnly` on it. This was probably exposed by my `IServer` refactoring, since before that, the servermap was holding actual `RemoteReference` objects, and now it just holds `IServer` objects (from which you can get a `RemoteReference`, **if it is still connected**, with `.get_rref()`). * certain errors during the retrieve code get dropped, and the retrieve hangs. To fix this, the control flow needs to be carefully structured to make sure that all exceptions will get caught in a Deferred chain that leads into the main Retrieve deferred. Zooko is going to write a patch for the first problem. Brian will walk through the code to fix the second problem.

I believe the current state of this ticket is blocked on someone verifying that it occurs with upstream Tahoe-LAFS (no i2p-specific patches or other patches that aren't in trunk).

Is that right?

killyourtv appears to be a reliable reporter of bugs so far -- his reports are low on speculation and include actual facts like the twistd file...

Okay, just had a conversation with Brian and David-Sarah on IRC and it seems clear that there is a bug in [the code for mutable files]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=5bf1ffbc879cf08253eb695d8b0742629f26cb84#L936 inasmuch as it assumes that the [server object]source:git/src/allmydata/storage_client.py?annotate=blame&rev=5bf1ffbc879cf08253eb695d8b0742629f26cb84#L149's .rref member will be an object, but instead that may be None due to normal, inevitable network entropy ([here]source:git/src/allmydata/storage_client.py?annotate=blame&rev=5bf1ffbc879cf08253eb695d8b0742629f26cb84#L254).

There are two possible fixes:

  1. Find all places that call .get_rref make sure they sensibly handle the case that they get None.
  2. Change the server instance to keep a RemoteReference object in .rref even after it has been notified that the connection has been disconnected and that object will raise DeadReferenceError if it is used.

I'll investigate both approaches and produce a patch for one or each of them.

I believe the current state of this ticket is `blocked on someone verifying that it occurs with upstream Tahoe-LAFS (no i2p-specific patches or other patches that aren't in trunk)`. Is that right? killyourtv appears to be a reliable reporter of bugs so far -- his reports are low on speculation and include actual facts like the twistd file... Okay, just had a conversation with Brian and David-Sarah on IRC and it seems clear that there is a bug in [the code for mutable files]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=5bf1ffbc879cf08253eb695d8b0742629f26cb84#L936 inasmuch as it assumes that the [server object]source:git/src/allmydata/storage_client.py?annotate=blame&rev=5bf1ffbc879cf08253eb695d8b0742629f26cb84#L149's `.rref` member will be an object, but instead that may be `None` due to normal, inevitable network entropy ([here]source:git/src/allmydata/storage_client.py?annotate=blame&rev=5bf1ffbc879cf08253eb695d8b0742629f26cb84#L254). There are two possible fixes: 1. Find all places that call `.get_rref` make sure they sensibly handle the case that they get `None`. 2. Change the server instance to keep a [RemoteReference](http://foolscap.lothar.com/trac/browser/foolscap/referenceable.py?annotate=blame&rev=03d9fdb5a5bb089f708614667830477dbd0bff63#L412) object in `.rref` even after it has been notified that the connection has been disconnected and that object will raise [DeadReferenceError](http://foolscap.lothar.com/trac/browser/foolscap/ipb.py?annotate=blame&rev=10a3bfb6ab75b8997e4995e4c77ab8bf02f1bdcb#L10) if it is used. I'll investigate both approaches and produce a patch for one or each of them.

I've cleaned up the control flow in retrieve.py, in a series of 6 patches ending in changeset:893eea849ba7cf78 / changeset:8db39f0e. I still need to do one more careful inspection, but I think it's pretty solid now.

On the .get_rref front, I had a thought. Since my long-term goal with IServer is to remove knowledge of server protocols from the upload/download code and move it into the IServer class (to facilitate having multiple backends), why not make it IServer.callRemote(..) instead of IServer.get_rref().callRemote(..)? The current IServer class would just do:

    def callRemote(self, methname, *args, **kwargs):
        if self.rref:
            return self.rref.callRemote(methname, *args, **kwargs)
        else:
            return defer.fail(DeadReferenceError())

It'd be easy to grep for that callRemote later, when it's time to hoist the abstraction later and turn those calls into putShare or whatever.

I've cleaned up the control flow in retrieve.py, in a series of 6 patches ending in changeset:893eea849ba7cf78 / changeset:8db39f0e. I still need to do one more careful inspection, but I think it's pretty solid now. On the `.get_rref` front, I had a thought. Since my long-term goal with `IServer` is to remove knowledge of server protocols from the upload/download code and move it into the `IServer` class (to facilitate having multiple backends), why not make it `IServer.callRemote(..)` instead of `IServer.get_rref().callRemote(..)`? The current `IServer` class would just do: ``` def callRemote(self, methname, *args, **kwargs): if self.rref: return self.rref.callRemote(methname, *args, **kwargs) else: return defer.fail(DeadReferenceError()) ``` It'd be easy to grep for that `callRemote` later, when it's time to hoist the abstraction later and turn those calls into `putShare` or whatever.
killyourtv commented 2012-02-14 00:43:15 +00:00
Author
Owner

Replying to [davidsarah]comment:5:

Replying to killyourtv:

I don't know how to best explain this (it may be related to the regression being tackled in #1628), ...

I can't see any evidence that it is the same bug. What makes you suspect that it is?

(#1583 seems to be the same bug as #1628, but they have different symptoms to this one.)

Sorry, just saw this reply. :|

It may have been stupid speculation on my part. I don't remember why I suspected it might be. Disregard (as you likely have by now).

Replying to [davidsarah]comment:5: > Replying to [killyourtv](/tahoe-lafs/trac-2024-07-25/issues/1636#issuecomment-86749): > > I don't know how to best explain this (it may be related to the regression being tackled in #1628), ... > > I can't see any evidence that it is the same bug. What makes you suspect that it is? > > (#1583 seems to be the same bug as #1628, but they have different symptoms to this one.) Sorry, just saw this reply. :| It may have been stupid speculation on my part. I don't remember why I suspected it might be. Disregard (as you likely have by now).
killyourtv commented 2012-02-14 01:14:09 +00:00
Author
Owner

Replying to zooko:

[...]

killyourtv appears to be a reliable reporter of bugs so far -- his reports are low on speculation

[...]

That was until comment:86749 above. :| Sigh.

Anyhow...I'm back testing with was the latest git revision as of a few days ago (the webui shows 1.9.0.post77, rev 2da8339dd762aa96e73b1c1d98e217df43802418).

allmydata-tahoe: 1.9.0.post77,
foolscap: 0.6.2,
pycryptopp: 0.5.29,
zfec: 1.4.22,
Twisted: 11.1.0,
Nevow: 0.10.0,
zope.interface: unknown,
python: 2.7.2+,
platform: Linux-debian_wheezy/sid-x86_64-64bit_ELF,
pyOpenSSL: 0.13,
simplejson: 2.3.2,
pycrypto: 2.5,
pyasn1: unknown,
mock: 0.7.2,
sqlite3: 2.6.0 [sqlite 3.7.10],
setuptools: 0.6c16dev3

Once again, patches from #1007, #1010, and #68 have been applied. Everything, so far, seems to be working fine. Or...I should say that things are much better than they had been. It's been running for ~36 hours now.

The only problems I've had (a couple error 500s during upload or repair) can probably be attributed to our not-always-as-stable-as-we'd-like grid. When the failures occurred the tasks remained on the status page, such as

publish 	3l6y3gkgdxxwwsn2mwtdmssjse 	No 	0B 	0.0% 	Started
publish 	zyt7tuse472ikqnjc7dmw3ai4y 	No 	589B 	0.0% 	Pushing shares

These entries are from around two hours ago (it currently just after 1am UTC). These tasks are no longer running but they're still in the "Active Operations" section.

    Started: 22:53:57 13-Feb-2012
    Storage Index: 3l6y3gkgdxxwwsn2mwtdmssjse
    Helper?: No
    Current Size: 0
    Progress: 0.0%
    Status: Started

Retrieve Results

    Encoding: 5 of 10
    Sharemap:
        0 -> Placed on [3t4enhba]
        1 -> Placed on [eyp5vu5v], [bgahn3ft]
        2 -> Placed on [3ejzcskd]
        3 -> Placed on [uxqmbg6c]
        5 -> Placed on [jvgf7m73]
        6 -> Placed on [3teqr2cf], [n64vwepc]
        7 -> Placed on [q5b52rmg]
        8 -> Placed on [ql3cupul], [p6463dyg]
        9 -> Placed on [poeonyl6]
    Timings:
        Total: ()
            Setup:
            Encrypting: 0us ()
            Encoding: 0us ()
            Packing Shares: ()
                RSA Signature:
            Pushing: ()
    Started: 23:18:25 13-Feb-2012
    Storage Index: zyt7tuse472ikqnjc7dmw3ai4y
    Helper?: No
    Current Size: 589
    Progress: 0.0%
    Status: Pushing shares

Retrieve Results

    Encoding: 3 of 10
    Sharemap:
        0 -> Placed on [p6463dyg], [3t4enhba]
        1 -> Placed on [n64vwepc]
        2 -> Placed on [basz6g7o]
        3 -> Placed on [5mefqfis]
        4 -> Placed on [eyp5vu5v]
        5 -> Placed on [3ejzcskd]
        6 -> Placed on [jvgf7m73]
        7 -> Placed on [vvnvgne4]
        8 -> Placed on [iqhk5e2t]
        9 -> Placed on [uxqmbg6c]
    Timings:
        Total: ()
            Setup: 5.7ms
            Encrypting: 283us (2.08MBps)
            Encoding: 151us (3.90MBps)
            Packing Shares: ()
                RSA Signature: 6.5ms
            Pushing: ()

This is the only entry in twistd.log from around that time:

2012-02-13 23:37:48+0000 [-] Unhandled error in Deferred:
2012-02-13 23:37:48+0000 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.7/dist-packages/nevow/flat/twist.py", line 42, in cb
            _drive(iterable, finished)
          File "/usr/lib/python2.7/dist-packages/nevow/flat/twist.py", line 26, in _drive
            finished.callback('')
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in callback
            self._startRunCallbacks(result)
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in _startRunCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/usr/lib/python2.7/dist-packages/nevow/appserver.py", line 181, in _cbFinishRender
            self.finishRequest(  True )
          File "/usr/lib/python2.7/dist-packages/nevow/appserver.py", line 176, in finishRequest
            server.Request.finish(self)
          File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 866, in finish
            "Request.finish called on a request after its connection was lost; "
        exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this

If I'm around if/when it happens again I'll generate an incident report.

All that said, things look much better.

Replying to [zooko](/tahoe-lafs/trac-2024-07-25/issues/1636#issuecomment-86757): [...] > killyourtv appears to be a reliable reporter of bugs so far -- his reports are low on speculation [...] That was until [comment:86749](/tahoe-lafs/trac-2024-07-25/issues/1636#issuecomment-86749) above. :| *Sigh*. Anyhow...I'm back testing with was the latest git revision as of a few days ago (the webui shows 1.9.0.post77, rev 2da8339dd762aa96e73b1c1d98e217df43802418). ``` allmydata-tahoe: 1.9.0.post77, foolscap: 0.6.2, pycryptopp: 0.5.29, zfec: 1.4.22, Twisted: 11.1.0, Nevow: 0.10.0, zope.interface: unknown, python: 2.7.2+, platform: Linux-debian_wheezy/sid-x86_64-64bit_ELF, pyOpenSSL: 0.13, simplejson: 2.3.2, pycrypto: 2.5, pyasn1: unknown, mock: 0.7.2, sqlite3: 2.6.0 [sqlite 3.7.10], setuptools: 0.6c16dev3 ``` Once again, patches from #1007, #1010, and #68 have been applied. Everything, so far, seems to be working fine. Or...I should say that things are much better than they had been. It's been running for ~36 hours now. The only problems I've had (a couple error 500s during upload or repair) can probably be attributed to our not-always-as-stable-as-we'd-like grid. When the failures occurred the tasks remained on the status page, such as ``` publish 3l6y3gkgdxxwwsn2mwtdmssjse No 0B 0.0% Started publish zyt7tuse472ikqnjc7dmw3ai4y No 589B 0.0% Pushing shares ``` These entries are from around two hours ago (it currently just after 1am UTC). These tasks are no longer running but they're still in the "Active Operations" section. ``` Started: 22:53:57 13-Feb-2012 Storage Index: 3l6y3gkgdxxwwsn2mwtdmssjse Helper?: No Current Size: 0 Progress: 0.0% Status: Started Retrieve Results Encoding: 5 of 10 Sharemap: 0 -> Placed on [3t4enhba] 1 -> Placed on [eyp5vu5v], [bgahn3ft] 2 -> Placed on [3ejzcskd] 3 -> Placed on [uxqmbg6c] 5 -> Placed on [jvgf7m73] 6 -> Placed on [3teqr2cf], [n64vwepc] 7 -> Placed on [q5b52rmg] 8 -> Placed on [ql3cupul], [p6463dyg] 9 -> Placed on [poeonyl6] Timings: Total: () Setup: Encrypting: 0us () Encoding: 0us () Packing Shares: () RSA Signature: Pushing: () ``` ``` Started: 23:18:25 13-Feb-2012 Storage Index: zyt7tuse472ikqnjc7dmw3ai4y Helper?: No Current Size: 589 Progress: 0.0% Status: Pushing shares Retrieve Results Encoding: 3 of 10 Sharemap: 0 -> Placed on [p6463dyg], [3t4enhba] 1 -> Placed on [n64vwepc] 2 -> Placed on [basz6g7o] 3 -> Placed on [5mefqfis] 4 -> Placed on [eyp5vu5v] 5 -> Placed on [3ejzcskd] 6 -> Placed on [jvgf7m73] 7 -> Placed on [vvnvgne4] 8 -> Placed on [iqhk5e2t] 9 -> Placed on [uxqmbg6c] Timings: Total: () Setup: 5.7ms Encrypting: 283us (2.08MBps) Encoding: 151us (3.90MBps) Packing Shares: () RSA Signature: 6.5ms Pushing: () ``` This is the only entry in twistd.log from around that time: ``` 2012-02-13 23:37:48+0000 [-] Unhandled error in Deferred: 2012-02-13 23:37:48+0000 [-] Unhandled Error Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/nevow/flat/twist.py", line 42, in cb _drive(iterable, finished) File "/usr/lib/python2.7/dist-packages/nevow/flat/twist.py", line 26, in _drive finished.callback('') File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in callback self._startRunCallbacks(result) File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/usr/lib/python2.7/dist-packages/nevow/appserver.py", line 181, in _cbFinishRender self.finishRequest( True ) File "/usr/lib/python2.7/dist-packages/nevow/appserver.py", line 176, in finishRequest server.Request.finish(self) File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 866, in finish "Request.finish called on a request after its connection was lost; " exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this ``` If I'm around if/when it happens again I'll generate an incident report. All that said, things look *much better*.

Replying to [killyourtv]comment:11:

The only problems I've had (a couple error 500s during upload or repair) can probably be attributed to our not-always-as-stable-as-we'd-like grid. When the failures occurred the tasks remained on the status page, such as

Hrm. I don't think bad behavior on the part of the servers should ever cause a 5xx error code from the gateway. Maybe I'm wrong about that, but at the very least the gateway should emit a clear explanation of what went wrong. Please report any 5xx errors you see so we can figure out if they actually represent bugs in your gateway or if they represent bad behavior on the part of the servers. In the latter case, the gateway ought to explain clearly in the error message what went wrong.

Replying to [killyourtv]comment:11: > > The only problems I've had (a couple error 500s during upload or repair) can probably be attributed to our not-always-as-stable-as-we'd-like grid. When the failures occurred the tasks remained on the status page, such as Hrm. I don't think bad behavior on the part of the servers should ever cause a 5xx error code from the gateway. Maybe I'm wrong about that, but at the very least the gateway should emit a clear explanation of what went wrong. Please report any 5xx errors you see so we can figure out if they actually represent bugs in your gateway or if they represent bad behavior on the part of the servers. In the latter case, the gateway ought to explain clearly in the error message what went wrong.
killyourtv commented 2012-02-14 10:34:38 +00:00
Author
Owner

Replying to [zooko]comment:12:

[...]

Please report any 5xx errors you see so we can figure out if they actually represent bugs in your gateway or if they represent bad behavior on the part of the servers. In the latter case, the gateway ought to explain clearly in the error message what went wrong.

Filed #1675.

Replying to [zooko]comment:12:<br> [...] >Please report any 5xx errors you see so we can figure out if they actually represent bugs in your gateway or if they represent bad behavior on the part of the servers. In the latter case, the gateway ought to explain clearly in the error message what went wrong. Filed #1675.
davidsarah commented 2012-03-31 23:00:18 +00:00
Author
Owner

comment:86758 seems to suggest that this is fixed. Is that right? Do we have adequate tests for the original problem?

[comment:86758](/tahoe-lafs/trac-2024-07-25/issues/1636#issuecomment-86758) seems to suggest that this is fixed. Is that right? Do we have adequate tests for the original problem?
zooko was unassigned by tahoe-lafs 2012-03-31 23:00:18 +00:00
warner was assigned by tahoe-lafs 2012-03-31 23:00:18 +00:00
davidsarah commented 2012-03-31 23:04:41 +00:00
Author
Owner

Wait, I'm confused. The description says that the error occurs on "upload" (i.e. publish), but Brian's patches mentioned in comment:86758 only changed the retrieve code.

Wait, I'm confused. The description says that the error occurs on "upload" (i.e. publish), but Brian's patches mentioned in [comment:86758](/tahoe-lafs/trac-2024-07-25/issues/1636#issuecomment-86758) only changed the retrieve code.
killyourtv commented 2012-04-20 09:11:39 +00:00
Author
Owner

For what it's worth, I cannot reproduce this anymore (current git) so I think this may have indeed been fixed.

allmydata-tahoe: 1.9.0.post133.dev0,
foolscap: 0.6.3,
pycryptopp: 0.6.0.1206569328141510525648634803928199668821045408958,
zfec: 1.4.24,
Twisted: 11.1.0,
Nevow: 0.10.0,
zope.interface: unknown,
python: 2.7.3rc2,
platform: Linux-debian_wheezy/sid-x86_64-64bit_ELF,
pyOpenSSL: 0.13,
simplejson: 2.5.0,
pycrypto: 2.5,
pyasn1: unknown,
mock: 0.8.0,
sqlite3: 2.6.0 [sqlite 3.7.11],
setuptools: 0.6c16dev3

Everything has been working wonderfully during the last week that I've been on this version. Thanks for all of your (collective) work on this. :)

For what it's worth, I cannot reproduce this anymore (current git) so I think this may have indeed been fixed. ``` allmydata-tahoe: 1.9.0.post133.dev0, foolscap: 0.6.3, pycryptopp: 0.6.0.1206569328141510525648634803928199668821045408958, zfec: 1.4.24, Twisted: 11.1.0, Nevow: 0.10.0, zope.interface: unknown, python: 2.7.3rc2, platform: Linux-debian_wheezy/sid-x86_64-64bit_ELF, pyOpenSSL: 0.13, simplejson: 2.5.0, pycrypto: 2.5, pyasn1: unknown, mock: 0.8.0, sqlite3: 2.6.0 [sqlite 3.7.11], setuptools: 0.6c16dev3 ``` Everything has been working wonderfully during the last week that I've been on this version. Thanks for all of your (collective) work on this. :)
tahoe-lafs added the
fixed
label 2012-04-20 09:11:39 +00:00
killyourtv closed this issue 2012-04-20 09:11:39 +00:00

Thanks for the update (comment:86767), kytv. I hope someone who understands the code better than I can tell us how likely it is that some specific patch or set of patches fixed this specific problem. If not, I'll try to dig into the question myself at some point. Adding the keyword "review-needed" to indicate that we need someone to look into the source code and try to connect this particular issue with some specific source code changes.

Thanks for the update ([comment:86767](/tahoe-lafs/trac-2024-07-25/issues/1636#issuecomment-86767)), kytv. I hope someone who understands the code better than I can tell us how likely it is that some specific patch or set of patches fixed this specific problem. If not, I'll try to dig into the question myself at some point. Adding the keyword "review-needed" to indicate that we need someone to look into the source code and try to connect this particular issue with some specific source code changes.
zooko removed the
fixed
label 2012-04-22 18:10:23 +00:00
zooko reopened this issue 2012-04-22 18:10:23 +00:00

Just to explain why I re-opened this ticket:

While having a report from a user like kytv, who actually experiences the bug in the wild, is extremely valuable, and I would prefer not to close a ticket without such a report, also having a source-code-oriented explanation of what the bug was and how it was fixed is something I would prefer not to close a ticket without.

Bugs are sneaky things, and I don't believe they are really gone until I see the bodies.

Just to explain why I re-opened this ticket: While having a report from a user like kytv, who actually experiences the bug in the wild, is extremely valuable, and I would prefer not to close a ticket without such a report, also having a source-code-oriented explanation of what the bug was and how it was fixed is something I would prefer not to close a ticket without. Bugs are sneaky things, and I don't believe they are really gone until I see the bodies.
killyourtv commented 2012-04-22 22:55:25 +00:00
Author
Owner

OK. Should I refrain from closing bugs (that I reported) in the future? (I don't want to create "more work" for others--in fact, I was trying to help by closing it. ;))

OK. Should I refrain from closing bugs (that I reported) in the future? (I don't want to create "more work" for others--in fact, I was trying to help by closing it. ;))
davidsarah commented 2012-04-22 23:53:13 +00:00
Author
Owner

As I implied in comment:86766, I don't think the patches warner mentioned in comment:86758 can have fixed this bug, since the bug occurs on upload, not retrieve. I don't know of any other changes that would have affected it, so it probably still exists.

killyourtv: in general closing a bug with "cannot reproduce" should be done with great reluctance, since bugs don't fix themselves, and we like to know what change fixed something. That resolution status is really intended for the case where the relevant code has changed so much since the bug was reported that the original report is probably no longer relevant. It's fine to close bugs after having verified that a particular fix works, though.

As I implied in [comment:86766](/tahoe-lafs/trac-2024-07-25/issues/1636#issuecomment-86766), I don't think the patches warner mentioned in [comment:86758](/tahoe-lafs/trac-2024-07-25/issues/1636#issuecomment-86758) can have fixed this bug, since the bug occurs on upload, not retrieve. I don't know of any other changes that would have affected it, so it probably still exists. killyourtv: in general closing a bug with "cannot reproduce" should be done with great reluctance, since bugs don't fix themselves, and we like to know what change fixed something. That resolution status is really intended for the case where the relevant code has changed so much since the bug was reported that the original report is probably no longer relevant. It's fine to close bugs after having verified that a particular fix works, though.

Okay since David-Sarah says (per comment:86775) that the patches mentioned in comment:86758 can't have fixed this issue, I'm removing the review-needed tag. (I was reading [/~marlowe/TWN29.html the most recent Tahoe-LAFS Weekly News] and saw this ticket listed as being one of the review-needed tickets.)

Okay since David-Sarah says (per [comment:86775](/tahoe-lafs/trac-2024-07-25/issues/1636#issuecomment-86775)) that the patches mentioned in [comment:86758](/tahoe-lafs/trac-2024-07-25/issues/1636#issuecomment-86758) can't have fixed this issue, I'm removing the `review-needed` tag. (I was reading [/~marlowe/TWN29.html the most recent Tahoe-LAFS Weekly News] and saw this ticket listed as being one of the `review-needed` tickets.)

killyourtv: yeah, one practice that we have in this project which is different from a lot of other projects is that we focus on reducing uncertainty about bugs. A lot of projects promote a sort of vague notion of "degree of bugginess" with associated folklore, like "Oh yeah, version 13.1.6 was way more crashy than version 13.1.3, but then version 22.0.7 was way more stable...".

I prefer for our collective understanding of bugs to be as specific and reproducible as possible. Therefore, since you definitely spotted a bug originally (thank you!) and since we don't have a specific reason to believe that the bug is fixed, we should leave this ticket open.

Thanks again for your good bug reports!

killyourtv: yeah, one practice that we have in this project which is different from a lot of other projects is that we focus on reducing uncertainty about bugs. A lot of projects promote a sort of vague notion of "degree of bugginess" with associated folklore, like "Oh yeah, version 13.1.6 was way more crashy than version 13.1.3, but then version 22.0.7 was way more stable...". I prefer for our collective understanding of bugs to be as specific and reproducible as possible. Therefore, since you definitely spotted a bug originally (thank you!) and since we don't have a specific reason to believe that the bug is fixed, we should leave this ticket open. Thanks again for your good bug reports!

Replying to davidsarah:

Wait, I'm confused. The description says that the error occurs on "upload" (i.e. publish), but Brian's patches mentioned in comment:86758 only changed the retrieve code.

Look at the traceback: the exception happens in Retrieve. I imagine that the WUI operation was to upload a file and add it to the directory, so a Retrieve operation occurs when it fetches the previous contents of the directory (so it can modify and publish the new contents). No (immutable) Upload code was implicated in the traceback.

Replying to [davidsarah](/tahoe-lafs/trac-2024-07-25/issues/1636#issuecomment-86766): > Wait, I'm confused. The description says that the error occurs on "upload" (i.e. publish), but Brian's patches mentioned in [comment:86758](/tahoe-lafs/trac-2024-07-25/issues/1636#issuecomment-86758) only changed the retrieve code. Look at the traceback: the exception happens in Retrieve. I imagine that the WUI operation was to upload a file and add it to the directory, so a Retrieve operation occurs when it fetches the previous contents of the directory (so it can modify and publish the new contents). No (immutable) Upload code was implicated in the traceback.

I gather zooko has some specific idea in mind for how this bug can be closed, handing it off to him.

I gather zooko has some specific idea in mind for how this bug can be closed, handing it off to him.
warner removed their assignment 2012-05-13 20:53:23 +00:00
zooko was assigned by warner 2012-05-13 20:53:23 +00:00
davidsarah commented 2012-05-14 00:52:13 +00:00
Author
Owner

Replying to [warner]comment:24:

Replying to davidsarah:

Wait, I'm confused. The description says that the error occurs on "upload" (i.e. publish), but Brian's patches mentioned in comment:86758 only changed the retrieve code.

Look at the traceback: the exception happens in Retrieve.

Ah, I see. OK, I agree that the patches referenced in comment:86758 address the problem in the traceback. However, I don't see any regression tests for that problem.

Replying to [warner]comment:24: > Replying to [davidsarah](/tahoe-lafs/trac-2024-07-25/issues/1636#issuecomment-86766): > > Wait, I'm confused. The description says that the error occurs on "upload" (i.e. publish), but Brian's patches mentioned in [comment:86758](/tahoe-lafs/trac-2024-07-25/issues/1636#issuecomment-86758) only changed the retrieve code. > > Look at the traceback: the exception happens in Retrieve. Ah, I see. ~~OK, I agree that the patches referenced in [comment:86758](/tahoe-lafs/trac-2024-07-25/issues/1636#issuecomment-86758) address the problem in the traceback.~~ However, I don't see any regression tests for that problem.
tahoe-lafs changed title from Unhandled error in Deferred while uploading (current darcs build) to Unhandled error in Deferred during retrieve 2012-05-14 01:05:44 +00:00

I don't think the patches references in comment:86758 fix this bug. I read the patches and don't see a fix for this. The path that led to this exception is still in place in trunk. That is: IServer.get_rref() can return None (because the connection-lost handling in storage_client.py set self.rref to None), and the send-corruption-report-to-server functionality doesn't test for None.

This is an instance of a problem that appears in many places in trunk, which problem and some proposed solutions are described in comment:86756, comment:86757, comment:11.

I don't think the patches references in [comment:86758](/tahoe-lafs/trac-2024-07-25/issues/1636#issuecomment-86758) fix this bug. I read the patches and don't see a fix for this. The path that led to this exception is still in place in trunk. That is: `IServer.get_rref()` can return `None` (because the connection-lost handling in storage_client.py set `self.rref` to `None`), and the send-corruption-report-to-server functionality doesn't test for `None`. This is an instance of a problem that appears in many places in trunk, which problem and some proposed solutions are described in [comment:86756](/tahoe-lafs/trac-2024-07-25/issues/1636#issuecomment-86756), [comment:86757](/tahoe-lafs/trac-2024-07-25/issues/1636#issuecomment-86757), comment:11.
zooko modified the milestone from 1.9.2 to 2.0.0 2012-05-16 22:02:18 +00:00
zooko modified the milestone from 2.0.0 to 1.9.2 2012-05-16 22:02:35 +00:00
zooko was unassigned by tahoe-lafs 2012-06-12 16:48:17 +00:00
warner was assigned by tahoe-lafs 2012-06-12 16:48:17 +00:00

Attachment dead-rref.diff (3048 bytes) added

retain rref even after disconnect

**Attachment** dead-rref.diff (3048 bytes) added retain rref even after disconnect

dead-rref.diff implements zooko's second approach from comment:11, namely to make the IServer retain the dead RemoteReference after disconnect, and continue to return it to anyone who calls get_rref(). This removes the need for callers to guard against receiving a None (as long as the server was connected at least once, which is always the case for servers returned by get_servers_for_psi(), which is how all upload/download code gets servers).

This patch should remove the problem. A later cleanup step could go the callers of get_rref() and remove the now-redundant guards, but a better change would be to remove get_rref() and have callers do server.callRemote(), which will then be replaced by higher-level actions like server.putShare() as we move away from Foolscap.

[dead-rref.diff](/tahoe-lafs/trac-2024-07-25/attachments/000078ac-85d0-662a-d582-27aea781ecc0) implements zooko's second approach from comment:11, namely to make the `IServer` retain the dead `RemoteReference` after disconnect, and continue to return it to anyone who calls `get_rref()`. This removes the need for callers to guard against receiving a `None` (as long as the server was connected at least once, which is always the case for servers returned by `get_servers_for_psi()`, which is how all upload/download code gets servers). This patch should remove the problem. A later cleanup step could go the callers of get_rref() and remove the now-redundant guards, but a better change would be to remove get_rref() and have callers do `server.callRemote()`, which will then be replaced by higher-level actions like `server.putShare()` as we move away from Foolscap.
davidsarah commented 2012-06-12 22:51:34 +00:00
Author
Owner

Will review.

Will review.

I reviewed attachment:dead-rref.diff. I see no bugs in it, it is a small patch, and I (still) like the idea of doing it that way. It does make me mildly uncomfortable that there are no unit tests in the patch.

I reviewed attachment:dead-rref.diff. I see no bugs in it, it is a small patch, and I (still) like the idea of doing it that way. It does make me mildly uncomfortable that there are no unit tests in the patch.

Attachment dead-rref-test.diff (5107 bytes) added

same patch, plus a test

**Attachment** dead-rref-test.diff (5107 bytes) added same patch, plus a test

In the new patch (which includes a test), the assignment of node_key_s will eventually need to move out of the conditional init_storage(), which only happens if the node is running a storage server. (client-only nodes will get a node-key too). But that fix can happen after 1.10, and will happen naturally as part of the client-side accounting work.

In the new patch (which includes a test), the assignment of `node_key_s` will eventually need to move out of the conditional `init_storage()`, which only happens if the node is running a storage server. (client-only nodes will get a node-key too). But that fix can happen after 1.10, and will happen naturally as part of the client-side accounting work.
davidsarah commented 2012-06-15 01:44:04 +00:00
Author
Owner

I reviewed dead-rref-test.diff and approve it for 1.9.2. Will apply.

I reviewed [dead-rref-test.diff](/tahoe-lafs/trac-2024-07-25/attachments/000078ac-85d0-662a-d582-bfa956de594d) and approve it for 1.9.2. Will apply.
tahoe-lafs added
code-network
and removed
unknown
labels 2012-06-15 01:44:04 +00:00
david-sarah@jacaranda.org commented 2012-06-15 02:05:59 +00:00
Author
Owner

In changeset:1b07d307619049dd:

After a server disconnects, make the IServer retain the dead RemoteReference, and continue to return it to anyone who calls get_rref(). This removes the need for callers to guard against receiving a None (as long as the server was connected at least once, which is always the case for servers returned by get_servers_for_psi(), which is how all upload/download code gets servers). Includes test. fixes #1636
In changeset:1b07d307619049dd: ``` After a server disconnects, make the IServer retain the dead RemoteReference, and continue to return it to anyone who calls get_rref(). This removes the need for callers to guard against receiving a None (as long as the server was connected at least once, which is always the case for servers returned by get_servers_for_psi(), which is how all upload/download code gets servers). Includes test. fixes #1636 ```
tahoe-lafs added the
fixed
label 2012-06-15 02:05:59 +00:00
david-sarah@jacaranda.org closed this issue 2012-06-15 02:05:59 +00:00
david-sarah <david-sarah@jacaranda.org> commented 2012-06-15 02:09:02 +00:00
Author
Owner

In changeset:1b07d307619049dd:

After a server disconnects, make the IServer retain the dead RemoteReference, and continue to return it to anyone who calls get_rref(). This removes the need for callers to guard against receiving a None (as long as the server was connected at least once, which is always the case for servers returned by get_servers_for_psi(), which is how all upload/download code gets servers). Includes test. fixes #1636
In changeset:1b07d307619049dd: ``` After a server disconnects, make the IServer retain the dead RemoteReference, and continue to return it to anyone who calls get_rref(). This removes the need for callers to guard against receiving a None (as long as the server was connected at least once, which is always the case for servers returned by get_servers_for_psi(), which is how all upload/download code gets servers). Includes test. fixes #1636 ```
davidsarah commented 2012-06-15 02:58:26 +00:00
Author
Owner

It looks like this test failure is in the new test:

[ERROR]
Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x70d3cf8 [0.972191095352s] called=0 cancelled=0 Reconnector._timer_expired()>

allmydata.test.test_system.Connections.test_rref
It looks like [this test failure](https://tahoe-lafs.org/buildbot-tahoe-lafs/builders/FreeStorm%20CentOS6-amd64/builds/98/steps/test/logs/stdio) is in the new test: ``` [ERROR] Traceback (most recent call last): Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean. DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug) <DelayedCall 0x70d3cf8 [0.972191095352s] called=0 cancelled=0 Reconnector._timer_expired()> allmydata.test.test_system.Connections.test_rref ```
tahoe-lafs removed the
fixed
label 2012-06-15 02:58:26 +00:00
davidsarah reopened this issue 2012-06-15 02:58:26 +00:00

I traced this to a bug in foolscap, foolscap#196, and fixed it. I'll try to make a new release this weekend, and we can bump tahoe's dependency if you like. (it should only affect tests).

(the bug was that Tub.stopService was shutting off Reconnectors by iterating over the self.reconnectors list, but the Reconnector shutdown process was removing items from that list during the iteration. The shared-state hazard resulted in some Reconnectors being dropped, so their timers weren't shut down, so one was still running at the end of the test, triggering the DirtyReactorError)

I traced this to a bug in foolscap, [foolscap#196](http://foolscap.lothar.com/trac/ticket/196), and fixed it. I'll try to make a new release this weekend, and we can bump tahoe's dependency if you like. (it should only affect tests). (the bug was that `Tub.stopService` was shutting off Reconnectors by iterating over the `self.reconnectors` list, but the Reconnector shutdown process was removing items from that list during the iteration. The shared-state hazard resulted in some Reconnectors being dropped, so their timers weren't shut down, so one was still running at the end of the test, triggering the DirtyReactorError)
david-sarah@jacaranda.org commented 2012-06-15 23:21:36 +00:00
Author
Owner

In changeset:3ab0f33413de23dd:

Skip allmydata.test.test_system.Connections.test_rref unless we have foolscap >= 0.6.4, because of http://foolscap.lothar.com/trac/ticket/196 . refs #1636
In changeset:3ab0f33413de23dd: ``` Skip allmydata.test.test_system.Connections.test_rref unless we have foolscap >= 0.6.4, because of http://foolscap.lothar.com/trac/ticket/196 . refs #1636 ```
davidsarah commented 2012-06-16 04:36:29 +00:00
Author
Owner

Review needed for fix-1636-for-1.9.2.darcs.patch .

Review needed for [fix-1636-for-1.9.2.darcs.patch](/tahoe-lafs/trac-2024-07-25/attachments/000078ac-85d0-662a-d582-24be81646d9f) .
davidsarah commented 2012-06-16 04:38:39 +00:00
Author
Owner

warner: I took your suggestion to compare self.c0.nodeid and s.get_serverid(), which seems to work. Please test with your candidate foolscap 0.6.4 release.

warner: I took your suggestion to compare `self.c0.nodeid` and `s.get_serverid()`, which seems to work. Please test with your candidate foolscap 0.6.4 release.
david-sarah <david-sarah@jacaranda.org> commented 2012-06-16 17:33:30 +00:00
Author
Owner

In changeset:3ab0f33413de23dd:

Skip allmydata.test.test_system.Connections.test_rref unless we have foolscap >= 0.6.4, because of http://foolscap.lothar.com/trac/ticket/196 . refs #1636
In changeset:3ab0f33413de23dd: ``` Skip allmydata.test.test_system.Connections.test_rref unless we have foolscap >= 0.6.4, because of http://foolscap.lothar.com/trac/ticket/196 . refs #1636 ```
david-sarah@jacaranda.org commented 2012-06-16 18:20:43 +00:00
Author
Owner

In changeset:323774698ee0a2e8:

Make the intent of the loop over servers in test_system.Connections.test_rref clearer, and able to be the same in 1.9.2 and trunk. Remove the now-unused node_key_s attribute of Client. refs #1636
In changeset:323774698ee0a2e8: ``` Make the intent of the loop over servers in test_system.Connections.test_rref clearer, and able to be the same in 1.9.2 and trunk. Remove the now-unused node_key_s attribute of Client. refs #1636 ```
davidsarah commented 2012-06-16 19:06:39 +00:00
Author
Owner

Attachment fix-1636-for-1.9.2.darcs.patch (106553 bytes) added

[for 1.9.2]rebased After a server disconnects, make the IServer retain the dead RemoteReference, and continue to return it to anyone who calls get_rref(). This removes the need for callers to guard against receiving a None (as long as the server was connected at least once, which is always the case for servers returned by get_servers_for_psi(), which is how all upload/download code gets servers). Includes test

**Attachment** fix-1636-for-1.9.2.darcs.patch (106553 bytes) added [for 1.9.2]rebased After a server disconnects, make the IServer retain the dead [RemoteReference](wiki/RemoteReference), and continue to return it to anyone who calls get_rref(). This removes the need for callers to guard against receiving a None (as long as the server was connected at least once, which is always the case for servers returned by get_servers_for_psi(), which is how all upload/download code gets servers). Includes test
david-sarah@jacaranda.org commented 2012-06-21 19:33:29 +00:00
Author
Owner

In changeset:5520/1.9.2:

[rebased for 1.9.2] After a server disconnects, make the IServer retain the dead RemoteReference, and continue to return it to anyone who calls get_rref(). This removes the need for callers to guard against receiving a None (as long as the server was connected at least once, which is always the case for servers returned by get_servers_for_psi(), which is how all upload/download code gets servers). Includes test, which is now the same as on trunk. fixes #1636 for 1.9.2.
In changeset:5520/1.9.2: ``` [rebased for 1.9.2] After a server disconnects, make the IServer retain the dead RemoteReference, and continue to return it to anyone who calls get_rref(). This removes the need for callers to guard against receiving a None (as long as the server was connected at least once, which is always the case for servers returned by get_servers_for_psi(), which is how all upload/download code gets servers). Includes test, which is now the same as on trunk. fixes #1636 for 1.9.2. ```
tahoe-lafs added the
fixed
label 2012-06-21 19:33:29 +00:00
david-sarah@jacaranda.org closed this issue 2012-06-21 19:33:29 +00:00
david-sarah@jacaranda.org commented 2012-07-16 16:33:54 +00:00
Author
Owner

In changeset:5882/cloud-backend:

[rebased for cloud-branch] After a server disconnects, make the IServer retain the dead RemoteReference, and continue to return it to anyone who calls get_rref(). This removes the need for callers to guard against receiving a None (as long as the server was connected at least once, which is always the case for servers returned by get_servers_for_psi(), which is how all upload/download code gets servers). Includes test, which is now the same as on trunk. fixes #1636 for cloud-branch
In changeset:5882/cloud-backend: ``` [rebased for cloud-branch] After a server disconnects, make the IServer retain the dead RemoteReference, and continue to return it to anyone who calls get_rref(). This removes the need for callers to guard against receiving a None (as long as the server was connected at least once, which is always the case for servers returned by get_servers_for_psi(), which is how all upload/download code gets servers). Includes test, which is now the same as on trunk. fixes #1636 for cloud-branch ```
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#1636
No description provided.