uncaught exception in mutable-retrieve: UCW between mapupdate and retrieve #474

Open
opened 2008-06-23 20:28:52 +00:00 by warner · 5 comments

If a mutable file is modified (by some external uncoordinated writer) after
we've done a servermap update but before we do a retrieve, then the exception
raised by mutable.Retrieve._maybe_send_more_queries
(source:src/allmydata/mutable/retrieve.py#L415) is not caught by the Deferred
chain. This means that the caller (who is probably in a filenode.modify
sequence) will be waiting forever for the answer.

The exception that shows up in the twisted log looks like this:


2008-06-22 00:19:31.726Z [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.5/site-packages/foolscap/call.py", line 667, in _done
            self.request.complete(res)
          File "/usr/lib/python2.5/site-packages/foolscap/call.py", line 53, in complete
            self.deferred.callback(res)
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 239, in callback
            self._startRunCallbacks(result)
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 304, in _startRunCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 317, in _runCallbacks
            self.result = callback(self.result, *args, **kw)
          File "/usr/lib/python2.5/site-packages/allmydata/mutable/retrieve.py", line 325, in _check_for_done
            return self._maybe_send_more_queries(k)
          File "/usr/lib/python2.5/site-packages/allmydata/mutable/retrieve.py", line 415, in _maybe_send_more_queries
            raise err
        allmydata.encode.NotEnoughSharesError: ran out of peers: have 0 shares (k=3), 2 queries in flight, need 1 more, found 0 bad shares, last failure: [Failure instance: Traceback: <class 'allmydata.mutable.common.UncoordinatedWriteError'>: someone wrote to the data since we read the servermap: prefix changed
        /usr/lib/python2.5/site-packages/foolscap/call.py:667:_done
        /usr/lib/python2.5/site-packages/foolscap/call.py:53:complete
        /usr/lib/python2.5/site-packages/twisted/internet/defer.py:239:callback
        /usr/lib/python2.5/site-packages/twisted/internet/defer.py:304:_startRunCallbacks
        --- <exception caught here> ---
        /usr/lib/python2.5/site-packages/twisted/internet/defer.py:317:_runCallbacks
        /usr/lib/python2.5/site-packages/allmydata/mutable/retrieve.py:246:_got_results
        /usr/lib/python2.5/site-packages/allmydata/mutable/retrieve.py:268:_got_results_one_share
        ]

The Deferred chaining needs to be investigated to make sure that this
exception is properly returned to the caller via the errback on their
Deferred.

In addition, the code in filenode.modify needs to be examined to make sure
that this kind of uncoordinated write error is caught and retried. My concern
is that Retrieve is returning a NotEnoughSharesError that wraps a
UncoordinatedWriteError, rather than the UCWE directly, and that the
f.trap in modify() might not know to look for that.

If a mutable file is modified (by some external uncoordinated writer) after we've done a servermap update but before we do a retrieve, then the exception raised by mutable.Retrieve._maybe_send_more_queries (source:src/allmydata/mutable/retrieve.py#L415) is not caught by the Deferred chain. This means that the caller (who is probably in a filenode.modify sequence) will be waiting forever for the answer. The exception that shows up in the twisted log looks like this: ``` 2008-06-22 00:19:31.726Z [-] Unhandled Error Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/foolscap/call.py", line 667, in _done self.request.complete(res) File "/usr/lib/python2.5/site-packages/foolscap/call.py", line 53, in complete self.deferred.callback(res) File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 239, in callback self._startRunCallbacks(result) File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 304, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 317, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/usr/lib/python2.5/site-packages/allmydata/mutable/retrieve.py", line 325, in _check_for_done return self._maybe_send_more_queries(k) File "/usr/lib/python2.5/site-packages/allmydata/mutable/retrieve.py", line 415, in _maybe_send_more_queries raise err allmydata.encode.NotEnoughSharesError: ran out of peers: have 0 shares (k=3), 2 queries in flight, need 1 more, found 0 bad shares, last failure: [Failure instance: Traceback: <class 'allmydata.mutable.common.UncoordinatedWriteError'>: someone wrote to the data since we read the servermap: prefix changed /usr/lib/python2.5/site-packages/foolscap/call.py:667:_done /usr/lib/python2.5/site-packages/foolscap/call.py:53:complete /usr/lib/python2.5/site-packages/twisted/internet/defer.py:239:callback /usr/lib/python2.5/site-packages/twisted/internet/defer.py:304:_startRunCallbacks --- <exception caught here> --- /usr/lib/python2.5/site-packages/twisted/internet/defer.py:317:_runCallbacks /usr/lib/python2.5/site-packages/allmydata/mutable/retrieve.py:246:_got_results /usr/lib/python2.5/site-packages/allmydata/mutable/retrieve.py:268:_got_results_one_share ] ``` The Deferred chaining needs to be investigated to make sure that this exception is properly returned to the caller via the errback on their Deferred. In addition, the code in filenode.modify needs to be examined to make sure that this kind of uncoordinated write error is caught and retried. My concern is that Retrieve is returning a `NotEnoughSharesError` that wraps a `UncoordinatedWriteError`, rather than the UCWE directly, and that the f.trap in modify() might not know to look for that.
warner added the
code-encoding
critical
defect
1.1.0
labels 2008-06-23 20:28:52 +00:00
Author

in addition, I think we need test coverage for an UCWE that occurs between mapupdate and retrieve. I believe that we already have coverage for one that occurs between retrieve and publish, but not before the retrieve. Note that this will require using a file that is large enough to not fit in the filenode's read-cache (20kB should be enough).

in addition, I think we need test coverage for an UCWE that occurs between mapupdate and retrieve. I believe that we already have coverage for one that occurs between retrieve and publish, but not before the retrieve. Note that this will require using a file that is large enough to not fit in the filenode's read-cache (20kB should be enough).
Author

Unit tests to write:

  • do mapupdate(MODE_READ), remove all storage servers, do Retrieve -> should fail
  • do mapupdate, disable all storage servers (return error), do Retrieve -> failure
  • do mapupdate, rollback all storage servers, do Retrieve(ver) -> failure
  • do mapupdate, corrupt all shares, do Retrieve -> failure
  • do modify, between the mapupdate and the retrieve rollback all servers -> retry
Unit tests to write: * do mapupdate(MODE_READ), remove all storage servers, do Retrieve -> should fail * do mapupdate, disable all storage servers (return error), do Retrieve -> failure * do mapupdate, rollback all storage servers, do Retrieve(ver) -> failure * do mapupdate, corrupt all shares, do Retrieve -> failure * do modify, between the mapupdate and the retrieve rollback all servers -> retry
Author

hrm, a quick check of the code doesn't reveal any paths by which _check_for_done is called outside of a Deferred chain. I added a few unit tests changeset:dba0ad3d2841b7a0 to corrupt all shares between the mapupdate and retrieve, and they pass. I also increased the size of the test file to 20kB to allow the post-mapupdate corruption to actually get noticed.

hrm, a quick check of the code doesn't reveal any paths by which _check_for_done is called outside of a Deferred chain. I added a few unit tests changeset:dba0ad3d2841b7a0 to corrupt all shares between the mapupdate and retrieve, and they pass. I also increased the size of the test file to 20kB to allow the post-mapupdate corruption to actually get noticed.
Author

At least one of these errors occurred while the webapi server in question was throwing out-of-memory errors, so I think I'm going to wait until it happens again before investigating it deeply.
For reference, the times (UTC) at which these errors occurred were:

2008-06-21 22:53:42.575Z    (06-21 15:53 PDT)
2008-06-22 00:00:57.964Z    (06-21 17:00 PDT)
2008-06-22 00:05:58.949Z    (06-21 17:05 PDT)
2008-06-22 00:19:31.726Z    (06-21 17:19 PDT)

with a MemoryError occurring at 2008-06-22 00:25:46.020Z . We have no events logged from this node (2b6w) between 17:02 and 17:36. Because of the foolscap log-buffer overflow problem, we have no foolscap log events past about 15:25.

At least one of these errors occurred while the webapi server in question was throwing out-of-memory errors, so I think I'm going to wait until it happens again before investigating it deeply. For reference, the times (UTC) at which these errors occurred were: ``` 2008-06-21 22:53:42.575Z (06-21 15:53 PDT) 2008-06-22 00:00:57.964Z (06-21 17:00 PDT) 2008-06-22 00:05:58.949Z (06-21 17:05 PDT) 2008-06-22 00:19:31.726Z (06-21 17:19 PDT) ``` with a MemoryError occurring at 2008-06-22 00:25:46.020Z . We have no events logged from this node (2b6w) between 17:02 and 17:36. Because of the foolscap log-buffer overflow problem, we have no foolscap log events past about 15:25.
warner added
major
and removed
critical
labels 2008-06-25 00:20:36 +00:00
warner added this to the eventually milestone 2008-06-25 00:20:36 +00:00
warner added
code-mutable
and removed
code-encoding
labels 2009-12-29 20:07:36 +00:00
tahoe-lafs modified the milestone from eventually to 1.7.0 2010-02-27 08:45:23 +00:00

It's really bothering me that mutable file upload and download behavior is so finicky, buggy, inefficient, hard to understand, different from immutable file upload and download behavior, etc. So I'm putting a bunch of tickets into the "1.8" Milestone. I am not, however, at this time, volunteering to work on these tickets, so it might be a mistake to put them into the 1.8 Milestone, but I really hope that someone else will volunteer or that I will decide to do it myself. :-)

It's really bothering me that mutable file upload and download behavior is so finicky, buggy, inefficient, hard to understand, different from immutable file upload and download behavior, etc. So I'm putting a bunch of tickets into the "1.8" Milestone. I am not, however, at this time, volunteering to work on these tickets, so it might be a mistake to put them into the 1.8 Milestone, but I really hope that someone else will volunteer or that I will decide to do it myself. :-)
zooko modified the milestone from 1.7.0 to 1.8.0 2010-05-26 14:51:46 +00:00
tahoe-lafs modified the milestone from 1.8.0 to 1.9.0 2010-08-10 04:16:32 +00:00
tahoe-lafs modified the milestone from 1.9.0 to soon 2011-07-16 21:02:52 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#474
No description provided.