test_repairer test_repair_from_corruption_of_1 sometimes errors #611

Closed
opened 2009-02-09 09:27:34 +00:00 by warner · 4 comments

On some builders, this test_repairer unit test sometimes appears to hang, for long enough to trigger the timeout error. It might just be that the test is taking a long time, but I doubt it, because its neighboring tests (which should take about the same amount of time) never seem to fail. Also, I've seen it fail on an otherwise fast build slave.

http://allmydata.org/buildbot/builders/hardy2.6/builds/308 is the most recent failure I've seen.

We know that many of the repairer tests are asserting more functionality than the repairer has right now. Also, some of them wind up depending upon random values (probably in the corrupt-share routine), so they pass or fail at random. However, neither of these explains a hang.

Once hunch is that certain failure modes (i.e. corrupting a version byte) cause some state-machine-shaped code to forget to do a callback. Another is that we're missing an errback handler somewhere.

The ideal fix will be to understand why this is causing a hang and fix the bug. The second-best fix will be to remove or disable these tests, because of the other problems described above.

On some builders, this test_repairer unit test sometimes appears to hang, for long enough to trigger the timeout error. It might just be that the test is taking a long time, but I doubt it, because its neighboring tests (which should take about the same amount of time) never seem to fail. Also, I've seen it fail on an otherwise fast build slave. <http://allmydata.org/buildbot/builders/hardy2.6/builds/308> is the most recent failure I've seen. We know that many of the repairer tests are asserting more functionality than the repairer has right now. Also, some of them wind up depending upon random values (probably in the corrupt-share routine), so they pass or fail at random. However, neither of these explains a hang. Once hunch is that certain failure modes (i.e. corrupting a version byte) cause some state-machine-shaped code to forget to do a callback. Another is that we're missing an errback handler somewhere. The ideal fix will be to understand why this is causing a hang and fix the bug. The second-best fix will be to remove or disable these tests, because of the other problems described above.
warner added the
code-encoding
major
defect
1.2.0
labels 2009-02-09 09:27:34 +00:00
warner added this to the 1.3.0 milestone 2009-02-09 09:27:34 +00:00
zooko self-assigned this 2009-02-09 17:12:33 +00:00
Author

I've added notes to test_repairer and NEWS to explain the current state of
these tests: the repair tests never pass (because they are very strenuous and
the repairer is incomplete), and sometimes they appear to hang.

I've disabled the repair-from-corruption tests.

I've seen one hang in the repair-from-deletion test, but it took an hour to
trigger, and I haven't been able to capture it (with enough logging to
reproduce it directly) despite another hour or two of trying. I'll continue
to hammer on it, but I don't think this is going to be a 1.3.0 fix.

So I'm going to push this ticket out to 1.3.1 .

Here's a copy of my notes from test_repairer.py:

As recently documented in NEWS for the 1.3.0 release, the current
immutable repairer suffers from several limitations:

  • minimalistic verifier: it's just download without decryption, so we
    don't look for corruption in N-k shares, and for many fields (those
    which are the same in all shares) we only look for corruption in a
    single share

  • some kinds of corruption cause download to fail (when it ought to
    just switch to a different share), so repair will fail on these too

  • RIStorageServer doesn't offer a way to delete old corrupt immutable
    shares (the authority model is not at all clear), so the best the
    repairer can do is to put replacement shares on new servers,
    unfortunately leaving the corrupt shares in place

This test is pretty strenuous: it asserts that the repairer does the
ideal thing in 8 distinct situations, with randomized corruption in
each. Because of the aforementioned limitations, it is highly unlikely
to pass any of these. We're also concerned that the download-fails case
can provoke a lost-progress bug (one was fixed, but there might be more
lurking), which will cause the test to fail despite a ".todo" marker,
and will probably cause subsequent unrelated tests to fail too (due to
"unclean reactor" problems).

So we're turning this test off until we've done one or more of the
following:

  • remove some of these limitations
  • break the test up into smaller, more functionally-oriented pieces
  • simplify the repairer enough to let us be confident that it is free
    of lost-progress bugs
I've added notes to test_repairer and NEWS to explain the current state of these tests: the repair tests never pass (because they are very strenuous and the repairer is incomplete), and sometimes they appear to hang. I've disabled the repair-from-corruption tests. I've seen one hang in the repair-from-deletion test, but it took an hour to trigger, and I haven't been able to capture it (with enough logging to reproduce it directly) despite another hour or two of trying. I'll continue to hammer on it, but I don't think this is going to be a 1.3.0 fix. So I'm going to push this ticket out to 1.3.1 . Here's a copy of my notes from test_repairer.py: As recently documented in NEWS for the 1.3.0 release, the current immutable repairer suffers from several limitations: * minimalistic verifier: it's just download without decryption, so we don't look for corruption in N-k shares, and for many fields (those which are the same in all shares) we only look for corruption in a single share * some kinds of corruption cause download to fail (when it ought to just switch to a different share), so repair will fail on these too * RIStorageServer doesn't offer a way to delete old corrupt immutable shares (the authority model is not at all clear), so the best the repairer can do is to put replacement shares on new servers, unfortunately leaving the corrupt shares in place This test is pretty strenuous: it asserts that the repairer does the ideal thing in 8 distinct situations, with randomized corruption in each. Because of the aforementioned limitations, it is highly unlikely to pass any of these. We're also concerned that the download-fails case can provoke a lost-progress bug (one was fixed, but there might be more lurking), which will cause the test to fail despite a ".todo" marker, and will probably cause subsequent unrelated tests to fail too (due to "unclean reactor" problems). So we're turning this test off until we've done one or more of the following: * remove some of these limitations * break the test up into smaller, more functionally-oriented pieces * simplify the repairer enough to let us be confident that it is free of lost-progress bugs

There was a typo in this. Brian wrote:


I've disabled the repair-from-corruption tests.```

He meant:

```I've added notes to test_repairer and NEWS to explain the current state of these tests: the repair-from-corruption tests never pass (because they are very strenuous and the repairer is incomplete), and sometimes they appear to hang.

I've disabled the repair-from-corruption tests.```
There was a typo in this. Brian wrote: ```I've added notes to test_repairer and NEWS to explain the current state of these tests: the repair tests never pass (because they are very strenuous and the repairer is incomplete), and sometimes they appear to hang. I've disabled the repair-from-corruption tests.``` He meant: ```I've added notes to test_repairer and NEWS to explain the current state of these tests: the repair-from-corruption tests never pass (because they are very strenuous and the repairer is incomplete), and sometimes they appear to hang. I've disabled the repair-from-corruption tests.```

I wasn't able to reproduce Brian's hang in allmydata.test.test_repairer.Repairer.test_repair_from_deletion_of_1. I ran it overnight (700 minutes), and it ran the test 9443 times (taking around 3.4s per run, by the way), and didn't hang or fail or otherwise do anything funny. I think I'll relaunch it after removing the debugprintouts that I added, but currently I'm satisfied that repairer is Good Enough for tahoe 1.3.0 release.

I wasn't able to reproduce Brian's hang in `allmydata.test.test_repairer.Repairer.test_repair_from_deletion_of_1`. I ran it overnight (700 minutes), and it ran the test 9443 times (taking around 3.4s per run, by the way), and didn't hang or fail or otherwise do anything funny. I think I'll relaunch it after removing the debugprintouts that I added, but currently I'm satisfied that repairer is Good Enough for tahoe 1.3.0 release.

Okay, I found the bug. If the uploader requests a read from the DownUpConnector after the downloader has finished writing and before the downloader closes, then that request will never be satisfied. changeset:d7dbd6675efa2f25 fixes this and tests it. Please review! This is the last code change that is scheduled to go into tahoe-1.3.0 and I would hate to have inserted a bug into tahoe-1.3.0 at the last moment, so please review this patch while I work on CREDITS and relnotes.txt. :-)

By the way, I also ran another 4876 iterations of test_repair_from_deletion_of_1 (this time without debugprintouts) with no deviation in behavior, and I am wondering why there is any variation at all in the behavior of this test on Brian's machine. Could it be that system test is using loopback and that the loopback device drops or reorders packets very rarely on Brian's machine but never or at least even more rarely on mine?

Okay, I found the bug. If the uploader requests a read from the [DownUpConnector](wiki/DownUpConnector) after the downloader has finished writing and before the downloader closes, then that request will never be satisfied. changeset:d7dbd6675efa2f25 fixes this and tests it. Please review! This is the last code change that is scheduled to go into tahoe-1.3.0 and I would hate to have inserted a bug into tahoe-1.3.0 at the last moment, so please review this patch while I work on CREDITS and relnotes.txt. :-) By the way, I also ran another 4876 iterations of `test_repair_from_deletion_of_1` (this time without debugprintouts) with no deviation in behavior, and I am wondering why there is any variation at all in the behavior of this test on Brian's machine. Could it be that system test is using loopback and that the loopback device drops or reorders packets very rarely on Brian's machine but never or at least even more rarely on mine?
zooko added the
fixed
label 2009-02-12 23:34:44 +00:00
zooko closed this issue 2009-02-12 23:34:44 +00:00
zooko added this to the 1.3.0 milestone 2009-02-12 23:34:48 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#611
No description provided.