bug in repairer causes sporadic hangs in unit tests #616

New Issue

zooko · 2009-02-10T08:48:40Z

zooko commented

2009-02-10 08:48:40 +00:00

There is a bug in DownUpConnector._satisfy_reads_if_possible():

source:src/allmydata/immutable/repairer.py@20090112214120-e01fd-7d241072d30b14d3e243829e952e8c8440e6c461#L127

It should be putting leftover bytes back into the self.bufs and the rest into the result, not putting all-but-leftover bytes back and the rest into the result! In cases where the input chunks have come in different sizes than the read requests, this bug could lead to a read request getting more or fewer bytes than it requested. This could lead to data corruption (although not irreversibly so -- it would then upload the same sequence of bytes but in different-sized blocks, which would screw up the integrity checking code but not the ciphertext).

Fortunately, in our current code, the writes and the read requests are always of the same sizes (the block size), so this doesn't happen in practice. I've added an assertion in changeset:c59940852b94ba45 just to make it fail safely if this were to happen in practice. I have started writing unit tests for DownUpConnector._satisfy_reads_if_possible() -- it turns out that we need unit tests in addition to the functional tests that I already wrote: source:src/allmydata/test/test_repairer.py.

This explains the sporadic "lost progress" failure in the functional tests. Hm... Could it also explain the "lost progress" behavior that Brian and I witnessed on the testgrid when this code was newly committed to trunk? I hope not, because that would mean that I am wrong about the writes and reads always having the same sizes. But I'm pretty sure I am right about that.

There is a bug in `DownUpConnector._satisfy_reads_if_possible()`: source:src/allmydata/immutable/repairer.py@20090112214120-e01fd-7d241072d30b14d3e243829e952e8c8440e6c461#L127 It should be putting `leftover` bytes back into the `self.bufs` and the rest into the result, not putting all-but-`leftover` bytes back and the rest into the result! In cases where the input chunks have come in different sizes than the read requests, this bug could lead to a read request getting more or fewer bytes than it requested. This could lead to data corruption (although not irreversibly so -- it would then upload the same sequence of bytes but in different-sized blocks, which would screw up the integrity checking code but not the ciphertext). Fortunately, in our current code, the writes and the read requests are always of the same sizes (the block size), so this doesn't happen in practice. I've added an assertion in changeset:c59940852b94ba45 just to make it fail safely if this were to happen in practice. I have started writing unit tests for `DownUpConnector._satisfy_reads_if_possible()` -- it turns out that we need unit tests in addition to the functional tests that I already wrote: source:src/allmydata/test/test_repairer.py. This explains the sporadic "lost progress" failure in the functional tests. Hm... Could it also explain the "lost progress" behavior that Brian and I witnessed on the testgrid when this code was newly committed to trunk? I hope not, because that would mean that I am wrong about the writes and reads always having the same sizes. But I'm pretty sure I am right about that.

zooko added the

labels 2009-02-10 08:48:40 +00:00

zooko added this to the 1.3.0 milestone 2009-02-10 08:48:40 +00:00

warner commented

2009-02-12 00:00:32 +00:00

as mentioned in #611, we disabled the repair-from-corruption tests, and have only rarely seen lost-progress in the remaining repair-from-deletion test.

Zooko fixed one bug in the repairer which would have caused lost-progress, but didn't see any other obvious ones.

I've seen lost-progress in repair-from-deletion twice now (after zooko's fix), but it's pretty rare (and therefore hard to analyze). Since repair-from-deletion is supposed to be deterministic, the only entropy source remaining is the order in which download reads and upload writes are interleaved, which means it's going to be a long hard struggle to capture enough information for analysis.

So we're going to push this one out to 1.3.1 . We'd like to have a perfect repairer in 1.3.0, but we also want to have a 1.3.0 soon, and a repairer which hangs once out of every thousand uses might be good enough for that.

as mentioned in #611, we disabled the repair-from-corruption tests, and have only rarely seen lost-progress in the remaining repair-from-deletion test. Zooko fixed one bug in the repairer which would have caused lost-progress, but didn't see any other obvious ones. I've seen lost-progress in repair-from-deletion twice now (after zooko's fix), but it's pretty rare (and therefore hard to analyze). Since repair-from-deletion is supposed to be deterministic, the only entropy source remaining is the order in which download reads and upload writes are interleaved, which means it's going to be a long hard struggle to capture enough information for analysis. So we're going to push this one out to 1.3.1 . We'd like to have a perfect repairer in 1.3.0, but we also want to have a 1.3.0 soon, and a repairer which hangs once out of every thousand uses might be good enough for that.

zooko commented

2009-02-20 21:41:06 +00:00

fixed in 1.3.0 by changeset:d7dbd6675efa2f25

zooko added the

fixed

label 2009-02-20 21:41:06 +00:00

zooko added this to the 1.3.0 milestone 2009-02-20 21:41:06 +00:00

zooko closed this issue

2009-02-20 21:41:06 +00:00

Sign in to join this conversation.