bug in repairer causes sporadic hangs in unit tests #616

Closed
opened 2009-02-10 08:48:40 +00:00 by zooko · 2 comments

There is a bug in DownUpConnector._satisfy_reads_if_possible():

source:src/allmydata/immutable/repairer.py@20090112214120-e01fd-7d241072d30b14d3e243829e952e8c8440e6c461#L127

It should be putting leftover bytes back into the self.bufs and the rest into the result, not putting all-but-leftover bytes back and the rest into the result! In cases where the input chunks have come in different sizes than the read requests, this bug could lead to a read request getting more or fewer bytes than it requested. This could lead to data corruption (although not irreversibly so -- it would then upload the same sequence of bytes but in different-sized blocks, which would screw up the integrity checking code but not the ciphertext).

Fortunately, in our current code, the writes and the read requests are always of the same sizes (the block size), so this doesn't happen in practice. I've added an assertion in changeset:c59940852b94ba45 just to make it fail safely if this were to happen in practice. I have started writing unit tests for DownUpConnector._satisfy_reads_if_possible() -- it turns out that we need unit tests in addition to the functional tests that I already wrote: source:src/allmydata/test/test_repairer.py.

This explains the sporadic "lost progress" failure in the functional tests. Hm... Could it also explain the "lost progress" behavior that Brian and I witnessed on the testgrid when this code was newly committed to trunk? I hope not, because that would mean that I am wrong about the writes and reads always having the same sizes. But I'm pretty sure I am right about that.

There is a bug in `DownUpConnector._satisfy_reads_if_possible()`: source:src/allmydata/immutable/repairer.py@20090112214120-e01fd-7d241072d30b14d3e243829e952e8c8440e6c461#L127 It should be putting `leftover` bytes back into the `self.bufs` and the rest into the result, not putting all-but-`leftover` bytes back and the rest into the result! In cases where the input chunks have come in different sizes than the read requests, this bug could lead to a read request getting more or fewer bytes than it requested. This could lead to data corruption (although not irreversibly so -- it would then upload the same sequence of bytes but in different-sized blocks, which would screw up the integrity checking code but not the ciphertext). Fortunately, in our current code, the writes and the read requests are always of the same sizes (the block size), so this doesn't happen in practice. I've added an assertion in changeset:c59940852b94ba45 just to make it fail safely if this were to happen in practice. I have started writing unit tests for `DownUpConnector._satisfy_reads_if_possible()` -- it turns out that we need unit tests in addition to the functional tests that I already wrote: source:src/allmydata/test/test_repairer.py. This explains the sporadic "lost progress" failure in the functional tests. Hm... Could it also explain the "lost progress" behavior that Brian and I witnessed on the testgrid when this code was newly committed to trunk? I hope not, because that would mean that I am wrong about the writes and reads always having the same sizes. But I'm pretty sure I am right about that.
zooko added the
code-encoding
major
defect
1.2.0
labels 2009-02-10 08:48:40 +00:00
zooko added this to the 1.3.0 milestone 2009-02-10 08:48:40 +00:00

as mentioned in #611, we disabled the repair-from-corruption tests, and have only rarely seen lost-progress in the remaining repair-from-deletion test.

Zooko fixed one bug in the repairer which would have caused lost-progress, but didn't see any other obvious ones.

I've seen lost-progress in repair-from-deletion twice now (after zooko's fix), but it's pretty rare (and therefore hard to analyze). Since repair-from-deletion is supposed to be deterministic, the only entropy source remaining is the order in which download reads and upload writes are interleaved, which means it's going to be a long hard struggle to capture enough information for analysis.

So we're going to push this one out to 1.3.1 . We'd like to have a perfect repairer in 1.3.0, but we also want to have a 1.3.0 soon, and a repairer which hangs once out of every thousand uses might be good enough for that.

as mentioned in #611, we disabled the repair-from-corruption tests, and have only rarely seen lost-progress in the remaining repair-from-deletion test. Zooko fixed one bug in the repairer which would have caused lost-progress, but didn't see any other obvious ones. I've seen lost-progress in repair-from-deletion twice now (after zooko's fix), but it's pretty rare (and therefore hard to analyze). Since repair-from-deletion is supposed to be deterministic, the only entropy source remaining is the order in which download reads and upload writes are interleaved, which means it's going to be a long hard struggle to capture enough information for analysis. So we're going to push this one out to 1.3.1 . We'd like to have a perfect repairer in 1.3.0, but we also want to have a 1.3.0 soon, and a repairer which hangs once out of every thousand uses might be good enough for that.
Author

fixed in 1.3.0 by changeset:d7dbd6675efa2f25

fixed in 1.3.0 by changeset:d7dbd6675efa2f25
zooko added the
fixed
label 2009-02-20 21:41:06 +00:00
zooko added this to the 1.3.0 milestone 2009-02-20 21:41:06 +00:00
zooko closed this issue 2009-02-20 21:41:06 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#616
No description provided.