mutable: tolerate mixed corrupt/good shares from any given peer #211

Closed
opened 2007-11-15 20:54:00 +00:00 by warner · 2 comments

The current mutable file Retrieve code has a control flow problem that causes
it to respond to a corrupt share by ignoring any remaining shares from the
same peer. This causes unnecessary problems for small grids, because it makes
fewer shares available for use. In the worst case, this could make files
unavailable.

This worst case is only likely to be exercised in a unit test, but that's
what is happening in our test_mutable, where we use 5 nodes, 10 shares (of
which 7 are corrupt), 3-of-10 encoding.

To fix this, we need to modify the control flow in Retrieve._got_results to
allow a CorruptShareError to allow processing of the remaining shares but
still raise the exception at the end of the loop (to notify _query_failed,
which cares about the peerid but not the share number).

The current workaround is to use 10 nodes in that test instead of 5. Once we
fix this control flow, test_system.SystemTest.test_mutable should be restored
to using 5 nodes intead of 10, because the memory footprint of a 10-node test
is considerably larger than a 5-node test (233MB instead of 77MB).

The current mutable file Retrieve code has a control flow problem that causes it to respond to a corrupt share by ignoring any remaining shares from the same peer. This causes unnecessary problems for small grids, because it makes fewer shares available for use. In the worst case, this could make files unavailable. This worst case is only likely to be exercised in a unit test, but that's what is happening in our test_mutable, where we use 5 nodes, 10 shares (of which 7 are corrupt), 3-of-10 encoding. To fix this, we need to modify the control flow in Retrieve._got_results to allow a [CorruptShareError](wiki/CorruptShareError) to allow processing of the remaining shares but still raise the exception at the end of the loop (to notify _query_failed, which cares about the peerid but not the share number). The current workaround is to use 10 nodes in that test instead of 5. Once we fix this control flow, test_system.SystemTest.test_mutable should be restored to using 5 nodes intead of 10, because the memory footprint of a 10-node test is considerably larger than a 5-node test (233MB instead of 77MB).
warner added the
code
major
defect
0.6.1
labels 2007-11-15 20:54:00 +00:00
warner self-assigned this 2007-11-15 20:54:00 +00:00
Author

The workaround was introduced in changeset:59d6c3c8229d8457 to fix #209 in time for the 0.7.0 release.

The workaround was introduced in changeset:59d6c3c8229d8457 to fix #209 in time for the 0.7.0 release.
Author

Fixed, in changeset:e3037a7541d2a37c. I also reduced the test case back down to 5 nodes: to exercise the recent resource.setrlimit code in node.py, you'll want to raise that back up to 10 briefly.

Fixed, in changeset:e3037a7541d2a37c. I also reduced the test case back down to 5 nodes: to exercise the recent resource.setrlimit code in node.py, you'll want to raise that back up to 10 briefly.
warner added the
fixed
label 2007-11-16 23:14:52 +00:00
warner added this to the 0.7.0 milestone 2007-11-16 23:14:52 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#211
No description provided.