mutable: tolerate mixed corrupt/good shares from any given peer #211

New Issue

warner · 2007-11-15T20:54:00Z

warner commented

2007-11-15 20:54:00 +00:00

The current mutable file Retrieve code has a control flow problem that causes
it to respond to a corrupt share by ignoring any remaining shares from the
same peer. This causes unnecessary problems for small grids, because it makes
fewer shares available for use. In the worst case, this could make files
unavailable.

This worst case is only likely to be exercised in a unit test, but that's
what is happening in our test_mutable, where we use 5 nodes, 10 shares (of
which 7 are corrupt), 3-of-10 encoding.

To fix this, we need to modify the control flow in Retrieve._got_results to
allow a CorruptShareError to allow processing of the remaining shares but
still raise the exception at the end of the loop (to notify _query_failed,
which cares about the peerid but not the share number).

The current workaround is to use 10 nodes in that test instead of 5. Once we
fix this control flow, test_system.SystemTest.test_mutable should be restored
to using 5 nodes intead of 10, because the memory footprint of a 10-node test
is considerably larger than a 5-node test (233MB instead of 77MB).

The current mutable file Retrieve code has a control flow problem that causes it to respond to a corrupt share by ignoring any remaining shares from the same peer. This causes unnecessary problems for small grids, because it makes fewer shares available for use. In the worst case, this could make files unavailable. This worst case is only likely to be exercised in a unit test, but that's what is happening in our test_mutable, where we use 5 nodes, 10 shares (of which 7 are corrupt), 3-of-10 encoding. To fix this, we need to modify the control flow in Retrieve._got_results to allow a [CorruptShareError](wiki/CorruptShareError) to allow processing of the remaining shares but still raise the exception at the end of the loop (to notify _query_failed, which cares about the peerid but not the share number). The current workaround is to use 10 nodes in that test instead of 5. Once we fix this control flow, test_system.SystemTest.test_mutable should be restored to using 5 nodes intead of 10, because the memory footprint of a 10-node test is considerably larger than a 5-node test (233MB instead of 77MB).

warner added the

labels 2007-11-15 20:54:00 +00:00

warner self-assigned this 2007-11-15 20:54:00 +00:00

warner commented

2007-11-15 20:58:34 +00:00

The workaround was introduced in changeset:59d6c3c8229d8457 to fix #209 in time for the 0.7.0 release.

warner commented

2007-11-16 23:14:52 +00:00

Fixed, in changeset:e3037a7541d2a37c. I also reduced the test case back down to 5 nodes: to exercise the recent resource.setrlimit code in node.py, you'll want to raise that back up to 10 briefly.

warner added the

fixed

label 2007-11-16 23:14:52 +00:00

warner added this to the 0.7.0 milestone 2007-11-16 23:14:52 +00:00

warner closed this issue

2007-11-16 23:14:52 +00:00

Sign in to join this conversation.