Update for mutable files is sensitive to the number of servers #2034

New Issue

markberger · 2013-07-22T22:26:34Z

markberger commented

2013-07-22 22:26:34 +00:00

When the number of servers used in a grid is greater than or equal to N + k, there is a race condition when mutable files are updated. While waiting for responses, ServermapUpdater will hit a threshold after EPSILON servers do not have any shares (currently EPSILON = k). This causes the process to finish and reject any remaining responses.

However, this can occur before the ServermapUpdater has added the server to the servermap, causing the client to reject the server's response. Since the server is not added to the servermap, a writer for that server is not created. This causes a key error when the updater attempts to retrieve the corresponding writer for each share.

This race condition can be replicated by setting the number of servers for allmydata.test.test_mutable.Update to 13.

===============================================================================
[FAIL]
Traceback (most recent call last):
  File "/Users/markberger/Code/tahoe-lafs/src/allmydata/test/test_mutable.py", line 3663, in <lambda>
    self.failUnlessEqual(results, new_data))
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/trial/_synctest.py", line 356, in assertEqual
    % (msg, pformat(first), pformat(second)))
twisted.trial.unittest.FailTest: not equal:
a = 'test datatest datatest datatest datatest datatest datatest datatest datatest datatest data'
b = 'test datatest datatest datatest datatest datatest datatest datatest datatest datatest dataappended'


allmydata.test.test_mutable.Update.test_update_sdmf
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: allmydata.mutable.common.NotEnoughServersError: ("Publish ran out of good servers, last failure was: [Failure instance: Traceback: <type 'exceptions.KeyError'>: 0\n/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py:642:_push\n/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py:662:push_segment\n/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py:304:addCallback\n/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py:293:addCallbacks\n--- <exception caught here> ---\n/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py:575:_runCallbacks\n/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py:767:_push_segment\n]", None)

allmydata.test.test_mutable.Update.test_append
allmydata.test.test_mutable.Update.test_append_power_of_two
allmydata.test.test_mutable.Update.test_multiple_segment_replace
allmydata.test.test_mutable.Update.test_replace_beginning
allmydata.test.test_mutable.Update.test_replace_segstart1
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: allmydata.mutable.common.NotEnoughServersError: ("Publish ran out of good servers, last failure was: [Failure instance: Traceback: <type 'exceptions.KeyError'>: 1\n/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py:642:_push\n/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py:662:push_segment\n/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py:304:addCallback\n/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py:293:addCallbacks\n--- <exception caught here> ---\n/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py:575:_runCallbacks\n/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py:767:_push_segment\n]", None)

allmydata.test.test_mutable.Update.test_replace_and_extend
allmydata.test.test_mutable.Update.test_replace_in_last_segment
allmydata.test.test_mutable.Update.test_replace_locations
allmydata.test.test_mutable.Update.test_replace_middle
allmydata.test.test_mutable.Update.test_replace_zero_length_middle
===============================================================================
[ERROR]
Traceback (most recent call last):
  File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/filenode.py", line 1177, in _build_uploadable_and_finish
    return p.update(u, offset, segments_and_bht[2], self._version)
  File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py", line 341, in update
    self._push()
  File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py", line 642, in _push
    return self.push_segment(self._current_segment)
  File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py", line 659, in push_segment
    return self._push()
  File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py", line 645, in _push
    return self.push_everything_else()
  File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py", line 778, in push_everything_else
    self.push_blockhashes()
  File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py", line 806, in push_blockhashes
    writers = self.writers[shnum]
exceptions.KeyError: 0

allmydata.test.test_mutable.Update.test_replace_zero_length_beginning
allmydata.test.test_mutable.Update.test_replace_zero_length_segstart1
-------------------------------------------------------------------------------
Ran 14 tests in 58.570s

When the number of servers used in a grid is greater than or equal to N + k, there is a race condition when mutable files are updated. While waiting for responses, ServermapUpdater will hit a threshold after EPSILON servers do not have any shares (currently EPSILON = k). This causes the process to finish and reject any remaining responses. However, this can occur before the ServermapUpdater has added the server to the servermap, causing the client to reject the server's response. Since the server is not added to the servermap, a writer for that server is not created. This causes a key error when the updater attempts to retrieve the corresponding writer for each share. This race condition can be replicated by setting the number of servers for allmydata.test.test_mutable.Update to 13. ``` =============================================================================== [FAIL] Traceback (most recent call last): File "/Users/markberger/Code/tahoe-lafs/src/allmydata/test/test_mutable.py", line 3663, in <lambda> self.failUnlessEqual(results, new_data)) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/trial/_synctest.py", line 356, in assertEqual % (msg, pformat(first), pformat(second))) twisted.trial.unittest.FailTest: not equal: a = 'test datatest datatest datatest datatest datatest datatest datatest datatest datatest data' b = 'test datatest datatest datatest datatest datatest datatest datatest datatest datatest dataappended' allmydata.test.test_mutable.Update.test_update_sdmf =============================================================================== [ERROR] Traceback (most recent call last): Failure: allmydata.mutable.common.NotEnoughServersError: ("Publish ran out of good servers, last failure was: [Failure instance: Traceback: <type 'exceptions.KeyError'>: 0\n/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py:642:_push\n/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py:662:push_segment\n/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py:304:addCallback\n/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py:293:addCallbacks\n--- <exception caught here> ---\n/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py:575:_runCallbacks\n/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py:767:_push_segment\n]", None) allmydata.test.test_mutable.Update.test_append allmydata.test.test_mutable.Update.test_append_power_of_two allmydata.test.test_mutable.Update.test_multiple_segment_replace allmydata.test.test_mutable.Update.test_replace_beginning allmydata.test.test_mutable.Update.test_replace_segstart1 =============================================================================== [ERROR] Traceback (most recent call last): Failure: allmydata.mutable.common.NotEnoughServersError: ("Publish ran out of good servers, last failure was: [Failure instance: Traceback: <type 'exceptions.KeyError'>: 1\n/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py:642:_push\n/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py:662:push_segment\n/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py:304:addCallback\n/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py:293:addCallbacks\n--- <exception caught here> ---\n/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py:575:_runCallbacks\n/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py:767:_push_segment\n]", None) allmydata.test.test_mutable.Update.test_replace_and_extend allmydata.test.test_mutable.Update.test_replace_in_last_segment allmydata.test.test_mutable.Update.test_replace_locations allmydata.test.test_mutable.Update.test_replace_middle allmydata.test.test_mutable.Update.test_replace_zero_length_middle =============================================================================== [ERROR] Traceback (most recent call last): File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/filenode.py", line 1177, in _build_uploadable_and_finish return p.update(u, offset, segments_and_bht[2], self._version) File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py", line 341, in update self._push() File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py", line 642, in _push return self.push_segment(self._current_segment) File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py", line 659, in push_segment return self._push() File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py", line 645, in _push return self.push_everything_else() File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py", line 778, in push_everything_else self.push_blockhashes() File "/Users/markberger/Code/tahoe-lafs/src/allmydata/mutable/publish.py", line 806, in push_blockhashes writers = self.writers[shnum] exceptions.KeyError: 0 allmydata.test.test_mutable.Update.test_replace_zero_length_beginning allmydata.test.test_mutable.Update.test_replace_zero_length_segstart1 ------------------------------------------------------------------------------- Ran 14 tests in 58.570s ```

markberger added the

labels 2013-07-22 22:26:34 +00:00

markberger added this to the undecided milestone 2013-07-22 22:26:34 +00:00

markberger changed title from ~~Mutable update tests are sensitive to the number of servers~~ to Update for mutable files is sensitive to the number of servers

2013-07-23 16:20:37 +00:00

markberger commented

2013-07-23 16:26:49 +00:00

Possible related tickets: #549, #1042

markberger commented

2013-07-23 17:14:25 +00:00

Here is a patch that will solve this problem: https://github.com/markberger/tahoe-lafs/tree/2034-mutable-update-race-condition

Here is a patch that will solve this problem: <https://github.com/markberger/tahoe-lafs/tree/2034-mutable-update-race-condition>

markberger removed this from the undecided milestone 2013-07-23 17:14:25 +00:00

markberger self-assigned this 2013-07-23 17:14:25 +00:00

markberger commented

2013-07-23 22:55:32 +00:00

This patch is necessary in order to have tests for #1057 with a reasonably large number of servers.

zooko added this to the 1.11.0 milestone 2013-08-28 16:00:01 +00:00

zooko commented

2013-09-20 19:35:20 +00:00

I just had a look at https://github.com/markberger/tahoe-lafs/commit/08f70b511345c9ae267ae90fe31adf5cee1fd723, but I wasn't sure I understood the issue. I thought maybe looking at a unit test would help me understand what this patch is fixing, but there is no unit test included in the patch. Then I came back and read this ticket's original message and started to understand it better.

However, I still don't really understand it.

However, this can occur before the ServermapUpdater? has added the server to the servermap, causing the client to reject the server's response. Since the server is not added to the servermap, a writer for that server is not created.

When can this happen? Is this if there are > K servers in existence, and K servers have responded to queries, saying that they have no shares (or, perhaps some servers fail instead of responding "I have no shares"), and then… what happens? Is this server that "is not added to the servermap", mentioned above, one of the servers that has not yet been queried? Or it has been queried but it has not yet responded?

Ideas:

Should we patch the unit tests, as described in the original message::

"This race condition can be replicated by setting the number of servers for allmydata.test.test_mutable.Update to 13."

?
Should the resulting unit test have a comment giving this ticket # and URL to this ticket?

I just had a look at <https://github.com/markberger/tahoe-lafs/commit/08f70b511345c9ae267ae90fe31adf5cee1fd723>, but I wasn't sure I understood the issue. I thought maybe looking at a unit test would help me understand what this patch is fixing, but there is no unit test included in the patch. Then I came back and read this ticket's original message and started to understand it better. However, I still don't really understand it. > However, this can occur before the ServermapUpdater? has added the server to the servermap, causing the client to reject the server's response. Since the server is not added to the servermap, a writer for that server is not created. When can this happen? Is this if there are > K servers in existence, and K servers have responded to queries, saying that they have no shares (or, perhaps some servers fail instead of responding "I have no shares"), and then… what happens? Is this server that "is not added to the servermap", mentioned above, one of the servers that has not yet been queried? Or it has been queried but it has not yet responded? Ideas: * Should we patch the unit tests, as described in the original message:: "This race condition can be replicated by setting the number of servers for allmydata.test.test_mutable.Update to 13." ? * Should the resulting unit test have a comment giving this ticket # and URL to this ticket?

zooko commented

2013-09-20 19:57:00 +00:00

Wild stab in the dark, but this couldn't have anything to do with #1670 could it?

markberger commented

2013-09-20 20:15:44 +00:00

If I am remembering this correctly, ServermapUpdater starts querying servers and it will stop after epsilon (k) servers have responded with no shares. However, the querying process can end while we know a server has shares, but we haven't finished communicating with it yet.

So lets say our epsilon is 2, and we have three servers: A, B, and C. ServermapUpdater will send queries to servers A, B, and C. Server A responds that it has shares, so ServermapUpdater starts making more requests to get info about each share on server A. But before we can finish our transactions with server A, servers B and C respond that they have no shares. After servers B and C respond, ServermapUpdater checks to see if it hit the epsilon limit. Since our limit is 2, the limit has been hit, and we start rejecting all responses from server A. Since we never finished our transaction with server A, a writer for server A was not created.

This is an issue when the number of servers is greater than or equal to N + k because the querying process will not end until it receives feedback from N + k servers during a share write.

If I am remembering this correctly, ServermapUpdater starts querying servers and it will stop after epsilon (k) servers have responded with no shares. However, the querying process can end while we know a server has shares, but we haven't finished communicating with it yet. So lets say our epsilon is 2, and we have three servers: A, B, and C. ServermapUpdater will send queries to servers A, B, and C. Server A responds that it has shares, so ServermapUpdater starts making more requests to get info about each share on server A. But before we can finish our transactions with server A, servers B and C respond that they have no shares. After servers B and C respond, ServermapUpdater checks to see if it hit the epsilon limit. Since our limit is 2, the limit has been hit, and we start rejecting all responses from server A. Since we never finished our transaction with server A, a writer for server A was not created. This is an issue when the number of servers is greater than or equal to N + k because the querying process will not end until it receives feedback from N + k servers during a share write.

markberger commented

2013-09-20 20:21:43 +00:00

Replying to zooko:

Wild stab in the dark, but this couldn't have anything to do with #1670 could it?

Because it is a key error, this ticket could be causing #1670. If I had to guess though, I don't think this is causing it because one of the error reports has k = 2, N = 4, with 5 servers.

Replying to [zooko](/tahoe-lafs/trac-2024-07-25/issues/2034#issuecomment-92722): > Wild stab in the dark, but this couldn't have anything to do with #1670 could it? Because it is a key error, this ticket could be causing #1670. If I had to guess though, I don't think this is causing it because one of the error reports has k = 2, N = 4, with 5 servers.

daira commented

2013-11-13 23:00:52 +00:00

I'm bumping this out of the 1.11.0 release because it depends on #1057 which has also been bumped out.

tahoe-lafs modified the milestone from 1.11.0 to 1.12.0

2013-11-13 23:00:52 +00:00

daira commented

2014-10-02 14:56:03 +00:00

Replying to daira:

I'm bumping this out of the 1.11.0 release because it depends on #1057 which has also been bumped out.

Actually I think I was mistaken, this doesn't depend on #1057. We should re-review it to see whether it can go into 1.11.

Replying to [daira](/tahoe-lafs/trac-2024-07-25/issues/2034#issuecomment-92725): > I'm bumping this out of the 1.11.0 release because it depends on #1057 which has also been bumped out. Actually I think I was mistaken, this *doesn't* depend on #1057. We should re-review it to see whether it can go into 1.11.

tahoe-lafs modified the milestone from 1.12.0 to 1.11.0

2014-10-02 14:56:12 +00:00

warner commented

2015-01-20 17:40:28 +00:00

I'll review markberger's patch.

warner commented

2015-01-27 09:00:20 +00:00

Ok, I think the patch is a step forward. It basically changes the point at which we say "yes, this server gave us a good share". It used to set this mark as soon as the first response came back (saying "I think I have a share, fetch it if you want"). With the patch, it sets the mark later, after enough of the share has been fetched to verify its signature. In between those two points, the client is grabbing more data (the pubkey, maybe more), so there's a chance for other code to run, including the "are we done yet" checker. The problem here occurred when that other code ran too early.

I have a feeling that the entire mutable tree is due for an overhaul. Although I can't currently think of any better way to organize it. Mostly I'm bothered by the large Deferred chains in there, and maybe a state machine (like the "new" immutable downloader) would provide fewer hiding places for concurrency/interleaving hazards to lurk.

Landed in db12f1c.

Ok, I think the patch is a step forward. It basically changes the point at which we say "yes, this server gave us a good share". It used to set this mark as soon as the first response came back (saying "I *think* I have a share, fetch it if you want"). With the patch, it sets the mark later, after enough of the share has been fetched to verify its signature. In between those two points, the client is grabbing more data (the pubkey, maybe more), so there's a chance for other code to run, including the "are we done yet" checker. The problem here occurred when that other code ran too early. I have a feeling that the entire mutable tree is due for an overhaul. Although I can't currently think of any better way to organize it. Mostly I'm bothered by the large Deferred chains in there, and maybe a state machine (like the "new" immutable downloader) would provide fewer hiding places for concurrency/interleaving hazards to lurk. Landed in db12f1c.

daira commented

2015-01-27 18:42:35 +00:00

Do we need a regression test for this bug, or can this ticket be closed?

zooko commented

2015-01-29 12:09:04 +00:00

I would feel better if we had an accompanying regression test…

daira commented

2015-01-29 16:02:55 +00:00

markberger, can you write a test?

warner was unassigned by tahoe-lafs

2015-01-29 16:02:55 +00:00

markberger was assigned by tahoe-lafs

2015-01-29 16:02:55 +00:00

warner commented

2015-03-26 01:54:10 +00:00

I tried Mark's original advice (increasing the number of servers in allmydata.test.test_mutable.Update from the default of 10 to 13). Both 10 and 13 pass with current trunk (including the db12f1c fix). If I back out db12f1c, then num_servers=13 causes the original failure listed above, and num_servers=10 does not.

So I think that simply changing num_servers to 13 is enough to test this. https://github.com/tahoe-lafs/tahoe-lafs/pull/149 does that.

I tried Mark's original advice (increasing the number of servers in `allmydata.test.test_mutable.Update` from the default of 10 to 13). Both 10 and 13 pass with current trunk (including the db12f1c fix). If I back out db12f1c, then `num_servers=13` causes the original failure listed above, and `num_servers=10` does not. So I think that simply changing num_servers to 13 is enough to test this. <https://github.com/tahoe-lafs/tahoe-lafs/pull/149> does that.

markberger was unassigned by warner

2015-03-26 01:54:10 +00:00

warner self-assigned this 2015-03-26 01:54:10 +00:00

warner commented

2015-03-31 16:48:00 +00:00

Landed in 2d3c805.

warner added the

fixed

label 2015-03-31 16:48:00 +00:00

warner closed this issue

2015-03-31 16:48:00 +00:00

Sign in to join this conversation.