upload is unhappy even though the shares are already distributed #1124

Open
opened 2010-07-18 16:18:57 +00:00 by zooko · 16 comments

Here's a test case that I added to source:src/allmydata/test/test_upload.py:

    def test_problem_layout_0123_03_2_1(self):
        # Zooko stumbled on this case when investigating #1118.
        self.basedir = self.mktemp()
        d = self._setup_and_upload(k=2, n=4)

        # server 0: shares 0, 1, 2, 3
        # server 1: shares 0, 3
        # server 2: share 1
        # server 3: share 2
        # With this layout, an upload should just be satisfied that the current distribution is good enough, right?
        def _setup(ign):
            self._add_server_with_share(server_number=0, share_number=None)
            self._add_server_with_share(server_number=1, share_number=0)
            self._add_server_with_share(server_number=2, share_number=1)
            self._add_server_with_share(server_number=3, share_number=2)
            # Copy shares
            self._copy_share_to_server(3, 1)
            client = self.g.clients[0]
            client.DEFAULT_ENCODING_PARAMETERS['happy'] = 4
            return client

        d.addCallback(_setup)
        d.addCallback(lambda client:
            client.upload(upload.Data("data" * 10000, convergence="")))
        return d

When I run it, it fails like this:

[ERROR]: allmydata.test.test_upload.EncodingParameters.test_problem_layout_ticket_1118

Traceback (most recent call last):
  File "/Users/zooko/playground/tahoe-lafs/trunk/src/allmydata/immutable/upload.py", line 510, in _got_response
    return self._loop()
  File "/Users/zooko/playground/tahoe-lafs/trunk/src/allmydata/immutable/upload.py", line 363, in _loop
    self._get_progress_message()))
allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 peers, 4 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error))

Why does the upload not succeed?
I added debugprints and here are some of them:

2010-07-18 09:54:15-0600 [-] response to allocate_buckets() from peer xgru5adv: alreadygot=(0, 1, 2, 3), allocated=()
2010-07-18 09:54:15-0600 [-] response to allocate_buckets() from peer ysbz4st7: alreadygot=(0, 3), allocated=()
2010-07-18 09:54:15-0600 [-] response to allocate_buckets() from peer b3llgpww: alreadygot=(2,), allocated=(1,)
2010-07-18 09:54:15-0600 [-] response to allocate_buckets() from peer rvsry4kn: alreadygot=(1,), allocated=(0,)
2010-07-18 09:54:15-0600 [-] server selection unsuccessful for <Tahoe2PeerSelector for upload t2uvf>: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers.: placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 peers, 4 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error): sh0: rvsry4kn, sh1: b3llgpww+rvsry4kn, sh2: b3llgpww, sh3: ysbz4st7+xgru5adv
Here's a test case that I added to source:src/allmydata/test/test_upload.py: ``` def test_problem_layout_0123_03_2_1(self): # Zooko stumbled on this case when investigating #1118. self.basedir = self.mktemp() d = self._setup_and_upload(k=2, n=4) # server 0: shares 0, 1, 2, 3 # server 1: shares 0, 3 # server 2: share 1 # server 3: share 2 # With this layout, an upload should just be satisfied that the current distribution is good enough, right? def _setup(ign): self._add_server_with_share(server_number=0, share_number=None) self._add_server_with_share(server_number=1, share_number=0) self._add_server_with_share(server_number=2, share_number=1) self._add_server_with_share(server_number=3, share_number=2) # Copy shares self._copy_share_to_server(3, 1) client = self.g.clients[0] client.DEFAULT_ENCODING_PARAMETERS['happy'] = 4 return client d.addCallback(_setup) d.addCallback(lambda client: client.upload(upload.Data("data" * 10000, convergence=""))) return d ``` When I run it, it fails like this: ``` [ERROR]: allmydata.test.test_upload.EncodingParameters.test_problem_layout_ticket_1118 Traceback (most recent call last): File "/Users/zooko/playground/tahoe-lafs/trunk/src/allmydata/immutable/upload.py", line 510, in _got_response return self._loop() File "/Users/zooko/playground/tahoe-lafs/trunk/src/allmydata/immutable/upload.py", line 363, in _loop self._get_progress_message())) allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 peers, 4 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)) ``` Why does the upload not succeed? I added debugprints and here are some of them: ``` 2010-07-18 09:54:15-0600 [-] response to allocate_buckets() from peer xgru5adv: alreadygot=(0, 1, 2, 3), allocated=() 2010-07-18 09:54:15-0600 [-] response to allocate_buckets() from peer ysbz4st7: alreadygot=(0, 3), allocated=() 2010-07-18 09:54:15-0600 [-] response to allocate_buckets() from peer b3llgpww: alreadygot=(2,), allocated=(1,) 2010-07-18 09:54:15-0600 [-] response to allocate_buckets() from peer rvsry4kn: alreadygot=(1,), allocated=(0,) 2010-07-18 09:54:15-0600 [-] server selection unsuccessful for <Tahoe2PeerSelector for upload t2uvf>: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers.: placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 peers, 4 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error): sh0: rvsry4kn, sh1: b3llgpww+rvsry4kn, sh2: b3llgpww, sh3: ysbz4st7+xgru5adv ```
zooko added the
code-peerselection
major
defect
1.7.0
labels 2010-07-18 16:18:57 +00:00
zooko added this to the undecided milestone 2010-07-18 16:18:57 +00:00
davidsarah commented 2010-07-18 19:54:59 +00:00
Owner

(Nitpick: the name of the test method should be 0123_03_1_2.)

Is it the maximum matching implementation that is incorrect? If it is, then it should be possible to write a narrower test case for that.

(Nitpick: the name of the test method should be `0123_03_1_2`.) Is it the maximum matching implementation that is incorrect? If it is, then it should be possible to write a narrower test case for that.
davidsarah commented 2010-07-18 20:07:57 +00:00
Owner

Replying to davidsarah:

Is it the maximum matching implementation that is incorrect? If it is, then it should be possible to write a narrower test case for that.

That isn't the problem:

>>> from allmydata.util import happinessutil
>>> sharemap = {0: set([0, 1, 2, 3]), 1: set([0, 3]), 2: set([1]), 3: set([2])}
>>> happinessutil.servers_of_happiness(sharemap)
4
Replying to [davidsarah](/tahoe-lafs/trac-2024-07-25/issues/1124#issuecomment-78659): > Is it the maximum matching implementation that is incorrect? If it is, then it should be possible to write a narrower test case for that. That isn't the problem: ``` >>> from allmydata.util import happinessutil >>> sharemap = {0: set([0, 1, 2, 3]), 1: set([0, 3]), 2: set([1]), 3: set([2])} >>> happinessutil.servers_of_happiness(sharemap) 4 ```
Author

Swapping the order of the last two servers in the test makes it so the code under test starts succeeding to upload instead of failing to upload. This fails:

    def test_problem_layout_0123_03_1_2(self):
        # Zooko stumbled on this case when investigating #1118.
        self.basedir = self.mktemp()
        d = self._setup_and_upload(k=2, n=4)

        # server 0: shares 0, 1, 2, 3
        # server 1: shares 0, 3
        # server 2: share 1
        # server 3: share 2
        # With this layout, an upload should just be satisfied that the current distribution is good enough, right?
        def _setup(ign):
            self._add_server_with_share(server_number=0, share_number=None)
            self._add_server_with_share(server_number=1, share_number=0)
            self._add_server_with_share(server_number=2, share_number=1)
            self._add_server_with_share(server_number=3, share_number=2)
            # Copy shares
            self._copy_share_to_server(3, 1)
            client = self.g.clients[0]
            client.DEFAULT_ENCODING_PARAMETERS['happy'] = 4
            return client

        d.addCallback(_setup)
        d.addCallback(lambda client:
            client.upload(upload.Data("data" * 10000, convergence="")))
        return d

while logging the following:

$ grep -Ee"response|success" _trial_temp/test.log 2010-07-18 15:16:55-0600 [-] response from peer xgru5adv: alreadygot=(), allocated=(0,)
2010-07-18 15:16:55-0600 [-] response from peer xgru5adv: alreadygot=(), allocated=(1, 2, 3)
2010-07-18 15:16:55-0600 [-] peer selection successful for <Tahoe2PeerSelector for upload t2uvf>: placed all 4 shares, want to place shares on at least 1 servers such that any 2 of them have enough shares to recover the file, sent 2 queries to 1 peers, 2 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)
2010-07-18 15:16:55-0600 [-] response from peer xgru5adv: alreadygot=(0, 1, 2, 3), allocated=()
2010-07-18 15:16:55-0600 [-] response from peer ysbz4st7: alreadygot=(0, 3), allocated=()
2010-07-18 15:16:55-0600 [-] response from peer b3llgpww: alreadygot=(2,), allocated=(1,)
2010-07-18 15:16:55-0600 [-] response from peer rvsry4kn: alreadygot=(1,), allocated=(0,)

This succeeds:

    def test_problem_layout_0123_03_2_1(self):
        # Zooko stumbled on this case when investigating #1118.
        self.basedir = self.mktemp()
        d = self._setup_and_upload(k=2, n=4)

        # server 0: shares 0, 1, 2, 3
        # server 1: shares 0, 3
        # server 2: share 2
        # server 3: share 1
        # With this layout, an upload should just be satisfied that the current distribution is good enough, right?
        def _setup(ign):
            self._add_server_with_share(server_number=0, share_number=None)
            self._add_server_with_share(server_number=1, share_number=0)
            self._add_server_with_share(server_number=2, share_number=2)
            self._add_server_with_share(server_number=3, share_number=1)
            # Copy shares
            self._copy_share_to_server(3, 1)
            client = self.g.clients[0]
            client.DEFAULT_ENCODING_PARAMETERS['happy'] = 4
            return client

        d.addCallback(_setup)
        d.addCallback(lambda client:
            client.upload(upload.Data("data" * 10000, convergence="")))
        return d

while logging the following:

$ grep -Ee"response|success" _trial_temp/test.log 
2010-07-18 15:16:01-0600 [-] response from peer xgru5adv: alreadygot=(), allocated=(0,)
2010-07-18 15:16:01-0600 [-] response from peer xgru5adv: alreadygot=(), allocated=(1, 2, 3)
2010-07-18 15:16:01-0600 [-] peer selection successful for <Tahoe2PeerSelector for upload t2uvf>: placed all 4 shares, want to place shares on at least 1 servers such that any 2 of them have enough shares to recover the file, sent 2 queries to 1 peers, 2 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)
2010-07-18 15:16:02-0600 [-] response from peer xgru5adv: alreadygot=(0, 1, 2, 3), allocated=()
2010-07-18 15:16:02-0600 [-] response from peer ysbz4st7: alreadygot=(0, 3), allocated=()
2010-07-18 15:16:02-0600 [-] response from peer b3llgpww: alreadygot=(1,), allocated=()
2010-07-18 15:16:02-0600 [-] response from peer rvsry4kn: alreadygot=(2,), allocated=()
2010-07-18 15:16:02-0600 [-] peer selection successful for <Tahoe2PeerSelector for upload t2uvf>: placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 peers, 4 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)
Swapping the order of the last two servers in the test makes it so the code under test starts succeeding to upload instead of failing to upload. This fails: ``` def test_problem_layout_0123_03_1_2(self): # Zooko stumbled on this case when investigating #1118. self.basedir = self.mktemp() d = self._setup_and_upload(k=2, n=4) # server 0: shares 0, 1, 2, 3 # server 1: shares 0, 3 # server 2: share 1 # server 3: share 2 # With this layout, an upload should just be satisfied that the current distribution is good enough, right? def _setup(ign): self._add_server_with_share(server_number=0, share_number=None) self._add_server_with_share(server_number=1, share_number=0) self._add_server_with_share(server_number=2, share_number=1) self._add_server_with_share(server_number=3, share_number=2) # Copy shares self._copy_share_to_server(3, 1) client = self.g.clients[0] client.DEFAULT_ENCODING_PARAMETERS['happy'] = 4 return client d.addCallback(_setup) d.addCallback(lambda client: client.upload(upload.Data("data" * 10000, convergence=""))) return d ``` while logging the following: ``` $ grep -Ee"response|success" _trial_temp/test.log 2010-07-18 15:16:55-0600 [-] response from peer xgru5adv: alreadygot=(), allocated=(0,) 2010-07-18 15:16:55-0600 [-] response from peer xgru5adv: alreadygot=(), allocated=(1, 2, 3) 2010-07-18 15:16:55-0600 [-] peer selection successful for <Tahoe2PeerSelector for upload t2uvf>: placed all 4 shares, want to place shares on at least 1 servers such that any 2 of them have enough shares to recover the file, sent 2 queries to 1 peers, 2 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error) 2010-07-18 15:16:55-0600 [-] response from peer xgru5adv: alreadygot=(0, 1, 2, 3), allocated=() 2010-07-18 15:16:55-0600 [-] response from peer ysbz4st7: alreadygot=(0, 3), allocated=() 2010-07-18 15:16:55-0600 [-] response from peer b3llgpww: alreadygot=(2,), allocated=(1,) 2010-07-18 15:16:55-0600 [-] response from peer rvsry4kn: alreadygot=(1,), allocated=(0,) ``` This succeeds: ``` def test_problem_layout_0123_03_2_1(self): # Zooko stumbled on this case when investigating #1118. self.basedir = self.mktemp() d = self._setup_and_upload(k=2, n=4) # server 0: shares 0, 1, 2, 3 # server 1: shares 0, 3 # server 2: share 2 # server 3: share 1 # With this layout, an upload should just be satisfied that the current distribution is good enough, right? def _setup(ign): self._add_server_with_share(server_number=0, share_number=None) self._add_server_with_share(server_number=1, share_number=0) self._add_server_with_share(server_number=2, share_number=2) self._add_server_with_share(server_number=3, share_number=1) # Copy shares self._copy_share_to_server(3, 1) client = self.g.clients[0] client.DEFAULT_ENCODING_PARAMETERS['happy'] = 4 return client d.addCallback(_setup) d.addCallback(lambda client: client.upload(upload.Data("data" * 10000, convergence=""))) return d ``` while logging the following: ``` $ grep -Ee"response|success" _trial_temp/test.log 2010-07-18 15:16:01-0600 [-] response from peer xgru5adv: alreadygot=(), allocated=(0,) 2010-07-18 15:16:01-0600 [-] response from peer xgru5adv: alreadygot=(), allocated=(1, 2, 3) 2010-07-18 15:16:01-0600 [-] peer selection successful for <Tahoe2PeerSelector for upload t2uvf>: placed all 4 shares, want to place shares on at least 1 servers such that any 2 of them have enough shares to recover the file, sent 2 queries to 1 peers, 2 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error) 2010-07-18 15:16:02-0600 [-] response from peer xgru5adv: alreadygot=(0, 1, 2, 3), allocated=() 2010-07-18 15:16:02-0600 [-] response from peer ysbz4st7: alreadygot=(0, 3), allocated=() 2010-07-18 15:16:02-0600 [-] response from peer b3llgpww: alreadygot=(1,), allocated=() 2010-07-18 15:16:02-0600 [-] response from peer rvsry4kn: alreadygot=(2,), allocated=() 2010-07-18 15:16:02-0600 [-] peer selection successful for <Tahoe2PeerSelector for upload t2uvf>: placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 peers, 4 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error) ```
Author

Attachment test-1124.dpatch.txt (11983 bytes) added

**Attachment** test-1124.dpatch.txt (11983 bytes) added
Author

test-1124.dpatch.txt is a test for this issue, marked TODO.

[test-1124.dpatch.txt](/tahoe-lafs/trac-2024-07-25/attachments/000078ac-ff33-a8f1-0515-104686c89c6a) is a test for this issue, marked TODO.
davidsarah commented 2010-07-18 23:45:15 +00:00
Owner

The problem seems to be the share redistribution algorithm implemented by Tahoe2PeerSelector (as modified by #778). If we instrument it to print out the sharemap (i.e. sharenum -> set of peerids) after the call to merge_peers in each iteration of _loop, we get (abbreviating each peerid to its first hex byte):

{0: set(['b9']), 1: set(['b9']), 2: set(['b9']), 3: set(['b9'])}
{0: set(['c4']), 1: set(['0e']), 2: set(['0e']), 3: set(['c4', 'b9'])}
{0: set(['8d']), 1: set(['0e', '8d']), 2: set(['0e']), 3: set(['c4', 'b9'])}

Presumably 'b9' is server 0. So we merge in all of the shares found for that server, but then in the same iteration of _loop, we move all but one of them into homeless_shares. I don't understand the intent of the algorithm fully, but it seems as though we're forgetting that server 0 had additional shares that if taken into account, would have increased servers_of_happiness -- even though they didn't increase it in that particular iteration.

The problem seems to be the share redistribution algorithm implemented by `Tahoe2PeerSelector` (as modified by #778). If we instrument it to print out the sharemap (i.e. sharenum -> set of peerids) after the call to `merge_peers` in each iteration of `_loop`, we get (abbreviating each peerid to its first hex byte): ``` {0: set(['b9']), 1: set(['b9']), 2: set(['b9']), 3: set(['b9'])} {0: set(['c4']), 1: set(['0e']), 2: set(['0e']), 3: set(['c4', 'b9'])} {0: set(['8d']), 1: set(['0e', '8d']), 2: set(['0e']), 3: set(['c4', 'b9'])} ``` Presumably 'b9' is server 0. So we merge in all of the shares found for that server, but then in the same iteration of `_loop`, we move all but one of them into `homeless_shares`. I don't understand the intent of the algorithm fully, but it seems as though we're forgetting that server 0 had additional shares that if taken into account, would have increased servers_of_happiness -- even though they didn't increase it in that particular iteration.
kevan commented 2010-07-19 02:48:38 +00:00
Owner

The intent of the algorithm is to identify servers with more than one share, and make some of the shares on those servers homeless so that they can be redistributed to peers that might not have had any shares assigned to them yet. It is a greedy algorithm that doesn't quite do the trick in a lot of situations, and it seems like this is one of them. test_problem_layout_comment_187 is another such layout; it is marked as todo, because we hope to change the uploader to do a better job of share redistribution in 1.8. This might be a feature of #1126, or might be another ticket that hasn't been made yet.

The intent of the algorithm is to identify servers with more than one share, and make some of the shares on those servers homeless so that they can be redistributed to peers that might not have had any shares assigned to them yet. It is a greedy algorithm that doesn't quite do the trick in a lot of situations, and it seems like this is one of them. `test_problem_layout_comment_187` is another such layout; it is marked as todo, because we hope to change the uploader to do a better job of share redistribution in 1.8. This might be a feature of #1126, or might be another ticket that hasn't been made yet.
Author

There was a bug in the code that david-sarah was testing in comment:78663. That bug was fixed by changeset:13b5e44fbc2effd0. We should re-evaluate this ticket.

There was a bug in the code that david-sarah was testing in [comment:78663](/tahoe-lafs/trac-2024-07-25/issues/1124#issuecomment-78663). That bug was fixed by changeset:13b5e44fbc2effd0. We should re-evaluate this ticket.
davidsarah commented 2010-07-24 00:27:33 +00:00
Owner

Test was committed in changeset:0f46766a51792eb5.

Test was committed in changeset:0f46766a51792eb5.
tahoe-lafs modified the milestone from undecided to 1.9.0 2010-08-12 23:33:50 +00:00

what needs to happen to resolve this? do we have a plan to improve the share-distribution algorithm? It seems to me that there's no chance of this being a small safe fix, so I'm kicking it out of 1.9 .

what needs to happen to resolve this? do we have a plan to improve the share-distribution algorithm? It seems to me that there's no chance of this being a small safe fix, so I'm kicking it out of 1.9 .
warner modified the milestone from 1.9.0 to 1.10.0 2011-08-29 15:33:15 +00:00
davidsarah commented 2011-08-30 23:15:04 +00:00
Owner

Replying to warner:

what needs to happen to resolve this? do we have a plan to improve the share-distribution algorithm?

That was being done in #1382. If I understand correctly, Kevan's patches there try to implement the algorithm of ticket:1130#comment:-1, or something very much like it.

Replying to [warner](/tahoe-lafs/trac-2024-07-25/issues/1124#issuecomment-78668): > what needs to happen to resolve this? do we have a plan to improve the share-distribution algorithm? That was being done in #1382. If I understand correctly, Kevan's patches there try to implement the algorithm of ticket:1130#[comment:-1](/tahoe-lafs/trac-2024-07-25/issues/1124#issuecomment--1), or something very much like it.
tahoe-lafs modified the milestone from soon to 1.11.0 2013-09-01 05:29:51 +00:00
Author

I believe this will be fixed by #1382.

I believe this will be fixed by #1382.
warner modified the milestone from 1.10.1 to 1.11.0 2015-01-20 17:25:53 +00:00

Milestone renamed

Milestone renamed
warner modified the milestone from 1.11.0 to 1.12.0 2016-03-22 05:02:52 +00:00

moving most tickets from 1.12 to 1.13 so we can release 1.12 with magic-folders

moving most tickets from 1.12 to 1.13 so we can release 1.12 with magic-folders
warner modified the milestone from 1.12.0 to 1.13.0 2016-06-28 18:20:37 +00:00

Moving open issues out of closed milestones.

Moving open issues out of closed milestones.
exarkun modified the milestone from 1.13.0 to 1.15.0 2020-06-30 14:45:13 +00:00
Owner

Ticket retargeted after milestone closed

Ticket retargeted after milestone closed
meejah modified the milestone from 1.15.0 to soon 2021-03-30 18:40:19 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
5 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#1124
No description provided.