count-good-share-hosts is calculated incorrectly (post-repair says 10 hosts have good shares but there are only 4 hosts) #1115
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
4 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1115
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I guess the web page reporting repair results is putting in the number of shares where it ought to have the number of servers.
This would be a bugfix, and it would be a shallow patch with little danger of introducing worse bugs, so it would still be welcome for v1.7.1.
Out of time.
The bug appears to be [here in immutable/filenode.py's check_and_repair method]source:src/allmydata/immutable/filenode.py@4248#L262:
count-good-share-hosts
should not be the same ascount-shares-good
. It possibly should belen(servers_responding)
, but I'm not sure of that.Hm it is documented in [webapi.txt]source:docs/frontends/webapi.txt@4508#L1175:
So it should be ... Hrm... I'm not 100% certain but I think that
count-good-share-hosts
should belen(reduce(set.*union*, sm.itervalues())
. Clearly we need a unit test for this.Hmm, isn't that documentation misleading? Suppose that the share->server mapping is
for example. Then the number of good shares is 4 (0, 1, 2, 3), and the number of servers that hold good shares is 4 (A, B, C, D), but server D holds shares that should ideally be moved to other servers.
If the "if" in the comment is interpreted as a sufficient but not necessary condition for shares to be doubled up, then it is strictly speaking correct, but the wording tends to imply a necessary-and-sufficient condition.
The measure that should really be used to decide whether the shares for a file are properly distributed, is servers-of-happiness. We should add that to the repair report, and have the docs deemphasize the usefulness of
count-good-share-hosts
-- or perhaps even remove it, given that it was calculated incorrectly.Replying to davidsarah:
+1
I agree to de-emphasize the usefulness of it for deciding if you should rebalance. However, maybe we should go ahead and leave it (corrected) in the report, as someone might find that information useful for something.
post-repair says 10 hosts have good shares but there are only 4 hoststo add servers-of-happiness to reports (post-repair says 10 hosts have good shares but there are only 4 hosts)We have no unit test, so not for 1.8.
In changeset:0091205e3c56a4de:
Right now there are two conflicting definitions for computing
count-good-share-hosts
:So, I wrote a test for both of these conditions. Per comment:78474 I'm providing a patch to fix the former (incorrect) one.
https://github.com/amiller/tahoe-lafs/pull/4/files
len(reduce(set.union, sm.itervalues()))
as used in the pull request is wrong for the empty map:. However, the docs should make clear that a high value oflen(set(sm.itervalues()))
is correctcount-good-share-hosts
does not imply good share distribution (although a low value does imply bad distribution or insufficient shares).Good catch. I fixed the problem by adding an initial argument to reduce:
len(reduce(set.union, sm.itervalues(), set()))
I modified the unit test to cover this as well.
Apparently github pull requests are not stateless so the previous link now includes my new commit, so perhaps that's not the ideal for archiving progress. Here's the commit:
https://github.com/amiller/tahoe-lafs/commit/3898753b9549f5b355a0233b835da8fc38faa5c6
Yeah, that describes what I had in mind for count-good-share-hosts, but
I agree that it probably isn't as meaningful as I'd originally thought.
The sharemap described in comment:78475 is a great example, although I think
it'd be awfully hard to get into this situation (it would take some
serious contortions to get an uploader+repairer to leave shares in that
state). count-good-share-hosts probably covered the situations I was
able to imagine happening after a typical upload process, but if there's
a better "what needs to happen to make your file healthier" value, I'm
happy to emphasize that instead.
Anyways, I reviewed the patch and it looks good. I added a few cosmetic
fixes (whitespace, name of the test file), and will land it in a minute.
In changeset:fcc7e6475918eab1:
In changeset:fcc7e6475918eab1:
Want to double-check the correctness of this.
add servers-of-happiness to reports (post-repair says 10 hosts have good shares but there are only 4 hosts)to count-good-shares is calculated incorrectly (post-repair says 10 hosts have good shares but there are only 4 hosts)count-good-shares is calculated incorrectly (post-repair says 10 hosts have good shares but there are only 4 hosts)to count-good-share-hosts is calculated incorrectly (post-repair says 10 hosts have good shares but there are only 4 hosts)Adding an entry for the happiness count has been split to #1784.
The change to the computation of
count-good-share-hosts
at source:1.9.2/src/allmydata/immutable/checker.py@5499#L765 looks correct. A minor simplification is possible:could just be
since the
servers
dict never includes any entries that are empty sets. I don't think this simplification is needed for 1.9.2.I spotted another problem, though: the value of
needs-rebalancing
is computed inconsistently betweenchecker.py
andfilenode.py
. Inchecker.py
it is computed as:In
filenode.py
it is computed aswhere
len(sm)
is equal tocount-shares-good
. I don't understand this latter definition at all, it looks completely wrong. The definition inchecker.py
is more subtly wrong because it credits servers that only have duplicated shares as contributing to existing balance. The correct definition should be something like 'iff the happiness count is less than the number of uniquely-numbered good shares available'.I propose to change the definition in
filenode.py
to be consistent withchecker.py
in 1.9.2, and then change it to use the happiness count in 1.10, as part of #1784.The definition of
needs-rebalancing
in source:1.9.2/docs/frontends/webapi.rst is:but:
checker.py
orfilenode.py
computes :-(Replying to davidsarah:
Actually, sod it, I can't decide how to fix
needs-rebalancing
for 1.9.2, so I'll put a caveat inwebapi.rst
and bump it to 1.10 (as part of #1784).In changeset:5532/1.9.2:
In changeset:5886/cloud-backend:
In changeset:b06f8cd8d03a6239:
The
needs-rebalancing
issue is now #2105.