servers-responding list is wrong #1739
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1739
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I ran into a bug, hunted it down, and squashed it. Here's the
description and the patch:
If a server did not respond to the pre-repair filecheck, but did
respond to the repair, that server was not correctly added to the
RepairResults.data["servers-responding"]
list. (This resulted froma buggy usage of
DictOfSets.union()
in filenode.py).In addition, servers to which filecheck queries were sent, but did
not respond, were incorrectly added to the servers-responding list
anyawys. (This resulted from code in the checker.py not paying
attention to the 'responded' flag).
The first bug was neatly masked by the second: it's pretty rare to
have a server suddenly start responding in the one-second window
between a filecheck and a subsequent repair, and if the server was
around for the filecheck, you'd never notice the problem. I only
spotted the smelly code while I was changing it for IServer cleanup
purposes.
I added coverage to test_repairer.py for this. Trying to get that
test to fail before fixing the first bug is what led me to discover
the second bug. I also had to update test_corrupt_file_verno, since
it was incorrectly asserting that 10 servers responded, when in
fact one of them throws an error (but the second bug was causing it
to be reported anyways).
ugh, the nginx bug (#1581) prevented me from uploading a diff. Look at the github pull-request in https://github.com/tahoe-lafs/tahoe-lafs/pull/9 instead.
will review
+1 on removing DictOfSets.union; I was looking at using DictOfSets for something and also found that method confusing. I confirmed that the patches remove all uses.
Would
d['servers-responding'] = sorted(list(servers_responding))
be better, for determinism? (I know the previous code also wasn't deterministic here.)I think the overloading of
.broken
as an int or boolean will not work as intended for the boolean case, because:Otherwise looks good.
wow, good catch, I thought booleans were distinct from ints for sure. Thanks!
Yeah, determinism is helpful. I had to add similar
sorted()
calls in the webapi code (in particular in places where we switch from a set, to a JSON list) so the tests can match against something stable. I'll move that up to the code that creates servers-responding.Actually the
list()
isn't necessary;sorted(servers_responding)
does the same thing.Landed in changeset:cc366903 and changeset:9acf5bee.
Also applied to 1.9.2: [5500/1.9.2], [5499/1.9.2]