servers-responding list is wrong #1739

Closed
opened 2012-05-16 22:40:01 +00:00 by warner · 7 comments

I ran into a bug, hunted it down, and squashed it. Here's the
description and the patch:

If a server did not respond to the pre-repair filecheck, but did
respond to the repair, that server was not correctly added to the
RepairResults.data["servers-responding"] list. (This resulted from
a buggy usage of DictOfSets.union() in filenode.py).

In addition, servers to which filecheck queries were sent, but did
not respond, were incorrectly added to the servers-responding list
anyawys. (This resulted from code in the checker.py not paying
attention to the 'responded' flag).

The first bug was neatly masked by the second: it's pretty rare to
have a server suddenly start responding in the one-second window
between a filecheck and a subsequent repair, and if the server was
around for the filecheck, you'd never notice the problem. I only
spotted the smelly code while I was changing it for IServer cleanup
purposes.

I added coverage to test_repairer.py for this. Trying to get that
test to fail before fixing the first bug is what led me to discover
the second bug. I also had to update test_corrupt_file_verno, since
it was incorrectly asserting that 10 servers responded, when in
fact one of them throws an error (but the second bug was causing it
to be reported anyways).

I ran into a bug, hunted it down, and squashed it. Here's the description and the patch: If a server did not respond to the pre-repair filecheck, but did respond to the repair, that server was not correctly added to the `RepairResults.data["servers-responding"]` list. (This resulted from a buggy usage of `DictOfSets.union()` in filenode.py). In addition, servers to which filecheck queries were sent, but did not respond, were incorrectly added to the servers-responding list anyawys. (This resulted from code in the checker.py not paying attention to the 'responded' flag). The first bug was neatly masked by the second: it's pretty rare to have a server suddenly start responding in the one-second window between a filecheck and a subsequent repair, and if the server was around for the filecheck, you'd never notice the problem. I only spotted the smelly code while I was changing it for IServer cleanup purposes. I added coverage to test_repairer.py for this. Trying to get that test to fail before fixing the first bug is what led me to discover the second bug. I also had to update test_corrupt_file_verno, since it was incorrectly asserting that 10 servers responded, when in fact one of them throws an error (but the second bug was causing it to be reported anyways).
warner added the
code
minor
defect
1.9.1
labels 2012-05-16 22:40:01 +00:00
warner added this to the soon milestone 2012-05-16 22:40:01 +00:00
warner self-assigned this 2012-05-16 22:40:01 +00:00
Author

ugh, the nginx bug (#1581) prevented me from uploading a diff. Look at the github pull-request in https://github.com/tahoe-lafs/tahoe-lafs/pull/9 instead.

ugh, the nginx bug (#1581) prevented me from uploading a diff. Look at the github pull-request in <https://github.com/tahoe-lafs/tahoe-lafs/pull/9> instead.

will review

will review
zooko modified the milestone from soon to 1.9.2 2012-05-16 23:06:15 +00:00
warner was unassigned by zooko 2012-05-16 23:06:15 +00:00
zooko self-assigned this 2012-05-16 23:06:15 +00:00
davidsarah commented 2012-05-16 23:36:01 +00:00
Owner

+1 on removing DictOfSets.union; I was looking at using DictOfSets for something and also found that method confusing. I confirmed that the patches remove all uses.

Would d['servers-responding'] = sorted(list(servers_responding)) be better, for determinism? (I know the previous code also wasn't deterministic here.)

I think the overloading of .broken as an int or boolean will not work as intended for the boolean case, because:

$ python
Python 2.6.6 (r266:84292, Sep 15 2010, 16:22:56) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> broken = True
>>> isinstance(broken, int)
True
>>> broken -= 1
>>> broken
0

Otherwise looks good.

+1 on removing DictOfSets.union; I was looking at using DictOfSets for something and also found that method confusing. I confirmed that the patches remove all uses. Would `d['servers-responding'] = sorted(list(servers_responding))` be better, for determinism? (I know the previous code also wasn't deterministic here.) I think the overloading of `.broken` as an int or boolean will not work as intended for the boolean case, because: ``` $ python Python 2.6.6 (r266:84292, Sep 15 2010, 16:22:56) [GCC 4.4.5] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> broken = True >>> isinstance(broken, int) True >>> broken -= 1 >>> broken 0 ``` Otherwise looks good.
tahoe-lafs added
normal
and removed
minor
labels 2012-05-16 23:36:01 +00:00
zooko was unassigned by tahoe-lafs 2012-05-16 23:36:01 +00:00
warner was assigned by tahoe-lafs 2012-05-16 23:36:01 +00:00
Author

wow, good catch, I thought booleans were distinct from ints for sure. Thanks!

Yeah, determinism is helpful. I had to add similar sorted() calls in the webapi code (in particular in places where we switch from a set, to a JSON list) so the tests can match against something stable. I'll move that up to the code that creates servers-responding.

wow, good catch, I thought booleans were distinct from ints for sure. Thanks! Yeah, determinism is helpful. I had to add similar `sorted()` calls in the webapi code (in particular in places where we switch from a set, to a JSON list) so the tests can match against something stable. I'll move that up to the code that creates servers-responding.
davidsarah commented 2012-05-16 23:48:43 +00:00
Owner

Would d['servers-responding'] = sorted(list(servers_responding)) be better, for determinism?

Actually the list() isn't necessary; sorted(servers_responding) does the same thing.

> Would `d['servers-responding'] = sorted(list(servers_responding))` be better, for determinism? Actually the `list()` isn't necessary; `sorted(servers_responding)` does the same thing.
Author

Landed in changeset:cc366903 and changeset:9acf5bee.

Landed in changeset:cc366903 and changeset:9acf5bee.
warner added the
fixed
label 2012-05-17 00:02:43 +00:00
davidsarah commented 2012-05-17 00:19:31 +00:00
Owner

Also applied to 1.9.2: [5500/1.9.2], [5499/1.9.2]

Also applied to 1.9.2: [5500/1.9.2], [5499/1.9.2]
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#1739
No description provided.