introducer client: connection count is wrong, !VersionedRemoteReference needs EQ #653
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#653
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Zandr noticed that the prodnet webapi nodes are reporting 111 known servers (which is correct) and 135 connected servers (which is not). The root problem is that the IntroducerClient uses object-identity (specifically RemoteReference identity) to manage the set of current connections. Specifically
IntroducerClient._connections
is a set of (peerid, service_name, rref) tuples, and the insertion/removal code assumes that 'rref' values are comparable (it uses _connections.add() and assumes that duplicates will be removed by the set class).Unfortunately, the recent introduction of VersionedRemoteReference wrappers breaks this. The 135-111 extra ghost servers are actually two different VersionedRemoteReference objects which wrap the same original RemoteReference instance.
I'm not quite sure how to fix this. My first inclination is to stop using VersionedRemoteReference wrappers, and instead assign
.versions
attributes to the regular RemoteReference instances to achieve the versioning functionality. However, I believe that VersionedRemoteReference also serves another purpose, which is to transform all remote exceptions (everything returned by callRemote, incidentally including local schema Violations) into a special form which is easily recognizable. This is not as simple to achieve without a wrapper.Another approach would be to modify
*eq*
and*hash*
(and also*ne*
? I'm never sure..) on VersionedRemoteReference to pass through to the original. A third approach would be to maintain a singleton weak dict of some sort, which would help us make sure that we have at most one VersionedRemoteReference per RemoteReference.I prefer the first (no-wrapper) approach, but it may take some doing.
Which object is creating a new instance of VersionedRemoteReference when there already exists one for that RemoteReference? Maybe we should consider this a bug in that object. This is very close to your "singletone, weak dict" approach, but this has a narrow scope of who is supposed to be weakly tracking which remote refs.
P.S. Yes I know it is very likely a class that I wrote which is responsible...
it's probably the introducer client, specifically
RemoteServiceConnector._got_service
, which reacts to the reconnector-driven connection event by callingget_versioned_remote_reference
. The problem is the separate manager object (IntroducerClient.add_connection
which doesn't ever see the real RemoteReference, just the wrapper.. maybe we could pass back both, and have it use the real RemoteReference as a dict key, and put the wrapper in a value. It used to use a dictionary, but I replaced it with a set because RemoteReferences compare cleanly and because that made the code easier to read.I suspect that I rewrote all of this code in my signed-announcements branch (#466), so maybe what I ought to do is examine that branch and copy code from it.
I think this is fixed by
foolscap >= 0.4.0
.probably yes, but I haven't written a unit test to confirm it yet.. I'll close this ticket when I've done that.
Hm. http://testgrid.allmydata.org:3567 currently says:
And its version numbers are:
The source:NEWS file says that this ticket -- #653 -- has been fixed. But perhaps it hasn't?
the testgrid is now behaving properly - but there is still no test case.
Per #748#comment:-1 -- I upgraded and rebooted the testgrid web gateway server. However, the previous version that was running already had foolscap v0.4.1, so there is apparently some issue which results in more connections than servers which is still present in that version of foolscap.
FYI, this should have been fixed by changeset:1863aee0aae1cc02, which depended upon a feature added to foolscap-0.4.0 (i.e. foolscap itself does not fix the issue.. it required a tahoe-side change).
I guess something may still be broken. We'll probably have to wait for it to become weird again and then look at the introducerclient's internal state.
Attachment incident-2009-07-16-002846-pypgfay.flog.bz2 (78470 bytes) added
Attachment incident-2009-07-16-002829-gc4xv5y.flog.bz2 (79877 bytes) added
:-( I guess I shouldn't have rebooted it then. Does that mean we should go ahead and boot this issue out of v1.5 Milestone? I have an idea -- let's add an assertion in the code that the number of connected storage servers <= number of known storage servers. Perhaps even a stronger, more detailed assertion if that's feasible.
Here's what evidence I can gather about the problem that was exhibited.
I upgraded and restarted testgrid.allmydata.org:3567 at 2009-07-16 16:17:18.835Z (on testgrid.allmydata.org's clock). There was nothing that looked too unusual in the
twistd.log
that day. There are two incidents reported, attached, from that day:incident-2009-07-16-002829-gc4xv5y.flog.bz2
andincident-2009-07-16-002846-pypgfay.flog.bz2
.Here is a foolscap log-viewer web service showing each of those logfiles: http://testgrid.allmydata.org:10000/ http://testgrid.allmydata.org:10001 . I have a hard time learning what I want to know from these logfiles. What I want to know (at least at first) is mostly about temporal coincidence. For starters, I'd like to be sure that these incidents occurred before I rebooted and upgraded the server, not after. However, the timestamps, such as "# 19:19:47.489 [23537270]: UNUSUAL excessive reactor delay ({'args': (25.734021902084351,), 'format': 'excessive reactor delay (%ss)', 'incarnation': ('\xf5\x16\x1dl\xb2\xf5\x85\xf9', None), 'num': 23537270, 'time': 1247710787.4896951, 'level': 23}s)" don't tell me what day it was nor what timezone the timestamp is in. Checking the status of http://foolscap.lothar.com/trac/ticket/90 suggests that the timestamps might be in PST, which is UTC-7. If that's the case then ... No, some of the (causally) earliest events in the log are from 18:01. Perhaps they were from 2009-07-15? Argh, I give up. Please tell me how to understand the timing of events in foolscap incident report files. I updated http://foolscap.lothar.com/trac/ticket/90 to plead for fully-qualified, unambiguous timestamps.
The triggering incident is "download failure", but scarier to me is that there was a 26 second reactor delay.
Oookay, I read the relevant source code and the miscounting of number of connected storage servers was fixed between changeset:35b3f7f426c193cf (the version that exhibited the problem) and current HEAD. However, I'm not sure if that also means that whatever caused the failures on TestGrid was also fixed. Read on. Here is the path of the code that shows how "Connected to %s of %s known storage servers" was produced at version changeset:35b3f7f426c193cf:
source:src/allmydata/web/welcome.xhtml@3897#L55
source:src/allmydata/web/root.py@3897#L240
source:src/allmydata/introducer/client.py@3897#L277
Here is how it is produced today:
source:src/allmydata/web/welcome.xhtml@3997#L55
source:src/allmydata/web/root.py@3997#L240
source:src/allmydata/storage_client.py@3997#L41
The old way could potentially have lots of tuples of
(serverid, servicename, rref)
for the same serverid and servicename, if new connections were established to the same serverid but the old connection was not lost, i.e.notifyOnDisconnect()
wasn't getting called. The new way cannot possibly have an inconsistency between the number of known storage servers and the number of connected storage servers, since both are computed from the same dict -- the known servers are all the items of the dict and the connected storage servers are the ones that have an rref.Brian: what do you think about
notifyOnDisconnect()
not getting called even while new connections to the same foolscap peer are being established? That's the only explanation I can see for the observed miscounting on the web gateway that was running allmydata-tahoe changeset:35b3f7f426c193cf and foolscap 0.4.1.Hm, note that at least one of the storage servers was in a messed-up state in which it had incurred MemoryErrors, so it might have been failing to participate in some sort of foolscap "good-bye" negotiation, thus preventing the older connections to it from being cleaned up. This reminds of these mysterious warnings in the incident reports (I've seen a lot of this kind of message in foolscap logs):
(flogtool note: until we get foolscap#90 resolved, you can use "flogtool dump
--verbose" to get the raw seconds-since-epoch timestamps out of the event
logs, which may give you more confidence in comparing them across servers.
There is not yet something similar for "flogtool web-viewer", unfortunately.
Also note that incident files always store seconds-since-epoch, so any
conversion to a string time format is being done by the "flogtool" program
that is processing the incident file)
Yeah, one of the improvements in changeset:8df15e9f30a3bda7 was to use a single data structure to
keep track of storage servers, instead of two separate ones. Seemed more
robust.
I'd have to look more closely at the old code to figure out how
notifyOnDisconnect
could be confused. It gets fired when the TCPconnection is lost, so there's no requirement that the remote side
participate in any shutdown-time protocol, but it does get fired through an
eventual-send, so there could conceivably have been a race condition or two.
(incidentally, I removed those "Tub.connectorFinished" WEIRD messages in
foolscap-0.4.2, the current release.. they were causing more problems than
they were worth)
I spoke with Brian on IRC this morning before his gallivanting schedule, and he said that it was unlikely that this problem will recur with the current versions of foolscap and of TahoeLAFS. I agree and am going to bump this ticket out of the v1.5 Milestone.
Note that we can't actually close this ticket yet because there isn't a test for the original issue: duplicate entries in
IntroducerClient._connections
because RemoteReferences don't compare equal when they denote the same underlying TCP connection to the same peer, much less the newly discussed issue: duplicate RemoteReferences because (somehow) the peer is setting up new RemoteReferences whilenotifyOnDisconnect()
is not getting called for the old ones.However, I'm fairly confident that this is (a) not a regression versus TahoeLAFS v1.4.1, and (b) not a critical bug. (In fact, it might actually be already fixed in current trunk.)
hm, I do hope we'll be able to close this ticket eventually, without reverting back to the old code, analyzing the problem, fixing it, then discarding the fix to go back to the new code :-). Let's consider closing this as INVALID after a couple of months of correct operation of the testgrid.
The only part that I'm not sure about is the way that multiple
RemoteReferences
piled up. I.e., I'm confident that there is not a bug in TahoeLAFS with regard to this issue, but I'm not confident that there isn't a bug in foolscap about this issue. Note that we didn't change the version of foolscap on the testgrid webapi (foolscap 0.4.1).Maybe if I write down what I know about it in this ticket that will help me understand it better.
Could you summarize what was changed which made me think (in comment:70180) that foolscap v0.4.0 fixed the original issue of having multiple
RemoteReferences
over the same TCP connection to the same peer?Whatever it was, it didn't fix the problem that was demonstrated using TahoeLAFS changeset:35b3f7f426c193cf and foolscap 0.4.1 in which there were many more tuples of
(serverid, servicetype, rref)
than there were servers. This suggests that foolscap 0.4.1 must have been creating newRemoteReferences
and not calling back thenotifyOnDisconnect()
method to clean up oldRemoteReferences
to the same peers. Now, TahoeLAFS in version changeset:1192b61dfed62a49 no longer keeps tuples of(peerid, servicetype, rref)
, instead it keeps a dict mapping frompeerid
toNativeStorageClientDescriptor
, where aNativeStorageClientDescriptor
has at most oneRemoteReference
.When foolscap calls
NativeStorageClientDescriptor._got_connection()
, then theNativeStorageClientDescriptor
eventually setsself.rref
to reference the newRemoteReference
(see [storage_client.py line 218]source:src/allmydata/storage_client.py@20090717050709-66853-80eda37caf0df376be79f45cbed728999b68a843#L218.) If itsself.rref
already had a reference to aRemoteReference
then it would be overwritten by the new one, so the problem of lots of redundantRemoteReference
s piling up certainly can't happen in TahoeLAFS anymore.But, if foolscap is generating redundant
RemoteReference
s passing them to TahoeLAFS, could this cause other problems? For example, I see thatNativeStorageClientDescriptor
callsrref.notifyOnDisconnect(self._lost)
. If foolscap is generating redundantrref
s, then maybe theself._lost
will eventually be called by one of the old ones that is no longer referenced by theNativeStorageClientDescriptor
, and if so then the currentrref
will get spuriously removed?Maybe we should add some sanity checking to
NativeStorageClientDescriptor
such as asserting that when its_got_connection()
is called that itsself.rref
is required to be currentlyNone
, or asserting that when itsself._lost()
method is called that the current value ofself.rref
is the same as it was when thenotifyOnDisconnect()
was called? In the interests of a stable v1.5, maybe we could make it so that if these checks fail it does something "safe", such as ifself._lost()
is called whenself.rref
is not equal to the same thing that it was when thenotifyOnDisconnect()
was called, that it leaves theself.rref
alone (as well as logging a high-scariness incident.What do you think?
Brian: I hope you get some time to read my comment and think if my analysis is right, and this shows that there is still a bug in foolscap 0.4.1, and that if there is this might still cause problems for Tahoe-LAFS.
Zooko and I talked and did some more analysis. Based on that, we think there's a high probability of a foolscap bug (still present in the latest 0.4.2) that causes notifyOnDisconnect to sometimes not get called, probably triggered by "replacement connections" (i.e. where NAT table expiries or something cause an asymmetric close, one side reconnects, and the other side must displace an existing-but-really-dead connection with the new inbound one).
The tahoe code was rewritten to reduce the damage caused by this sort of thing. We could change it further, to remove the use of notifyOnDisconnect altogether, with two negative consequences:
The first problem would be annoying, so I think we're going to leave tahoe alone for now. I'll add a note to the foolscap docs to warn users about the notifyOnDisconnect bug, and encourage people to not rely upon it in replacement-connection -likely environments.
I created #816 (don't rely on notifyOnDisconnect()). We can close this ticket as fixed by the rewrite of the server management between changeset:35b3f7f426c193cf and changeset:1192b61dfed62a49. (Therefore the first release with this bug fixed was v1.5.0.)
Brian, I was just looking at a tahoe-lafs v1.4.1 storage server, and it has an incident report triggered by this:
I remember seeing this WEIRDness a lot in the past, and in fact I think you might have downgraded it to be not considered weird so that it would stop generating so many incident reports. Anyway, I was wondering if the hypothesis that
notifyOnDisconnect()
sometimes doesn't fire when it is expected to would cast light on this weirdness. Thanks!Maybe. Yeah, I remember seeing this particular message an awful lot, and couldn't figure it out, and decided that it was pretty harmless, so I downgraded it. I vaguely recall thinking that it was related to multiple simultaneous connections, in which the losing connection attempts were somehow being doubly-removed from the list of remaining attempts.
I suppose a direction to investigate might be to see what happens to those connectors when the primary connection finally closes. Or if the path that triggers the WEIRD warning might also somehow clobber the notifyOnDisconnect handler.