introducer doesn't seem to forget about old peers, or peers don't forget about old peers #26
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#26
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
if you restart a node, the new instance retains the same TubID as the old one (since we stash the SSL certificate in the 'client.pem' file), but it gets a new so-called "Swiss Number". The introducer node is then informed of both, and this gets announced to everybody else.
The problem is that it looks like the rest of the world doesn't forget about the old swissnumber when the first instance of the node is shut down. They keep trying to reconnect to both the old one and the new one. The connection attempts to the old one are able to attach to the Tub (since the certificate is still the same), but of course the old swissnumber is gone, and thus you get benign but annoying error messages in everybody's log files as the getReferenceByName call fails.
I think that our introducer scheme is the absolute simplest thing that could possibly work, and as such it doesn't every send out negative announcements. Without this, the clients will not know they should turn off their Reconnectors for the missing peers, and they'll keep trying to hit it over and over again.
We should fix this.
I checked the code, and the introducer does indeed forget about peers that go away, but it does not tell anyone about their loss.
I think we need to add a 'lost_peers' method next to the existing 'new_peers' one. It should take a set of FURLs that are no longer in the mesh. The IntroducerClient should shut down any reconnector they have for a FURL that appears in a 'lost_peers' message.
The fact that the introducer sends these out in a strict order means (I believe) that the new_peers/lost_peers messages for a given FURL should be strictly interleaved: no add/add/lost/lost situations. I think that makes this safe from races.
I'm working on a patch for this. It will change the RIIntroducerClient protocol, though, by adding a lost_peers message.
I've checked in that patch, so this issue is closed. I'll probably wait until the 0.2.1 release before upgrading testnet, though.
I'm not sure what the problem is exactly:
?
Oh I didn't realize you'd already implemented the lost_peers message. So far as I understand the issue, that doesn't seem like a good idea. Let's re-open this ticket and talk about it.
The log messages are annoying, but that's just a symptom of the real problem. #3 is the real issue, although it would probably be more accurate to describe the new connection as "better" because the old one is completely useless.
The way we create the IntroducerClient FURLs causes us to pick a new one each time the node boots, so once a node has shut down, the FURL that was used for that incarnation will never be used again, so it's inappropriate for anyone else to remember it. The log messages show up because the new incarnation of the node does use the same SSL certificate, and therefore gets the same TubID, and probably has the same IP address, so every other node in the system is trying to talk to the ghosts of our previous lives, and we log a getReferenceByName failure for each attempt.
Our current peer-connection/mesh-maintenance algorithm (v1 = fully-connected mesh) obligates us to treat nodes which are connected to the Introducer as active, and nodes which are not as inactive. So for both these reasons, once the node goes away, we need to stop trying to talk to it.
Part of this problem could be addressed by persisting the randomly-generated FURL so that each node's IntroducerClient would have the same name each time. (I'm not convinced this is the right approach for this issue; independent of that I think this sort of persistence is an important pattern that I want Foolscap to provide convenient access to). If we had that, then the client would get the same FURL each time, and the other nodes could only keep trying to connect to a single object per Tub instead of lots of them. This would still treat nodes that are no longer in contact with the Introducer as being active, however, which doesn't match what our v1 specs call for, and will cause 1 connection attempt per hour per active peer for each of the IP addresses that have been used by nodes in the past.
I've rolled back the lost_peers change pending further discussion.
Chatting with Brian on the phone.
A better way to describe this problem is "#4: wasted effort (on both initiator and recipient) trying to use a connection which you know can never succeed".
The annoying logging is also very important -- we want to improve logging. We want to add ways to control what gets logged and the way it gets logged, and it also important to reduce false-positives in the log so as to facilitate effective use of the logs.
The centralized introducer is an implementation expedience -- the long term goal is to reduce and eliminate such points of centralization.
Restating my original list of possible problems that we might want to solve:
We'll get back to this when Brian has more design bandwidth. For the moment we're leaving ugly log messages and wasted effort in place.
Zooko and I chatted a bit, and we've settled on a short-term (0.2.1) fix: just persist the IntroducerClient's swissnumbers. That will get rid of the noise from clients who restart (and would thus change their furls). It will leave the noise from clients who shut down and never restart, but I think that's less of an issue.
One datapoint:
Zooko is concerned about inadvertently adding (one might say "introducing") centralization to our architecture. Giving the Introducer the ability to forcibly disconnect peers turns it from an Introducer into a Dictator-Of-Who-Gets-To-Be-In-The-Mesh, and if we actually want that, it should be explicit and distinct. There are several reasons for wanting to create "private" meshes, but these will be implemented with membership credentials rather than by trying to control introduction.
The next connection-management milestone will be updated to specify that the Introducer exists to improve peer-discovery, but the mesh should not be dependent upon it. In particular, if the Introducer goes away, all existing connections should continue to work, and nodes should maintain their own independent decisions about which peers are useful to connect to and which are not. Eventually we want the mesh to grow via peer-learned gossip and reduce our dependence upon the Introducer until it goes away completely or only exists to bootstrap the network.
The "short term fix" has been in place successfully for 5 years now, and is working fine. Nodes use stable FURLs, so the only deadwood/noise is from nodes that have left for good. Until clients remember servers independently of the Introducer, occasional reboots will clear out dead announcements (specifically: after the node leaves, after the next Introducer reboot, then after the following client reboot, the client will no longer attempt to contact the missing node).
When clients are modified to remember servers on their own (as part of the #68 distributed-introduction work), they should include some timeouts, so nodes that haven't been heard from in any form (gossip/introduction or actual connection) can be forgotten.