introducer doesn't seem to forget about old peers, or peers don't forget about old peers #26

New Issue

warner · 2007-05-02T21:46:39Z

warner commented

2007-05-02 21:46:39 +00:00

if you restart a node, the new instance retains the same TubID as the old one (since we stash the SSL certificate in the 'client.pem' file), but it gets a new so-called "Swiss Number". The introducer node is then informed of both, and this gets announced to everybody else.

The problem is that it looks like the rest of the world doesn't forget about the old swissnumber when the first instance of the node is shut down. They keep trying to reconnect to both the old one and the new one. The connection attempts to the old one are able to attach to the Tub (since the certificate is still the same), but of course the old swissnumber is gone, and thus you get benign but annoying error messages in everybody's log files as the getReferenceByName call fails.

I think that our introducer scheme is the absolute simplest thing that could possibly work, and as such it doesn't every send out negative announcements. Without this, the clients will not know they should turn off their Reconnectors for the missing peers, and they'll keep trying to hit it over and over again.

We should fix this.

if you restart a node, the new instance retains the same TubID as the old one (since we stash the SSL certificate in the 'client.pem' file), but it gets a new so-called "Swiss Number". The introducer node is then informed of both, and this gets announced to everybody else. The problem is that it looks like the rest of the world doesn't forget about the old swissnumber when the first instance of the node is shut down. They keep trying to reconnect to both the old one and the new one. The connection attempts to the old one *are* able to attach to the Tub (since the certificate is still the same), but of course the old swissnumber is gone, and thus you get benign but annoying error messages in everybody's log files as the getReferenceByName call fails. I think that our introducer scheme is the absolute simplest thing that could possibly work, and as such it doesn't every send out negative announcements. Without this, the clients will not know they should turn off their Reconnectors for the missing peers, and they'll keep trying to hit it over and over again. We should fix this.

warner added the

code

minor

defect

labels 2007-05-02 21:46:39 +00:00

warner commented

2007-05-02 23:25:00 +00:00

I checked the code, and the introducer does indeed forget about peers that go away, but it does not tell anyone about their loss.

I think we need to add a 'lost_peers' method next to the existing 'new_peers' one. It should take a set of FURLs that are no longer in the mesh. The IntroducerClient should shut down any reconnector they have for a FURL that appears in a 'lost_peers' message.

The fact that the introducer sends these out in a strict order means (I believe) that the new_peers/lost_peers messages for a given FURL should be strictly interleaved: no add/add/lost/lost situations. I think that makes this safe from races.

I checked the code, and the introducer does indeed forget about peers that go away, but it does not tell anyone about their loss. I think we need to add a 'lost_peers' method next to the existing 'new_peers' one. It should take a set of FURLs that are no longer in the mesh. The [IntroducerClient](wiki/IntroducerClient) should shut down any reconnector they have for a FURL that appears in a 'lost_peers' message. The fact that the introducer sends these out in a strict order means (I believe) that the new_peers/lost_peers messages for a given FURL should be strictly interleaved: no add/add/lost/lost situations. I think that makes this safe from races.

warner commented

2007-05-04 05:17:50 +00:00

I'm working on a patch for this. It will change the RIIntroducerClient protocol, though, by adding a lost_peers message.

warner added

major

and removed

minor

labels 2007-05-04 05:17:50 +00:00

warner commented

2007-05-08 02:18:41 +00:00

I've checked in that patch, so this issue is closed. I'll probably wait until the 0.2.1 release before upgrading testnet, though.

warner added the

fixed

label 2007-05-08 02:18:41 +00:00

warner closed this issue

2007-05-08 02:18:41 +00:00

zooko commented

2007-05-08 15:27:29 +00:00

I'm not sure what the problem is exactly:

annoying messages in logs
wasted effort trying to reconnect to a node that never comes back
wasted effort trying to reconnect to a node that you already have a new, better connection to

?

I'm not sure what the problem is exactly: 1. annoying messages in logs 2. wasted effort trying to reconnect to a node that never comes back 3. wasted effort trying to reconnect to a node that you already have a new, better connection to ?

zooko commented

2007-05-08 15:28:39 +00:00

Oh I didn't realize you'd already implemented the lost_peers message. So far as I understand the issue, that doesn't seem like a good idea. Let's re-open this ticket and talk about it.

zooko removed the

fixed

label 2007-05-08 15:28:39 +00:00

zooko reopened this issue

2007-05-08 15:28:39 +00:00

warner commented

2007-05-09 07:42:24 +00:00

The log messages are annoying, but that's just a symptom of the real problem. #3 is the real issue, although it would probably be more accurate to describe the new connection as "better" because the old one is completely useless.

The way we create the IntroducerClient FURLs causes us to pick a new one each time the node boots, so once a node has shut down, the FURL that was used for that incarnation will never be used again, so it's inappropriate for anyone else to remember it. The log messages show up because the new incarnation of the node does use the same SSL certificate, and therefore gets the same TubID, and probably has the same IP address, so every other node in the system is trying to talk to the ghosts of our previous lives, and we log a getReferenceByName failure for each attempt.

Our current peer-connection/mesh-maintenance algorithm (v1 = fully-connected mesh) obligates us to treat nodes which are connected to the Introducer as active, and nodes which are not as inactive. So for both these reasons, once the node goes away, we need to stop trying to talk to it.

Part of this problem could be addressed by persisting the randomly-generated FURL so that each node's IntroducerClient would have the same name each time. (I'm not convinced this is the right approach for this issue; independent of that I think this sort of persistence is an important pattern that I want Foolscap to provide convenient access to). If we had that, then the client would get the same FURL each time, and the other nodes could only keep trying to connect to a single object per Tub instead of lots of them. This would still treat nodes that are no longer in contact with the Introducer as being active, however, which doesn't match what our v1 specs call for, and will cause 1 connection attempt per hour per active peer for each of the IP addresses that have been used by nodes in the past.

The log messages are annoying, but that's just a symptom of the real problem. #3 is the real issue, although it would probably be more accurate to describe the new connection as "better" because the old one is completely useless. The way we create the [IntroducerClient](wiki/IntroducerClient) FURLs causes us to pick a new one each time the node boots, so once a node has shut down, the FURL that was used for that incarnation will never be used again, so it's inappropriate for anyone else to remember it. The log messages show up because the new incarnation of the node *does* use the same SSL certificate, and therefore gets the same TubID, and probably has the same IP address, so every other node in the system is trying to talk to the ghosts of our previous lives, and we log a getReferenceByName failure for each attempt. Our current peer-connection/mesh-maintenance algorithm (v1 = fully-connected mesh) obligates us to treat nodes which are connected to the Introducer as active, and nodes which are not as inactive. So for both these reasons, once the node goes away, we need to stop trying to talk to it. Part of this problem could be addressed by persisting the randomly-generated FURL so that each node's [IntroducerClient](wiki/IntroducerClient) would have the same name each time. (I'm not convinced this is the right approach for this issue; independent of that I think this sort of persistence is an important pattern that I want Foolscap to provide convenient access to). If we had that, then the client would get the same FURL each time, and the other nodes could only keep trying to connect to a single object per Tub instead of lots of them. This would still treat nodes that are no longer in contact with the Introducer as being active, however, which doesn't match what our v1 specs call for, and will cause 1 connection attempt per hour per active peer for each of the IP addresses that have been used by nodes in the past.

warner commented

2007-05-09 18:39:47 +00:00

I've rolled back the lost_peers change pending further discussion.

zooko commented

2007-05-09 18:50:00 +00:00

Chatting with Brian on the phone.

A better way to describe this problem is "#4: wasted effort (on both initiator and recipient) trying to use a connection which you know can never succeed".

The annoying logging is also very important -- we want to improve logging. We want to add ways to control what gets logged and the way it gets logged, and it also important to reduce false-positives in the log so as to facilitate effective use of the logs.

The centralized introducer is an implementation expedience -- the long term goal is to reduce and eliminate such points of centralization.

Restating my original list of possible problems that we might want to solve:

1.  False alarms in log.
2.  Wasted effort trying to connect and failing.
3.  Wasted effort trying to connect on a channel which has been superceded by a "better" channel.
4.  Wasted effort trying to connect on a channel which you know can never succeed.

We'll get back to this when Brian has more design bandwidth. For the moment we're leaving ugly log messages and wasted effort in place.

Chatting with Brian on the phone. A better way to describe this problem is "#4: wasted effort (on both initiator and recipient) trying to use a connection which you know can never succeed". The annoying logging is also very important -- we want to improve logging. We want to add ways to control what gets logged and the way it gets logged, and it also important to reduce false-positives in the log so as to facilitate effective use of the logs. The centralized introducer is an implementation expedience -- the long term goal is to reduce and eliminate such points of centralization. Restating my original list of possible problems that we might want to solve: ``` 1. False alarms in log. 2. Wasted effort trying to connect and failing. 3. Wasted effort trying to connect on a channel which has been superceded by a "better" channel. 4. Wasted effort trying to connect on a channel which you know can never succeed. ``` We'll get back to this when Brian has more design bandwidth. For the moment we're leaving ugly log messages and wasted effort in place.

warner commented

2007-06-07 19:02:57 +00:00

Zooko and I chatted a bit, and we've settled on a short-term (0.2.1) fix: just persist the IntroducerClient's swissnumbers. That will get rid of the noise from clients who restart (and would thus change their furls). It will leave the noise from clients who shut down and never restart, but I think that's less of an issue.

One datapoint:

the testnet machines are emitting 300kB of logs per day, with no user activity. This is entirely a result of these wasted connections

Zooko is concerned about inadvertently adding (one might say "introducing") centralization to our architecture. Giving the Introducer the ability to forcibly disconnect peers turns it from an Introducer into a Dictator-Of-Who-Gets-To-Be-In-The-Mesh, and if we actually want that, it should be explicit and distinct. There are several reasons for wanting to create "private" meshes, but these will be implemented with membership credentials rather than by trying to control introduction.

The next connection-management milestone will be updated to specify that the Introducer exists to improve peer-discovery, but the mesh should not be dependent upon it. In particular, if the Introducer goes away, all existing connections should continue to work, and nodes should maintain their own independent decisions about which peers are useful to connect to and which are not. Eventually we want the mesh to grow via peer-learned gossip and reduce our dependence upon the Introducer until it goes away completely or only exists to bootstrap the network.

Zooko and I chatted a bit, and we've settled on a short-term (0.2.1) fix: just persist the [IntroducerClient](wiki/IntroducerClient)'s swissnumbers. That will get rid of the noise from clients who restart (and would thus change their furls). It will leave the noise from clients who shut down and never restart, but I think that's less of an issue. One datapoint: * the testnet machines are emitting 300kB of logs per day, with no user activity. This is entirely a result of these wasted connections Zooko is concerned about inadvertently adding (one might say "introducing") centralization to our architecture. Giving the Introducer the ability to forcibly disconnect peers turns it from an Introducer into a Dictator-Of-Who-Gets-To-Be-In-The-Mesh, and if we actually want that, it should be explicit and distinct. There are several reasons for wanting to create "private" meshes, but these will be implemented with membership credentials rather than by trying to control introduction. The next connection-management milestone will be updated to specify that the Introducer exists to improve peer-discovery, but the mesh should not be dependent upon it. In particular, if the Introducer goes away, all existing connections should continue to work, and nodes should maintain their own independent decisions about which peers are useful to connect to and which are not. Eventually we want the mesh to grow via peer-learned gossip and reduce our dependence upon the Introducer until it goes away completely or only exists to bootstrap the network.

warner added

code-network

and removed

code

labels 2007-08-14 18:53:15 +00:00

zooko added the

0.6.0

label 2007-09-25 04:18:16 +00:00

zooko added this to the undecided milestone 2007-09-25 04:18:16 +00:00

warner commented

2012-06-12 22:56:04 +00:00

The "short term fix" has been in place successfully for 5 years now, and is working fine. Nodes use stable FURLs, so the only deadwood/noise is from nodes that have left for good. Until clients remember servers independently of the Introducer, occasional reboots will clear out dead announcements (specifically: after the node leaves, after the next Introducer reboot, then after the following client reboot, the client will no longer attempt to contact the missing node).

When clients are modified to remember servers on their own (as part of the #68 distributed-introduction work), they should include some timeouts, so nodes that haven't been heard from in any form (gossip/introduction or actual connection) can be forgotten.

The "short term fix" has been in place successfully for 5 years now, and is working fine. Nodes use stable FURLs, so the only deadwood/noise is from nodes that have left for good. Until clients remember servers independently of the Introducer, occasional reboots will clear out dead announcements (specifically: after the node leaves, after the next Introducer reboot, then after the following client reboot, the client will no longer attempt to contact the missing node). When clients are modified to remember servers on their own (as part of the #68 distributed-introduction work), they should include some timeouts, so nodes that haven't been heard from in any form (gossip/introduction or actual connection) can be forgotten.

warner added the

fixed

label 2012-06-12 22:56:04 +00:00

warner closed this issue

2012-06-12 22:56:04 +00:00

Sign in to join this conversation.