reconnecting to one server should trigger reconnection attempts to all #374

New Issue

warner · 2008-03-30T18:34:47Z

warner commented

2008-03-30 18:34:47 +00:00

The unwriteable-directory error that Mike saw last week, which is promtping us to bring forward the read-k-plus-epsilon ticket, was probably caused by a network flap. I think that his laptop was offline for a while causing all the storage-server Reconnectors to ratchet back to some long retry interval (perhaps as long as an hour). He got back online, then some time later tried to write to that directory. At that point, I think some of the Reconnectors had fired, while the others were still waiting. The Publish process only saw a subset of the servers, so it placed new shares on new machines, leaving the partitioned servers with their old shares. It was the presence of recoverable shares for both versions that triggered the problem we saw.

I'm thinking that we need to find a way to accelerate the reconnection process, so that there is a smaller window of time during which we're connected to some servers but not others. From the client's point of view, we can't tell the difference between a server being down and the client being offline (hmm, or can we?). But we could say that the Reconnector callback being fired for any server (or the Introducer) should make us reset the retry timer for all other Reconnectors. For the client-comes-back-online case, this would result in a random delay followed by one connect followed by a thundering herd of other connects.

Foolscap already has support for resetting the Reconnnector, although it was intended for debugging purposes, so we need to make sure that it works properly (specifically that it is ignored if the connection is already established).

Other ideas: the client could watch the network interface list for changes, and reset the reconnectors a few seconds after it notices a change. It could use a lower delay for the introducer and reset the reconnectors after the introducer connection is reestablished.

It might also be a good idea to have the publish process notice when it has to use new servers, and schedule a repair pass to occur some time afterwards. Perhaps a "repair after reconnection" queue.

It might also be useful to display "connected to 8 out of 20 known servers" to the user (perhaps through a JSON-based machine-readable interface, so that front-ends can be involved), and have a "try to reconnect" button on the front page.

The unwriteable-directory error that Mike saw last week, which is promtping us to bring forward the read-k-plus-epsilon ticket, was probably caused by a network flap. I think that his laptop was offline for a while causing all the storage-server Reconnectors to ratchet back to some long retry interval (perhaps as long as an hour). He got back online, then some time later tried to write to that directory. At that point, I think some of the Reconnectors had fired, while the others were still waiting. The Publish process only saw a subset of the servers, so it placed new shares on new machines, leaving the partitioned servers with their old shares. It was the presence of recoverable shares for both versions that triggered the problem we saw. I'm thinking that we need to find a way to accelerate the reconnection process, so that there is a smaller window of time during which we're connected to some servers but not others. From the client's point of view, we can't tell the difference between a server being down and the client being offline (hmm, or can we?). But we could say that the Reconnector callback being fired for any server (or the Introducer) should make us reset the retry timer for all other Reconnectors. For the client-comes-back-online case, this would result in a random delay followed by one connect followed by a thundering herd of other connects. Foolscap already has support for resetting the Reconnnector, although it was intended for debugging purposes, so we need to make sure that it works properly (specifically that it is ignored if the connection is already established). Other ideas: the client could watch the network interface list for changes, and reset the reconnectors a few seconds after it notices a change. It could use a lower delay for the introducer and reset the reconnectors after the introducer connection is reestablished. It might also be a good idea to have the publish process notice when it has to use new servers, and schedule a repair pass to occur some time afterwards. Perhaps a "repair after reconnection" queue. It might also be useful to display "connected to 8 out of 20 known servers" to the user (perhaps through a JSON-based machine-readable interface, so that front-ends can be involved), and have a "try to reconnect" button on the front page.