reconnecting to one server should trigger reconnection attempts to all #374
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#374
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The unwriteable-directory error that Mike saw last week, which is promtping us to bring forward the read-k-plus-epsilon ticket, was probably caused by a network flap. I think that his laptop was offline for a while causing all the storage-server Reconnectors to ratchet back to some long retry interval (perhaps as long as an hour). He got back online, then some time later tried to write to that directory. At that point, I think some of the Reconnectors had fired, while the others were still waiting. The Publish process only saw a subset of the servers, so it placed new shares on new machines, leaving the partitioned servers with their old shares. It was the presence of recoverable shares for both versions that triggered the problem we saw.
I'm thinking that we need to find a way to accelerate the reconnection process, so that there is a smaller window of time during which we're connected to some servers but not others. From the client's point of view, we can't tell the difference between a server being down and the client being offline (hmm, or can we?). But we could say that the Reconnector callback being fired for any server (or the Introducer) should make us reset the retry timer for all other Reconnectors. For the client-comes-back-online case, this would result in a random delay followed by one connect followed by a thundering herd of other connects.
Foolscap already has support for resetting the Reconnnector, although it was intended for debugging purposes, so we need to make sure that it works properly (specifically that it is ignored if the connection is already established).
Other ideas: the client could watch the network interface list for changes, and reset the reconnectors a few seconds after it notices a change. It could use a lower delay for the introducer and reset the reconnectors after the introducer connection is reestablished.
It might also be a good idea to have the publish process notice when it has to use new servers, and schedule a repair pass to occur some time afterwards. Perhaps a "repair after reconnection" queue.
It might also be useful to display "connected to 8 out of 20 known servers" to the user (perhaps through a JSON-based machine-readable interface, so that front-ends can be involved), and have a "try to reconnect" button on the front page.
done, in changeset:5578559b8566b7dc
The changeset in trac for this patch is now numbered changeset:f9e261d939412e27.
Milestone 1.0.1 deleted