everything stalls after abrupt disconnect #253
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#253
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
We just set up a 3-node network at Seb's house with my laptop, Seb's, and Josh's. When I turned off the airport on my Mac, then subsequently Seb and Josh couldn't do anything -- uploads, downloads, and "check this file" operations all hung silently.
After a few minutes I reconnected my laptop, but the problem persisted for several minutes -- perhaps 5 -- before Seb's tahoe node recovered and was able to function normally. However, Josh's node never did by the time that we called it a night (maybe 15 minutes).
I'm attaching all three logs.
Attachment twistd.log-from-sebs-laptop (87409 bytes) added
twistd.log from seb's laptop
Attachment twistd.log-from-zookos-laptop.gz (54086 bytes) added
twistd.log from zooko's laptop (gzipped)
Attachment twistd.log-from-joshes-laptop.gz (4303 bytes) added
twistd.log from josh's laptop (gzipped)
I've assigned this to Brian in order to draw his attention to it, since it probably involves foolscap connection management, but I'm going to try to reproduce it now.
were you all running the latest Foolscap? This was definitely a problem in older versions, but the hope was that we fixed it in 0.2.0 or so.
I'll try to look at those logs when I get a chance. The real information may well be in the foolscap log events, though, which we aren't recording by default, so it's possible that the important data is missing, so reproducing the problem would be a big help.
I looked into the logs in order to answer the question of which versions of foolscap were in use, and that information isn't there! I thought that we logged all version numbers. I'll investigate that.
so you shut down your laptop, after which the other nodes would see their TCP
packets go unacknowledged. TCP takes about 15 minutes to break the connection
in this state (see
Foolscap#28 for some
experimental timing data). During this period, the other nodes cannot
distinguish between your laptop being slow and it being gone.
Assuming your laptop doesn't come back, to allow the other nodes to make
progress, we need to modify the Tahoe download code to switch to an alternate
source of shares when a callRemote takes too long. One heuristic might be to
keep track of how long it took to acquire the previous share, and if the next
share takes more than 150% as long, move that peer to the bottom of the list
and ask somebody else for it. To allow uploads to make progress, we'd want to
do something similar: if our remote_write call doesn't complete within 150%
of the time the previous one did (or 150% of the max time that the other
peers handled it), assume that this peer is dubious. Either we consider it
dead (and wind up with a slightly-unhealthy file), or we buffer the shares
that we wanted to send to them (and consume storage in the hopes that they'll
come back).
Now, when your laptop did come back, did you restart your tahoe node? If so,
the new node's connections should have displaced the ones from the old node,
and any uploads/downloads in progress should have seen immediate
connectionLost errors. For upload I think we handle this properly (we abandon
that share, resulting in a slightly unhealthy file, and if we don't achieve
shares_of_happiness then we declare the upload to have failed). For download
I think we explode pretty violently: an indefinite hang is a distinct
possiblity. (at the very least we should flunk the download, but really we
should switch over to other peers as described above). New operations (done
after your new node finished connecting) should have worked normally.. if
not, perhaps our Introducer code isn't properly replacing the peer reference
when the reconnector fires the callback a second time.
But, if you didn't restart your tahoe node when you reconnected, now we're
in a different state. The other nodes would have outstanding data trying to
get to your node, and TCP will retransmit that with an exponential backoff
(doubling the delay each time). If your machine was off-net for 4 minutes,
you could expect those nodes to not try again for a further 4 minutes. If
your node send data of its own, that might trigger a fast retry, but maybe
not, and your node might not have needed to talk to them at that point. Once
a retry was attempted, I'd expect data to start flowing quickly and normal
operations to resume.
Any idea which case it was?
I didn't restart my Tahoe node. Seb's tahoe node reconnected within a few minutes of my turning on my wireless card, but Josh's hadn't even after maybe 15 minutes.
Cc: josh (arch)
I'm ready to call this a known issue for v0.7.0. Bumping it to v0.7.1 Milestone.
See also #193, #287, and #521.
I haven't tried to reproduce this problem or further diagnose it in two years, and much has changed since then. I'm going to presumptively close it as 'wontfix'. A future rewrite of the download logic may also fix it if it isn't already fixed -- see comment:62885.