everything stalls after abrupt disconnect #253

New Issue

zooko · 2008-01-04T17:23:30Z

zooko commented

2008-01-04 17:23:30 +00:00

We just set up a 3-node network at Seb's house with my laptop, Seb's, and Josh's. When I turned off the airport on my Mac, then subsequently Seb and Josh couldn't do anything -- uploads, downloads, and "check this file" operations all hung silently.

After a few minutes I reconnected my laptop, but the problem persisted for several minutes -- perhaps 5 -- before Seb's tahoe node recovered and was able to function normally. However, Josh's node never did by the time that we called it a night (maybe 15 minutes).

I'm attaching all three logs.

We just set up a 3-node network at Seb's house with my laptop, Seb's, and Josh's. When I turned off the airport on my Mac, then subsequently Seb and Josh couldn't do anything -- uploads, downloads, and "check this file" operations all hung silently. After a few minutes I reconnected my laptop, but the problem persisted for several minutes -- perhaps 5 -- before Seb's tahoe node recovered and was able to function normally. However, Josh's node never did by the time that we called it a night (maybe 15 minutes). I'm attaching all three logs.

zooko added the

labels 2008-01-04 17:23:30 +00:00

zooko added this to the undecided milestone 2008-01-04 17:23:30 +00:00

warner was assigned by zooko

2008-01-04 17:23:30 +00:00

zooko commented

2008-01-04 17:24:22 +00:00

Attachment twistd.log-from-sebs-laptop (87409 bytes) added

twistd.log from seb's laptop

**Attachment** twistd.log-from-sebs-laptop (87409 bytes) added twistd.log from seb's laptop

twistd.log-from-sebs-laptop

85 KiB

zooko commented

2008-01-04 17:29:58 +00:00

Attachment twistd.log-from-zookos-laptop.gz (54086 bytes) added

twistd.log from zooko's laptop (gzipped)

**Attachment** twistd.log-from-zookos-laptop.gz (54086 bytes) added twistd.log from zooko's laptop (gzipped)

twistd.log-from-zookos-laptop.gz

53 KiB

zooko commented

2008-01-04 17:31:05 +00:00

Attachment twistd.log-from-joshes-laptop.gz (4303 bytes) added

twistd.log from josh's laptop (gzipped)

**Attachment** twistd.log-from-joshes-laptop.gz (4303 bytes) added twistd.log from josh's laptop (gzipped)

twistd.log-from-joshes-laptop.gz

4.2 KiB

zooko commented

2008-01-04 17:32:23 +00:00

I've assigned this to Brian in order to draw his attention to it, since it probably involves foolscap connection management, but I'm going to try to reproduce it now.

zooko modified the milestone from undecided to 0.7.0

2008-01-04 18:09:21 +00:00

warner commented

2008-01-04 23:28:55 +00:00

were you all running the latest Foolscap? This was definitely a problem in older versions, but the hope was that we fixed it in 0.2.0 or so.

I'll try to look at those logs when I get a chance. The real information may well be in the foolscap log events, though, which we aren't recording by default, so it's possible that the important data is missing, so reproducing the problem would be a big help.

were you all running the latest Foolscap? This was definitely a problem in older versions, but the hope was that we fixed it in 0.2.0 or so. I'll try to look at those logs when I get a chance. The real information may well be in the foolscap log events, though, which we aren't recording by default, so it's possible that the important data is missing, so reproducing the problem would be a big help.

zooko commented

2008-01-04 23:35:08 +00:00

I looked into the logs in order to answer the question of which versions of foolscap were in use, and that information isn't there! I thought that we logged all version numbers. I'll investigate that.

warner commented

2008-01-05 05:52:00 +00:00

so you shut down your laptop, after which the other nodes would see their TCP
packets go unacknowledged. TCP takes about 15 minutes to break the connection
in this state (see
Foolscap#28 for some
experimental timing data). During this period, the other nodes cannot
distinguish between your laptop being slow and it being gone.

Assuming your laptop doesn't come back, to allow the other nodes to make
progress, we need to modify the Tahoe download code to switch to an alternate
source of shares when a callRemote takes too long. One heuristic might be to
keep track of how long it took to acquire the previous share, and if the next
share takes more than 150% as long, move that peer to the bottom of the list
and ask somebody else for it. To allow uploads to make progress, we'd want to
do something similar: if our remote_write call doesn't complete within 150%
of the time the previous one did (or 150% of the max time that the other
peers handled it), assume that this peer is dubious. Either we consider it
dead (and wind up with a slightly-unhealthy file), or we buffer the shares
that we wanted to send to them (and consume storage in the hopes that they'll
come back).

Now, when your laptop did come back, did you restart your tahoe node? If so,
the new node's connections should have displaced the ones from the old node,
and any uploads/downloads in progress should have seen immediate
connectionLost errors. For upload I think we handle this properly (we abandon
that share, resulting in a slightly unhealthy file, and if we don't achieve
shares_of_happiness then we declare the upload to have failed). For download
I think we explode pretty violently: an indefinite hang is a distinct
possiblity. (at the very least we should flunk the download, but really we
should switch over to other peers as described above). New operations (done
after your new node finished connecting) should have worked normally.. if
not, perhaps our Introducer code isn't properly replacing the peer reference
when the reconnector fires the callback a second time.

But, if you didn't restart your tahoe node when you reconnected, now we're
in a different state. The other nodes would have outstanding data trying to
get to your node, and TCP will retransmit that with an exponential backoff
(doubling the delay each time). If your machine was off-net for 4 minutes,
you could expect those nodes to not try again for a further 4 minutes. If
your node send data of its own, that might trigger a fast retry, but maybe
not, and your node might not have needed to talk to them at that point. Once
a retry was attempted, I'd expect data to start flowing quickly and normal
operations to resume.

Any idea which case it was?

so you shut down your laptop, after which the other nodes would see their TCP packets go unacknowledged. TCP takes about 15 minutes to break the connection in this state (see [Foolscap#28](@@http://foolscap.lothar.com/trac/ticket/28#[comment:-1](/tahoe-lafs/trac-2024-07-25/issues/253#issuecomment--1)@@) for some experimental timing data). During this period, the other nodes cannot distinguish between your laptop being slow and it being gone. Assuming your laptop doesn't come back, to allow the other nodes to make progress, we need to modify the Tahoe download code to switch to an alternate source of shares when a callRemote takes too long. One heuristic might be to keep track of how long it took to acquire the previous share, and if the next share takes more than 150% as long, move that peer to the bottom of the list and ask somebody else for it. To allow uploads to make progress, we'd want to do something similar: if our remote_write call doesn't complete within 150% of the time the previous one did (or 150% of the max time that the other peers handled it), assume that this peer is dubious. Either we consider it dead (and wind up with a slightly-unhealthy file), or we buffer the shares that we wanted to send to them (and consume storage in the hopes that they'll come back). Now, when your laptop did come back, did you restart your tahoe node? If so, the new node's connections should have displaced the ones from the old node, and any uploads/downloads in progress should have seen immediate connectionLost errors. For upload I think we handle this properly (we abandon that share, resulting in a slightly unhealthy file, and if we don't achieve shares_of_happiness then we declare the upload to have failed). For download I think we explode pretty violently: an indefinite hang is a distinct possiblity. (at the very least we should flunk the download, but really we should switch over to other peers as described above). New operations (done after your new node finished connecting) should have worked normally.. if not, perhaps our Introducer code isn't properly replacing the peer reference when the reconnector fires the callback a second time. But, if you *didn't* restart your tahoe node when you reconnected, now we're in a different state. The other nodes would have outstanding data trying to get to your node, and TCP will retransmit that with an exponential backoff (doubling the delay each time). If your machine was off-net for 4 minutes, you could expect those nodes to not try again for a further 4 minutes. If your node send data of its own, that might trigger a fast retry, but maybe not, and your node might not have needed to talk to them at that point. Once a retry was attempted, I'd expect data to start flowing quickly and normal operations to resume. Any idea which case it was?

zooko commented

2008-01-05 20:00:17 +00:00

I didn't restart my Tahoe node. Seb's tahoe node reconnected within a few minutes of my turning on my wireless card, but Josh's hadn't even after maybe 15 minutes.

Cc: josh (arch)

I didn't restart my Tahoe node. Seb's tahoe node reconnected within a few minutes of my turning on my wireless card, but Josh's hadn't even after maybe 15 minutes. Cc: josh (arch)

zooko commented

2008-01-05 20:34:47 +00:00

I'm ready to call this a known issue for v0.7.0. Bumping it to v0.7.1 Milestone.

zooko added this to the undecided milestone 2008-01-23 02:43:48 +00:00

zooko commented

2008-09-24 13:27:12 +00:00