connection lost during "tahoe backup" #782

Open
opened 2009-08-06 15:48:10 +00:00 by zooko · 3 comments

Andrej Falout reported this to tahoe-dev.

Andrej: could you please look for "incident report files" which were created around the time of the problem, in your $TAHOEBASEDIR/logs/incidents directory. If there is an incident report file created about the same time as the (first) failure you encounter, please attach it to this ticket. Thanks!

Andrej Falout [reported](http://allmydata.org/pipermail/tahoe-dev/2009-August/002531.html) this to tahoe-dev. Andrej: could you please look for "incident report files" which were created around the time of the problem, in your `$TAHOEBASEDIR/logs/incidents` directory. If there is an incident report file created about the same time as the (first) failure you encounter, please attach it to this ticket. Thanks!
zooko added the
unknown
major
defect
1.5.0
labels 2009-08-06 15:48:10 +00:00
zooko added this to the undecided milestone 2009-08-06 15:48:10 +00:00
Author

andrej: the allmydata.com servers have occasionally been full and rejecting new uploads. This may have caused your problem. Did you look for incident report files? Does this problem still occur? Thanks.

andrej: the allmydata.com servers have occasionally been full and rejecting new uploads. This may have caused your problem. Did you look for incident report files? Does this problem still occur? Thanks.
Author

andrej sent me this note in private email:

"The issue is cause in large majority of the cases by Tahoe's poor resistance to concurrent traffic; put it simply, if I have p2p client running with more then few hundred opened connections, Tahoe starts loosing connections. I stop p2p, Tahoe immediately starts working again.

Please note that this is not a bad router kind of issue, I tested it extensively while debugging another issue. Or a saturated connection, there is plenty of headroom left, and no other network app I use exhibits this kind of sensitivity. It simply looks like Tahoe want response NOW, and if it does not get it NOW, it just gives up.

I'd suspect a more tollerant timeouts plus a connection retry handling would go a long way in fixing this."

andrej sent me this note in private email: "The issue is cause in large majority of the cases by Tahoe's poor resistance to concurrent traffic; put it simply, if I have p2p client running with more then few hundred opened connections, Tahoe starts loosing connections. I stop p2p, Tahoe immediately starts working again. Please note that this is not a bad router kind of issue, I tested it extensively while debugging another issue. Or a saturated connection, there is plenty of headroom left, and no other network app I use exhibits this kind of sensitivity. It simply looks like Tahoe want response NOW, and if it does not get it NOW, it just gives up. I'd suspect a more tollerant timeouts plus a connection retry handling would go a long way in fixing this."
afalout commented 2009-11-04 00:11:24 +00:00
Owner

In response to Zooko's comments:

"I don't see how your theory can fit with my mental model of the Tahoe-LAFS network code. Maybe if you turn on some extra logging and then stimulate it to fail and then post the logs then I can figure it out."

I can confirm without any uncertainty that running a P2P app with large number of connections kills Tahoe. I even scripted this into my backup scripts so all P2P traffic is stopped when running Tahoe.. Lite P2P (5 files/500 connections or so) is OK but anything significantly over this is a killer.

Now whether this means something can or even should be changed in Tahoe, is another matter entirely.

I would argue that for an application that is supposed to transfer a large amount of data over a long period of time, ability to recover form any sort of network interruptions is a paramount.

I would even go so far as not to allow Tahoe to quit for this reason at all, instead preferring it to retry the action indefinitely, until it either completes the requested operation, or user interrupts it.

In response to Zooko's comments: "I don't see how your theory can fit with my mental model of the Tahoe-LAFS network code. Maybe if you turn on some extra logging and then stimulate it to fail and then post the logs then I can figure it out." I can confirm without any uncertainty that running a P2P app with large number of connections kills Tahoe. I even scripted this into my backup scripts so all P2P traffic is stopped when running Tahoe.. Lite P2P (5 files/500 connections or so) is OK but anything significantly over this is a killer. Now whether this means something can or even should be changed in Tahoe, is another matter entirely. I would argue that for an application that is supposed to transfer a large amount of data over a long period of time, ability to recover form any sort of network interruptions is a paramount. I would even go so far as not to allow Tahoe to quit for this reason at all, instead preferring it to retry the action indefinitely, until it either completes the requested operation, or user interrupts it.
tahoe-lafs added
code-network
and removed
unknown
labels 2009-12-01 00:13:57 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#782
No description provided.