Timeout of Servermap Update #1138
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1138
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I wrote to the mail list but no one answered. Hope this is the better place for such report.
I have one problem with updating the directories. Sometimes a server
is dropping out of network for various reasons but still remains
connected on welcome page. When I'm trying to access any directory my
node starts map updating and it may be very long operation during
which all work with directories hangs. It takes from 10 to 14 minutes
and I guess it's unacceptable for such system. Where can I set the
timeout for this? I set up:
But this doesn't help. One of servers lost internet connection
suddenly and that's what I got accessing one of directories (there
were 4 servers and only 3 requests succeded):
Almost 15 minutes! That becomes critical if you mount a directory via
sshfs+sftpd, then any process may stuck when lists this mounted dir
and the only way is to kill sshfs (you couldn't kill the process
itself, it's in D+ state) or wait (not knowing how long). Please,
point me to the right timeout option for such requests! 10 to 30
seconds would be very nice.
Thank you for the bug report. I think you are right that a ticket is a better way to get answers about this kind of thing than a mailing list message. (Although sending a mailing list message first is often a good way to start.)
Our long-term plan to fix this problem is to make it so that uploads don't wait for a long time to get a response from a specific server but instead fail-over to another server. That's #873 (upload: tolerate lost or unacceptably slow servers).
In the short-term, I'm not sure why setting the foolscap timeouts did not cause the upload to complete (whether successfully or failingly) more quickly. It is a mystery to me. Perhaps someone else could dig into the upload code and figure it out. One potentially productive way to do that would be to add more diagnostics to the status page, showing which requests to servers are currently outstanding and for how long they have been.
Replying to zooko:
From the description, the case at hand seems to be mutable download.
Will this be addressed by the New Downloader, or does that only handle immutable download?
The #798 new-downloader is only for immutable files, sorry.
The
timeout.disconnect
timer, due to the low-overhead way it is implemented, may take up to twice the value to finally sever a connection. So a value of 300 could take up to 10 minutes to disconnect the server connection. But it shouldn't have let the connection stay up for 14 minutes. Two ideas come to mind: thetimeout.disconnect
clause might have been in the wrong section (it should be in thenode
section), or there might have been other traffic on that connection that kept it alive (but not the response to the mutable read query). Neither seems likely.. the only way I can imagine traffic keeping it alive is if the server were having weird out-of-memory or hardware errors and dropped one request while accepting others (we've seen things like this happen before, but it was on a server that had run out of memory).It might help to collect some log information from your box after it does this. If you go to the front "Welcome" page, there's a button at the bottom that says "Report An Incident". Push that, and a few seconds later, a new "flogfile" will appear in your BASEDIR/logs/incidents/ directory. Upload and attach that file here: it will contain a record of the important events that occurred up to the moment you hit the button. We're looking for information about any messages sent to the lost server. If there's something weird like an out-of-memory condition, this might show up in the logs.