two-hour delay to connect to a grid from Win32, if there are many storage servers unreachable #605
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#605
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Current trunk changeset:91a28da3aa2e98c9 takes something like two hours to connect to all of the testgrid storage servers. Two seconds would be more like it! Strangely, it spent a long time (tens of minutes) with a connection open to one tahoe server node on each server out of the three tahoe server nodes running on each server. Eventually it then connected to a second server node on each server, and finally to the third.
I'm trying to reproduce it with tahoe-1.2.0 on the same Windows laptop (zaulaw).
A-ha! That's interesting! The same behavior is occurring with tahoe-1.2.0. Here is the full version string:
My versions: allmydata: 1.2.0, foolscap: 0.3.2, pycryptopp: 0.5.1, zfec: 1.3.4, twisted: 8.1.0, nevow: 0.9.32, simplejson: 1.7.3, pyopenssl: 0.8, setuptools: 0.7a1
The next experiment is to try an older version of foolscap. Let's see... tahoe-1.2.0 shipped saying that it required foolscap >= 0.2.9: source:_auto_deps@20080722010403-92b7f-a74ec17a8a9cff0834ecc122ffa275280f563cea, so I'll try that.
That link should have been source:_auto_deps.py@20080722010403-92b7f-a74ec17a8a9cff0834ecc122ffa275280f563cea.
could there be some sort of firewall that's rate-limiting connections to a single IP address? or a NAT table which can only hold one entry per destination address?
Try using telnet or netcat to connect to something on the not-yet-connected servers while it's in this state, or point a browser on that box at the server's webapi ports.
Those are good ideas. The builtin Windows XP firewall is turned off. I can netcat to one of the servers that hasn't connected and the TCP connection is established, although I don't know how to type "hello foolscap begin your STARTTLS negotiation now", so all I can say is that the connection was established and then a while later was dropped.
It certainly can't be the fault of the wireless router or the DSL modem on my LAN, because four other computers on the same LAN connect to all blockserver quickly as expected.
Next step is to examine the flog tail while this is happening. If someone else wanted to try to reproduce it on their Windows machine (including allmydata.com's virtual Windows machine), that would be helpful.
Attachment flogtail.txt (450242 bytes) added
Here is the output from
flogtool tail
while the client is running and connecting to several servers including:but not:
Here is an excerpt from this flogtail file:
That's the complete output of that grep command at this time.
Attached is the current flogtail.txt file.
So it looks like there's something funny about foolscap's attempts to connect to
xikt
but notlwkv
. What's the next step? Perhaps it is to attach flog gatherers toxikt
as well as this client? Also, of course, I would like someone to try to reproduce this problem elsewhere. To reproduce it:Of course, even better would be automated tests of this functionality! We currently have automated tests of this functionality, but they are run by buildbot only on a Linux system. Perhaps we could run them on a Windows buildslave as well.
Okay, at 20:52:54 it connected:
ndurner on IRC said that Windows XP SP 2 limits the number of concurrent TCP connections, and said that if this had happened then there would be a note in Start -> Control Panel -> Administrative Tools -> Event Viewer. Sure enough there it is:
"TCP/IP has reached the security limit imposed on the number of concurrent TCP connection attempts."
It offers to connect to a Microsoft server and fetch more information, and I can't find a URL for this information on the public web, so here it is:
Note that even though the production grid has more servers than the test grid, and even though this happened with tahoe-1.2.0 and foolscap-0.2.9 in my own tests, this doesn't seem to be happening with tahoe-prod-3.0.0 on the production grid.
Ah, here's why it doesn't effect the prod grid:
"""During normal operation, when programs are connecting to available hosts at valid IP addresses, no limit is imposed on the number of connections in the incomplete state. When the number of incomplete connections exceeds the limit, for example, as a result of programs connecting to IP addresses that are not valid, connection-rate limitations are invoked, and this event is logged."""
There are too many no-longer-reachable servers on the test grid.
Moving this out of the 1.3.0 Milestone.
Changing the name to indicate the severity.
delayed connection on Windowsto two-hour delay to connect to a grid from Win32, if there are many storage servers unreachableI'm not sure whether this is actually fixable; it might be an unavoidable problem with Windows TCP/IP, but we should investigate it further.
Changes which reduce the number of attempts to connect to servers, such as #448 (download: speak to as few servers as possible), might help. If a Windows user starts up their Tahoe-LAFS gateway and then tries to download a single file, it might not start too many attempts to connect to unreachable servers and thus avoid this penalty from Windows. However, if the user tries to download many files at once, it would still incur this penalty (since each file would trigger attempts to connect to a different set of servers).
Another way we can fix, or "work-around" the problem in some sense is by implementing some sort of presence and routing protocol other than the current naïve one.
The current design of introduction (which I rather like for its simplicity) is that we separate out three notions: announcement of a node's existence, routing information (currently: the IP addresses and port numbers you can use to reach the node), and presence notification (whether the node is currently on-line or off-line). The current design of introduction, which I rather like for its simplicity, is that announcement-of-existence and propagation-of-routing-information are both done by the introducer and presence is done directly from each node to each of its peers.
You could imagine an alternative design in which the Introducer (or hopefully instead a decentralized introduction scheme (#68, #295)) informed you "Here is a server, but it is currently off-line", or else didn't tell you about off-line servers at all. In that case, the storage client running on Windows would usually open connections only to currently-live servers, and avoid this penalty from Windows entirely.
This isn't going to happen for v1.7.0, and also it is blocked on #448 because I want to see what effect #448 has on this issue.
If you like this ticket, you might also like the "Brian's New Downloader" bundle of tickets: #800 (improve alacrity by downloading only the part of the Merkle Tree that you need), #798 (improve random-access download to retrieve/decrypt less data), #809 (Measure how segment size affects upload/download speed.), #287 (download: tolerate lost or missing servers), and #448 (download: speak to as few servers as possible).
#448 is fixed. I looked in the code and this issue #605 is still going to be unchanged by #448 being fixed because the storage client starts attempting to connect to each storage server as soon as it learns about the server: source:trunk/src/allmydata/storage_client.py@4131#L102. One way to improve the behavior of the client with regard to this ticket would be to make that call to
start_connection()
be invoked lazily, only when the client actually wanted to send a message to that particular server. However, this would harm alacrity—it would move TCP and foolscap handshaking from introduction time to request time and thus increase the delay between making the request and receiving the response (for the first request made to each server). I don't think that is a good trade-off.That means this ticket is going to stay open until we implement something more along the lines of comment:69317 — a way for clients to get hints that certain servers are currently off-line without attempting to open a TCP connection to those servers. Oh, also something that might help with this would be μTP (#1179).