intermittent "Address Already In Use" error during tests #2787
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#2787
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I'm seeing occasional errors during tests like this:
I'm still tracing this down, but it looks like iputil.py
allocate_tcp_port()
(which I wrote for Foolscap and copied over a few months ago) is sometimes giving us port numbers that are actually already in use. Those ports are coming from the kernel (we do a bind(port=0) and then ask what port got allocated).One problem that I know about is that we're binding the test port to 127.0.0.1, and using SO_REUSEADDR, and the combination of those two might make the kernel think it's ok to give us a port that's already bound to something other than 127.0.0.1. But in some tests, replacing that with 0.0.0.0 didn't help: I was still given ports that are already in use.
I have to experiment some more to figure out what's going on. I think in the long run,
allocate_tcp_port()
might need to actually try to listen on the port, and if that fails, grab a different one.It's not possible to fix this inside
allocate_tcp_port
itself. So I'm planning to close this ticket. Instead, we'll have a ticket for each test which can fail this way and they'll have to be fixed one by one.The reason we cannot fix this inside
allocate_tcp_port
is that the approach it is a component of suffers from an unavoidable race condition.allocate_tcp_port
tries to figure out a specific TCP port number which will not be in use at a later point in time. Since there is no part of the system which allows the port number to be reserved or otherwise kept out of us except by the one piece of code we intend it cannot actually know whether any port number it selects will satisfy this requirement.In practice, it does succeed with high probability. However, due to the large number of cases in which it is used (many times per test suite run and the test suite itself is run many times), even this high probability of success is not good enough. I will make an incredibly naive estimate that there are 2^15^ ports available for "random" assignment and that the chance of an unrelated intermediate assignment being made is about 1 in 2 (I suspect some tests themselves trigger an unrelated intermediate port assignment). The chance of a collision is therefore 1 in 2^16^ (around a thousandth of a percent). If there are 100 users of
allocate_tcp_port
in the test suite then the chance of a collision anywhere in the test suite is 100 in 2^16^. There are about 15 different CI runners of the test suite. So the chance of a failure on any of them for one build set is 15 * 100 in 2^16^. The test suite is run for every pull request and every master revision. If there is one PR merged a day, the chance of a failure in a week is at least 14 * 15 * 100 in 2^16^ which reduces to around 32%. Quite easily high enough to be disruptive to development.There are several possible general fixes for this issue.
Add retry logic. If a test randomly allocates a port and then discovers it cannot bind that port, just try the whole process over again. A small number of retries should be able to drive the failure rate down dramatically (the chance of success of each try should be independent; if the chance of failure of 1 try is a thousandth of a percent, the chance of failure of 3 tries is the cube of that - under a billionth of a percent). This solution is conceptually simple but the implementation might not be so. Detecting the failure (asynchronously, often across process boundaries) and backing up to a point where a retry may be made will probably take a lot of effort.
Switch to pre-allocated sockets. Note that
allocate_tcp_port
is really trying to allocate a TCP port number. If it allocated a bound TCP socket (perhaps marked as listening) and this socket were handed to application code, there is no possibility for a collision in the application code because there is no longer any need to bind there. There is still the possibility for a collision inside the allocation function but it is much reduced compared to the current situation and it is much more amenable to the addition of retry logic. The most likely downside to this approach is lack of support for the underlying operation on Windows.Switch to UNIX sockets. It's much easier to avoid collisions with UNIX sockets. When using TCP we are working with only 2^15^ possible values, they are assigned roughly randomly, and we compete with all other users of the system for them. When using UNIX, we have at least 255^108^ possible values, we can allocate them with structure that inherently avoids self-collision, and we need not compete with anyone else on the system. However, UNIX sockets are not necessarily compatible with all of the components which need to accept connections (for example, their "socket name" necessarily differs from TCP/IPv4; and being inherently private, there is less support in tools like HTTP clients for accessing them).
Reverse the allocation relationship. Let the application code randomly allocate a port number. Arrange for the test code to somehow learn of the allocated value. As with option (2), this dramatically reduces the possibility for a collision and makes it significantly easier to add retry logic at the point where that collision may occur. In contrast to (2), it may require implementation of this allocation and retry logic at multiple code sites. There is also the matter of conveying the allocated port number back to the test code which probably also requires several different implementations.
Considering all of these, (2) is my preference. However, there is the matter of Windows support to contend with in that case.
Create a private network namespace for the test suite. This removes the possibility of a port collision involving unrelated activities on the same host. It does not remove the possibility of a port collision of Tahoe-LAFS code with other Tahoe-LAFS code though. Network namespaces are highly platform specific and this would likely involve three or more implementations of the same idea. Also, creating network namespaces likely requires elevated privileges imposing a practical barrier to deployment.
Avoid binding to IN_ADDRANY. Instead, bind to a specific interface. This avoids collisions with other ports bound to different specific interfaces. It doesn't avoid collisions with other ports bound to IN_ADDRANY. Since most collisions are probably with IN_ADDRANY-bound sockets this probably doesn't help a lot.