Too many open files #2342
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#2342
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I'm working with a client company that says they have hundreds of thousands of people inconvenienced by their Tahoe-LAFS installation failing. Initial investigation reveals this in the twistd.log of their gateway node:
Here's a branch to tweak the increase_rlimits() code and to printout what it does for diagnostics: https://github.com/zooko/tahoe-lafs/blob/2342-Too-many-open-files/src/allmydata/util/iputil.py
My consulting client (codenamed "WAG") reported this result on their RHEL server:
Also this tweep ran it on CentOS and reported that with the patch (https://github.com/zooko/tahoe-lafs/blob/2342-Too-many-open-files/src/allmydata/util/iputil.py) the limit was raised to 4096:
https://twitter.com/brouhaha/status/538085487622107136
See also #1794, #812, and #1278. I currently believe the underlying problem has to do with bad handling of corrupted shares (per comment:73185).
Seems like a decent theory.. you might add a timed loop that counts/logs the number of
allmydata.immutable.upload.Uploader
instances, and/orupload.FileHandle
+subclasses (specificallyFileName
). If an upload gets wedged and stops making any progress, it will hold a filehandle open forever, and eventually anopen()
will fail like that.You might also
lsof
the client in question and see what filehandles it has open: if it's this problem, there'll be a lot of/tmp
files in the list, recently opened by HTTP uploads but not yet pushed out to the grid.Also, it might be appropriate to add a failsafe timer to
Uploader
s, something that fires once every minute or five minutes, and checks to see if any progress has been made, and self-destructs if not. We don't like heuristics, but sometimes they're a good hedge against weird unpredictable things happening.