Too many open files #2342

Open
opened 2014-11-27 16:11:00 +00:00 by zooko · 5 comments

I'm working with a client company that says they have hundreds of thousands of people inconvenienced by their Tahoe-LAFS installation failing. Initial investigation reveals this in the twistd.log of their gateway node:

2014-11-20 13:33:36+0530 [HTTPChannel,143180,192.168.51.230] Unhandled Error
        Traceback (most recent call last):
          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/python/log.py", line 88, in cal
lWithLogger
            
          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/python/log.py", line 73, in cal
lWithContext
            
          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/python/context.py", line 118, i
n callWithContext
            
          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/python/context.py", line 81, in
 callWithContext
            
        --- <exception caught here> ---
          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/posixbase.py", line 61
4, in _doReadOrWrite
            
          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/tcp.py", line 215, in 
doRead
            
          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/tcp.py", line 221, in 
_dataReceived
            
          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/protocols/basic.py", line 581, 
in dataReceived
            
          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/web/http.py", line 1609, in lin
eReceived
            
HeadersReceived

          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/web/http.py", line 694, in gotLength

          File "/usr/lib64/python2.6/tempfile.py", line 475, in TemporaryFile

          File "/usr/lib64/python2.6/tempfile.py", line 228, in _mkstemp_inner

        exceptions.OSError: [Errno 24] Too many open files: '/GFS2_tahoe/.tahoe-filestore1/tmp/tmpmKDgf6'

I'm working with a client company that says they have hundreds of thousands of people inconvenienced by their Tahoe-LAFS installation failing. Initial investigation reveals this in the twistd.log of their gateway node: ``` 2014-11-20 13:33:36+0530 [HTTPChannel,143180,192.168.51.230] Unhandled Error Traceback (most recent call last): File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/python/log.py", line 88, in cal lWithLogger File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/python/log.py", line 73, in cal lWithContext File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/python/context.py", line 118, i n callWithContext File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/python/context.py", line 81, in callWithContext --- <exception caught here> --- File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/posixbase.py", line 61 4, in _doReadOrWrite File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/tcp.py", line 215, in doRead File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/tcp.py", line 221, in _dataReceived File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/protocols/basic.py", line 581, in dataReceived File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/web/http.py", line 1609, in lin eReceived HeadersReceived File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/web/http.py", line 694, in gotLength File "/usr/lib64/python2.6/tempfile.py", line 475, in TemporaryFile File "/usr/lib64/python2.6/tempfile.py", line 228, in _mkstemp_inner exceptions.OSError: [Errno 24] Too many open files: '/GFS2_tahoe/.tahoe-filestore1/tmp/tmpmKDgf6' ```
zooko added the
code
normal
defect
1.10.0
labels 2014-11-27 16:11:00 +00:00
zooko added this to the undecided milestone 2014-11-27 16:11:00 +00:00
Author

Here's a branch to tweak the increase_rlimits() code and to printout what it does for diagnostics: https://github.com/zooko/tahoe-lafs/blob/2342-Too-many-open-files/src/allmydata/util/iputil.py

Here's a branch to tweak the increase_rlimits() code and to printout what it does for diagnostics: <https://github.com/zooko/tahoe-lafs/blob/2342-Too-many-open-files/src/allmydata/util/iputil.py>
Author

My consulting client (codenamed "WAG") reported this result on their RHEL server:

# python iputil.py v110
HELLO WORLD 000
HELLO WORLD 011
012 getrlimit(RLIMIT_NOFILE) before setrlimit: (1024, 4096)

Also this tweep ran it on CentOS and reported that with the patch (https://github.com/zooko/tahoe-lafs/blob/2342-Too-many-open-files/src/allmydata/util/iputil.py) the limit was raised to 4096:

https://twitter.com/brouhaha/status/538085487622107136

CentOS 6.5: before setrlimit: (1024, 4096) ... after setrlimit(4096, 4096): (4096, 4096)
My consulting client (codenamed "WAG") reported this result on their RHEL server: ``` # python iputil.py v110 HELLO WORLD 000 HELLO WORLD 011 012 getrlimit(RLIMIT_NOFILE) before setrlimit: (1024, 4096) ``` Also this tweep ran it on CentOS and reported that with the patch (<https://github.com/zooko/tahoe-lafs/blob/2342-Too-many-open-files/src/allmydata/util/iputil.py>) the limit was raised to 4096: <https://twitter.com/brouhaha/status/538085487622107136> ``` CentOS 6.5: before setrlimit: (1024, 4096) ... after setrlimit(4096, 4096): (4096, 4096) ```
Author

See also #1794, #812, and #1278. I currently believe the underlying problem has to do with bad handling of corrupted shares (per comment:73185).

See also #1794, #812, and #1278. I currently believe the underlying problem has to do with bad handling of corrupted shares (per [comment:73185](/tahoe-lafs/trac-2024-07-25/issues/812#issuecomment-73185)).

Seems like a decent theory.. you might add a timed loop that counts/logs the number of allmydata.immutable.upload.Uploader instances, and/or upload.FileHandle+subclasses (specifically FileName). If an upload gets wedged and stops making any progress, it will hold a filehandle open forever, and eventually an open() will fail like that.

You might also lsof the client in question and see what filehandles it has open: if it's this problem, there'll be a lot of /tmp files in the list, recently opened by HTTP uploads but not yet pushed out to the grid.

Seems like a decent theory.. you might add a timed loop that counts/logs the number of `allmydata.immutable.upload.Uploader` instances, and/or `upload.FileHandle`+subclasses (specifically `FileName`). If an upload gets wedged and stops making any progress, it will hold a filehandle open forever, and eventually an `open()` will fail like that. You might also `lsof` the client in question and see what filehandles it has open: if it's this problem, there'll be a lot of `/tmp` files in the list, recently opened by HTTP uploads but not yet pushed out to the grid.

Also, it might be appropriate to add a failsafe timer to Uploaders, something that fires once every minute or five minutes, and checks to see if any progress has been made, and self-destructs if not. We don't like heuristics, but sometimes they're a good hedge against weird unpredictable things happening.

Also, it might be appropriate to add a failsafe timer to `Uploader`s, something that fires once every minute or five minutes, and checks to see if any progress has been made, and self-destructs if not. We don't like heuristics, but sometimes they're a good hedge against weird unpredictable things happening.
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#2342
No description provided.