leasedb: high file descriptor usage exposed by tests #2015
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
4 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#2015
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Running tests on the leasedb or cloud-merge branches may cause the
trial
process to run out of file descriptors. This happens more readily on platforms where the default fd limit is small (e.g. 256 on Mac OS X), or when usingsynchronous = NORMAL
orFULL
(as opposed toOFF
) options to sqlite for the leasedb.If there is a file descriptor leak, we should fix it or work around it; otherwise we should use [iputil.increase_rlimits()]source:src/allmydata/util/iputil.py to increase the number of fds available on platforms where the limits are too small (as we do in node processes).
See also #1870 for a possibly-related performance regression on the leasedb branches.
I learned that to recreate the big performance regression, you need to apply this patch to current 1819-cloud-merge:
I used my fdleakfinder (https://github.com/zooko/fdleakfinder) on the full test suite, with optimized sqlite settings (journal_mode="WAL", synchronous="NORMAL", see below), and with fd limit bumped up to the max that my linux allows -- 2000.
fdleakfinder definitely reports that the sqlite files opened during test cases are left open instead of closed at the end of this test case. It is reporting the "high water mark", which is a list of all the fds that were open at one particular time. See attached output from fdleakfinder.
Attachment fdleakfinder.out.txt.bz2 (26009 bytes) added
Replying to zooko:
That's understandable, because there is no code to close them! (This shouldn't matter for a node which only has one DB connection open; it only matters for tests.)
Replying to [daira]comment:2:
I would have assumed that the fds would be closed automatically when the Python objects were garbage-collected.
markberger: it sounds like there might be a bug in the Python sqlite module such that it leaks fds. Or at least a misunderstanding on my part. You could investigate this hypothesis by writing a program that opens a million sqlite databases in a row while losing its reference to teh previous one. Something like this (untested, written by cutting and pasting a bit of code from dbutil.py):
If that script runs out of fds, then indeed the underlying database files are not getting closed when the accompanying Python objects get garbage collected. (Or the Python objects are not getting garbage-collected in a timely way, due to reference cycles preventing their de-allocation by reference-counting…)
I called dbtuil.get_db 4000 times like in the above example and I did not receive any fds errors.
markberger: interesting! Did it create 4000 files in the local filesystem? Could you cause it to hang (for example with
time.sleep(1000)
) and see how many fds it has open with thelsof
command-line tool?I guess I'll run your script under strace (on my linux system) and use my fdleakfinder script on it and see what I learn…
With my script the total number of fds appears to remain constant during execution when I call
sudo lsof | wc -l
from another terminal. This isn't the case when running the test suite (as your fdleakfinder has shown).Replying to markberger:
Okay, markberger. Good job disproving the hypothesis that there is a leak in the Python sqlite module, or even in our allmydata/util/dbutil.py! Now what's the next step to find the problem? The fdleakfinder outputs show which specific files are still being held open after tests end…
By the way, here is the output from fdleakfinder on markberger's script that opened 4000 dbs one after another:
I'm working on this right now. FYI here's the patch I'm testing.
Attachment patch.txt (1675 bytes) added
Here is a patch for your enjoyment:
https://github.com/zooko/tahoe-lafs/tree/2015-high-fd-usage
However, it is not yet suitable for merge to trunk since it lacks a unit test.
The patch also causes the following tests to fail:
On this branch (https://github.com/markberger/tahoe-lafs/tree/2015-high-fd-usage) there is a patch by daira which fixes the problem that caused the tests to fail. I've also added a unit test for this ticket which ensures that the database connection and the cursor are closed when the service stops.
Good job, Mark_B! I hope someone reviews this soon… I personally am going to focus on imminent commercial launch at https://LeastAuthority.com instead, for the next week or so.
Are you going to move on to… Something else, now? Oh, I know, go back to fixing the regression in new-server-selector (#1382) about contacting too many servers?
Reviewed with comments on github. Please squash the commits from and including zooko's into a single commit. (Ask on irc if you don't know how.)
Zooko: tut tut, no chitchat unrelated to the ticket in ticket comments :-)
The commits have been squashed and I opened a pull request to LeastAuthority/1819-cloud-merge branch. The pull request is here: https://github.com/LeastAuthority/tahoe-lafs/pull/1
Fixed in e6da7e2e on the 1819-cloud-merge branch.
Milestone renamed
renaming milestone