our automated memory measurements might be measuring the wrong thing #227
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#227
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
As visible in the memory usage graphs, pycryptopp increased the static memory footprint by about 6 MiB when we added it in early November (I think it was November 6, although the Performance page says November 9), and removing pycrypto on 2007-12-03 seems to have had almost no benefit in reducing memory footprint.
This reminds me of the weirdness about the 64-bit version using way more memory than we expected.
Hm. I think maybe we are erring by using "VmSize" (from /proc/*/status) as our proxy for memory usage. That number is the total size of the virtual address space requested by the process, if I understand correctly. So for example, mmap'ing a file adds the file's size to your VmSize, although it does not (by itself) use any memory.
Linux kernel hackers seem to be in universal agreement that it is a bad idea to use VmSize for anything:
http://bmaurer.blogspot.com/2006/03/memory-usage-with-smaps.html
http://lwn.net/Articles/230975/
But what's the alternative? We could read "smaps" and see if we can get a better metric out of that.
By the way, if anyone wants to investigate more closely the memory usage, the valgrind tool named massif has been rewritten so maybe it will work this time.
pycryptopp uses up a 6 MB of memory (or at least it increases VmSize by 6M)to our automated memory measurements might be measuring the wrong thingHere is a way to test whether your memory measurement is giving you useful answers. Take a machine with little physical RAM -- I have one here with 500 MB -- turn off swap, and start more and more Tahoe clients and start each one doing the "upload" operation until eventually you get malloc failures or Linux OOM kills or whatever.
Now divide your physical RAM by the number of Tahoe clients that you were able to run without incurring memory problems. The result of that division is a reasonable approximation of the "memory requirements" of the current Tahoe client.
This sounds like fun -- I'll accept this ticket.
Ooh, and as Seb just reminded me, I can turn off overcommit first too, to make it more deterministic/analyzable.
Please see:
http://allmydata.org/pipermail/tahoe-dev/2008-January/000341.html
Zandr: how would you feel about turning off swap for tahoeslave-feisty and for zandr-64? I believe that turning off swap is necessary in order to get a useful measurement of memory. (Personally, I turn off swap on my Linux systems anyway.)
I think that turning off memory overcommit isn't strictly necessary for doing measurements, but it might help by showing memory exhaustion errors in a more deterministic way than the Linux OOM killer.
Adding Cc: zandr
By the way, I don't think I succeeded at boiling down the results of my research for the consumption of others. Here's the boiled-down version: measuring the vsize as we do in [our Performance page]wiki: gives a number much higher than what we actually want to know, and it changes even when the thing that we care about hasn't changed, so it is either useless or only barely useful. Measuring the resident set size would give something probably smaller or possibly larger than the thing we want to know, and it too would change randomly when the thing we care about hasn't changed. The two of them put together and then eyeballed might give you insight, or might just mislead you.
The idea that I had and wrote up in this ticket (above) was a third option: turn off swap and measure resident. That gives you a number that is probably pretty close to what you care about, if what you care about is something like "How much RAM do I need in my machine to run one Tahoe node without it needing to swap.". (If you have a different idea of what you want to know then by all means speak up.)
Anyway, that's all my attempt to restate the history of this ticket and explain why you shouldn't pay much if any attention to the numbers on the Performance page. The new news is that Matt Mackall has been working on this problem and has a new tool that can help (on Linux):
http://lwn.net/SubscriberLink/329458/d28c2d45a663045a
If you love this ticket (#227), then you might like tickets #54 (port memory usage tests to windows), #419 (pycryptopp uses up too much RAM), #478 (add memory-usage to stats-provider numbers), and #97 (reducing memory footprint in share reception).
Here's the permanent URL for that LWN.net article: Matt Mackall has invented "smem" which provides measurements of memory usage that are actually useful: http://lwn.net/Articles/329458/ .
I took a quick look at smem today, seems pretty nice. I think the "USS" (Unique Set Size) might be a good thing to track: it's the amount of memory you'd get back by killing the process. For Tahoe, the main thing we care about is that the client process isn't leaking or over-allocating the memory used to hold files during the upload/download process, and that memory isn't going to be shared with any other process. So even if it doesn't answer the "can I fit this tahoe node/workload on my NN-MB computer", it does answer the question of whether we're meeting our memory-complexity design goals.
Installing
smem
requires a bunch of other stuff (python-gtk2, python-tk, matplotlib), since it has a graphical mode that we don't care about, but that's not a big deal. There's a process-filter thing which I can't find documentation on, which we'd need to limit the output to the tahoe client's own PID. And then the main downside I can think of is that you have to shell out to a not-small python program for each sample (vs reading /proc/self/status, which is basically free), so somebody might be worried about the performance impact.This ticket is about operational visibility for operations that are no longer operational.