web gateway memory grows without bound under load #891

Open
opened 2010-01-10 06:16:08 +00:00 by zooko · 6 comments

I watched as two allmydata.com web gateways slow grew to multiple GB of RAM, while consuming max CPU. I kept watching until their behavior killed my ssh session. Fortunately I left a flogtool tail running so we got to capture one's final minutes. It looks to me like a client is able to initiate jobs faster than the web gateway can complete them, and the client kept this up at a steady rate until the web gateway died.

I watched as two allmydata.com web gateways slow grew to multiple GB of RAM, while consuming max CPU. I kept watching until their behavior killed my ssh session. Fortunately I left a `flogtool tail` running so we got to capture one's final minutes. It looks to me like a client is able to initiate jobs faster than the web gateway can complete them, and the client kept this up at a steady rate until the web gateway died.
zooko added the
code-frontend-web
critical
defect
1.5.0
labels 2010-01-10 06:16:08 +00:00
zooko added this to the undecided milestone 2010-01-10 06:16:08 +00:00
warner was assigned by zooko 2010-01-10 06:16:08 +00:00
Author

Attachment dump.flog.bz2 (86911 bytes) added

"flogtool tail --save-as=dump.flog" of the final minutes of the web gateway's life

**Attachment** dump.flog.bz2 (86911 bytes) added "flogtool tail --save-as=dump.flog" of the final minutes of the web gateway's life
Author

Attachment dump-2.flog.bz2 (32391 bytes) added

Another "flogtool tail --save-as=dump-2.log" run which overlaps with the previous one (named dump.log) but which has different contents...

**Attachment** dump-2.flog.bz2 (32391 bytes) added Another "flogtool tail --save-as=dump-2.log" run which *overlaps* with the previous one (named dump.log) but which has different contents...
Author

So while I was running flogtool tail --save-as=dump.flog I started a second tail, like this: flogtool tail --save-as=dump-2.flog. Here is the result of that second tail, which confusingly doesn't seem to have a contiguous subset of the the first, although maybe I'm just reading it wrong.

So while I was running `flogtool tail --save-as=dump.flog` I started a *second* tail, like this: `flogtool tail --save-as=dump-2.flog`. Here is the result of that second tail, which confusingly doesn't seem to have a contiguous subset of the the first, although maybe I'm just reading it wrong.
tahoe-lafs modified the milestone from undecided to 1.7.0 2010-02-27 09:07:13 +00:00
tahoe-lafs modified the milestone from 1.7.0 to soon 2010-06-16 03:58:49 +00:00

incidentally, the best way to grab logs from a doomed system like this is to get the target node's "logport.furl" (from BASEDIR/private/logport.furl"), and then run the flogtool tail command from another computer altogether. That way the flogtool command isn't competing with the doomed process for memory. You might have done it this way.. it's not immediately obvious to me.

I'll take a look at the logs as soon as I can.

incidentally, the best way to grab logs from a doomed system like this is to get the target node's "logport.furl" (from BASEDIR/private/logport.furl"), and then run the `flogtool tail` command from another computer altogether. That way the flogtool command isn't competing with the doomed process for memory. You might have done it this way.. it's not immediately obvious to me. I'll take a look at the logs as soon as I can.
Author

No I ran flogtool tail on the same system. If I recall correctly the system had enough memory available--it was just that the python process was approaching its 3 GB limit (per process vm limit which I forget why it exists).

No I ran `flogtool tail` on the same system. If I recall correctly the system had enough memory available--it was just that the python process was approaching its 3 GB limit (per process vm limit which I forget why it exists).

Hm, assuming we can reproduce this after two years, and assuming there's no bug causing pathological memory leaks, what would be the best sort of fix? We could impose an arbitrary limit on the number of parallel operations that the gateway is willing to perform. Or (on some OSes) have it monitor its own memory usage and refuse new operations when the footprint grows above a certain threshold. Both seem a bit unclean, but might be practical.

Hm, assuming we can reproduce this after two years, and assuming there's no bug causing pathological memory leaks, what would be the best sort of fix? We could impose an arbitrary limit on the number of parallel operations that the gateway is willing to perform. Or (on some OSes) have it monitor its own memory usage and refuse new operations when the footprint grows above a certain threshold. Both seem a bit unclean, but might be practical.
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#891
No description provided.