Make operation-handle-querying use only a little memory #857

New Issue

tahoe-lafs · 2009-12-13T08:05:29Z

davidsarah commented

2009-12-13 08:05:29 +00:00

The documentation on operation handles starting at source:docs/frontends/webapi.txt@4112#L203 says:

Many "slow" operations can begin to use unacceptable amounts of memory when operation on large directory structures. The memory usage increases when the ophandle is polled, as the results must be copied into a JSON string, sent over the wire, then parsed by a client.

The documentation on operation handles starting at source:docs/frontends/webapi.txt@4112#L203 says: > Many "slow" operations can begin to use unacceptable amounts of memory when operation on large directory structures. The memory usage increases when the ophandle is polled, as the results must be copied into a JSON string, sent over the wire, then parsed by a client.

tahoe-lafs added the

labels 2009-12-13 08:05:29 +00:00

tahoe-lafs added this to the undecided milestone 2009-12-13 08:05:29 +00:00

tahoe-lafs added

code-frontend-web

and removed

unknown

labels 2009-12-13 08:09:37 +00:00

warner commented

2009-12-18 00:02:00 +00:00

hm, interesting. I have no idea how to improve this. There are two sources of memory usage. The first is the underlying results list, to which a new record is appended for each file/directory that is traversed. This one grows over time, unrelated to the act of querying the operation.

The second when the operation is queried, and the API specifies a JSON string that basically copies the underlying results list (converting some fields into a more JSON-representable format). The problem here is that the simplejson.dumps() call produces a single large string, probably with StringIO, which will probably use (briefly) about twice the memory as the original results list (one copy for lots of little stringlets, a second copy for the merged result, then the first copy is released).

Maybe there's a way to use simplejson.dump instead (which takes a file-like object as a target for .write() calls), and glue it onto the HTTP response channel. simplejson is going to run synchronously anyways, so it won't save us from one copy of that string (living in the HTTP transport write buffer), but maybe it could save us from the temporary condition of having two copies of that string.

OTOH, maybe we should give up the convenience of doing slow deep traversal operations within the node, and require the webapi client to do it, moving the buffering requirements out to their side. Or, make an API for slow deep traversals that streams the results, but pauses the operation if/when the HTTP channel is lost, to avoid the need to store unclaimed results. Or an API that requests results in chunks, and explicitly releases earlier chunks, so that the node could discard old results that the client has safely retrieved.

hm, interesting. I have no idea how to improve this. There are two sources of memory usage. The first is the underlying results list, to which a new record is appended for each file/directory that is traversed. This one grows over time, unrelated to the act of querying the operation. The second when the operation is queried, and the API specifies a JSON string that basically copies the underlying results list (converting some fields into a more JSON-representable format). The problem here is that the `simplejson.dumps()` call produces a single large string, probably with `StringIO`, which will probably use (briefly) about twice the memory as the original results list (one copy for lots of little stringlets, a second copy for the merged result, then the first copy is released). Maybe there's a way to use `simplejson.dump` instead (which takes a file-like object as a target for `.write()` calls), and glue it onto the HTTP response channel. simplejson is going to run synchronously anyways, so it won't save us from one copy of that string (living in the HTTP transport write buffer), but maybe it could save us from the temporary condition of having two copies of that string. OTOH, maybe we should give up the convenience of doing slow deep traversal operations within the node, and require the webapi client to do it, moving the buffering requirements out to their side. Or, make an API for slow deep traversals that streams the results, but pauses the operation if/when the HTTP channel is lost, to avoid the need to store unclaimed results. Or an API that requests results in chunks, and explicitly releases earlier chunks, so that the node could discard old results that the client has safely retrieved.

Sign in to join this conversation.