t=deep-size needs rate-limiting #384

New Issue

warner · 2008-04-15T01:30:24Z

warner commented

2008-04-15 01:30:24 +00:00

The webapi "?t=deep-size" feature (as well as the t=manifest feature from which it is derived) needs to be rate-limited. I saw the prodnet webapi machine fire off about 300 directory retrievals in a single tick, which is enough of a load spike to stall the node for a few dozen seconds.

It might be useful to rebuild something like the old slowjob, but in a form that's easier to use this time around. Maybe an object which accepts a (callable, args, kwargs) tuple, and returns a Deferred that fires with the results. The call is not invoked until later, however, and the object has a limit on the number of simultaneous requests that will be outstanding, or perhaps a maximum rate at which requests will be released.

The webapi "?t=deep-size" feature (as well as the t=manifest feature from which it is derived) needs to be rate-limited. I saw the prodnet webapi machine fire off about 300 directory retrievals in a single tick, which is enough of a load spike to stall the node for a few dozen seconds. It might be useful to rebuild something like the old slowjob, but in a form that's easier to use this time around. Maybe an object which accepts a (callable, args, kwargs) tuple, and returns a Deferred that fires with the results. The call is not invoked until later, however, and the object has a limit on the number of simultaneous requests that will be outstanding, or perhaps a maximum rate at which requests will be released.

warner added the

major

enhancement

1.0.0

labels 2008-04-15 01:30:24 +00:00

warner added this to the undecided milestone 2008-04-15 01:30:24 +00:00

warner commented

2008-04-16 00:23:09 +00:00

Mike says that he saw similar problems on the windows client, before changing it to offload the t=deep-size queries to the prodnet webapi server. The trouble is, that machine gets overloaded by it too. So managing the parallelism would help both issues.

He saw a request use 50% of the local CPU for about 60 seconds. The same deep-size request took about four minutes when using a remote server, if I'm understanding his message correctly.

One important point to take away is that deep-size should not be called on every modification.. we should really be caching the size of filesystem and applying deltas as we add and remove files, then only doing a full deep-size every once in a while (maybe once a day) to correct the value.

Mike says that he saw similar problems on the windows client, before changing it to offload the t=deep-size queries to the prodnet webapi server. The trouble is, that machine gets overloaded by it too. So managing the parallelism would help both issues. He saw a request use 50% of the local CPU for about 60 seconds. The same deep-size request took about four minutes when using a remote server, if I'm understanding his message correctly. One important point to take away is that deep-size should not be called on every modification.. we should really be caching the size of filesystem and applying deltas as we add and remove files, then only doing a full deep-size every once in a while (maybe once a day) to correct the value.

warner commented

2008-05-08 18:07:50 +00:00

I implemented this, in changeset:3cb361e233054121. I did some experiments to decide upon a
reasonable value for the default limit, and settled upon allowing 10
simultaneous requests per call to deep-size.

From my desktop machine (fluxx, Athlon 64 3500+ in 32bit mode), which has a
pretty fast pipe to the storage servers in our colo, t=deep-size on a rather
large directory tree (~1700 directories, including one that has at least 300
children) takes:

limit=2: 2m25s (13 directories per second)
limit=5: 2m15s (14.7 dps)
limit=10: 2m10s/2m13s/2m14s (15 dps)
limit=30: 2m13s/2m14s (15 dps)
limit=60: 2m13s (15 dps)
limit=120: 2m12s (15.7 dps)
limit=9999: 2m06s (16.6 dps)

The same test done from a machine in colo (tahoecs2, P4 3.4GHz), which
probably gets lower latency to the storage servers but might have a slower
CPU, gets:

limit=2: 2m35s/2m32s peak memory 67MB vmsize / 42MB rss
iimit=10: 2m37s/2m29s 68MB vmsize / 43MB rss
limit=9999: 2m28s/2m52s 122MB vmsize / 100MB rss

So increasing the concurrency limit causes:

marginal speedups in retrieval time (<25%), probably because it's filling
the pipe better
significant increases in memory (2x), because there are lots of dirnode
retrivals happening at the same time

Therefore I think limit=10 is a reasonable choice.

It is useful to note that the CPU was pegged at 100% for all trials. The
current bottleneck is in the CPU, not the network. I suspect that the
mostly-python unpacking of dirnodes is taking up most of the CPU.

I implemented this, in changeset:3cb361e233054121. I did some experiments to decide upon a reasonable value for the default limit, and settled upon allowing 10 simultaneous requests per call to deep-size. From my desktop machine (fluxx, Athlon 64 3500+ in 32bit mode), which has a pretty fast pipe to the storage servers in our colo, t=deep-size on a rather large directory tree (~1700 directories, including one that has at least 300 children) takes: * limit=2: 2m25s (13 directories per second) * limit=5: 2m15s (14.7 dps) * limit=10: 2m10s/2m13s/2m14s (15 dps) * limit=30: 2m13s/2m14s (15 dps) * limit=60: 2m13s (15 dps) * limit=120: 2m12s (15.7 dps) * limit=9999: 2m06s (16.6 dps) The same test done from a machine in colo (tahoecs2, P4 3.4GHz), which probably gets lower latency to the storage servers but might have a slower CPU, gets: * limit=2: 2m35s/2m32s peak memory 67MB vmsize / 42MB rss * iimit=10: 2m37s/2m29s 68MB vmsize / 43MB rss * limit=9999: 2m28s/2m52s 122MB vmsize / 100MB rss So increasing the concurrency limit causes: * marginal speedups in retrieval time (<25%), probably because it's filling the pipe better * significant increases in memory (2x), because there are lots of dirnode retrivals happening at the same time Therefore I think limit=10 is a reasonable choice. It is useful to note that the CPU was pegged at 100% for all trials. The current bottleneck is in the CPU, not the network. I suspect that the mostly-python unpacking of dirnodes is taking up most of the CPU.

warner added the

fixed

label 2008-05-08 18:07:50 +00:00

warner modified the milestone from undecided to 1.1.0

2008-05-08 18:07:50 +00:00

warner closed this issue

2008-05-08 18:07:50 +00:00

Sign in to join this conversation.