t=deep-size needs rate-limiting #384
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#384
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The webapi "?t=deep-size" feature (as well as the t=manifest feature from which it is derived) needs to be rate-limited. I saw the prodnet webapi machine fire off about 300 directory retrievals in a single tick, which is enough of a load spike to stall the node for a few dozen seconds.
It might be useful to rebuild something like the old slowjob, but in a form that's easier to use this time around. Maybe an object which accepts a (callable, args, kwargs) tuple, and returns a Deferred that fires with the results. The call is not invoked until later, however, and the object has a limit on the number of simultaneous requests that will be outstanding, or perhaps a maximum rate at which requests will be released.
Mike says that he saw similar problems on the windows client, before changing it to offload the t=deep-size queries to the prodnet webapi server. The trouble is, that machine gets overloaded by it too. So managing the parallelism would help both issues.
He saw a request use 50% of the local CPU for about 60 seconds. The same deep-size request took about four minutes when using a remote server, if I'm understanding his message correctly.
One important point to take away is that deep-size should not be called on every modification.. we should really be caching the size of filesystem and applying deltas as we add and remove files, then only doing a full deep-size every once in a while (maybe once a day) to correct the value.
I implemented this, in changeset:3cb361e233054121. I did some experiments to decide upon a
reasonable value for the default limit, and settled upon allowing 10
simultaneous requests per call to deep-size.
From my desktop machine (fluxx, Athlon 64 3500+ in 32bit mode), which has a
pretty fast pipe to the storage servers in our colo, t=deep-size on a rather
large directory tree (~1700 directories, including one that has at least 300
children) takes:
The same test done from a machine in colo (tahoecs2, P4 3.4GHz), which
probably gets lower latency to the storage servers but might have a slower
CPU, gets:
So increasing the concurrency limit causes:
the pipe better
retrivals happening at the same time
Therefore I think limit=10 is a reasonable choice.
It is useful to note that the CPU was pegged at 100% for all trials. The
current bottleneck is in the CPU, not the network. I suspect that the
mostly-python unpacking of dirnodes is taking up most of the CPU.