very large memory usage when adding to large directories #379
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#379
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
We've seen the webapi servers that are participating in the allmydata
migration effort have memory usage spikes that jump to an incredible 3.1GB
over the past two days. The CPU usage goes to 100% for about 5 minutes while
this happens. This is occurring about once every 15 minutes, causing the
migration process to run significantly slower.
We've isolated the problem to the following items:
slot_testv_and_readv_and_writev) declares a maximum share data size of 1
MiB, i.e. 1048576 bytes, but the maximum size of a mutable file (3.5 MB)
leads to shares that can exceed this. This occurs for directories of
about 9600 entries
Violation, because the share-size constraint is being violated
maybeDeferred) causes the cleanFailure method to be run, which turns
the entire stack trace (including local variables for each frame) into
a bunch of strings
stack frames (in the locals). Every one of these strings gets repr'ed,
and since they're binary data, each repr'ing gets a 4x expansion.
share data inside the Failure's .stack and .frames attributes.
DeferredList
fails to catcherrors in callRemote: servers fail to accept shares, but we don't
notice.
What we need to do to address this:
ShareData
. This will stop theseparticular Violations from happening
somehow, then require the resulting release of twisted.
DeferredList
in mutable.py,to properly catch exceptions
Any exception in share transmission is likely to consume this sort of memory.
Workarounds we could conceivably implement before the Twisted problem gets
fixed:
attributes instead, be careful to delete them from locals as soon as
possible
show the whole thing) instead of a real string, and teach Foolscap
(via
ISliceable
) to serialize them as strings.on the same stack
Our current plan is to fix the constraint and then hope that we don't trigger
other failures while we contribute to the Twisted ticket and wait for a new
release. Later refactorings of share management will probably put more data
in instance attributes rather than being passed through method arguments.
If necessary, we can ship tahoe with a patched version of Twisted.
I changed the remote interface to remove the size constraint, and I fixed the use of
DeferredList
to propagate errors back to the caller.I updated our production webapi servers with these change, and they stopped using 3GB of memory every time someone tried to exceed the directory size limit.
I didn't update our storage servers to match, however, so what's probably happening right now is that the remote end is raising the same Violation. However this should involve less memory: the inbound constraint is being tested on each token and rejecting the message immediately (precisely what the foolscap constraint mechanism is designed for). So the caller is still likely to get a very weird error (a foolscap constraint Violation) rather than hitting the earlier "mutable files must be < 3.5MB" check.
the newly-refactoried mutable files no longer use
DeferredList
.I think this ticket has served its purpose, and hopefully future searchers will find it in the archives when they run into this sort of problem again.
Milestone 1.0.1 deleted