error doing a check --verify on files bigger than about 1Gbyte #1395
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1395
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
on a grid with 10 storage nodes, all on win32, and a gateway/client on win32, with all the nodes up and all the shares ok, I can download all the files ok, and check all of them ok.
only thing that doesn't work is check --verify of files bigger than 1Gbyte
here's the error log of the gateway, and after it of a storage node:
######################################################################
######################################################################
now for the storage log:
######################################################################
######################################################################
the gateway/client process quickly grow over 1.5 Gbyte in ram, then it shrunk to 180 Mbyte, the storage process grow to 300 Mbyte and stayed there
I suspect that this isn't specific to win32, although the limit on usable process address space for 32-bit Windows might be smaller than for other 32-bit platforms.
There was also an error while reporting the incident:
but it isn't clear whether that's due to the low-memory condition or an independent issue.
'wb' is a valid mode, so probably the path is incorrect. Does
C:\tahoeclient\logs\incidents\
exist?yeah, I can confirm that C:\tahoeclient\logs\incidents\ exists
I see the problem. '
:
' is not valid in a Windows filename. Will file another ticket for that.Replying to davidsarah:
#1396 and http://foolscap.lothar.com/trac/ticket/177
error doing a check --verify on files bigger than 1Gb on win32to error doing a check --verify on files bigger than about 1GbyteHow was this reported by
tahoe check --verify
?it simply reported that 0 shares could be find for that file
So both the gateway and the storage node had MemoryError. Can you reproduce it and tell what the Process Explorer says about how much memory those two different processes are using at different times in the operation? Are they using a large amount of memory continuously, or does it go up just before this failure, or what?
Thank you!
#1229 is another checker-related memory issue. I've no evidence that it's the same issue, but it's possible that they might interact (i.e. the fact that we need memory proportional to the file size might increase the size of the #1229 memory leak, or the leak might increase the memory required for a single check).
Anyway, we should try to reproduce this on a non-Windows system to confirm that it's not win32-specific.
Note that performance.rst claims:
memory footprint: N/K*S
for verifying a file (where S is the segment size). That seems to be contradicted by the behaviour in this bug.
Not that it's likely to be the problem here, but we should update that claim to be "
N/K*S
times a small multiple". I think the multiple is currently about 2 or 3. During encryption, we hold both a plaintext share and a ciphertext share in RAM at the same time (so 2S), then we drop the plaintext. During erasure-coding, we hold a whole S of ciphertext in memory at the same time as the N/KS shares, then we drop the ciphertext before pushing. We also pipeline the sends a little bit, I think 10kB or 50kB per server, to get better utilization out of a non-zero-latency wire.Also Python's memory-management strategy interacts weirdly. Dropping the plaintext segment may not be enough: Python might not re-use that memory space for anything else right away. Although I'd expect it to de-fragment or coalesce free blocks before asking the OS for so much memory that it crashed.
Replying to warner:
I would be willing to update these docs to be more precise or more correct, but I'm not entirely sure what you want them to say.
(N/K)*S*3+50KB
? (But only for immutable repair.)Note that some of the other numbers in there are marked as approximate by a preceding tilde
~
, e.g. performance.rst "Repairing an A-byte file". Maybe we should use the computer science tradition of ignoring constant factors which are independent of the variables (K
,N
,S
,A
,B
, andG
). However, I would rather follow that tradition only when the constant factors that we're ignoring are sufficiently small that our users will be willing to ignore them too. :-)So in short: +0 from me, but you would need to write a patch for
performance.rst
yourself. Your attention to that document would be much appreciated by me because I would like for your admirable concern for precision in resource usage to be better represented there.The pipeline size, which applies only to immutable objects and only to uploads, is 50 KB: [immutable/layout.py WriteBucketProxy](http://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/src/allmydata/immutable/layout.py?annotate=blame&rev=4655#L99).
I've opened #1398 (make docs/performance.rst more precise and accurate) for the documentation issues and attached a patch there. I do hope Brian looks at it, because my previous pass seems to have left a lot of incorrect stuff in that document and I would like the next version of it to be less wrong.
Attachment serialize-verifier.diff (1614 bytes) added
patch to serialize block-fetches in Verifier. Multiple shares are still done in parallel.
I think I see the problem. The immutable Verifier code (here in checker.py) is overly parallelized. It uses a DeferredList to work on all shares in parallel, and each share worker uses a DeferredList to work on all blocks in parallel. The result is that every single byte of every single share is fetched at the same time, completely blowing our memory budget. As to why the server is crashing, I suspect that when the server gets a gigantic batch of requests for every single byte of the file, it responds to all of them, queueing a massive amount of data in the output buffers, which blows the memory space. A separate issue is to protect our servers against this sort of DoS, but I'm not sure how (we'd need to delay responding to a request if there were more than a certain number of bytes sitting in the output queue for that connection, which jumps wildly across the abstraction boundaries).
The Verifier should work just like the Downloader: one segment at a time, all blocks for a single segment being fetched in parallel. That approach gives a memory footprint of about
S*N/k
(whereas regular download is aboutS
). We could reduce the footprint toS/k
(at the expense of speed) by doing just one block at a time (i.e. completely verify share 1 before touching share 2, and within share 1 we completely verify block 1 before touching block 2), but I think that's too much.I've attached a patch which limits parallelism to approximately the right thing, given the slightly funky design of the Verifier (the verifier iterates primarily over shares, not segments). The patch continues to verify all shares in parallel. However, within each share, it serializes the handling of blocks, so that each share-handler will only look at one block at a time.
The patch needs tests, which should verify a moderate-size artificially-small-segment (thus high-number-of-segment) file, probably with N=1 for simplicity. It needs to confirm that one block is completed before the next begins: I don't know an easy way to do that.. probably needs some instrumentation in
checker.py
. My manual tests just added some printfs, one just before the call tovrbp.get_block()
, another inside_discard_result()
, and noticed that there were lots ofget_block
s without interleaved_discard_result
s.Could this patch cause a performance regression for small files, or is each server going to serialize the requests to it anyway?
This is a major bug. I really think we should fix this for v1.9.0.
+1 on fixing this for 1.9.0.
YAY, this patch solves my problem! (and also speeds up the testing and greatly reduces the memory usate of the storage node =)
Cannot reproduce the bug on my virtual machine: Windows 7 x64, 1G RAM. Test files were 1G and 2G.
ok, so sickness' report suggests this was the right problem to fix. Now to figure out how to write a test for it.
Attachment 1395-overparallel.diff (4744 bytes) added
serialize block-fetches, add test
Ok, that latest patch should be ready: it adds a test which includes a tiny bit of instrumentation on the immutable verifier (to measure how many block fetches are active at any one time), and it fails without the verifier fix, and passes once that fix is applied. Ready for review.
comment:83410 "How was this reported by
tahoe check --verify?
"comment:83411 "it simply reported that 0 shares could be find for that file"
That seems like an error-reporting bug that wouldn't be fixed by attachment:1395-overparallel.diff, even if it would happen in fewer cases.
davidsarah: hmm, if the storage server got clobbered by all the parallel fetches, then it would throw
MemoryError
(and either crash completely, or at least return Failures to theremote_read()
calls). From the client's point of view, this is identical to there being no shares available: crashing servers are not providing shares.Maybe the CLI command output could be extended to mention how many servers were contacted, and how many answered, so the user could distinguish crashing-servers (or no-network) from solid NAKs.
This patch is still important to apply, though.. zooko: please don't defer reviewing it because of the CLI deficiency. It'll be easier to test a CLI change with special-purpose test code (building a client with no servers, or with intentionally-failing servers) than by trying to provoke a
MemoryError
.Attachment not-too-parallel-test-by-patching.diff (6944 bytes) added
Okay, I've reviewed the patch and it is good. Please review this alternate patch — not-too-parallel-test-by-patching.diff — in which the test patches the ValidatedReadBucketClass instead of the code under test (Checker) having code in it for testing purposes. Unfortunately this patch depends on a variant of
mock.py
namedmockutil.py
which knows what to do when a test case returns a Deferred instance, so I don't think we should commit this patch as is. I've submitted a patch to themock
project (http://code.google.com/p/mock/issues/detail?id=96 ) and I'll proceed to work up a patch for Tahoe-LAFS which uses a lower-tech patching tool. But please review this patch because it most clearly shows why I like this approach. (If you ignore the wholemockutil.py
part.)Attachment not-too-parallel-test-by-low-tech-patching.diff (5149 bytes) added
Okay here's a version that doesn't depend on
mockutil.py
: attachment:not-too-parallel-test-by-low-tech-patching.diff. As you can see it makes the test code a little uglier and it doesn't handle exceptions (it'll leave theValidatedReadBucketProxy
patched out for the rest of thetrial
process if there is an exception raised from the test). I guess I'm a bit on the fence about which of these two patches I prefer. What's your preference?To clarify, the original patch 1395-overparallel.diff passes my review and I approve of applying it to trunk. I also think that not-too-parallel-test-by-low-tech-patching.diff is better, and hope you (warner) will consider it as a replacement for attachment:1395-overparallel.diff. I also like not-too-parallel-test-by-patching.diff , but it has the flaw that it will probably break if someone uses it with a newer version of
mock.py
, so we should probably avoid it.I'm -1 on a test that leaves the code in a broken/patched state after an
exception.. too much opportunity for confusion and hair-tearing debugging of
other problems later. And I'm -1 on a test that depends deeply upon internals
of a library that we don't include a copy of (note that I'm historically
inconsistent about this one, q.v. Foolscap vs Twisted, but I'm trying to mend
my ways).
At first, I found your -low-tech-patching test hard to follow, but looking
more closely, I'm growing fond of it. The original code isn't too hard to
patch in the right way, so the mock subclass isn't too too weird. It might be
nice if the new variable names were shorter (and since they only appear in the
context of the TooParallel unit test, they don't need to be as
fully-qualified as if they were in the original
checker.py
). But aftergetting used to it, I think I'm ok with the way it stands.
I'd be ok with simple partial fix to the cleanup-after-exception problem.
Instead of using
setUp/tearDown
or pushing the whole middle of the testmethod into a sub-method, just put the
set_up_grid
part into a smallfunction so you can get it into the Deferred chain. Something like:
Since the "critical region" where exceptions could cause problems doesn't
start until after the
...checker.ValidatedReadBucketProxy = make_mock_VRBP
line, it's good enough to just capture the first chunk of the code after that
in a function which is run from within the Deferred chain. Any exceptions in
set_up_grid
orc0.upload
(which are fairly big and complex) willstill get caught, so the cleanup can happen.
I'll modify the low-tech patcher approach (attachment:not-too-parallel-test-by-low-tech-patching.diff) to clean up in case of exception, like Brian suggested.
Note that Michael Foord, author of mock.py, has proposed a patch for mock.py that would make it do everything we need. He just wants code-review and unit tests. See the mock.py ticket to help with that.
Once a version of mock.py is out with that functionality and we are willing to make Tahoe-LAFS depends on that version of mock then we can have the more succinct not-too-parallel-test-by-patching.diff style of mockery in our unit tests.
In changeset:f426e82287c11b11:
In changeset:9f8d34e63aa1aeeb:
In changeset:c7f65ee8ad254f3f:
changeset:c7f65ee8ad254f3f seems to have fixed the failures of
allmydata.test.test_repairer.Verifier.test_corrupt_sharedata
on the builders. (Some remaining nondeterministic failures are due to #1084.)