performance measurement of directories #327

New Issue

zooko · 2008-02-28T22:25:29Z

zooko commented

2008-02-28 22:25:29 +00:00

Make a speedtest which adds files one by one to a directory.

zooko added the

labels 2008-02-28 22:25:29 +00:00

zooko added this to the 0.9.0 (Allmydata 3.0 final) milestone 2008-02-28 22:25:29 +00:00

warner commented

2008-02-29 02:21:25 +00:00

yeah! I'd like to make this part of the regular 'make check-speed' test. To
do this, we need a test that will run in 1 to 5 minutes. We can grab multiple
timing numbers from a single test run, for example if we do 100 additions, we
can measure the time it took to do the first one, and also to do the first
ten, and all 100, and also record the time taken by the last one.

The performance problems that we've seen (#321) are the clearly present in
the tahoe node, but the solution we've proposed is to make the next layer up
(in this case the windows SMB/FUSE plugin) do write-caching/batching to
reduce the number of dirnode modifications that must be done. Should this
test include these changes?

Specifically, do we expect this test to improve once we fix #321? I suspect
the answer is "no", that continuing results of this test should highlight the
need for app-level cacheing. In that case, this test should use the HTTP
interface and simply do a series of PUT /uri/TESTDIR/FILENUM calls.

How many files should we add? Some tests that Peter just ran suggest the
following total times (using the office DSL and uploading random 1024 files):

1 file : 9s
10 files : 44s
20 files : 1m21s
100 files : 15m14s
200 files : 32m11s
300 files : 47m55s

20 files seems like it would hit the automated-test-friendly goals of a few
minutes per test. However the significant slowdowns we've seen in #321 were
mostly triggered by having a few hundred entries in a directory. Once we have
a web interface to add multiple children, I think we should do 20 files, then
switch to adding 10 or maybe 50 children at a time and do another dozen
iterations.

yeah! I'd like to make this part of the regular 'make check-speed' test. To do this, we need a test that will run in 1 to 5 minutes. We can grab multiple timing numbers from a single test run, for example if we do 100 additions, we can measure the time it took to do the first one, and also to do the first ten, and all 100, and also record the time taken by the last one. The performance problems that we've seen (#321) are the clearly present in the tahoe node, but the solution we've proposed is to make the next layer up (in this case the windows SMB/FUSE plugin) do write-caching/batching to reduce the number of dirnode modifications that must be done. Should this test include these changes? Specifically, do we expect this test to improve once we fix #321? I suspect the answer is "no", that continuing results of this test should highlight the need for app-level cacheing. In that case, this test should use the HTTP interface and simply do a series of PUT /uri/TESTDIR/FILENUM calls. How many files should we add? Some tests that Peter just ran suggest the following total times (using the office DSL and uploading random 1024 files): * 1 file : 9s * 10 files : 44s * 20 files : 1m21s * 100 files : 15m14s * 200 files : 32m11s * 300 files : 47m55s 20 files seems like it would hit the automated-test-friendly goals of a few minutes per test. However the significant slowdowns we've seen in #321 were mostly triggered by having a few hundred entries in a directory. Once we have a web interface to add multiple children, I think we should do 20 files, then switch to adding 10 or maybe 50 children at a time and do another dozen iterations.

warner added

code-dirnodes

and removed

code

labels 2008-04-24 23:51:16 +00:00

warner commented

2008-05-09 02:12:36 +00:00

some random datapoints:

t=deep-size on a large tree (about 1700 directories) takes about 140s
30s (about 20%) of this is spent in _unpack_contents
- we could switch to JSON-encoded dirnodes, to take advantage of parsers written in C
we could bypass child writecap decryption (by using deep-size on a readonly dirnode instead of
a read-write one), but there's no point
- _decrypt_rwcap used 2.6s
- retrieve zfec decoding used 233ms
- retrieve AES decryption used 272ms

So beyond the unpacking loop (which unpacks a lot of netstrings, mostly using string.index), I don't yet know where the time is being spent. We only pulled down about 9MB of directory data, and we've got a link that can do about 20Mbps down, and increasing the parallelism doesn't speed things up very much, so the wire isn't the bottleneck. The CPU is pegged at 100% for this trial.

some random datapoints: * t=deep-size on a large tree (about 1700 directories) takes about 140s * 30s (about 20%) of this is spent in _unpack_contents * we could switch to JSON-encoded dirnodes, to take advantage of parsers written in C * we could bypass child writecap decryption (by using deep-size on a readonly dirnode instead of a read-write one), but there's no point * _decrypt_rwcap used 2.6s * retrieve zfec decoding used 233ms * retrieve AES decryption used 272ms So beyond the unpacking loop (which unpacks a lot of netstrings, mostly using string.index), I don't yet know where the time is being spent. We only pulled down about 9MB of directory data, and we've got a link that can do about 20Mbps down, and increasing the parallelism doesn't speed things up very much, so the wire isn't the bottleneck. The CPU is pegged at 100% for this trial.

warner commented

2008-05-09 16:20:41 +00:00

Oh, I just remembered why we didn't go with JSON for the dirnodes: much of the data that they're encoding is binary (well, at least the encrypted child write-cap is, and we were anticipating more densely packed forms of URIs in the future), and JSON isn't particularly efficient with binary data. (In fact, len(simplejson.dumps(os.urandom(100))) fails with a UnicodeDecodeError.. can JSON be used for binary strings at all?).

So rather than switching to JSON, we could write a custom directory unpacker in C, or give psycho a try.

Oh, I just remembered why we didn't go with JSON for the dirnodes: much of the data that they're encoding is binary (well, at least the encrypted child write-cap is, and we were anticipating more densely packed forms of URIs in the future), and JSON isn't particularly efficient with binary data. (In fact, len(simplejson.dumps(os.urandom(100))) fails with a `UnicodeDecodeError`.. can JSON be used for binary strings at all?). So rather than switching to JSON, we could write a custom directory unpacker in C, or give psycho a try.

warner modified the milestone from 1.1.0 to 1.2.0

2008-05-29 22:18:16 +00:00

zooko commented

2009-05-04 16:54:35 +00:00

See also #383 (large directories take a long time to modify), #414 (profiling on directory unpacking), and #329 (dirnodes could cache encrypted/serialized entries for speed).

zooko modified the milestone from 1.5.0 to eventually

2009-06-30 18:11:39 +00:00

zooko commented

2009-06-30 19:23:02 +00:00

I've started writing a script to benchmark directory packing and unpacking.

zooko self-assigned this 2009-06-30 19:23:02 +00:00

kevan commented

2009-07-05 22:39:40 +00:00

(this is in response to the first iteration of the benchmarking script; the issue is addressed in r3965)

Cool!

I've run it on my machine, and noticed that it actually shows slower results for the optimized code. I think that's a matter of methodology, though.

From what I understand (having never used benchutil), to test pack_contents you're building a list of (name, child) tuples, feeding that into a dict, and then feeding the dict to pack_contents, and you're testing how long that takes for increasing numbers of tuples. To test unpack_contents, you're doing that, but saving the result of pack_contents in a global variable, then feeding that to unpack_contents to see how long it takes.

If I'm right, we aren't seeing any speed improvements because the benchmark isn't actually testing the optimizations. In order to do that, we need to feed pack_contents a dictionary that was actually output from unpack_contents (or else built with set_both_items instead of setitem) -- that way, the set_both_items method of the dict wrapper will have been used to set the serialized value, and pack_contents will find and use that value, thus (ideally) making it faster.

One way to do this might be to stick the logic for child creation into a more general setup method -- something which, when called, would generate + pack the list of children, and return the results of pack_contents on that list, for example. Then the init method for the unpack test could store that value in the global packstr variable, and work as it does now, while the init method for the pack test could unpack that, then feed the resulting dictionary into pack_contents again, where the optimizations would start working.

(this is in response to the first iteration of the benchmarking script; the issue is addressed in r3965) Cool! I've run it on my machine, and noticed that it actually shows slower results for the optimized code. I think that's a matter of methodology, though. From what I understand (having never used benchutil), to test pack_contents you're building a list of (name, child) tuples, feeding that into a dict, and then feeding the dict to pack_contents, and you're testing how long that takes for increasing numbers of tuples. To test unpack_contents, you're doing that, but saving the result of pack_contents in a global variable, then feeding that to unpack_contents to see how long it takes. If I'm right, we aren't seeing any speed improvements because the benchmark isn't actually testing the optimizations. In order to do that, we need to feed pack_contents a dictionary that was actually output from unpack_contents (or else built with set_both_items instead of *setitem*) -- that way, the set_both_items method of the dict wrapper will have been used to set the serialized value, and pack_contents will find and use that value, thus (ideally) making it faster. One way to do this might be to stick the logic for child creation into a more general setup method -- something which, when called, would generate + pack the list of children, and return the results of pack_contents on that list, for example. Then the init method for the unpack test could store that value in the global packstr variable, and work as it does now, while the init method for the pack test could unpack that, then feed the resulting dictionary into pack_contents again, where the optimizations would start working.

Sign in to join this conversation.