poor performance with large number of files via windows FUSE? #321

New Issue

warner · 2008-02-27T20:16:01Z

warner commented

2008-02-27 20:16:01 +00:00

Peter and Fabrice have reported problems with dragging a large folder into
the windows FUSE frontend. We're still collecting data, but the implication
is that there is a super-linear slowdown somewhere, maybe in the FUSE plugin,
maybe in the local Tahoe node that it connects to. We expect to spend roughly
one second per file right now: our automated perfnet tests show 600ms per
immutable file upload and 300ms per directory update; prodnet has a different
number of servers but I'd expect the values to be fairly close. Peter says
that this is not sufficient to explain the slowdowns.

We are currently running tests with additional instrumentation to figure out
where this time is being spent.

Peter and Fabrice have reported problems with dragging a large folder into the windows FUSE frontend. We're still collecting data, but the implication is that there is a super-linear slowdown somewhere, maybe in the FUSE plugin, maybe in the local Tahoe node that it connects to. We expect to spend roughly one second per file right now: our automated perfnet tests show 600ms per immutable file upload and 300ms per directory update; prodnet has a different number of servers but I'd expect the values to be fairly close. Peter says that this is not sufficient to explain the slowdowns. We are currently running tests with additional instrumentation to figure out where this time is being spent.

warner added the

labels 2008-02-27 20:16:01 +00:00

warner added this to the 0.9.0 (Allmydata 3.0 final) milestone 2008-02-27 20:16:01 +00:00

zooko commented

2008-02-28 01:59:59 +00:00

It's too bad we didn't implement #273 -- "How does tahoe handle lots of simultaneous file-upload tasks?" -- before now. If we had, then we would know already how the Tahoe node itself handles this load.

zooko commented

2008-02-28 02:00:24 +00:00

Err, I mean #173 -- "How does tahoe handle lots of simultaneous file-upload tasks?".

warner commented

2008-02-28 19:47:53 +00:00

unfortunately no.. the FUSE plugin is only giving one task to the tahoe node at a time. No parallelism here.

zooko commented

2008-02-28 20:10:25 +00:00

Fine then -- let us add an automated performance measurement that says "How deos tahoe handle lots of sequential file-upload tasks?".

zooko commented

2008-02-28 22:26:22 +00:00

#327 -- "performance measurement of directories"

warner commented

2008-02-29 04:07:10 +00:00

We've performed some log analysis, and identified that the problem is simply
the dirnodes becoming too large. A directory with 353 children consumes
114305 bytes, and at 3-of-10 encoding, requires about 400kB to be written on
each update. A 1MBps SDSL line can do about 100kBps, so this takes about 4
seconds to send out all the shares. The Retrieve that precedes the Publish
takes a third of this time, so it needs 1 or 2 seconds. The total time to
update a dirnode of this size is about 10 seconds. Small directories take
about 2 seconds.

One thing that surprised me was that dirnodes are twice as large as I'd
thought: 324 bytes per child. I guess my previous estimates (of 100-150) were
based on design that we haven't yet implemented, in which we store binary
child caps instead of ASCII ones. So the contents of dirnode are large enough
to take a non-trivial amount of time to upload. Also note that this means our
1MB limit on SMDF files imposes a roughly 3000-child limit on dirnodes (but
this could be easily raised by allowing larger segments).

There are four things we can do about this.

The most significant is to do fewer dirnode updates. A FUSE plugin (with a
POSIX-like API) doesn't give us any advance notice of how many child
entries are going to be added, so the best we can do is a Nagle-like
algorithm that tries to batch writes together for efficiency. The basic
idea is that when a dirnode update request comes in, start a timer
(perhaps five seconds). Merge in any other update requests that arrive
during that time. When the timer expires, do the actual update. This will
help the lots-of-small-files case as long as the files are fairly small
and upload quickly. In the test we ran (with 1024 byte files), this would
probably have reduced the number of dirnode updates by a factor of 5.

The biggest problem is that this can't be done completely safely: it
requires lying to the close() call and pretending that the child has been
added when it actually hasn't. We could recover some safety by adding a
flush() or sync() call of some sort, and not returning from it until all
the nagle timers have been accelerated and finished.
Make dirnodes smaller. DSA-mutable files (#217) and packing binary caps
into dirnodes (no ticket yet) would cut the per-child size in half
(assuming I'm remembering my numbers correctly). Once dirnodes get large
enough to exceed the size of the overhead (2kB overhead, so roughly 6
entries), this will cut about 50% off the large dirnode update time.
We discovered an the unnecessary retrieve during the directory update
process. We need to update the API (#328) to remove this and provide the
safe-update semantics that were intended. Fixing this would shave about
10%-15% off the time needed to do a dirnode update (both large and small).
Serializing the directory contents (including encrypting the writecaps)
took 500ms for 353 entries. The dirnode could cache and reuse the
encrypted strings instead of generating new ones each time. This might
save about 5% of the large-dirnode update time. Ticket #329 describes
this.

Zooko has started work on reducing the dirnode updates, by adding an HTTP
interface to IDirectoryNode.set_uris() (allowing the HTTP client to add
multiple children at once). Mike is going make the winFUSE plugin split the
upload process into separate upload-file-get-URI and dirnode-add-child
phases, which will make it possible for him to implement the Nagle-like timer
and batch the updates.

We've performed some log analysis, and identified that the problem is simply the dirnodes becoming too large. A directory with 353 children consumes 114305 bytes, and at 3-of-10 encoding, requires about 400kB to be written on each update. A 1MBps SDSL line can do about 100kBps, so this takes about 4 seconds to send out all the shares. The Retrieve that precedes the Publish takes a third of this time, so it needs 1 or 2 seconds. The total time to update a dirnode of this size is about 10 seconds. Small directories take about 2 seconds. One thing that surprised me was that dirnodes are twice as large as I'd thought: 324 bytes per child. I guess my previous estimates (of 100-150) were based on design that we haven't yet implemented, in which we store binary child caps instead of ASCII ones. So the contents of dirnode are large enough to take a non-trivial amount of time to upload. Also note that this means our 1MB limit on SMDF files imposes a roughly 3000-child limit on dirnodes (but this could be easily raised by allowing larger segments). There are four things we can do about this. * The most significant is to do fewer dirnode updates. A FUSE plugin (with a POSIX-like API) doesn't give us any advance notice of how many child entries are going to be added, so the best we can do is a Nagle-like algorithm that tries to batch writes together for efficiency. The basic idea is that when a dirnode update request comes in, start a timer (perhaps five seconds). Merge in any other update requests that arrive during that time. When the timer expires, do the actual update. This will help the lots-of-small-files case as long as the files are fairly small and upload quickly. In the test we ran (with 1024 byte files), this would probably have reduced the number of dirnode updates by a factor of 5. The biggest problem is that this can't be done completely safely: it requires lying to the close() call and pretending that the child has been added when it actually hasn't. We could recover some safety by adding a flush() or sync() call of some sort, and not returning from it until all the nagle timers have been accelerated and finished. * Make dirnodes smaller. DSA-mutable files (#217) and packing binary caps into dirnodes (no ticket yet) would cut the per-child size in half (assuming I'm remembering my numbers correctly). Once dirnodes get large enough to exceed the size of the overhead (2kB overhead, so roughly 6 entries), this will cut about 50% off the large dirnode update time. * We discovered an the unnecessary retrieve during the directory update process. We need to update the API (#328) to remove this and provide the safe-update semantics that were intended. Fixing this would shave about 10%-15% off the time needed to do a dirnode update (both large and small). * Serializing the directory contents (including encrypting the writecaps) took 500ms for 353 entries. The dirnode could cache and reuse the encrypted strings instead of generating new ones each time. This might save about 5% of the large-dirnode update time. Ticket #329 describes this. Zooko has started work on reducing the dirnode updates, by adding an HTTP interface to IDirectoryNode.set_uris() (allowing the HTTP client to add multiple children at once). Mike is going make the winFUSE plugin split the upload process into separate upload-file-get-URI and dirnode-add-child phases, which will make it possible for him to implement the Nagle-like timer and batch the updates.

warner commented

2008-02-29 04:07:57 +00:00

Attachment NOTES (2195 bytes) added

some timing notes from our logfile analysis

**Attachment** NOTES (2195 bytes) added some timing notes from our logfile analysis

NOTES

2.1 KiB

warner commented

2008-02-29 04:13:43 +00:00

Oh, we also noticed a large number of t=json queries being submitted by the
winFUSE plugin. At the beginning of the test (when the directory only had a
few entries, and updates took about 3 seconds), we were seeing about 5 such
queries per child entry. All of these queries require a directory fetch, and
most resulted in a 404 because the target filename wasn't present in the
directory. When dirnode updates started taking longer (10 seconds), we saw
fewer of these per update (maybe 1).

Early in the test, these queries took 210ms each. At the end of the test they
take one or two seconds each. This might represent 15%-30% of the time spent
doing the dirnode updates.

The plugin should do fewer of these queries: they are consuming network
bandwidth and slowing down the directory update. If it is doing them to see
if the file has been added to the directory yet, then it would be far more
efficient to simply wait for the response to the PUT call. If they are being
done for some other reason, then we should consider some sort of read cache
to reduce their impact.

Oh, we also noticed a large number of t=json queries being submitted by the winFUSE plugin. At the beginning of the test (when the directory only had a few entries, and updates took about 3 seconds), we were seeing about 5 such queries per child entry. All of these queries require a directory fetch, and most resulted in a 404 because the target filename wasn't present in the directory. When dirnode updates started taking longer (10 seconds), we saw fewer of these per update (maybe 1). Early in the test, these queries took 210ms each. At the end of the test they take one or two seconds each. This might represent 15%-30% of the time spent doing the dirnode updates. The plugin should do fewer of these queries: they are consuming network bandwidth and slowing down the directory update. If it is doing them to see if the file has been added to the directory yet, then it would be far more efficient to simply wait for the response to the PUT call. If they are being done for some other reason, then we should consider some sort of read cache to reduce their impact.

zooko commented

2008-03-10 19:42:17 +00:00

MikeB: is this issue handled Well Enough for v0.9.0 now?

zooko commented

2008-03-13 03:43:51 +00:00

This issue is somewhat improved, and is hereby considered Good Enough for allmydata.org "Tahoe" v0.9.0.

(Further performance tuning might be applied before the Allmydata.com 3.0 product release, but that can be done after the allmydata.org "Tahoe" v0.9.0 release.)

This issue is somewhat improved, and is hereby considered Good Enough for allmydata.org "Tahoe" v0.9.0. (Further performance tuning might be applied before the Allmydata.com 3.0 product release, but that can be done after the allmydata.org "Tahoe" v0.9.0 release.)

zooko added the

fixed

label 2008-03-13 03:43:51 +00:00

zooko closed this issue

2008-03-13 03:43:51 +00:00

Sign in to join this conversation.