macfuse: need some sort of caching #300

New Issue

tahoe-lafs · 2008-02-06T00:25:50Z

robk commented

2008-02-06 00:25:50 +00:00

so doing some initial experiments with mac fuse and python fuse bindings, it seems like the simple act of viewing a directory in finder generates a large number of calls through the fuse api.

I ran a stub (loopback) fs with instrumentation of each fuse call, and opened a directory or two, with only a few files. (having tested a much larger directory and seen correspondingly larger numbers of calls). the tool inserted a 100ms delay in answering each call, which explains the spacing of calls over time.

see attached log.

so doing some initial experiments with mac fuse and python fuse bindings, it seems like the simple act of viewing a directory in finder generates a large number of calls through the fuse api. I ran a stub (loopback) fs with instrumentation of each fuse call, and opened a directory or two, with only a few files. (having tested a much larger directory and seen correspondingly larger numbers of calls). the tool inserted a 100ms delay in answering each call, which explains the spacing of calls over time. see attached log.

tahoe-lafs added the

labels 2008-02-06 00:25:50 +00:00

tahoe-lafs added this to the 0.9.0 (Allmydata 3.0 final) milestone 2008-02-06 00:25:50 +00:00

robk commented

2008-02-06 00:26:39 +00:00

Attachment tfuse.log (37981 bytes) added

log of fuse calls

**Attachment** tfuse.log (37981 bytes) added log of fuse calls

tfuse.log

37 KiB

warner commented

2008-02-12 02:51:20 +00:00

So if I'm reading that log right, when the finder looks in a directory, it
makes the following calls:

about 102 calls to access(DIR)
14 calls to getattr(DIR)

3 calls to getattr(.DS_Store)
1 call to getattr(.hidden)
1 call to readdir(DIR)
21 calls to statfs()

for FILE in DIR:
24 calls to access(FILE)
6 calls to getattr(FILE)
12 calls to access(FILE.swp)
3 calls to getattr(FILE.swp)

And displaying that 5-file directory resulted in about 330 system calls.
Impressive! :-)

It sounds like everything except statfs() can be handled with the data from a
single dirnode, so caching it long enough to make sure that this batch of
330-ish calls can be fed with a single Tahoe dirnode fetch is an important
goal. We have a few numbers to suggest how long it takes to perform this
fetch:
http://allmydata.org/tahoe-figleaf-graph/hanford.allmydata.com-tahoe_speedstats_delay_SSK.html
suggests that it takes about 70ms for a Tahoe node to retrieve a small
mutable file over a DSL line. There will be some extra delays involved if we include web API time, or more servers than those used on our speed-net
test, but I believe that any given directory should be fully retrieveable in
under a second.

So we'll need to choose a caching policy based upon the following criteria:

displaying a directory requires several hundred system calls that refer to
the same dirnode contents, in rapid succession
fetching the dirnode contents probably takes less than a second, closer
to 100ms

The cache entries should expire after some reasonable period of time. Longer
expiration times will produce surprises and frustration when a user updates a
directory on one machine and then fails to see the updates on a different
machine.

If the expiration time is more than a few seconds, the implementation will
require some sort of forced-expiration or local-update in the face of
locally-caused changes to the directory, to make sure you can see the changes
you just made. (if we didn't have caching, we wouldn't need this
relatively-complicated feature).

My straw-man suggestion is the following:

index the cache by the URI of the directory
expire the cache entries 10 seconds after they are retrieved
expire the cache entries immediately if the directory is modified

More data (specifically system-call traces) would be useful on the following
cases:

opening a child directory directly (perhaps through a symlink). Does the
finder do a lot of calls for the ancestor directories? If so, that will
increase the pressure to retain cached entries longer.
when writing to a file in a directory, how much (and when) is the
directory re-read? That will influence the modify-the-cache vs.
expire-the-cache design decisions.

So if I'm reading that log right, when the finder looks in a directory, it makes the following calls: about 102 calls to access(DIR) 14 calls to getattr(DIR) 3 calls to getattr(.DS_Store) 1 call to getattr(.hidden) 1 call to readdir(DIR) 21 calls to statfs() for FILE in DIR: 24 calls to access(FILE) 6 calls to getattr(FILE) 12 calls to access(FILE.swp) 3 calls to getattr(FILE.swp) And displaying that 5-file directory resulted in about 330 system calls. Impressive! :-) It sounds like everything except statfs() can be handled with the data from a single dirnode, so caching it long enough to make sure that this batch of 330-ish calls can be fed with a single Tahoe dirnode fetch is an important goal. We have a few numbers to suggest how long it takes to perform this fetch: <http://allmydata.org/tahoe-figleaf-graph/hanford.allmydata.com-tahoe_speedstats_delay_SSK.html> suggests that it takes about 70ms for a Tahoe node to retrieve a small mutable file over a DSL line. There will be some extra delays involved if we include web API time, or more servers than those used on our speed-net test, but I believe that any given directory should be fully retrieveable in under a second. So we'll need to choose a caching policy based upon the following criteria: * displaying a directory requires several hundred system calls that refer to the same dirnode contents, in rapid succession * fetching the dirnode contents probably takes less than a second, closer to 100ms The cache entries should expire after some reasonable period of time. Longer expiration times will produce surprises and frustration when a user updates a directory on one machine and then fails to see the updates on a different machine. If the expiration time is more than a few seconds, the implementation will require some sort of forced-expiration or local-update in the face of locally-caused changes to the directory, to make sure you can see the changes you just made. (if we didn't have caching, we wouldn't need this relatively-complicated feature). My straw-man suggestion is the following: * index the cache by the URI of the directory * expire the cache entries 10 seconds after they are retrieved * expire the cache entries immediately if the directory is modified More data (specifically system-call traces) would be useful on the following cases: * opening a child directory directly (perhaps through a symlink). Does the finder do a lot of calls for the ancestor directories? If so, that will increase the pressure to retain cached entries longer. * when writing to a file in a directory, how much (and when) is the directory re-read? That will influence the modify-the-cache vs. expire-the-cache design decisions.

warner modified the milestone from 1.1.0 to 1.2.0

2008-05-29 22:21:00 +00:00

zooko modified the milestone from 1.5.0 to eventually

2009-06-30 12:38:59 +00:00

zooko commented

2009-09-24 05:54:33 +00:00

If you like this ticket, you might also like #606 (backupdb: add directory cache), #465 (add a mutable-file cache), and #316 (add caching to tahoe proper?).

exarkun commented

2020-01-16 19:39:10 +00:00

The direct FUSE support in Tahoe-LAFS was removed in 4f8e3e5ae8fefc01df3177e737d8ce148edd60b9 (2011). The preferred route to have native filesystem-like interface is via the SFTP frontend and something like sshfs.

exarkun added the

wontfix

label 2020-01-16 19:39:16 +00:00

exarkun closed this issue

2020-01-16 19:39:16 +00:00

Sign in to join this conversation.