zandr's FUSE/NAS idea #935

New Issue

warner · 2010-02-02T21:29:51Z

warner commented

2010-02-02 21:29:51 +00:00

At lunch today, Zandr and I were talking about an interesting approach to a
tahoe frontend.
Imagine, if you will, a NAS box, to which your client connects via webdav or
some other convenient protocol. On this box sites a specialized webdav
server, a Tahoe node, and a bunch of (real) disk.
The server maintains a database. For each pathname visible to the client, the
database records two things: "file present on disk?" and "filecap in grid?".
When the client reads a file, the server checks to see if a real file is
present on disk, and if so, it satisfies the read with that file. If not, it
uses the filecap to satisfy whatever piece of the data the client requested
(e.g. with a Range: header), returns it to the client, writes it to local
disk, then (in the background) fills the rest of the local disk file with
data from the grid.
On write, the server writes data to a real local file. Later, when the file
has stopped twitching, the server uploads the file into the grid and updates
the database to reflect the filecap.
Much later, when the server concludes that this file is no longer "hot", it
removes the local disk copy. There are two separate timers: one to decide
when the contents are stable, another to decide when the file is no longer
interesting enough to spend local disk space on. The latter timer is likely
to be related to the amount of disk space available.
From the client's point of view, this is just a NAS box that occasionally
suffers from higher-than-normal latency, but all of its contents eventually
show up on a tahoe backup grid.
Shared directories must be tolerated somehow. I imagine that the server
maintains a cache of dirnode contents (so that the client sees directories
load quickly), but when a client references a given path, the cached dirnodes
on that path are refreshed more quickly than the others. And of course any
UCWE surprises are cause for refreshing a lot dirnodes. With a real on-disk
copy of the file, the server could deal with collisions by presenting the old
version, the new local version, and the new upstream version, and let the
user sort it out.
This idea has been partially explored before, both by the windows FUSE-like code that
shipped with the allmydata.com client, and in the OS-X FUSE code
("blackmatch") written by Rob Kinninmont. But neither of these are
particularly general or available for widespread use.

At lunch today, Zandr and I were talking about an interesting approach to a tahoe frontend. Imagine, if you will, a NAS box, to which your client connects via webdav or some other convenient protocol. On this box sites a specialized webdav server, a Tahoe node, and a bunch of (real) disk. The server maintains a database. For each pathname visible to the client, the database records two things: "file present on disk?" and "filecap in grid?". When the client reads a file, the server checks to see if a real file is present on disk, and if so, it satisfies the read with that file. If not, it uses the filecap to satisfy whatever piece of the data the client requested (e.g. with a Range: header), returns it to the client, writes it to local disk, then (in the background) fills the rest of the local disk file with data from the grid. On write, the server writes data to a real local file. Later, when the file has stopped twitching, the server uploads the file into the grid and updates the database to reflect the filecap. Much later, when the server concludes that this file is no longer "hot", it removes the local disk copy. There are two separate timers: one to decide when the contents are stable, another to decide when the file is no longer interesting enough to spend local disk space on. The latter timer is likely to be related to the amount of disk space available. From the client's point of view, this is just a NAS box that occasionally suffers from higher-than-normal latency, but all of its contents eventually show up on a tahoe backup grid. Shared directories must be tolerated somehow. I imagine that the server maintains a cache of dirnode contents (so that the client sees directories load quickly), but when a client references a given path, the cached dirnodes on that path are refreshed more quickly than the others. And of course any UCWE surprises are cause for refreshing a lot dirnodes. With a real on-disk copy of the file, the server could deal with collisions by presenting the old version, the new local version, and the new upstream version, and let the user sort it out. This idea has been partially explored before, both by the windows FUSE-like code that shipped with the allmydata.com client, and in the OS-X FUSE code ("blackmatch") written by Rob Kinninmont. But neither of these are particularly general or available for widespread use.

warner added the

labels 2010-02-02 21:29:51 +00:00

warner added this to the undecided milestone 2010-02-02 21:29:51 +00:00

davidsarah commented

2010-02-12 04:53:07 +00:00

This is not really a different frontend, it's just adding a cache to client storage nodes. The cache would be shared between frontend protocols (HTTP(S)/WebDAV, SFTP, and FTP), which is an advantage over putting a forward proxy between the node and the frontend client. Being able to cache only ciphertext is also an advantage.

The ticket description essentially describes write-behind caching. For this kind of system, there are two significant 'commit' events for any write:

the time at which the write is "locally durable" -- subsequent reads via this gateway will see the write, and if the gateway machine or process crashes and then restarts, it will still try to complete the upload to the grid.
the time at which the write is "globally durable" -- subsequent reads via any gateway will see the write, and the servers-of-happiness criterion is met, so that the file has Tahoe's normal availability and preservation properties (see #946).

Unfortunately existing apps and filesystem protocols don't distinguish these events. Ideally an operation such as an HTTP PUT or a file close() would notify its client twice, once for the local commit, and once for the global commit (this abstraction would be a good fit for distributed storage systems in general, not just Tahoe). But since they don't, we have to choose when to notify the client that the operation has succeeded. The client doesn't have any way to tell us what kind of notification it wants (there are no relevant flags defined either by POSIX or by HTTP or SFTP), so we have a Hobson's choice between performance or durability.

I don't know how to resolve this, but I do think that caching on the gateway is going to be essential for good WebDAV or SFTP performance. sshfs does maintain a local cache, but when that is switched on it essentially assumes that only this instance of sshfs is accessing the filesystem, so unless we switch it off (see http://code.google.com/p/macfuse/wiki/FAQ#sshfs), sshfs users would not see updates to the Tahoe filesystem by other users. dav2fs would presumably have similar issues.

This is not really a different frontend, it's just adding a cache to client storage nodes. The cache would be shared between frontend protocols (HTTP(S)/WebDAV, SFTP, and FTP), which is an advantage over putting a forward proxy between the node and the frontend client. Being able to cache only ciphertext is also an advantage. The ticket description essentially describes write-behind caching. For this kind of system, there are two significant 'commit' events for any write: * the time at which the write is "locally durable" -- subsequent reads via *this* gateway will see the write, and if the gateway machine or process crashes and then restarts, it will still try to complete the upload to the grid. * the time at which the write is "globally durable" -- subsequent reads via *any* gateway will see the write, and the servers-of-happiness criterion is met, so that the file has Tahoe's normal availability and preservation properties (see #946). Unfortunately existing apps and filesystem protocols don't distinguish these events. Ideally an operation such as an HTTP PUT or a file close() would notify its client twice, once for the local commit, and once for the global commit (this abstraction would be a good fit for distributed storage systems in general, not just Tahoe). But since they don't, we have to choose when to notify the client that the operation has succeeded. The client doesn't have any way to tell us what kind of notification it wants (there are no relevant flags defined either by POSIX or by HTTP or SFTP), so we have a Hobson's choice between performance or durability. I don't know how to resolve this, but I do think that caching on the gateway is going to be essential for good WebDAV or SFTP performance. sshfs does maintain a local cache, but when that is switched on it essentially assumes that only this instance of sshfs is accessing the filesystem, so unless we switch it off (see <http://code.google.com/p/macfuse/wiki/FAQ#sshfs>), sshfs users would not see updates to the Tahoe filesystem by other users. dav2fs would presumably have similar issues.

tahoe-lafs modified the milestone from undecided to eventually

2010-02-12 04:53:07 +00:00

zandr commented

2010-02-12 17:55:26 +00:00

Ah, didn't notice that warner had opened a ticket on this, I'll add some comments.

In the use case I was imagining, locally durable is all that matters. While there may be multiple readers/writers on the local network, this is a grid-backed NAS application, not a global sharing application. I know that Tahoe gives us the latter for free (though with some caveats), the point is I'm quite happy to make the experience from other gateways suboptimal for this application.

Further, I wasn't imagining that we would stop spending local storage on the file, but rather that we would replace the plaintext copy with k shares when a file had 'cooled'.

Thus, a file is cached (available on a local NAS as plaintext) while it's hot, available from local ciphertext when cool, and recoverable from the grid in the event of a device failure. I know this is dependent on a bunch of other tickets I don't have at my fingertips.

Ah, didn't notice that warner had opened a ticket on this, I'll add some comments. In the use case I was imagining, locally durable is all that matters. While there may be multiple readers/writers on the local network, this is a grid-backed NAS application, not a global sharing application. I know that Tahoe gives us the latter for free (though with some caveats), the point is I'm quite happy to make the experience from other gateways suboptimal for this application. Further, I wasn't imagining that we would stop spending local storage on the file, but rather that we would replace the plaintext copy with k shares when a file had 'cooled'. Thus, a file is cached (available on a local NAS as plaintext) while it's hot, available from local ciphertext when cool, and recoverable from the grid in the event of a device failure. I know this is dependent on a bunch of other tickets I don't have at my fingertips.

davidsarah commented

2011-03-13 00:50:57 +00:00

The functionality described in this ticket could alternatively be achieved by detecting changes to local files using inotify/pnotify/NTFS change journals, and queueing them for upload to Tahoe. This is the approach that dropbox uses. It has some significant advantages for application compatibility, and performance, over a network filesystem.

The functionality described in this ticket could alternatively be achieved by detecting changes to local files using inotify/pnotify/NTFS change journals, and queueing them for upload to Tahoe. This is the approach that [dropbox](https://www.dropbox.com/) uses. It has some significant advantages for application compatibility, and performance, over a network filesystem.

gdt commented

2011-03-13 00:59:40 +00:00

There is prior art that should be studied, in particular Coda. Coda's primary concept is that files are stored on replicated servers, and that clients have a cache. In this way it is similar to AFS, which can satisfy reads from the cache when no servers are reachable. Coda adds the ability to do disconnected writes, where changes are cached and reintegrated. This in turn requires fairly complex conflict detection and resolution code.

Not required by the above vision, but also present in Coda, is a kernel module (for Linux, and for various BSDs) that implements vnodeops and passes operations to userspace. This is similar to FUSE, but predates it - I've been using it since probably 1997. One of Coda's design goals is to be efficient once you have the file - operations on a file in coda are actually done on the container file, which is a normal file on the local disk, and these operations are short-circuited in the kernel so they are almost as fast as local file operations. On read of a file that doesn't have a container file, there is a pause while it's faulted in, and on close of a file that was opened for writing a store operation begins.

Were Coda to start over now, it should use FUSE, and FUSE would be extended to have fast container file redirects. I would then argue that the caching FUSE module is generic, and can serve both a Coda backend as well as a tahoe back end, or at least could be written so that most code is shared.

There is prior art that should be studied, in particular Coda. Coda's primary concept is that files are stored on replicated servers, and that clients have a cache. In this way it is similar to AFS, which can satisfy reads from the cache when no servers are reachable. Coda adds the ability to do disconnected writes, where changes are cached and reintegrated. This in turn requires fairly complex conflict detection and resolution code. Not required by the above vision, but also present in Coda, is a kernel module (for Linux, and for various BSDs) that implements vnodeops and passes operations to userspace. This is similar to FUSE, but predates it - I've been using it since probably 1997. One of Coda's design goals is to be efficient once you have the file - operations on a file in coda are actually done on the container file, which is a normal file on the local disk, and these operations are short-circuited in the kernel so they are almost as fast as local file operations. On read of a file that doesn't have a container file, there is a pause while it's faulted in, and on close of a file that was opened for writing a store operation begins. Were Coda to start over now, it should use FUSE, and FUSE would be extended to have fast container file redirects. I would then argue that the caching FUSE module is generic, and can serve both a Coda backend as well as a tahoe back end, or at least could be written so that most code is shared.

davidsarah commented

2011-08-18 02:00:03 +00:00

Replying to davidsarah:

The functionality described in this ticket could alternatively be achieved by detecting changes to local files using inotify/pnotify/NTFS change journals, and queueing them for upload to Tahoe. This is the approach that dropbox uses. It has some significant advantages for application compatibility, and performance, over a network filesystem.

This approach is now implemented by #1429, on Linux, albeit only for a single directory at present.

Replying to [davidsarah](/tahoe-lafs/trac-2024-07-25/issues/935#issuecomment-75134): > The functionality described in this ticket could alternatively be achieved by detecting changes to local files using inotify/pnotify/NTFS change journals, and queueing them for upload to Tahoe. This is the approach that [dropbox](https://www.dropbox.com/) uses. It has some significant advantages for application compatibility, and performance, over a network filesystem. This approach is now implemented by #1429, on Linux, albeit only for a single directory at present.

Sign in to join this conversation.