"tahoe cp" should avoid full upload/download when the destination already exists (using backupdb and/or plaintext hashes) #658

New Issue

warner · 2009-03-10T20:41:01Z

warner commented

2009-03-10 20:41:01 +00:00

Now that the backupdb seems to be working well for "tahoe backup", it's time to extend "tahoe cp" to use it too.

In the upload direction (tahoe cp LOCAL REMOTE), the backupdb should be used to let us skip a new upload of a file that's already been uploaded. The goal is to allow periodic "tahoe cp LOCAL REMOTE" (with fixed values of LOCAL and REMOTE) to do as little work as possible.

In the download direction (tahoe cp REMOTE LOCAL), the backupdb should also be used, to let us skip a download of a file that's already been downloaded. When a Tahoe file is downloaded and written to local disk, a path+timestamps-to-URI entry should be added to the db. Before downloading a file to local disk, the disk should be checked for an existing file with the same timestamps: if present, and if the URI matches the URI that was going to be downloaded, the download should be skipped.

Now that the backupdb seems to be working well for "tahoe backup", it's time to extend "tahoe cp" to use it too. In the upload direction (tahoe cp LOCAL REMOTE), the backupdb should be used to let us skip a new upload of a file that's already been uploaded. The goal is to allow periodic "tahoe cp LOCAL REMOTE" (with fixed values of LOCAL and REMOTE) to do as little work as possible. In the download direction (tahoe cp REMOTE LOCAL), the backupdb should also be used, to let us skip a download of a file that's already been downloaded. When a Tahoe file is downloaded and written to local disk, a path+timestamps-to-URI entry should be added to the db. Before downloading a file to local disk, the disk should be checked for an existing file with the same timestamps: if present, and if the URI matches the URI that was going to be downloaded, the download should be skipped.

warner added the

labels 2009-03-10 20:41:01 +00:00

warner added this to the undecided milestone 2009-03-10 20:41:01 +00:00

davidsarah commented

2009-12-07 02:49:43 +00:00

I think this should be gated by an option that is not the default (or else make it the default for a new command called something other than cp). Otherwise, if anything goes wrong then it won't be obvious that the backupdb could be at fault; users are likely consider tahoe cp to be a lower-level operation that copies files unconditionally, like Unix cp does.

I think this should be gated by an option that is not the default (or else make it the default for a new command called something other than `cp`). Otherwise, if anything goes wrong then it won't be obvious that the backupdb could be at fault; users are likely consider `tahoe cp` to be a lower-level operation that copies files unconditionally, like Unix `cp` does.

davidsarah commented

2009-12-07 03:08:46 +00:00

Plaintext hashes would be a more robust way of doing this than URI+timestamp (but dependent on #453).

IOW, for downloading a file:

if the source cap is to an immutable file, the read cap might be sufficient to verify that the existing copy has the same plaintext hash.
if the source cap is to a mutable file, cp would need to go to the servers to find the concensus value for the plaintext hash of the current version. Then it would proceed as for an immutable file.

If the existing file is the correct one, it should still be touched to update its mtime.

For uploading a file, if there is an existing copy then you would have to verify it.

The storage server protocol and webapi would need to be able to return a hash of the file first. (See http://www.usenix.org/events/nsdi04/tech/full_papers/mogul/mogul.pdf for a similar protocol with some relevant discussion of design issues.)

Plaintext hashes would be a more robust way of doing this than URI+timestamp (but dependent on #453). IOW, for downloading a file: * if the source cap is to an immutable file, the read cap might be sufficient to verify that the existing copy has the same plaintext hash. * if the source cap is to a mutable file, `cp` would need to go to the servers to find the concensus value for the plaintext hash of the current version. Then it would proceed as for an immutable file. If the existing file is the correct one, it should still be `touch`ed to update its mtime. For uploading a file, if there is an existing copy then you would have to verify it. The storage server protocol and webapi would need to be able to return a hash of the file first. (See <http://www.usenix.org/events/nsdi04/tech/full_papers/mogul/mogul.pdf> for a similar protocol with some relevant discussion of design issues.)

tahoe-lafs changed title from ~~"tahoe cp" should use backupdb, in both directions~~ to "tahoe cp" should avoid full upload/download when the destination already exists (using backupdb and/or plaintext hashes)

2009-12-07 03:22:42 +00:00

daira commented

2015-04-17 22:54:51 +00:00

This may interact with the planned magic folder db (see source:docs/proposed/magic-folder/filesystem-integration.rst).

Sign in to join this conversation.