add 'tahoe backup' command: fast versioned readonly backups #598
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#598
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
As a complement to the only-the-latest-version 'tahoe sync' command described
in #597, I'd like to have a full-featured multiple-version 'tahoe backup'
command too. This would behave like the existing windows-only allmydata.com
backup tool:
tahoe backup LOCALDIR ALIAS:BACKUPBASEDIR
LOCALDIR refers to a directory on the local disk. ALIAS:BACKUPBASEDIR will
refer to a writeable Tahoe directory; it will be created if it does not
already exist.
Each time this is run, ALIAS:BACKUPBASEDIR/$TIMESTAMP will be created, as a
read-only directory that contains an exact mirror of the local disk's
LOCALDIR subtree. In addition, ALIAS:BACKUPBASEDIR/Latest will be a read-only
reference to the same directory. Over time, BACKUPBASEDIR/ will be filled
with a series of timestamped directories, containing historical backups.
Whenever possible, $TIMESTAMPn will contain references to files and
directories created under $TIMESTAMPn-1; i.e. backups will share unchanged
objects with earlier backups. Each backup, once finished, will not be changed
again. If/when Tahoe acquires immutable dirnodes, 'tahoe backup' will take
advantage of them. Meanwhile, it will use read-only dirnodes, by throwing out
the write-cap for the $TIMESTAMP directory when the backup is done.
This will use the same backupdb as described in #597 to reduce the amount of
work that must be done for unchanged files.
A basic backup system could be constructed by simply running 'tahoe backup'
in a cron job. It might be a good idea to have a lockfile of some sort to
make this usage safer (i.e. prevent overruns from causing two simultaneous
backups from running at the same time).
Looks like this is more important than #597 .
The basic flowchart I've got in mind:
If upload-with-backupdb works as described in
http://allmydata.org/pipermail/tahoe-dev/2008-May/000620.html , then the
workload of a null backup will be the recursive read of the entire
most-recent-version subtree. To avoid even that:
and look for the result in the table
which is not used in a 'tahoe backup' run should be discarded at the end
of that run)
With that in place, a null backup should involve nothing but local stat()
calls.
Some data points: home directory sizes on some developer's machines:
So, to use "tahoe backup" on these systems, the backupdb must be able to efficiently manage a million entries. I think this is too big for a simple pickle to handle well.
I'll do some experimentation, but my current plan is to use a sqlite database, one for the file-oriented backupdb, and a second for the directory-contents db.
Going forward, of course, it would be nice to allow the use of mysql or postgres. But sqlite is in the python2.5 stdlib, and has a synchronous interface (which makes the implementation of tahoe_backup.py a bit easier), and doesn't require any external setup. Whereas mysql/postgres would require a separate process to be configured and a DB to be set up, along with user-account setup. Another question is to use sqlite directly or use the Axiom layer (which we're using as an experiment in the disk-watcher).. I'm inclined to use sqlite directly, again because of avoiding lots of new dependencies.
zooko's system with 153k dirs and 1306k files has about 69GB of data
changeset:cfce8b5eab431772 has the first cut: no backupdb, but the other functionality is there.
My system with 153k dirs and 1306k files has 35,350 files which are duplicates -- that set of 35,350 files has only 17,675 unique md5 hashes.
Note that I'm adding Cc: tahoe-dev@allmydata.org to this ticket, so until that Cc: is removed any comments posted here will be mailed to the list.
Done. changeset:177ffa0870390c6e was the last patch: the "tahoe backup" command now uses the backupdb and avoids uploading any file that looks like it was unchanged. I'll create a separate ticket (#606) for adding a directory cache to the backupdb.. that can be a future enhancement that will improve performance even further.
I've done some little benchmarks of uploading one of my darcs repos to the production grid. I've uploaded it first using "tahoe cp -r -v" and then i uploaded a tar (not zipped) of the same data. The repo is composed of 67 dirs and 4098 files, the tar size is 27 MB. The "cp -r -v" took roughly 3.5 hours, the "cp repo.tar" took 760 seconds.
The client is configured to use an helper.
Here are the stats for one of the files involved in the first upload:
Next, the stats for the tar upload:
This small test demostrated an overhead of 1.5 ~ 2 seconds for every upload operation.
Lastly i post the results of a "du --si $repo; find $repo -type f |wc -l; find $repo -type d |wc-l" command: