tahoe backup should be able to backup symlinks #641

Open
opened 2009-02-24 01:23:27 +00:00 by francois · 20 comments
francois commented 2009-02-24 01:23:27 +00:00
Owner

Running tahoe backup on a directory containing a symbolic link currently doesn't work. It raises the following exception instead.

Traceback (most recent call last):
  File "/home/francois/dev/tahoe/support/bin/tahoe", line 8, in <module>
    load_entry_point('allmydata-tahoe==1.2.0-r3615', 'console_scripts', 'tahoe')()
  File "/home/francois/dev/tahoe/src/allmydata/scripts/runner.py", line 91, in run
    rc = runner(sys.argv[1:])
  File "/home/francois/dev/tahoe/src/allmydata/scripts/runner.py", line 78, in runner
    rc = cli.dispatch[command](so)
  File "/home/francois/dev/tahoe/src/allmydata/scripts/cli.py", line 359, in backup
    rc = tahoe_backup.backup(options)
  File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_backup.py", line 353, in backup
    return bu.run()
  File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_backup.py", line 198, in run
    new_backup_dircap = self.process(options.from_dir, latest_backup_dircap)
  File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_backup.py", line 245, in process
    newchilddircap = self.process(childpath, oldchildcap)
  File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_backup.py", line 245, in process
    newchilddircap = self.process(childpath, oldchildcap)
  File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_backup.py", line 245, in process
    newchilddircap = self.process(childpath, oldchildcap)
  File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_backup.py", line 251, in process
    raise RuntimeError("how do I back this up?" % childpath)
RuntimeError: how do I back this up?
Running tahoe backup on a directory containing a symbolic link currently doesn't work. It raises the following exception instead. ``` Traceback (most recent call last): File "/home/francois/dev/tahoe/support/bin/tahoe", line 8, in <module> load_entry_point('allmydata-tahoe==1.2.0-r3615', 'console_scripts', 'tahoe')() File "/home/francois/dev/tahoe/src/allmydata/scripts/runner.py", line 91, in run rc = runner(sys.argv[1:]) File "/home/francois/dev/tahoe/src/allmydata/scripts/runner.py", line 78, in runner rc = cli.dispatch[command](so) File "/home/francois/dev/tahoe/src/allmydata/scripts/cli.py", line 359, in backup rc = tahoe_backup.backup(options) File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_backup.py", line 353, in backup return bu.run() File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_backup.py", line 198, in run new_backup_dircap = self.process(options.from_dir, latest_backup_dircap) File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_backup.py", line 245, in process newchilddircap = self.process(childpath, oldchildcap) File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_backup.py", line 245, in process newchilddircap = self.process(childpath, oldchildcap) File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_backup.py", line 245, in process newchilddircap = self.process(childpath, oldchildcap) File "/home/francois/dev/tahoe/src/allmydata/scripts/tahoe_backup.py", line 251, in process raise RuntimeError("how do I back this up?" % childpath) RuntimeError: how do I back this up? ```
tahoe-lafs added the
code-frontend-cli
minor
defect
1.3.0
labels 2009-02-24 01:23:27 +00:00
tahoe-lafs added this to the undecided milestone 2009-02-24 01:23:27 +00:00
francois commented 2009-02-24 01:28:53 +00:00
Author
Owner

Well, it's perhaps easier to discard them for now and simply display a warning message.

Well, it's perhaps easier to discard them for now and simply display a warning message.
francois commented 2009-02-24 18:35:20 +00:00
Author
Owner

Attachment bug-641.dpatch (19195 bytes) added

**Attachment** bug-641.dpatch (19195 bytes) added
francois commented 2009-02-24 18:36:12 +00:00
Author
Owner

Here's a patch which makes tahoe backup ignore symlinks.

Here's a patch which makes tahoe backup ignore symlinks.
azazel commented 2009-02-25 01:09:29 +00:00
Author
Owner

I've made a patch, which instead of yours, skips everything that isn't a file or a directory. This also work for file that are unix sockets, devices and so on.
Please note that really non-dangling links (targets) gets backupped with or without your patch. Just dangling links are dangerous.
I've attached to this ticket a patch file, 'small_symlink_test.patch' really an hack that alters your code to do a much more simple test without using any other function call or temp dir. If runner under linux it demonstrates that links with real target works, and that maybe your test code fails somewhere in being a real useful test?
Now i'm too tired and i'll look at it more in detail tomorrow, maybe i'll end up with a franken-patch that will glue the best of the two. Have a look at my patches.

I've made a patch, which instead of yours, skips everything that isn't a file or a directory. This also work for file that are unix sockets, devices and so on. Please note that really non-dangling links (targets) gets backupped with or without your patch. Just dangling links are dangerous. I've attached to this ticket a patch file, 'small_symlink_test.patch' really an hack that alters your code to do a much more simple test without using any other function call or temp dir. If runner under linux it demonstrates that links with real target works, and that maybe your test code fails somewhere in being a real useful test? Now i'm too tired and i'll look at it more in detail tomorrow, maybe i'll end up with a franken-patch that will glue the best of the two. Have a look at my patches.
azazel commented 2009-02-25 01:09:49 +00:00
Author
Owner

Attachment small_symlink_test.patch (2493 bytes) added

**Attachment** small_symlink_test.patch (2493 bytes) added
azazel commented 2009-02-25 01:10:03 +00:00
Author
Owner

Attachment half-fix-for-bug-641.dpatch (23934 bytes) added

**Attachment** half-fix-for-bug-_641_.dpatch (23934 bytes) added
swillden commented 2009-02-25 02:20:36 +00:00
Author
Owner

While you're at it, you might want to consider also skipping directories which are on other devices. I think it's generally a bad idea to recurse into a network share unless it's been specifically requested. To do that, just look at the st_dev field from lstat. If it doesn't match the st_dev of the parent directory, skip it.

This one is somewhat debatable. For me, I'd rather have it skip network shares because my file server has terabytes of stuff on it and if the backup process goes in there it will never get to the rest of the stuff I want it to back up. Perhaps others have a different perspective.

While you're at it, you might want to consider also skipping directories which are on other devices. I think it's generally a bad idea to recurse into a network share unless it's been specifically requested. To do that, just look at the st_dev field from lstat. If it doesn't match the st_dev of the parent directory, skip it. This one is somewhat debatable. For me, I'd rather have it skip network shares because my file server has terabytes of stuff on it and if the backup process goes in there it will never get to the rest of the stuff I want it to back up. Perhaps others have a different perspective.

What's the status of this patch? I've been running it in one my local sandboxes for weeks now, and I just now obliterated those patches in order to test something closer to current trunk. It looks like none of the patches in this ticket has good unit tests yet.

What's the status of this patch? I've been running it in one my local sandboxes for weeks now, and I just now obliterated those patches in order to test something closer to current trunk. It looks like none of the patches in this ticket has good unit tests yet.
francois commented 2009-05-24 21:20:32 +00:00
Author
Owner

What about mimicking rsync behavior ? It's probably much more intuitive for users to have a consistent default behavior while allowing special cases by the use of additional CLI arguments.

By default, if no special argument given, follow symlinks, cross filesystem boundaries and don't save any special files (fifo, devices and sockets). In case of dangling symlink, display a warning and continue.

Implement new CLI arguments to change this behavior:

 -x, --one-file-system   don’t cross filesystem boundaries
 --devices               preserve device files
 --specials              preserve special files
 -l, --links             copy symlinks as symlinks

Note that implementation that last three options requires a way to store file type and associated parameters in metadata.

What about mimicking rsync behavior ? It's probably much more intuitive for users to have a consistent default behavior while allowing special cases by the use of additional CLI arguments. By default, if no special argument given, follow symlinks, cross filesystem boundaries and don't save any special files (fifo, devices and sockets). In case of dangling symlink, display a warning and continue. Implement new CLI arguments to change this behavior: ``` -x, --one-file-system don’t cross filesystem boundaries --devices preserve device files --specials preserve special files -l, --links copy symlinks as symlinks ``` Note that implementation that last three options requires a way to store file type and associated parameters in metadata.

I've started using 'tahoe backup' for serious personal use, so I'm starting
to run into these sorts of problems. My first workaround was to hack my
"tahoe backup" client to skip over symlinks.

I like the idea of matching rsync's options, except that we don't have a way
to record non-files yet, so we can't actually implement --devices,
--specials, or --links. Our current default behavior is to follow
directory symlinks, but abort when we encounter a file symlink.

If our cap-string scheme were general enough, I'd say we should create a cap
type that says "here is a filecap, treat its contents as the target of a
symlink" (just like our dircaps say "here is a filecap, treat its contents as
an encoded directory table"). But that's a deeper change.. still appropriate
for this ticket, which after all says "tahoe backup should be able to backup
symlinks", but represents more work than I want to do right now.

Right now, I just want to be able to use "tahoe backup" even though my home
directory has a couple of symlinks in it. I'd be happy with an option to skip
symlinks altogether (whether they point to files or directories), or to skip
file-symlinks. And I'd be happy if we always skipped the special things like
devices and sockets.. I don't have any of those in my home directory..
they're only in /tmp/ and /dev/ and places that I'm not yet trying to back
up.

I've started using 'tahoe backup' for serious personal use, so I'm starting to run into these sorts of problems. My first workaround was to hack my "tahoe backup" client to skip over symlinks. I like the idea of matching rsync's options, except that we don't have a way to record non-files yet, so we can't actually implement `--devices`, `--specials`, or `--links`. Our current default behavior is to follow directory symlinks, but abort when we encounter a file symlink. If our cap-string scheme were general enough, I'd say we should create a cap type that says "here is a filecap, treat its contents as the target of a symlink" (just like our dircaps say "here is a filecap, treat its contents as an encoded directory table"). But that's a deeper change.. still appropriate for this ticket, which after all says "tahoe backup should be able to backup symlinks", but represents more work than I want to do right now. Right now, I just want to be able to use "tahoe backup" even though my home directory has a couple of symlinks in it. I'd be happy with an option to skip symlinks altogether (whether they point to files or directories), or to skip file-symlinks. And I'd be happy if we always skipped the special things like devices and sockets.. I don't have any of those in my home directory.. they're only in /tmp/ and /dev/ and places that I'm not yet trying to back up.
davidsarah commented 2009-12-21 00:30:04 +00:00
Author
Owner

#729 is an instance of the same problem.

#729 is an instance of the same problem.

for now (i.e. for 1.6.0), I'm going to have "tahoe backup" skip all symlinks, emitting the same WARNING: cannot backup special file %s message that you get with device files and named pipes.

for now (i.e. for 1.6.0), I'm going to have "tahoe backup" skip all symlinks, emitting the same `WARNING: cannot backup special file %s` message that you get with device files and named pipes.
davidsarah commented 2011-03-21 16:26:08 +00:00
Author
Owner

From the duplicate #1380 filed by gdt:

When running backup on a directory (which is in coda, which probably doesn't matter), I get

WARNING: cannot backup symlink '/blah/blah'

I consider symlinks important. I realize tahoe doesn't have them (maybe it should) but this points out that tahoe backup is not a satisfactory general solution.
(I would argue that the backup program and the filesystem used to store the files comprising the backup database should be independent anyway.)

From the duplicate #1380 filed by gdt: > When running backup on a directory (which is in coda, which probably doesn't matter), I get > ``` > WARNING: cannot backup symlink '/blah/blah' > ``` > I consider symlinks important. I realize tahoe doesn't have them (maybe it should) but this points out that tahoe backup is not a satisfactory general solution. > (I would argue that the backup program and the filesystem used to store the files comprising the backup database should be independent anyway.)
socrates1024 commented 2011-11-30 21:12:04 +00:00
Author
Owner

Attachment 641-symlink-depth-limit-1.darcs.patch (67525 bytes) added

**Attachment** 641-symlink-depth-limit-1.darcs.patch (67525 bytes) added
socrates1024 commented 2011-11-30 21:15:30 +00:00
Author
Owner

I would also like for "tahoe backup" to handle symlinks. Most specifically, I like to symlink directories I want backed-up into my main "Dropbox" folder (the target of "tahoe backup" in my crontab).

After a few experiments with Dropbox, it seems that Dropbox 'follows' symlinks to a limit depth, but it doesn't 'preserve' the symlinks (i.e. it does not behave like rsync --links). There seem to be a handful of hazards with following symlinks: you can have infinite recursion if circular symlinks aren't detected, and even without recursion, symlinks can cause redundant data to be stored.

I'm attaching a patch just to show my approach so far, to enforce a symlink depth limit of 3 (for directories only). I'll look into making tests that show how this approach behaves. For my immediate personal needs, this is already a solution.

I would also like for "tahoe backup" to handle symlinks. Most specifically, I like to symlink directories I want backed-up into my main "Dropbox" folder (the target of "tahoe backup" in my crontab). After a few experiments with Dropbox, it seems that Dropbox 'follows' symlinks to a limit depth, but it doesn't 'preserve' the symlinks (i.e. it does not behave like rsync --links). There seem to be a handful of hazards with following symlinks: you can have infinite recursion if circular symlinks aren't detected, and even without recursion, symlinks can cause redundant data to be stored. I'm attaching a patch just to show my approach so far, to enforce a symlink depth limit of 3 (for directories only). I'll look into making tests that show how this approach behaves. For my immediate personal needs, this is already a solution.

I rewrote my previous patch from 4 months ago (I forgot I ever posted it here) but nothing has changed in my approach.

I have now added a unit test that creates a directory with a symlink cycle and shows what happens. Cycles are only followed up to 3 levels deep. Other notable behavior is that multiple symlinks to the same file will be uploaded to tahoe_lafs multiple times as separate files.

https://github.com/amiller/tahoe-lafs/pull/1.patch

I rewrote my previous patch from 4 months ago (I forgot I ever posted it here) but nothing has changed in my approach. I have now added a unit test that creates a directory with a symlink cycle and shows what happens. Cycles are only followed up to 3 levels deep. Other notable behavior is that multiple symlinks to the same file will be uploaded to tahoe_lafs multiple times as separate files. <https://github.com/amiller/tahoe-lafs/pull/1.patch>
tahoe-lafs added
normal
and removed
minor
labels 2012-03-31 03:07:37 +00:00

Per this mailing list discussion, a better way to detect cycles than counting how many symlinks you've traversed is to examine the dev and inode of each thing and raise an exception about recursive symlinks if you encounter the same one a second time. That way we can handle an arbitrarily deep nest of symlinks.

Here's some code I wrote for a different tool that uses dev and inode to identify files:

https://tahoe-lafs.org/trac/dupfilefind/browser/trunk/dupfilefind/dff.py?annotate=blame

Per [this mailing list discussion](https://tahoe-lafs.org/pipermail/tahoe-dev/2012-April/007233.html), a better way to detect cycles than counting how many symlinks you've traversed is to examine the dev and inode of each thing and raise an exception about recursive symlinks if you encounter the same one a second time. That way we can handle an arbitrarily deep nest of symlinks. Here's some code I wrote for a different tool that uses dev and inode to identify files: <https://tahoe-lafs.org/trac/dupfilefind/browser/trunk/dupfilefind/dff.py?annotate=blame>
Author
Owner

Replying to zooko:

imho, it would be a good idea to keep backing up symlinks optional (well, /make/ it optional)

Replying to [zooko](/tahoe-lafs/trac-2024-07-25/issues/641#issuecomment-69949): imho, it would be a good idea to keep backing up symlinks optional (well, /make/ it optional)

I don't want the "limit it to K levels deep" approach, so I'm unsetting review-needed. Thank you for your contribution, amiller!

I don't want the "limit it to K levels deep" approach, so I'm unsetting `review-needed`. Thank you for your contribution, amiller!

I'm not sure the status of this ticket... but I wanted to past along my github commit, which includes tests and is currently rebased against matser.
https://github.com/amiller/tahoe-lafs/commit/3deafed1c790e076481032536260a29ba2007401

I'm not sure the status of this ticket... but I wanted to past along my github commit, which includes tests and is currently rebased against matser. <https://github.com/amiller/tahoe-lafs/commit/3deafed1c790e076481032536260a29ba2007401>
Sign in to join this conversation.
No Milestone
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#641
No description provided.