consider removing some st_* fields from metadata #1946
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1946
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Here is typical metadata for a Tahoe directory entry:
We should consider removing the
st_gid
,st_uid
,st_ino
, andst_dev
fields, since they are not very meaningful and may constitute a privacy or anonymity risk for some uses of Tahoe.I would advocate retaining a way to set those fields, and restore them, as they are useful and even desirable when using Tahoe for backups. They need not be stored by default for this use case.
Note that for backup, we really need to store more than this, in particular permissions. (Maybe we can just support that for Unix; Windows NT ACLs are complicated.) So the amount of data we are currently storing falls between two stools.
Replying to daira:
Oh, sorry, Unix permissions are what
st_mode
is.elb: for the record, I would want my 'tahoe backup' to record gid uid and mode
zooko: But, I use "tahoe backup" as my normal way to upload things to the grid.
zooko: Even though I consider uid and gid meaningless once the file has been uploaded to the tahoe-lafs grid.
cehteh: you could add another 'privacy' attribute which other tools then respect
I think this discussion is pointing out a fundamental confusion in tahoe. On one hand, it's a filesystem, which doesn't implement POSIX semantics, and we're told that FUSE/vfs access is troubling. On the other hand, it has a builtin backup program, even though there aren't really any good reasons to blur backup programs and filesystems (consider bup with the BUPDIR on tahoe).
I think
tahoe backup
was originally conceived as being, well, for "backup". And there was originally the idea that there would eventually be atahoe restore
. The assumption of a "backup and restore" use case is that the set of files and directories probably won't be shared or published while it is backed-up, and that it will eventually be restored to the same or a "similar" system as the one it came from.However, I don't use
tahoe backup
that way. I use it as a good way to publish files to my LAFS grid. I almost always usetahoe backup
whenever I'm uploading files, and I never "restore" them, and I often share them. Even if atahoe restore
command existed, and even if I decided to use it, then the system I was restoring to might not have the same uid and gid set as the original system.So why do I use
tahoe backup
then if not for backup? Well, it is faster than "tahoe cp" or "tahoe put" or FUSE because it maintains a local cache db of files that it has already backed-up. It also keeps time-stamp-keyed snapshots of all previous versions that have been written to the grid. These are two very nice features.What if we create a command named something like
tahoe snapshot
ortahoe mirror
that has those two features, but does not have thest_*
metadata, which I do not think is meaningful in my use case, and which could be a privacy leak? An advantage of this newtahoe snapshot
command is that then more people would discover its existence and its usefulness, even when they have a "publish" kind of use-case instead of a "backup" kind of use case.If you like this ticket, you might also like #1865, #897, or even [/query?status=!closed&keywords=~tahoe-backup&order=priority all open tickets about tahoe-backup]
Here's why I don't like uid and gid in here the way they currently are in here. This is kind of like Daira's complaint about falling between two stools. It might be cool to store extra information if it were unambiguously interpretable. For example, maybe if you had uid, gid, and "UUID of the disk partition from which the root filesystem was loaded", then you would later be able to tell (in a hypothetical future "restore" command) whether the stored uid and gid could be meaningfully copied back into the target of the restore. Or maybe that wouldn't work, I don't know. Maybe instead you need to store a copy of the
/etc/passwd
and/etc/groups
so that you can check whether the target system has a sufficiently similar entry in its files as the source system had, for this uid and gid? But in any case I don't like "floating pieces of data which have broken from their anchors", because you can never safely re-anchor them, except by guessing or by asking a human user to guess. To me, uid and gid numbers without any way to recognize their context are that sort of "floating pieces of data". I know it's the Unix Way, but that doesn't mean I have to like it. Also, that tradition originated in a use-case where you might reasonably expect the sysadmin to write down what he needs to recognize their context (i.e. the name of the system from which this tarball was produced), and that's less true — at least in my experience — for the waytahoe backup
is used.zooko:
I get that, but I think it neglects the use case where a user really is backing up and restoring a particular system. That's a use case I care about. Now, I don't care if the stored information is a uid or a username or gid or group name or what (and the symbolic names may provide some measure of robustness to the disassociation you describe, but can't really be said to fix the problem), but I do care that the information is stored.
Stepping back, I think metadata gets used for three things, and we should think about them separately:
My complaint about 'tahoe backup' is that I don't see any good reason to couple the backup program and the filesystem. All the reasons about local storage of metadata about what's there, etc. should apply to any back-end storage.
By the way, I have, in one tree, 695,000 files. Can that be right? Anyway, I have a lot of them, so the space cost of a little bit of metadata on each one might be significant for me.
I just checked one of the trees I want to back up; 1.1M files. That's certainly an argument for some efficiency in metadata, anyway. :-)
Replying to zooko:
A deep-stats operation will tell you the total size of directories, which is an upper bound on the amount of space that could possibly be saved by reducing metadata, in the "size-directories" field.
Replying to gdt:
gdt, you might be interested in [//pipermail/tahoe-dev/2008-September/000814.html this old thread], where I failed to deter Brian from inventing
tahoe backup
. The thing is, neither bup, duplicity, nor any of the other ones that I named would have the same behavior thattahoe backup
does. I usetahoe backup
a lot, and I like its behavior, and I suspect we couldn't get the same behavior by composing a backup tool and a storage system through a generic POSIX API.Replying to elb:
Maybe that use case is better served by an archive-oriented backup tool such as duplicity or duplicati instead of
tahoe backup
. All of the other backup tools that I know of are archive-oriented instead of file-oriented.tahoe backup
, on the other hand, makes a separate copy of each file that it encounters. So there are a bunch of things which archivers (including venerable "tar") can handle thattahoe backup
can't handle, such as symlinks. Traditional backup is better thought of as something that you do to a filesystem or at least a subtree, rather than to a set of independent files. They also achieve much better efficiency by compressing between files and across subsequent versions thantahoe backup
does, for the same reason --tahoe backup
can't compress across files or across subsequent versions of the filesystem, because it has to store every file as an independently addressable and accessible entity.You can use duplicity or duplicati to archive, delta-compress, store metadata, etc. and use their tahoe-lafs backends to securely store the actual data. That sounds like a good solution for this use case.
I kind of think that Brian was right to invent
tahoe backup
, because it is cool and useful, but that I was right that it isn't a good solution for the "backup a filesystem" use case. Maybe it should be renamed totahoe mirror
ortahoe snapshot
.Opened #1952 to track the issue of "shall we rename
tahoe backup
totahoe snapshot
".