"tahoe backup" thinks "ctime" means "creation time" #897
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
4 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#897
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
backupdb seems to think "ctime" means "creation time", which it does, but only on Windows.
This means
there is an incorrect statement in the documentation,that "tahoe backup" is unnecessarily re-uploading files in the case that the ownership or permission bits have changed but the file contents haven't, and that "tahoe backup" is incorrectly mapping between "unix change time" and "file creation time" when used on Windows. So this ticket is forthreetwo bugs, but they are all closely related and should probably be fixed at once.I noticed in source:docs/backupdb.txt@4111#L84 that the backupdb docs mention "creation time". POSIX doesn't provide a "creation time" but it does provide a "change time", abbreviated "ctime", which most people mistakenly think is a "creation time". Windows does provide a "creation time", and unfortunately Python provides unix "change time" and Windows "creation time" in the same slot -- the
st_ctime
slot of thestat
module. Here is my bug report saying that the Python stdlib is wrong to do this, and that any Python code which uses the Python stdlib is wrong unless it immediately disambiguates.In particular, it is a bug for any Tahoe-LAFS code to read the
st_ctime
member without immediately switching on whether the current platform is Windows or not. If you read thest_ctime
member and do not use the current platform to disambiguate, then you have a value whose semantics are uninterpretable without guessing what platform that value was generated on.In particular, for "tahoe backup" purposes, it is probably a mistake to say that a new
ctime
means that the file needs to be uploaded again. Unix and Windows both guarantee that themtime
will be changed if the file contents have changed, and therefore ifmtime
is unchanged then the file contents are unchanged, even if thectime
has changed. On the other hand thectime
changes on Unix even when the file contents have not changed, such as if ownership or permission bits have changed. So if only thectime
has changed then "tahoe backup" might want to set the newctime
value on the link leading to that file, but it should not reupload the file contents.In addition, I think "tahoe backup" should disambiguate between "unix change time" and "creation time" in the metadata that it stores. Why not change the name of the metadata stored in the tahoe-lafs filesystem edge from the ambiguous and widely misunderstood "ctime" to something like "unix change time", and then if you are on non-Windows you can set that from the local filesystem's
ctime
on upload and set the local filesystem'sctime
from that on download. On the other hand if you are on Windows then it is a bug to set the "unix change time" from the local filesystem'sctime
, although it would be correct to set a different metadata entry namedfile creation time
from the local filesystem'sctime
.See also #628, which is about the same issue in "tahoe cp", includes a taxonomy of filesystem "ctime" semantics, and includes a satisfactory backward-compatible solution that was shipped in Tahoe-LAFS v1.4.1.
I'm tagging this ticket with "forward-compatibility" because we'll eventually have to clarify these semantics and the longer we ship a tool that uploads ambiguous data the harder it will be to fix.
I suggest naming as few things as possible “ctime”.
:-)
Even though Mac OS X is a Unix, it keeps file creation time metadata, at least on its native HFS+ filesystems. I would guess it does not have the
st_ctime
confusion, but I don't know how the creation time actually is accessed. If Tahoe backups have a field for creation time, it would be good to preserve this information (I often find it useful as a user, and would be irritated if my hypothetical Tahoe-based personal backups failed to preserve it).Hrm.
using ctime/mtime in backupdb
So, first, let's make the docs (source:docs/backupdb.txt#L84) clearer,
by replacing the reference to "creation time, and modification time"
with just "ctime/mtime". The backupdb does not care about the semantics
of these timestamps. All it cares about is having a cheap
sometimes-false-positive proxy for detecting changes to file contents.
In particular, I'm not worried about trying to avoid re-uploading in the
face of user-triggered changes to metadata that doesn't actually change
file contents. If someone does a "chown" or "chmod" or "touch" on a
bunch of files, I think they'll accept the fact that "tahoe backup" will
subsequently do more work on those files than if they had not gone and
run those commands.
So I think that comparing the (size/ctime/mtime) tuple (specifically the
(stat.ST_SIZE, stat.ST_MTIME, stat.ST_CTIME)
tuple) will servethis purpose, regardless of what
os.stat(fn)[stat.ST_CTIME]
actually means. We could change the backupdb to record more
semantically-accurate fields, and fill in some but not others depending
upon which platform we were using, but since we're only comparing this
data against itself, I don't see enough value in adding that complexity.
putting timestamp metadata into backups created by "tahoe backup"
As a separate issue, I guess I'm +0 on changing the metadata that "tahoe
backup" creates to have more accurate names. Thanks to the patch from
#628, "tahoe backup" is actually the only place that even reads local
filesystem metadata (i.e.
find src -name '*.py' |xargs grep os.stat
is almost all tahoe internal files). "tahoe backup" currently
does the simplistic thing of copying
stat.st_ctime
intometadata["ctime"]
, etc.I'm not sure how to value timestamps (or other metadata) in backups.
When you restore from a backup, do you expect all of the files to have
the same creation/modification timestamps as they did on the original
disk? The same permission bits? The same owner? The same inode numbers?
The same
atime
? (I'd guess a survey would show users expectingthese properties in descending order, from like 70% or users for
timestamps to 1% of users for atime).
But I think most users of a "tahoe cp" tool would expect the
newly-generated local files to have all timestamps set to the present
moment (as /bin/cp does), and for permission bits/owner to be set by the
current umask setting/login.
Other tools that I use for backup purposes (like version-control
systems) don't record this metadata, because it doesn't generally make
sense to restore it (when I do an 'svn update', I really don't want the
timestamps of the newly-modified files to wind up in the past, because
then my builds will get messed up. Likewise, changing the mode bits,
other than sometimes the execute bit, is probably a bad idea).
So this suggests that we'd need a special "tahoe restore" (or maybe an
option on "tahoe cp", like /bin/cp's --preserve) to use this extended
metadata. And then, if we had that, it would make sense for "tahoe
backup" to record more accurate information about platform-specific
timestamps, such that "tahoe cp --preserve tahoe:backups/Latest
./local-restore" could take your Unix-generated backup and copy it onto
your windows box and reset as much metadata as made sense.
Eh, I dunno.
Incidentally, part of the "timestamps are unimportant" philosophy
described above is embedded in "tahoe backup"'s design: if the local
timestamps have changed but file contents have not, we won't upload
anything new, so the backup snapshot will continue to have the same
timestamps from the original upload. This may mean that you shouldn't
put too much trust in the tahoe-side timestamp metadata anyways. We
could change this to upload more frequently, but personally I prefer the
performance wins of sharing directories between snapshots.
Ok, Zooko and I had a long discussion about this in IRC. There's a bit of
tension between three goals:
future developers can figure out where the timestamps came from
"ctime"
Goal 1 is about not trying to be too clever. The original problem here is
that Python tries to be too clever and reports a windows os.stat field (named
ftCreationTime
in the underlying API) asst_ctime
, the same waythat POSIX's st_ctime is reported. This decision was probably based on
mistakenly believing that they have the same semantics, and a desire to hide
irrelevant platform details from developers who shouldn't have to care.
However, if they hadn't done that (i.e. report
st_creationtime
onwindows and
st_ctime
on unix), then we'd have less-convenient butless-ambiguous os.stat results.
Systems which try to hide details from developers can cause frustration,
especially if the developers understand the quirks and foibles of the
underlying system, because then the "helpful" intermediate layers are really
just getting in the way.
To implement goal 1, we would copy all of the
os.stat()
fields into themetadata as-is, and probably include an extra field (perhaps labeled
st_platform
) as a hint to cyber-historians who know better than we dowhat os.stat returns on various platforms, and how to interpret it.
Goal 2 would be accomplished by never using the word "ctime" in our metadata,
even though it's used in two other places (
os.stat
return value, andPOSIX's stat(2) call). Evidence suggests that the majority of developers
believe the wrong thing about what POSIX's ctime means (and I've certainly
been in this camp). So giving them a word other than "ctime" will either be
more meaningful (e.g. if we called it posix-metadata-change-time) or will
force them to look up our actual definition (e.g. if we called it
tahoe-bagel-kumquat and dared them to search webapi.txt for details).
Goal 3 would be accomplished by using a common, easy-to-understand word like
"changetime" or "creationtime" for all platforms, despite whatever name is
used by the underlying system call. POSIX and windows return "mtime" values
with (as far as I've been told) the same semantics. So it's probably fair to
say that the fact that (A: POSIX stat() returns it in st_mtime, while B:
windows returns it in ftModificationTime or something) is an "irrelevant
platform detail", and that developers lives are easier if this distinction is
hidden from them.
So, as a compromise between these goals, we settled on the following keys:
posix-change-time)
windows-creation-time)
The synthetic "st_platform" key will contain
sys.platform
, so somethinglike "linux2" or "darwin" or "windows". The hope is that this is a cheap way
to provide some useful information to future developers and cyber-historians
to interpret the rest of the st_* fields in some meaningful way.
st_dev, st_mode, etc, will be copied directly from the os.stat call. Other
attributes (perhaps platform-specific fields like OS-X's st_creator and
st_type) will be copied here too.
modification-time
will be copied from st_mtime on all platforms, basedon the conclusion that it represents the same concept on all platforms: the
most recent time that the file's contents have been modified.
posix-change-time
will be present for files that came from a POSIXfilesystem, and will be copied from st_ctime.
windows-creation-time
will be present for files that came from awindows filesystem, and will be copied from st_ctime.
Having longer and more-detailed names for the ctime values will help with
goal 2 (help developers correctly interpret this field). Not calling them
"ctime" will help developers who would otherwise misinterpret
posix-change-time
as if it were the mythical "posix-creation-time" thateveryone really wants. We cannot provide goal 3 here, because there is no
common semantic between POSIX and windows.
(note for future discussion: some POSIX-ish filesystems do provide
creation-time, in the form of OS-X's st_birthtime, and supposedly something
that ZFS offers. If we can determine that the semantics of these are the
same, it could be argued that windows-creation-time should be renamed
creation-time
, and only populated on platforms that offer it, whichwould be st_birthtime from HFS+/OS-X, st_ctime on windows, and something else
on ZFS)
(and note that, if we cannot determine that the semantics are the same,
then we should probably refrain from trying to coerce them into the same
field, lest we make the same mistake that Python's os.stat did, making life
more difficult for somebody in the future who is trying to figure out whether
a given file's so-called "creation-time" was really the ZFS notion, or the
HFS+ notion, or whatever).
Replying to warner:
I know Tahoe doesn't attempt to provide anonymity of file uploaders, but from a privacy point of view, why should the holder of the directory read cap have this information?
I think they are the same. Also I think
posix-change-time
should bemetadata-change-time
, since a system that provides it isn't necessarily a POSIX system.The only argument for
st_platform
is to give readers (including dirreadcap holders) a hedge against failures in our current understanding of what these fields mean (well, our+python's understanding). I'm +0 on it now.. I could be talked out of including it.I suspect that our inclusion of st_ino and st_dev and the other fields will reveal much information about the value of st_platform anyways, since I imagine that windows filesystems use these fields in very different ways. An enthusiastic reader who is trying to un-"fix" our+python's attempt at being helpful might be able to recognize these characteristic st_ino/st_dev values and improve the accuracy of their unwinding without being told explicitly what st_platform is. On the other hand, st_platform may not be enough information to do this hypothetical job well, and that future reader may complain to our ghosts that we should have included even more information.
So I'm ok with not including st_platform.
I suspect they are the same too, but I'm afraid of committing the same mistake that Python did, especially when OS-X is the only documentation that I've seen personally, and nobody that I know has even seen the hypothetical ZFS docs (and ZFS is losing ground now that Apple abandoned it). Again, I'm willing to go this way, but I'm going to be awfully embarrassed if our attempt to fix python's semantics proves (years from now) to be adding yet another layer of brokenness.
I'm also willing to go along with this one, but I'm even more hesitant. First, I think the casual reader who sees "metadata-change-time" will incorrectly assume that it is referring to the Tahoe metadata in which this key is embedded, rather than the original disk filesystem from which the file was copied. Second, do we have examples of non-POSIX systems which provide this "ctime" that behave exactly like POSIX ones? We'd be making a bolder claim by going with "metadata-change-time".. there are surely some pieces of metadata that might be changed without updating this timestamp.. if some other system uses a different set of metadata than our POSIX systems, would we ignore the differences and represent both at "metadata-change-time"? Or add a "non-POSIX-metadata-change-time" ?
Oh, incidentally, the wikipedia page on ext3 uses the term "attribute modification" to describe ctime. Maybe "attribute-change-time" would be a suitable replacement for "metadata-change-time"?
It isn't "change time from a POSIX system" it is "change time matching the POSIX semantics". Does that answer your objection David-Sarah?
You have a point about
posix-change-time
-- it's a POSIXism that is fairly unlikely to be duplicated precisely. For creation time, I'll try and find the relevant ZFS and OS-X docs.I think we are out of time to do this for 1.7.0. Brian is busy with new-downloader (unless he decides to save new-downloader for after the v1.7.0 release and work on something else for v1.7.0). David-Sarah and I are also busy with other 1.7.0 work.
I'm putting this in 1.8.0 instead of in "eventually" because it is a
forward-compatibility
issue and I really like to fix those as early as possible...Since this is a forward-compatibility issue I'm still interested in getting it fixed. By the way, Linux is growing a creation-time field:
Hm. What's the status of this ticket? It is assigned to Brian. Is Brian intending to do anything with it? Did we achieve consensus on what should be done? Do we all agree with what Brian wrote in comment:74576 plus the various follow-ups? I no longer remember. Maybe someone should write up a new summary of what we intend to do. Brian: if you don't intend to write up a new summary (or otherwise move this ticket forward) then please assign it to me.
Replying to warner:
But why should we re-upload those files? The unix operating system is asserting, by giving us a changed
change time
and an unchangedmodify time
, that the file contents have not changed. If we are relying on the operating system about this sort of thing, for improved efficiency, then why not believe it about this and skip the re-upload?Obviously if the operating system asserts that the
creation time
has changed, then we should re-upload.Zooko reminded me of this ticket in IRC today, so I re-read
everything. I think we have the following tasks to finish for this
ticket:
st_platform
in thetahoe-backup
metadataposix-change-time
intahoe-backup
to record the keys described incomment:74576 (modification-time, windows-creation-time or
and then a separate ticket can be created to build some sort of
restore command (maybe an option for
tahoe cp
, maybe aseparate
tahoe restore
that reads this metadata and appliesit to the resulting files.
See also #2250 (don't re-use metadata from earlier snapshots, in a "tahoe backup").
Replying to [zooko]comment:18:
It is possible for
ctime
to change butmtime
not, when the file contents change. In particular, suppose there are two filesfoo
andbar
that have different contents but the same size andmtime
(because they were last modified at the same time to within themtime
resolution). Then,mv bar foo
will not changefoo
'smtime
, but will set itsctime
to the current time. (Verified on Linux.)