"tahoe backup" thinks "ctime" means "creation time" #897

Open
opened 2010-01-12 23:36:42 +00:00 by zooko · 15 comments

backupdb seems to think "ctime" means "creation time", which it does, but only on Windows.

This means there is an incorrect statement in the documentation, that "tahoe backup" is unnecessarily re-uploading files in the case that the ownership or permission bits have changed but the file contents haven't, and that "tahoe backup" is incorrectly mapping between "unix change time" and "file creation time" when used on Windows. So this ticket is for three two bugs, but they are all closely related and should probably be fixed at once.

I noticed in source:docs/backupdb.txt@4111#L84 that the backupdb docs mention "creation time". POSIX doesn't provide a "creation time" but it does provide a "change time", abbreviated "ctime", which most people mistakenly think is a "creation time". Windows does provide a "creation time", and unfortunately Python provides unix "change time" and Windows "creation time" in the same slot -- the st_ctime slot of the stat module. Here is my bug report saying that the Python stdlib is wrong to do this, and that any Python code which uses the Python stdlib is wrong unless it immediately disambiguates.

In particular, it is a bug for any Tahoe-LAFS code to read the st_ctime member without immediately switching on whether the current platform is Windows or not. If you read the st_ctime member and do not use the current platform to disambiguate, then you have a value whose semantics are uninterpretable without guessing what platform that value was generated on.

In particular, for "tahoe backup" purposes, it is probably a mistake to say that a new ctime means that the file needs to be uploaded again. Unix and Windows both guarantee that the mtime will be changed if the file contents have changed, and therefore if mtime is unchanged then the file contents are unchanged, even if the ctime has changed. On the other hand the ctime changes on Unix even when the file contents have not changed, such as if ownership or permission bits have changed. So if only the ctime has changed then "tahoe backup" might want to set the new ctime value on the link leading to that file, but it should not reupload the file contents.

In addition, I think "tahoe backup" should disambiguate between "unix change time" and "creation time" in the metadata that it stores. Why not change the name of the metadata stored in the tahoe-lafs filesystem edge from the ambiguous and widely misunderstood "ctime" to something like "unix change time", and then if you are on non-Windows you can set that from the local filesystem's ctime on upload and set the local filesystem's ctime from that on download. On the other hand if you are on Windows then it is a bug to set the "unix change time" from the local filesystem's ctime, although it would be correct to set a different metadata entry named file creation time from the local filesystem's ctime.

See also #628, which is about the same issue in "tahoe cp", includes a taxonomy of filesystem "ctime" semantics, and includes a satisfactory backward-compatible solution that was shipped in Tahoe-LAFS v1.4.1.

I'm tagging this ticket with "forward-compatibility" because we'll eventually have to clarify these semantics and the longer we ship a tool that uploads ambiguous data the harder it will be to fix.

backupdb seems to think "ctime" means "creation time", which it does, but only on Windows. This means ~~there is an incorrect statement in the documentation,~~ that "tahoe backup" is unnecessarily re-uploading files in the case that the ownership or permission bits have changed but the file contents haven't, and that "tahoe backup" is incorrectly mapping between "unix change time" and "file creation time" when used on Windows. So this ticket is for ~~three~~ two bugs, but they are all closely related and should probably be fixed at once. I noticed in source:docs/backupdb.txt@4111#L84 that the backupdb docs mention "creation time". POSIX doesn't provide a "creation time" but it does provide a "change time", abbreviated "ctime", which most people mistakenly think is a "creation time". Windows *does* provide a "creation time", and unfortunately Python provides unix "change time" and Windows "creation time" in the same slot -- the `st_ctime` slot of the `stat` module. Here is my [bug report](http://bugs.python.org/issue5720) saying that the Python stdlib is wrong to do this, and that any Python code which uses the Python stdlib is wrong unless it immediately disambiguates. In particular, it is a bug for any Tahoe-LAFS code to read the `st_ctime` member without immediately switching on whether the current platform is Windows or not. If you read the `st_ctime` member and do not use the current platform to disambiguate, then you have a value whose semantics are uninterpretable without guessing what platform that value was generated on. In particular, for "tahoe backup" purposes, it is probably a mistake to say that a new `ctime` means that the file needs to be uploaded again. Unix and Windows both guarantee that the `mtime` will be changed if the file contents have changed, and therefore if `mtime` is unchanged then the file contents are unchanged, even if the `ctime` has changed. On the other hand the `ctime` changes on Unix even when the file contents have not changed, such as if ownership or permission bits have changed. So if only the `ctime` has changed then "tahoe backup" might want to set the new `ctime` value on the link leading to that file, but it should not reupload the file contents. In addition, I think "tahoe backup" should disambiguate between "unix change time" and "creation time" in the metadata that it stores. Why not change the name of the metadata stored in the tahoe-lafs filesystem edge from the ambiguous and widely misunderstood "ctime" to something like "unix change time", and then if you are on non-Windows you can set that from the local filesystem's `ctime` on upload and set the local filesystem's `ctime` from that on download. On the other hand if you are on Windows then it is a bug to set the "unix change time" from the local filesystem's `ctime`, although it would be correct to set a different metadata entry named `file creation time` from the local filesystem's `ctime`. See also #628, which is about the same issue in "tahoe cp", includes a taxonomy of filesystem "ctime" semantics, and includes a satisfactory backward-compatible solution that was shipped in Tahoe-LAFS v1.4.1. I'm tagging this ticket with "forward-compatibility" because we'll eventually have to clarify these semantics and the longer we ship a tool that uploads ambiguous data the harder it will be to fix.
zooko added the
unknown
major
defect
1.5.0
labels 2010-01-12 23:36:42 +00:00
zooko added this to the undecided milestone 2010-01-12 23:36:42 +00:00
  • I suggest naming as few things as possible “ctime”. :-)

  • Even though Mac OS X is a Unix, it keeps file creation time metadata, at least on its native HFS+ filesystems. I would guess it does not have the st_ctime confusion, but I don't know how the creation time actually is accessed. If Tahoe backups have a field for creation time, it would be good to preserve this information (I often find it useful as a user, and would be irritated if my hypothetical Tahoe-based personal backups failed to preserve it).

* I suggest naming as few things as possible “ctime”. ` :-) ` * Even though Mac OS X is a Unix, it keeps file creation time metadata, at least on its native HFS+ filesystems. I would *guess* it does not have the `st_ctime` confusion, but I don't know how the creation time actually is accessed. If Tahoe backups have a field for creation time, it would be good to preserve this information (I often find it useful as a user, and would be irritated if my hypothetical Tahoe-based personal backups failed to preserve it).

Hrm.

using ctime/mtime in backupdb

So, first, let's make the docs (source:docs/backupdb.txt#L84) clearer,
by replacing the reference to "creation time, and modification time"
with just "ctime/mtime". The backupdb does not care about the semantics
of these timestamps. All it cares about is having a cheap
sometimes-false-positive proxy for detecting changes to file contents.

In particular, I'm not worried about trying to avoid re-uploading in the
face of user-triggered changes to metadata that doesn't actually change
file contents. If someone does a "chown" or "chmod" or "touch" on a
bunch of files, I think they'll accept the fact that "tahoe backup" will
subsequently do more work on those files than if they had not gone and
run those commands.

So I think that comparing the (size/ctime/mtime) tuple (specifically the
(stat.ST_SIZE, stat.ST_MTIME, stat.ST_CTIME) tuple) will serve
this purpose, regardless of what os.stat(fn)[stat.ST_CTIME]
actually means. We could change the backupdb to record more
semantically-accurate fields, and fill in some but not others depending
upon which platform we were using, but since we're only comparing this
data against itself, I don't see enough value in adding that complexity.

putting timestamp metadata into backups created by "tahoe backup"

As a separate issue, I guess I'm +0 on changing the metadata that "tahoe
backup" creates to have more accurate names. Thanks to the patch from
#628, "tahoe backup" is actually the only place that even reads local
filesystem metadata (i.e. find src -name '*.py' |xargs grep os.stat
is almost all tahoe internal files). "tahoe backup" currently
does the simplistic thing of copying stat.st_ctime into
metadata["ctime"], etc.

I'm not sure how to value timestamps (or other metadata) in backups.
When you restore from a backup, do you expect all of the files to have
the same creation/modification timestamps as they did on the original
disk? The same permission bits? The same owner? The same inode numbers?
The same atime? (I'd guess a survey would show users expecting
these properties in descending order, from like 70% or users for
timestamps to 1% of users for atime).

But I think most users of a "tahoe cp" tool would expect the
newly-generated local files to have all timestamps set to the present
moment (as /bin/cp does), and for permission bits/owner to be set by the
current umask setting/login.

Other tools that I use for backup purposes (like version-control
systems) don't record this metadata, because it doesn't generally make
sense to restore it (when I do an 'svn update', I really don't want the
timestamps of the newly-modified files to wind up in the past, because
then my builds will get messed up. Likewise, changing the mode bits,
other than sometimes the execute bit, is probably a bad idea).

So this suggests that we'd need a special "tahoe restore" (or maybe an
option on "tahoe cp", like /bin/cp's --preserve) to use this extended
metadata. And then, if we had that, it would make sense for "tahoe
backup" to record more accurate information about platform-specific
timestamps, such that "tahoe cp --preserve tahoe:backups/Latest
./local-restore" could take your Unix-generated backup and copy it onto
your windows box and reset as much metadata as made sense.

Eh, I dunno.

Incidentally, part of the "timestamps are unimportant" philosophy
described above is embedded in "tahoe backup"'s design: if the local
timestamps have changed but file contents have not, we won't upload
anything new, so the backup snapshot will continue to have the same
timestamps from the original upload. This may mean that you shouldn't
put too much trust in the tahoe-side timestamp metadata anyways. We
could change this to upload more frequently, but personally I prefer the
performance wins of sharing directories between snapshots.

Hrm. ## using ctime/mtime in backupdb So, first, let's make the docs (source:docs/backupdb.txt#L84) clearer, by replacing the reference to "creation time, and modification time" with just "ctime/mtime". The backupdb does not care about the semantics of these timestamps. All it cares about is having a cheap sometimes-false-positive proxy for detecting changes to file contents. In particular, I'm not worried about trying to avoid re-uploading in the face of user-triggered changes to metadata that doesn't actually change file contents. If someone does a "chown" or "chmod" or "touch" on a bunch of files, I think they'll accept the fact that "tahoe backup" will subsequently do more work on those files than if they had not gone and run those commands. So I think that comparing the (size/ctime/mtime) tuple (specifically the `(stat.ST_SIZE, stat.ST_MTIME, stat.ST_CTIME)` tuple) will serve this purpose, regardless of what `os.stat(fn)[stat.ST_CTIME]` actually means. We could change the backupdb to record more semantically-accurate fields, and fill in some but not others depending upon which platform we were using, but since we're only comparing this data against itself, I don't see enough value in adding that complexity. ## putting timestamp metadata into backups created by "tahoe backup" As a separate issue, I guess I'm +0 on changing the metadata that "tahoe backup" creates to have more accurate names. Thanks to the patch from #628, "tahoe backup" is actually the only place that even reads local filesystem metadata (i.e. `find src -name '*.py' |xargs grep os.stat` is almost all tahoe internal files). "tahoe backup" currently does the simplistic thing of copying `stat.st_ctime` into `metadata["ctime"]`, etc. I'm not sure how to value timestamps (or other metadata) in backups. When you restore from a backup, do you expect all of the files to have the same creation/modification timestamps as they did on the original disk? The same permission bits? The same owner? The same inode numbers? The same `atime`? (I'd guess a survey would show users expecting these properties in descending order, from like 70% or users for timestamps to 1% of users for atime). But I think most users of a "tahoe cp" tool would expect the newly-generated local files to have all timestamps set to the present moment (as /bin/cp does), and for permission bits/owner to be set by the current umask setting/login. Other tools that I use for backup purposes (like version-control systems) don't record this metadata, because it doesn't generally make sense to restore it (when I do an 'svn update', I really don't want the timestamps of the newly-modified files to wind up in the past, because then my builds will get messed up. Likewise, changing the mode bits, other than sometimes the execute bit, is probably a bad idea). So this suggests that we'd need a special "tahoe restore" (or maybe an option on "tahoe cp", like /bin/cp's --preserve) to use this extended metadata. And then, if we had that, it would make sense for "tahoe backup" to record more accurate information about platform-specific timestamps, such that "tahoe cp --preserve tahoe:backups/Latest ./local-restore" could take your Unix-generated backup and copy it onto your windows box and reset as much metadata as made sense. Eh, I dunno. Incidentally, part of the "timestamps are unimportant" philosophy described above is embedded in "tahoe backup"'s design: if the local timestamps have changed but file contents have not, we won't upload anything new, so the backup snapshot will continue to have the same timestamps from the original upload. This may mean that you shouldn't put too much trust in the tahoe-side timestamp metadata anyways. We could change this to upload more frequently, but personally I prefer the performance wins of sharing directories between snapshots.

Ok, Zooko and I had a long discussion about this in IRC. There's a bit of
tension between three goals:

  1. preserving information, even if it is confusing or badly labeled, so that
    future developers can figure out where the timestamps came from
  2. not confusing busy developers by perpetuating ambiguous labels like
    "ctime"
  3. hiding irrelevant platform details, making life easier for developers

Goal 1 is about not trying to be too clever. The original problem here is
that Python tries to be too clever and reports a windows os.stat field (named
ftCreationTime in the underlying API) as st_ctime, the same way
that POSIX's st_ctime is reported. This decision was probably based on
mistakenly believing that they have the same semantics, and a desire to hide
irrelevant platform details from developers who shouldn't have to care.
However, if they hadn't done that (i.e. report st_creationtime on
windows and st_ctime on unix), then we'd have less-convenient but
less-ambiguous os.stat results.

Systems which try to hide details from developers can cause frustration,
especially if the developers understand the quirks and foibles of the
underlying system, because then the "helpful" intermediate layers are really
just getting in the way.

To implement goal 1, we would copy all of the os.stat() fields into the
metadata as-is, and probably include an extra field (perhaps labeled
st_platform) as a hint to cyber-historians who know better than we do
what os.stat returns on various platforms, and how to interpret it.

Goal 2 would be accomplished by never using the word "ctime" in our metadata,
even though it's used in two other places (os.stat return value, and
POSIX's stat(2) call). Evidence suggests that the majority of developers
believe the wrong thing about what POSIX's ctime means (and I've certainly
been in this camp). So giving them a word other than "ctime" will either be
more meaningful (e.g. if we called it posix-metadata-change-time) or will
force them to look up our actual definition (e.g. if we called it
tahoe-bagel-kumquat and dared them to search webapi.txt for details).

Goal 3 would be accomplished by using a common, easy-to-understand word like
"changetime" or "creationtime" for all platforms, despite whatever name is
used by the underlying system call. POSIX and windows return "mtime" values
with (as far as I've been told) the same semantics. So it's probably fair to
say that the fact that (A: POSIX stat() returns it in st_mtime, while B:
windows returns it in ftModificationTime or something) is an "irrelevant
platform detail", and that developers lives are easier if this distinction is
hidden from them.

So, as a compromise between these goals, we settled on the following keys:

  • unix: (st_platform, st_dev, st_mode, st_ino.., modification-time,
    posix-change-time)
  • windows: (st_platform, st_dev, st_mode, st_ino.., modification-time,
    windows-creation-time)

The synthetic "st_platform" key will contain sys.platform, so something
like "linux2" or "darwin" or "windows". The hope is that this is a cheap way
to provide some useful information to future developers and cyber-historians
to interpret the rest of the st_* fields in some meaningful way.

st_dev, st_mode, etc, will be copied directly from the os.stat call. Other
attributes (perhaps platform-specific fields like OS-X's st_creator and
st_type) will be copied here too.

modification-time will be copied from st_mtime on all platforms, based
on the conclusion that it represents the same concept on all platforms: the
most recent time that the file's contents have been modified.

posix-change-time will be present for files that came from a POSIX
filesystem, and will be copied from st_ctime.

windows-creation-time will be present for files that came from a
windows filesystem, and will be copied from st_ctime.

Having longer and more-detailed names for the ctime values will help with
goal 2 (help developers correctly interpret this field). Not calling them
"ctime" will help developers who would otherwise misinterpret
posix-change-time as if it were the mythical "posix-creation-time" that
everyone really wants. We cannot provide goal 3 here, because there is no
common semantic between POSIX and windows.

(note for future discussion: some POSIX-ish filesystems do provide
creation-time, in the form of OS-X's st_birthtime, and supposedly something
that ZFS offers. If we can determine that the semantics of these are the
same, it could be argued that windows-creation-time should be renamed
creation-time, and only populated on platforms that offer it, which
would be st_birthtime from HFS+/OS-X, st_ctime on windows, and something else
on ZFS)

(and note that, if we cannot determine that the semantics are the same,
then we should probably refrain from trying to coerce them into the same
field, lest we make the same mistake that Python's os.stat did, making life
more difficult for somebody in the future who is trying to figure out whether
a given file's so-called "creation-time" was really the ZFS notion, or the
HFS+ notion, or whatever).

Ok, Zooko and I had a long discussion about this in IRC. There's a bit of tension between three goals: 1. preserving information, even if it is confusing or badly labeled, so that future developers can figure out where the timestamps came from 2. not confusing busy developers by perpetuating ambiguous labels like "ctime" 3. hiding irrelevant platform details, making life easier for developers Goal 1 is about not trying to be too clever. The original problem here is that Python tries to be too clever and reports a windows os.stat field (named `ftCreationTime` in the underlying API) as `st_ctime`, the same way that POSIX's st_ctime is reported. This decision was probably based on mistakenly believing that they have the same semantics, and a desire to hide irrelevant platform details from developers who shouldn't have to care. However, if they hadn't done that (i.e. report `st_creationtime` on windows and `st_ctime` on unix), then we'd have less-convenient but less-ambiguous os.stat results. Systems which try to hide details from developers can cause frustration, especially if the developers understand the quirks and foibles of the underlying system, because then the "helpful" intermediate layers are really just getting in the way. To implement goal 1, we would copy all of the `os.stat()` fields into the metadata as-is, and probably include an extra field (perhaps labeled `st_platform`) as a hint to cyber-historians who know better than we do what os.stat returns on various platforms, and how to interpret it. Goal 2 would be accomplished by never using the word "ctime" in our metadata, even though it's used in two other places (`os.stat` return value, and POSIX's stat(2) call). Evidence suggests that the majority of developers believe the wrong thing about what POSIX's ctime means (and I've certainly been in this camp). So giving them a word other than "ctime" will either be more meaningful (e.g. if we called it posix-metadata-change-time) or will force them to look up our actual definition (e.g. if we called it tahoe-bagel-kumquat and dared them to search webapi.txt for details). Goal 3 would be accomplished by using a common, easy-to-understand word like "changetime" or "creationtime" for all platforms, despite whatever name is used by the underlying system call. POSIX and windows return "mtime" values with (as far as I've been told) the same semantics. So it's probably fair to say that the fact that (A: POSIX stat() returns it in st_mtime, while B: windows returns it in ftModificationTime or something) is an "irrelevant platform detail", and that developers lives are easier if this distinction is hidden from them. So, as a compromise between these goals, we settled on the following keys: * unix: (st_platform, st_dev, st_mode, st_ino.., modification-time, posix-change-time) * windows: (st_platform, st_dev, st_mode, st_ino.., modification-time, windows-creation-time) The synthetic "st_platform" key will contain `sys.platform`, so something like "linux2" or "darwin" or "windows". The hope is that this is a cheap way to provide some useful information to future developers and cyber-historians to interpret the rest of the st_* fields in some meaningful way. st_dev, st_mode, etc, will be copied directly from the os.stat call. Other attributes (perhaps platform-specific fields like OS-X's st_creator and st_type) will be copied here too. `modification-time` will be copied from st_mtime on all platforms, based on the conclusion that it represents the same concept on all platforms: the most recent time that the file's contents have been modified. `posix-change-time` will be present for files that came from a POSIX filesystem, and will be copied from st_ctime. `windows-creation-time` will be present for files that came from a windows filesystem, and will be copied from st_ctime. Having longer and more-detailed names for the ctime values will help with goal 2 (help developers correctly interpret this field). Not calling them "ctime" will help developers who would otherwise misinterpret `posix-change-time` as if it were the mythical "posix-creation-time" that everyone really wants. We cannot provide goal 3 here, because there is no common semantic between POSIX and windows. (note for future discussion: some POSIX-ish filesystems do provide creation-time, in the form of OS-X's st_birthtime, and supposedly something that ZFS offers. If we can determine that the semantics of these are the same, it could be argued that windows-creation-time should be renamed `creation-time`, and only populated on platforms that offer it, which would be st_birthtime from HFS+/OS-X, st_ctime on windows, and something else on ZFS) (and note that, if we *cannot* determine that the semantics are the same, then we should probably refrain from trying to coerce them into the same field, lest we make the same mistake that Python's os.stat did, making life more difficult for somebody in the future who is trying to figure out whether a given file's so-called "creation-time" was really the ZFS notion, or the HFS+ notion, or whatever).
davidsarah commented 2010-01-13 19:16:59 +00:00
Owner

Replying to warner:

The synthetic "st_platform" key will contain sys.platform, so something
like "linux2" or "darwin" or "windows". The hope is that this is a cheap way
to provide some useful information to future developers and cyber-historians
to interpret the rest of the st_* fields in some meaningful way.

I know Tahoe doesn't attempt to provide anonymity of file uploaders, but from a privacy point of view, why should the holder of the directory read cap have this information?

(note for future discussion: some POSIX-ish filesystems do provide
creation-time, in the form of OS-X's st_birthtime, and supposedly something
that ZFS offers. If we can determine that the semantics of these are the
same, it could be argued that windows-creation-time should be renamed
creation-time, and only populated on platforms that offer it, which
would be st_birthtime from HFS+/OS-X, st_ctime on windows, and something else
on ZFS)

(and note that, if we cannot determine that the semantics are the same,
then we should probably refrain from trying to coerce them into the same
field, lest we make the same mistake that Python's os.stat did, making life
more difficult for somebody in the future who is trying to figure out whether
a given file's so-called "creation-time" was really the ZFS notion, or the
HFS+ notion, or whatever).

I think they are the same. Also I think posix-change-time should be metadata-change-time, since a system that provides it isn't necessarily a POSIX system.

Replying to [warner](/tahoe-lafs/trac-2024-07-25/issues/897#issuecomment-74576): > The synthetic "st_platform" key will contain `sys.platform`, so something > like "linux2" or "darwin" or "windows". The hope is that this is a cheap way > to provide some useful information to future developers and cyber-historians > to interpret the rest of the st_* fields in some meaningful way. I know Tahoe doesn't attempt to provide anonymity of file uploaders, but from a privacy point of view, why should the holder of the directory read cap have this information? > (note for future discussion: some POSIX-ish filesystems do provide > creation-time, in the form of OS-X's st_birthtime, and supposedly something > that ZFS offers. If we can determine that the semantics of these are the > same, it could be argued that windows-creation-time should be renamed > `creation-time`, and only populated on platforms that offer it, which > would be st_birthtime from HFS+/OS-X, st_ctime on windows, and something else > on ZFS) > > (and note that, if we *cannot* determine that the semantics are the same, > then we should probably refrain from trying to coerce them into the same > field, lest we make the same mistake that Python's os.stat did, making life > more difficult for somebody in the future who is trying to figure out whether > a given file's so-called "creation-time" was really the ZFS notion, or the > HFS+ notion, or whatever). I think they are the same. Also I think `posix-change-time` should be `metadata-change-time`, since a system that provides it isn't necessarily a POSIX system.

The only argument for st_platform is to give readers (including dirreadcap holders) a hedge against failures in our current understanding of what these fields mean (well, our+python's understanding). I'm +0 on it now.. I could be talked out of including it.

I suspect that our inclusion of st_ino and st_dev and the other fields will reveal much information about the value of st_platform anyways, since I imagine that windows filesystems use these fields in very different ways. An enthusiastic reader who is trying to un-"fix" our+python's attempt at being helpful might be able to recognize these characteristic st_ino/st_dev values and improve the accuracy of their unwinding without being told explicitly what st_platform is. On the other hand, st_platform may not be enough information to do this hypothetical job well, and that future reader may complain to our ghosts that we should have included even more information.

So I'm ok with not including st_platform.

I think they are the same.

I suspect they are the same too, but I'm afraid of committing the same mistake that Python did, especially when OS-X is the only documentation that I've seen personally, and nobody that I know has even seen the hypothetical ZFS docs (and ZFS is losing ground now that Apple abandoned it). Again, I'm willing to go this way, but I'm going to be awfully embarrassed if our attempt to fix python's semantics proves (years from now) to be adding yet another layer of brokenness.

Also I think posix-change-time should be metadata-change-time,

I'm also willing to go along with this one, but I'm even more hesitant. First, I think the casual reader who sees "metadata-change-time" will incorrectly assume that it is referring to the Tahoe metadata in which this key is embedded, rather than the original disk filesystem from which the file was copied. Second, do we have examples of non-POSIX systems which provide this "ctime" that behave exactly like POSIX ones? We'd be making a bolder claim by going with "metadata-change-time".. there are surely some pieces of metadata that might be changed without updating this timestamp.. if some other system uses a different set of metadata than our POSIX systems, would we ignore the differences and represent both at "metadata-change-time"? Or add a "non-POSIX-metadata-change-time" ?

The only argument for `st_platform` is to give readers (including dirreadcap holders) a hedge against failures in our current understanding of what these fields mean (well, our+python's understanding). I'm +0 on it now.. I could be talked out of including it. I suspect that our inclusion of st_ino and st_dev and the other fields will reveal much information about the value of st_platform anyways, since I imagine that windows filesystems use these fields in very different ways. An enthusiastic reader who is trying to un-"fix" our+python's attempt at being helpful might be able to recognize these characteristic st_ino/st_dev values and improve the accuracy of their unwinding without being told explicitly what st_platform is. On the other hand, st_platform may not be enough information to do this hypothetical job well, and that future reader may complain to our ghosts that we should have included even more information. So I'm ok with not including st_platform. > I think they are the same. I suspect they are the same too, but I'm afraid of committing the same mistake that Python did, especially when OS-X is the only documentation that I've seen personally, and nobody that I know has even seen the hypothetical ZFS docs (and ZFS is losing ground now that Apple abandoned it). Again, I'm willing to go this way, but I'm going to be awfully embarrassed if our attempt to fix python's semantics proves (years from now) to be adding yet another layer of brokenness. > Also I think posix-change-time should be metadata-change-time, I'm also willing to go along with this one, but I'm even more hesitant. First, I think the casual reader who sees "metadata-change-time" will incorrectly assume that it is referring to the Tahoe metadata in which this key is embedded, rather than the original disk filesystem from which the file was copied. Second, do we have examples of non-POSIX systems which provide this "ctime" that behave exactly like POSIX ones? We'd be making a bolder claim by going with "metadata-change-time".. there are surely some pieces of metadata that might be changed without updating this timestamp.. if some other system uses a different set of metadata than our POSIX systems, would we ignore the differences and represent both at "metadata-change-time"? Or add a "non-POSIX-metadata-change-time" ?

Oh, incidentally, the wikipedia page on ext3 uses the term "attribute modification" to describe ctime. Maybe "attribute-change-time" would be a suitable replacement for "metadata-change-time"?

Oh, incidentally, the wikipedia page on ext3 uses the term "attribute modification" to describe ctime. Maybe "attribute-change-time" would be a suitable replacement for "metadata-change-time"?
Author

It isn't "change time from a POSIX system" it is "change time matching the POSIX semantics". Does that answer your objection David-Sarah?

It isn't "change time from a POSIX system" it is "change time matching the POSIX semantics". Does that answer your objection David-Sarah?
davidsarah commented 2010-01-13 21:50:20 +00:00
Owner

You have a point about posix-change-time -- it's a POSIXism that is fairly unlikely to be duplicated precisely. For creation time, I'll try and find the relevant ZFS and OS-X docs.

You have a point about `posix-change-time` -- it's a POSIXism that is fairly unlikely to be duplicated precisely. For creation time, I'll try and find the relevant ZFS and OS-X docs.
zooko modified the milestone from undecided to 1.7.0 2010-01-27 05:58:25 +00:00
Author

I think we are out of time to do this for 1.7.0. Brian is busy with new-downloader (unless he decides to save new-downloader for after the v1.7.0 release and work on something else for v1.7.0). David-Sarah and I are also busy with other 1.7.0 work.

I'm putting this in 1.8.0 instead of in "eventually" because it is a forward-compatibility issue and I really like to fix those as early as possible...

I think we are out of time to do this for 1.7.0. Brian is busy with new-downloader (unless he decides to save new-downloader for after the v1.7.0 release and work on something else for v1.7.0). David-Sarah and I are also busy with other 1.7.0 work. I'm putting this in 1.8.0 instead of in "eventually" because it is a `forward-compatibility` issue and I really like to fix those as early as possible...
zooko modified the milestone from 1.7.0 to 1.8.0 2010-05-05 05:51:23 +00:00
tahoe-lafs added
code-frontend-cli
and removed
unknown
labels 2010-06-21 03:30:53 +00:00
Author

Since this is a forward-compatibility issue I'm still interested in getting it fixed. By the way, Linux is growing a creation-time field:

Since this is a forward-compatibility issue I'm still interested in getting it fixed. By the way, Linux is growing a creation-time field: * <http://lwn.net/Articles/394298/> * <http://article.gmane.org/gmane.linux.nfs/33964> * <http://article.gmane.org/gmane.linux.nfs/33965>
zooko modified the milestone from 1.8.0 to soon 2010-08-13 18:50:41 +00:00
Author

Hm. What's the status of this ticket? It is assigned to Brian. Is Brian intending to do anything with it? Did we achieve consensus on what should be done? Do we all agree with what Brian wrote in comment:74576 plus the various follow-ups? I no longer remember. Maybe someone should write up a new summary of what we intend to do. Brian: if you don't intend to write up a new summary (or otherwise move this ticket forward) then please assign it to me.

Hm. What's the status of this ticket? It is assigned to Brian. Is Brian intending to do anything with it? Did we achieve consensus on what should be done? Do we all agree with what Brian wrote in [comment:74576](/tahoe-lafs/trac-2024-07-25/issues/897#issuecomment-74576) plus the various follow-ups? I no longer remember. Maybe someone should write up a new summary of what we intend to do. Brian: if you don't intend to write up a new summary (or otherwise move this ticket forward) then please assign it to me.
zooko added
1.6.1
and removed
1.5.0
labels 2012-02-24 06:06:48 +00:00
warner was unassigned by zooko 2012-02-24 06:06:48 +00:00
zooko self-assigned this 2012-02-24 06:06:48 +00:00
Author

Replying to warner:

== using ctime/mtime in backupdb ==

In particular, I'm not worried about trying to avoid re-uploading in the face of user-triggered changes to metadata that doesn't actually change file contents. If someone does a "chown" or "chmod" or "touch" on a bunch of files, I think they'll accept the fact that "tahoe backup" will subsequently do more work on those files than if they had not gone and run those commands.

But why should we re-upload those files? The unix operating system is asserting, by giving us a changed change time and an unchanged modify time, that the file contents have not changed. If we are relying on the operating system about this sort of thing, for improved efficiency, then why not believe it about this and skip the re-upload?

Obviously if the operating system asserts that the creation time has changed, then we should re-upload.

Replying to [warner](/tahoe-lafs/trac-2024-07-25/issues/897#issuecomment-74575): > > == using ctime/mtime in backupdb == > > In particular, I'm not worried about trying to avoid re-uploading in the face of user-triggered changes to metadata that doesn't actually change file contents. If someone does a "chown" or "chmod" or "touch" on a bunch of files, I think they'll accept the fact that "tahoe backup" will subsequently do more work on those files than if they had not gone and run those commands. But why should we re-upload those files? The unix operating system is asserting, by giving us a changed `change time` and an unchanged `modify time`, that the file contents have not changed. If we are relying on the operating system about this sort of thing, for improved efficiency, then why not believe it about this and skip the re-upload? Obviously if the operating system asserts that the `creation time` has changed, then we should re-upload.
zooko removed their assignment 2012-06-04 04:18:46 +00:00
warner was assigned by zooko 2012-06-04 04:18:46 +00:00

Zooko reminded me of this ticket in IRC today, so I re-read
everything. I think we have the following tasks to finish for this
ticket:

  • achieve consensus upon the inclusion of st_platform in the
    tahoe-backup metadata
  • achieve consensus upon the spelling of posix-change-time in

the tahoe-backup metadata

  • change tahoe-backup to record the keys described in
    comment:74576 (modification-time, windows-creation-time or

posix-change-time)

  • argue and achieve consensus on the when-to-re-upload question

and then a separate ticket can be created to build some sort of
restore command (maybe an option for tahoe cp, maybe a
separate tahoe restore that reads this metadata and applies
it to the resulting files.

Zooko reminded me of this ticket in IRC today, so I re-read everything. I think we have the following tasks to finish for this ticket: * achieve consensus upon the inclusion of `st_platform` in the `tahoe-backup` metadata * achieve consensus upon the spelling of `posix-change-time` in > the `tahoe-backup` metadata * change `tahoe-backup` to record the keys described in [comment:74576](/tahoe-lafs/trac-2024-07-25/issues/897#issuecomment-74576) (modification-time, windows-creation-time or > posix-change-time) * argue and achieve consensus on the when-to-re-upload question and then a separate ticket can be created to build some sort of restore command (maybe an option for `tahoe cp`, maybe a separate `tahoe restore` that reads this metadata and applies it to the resulting files.
Author

See also #2250 (don't re-use metadata from earlier snapshots, in a "tahoe backup").

See also #2250 (don't re-use metadata from earlier snapshots, in a "tahoe backup").
daira commented 2015-09-24 19:43:39 +00:00
Owner

Replying to [zooko]comment:18:

Replying to warner:

== using ctime/mtime in backupdb ==

In particular, I'm not worried about trying to avoid re-uploading in the face of user-triggered changes to metadata that doesn't actually change file contents. If someone does a "chown" or "chmod" or "touch" on a bunch of files, I think they'll accept the fact that "tahoe backup" will subsequently do more work on those files than if they had not gone and run those commands.

But why should we re-upload those files? The unix operating system is asserting, by giving us a changed change time and an unchanged modify time, that the file contents have not changed. If we are relying on the operating system about this sort of thing, for improved efficiency, then why not believe it about this and skip the re-upload?

Obviously if the operating system asserts that the creation time has changed, then we should re-upload.

It is possible for ctime to change but mtime not, when the file contents change. In particular, suppose there are two files foo and bar that have different contents but the same size and mtime (because they were last modified at the same time to within the mtime resolution). Then, mv bar foo will not change foo's mtime, but will set its ctime to the current time. (Verified on Linux.)

Replying to [zooko]comment:18: > Replying to [warner](/tahoe-lafs/trac-2024-07-25/issues/897#issuecomment-74575): > > > > == using ctime/mtime in backupdb == > > > > In particular, I'm not worried about trying to avoid re-uploading in the face of user-triggered changes to metadata that doesn't actually change file contents. If someone does a "chown" or "chmod" or "touch" on a bunch of files, I think they'll accept the fact that "tahoe backup" will subsequently do more work on those files than if they had not gone and run those commands. > > But why should we re-upload those files? The unix operating system is asserting, by giving us a changed `change time` and an unchanged `modify time`, that the file contents have not changed. If we are relying on the operating system about this sort of thing, for improved efficiency, then why not believe it about this and skip the re-upload? > > Obviously if the operating system asserts that the `creation time` has changed, then we should re-upload. It is possible for `ctime` to change but `mtime` not, when the file contents change. In particular, suppose there are two files `foo` and `bar` that have different contents but the same size and `mtime` (because they were last modified at the same time to within the `mtime` resolution). Then, `mv bar foo` will not change `foo`'s `mtime`, but will set its `ctime` to the current time. (Verified on Linux.)
Sign in to join this conversation.
No Milestone
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#897
No description provided.