what to do with filenames that are illegal on some systems #731
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#731
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
If someone copies a file from system A into Tahoe-LAFS and then later someone tries to copy that file from Tahoe-LAFS into system B, then a problem could arise if the filename from system A is illegal on system B. This can happen in a few ways:
The filename could be illegal on Windows (http://msdn.microsoft.com/en-us/library/aa365247.aspx ), and system B could be Windows and system A non-Windows.
The filename could be illegal on Mac (http://developer.apple.com/technotes/tn/tn1150table.html ).
The filename could case-collide with another filename in the same directory, and system B could be a case-insensitive filesystem. (Note that Tahoe's current naïve approach will result in a randomly-chosen one of the files overwriting the other if the target system is Windows or Macintosh.)
If we allowed undecodable bytestring filenames from POSIX system A's, either by storing bytestring (non-unicode) filenames, or by some escaping mechanism such as
utf8b
, then a non-POSIX system B would not be able to accept that name (or at least we should not write that name into that system). Likewise some users of POSIX have a policy that only correctly encoded unicode filenames should be stored in their filesystem, so for them we should not write that name even though we can do so by using the POSIX byte-oriented APIs.Here are someone else's notes about these sorts of issues:
http://www.portfoliofaq.com/pfaq/FAQ00352.htm
See also David A. Wheeler's excellent article arguing that we should start being pickier about filenames in POSIX systems:
http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
There are various ways Tahoe can deal with this. It can do something about it on the Tahoe -> system B leg of the trip, such as by stopping with an error, offering to rename the offending files, etc.. It could also do something about it on the system A -> Tahoe leg of the trip.
I think in the short term it might be better if Tahoe rejected non-portable filenames in the system A -> Tahoe leg of the trip, because we don't yet know how we want to handle them. By rejecting them, we avoid the current random-overwrite issue and we don't constrain future versions of Tahoe-LAFS as much in terms of what sorts of filenames it has to support. (There might already be some problematic filenames stored in Tahoe and we might want to extend Tahoe to deal with these better in the future, but if Tahoe-v1.5 starts rejecting new ones then the problem will probably be less widespread and less severe in the future.)
On the other hand, rejecting them would be a UI/API regression, so we would probably want to add a
--force-nonportable-filenames
option to make it behave like Tahoe-v1.4 currently does.Help!?
This is a "backwards-compatibility" issue. Doing the easy and lazy thing now could make things harder for future versions of Tahoe. Adding the "backwards-compatibility" Keyword and leaving this ticket in the "1.5.0" Milestone. Help!?
I meant "forward-compatibility": [//pipermail/tahoe-dev/2009-June/001968.html]
A few notes:
It seems to me that tahoe probably has enough flexibility to store any filename, and many people will only be using it to store and retrieve files to/from the same system, so it should "just work" for that use case. In the other cases, it would probably be a good idea to provide a hook in the Python API for handling filenames that can't be represented, and when using the CLI, etc., there should be at least two options: translate the name via some encoding, with a warning, and cause a hard error.
My 2c.
Replying to bewst:
This is my thought as well, at least for backup use cases. Tahoe in general has a broader usage model, and so solutions appropriate for backup may not be adequate for those other use cases, but for backups, I think the top priority is ensuring that backups succeed reliably and don't lose any data -- including file name data.
That's why the approach I've chosen for GridBackup (which, BTW, is finally starting to write to a grid, Yay!) is to make sure that:
ALL names can be backed up, regardless of whether or not they make any sense on any filesystem in existence.
When restoring to a system that uses the same encoding as the backup source, all names are restored byte-for-byte identically to what was read from the file system during backup.
When restoring to a system that uses a different encoding, I try to transcode the names but just error out if it doesn't work. Eventually my plan is to give the user a list of paths that broke and let them decide what to name each of them, with some suggestions based on attempts to decode the name with all Python-supported codecs.
During a restore, there's room for human intervention to address naming problems, but during backup, I just want to get the data. I'm taking a similar approach to other metadata. Extended attributes, ACLs, resource forks, even POSIX permissions -- there are destination systems to which none of these things will make sense, but that's okay. The backup will grab everything and we can deal with how to make use of the data, if possible, during restore.
Replying to [swillden]comment:4:
It's what I want for all the use cases I can think of, and especially so while GridBackup isn't ready for primetime.
I'm not going to do anything about this for v1.7.0. I still think the current behavior is problematic (there are normal, not-uncommon use cases where some files are unexpectedly overwritten and others where download/restore fails). But I don't have time to work on it for v1.7.0.
I almost hesitate to mention this, because I'm not at all sure that it is a good idea, but with regard to problem 4. from the initial comment, we just try to autodetect the real encoding (if any) using this package I just discovered: http://chardet.feedparser.org/ . It is probably an even worse idea for filenames than for other strings, which can be short and non-linguistic (e.g. "f954b.c" is a reasonable filename for an English speaker to use but not a reasonable string to find in English prose a newspaper or web page.)
(copying some comments that I wrote over on #1072...)
It is worth considering the five possible Requirements in this message. With our current unicode support as of Tahoe-LAFS v1.7.0 we have achieved Requirement 1 (unicode) and Requirement 2 (faithful if unicode). We have not achieved Requirement 3 (no file left behind), Requirement 4 (faithful bytes if not unicode), or Requirement 5 (no loss of information).
Nowadays I am pretty skeptical of the value of Requirement 4.
After I wrote that message I subsequently realized that a good behavior would be that if you load an ill-encoded filename into Tahoe-LAFS then its representation looks identical to or similar to the representation of that file when you view it with Nautilus, GNU ls, or whatever other tools would have the same problem with ill-encoded filenames. I think this should be added as Requirement 6 (familiar gibberish): "If you copy an ill-encoded filename into Tahoe-LAFS, its filename looks identical to or similar to what you see when you view it with other tools (e.g. Nautilus, GNU ls, etc.)".
Here are some more notes from someone else about these sorts of surprises: http://www.ericsink.com/entries/quirky.html
stringprep (RFC 3454) seems like a useful standard:
http://www.ietf.org/rfc/rfc3454.txt
And it is implemented in the Python standard library:
http://docs.python.org/library/stringprep.html
Here is monotone's rules about filename handling:
http://www.monotone.ca/docs/Internationalization.html
Replying to zooko:
stringprep is one of the worst ideas ever to come out of an IETF Working Group.
Unicode is a semantic character encoding standard; that is, it makes a valiant attempt to unify or disunify characters based on distinctions in meaning and usage, as opposed to visual appearance. A simple example of this is that Latin 'p' looks identical to Cyrillic 'р', but they are completely different letters that don't even sound the same. Some people might consider that to be a problem, but actually it's just a fact about human scripts.
The International Domain Names Working Group got a bee in their bonnet about it being a problem that some characters are "confusingly" similar. Now, given that some commonly used characters are semantically distinct but look identical in related fonts, you might think it to be a quixotic task to somehow deal with the tens of thousands of characters that only look similar to some other character, but that didn't stop the WG arguing about it interminably, and coming up with stringprep in order to placate the people on one side of the argument -- even though stringprep doesn't really solve that issue at all.
There are indeed some characters, I call them "junk characters", that we don't want to use. The polite term for junk characters is "compatibility characters", most of which are "compatibility composites" as defined in section 2.3 of the Unicode Standard. These characters are only in Unicode because some national body insisted on round-tripping between Unicode and their misdesigned legacy standard (which could have been done in other ways that would have been more technically elegant than assigning many ad-hoc character variants, but that's water under the bridge).
The right place to implement "don't use junk characters" is in input methods. That is, if a user can never type a junk character, then it's much less likely that its existence will cause a problem. More specifically, if a user can only type non-junk characters in some normalization form (preferably NFC), then name lookups based on exact matching, as needed for filenames and other identifiers, are more likely to work.
The wrong thing to do is what stringprep tries to do, which is to map junk characters to somebody's idea of the nearest non-junk characters. This just causes unintended name collisions and breakage, and doesn't get any closer to solving the unsolvable issue of confusable characters.
Before we dig into this hard, what is special about tahoe, compared to the other 12 distributed filesystems out there, and what problem do we have that they don't, and why do their approaches not map?
Here's a good summary of Windows paths: https://googleprojectzero.blogspot.co.uk/2016/02/the-definitive-guide-on-win32-to-nt.html
See also #1840