what to do with filenames that are illegal on some systems #731

Open
opened 2009-06-09 21:19:25 +00:00 by zooko · 14 comments

If someone copies a file from system A into Tahoe-LAFS and then later someone tries to copy that file from Tahoe-LAFS into system B, then a problem could arise if the filename from system A is illegal on system B. This can happen in a few ways:

  1. The filename could be illegal on Windows (http://msdn.microsoft.com/en-us/library/aa365247.aspx ), and system B could be Windows and system A non-Windows.

  2. The filename could be illegal on Mac (http://developer.apple.com/technotes/tn/tn1150table.html ).

  3. The filename could case-collide with another filename in the same directory, and system B could be a case-insensitive filesystem. (Note that Tahoe's current naïve approach will result in a randomly-chosen one of the files overwriting the other if the target system is Windows or Macintosh.)

  4. If we allowed undecodable bytestring filenames from POSIX system A's, either by storing bytestring (non-unicode) filenames, or by some escaping mechanism such as utf8b, then a non-POSIX system B would not be able to accept that name (or at least we should not write that name into that system). Likewise some users of POSIX have a policy that only correctly encoded unicode filenames should be stored in their filesystem, so for them we should not write that name even though we can do so by using the POSIX byte-oriented APIs.

Here are someone else's notes about these sorts of issues:

http://www.portfoliofaq.com/pfaq/FAQ00352.htm

See also David A. Wheeler's excellent article arguing that we should start being pickier about filenames in POSIX systems:

http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html

There are various ways Tahoe can deal with this. It can do something about it on the Tahoe -> system B leg of the trip, such as by stopping with an error, offering to rename the offending files, etc.. It could also do something about it on the system A -> Tahoe leg of the trip.

I think in the short term it might be better if Tahoe rejected non-portable filenames in the system A -> Tahoe leg of the trip, because we don't yet know how we want to handle them. By rejecting them, we avoid the current random-overwrite issue and we don't constrain future versions of Tahoe-LAFS as much in terms of what sorts of filenames it has to support. (There might already be some problematic filenames stored in Tahoe and we might want to extend Tahoe to deal with these better in the future, but if Tahoe-v1.5 starts rejecting new ones then the problem will probably be less widespread and less severe in the future.)

On the other hand, rejecting them would be a UI/API regression, so we would probably want to add a --force-nonportable-filenames option to make it behave like Tahoe-v1.4 currently does.

Help!?

If someone copies a file from system A into Tahoe-LAFS and then later someone tries to copy that file from Tahoe-LAFS into system B, then a problem could arise if the filename from system A is illegal on system B. This can happen in a few ways: 1. The filename could be illegal on Windows (<http://msdn.microsoft.com/en-us/library/aa365247.aspx> ), and system B could be Windows and system A non-Windows. 2. The filename could be illegal on Mac (<http://developer.apple.com/technotes/tn/tn1150table.html> ). 3. The filename could case-collide with another filename in the same directory, and system B could be a case-insensitive filesystem. (Note that Tahoe's current naïve approach will result in a randomly-chosen one of the files overwriting the other if the target system is Windows or Macintosh.) 4. If we allowed undecodable bytestring filenames from POSIX system A's, either by storing bytestring (non-unicode) filenames, or by some escaping mechanism such as `utf8b`, then a non-POSIX system B would not be able to accept that name (or at least we *should* not write that name into that system). Likewise some users of POSIX have a policy that only correctly encoded unicode filenames should be stored in their filesystem, so for them we should not write that name even though we can do so by using the POSIX byte-oriented APIs. Here are someone else's notes about these sorts of issues: <http://www.portfoliofaq.com/pfaq/FAQ00352.htm> See also David A. Wheeler's excellent article arguing that we should start being pickier about filenames in POSIX systems: <http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html> There are various ways Tahoe can deal with this. It can do something about it on the Tahoe -> system B leg of the trip, such as by stopping with an error, offering to rename the offending files, etc.. It could also do something about it on the system A -> Tahoe leg of the trip. I think in the short term it might be better if Tahoe rejected non-portable filenames in the system A -> Tahoe leg of the trip, because we don't yet know how we want to handle them. By rejecting them, we avoid the current random-overwrite issue and we don't constrain future versions of Tahoe-LAFS as much in terms of what sorts of filenames it has to support. (There *might* already be some problematic filenames stored in Tahoe and we might want to extend Tahoe to deal with these better in the future, but if Tahoe-v1.5 starts rejecting new ones then the problem will probably be less widespread and less severe in the future.) On the other hand, rejecting them would be a UI/API regression, so we would probably want to add a `--force-nonportable-filenames` option to make it behave like Tahoe-v1.4 currently does. Help!?
zooko added the
code-dirnodes
major
defect
1.4.1
labels 2009-06-09 21:19:25 +00:00
zooko added this to the 1.5.0 milestone 2009-06-09 21:19:25 +00:00
Author

This is a "backwards-compatibility" issue. Doing the easy and lazy thing now could make things harder for future versions of Tahoe. Adding the "backwards-compatibility" Keyword and leaving this ticket in the "1.5.0" Milestone. Help!?

This is a "backwards-compatibility" issue. Doing the easy and lazy thing now could make things harder for future versions of Tahoe. Adding the "backwards-compatibility" Keyword and leaving this ticket in the "1.5.0" Milestone. Help!?
Author

I meant "forward-compatibility": [//pipermail/tahoe-dev/2009-June/001968.html]

I meant "forward-compatibility": [//pipermail/tahoe-dev/2009-June/001968.html]
bewst commented 2009-06-14 15:57:53 +00:00
Owner

A few notes:

  • My first reaction was to say you had the right idea in rejecting nonportable names, but then I thought about how it might affect me. Although rejecting nonportable names on the way in is "safe" from a design evolution, point of view, it probably won't make customers happy when their backup fails partway through because some file has a name tahoe didn't like. It'll also be a problem for some people if files that used to save just fine start producing error messages.
  • You might want to decide what "portable" means before trying to solve this problem. For example, are you planning to support VMS? That changes what it means to be a legal filename. One ambitious definition could be: works wherever Python works.
  • Many people have had to solve this sort of problem before you; this is one of those areas where you can benefit from their research, e.g. http://www.boost.org/doc/libs/1_39_0/libs/filesystem/doc/portability_guide.htm#recommendations.
  • FWIW, last I heard, Samba had given up on solving this problem correctly, though that may have changed.

It seems to me that tahoe probably has enough flexibility to store any filename, and many people will only be using it to store and retrieve files to/from the same system, so it should "just work" for that use case. In the other cases, it would probably be a good idea to provide a hook in the Python API for handling filenames that can't be represented, and when using the CLI, etc., there should be at least two options: translate the name via some encoding, with a warning, and cause a hard error.

My 2c.

A few notes: * My first reaction was to say you had the right idea in rejecting nonportable names, but then I thought about how it might affect me. Although rejecting nonportable names on the way in is "safe" from a design evolution, point of view, it probably won't make customers happy when their backup fails partway through because some file has a name tahoe didn't like. It'll also be a problem for some people if files that used to save just fine start producing error messages. * You might want to decide what "portable" means before trying to solve this problem. For example, are you planning to support VMS? That changes what it means to be a legal filename. One ambitious definition could be: works wherever Python works. * Many people have had to solve this sort of problem before you; this is one of those areas where you can benefit from their research, e.g. <http://www.boost.org/doc/libs/1_39_0/libs/filesystem/doc/portability_guide.htm#recommendations>. * FWIW, last I heard, Samba had given up on solving this problem correctly, though that may have changed. It seems to me that tahoe probably has enough flexibility to store *any* filename, and many people will only be using it to store and retrieve files to/from the same system, so it should "just work" for that use case. In the other cases, it would probably be a good idea to provide a hook in the Python API for handling filenames that can't be represented, and when using the CLI, etc., there should be at least two options: translate the name via some encoding, with a warning, and cause a hard error. My 2c.
swillden commented 2009-06-14 16:49:38 +00:00
Owner

Replying to bewst:

It seems to me that tahoe probably has enough flexibility to store any filename, and many
people will only be using it to store and retrieve files to/from the same system, so it should
"just work" for that use case.

This is my thought as well, at least for backup use cases. Tahoe in general has a broader usage model, and so solutions appropriate for backup may not be adequate for those other use cases, but for backups, I think the top priority is ensuring that backups succeed reliably and don't lose any data -- including file name data.

That's why the approach I've chosen for GridBackup (which, BTW, is finally starting to write to a grid, Yay!) is to make sure that:

  1. ALL names can be backed up, regardless of whether or not they make any sense on any filesystem in existence.

  2. When restoring to a system that uses the same encoding as the backup source, all names are restored byte-for-byte identically to what was read from the file system during backup.

  3. When restoring to a system that uses a different encoding, I try to transcode the names but just error out if it doesn't work. Eventually my plan is to give the user a list of paths that broke and let them decide what to name each of them, with some suggestions based on attempts to decode the name with all Python-supported codecs.

During a restore, there's room for human intervention to address naming problems, but during backup, I just want to get the data. I'm taking a similar approach to other metadata. Extended attributes, ACLs, resource forks, even POSIX permissions -- there are destination systems to which none of these things will make sense, but that's okay. The backup will grab everything and we can deal with how to make use of the data, if possible, during restore.

Replying to [bewst](/tahoe-lafs/trac-2024-07-25/issues/731#issuecomment-71539): > It seems to me that tahoe probably has enough flexibility to store *any* filename, and many > people will only be using it to store and retrieve files to/from the same system, so it should > "just work" for that use case. This is my thought as well, at least for backup use cases. Tahoe in general has a broader usage model, and so solutions appropriate for backup may not be adequate for those other use cases, but for backups, I think the top priority is ensuring that backups succeed reliably and don't lose any data -- including file name data. That's why the approach I've chosen for [GridBackup](wiki/GridBackup) (which, BTW, is finally starting to write to a grid, Yay!) is to make sure that: 1. ALL names can be backed up, regardless of whether or not they make any sense on any filesystem in existence. 2. When restoring to a system that uses the same encoding as the backup source, all names are restored byte-for-byte identically to what was read from the file system during backup. 3. When restoring to a system that uses a different encoding, I try to transcode the names but just error out if it doesn't work. Eventually my plan is to give the user a list of paths that broke and let them decide what to name each of them, with some suggestions based on attempts to decode the name with all Python-supported codecs. During a restore, there's room for human intervention to address naming problems, but during backup, I just want to get the data. I'm taking a similar approach to other metadata. Extended attributes, ACLs, resource forks, even POSIX permissions -- there are destination systems to which none of these things will make sense, but that's okay. The backup will grab everything and we can deal with how to make use of the data, if possible, during restore.
bewst commented 2009-06-15 09:39:12 +00:00
Owner

Replying to [swillden]comment:4:

Replying to bewst:

It seems to me that tahoe probably has enough flexibility to store any filename, and many
people will only be using it to store and retrieve files to/from the same system, so it should
"just work" for that use case.

This is my thought as well, at least for backup use cases.

It's what I want for all the use cases I can think of, and especially so while GridBackup isn't ready for primetime.

Replying to [swillden]comment:4: > Replying to [bewst](/tahoe-lafs/trac-2024-07-25/issues/731#issuecomment-71539): > > It seems to me that tahoe probably has enough flexibility to store *any* filename, and many > > people will only be using it to store and retrieve files to/from the same system, so it should > > "just work" for that use case. > > This is my thought as well, at least for backup use cases. It's what I want for all the use cases I can think of, and *especially* so while [GridBackup](wiki/GridBackup) isn't ready for primetime.
zooko modified the milestone from 1.5.0 to 1.6.0 2009-06-30 12:38:02 +00:00
zooko modified the milestone from 1.6.0 to eventually 2010-01-26 15:44:07 +00:00
zooko modified the milestone from eventually to 1.7.0 2010-01-27 06:01:13 +00:00
Author

I'm not going to do anything about this for v1.7.0. I still think the current behavior is problematic (there are normal, not-uncommon use cases where some files are unexpectedly overwritten and others where download/restore fails). But I don't have time to work on it for v1.7.0.

I'm not going to do anything about this for v1.7.0. I still think the current behavior is problematic (there are normal, not-uncommon use cases where some files are unexpectedly overwritten and others where download/restore fails). But I don't have time to work on it for v1.7.0.
zooko modified the milestone from 1.7.0 to eventually 2010-05-05 05:47:01 +00:00
Author

I almost hesitate to mention this, because I'm not at all sure that it is a good idea, but with regard to problem 4. from the initial comment, we just try to autodetect the real encoding (if any) using this package I just discovered: http://chardet.feedparser.org/ . It is probably an even worse idea for filenames than for other strings, which can be short and non-linguistic (e.g. "f954b.c" is a reasonable filename for an English speaker to use but not a reasonable string to find in English prose a newspaper or web page.)

I almost hesitate to mention this, because I'm not at all sure that it is a good idea, but with regard to problem 4. from the initial comment, we just try to autodetect the real encoding (if any) using this package I just discovered: <http://chardet.feedparser.org/> . It is probably an even worse idea for filenames than for other strings, which can be short and non-linguistic (e.g. "f954b.c" is a reasonable filename for an English speaker to use but not a reasonable string to find in English prose a newspaper or web page.)
Author

(copying some comments that I wrote over on #1072...)

It is worth considering the five possible Requirements in this message. With our current unicode support as of Tahoe-LAFS v1.7.0 we have achieved Requirement 1 (unicode) and Requirement 2 (faithful if unicode). We have not achieved Requirement 3 (no file left behind), Requirement 4 (faithful bytes if not unicode), or Requirement 5 (no loss of information).

Nowadays I am pretty skeptical of the value of Requirement 4.

After I wrote that message I subsequently realized that a good behavior would be that if you load an ill-encoded filename into Tahoe-LAFS then its representation looks identical to or similar to the representation of that file when you view it with Nautilus, GNU ls, or whatever other tools would have the same problem with ill-encoded filenames. I think this should be added as Requirement 6 (familiar gibberish): "If you copy an ill-encoded filename into Tahoe-LAFS, its filename looks identical to or similar to what you see when you view it with other tools (e.g. Nautilus, GNU ls, etc.)".

(copying some comments that I wrote over on #1072...) It is worth considering the five possible Requirements in [this](http://tahoe-lafs.org/pipermail/tahoe-dev/2009-May/001670.html) message. With our current unicode support as of Tahoe-LAFS v1.7.0 we have achieved Requirement 1 (unicode) and Requirement 2 (faithful if unicode). We have not achieved Requirement 3 (no file left behind), Requirement 4 (faithful bytes if not unicode), or Requirement 5 (no loss of information). Nowadays I am pretty skeptical of the value of Requirement 4. After I wrote that message I subsequently realized that a good behavior would be that if you load an ill-encoded filename into Tahoe-LAFS then its representation looks identical to or similar to the representation of that file when you view it with Nautilus, GNU ls, or whatever other tools would have the same problem with ill-encoded filenames. I think this should be added as Requirement 6 (familiar gibberish): "If you copy an ill-encoded filename into Tahoe-LAFS, its filename looks identical to or similar to what you see when you view it with other tools (e.g. Nautilus, GNU ls, etc.)".
Author

Here are some more notes from someone else about these sorts of surprises: http://www.ericsink.com/entries/quirky.html

Here are some more notes from someone else about these sorts of surprises: <http://www.ericsink.com/entries/quirky.html>
Author

stringprep (RFC 3454) seems like a useful standard:

http://www.ietf.org/rfc/rfc3454.txt

And it is implemented in the Python standard library:

http://docs.python.org/library/stringprep.html

Here is monotone's rules about filename handling:

http://www.monotone.ca/docs/Internationalization.html

stringprep (RFC 3454) seems like a useful standard: <http://www.ietf.org/rfc/rfc3454.txt> And it is implemented in the Python standard library: <http://docs.python.org/library/stringprep.html> Here is monotone's rules about filename handling: <http://www.monotone.ca/docs/Internationalization.html>
davidsarah commented 2012-01-09 20:04:22 +00:00
Owner

Replying to zooko:

stringprep (RFC 3454) seems like a useful standard:

http://www.ietf.org/rfc/rfc3454.txt

stringprep is one of the worst ideas ever to come out of an IETF Working Group.

Unicode is a semantic character encoding standard; that is, it makes a valiant attempt to unify or disunify characters based on distinctions in meaning and usage, as opposed to visual appearance. A simple example of this is that Latin 'p' looks identical to Cyrillic 'р', but they are completely different letters that don't even sound the same. Some people might consider that to be a problem, but actually it's just a fact about human scripts.

The International Domain Names Working Group got a bee in their bonnet about it being a problem that some characters are "confusingly" similar. Now, given that some commonly used characters are semantically distinct but look identical in related fonts, you might think it to be a quixotic task to somehow deal with the tens of thousands of characters that only look similar to some other character, but that didn't stop the WG arguing about it interminably, and coming up with stringprep in order to placate the people on one side of the argument -- even though stringprep doesn't really solve that issue at all.

There are indeed some characters, I call them "junk characters", that we don't want to use. The polite term for junk characters is "compatibility characters", most of which are "compatibility composites" as defined in section 2.3 of the Unicode Standard. These characters are only in Unicode because some national body insisted on round-tripping between Unicode and their misdesigned legacy standard (which could have been done in other ways that would have been more technically elegant than assigning many ad-hoc character variants, but that's water under the bridge).

The right place to implement "don't use junk characters" is in input methods. That is, if a user can never type a junk character, then it's much less likely that its existence will cause a problem. More specifically, if a user can only type non-junk characters in some normalization form (preferably NFC), then name lookups based on exact matching, as needed for filenames and other identifiers, are more likely to work.

The wrong thing to do is what stringprep tries to do, which is to map junk characters to somebody's idea of the nearest non-junk characters. This just causes unintended name collisions and breakage, and doesn't get any closer to solving the unsolvable issue of confusable characters.

Replying to [zooko](/tahoe-lafs/trac-2024-07-25/issues/731#issuecomment-71550): > stringprep (RFC 3454) seems like a useful standard: > > <http://www.ietf.org/rfc/rfc3454.txt> stringprep is one of the worst ideas ever to come out of an IETF Working Group. Unicode is a semantic character encoding standard; that is, it makes a valiant attempt to unify or disunify characters based on distinctions in meaning and usage, as opposed to visual appearance. A simple example of this is that Latin 'p' looks identical to Cyrillic 'р', but they are completely different letters that don't even sound the same. Some people might consider that to be a problem, but actually it's just a fact about human scripts. The International Domain Names Working Group got a bee in their bonnet about it being a problem that some characters are "confusingly" similar. Now, given that some commonly used characters are semantically distinct but look *identical* in related fonts, you might think it to be a quixotic task to somehow deal with the tens of thousands of characters that only look *similar* to some other character, but that didn't stop the WG arguing about it interminably, and coming up with stringprep in order to placate the people on one side of the argument -- even though stringprep doesn't really solve that issue at all. There are indeed some characters, I call them "junk characters", that we don't want to use. The polite term for junk characters is "compatibility characters", most of which are "compatibility composites" as defined in section 2.3 of the Unicode Standard. These characters are only in Unicode because some national body insisted on round-tripping between Unicode and their misdesigned legacy standard (which could have been done in other ways that would have been more technically elegant than assigning many ad-hoc character variants, but that's water under the bridge). The right place to implement "don't use junk characters" is in input methods. That is, if a user can never type a junk character, then it's much less likely that its existence will cause a problem. More specifically, if a user can only type non-junk characters in some normalization form (preferably NFC), then name lookups based on exact matching, as needed for filenames and other identifiers, are more likely to work. The **wrong** thing to do is what stringprep tries to do, which is to map junk characters to somebody's idea of the nearest non-junk characters. This just causes unintended name collisions and breakage, and doesn't get any closer to solving the unsolvable issue of confusable characters.
Owner

Before we dig into this hard, what is special about tahoe, compared to the other 12 distributed filesystems out there, and what problem do we have that they don't, and why do their approaches not map?

Before we dig into this hard, what is special about tahoe, compared to the other 12 distributed filesystems out there, and what problem do we have that they don't, and why do their approaches not map?
Author
Here's a good summary of Windows paths: <https://googleprojectzero.blogspot.co.uk/2016/02/the-definitive-guide-on-win32-to-nt.html>
Author

See also #1840

See also #1840
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#731
No description provided.