Unicode normalization needs to be applied to filenames in more cases #1076

Closed
opened 2010-06-10 00:01:50 +00:00 by davidsarah · 15 comments
davidsarah commented 2010-06-10 00:01:50 +00:00
Owner

Currently, the CLI normalizes filenames to NFC when listing the contents of a local directory in [listdir_unicode]source:src/allmydata/util/stringutils.py?rev=4464#L193 (used by tahoe cp and tahoe backup), but that is the only point at which filenames are normalized.

So, unnormalized filenames can get into Tahoe directories via CLI arguments, SFTP, FTP, and the WUI.

This is a forward-compatibility issue because, if we have any non-NFC filenames stored in Tahoe directories that we need to maintain compatibility with, then we would have to normalize when reading filenames out of Tahoe directories and not just when putting filenames into them.

Currently, the CLI normalizes filenames to NFC when listing the contents of a **local** directory in [listdir_unicode]source:src/allmydata/util/stringutils.py?rev=4464#L193 (used by `tahoe cp` and `tahoe backup`), but that is the only point at which filenames are normalized. So, unnormalized filenames can get into Tahoe directories via CLI arguments, SFTP, FTP, and the WUI. This is a forward-compatibility issue because, if we have any non-NFC filenames stored in Tahoe directories that we need to maintain compatibility with, then we would have to normalize when reading filenames out of Tahoe directories and not just when putting filenames into them.
tahoe-lafs added the
code-frontend
major
defect
1.7β
labels 2010-06-10 00:01:50 +00:00
tahoe-lafs added this to the undecided milestone 2010-06-10 00:01:50 +00:00
davidsarah commented 2010-06-10 00:04:57 +00:00
Author
Owner

It should probably be the dirnode implementation that enforces this, so that we are not having to normalize in multiple frontends.

It should probably be the dirnode implementation that enforces this, so that we are not having to normalize in multiple frontends.
tahoe-lafs added
code-dirnodes
and removed
code-frontend
labels 2010-06-10 00:04:57 +00:00

Since this is a potentially significant forward-compatibility issue and potentially significant bug, we're going to fix it for 1.7.0-final.

Since this is a potentially significant forward-compatibility issue and potentially significant bug, we're going to fix it for 1.7.0-final.
zooko modified the milestone from undecided to 1.7.0 2010-06-10 19:14:59 +00:00
davidsarah commented 2010-06-16 03:29:12 +00:00
Author
Owner

Attachment nfc-normalization.dpatch (68260 bytes) added

Provisional patch to NFC-normalize filenames going in and out of Tahoe directories.

**Attachment** nfc-normalization.dpatch (68260 bytes) added Provisional patch to NFC-normalize filenames going in and out of Tahoe directories.
davidsarah commented 2010-06-16 03:44:31 +00:00
Author
Owner

nfc-normalization.dpatch also normalizes names to NFC when unpacking them from directories. This isn't absolutely necessary, but if a name contains characters that are unassigned in the version of Unicode used by the client that wrote the directory, then they might not be normalized wrt a later version of Unicode.

The patch does not remove the normalization from listdir_unicode in source:src/allmydata/util/stringutils.py. We should not be normalizing at that point, because:

  • if the local filesystem API is normalizing (even using NFD rather than NFC, as in the case of Mac OS X), then we don't need to.
  • if the local filesystem API is not normalizing, then users of listdir_unicode such as tahoe cp and tahoe backup may get a 'file not found' error when they try to read the file by its normalized name.

The patch does not have any tests.

nfc-normalization.dpatch also normalizes names to NFC when unpacking them from directories. This isn't absolutely necessary, but if a name contains characters that are unassigned in the version of Unicode used by the client that wrote the directory, then they might not be normalized wrt a later version of Unicode. The patch does not remove the normalization from `listdir_unicode` in source:src/allmydata/util/stringutils.py. We should not be normalizing at that point, because: * if the local filesystem API is normalizing (even using NFD rather than NFC, as in the case of Mac OS X), then we don't need to. * if the local filesystem API is not normalizing, then users of `listdir_unicode` such as `tahoe cp` and `tahoe backup` may get a 'file not found' error when they try to read the file by its normalized name. The patch does not have any tests.

Replying to davidsarah:

The patch does not remove the normalization from listdir_unicode in source:src/allmydata/util/stringutils.py. We should not be normalizing at that point, because:

  • if the local filesystem API is normalizing (even using NFD rather than NFC, as in the case of Mac OS X), then we don't need to.

Don't we need to re-normalize to NFC before putting that name into a Tahoe-LAFS directory?

Oh, but that will happen at the other call site -- at the Tahoe-LAFS directory insertion point. Right?

Replying to [davidsarah](/tahoe-lafs/trac-2024-07-25/issues/1076#issuecomment-77885): > The patch does not remove the normalization from `listdir_unicode` in source:src/allmydata/util/stringutils.py. We should not be normalizing at that point, because: > * if the local filesystem API is normalizing (even using NFD rather than NFC, as in the case of Mac OS X), then we don't need to. Don't we need to re-normalize to NFC before putting that name into a Tahoe-LAFS directory? Oh, but that will happen at the other call site -- at the Tahoe-LAFS directory insertion point. Right?

Review:

Review: * Please write down into some doc why we chose NFC for our normalization scheme. * It's not clear that you intended for the new child name to get normalized twice when it gets set to the current child name, but I guess it doesn't hurt: <http://tahoe-lafs.org/trac/tahoe-lafs/attachment/ticket/1076/nfc-normalization.dpatch#L425> * I appreciate the addition of little clarifying comments like this: <http://tahoe-lafs.org/trac/tahoe-lafs/attachment/ticket/1076/nfc-normalization.dpatch#L133> * Is it even *possible* to have a name containing characters that are unassigned in the current version of unicode? Perhaps you could explain further in this comment or point to a larger explanation in some doc text file or at least refer to some external 3rd party document that describes this situation? <http://tahoe-lafs.org/trac/tahoe-lafs/attachment/ticket/1076/nfc-normalization.dpatch#L158> * In any case, I look forward to seeing how you are going to unit-test this case! I guess the test needs to stick the utf-8 encoding of a non-normalized string into a directory so that it will exercise this case. * And of course we need unit tests. Otherwise, this looks like a good patch! Thank you!

Replying to zooko:

  • Is it even possible to have a name containing characters that are unassigned in the current version of unicode? Perhaps you could explain further in this comment or point to a larger explanation in some doc text file or at least refer to some external 3rd party document that describes this situation? http://tahoe-lafs.org/trac/tahoe-lafs/attachment/ticket/1076/nfc-normalization.dpatch#L158
  • In any case, I look forward to seeing how you are going to unit-test this case! I guess the test needs to stick the utf-8 encoding of a non-normalized string into a directory so that it will exercise this case.

Oh, this method of testing also suggests a reason why we need the code: because releases of Tahoe-LAFS < v1.7 might put non-normalized names into directories.

Replying to [zooko](/tahoe-lafs/trac-2024-07-25/issues/1076#issuecomment-77887): > * Is it even *possible* to have a name containing characters that are unassigned in the current version of unicode? Perhaps you could explain further in this comment or point to a larger explanation in some doc text file or at least refer to some external 3rd party document that describes this situation? <http://tahoe-lafs.org/trac/tahoe-lafs/attachment/ticket/1076/nfc-normalization.dpatch#L158> > * In any case, I look forward to seeing how you are going to unit-test this case! I guess the test needs to stick the utf-8 encoding of a non-normalized string into a directory so that it will exercise this case. Oh, this method of testing also suggests a reason why we need the code: because releases of Tahoe-LAFS < v1.7 might put non-normalized names into directories.
davidsarah commented 2010-06-16 05:05:26 +00:00
Author
Owner

Replying to zooko:

Review:

  • Please write down into some doc why we chose NFC for our normalization scheme.

Will do (probably in webapi.txt).

It's quite difficult to avoid all possible cases of double-normalization without breaking abstraction. (You would have to add another method that assumed its argument was already normalized, and ensure that assumption was always met.)

  • Is it even possible to have a name containing characters that are unassigned in the current version of unicode?

Yes, we don't check for unassigned characters.

Perhaps you could explain further in this comment or point to a larger explanation in some doc text file or at least refer to some external 3rd party document that describes this situation? http://tahoe-lafs.org/trac/tahoe-lafs/attachment/ticket/1076/nfc-normalization.dpatch#L158

<http://unicode.org/policies/stability_policy.html>, see the note in section 'Normalization Stability'.

  • In any case, I look forward to seeing how you are going to unit-test this case! I guess the test needs to stick the utf-8 encoding of a non-normalized string into a directory so that it will exercise this case.

Right. We can also check that we handle other misencoded directory contents that way (which is a test that was left undone in 1.6.0 and .1).

Replying to [zooko](/tahoe-lafs/trac-2024-07-25/issues/1076#issuecomment-77887): > Review: > * Please write down into some doc why we chose NFC for our normalization scheme. Will do (probably in webapi.txt). > * It's not clear that you intended for the new child name to get normalized twice when it gets set to the current child name, but I guess it doesn't hurt: <http://tahoe-lafs.org/trac/tahoe-lafs/attachment/ticket/1076/nfc-normalization.dpatch#L425> It's quite difficult to avoid all possible cases of double-normalization without breaking abstraction. (You would have to add another method that assumed its argument was already normalized, and ensure that assumption was always met.) > * Is it even *possible* to have a name containing characters that are unassigned in the current version of unicode? Yes, we don't check for unassigned characters. > Perhaps you could explain further in this comment or point to a larger explanation in some doc text file or at least refer to some external 3rd party document that describes this situation? <http://tahoe-lafs.org/trac/tahoe-lafs/attachment/ticket/1076/nfc-normalization.dpatch#L158> <<http://unicode.org/policies/stability_policy.html>>, see the note in section 'Normalization Stability'. > * In any case, I look forward to seeing how you are going to unit-test this case! I guess the test needs to stick the utf-8 encoding of a non-normalized string into a directory so that it will exercise this case. Right. We can also check that we handle other misencoded directory contents that way (which is a test that was left undone in 1.6.0 and .1).
davidsarah commented 2010-06-16 05:06:43 +00:00
Author
Owner

Replying to [zooko]comment:5:

Replying to davidsarah:

The patch does not remove the normalization from listdir_unicode in source:src/allmydata/util/stringutils.py. We should not be normalizing at that point, because:

  • if the local filesystem API is normalizing (even using NFD rather than NFC, as in the case of Mac OS X), then we don't need to.

Don't we need to re-normalize to NFC before putting that name into a Tahoe-LAFS directory?

Oh, but that will happen at the other call site -- at the Tahoe-LAFS directory insertion point. Right?

Right.

Replying to [zooko]comment:5: > Replying to [davidsarah](/tahoe-lafs/trac-2024-07-25/issues/1076#issuecomment-77885): > > The patch does not remove the normalization from `listdir_unicode` in source:src/allmydata/util/stringutils.py. We should not be normalizing at that point, because: > > * if the local filesystem API is normalizing (even using NFD rather than NFC, as in the case of Mac OS X), then we don't need to. > > Don't we need to re-normalize to NFC before putting that name into a Tahoe-LAFS directory? > > Oh, but that will happen at the other call site -- at the Tahoe-LAFS directory insertion point. Right? Right.

Replying to [davidsarah]comment:8:

Perhaps you could explain further in this comment or point to a larger explanation in some doc text file or at least refer to some external 3rd party document that describes this situation? http://tahoe-lafs.org/trac/tahoe-lafs/attachment/ticket/1076/nfc-normalization.dpatch#L158

<http://unicode.org/policies/stability_policy.html>, see the note in section 'Normalization Stability'.

Please put this reference into the comments or docs somewhere. Thanks!

Replying to [davidsarah]comment:8: > > Perhaps you could explain further in this comment or point to a larger explanation in some doc text file or at least refer to some external 3rd party document that describes this situation? <http://tahoe-lafs.org/trac/tahoe-lafs/attachment/ticket/1076/nfc-normalization.dpatch#L158> > > <<http://unicode.org/policies/stability_policy.html>>, see the note in section 'Normalization Stability'. Please put this reference into the comments or docs somewhere. Thanks!
davidsarah commented 2010-06-17 04:31:09 +00:00
Author
Owner

Attachment nfc-normalization-2.dpatch (88673 bytes) added

Patch bundle for normalization changes including tests. Also work around a bug in locale.getpreferredencoding, and add support for Unicode 'exclude' patterns in 'tahoe backup'.

**Attachment** nfc-normalization-2.dpatch (88673 bytes) added Patch bundle for normalization changes including tests. Also work around a bug in locale.getpreferredencoding, and add support for Unicode 'exclude' patterns in 'tahoe backup'.

Too tired. Will review tomorrow morning on the bus to work. I hope there are or will be tests for these new things: "work around a bug in locale.getpreferredencoding" and "add support for Unicode 'exclude' patterns in 'tahoe backup'"...

Too tired. Will review tomorrow morning on the bus to work. I hope there are or will be tests for these new things: "work around a bug in locale.getpreferredencoding" and "add support for Unicode 'exclude' patterns in 'tahoe backup'"...
zooko self-assigned this 2010-06-17 05:14:31 +00:00
davidsarah commented 2010-06-18 02:17:53 +00:00
Author
Owner

Attachment nfc-normalization-3.dpatch (99246 bytes) added

Patch bundle for normalization changes including tests (and a new test for normalization of names coming out of a directory). Also work around a bug in locale.getpreferredencoding. Fixes a hold in the previous patch where childnames in directories created by modemaker.create_mutable/immutable_directory would not be normalized. Does not include the 'tahoe backup' change.

**Attachment** nfc-normalization-3.dpatch (99246 bytes) added Patch bundle for normalization changes including tests (and a new test for normalization of names coming out of a directory). Also work around a bug in locale.getpreferredencoding. Fixes a hold in the previous patch where childnames in directories created by modemaker.create_mutable/immutable_directory would not be normalized. Does not include the 'tahoe backup' change.

Reviewed! This patch set is GREAT. Applying...

changeset:c8d99b77a32b9bfd, changeset:e2c7ad1d881312b3, changeset:9f5488b2d1493d14, changeset:025aede9e40c5749, changeset:5ada31034b0bc043, changeset:7e7644589a365371, changeset:718870a796151e84, changeset:a9fe3792ded50a26,

Reviewed! This patch set is GREAT. Applying... changeset:c8d99b77a32b9bfd, changeset:e2c7ad1d881312b3, changeset:9f5488b2d1493d14, changeset:025aede9e40c5749, changeset:5ada31034b0bc043, changeset:7e7644589a365371, changeset:718870a796151e84, changeset:a9fe3792ded50a26,
zooko added the
fixed
label 2010-06-18 05:14:33 +00:00
zooko closed this issue 2010-06-18 05:14:33 +00:00
davidsarah commented 2010-06-18 05:25:05 +00:00
Author
Owner

and changeset:390fc78a9a68b42c.

and changeset:390fc78a9a68b42c.
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#1076
No description provided.