Unicode normalization needs to be applied to filenames in more cases #1076
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1076
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Currently, the CLI normalizes filenames to NFC when listing the contents of a local directory in [listdir_unicode]source:src/allmydata/util/stringutils.py?rev=4464#L193 (used by
tahoe cp
andtahoe backup
), but that is the only point at which filenames are normalized.So, unnormalized filenames can get into Tahoe directories via CLI arguments, SFTP, FTP, and the WUI.
This is a forward-compatibility issue because, if we have any non-NFC filenames stored in Tahoe directories that we need to maintain compatibility with, then we would have to normalize when reading filenames out of Tahoe directories and not just when putting filenames into them.
It should probably be the dirnode implementation that enforces this, so that we are not having to normalize in multiple frontends.
Since this is a potentially significant forward-compatibility issue and potentially significant bug, we're going to fix it for 1.7.0-final.
Attachment nfc-normalization.dpatch (68260 bytes) added
Provisional patch to NFC-normalize filenames going in and out of Tahoe directories.
nfc-normalization.dpatch also normalizes names to NFC when unpacking them from directories. This isn't absolutely necessary, but if a name contains characters that are unassigned in the version of Unicode used by the client that wrote the directory, then they might not be normalized wrt a later version of Unicode.
The patch does not remove the normalization from
listdir_unicode
in source:src/allmydata/util/stringutils.py. We should not be normalizing at that point, because:listdir_unicode
such astahoe cp
andtahoe backup
may get a 'file not found' error when they try to read the file by its normalized name.The patch does not have any tests.
Replying to davidsarah:
Don't we need to re-normalize to NFC before putting that name into a Tahoe-LAFS directory?
Oh, but that will happen at the other call site -- at the Tahoe-LAFS directory insertion point. Right?
Review:
Otherwise, this looks like a good patch! Thank you!
Replying to zooko:
Oh, this method of testing also suggests a reason why we need the code: because releases of Tahoe-LAFS < v1.7 might put non-normalized names into directories.
Replying to zooko:
Will do (probably in webapi.txt).
It's quite difficult to avoid all possible cases of double-normalization without breaking abstraction. (You would have to add another method that assumed its argument was already normalized, and ensure that assumption was always met.)
Yes, we don't check for unassigned characters.
<http://unicode.org/policies/stability_policy.html>, see the note in section 'Normalization Stability'.
Right. We can also check that we handle other misencoded directory contents that way (which is a test that was left undone in 1.6.0 and .1).
Replying to [zooko]comment:5:
Right.
Replying to [davidsarah]comment:8:
Please put this reference into the comments or docs somewhere. Thanks!
Attachment nfc-normalization-2.dpatch (88673 bytes) added
Patch bundle for normalization changes including tests. Also work around a bug in locale.getpreferredencoding, and add support for Unicode 'exclude' patterns in 'tahoe backup'.
Too tired. Will review tomorrow morning on the bus to work. I hope there are or will be tests for these new things: "work around a bug in locale.getpreferredencoding" and "add support for Unicode 'exclude' patterns in 'tahoe backup'"...
Attachment nfc-normalization-3.dpatch (99246 bytes) added
Patch bundle for normalization changes including tests (and a new test for normalization of names coming out of a directory). Also work around a bug in locale.getpreferredencoding. Fixes a hold in the previous patch where childnames in directories created by modemaker.create_mutable/immutable_directory would not be normalized. Does not include the 'tahoe backup' change.
Reviewed! This patch set is GREAT. Applying...
changeset:c8d99b77a32b9bfd, changeset:e2c7ad1d881312b3, changeset:9f5488b2d1493d14, changeset:025aede9e40c5749, changeset:5ada31034b0bc043, changeset:7e7644589a365371, changeset:718870a796151e84, changeset:a9fe3792ded50a26,
and changeset:390fc78a9a68b42c.