'tahoe backup' doesn't tolerate 8-bit filenames #629
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#629
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Sigh, another unicode problem, not unlike #565 or #534, but this time
affecting 'tahoe backup'. Sometimes the sqlite database doesn't like to
accept 8-bit strings.
I think that the database should hold whatever pathname we get back from
os.listdir(), i.e. if os.listdir() returns bytestrings that contain UTF-8
encoded filenames, then the database should hold bytestrings that happen to
contain UTF-8 encoded filenames. This means modifying the way we use sqlite
to tolerate bytestrings. According to this traceback provided by Andrej
Falout, this involves a custom text_factory:
This still leaves the question of what filenames we should pass to Tahoe. The
Tahoe webapi expects URL-encoded UTF-8-encoded bytestrings in the URL, and
the set_children() command expects JSON-encoded unicode strings as
childnames. The main question (as explored by #565/#534) is how to get from
the return value of os.listdir() (and the starting point in sys.argv) to a
tahoe-suitable unicode object.. that part depends upon what the local
encoding is, and on what convention is in use on any particular local
filesystem.
Why not take sqlite's advice and store only unicode strings in the db? This would be consistent with the direction that we're going on the
tahoe cp
issues: basically we're not going to support people who want to have ill-encoded filenames or a wrongly configured locale and who want their filenames to make round trips unchanged. I don't think that class of user is big enough (is it even non-null?) or their demands are reasonable enough that we should keep trying to satisfy them as well as satisfying people who have well-encoded filenames.Uh.. one might be me. :-) I have ill-encoded filenames as a leftover from old, old data stores: I actually have direct backups from my Amiga, for example, that have survived all the machine disasters I've had (so far) and persist in my backup stores. I have files from my old Commodore-64 too, for that matter. Sometimes it's a matter of trying to keep as pristine a backup of those files as possible so that I can use them in things like emulators, or even actual antique hardware, that keeps me from doing any sorts of conversions. Also, the OS filesystem seems to ignore encodings for the most part in its command-line activities: and so these artifacts persist. (And I'd love to store them in a tahoe backup grid without transforming them with.. for example, tar or something.)
I can think of two reasons:
I think the simplest approach is to take sqlite's advice and "use a text_factory that can interpret 8-bit bytestrings (like text_factory=str)". If I'd known about this up front (or if my development platform had signalled this error), that's what I would have done.
Hey, can someone who's experienced this problem see if the attached unit test fails? It passes on my system, but I think I have a less-picky version of sqlite, or something (maybe my system is giving my unicode filenames, instead of UTF-8-encoded filenames).
Also, http://docs.python.org/library/sqlite3.html is the place to go for text_factory information. I think we may want to define the
local_files.path
column to be of a special type, and then tell sqlite that we want to interact with that type as a bytestring. Also, if we do get unicode from os.listdir(), I guess we have to encode it (as UTF-8, doesn't really matter as long as it's one-way consistent) before passing it in as this special type.Attachment test_backupdb.py.diff (1589 bytes) added
updated test case
It makes sense to fix this with #534 and #565.
This patch fixes the crash I see with utf-8 encoded characters.
Replying to jsgf:
Hello,
I've tested this patch, it work fine for me.
My system: Windows XP SP2 French
Tahoe --version:
allmydata-tahoe: 1.6.1, foolscap: 0.4.2, pycryptopp: 0.5.17-r683, zfec: 1.4.5,
Twisted: 8.2.0, Nevow: 0.9.33, zope.interface: 3.5.2, python: 2.6.2,
platform: Windows-XP-5.1.2600-SP3, sqlite: 3.5.9, simplejson: 2.0.9, pyopenssl: 0.
9, argparse: 0.9.1, twisted: 8.2.0, nevow: 0.9.33-r17222, pyOpenSSL: 0.9,
pyutil: 1.3.34, zbase32: 1.1.1, setuptools: 0.6c12dev, pysqlite: 2.4.1
I have tested with this big file (all chars that Windows accept in filename):
"LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñѪº¿®¬½¼¡«»ÁÂÀ©¢¥ãäðÐÊËÈÍÎϦÌÓßÔÒõÕµþÞÚÛÙýݯ´±¾¶§÷¸°¨·¹³² .txt"
The result:
C:>tahoe backup -v c:\temp\tahoe ROOT:TEST/TEST_CHARS
processing c:\temp\tahoe
skipping c:\temp\tahoe\LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñѪº¿®¬½¼¡«»ÁÂÀ©¢¥ãäðÐÊËÈÍÎϦÌÓßÔÒõÕµþÞÚÛÙýݯ´±¾¶§÷¸°¨·¹³² .txt..
re-using old directory for c:\temp\tahoe
0 files uploaded (1 reused), 0 files skipped, 0 directories created (1 reused), 0 directories skipped
0 files checked, 0 directories checked
backup done, elapsed time: 0:00:06
C:>
Fred
I really want to see this patch in trunk in the next 48 hours for Tahoe-LAFS v1.7, but I can't review it myself right now. Unassigning myself and writing a note to tahoe-dev pleading for someone else to review it.
Attachment tahoe-backup-unicode-exclude-and-tests.dpatch (60796 bytes) added
Make 'tahoe backup --exclude' option accept Unicode (the other arguments were already fixed as part of #534/#565). Add tests, based partly on the existing #629 patch.
Reviewed! Should we update NEWS to include this change? Otherwise it is good. Applying...
applied as changeset:178401eb4e87589c, changeset:1a0674bf37779751
Replying to zooko:
It was already mentioned in the NEWS for 1.7beta, even though there were no tests then.
Hang on. Why does this work?
The field in the sqlite3 db that holds local filenames is declared as:
VARCHAR
is 8-bit, but source:src/allmydata/scripts/tahoe_backup.py is callingcheck_file
in source:src/allmydata/scripts/backupdb.py with a unicode argument.os.path.abspath
returns unicode when given a unicode path, so how come we are able to look that up in the database?(If it is being converted to UTF-8, that is fine and will work correctly.)
This change caused a test failure: http://tahoe-lafs.org/buildbot/builders/Kyle%20OpenBSD-4.6%20amd64/builds/258/steps/test-coverage/logs/stdio . Investigating.
Replying to davidsarah:
Fixed in changeset:6b2f99fa9a1f9552.