'tahoe backup' doesn't tolerate 8-bit filenames #629

Closed
opened 2009-02-16 23:09:47 +00:00 by warner · 15 comments

Sigh, another unicode problem, not unlike #565 or #534, but this time
affecting 'tahoe backup'. Sometimes the sqlite database doesn't like to
accept 8-bit strings.

I think that the database should hold whatever pathname we get back from
os.listdir(), i.e. if os.listdir() returns bytestrings that contain UTF-8
encoded filenames, then the database should hold bytestrings that happen to
contain UTF-8 encoded filenames. This means modifying the way we use sqlite
to tolerate bytestrings. According to this traceback provided by Andrej
Falout, this involves a custom text_factory:

  File "/usr/src/tahoe/allmydata-tahoe-1.2.0-r3558/src/allmydata/scripts/tahoe_backup.py", line 243, in process
    newfilecap, metadata = self.upload(childpath)
  File "/usr/src/tahoe/allmydata-tahoe-1.2.0-r3558/src/allmydata/scripts/tahoe_backup.py", line 322, in upload
    must_upload, bdb_results = self.check_backupdb(childpath)
  File "/usr/src/tahoe/allmydata-tahoe-1.2.0-r3558/src/allmydata/scripts/tahoe_backup.py", line 268, in check_backupdb
    r = self.backupdb.check_file(childpath, use_timestamps)
  File "/usr/src/tahoe/allmydata-tahoe-1.2.0-r3558/src/allmydata/scripts/backupdb.py", line 168, in check_file
    (path,))
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless
you use a text_factory that can interpret 8-bit bytestrings (like
text_factory = str). It is highly recommended that you instead just
switch your application to Unicode strings.
Command exited with non-zero status 1

This still leaves the question of what filenames we should pass to Tahoe. The
Tahoe webapi expects URL-encoded UTF-8-encoded bytestrings in the URL, and
the set_children() command expects JSON-encoded unicode strings as
childnames. The main question (as explored by #565/#534) is how to get from
the return value of os.listdir() (and the starting point in sys.argv) to a
tahoe-suitable unicode object.. that part depends upon what the local
encoding is, and on what convention is in use on any particular local
filesystem.

Sigh, another unicode problem, not unlike #565 or #534, but this time affecting 'tahoe backup'. Sometimes the sqlite database doesn't like to accept 8-bit strings. I think that the database should hold whatever pathname we get back from os.listdir(), i.e. if os.listdir() returns bytestrings that contain UTF-8 encoded filenames, then the database should hold bytestrings that happen to contain UTF-8 encoded filenames. This means modifying the way we use sqlite to tolerate bytestrings. According to this traceback provided by Andrej Falout, this involves a custom text_factory: ``` File "/usr/src/tahoe/allmydata-tahoe-1.2.0-r3558/src/allmydata/scripts/tahoe_backup.py", line 243, in process newfilecap, metadata = self.upload(childpath) File "/usr/src/tahoe/allmydata-tahoe-1.2.0-r3558/src/allmydata/scripts/tahoe_backup.py", line 322, in upload must_upload, bdb_results = self.check_backupdb(childpath) File "/usr/src/tahoe/allmydata-tahoe-1.2.0-r3558/src/allmydata/scripts/tahoe_backup.py", line 268, in check_backupdb r = self.backupdb.check_file(childpath, use_timestamps) File "/usr/src/tahoe/allmydata-tahoe-1.2.0-r3558/src/allmydata/scripts/backupdb.py", line 168, in check_file (path,)) sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. Command exited with non-zero status 1 ``` This still leaves the question of what filenames we should pass to Tahoe. The Tahoe webapi expects URL-encoded UTF-8-encoded bytestrings in the URL, and the set_children() command expects JSON-encoded unicode strings as childnames. The main question (as explored by #565/#534) is how to get from the return value of os.listdir() (and the starting point in sys.argv) to a tahoe-suitable unicode object.. that part depends upon what the local encoding is, and on what convention is in use on any particular local filesystem.
warner added the
code-frontend-cli
major
defect
1.3.0
labels 2009-02-16 23:09:47 +00:00
warner added this to the undecided milestone 2009-02-16 23:09:47 +00:00

Why not take sqlite's advice and store only unicode strings in the db? This would be consistent with the direction that we're going on the tahoe cp issues: basically we're not going to support people who want to have ill-encoded filenames or a wrongly configured locale and who want their filenames to make round trips unchanged. I don't think that class of user is big enough (is it even non-null?) or their demands are reasonable enough that we should keep trying to satisfy them as well as satisfying people who have well-encoded filenames.

Why not take sqlite's advice and store only unicode strings in the db? This would be consistent with the direction that we're going on the `tahoe cp` issues: basically we're not going to support people who want to have ill-encoded filenames or a wrongly configured locale and who want their filenames to make round trips unchanged. I don't think that class of user is big enough (is it even non-null?) or their demands are reasonable enough that we should keep trying to satisfy them as well as satisfying people who have well-encoded filenames.
midnightmagic commented 2009-07-20 21:45:29 +00:00
Owner

Uh.. one might be me. :-) I have ill-encoded filenames as a leftover from old, old data stores: I actually have direct backups from my Amiga, for example, that have survived all the machine disasters I've had (so far) and persist in my backup stores. I have files from my old Commodore-64 too, for that matter. Sometimes it's a matter of trying to keep as pristine a backup of those files as possible so that I can use them in things like emulators, or even actual antique hardware, that keeps me from doing any sorts of conversions. Also, the OS filesystem seems to ignore encodings for the most part in its command-line activities: and so these artifacts persist. (And I'd love to store them in a tahoe backup grid without transforming them with.. for example, tar or something.)

Uh.. one might be me. :-) I have ill-encoded filenames as a leftover from old, old data stores: I actually have direct backups from my Amiga, for example, that have survived all the machine disasters I've had (so far) and persist in my backup stores. I have files from my old Commodore-64 too, for that matter. Sometimes it's a matter of trying to keep as pristine a backup of those files as possible so that I can use them in things like emulators, or even actual antique hardware, that keeps me from doing any sorts of conversions. Also, the OS filesystem seems to ignore encodings for the most part in its command-line activities: and so these artifacts persist. (And I'd love to store them in a tahoe backup grid without transforming them with.. for example, tar or something.)
Author

I can think of two reasons:

  • deciding what unicode to use for a given os.listdir() return value is equivalent to solving #731/#734, and that is proving to be difficult and lengthy
  • backupdb paths are entirely local: they are not shared with other Tahoe users. They will only be compared against other strings coming from the same host's os.listdir(). So we aren't obligated to find a way to make them useful to other platforms with different encoding preferences.

I think the simplest approach is to take sqlite's advice and "use a text_factory that can interpret 8-bit bytestrings (like text_factory=str)". If I'd known about this up front (or if my development platform had signalled this error), that's what I would have done.

I can think of two reasons: * deciding what unicode to use for a given os.listdir() return value is equivalent to solving #731/#734, and that is proving to be difficult and lengthy * backupdb paths are entirely local: they are not shared with other Tahoe users. They will only be compared against other strings coming from the same host's os.listdir(). So we aren't obligated to find a way to make them useful to other platforms with different encoding preferences. I think the simplest approach is to take sqlite's advice and "use a text_factory that can interpret 8-bit bytestrings (like text_factory=str)". If I'd known about this up front (or if my development platform had signalled this error), that's what I would have done.
warner modified the milestone from undecided to eventually 2009-08-07 02:03:58 +00:00
Author

Hey, can someone who's experienced this problem see if the attached unit test fails? It passes on my system, but I think I have a less-picky version of sqlite, or something (maybe my system is giving my unicode filenames, instead of UTF-8-encoded filenames).

Also, http://docs.python.org/library/sqlite3.html is the place to go for text_factory information. I think we may want to define the local_files.path column to be of a special type, and then tell sqlite that we want to interact with that type as a bytestring. Also, if we do get unicode from os.listdir(), I guess we have to encode it (as UTF-8, doesn't really matter as long as it's one-way consistent) before passing it in as this special type.

Hey, can someone who's experienced this problem see if the attached unit test fails? It passes on my system, but I think I have a less-picky version of sqlite, or something (maybe my system is giving my unicode filenames, instead of UTF-8-encoded filenames). Also, <http://docs.python.org/library/sqlite3.html> is the place to go for text_factory information. I think we may want to define the `local_files.path` column to be of a special type, and then tell sqlite that we want to interact with that type as a bytestring. Also, if we *do* get unicode from os.listdir(), I guess we have to encode it (as UTF-8, doesn't really matter as long as it's one-way consistent) before passing it in as this special type.
Author

Attachment test_backupdb.py.diff (1589 bytes) added

updated test case

**Attachment** test_backupdb.py.diff (1589 bytes) added updated test case
davidsarah commented 2010-02-02 00:16:55 +00:00
Owner

It makes sense to fix this with #534 and #565.

It makes sense to fix this with #534 and #565.
tahoe-lafs modified the milestone from eventually to 1.7.0 2010-02-02 00:16:55 +00:00
Owner

This patch fixes the crash I see with utf-8 encoded characters.

diff -rN -u old-tahoe-lafs/src/allmydata/scripts/cli.py new-tahoe-lafs/src/allmydata/scripts/cli.py
--- old-tahoe-lafs/src/allmydata/scripts/cli.py	2010-03-09 10:59:58.371372576 -0800
+++ new-tahoe-lafs/src/allmydata/scripts/cli.py	2010-03-09 10:59:58.529372862 -0800
@@ -279,7 +279,7 @@
         self['exclude'] = set()
 
     def parseArgs(self, localdir, topath):
-        self.from_dir = localdir
+        self.from_dir = localdir.decode(sys.getfilesystemencoding())
         self.to_dir = topath
 
     def getSynopsis(Self):
This patch fixes the crash I see with utf-8 encoded characters. ``` diff -rN -u old-tahoe-lafs/src/allmydata/scripts/cli.py new-tahoe-lafs/src/allmydata/scripts/cli.py --- old-tahoe-lafs/src/allmydata/scripts/cli.py 2010-03-09 10:59:58.371372576 -0800 +++ new-tahoe-lafs/src/allmydata/scripts/cli.py 2010-03-09 10:59:58.529372862 -0800 @@ -279,7 +279,7 @@ self['exclude'] = set() def parseArgs(self, localdir, topath): - self.from_dir = localdir + self.from_dir = localdir.decode(sys.getfilesystemencoding()) self.to_dir = topath def getSynopsis(Self): ```
freestorm commented 2010-03-25 22:58:28 +00:00
Owner

Replying to jsgf:

Hello,
I've tested this patch, it work fine for me.

My system: Windows XP SP2 French

Tahoe --version:
allmydata-tahoe: 1.6.1, foolscap: 0.4.2, pycryptopp: 0.5.17-r683, zfec: 1.4.5,
Twisted: 8.2.0, Nevow: 0.9.33, zope.interface: 3.5.2, python: 2.6.2,
platform: Windows-XP-5.1.2600-SP3, sqlite: 3.5.9, simplejson: 2.0.9, pyopenssl: 0.
9, argparse: 0.9.1, twisted: 8.2.0, nevow: 0.9.33-r17222, pyOpenSSL: 0.9,
pyutil: 1.3.34, zbase32: 1.1.1, setuptools: 0.6c12dev, pysqlite: 2.4.1

I have tested with this big file (all chars that Windows accept in filename):

"LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñѪº¿®¬½¼¡«»ÁÂÀ©¢¥ãäðÐÊËÈÍÎϦÌÓßÔÒõÕµþÞÚÛÙýݯ´­±¾¶§÷¸°¨·¹³² .txt"

The result:

C:>tahoe backup -v c:\temp\tahoe ROOT:TEST/TEST_CHARS
processing c:\temp\tahoe
skipping c:\temp\tahoe\LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñѪº¿®¬½¼¡«»ÁÂÀ©¢¥ãäðÐÊËÈÍÎϦÌÓßÔÒõÕµþÞÚÛÙýݯ´­±¾¶§÷¸°¨·¹³² .txt..

re-using old directory for c:\temp\tahoe
0 files uploaded (1 reused), 0 files skipped, 0 directories created (1 reused), 0 directories skipped
0 files checked, 0 directories checked
backup done, elapsed time: 0:00:06

C:>

Fred

This patch fixes the crash I see with utf-8 encoded characters.

diff -rN -u old-tahoe-lafs/src/allmydata/scripts/cli.py new-tahoe-lafs/src/allmydata/scripts/cli.py
--- old-tahoe-lafs/src/allmydata/scripts/cli.py	2010-03-09 10:59:58.371372576 -0800
+++ new-tahoe-lafs/src/allmydata/scripts/cli.py	2010-03-09 10:59:58.529372862 -0800
@@ -279,7 +279,7 @@
         self['exclude'] = set()
 
     def parseArgs(self, localdir, topath):
-        self.from_dir = localdir
+        self.from_dir = localdir.decode(sys.getfilesystemencoding())
         self.to_dir = topath
 
     def getSynopsis(Self):
Replying to [jsgf](/tahoe-lafs/trac-2024-07-25/issues/629#issuecomment-69741): Hello, I've tested this patch, it work fine for me. My system: Windows XP SP2 French Tahoe --version: allmydata-tahoe: 1.6.1, foolscap: 0.4.2, pycryptopp: 0.5.17-r683, zfec: 1.4.5, Twisted: 8.2.0, Nevow: 0.9.33, zope.interface: 3.5.2, python: 2.6.2, platform: Windows-XP-5.1.2600-SP3, sqlite: 3.5.9, simplejson: 2.0.9, pyopenssl: 0. 9, argparse: 0.9.1, twisted: 8.2.0, nevow: 0.9.33-r17222, pyOpenSSL: 0.9, pyutil: 1.3.34, zbase32: 1.1.1, setuptools: 0.6c12dev, pysqlite: 2.4.1 I have tested with this big file (all chars that Windows accept in filename): "LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñѪº¿®¬½¼¡«»ÁÂÀ©¢¥ãäðÐÊËÈÍÎϦÌÓßÔÒõÕµþÞÚÛÙýݯ´­±¾¶§÷¸°¨·¹³² .txt" The result: C:\>tahoe backup -v c:\temp\tahoe ROOT:TEST/TEST_CHARS processing c:\temp\tahoe skipping c:\temp\tahoe\LONG_TEST_STRING !#$%&'()+,-.012345789=@ABCDEFGHIJKQRSTYZbcdefghijklmnopqrstuvwxyz{}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×áíóúñѪº¿®¬½¼¡«»ÁÂÀ©¢¥ãäðÐÊËÈÍÎϦÌÓßÔÒõÕµþÞÚÛÙýݯ´­±¾¶§÷¸°¨·¹³² .txt.. re-using old directory for c:\temp\tahoe 0 files uploaded (1 reused), 0 files skipped, 0 directories created (1 reused), 0 directories skipped 0 files checked, 0 directories checked backup done, elapsed time: 0:00:06 C:\> Fred > This patch fixes the crash I see with utf-8 encoded characters. > > ``` > diff -rN -u old-tahoe-lafs/src/allmydata/scripts/cli.py new-tahoe-lafs/src/allmydata/scripts/cli.py > --- old-tahoe-lafs/src/allmydata/scripts/cli.py 2010-03-09 10:59:58.371372576 -0800 > +++ new-tahoe-lafs/src/allmydata/scripts/cli.py 2010-03-09 10:59:58.529372862 -0800 > @@ -279,7 +279,7 @@ > self['exclude'] = set() > > def parseArgs(self, localdir, topath): > - self.from_dir = localdir > + self.from_dir = localdir.decode(sys.getfilesystemencoding()) > self.to_dir = topath > > def getSynopsis(Self): > ```

I really want to see this patch in trunk in the next 48 hours for Tahoe-LAFS v1.7, but I can't review it myself right now. Unassigning myself and writing a note to tahoe-dev pleading for someone else to review it.

I really want to see this patch in trunk in the next 48 hours for Tahoe-LAFS v1.7, but I can't review it myself right now. Unassigning myself and writing a note to tahoe-dev pleading for someone else to review it.
zooko modified the milestone from 1.7.0 to 1.8.0 2010-05-14 04:28:00 +00:00
tahoe-lafs modified the milestone from 1.8.0 to 1.7.0 2010-06-12 21:00:00 +00:00
davidsarah commented 2010-06-18 04:09:59 +00:00
Owner

Attachment tahoe-backup-unicode-exclude-and-tests.dpatch (60796 bytes) added

Make 'tahoe backup --exclude' option accept Unicode (the other arguments were already fixed as part of #534/#565). Add tests, based partly on the existing #629 patch.

**Attachment** tahoe-backup-unicode-exclude-and-tests.dpatch (60796 bytes) added Make 'tahoe backup --exclude' option accept Unicode (the other arguments were already fixed as part of #534/#565). Add tests, based partly on the existing #629 patch.

Reviewed! Should we update NEWS to include this change? Otherwise it is good. Applying...

applied as changeset:178401eb4e87589c, changeset:1a0674bf37779751

Reviewed! Should we update NEWS to include this change? Otherwise it is good. Applying... applied as changeset:178401eb4e87589c, changeset:1a0674bf37779751
zooko added the
fixed
label 2010-06-18 05:09:36 +00:00
zooko closed this issue 2010-06-18 05:09:36 +00:00
davidsarah commented 2010-06-18 05:27:45 +00:00
Owner

Replying to zooko:

Reviewed! Should we update NEWS to include this change?

It was already mentioned in the NEWS for 1.7beta, even though there were no tests then.

Replying to [zooko](/tahoe-lafs/trac-2024-07-25/issues/629#issuecomment-69746): > Reviewed! Should we update NEWS to include this change? It was already mentioned in the NEWS for 1.7beta, even though there were no tests then.
davidsarah commented 2010-06-18 05:49:21 +00:00
Owner

Hang on. Why does this work?

The field in the sqlite3 db that holds local filenames is declared as:

 path  VARCHAR(1024) PRIMARY KEY, -- index, this is os.path.abspath(fn)

VARCHAR is 8-bit, but source:src/allmydata/scripts/tahoe_backup.py is calling check_file in source:src/allmydata/scripts/backupdb.py with a unicode argument. os.path.abspath returns unicode when given a unicode path, so how come we are able to look that up in the database?

(If it is being converted to UTF-8, that is fine and will work correctly.)

Hang on. Why does this work? The field in the sqlite3 db that holds local filenames is declared as: ``` path VARCHAR(1024) PRIMARY KEY, -- index, this is os.path.abspath(fn) ``` `VARCHAR` is 8-bit, but source:src/allmydata/scripts/tahoe_backup.py is calling `check_file` in source:src/allmydata/scripts/backupdb.py with a unicode argument. `os.path.abspath` returns unicode when given a unicode path, so how come we are able to look that up in the database? (If it is being converted to UTF-8, that is fine and will work correctly.)
davidsarah commented 2010-06-19 02:03:00 +00:00
Owner
This change caused a test failure: <http://tahoe-lafs.org/buildbot/builders/Kyle%20OpenBSD-4.6%20amd64/builds/258/steps/test-coverage/logs/stdio> . Investigating.
tahoe-lafs removed the
fixed
label 2010-06-19 02:03:00 +00:00
davidsarah reopened this issue 2010-06-19 02:03:00 +00:00
davidsarah commented 2010-06-19 03:52:09 +00:00
Owner

Replying to davidsarah:

This change caused a test failure: http://tahoe-lafs.org/buildbot/builders/Kyle%20OpenBSD-4.6%20amd64/builds/258/steps/test-coverage/logs/stdio . Investigating.

Fixed in changeset:6b2f99fa9a1f9552.

Replying to [davidsarah](/tahoe-lafs/trac-2024-07-25/issues/629#issuecomment-69751): > This change caused a test failure: <http://tahoe-lafs.org/buildbot/builders/Kyle%20OpenBSD-4.6%20amd64/builds/258/steps/test-coverage/logs/stdio> . Investigating. Fixed in changeset:6b2f99fa9a1f9552.
tahoe-lafs added the
fixed
label 2010-06-19 03:52:09 +00:00
davidsarah closed this issue 2010-06-19 03:52:09 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#629
No description provided.