Unicode bug in grid to grid copies #1224

Closed
opened 2010-10-08 18:26:47 +00:00 by francois · 11 comments
francois commented 2010-10-08 18:26:47 +00:00
Owner

A grid to grid copy involving non-ASCII filenames fails. This is likely another occurrence of bug #534.

$ tahoe cp -rv tahoe:Blah tahoe:Blah2
/usr/lib/python2.5/urllib.py:1205: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  res = map(safe_map.__getitem__, s)
Traceback (most recent call last):
  File "/root/tahoe-lafs/support/bin/tahoe", line 9, in <module>
    load_entry_point('allmydata-tahoe==1.8.0-r4751', 'console_scripts', 'tahoe')()
  File "/root/tahoe-lafs/src/allmydata/scripts/runner.py", line 118, in run
    rc = runner(sys.argv[1:], install_node_control=install_node_control)
  File "/root/tahoe-lafs/src/allmydata/scripts/runner.py", line 104, in runner
    rc = cli.dispatch[command](so)
  File "/root/tahoe-lafs/src/allmydata/scripts/cli.py", line 493, in cp
    rc = tahoe_cp.copy(options)
  File "/root/tahoe-lafs/src/allmydata/scripts/tahoe_cp.py", line 762, in copy
    return Copier().do_copy(options)
  File "/root/tahoe-lafs/src/allmydata/scripts/tahoe_cp.py", line 442, in do_copy
    status = self.try_copy()
  File "/root/tahoe-lafs/src/allmydata/scripts/tahoe_cp.py", line 485, in try_copy
    return self.copy_to_directory(sources, target)
  File "/root/tahoe-lafs/src/allmydata/scripts/tahoe_cp.py", line 649, in copy_to_directory
    self.assign_targets(source, target)
  File "/root/tahoe-lafs/src/allmydata/scripts/tahoe_cp.py", line 684, in assign_targets
    subtarget = target.get_child_target(name)
  File "/root/tahoe-lafs/src/allmydata/scripts/tahoe_cp.py", line 378, in get_child_target
    writecap = make_tahoe_subdirectory(self.nodeurl, self.writecap, name)
  File "/root/tahoe-lafs/src/allmydata/scripts/tahoe_cp.py", line 55, in make_tahoe_subdirectory
    ]) + "?t=mkdir"
  File "/usr/lib/python2.5/urllib.py", line 1205, in quote
    res = map(safe_map.__getitem__, s)
KeyError: u'\xe9'
A grid to grid copy involving non-ASCII filenames fails. This is likely another occurrence of bug #534. ``` $ tahoe cp -rv tahoe:Blah tahoe:Blah2 ``` ``` /usr/lib/python2.5/urllib.py:1205: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal res = map(safe_map.__getitem__, s) Traceback (most recent call last): File "/root/tahoe-lafs/support/bin/tahoe", line 9, in <module> load_entry_point('allmydata-tahoe==1.8.0-r4751', 'console_scripts', 'tahoe')() File "/root/tahoe-lafs/src/allmydata/scripts/runner.py", line 118, in run rc = runner(sys.argv[1:], install_node_control=install_node_control) File "/root/tahoe-lafs/src/allmydata/scripts/runner.py", line 104, in runner rc = cli.dispatch[command](so) File "/root/tahoe-lafs/src/allmydata/scripts/cli.py", line 493, in cp rc = tahoe_cp.copy(options) File "/root/tahoe-lafs/src/allmydata/scripts/tahoe_cp.py", line 762, in copy return Copier().do_copy(options) File "/root/tahoe-lafs/src/allmydata/scripts/tahoe_cp.py", line 442, in do_copy status = self.try_copy() File "/root/tahoe-lafs/src/allmydata/scripts/tahoe_cp.py", line 485, in try_copy return self.copy_to_directory(sources, target) File "/root/tahoe-lafs/src/allmydata/scripts/tahoe_cp.py", line 649, in copy_to_directory self.assign_targets(source, target) File "/root/tahoe-lafs/src/allmydata/scripts/tahoe_cp.py", line 684, in assign_targets subtarget = target.get_child_target(name) File "/root/tahoe-lafs/src/allmydata/scripts/tahoe_cp.py", line 378, in get_child_target writecap = make_tahoe_subdirectory(self.nodeurl, self.writecap, name) File "/root/tahoe-lafs/src/allmydata/scripts/tahoe_cp.py", line 55, in make_tahoe_subdirectory ]) + "?t=mkdir" File "/usr/lib/python2.5/urllib.py", line 1205, in quote res = map(safe_map.__getitem__, s) KeyError: u'\xe9' ```
tahoe-lafs added the
code-frontend-cli
major
defect
1.8.0
labels 2010-10-08 18:26:47 +00:00
tahoe-lafs added this to the 1.8.1 milestone 2010-10-08 18:26:47 +00:00
davidsarah commented 2010-10-14 01:40:50 +00:00
Author
Owner

I had assumed that urllib.quote was supposed to UTF-8-then-percent-encode Unicode strings, but it's not documented as doing so, so that was probably wishful thinking.

This seems to be http://bugs.python.org/issue1712522. Apparently you have to convert to UTF-8 manually.

Note that we have a unicode_to_url method in source:src/allmydata/util/encodingutil.py that should probably be used for this (or maybe we should add a quote_unicode_url method, if it turns out that we normally need to convert and percent-escape at the same time).

I had assumed that `urllib.quote` was supposed to UTF-8-then-percent-encode Unicode strings, but [it's not documented as doing so](http://docs.python.org/library/urllib.html#urllib.quote), so that was probably wishful thinking. This seems to be <http://bugs.python.org/issue1712522>. Apparently [you have to convert to UTF-8 manually](http://old.nabble.com/Re:-Problem:-neither-urllib2.quote-nor-urllib.quote-encode-the--unicode-strings-arguments-p19823144.html). Note that we have a `unicode_to_url` method in source:src/allmydata/util/encodingutil.py that should probably be used for this (or maybe we should add a `quote_unicode_url` method, if it turns out that we normally need to convert and percent-escape at the same time).

This isn't actually a regression from v1.7.1 to v1.8.0 is it?

(Maybe we should fix it in v1.8.1 anyway, just because it is easy to fix, impacts actual users like François, the fix is unlikely to cause other problems, and it is "unfinished business" from the new univode support in v1.7.0.)

This isn't actually a regression from v1.7.1 to v1.8.0 is it? (Maybe we should fix it in v1.8.1 anyway, just because it is easy to fix, impacts actual users like François, the fix is unlikely to cause other problems, and it is "unfinished business" from the new univode support in v1.7.0.)
francois commented 2010-10-16 01:01:05 +00:00
Author
Owner

A patch to fix this bug and add a test has been pushed in my git repository which is available there:

http://github.com/ctrlaltdel/tahoe-lafs/tree/ticket/1224

A patch to fix this bug and add a test has been pushed in my git repository which is available there: <http://github.com/ctrlaltdel/tahoe-lafs/tree/ticket/1224>
davidsarah commented 2010-10-16 04:18:42 +00:00
Author
Owner

There are other instances of urllib.quote with a name (as opposed to a cap URI) as argument, in tahoe_backup.py, tahoe_mkdir.py, tahoe_put.py, and web/directory.py I think.

There are other instances of `urllib.quote` with a name (as opposed to a cap URI) as argument, in `tahoe_backup.py`, `tahoe_mkdir.py`, `tahoe_put.py`, and `web/directory.py` I think.
francois commented 2010-10-16 09:34:50 +00:00
Author
Owner

Replying to davidsarah:

There are other instances of urllib.quote with a name (as opposed to a cap URI) as argument, in tahoe_backup.py, tahoe_mkdir.py, tahoe_put.py, and web/directory.py I think.

I already did a grep in the whole tree to find other occurrences of this bug, here's what I came up with.

  1. tahoe_backup.py

Function put_child gets only called with path="Latest" or path=now which are both ASCII strings. But you're right, this is probably safer to use unicode_to_url there as well. I pushed a new commit in my git branch with this change.

  1. tahoe_mkdir.py

The path variable comes from the get_alias function which already returns an UTF-8 encoded string.

def get_alias(aliases, path_unicode, default):
    """
    Transform u"work:path/filename" into (aliases[u"work"], u"path/filename".encode('utf-8')).
  1. tahoe_put.py

It uses the get_alias function as well.

  1. web/directory.py

In this file, the name is always encoded as an UTF-8 string before use.

name = name.encode("utf-8")
Replying to [davidsarah](/tahoe-lafs/trac-2024-07-25/issues/1224#issuecomment-80516): > There are other instances of `urllib.quote` with a name (as opposed to a cap URI) as argument, in `tahoe_backup.py`, `tahoe_mkdir.py`, `tahoe_put.py`, and `web/directory.py` I think. I already did a grep in the whole tree to find other occurrences of this bug, here's what I came up with. 1. `tahoe_backup.py` Function `put_child` gets only called with `path="Latest"` or `path=now` which are both ASCII strings. But you're right, this is probably safer to use `unicode_to_url` there as well. I pushed a new commit in my git branch with this change. 2. `tahoe_mkdir.py` The `path` variable comes from the `get_alias` function which already returns an UTF-8 encoded string. ``` def get_alias(aliases, path_unicode, default): """ Transform u"work:path/filename" into (aliases[u"work"], u"path/filename".encode('utf-8')). ``` 3. `tahoe_put.py` It uses the `get_alias` function as well. 4. `web/directory.py` In this file, the `name` is always encoded as an UTF-8 string before use. ``` name = name.encode("utf-8") ```
davidsarah commented 2010-10-16 15:49:50 +00:00
Author
Owner

I reviewed the git commit and it looks good.

I reviewed [the git commit](http://github.com/ctrlaltdel/tahoe-lafs/commit/642f60b4c5e26aed188a9f8f6879ce661540fd1d) and it looks good.

Brian, could you merge this patch into trunk and push it into the darcs repo at dev.allmydata.org:/home/darcs/tahoe-lafs/trunk? Thanks!

Brian, could you merge this patch into trunk and push it into the darcs repo at dev.allmydata.org:/home/darcs/tahoe-lafs/trunk? Thanks!
davidsarah commented 2010-10-23 04:29:01 +00:00
Author
Owner

I reviewed the change to tahoe_backup.py and that also looks good.

I reviewed [the change to tahoe_backup.py](http://github.com/ctrlaltdel/tahoe-lafs/commit/3e6ad1e5c1b64a11bf283dc2e3bbcade41fe0365) and that also looks good.

Okay, Brian could you also push that one from comment:80520 into trunk then? :-)

Oh, do these need a NEWS entry?

Okay, Brian could you also push that one from [comment:80520](/tahoe-lafs/trac-2024-07-25/issues/1224#issuecomment-80520) into trunk then? :-) Oh, do these need a `NEWS` entry?
Brian Warner <warner@lothar.com> commented 2010-10-29 09:09:38 +00:00
Author
Owner

In changeset:14ee763c542b61c5:

tahoe_cp.py: Don't call urllib.quote with an Unicode argument, fix #1224
tahoe_backup.py: Fix another (potential) occurrence of calling urllib.quote()
with an Unicode parameter
In changeset:14ee763c542b61c5: ``` tahoe_cp.py: Don't call urllib.quote with an Unicode argument, fix #1224 tahoe_backup.py: Fix another (potential) occurrence of calling urllib.quote() with an Unicode parameter ```
tahoe-lafs added the
fixed
label 2010-10-29 09:09:38 +00:00
david-sarah@jacaranda.org commented 2010-10-29 19:43:14 +00:00
Author
Owner

In changeset:2610f8e0aa6e2221:

NEWS: clarify (strengthen) description of what backdoors.rst declares, and add bugfix entries for 'tahoe cp' and Windows console bugs. refs #1216, #1224, #1232
In changeset:2610f8e0aa6e2221: ``` NEWS: clarify (strengthen) description of what backdoors.rst declares, and add bugfix entries for 'tahoe cp' and Windows console bugs. refs #1216, #1224, #1232 ```
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#1224
No description provided.