unicode arguments on the command-line #565

Closed
opened 2008-12-23 16:25:35 +00:00 by zooko · 13 comments

How do we know what encoding was used to encode the filenames or other arguments that are passed in via Python 2's sys.argv? If we don't know, do we assume that it is utf-8, thus making it incompatible with platforms that don't encode arguments with utf-8? Or do we leave it undecoded, thus making it impossible to correctly inspect the string for the presence of '/' chars?

How do we know what encoding was used to encode the filenames or other arguments that are passed in via Python 2's `sys.argv`? If we don't know, do we assume that it is utf-8, thus making it incompatible with platforms that don't encode arguments with utf-8? Or do we leave it undecoded, thus making it impossible to correctly inspect the string for the presence of '/' chars?
zooko added the
code-frontend-cli
major
defect
1.2.0
labels 2008-12-23 16:25:35 +00:00
zooko added this to the undecided milestone 2008-12-23 16:25:35 +00:00
francois commented 2008-12-28 00:43:18 +00:00
Owner

As a data point, here's how it is handled in Python 3.0.

Some system APIs like os.environ and sys.argv can also present problems when the bytes made available by the system is not interpretable using the default encoding. Setting the LANG variable and rerunning the program is probably the best approach.

Source: What's new in Python 3.0

$ LANG=en_US.UTF-8 python3.0 -c "import sys; print(sys.argv[1])" ärtonwall
ärtonwall
$ LANG=C python3.0 -c "import sys; print(sys.argv[1])" ärtonwall
Could not convert argument 3 to string

We should probably implement something working in a similair way for python 2.

As a data point, here's how it is handled in Python 3.0. ---- Some system APIs like os.environ and sys.argv can also present problems when the bytes made available by the system is not interpretable using the default encoding. Setting the LANG variable and rerunning the program is probably the best approach. ---- Source: [What's new in Python 3.0](http://docs.python.org/3.0/whatsnew/3.0.html) ``` $ LANG=en_US.UTF-8 python3.0 -c "import sys; print(sys.argv[1])" ärtonwall ärtonwall $ LANG=C python3.0 -c "import sys; print(sys.argv[1])" ärtonwall Could not convert argument 3 to string ``` We should probably implement something working in a similair way for python 2.
tahoe-lafs changed title from unicode arguments on the command-line to unicode arguments on the command-line 2008-12-28 00:43:18 +00:00
davidsarah commented 2009-12-07 04:57:53 +00:00
Owner

Windows-only

http://bugs.python.org/issue2128 suggests that on Python 2.6.x for Windows, any non-ASCII characters will have been irretrievably mangled to question-marks in sys.argv. Unfortunately win32api.GetCommandLine seems to call GetCommandLineA, not GetCommandLineW. The bzr project solved this problem by using ctypes to call GetCommandLineW: https://bugs.launchpad.net/bzr/+bug/375934 . (bzr is GPL'd, so we can use that code.)

Note that this would require passing the correct unicode argv into twisted.python.usage.Options.parseOptions from source:src/allmydata/scripts/runner.py , i.e. change source:windows/tahoe.py to do

argv = get_cmdline_unicode()  # from bzr patch
rc = runner(argv[1:], install_node_control=False)
sys.exit(rc)

(assuming that twisted.python.usage.Options handles Unicode correctly, which I haven't tested).

Windows-only <http://bugs.python.org/issue2128> suggests that on Python 2.6.x for Windows, any non-ASCII characters will have been irretrievably mangled to question-marks in `sys.argv`. Unfortunately `win32api.GetCommandLine` seems to call `GetCommandLineA`, not `GetCommandLineW`. The bzr project solved this problem by using `ctypes` to call `GetCommandLineW`: <https://bugs.launchpad.net/bzr/+bug/375934> . (bzr is GPL'd, so we can use that code.) Note that this would require passing the correct unicode argv into `twisted.python.usage.Options.parseOptions` from source:src/allmydata/scripts/runner.py , i.e. change source:windows/tahoe.py to do ``` argv = get_cmdline_unicode() # from bzr patch rc = runner(argv[1:], install_node_control=False) sys.exit(rc) ``` (assuming that `twisted.python.usage.Options` handles Unicode correctly, which I haven't tested).
davidsarah commented 2010-02-02 00:15:29 +00:00
Owner

Needed for #534 which has milestone 1.7.0.

Needed for #534 which has milestone 1.7.0.
tahoe-lafs modified the milestone from undecided to 1.7.0 2010-02-02 00:15:29 +00:00
davidsarah commented 2010-04-30 18:24:04 +00:00
Owner

Here's some code to get Unicode argv that should work on both Windows (including cygwin) and Unix. On Unix, it assumes that arguments are encoded according to the current locale encoding (or UTF-8 if that could not be determined by Python).

import sys, locale

if sys.platform == "win32":
    from ctypes import WINFUNCTYPE, POINTER, byref, c_wchar_p, c_int, windll
    def get_unicode_argv():
        GetCommandLineW = WINFUNCTYPE(c_wchar_p)(("GetCommandLineW", windll.kernel32))
        CommandLineToArgvW = WINFUNCTYPE(POINTER(c_wchar_p), c_wchar_p, POINTER(c_int)) \
          (("CommandLineToArgvW", windll.shell32))
        argc = c_int(0)
        argv = CommandLineToArgvW(GetCommandLineW(), byref(argc))
        return [argv[i] for i in xrange(1, argc.value)]
else:
    def get_unicode_argv():
        encoding = locale.getpreferredencoding()
        if not encoding:
            encoding = "utf-8"
        # This throws UnicodeError if any argument cannot be decoded.
        return [arg.decode(encoding, 'strict') for arg in sys.argv]

print get_unicode_argv()
Here's some code to get Unicode argv that should work on both Windows (including cygwin) and Unix. On Unix, it assumes that arguments are encoded according to the current locale encoding (or UTF-8 if that could not be determined by Python). ``` import sys, locale if sys.platform == "win32": from ctypes import WINFUNCTYPE, POINTER, byref, c_wchar_p, c_int, windll def get_unicode_argv(): GetCommandLineW = WINFUNCTYPE(c_wchar_p)(("GetCommandLineW", windll.kernel32)) CommandLineToArgvW = WINFUNCTYPE(POINTER(c_wchar_p), c_wchar_p, POINTER(c_int)) \ (("CommandLineToArgvW", windll.shell32)) argc = c_int(0) argv = CommandLineToArgvW(GetCommandLineW(), byref(argc)) return [argv[i] for i in xrange(1, argc.value)] else: def get_unicode_argv(): encoding = locale.getpreferredencoding() if not encoding: encoding = "utf-8" # This throws UnicodeError if any argument cannot be decoded. return [arg.decode(encoding, 'strict') for arg in sys.argv] print get_unicode_argv() ```
Author

I really want to see this patch in trunk in the next 48 hours for Tahoe-LAFS v1.7, but I can't contribute to it myself right now.

I really want to see this patch in trunk in the next 48 hours for Tahoe-LAFS v1.7, but I can't contribute to it myself right now.
davidsarah commented 2010-06-08 08:57:35 +00:00
Owner

Getting this working on Windows is more difficult than I thought. I have successfully got it to work by hacking the setuptools-generated entry script like this:

# EASY-INSTALL-ENTRY-SCRIPT: 'allmydata-tahoe==1.6.1-r4452','console_scripts','tahoe'
__requires__ = 'allmydata-tahoe==1.6.1-r4452'
import sys
from pkg_resources import load_entry_point

### start extra code
from ctypes import WINFUNCTYPE, POINTER, byref, c_wchar_p, c_int, windll

GetCommandLineW = WINFUNCTYPE(c_wchar_p)(("GetCommandLineW", windll.kernel32))
CommandLineToArgvW = WINFUNCTYPE(POINTER(c_wchar_p), c_wchar_p, POINTER(c_int)) \
                         (("CommandLineToArgvW", windll.shell32))

argc = c_int(0)
argv = CommandLineToArgvW(GetCommandLineW(), byref(argc))
sys.argv = [argv[i].encode('utf-8') for i in xrange(1, argc.value)]
### end extra code

sys.exit(
   load_entry_point('allmydata-tahoe==1.6.1-r4452', 'console_scripts', 'tahoe')()
)

but only by invoking this script directly from the command line, not via the tahoe.exe wrapper. The latter mangles the arguments beyond hope of recovery.

Getting this working on Windows is more difficult than I thought. I have successfully got it to work by hacking the setuptools-generated entry script like this: ```#!c:\Python26\python.exe # EASY-INSTALL-ENTRY-SCRIPT: 'allmydata-tahoe==1.6.1-r4452','console_scripts','tahoe' __requires__ = 'allmydata-tahoe==1.6.1-r4452' import sys from pkg_resources import load_entry_point ### start extra code from ctypes import WINFUNCTYPE, POINTER, byref, c_wchar_p, c_int, windll GetCommandLineW = WINFUNCTYPE(c_wchar_p)(("GetCommandLineW", windll.kernel32)) CommandLineToArgvW = WINFUNCTYPE(POINTER(c_wchar_p), c_wchar_p, POINTER(c_int)) \ (("CommandLineToArgvW", windll.shell32)) argc = c_int(0) argv = CommandLineToArgvW(GetCommandLineW(), byref(argc)) sys.argv = [argv[i].encode('utf-8') for i in xrange(1, argc.value)] ### end extra code sys.exit( load_entry_point('allmydata-tahoe==1.6.1-r4452', 'console_scripts', 'tahoe')() ) ``` but only by invoking this script directly from the command line, not via the `tahoe.exe` wrapper. The latter mangles the arguments beyond hope of recovery.
davidsarah commented 2010-06-08 18:25:38 +00:00
Owner

It isn't necessary for the extra code to be in the entry script; it could be in source:allmydata/scripts/runner.py . However, Zooko and I decided that changing how the CLI entry works on Windows would be too disruptive for 1.7, so we're dropping support for Unicode args on Windows until the next release.

This ticket is fixed for other platforms in 1.7.

It isn't necessary for the extra code to be in the entry script; it could be in source:allmydata/scripts/runner.py . However, Zooko and I decided that changing how the CLI entry works on Windows would be too disruptive for 1.7, so we're dropping support for Unicode args on Windows until the next release. This ticket is fixed for other platforms in 1.7.
davidsarah commented 2010-06-09 00:20:35 +00:00
Owner

Attachment back-out-windows-specific-unicode-argv.dpatch (47775 bytes) added

Back out Windows-specific Unicode argument support for v1.7.

**Attachment** back-out-windows-specific-unicode-argv.dpatch (47775 bytes) added Back out Windows-specific Unicode argument support for v1.7.
Author

The patch looks correct.

The patch looks correct.
davidsarah commented 2010-06-12 20:48:23 +00:00
Owner

back-out-windows-specific-unicode-argv.dpatch was applied in changeset:32d9deace3d82637.

See #1074 for a patch that reenables Unicode argument support on Windows (but requires further discussion and refinement).

back-out-windows-specific-unicode-argv.dpatch was applied in changeset:32d9deace3d82637. See #1074 for a patch that reenables Unicode argument support on Windows (but requires further discussion and refinement).
tahoe-lafs modified the milestone from 1.7.0 to 1.7.1 2010-06-12 20:48:23 +00:00
davidsarah commented 2010-07-14 02:44:22 +00:00
Owner

The #1074 patch is now finished.

The #1074 patch is now finished.
tahoe-lafs modified the milestone from 1.7.1 to 1.8β 2010-07-17 03:50:28 +00:00
david-sarah@jacaranda.org commented 2010-08-02 07:23:26 +00:00
Owner

In [4627/ticket798]:

Bundle setuptools-0.6c16dev (with Windows script changes, and the change to only warn if site.py wasn't generated by setuptools) instead of 0.6c15dev. addresses #565, #1073, #1074
In [4627/ticket798]: ``` Bundle setuptools-0.6c16dev (with Windows script changes, and the change to only warn if site.py wasn't generated by setuptools) instead of 0.6c15dev. addresses #565, #1073, #1074 ```
davidsarah commented 2010-08-08 00:37:52 +00:00
Owner

Fixed; see ticket:1074#comment:29 for changesets.

Fixed; see ticket:1074#comment:29 for changesets.
tahoe-lafs added the
fixed
label 2010-08-08 00:37:52 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#565
No description provided.