reduce unnecessary escaping in quote_output #1135

Closed
opened 2010-07-22 01:36:05 +00:00 by davidsarah · 16 comments
davidsarah commented 2010-07-22 01:36:05 +00:00
Owner

The [quote_output]source:src/allmydata/util/encodingutil.py@4543#L140 function, which is used to output possibly-Unicode strings to stdout and stderr, does more backslash-escaping than is necessary for the output to be interpreted unambiguously. In particular, it will double any backslashes in the string, which makes Windows paths look ugly.

We can reduce this by making it follow POSIX-like escaping rules: within single quotes, backslashes are not special, and the only printable character that cannot appear is a single quote. If we do have a single quote or control characters in the string, then we use double quotes with backslash-escaping.

The [quote_output]source:src/allmydata/util/encodingutil.py@4543#L140 function, which is used to output possibly-Unicode strings to `stdout` and `stderr`, does more backslash-escaping than is necessary for the output to be interpreted unambiguously. In particular, it will double any backslashes in the string, which makes Windows paths look ugly. We can reduce this by making it follow POSIX-like escaping rules: within single quotes, backslashes are not special, and the only printable character that cannot appear is a single quote. If we do have a single quote or control characters in the string, then we use double quotes with backslash-escaping.
tahoe-lafs added the
code-frontend-cli
major
defect
1.7.1
labels 2010-07-22 01:36:05 +00:00
tahoe-lafs added this to the 1.8β milestone 2010-07-22 01:36:05 +00:00
davidsarah commented 2010-07-22 01:51:32 +00:00
Author
Owner

Please hold off reviewing this; the _double_quote function is broken for some inputs.

Please hold off reviewing this; the `_double_quote` function is broken for some inputs.
davidsarah commented 2010-07-23 08:17:42 +00:00
Author
Owner

Attachment encodingutil-reduce-quote-escaping.dpatch (10514 bytes) added

util.encodingutil: change quote_output to do less unnecessary escaping, and to use double-quotes more consistently when needed. This version avoids u-escaping for characters that are representable in the output encoding, when double quotes are used; and includes tests. fixes #1135

**Attachment** encodingutil-reduce-quote-escaping.dpatch (10514 bytes) added util.encodingutil: change quote_output to do less unnecessary escaping, and to use double-quotes more consistently when needed. This version avoids u-escaping for characters that are representable in the output encoding, when double quotes are used; and includes tests. fixes #1135

This seems like a utility which is not at all Tahoe-LAFS-specific and could be useful to lots of other Python programmers.

Do you have any ideas about how to make it available to them?

I've been trying to do that sort of thing by publishing the utility in my pyutil library. (See also #47 use pyutil as a separate package and contribute src/allmydata/util/ into pyutil*) I'm not sure if this has helped anyone! Nobody uses pyutil as far as I am aware. (Hm, although http://pypi.python.org/pypi/pyutil says that it has been download more than a thousand times. I wonder.) Recently when I added a new feature to pyutil that I wanted people to be aware of I also made a new separate package named jsonutil and separately uploaded it to pypi and announced it on the python mailing list: http://pypi.python.org/pypi/jsonutil .

Maybe the Right thing to do for this is to package up everything into pyutil (per #47) and then to open requests of the Python upstream maintainers that they consider some (or all) of those functions as additions to the Python Standard Library.

Over the years (almost ten years, now) of my maintaining pyutil, I've seen many of the features of pyutil get added as standard features of the Python Standard Library, but in every case because someone else independently discovered this idea, independently implemented it, and pushed it into the Python Standard Library. I guess if we want the Python folks to notice features like this in pyutil we have to actually tell them. :-)

This seems like a utility which is not at all Tahoe-LAFS-specific and could be useful to lots of other Python programmers. Do you have any ideas about how to make it available to them? I've been trying to do that sort of thing by publishing the utility in [my pyutil library](http://tahoe-lafs.org/trac/pyutil). (See also #47 *use pyutil as a separate package and contribute src/allmydata/util/* into pyutil*) I'm not sure if this has helped anyone! Nobody uses pyutil as far as I am aware. (Hm, although <http://pypi.python.org/pypi/pyutil> says that it has been download more than a thousand times. I wonder.) Recently when I added a new feature to pyutil that I wanted people to be aware of I also made a new separate package named jsonutil and separately uploaded it to pypi and announced it on the python mailing list: <http://pypi.python.org/pypi/jsonutil> . Maybe the Right thing to do for this is to package up everything into pyutil (per #47) and then to open requests of the Python upstream maintainers that they consider some (or all) of those functions as additions to the Python Standard Library. Over the years (almost ten years, now) of my maintaining pyutil, I've seen many of the features of pyutil get added as standard features of the Python Standard Library, but in every case because someone else independently discovered this idea, independently implemented it, and pushed it into the Python Standard Library. I guess if we want the Python folks to notice features like this in pyutil we have to actually tell them. :-)
davidsarah commented 2010-07-31 00:40:03 +00:00
Author
Owner

For reviewers: the [test_quote_output_mock]attachment:encodingutil-reduce-quote-escaping.dpatch#L107 method (which tests quote_output with the default output encoding) has been simplified on the ticket1074 branch to not use mocks:

    def test_quote_output_default(self):
        encodingutil.output_encoding = 'ascii'
        self.test_quote_output_ascii(None)

        encodingutil.output_encoding = 'latin1'
        self.test_quote_output_latin1(None)

        encodingutil.output_encoding = 'utf-8'
        self.test_quote_output_utf8(None)
For reviewers: the [test_quote_output_mock]attachment:encodingutil-reduce-quote-escaping.dpatch#L107 method (which tests `quote_output` with the default output encoding) has been simplified on the ticket1074 branch to not use mocks: ``` def test_quote_output_default(self): encodingutil.output_encoding = 'ascii' self.test_quote_output_ascii(None) encodingutil.output_encoding = 'latin1' self.test_quote_output_latin1(None) encodingutil.output_encoding = 'utf-8' self.test_quote_output_utf8(None) ```
davidsarah commented 2010-07-31 00:46:23 +00:00
Author
Owner

It's probably easier to review this code [as it is on the ticket1074 branch]source:ticket1074/src/allmydata/util/encodingutil.py#L164 ([tests here]source:ticket1074/src/allmydata/test/test_encodingutil.py#L294), rather than the patch.

It's probably easier to review this code [as it is on the ticket1074 branch]source:ticket1074/src/allmydata/util/encodingutil.py#L164 ([tests here]source:ticket1074/src/allmydata/test/test_encodingutil.py#L294), rather than the patch.
david-sarah@jacaranda.org commented 2010-08-01 20:54:29 +00:00
Author
Owner

In [4600/ticket798]:

util.encodingutil: change quote_output to do less unnecessary escaping, and to use double-quotes more consistently when needed. This version avoids u-escaping for characters that are representable in the output encoding, when double quotes are used, and includes tests. fixes #1135
In [4600/ticket798]: ``` util.encodingutil: change quote_output to do less unnecessary escaping, and to use double-quotes more consistently when needed. This version avoids u-escaping for characters that are representable in the output encoding, when double quotes are used, and includes tests. fixes #1135 ```
tahoe-lafs added the
fixed
label 2010-08-01 20:54:29 +00:00
davidsarah commented 2010-08-02 19:34:15 +00:00
Author
Owner

Replying to david-sarah@…:

In [4600/ticket798]:

#CommitTicketReference repository="ticket798" revision="4600"
util.encodingutil: change quote_output to do less unnecessary escaping, and to use double-quotes more consistently when needed. This version avoids u-escaping for characters that are representable in the output encoding, when double quotes are used, and includes tests. fixes #1135

This ticket should not have been auto-closed yet, because the commit was on a branch.

Replying to [david-sarah@…](/tahoe-lafs/trac-2024-07-25/issues/1135#issuecomment-78847): > In [4600/ticket798]: > ``` > #CommitTicketReference repository="ticket798" revision="4600" > util.encodingutil: change quote_output to do less unnecessary escaping, and to use double-quotes more consistently when needed. This version avoids u-escaping for characters that are representable in the output encoding, when double quotes are used, and includes tests. fixes #1135 > ``` This ticket should not have been auto-closed yet, because the commit was on a branch.
tahoe-lafs removed the
fixed
label 2010-08-02 19:34:15 +00:00
davidsarah commented 2010-08-07 21:37:27 +00:00
Author
Owner

Applied to trunk in changeset:28e6ad51a794067f. Still could do with review.

Applied to trunk in changeset:28e6ad51a794067f. Still could do with review.
davidsarah commented 2010-12-30 23:43:48 +00:00
Author
Owner

For the benefit of reviewers, the POSIX shell escaping that this is approximately trying to emulate is specified here, except that that spec doesn't discuss Unicode. Generally when a Unicode character can be represented in the output encoding as itself, we do so, otherwise we use an \x, \u or \U escape (whichever is shortest). Interesting questions are:

  • is this a sensible way of escaping strings?
  • does it have any gratuitous inconsistencies with POSIX, C99, Python, or other escaping conventions that users might be used to, that could easily be fixed?
  • does the [implementation]source:src/allmydata/util/encodingutil.py@4616#L164 correspond to the comment?
  • are there important cases missing in the [tests]source:src/allmydata/test/test_encodingutil.py@4761#L294?
For the benefit of reviewers, the POSIX shell escaping that this is *approximately* trying to emulate is specified [here](http://pubs.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html#tag_02_02), except that that spec doesn't discuss Unicode. Generally when a Unicode character can be represented in the output encoding as itself, we do so, otherwise we use an \x, \u or \U escape (whichever is shortest). Interesting questions are: * is this a sensible way of escaping strings? * does it have any gratuitous inconsistencies with POSIX, C99, Python, or other escaping conventions that users might be used to, that could easily be fixed? * does the [implementation]source:src/allmydata/util/encodingutil.py@4616#L164 correspond to the comment? * are there important cases missing in the [tests]source:src/allmydata/test/test_encodingutil.py@4761#L294?
Zarutian commented 2011-01-08 17:10:26 +00:00
Author
Owner

Minimalizing, or omitting, the quoteing of unicode characters that can be represented in the output encoding is pretty neat.

However regarding backslash quoting behaviour: wont this break other stuff due to things such as "C:\new folder\fav_pet\targh42" being interpreted as "C:" followed by a newline, followed by "ew folder", followed by formfeed, followed by "av_pet", followed by tab and finally ending with "argh42"

I think doubling the backslashes in Windows pathnames is unavoidable.

Minimalizing, or omitting, the quoteing of unicode characters that can be represented in the output encoding is pretty neat. However regarding backslash quoting behaviour: wont this break other stuff due to things such as "C:\new folder\fav_pet\targh42" being interpreted as "C:" followed by a newline, followed by "ew folder", followed by formfeed, followed by "av_pet", followed by tab and finally ending with "argh42" I think doubling the backslashes in Windows pathnames is unavoidable.

Perhaps we (probably David-Sarah) could add documentation (in the form of doc file, docstring, or comment) to the patch addressing Zarutian's concern.

Perhaps we (probably David-Sarah) could add documentation (in the form of doc file, docstring, or comment) to the patch addressing Zarutian's concern.
david-sarah@jacaranda.org commented 2011-01-09 01:34:21 +00:00
Author
Owner

In changeset:6ce3ec6d0d363e68:

docs/frontends/CLI.rst: discuss commandline/output quoting issues and wildcards. refs #1135
In changeset:6ce3ec6d0d363e68: ``` docs/frontends/CLI.rst: discuss commandline/output quoting issues and wildcards. refs #1135 ```
davidsarah commented 2011-01-09 01:35:57 +00:00
Author
Owner

Replying to Zarutian:

Minimalizing, or omitting, the quoteing of unicode characters that can be represented in the output encoding is pretty neat.

However regarding backslash quoting behaviour: wont this break other stuff due to things such as "C:\new folder\fav_pet\targh42" being interpreted as "C:" followed by a newline, followed by "ew folder", followed by formfeed, followed by "av_pet", followed by tab and finally ending with "argh42"

No, because single-quote escaping single-quoting doesn't give a special meaning to backslashes.

This is described in the comment at source:src/allmydata/util/encodingutil.py@4616#L195, but not in any end-user documentation. I've now added some in source:docs/frontends/CLI.rst.

Replying to [Zarutian](/tahoe-lafs/trac-2024-07-25/issues/1135#issuecomment-78855): > Minimalizing, or omitting, the quoteing of unicode characters that can be represented in the output encoding is pretty neat. > > However regarding backslash quoting behaviour: wont this break other stuff due to things such as "C:\new folder\fav_pet\targh42" being interpreted as "C:" followed by a newline, followed by "ew folder", followed by formfeed, followed by "av_pet", followed by tab and finally ending with "argh42" No, because ~~single-quote escaping~~ single-quoting doesn't give a special meaning to backslashes. This is described in the comment at source:src/allmydata/util/encodingutil.py@4616#L195, but not in any end-user documentation. I've now added some in source:docs/frontends/CLI.rst.
Zarutian commented 2011-01-15 00:39:10 +00:00
Author
Owner

Replying to davidsarah:

-snip-

No, because single-quote escaping single-quoting doesn't give a special meaning to backslashes.

This is described in the comment at source:src/allmydata/util/encodingutil.py@4616#L195, but not in any end-user documentation. I've now added some in source:docs/frontends/CLI.rst.

I say that having consistant handling of backslashes and having to deal with some doubling of them is preferable to other issues such as subtle modal context changes. Because I know in the future it will happen that some poor sod will miss the subtle distinction between single- and double-quotes in the quoted output. (I know I have, in the past elsewhere)

However if you want to use single quotes to turn of backslash escaping when it might not be warrented then feel free to do so. (But dont blame me when the poor sod comes along ;)

Replying to [davidsarah](/tahoe-lafs/trac-2024-07-25/issues/1135#issuecomment-78858): -snip- > > No, because ~~single-quote escaping~~ single-quoting doesn't give a special meaning to backslashes. > > This is described in the comment at source:src/allmydata/util/encodingutil.py@4616#L195, but not in any end-user documentation. I've now added some in source:docs/frontends/CLI.rst. I say that having consistant handling of backslashes and having to deal with some doubling of them is preferable to other issues such as subtle modal context changes. Because I know in the future it will happen that some poor sod will miss the subtle distinction between single- and double-quotes in the quoted output. (I know I have, in the past elsewhere) However if you want to use single quotes to turn of backslash escaping when it might not be warrented then feel free to do so. (But dont blame me when the poor sod comes along ;)
davidsarah commented 2011-01-15 01:10:12 +00:00
Author
Owner

Replying to [Zarutian]comment:19:

... However if you want to use single quotes to turn of backslash escaping when it might not be warrented then feel free to do so. (But dont blame me when the poor sod comes along ;)

I guess that's a -0 on this change (which was made in 1.8) :-/. FWIW, I think that casual users will probably not notice that the single-quoted strings are quoted at all, and so they won't be confused by them being different to double-quoted strings.

(Previous to quote_output being added changeset:80252debcd94fc28, in many cases we just had names being printed to stdout/err using "... '%s' ..." % (name,), which was definitely wrong.)

I think we can probably close this ticket now, unless anyone wants to do a more technical code review.

Replying to [Zarutian]comment:19: > ... However if you want to use single quotes to turn of backslash escaping when it might not be warrented then feel free to do so. (But dont blame me when the poor sod comes along ;) I guess that's a -0 on this change (which was made in 1.8) :-/. FWIW, I think that casual users will probably not notice that the single-quoted strings are quoted at all, and so they won't be confused by them being different to double-quoted strings. (Previous to `quote_output` being added changeset:80252debcd94fc28, in many cases we just had names being printed to stdout/err using `"... '%s' ..." % (name,)`, which was definitely wrong.) I think we can probably close this ticket now, unless anyone wants to do a more technical code review.
Zarutian commented 2011-01-16 17:41:57 +00:00
Author
Owner

A quick note: I cant see any obvious bugs (such as off by one, fence posting or logic ones) in this patch so I think it is most propably okay.

(Please note that I do not know Python and have to infer what I have seen in other similiar languages both syntactically and semantically.)

A quick note: I cant see any obvious bugs (such as off by one, fence posting or logic ones) in this patch so I think it is most propably okay. (Please note that I do not know Python and have to infer what I have seen in other similiar languages both syntactically and semantically.)
tahoe-lafs added the
fixed
label 2011-01-22 04:50:49 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#1135
No description provided.