capabilities from the future could have non-ascii characters #1051
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1051
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
In the work around ticket #833 we implemented forward-compatibility for future capability formats. However we required all future cap formats to be expressed in ASCII without (as far as I know) actually thinking it through and deciding that we really wanted to constrain future cap formats in that way.
This ticket is to loosen that constraint on future cap formats. Note that this doesn't require future cap formats to have non-ASCII characters in them -- it just makes it so that if they do then they will still enjoy the same (limited) backward-compatibility to Tahoe-LAFS v1.7 that pure-ASCII caps from the future enjoy back to Tahoe-LAFS v1.6.1.
Attachment refactor-test-web.dpatch (24338 bytes) added
Please review attachment:refactor-test-web.dpatch.
Attachment test-nonascii-future-caps.dpatch (35715 bytes) added
Here are unit tests which trunk currently fails. They test what happens if someone gives you a cap (through wui/wapi, cli, or in a child of a dir) that has non-ascii chars in the cap itself. Please review these test patches!
FWIW I agree with Brian's and David-Sarah's comments on IRC that the next version of caps (hopefully coming out this year) should not use non-ASCII chars.
I would still like to see this forward-compatibility feature in Tahoe-LAFS as soon as possible though to give our future selves and our successors more graceful options.
Tests look ok, but we don't have a patch that makes them pass. This may have to wait until 1.8.
refactor-test-web.dpatch can be applied now.
When we parse a JSON bytestring using
simplejson.loads
, the result is a mixture of ASCII and Unicode strings. Therefore any "rw_uri
" and "ro_uri
" fields in the result should be coerced usingallmydata.util.stringutils.to_str
, which is defined like this:I've been doing this for the CLI scripts as part of fixing the ticket534 branch.
(The
stringutils
module is likely to be renamed, maybe toencodingutil
.)The test patch is testing that (some of) our internal APIs directly support URIs that can be Unicode strings as well as byte strings. I don't think we should do it that way: we should represent URIs as UTF-8 (also in the encoded form of directories), and convert when reading caps/paths from the command line, and when displaying caps to stdout/stderr and in the WUI. (I think the former may already work, due to the changes for #534 and #565.)
refactor-test-web.dpatch was applied in changeset:9e2da058372cad56.
I concur: filecaps are either ASCII or arbitrary bytestrings, not unicode
objects.
Actually, I think of it this way:
files/directories (think about our URI objects)
which starts with "URI:" and always contains printable ASCII
dense binary form (call it V2), and a more-official
follows-the-RFC-about-URIs URI form, with
://
and everything (callit V3)
used in the dirnodes that it unpacks. The dirnodes contain a bytestring
(packed with netstrings, not JSON). This is the first constraint on what
future code can put into dirnodes
cap it doesn't recognize. This is the second constraint.
(subdir path and/or filename). HTTP enforces a specific type here:
%-encoding of a bytestring, and it is common to expect the bytestring to
be a UTF-8 encoding of a unicode string.
explode when they see unrecognized caps, in the . Furthermore, it'd be
nice if they could treat such caps as opaque objects and still be able to
do certain manipulation of them (like copying/moving them from one
directory to another).
The way we build dirnodes tells us that we can put arbitrary bytestrings into
them: whatever syntax we use is not constrained by the container they are put
in, as long as any non-bytestring syntax we use is encoded into a bytestring
on the way in.
The WUI display is not a significant constraint: we rarely accept filecaps as
input in the WUI, and we only display them on the "More Info" page, which
could use repr(). We don't need to construct URLs out of unrecognized caps,
because the WUI doesn't provide any operations to work with them.
The WAPI t=json encoding is big constraint: since we pass filecaps as
dictionary values in the JSON body, which limits them to being ASCII or
unicode objects, unless we change the definition of the t=json API to declare
that the values it contains are something else.
(sometimes you can retroactively change protocol definitions like this in
ways that magically retain backwards compatibility: for example, if we
declared that the V1 encoding of filecaps has in fact actually been unicode,
but that everybody thought it was just ASCII because nobody had ever created
a unicode filecap before, then the t=json body definition could similarly be
retroactively defined as "UTF-8 encoding of the V1 filecap", and all the
ASCII filecaps coming from those APIs would still look the same)
So anyways, the reason that I think filecaps are ASCII or bytestrings is
because they contain machine-readable data like cryptovalues and numbers, not
human-generated/human-readable data like names.
I'd strongly recommend that, if we're going to plan for expansion of the "V1"
filecap-as-string syntax beyond ASCII, then we should plan for them to be
bytestrings. You can reliably compare bytestrings for equality (which is not
generally the case for unicode strings), there is an unambigious mapping from
filecap-as-bundle-of-data to filecap-as-bytestring (which is not the case for
unicode: even if we tell everyone to use UTF-8 instead of UTF-16/etc, there
are still too many options).
A related but separate issue is how to plan for expansion to the V2/V3/etc
syntaxes. The V1 syntax, as currently (narrowly) defined, is always printable
ASCII and always starts with "URI:". We could define a V2 dense-binary syntax
which, given a single leading version byte, would not overlap with the V1
syntax. Likewise a V3 real-URI syntax, which started with
tahoe://
,would not overlap. We might then retroactively define the filecaps stored in
dirnodes to be bytestrings that parse in one of these three forms (allowing
smaller dirnodes with dense binary caps). If the current dirnode-handling
code can tolerate+ignore arbitrary bytestrings, then this might be safe.
(t=json might not, however).
I need to read, understand, respond to Brian's objection.
Okay I've read this through a few times now and I'm not sure I understand all of it.
To start with, the "related but separate issue" at the end of comment:77457 can safely go into a separate ticket, right?
Next, I'm fairly sure that this can also go into a separate ticket, possibly that same one: I'd strongly recommend that, if we're going to plan for expansion of the "V1" filecap-as-string syntax beyond ASCII, then we should plan for them to be bytestrings..
Maybe this ticket could be named "capabilities from the future could be non-ascii and non-unicode bytestrings". Does that make sense at all?
So, to focus on what I see as the point of this ticket I would like to ask Brian and David-Sarah a few questions. "Socratic" questioning often sounds condescending and irritating to me. These are actual questions that I don't already know the "right answers" to.
Suppose Alice is running Tahoe-LAFS v1.8.0, in the year 2020, and suppose hypothetically that for some reason that is currently unimaginable to us, we have in the year 2019 defined an "expression syntax" for Tahoe-LAFS caps which are unicode, like this:
lafs://from_the_future_fw-蜔쳨欝遃䝦舜琇襇邤䍏㵦☚✸킾궑蒴犏띎냔㳆㼿졨浴䒉ΐ屝稜퍙鉧
.Now suppose, that Bob has a cap like that, and he conveys it to Alice, either by sending it to her and inviting her to click on it or cut and paste it (or wave her magic wand at it or whatever they do in 2020) to enter it into her Tahoe-LAFS v1.8.0 client. Or, suppose Bob puts it into a Tahoe-LAFS directory which Alice has read-access to and asks her to look at that directory.
Question 1: What would you want to happen (in this hypothetical scenario) when Alice waves her wand at it or lists that Tahoe-LAFS directory?
Question 2: What would happen if Alice were using Tahoe-LAFS v1.7.0? (I don't know the answer to this question. Wouldn't her client incur an internal TypeError of some kind?)
Question 3: How would you write a unit test which answers Question 2? My attempt at that was test-nonascii-future-caps.dpatch , but maybe that test doesn't actually answer Question 2. I'm not sure.
My objection was slightly different to Brian's. I was objecting to the internal representation of URIs being "either a Unicode string or a bytestring", as the patch assumed. From experience, that's extremely error-prone and leads to horrible problems with implicit conversions.
Representing URIs as UTF-8 doesn't have that problem, and satisfies Brian's criteria in comment:77457 :
I can't speak for Brian, but I believe from IRC conversations that he was skeptical of the whole idea of Unicode in caps. I share some of that skepticism, but I think it's relatively harmless to allow the possibility; it doesn't add much complexity.
Note that if we do use Unicode in caps in future, we should limit the character set to characters for which normalization is not an issue. (There are big blocks of Han characters with no equivalences, for example.)
Re: the JSON encoding -- JSON strings are by definition Unicode, and the JSON bodies are already assumed to be UTF-8 (which is necessary for filenames). The only other compatible option for encoding URIs in JSON would be ISO-Latin-1 (i.e. encode bytes 0x80..0xFF as \x80..\xFF), but it makes no sense to me to use a mixture of UTF-8 and ISO-Latin-1. Also see below for the current behaviour when using
simplejson.dumps
.Dense cap encoding is a separate issue that has nothing to do with Unicode. At some point we will probably change the dirnode format for other reasons (e.g. to support deep-verify caps), and then we can consider whether the new format uses dense encoding. I don't think there's much space to be saved, though.
Replying to warner:
That's not quite true; the WUI accepts unrecognized caps in the form field to link a cap into a directory.
Replying to zooko:
I think so.
No; we have to make a choice between UTF-8 and arbitrary bytestrings. That's part of this ticket.
Err, no. All bytestrings are non-Unicode. The question is whether they represent Unicode, i.e. whether they need to be valid UTF-8.
[...]
OK.
She should get an unlinked entry with '?', '?-IMM', or '?-RO' in the first column.
I haven't tried an end-to-end test, but from browsing the code for HTML directory listings (web/directory.py), I believe that it won't attempt to decode the URI, so it will be treated like any other unknown URI -- i.e. she will get the desired unlinked entry. I don't know what will happen for the Info page.
For JSON directory listings, again I haven't tried an end-to-end test, but the behaviour of
simplejson.dumps
is to assume that bytestrings in the input are UTF-8, and (if they are valid UTF-8) encode them with a\u
escape in the resulting JSON. In other words, I think this may entirely accidentally do the right thing. If the directory contains an URI that is not valid UTF-8, then aUnicodeDecodeError
will probably occur [here]source:src/allmydata/web/directory.py@4527#L851.I'd use something like that patch, but with
.encode('utf-8')
added to all of the Unicode URI strings. (Also I prefer to use Unicode escapes rather than UTF-8 in source files.)Attachment test-utf8-future-caps.dpatch (162820 bytes) added
Add tests of caps from the future that have non-ASCII characters in them (encoded as UTF-8). The changes to test_uri .py, test_client.py, and test_dirnode.py add tests of non-ASCII future caps in addition to the current tests. The change s to test_web.py just replace the tests of all-ASCII future caps with tests of non-ASCII future caps. We also change use s of failUnlessEqual to failUnlessReallyEqual, in order to catch cases where the type of a string is not as expected.
Attachment behaviour-utf8-future-caps.dpatch (5160 bytes) added
Allow URIs passed in the initial JSON for t=mkdir-with-children, t=mkdir-immutable to be Unicode (this makes 'test-utf8-future-caps.dpatch' pass). Also pass the name of each child into nodemaker.create_from_cap for error reporting.
Incidentally, these tests show that the directory listing case already worked in 1.7.0 (as far as I can tell having only run them on Windows). I think the only thing that didn't work was passing Unicode URIs in the JSON for
t=mkdir-with-children
andt=mkdir-immutable
.Okay, I still haven't fully understood Brian's objection. Brian: please review!
As far as I dimly understand the whole issue, with these patches we will effectively have no constraints on future representations of caps (except that they can't start with 'URI:'). Tahoe-LAFS v1.7.0 turns out to already allow any future-caps in Tahoe-LAFS dirs and just show them as '?', but would have raised an exception if you tried to write such future-caps in with the WAPI's
t=mkdir-with-children
andt=mkdir-immutable
. With these patches, Tahoe-LAFS v1.7.1 would also accept any future-caps through the WAPI.I reviewed behaviour-utf8-future-caps.dpatch and it looks correct to me.
David-Sarah's analysis in comment:14 is mostly in line with my thinking.
I object less to "filecaps are UTF-8 encoding of some unicode string" than
"filecaps are unicode strings". This would let us say that filecaps are
bytestrings but with a constraint that
filecap.decode("utf-8")
must notthrow an exception, and perhaps the additional constraint that
filecap.decode("utf-8").encode("utf-8")==filecap
. If we went this way,we should say that the UTF-8 -encoded form is the primary one (i.e., if you
want to compare two filecaps, use
filecap1==filecap2
, notfilecap1.decode("utf-8")==filecap2.decode("utf-8")
.That still feels weird, though: UTF-8 is an encoding of something else, and
in general you want to be comparing the primary form, not some encoding
thereof. And filecaps must be unambiguous. If you wanted to visually
compare two ASCII filecaps, you could do it easily (in fact the base32 takes
out the o/0 1/i/I/l/L homoglyphs). While I don't expect people to do this
much, the fact that two unicode strings simply cannot be safely compared this
way has got to be a bad sign.
If we really must accept more than just ASCII, then I'd prefer to accept
completely arbitrary bytestrings. The biggest problem with doing this is the
t=json WAPI: if I'd taken this issue at all seriously when I built the
webapi, I would have defined the t=json format to emit base64-encoded
filecaps or something similar. (actually, at that point I did not yet realize
that JSON could not handle arbitrary binary data.. if I had, I might have
skipped JSON altogether and used protocol buffers or netstrings or
something).
But one option would be to have the t=json response leave out any filecap
that cannot be expressed in printable ASCII (i.e., run a regexp against it
before populating the child-info dictionary, replace it with an "unknown cap"
marker if that fails). I can't remember if we covered this one during the
earlier caps-from-the-future discussion.
If we go with "filecaps are UTF-8 encoding of a unicode string", then the
t=json API doesn't give enough information to clients to compare the real
filecaps: all they can get is
filecap.decode("utf-8")
. In addition, atsome point inside the webapi, we'd have to convert the filecaps into unicode
before adding them to the JSON response. I'm really nervous about the
information-losing behavior of unicode conversions, and security problems
that can result.
Ugh.. how can we make this safe? That is, when somebody pastes in a cap, how
do we verify that it isn't using any characters in this set? Is this set even
constant? When we're all speaking Lojban or Ilaksh or Marain or something in
the future, won't there be new codepoints which the old code can't recognize
as being non-normalizable?
While parts of this may belong in other tickets, I think it remains relevant
for this one. Your desire to plan for new things in our V1 filecaps might
actually be a desire to define and implement those V2/V3 syntaxes (and
improve the webapi to accept them, etc). So it may be better to leave the V1
syntax definition alone, leave certain Tahoe interfaces intolerant to the
potential new forms, and declare that we'll replace those interfaces with
V2+-tolerant ones before we start using those forms.
== Re: behaviour-utf8-future-caps.dpatch ==
Why the s/name/namex/g ? Did you maybe mean to say "
name = unicode(namex)
"to highlight the transition from "unicode or bytestring" to "really unicode",
and then leave the other instances of "name" alone?
The
writecap = to_str(propropdict.get("rw_uri"))
line performs theunicode-to-UTF8 conversion. This means that webapi users calling
t=mkdir-with-children
ort=mkdir-immutable
are giving us unicode,not UTF-8 bytestrings (i.e. tahoe gets
callerwritecap.decode("utf-8").encode("utf-8")
, because the JSONlibrary is doing a decode before tahoe proper sees the data). Worse yet, the
decode and the encode are being done by different pieces of code (I'd hope
that the JSON library uses python's
.decode
logic, but who knows?).That's the best way to implement the unicode-caps design, but it also makes
it clear that this is not an exact transformation.
I didn't review it earlier, but nodemaker.create_from_cap(name=) is weird.
I'd be concerned about unicode creeping into an exception instance and then
causing bytestring-only logging to break (such as when it is written to
twistd.log). I'm not sure what a good solution is: I see how it's a bit
easier to pass "extraneous" information down into a function that might raise
an exception (and stuff it into the exception message down there), rather
than e.g. catch the exception higher up (where knowing name= is a bit more
natural) and somehow gluing the name into the already-constructed exception
object.
== Re: test-utf8-future-caps.dpatch ==
Hrm, could you reduce the instances of "failUnlessReallyEqual" to things that
just test caps? Seeing it on things like
(c.getServiceNamed("storage"}.reserved_space, 0)
makes the patchawfully big. Hm, and if there were some clever way to make it the same length
as "failUnlessEqual", that would reduce the noise even further (if you do
this, which I don't think you should, note that
len(assertTypeEqual)==len(failUnlessEqual)).
I don't think using
failUnlessReallyEqual
in test_dirnode.py on thingslike
set(metadata.keys()
does everything you want it to: it will assertthat both sides are of type Set, but it won't assert that the members of
those sets are both of type string.
In test_dirnode.py, I would call the new variables
"future_unicode_write_uri", rather than "future_nonascii_write_uri", to make
it clear that this is one possible direction (and that there are others).
== Conclusions ==
behaviour-utf8-future-caps.dpatch: yes, this patch is pretty harmless, I
don't mind it going in.
test-utf8-future-caps.dpatch: I see no problems with the patch per se, but I
think the examples it uses set a bad precedent, by causing anyone reading the
test to believe that tahoe's future caps will be unicode, which I think is a
bad idea.
I don't object to these two patches going in, but I will continue to object
to the idea that the filecaps accepted by our existing interfaces (and stored
in existing dirnodes) should be defined as unicode-encoded-to-UTF8. I think
the best approaches are, in order of preference:
interface which is unable to tolerate such a wide range
I don't want to define filecaps to be unicode. Unicode exists to represent
strings of written human languages. Filecaps are records/structs of
cryptovalues. We have more tools to manipulate printable/copypastable strings
than to manipulate abstract records of cryptovalues, so expressing filecaps
as strings is convenient, but we should pick the encoding to serve tahoe's
needs, rather than trying to make any conceivable written-human-language
string meaningful as a tahoe filecap.
That said, for users who have a solid unicode-friendly set of tools and want
to tweet their filecaps, I don't object to an encoding scheme that somehow
takes a filecap and expresses it as a string of unicode characters (this
would be a "V4", in my V1/V2/V3 scheme from comment:11). But the tahoe
interfaces that accept this need to be clearly marked, and I think the
current t=json is not one of them.
Wow! Thanks for the detailed review.
I think we need to take the broader precedent-setting discussion somewhere else, such as the mailing list and then once it starts to gel move it to the wiki/NewCapDesign page.
I'll try to focus on the narrower issues in this comment.
Replying to warner:
What purpose would that serve?
Maybe add a comment saying that we do not intend to invent caps like these--these are only examples of possibilities for testing.
behaviour-utf8-future-caps.dpatch applied in changeset:fa0fd66e17fe845b.
applied tests as changeset:d346e0853d9b0b4b. added comment about the tests of caps "from the future" being actually from an alternate reality future: changeset:7cc98759bd1baca3
Some of the tests added in changeset:d346e0853d9b0b4b were too strict in testing the type of values parsed from JSON (which is different depending on the
simplejson
version). This was fixed in changeset:74c41ebb8bb772c2.