make sure the new MDMF extension field is forward-compatible and safe #1526
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1526
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
In #393 we've added "extension fields" to the MDMF caps. If I recall correctly, the original motivation was that future writers of MDMFs might want to include the
K
andsegsize
parameters in the cap itself (in addition to inside the share data) and future readers might want to take advantage of that information so that they don't need to do an extra round trip to learn that information by fetching some of the share data.It is important that such future writers of MDMFs don't exclude older (v1.9) readers of MDMFs from being able to read those caps or even (gah!) cause them to get the wrong data or to incur an error when they try to read those caps. Therefore we need a way to include data in the caps which older (v1.9) readers will reliably and safely ignore (and still be able to read the file data correctly) but future readers can use if they want.
I thought we had decided to make a generic field for "extensions" in the MDMF caps, and not to make the current (1.9) reader or writer actually use this extension field yet. But the current code in trunk constrains that field instead of allowing it to be generically extensible, and it seems to try to use the numbers contained therein for its
K
andsegsize
values in some (?) cases.The MDMF filecap format is currently defined as:
and all existing caps are created with two extension fields: the first is K (as an integer), the second is segsize (also as an integer). The intention is to allow additional extension fields to be added in the future, perhaps hints that could speed up a download.
But the current code first constrains each extension field to [contain only the characters 0-9 and :]source:trunk/src/allmydata/uri.py?annotate=blame&rev=5220#L31, and then it requires there to be [exactly two such fields]source:trunk/src/allmydata/mutable/filenode.py?annotate=blame&rev=5227#L120.
Future versions of MDMF writers can't use the extensions then, to communicate anything other than the currently defined K and segsize fields. They might allow some future use if each field can have a larger set of characters, and if there is space for messages that Tahoe-LAFS v1.9 readers will parse and then ignore.
I'd like to have this in 1.9 because then 1.9 will be tolerant of caps generated by future versions that have a different number of extension fields.
As for the use of that field to initialize the
K
andsegsize
values, I haven't read through the code carefully enough to see if it does that correctly and if it has good tests. This is potentially complicated.What, for example, happens if the
segsize
indicated in the cap and thesegsize
indicated [in the version info]source:trunk/src/allmydata/mutable/filenode.py?annotate=blame&rev=5227#L1107 differ? Can thesegsize
or theK
change in different versions of the same MDMF? (I'm pretty sure it can't, but if it can't then maybe the value in the cap should be the only place thatK
orsegsize
exist.) Does the current trunk MDMF reader actually really use this value? Scanning through the code, I don't think so but I'm not 100% sure yet.There are three possibilities that I think we should consider for v1.9.0:
Proposal 1 (extension field for future use—currently unused)
K:segsize
when generating a URL. (The wayK
andsegsize
are encoded into the extension field has, of course, to fit into the constraints of the extension field. In addition to that, it should not consume the entire extension field, but should allow a safe and easy way for other fields to be added into the extension field such that they can be unambiguously parsed apart from theK
andsegsize
fields.)Proposal 2 (
K
andsegsize
in cap):K
andsegsize
. This is in fact exactly the same as the "extension field" in the current trunk, but we stop calling it the "extension field" and start calling itK
andsegsize
.K
andsegsize
values from anywhere other than the cap when reading.Proposal 3 (
K
andsegsize
in cap, plus an extension field):K
andsegsize
into the cap, and then also do proposal 1.a. to provide an extension field for future use.[Zooko was talking below about slightly different versions of the proposals -- he wasn't sure whether the existing code included K and segsize, which in fact it does.]Note:
I think that Proposal 1 is the least likely to delay or destabilize Tahoe-LAFS v1.9
especially if we leave out the optional 1.c. step. If v1.9 does not attempt to use the extension field in any way other than telling where it begins and ends, then future MDMF users will not have to worry that what they put in there will cause problems for old v1.9 users. By removing all the code that does anything with the extension field (aside from the regex which allows the extension field to be present in an MDMF cap), we can simplify the current 1.9 alpha code for easier review.I'd like to hear your opinion about this! (Especially if you are Kevan, David-Sarah, or Brian.)
Current trunk does put K and segsize into the extension field when generating caps (source:src/allmydata/mutable/filenode.py@5227#L702).
also see #1509 ("MDMF filecaps should tolerate extra parameters"). This ticket contains more information than #1509, though, so if we close one as a dup, I'd close #1509.
I like the conservatism of precisely specified and restrictive extension fields; it makes me more comfortable with the idea of extension fields, since it's easier for me to reason about the form of a valid cap and feel confident that we haven't inadvertently allowed caps that we don't want to allow. It is kind of silly to add a restriction like that before we even start writing the code that might one day use the extensions, though, and my objection isn't anything more than a vague unease with the idea.
I think that the code at the moment simply populates the k and segsize values -- it doesn't try to use them. Actually, aside from the restrictive nature of the extension fields at the moment, proposal 1 is what the current code does.
As I understand it, the extension fields are meant to advisory. That is, any part of Tahoe-LAFS that uses them to work with a file should tolerate cases in which they're wrong. So we'd treat the remote share's representation of the segsize, k, or whatever else we choose to stick in the cap as canonical, and defer to it in the event of an inconsistency.
I'm a bit confused as to what optimizations are available from including K and segsize in the cap. Doesn't the downloader get K and segsize on the first round-trip anyway? Why would it be helpful to know them before that?
For any future extensions, I would prefer each field to be labelled. A one-letter label as the first character of the field would probably be sufficient. The advantage of this is that if we decide that a particular field is no longer necessary, we can just omit it without making the parse ambiguous.
[edited to merge in comments from the duplicate #1509, and take into account comment:85514.]Description
on IRC just now, I concluded that a "k" hint is useful to the mapupdate step (so it can make a reasonable number of parallel DYHB/checkstring requests), but that the "segsize" hint isn't so useful.
Ideally, the hints would let us build a one-round-trip downloader. The share contains a bunch of static data (whose size depends upon "N" and the keysize, but not the current filesize), followed by the share data, finally followed by the block hash tree (which depends upon the current filesize). This order lets us grow the file without needing to move all the sharedata. So 1 RTT is really hard: you have to correctly guess where the block hash tree data is, which means knowing the filesize to within a segment, in addition to knowing the segsize.
So 2 RTT is a more reasonable goal: the first request tells you the current filesize and the offset table, which lets you make an accurate second request. For the current Retrieve code, the first request is done during the mapupdate phase, which gets the checkstring and offset table for each known share. Then the retrieve phase can make an accurate initial request, and return the first segment of data in just 2 RTT.
So anyways, I think "k" is a useful hint, but I'm no longer so sure about including "segsize". I think "k" is so useful that I'm willing to make it mandatory, or at least untagged, so the MDMF filecap would be
URI:MDMF:$writekey:$verfkeyfingerprint[:$k][:$EXTNAME=$VALUE]*
.The downside of making "k" so non-optional is that a repairer/reencoder which changes the file's encoding (replacing every single share) would then cause the filecap's "k" value to be stale. The retrieve code only treats it as a hint, of course (includinging the mapupdate step but not e.g. decoding), so it couldn't hurt anything but efficiency (the mapupdate's first batch of requests might not be enough to find enough shares, and we'll have to wait for the first batch to return before we learn of the larger real "k" and send out a second batch). But I think that's not a significant issue.
After talking with zooko on IRC, I'm in favor of making 1.9 tolerate and ignore everything past the fingerprint. That puts no constraints on the extension fields (although 1.10 or some later version might impose some, and including "k" will be the first extension I'd recommend).
If we're really not using those hints at all right now, then the code changes to support this should be pretty simple. I think we may be parsing 'k' and 'segsize' and populating the Filenode with them, in which case it's possible for behavior to be affected, and we should change it to not pre-populate those values.
I'll figure out how to implement this: I think it means a test that the URI parser accepts (and equates) filecaps with extra stuff after the fingerprint (so
URI:MDMF:$writekey:$fingerprint:blahblah
), changes to the URI parser's regexp, and changes to the Filenode constructor (to not extract+populate the extension fields). It might also involve changes to the URI constructor, to not provide those fields either.Kevan: I'd like to talk with you about this. Maybe a phone call would work better than trac comments. Maybe IRC.
Attachment no-extensions.diff (43052 bytes) added
tolerate-but-ignore extensions in MDMF filecaps: remove extension-handling code and tests
In changeset:416701e404c74a3e:
In [5393/ticket999-S3-backend]:
In changeset:de00b277cc9adfb0:
In [5418/ticket999-S3-backend]:
I reviewed changeset:0716c496c8a39758 and it looks good to me (except see below). It has a net reduction of source code, which is always good.
One thing I noticed was that with this patch, if you have a cap with an extension field, and you parse it using [uri.py]source:trunk/src/allmydata/uri.py?annotate=blame&rev=5305 and then produce a cap from the resulting Python object, the cap produced will not have the extension on it. Could this be a problem if, in the future, someone sends a cap with an extension, and a user of Tahoe-LAFS v1.9 inputs that cap into their Tahoe-LAFS implementation? Regardless of whether their Tahoe-LAFS v1.9 will show the extension field in copies of the cap that it produces/exports/displays, it will certainly not use the extension field for anything, so the only question is whether it is okay for the Tahoe-LAFS v1.9 implementation to omit the extension, which it is ignoring, from productions/exports/displays of that cap.
Kevan: I would appreciate it if you would review this issue, and this patch, even though (or especially because) you earlier said you preferred the original behavior of parsing the extension field as a two-tuple of integers.
David-Sarah: I would appreciate it if you would review this because you did such a great job of reasoning about forward-compatibility in previous iterations.
I'm inclined to think that we should only be parsing caps with
uri.py
if we're about to consume them somehow, and never re-stringify them unless we've just created one (i.e. uploaded a new file). Then we won't be obligated to haveuri.py
retain information that it doesn't actually use.Things like directory-listing should just dump the filecap strings.
In fact, I kind of regret having URI objects in the first place.. they're really just a was to encapsulate the parsing/serialization functions. Having a separate class has caused a lot of confusion in things like
filenode.py
'sget_uri()
method (does it return a string? a URI object?). So I'd like to reduce their use, and eventually get rid of them completely, somehow.Re: comment:85530 that sounds good! It means that code which is going to consume caps and later produce them again has to keep a verbatim copy of the (string) cap that they started with. But, this is a question of how storage and reproduction of caps should be implemented -- it is orthogonal to the question of whether that storage and reproduction should include or omit the extension field.