Store Content-Type as part of directory entries #992
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#992
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Some apps, particularly using the webapi, will want to associate proper content types with files.
Issue #947 proposes a complex scheme which ties metadata with the actual file object in some way.
I propose in this issue a simpler scheme:
In the directory entry, have "content-type" and "content-encoding" entries in the normal metadata hash.
On http PUT, create a directory entry inheriting the content-type and -encoding from the PUT request (if present). If one or both are not present, then leave the entries absent.
On GET, use the content-type and -encoding from the metadata if present. Otherwise use the current scheme (guess the content type from the filename, defaulting to text/plain; no encoding).
To fill this out, the command-line tools would need some option to do nothing (current behaviour), explicitly set the type and encoding, or guess based on extension, magic-number sniffing, etc.
The content-type would be full content-type syntax with type/subtype and parameters.
Oh, I was going to comment on backwards and forwards compatibility:
If a newer client sees old files without this metadata, then it will behave just as an older client. If the metadata is present, it will be returned with a GET/HEAD request, exactly as normal.
If an older client reads entries with the metadata, it will ignore it and behave as if they weren't there (ie, making up/guess its own mime type). It will be no worse off than it is now. It will create new entries without the metadata.
The main problem is that a mixture of old and new clients will see different metadata for the same files. This seems unavoidable.
#994 discusses Content-Encoding in more detail -- it is not sufficient to just store the Content-Encoding as metadata; the frontends also have to be able to decompress a compressed file in some cases.
Also, I don't think that storing Content-Encoding in edge metadata can work. The edge metadata isn't known when referring to a file directly by its cap, rather than via a directory, so the interpretation of the file as a sequence of bytes (never mind as a MIME object) would be ambiguous.
Changing this ticket to just be about Content-Type (and perhaps other edge metadata that Tahoe would not need to understand).
Store content-type and encoding as part of directory entriesto Store Content-Type as part of directory entriesReplying to davidsarah:
I see this as a feature, to an extent, as it allows the same bucket of bits to be presented in multiple ways depending on the path used to reference it. For example, you may want to refer to the same file as: "bigfile.txt.gz" with a content-type of application/gzip and no encoding, or as "bigfile.txt" with a content-type of text/plain and an encoding of gzip.
I agree that it is a pain there's no way to have a raw cap of a file with associated metadata, but I think that's the topic of #947. I have a half-formed idea about using a DIR cap of a vestigial directory containing a single nameless (but with metadata) pointing to the final file. But I haven't really thought it through.
In this conversation on Google+ it occurred to me that if we had that metadata in the directory, then the URLs to children that are served up by a directory, e.g. this directory, which URLs currently contain a suggested filename, could also contain a suggested content-type, e.g. (@@https://lafsgateway.zooko.com/file/URI%3ACHK%3Ags6rtdc74o4jxuv2frmni45tyu%3Apaqq4o4tquin7cqnigoogvuzsidikiss5yidsxskfurwcsl6g6ua%3A1%3A1%3A8294637/@@named=/Weinbergerinternet.mp3.wav.opus?and-by-the-way-mister-http-server-please-set-content-type=audio/ogg@@)
It is important for security for the web gateway to validate the syntax of the header in order to prevent response splitting attacks. Response splitting is an injection attack where the input spliced into a header field contains '\r\n' then possibly more headers, then possibly a complete response body.
This would allow a malicious directory (or file-cap-associated metadata) to impersonate the web gateway.
And of course for user-friendliness and defense in depth it would be nice if all clients and server-side metadata storage used the same validation parser. (ie: "tahoe put --content-type 'barf\0\r\nWhee!' myfile" would say something about an invalid content type before attempting any network io.)
Content-Type syntax is defined in http://tools.ietf.org/html/rfc2045#section-5.1. It's a bit overcomplicated so I suggest just restricting to printable characters (ASCII 0x20..0x7E), and possibly imposing a maximum length. That should be sufficient to prevent splitting attacks (and buffer overflow attacks against carelessly written parsers).
Replying to davidsarah:
... except that the ABNF grammar there does not allow spaces, and in practice all implementations do allow spaces, at least after ';'.
Notice that
GET /file/...@@named=...
is similar to directory-associated edge metadata. A URL is an edge, even when it is not stored in a directory.So in some sense, adding more features like
@@named
is similar to adding edge-associated metadata in a directory.If we add some metadata in the
@@
-style and different conventions in the dirnode metadata, the interface will grow more complex and confusing over time. For that reason, I'm a fan of jsgf's "anonymous singleton directory" idea because it could replace the ad-hoc@@
requests with a single standard for directory metadata. (Maybe there are still webapi / ui headaches around this approach, though.)Maybe a better approach than a "singleton directory" is to just ensure that every kind of dirnode edge metadata is exposed to the
/file/...@@
interface in a uniform way.