Store Content-Type as part of directory entries #992

New Issue

tahoe-lafs · 2010-03-12T03:35:04Z

jsgf commented

2010-03-12 03:35:04 +00:00

Some apps, particularly using the webapi, will want to associate proper content types with files.

Issue #947 proposes a complex scheme which ties metadata with the actual file object in some way.

I propose in this issue a simpler scheme:

In the directory entry, have "content-type" and "content-encoding" entries in the normal metadata hash.
On http PUT, create a directory entry inheriting the content-type and -encoding from the PUT request (if present). If one or both are not present, then leave the entries absent.
On GET, use the content-type and -encoding from the metadata if present. Otherwise use the current scheme (guess the content type from the filename, defaulting to text/plain; no encoding).

To fill this out, the command-line tools would need some option to do nothing (current behaviour), explicitly set the type and encoding, or guess based on extension, magic-number sniffing, etc.

The content-type would be full content-type syntax with type/subtype and parameters.

Some apps, particularly using the webapi, will want to associate proper content types with files. Issue #947 proposes a complex scheme which ties metadata with the actual file object in some way. I propose in this issue a simpler scheme: 1. In the directory entry, have "content-type" and "content-encoding" entries in the normal metadata hash. 2. On http PUT, create a directory entry inheriting the content-type and -encoding from the PUT request (if present). If one or both are not present, then leave the entries absent. 3. On GET, use the content-type and -encoding from the metadata if present. Otherwise use the current scheme (guess the content type from the filename, defaulting to text/plain; no encoding). To fill this out, the command-line tools would need some option to do nothing (current behaviour), explicitly set the type and encoding, or guess based on extension, magic-number sniffing, etc. The content-type would be full content-type syntax with type/subtype and parameters.

tahoe-lafs added the

labels 2010-03-12 03:35:04 +00:00

tahoe-lafs added this to the undecided milestone 2010-03-12 03:35:04 +00:00

jsgf commented

2010-03-12 05:12:58 +00:00

Oh, I was going to comment on backwards and forwards compatibility:

If a newer client sees old files without this metadata, then it will behave just as an older client. If the metadata is present, it will be returned with a GET/HEAD request, exactly as normal.

If an older client reads entries with the metadata, it will ignore it and behave as if they weren't there (ie, making up/guess its own mime type). It will be no worse off than it is now. It will create new entries without the metadata.

The main problem is that a mixture of old and new clients will see different metadata for the same files. This seems unavoidable.

Oh, I was going to comment on backwards and forwards compatibility: If a newer client sees old files without this metadata, then it will behave just as an older client. If the metadata is present, it will be returned with a GET/HEAD request, exactly as normal. If an older client reads entries with the metadata, it will ignore it and behave as if they weren't there (ie, making up/guess its own mime type). It will be no worse off than it is now. It will create new entries without the metadata. The main problem is that a mixture of old and new clients will see different metadata for the same files. This seems unavoidable.

davidsarah commented

2010-03-12 06:41:50 +00:00

#994 discusses Content-Encoding in more detail -- it is not sufficient to just store the Content-Encoding as metadata; the frontends also have to be able to decompress a compressed file in some cases.

Also, I don't think that storing Content-Encoding in edge metadata can work. The edge metadata isn't known when referring to a file directly by its cap, rather than via a directory, so the interpretation of the file as a sequence of bytes (never mind as a MIME object) would be ambiguous.

Changing this ticket to just be about Content-Type (and perhaps other edge metadata that Tahoe would not need to understand).

#994 discusses Content-Encoding in more detail -- it is not sufficient to *just* store the Content-Encoding as metadata; the frontends also have to be able to decompress a compressed file in some cases. Also, I don't think that storing Content-Encoding in **edge** metadata can work. The edge metadata isn't known when referring to a file directly by its cap, rather than via a directory, so the interpretation of the file as a sequence of bytes (never mind as a MIME object) would be ambiguous. Changing this ticket to just be about Content-Type (and perhaps other edge metadata that Tahoe would not need to understand).

tahoe-lafs added

and removed

labels 2010-03-12 06:41:50 +00:00

tahoe-lafs changed title from ~~Store content-type and encoding as part of directory entries~~ to Store Content-Type as part of directory entries

2010-03-12 06:41:50 +00:00

jsgf commented

2010-03-12 17:44:40 +00:00

Replying to davidsarah:

Also, I don't think that storing Content-Encoding in edge metadata can work. The edge metadata isn't known when referring to a file directly by its cap, rather than via a directory, so the interpretation of the file as a sequence of bytes (never mind as a MIME object) would be ambiguous.

I see this as a feature, to an extent, as it allows the same bucket of bits to be presented in multiple ways depending on the path used to reference it. For example, you may want to refer to the same file as: "bigfile.txt.gz" with a content-type of application/gzip and no encoding, or as "bigfile.txt" with a content-type of text/plain and an encoding of gzip.

I agree that it is a pain there's no way to have a raw cap of a file with associated metadata, but I think that's the topic of #947. I have a half-formed idea about using a DIR cap of a vestigial directory containing a single nameless (but with metadata) pointing to the final file. But I haven't really thought it through.

Replying to [davidsarah](/tahoe-lafs/trac-2024-07-25/issues/992#issuecomment-76194): > Also, I don't think that storing Content-Encoding in **edge** metadata can work. The edge metadata isn't known when referring to a file directly by its cap, rather than via a directory, so the interpretation of the file as a sequence of bytes (never mind as a MIME object) would be ambiguous. I see this as a feature, to an extent, as it allows the same bucket of bits to be presented in multiple ways depending on the path used to reference it. For example, you may want to refer to the same file as: "bigfile.txt.gz" with a content-type of application/gzip and no encoding, or as "bigfile.txt" with a content-type of text/plain and an encoding of gzip. I agree that it is a pain there's no way to have a raw cap of a file with associated metadata, but I think that's the topic of #947. I have a half-formed idea about using a DIR cap of a vestigial directory containing a single nameless (but with metadata) pointing to the final file. But I haven't really thought it through.

zooko commented

2012-05-17 17:30:49 +00:00

In this conversation on Google+ it occurred to me that if we had that metadata in the directory, then the URLs to children that are served up by a directory, e.g. this directory, which URLs currently contain a suggested filename, could also contain a suggested content-type, e.g. (@@https://lafsgateway.zooko.com/file/URI%3ACHK%3Ags6rtdc74o4jxuv2frmni45tyu%3Apaqq4o4tquin7cqnigoogvuzsidikiss5yidsxskfurwcsl6g6ua%3A1%3A1%3A8294637/@@named=/Weinbergerinternet.mp3.wav.opus?and-by-the-way-mister-http-server-please-set-content-type=audio/ogg@@)

In [this conversation on Google+](https://plus.google.com/u/0/108313527900507320366/posts/b4KxkLCvrJD) it occurred to me that if we had that metadata in the directory, then the URLs to children that are served up by a directory, e.g. [this directory](https://lafsgateway.zooko.com/uri/URI%3ADIR2-MDMF-RO%3Ahd5kspvx5bymvat6omak5nfni4%3Awtieioxcnw5tukgjyqxp3ql3fu6ndkcdalvhisaqbiod3ldjytna), which URLs currently contain a suggested filename, could also contain a suggested content-type, e.g. (@@https://lafsgateway.zooko.com/file/URI%3ACHK%3Ags6rtdc74o4jxuv2frmni45tyu%3Apaqq4o4tquin7cqnigoogvuzsidikiss5yidsxskfurwcsl6g6ua%3A1%3A1%3A8294637/@@named=/Weinbergerinternet.mp3.wav.opus?and-by-the-way-mister-http-server-please-set-content-type=audio/ogg@@)

nejucomo commented

2012-05-17 21:29:24 +00:00

It is important for security for the web gateway to validate the syntax of the header in order to prevent response splitting attacks. Response splitting is an injection attack where the input spliced into a header field contains '\r\n' then possibly more headers, then possibly a complete response body.

This would allow a malicious directory (or file-cap-associated metadata) to impersonate the web gateway.

And of course for user-friendliness and defense in depth it would be nice if all clients and server-side metadata storage used the same validation parser. (ie: "tahoe put --content-type 'barf\0\r\nWhee!' myfile" would say something about an invalid content type before attempting any network io.)

It is important for security for the web gateway to validate the syntax of the header in order to prevent response splitting attacks. Response splitting is an injection attack where the input spliced into a header field contains '\r\n' then possibly more headers, then possibly a complete response body. This would allow a malicious directory (or file-cap-associated metadata) to impersonate the web gateway. And of course for user-friendliness and defense in depth it would be nice if all clients and server-side metadata storage used the same validation parser. (ie: "tahoe put --content-type 'barf\0\r\nWhee!' myfile" would say something about an invalid content type before attempting any network io.)

davidsarah commented

2012-05-17 22:01:17 +00:00

Content-Type syntax is defined in http://tools.ietf.org/html/rfc2045#section-5.1. It's a bit overcomplicated so I suggest just restricting to printable characters (ASCII 0x20..0x7E), and possibly imposing a maximum length. That should be sufficient to prevent splitting attacks (and buffer overflow attacks against carelessly written parsers).

Content-Type syntax is defined in <http://tools.ietf.org/html/rfc2045#section-5.1>. It's a bit overcomplicated so I suggest just restricting to printable characters (ASCII 0x20..0x7E), and possibly imposing a maximum length. That should be sufficient to prevent splitting attacks (and buffer overflow attacks against carelessly written parsers).

davidsarah commented

2012-05-17 22:07:16 +00:00

Replying to davidsarah:

Content-Type syntax is defined in http://tools.ietf.org/html/rfc2045#section-5.1.

... except that the ABNF grammar there does not allow spaces, and in practice all implementations do allow spaces, at least after ';'.

Replying to [davidsarah](/tahoe-lafs/trac-2024-07-25/issues/992#issuecomment-76203): > Content-Type syntax is defined in <http://tools.ietf.org/html/rfc2045#section-5.1>. ... except that the ABNF grammar there [does not allow spaces](http://tools.ietf.org/html/rfc5234#section-3.1), and in practice all implementations do allow spaces, at least after ';'.

nejucomo commented

2012-05-17 22:07:41 +00:00

Notice that GET /file/...@@named=... is similar to directory-associated edge metadata. A URL is an edge, even when it is not stored in a directory.

So in some sense, adding more features like @@named is similar to adding edge-associated metadata in a directory.

If we add some metadata in the @@-style and different conventions in the dirnode metadata, the interface will grow more complex and confusing over time. For that reason, I'm a fan of jsgf's "anonymous singleton directory" idea because it could replace the ad-hoc @@ requests with a single standard for directory metadata. (Maybe there are still webapi / ui headaches around this approach, though.)

Notice that `GET /file/...@@named=...` is similar to directory-associated edge metadata. A URL is an edge, even when it is not stored in a directory. So in some sense, adding more features like `@@named` is similar to adding edge-associated metadata in a directory. If we add some metadata in the `@@`-style and different conventions in the dirnode metadata, the interface will grow more complex and confusing over time. For that reason, I'm a fan of jsgf's "anonymous singleton directory" idea because it could replace the ad-hoc `@@` requests with a single standard for directory metadata. (Maybe there are still webapi / ui headaches around this approach, though.)

nejucomo commented

2012-05-17 22:09:17 +00:00

Maybe a better approach than a "singleton directory" is to just ensure that every kind of dirnode edge metadata is exposed to the /file/...@@ interface in a uniform way.

Maybe a better approach than a "singleton directory" is to just ensure that every kind of dirnode edge metadata is exposed to the `/file/...@@` interface in a uniform way.

Sign in to join this conversation.