support precompressed files #994

New Issue

tahoe-lafs · 2010-03-12T06:29:42Z

davidsarah commented

2010-03-12 06:29:42 +00:00

A "precompressed file" is a file where the plaintext has been compressed using an algorithm supported by HTTP (gzip or deflate -- we'd probably support only one). When the file is served via the webapi, it is served in compressed form with the Content-Encoding HTTP header set appropriately. The Content-Encoding can also be set in a PUT or POST request to upload a precompressed file.

Storage servers would be completely ignorant of precompressed files. The CLI, SFTP and FTP frontends would have to be decompress them. The gateway would also have to decompress if it receives an HTTP request that does not have an Accept-Encoding header allowing the compression algorithm used for that file.

This would provide a performance improvement as long as the HTTP clients have sufficient CPU capacity, that the time taken for them to decompress is outweighed by the savings in bandwidth. CPU-constrained clients (connecting to a less CPU-constrained gateway) are not a problem because they can just not set Accept-Encoding.

This would rely on HTTP clients implementing decompression correctly; if they don't then there is a potential loss of integrity, and the possibility of attacks against the client from maliciously constructed compressed data. It is possible to protect against "decompression bombs" if that is required.

A "precompressed file" is a file where the plaintext has been compressed using an algorithm supported by HTTP (gzip or deflate -- we'd probably support only one). When the file is served via the webapi, it is served in compressed form with the [Content-Encoding](http://tools.ietf.org/html/rfc2616#section-14.11) HTTP header set appropriately. The Content-Encoding can also be set in a PUT or POST request to upload a precompressed file. Storage servers would be completely ignorant of precompressed files. The CLI, SFTP and FTP frontends would have to be decompress them. The gateway would also have to decompress if it receives an HTTP request that does not have an [Accept-Encoding](http://tools.ietf.org/html/rfc2616#section-14.3) header allowing the compression algorithm used for that file. This would provide a performance improvement as long as the HTTP clients have sufficient CPU capacity, that the time taken for them to decompress is outweighed by the savings in bandwidth. CPU-constrained clients (connecting to a less CPU-constrained gateway) are not a problem because they can just not set `Accept-Encoding`. This would rely on HTTP clients implementing decompression correctly; if they don't then there is a potential loss of integrity, and the possibility of attacks against the client from maliciously constructed compressed data. It is possible to protect against "decompression bombs" if that is required.

tahoe-lafs added the

labels 2010-03-12 06:29:42 +00:00

tahoe-lafs added this to the undecided milestone 2010-03-12 06:29:42 +00:00

davidsarah commented

2010-03-12 06:44:45 +00:00

Note that, as pointed out in /tahoe-lafs/trac-2024-07-25/issues/6054#comment:2, the Content-Encoding must be a property of a file, not of the metadata stored in directory entries. (I think there are ways to compatibly store this in the UEB.)

Note that, as pointed out in [/tahoe-lafs/trac-2024-07-25/issues/6054](/tahoe-lafs/trac-2024-07-25/issues/6054)#comment:2, the Content-Encoding must be a property of a file, not of the metadata stored in directory entries. (I think there are ways to compatibly store this in the UEB.)

davidsarah commented

2010-03-12 06:47:17 +00:00

Replying to davidsarah:

... the Content-Encoding must be a property of a file, not of the metadata stored in directory entries. (I think there are ways to compatibly store this in the UEB.)

Actually, we want old clients to fail to download these files (rather than to misinterpret the compressed data as uncompressed).

Replying to [davidsarah](/tahoe-lafs/trac-2024-07-25/issues/994#issuecomment-76244): > ... the Content-Encoding must be a property of a file, not of the metadata stored in directory entries. (I think there are ways to compatibly store this in the UEB.) Actually, we want old clients to fail to download these files (rather than to misinterpret the compressed data as uncompressed).

jsgf commented

2010-03-12 17:58:03 +00:00

Replying to [davidsarah]comment:2:

Actually, we want old clients to fail to download these files (rather than to misinterpret the compressed data as uncompressed).

That seems like a pretty big semantic change for Tahoe. Thus far it is more or less a transparent container for arrays of bytes, with a bit of advisory metadata sprinkled on top. Changing that so that some byte arrays have an innate property which prevents some clients from being able to download them is a big change.

Given that the widespread convention is that content type and encoding are stored (to some extent) in the filename itself as extensions, making these properties more fully expanded in the directory entries has an internal consistency.

As I mention in /tahoe-lafs/trac-2024-07-25/issues/6054#comment:3, the same bits can be represented as either "foo.txt" "text/plain" "encoding: gzip" or "foo.txt.gz" "application/gzip". The former could be misinterpreted by an old client which fails to pay attention to content-encoding.

But I don't think this is a huge problem; I suspect most webapi clients are already using a general-purpose HTTP library, which will already have to deal with content encoding. We'd need to test that the CLI ends up doing the right thing, of course. I don't know what would happen to apps directly using the python APIs.

Replying to [davidsarah]comment:2: > Actually, we want old clients to fail to download these files (rather than to misinterpret the compressed data as uncompressed). That seems like a pretty big semantic change for Tahoe. Thus far it is more or less a transparent container for arrays of bytes, with a bit of advisory metadata sprinkled on top. Changing that so that some byte arrays have an innate property which prevents some clients from being able to download them is a big change. Given that the widespread convention is that content type and encoding are stored (to some extent) in the filename itself as extensions, making these properties more fully expanded in the directory entries has an internal consistency. As I mention in [/tahoe-lafs/trac-2024-07-25/issues/6054](/tahoe-lafs/trac-2024-07-25/issues/6054)#comment:3, the same bits can be represented as either "foo.txt" "text/plain" "encoding: gzip" or "foo.txt.gz" "application/gzip". The former could be misinterpreted by an old client which fails to pay attention to content-encoding. But I don't think this is a huge problem; I suspect most webapi clients are already using a general-purpose HTTP library, which will already have to deal with content encoding. We'd need to test that the CLI ends up doing the right thing, of course. I don't know what would happen to apps directly using the python APIs.

davidsarah commented

2011-10-11 01:46:01 +00:00

Replying to [jsgf]comment:3:

Replying to [davidsarah]comment:2:

Actually, we want old clients to fail to download these files (rather than to misinterpret the compressed data as uncompressed).

That seems like a pretty big semantic change for Tahoe. Thus far it is more or less a transparent container for arrays of bytes, with a bit of advisory metadata sprinkled on top. Changing that so that some byte arrays have an innate property which prevents some clients from being able to download them is a big change.

The effect of making the file data (as an uncompressed sequence of bytes) dependent on metadata that is detached from the file URI, would be an even bigger semantic change. The file URI has to unambiguously determine the file data.

One way of achieving that would be to put the bit that determines whether a file has been stored compressed in the URI, for example "UCHK:gz:..." could be the gzip-decompressed version of "CHK:...".

As I mention in /tahoe-lafs/trac-2024-07-25/issues/6054#comment:3, the same bits can be represented as either "foo.txt" "text/plain" "encoding: gzip" or "foo.txt.gz" "application/gzip". The former could be misinterpreted by an old client which fails to pay attention to content-encoding.

But I don't think this is a huge problem; I suspect most webapi clients are already using a general-purpose HTTP library, which will already have to deal with content encoding.

We can't send Content-Encoding: gzip if the client hasn't sent an Accept-Encoding that includes gzip; that would obviously be incorrect and not compliant to RFC 2616. We can't do much about clients that are sometimes unable to correctly decompress encodings that they advertise they accept, such as Netscape 4.x (well, we could blacklist such clients by User-Agent, but yuck).

Given that the widespread convention is that content type and encoding are stored (to some extent) in the filename itself as extensions, making these properties more fully expanded in the directory entries has an internal consistency.

There's no usable consistency in file extensions.

Replying to [jsgf]comment:3: > Replying to [davidsarah]comment:2: > > Actually, we want old clients to fail to download these files (rather than to misinterpret the compressed data as uncompressed). > > That seems like a pretty big semantic change for Tahoe. Thus far it is more or less a transparent container for arrays of bytes, with a bit of advisory metadata sprinkled on top. Changing that so that some byte arrays have an innate property which prevents some clients from being able to download them is a big change. The effect of making the file data (as an uncompressed sequence of bytes) dependent on metadata that is detached from the file URI, would be an even bigger semantic change. The file URI has to unambiguously determine the file data. One way of achieving that would be to put the bit that determines whether a file has been stored compressed in the URI, for example "`UCHK:gz:...`" could be the gzip-decompressed version of "`CHK:...`". > As I mention in [/tahoe-lafs/trac-2024-07-25/issues/6054](/tahoe-lafs/trac-2024-07-25/issues/6054)#comment:3, the same bits can be represented as either "foo.txt" "text/plain" "encoding: gzip" or "foo.txt.gz" "application/gzip". The former could be misinterpreted by an old client which fails to pay attention to content-encoding. > > But I don't think this is a huge problem; I suspect most webapi clients are already using a general-purpose HTTP library, which will already have to deal with content encoding. We can't send `Content-Encoding: gzip` if the client hasn't sent an `Accept-Encoding` that includes `gzip`; that would obviously be incorrect and not compliant to RFC 2616. We can't do much about clients that are sometimes unable to correctly decompress encodings that they advertise they accept, such as [Netscape 4.x](http://schroepl.net/projekte/mod_gzip/browser.htm) (well, we could blacklist such clients by `User-Agent`, but yuck). > Given that the widespread convention is that content type and encoding are stored (to some extent) in the filename itself as extensions, making these properties more fully expanded in the directory entries has an internal consistency. There's no usable consistency in file extensions.

davidsarah commented

2011-10-11 01:53:09 +00:00

#1354 is about supporting compression at the storage layer.

Sign in to join this conversation.