support precompressed files #994
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#994
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
A "precompressed file" is a file where the plaintext has been compressed using an algorithm supported by HTTP (gzip or deflate -- we'd probably support only one). When the file is served via the webapi, it is served in compressed form with the Content-Encoding HTTP header set appropriately. The Content-Encoding can also be set in a PUT or POST request to upload a precompressed file.
Storage servers would be completely ignorant of precompressed files. The CLI, SFTP and FTP frontends would have to be decompress them. The gateway would also have to decompress if it receives an HTTP request that does not have an Accept-Encoding header allowing the compression algorithm used for that file.
This would provide a performance improvement as long as the HTTP clients have sufficient CPU capacity, that the time taken for them to decompress is outweighed by the savings in bandwidth. CPU-constrained clients (connecting to a less CPU-constrained gateway) are not a problem because they can just not set
Accept-Encoding
.This would rely on HTTP clients implementing decompression correctly; if they don't then there is a potential loss of integrity, and the possibility of attacks against the client from maliciously constructed compressed data. It is possible to protect against "decompression bombs" if that is required.
Note that, as pointed out in /tahoe-lafs/trac-2024-07-25/issues/6054#comment:2, the Content-Encoding must be a property of a file, not of the metadata stored in directory entries. (I think there are ways to compatibly store this in the UEB.)
Replying to davidsarah:
Actually, we want old clients to fail to download these files (rather than to misinterpret the compressed data as uncompressed).
Replying to [davidsarah]comment:2:
That seems like a pretty big semantic change for Tahoe. Thus far it is more or less a transparent container for arrays of bytes, with a bit of advisory metadata sprinkled on top. Changing that so that some byte arrays have an innate property which prevents some clients from being able to download them is a big change.
Given that the widespread convention is that content type and encoding are stored (to some extent) in the filename itself as extensions, making these properties more fully expanded in the directory entries has an internal consistency.
As I mention in /tahoe-lafs/trac-2024-07-25/issues/6054#comment:3, the same bits can be represented as either "foo.txt" "text/plain" "encoding: gzip" or "foo.txt.gz" "application/gzip". The former could be misinterpreted by an old client which fails to pay attention to content-encoding.
But I don't think this is a huge problem; I suspect most webapi clients are already using a general-purpose HTTP library, which will already have to deal with content encoding. We'd need to test that the CLI ends up doing the right thing, of course. I don't know what would happen to apps directly using the python APIs.
Replying to [jsgf]comment:3:
The effect of making the file data (as an uncompressed sequence of bytes) dependent on metadata that is detached from the file URI, would be an even bigger semantic change. The file URI has to unambiguously determine the file data.
One way of achieving that would be to put the bit that determines whether a file has been stored compressed in the URI, for example "
UCHK:gz:...
" could be the gzip-decompressed version of "CHK:...
".We can't send
Content-Encoding: gzip
if the client hasn't sent anAccept-Encoding
that includesgzip
; that would obviously be incorrect and not compliant to RFC 2616. We can't do much about clients that are sometimes unable to correctly decompress encodings that they advertise they accept, such as Netscape 4.x (well, we could blacklist such clients byUser-Agent
, but yuck).There's no usable consistency in file extensions.
#1354 is about supporting compression at the storage layer.