download: support Range header, in a quick hackish tempfile way #527
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#527
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Currently, tahoe supports "streaming download", which really means we can serve the first byte of the response quickly (without needing to process the whole file; this is enabled by the use of merkle hash trees). But the real form of streaming download that we want is to support the HTTP "Range" header, which a client uses to tell the server that they only want a specific chunk of the target file, rather than the whole thing.
It turns out that the Quicktime media player in the iPhone depends upon this feature: it is unable to play music or movies that are served by a webserver which does not honor the Range header.
The long-term way to address this is by rewriting the download code to allow random-access (this will be done as a side-effect of the effort to make download tolerate stalled/slow servers). But we thought of a short-term approach this afternoon, which is worth doing sooner.
The Twisted web server code, in static.File, knows how to handle the Range header when it is serving a real file from local disk. So the idea is that:
web.filenode.FileDownloader
) sees a Range header in the request, it bypasses the usual producer/consumerDownloadTarget
codestatic.File
resource for the target, enabling the Range header to be processedDownloadTarget
and don't use a tempfileSome manual process (i.e. a cronjob) would be responsible for deleting old files from the cache every once in a while.
The nice thing about this approach is that clients won't even notice when we get it fixed correctly and have proper random-access support. They'll just see a shorter turnaround time.
changeset:37e3d8e47c489ad8 adds the first part of this:
replacement for download/download_to_data/etc . The current implementation special-cases
(offset=0,size=None) and does a regular streaming download of the whole file. Anything
else uses a new cache file:
This first pass will not give fast service for the initial Range: read: a GET of just the first byte of the file will have to wait for the whole file to be downloaded before it is satisfied. (if a second read arrives after the cache fill has begun, that one will return quickly). The next step is to fix this behavior. I'll work on this in the next few days.
Another thing which remains to be done is to expire the cache files after a while.
Also, note that this puts plaintext files on disk. They are written into the private/ subdirectory (which is chmod'ed go-rx), but I'm still not very comfortable with the risk.
Eventually, when we rewrite the immutable download code,
filenode.read()
will be the preferred interface: the filenode will be allowed to cache some of its state (we'll probably put the hash trees in memory and the ciphertext on disk), and it will have a queue of segments that should be downloaded next, fed by the read() calls.changeset:b1ca238176bfea38 adds the second part: the GET request will be processed as soon as its range is available, rather than waiting for the whole file to be retrieved.
The next part is to expire the cache files after a while. It isn't safe to do this with, say, cron+find+rm, at least not on a running node, since it might get confused if a file being written to suddenly disappears. Also, I'd prefer the tahoe node to be more self-sufficient.. the helper currently uses an external cron job to delete old files and I always worry if the cronjob is really still running or not.
changeset:ba019bfd3ad6fa74 adds the last part: expire the cache files after a while. The timers are set to check once per hour, and to delete the file if it is unused and more than an hour old. The file is touched each time a new Filenode instance is created, and assuming that we don't wind up with reference cycles that keep instances around longer than they should be, this means one hour since the last download attempt was started.
I'm going to declare this one as done, although if I get some more time next week I might try to replace the plaintext cache with a ciphertext one.. I'd be more comfortable with that.
Replying to warner:
Did you get around to doing this, or is there still a plaintext cache?
What happens if a node crashes -- do the old cache files stay around?