download: support Range header, in a quick hackish tempfile way #527

New Issue

warner · 2008-10-08T22:58:32Z

warner commented

2008-10-08 22:58:32 +00:00

Currently, tahoe supports "streaming download", which really means we can serve the first byte of the response quickly (without needing to process the whole file; this is enabled by the use of merkle hash trees). But the real form of streaming download that we want is to support the HTTP "Range" header, which a client uses to tell the server that they only want a specific chunk of the target file, rather than the whole thing.

It turns out that the Quicktime media player in the iPhone depends upon this feature: it is unable to play music or movies that are served by a webserver which does not honor the Range header.

The long-term way to address this is by rewriting the download code to allow random-access (this will be done as a side-effect of the effort to make download tolerate stalled/slow servers). But we thought of a short-term approach this afternoon, which is worth doing sooner.

The Twisted web server code, in static.File, knows how to handle the Range header when it is serving a real file from local disk. So the idea is that:

if the Tahoe web code (specifically web.filenode.FileDownloader) sees a Range header in the request, it bypasses the usual producer/consumer DownloadTarget code
it looks in the cache directory to see if the file is already present there. The readcap is used as the cache file name.
if not, the file is downloaded into the cache. It must be downloaded into a tempfile first, and the code must avoid doing multiple downloads of the same file
once the file is present in the cache, the code returns a static.File resource for the target, enabling the Range header to be processed
if there is no Range header, we use the normal DownloadTarget and don't use a tempfile

Some manual process (i.e. a cronjob) would be responsible for deleting old files from the cache every once in a while.

The nice thing about this approach is that clients won't even notice when we get it fixed correctly and have proper random-access support. They'll just see a shorter turnaround time.

Currently, tahoe supports "streaming download", which really means we can serve the first byte of the response quickly (without needing to process the whole file; this is enabled by the use of merkle hash trees). But the real form of streaming download that we want is to support the HTTP "Range" header, which a client uses to tell the server that they only want a specific chunk of the target file, rather than the whole thing. It turns out that the Quicktime media player in the iPhone depends upon this feature: it is unable to play music or movies that are served by a webserver which does not honor the Range header. The long-term way to address this is by rewriting the download code to allow random-access (this will be done as a side-effect of the effort to make download tolerate stalled/slow servers). But we thought of a short-term approach this afternoon, which is worth doing sooner. The Twisted web server code, in static.File, knows how to handle the Range header when it is serving a real file from local disk. So the idea is that: * if the Tahoe web code (specifically `web.filenode.FileDownloader`) sees a Range header in the request, it bypasses the usual producer/consumer `DownloadTarget` code * it looks in the cache directory to see if the file is already present there. The readcap is used as the cache file name. * if not, the file is downloaded into the cache. It must be downloaded into a tempfile first, and the code must avoid doing multiple downloads of the same file * once the file is present in the cache, the code returns a `static.File` resource for the target, enabling the Range header to be processed * if there is no Range header, we use the normal `DownloadTarget` and don't use a tempfile Some manual process (i.e. a cronjob) would be responsible for deleting old files from the cache every once in a while. The nice thing about this approach is that clients won't even notice when we get it fixed correctly and have proper random-access support. They'll just see a shorter turnaround time.

warner added the

labels 2008-10-08 22:58:32 +00:00

warner commented

2008-10-28 20:50:48 +00:00

changeset:37e3d8e47c489ad8 adds the first part of this:

it creates IFilenode.read(consumer, offset=0, size=None), which can be the new
replacement for download/download_to_data/etc . The current implementation special-cases
(offset=0,size=None) and does a regular streaming download of the whole file. Anything
else uses a new cache file:
- if the range is already present in the cache, the read is satisfied from the cachefile
- start a full download (to the cachefile) if one is not already running
- wait for the full download to finish, then satisfy the read
add Range: support in the regular filenode GET webapi command
add tests

This first pass will not give fast service for the initial Range: read: a GET of just the first byte of the file will have to wait for the whole file to be downloaded before it is satisfied. (if a second read arrives after the cache fill has begun, that one will return quickly). The next step is to fix this behavior. I'll work on this in the next few days.

Another thing which remains to be done is to expire the cache files after a while.

Also, note that this puts plaintext files on disk. They are written into the private/ subdirectory (which is chmod'ed go-rx), but I'm still not very comfortable with the risk.

Eventually, when we rewrite the immutable download code, filenode.read() will be the preferred interface: the filenode will be allowed to cache some of its state (we'll probably put the hash trees in memory and the ciphertext on disk), and it will have a queue of segments that should be downloaded next, fed by the read() calls.

changeset:37e3d8e47c489ad8 adds the first part of this: * it creates IFilenode.read(consumer, offset=0, size=None), which can be the new replacement for download/download_to_data/etc . The current implementation special-cases (offset=0,size=None) and does a regular streaming download of the whole file. Anything else uses a new cache file: * if the range is already present in the cache, the read is satisfied from the cachefile * start a full download (to the cachefile) if one is not already running * wait for the full download to finish, then satisfy the read * add Range: support in the regular filenode GET webapi command * add tests This first pass will not give fast service for the initial Range: read: a GET of just the first byte of the file will have to wait for the whole file to be downloaded before it is satisfied. (if a second read arrives after the cache fill has begun, that one will return quickly). The next step is to fix this behavior. I'll work on this in the next few days. Another thing which remains to be done is to expire the cache files after a while. Also, note that this puts plaintext files on disk. They are written into the private/ subdirectory (which is chmod'ed go-rx), but I'm still not very comfortable with the risk. Eventually, when we rewrite the immutable download code, `filenode.read()` will be the preferred interface: the filenode will be allowed to cache some of its state (we'll probably put the hash trees in memory and the ciphertext on disk), and it will have a queue of segments that should be downloaded next, fed by the read() calls.

warner commented

2008-10-29 01:06:07 +00:00

changeset:b1ca238176bfea38 adds the second part: the GET request will be processed as soon as its range is available, rather than waiting for the whole file to be retrieved.

The next part is to expire the cache files after a while. It isn't safe to do this with, say, cron+find+rm, at least not on a running node, since it might get confused if a file being written to suddenly disappears. Also, I'd prefer the tahoe node to be more self-sufficient.. the helper currently uses an external cron job to delete old files and I always worry if the cronjob is really still running or not.

changeset:b1ca238176bfea38 adds the second part: the GET request will be processed as soon as its range is available, rather than waiting for the whole file to be retrieved. The next part is to expire the cache files after a while. It isn't safe to do this with, say, cron+find+rm, at least not on a running node, since it might get confused if a file being written to suddenly disappears. Also, I'd prefer the tahoe node to be more self-sufficient.. the helper currently uses an external cron job to delete old files and I always worry if the cronjob is really still running or not.

warner commented

2008-10-30 20:47:28 +00:00

changeset:ba019bfd3ad6fa74 adds the last part: expire the cache files after a while. The timers are set to check once per hour, and to delete the file if it is unused and more than an hour old. The file is touched each time a new Filenode instance is created, and assuming that we don't wind up with reference cycles that keep instances around longer than they should be, this means one hour since the last download attempt was started.

I'm going to declare this one as done, although if I get some more time next week I might try to replace the plaintext cache with a ciphertext one.. I'd be more comfortable with that.

changeset:ba019bfd3ad6fa74 adds the last part: expire the cache files after a while. The timers are set to check once per hour, and to delete the file if it is unused and more than an hour old. The file is touched each time a new Filenode instance is created, and assuming that we don't wind up with reference cycles that keep instances around longer than they should be, this means one hour since the last download attempt was started. I'm going to declare this one as done, although if I get some more time next week I might try to replace the plaintext cache with a ciphertext one.. I'd be more comfortable with that.

warner added

fixed

enhancement

and removed

defect

labels 2008-10-30 20:47:28 +00:00

warner added this to the 1.3.0 milestone 2008-10-30 20:47:28 +00:00

warner closed this issue

2008-10-30 20:47:28 +00:00

davidsarah commented

2009-12-20 17:24:16 +00:00

Replying to warner:

I'm going to declare this one as done, although if I get some more time next week I might try to replace the plaintext cache with a ciphertext one.. I'd be more comfortable with that.

Did you get around to doing this, or is there still a plaintext cache?

What happens if a node crashes -- do the old cache files stay around?

Replying to [warner](/tahoe-lafs/trac-2024-07-25/issues/527#issuecomment-68018): > I'm going to declare this one as done, although if I get some more time next week I might try to replace the plaintext cache with a ciphertext one.. I'd be more comfortable with that. Did you get around to doing this, or is there still a plaintext cache? What happens if a node crashes -- do the old cache files stay around?

Sign in to join this conversation.