add a mutable-file cache #465

Open
opened 2008-06-14 17:18:11 +00:00 by warner · 2 comments

If the mutable-file retrieval process were allowed to keep an on-disk cache (indexed by SI+roothash, populated by either publish or retrieve), then a lot of directory-traversal operations would run a lot faster.

The cache could store plaintext, or ciphertext (or, if/when we implement deep-traversal caps, it could contain the intermediate form). For a node that runs on behalf of a single user, the plaintext cache would be fastest. For a webapi node that works for multiple users, I wouldn't feel comfortable holding on to the plaintext for any longer than necessary (i.e. we want to maintain forward secrecy), so if a cache still seemed useful then I'd want it to be just a ciphertext cache.

The cache should only store one version per SI, so once we publish or discover a different roothash, that should replace the old cache entry. The cache should have a bounded size, with a random-discard policy. Of course we need some efficient way to manage that size (doing 'du -s' on the cache directory would be slow for a large cache, either we should keep the cache from getting that large or do something more clever).

It would probably be enough to declare that the cache is implemented as a single directory, with one file per SI, and each file contains the base32-encoded roothash + newline + plaintext/ciphertext . The size bound is imposed by limiting the number of files in this directory, and it is counted at startup.

Note that because the cache is indexed by (SI,roothash), it is an accurate cache: a servermap-update is always performed (incurring one round-trip), and only the retrieve phase is bypassed upon a cache hit. This only helps improve retrieves of directories that are large enough to not fit in the initial read that the servermap-update performs (since there is already a small share-cache that holds these reads), which probably means 6 children or more per directory. These not-so-small directories could be fetched in a single round-trip instead of two RTT.

If we allowed the cache to be indexed by just SI (or if we were to introduce a separate cache that mapped from SI to somewhat-current-roothash), it would be an inaccurate cache, implementing a tradeoff between up-to-dateness and performance. In this mode, the node would be allowed to cache the state of a directory for some amount of time. We'd get zero-RTT retrieves for some directories, but we'd also run the risk of not noticing updates that someone else had made.

If the mutable-file retrieval process were allowed to keep an on-disk cache (indexed by SI+roothash, populated by either publish or retrieve), then a lot of directory-traversal operations would run a lot faster. The cache could store plaintext, or ciphertext (or, if/when we implement deep-traversal caps, it could contain the intermediate form). For a node that runs on behalf of a single user, the plaintext cache would be fastest. For a webapi node that works for multiple users, I wouldn't feel comfortable holding on to the plaintext for any longer than necessary (i.e. we want to maintain forward secrecy), so if a cache still seemed useful then I'd want it to be just a ciphertext cache. The cache should only store one version per SI, so once we publish or discover a different roothash, that should replace the old cache entry. The cache should have a bounded size, with a random-discard policy. Of course we need some efficient way to manage that size (doing 'du -s' on the cache directory would be slow for a large cache, either we should keep the cache from getting that large or do something more clever). It would probably be enough to declare that the cache is implemented as a single directory, with one file per SI, and each file contains the base32-encoded roothash + newline + plaintext/ciphertext . The size bound is imposed by limiting the number of files in this directory, and it is counted at startup. Note that because the cache is indexed by (SI,roothash), it is an accurate cache: a servermap-update is always performed (incurring one round-trip), and only the retrieve phase is bypassed upon a cache hit. This only helps improve retrieves of directories that are large enough to not fit in the initial read that the servermap-update performs (since there is already a small share-cache that holds these reads), which probably means 6 children or more per directory. These not-so-small directories could be fetched in a single round-trip instead of two RTT. If we allowed the cache to be indexed by just SI (or if we were to introduce a separate cache that mapped from SI to somewhat-current-roothash), it would be an inaccurate cache, implementing a tradeoff between up-to-dateness and performance. In this mode, the node would be allowed to cache the state of a directory for some amount of time. We'd get zero-RTT retrieves for some directories, but we'd also run the risk of not noticing updates that someone else had made.
warner added the
code-mutable
major
enhancement
1.1.0
labels 2008-06-14 17:18:11 +00:00
warner added this to the undecided milestone 2008-06-14 17:18:11 +00:00

If you like this ticket, you might also like #606 (backupdb: add directory cache), #316 (add caching to tahoe proper?), and #300 (macfuse: need some sort of caching).

If you like this ticket, you might also like #606 (backupdb: add directory cache), #316 (add caching to tahoe proper?), and #300 (macfuse: need some sort of caching).
tahoe-lafs modified the milestone from undecided to eventually 2010-02-11 03:44:13 +00:00
davidsarah commented 2010-10-24 16:56:07 +00:00
Owner

See also #1045 (Memory leak during massive file upload (or download) through SFTP frontend).

It appears that that problem is due to the current design of the ResponseCache in source:allmydata/mutable/common.py, and might be solved by replacing that cache (which stores share responses) with a mutable file cache as described in this ticket.

See also #1045 (Memory leak during massive file upload (or download) through SFTP frontend). It appears that that problem is due to the current design of the `ResponseCache` in source:allmydata/mutable/common.py, and might be solved by replacing that cache (which stores share responses) with a mutable file cache as described in this ticket.
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#465
No description provided.