lease expiration / deletion / garbage-collection #119

New Issue

warner · 2007-08-20T20:23:17Z

warner commented

2007-08-20 20:23:17 +00:00

I think the last Big Thing we need to develop (as opposed to implement or
fix) is a structure to both maintain the long-term health of files and also
insure their eventual deletion. I think these need to be developed together,
since they are closely related.

Leases need to expire after a while (we're thinking of one month as a good
timeout). Files that are supposed to stick around longer than this either
need to be kept alive by the original uploader or by someone to whom they've
delegated this task. If the original uploader expects to be around at least
once a month, they can do it themselves, but for a backup application we
can't impose this requirement. We refer to this task as "refreshing", and the
provider of this service is either doing it out of the kindness of their
heart (in the friend-net use case) or as part of a paid service (in the
commercial-offering use case).

The refreshing process will also perform "file checking", which is simply
counting the number of shares that are available for any given file. This
gives a rough measure of the "health" of the file. The process may also
perform "file verification" from time to time, which is downloading the
crypttext and checking its hash against the value in the URI extension block.

If either checking/verification process discovers a problem, the "file
repairer" may be triggered, which uses the remaining shares to reconstruct
the correct crypttext, then re-encodes and re-uploads any shares which have
been lost.

This series of processes all serve to improve the health of the file, at
various bandwidth/CPU costs: refreshing/checking is cheap, repair/re-upload
is expensive. The intent is to use the refreshing service to keep the file as
healthy as possible at low cost, and use the checker results to trigger more
costly repair operations as little as possible. Refreshing must take place at
least once a month to keep the leases alive. The required filecheck frequency
wil depend upon how quickly storage servers drop out of the grid: we expect
that files will undergo an exponential decay curve, so we must do checks
frequently enough to reduce the chance that the health will decay beyond
repair. The exact parameters will be tunable, of course, to pick a tradeoff
between bandwidth consumed and the chance that a file will decay too quickly
to be saved.

Files that are deleted from a vdrive need to have their shares dereferenced
in a timely fashion (I'm thinking by the end of the day for this). If the
reference count drops to zero, the share should be deleted immediately (for a
storage server on a home user's machine who wants their disk for other
purposes), or marked for deletion as soon as the storage is needed for
something else (for a dedicated commercial server with nothing better to do
with that disk space; there's a chance that someone will re-upload the file
that was just deleted, and if the share is still around then we can avoid
repeating the upload). Deleted files should also be removed from the
filechecker and repair mechanisms.

Note that files should be deleted promptly, rather than allowing their leases
to expire on their own, to reduce the storage overhead (storage consumed
beyond that required to desired files). The lease expiration mechanism is a
necessary fallback to keep storage usage from growing without bound, but
without prompt deletion, high churn rates could cause actual storage consumed
to grow larger than desired.

Finally, many of our use cases will want to enforce a utilization quota on
each user, limiting the amount of storage space they are allowed to consume.
The file-repair service may be a good place to enforce this (with a rule
saying that you can upload as much as you want, but the repair service won't
help you exceed your quota). Eventually we may want each client to have
membership credentials which would allow storage servers to measure how much
space each client is consuming: with this, a daily (or slower) process could
calculate how much global space is consumed by each client, and flag or
revoke membership for clients which use more space than they've contracted
for.

I think the last Big Thing we need to develop (as opposed to implement or fix) is a structure to both maintain the long-term health of files and also insure their eventual deletion. I think these need to be developed together, since they are closely related. Leases need to expire after a while (we're thinking of one month as a good timeout). Files that are supposed to stick around longer than this either need to be kept alive by the original uploader or by someone to whom they've delegated this task. If the original uploader expects to be around at least once a month, they can do it themselves, but for a backup application we can't impose this requirement. We refer to this task as "refreshing", and the provider of this service is either doing it out of the kindness of their heart (in the friend-net use case) or as part of a paid service (in the commercial-offering use case). The refreshing process will also perform "file checking", which is simply counting the number of shares that are available for any given file. This gives a rough measure of the "health" of the file. The process may also perform "file verification" from time to time, which is downloading the crypttext and checking its hash against the value in the URI extension block. If either checking/verification process discovers a problem, the "file repairer" may be triggered, which uses the remaining shares to reconstruct the correct crypttext, then re-encodes and re-uploads any shares which have been lost. This series of processes all serve to improve the health of the file, at various bandwidth/CPU costs: refreshing/checking is cheap, repair/re-upload is expensive. The intent is to use the refreshing service to keep the file as healthy as possible at low cost, and use the checker results to trigger more costly repair operations as little as possible. Refreshing must take place at least once a month to keep the leases alive. The required filecheck frequency wil depend upon how quickly storage servers drop out of the grid: we expect that files will undergo an exponential decay curve, so we must do checks frequently enough to reduce the chance that the health will decay beyond repair. The exact parameters will be tunable, of course, to pick a tradeoff between bandwidth consumed and the chance that a file will decay too quickly to be saved. Files that are deleted from a vdrive need to have their shares dereferenced in a timely fashion (I'm thinking by the end of the day for this). If the reference count drops to zero, the share should be deleted immediately (for a storage server on a home user's machine who wants their disk for other purposes), or marked for deletion as soon as the storage is needed for something else (for a dedicated commercial server with nothing better to do with that disk space; there's a chance that someone will re-upload the file that was just deleted, and if the share is still around then we can avoid repeating the upload). Deleted files should also be removed from the filechecker and repair mechanisms. Note that files should be deleted promptly, rather than allowing their leases to expire on their own, to reduce the storage overhead (storage consumed beyond that required to desired files). The lease expiration mechanism is a necessary fallback to keep storage usage from growing without bound, but without prompt deletion, high churn rates could cause actual storage consumed to grow larger than desired. Finally, many of our use cases will want to enforce a utilization quota on each user, limiting the amount of storage space they are allowed to consume. The file-repair service may be a good place to enforce this (with a rule saying that you can upload as much as you want, but the repair service won't help you exceed your quota). Eventually we may want each client to have membership credentials which would allow storage servers to measure how much space each client is consuming: with this, a daily (or slower) process could calculate how much global space is consumed by each client, and flag or revoke membership for clients which use more space than they've contracted for.

warner added the

labels 2007-08-20 20:23:17 +00:00

zooko added

0.6.0

and removed

0.5.0

labels 2007-09-25 04:19:56 +00:00

zooko added this to the 0.7.0 milestone 2007-09-25 04:19:56 +00:00

zooko commented

2007-11-01 20:14:26 +00:00

We're focussing on an imminent v0.7.0 (see the roadmap) which hopefully has [#197 #197 -- Small Distributed Mutable Files] and also a fix for [#199 #199 -- bad SHA-256]. So I'm bumping less urgent tickets to v0.7.1.

We're focussing on an imminent v0.7.0 (see [the roadmap](http://allmydata.org/trac/tahoe/roadmap)) which hopefully has [#197 #197 -- Small Distributed Mutable Files] and also a fix for [#199 #199 -- bad SHA-256]. So I'm bumping less urgent tickets to v0.7.1.

zooko added

0.6.1

and removed

0.6.0

labels 2007-11-01 20:14:26 +00:00

zooko commented

2007-11-13 18:23:23 +00:00

This is an important, required, feature, but it is a big feature to implement, and I don't think we are going to get it done in the next six weeks, so I'm putting it in Milestone 1.0.

zooko added

0.7.0

and removed

0.6.1

labels 2007-11-13 18:23:23 +00:00

warner commented

2008-01-09 01:09:21 +00:00

we've decided to push this out past 0.9.0

warner commented

2008-05-09 00:09:49 +00:00

this isn't a 1.1.0 thing

warner modified the milestone from 1.1.0 to undecided

2008-05-09 00:09:49 +00:00

warner commented

2008-06-03 05:26:22 +00:00

Here are some random notes that used to be in roadmap.txt:


 multiple categories of leases:
  1: committed leases -- we will not delete these in any case, but will instead
     tell an uploader that we are full
   1a: active leases
   1b: in-progress leases (partially filled, not closed, pb connection is
       currently open)
  2: uncommitted leases -- we will delete these in order to make room for new
     lease requests
   2a: interrupted leases (partially filled, not closed, pb connection is
       currently not open, but they might come back)
   2b: expired leases

  (I'm not sure about the precedence of these last two. Probably deleting
  expired leases instead of deleting interrupted leases would be okay.)

Here are some random notes that used to be in roadmap.txt: ``` multiple categories of leases: 1: committed leases -- we will not delete these in any case, but will instead tell an uploader that we are full 1a: active leases 1b: in-progress leases (partially filled, not closed, pb connection is currently open) 2: uncommitted leases -- we will delete these in order to make room for new lease requests 2a: interrupted leases (partially filled, not closed, pb connection is currently not open, but they might come back) 2b: expired leases (I'm not sure about the precedence of these last two. Probably deleting expired leases instead of deleting interrupted leases would be okay.) ```

warner commented

2008-09-03 01:36:58 +00:00

We've basically split lease/gc into a separate task from checker/repairer, so I'm removing the checker/repairer aspects of this ticket. This ticket will focus on lease/gc work.

warner changed title from ~~lease expiration / deletion / filechecking / quotas~~ to lease expiration / deletion / garbage-collection / quotas

2008-09-03 01:36:58 +00:00

zooko commented

2008-09-24 13:19:26 +00:00

I'm not sure, but I think we've tentatively agreed to focus on garbage collection separately from the notion of accounting or quotes, so I'm changing the name of this ticket.

zooko changed title from ~~lease expiration / deletion / garbage-collection / quotas~~ to lease expiration / deletion / garbage-collection

2008-09-24 13:19:26 +00:00

zooko commented

2008-09-24 13:51:20 +00:00

I mentioned this ticket as one of the most important-to-me improvements that we could make in the Tahoe code: http://allmydata.org/pipermail/tahoe-dev/2008-September/000809.html

I mentioned this ticket as one of the most important-to-me improvements that we could make in the Tahoe code: <http://allmydata.org/pipermail/tahoe-dev/2008-September/000809.html>

warner commented

2009-03-24 00:52:05 +00:00

I recently pushed a number of changes that roughly implement this. What we have right now (and will be in 1.3.1 or whatever-comes-after-1.3.0) is:

uploading a new immutable share, or creating a new mutable slot, results in a fixed-duration anonymous 31-day lease
the "tahoe check/deep-check --add-lease" CLI command (and some webapi equivalents) will add new fixed-duration anonymous 31-day leases to shares of existing files and directories
the storage server can optionally be configured to expire leases and delete shares when the last lease expires, in one of three modes:
- honor the original 31-day timer
- use an alternative timeout (perhaps 60 days)
- expire leases that were created/renewed before an absolute cutoff date
storage server has a webapi page to display expiration status, space recovered, etc

There are lots of details about how GC currently works in source:docs/garbage-collection.txt . There are ways it can be improved (in particular by associated leases with account identifiers, to reduce the scope of the lease, to make it easier for leaseholders to safely cancel leases; also to reduce renewal traffic by switching to an expire-the-account mode instead of the current expire-the-file mode). But for moderate sized grids, the mark-and-sweep lease/GC approach ought to be sufficient.

I recently pushed a number of changes that roughly implement this. What we have right now (and will be in 1.3.1 or whatever-comes-after-1.3.0) is: * uploading a new immutable share, or creating a new mutable slot, results in a fixed-duration anonymous 31-day lease * the "tahoe check/deep-check --add-lease" CLI command (and some webapi equivalents) will add new fixed-duration anonymous 31-day leases to shares of existing files and directories * the storage server can optionally be configured to expire leases and delete shares when the last lease expires, in one of three modes: * honor the original 31-day timer * use an alternative timeout (perhaps 60 days) * expire leases that were created/renewed before an absolute cutoff date * storage server has a webapi page to display expiration status, space recovered, etc There are lots of details about how GC currently works in source:docs/garbage-collection.txt . There are ways it can be improved (in particular by associated leases with account identifiers, to reduce the scope of the lease, to make it easier for leaseholders to safely cancel leases; also to reduce renewal traffic by switching to an expire-the-account mode instead of the current expire-the-file mode). But for moderate sized grids, the mark-and-sweep lease/GC approach ought to be sufficient.

warner added the

fixed

label 2009-03-24 00:52:05 +00:00

warner closed this issue

2009-03-24 00:52:05 +00:00

Sign in to join this conversation.