tahoe-lafs/docs/proposed/leasedb.rst

289 lines
10 KiB
ReStructuredText

.. -*- coding: utf-8-with-signature -*-
=====================
Lease database design
=====================
The target audience for this document is developers who wish to understand
the new lease database (leasedb) planned to be added in Tahoe-LAFS v1.11.0.
Introduction
------------
A "lease" is a request by an account that a share not be deleted before a
specified time. Each storage server stores leases in order to know which
shares to spare from garbage collection.
Motivation
----------
The leasedb will replace the current design in which leases are stored in
the storage server's share container files. That design has several
disadvantages:
- Updating a lease requires modifying a share container file (even for
immutable shares). This complicates the implementation of share classes.
The mixing of share contents and lease data in share files also led to a
security bug (ticket `#1528`_).
- When only the disk backend is supported, it is possible to read and
update leases synchronously because the share files are stored locally
to the storage server. For the cloud backend, accessing share files
requires an HTTP request, and so must be asynchronous. Accepting this
asynchrony for lease queries would be both inefficient and complex.
Moving lease information out of shares and into a local database allows
lease queries to stay synchronous.
Also, the current cryptographic protocol for renewing and cancelling leases
(based on shared secrets derived from secure hash functions) is complex,
and the cancellation part was never used.
The leasedb solves the first two problems by storing the lease information in
a local database instead of in the share container files. The share data
itself is still held in the share container file.
At the same time as implementing leasedb, we devised a simpler protocol for
allocating and cancelling leases: a client can use a public key digital
signature to authenticate access to a foolscap object representing the
authority of an account. This protocol is not yet implemented; at the time
of writing, only an "anonymous" account is supported.
The leasedb also provides an efficient way to get summarized information,
such as total space usage of shares leased by an account, for accounting
purposes.
.. _`#1528`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1528
Design constraints
------------------
A share is stored as a collection of objects. The persistent storage may be
remote from the server (for example, cloud storage).
Writing to the persistent store objects is in general not an atomic
operation. So the leasedb also keeps track of which shares are in an
inconsistent state because they have been partly written. (This may
change in future when we implement a protocol to improve atomicity of
updates to mutable shares.)
Leases are no longer stored in shares. The same share format is used as
before, but the lease slots are ignored, and are cleared when rewriting a
mutable share. The new design also does not use lease renewal or cancel
secrets. (They are accepted as parameters in the storage protocol interfaces
for backward compatibility, but are ignored. Cancel secrets were already
ignored due to the fix for `#1528`_.)
The new design needs to be fail-safe in the sense that if the lease database
is lost or corruption is detected, no share data will be lost (even though
the metadata about leases held by particular accounts has been lost).
Accounting crawler
------------------
A "crawler" is a long-running process that visits share container files at a
slow rate, so as not to overload the server by trying to visit all share
container files one after another immediately.
The accounting crawler replaces the previous "lease crawler". It examines
each share container file and compares it with the state of the leasedb, and
may update the state of the share and/or the leasedb.
The accounting crawler may perform the following functions (but see ticket
#1834 for a proposal to reduce the scope of its responsibility):
- Remove leases that are past their expiration time. (Currently, this is
done automatically before deleting shares, but we plan to allow expiration
to be performed separately for individual accounts in future.)
- Delete the objects containing unleased shares — that is, shares that have
stable entries in the leasedb but no current leases (see below for the
definition of "stable" entries).
- Discover shares that have been manually added to storage, via ``scp`` or
some other out-of-band means.
- Discover shares that are present when a storage server is upgraded to
a leasedb-supporting version from a previous version, and give them
"starter leases".
- Recover from a situation where the leasedb is lost or detectably
corrupted. This is handled in the same way as upgrading from a previous
version.
- Detect shares that have unexpectedly disappeared from storage. The
disappearance of a share is logged, and its entry and leases are removed
from the leasedb.
Accounts
--------
An account holds leases for some subset of shares stored by a server. The
leasedb schema can handle many distinct accounts, but for the time being we
create only two accounts: an anonymous account and a starter account. The
starter account is used for leases on shares discovered by the accounting
crawler; the anonymous account is used for all other leases.
The leasedb has at most one lease entry per account per (storage_index,
shnum) pair. This entry stores the times when the lease was last renewed and
when it is set to expire (if the expiration policy does not force it to
expire earlier), represented as Unix UTC-seconds-since-epoch timestamps.
For more on expiration policy, see :doc:`../garbage-collection`.
Share states
------------
The leasedb holds an explicit indicator of the state of each share.
The diagram and descriptions below give the possible values of the "state"
indicator, what that value means, and transitions between states, for any
(storage_index, shnum) pair on each server::
# STATE_STABLE -------.
# ^ | ^ | |
# | v | | v
# STATE_COMING | | STATE_GOING
# ^ | | |
# | | v |
# '----- NONE <------'
**NONE**: There is no entry in the ``shares`` table for this (storage_index,
shnum) in this server's leasedb. This is the initial state.
**STATE_COMING**: The share is being created or (if a mutable share)
updated. The store objects may have been at least partially written, but
the storage server doesn't have confirmation that they have all been
completely written.
**STATE_STABLE**: The store objects have been completely written and are
not in the process of being modified or deleted by the storage server. (It
could have been modified or deleted behind the back of the storage server,
but if it has, the server has not noticed that yet.) The share may or may not
be leased.
**STATE_GOING**: The share is being deleted.
State transitions
-----------------
**STATE_GOING****NONE**
trigger: The storage server gains confidence that all store objects for
the share have been removed.
implementation:
1. Remove the entry in the leasedb.
**STATE_STABLE****NONE**
trigger: The accounting crawler noticed that all the store objects for
this share are gone.
implementation:
1. Remove the entry in the leasedb.
**NONE****STATE_COMING**
triggers: A new share is being created, as explicitly signalled by a
client invoking a creation command, *or* the accounting crawler discovers
an incomplete share.
implementation:
1. Add an entry to the leasedb with **STATE_COMING**.
2. (In case of explicit creation) begin writing the store objects to hold
the share.
**STATE_STABLE****STATE_COMING**
trigger: A mutable share is being modified, as explicitly signalled by a
client invoking a modification command.
implementation:
1. Add an entry to the leasedb with **STATE_COMING**.
2. Begin updating the store objects.
**STATE_COMING****STATE_STABLE**
trigger: All store objects have been written.
implementation:
1. Change the state value of this entry in the leasedb from
**STATE_COMING** to **STATE_STABLE**.
**NONE****STATE_STABLE**
trigger: The accounting crawler discovers a complete share.
implementation:
1. Add an entry to the leasedb with **STATE_STABLE**.
**STATE_STABLE****STATE_GOING**
trigger: The share should be deleted because it is unleased.
implementation:
1. Change the state value of this entry in the leasedb from
**STATE_STABLE** to **STATE_GOING**.
2. Initiate removal of the store objects.
The following constraints are needed to avoid race conditions:
- While a share is being deleted (entry in **STATE_GOING**), we do not accept
any requests to recreate it. That would result in add and delete requests
for store objects being sent concurrently, with undefined results.
- While a share is being added or modified (entry in **STATE_COMING**), we
treat it as leased.
- Creation or modification requests for a given mutable share are serialized.
Unresolved design issues
------------------------
- What happens if a write to store objects for a new share fails
permanently? If we delete the share entry, then the accounting crawler
will eventually get to those store objects and see that their lengths
are inconsistent with the length in the container header. This will cause
the share to be treated as corrupted. Should we instead attempt to
delete those objects immediately? If so, do we need a direct
**STATE_COMING****STATE_GOING** transition to handle this case?
- What happens if only some store objects for a share disappear
unexpectedly? This case is similar to only some objects having been
written when we get an unrecoverable error during creation of a share, but
perhaps we want to treat it differently in order to preserve information
about the storage service having lost data.
- Does the leasedb need to track corrupted shares?
Future directions
-----------------
Clients will have key pairs identifying accounts, and will be able to add
leases for a specific account. Various space usage policies can be defined.
Better migration tools ('tahoe storage export'?) will create export files
that include both the share data and the lease data, and then an import tool
will both put the share in the right place and update the recipient node's
leasedb.