289 lines
10 KiB
ReStructuredText
289 lines
10 KiB
ReStructuredText
.. -*- coding: utf-8-with-signature -*-
|
|
|
|
=====================
|
|
Lease database design
|
|
=====================
|
|
|
|
The target audience for this document is developers who wish to understand
|
|
the new lease database (leasedb) planned to be added in Tahoe-LAFS v1.11.0.
|
|
|
|
|
|
Introduction
|
|
------------
|
|
|
|
A "lease" is a request by an account that a share not be deleted before a
|
|
specified time. Each storage server stores leases in order to know which
|
|
shares to spare from garbage collection.
|
|
|
|
Motivation
|
|
----------
|
|
|
|
The leasedb will replace the current design in which leases are stored in
|
|
the storage server's share container files. That design has several
|
|
disadvantages:
|
|
|
|
- Updating a lease requires modifying a share container file (even for
|
|
immutable shares). This complicates the implementation of share classes.
|
|
The mixing of share contents and lease data in share files also led to a
|
|
security bug (ticket `#1528`_).
|
|
|
|
- When only the disk backend is supported, it is possible to read and
|
|
update leases synchronously because the share files are stored locally
|
|
to the storage server. For the cloud backend, accessing share files
|
|
requires an HTTP request, and so must be asynchronous. Accepting this
|
|
asynchrony for lease queries would be both inefficient and complex.
|
|
Moving lease information out of shares and into a local database allows
|
|
lease queries to stay synchronous.
|
|
|
|
Also, the current cryptographic protocol for renewing and cancelling leases
|
|
(based on shared secrets derived from secure hash functions) is complex,
|
|
and the cancellation part was never used.
|
|
|
|
The leasedb solves the first two problems by storing the lease information in
|
|
a local database instead of in the share container files. The share data
|
|
itself is still held in the share container file.
|
|
|
|
At the same time as implementing leasedb, we devised a simpler protocol for
|
|
allocating and cancelling leases: a client can use a public key digital
|
|
signature to authenticate access to a foolscap object representing the
|
|
authority of an account. This protocol is not yet implemented; at the time
|
|
of writing, only an "anonymous" account is supported.
|
|
|
|
The leasedb also provides an efficient way to get summarized information,
|
|
such as total space usage of shares leased by an account, for accounting
|
|
purposes.
|
|
|
|
.. _`#1528`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1528
|
|
|
|
|
|
Design constraints
|
|
------------------
|
|
|
|
A share is stored as a collection of objects. The persistent storage may be
|
|
remote from the server (for example, cloud storage).
|
|
|
|
Writing to the persistent store objects is in general not an atomic
|
|
operation. So the leasedb also keeps track of which shares are in an
|
|
inconsistent state because they have been partly written. (This may
|
|
change in future when we implement a protocol to improve atomicity of
|
|
updates to mutable shares.)
|
|
|
|
Leases are no longer stored in shares. The same share format is used as
|
|
before, but the lease slots are ignored, and are cleared when rewriting a
|
|
mutable share. The new design also does not use lease renewal or cancel
|
|
secrets. (They are accepted as parameters in the storage protocol interfaces
|
|
for backward compatibility, but are ignored. Cancel secrets were already
|
|
ignored due to the fix for `#1528`_.)
|
|
|
|
The new design needs to be fail-safe in the sense that if the lease database
|
|
is lost or corruption is detected, no share data will be lost (even though
|
|
the metadata about leases held by particular accounts has been lost).
|
|
|
|
|
|
Accounting crawler
|
|
------------------
|
|
|
|
A "crawler" is a long-running process that visits share container files at a
|
|
slow rate, so as not to overload the server by trying to visit all share
|
|
container files one after another immediately.
|
|
|
|
The accounting crawler replaces the previous "lease crawler". It examines
|
|
each share container file and compares it with the state of the leasedb, and
|
|
may update the state of the share and/or the leasedb.
|
|
|
|
The accounting crawler may perform the following functions (but see ticket
|
|
#1834 for a proposal to reduce the scope of its responsibility):
|
|
|
|
- Remove leases that are past their expiration time. (Currently, this is
|
|
done automatically before deleting shares, but we plan to allow expiration
|
|
to be performed separately for individual accounts in future.)
|
|
|
|
- Delete the objects containing unleased shares — that is, shares that have
|
|
stable entries in the leasedb but no current leases (see below for the
|
|
definition of "stable" entries).
|
|
|
|
- Discover shares that have been manually added to storage, via ``scp`` or
|
|
some other out-of-band means.
|
|
|
|
- Discover shares that are present when a storage server is upgraded to
|
|
a leasedb-supporting version from a previous version, and give them
|
|
"starter leases".
|
|
|
|
- Recover from a situation where the leasedb is lost or detectably
|
|
corrupted. This is handled in the same way as upgrading from a previous
|
|
version.
|
|
|
|
- Detect shares that have unexpectedly disappeared from storage. The
|
|
disappearance of a share is logged, and its entry and leases are removed
|
|
from the leasedb.
|
|
|
|
|
|
Accounts
|
|
--------
|
|
|
|
An account holds leases for some subset of shares stored by a server. The
|
|
leasedb schema can handle many distinct accounts, but for the time being we
|
|
create only two accounts: an anonymous account and a starter account. The
|
|
starter account is used for leases on shares discovered by the accounting
|
|
crawler; the anonymous account is used for all other leases.
|
|
|
|
The leasedb has at most one lease entry per account per (storage_index,
|
|
shnum) pair. This entry stores the times when the lease was last renewed and
|
|
when it is set to expire (if the expiration policy does not force it to
|
|
expire earlier), represented as Unix UTC-seconds-since-epoch timestamps.
|
|
|
|
For more on expiration policy, see :doc:`../garbage-collection`.
|
|
|
|
|
|
Share states
|
|
------------
|
|
|
|
The leasedb holds an explicit indicator of the state of each share.
|
|
|
|
The diagram and descriptions below give the possible values of the "state"
|
|
indicator, what that value means, and transitions between states, for any
|
|
(storage_index, shnum) pair on each server::
|
|
|
|
|
|
# STATE_STABLE -------.
|
|
# ^ | ^ | |
|
|
# | v | | v
|
|
# STATE_COMING | | STATE_GOING
|
|
# ^ | | |
|
|
# | | v |
|
|
# '----- NONE <------'
|
|
|
|
|
|
**NONE**: There is no entry in the ``shares`` table for this (storage_index,
|
|
shnum) in this server's leasedb. This is the initial state.
|
|
|
|
**STATE_COMING**: The share is being created or (if a mutable share)
|
|
updated. The store objects may have been at least partially written, but
|
|
the storage server doesn't have confirmation that they have all been
|
|
completely written.
|
|
|
|
**STATE_STABLE**: The store objects have been completely written and are
|
|
not in the process of being modified or deleted by the storage server. (It
|
|
could have been modified or deleted behind the back of the storage server,
|
|
but if it has, the server has not noticed that yet.) The share may or may not
|
|
be leased.
|
|
|
|
**STATE_GOING**: The share is being deleted.
|
|
|
|
State transitions
|
|
-----------------
|
|
|
|
• **STATE_GOING** → **NONE**
|
|
|
|
trigger: The storage server gains confidence that all store objects for
|
|
the share have been removed.
|
|
|
|
implementation:
|
|
|
|
1. Remove the entry in the leasedb.
|
|
|
|
• **STATE_STABLE** → **NONE**
|
|
|
|
trigger: The accounting crawler noticed that all the store objects for
|
|
this share are gone.
|
|
|
|
implementation:
|
|
|
|
1. Remove the entry in the leasedb.
|
|
|
|
• **NONE** → **STATE_COMING**
|
|
|
|
triggers: A new share is being created, as explicitly signalled by a
|
|
client invoking a creation command, *or* the accounting crawler discovers
|
|
an incomplete share.
|
|
|
|
implementation:
|
|
|
|
1. Add an entry to the leasedb with **STATE_COMING**.
|
|
|
|
2. (In case of explicit creation) begin writing the store objects to hold
|
|
the share.
|
|
|
|
• **STATE_STABLE** → **STATE_COMING**
|
|
|
|
trigger: A mutable share is being modified, as explicitly signalled by a
|
|
client invoking a modification command.
|
|
|
|
implementation:
|
|
|
|
1. Add an entry to the leasedb with **STATE_COMING**.
|
|
|
|
2. Begin updating the store objects.
|
|
|
|
• **STATE_COMING** → **STATE_STABLE**
|
|
|
|
trigger: All store objects have been written.
|
|
|
|
implementation:
|
|
|
|
1. Change the state value of this entry in the leasedb from
|
|
**STATE_COMING** to **STATE_STABLE**.
|
|
|
|
• **NONE** → **STATE_STABLE**
|
|
|
|
trigger: The accounting crawler discovers a complete share.
|
|
|
|
implementation:
|
|
|
|
1. Add an entry to the leasedb with **STATE_STABLE**.
|
|
|
|
• **STATE_STABLE** → **STATE_GOING**
|
|
|
|
trigger: The share should be deleted because it is unleased.
|
|
|
|
implementation:
|
|
|
|
1. Change the state value of this entry in the leasedb from
|
|
**STATE_STABLE** to **STATE_GOING**.
|
|
|
|
2. Initiate removal of the store objects.
|
|
|
|
|
|
The following constraints are needed to avoid race conditions:
|
|
|
|
- While a share is being deleted (entry in **STATE_GOING**), we do not accept
|
|
any requests to recreate it. That would result in add and delete requests
|
|
for store objects being sent concurrently, with undefined results.
|
|
|
|
- While a share is being added or modified (entry in **STATE_COMING**), we
|
|
treat it as leased.
|
|
|
|
- Creation or modification requests for a given mutable share are serialized.
|
|
|
|
|
|
Unresolved design issues
|
|
------------------------
|
|
|
|
- What happens if a write to store objects for a new share fails
|
|
permanently? If we delete the share entry, then the accounting crawler
|
|
will eventually get to those store objects and see that their lengths
|
|
are inconsistent with the length in the container header. This will cause
|
|
the share to be treated as corrupted. Should we instead attempt to
|
|
delete those objects immediately? If so, do we need a direct
|
|
**STATE_COMING** → **STATE_GOING** transition to handle this case?
|
|
|
|
- What happens if only some store objects for a share disappear
|
|
unexpectedly? This case is similar to only some objects having been
|
|
written when we get an unrecoverable error during creation of a share, but
|
|
perhaps we want to treat it differently in order to preserve information
|
|
about the storage service having lost data.
|
|
|
|
- Does the leasedb need to track corrupted shares?
|
|
|
|
|
|
Future directions
|
|
-----------------
|
|
|
|
Clients will have key pairs identifying accounts, and will be able to add
|
|
leases for a specific account. Various space usage policies can be defined.
|
|
|
|
Better migration tools ('tahoe storage export'?) will create export files
|
|
that include both the share data and the lease data, and then an import tool
|
|
will both put the share in the right place and update the recipient node's
|
|
leasedb.
|