handle disk-full situations properly #426
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#426
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
We need to add code that implements a "min-free-space=" disk usage model.
Specifically, you should be able to tell a Tahoe node that it must refuse new
leases if its remaining disk space is less than some threshold.
We need this in place before any of the allmydata.com prodnet storage servers
get close to running out of space, because otherwise the out-of-space error
raised during a write() call will interact badly with the client's upload
algorithm: worst case, the upload will fail, but unless the client restarts,
the servers will claim that the upload is still in progress, and the client
won't try to use other servers.
The plan is to have a config file of some sort that specifies a minimum free
space. The server will use 'df' or its python equivalent to measure the free
space in storage/ before each allocate_bucket() call, and if the free space
minus the request size is below this threshold, the lease request will be
rejected.
We also need to make sure that mutable file lease requests can be rejected
properly.
code review:
Server side 1.0 Behavior
In 1.0, if the server were to run out of room, or if the partition it is
using for NODEDIR/storage/ were to be mounted read-only, or if
NODEDIR/storage/ were chmoded o-r, then:
remote_allocate_buckets(), remote_renew_lease(), and remote_cancel_lease()
would raise an IOError exceptionremote_slot_testv_and_readv_and_writev()
would raise IOError insteadof writing (if the test vectors did not match, it would return the usual
non-exception resposne)
BucketWriter.remote_write
would also raise IOError, if theallocate_buckets succeeded but we ran out of space later.
If NODEDIR/readonly_storage exists, then:
remote_allocate_buckets()
would return the usual non-exceptionresponse (i.e. an empty 'bucketwriters' dict), indicating that the lease
is rejected
remote_renew_lease() and remote_cancel_lease()
would succeedremote_slot_testv_and_readv_and_writev()
would succeed.Our proposal for 1.1 is to transform the IOError that is triggered by writes
to a full or readonly filesystem into a well-defined remote exception, and to
react to NODEDIR/readonly_storage by raising the same IOError. In addition,
we plan to add a "df"-based reserved-space threshold, and if this plus the
size of all current reservations is exceeded, to raise the same IOError.
http://allmydata.org/pipermail/tahoe-dev/2008-May/000630.html contains some
relevant discussion, as well as some API plans for post-1.1
So the requirement is that all supported client versions must tolerate an
exception during write.
Client side 1.0 Behavior
In 1.0, if an immutable upload receives an exception during
allocate_buckets(), a log.UNUSUAL message is logged ("got error during peer
selection"), but otherwise the peer selection code will proceed normally. If
an exception is received during share write, another log.UNUSUAL message is
logged, and the shareholder is dropped. However, since this takes place after
peer selection, no new shareholder will be found to take their place, and
that share will not be uploaded, resulting in a slightly unhealthy file
(fewer than N shares present). If this happens to enough shares, the
shares_of_happiness threshold will not be met, and the upload will fail.
Since uploads do not automatically abandon their shares partial shares, the
server will still see a non-zero reference count for the BucketWriter
object, so the partial share data will remain in
NODEDIR/storage/shares/incoming/, and therefore it is likely that the next
allocate_buckets() call will fail. However, the partial shares in incoming/
will cause allocate_buckets to believe that someone else is currently
uploading those shares, and the client will treat them as "alreadygot", which
means it will no attempt to find new (better) homes for them. So, worst case,
the first upload will fail, the second upload will appear to succeed, but the
file will not actually be retrieveable from the grid. Badness.
If a mutable publish receives an exception during
remote_slot_testv_and_readv_and_writev, the unfortunate DeferredList created
by Publish._send_shares() will fill with (False, Failure) pairs, and the lack
of code to detect this condition means that the publish will appear to
succeed when in fact the file is still in its original state.
Client side 1.1 (current trunk) Behavior
The immutable upload code in 1.1 is the same as in 1.0 . A storage server
which discovers that it is full after allocate_buckets will cause silent
failures.
The mutable upload code in 1.1 is new. The servermap update phase has no way
to ask if the server will accept a new share or not, but the publish phase
uses a full peerlist, and will fall back to later peers if earlier ones have
problems. The IOError will cause a log.UNUSUAL event to be recorded, but
otherwise peer-selection will work correctly. Since mutable share writes are
performed by a single remote_slot_testv_and_readv_and_writev call (instead of
being broken up into allocate, write, and close calls like immutable shares),
they are not vulnerable to the problems that will occur with immutable files
and late exceptions. Mutable file publish in the face of IOError will require
multiple roundtrips, though, since we must wait until the publish phase to
determine which peers will help. I expect this to make the publish phase
require 2 RTT instead of 1, bringing the total from 2 to 3.
Out-of-space exceptions for initial mutable-file creation should be tolerated
well, however out-of-space during subsequent modification calls is a problem.
The client will detect the error and find another server to put the new
(larger) share on, but they do not then remove the old (smaller) share from
the server that raised IOError. As a result, the old version will still be
there, and once this happens to several primary servers, rollback will occur
(i.e. the first few k+epsilon shares that the client sees will be old ones,
so it won't see the later version).
Necessary Changes
We need to reduce the chance that (immutable) allocate_buckets will succeed
but a later write() call will fail, since that will cause significant
problems. Likewise, assuming that we can't get rid of all 1.0 clients for a
while, we need to reduce the chance that mutable r_s_t_a_r_a_w() will get an
exception.
To do this, we should set NODEDIR/readonly_storage on storage servers that
are getting close to full (say, with about 10GB to spare). That will cause
allocate_buckets() to start rejecting shares, avoiding failures in write().
readonly_storage does not yet affect r_s_t_a_r_a_w(), so clients will
continue to write mutable shares to the somewhat-readonly servers.
The next phase is to get rid of all the 1.0 peers, to avoid the bad behavior
that occurs when they experience an exception during publish.
Then we can change the server-side storage code in trunk to partially respect
NODEDIR/readonly_storage by rejecting new mutable shares (raising an
exception) but have it continue to accept modifications of existing shares.
This will allow 1.1 clients to behave well, while still avoiding the problems
that occur when 1.1 clients get errors while modifying existing shares.
Next, we change the client-side mutable publish code in trunk to be able to
move shares (specifically give it the ability to delete shares). This
requires new server-side methods. Publish should respond to an out-of-space
error by locating a server which can hold the share, uploading it to them,
then deleting the old one. Once all clients are able to do this, it will
become safe to allow servers to raise an out-of-space exception in
r_s_t_a_r_a_w.
Then, we can change the server-side code to fully respect readonly_storage,
except that we need to change its meaning: something more like "stop getting
bigger". The new flag must allow the deletion of mutable shares, and could
possibly allow modifications to them as long as the shares do not get bigger.
Eventually, we want the server to pay attention to its free space (the 'df'
reserved threshold) and reject allocation requests when they would cause this
threshold to be exceeded.
Ok, so the plan is:
readonly_storage is set), but continue to allow modifications to exising mutable shares.
This will reduce the rate of inbound data to practically nothing, and if we make this change
with perhaps 10GB left, we can probably survive in this state for years.
Later, we'll be overhauling the storage API to handle all of this better. We'll probably deploy that change through the introducer (so that 1.1 clients will see different storage objects than newer clients).
So, we don't really need to make any changes to 1.1 to make it behave well according to this plan, so I'm closing this ticket.