handle out-of-disk-space condition #871
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#871
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
How does a Tahoe-LAFS node handle it when it runs out of disk space? This happens somewhat frequently with the allmydata.com nodes because they are configured to keep about 10 GB of space free (in order to allow updates to mutable shares, using the
reserved_space
configuration), and when someone uses the storage servers as a web gateway (they are all configured to serve as web gateways) then sometimes the download cache fills up the remaining 10 GB and causes the download to fail, and then the cache doesn't get cleaned up, and then whenever the node runs it gets out-of-disk-space problems such as being unable to open thetwistd.log
file. I will open another ticket about the fact that the cache isn't getting cleaned up, but this ticket is about making the tahoe-lafs node fail gracefully and with a useful error message when there is no disk space.heh, I think you mean "fail gracefully without an error message".. where would the message go? :)
More seriously though, this is a tricky situation. A lot of operations can continue to work normally. We certainly want storage server reads to keep working, and these should never require additional disk space. Many client operations should work: full immutable downloads are held entirely in RAM (since we do streaming downloads and pause the process until the HTTP client accepts each segment), and small uploads are entirely RAM. Large uploads (either mutable or immutable) cause twisted.web to use a tempfile, and random-access immutable downloads currently use a tempfile. All mutable downloads are RAM based, as are all directory operations.
I suppose that when the log stops working due to a full disk, it would be nice if we could connect via 'flogtool tail' and find out about the node's predicament. The easiest way to emit a message that will be retrievable a long time later is to emit one at a very high severity level. This will trigger an incident, which won't be writable because the disk is full, so we need to make sure foolscap handles that reasonably.
I hesitate to suggest something so complex, but perhaps we should consider a second form of reserved-space parameter, which is applied to the non-storage-server disk consumers inside the node. Or maybe we could track down the non-storage-server disk consumers and make them all obey the same reserved-space parameter that the storage server tries to obey. With this sort of feature, the node would fail sort-of gracefully when the reserved-space limit was exceeded, by refusing to accept large uploads or perform large random-access downloads that would require more disk space. We'd have to decide what sorts of logging would be subject to this limit. Maybe a single Incident when the threshold was crossed (which would be logged successfully, using some of the remaining space), would at least put notice of impending space exhaustion on the disk where it could be found by operators later.
If the fix causes other operations than share upload to respect the
reserved_space
setting, then there should still be enough space to log the failure. (There can be a slightly smaller reserved_space limitation for writing to the logfile.)well, unless the full disk is the result of some other non-Tahoe process altogether, which is completely ignorant of tahoe's reserved_space concept. Gotta plan for the worst..
So your idea is to make all of the node operations respect the
reserved_space
parameter except for the logging operations, and then add a (high severity) log message showing that thereserved_space
limit has been reached? That sounds good. Oh, yeah but as you say, what should the node do when there isn't any disk space? What would be complicated about triggering a very high-severity incident when an out-of-disk-space condition is detected? That sounds straightforward to me, and as far as I understand foolscap, an investigator who later connected with aflogtool tail
would then see that high-severity incident report, right?Yes to all of that. I hadn't been thinking of two separate messages, but
maybe that makes sense.. one when reserved_space is exceeded the first time,
another when, hm, well when disk_avail==0 (or disk_avail<REALLYSMALL) but
since we'd already be guarding all our writes with reserved_space, I don't
know where exactly we'd be checking for the second threshold.
Anyways, the requirement on Foolscap is that its "Incident Reporter" (which
is the piece that tries to write the .flog file into BASEDIR/logs/incidents/)
must survive the out-of-disk condition without breaking logging or losing the
in-RAM copy of the incident. As long as that incident is in RAM, a
flogtool tail
process should see it later. (I just added[http://foolscap.lothar.com/trac/ticket/144 foolscap#144] to track this
requirement).
The only other thing I'd want to think about is how to keep the message (or
messages) from being emitted over and over. The obvious place to put this
message would in in the storage server (where it tests disk-space-remaining
against reserved_space) and the cachey thing (where we're going to add code
to do the same). But should there be a flag of some sort to say "only emit
this message once". But, if something resolves the too-full condition and
then, a month later, it gets full again, would we want the message to be
re-emitted?
It almost seems like we'd want a switch that the operator resets when they
fix the overfull condition, sort of like the "Check Engine" light on your
car's dashboard that stays on until the mechanic fixes everything. (or,
assuming your mechanic is a concurrency expert and has a healthy fear of race
conditions, they'll turn off the light and then fix everything).
Maybe the rule should be that if you see this incident, you should do
something to free up space, and then restart the node (to reset the flag).
The flag should be reset when the free space is observed to be above a threshold (
reserved_space
plus constant) when we test it. I think there's no need to poll the free space -- testing it when we are about to write something should be sufficient. There's also no need to remember the flag across restarts.So, something like this?:
Replying to warner:
Yes. Nitpicks:
"Making all of the node operations respect the
reserved_space
parameter" includes #390 ('readonly_storage' and 'reserved_space' not honored for mutable-slot write requests).Bumping this from v1.6.1 because it isn't a regression and we have other tickets to do in v1.6.1.
This is a "reliability" issue, meaning that it is one of those things that developers can get away with ignoring most of the time because most of the time they aren't encountering the conditions which cause this issue to arise.
Therefore, it's the kind of ticket that I value highly so that we don't forget about it and allow users to suffer the consequences. But, v1.7 is over and I'm moving this to "eventually" instead of to v1.8 because I'm not sure of the priority of this ticket vs. the hundreds of other tickets that I'm not looking at right now, and because I don't want the "bulldozer effect" of a big and growing pile of tickets getting pushed from one Milestone to the next. :-)
See also #1279 which is more about what happens if the disk is full when the server is trying to start.