writing of shares is fragile and "tahoe stop" is unnecessarily harsh #200
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#200
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
As per comment:/tahoe-lafs/trac-2024-07-25/issues/5243:8, the updating of share data is an incremental in-place process on disk, which means that if the node crashes during updating a share, the share will be corrupted. Also, there is not currently a way to deliberately stop (or restart) node without crashing it.
I'm inclined to measure the I/O cost of more robust atomic update of shares, but I'll leave it up to Brian and assign this ticket to him.
This isn't an integrity issue because even if a share is corrupted due to this issue that doesn't threaten the integrity of the file.
Note that there are in general two possible ways to reduce the problem of shares being corrupted during a shutdown or crash. One is to make the writing of shares be more robust, for example by writing out a complete new copy of the share to a new temporary location and then renaming it into place. This is the option that increases I/O costs as discussed in the initial comment. Another is to add a "graceful shutdown" option where the storage server gets a chance to finish (or abort) updating a share before its process is killed.
I'm currently opposed to the latter and would be happier with the current fragile update than with the latter.
I agree that "graceful shutdown" is not the right solution.
writing of shares is fragile and/or there is no graceful shutdownto writing of shares is fragileHrmph, I guess this is one of my hot buttons. Zooko and I have discussed the
"crash-only" approach before, and I think we're still circling around each
other's opinions. I currently feel that any approach that prefers fragility
is wrong. Intentionally killing the server with no warning whatsoever (i.e.
the SIGKILL that "tahoe stop" does), when it is perfectly reasonable to
provide some warning and tolerate a brief delay, is equal to intentionally
causing data loss and damaging shares for the sake of some sort of
ideological purity that I don't really understand.
Be nice to your server! Don't shoot it in the head just to prove that you
can. :-)
Yes, sometimes the server will die abruptly. But it will be manually
restarted far more frequently than that. Here's my list of
running-to-not-running transition scenarios, in roughly increasing order of
frequency:
completed and buffers flushed)
disk writes completed)
The tradeoff is between:
(i.e. resistance to corruption: what is the probability that a share will
survive intact?)
Modern disk filesystems effectively write a bunch of highly-correct
corruption-resistant but poor-performance data to disk (i.e. the journal),
then write a best-effort performance-improving index to very specific places
(i.e. the inodes and dirnodes and free-block-tables and the rest). In the
good case, it uses the index and gets high performance. In the bad case (i.e.
the fsck that happens after it wakes up and learns that it didn't shut down
gracefully), it spends a lot of time on recovery but maximizes the
correctness by using the journal. The shutdown time is pretty small but
depends upon how much buffered data is waiting to be written (it tends to be
insignificant for hard drives, but annoyingly long for removable USB drives).
A modern filesystem could achieve its correctness goals purely by using the
journal, with zero shutdown time (umount == poweroff), and would never spend
any time recovering anything, and would be completely "crash-only", but of
course the performance would be so horrible that nobody would ever use it.
Each open() or read() would involve a big fsck process, and it would probably
have to keep the entire directory structure in RAM.
So it's an engineering tradeoff. In Tahoe, we've got a layer of reliability
over and above the individual storage servers, which lets us deprioritize the
per-server correctness/corruption-resistance goal a little bit.
If correctness were infinitely important, we'd write out each new version of
a mutable share to a separate file, then do an fsync(), then perform an
atomic rename (except on platforms that are too stupid to provide such a
feature, of course), then do fsync() again, to maximize the period of time
when the disk contained a valid monotonically-increasing version of the
share.
If performance or code complexity were infinitely important, we'd modify the
share in-place with as few writes and syscalls as possible, and leave the
flushing up to the filesystem and kernel, to do at the most efficient time
possible.
If performance and correctness were top goals, but not code complexity, you
could imagine writing out a journal of mutable share updates, and somehow
replaying it on restart if we didn't see the "clean" bit that means we'd
finished doing all updates before shutdown.
So anyways, those are my feelings in the abstract. As for the specific, I
strongly feel that "tahoe stop" should be changed to send SIGINT and give the
process a few seconds to finish any mutable-file-modification operation it
was doing before sending it SIGKILL. (as far as I'm concerned, the only
reason to ever send SIGKILL is because you're impatient and don't want to
wait for it to clean up, possibly because you believe that the process has
hung or stopped making progress, and you can't or don't wish to look at the
logs to find out what's going on).
I don't yet have an informed opinion about copy-before-write or
edit-in-place. As Zooko points out, it would be appropriate to measure the IO
costs of writing out a new copy of each share, and see how bad it looks. Code
notes
the simplest way to implement copy-before-write would be to first copy the
entire share, then apply in-place edits to the new versions, then
atomically rename. We'd want to consider a recovery-like scan for
abandoned editing files (i.e.
find storage/shares -name *.tmp |xargs rm
) at startup, to avoidunbounded accumulation of those tempfiles, except that would be expensive
to perform and will never yield much results.
another option is to make a backup copy of the entire share, apply
in-place edits to the old version, then delete the backup (and establish
a recovery procedure that looks for backup copies and uses them to replace
the presumeably-incompletely-edited original). This would be easier to
implement if the backup copies are all placed in a single central
directory, so the recovery process can scan for them quickly, perhaps in
storage/shares/updates/$SI.
However, my suspicion is that edit-in-place is the appropriate tradeoff,
because that will lead to simpler code (i.e. fewer bugs) and better
performance, while only making us vulnerable to share corruption during the
rare events that don't give the server time to finish its write() calls (i.e.
kernel crash, power loss, and SIGKILL). Similarly, I suspect that it is not
appropriate to call fsync(), because we lose performance everywhere but only
improve correctness in the kernel crash and power loss scenarios. (a graceful
kernel shutdown, or arbitrary process shutdown followed by enough time for
the kernel/filesystem to flush its buffers, would provide for all write()s to
be flushed even without a single fsync() call).
writing of shares is fragileto writing of shares is fragile and "tahoe stop" is unnecessarily harshI'm sorry if this topic makes you feel unhappy. For what it is worth, I am satisfied with the current behavior: dumb writes, stupid shutdown, simple startup. :-) This scores highest on simplicity, highest on performance, and not so great on preserving mutable shares.
This seems okay to me, because I consider shares to be expendable -- files are what we care about, and those are preserved by verification and repair at the Tahoe-LAFS layer rather than by having high quality storage at the storage layer. allmydata.com uses the cheap commodity PC kit, such as 2 TB hard drive for a mere $200. Enterprise storage people consider it to be completely irresponsible and wrong to use such kit for "enterprise" purposes. They buy "enterprise" SCSI drives from their big equipment provider (Sun, HP, IBM) with something like 300 GB capacity for something like $500. Then they add RAID-5 or RAID-6 or RAID-Z, redundant power supplies, yadda yadda yadda.
So anyway, allmydata.com buys these commodity PCs -- basically the same hardware you can buy retail at Fry's or Newegg -- which are quite inexpensive and suffer a correspondingly higher failure rate. In one memorable incidence, one of these 1U servers from ~SuperMicro failed in such a way that all four of the commodity 1 TB hard drives in it were destroyed. This means lots of mutable shares -- maybe something on the order of 10,000 mutable shares -- were destroyed in an instant! But none of the allmydata.com customer files were harmed.
The hard shutdown behavior that is currently in Tahoe-LAFS would have to be exercised quite a lot while under high load before it would come close to destroying that many mutable shares. :-)
I would accept changing it to do robust writes such as the simple "write-new-then-relink-into-place". (My guess is that this will not cause a noticeable performance degradation.)
I would accept changing it to do traditional unixy two-phase graceful shutdown as you describe, with misgivings, as I think I've already made clear to you in personal conversation and in comment:/tahoe-lafs/trac-2024-07-25/issues/5243:8.
To sum my misgivings: 1. our handling of hard shutdown (e.g. power off, out of disk space, kernel crash) is not thereby improved, and 2. if we come to rely on "graceful shutdown" then our "robust startup" muscles atrophy.
Consider this: we currently have no automated tests of what happens when servers get shut down in the middle of their work. So we should worry that as the code evolves, someone could commit a patch which causes bad behavior in that case and we wouldn't notice.
However, we do know that everytime anyone runs
tahoe stop
ortahoe restart
that it exercises the hard shutdown case. The fact that allmydata.com has hundreds of servers with this behavior and has had for years gives me increased confidence the current code doesn't do anything catastrophically wrong in this case.If we improved
tahoe stop
to be a graceful shutdown instead of a hard shutdown, then of course the current version of Tahoe-LAFS would still be just as good as ever, but as time went on and the code evolved I would start worrying more and more about how tahoe servers handle the hard shutdown case. Maybe this means we need automated tests of that case.