writing of shares is fragile and "tahoe stop" is unnecessarily harsh #200

Open
opened 2007-11-01 17:58:53 +00:00 by zooko · 4 comments

As per comment:/tahoe-lafs/trac-2024-07-25/issues/5243:8, the updating of share data is an incremental in-place process on disk, which means that if the node crashes during updating a share, the share will be corrupted. Also, there is not currently a way to deliberately stop (or restart) node without crashing it.

I'm inclined to measure the I/O cost of more robust atomic update of shares, but I'll leave it up to Brian and assign this ticket to him.

As per comment:[/tahoe-lafs/trac-2024-07-25/issues/5243](/tahoe-lafs/trac-2024-07-25/issues/5243):8, the updating of share data is an incremental in-place process on disk, which means that if the node crashes during updating a share, the share will be corrupted. Also, there is not currently a way to deliberately stop (or restart) node without crashing it. I'm inclined to measure the I/O cost of more robust atomic update of shares, but I'll leave it up to Brian and assign this ticket to him.
zooko added the
unknown
major
enhancement
0.6.1
labels 2007-11-01 17:58:53 +00:00
zooko added this to the eventually milestone 2007-11-01 17:58:53 +00:00
warner was assigned by zooko 2007-11-01 17:58:53 +00:00
zooko added
code-storage
and removed
unknown
labels 2007-12-04 21:39:14 +00:00
Author

This isn't an integrity issue because even if a share is corrupted due to this issue that doesn't threaten the integrity of the file.

Note that there are in general two possible ways to reduce the problem of shares being corrupted during a shutdown or crash. One is to make the writing of shares be more robust, for example by writing out a complete new copy of the share to a new temporary location and then renaming it into place. This is the option that increases I/O costs as discussed in the initial comment. Another is to add a "graceful shutdown" option where the storage server gets a chance to finish (or abort) updating a share before its process is killed.

I'm currently opposed to the latter and would be happier with the current fragile update than with the latter.

This isn't an integrity issue because even if a share is corrupted due to this issue that doesn't threaten the integrity of the file. Note that there are in general two possible ways to reduce the problem of shares being corrupted during a shutdown or crash. One is to make the writing of shares be more robust, for example by writing out a complete new copy of the share to a new temporary location and then renaming it into place. This is the option that increases I/O costs as discussed in the initial comment. Another is to add a "graceful shutdown" option where the storage server gets a chance to finish (or abort) updating a share before its process is killed. I'm currently opposed to the latter and would be happier with the current fragile update than with the latter.
davidsarah commented 2009-10-28 21:22:11 +00:00
Owner

I agree that "graceful shutdown" is not the right solution.

I agree that "graceful shutdown" is not the right solution.
tahoe-lafs changed title from writing of shares is fragile and/or there is no graceful shutdown to writing of shares is fragile 2009-10-28 21:22:11 +00:00

Hrmph, I guess this is one of my hot buttons. Zooko and I have discussed the
"crash-only" approach before, and I think we're still circling around each
other's opinions. I currently feel that any approach that prefers fragility
is wrong. Intentionally killing the server with no warning whatsoever (i.e.
the SIGKILL that "tahoe stop" does), when it is perfectly reasonable to
provide some warning and tolerate a brief delay, is equal to intentionally
causing data loss and damaging shares for the sake of some sort of
ideological purity that I don't really understand.

Be nice to your server! Don't shoot it in the head just to prove that you
can. :-)

Yes, sometimes the server will die abruptly. But it will be manually
restarted far more frequently than that. Here's my list of
running-to-not-running transition scenarios, in roughly increasing order of
frequency:

  • kernel crash (some disk writes completed, in temporal order if you're lucky)
  • power loss (like kernel crash)
  • process crash / SIGSEGV (all disk writes completed)
  • kernel shutdown (process gets SIGINT, then SIGKILL, all disk writes
    completed and buffers flushed)
  • process shutdown (SIGINT, then SIGKILL: process can choose what to do, all
    disk writes completed)

The tradeoff is between:

  • performance in the good case
  • shutdown time in the "graceful shutdown" case
  • recovery time after something unexpected/rare happens
  • correctness: amount of corruption when something unexpected/rare happens
    (i.e. resistance to corruption: what is the probability that a share will
    survive intact?)
  • code complexity

Modern disk filesystems effectively write a bunch of highly-correct
corruption-resistant but poor-performance data to disk (i.e. the journal),
then write a best-effort performance-improving index to very specific places
(i.e. the inodes and dirnodes and free-block-tables and the rest). In the
good case, it uses the index and gets high performance. In the bad case (i.e.
the fsck that happens after it wakes up and learns that it didn't shut down
gracefully), it spends a lot of time on recovery but maximizes the
correctness by using the journal. The shutdown time is pretty small but
depends upon how much buffered data is waiting to be written (it tends to be
insignificant for hard drives, but annoyingly long for removable USB drives).

A modern filesystem could achieve its correctness goals purely by using the
journal, with zero shutdown time (umount == poweroff), and would never spend
any time recovering anything, and would be completely "crash-only", but of
course the performance would be so horrible that nobody would ever use it.
Each open() or read() would involve a big fsck process, and it would probably
have to keep the entire directory structure in RAM.

So it's an engineering tradeoff. In Tahoe, we've got a layer of reliability
over and above the individual storage servers, which lets us deprioritize the
per-server correctness/corruption-resistance goal a little bit.

If correctness were infinitely important, we'd write out each new version of
a mutable share to a separate file, then do an fsync(), then perform an
atomic rename (except on platforms that are too stupid to provide such a
feature, of course), then do fsync() again, to maximize the period of time
when the disk contained a valid monotonically-increasing version of the
share.

If performance or code complexity were infinitely important, we'd modify the
share in-place with as few writes and syscalls as possible, and leave the
flushing up to the filesystem and kernel, to do at the most efficient time
possible.

If performance and correctness were top goals, but not code complexity, you
could imagine writing out a journal of mutable share updates, and somehow
replaying it on restart if we didn't see the "clean" bit that means we'd
finished doing all updates before shutdown.

So anyways, those are my feelings in the abstract. As for the specific, I
strongly feel that "tahoe stop" should be changed to send SIGINT and give the
process a few seconds to finish any mutable-file-modification operation it
was doing before sending it SIGKILL. (as far as I'm concerned, the only
reason to ever send SIGKILL is because you're impatient and don't want to
wait for it to clean up, possibly because you believe that the process has
hung or stopped making progress, and you can't or don't wish to look at the
logs to find out what's going on).

I don't yet have an informed opinion about copy-before-write or
edit-in-place. As Zooko points out, it would be appropriate to measure the IO
costs of writing out a new copy of each share, and see how bad it looks. Code
notes

  • the simplest way to implement copy-before-write would be to first copy the
    entire share, then apply in-place edits to the new versions, then
    atomically rename. We'd want to consider a recovery-like scan for
    abandoned editing files (i.e.
    find storage/shares -name *.tmp |xargs rm) at startup, to avoid
    unbounded accumulation of those tempfiles, except that would be expensive
    to perform and will never yield much results.

  • another option is to make a backup copy of the entire share, apply
    in-place edits to the old version, then delete the backup (and establish
    a recovery procedure that looks for backup copies and uses them to replace
    the presumeably-incompletely-edited original). This would be easier to
    implement if the backup copies are all placed in a single central
    directory, so the recovery process can scan for them quickly, perhaps in
    storage/shares/updates/$SI.

However, my suspicion is that edit-in-place is the appropriate tradeoff,
because that will lead to simpler code (i.e. fewer bugs) and better
performance, while only making us vulnerable to share corruption during the
rare events that don't give the server time to finish its write() calls (i.e.
kernel crash, power loss, and SIGKILL). Similarly, I suspect that it is not
appropriate to call fsync(), because we lose performance everywhere but only
improve correctness in the kernel crash and power loss scenarios. (a graceful
kernel shutdown, or arbitrary process shutdown followed by enough time for
the kernel/filesystem to flush its buffers, would provide for all write()s to
be flushed even without a single fsync() call).

Hrmph, I guess this is one of my hot buttons. Zooko and I have discussed the "crash-only" approach before, and I think we're still circling around each other's opinions. I currently feel that any approach that prefers fragility is wrong. Intentionally killing the server with no warning whatsoever (i.e. the SIGKILL that "tahoe stop" does), when it is perfectly reasonable to provide some warning and tolerate a brief delay, is equal to intentionally causing data loss and damaging shares for the sake of some sort of ideological purity that I don't really understand. Be nice to your server! Don't shoot it in the head just to prove that you can. :-) Yes, sometimes the server will die abruptly. But it will be manually restarted far more frequently than that. Here's my list of running-to-not-running transition scenarios, in roughly increasing order of frequency: * kernel crash (some disk writes completed, in temporal order if you're lucky) * power loss (like kernel crash) * process crash / SIGSEGV (all disk writes completed) * kernel shutdown (process gets SIGINT, then SIGKILL, all disk writes completed and buffers flushed) * process shutdown (SIGINT, then SIGKILL: process can choose what to do, all disk writes completed) The tradeoff is between: * performance in the good case * shutdown time in the "graceful shutdown" case * recovery time after something unexpected/rare happens * correctness: amount of corruption when something unexpected/rare happens (i.e. resistance to corruption: what is the probability that a share will survive intact?) * code complexity Modern disk filesystems effectively write a bunch of highly-correct corruption-resistant but poor-performance data to disk (i.e. the journal), then write a best-effort performance-improving index to very specific places (i.e. the inodes and dirnodes and free-block-tables and the rest). In the good case, it uses the index and gets high performance. In the bad case (i.e. the fsck that happens after it wakes up and learns that it didn't shut down gracefully), it spends a lot of time on recovery but maximizes the correctness by using the journal. The shutdown time is pretty small but depends upon how much buffered data is waiting to be written (it tends to be insignificant for hard drives, but annoyingly long for removable USB drives). A modern filesystem could achieve its correctness goals purely by using the journal, with zero shutdown time (umount == poweroff), and would never spend any time recovering anything, and would be completely "crash-only", but of course the performance would be so horrible that nobody would ever use it. Each open() or read() would involve a big fsck process, and it would probably have to keep the entire directory structure in RAM. So it's an engineering tradeoff. In Tahoe, we've got a layer of reliability over and above the individual storage servers, which lets us deprioritize the per-server correctness/corruption-resistance goal a little bit. If correctness were infinitely important, we'd write out each new version of a mutable share to a separate file, then do an fsync(), then perform an atomic rename (except on platforms that are too stupid to provide such a feature, of course), then do fsync() again, to maximize the period of time when the disk contained a valid monotonically-increasing version of the share. If performance or code complexity were infinitely important, we'd modify the share in-place with as few writes and syscalls as possible, and leave the flushing up to the filesystem and kernel, to do at the most efficient time possible. If performance and correctness were top goals, but not code complexity, you could imagine writing out a journal of mutable share updates, and somehow replaying it on restart if we didn't see the "clean" bit that means we'd finished doing all updates before shutdown. So anyways, those are my feelings in the abstract. As for the specific, I strongly feel that "tahoe stop" should be changed to send SIGINT and give the process a few seconds to finish any mutable-file-modification operation it was doing before sending it SIGKILL. (as far as I'm concerned, the only reason to ever send SIGKILL is because you're impatient and don't want to wait for it to clean up, possibly because you believe that the process has hung or stopped making progress, and you can't or don't wish to look at the logs to find out what's going on). I don't yet have an informed opinion about copy-before-write or edit-in-place. As Zooko points out, it would be appropriate to measure the IO costs of writing out a new copy of each share, and see how bad it looks. Code notes * the simplest way to implement copy-before-write would be to first copy the entire share, then apply in-place edits to the new versions, then atomically rename. We'd want to consider a recovery-like scan for abandoned editing files (i.e. `find storage/shares -name *.tmp |xargs rm`) at startup, to avoid unbounded accumulation of those tempfiles, except that would be expensive to perform and will never yield much results. * another option is to make a backup copy of the entire share, apply in-place edits to the *old* version, then delete the backup (and establish a recovery procedure that looks for backup copies and uses them to replace the presumeably-incompletely-edited original). This would be easier to implement if the backup copies are all placed in a single central directory, so the recovery process can scan for them quickly, perhaps in storage/shares/updates/$SI. However, my suspicion is that edit-in-place is the appropriate tradeoff, because that will lead to simpler code (i.e. fewer bugs) and better performance, while only making us vulnerable to share corruption during the rare events that don't give the server time to finish its write() calls (i.e. kernel crash, power loss, and SIGKILL). Similarly, I suspect that it is *not* appropriate to call fsync(), because we lose performance everywhere but only improve correctness in the kernel crash and power loss scenarios. (a graceful kernel shutdown, or arbitrary process shutdown followed by enough time for the kernel/filesystem to flush its buffers, would provide for all write()s to be flushed even without a single fsync() call).
warner changed title from writing of shares is fragile to writing of shares is fragile and "tahoe stop" is unnecessarily harsh 2009-11-02 08:16:25 +00:00
Author

I'm sorry if this topic makes you feel unhappy. For what it is worth, I am satisfied with the current behavior: dumb writes, stupid shutdown, simple startup. :-) This scores highest on simplicity, highest on performance, and not so great on preserving mutable shares.

This seems okay to me, because I consider shares to be expendable -- files are what we care about, and those are preserved by verification and repair at the Tahoe-LAFS layer rather than by having high quality storage at the storage layer. allmydata.com uses the cheap commodity PC kit, such as 2 TB hard drive for a mere $200. Enterprise storage people consider it to be completely irresponsible and wrong to use such kit for "enterprise" purposes. They buy "enterprise" SCSI drives from their big equipment provider (Sun, HP, IBM) with something like 300 GB capacity for something like $500. Then they add RAID-5 or RAID-6 or RAID-Z, redundant power supplies, yadda yadda yadda.

So anyway, allmydata.com buys these commodity PCs -- basically the same hardware you can buy retail at Fry's or Newegg -- which are quite inexpensive and suffer a correspondingly higher failure rate. In one memorable incidence, one of these 1U servers from ~SuperMicro failed in such a way that all four of the commodity 1 TB hard drives in it were destroyed. This means lots of mutable shares -- maybe something on the order of 10,000 mutable shares -- were destroyed in an instant! But none of the allmydata.com customer files were harmed.

The hard shutdown behavior that is currently in Tahoe-LAFS would have to be exercised quite a lot while under high load before it would come close to destroying that many mutable shares. :-)

I would accept changing it to do robust writes such as the simple "write-new-then-relink-into-place". (My guess is that this will not cause a noticeable performance degradation.)

I would accept changing it to do traditional unixy two-phase graceful shutdown as you describe, with misgivings, as I think I've already made clear to you in personal conversation and in comment:/tahoe-lafs/trac-2024-07-25/issues/5243:8.

To sum my misgivings: 1. our handling of hard shutdown (e.g. power off, out of disk space, kernel crash) is not thereby improved, and 2. if we come to rely on "graceful shutdown" then our "robust startup" muscles atrophy.

Consider this: we currently have no automated tests of what happens when servers get shut down in the middle of their work. So we should worry that as the code evolves, someone could commit a patch which causes bad behavior in that case and we wouldn't notice.

However, we do know that everytime anyone runs tahoe stop or tahoe restart that it exercises the hard shutdown case. The fact that allmydata.com has hundreds of servers with this behavior and has had for years gives me increased confidence the current code doesn't do anything catastrophically wrong in this case.

If we improved tahoe stop to be a graceful shutdown instead of a hard shutdown, then of course the current version of Tahoe-LAFS would still be just as good as ever, but as time went on and the code evolved I would start worrying more and more about how tahoe servers handle the hard shutdown case. Maybe this means we need automated tests of that case.

I'm sorry if this topic makes you feel unhappy. For what it is worth, I am satisfied with the current behavior: dumb writes, stupid shutdown, simple startup. :-) This scores highest on simplicity, highest on performance, and not so great on preserving mutable shares. This seems okay to me, because I consider shares to be expendable -- files are what we care about, and those are preserved by verification and repair at the Tahoe-LAFS layer rather than by having high quality storage at the storage layer. allmydata.com uses the cheap commodity PC kit, such as 2 TB hard drive for a mere $200. Enterprise storage people consider it to be completely irresponsible and wrong to use such kit for "enterprise" purposes. They buy "enterprise" SCSI drives from their big equipment provider (Sun, HP, IBM) with something like 300 GB capacity for something like $500. Then they add RAID-5 or RAID-6 or RAID-Z, redundant power supplies, yadda yadda yadda. So anyway, allmydata.com buys these commodity PCs -- basically the same hardware you can buy retail at Fry's or Newegg -- which are quite inexpensive and suffer a correspondingly higher failure rate. In one memorable incidence, one of these 1U servers from ~SuperMicro failed in such a way that all four of the commodity 1 TB hard drives in it were destroyed. This means lots of mutable shares -- maybe something on the order of 10,000 mutable shares -- were destroyed in an instant! But none of the allmydata.com customer files were harmed. The hard shutdown behavior that is currently in Tahoe-LAFS would have to be exercised quite a lot while under high load before it would come close to destroying that many mutable shares. :-) I would accept changing it to do robust writes such as the simple "write-new-then-relink-into-place". (My guess is that this will not cause a noticeable performance degradation.) I would accept changing it to do traditional unixy two-phase graceful shutdown as you describe, with misgivings, as I think I've already made clear to you in personal conversation and in comment:[/tahoe-lafs/trac-2024-07-25/issues/5243](/tahoe-lafs/trac-2024-07-25/issues/5243):8. To sum my misgivings: 1. our handling of hard shutdown (e.g. power off, out of disk space, kernel crash) is not thereby improved, and 2. if we come to rely on "graceful shutdown" then our "robust startup" muscles atrophy. Consider this: we currently have no automated tests of what happens when servers get shut down in the middle of their work. So we should worry that as the code evolves, someone could commit a patch which causes bad behavior in that case and we wouldn't notice. However, we do know that everytime anyone runs `tahoe stop` or `tahoe restart` that it exercises the hard shutdown case. The fact that allmydata.com has hundreds of servers with this behavior and has had for years gives me increased confidence the current code doesn't do anything catastrophically wrong in this case. If we improved `tahoe stop` to be a graceful shutdown instead of a hard shutdown, then of course the current version of Tahoe-LAFS would still be just as good as ever, but as time went on and the code evolved I would start worrying more and more about how tahoe servers handle the hard shutdown case. Maybe this means we need automated tests of that case.
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#200
No description provided.