Allow deep-check to continue after error, and: if there is an unrecoverable subdirectory, the deep-check report (both WUI and CLI) loses other information #755
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#755
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
If I do a deep-check on a directory, I start getting results reported on the web page showing the files and subdirectories within that directory. Reloading (or waiting for the automatic self-reload) shows more and more results. Until one of the subdirectories is unrecoverable, in which case the web page containing the deep check results is replaced with a web page saying only this:
To close this ticket, make it so that I can still see all the other result that have already been generated, plus further results about other files and subdirectories that haven't yet been checked, even while there is an unrecoverable subdirectory present.
I'm using the current trunk: 1.4.1-r3982.
Brian: are you willing to take this ticket?
yeah, I'll work on this. Basically traversal failures during a deep-check or deep-repair operation should increment a counter and move on, instead of throwing an exception and stopping the walker. I don't know if I can finish it in time for 1.5.0 though.
This isn't really a blocker for v1.5.0.
On the mailing list Ludo reported:
Is this an example of the issue in this ticket?
By the way, see also #583 (repairer: test cancel, upload failure, download failure).
I just got bitten by this bug again. I have a directory (on the volunteergrid) that has an unrecoverable subdirectory in it. When I do a deep check in the WUI then it shows useful information about the other contents of the directory until it reaches that subdirectory, at which point I lose the other information. Also, the resulting error message doesn't tell me any identifying information about which file or directory was unrecoverable!
This is persistently causing problems for me. I have several important directory structures in which some of the directories or files are sometimes unrecoverable. I really need to be able to see information about the rest of them even at these times. Raising priority to
critical
to remind myself that I really care about this.Unifying this with #880; this ticket now covers both CLI and WUI.
if there is an unrecoverable subdirectory, the web deep-check report loses other informationto if there is an unrecoverable subdirectory, the deep-check report (both WUI and CLI) loses other informationThis might be too ambitious to finish for v1.6.1. I would like to get v1.6.1 released this coming weekend of 2010-02-20 so that people who have started packaging or deploying v1.6.0 have the option of quickly upgrading to v1.6.1 before their packages/deployments of v1.6.0 spread too far.
However, I'm leaving it in the Milestone v1.6.1 for now because I don't object to fixing it in v1.6.1.
We're not going to fix this in time for v1.6.1. Hopefully in time for v1.7.0!
This is one of our more commonly encountered usability problems, so I think it should be a priority for 1.9.0.
I'm willing to try to fix this bug.
Attachment 755-fix-for-review.diff (4524 bytes) added
The patch 755-fix-for-review.diff is how I intent to fix this bug. The associated tests are still being worked on.
Attachment patch-755.darcs.diff (30858 bytes) added
The patch patch-755.darcs.diff contains the fix for this issue and associated tests.
Good patch! I like the approach of making filenode.check_and_repair()
signal inability to repair by returning
CheckAndRepairResults.repair_successful
=False instead of bythrowing an exception. A few things I'd like to see changed:
we usually repair files that are unhealthy but recoverable. If repair
fails, the file should still be recoverable. The post-repair-results
are pessimistically being set to healthy=False recoverable=False
needs_rebalancing=False, when it's probably (and sometimes certainly)
more accurate to copy these values from the pre-repair-results. In
particular, we shouldn't scare users into thinking that repair
failures of "scratched" files (unhealthy but recoverable) indicate
unrecoverable files: this makes benign things like
UnhappinessError
look like data loss. This should be fixed inboth mutable and immutable files.
the newly-enabled test in
test_repairer.Repairer.test_harness
(which previously got a
self.shouldFail()
) should be slightlyit's probably worth checking the code coverage when we exercise
test_mutable
and make sure the new code is getting rundo we have any tests that confirm deep-repair on a tree with an
test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken()
Otherwise, looks great! With those few changes we can land this one for
1.8.2!
Replying to warner:
+1
If there's a failure, then we don't know whether the file is healthy, recoverable or needs rebalancing. Shouldn't unknown fields simply be missing from the results?
(Note: needs_rebalancing=False is not pessimistic.)
I thought it was both.
Thanks for the review! My comments are inline.
Replying to warner:
I agree with what davidsarah said in comment:23, it is
difficult to know the actual status when an exception was raised during
the check operation. However, it seems that simply removing the fields
from the results would necessitate other changes because I guess that
many parts of the code except them to be present.
What would you think about setting healthy to its value before the
repair (most likely
False
) and other fields toNone
?Something along those lines?
Good point, will be done in the next patch.
Will be done in the next patch.
I don't know either, will try to look in details into this.
I don't remember how the code coverage infrastructure in the build
system actually works. It would be very kind of you if you tell me which
command I should run?
This is what I think calling
do_web_stream_check()
insideDeepCheckWebBad.test_bad()
should be doing, isn't it?Yes, the traversal must continue in both cases. I was under the impression that unrecoverable immutable files were already supported and I understand this issue as being about unrecoverable direnodes.
Replying to [francois]comment:24:
Ok, but
set_recoverable()
andset_needs_rebalancing()
shouldbe copied from the pre-repair values too. For immutable files it's
certainly the case that repair cannot make things any worse, so if the
file was recoverable before repair, it will be recoverable afterwards
too. For mutable files, it's fuzzier, but once we get #1209 fixed, then
repair that doesn't involve UCWE collisions or multiple versions should
be strictly an improvement too. I think
set_needs_rebalancing()
isroughly the same.
My big concern is doing a deep-repair while you're missing a few
servers: all files are missing a few shares, so they aren't healthy and
we try to repair them, but you're missing too many servers to
successfully meet the servers-of-happiness threshold, so repair fails.
On every single file. All the files are actually recoverable, but the
post-repair results suggest that they are not. What I want to avoid is
the deep-repair summary message telling users that 4000 out of 4000
files are now unrecoverable and scaring the socks off them.
I usually do '
make quicktest-coverage
', but I think "python setup.py trial --coverage
" (or perhaps "python setup.py trial --coverage --test-suite test_mutable
" to be a bit more selective) should do thesame. That will create a .coverage file with the raw data. "
make coverage-output
", or following the commands listed in that section ofthe Makefile, will give you an HTML summary with color-coded source
lines.
I think that's mostly correct: it looks like
set_up_damaged_tree()
creates a root directory with 8 files (half mutable, half immutable),
some of which are unrecoverable. But 1:
do_web_stream_check()
doesn't attempt repair, merely deep-check, and 2: there are no
directories in that root, only files. Adding an unrecoverable directory
is the important bit, since I think deep-repair and deep-check have
enough common code paths that exercising deep-check is sufficient. (note
that I think the 'broken' directory set up there is not used by
do_web_stream_check()
).Yeah,
do_web_stream_check()
should cover theunrecoverable-immutable-file case (well, unless there's a difference in
behavior between a web-based
t=stream-deep-check
and an internaldirnode-based
dirnode.start_deep_check()
, which is worth testing).So I agree, unrecoverable dirnodes is the important thing to check.
So my hunch here is that we should add an unrecoverable directory to the
'root' tree created in
set_up_damaged_tree()
, and adjust thecounters to match, and then maybe we should get rid of the 'broken' tree
and
do_deepcheck_broken()
.BTW, if we get a patch for this on monday, I'll review and land it, and it'll be in 1.8.2. If it's not ready by monday or tuesday, then we may need to push it out until after 1.8.2. I want to make sure we get at least a few days of testing on this, since it's kind of invasive.
Replying to warner:
I guess that it's going to have to wait until after 1.8.2 because spare time in the coming week looks pretty scarce.
This needs some work to address the comments and to be rebased to trunk, but has a good chance of getting into 1.9.
I have a patch in progress that builds on patch-755.darcs.diff and fixes the review comments, including skipping unrecoverable directories and including information that they've been skipped in the output. It's not ready for 1.9 though.
I'll try to find the patch mentioned in comment:71987, but if I haven't done so in two weeks, it can be assumed that I've lost it.
#1955 was a duplicate.
#2337 was a duplicate.
if there is an unrecoverable subdirectory, the deep-check report (both WUI and CLI) loses other informationto Allow deep-check to continue after error, and: if there is an unrecoverable subdirectory, the deep-check report (both WUI and CLI) loses other informationKyle Markley wrote on tahoe-dev:
Kyle: It won't have to re-upload all the healthy files. The deduplication algorithm will find that the data for any unchanged files is already available and will re-use whatever shares it can. It'll just take a bit longer to run because it'll have to scan and encode every file.
Meanwhile: I just lost a bunch of stuff because I didn't know about this issue and assumed a deep-check --repair --add-lease cronjob would take care of things. One file near the beginning of the directory structure got damaged somehow, so neither repair nor leasing was done on the rest, and by the time I came back to check on it, chunks had expired and been deleted and I have to re-upload everything, which will take about a month.
This bug has been open for almost 8 years, and I see a patch for it in the discussion thread... If it's not going to be fixed in the next release, I recommend adding a warning about it to the documentation so new users don't do something stupid like expect the repair operation to behave in a sane manner.
As a work-around, I use:
This, of course, requires time and CPU to start a separate instance of the tahoe program for every data object being checked, so going over the entire directory takes days instead of hours, but at least it actually works.
Ok, so tahoe manifest also gives up on the first error it encounters, it just only encounters errors on damaged directories. But it will still bite you hard if you are actually stupid enough to rely on it.
So I've resorted to the following bash script:
The careful observer will notice that this script calls "check --add-lease" first and then only calls --repair if that returns an error. This is due to another bug in the --repair functionality which I will be filing shortly.
Is making deep-check note the unrepairable nodes, but then continue to check the rest of the tree really that difficult? I wouldn't think the average user should have to resort to writing their own tools to avoid cascade failures of the storage system...
If you guys want to bundle this tool or some clone or variant thereof into your packages you are more than welcome to do so. We need something to actually keep people's data safe until this bug is fixed.
Edit: Oh for Pete's Sake! tahoe check exits with a 0 even when the checked objects are unhealthy, so I have to scan the output myself to assess it. I sense that at some point I'm going to need to rewrite this in Python or something and use the REST API. Hopefully that's at least somewhat sane...