Allow deep-check to continue after error, and: if there is an unrecoverable subdirectory, the deep-check report (both WUI and CLI) loses other information #755

Open
opened 2009-07-11 21:00:14 +00:00 by zooko · 28 comments

If I do a deep-check on a directory, I start getting results reported on the web page showing the files and subdirectories within that directory. Reloading (or waiting for the automatic self-reload) shows more and more results. Until one of the subdirectories is unrecoverable, in which case the web page containing the deep check results is replaced with a web page saying only this:

UnrecoverableFileError: the directory (or mutable file) could not be retrieved, because there were insufficient good shares. This might indicate that no servers were connected, insufficient servers were connected, the URI was corrupt, or that shares have been lost due to server departure, hard drive failure, or disk corruption. You should perform a filecheck on this object to learn more.

To close this ticket, make it so that I can still see all the other result that have already been generated, plus further results about other files and subdirectories that haven't yet been checked, even while there is an unrecoverable subdirectory present.

I'm using the current trunk: 1.4.1-r3982.

Brian: are you willing to take this ticket?

If I do a deep-check on a directory, I start getting results reported on the web page showing the files and subdirectories within that directory. Reloading (or waiting for the automatic self-reload) shows more and more results. Until one of the subdirectories is unrecoverable, in which case the web page containing the deep check results is replaced with a web page saying only this: ``` UnrecoverableFileError: the directory (or mutable file) could not be retrieved, because there were insufficient good shares. This might indicate that no servers were connected, insufficient servers were connected, the URI was corrupt, or that shares have been lost due to server departure, hard drive failure, or disk corruption. You should perform a filecheck on this object to learn more. ``` To close this ticket, make it so that I can still see all the other result that have already been generated, plus further results about other files and subdirectories that haven't yet been checked, even while there is an unrecoverable subdirectory present. I'm using the current trunk: 1.4.1-r3982. Brian: are you willing to take this ticket?
zooko added the
code-frontend-web
major
defect
1.4.1
labels 2009-07-11 21:00:14 +00:00
zooko added this to the 1.5.0 milestone 2009-07-11 21:00:14 +00:00
warner was assigned by zooko 2009-07-11 21:00:14 +00:00

yeah, I'll work on this. Basically traversal failures during a deep-check or deep-repair operation should increment a counter and move on, instead of throwing an exception and stopping the walker. I don't know if I can finish it in time for 1.5.0 though.

yeah, I'll work on this. Basically traversal failures during a deep-check or deep-repair operation should increment a counter and move on, instead of throwing an exception and stopping the walker. I don't know if I can finish it in time for 1.5.0 though.
warner added
code-dirnodes
and removed
code-frontend-web
labels 2009-07-11 23:28:09 +00:00
Author

This isn't really a blocker for v1.5.0.

This isn't really a blocker for v1.5.0.
zooko modified the milestone from 1.5.0 to eventually 2009-07-15 05:24:36 +00:00
Author

On the mailing list Ludo reported:

$ tahoe deep-check
ERROR: UnrecoverableFileError(no recoverable versions)
[Failure instance: Traceback: <class 'allmydata.mutable.common.UnrecoverableFileError'>: no recoverable versions
/nix/store/i1bkz12nx2vbih8aj37c9gpqnzbjshkx-python-twisted-8.2.0/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/base.py:757:runUntilCurrent
/nix/store/nk39m80fi7ll7460713djzw3qzwgb4kr-python-foolscap-0.4.2/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py:26:_turn
/nix/store/i1bkz12nx2vbih8aj37c9gpqnzbjshkx-python-twisted-8.2.0/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/defer.py:243:callback
/nix/store/i1bkz12nx2vbih8aj37c9gpqnzbjshkx-python-twisted-8.2.0/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/defer.py:312:_startRunCallbacks
--- <exception caught here> ---
/nix/store/i1bkz12nx2vbih8aj37c9gpqnzbjshkx-python-twisted-8.2.0/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/defer.py:328:_runCallbacks
/nix/store/yj6q079b58rfnnf8g70ib5vaah6gxlhq-tahoe-1.5.0/lib/python2.5/site-packages/allmydata_tahoe-1.5.0-py2.5.egg/allmydata/mutable/filenode.py:312:_once_updated_download_best_version

Is this an example of the issue in this ticket?

By the way, see also #583 (repairer: test cancel, upload failure, download failure).

[On the mailing list](http://allmydata.org/pipermail/tahoe-dev/2009-August/002593.html) Ludo reported: ``` $ tahoe deep-check ERROR: UnrecoverableFileError(no recoverable versions) [Failure instance: Traceback: <class 'allmydata.mutable.common.UnrecoverableFileError'>: no recoverable versions /nix/store/i1bkz12nx2vbih8aj37c9gpqnzbjshkx-python-twisted-8.2.0/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/base.py:757:runUntilCurrent /nix/store/nk39m80fi7ll7460713djzw3qzwgb4kr-python-foolscap-0.4.2/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py:26:_turn /nix/store/i1bkz12nx2vbih8aj37c9gpqnzbjshkx-python-twisted-8.2.0/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/defer.py:243:callback /nix/store/i1bkz12nx2vbih8aj37c9gpqnzbjshkx-python-twisted-8.2.0/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/defer.py:312:_startRunCallbacks --- <exception caught here> --- /nix/store/i1bkz12nx2vbih8aj37c9gpqnzbjshkx-python-twisted-8.2.0/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-x86_64.egg/twisted/internet/defer.py:328:_runCallbacks /nix/store/yj6q079b58rfnnf8g70ib5vaah6gxlhq-tahoe-1.5.0/lib/python2.5/site-packages/allmydata_tahoe-1.5.0-py2.5.egg/allmydata/mutable/filenode.py:312:_once_updated_download_best_version ``` Is this an example of the issue in this ticket? By the way, see also #583 (repairer: test cancel, upload failure, download failure).
Author

I just got bitten by this bug again. I have a directory (on the volunteergrid) that has an unrecoverable subdirectory in it. When I do a deep check in the WUI then it shows useful information about the other contents of the directory until it reaches that subdirectory, at which point I lose the other information. Also, the resulting error message doesn't tell me any identifying information about which file or directory was unrecoverable!

UnrecoverableFileError: the directory (or mutable file) could not be retrieved, because there were insufficient good shares. This might indicate that no servers were connected, insufficient servers were connected, the URI was corrupt, or that shares have been lost due to server departure, hard drive failure, or disk corruption. You should perform a filecheck on this object to learn more.
I just got bitten by this bug again. I have a directory (on the volunteergrid) that has an unrecoverable subdirectory in it. When I do a deep check in the WUI then it shows useful information about the other contents of the directory until it reaches that subdirectory, at which point I lose the other information. Also, the resulting error message doesn't tell me any identifying information about *which* file or directory was unrecoverable! ``` UnrecoverableFileError: the directory (or mutable file) could not be retrieved, because there were insufficient good shares. This might indicate that no servers were connected, insufficient servers were connected, the URI was corrupt, or that shares have been lost due to server departure, hard drive failure, or disk corruption. You should perform a filecheck on this object to learn more. ```
tahoe-lafs modified the milestone from eventually to 1.7.0 2010-02-02 03:08:44 +00:00
Author

This is persistently causing problems for me. I have several important directory structures in which some of the directories or files are sometimes unrecoverable. I really need to be able to see information about the rest of them even at these times. Raising priority to critical to remind myself that I really care about this.

This is persistently causing problems for me. I have several important directory structures in which some of the directories or files are sometimes unrecoverable. I really need to be able to see information about the rest of them even at these times. Raising priority to `critical` to remind myself that I really care about this.
zooko added
critical
and removed
major
labels 2010-02-14 20:36:22 +00:00
tahoe-lafs modified the milestone from 1.7.0 to 1.6.1 2010-02-15 18:51:02 +00:00
davidsarah commented 2010-02-15 19:50:05 +00:00
Owner

Unifying this with #880; this ticket now covers both CLI and WUI.

Unifying this with #880; this ticket now covers both CLI and WUI.
tahoe-lafs changed title from if there is an unrecoverable subdirectory, the web deep-check report loses other information to if there is an unrecoverable subdirectory, the deep-check report (both WUI and CLI) loses other information 2010-02-15 19:50:05 +00:00
Author

This might be too ambitious to finish for v1.6.1. I would like to get v1.6.1 released this coming weekend of 2010-02-20 so that people who have started packaging or deploying v1.6.0 have the option of quickly upgrading to v1.6.1 before their packages/deployments of v1.6.0 spread too far.

However, I'm leaving it in the Milestone v1.6.1 for now because I don't object to fixing it in v1.6.1.

This might be too ambitious to finish for v1.6.1. I would like to get v1.6.1 released this coming weekend of 2010-02-20 so that people who have started packaging or deploying v1.6.0 have the option of quickly upgrading to v1.6.1 before their packages/deployments of v1.6.0 spread too far. However, I'm leaving it in the Milestone v1.6.1 for now because I don't *object* to fixing it in v1.6.1.
Author

We're not going to fix this in time for v1.6.1. Hopefully in time for v1.7.0!

We're not going to fix this in time for v1.6.1. Hopefully in time for v1.7.0!
zooko modified the milestone from 1.6.1 to 1.7.0 2010-02-22 05:04:34 +00:00
zooko modified the milestone from 1.7.0 to eventually 2010-05-16 23:40:04 +00:00
tahoe-lafs modified the milestone from eventually to soon 2010-05-17 02:15:24 +00:00
davidsarah commented 2010-10-28 23:14:39 +00:00
Owner

This is one of our more commonly encountered usability problems, so I think it should be a priority for 1.9.0.

This is one of our more commonly encountered usability problems, so I think it should be a priority for 1.9.0.
tahoe-lafs modified the milestone from soon to 1.9.0 2010-10-28 23:14:39 +00:00
francois commented 2010-11-01 11:12:49 +00:00
Owner

I'm willing to try to fix this bug.

I'm willing to try to fix this bug.
francois commented 2010-11-20 23:42:41 +00:00
Owner

Attachment 755-fix-for-review.diff (4524 bytes) added

**Attachment** 755-fix-for-review.diff (4524 bytes) added
francois commented 2010-11-20 23:44:34 +00:00
Owner

The patch 755-fix-for-review.diff is how I intent to fix this bug. The associated tests are still being worked on.

The patch [755-fix-for-review.diff](/tahoe-lafs/trac-2024-07-25/attachments/000078ac-5234-2fa4-40fb-1e63d44e96a7) is how I intent to fix this bug. The associated tests are still being worked on.
francois commented 2010-11-21 22:46:52 +00:00
Owner

Attachment patch-755.darcs.diff (30858 bytes) added

**Attachment** patch-755.darcs.diff (30858 bytes) added
francois commented 2010-11-21 22:48:49 +00:00
Owner

The patch patch-755.darcs.diff contains the fix for this issue and associated tests.

The patch [patch-755.darcs.diff](/tahoe-lafs/trac-2024-07-25/attachments/000078ac-5234-2fa4-40fb-43c067161683) contains the fix for this issue and associated tests.
tahoe-lafs modified the milestone from 1.9.0 to 1.8.2 2011-01-06 00:31:29 +00:00

Good patch! I like the approach of making filenode.check_and_repair()
signal inability to repair by returning
CheckAndRepairResults.repair_successful=False instead of by
throwing an exception. A few things I'd like to see changed:

  • we usually repair files that are unhealthy but recoverable. If repair
    fails, the file should still be recoverable. The post-repair-results
    are pessimistically being set to healthy=False recoverable=False
    needs_rebalancing=False, when it's probably (and sometimes certainly)
    more accurate to copy these values from the pre-repair-results. In
    particular, we shouldn't scare users into thinking that repair
    failures of "scratched" files (unhealthy but recoverable) indicate
    unrecoverable files: this makes benign things like
    UnhappinessError look like data loss. This should be fixed in
    both mutable and immutable files.

  • the newly-enabled test in test_repairer.Repairer.test_harness
    (which previously got a self.shouldFail()) should be slightly

enhanced to check the return value of check_and_repair(). We
should verify that it has crr.repair_attempted=True,
crr.repair_successful=False, and
crr.post_repair_results.recoverable=False

  • we should add a similar test for mutable files that have had 8 shares

deleted. There's something awfully close in
test_mutable.Repair.test_unrepairable_1share .. it should be
changed to use self._fn.check_and_repair() instead of
self._fn.repair() . To be honest, I'm not sure why that test was
passing before, because from what I can tell it should have been
behaving the same way as immutable repair on an unrecoverable file.

  • it's probably worth checking the code coverage when we exercise
    test_mutable and make sure the new code is getting run

  • do we have any tests that confirm deep-repair on a tree with an

unrecoverable file (or directory) makes it through to the end without
an errback? We probably do but I'd like to be sure.. probably
something in test_deepcheck exercises this.

  • I see test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken()

asserts that an unrecoverable dirnode causes the traversal to halt. Is
this what we want? Is this ticket about making sure an unrecoverable
file doesn't halt a deep-repair, or about an unrecoverable
dirnode? (broken dirnodes are more significant than files, because
it means you've probably lost access to even more data). We certainly
want the deep-traversal to keep going and repair more things, but we
also need to make sure the user learns about the dead dirnode.

Otherwise, looks great! With those few changes we can land this one for
1.8.2!

Good patch! I like the approach of making filenode.check_and_repair() signal inability to repair by returning `CheckAndRepairResults.repair_successful`=False instead of by throwing an exception. A few things I'd like to see changed: * we usually repair files that are unhealthy but recoverable. If repair fails, the file should still be recoverable. The post-repair-results are pessimistically being set to healthy=False recoverable=False needs_rebalancing=False, when it's probably (and sometimes certainly) more accurate to copy these values from the pre-repair-results. In particular, we shouldn't scare users into thinking that repair failures of "scratched" files (unhealthy but recoverable) indicate unrecoverable files: this makes benign things like `UnhappinessError` look like data loss. This should be fixed in both mutable and immutable files. * the newly-enabled test in `test_repairer.Repairer.test_harness` (which previously got a `self.shouldFail()`) should be slightly > enhanced to check the return value of `check_and_repair()`. We > should verify that it has `crr.repair_attempted=True`, `crr.repair_successful=False`, and `crr.post_repair_results.recoverable=False` * we should add a similar test for mutable files that have had 8 shares > deleted. There's something awfully close in `test_mutable.Repair.test_unrepairable_1share` .. it should be > changed to use `self._fn.check_and_repair()` instead of `self._fn.repair()` . To be honest, I'm not sure why that test was > passing before, because from what I can tell it should have been > behaving the same way as immutable repair on an unrecoverable file. * it's probably worth checking the code coverage when we exercise `test_mutable` and make sure the new code is getting run * do we have any tests that confirm deep-repair on a tree with an > unrecoverable file (or directory) makes it through to the end without > an errback? We probably do but I'd like to be sure.. probably > something in `test_deepcheck` exercises this. * I see `test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken()` > asserts that an unrecoverable dirnode causes the traversal to halt. Is > this what we want? Is this ticket about making sure an unrecoverable *file* doesn't halt a deep-repair, or about an unrecoverable *dirnode*? (broken dirnodes are more significant than files, because > it means you've probably lost access to even more data). We certainly > want the deep-traversal to keep going and repair more things, but we > also need to make sure the user learns about the dead dirnode. Otherwise, looks great! With those few changes we can land this one for 1.8.2!
davidsarah commented 2011-01-07 05:55:02 +00:00
Owner

Replying to warner:

Good patch! I like the approach of making filenode.check_and_repair()
signal inability to repair by returning
CheckAndRepairResults.repair_successful=False instead of by
throwing an exception.

+1

A few things I'd like to see changed:

  • we usually repair files that are unhealthy but recoverable. If repair
    fails, the file should still be recoverable. The post-repair-results
    are pessimistically being set to healthy=False recoverable=False
    needs_rebalancing=False, when it's probably (and sometimes certainly)
    more accurate to copy these values from the pre-repair-results.

If there's a failure, then we don't know whether the file is healthy, recoverable or needs rebalancing. Shouldn't unknown fields simply be missing from the results?

(Note: needs_rebalancing=False is not pessimistic.)

  • I see test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken()
    asserts that an unrecoverable dirnode causes the traversal to halt. Is
    this what we want? Is this ticket about making sure an unrecoverable
    file doesn't halt a deep-repair, or about an unrecoverable
    dirnode?

I thought it was both.

Replying to [warner](/tahoe-lafs/trac-2024-07-25/issues/755#issuecomment-71979): > Good patch! I like the approach of making filenode.check_and_repair() > signal inability to repair by returning > `CheckAndRepairResults.repair_successful`=False instead of by > throwing an exception. +1 > A few things I'd like to see changed: > > * we usually repair files that are unhealthy but recoverable. If repair > fails, the file should still be recoverable. The post-repair-results > are pessimistically being set to healthy=False recoverable=False > needs_rebalancing=False, when it's probably (and sometimes certainly) > more accurate to copy these values from the pre-repair-results. If there's a failure, then we don't know whether the file is healthy, recoverable or needs rebalancing. Shouldn't unknown fields simply be missing from the results? (Note: needs_rebalancing=False is not pessimistic.) > * I see `test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken()` > asserts that an unrecoverable dirnode causes the traversal to halt. Is > this what we want? Is this ticket about making sure an unrecoverable > *file* doesn't halt a deep-repair, or about an unrecoverable > *dirnode*? I thought it was both.
francois commented 2011-01-15 16:20:04 +00:00
Owner

Thanks for the review! My comments are inline.

Replying to warner:

  • we usually repair files that are unhealthy but recoverable. If repair
    fails, the file should still be recoverable. The post-repair-results
    are pessimistically being set to healthy=False recoverable=False
    needs_rebalancing=False, when it's probably (and sometimes certainly)
    more accurate to copy these values from the pre-repair-results.

I agree with what davidsarah said in comment:23, it is
difficult to know the actual status when an exception was raised during
the check operation. However, it seems that simply removing the fields
from the results would necessitate other changes because I guess that
many parts of the code except them to be present.

What would you think about setting healthy to its value before the
repair (most likely False) and other fields to None?
Something along those lines?

  def _repair_error(f):
    prr = CheckResults(cr.uri, cr.storage_index)
    prr.data = copy.deepcopy(cr.data)
    prr.set_healthy(crr.pre_repair_results.is_healthy())
    prr.set_recoverable(None)
    prr.set_needs_rebalancing(None)
    crr.post_repair_results = prr
    crr.repair_successful = False
    crr.repair_failure = f
    return crr
  • the newly-enabled test in test_repairer.Repairer.test_harness
    (which previously got a self.shouldFail()) should be slightly
    enhanced to check the return value of check_and_repair(). We
    should verify that it has crr.repair_attempted=True,
    crr.repair_successful=False, and
    crr.post_repair_results.recoverable=False

Good point, will be done in the next patch.

  • we should add a similar test for mutable files that have had 8 shares
    deleted. There's something awfully close in
    test_mutable.Repair.test_unrepairable_1share .. it should be
    changed to use self._fn.check_and_repair() instead of
    self._fn.repair().

Will be done in the next patch.

To be honest, I'm not sure why that test was passing before, because
from what I can tell it should have been behaving the same way as
immutable repair on an unrecoverable file.

I don't know either, will try to look in details into this.

  • it's probably worth checking the code coverage when we exercise
    test_mutable and make sure the new code is getting run

I don't remember how the code coverage infrastructure in the build
system actually works. It would be very kind of you if you tell me which
command I should run?

  • do we have any tests that confirm deep-repair on a tree with an
    unrecoverable file (or directory) makes it through to the end without
    an errback? We probably do but I'd like to be sure.. probably
    something in test_deepcheck exercises this.

This is what I think calling do_web_stream_check() inside
DeepCheckWebBad.test_bad() should be doing, isn't it?

  • I see test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken()
    asserts that an unrecoverable dirnode causes the traversal to halt. Is
    this what we want? Is this ticket about making sure an unrecoverable
    file doesn't halt a deep-repair, or about an unrecoverable
    dirnode? (broken dirnodes are more significant than files, because
    it means you've probably lost access to even more data). We certainly
    want the deep-traversal to keep going and repair more things, but we
    also need to make sure the user learns about the dead dirnode.

Yes, the traversal must continue in both cases. I was under the impression that unrecoverable immutable files were already supported and I understand this issue as being about unrecoverable direnodes.

Thanks for the review! My comments are inline. Replying to [warner](/tahoe-lafs/trac-2024-07-25/issues/755#issuecomment-71979): > * we usually repair files that are unhealthy but recoverable. If repair > fails, the file should still be recoverable. The post-repair-results > are pessimistically being set to healthy=False recoverable=False > needs_rebalancing=False, when it's probably (and sometimes certainly) > more accurate to copy these values from the pre-repair-results. I agree with what davidsarah said in comment:23, it is difficult to know the actual status when an exception was raised during the check operation. However, it seems that simply removing the fields from the results would necessitate other changes because I guess that many parts of the code except them to be present. What would you think about setting healthy to its value before the repair (most likely `False`) and other fields to `None`? Something along those lines? ``` def _repair_error(f): prr = CheckResults(cr.uri, cr.storage_index) prr.data = copy.deepcopy(cr.data) prr.set_healthy(crr.pre_repair_results.is_healthy()) prr.set_recoverable(None) prr.set_needs_rebalancing(None) crr.post_repair_results = prr crr.repair_successful = False crr.repair_failure = f return crr ``` > * the newly-enabled test in `test_repairer.Repairer.test_harness` > (which previously got a `self.shouldFail()`) should be slightly > enhanced to check the return value of `check_and_repair()`. We > should verify that it has `crr.repair_attempted=True`, > `crr.repair_successful=False`, and > `crr.post_repair_results.recoverable=False` Good point, will be done in the next patch. > * we should add a similar test for mutable files that have had 8 shares > deleted. There's something awfully close in > `test_mutable.Repair.test_unrepairable_1share` .. it should be > changed to use `self._fn.check_and_repair()` instead of > `self._fn.repair()`. Will be done in the next patch. > To be honest, I'm not sure why that test was passing before, because > from what I can tell it should have been behaving the same way as > immutable repair on an unrecoverable file. I don't know either, will try to look in details into this. > * it's probably worth checking the code coverage when we exercise > `test_mutable` and make sure the new code is getting run I don't remember how the code coverage infrastructure in the build system actually works. It would be very kind of you if you tell me which command I should run? > * do we have any tests that confirm deep-repair on a tree with an > unrecoverable file (or directory) makes it through to the end without > an errback? We probably do but I'd like to be sure.. probably > something in `test_deepcheck` exercises this. This is what I think calling `do_web_stream_check()` inside `DeepCheckWebBad.test_bad()` should be doing, isn't it? > * I see `test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken()` > asserts that an unrecoverable dirnode causes the traversal to halt. Is > this what we want? Is this ticket about making sure an unrecoverable > *file* doesn't halt a deep-repair, or about an unrecoverable > *dirnode*? (broken dirnodes are more significant than files, because > it means you've probably lost access to even more data). We certainly > want the deep-traversal to keep going and repair more things, but we > also need to make sure the user learns about the dead dirnode. Yes, the traversal must continue in both cases. I was under the impression that unrecoverable immutable files were already supported and I understand this issue as being about unrecoverable direnodes.

Replying to [francois]comment:24:

Thanks for the review! My comments are inline.

Replying to warner:

  • we usually repair files that are unhealthy but recoverable. If
    repair fails, the file should still be recoverable. The
    post-repair-results are pessimistically being set to healthy=False
    recoverable=False needs_rebalancing=False, when it's probably (and
    sometimes certainly) more accurate to copy these values from the
    pre-repair-results.

I agree with what davidsarah said in comment:23, it is difficult to
know the actual status when an exception was raised during the check
operation. However, it seems that simply removing the fields from the
results would necessitate other changes because I guess that many
parts of the code except them to be present.

What would you think about setting healthy to its value before the
repair (most likely False) and other fields to None?
Something along those lines?

  def _repair_error(f):
    prr = [CheckResults](wiki/CheckResults)(cr.uri, cr.storage_index)
    prr.data = copy.deepcopy(cr.data)
    prr.set_healthy(crr.pre_repair_results.is_healthy())
    prr.set_recoverable(None)
    prr.set_needs_rebalancing(None)
    crr.post_repair_results = prr
    crr.repair_successful = False
    crr.repair_failure = f
    return crr

Ok, but set_recoverable() and set_needs_rebalancing() should
be copied from the pre-repair values too. For immutable files it's
certainly the case that repair cannot make things any worse, so if the
file was recoverable before repair, it will be recoverable afterwards
too. For mutable files, it's fuzzier, but once we get #1209 fixed, then
repair that doesn't involve UCWE collisions or multiple versions should
be strictly an improvement too. I think set_needs_rebalancing() is
roughly the same.

My big concern is doing a deep-repair while you're missing a few
servers: all files are missing a few shares, so they aren't healthy and
we try to repair them, but you're missing too many servers to
successfully meet the servers-of-happiness threshold, so repair fails.
On every single file. All the files are actually recoverable, but the
post-repair results suggest that they are not. What I want to avoid is
the deep-repair summary message telling users that 4000 out of 4000
files are now unrecoverable and scaring the socks off them.

  • it's probably worth checking the code coverage when we exercise
    test_mutable and make sure the new code is getting run

I don't remember how the code coverage infrastructure in the build
system actually works. It would be very kind of you if you tell me
which command I should run?

I usually do 'make quicktest-coverage', but I think "python setup.py trial --coverage" (or perhaps "python setup.py trial --coverage --test-suite test_mutable" to be a bit more selective) should do the
same. That will create a .coverage file with the raw data. "make coverage-output", or following the commands listed in that section of
the Makefile, will give you an HTML summary with color-coded source
lines.

  • do we have any tests that confirm deep-repair on a tree with an
    unrecoverable file (or directory) makes it through to the end
    without an errback? We probably do but I'd like to be sure..
    probably something in test_deepcheck exercises this.

This is what I think calling do_web_stream_check() inside
DeepCheckWebBad.test_bad() should be doing, isn't it?

I think that's mostly correct: it looks like set_up_damaged_tree()
creates a root directory with 8 files (half mutable, half immutable),
some of which are unrecoverable. But 1: do_web_stream_check()
doesn't attempt repair, merely deep-check, and 2: there are no
directories in that root, only files. Adding an unrecoverable directory
is the important bit, since I think deep-repair and deep-check have
enough common code paths that exercising deep-check is sufficient. (note
that I think the 'broken' directory set up there is not used by
do_web_stream_check()).

  • I see
    test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken()
    asserts that an unrecoverable dirnode causes the traversal to
    halt. Is this what we want? Is this ticket about making sure an
    unrecoverable file doesn't halt a deep-repair, or about an
    unrecoverable dirnode? (broken dirnodes are more significant
    than files, because it means you've probably lost access to even
    more data). We certainly want the deep-traversal to keep going and
    repair more things, but we also need to make sure the user learns
    about the dead dirnode.

Yes, the traversal must continue in both cases. I was under the
impression that unrecoverable immutable files were already supported
and I understand this issue as being about unrecoverable direnodes.

Yeah, do_web_stream_check() should cover the
unrecoverable-immutable-file case (well, unless there's a difference in
behavior between a web-based t=stream-deep-check and an internal
dirnode-based dirnode.start_deep_check(), which is worth testing).
So I agree, unrecoverable dirnodes is the important thing to check.

So my hunch here is that we should add an unrecoverable directory to the
'root' tree created in set_up_damaged_tree(), and adjust the
counters to match, and then maybe we should get rid of the 'broken' tree
and do_deepcheck_broken().

Replying to [francois]comment:24: > Thanks for the review! My comments are inline. > > Replying to [warner](/tahoe-lafs/trac-2024-07-25/issues/755#issuecomment-71979): > > > * we usually repair files that are unhealthy but recoverable. If > > repair fails, the file should still be recoverable. The > > post-repair-results are pessimistically being set to healthy=False > > recoverable=False needs_rebalancing=False, when it's probably (and > > sometimes certainly) more accurate to copy these values from the > > pre-repair-results. > > I agree with what davidsarah said in comment:23, it is difficult to > know the actual status when an exception was raised during the check > operation. However, it seems that simply removing the fields from the > results would necessitate other changes because I guess that many > parts of the code except them to be present. > > What would you think about setting healthy to its value before the > repair (most likely `False`) and other fields to `None`? > Something along those lines? > > ``` > def _repair_error(f): > prr = [CheckResults](wiki/CheckResults)(cr.uri, cr.storage_index) > prr.data = copy.deepcopy(cr.data) > prr.set_healthy(crr.pre_repair_results.is_healthy()) > prr.set_recoverable(None) > prr.set_needs_rebalancing(None) > crr.post_repair_results = prr > crr.repair_successful = False > crr.repair_failure = f > return crr > ``` Ok, but `set_recoverable()` and `set_needs_rebalancing()` should be copied from the pre-repair values too. For immutable files it's certainly the case that repair cannot make things any worse, so if the file was recoverable before repair, it will be recoverable afterwards too. For mutable files, it's fuzzier, but once we get #1209 fixed, then repair that doesn't involve UCWE collisions or multiple versions should be strictly an improvement too. I think `set_needs_rebalancing()` is roughly the same. My big concern is doing a deep-repair while you're missing a few servers: all files are missing a few shares, so they aren't healthy and we try to repair them, but you're missing too many servers to successfully meet the servers-of-happiness threshold, so repair fails. On every single file. All the files are actually recoverable, but the post-repair results suggest that they are not. What I want to avoid is the deep-repair summary message telling users that 4000 out of 4000 files are now unrecoverable and scaring the socks off them. > > * it's probably worth checking the code coverage when we exercise > > `test_mutable` and make sure the new code is getting run > > I don't remember how the code coverage infrastructure in the build > system actually works. It would be very kind of you if you tell me > which command I should run? I usually do '`make quicktest-coverage`', but I think "`python setup.py trial --coverage`" (or perhaps "`python setup.py trial --coverage --test-suite test_mutable`" to be a bit more selective) should do the same. That will create a .coverage file with the raw data. "`make coverage-output`", or following the commands listed in that section of the Makefile, will give you an HTML summary with color-coded source lines. > > * do we have any tests that confirm deep-repair on a tree with an > > unrecoverable file (or directory) makes it through to the end > > without an errback? We probably do but I'd like to be sure.. > > probably something in `test_deepcheck` exercises this. > > This is what I think calling `do_web_stream_check()` inside > `DeepCheckWebBad.test_bad()` should be doing, isn't it? I think that's mostly correct: it looks like `set_up_damaged_tree()` creates a root directory with 8 files (half mutable, half immutable), some of which are unrecoverable. But 1: `do_web_stream_check()` doesn't attempt repair, merely deep-check, and 2: there are no directories in that root, only files. Adding an unrecoverable directory is the important bit, since I think deep-repair and deep-check have enough common code paths that exercising deep-check is sufficient. (note that I think the 'broken' directory set up there is not used by `do_web_stream_check()`). > > * I see > > `test_deepcheck.py:DeepCheckWebBad.do_deepcheck_broken()` > > asserts that an unrecoverable dirnode causes the traversal to > > halt. Is this what we want? Is this ticket about making sure an > > unrecoverable *file* doesn't halt a deep-repair, or about an > > unrecoverable *dirnode*? (broken dirnodes are more significant > > than files, because it means you've probably lost access to even > > more data). We certainly want the deep-traversal to keep going and > > repair more things, but we also need to make sure the user learns > > about the dead dirnode. > > Yes, the traversal must continue in both cases. I was under the > impression that unrecoverable immutable files were already supported > and I understand this issue as being about unrecoverable direnodes. Yeah, `do_web_stream_check()` should cover the unrecoverable-immutable-file case (well, unless there's a difference in behavior between a web-based `t=stream-deep-check` and an internal dirnode-based `dirnode.start_deep_check()`, which is worth testing). So I agree, unrecoverable dirnodes is the important thing to check. So my hunch here is that we should add an unrecoverable directory to the 'root' tree created in `set_up_damaged_tree()`, and adjust the counters to match, and then maybe we should get rid of the 'broken' tree and `do_deepcheck_broken()`.

BTW, if we get a patch for this on monday, I'll review and land it, and it'll be in 1.8.2. If it's not ready by monday or tuesday, then we may need to push it out until after 1.8.2. I want to make sure we get at least a few days of testing on this, since it's kind of invasive.

BTW, if we get a patch for this on monday, I'll review and land it, and it'll be in 1.8.2. If it's not ready by monday or tuesday, then we may need to push it out until after 1.8.2. I want to make sure we get at least a few days of testing on this, since it's kind of invasive.
francois commented 2011-01-17 20:47:28 +00:00
Owner

Replying to warner:

BTW, if we get a patch for this on monday, I'll review and land it, and it'll be in 1.8.2. If it's not ready by monday or tuesday, then we may need to push it out until after 1.8.2. I want to make sure we get at least a few days of testing on this, since it's kind of invasive.

I guess that it's going to have to wait until after 1.8.2 because spare time in the coming week looks pretty scarce.

Replying to [warner](/tahoe-lafs/trac-2024-07-25/issues/755#issuecomment-71983): > BTW, if we get a patch for this on monday, I'll review and land it, and it'll be in 1.8.2. If it's not ready by monday or tuesday, then we may need to push it out until after 1.8.2. I want to make sure we get at least a few days of testing on this, since it's kind of invasive. I guess that it's going to have to wait until after 1.8.2 because spare time in the coming week looks pretty scarce.
tahoe-lafs modified the milestone from 1.8.2 to 1.9.0 2011-01-17 20:47:41 +00:00
davidsarah commented 2011-07-16 20:49:20 +00:00
Owner

This needs some work to address the comments and to be rebased to trunk, but has a good chance of getting into 1.9.

This needs some work to address the comments and to be rebased to trunk, but has a good chance of getting into 1.9.
davidsarah commented 2011-08-02 15:43:50 +00:00
Owner

I have a patch in progress that builds on patch-755.darcs.diff and fixes the review comments, including skipping unrecoverable directories and including information that they've been skipped in the output. It's not ready for 1.9 though.

I have a patch in progress that builds on [patch-755.darcs.diff](/tahoe-lafs/trac-2024-07-25/attachments/000078ac-5234-2fa4-40fb-43c067161683) and fixes the review comments, including skipping unrecoverable directories and including information that they've been skipped in the output. It's not ready for 1.9 though.
tahoe-lafs modified the milestone from 1.9.0 to 1.10.0 2011-08-02 15:44:19 +00:00
davidsarah commented 2012-08-22 02:12:52 +00:00
Owner

I'll try to find the patch mentioned in comment:71987, but if I haven't done so in two weeks, it can be assumed that I've lost it.

I'll try to find the patch mentioned in [comment:71987](/tahoe-lafs/trac-2024-07-25/issues/755#issuecomment-71987), but if I haven't done so in two weeks, it can be assumed that I've lost it.
daira commented 2013-04-26 02:00:35 +00:00
Owner

#1955 was a duplicate.

#1955 was a duplicate.
daira commented 2014-11-19 07:26:39 +00:00
Owner

#2337 was a duplicate.

#2337 was a duplicate.
zooko changed title from if there is an unrecoverable subdirectory, the deep-check report (both WUI and CLI) loses other information to Allow deep-check to continue after error, and: if there is an unrecoverable subdirectory, the deep-check report (both WUI and CLI) loses other information 2015-02-03 17:43:33 +00:00
daira commented 2016-01-14 17:45:29 +00:00
Owner

Kyle Markley wrote on tahoe-dev:

When tahoe deep-check --repair encounters a file it can't repair, it stops without reporting anything about what file gave it trouble.
What do I do about this? I rerun, this time with -v, so I get a listing of what files it is working on. From that list I can often infer which file had the error. Assuming I still have the original file, the corrective action is to tahoe put the file. Then I can restart the deep-check.
But in a directory tree with thousands of files, that takes forever! Instead, I can restart the deep-check in a subdirectory closer to the previous failure. But this is a lot of tedious work.

I wish that tahoe deep-check would:

  1. Report which file is unrepairable.
  2. Not stop at the first error, but continue and report all errors upon completion.

When an unrepairable file is an immutable directory, what corrective action should be taken? I have resorted to modifying the directory by creating an empty file, performing a tahoe backup, and then continuing the deep-check --repair. But I cannot then remove the empty file, because that would cause the next backup to point to the original (unrepaired) directory. Can this be improved?

I wish that tahoe backup could be combined with tahoe deep-check --repair. The behavior would be like deep-check, but if any file is unrepairable yet exists in in the local filesystem at the corresponding path, upload it. And for bonus points this should guarantee happiness, not just healthiness. Or, it would be almost as good if deep-check would update the backup database so the next invocation of tahoe backup would re-upload the appropriate files and directories.

Essentially, I struggle with the fact that "tahoe backup" completes successfully without guaranteeing the recoverability of files it claims to have backed up. The backup database is out-of-sync with the healthiness of files on the grid, and there is no way to bring them in-sync. Sure, I can delete the backup database, but I don't want to pointlessly re-upload all the healthy files.

Kyle Markley wrote on tahoe-dev: > When `tahoe deep-check --repair` encounters a file it can't repair, it stops without reporting anything about what file gave it trouble. > What do I do about this? I rerun, this time with `-v`, so I get a listing of what files it is working on. From that list I can often infer which file had the error. Assuming I still have the original file, the corrective action is to tahoe put the file. Then I can restart the deep-check. > But in a directory tree with thousands of files, that takes forever! Instead, I can restart the deep-check in a subdirectory closer to the previous failure. But this is a lot of tedious work. > > I wish that `tahoe deep-check` would: > > 1. Report which file is unrepairable. > 2. Not stop at the first error, but continue and report all errors upon completion. > > When an unrepairable file is an immutable directory, what corrective action should be taken? I have resorted to modifying the directory by creating an empty file, performing a `tahoe backup`, and then continuing the `deep-check --repair`. But I cannot then remove the empty file, because that would cause the next backup to point to the original (unrepaired) directory. Can this be improved? > > I wish that `tahoe backup` could be combined with `tahoe deep-check --repair`. The behavior would be like deep-check, but if any file is unrepairable yet exists in in the local filesystem at the corresponding path, upload it. And for bonus points this should guarantee happiness, not just healthiness. Or, it would be almost as good if deep-check would update the backup database so the next invocation of tahoe backup would re-upload the appropriate files and directories. > > Essentially, I struggle with the fact that "`tahoe backup`" completes successfully without guaranteeing the recoverability of files it claims to have backed up. The backup database is out-of-sync with the healthiness of files on the grid, and there is no way to bring them in-sync. Sure, I can delete the backup database, but I don't want to pointlessly re-upload all the healthy files.
tlhonmey commented 2017-03-09 18:55:14 +00:00
Owner

Kyle: It won't have to re-upload all the healthy files. The deduplication algorithm will find that the data for any unchanged files is already available and will re-use whatever shares it can. It'll just take a bit longer to run because it'll have to scan and encode every file.

Meanwhile: I just lost a bunch of stuff because I didn't know about this issue and assumed a deep-check --repair --add-lease cronjob would take care of things. One file near the beginning of the directory structure got damaged somehow, so neither repair nor leasing was done on the rest, and by the time I came back to check on it, chunks had expired and been deleted and I have to re-upload everything, which will take about a month.

This bug has been open for almost 8 years, and I see a patch for it in the discussion thread... If it's not going to be fixed in the next release, I recommend adding a warning about it to the documentation so new users don't do something stupid like expect the repair operation to behave in a sane manner.

As a work-around, I use:

tahoe manifest alias: | cut -d" " -f 1 | xargs -L1 -P5 tahoe check --add-lease --repair

This, of course, requires time and CPU to start a separate instance of the tahoe program for every data object being checked, so going over the entire directory takes days instead of hours, but at least it actually works.

Kyle: It won't have to re-upload all the healthy files. The deduplication algorithm will find that the data for any unchanged files is already available and will re-use whatever shares it can. It'll just take a bit longer to run because it'll have to scan and encode every file. Meanwhile: I just lost a bunch of stuff because I didn't know about this issue and assumed a deep-check --repair --add-lease cronjob would take care of things. One file near the beginning of the directory structure got damaged somehow, so neither repair nor leasing was done on the rest, and by the time I came back to check on it, chunks had expired and been deleted and I have to re-upload everything, which will take about a month. This bug has been open for almost 8 years, and I see a patch for it in the discussion thread... If it's not going to be fixed in the next release, I recommend adding a warning about it to the documentation so new users don't do something stupid like expect the repair operation to behave in a sane manner. As a work-around, I use: ``` tahoe manifest alias: | cut -d" " -f 1 | xargs -L1 -P5 tahoe check --add-lease --repair ``` This, of course, requires time and CPU to start a separate instance of the tahoe program for every data object being checked, so going over the entire directory takes days instead of hours, but at least it actually works.
tlhonmey commented 2018-08-21 21:48:05 +00:00
Owner

Ok, so tahoe manifest also gives up on the first error it encounters, it just only encounters errors on damaged directories. But it will still bite you hard if you are actually stupid enough to rely on it.

So I've resorted to the following bash script:

tahoe="/home/tahoe/tahoe/bin/tahoe"
THREADS=5
FAILEDLOG="/tmp/failed.txt"


recurser() {
  CHILDREN=""
  echo "checking directory: $1"
  $tahoe check --add-lease "$1" || $tahoe check --add-lease --repair "$1" || sleep 5m #give it 5 minutes before continuing to let the grid come back up if this is a connection failure.  This prevents the entire script from finishing as failures if the network connection goes down.
  local ITEM
  for ITEM in $($tahoe ls -F "$1"); do
    echo "checking: ${1}${ITEM}"
    echo "$ITEM" | grep "/" >> /dev/null && echo "  Is a directory..." && recurser "${1}${ITEM}"
    ( $tahoe check --add-lease "${1}${ITEM}" | grep -n10 healthy || $tahoe check --repair --add-lease "${1}${ITEM}" || echo "${1}${ITEM}" >> $FAILEDLOG ) &
    CHILDREN="$? $CHILDREN"
    if [[ $(echo "$CHILDREN" | wc -w) == "$THREADS" ]]; then
      wait 
      CHILDREN=""
    fi
  done
}


echo "If it blows up immediately when passed a URI make sure you end it with a /"
recurser "$1"

The careful observer will notice that this script calls "check --add-lease" first and then only calls --repair if that returns an error. This is due to another bug in the --repair functionality which I will be filing shortly.

Is making deep-check note the unrepairable nodes, but then continue to check the rest of the tree really that difficult? I wouldn't think the average user should have to resort to writing their own tools to avoid cascade failures of the storage system...

If you guys want to bundle this tool or some clone or variant thereof into your packages you are more than welcome to do so. We need something to actually keep people's data safe until this bug is fixed.

Edit: Oh for Pete's Sake! tahoe check exits with a 0 even when the checked objects are unhealthy, so I have to scan the output myself to assess it. I sense that at some point I'm going to need to rewrite this in Python or something and use the REST API. Hopefully that's at least somewhat sane...

Ok, so tahoe manifest also gives up on the first error it encounters, it just only encounters errors on damaged directories. But it will still bite you hard if you are actually stupid enough to rely on it. So I've resorted to the following bash script: ``` /bin/bash tahoe="/home/tahoe/tahoe/bin/tahoe" THREADS=5 FAILEDLOG="/tmp/failed.txt" recurser() { CHILDREN="" echo "checking directory: $1" $tahoe check --add-lease "$1" || $tahoe check --add-lease --repair "$1" || sleep 5m #give it 5 minutes before continuing to let the grid come back up if this is a connection failure. This prevents the entire script from finishing as failures if the network connection goes down. local ITEM for ITEM in $($tahoe ls -F "$1"); do echo "checking: ${1}${ITEM}" echo "$ITEM" | grep "/" >> /dev/null && echo " Is a directory..." && recurser "${1}${ITEM}" ( $tahoe check --add-lease "${1}${ITEM}" | grep -n10 healthy || $tahoe check --repair --add-lease "${1}${ITEM}" || echo "${1}${ITEM}" >> $FAILEDLOG ) & CHILDREN="$? $CHILDREN" if [[ $(echo "$CHILDREN" | wc -w) == "$THREADS" ]]; then wait CHILDREN="" fi done } echo "If it blows up immediately when passed a URI make sure you end it with a /" recurser "$1" ``` The careful observer will notice that this script calls "check --add-lease" first and then only calls --repair if that returns an error. This is due to another bug in the --repair functionality which I will be filing shortly. Is making deep-check note the unrepairable nodes, but then continue to check the rest of the tree really that difficult? I wouldn't think the average user should have to resort to writing their own tools to avoid cascade failures of the storage system... If you guys want to bundle this tool or some clone or variant thereof into your packages you are more than welcome to do so. We need something to actually keep people's data safe until this bug is fixed. Edit: Oh for Pete's Sake! tahoe check exits with a 0 even when the checked objects are unhealthy, so I have to scan the output myself to assess it. I sense that at some point I'm going to need to rewrite this in Python or something and use the REST API. Hopefully that's at least somewhat sane...
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#755
No description provided.