UCWE on deep check with recent version #1628

Closed
opened 2011-11-30 22:54:54 +00:00 by davidsarah · 29 comments
davidsarah commented 2011-11-30 22:54:54 +00:00
Owner

Reported by kpreid:
I upgraded my tahoe to a recent development version, in the git mirror:

  git://github.com/warner/tahoe-lafs.git
  commit b73aba98de93c4c0b0013f1dd435c64e73e48f4c

Now, my daily deep-check --repair --add-lease on my four aliases on the volunteer grid consistently fails as follows. The first might have a legitimate uncoordinated write, but the last two are not regularly touched by anything but the repair process, and this identical failure has occurred for the past 4 days.

I'd appreciate it if this is fixed before my leases expire. :-)

ERROR: UncoordinatedWriteError()
"[Failure instance: Traceback (failure with no frames): <class 'allmydata.mutable.common.UncoordinatedWriteError'>: "

ERROR: AssertionError()
"[Failure instance: Traceback: <type 'exceptions.AssertionError'>: "
/Volumes/Opp/External/Projects/tahoe/src/allmydata/mutable/filenode.py:563:upload
/Volumes/Opp/External/Projects/tahoe/src/allmydata/mutable/filenode.py:661:_do_serialized
/Volumes/Opp/External/Projects/tahoe/support/lib/python2.6/site-packages/Twisted-11.1.0-py2.6-macosx-10.6-x86_64.egg/twisted/internet/defer.py:298:addCallback
/Volumes/Opp/External/Projects/tahoe/support/lib/python2.6/site-packages/Twisted-11.1.0-py2.6-macosx-10.6-x86_64.egg/twisted/internet/defer.py:287:addCallbacks
--- <exception caught here> ---
/Volumes/Opp/External/Projects/tahoe/support/lib/python2.6/site-packages/Twisted-11.1.0-py2.6-macosx-10.6-x86_64.egg/twisted/internet/defer.py:545:_runCallbacks
/Volumes/Opp/External/Projects/tahoe/src/allmydata/mutable/filenode.py:661:<lambda>
/Volumes/Opp/External/Projects/tahoe/src/allmydata/mutable/filenode.py:689:_upload
/Volumes/Opp/External/Projects/tahoe/src/allmydata/mutable/publish.py:402:publish

ERROR: UncoordinatedWriteError()
"[Failure instance: Traceback (failure with no frames): <class 'allmydata.mutable.common.UncoordinatedWriteError'>: "

ERROR: UncoordinatedWriteError()
"[Failure instance: Traceback (failure with no frames): <class 'allmydata.mutable.common.UncoordinatedWriteError'>: "

Before these failures, here is a typical example of normal results. (The indentation is added by my script.) I had understood the post-repair unhealthiness to be due to disagreement between the "healthy file" and "repair (successful re-upload)", a to-be-fixed bug.

	done: 3 objects checked
	 pre-repair: 3 healthy, 0 unhealthy
	 0 repairs attempted, 0 successful, 0 failed
	 post-repair: 3 healthy, 0 unhealthy

	 repair successful
	done: 1 objects checked
	 pre-repair: 0 healthy, 1 unhealthy
	 1 repairs attempted, 1 successful, 0 failed
	 post-repair: 0 healthy, 1 unhealthy

	 repair successful
	done: 5 objects checked
	 pre-repair: 4 healthy, 1 unhealthy
	 1 repairs attempted, 1 successful, 0 failed
	 post-repair: 4 healthy, 1 unhealthy

	 repair successful
	done: 5 objects checked
	 pre-repair: 4 healthy, 1 unhealthy
	 1 repairs attempted, 1 successful, 0 failed
	 post-repair: 4 healthy, 1 unhealthy
[Reported by kpreid](https://tahoe-lafs.org/pipermail/tahoe-dev/2011-November/006870.html): I upgraded my tahoe to a recent development version, in the git mirror: ``` git://github.com/warner/tahoe-lafs.git commit b73aba98de93c4c0b0013f1dd435c64e73e48f4c ``` Now, my daily deep-check --repair --add-lease on my four aliases on the volunteer grid consistently fails as follows. The first might have a legitimate uncoordinated write, but the last two are not regularly touched by anything but the repair process, and this identical failure has occurred for the past 4 days. I'd appreciate it if this is fixed before my leases expire. :-) ``` ERROR: UncoordinatedWriteError() "[Failure instance: Traceback (failure with no frames): <class 'allmydata.mutable.common.UncoordinatedWriteError'>: " ERROR: AssertionError() "[Failure instance: Traceback: <type 'exceptions.AssertionError'>: " /Volumes/Opp/External/Projects/tahoe/src/allmydata/mutable/filenode.py:563:upload /Volumes/Opp/External/Projects/tahoe/src/allmydata/mutable/filenode.py:661:_do_serialized /Volumes/Opp/External/Projects/tahoe/support/lib/python2.6/site-packages/Twisted-11.1.0-py2.6-macosx-10.6-x86_64.egg/twisted/internet/defer.py:298:addCallback /Volumes/Opp/External/Projects/tahoe/support/lib/python2.6/site-packages/Twisted-11.1.0-py2.6-macosx-10.6-x86_64.egg/twisted/internet/defer.py:287:addCallbacks --- <exception caught here> --- /Volumes/Opp/External/Projects/tahoe/support/lib/python2.6/site-packages/Twisted-11.1.0-py2.6-macosx-10.6-x86_64.egg/twisted/internet/defer.py:545:_runCallbacks /Volumes/Opp/External/Projects/tahoe/src/allmydata/mutable/filenode.py:661:<lambda> /Volumes/Opp/External/Projects/tahoe/src/allmydata/mutable/filenode.py:689:_upload /Volumes/Opp/External/Projects/tahoe/src/allmydata/mutable/publish.py:402:publish ERROR: UncoordinatedWriteError() "[Failure instance: Traceback (failure with no frames): <class 'allmydata.mutable.common.UncoordinatedWriteError'>: " ERROR: UncoordinatedWriteError() "[Failure instance: Traceback (failure with no frames): <class 'allmydata.mutable.common.UncoordinatedWriteError'>: " ``` Before these failures, here is a typical example of normal results. (The indentation is added by my script.) I had understood the post-repair unhealthiness to be due to disagreement between the "healthy file" and "repair (successful re-upload)", a to-be-fixed bug. ``` done: 3 objects checked pre-repair: 3 healthy, 0 unhealthy 0 repairs attempted, 0 successful, 0 failed post-repair: 3 healthy, 0 unhealthy repair successful done: 1 objects checked pre-repair: 0 healthy, 1 unhealthy 1 repairs attempted, 1 successful, 0 failed post-repair: 0 healthy, 1 unhealthy repair successful done: 5 objects checked pre-repair: 4 healthy, 1 unhealthy 1 repairs attempted, 1 successful, 0 failed post-repair: 4 healthy, 1 unhealthy repair successful done: 5 objects checked pre-repair: 4 healthy, 1 unhealthy 1 repairs attempted, 1 successful, 0 failed post-repair: 4 healthy, 1 unhealthy ```
tahoe-lafs added the
code
major
defect
1.9.0
labels 2011-11-30 22:54:54 +00:00
tahoe-lafs added this to the undecided milestone 2011-11-30 22:54:54 +00:00
davidsarah commented 2011-11-30 22:58:04 +00:00
Author
Owner

I had understood the post-repair unhealthiness to be due to disagreement between the "healthy file" and "repair (successful re-upload)", a to-be-fixed bug.

Yes, that sounds like #766.

> I had understood the post-repair unhealthiness to be due to disagreement between the "healthy file" and "repair (successful re-upload)", a to-be-fixed bug. Yes, that sounds like #766.
killyourtv commented 2011-12-01 10:17:26 +00:00
Author
Owner

I had the same problem after upgrading to 1.9, reported in #1583 (though I neglected to add the full traceback).

I had the same problem after upgrading to 1.9, reported in #1583 (though I neglected to add the full traceback).

Thanks for the information, kpreid and killyourtv. Kevin, would you please attempt to reproduce it and see if a foolscap incident report file is generated in ~/.tahoe/logs/incidents/, and then push the "Report A Problem" button as killyourtv did in #1583?

Thanks for the information, kpreid and killyourtv. Kevin, would you please attempt to reproduce it and see if a foolscap incident report file is generated in `~/.tahoe/logs/incidents/`, and then push the "Report A Problem" button as killyourtv did in #1583?

Attachment incident-2011-12-06--13-30-43Z-kt36tjq.flog.bz2 (56514 bytes) added

Repair failure #1 (UncoordinatedWriteError)

**Attachment** incident-2011-12-06--13-30-43Z-kt36tjq.flog.bz2 (56514 bytes) added Repair failure #1 (UncoordinatedWriteError)

Attachment incident-2011-12-06--13-30-54Z-uzuao2q.flog.bz2 (55321 bytes) added

Repair failure 3 (UncoordinatedWriteError)

**Attachment** incident-2011-12-06--13-30-54Z-uzuao2q.flog.bz2 (55321 bytes) added Repair failure 3 (UncoordinatedWriteError)

Attachment incident-2011-12-06--13-31-29Z-oq7v3wi.flog.bz2 (55136 bytes) added

Pushed the "Report an Incident" button

**Attachment** incident-2011-12-06--13-31-29Z-oq7v3wi.flog.bz2 (55136 bytes) added Pushed the "Report an Incident" button

For some reason I cannot upload the second of the four incident files, which contained an AssertionError from mutable/filenode.py:563:upload. I have tried several times, including with a different format and filename, and Trac acts as if it succeeded but doesn't show the file. I have temporarily uploaded it here: http://switchb.org/pri/incident-2011-12-06--13-30-50Z-fhhyszi.flog.bz2

For some reason I cannot upload the second of the four incident files, which contained an [AssertionError](wiki/AssertionError) from `mutable/filenode.py:563:upload`. I have tried several times, including with a different format and filename, and Trac acts as if it succeeded but doesn't show the file. I have *temporarily* uploaded it here: <http://switchb.org/pri/incident-2011-12-06--13-30-50Z-fhhyszi.flog.bz2>
kevan commented 2011-12-13 01:22:29 +00:00
Author
Owner

Hm, it looks like the publisher forgot how to keep track of multiple copies of a single share during the MDMF implementation. So, in your case, it knows about one copy of each of a few shares that each exist on more than one server, then gets surprised when it encounters the other copies when pushing other shares, interprets the shares as evidence of an uncoordinated write, and breaks. Can you test fix-1628.darcs.patch and let me know if it resolves the issue for you?

Hm, it looks like the publisher forgot how to keep track of multiple copies of a single share during the MDMF implementation. So, in your case, it knows about one copy of each of a few shares that each exist on more than one server, then gets surprised when it encounters the other copies when pushing other shares, interprets the shares as evidence of an uncoordinated write, and breaks. Can you test [fix-1628.darcs.patch](/tahoe-lafs/trac-2024-07-25/attachments/000078ac-b26a-d862-c552-27cdbab7fc5e) and let me know if it resolves the issue for you?
kevan commented 2011-12-13 01:23:13 +00:00
Author
Owner

Attachment fix-1628.darcs.patch (84710 bytes) added

**Attachment** fix-1628.darcs.patch (84710 bytes) added

I'm currently trying not to have to deal with darcs. If you can supply that as a unified diff or against the git mirror I can test it.

I'm currently trying not to have to deal with darcs. If you can supply that as a unified diff or against the git mirror I can test it.
kevan commented 2011-12-13 02:19:28 +00:00
Author
Owner

Attachment fix-1628.diff (18873 bytes) added

darcs-free version of fix-1628.darcs.patch

**Attachment** fix-1628.diff (18873 bytes) added darcs-free version of fix-1628.darcs.patch
kevan commented 2011-12-13 02:23:18 +00:00
Author
Owner

Does fix-1628.diff work for you? Don't mind the references to a fifth patch; it's not related to this issue.

Does [fix-1628.diff](/tahoe-lafs/trac-2024-07-25/attachments/000078ac-b26a-d862-c552-73526e2718cc) work for you? Don't mind the references to a fifth patch; it's not related to this issue.

fix-1628.diff appears to have eliminated the problem.

fix-1628.diff appears to have eliminated the problem.

Replying to kpreid:

For some reason I cannot upload the second of the four incident files, which contained an AssertionError from mutable/filenode.py:563:upload. I have tried several times, including with a different format and filename, and Trac acts as if it succeeded but doesn't show the file.

That's due to a misconfiguration of this Trac. #1581 My apologies.

Replying to [kpreid](/tahoe-lafs/trac-2024-07-25/issues/1628#issuecomment-86615): > For some reason I cannot upload the second of the four incident files, which contained an [AssertionError](wiki/AssertionError) from `mutable/filenode.py:563:upload`. I have tried several times, including with a different format and filename, and Trac acts as if it succeeded but doesn't show the file. That's due to a misconfiguration of this Trac. #1581 My apologies.

So is this a regression in 1.9.0 vs. earlier releases, and could it result in data loss, and should we plan a 1.9.1 release to fix it?

So is this a regression in 1.9.0 vs. earlier releases, and could it result in data loss, and should we plan a 1.9.1 release to fix it?
killyourtv commented 2011-12-13 22:54:46 +00:00
Author
Owner

Replying to kpreid:

fix-1628.diff appears to have eliminated the problem.

I concur, this seems to have solved my problem as well (though I want to do a bit more testing).

I assume that #1583 is probably the same as this bug. I'll close mine since this one has had a bit more activity.

Replying to [kpreid](/tahoe-lafs/trac-2024-07-25/issues/1628#issuecomment-86619): > fix-1628.diff appears to have eliminated the problem. I concur, this seems to have solved my problem as well (though I want to do a bit more testing). I assume that #1583 is probably the same as this bug. I'll close mine since this one has had a bit more activity.
davidsarah commented 2011-12-14 00:47:54 +00:00
Author
Owner

From the patch comments:

This tests for two regressions resulting from a design flaw in the 1.9
mutable publisher; specifically, that the publisher doesn't keep track
of more than one server for each share. This can lead to spurious UCWEs,
as seen in ticket #1628. This also means that the publisher will no
longer write shares associated with a new version of a mutable file over
all of the existing shares that it can find, which potentially decreases
the robustness of the new version of the mutable file.

This sounds to me like a regression serious enough to justify a 1.9.1. Although multiple servers holding the same share shouldn't happen if there have been only publish operations with a stable set of servers, it can easily happen if the grid membership is less stable and there have been repairs.

From the patch comments: ``` This tests for two regressions resulting from a design flaw in the 1.9 mutable publisher; specifically, that the publisher doesn't keep track of more than one server for each share. This can lead to spurious UCWEs, as seen in ticket #1628. This also means that the publisher will no longer write shares associated with a new version of a mutable file over all of the existing shares that it can find, which potentially decreases the robustness of the new version of the mutable file. ``` This sounds to me like a regression serious enough to justify a 1.9.1. Although multiple servers holding the same share shouldn't happen if there have been only publish operations with a stable set of servers, it can easily happen if the grid membership is less stable and there have been repairs.
tahoe-lafs modified the milestone from undecided to soon 2011-12-14 00:48:51 +00:00
kevan commented 2011-12-15 03:01:03 +00:00
Author
Owner

I agree with comment:86623. While investigating this issue, I noticed a potential regression in the way we handle #546 situations. I haven't had time to investigate that yet, and probably won't have time to investigate until this weekend. Can we wait on 1.9.1 until I make a ticket for that issue, so we can decide if it belongs in 1.9.1?

I agree with [comment:86623](/tahoe-lafs/trac-2024-07-25/issues/1628#issuecomment-86623). While investigating this issue, I noticed a potential regression in the way we handle #546 situations. I haven't had time to investigate that yet, and probably won't have time to investigate until this weekend. Can we wait on 1.9.1 until I make a ticket for that issue, so we can decide if it belongs in 1.9.1?
zooko added
critical
and removed
major
labels 2011-12-16 16:52:21 +00:00
zooko modified the milestone from soon to 1.9.1 2011-12-16 16:52:21 +00:00
kevan commented 2011-12-27 20:38:48 +00:00
Author
Owner

Attachment fix-1628.darcs.2.patch (86211 bytes) added

**Attachment** fix-1628.darcs.2.patch (86211 bytes) added
kevan commented 2011-12-27 20:40:14 +00:00
Author
Owner

fix-1628.darcs.2.patch fixes a flaw in my initial patch. I think it's ready for review.

[fix-1628.darcs.2.patch](/tahoe-lafs/trac-2024-07-25/attachments/000078ac-b26a-d862-c552-43dc19e33489) fixes a flaw in my initial patch. I think it's ready for review.

oops, I just reviewed and landed the first patch, in changeset:e29323f68fc5447b. I'll see if I can deduce a delta between the two darcs patches..

oops, I just reviewed and landed the *first* patch, in changeset:e29323f68fc5447b. I'll see if I can deduce a delta between the two darcs patches..
Brian Warner <warner@lothar.com> commented 2011-12-28 05:50:43 +00:00
Author
Owner

In changeset:147670fd89a04bad:

mutable publish: fix not-enough-shares detection. Refs #1628.

This should match the "fix-1628.darcs.2.patch" attachment on that ticket.
In changeset:147670fd89a04bad: ``` mutable publish: fix not-enough-shares detection. Refs #1628. This should match the "fix-1628.darcs.2.patch" attachment on that ticket. ```
Brian Warner <warner@lothar.com> commented 2011-12-28 05:52:05 +00:00
Author
Owner

In changeset:147670fd89a04bad:

mutable publish: fix not-enough-shares detection. Refs #1628.

This should match the "fix-1628.darcs.2.patch" attachment on that ticket.
In changeset:147670fd89a04bad: ``` mutable publish: fix not-enough-shares detection. Refs #1628. This should match the "fix-1628.darcs.2.patch" attachment on that ticket. ```

kevan: can you double-check that I got that delta right? I think the
only part that changed was this bit:

        all_shnums = filter(lambda sh: len(self.writers[sh]) > 0,
                            self.writers.iterkeys())
        if len(all_shnums) < self.required_shares or self.surprised:
            return self._failure()

with which I fully concur. Since empty lists are falsey, you could also
express it like:

all_shnums = set([shnum for shnum in self.writers if self.writers[shnum]])
# or
all_shnums = set([shnum for shnum,writers in self.writers.items() if writers])
# or, relying upon the uniqueness of dict keys:
all_shnums = [shnum for shnum,writers in self.writers.items() if writers]
# or, since we only actually care about the count of unique shnums:
shares = len([shnum for shnum,writers in self.writers.items() if writers])

(also, be aware of the DictOfSets that I use in the immutable code
for tracking the shnum->servers mapping)

Should I leave this ticket open until we get that second test written?

kevan: can you double-check that I got that delta right? I *think* the only part that changed was this bit: ``` all_shnums = filter(lambda sh: len(self.writers[sh]) > 0, self.writers.iterkeys()) if len(all_shnums) < self.required_shares or self.surprised: return self._failure() ``` with which I fully concur. Since empty lists are falsey, you could also express it like: ``` all_shnums = set([shnum for shnum in self.writers if self.writers[shnum]]) # or all_shnums = set([shnum for shnum,writers in self.writers.items() if writers]) # or, relying upon the uniqueness of dict keys: all_shnums = [shnum for shnum,writers in self.writers.items() if writers] # or, since we only actually care about the count of unique shnums: shares = len([shnum for shnum,writers in self.writers.items() if writers]) ``` (also, be aware of the `DictOfSets` that I use in the immutable code for tracking the shnum->servers mapping) Should I leave this ticket open until we get that second test written?
kevan commented 2011-12-28 19:26:39 +00:00
Author
Owner

I altered test_multiply_placed_shares to fail if some of the shares aren't updated to the newest version on an update, so we don't need to wait for another test. I guess the git changelogs are a little stale; sorry for any confusion from that.

You caught the only important change with your delta. I also removed

        self.g.clients[0].DEFAULT_ENCODING_PARAMETERS['n'] = 75

from test_multiply_placed_shares. Placing a lot of shares (so each server holds a few shares) made the test yield an UCWE more reliably, but it still sometimes made it to the multiple version check due to #1641. It didn't seem worthwhile to set a magical encoding parameter if it didn't always work, and the test always failed without the fix in any case, so I took it out. It probably doesn't matter either way, but the test might be a little faster without that line.

Thanks for the review, the suggested alternatives, and for landing the fixes.

I altered `test_multiply_placed_shares` to fail if some of the shares aren't updated to the newest version on an update, so we don't need to wait for another test. I guess the git changelogs are a little stale; sorry for any confusion from that. You caught the only important change with your delta. I also removed ``` self.g.clients[0].DEFAULT_ENCODING_PARAMETERS['n'] = 75 ``` from `test_multiply_placed_shares`. Placing a lot of shares (so each server holds a few shares) made the test yield an UCWE more reliably, but it still sometimes made it to the multiple version check due to #1641. It didn't seem worthwhile to set a magical encoding parameter if it didn't always work, and the test always failed without the fix in any case, so I took it out. It probably doesn't matter either way, but the test might be a little faster without that line. Thanks for the review, the suggested alternatives, and for landing the fixes.

Ok, I applied that change too, in changeset:7989fe21cc1465ac. So I think we can close this one now. Thanks!

Ok, I applied that change too, in changeset:7989fe21cc1465ac. So I think we can close this one now. Thanks!
warner added the
fixed
label 2011-12-29 00:00:40 +00:00

Replying to kevan:

I agree with comment:86623. While investigating this issue, I noticed a potential regression in the way we handle #546 situations. I haven't had time to investigate that yet, and probably won't have time to investigate until this weekend. Can we wait on 1.9.1 until I make a ticket for that issue, so we can decide if it belongs in 1.9.1?

Kevan: did you do this investigation? Release Manager Brian [//pipermail/tahoe-dev/2011-December/006901.html said] "a week or two" and the Milestone is currently marked as due on 2012-01-15, so I think we have time.

Replying to [kevan](/tahoe-lafs/trac-2024-07-25/issues/1628#issuecomment-86625): > I agree with [comment:86623](/tahoe-lafs/trac-2024-07-25/issues/1628#issuecomment-86623). While investigating this issue, I noticed a potential regression in the way we handle #546 situations. I haven't had time to investigate that yet, and probably won't have time to investigate until this weekend. Can we wait on 1.9.1 until I make a ticket for that issue, so we can decide if it belongs in 1.9.1? Kevan: did you do this investigation? Release Manager Brian [//pipermail/tahoe-dev/2011-December/006901.html said] "a week or two" and [the Milestone](/tahoe-lafs/trac-2024-07-25/milestone/93) is currently marked as due on 2012-01-15, so I think we have time.
kevan commented 2011-12-31 22:06:27 +00:00
Author
Owner

I did -- the result is ticket #1641.

I did -- the result is ticket #1641.
Author
Owner

Hi all,

I'm a tahoe-lafs novice, but playing with my first shares (4 storage servers, 2 clients on 2 of the storage servers, k=2, H=4, N=5) I succeeded within short time frame to fuckup my shares. I'm not yet fully sure what we did to confuse them, but in the end we had one share of seq2 and 5 shares of seq10. A deep-check --repair alias: always resulted in this:

$ tahoe deep-check --repair -v sound:28C3

'': not healthy

repair successful

'28c3.Pausenmusik.mp3': healthy

done: 2 objects checked

pre-repair: 1 healthy, 1 unhealthy

1 repairs attempted, 1 successful, 0 failed

post-repair: 1 healthy, 1 unhealthy

It was always the mutable files which became unhealty. And no number of repairs could get them fixed.

jg71 sugested to use git head, as some bugs were fixed there. I did as proposed and just found out that git head fixes all the above problems without fuss. Great work guys!

Hi all, I'm a tahoe-lafs novice, but playing with my first shares (4 storage servers, 2 clients on 2 of the storage servers, k=2, H=4, N=5) I succeeded within short time frame to fuckup my shares. I'm not yet fully sure what we did to confuse them, but in the end we had one share of seq2 and 5 shares of seq10. A deep-check --repair alias: always resulted in this: $ tahoe deep-check --repair -v sound:28C3 '<root>': not healthy<br> repair successful<br> '28c3.Pausenmusik.mp3': healthy<br> done: 2 objects checked<br> pre-repair: 1 healthy, 1 unhealthy<br> 1 repairs attempted, 1 successful, 0 failed<br> post-repair: 1 healthy, 1 unhealthy It was always the mutable files which became unhealty. And no number of repairs could get them fixed. jg71 sugested to use git head, as some bugs were fixed there. I did as proposed and just found out that git head fixes all the above problems without fuss. Great work guys!
Sign in to join this conversation.
No Milestone
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#1628
No description provided.