UCWE on deep check with recent version #1628
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
4 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1628
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Reported by kpreid:
I upgraded my tahoe to a recent development version, in the git mirror:
Now, my daily deep-check --repair --add-lease on my four aliases on the volunteer grid consistently fails as follows. The first might have a legitimate uncoordinated write, but the last two are not regularly touched by anything but the repair process, and this identical failure has occurred for the past 4 days.
I'd appreciate it if this is fixed before my leases expire. :-)
Before these failures, here is a typical example of normal results. (The indentation is added by my script.) I had understood the post-repair unhealthiness to be due to disagreement between the "healthy file" and "repair (successful re-upload)", a to-be-fixed bug.
Yes, that sounds like #766.
I had the same problem after upgrading to 1.9, reported in #1583 (though I neglected to add the full traceback).
Thanks for the information, kpreid and killyourtv. Kevin, would you please attempt to reproduce it and see if a foolscap incident report file is generated in
~/.tahoe/logs/incidents/
, and then push the "Report A Problem" button as killyourtv did in #1583?Attachment incident-2011-12-06--13-30-43Z-kt36tjq.flog.bz2 (56514 bytes) added
Repair failure #1 (UncoordinatedWriteError)
Attachment incident-2011-12-06--13-30-54Z-uzuao2q.flog.bz2 (55321 bytes) added
Repair failure 3 (UncoordinatedWriteError)
Attachment incident-2011-12-06--13-31-29Z-oq7v3wi.flog.bz2 (55136 bytes) added
Pushed the "Report an Incident" button
For some reason I cannot upload the second of the four incident files, which contained an AssertionError from
mutable/filenode.py:563:upload
. I have tried several times, including with a different format and filename, and Trac acts as if it succeeded but doesn't show the file. I have temporarily uploaded it here: http://switchb.org/pri/incident-2011-12-06--13-30-50Z-fhhyszi.flog.bz2Hm, it looks like the publisher forgot how to keep track of multiple copies of a single share during the MDMF implementation. So, in your case, it knows about one copy of each of a few shares that each exist on more than one server, then gets surprised when it encounters the other copies when pushing other shares, interprets the shares as evidence of an uncoordinated write, and breaks. Can you test fix-1628.darcs.patch and let me know if it resolves the issue for you?
Attachment fix-1628.darcs.patch (84710 bytes) added
I'm currently trying not to have to deal with darcs. If you can supply that as a unified diff or against the git mirror I can test it.
Attachment fix-1628.diff (18873 bytes) added
darcs-free version of fix-1628.darcs.patch
Does fix-1628.diff work for you? Don't mind the references to a fifth patch; it's not related to this issue.
fix-1628.diff appears to have eliminated the problem.
Replying to kpreid:
That's due to a misconfiguration of this Trac. #1581 My apologies.
So is this a regression in 1.9.0 vs. earlier releases, and could it result in data loss, and should we plan a 1.9.1 release to fix it?
Replying to kpreid:
I concur, this seems to have solved my problem as well (though I want to do a bit more testing).
I assume that #1583 is probably the same as this bug. I'll close mine since this one has had a bit more activity.
From the patch comments:
This sounds to me like a regression serious enough to justify a 1.9.1. Although multiple servers holding the same share shouldn't happen if there have been only publish operations with a stable set of servers, it can easily happen if the grid membership is less stable and there have been repairs.
I agree with comment:86623. While investigating this issue, I noticed a potential regression in the way we handle #546 situations. I haven't had time to investigate that yet, and probably won't have time to investigate until this weekend. Can we wait on 1.9.1 until I make a ticket for that issue, so we can decide if it belongs in 1.9.1?
Attachment fix-1628.darcs.2.patch (86211 bytes) added
fix-1628.darcs.2.patch fixes a flaw in my initial patch. I think it's ready for review.
oops, I just reviewed and landed the first patch, in changeset:e29323f68fc5447b. I'll see if I can deduce a delta between the two darcs patches..
In changeset:147670fd89a04bad:
In changeset:147670fd89a04bad:
kevan: can you double-check that I got that delta right? I think the
only part that changed was this bit:
with which I fully concur. Since empty lists are falsey, you could also
express it like:
(also, be aware of the
DictOfSets
that I use in the immutable codefor tracking the shnum->servers mapping)
Should I leave this ticket open until we get that second test written?
I altered
test_multiply_placed_shares
to fail if some of the shares aren't updated to the newest version on an update, so we don't need to wait for another test. I guess the git changelogs are a little stale; sorry for any confusion from that.You caught the only important change with your delta. I also removed
from
test_multiply_placed_shares
. Placing a lot of shares (so each server holds a few shares) made the test yield an UCWE more reliably, but it still sometimes made it to the multiple version check due to #1641. It didn't seem worthwhile to set a magical encoding parameter if it didn't always work, and the test always failed without the fix in any case, so I took it out. It probably doesn't matter either way, but the test might be a little faster without that line.Thanks for the review, the suggested alternatives, and for landing the fixes.
Ok, I applied that change too, in changeset:7989fe21cc1465ac. So I think we can close this one now. Thanks!
Replying to kevan:
Kevan: did you do this investigation? Release Manager Brian [//pipermail/tahoe-dev/2011-December/006901.html said] "a week or two" and the Milestone is currently marked as due on 2012-01-15, so I think we have time.
I did -- the result is ticket #1641.
Hi all,
I'm a tahoe-lafs novice, but playing with my first shares (4 storage servers, 2 clients on 2 of the storage servers, k=2, H=4, N=5) I succeeded within short time frame to fuckup my shares. I'm not yet fully sure what we did to confuse them, but in the end we had one share of seq2 and 5 shares of seq10. A deep-check --repair alias: always resulted in this:
$ tahoe deep-check --repair -v sound:28C3
'': not healthy
repair successful
'28c3.Pausenmusik.mp3': healthy
done: 2 objects checked
pre-repair: 1 healthy, 1 unhealthy
1 repairs attempted, 1 successful, 0 failed
post-repair: 1 healthy, 1 unhealthy
It was always the mutable files which became unhealty. And no number of repairs could get them fixed.
jg71 sugested to use git head, as some bugs were fixed there. I did as proposed and just found out that git head fixes all the above problems without fuss. Great work guys!