failure in block hash tree #738
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#738
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Running tahoe on the machine in which python2.5 setup.py test fails as reported in trac ticket#737, generates the attached incident report.
Brief summary from flog debugger viewer:
.. and so on.
Actual cli error message is:
tahoe get URI:CHK:lapry55oui4psmeiyxhvitfmpi:75mb37to6iauypych6bkqfkxxxfk6nhekhomzipkqzwt46v64hdq:3:5:99758080 meh
Finally, dump-share on the 33MB file:
Machine details:
This is a transitional pthread machine, partway between the M:N -> 1:1 threading model transition. The M:N threads should be functional and for all system and most application purposes they are. (KDE, etc.) However, on occasion some software makes assumptions or is built without threading support because configure detected anomalous behaviour.
NOTE: The share file IS AVAILABLE UPON REQUEST. I will save it for posterity.
Attachment incident-2009-06-16-211442-qvfj7eq.flog.bz2 (286034 bytes) added
Incident report. View with: flogtool web-viewer -p 8081 incident-2009-06-16-211442-qvfj7eq.flog.bz2
I want to know if Tahoe-LAFS unit tests pass (excluding the one that locks up as described in #737) and if pycryptopp unit tests pass.
(wrapped some of the description text to improve formatting)
I've looked a bit at the share file you sent me, and it seems ok (no corruption that I've seen so far). My next step is to examine the Incident report and see if I can figure out exactly which hash check failed, and compare them against hashes that I'll generate locally from that share.
Another approach will be to get a copy of two more shares, put them in a private grid, and attempt to download the file. If successful, the shares must be ok, and we'll focus on how the download process might be acting differently on your host.
I've looked at that report and compared it against the scrubber that I wrote (a server-side share verification tool). It looks like your client is generating a different hash for the first data block than it's supposed to. The incident report contains a copy of the first 50 bytes and the last 50 bytes of the block, and they match what I'm getting out of the share.
So, either your client is incorrectly computing the SHA256d hash of that 43kB-ish data, or it is using a block of data that is corrupted somewhere in the middle. Your client seems to compute the rest of the hash tree correctly (and I think you might have said that pycryptopp tests pass on this platform), so it seems like SHA256d is working in general. So that points to either the wrong hash tag (in allmydata.util.hashutil), or some sort of transport-level error that is corrupting or inserting/deleting data in the middle of the block.
I've just pushed some known-answer-tests to confirm that allmydata.util.hashutil is working correctly: could you pull a trunk tree, build, and then specifically run "make test TEST=allmydata.test.test_util" ? I know that an earlier test is hanging on your system; by running the later test_util directly, we can rule out this one hypothesis.
If that passes, the next step will be to patch the download code to save the full block to disk, so we can examine it and see if it matches what it's supposed to be.
Replying to warner:
My system make is a form of bmake. I had to run gmake to execute your tests. However, the result is apparently success:
Ran 63 tests in 5.852s
PASSED (successes=63)
In doing so, I have found that the default, lone python interpreter was actually different from the one I was manually using to build and install tahoe. Additionally, my tahoe installation is installed system-wide. I will (hrm.. manually?) de-install it and try again with a proper symlink from /usr/pkg/bin/python to python2.5. (Long shot with no evidence, I know.. but still.)
Nope. Same problem.
I'm sorry I don't know more about Python or I would be a lot more useful to you.
Attachment 738-dumpblock.diff (3718 bytes) added
patch to dump the offending block into logs/
ok, if you would, please apply the 738-dumpblock.diff patch to a trunk tree, then perform the failing download again. That patch will write out the block-with-bad-hash into logs/badhash-STUFF .. then please attach that file to this ticket, and we'll compare it against the block that we were supposed to get, to see if your machine is receiving bad data, or if it's computing the hashes incorrectly.
Also, double-check that "test_known_answers" were in the output of the "test_util" run that you just did, to make sure that your tree was new enough to have the tests I added.
thanks!
Excellent! I got some badhash's! I believe they're the same ones from that same file. I recognise the beginning of the hash reference anyway (jo blah blah).
I'll try to attach them to this note.
Attachment badhash-from-2ccpv6ww-SI-jow42sylefxjxsns3alv5ptghe-shnum-0-blocknum-0 (43691 bytes) added
first badhash
Attachment badhash-from-r4tndnav-SI-jow42sylefxjxsns3alv5ptghe-shnum-1-blocknum-0 (43691 bytes) added
Attachment badhash-from-rzozr3qr-SI-jow42sylefxjxsns3alv5ptghe-shnum-2-blocknum-0 (43691 bytes) added
should be last attachment for this..
Indeed it was, and in a new test with the latest (via darcs pull) and that includes the patch that generated the badhash-* files:
Attachment incident-2009-06-29-161825-brn5ypi.flog.bz2 (336781 bytes) added
incident file to go with badhashes (i believe.)
pycryptopp#24 opened.
midnightmagic and I were able to narrow this down to a failure in pycryptopp, in which hashing a 128-byte string in two chunks of size (33,95) gets the wrong value on !NetBSD. The block data hasher uses a tag (including netstring padding) of length 33. I suspect that the actual problem is with any block size BS such that
(33+BS)%128==0
.This smells a lot like pycryptopp#17, which was an ARM-specific alignment issue that corrupted AES output on certain chunk sizes. I haven't looked deeply at the SHA256 code yet, but I suspect the same sort of bug, this time affecting i386.
how interesting.. Black Dew's debian/i386 buildslave (which has experienced hangs in the test suite that look fairly similar to hangs that midnightmagic has seen) fails the new pycryptopp-24 test in exactly the same way.
adding Cc: bdew, midnightmagic so they will know that there is something they can do to help. Setting 'assigned to' bdew at random.
A-ha! Now Black Dew's buildslave got an internal compiler error in g++ while building Crypto++:
http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/25/steps/build/logs/stdio
This suggests to me that the machine has hardware problems.
This raises an interesting question of: what if anything can Tahoe-LAFS do to be robust and to fail clearly and nicely in the presence of hardware problems such as flaky RAM?
See also Black Dew's discoveries over in http://allmydata.org/trac/pycryptopp/ticket/24 . Crypto++ is being built to use the
MOVDQA
instruction, which may be buggy on his AthlonXP.So! Just as an update, this particular issue may be solved by the fact that crypto++ on my machine actually seems to fail (when ASM optimizations are turned on) on a test program I wrote. See the aforementioned pycryptopp ticket for more details and the test program.
SUCCESS! Rebuilding pycryptopp without ASM optimizations makes it pass the chunked SHA256 test, and setting PYTHONPATH to that top-level directory makes "tahoe ls" Just Work on an existing grid, and the failing command NOW WORKS PERFECTLY.
So there is a patch in pycryptopp/ticket/24 for setup.py which detects the platform involved and turns off assembly optimizations in just that specific platform and bit-width (32bit).
I would say that, if bdew could do the same and it works, we can put a platform detection for his also and likely close all these tickets out until crypto++ fixes their CPU features detection.
Wow, Wei Dai is fast. Check it out, he's fixed the problem already in SVN:
Crypto++ SVN Rev 470
Impressive.
pycryptopp in trunk now works perfectly (well the tests don't fail anyway) on all three machines, as listed in pycryptopp ticket 24. Using pycryptopp trunk, I now have apparently perfectly-working tahoe nodes where before they were only remotely usable.
Therefore, I believe this ticket can be closed, from my perspective. If 1.5.0 is going to include these fixes, then all's well!
Fixed by changeset:9578e70161009035 which increases the requirement on pycryptopp to >= 0.5.15. Note that if you are building pycryptopp against an external libcryptopp, however, then you may still have this bug if your libcryptopp has it.