failure in block hash tree #738

Closed
opened 2009-06-17 18:37:55 +00:00 by midnightmagic · 23 comments
midnightmagic commented 2009-06-17 18:37:55 +00:00
Owner

Running tahoe on the machine in which python2.5 setup.py test fails as reported in trac ticket#737, generates the attached incident report.

Brief summary from flog debugger viewer:

<ValidatedReadBucketProxy #1>
(2-<ReadBucketProxy 150097484 to peer [mgnbpxki] SI jow42sylefxjxsns3alv5ptghe>-m2st7xk76cvd):
 hash failure in block=0, shnum=2
 on <ReadBucketProxy 150097484 to peer [mgnbpxki] SI jow42sylefxjxsns3alv5ptghe>

[...]

<BlockDownloader #2>(<ValidatedReadBucketProxy #3>-2): failure to get block

[... etc ...]

failure in block hash tree

.. and so on.

Actual cli error message is:

tahoe get URI:CHK:lapry55oui4psmeiyxhvitfmpi:75mb37to6iauypych6bkqfkxxxfk6nhekhomzipkqzwt46v64hdq:3:5:99758080 meh

Error, got 410 Gone
NotEnoughSharesError: no shares could be found. Zero shares usually indicates
a corrupt URI, or that no servers were connected, but it might also indicate
severe corruption. You should perform a filecheck on this object to learn
more.

Finally, dump-share on the 33MB file:

share filename: /v/tahoe/.tahoe/storage/shares/jo/jow42sylefxjxsns3alv5ptghe/2
             version: 1
           file_size: 99758080
        num_segments: 762
        segment_size: 131073
       needed_shares: 3
        total_shares: 5

          codec_name: crs
        codec_params: 131073-3-5
   tail_codec_params: 11529-3-5

      crypttext_hash: zodzh33f7mnowsxine5mzejiahlxsilgggpxmop5bfrh4zzzdpha
 crypttext_root_hash: nuqsysg5zqkz5nsvpi32n5h6h5ilbepsbvmssji2xv773kqw53tq
     share_root_hash: m2st7xk76cvdutgf5lzmkdzbf72h75cxpkytwjegi5jgntir3u5q
            UEB_hash: 75mb37to6iauypych6bkqfkxxxfk6nhekhomzipkqzwt46v64hdq
          verify-cap: URI:CHK-Verifier:jow42sylefxjxsns3alv5ptghe:75mb37to6iauypych6bkqfkxxxfk6nhekhomzipkqzwt46v64hdq:3:5:99758080

 Size of data within the share:
                data: 33252694
       uri-extension: 325
          validation: 196648

 Lease #0: owner=0, expire in 2607324s (30 days)

Machine details:

NetBSD quickie 4.99.7 NetBSD 4.99.7 (quickie) #0: Tue Jan 2 14:47:23 PST 2007 root@quickie:/v/src/sys/arch/i386/compile/quickie i386 

AMD Athlon(tm) XP 2500+ (single-core, 32-bit) 2.5GB RAM 

Python 2.5.2 

This is a transitional pthread machine, partway between the M:N -> 1:1 threading model transition. The M:N threads should be functional and for all system and most application purposes they are. (KDE, etc.) However, on occasion some software makes assumptions or is built without threading support because configure detected anomalous behaviour.

NOTE: The share file IS AVAILABLE UPON REQUEST. I will save it for posterity.

Running tahoe on the machine in which python2.5 setup.py test fails as reported in trac ticket#737, generates the attached incident report. Brief summary from flog debugger viewer: ``` <ValidatedReadBucketProxy #1> (2-<ReadBucketProxy 150097484 to peer [mgnbpxki] SI jow42sylefxjxsns3alv5ptghe>-m2st7xk76cvd): hash failure in block=0, shnum=2 on <ReadBucketProxy 150097484 to peer [mgnbpxki] SI jow42sylefxjxsns3alv5ptghe> [...] <BlockDownloader #2>(<ValidatedReadBucketProxy #3>-2): failure to get block [... etc ...] failure in block hash tree ``` .. and so on. Actual cli error message is: tahoe get URI:CHK:lapry55oui4psmeiyxhvitfmpi:75mb37to6iauypych6bkqfkxxxfk6nhekhomzipkqzwt46v64hdq:3:5:99758080 meh ``` Error, got 410 Gone NotEnoughSharesError: no shares could be found. Zero shares usually indicates a corrupt URI, or that no servers were connected, but it might also indicate severe corruption. You should perform a filecheck on this object to learn more. ``` Finally, dump-share on the 33MB file: ``` share filename: /v/tahoe/.tahoe/storage/shares/jo/jow42sylefxjxsns3alv5ptghe/2 version: 1 file_size: 99758080 num_segments: 762 segment_size: 131073 needed_shares: 3 total_shares: 5 codec_name: crs codec_params: 131073-3-5 tail_codec_params: 11529-3-5 crypttext_hash: zodzh33f7mnowsxine5mzejiahlxsilgggpxmop5bfrh4zzzdpha crypttext_root_hash: nuqsysg5zqkz5nsvpi32n5h6h5ilbepsbvmssji2xv773kqw53tq share_root_hash: m2st7xk76cvdutgf5lzmkdzbf72h75cxpkytwjegi5jgntir3u5q UEB_hash: 75mb37to6iauypych6bkqfkxxxfk6nhekhomzipkqzwt46v64hdq verify-cap: URI:CHK-Verifier:jow42sylefxjxsns3alv5ptghe:75mb37to6iauypych6bkqfkxxxfk6nhekhomzipkqzwt46v64hdq:3:5:99758080 Size of data within the share: data: 33252694 uri-extension: 325 validation: 196648 Lease #0: owner=0, expire in 2607324s (30 days) ``` Machine details: ``` NetBSD quickie 4.99.7 NetBSD 4.99.7 (quickie) #0: Tue Jan 2 14:47:23 PST 2007 root@quickie:/v/src/sys/arch/i386/compile/quickie i386 AMD Athlon(tm) XP 2500+ (single-core, 32-bit) 2.5GB RAM Python 2.5.2 ``` This is a transitional pthread machine, partway between the M:N -> 1:1 threading model transition. The M:N threads *should* be functional and for all system and most application purposes they are. (KDE, etc.) However, on occasion some software makes assumptions or is built without threading support because configure detected anomalous behaviour. NOTE: The share file IS AVAILABLE UPON REQUEST. I will save it for posterity.
tahoe-lafs added the
unknown
major
defect
1.4.1
labels 2009-06-17 18:37:55 +00:00
tahoe-lafs added this to the undecided milestone 2009-06-17 18:37:55 +00:00
midnightmagic commented 2009-06-17 18:38:54 +00:00
Author
Owner

Attachment incident-2009-06-16-211442-qvfj7eq.flog.bz2 (286034 bytes) added

Incident report. View with: flogtool web-viewer -p 8081 incident-2009-06-16-211442-qvfj7eq.flog.bz2

**Attachment** incident-2009-06-16-211442-qvfj7eq.flog.bz2 (286034 bytes) added Incident report. View with: flogtool web-viewer -p 8081 incident-2009-06-16-211442-qvfj7eq.flog.bz2
tahoe-lafs added
code-storage
and removed
unknown
labels 2009-06-17 18:39:42 +00:00

I want to know if Tahoe-LAFS unit tests pass (excluding the one that locks up as described in #737) and if pycryptopp unit tests pass.

I want to know if Tahoe-LAFS unit tests pass (*excluding* the one that locks up as described in #737) and if pycryptopp unit tests pass.
zooko added
code-encoding
critical
and removed
code-storage
major
labels 2009-06-17 19:22:32 +00:00
zooko modified the milestone from undecided to 1.5.0 2009-06-17 19:22:32 +00:00

(wrapped some of the description text to improve formatting)

I've looked a bit at the share file you sent me, and it seems ok (no corruption that I've seen so far). My next step is to examine the Incident report and see if I can figure out exactly which hash check failed, and compare them against hashes that I'll generate locally from that share.

Another approach will be to get a copy of two more shares, put them in a private grid, and attempt to download the file. If successful, the shares must be ok, and we'll focus on how the download process might be acting differently on your host.

(wrapped some of the description text to improve formatting) I've looked a bit at the share file you sent me, and it seems ok (no corruption that I've seen so far). My next step is to examine the Incident report and see if I can figure out exactly which hash check failed, and compare them against hashes that I'll generate locally from that share. Another approach will be to get a copy of two more shares, put them in a private grid, and attempt to download the file. If successful, the shares must be ok, and we'll focus on how the download process might be acting differently on your host.

I've looked at that report and compared it against the scrubber that I wrote (a server-side share verification tool). It looks like your client is generating a different hash for the first data block than it's supposed to. The incident report contains a copy of the first 50 bytes and the last 50 bytes of the block, and they match what I'm getting out of the share.

So, either your client is incorrectly computing the SHA256d hash of that 43kB-ish data, or it is using a block of data that is corrupted somewhere in the middle. Your client seems to compute the rest of the hash tree correctly (and I think you might have said that pycryptopp tests pass on this platform), so it seems like SHA256d is working in general. So that points to either the wrong hash tag (in allmydata.util.hashutil), or some sort of transport-level error that is corrupting or inserting/deleting data in the middle of the block.

I've just pushed some known-answer-tests to confirm that allmydata.util.hashutil is working correctly: could you pull a trunk tree, build, and then specifically run "make test TEST=allmydata.test.test_util" ? I know that an earlier test is hanging on your system; by running the later test_util directly, we can rule out this one hypothesis.

If that passes, the next step will be to patch the download code to save the full block to disk, so we can examine it and see if it matches what it's supposed to be.

I've looked at that report and compared it against the scrubber that I wrote (a server-side share verification tool). It looks like your client is generating a different hash for the first data block than it's supposed to. The incident report contains a copy of the first 50 bytes and the last 50 bytes of the block, and they match what I'm getting out of the share. So, either your client is incorrectly computing the SHA256d hash of that 43kB-ish data, or it is using a block of data that is corrupted somewhere in the middle. Your client seems to compute the rest of the hash tree correctly (and I think you might have said that pycryptopp tests pass on this platform), so it seems like SHA256d is working in general. So that points to either the wrong hash tag (in allmydata.util.hashutil), or some sort of transport-level error that is corrupting or inserting/deleting data in the middle of the block. I've just pushed some known-answer-tests to confirm that allmydata.util.hashutil is working correctly: could you pull a trunk tree, build, and then specifically run "make test TEST=allmydata.test.test_util" ? I know that an earlier test is hanging on your system; by running the later test_util directly, we can rule out this one hypothesis. If that passes, the next step will be to patch the download code to save the full block to disk, so we can examine it and see if it matches what it's supposed to be.
midnightmagic commented 2009-06-19 19:27:28 +00:00
Author
Owner

Replying to warner:

I've just pushed some known-answer-tests to confirm that allmydata.util.hashutil is working
correctly: could you pull a trunk tree, build, and then specifically run "make test
TEST=allmydata.test.test_util" ? I know that an earlier test is hanging on your system; by
running the later test_util directly, we can rule out this one hypothesis.

If that passes, the next step will be to patch the download code to save the full block to
disk, so we can examine it and see if it matches what it's supposed to be.

My system make is a form of bmake. I had to run gmake to execute your tests. However, the result is apparently success:

Ran 63 tests in 5.852s

PASSED (successes=63)

In doing so, I have found that the default, lone python interpreter was actually different from the one I was manually using to build and install tahoe. Additionally, my tahoe installation is installed system-wide. I will (hrm.. manually?) de-install it and try again with a proper symlink from /usr/pkg/bin/python to python2.5. (Long shot with no evidence, I know.. but still.)

Nope. Same problem.

I'm sorry I don't know more about Python or I would be a lot more useful to you.

Replying to [warner](/tahoe-lafs/trac-2024-07-25/issues/738#issuecomment-71690): > I've just pushed some known-answer-tests to confirm that allmydata.util.hashutil is working > correctly: could you pull a trunk tree, build, and then specifically run "make test > TEST=allmydata.test.test_util" ? I know that an earlier test is hanging on your system; by > running the later test_util directly, we can rule out this one hypothesis. > > If that passes, the next step will be to patch the download code to save the full block to > disk, so we can examine it and see if it matches what it's supposed to be. My system make is a form of bmake. I had to run gmake to execute your tests. However, the result is apparently success: Ran 63 tests in 5.852s PASSED (successes=63) In doing so, I have found that the default, lone python interpreter was actually different from the one I was manually using to build and install tahoe. Additionally, my tahoe installation is installed system-wide. I will (hrm.. manually?) de-install it and try again with a proper symlink from /usr/pkg/bin/python to python2.5. (Long shot with no evidence, I know.. but still.) Nope. Same problem. I'm sorry I don't know more about Python or I would be a *lot* more useful to you.

Attachment 738-dumpblock.diff (3718 bytes) added

patch to dump the offending block into logs/

**Attachment** 738-dumpblock.diff (3718 bytes) added patch to dump the offending block into logs/

ok, if you would, please apply the 738-dumpblock.diff patch to a trunk tree, then perform the failing download again. That patch will write out the block-with-bad-hash into logs/badhash-STUFF .. then please attach that file to this ticket, and we'll compare it against the block that we were supposed to get, to see if your machine is receiving bad data, or if it's computing the hashes incorrectly.

Also, double-check that "test_known_answers" were in the output of the "test_util" run that you just did, to make sure that your tree was new enough to have the tests I added.

thanks!

ok, if you would, please apply the 738-dumpblock.diff patch to a trunk tree, then perform the failing download again. That patch will write out the block-with-bad-hash into logs/badhash-STUFF .. then please attach that file to this ticket, and we'll compare it against the block that we were supposed to get, to see if your machine is receiving bad data, or if it's computing the hashes incorrectly. Also, double-check that "test_known_answers" were in the output of the "test_util" run that you just did, to make sure that your tree was new enough to have the tests I added. thanks!
midnightmagic commented 2009-06-29 23:22:10 +00:00
Author
Owner

Excellent! I got some badhash's! I believe they're the same ones from that same file. I recognise the beginning of the hash reference anyway (jo blah blah).

I'll try to attach them to this note.

Excellent! I got some badhash's! I *believe* they're the same ones from that same file. I recognise the beginning of the hash reference anyway (jo blah blah). I'll try to attach them to this note.
midnightmagic commented 2009-06-29 23:22:39 +00:00
Author
Owner

Attachment badhash-from-2ccpv6ww-SI-jow42sylefxjxsns3alv5ptghe-shnum-0-blocknum-0 (43691 bytes) added

first badhash

**Attachment** badhash-from-2ccpv6ww-SI-jow42sylefxjxsns3alv5ptghe-shnum-0-blocknum-0 (43691 bytes) added first badhash
midnightmagic commented 2009-06-29 23:22:46 +00:00
Author
Owner

Attachment badhash-from-r4tndnav-SI-jow42sylefxjxsns3alv5ptghe-shnum-1-blocknum-0 (43691 bytes) added

**Attachment** badhash-from-r4tndnav-SI-jow42sylefxjxsns3alv5ptghe-shnum-1-blocknum-0 (43691 bytes) added
midnightmagic commented 2009-06-29 23:22:59 +00:00
Author
Owner

Attachment badhash-from-rzozr3qr-SI-jow42sylefxjxsns3alv5ptghe-shnum-2-blocknum-0 (43691 bytes) added

should be last attachment for this..

**Attachment** badhash-from-rzozr3qr-SI-jow42sylefxjxsns3alv5ptghe-shnum-2-blocknum-0 (43691 bytes) added should be last attachment for this..
midnightmagic commented 2009-06-29 23:36:11 +00:00
Author
Owner

Also, double-check that "test_known_answers" were in the output of the "test_util" run that
you just did, to make sure that your tree was new enough to have the tests I added.

Indeed it was, and in a new test with the latest (via darcs pull) and that includes the patch that generated the badhash-* files:

  HashUtilTests
    test_chk ...                                      [OK]
    test_constant_time_compare ...                    [OK]
    test_hashers ...                                  [OK]
    test_known_answers ...                            [OK]
    test_random_key ...                               [OK]
    test_sha256d ...                                  [OK]
    test_sha256d_truncated ...                        [OK]
[ ... and etc ...]
  TimeFormat
    test_epoch ...                                    [OK]
    test_epoch_in_London ...                          [OK]

---------------------------------------------------------
Ran 63 tests in 5.815s

PASSED (successes=63)
> Also, double-check that "test_known_answers" were in the output of the "test_util" run that > you just did, to make sure that your tree was new enough to have the tests I added. Indeed it was, and in a new test with the latest (via darcs pull) and that includes the patch that generated the badhash-* files: ``` HashUtilTests test_chk ... [OK] test_constant_time_compare ... [OK] test_hashers ... [OK] test_known_answers ... [OK] test_random_key ... [OK] test_sha256d ... [OK] test_sha256d_truncated ... [OK] [ ... and etc ...] TimeFormat test_epoch ... [OK] test_epoch_in_London ... [OK] --------------------------------------------------------- Ran 63 tests in 5.815s PASSED (successes=63) ```
midnightmagic commented 2009-06-29 23:37:51 +00:00
Author
Owner

Attachment incident-2009-06-29-161825-brn5ypi.flog.bz2 (336781 bytes) added

incident file to go with badhashes (i believe.)

**Attachment** incident-2009-06-29-161825-brn5ypi.flog.bz2 (336781 bytes) added incident file to go with badhashes (i believe.)

pycryptopp#24 opened.

midnightmagic and I were able to narrow this down to a failure in pycryptopp, in which hashing a 128-byte string in two chunks of size (33,95) gets the wrong value on !NetBSD. The block data hasher uses a tag (including netstring padding) of length 33. I suspect that the actual problem is with any block size BS such that (33+BS)%128==0.

This smells a lot like pycryptopp#17, which was an ARM-specific alignment issue that corrupted AES output on certain chunk sizes. I haven't looked deeply at the SHA256 code yet, but I suspect the same sort of bug, this time affecting i386.

[pycryptopp#24](http://allmydata.org/trac/pycryptopp/ticket/24) opened. midnightmagic and I were able to narrow this down to a failure in pycryptopp, in which hashing a 128-byte string in two chunks of size (33,95) gets the wrong value on !NetBSD. The block data hasher uses a tag (including netstring padding) of length 33. I suspect that the actual problem is with any block size BS such that `(33+BS)%128==0`. This smells a lot like [pycryptopp#17](http://allmydata.org/trac/pycryptopp/ticket/17), which was an ARM-specific alignment issue that corrupted AES output on certain chunk sizes. I haven't looked deeply at the SHA256 code yet, but I suspect the same sort of bug, this time affecting i386.

how interesting.. Black Dew's debian/i386 buildslave (which has experienced hangs in the test suite that look fairly similar to hangs that midnightmagic has seen) fails the new pycryptopp-24 test in exactly the same way.

how interesting.. Black Dew's debian/i386 buildslave (which has experienced hangs in the test suite that look fairly similar to hangs that midnightmagic has seen) fails the new pycryptopp-24 test in exactly the same way.

adding Cc: bdew, midnightmagic so they will know that there is something they can do to help. Setting 'assigned to' bdew at random.

adding Cc: bdew, midnightmagic so they will know that there is something they can do to help. Setting 'assigned to' bdew at random.

A-ha! Now Black Dew's buildslave got an internal compiler error in g++ while building Crypto++:

http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/25/steps/build/logs/stdio

This suggests to me that the machine has hardware problems.

This raises an interesting question of: what if anything can Tahoe-LAFS do to be robust and to fail clearly and nicely in the presence of hardware problems such as flaky RAM?

A-ha! Now Black Dew's buildslave got an internal compiler error in g++ while building Crypto++: <http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/25/steps/build/logs/stdio> This suggests to me that the machine has hardware problems. This raises an interesting question of: what if anything can Tahoe-LAFS do to be robust and to fail clearly and nicely in the presence of hardware problems such as flaky RAM?

See also Black Dew's discoveries over in http://allmydata.org/trac/pycryptopp/ticket/24 . Crypto++ is being built to use the MOVDQA instruction, which may be buggy on his AthlonXP.

See also Black Dew's discoveries over in <http://allmydata.org/trac/pycryptopp/ticket/24> . Crypto++ is being built to use the `MOVDQA` instruction, which may be buggy on his AthlonXP.
midnightmagic commented 2009-07-03 22:29:53 +00:00
Author
Owner

So! Just as an update, this particular issue may be solved by the fact that crypto++ on my machine actually seems to fail (when ASM optimizations are turned on) on a test program I wrote. See the aforementioned pycryptopp ticket for more details and the test program.

So! Just as an update, this particular issue may be solved by the fact that crypto++ on my machine actually seems to fail (when ASM optimizations are turned on) on a test program I wrote. See the aforementioned pycryptopp ticket for more details and the test program.
midnightmagic commented 2009-07-03 23:01:19 +00:00
Author
Owner

SUCCESS! Rebuilding pycryptopp without ASM optimizations makes it pass the chunked SHA256 test, and setting PYTHONPATH to that top-level directory makes "tahoe ls" Just Work on an existing grid, and the failing command NOW WORKS PERFECTLY.

So there is a patch in pycryptopp/ticket/24 for setup.py which detects the platform involved and turns off assembly optimizations in just that specific platform and bit-width (32bit).

I would say that, if bdew could do the same and it works, we can put a platform detection for his also and likely close all these tickets out until crypto++ fixes their CPU features detection.

SUCCESS! Rebuilding pycryptopp without ASM optimizations makes it pass the chunked SHA256 test, and setting PYTHONPATH to that top-level directory makes "tahoe ls" Just Work on an existing grid, and the failing command NOW WORKS PERFECTLY. So there is a patch in pycryptopp/ticket/24 for setup.py which detects the platform involved and turns off assembly optimizations in just that specific platform and bit-width (32bit). I would say that, if bdew could do the same and it works, we can put a platform detection for his also and likely close all these tickets out until crypto++ fixes their CPU features detection.
midnightmagic commented 2009-07-06 05:07:00 +00:00
Author
Owner

Wow, Wei Dai is fast. Check it out, he's fixed the problem already in SVN:

Crypto++ SVN Rev 470

Impressive.

Wow, Wei Dai is fast. Check it out, he's fixed the problem already in SVN: [Crypto++ SVN Rev 470](http://cryptopp.svn.sourceforge.net/viewvc/cryptopp?view=rev&sortby=date&revision=470) Impressive.
midnightmagic commented 2009-07-06 23:18:28 +00:00
Author
Owner

pycryptopp in trunk now works perfectly (well the tests don't fail anyway) on all three machines, as listed in pycryptopp ticket 24. Using pycryptopp trunk, I now have apparently perfectly-working tahoe nodes where before they were only remotely usable.

Therefore, I believe this ticket can be closed, from my perspective. If 1.5.0 is going to include these fixes, then all's well!

pycryptopp in trunk now works perfectly (well the tests don't fail anyway) on all three machines, as listed in [pycryptopp ticket 24](http://allmydata.org/trac/pycryptopp/ticket/24). Using pycryptopp trunk, I now have apparently perfectly-working tahoe nodes where before they were only remotely usable. Therefore, I believe this ticket can be closed, from my perspective. If 1.5.0 is going to include these fixes, then all's well!

Fixed by changeset:9578e70161009035 which increases the requirement on pycryptopp to >= 0.5.15. Note that if you are building pycryptopp against an external libcryptopp, however, then you may still have this bug if your libcryptopp has it.

Fixed by changeset:9578e70161009035 which increases the requirement on pycryptopp to >= 0.5.15. Note that if you are building pycryptopp against an external libcryptopp, however, then you may still have this bug if your libcryptopp has it.
zooko added the
fixed
label 2009-07-07 19:09:34 +00:00
zooko closed this issue 2009-07-07 19:09:34 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#738
No description provided.