Bitcoin Forum
September 20, 2024, 05:02:33 AM *
News: Latest Bitcoin Core release: 27.1 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1] 2 3 4 5 »  All
  Print  
Author Topic: 4 hashes parallel on SSE2 CPUs for 0.3.6  (Read 22024 times)
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
July 30, 2010, 09:23:10 PM
 #1

This patch will calculate four hashes on one core using vector instructions. There's a test programm included that validates the new hash function against the old one so it should be correct.

The patch is against 0.3.6. Improves khash/s by roughly 115%.

http://pastebin.com/XN1JDb53
knightmb
Sr. Member
****
Offline Offline

Activity: 308
Merit: 258



View Profile WWW
July 30, 2010, 09:33:29 PM
 #2

I take it that you've already tested the hash limit before performance starts to suffer against the stock code? I'm just curious myself.

Timekoin - The World's Most Energy Efficient Encrypted Digital Currency
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
July 30, 2010, 09:47:22 PM
 #3

Performance of stock code (as measured by my test/benchmark program) is about 1500khash/s.
My code does 3500khash/s. Both figures are for one core. It scales well because I do 128 hashes at once and keep the datastructures small enough to fit in the CPU cache.

I have two local collision attacks which will squeeze another 300khash/s out, but they are not stable yet.
knightmb
Sr. Member
****
Offline Offline

Activity: 308
Merit: 258



View Profile WWW
July 30, 2010, 09:51:10 PM
 #4

Awesome, I'll have to give it a try myself then.  Shocked

Timekoin - The World's Most Energy Efficient Encrypted Digital Currency
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
July 30, 2010, 10:00:24 PM
 #5

Tell me if it works Smiley
Donations are welcome. 17asVKkzRGTFvvGH9dMGQaHe78xzfvgSSA
satoshi
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 7065


View Profile
July 31, 2010, 12:29:20 AM
 #6

That's amazing...

So are you saying you use 128-bit registers to SIMD four 32-bit data at once?  I've wondered about that for a long time, but I didn't think it would be possible due to addition carrying into the neighbour's value.
knightmb
Sr. Member
****
Offline Offline

Activity: 308
Merit: 258



View Profile WWW
July 31, 2010, 04:49:33 AM
 #7

Darn, it means the next release, the difficulty is going to have to increase to 1000 or so to keep up, LOL  Grin

Timekoin - The World's Most Energy Efficient Encrypted Digital Currency
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
July 31, 2010, 10:12:38 AM
 #8

That's amazing...

So are you saying you use 128-bit registers to SIMD four 32-bit data at once?  I've wondered about that for a long time, but I didn't think it would be possible due to addition carrying into the neighbour's value.
That's how it works. Four 32 bit values in a 128 bit vector. They're calculated independently, but at the same time.

Btw. Why are you using this alignup<16> function when __attribute__ ((aligned (16))) will tell the compiler to align at compiletime?
em3rgentOrdr
Sr. Member
****
Offline Offline

Activity: 434
Merit: 252


youtube.com/ericfontainejazz now accepts bitcoin


View Profile WWW
July 31, 2010, 01:42:48 PM
 #9

hmm...I wasn't able to apply the patch (I'm a noobie). Here's the command I ran from bitcoin-0.3.6/src # patch < XN1JDb53.txt

Output:

1 out of 1 hunk ignored
(Stripping trailing CRs from patch.)
patching file main.cpp
Hunk #1 FAILED at 2555.
Hunk #2 FAILED at 2701.
2 out of 2 hunks FAILED
(Stripping trailing CRs from patch.)
patching file makefile.unix
Hunk #1 FAILED at 45.
Hunk #2 FAILED at 58.

What's the proper command to type into linux?  Or do you have linux binaries?

"We will not find a solution to political problems in cryptography, but we can win a major battle in the arms race and gain a new territory of freedom for several years.

Governments are good at cutting off the heads of a centrally controlled networks, but pure P2P networks are holding their own."
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
July 31, 2010, 02:18:03 PM
 #10

the mean client would send all generated bitcoins to a certain address Wink

@em3rgent0rder: i don't know why it fails, but it should be easy to patch it manually...
jgarzik
Legendary
*
qt
Offline Offline

Activity: 1596
Merit: 1099


View Profile
July 31, 2010, 05:18:30 PM
 #11

hmm...I wasn't able to apply the patch (I'm a noobie). Here's the command I ran from bitcoin-0.3.6/src # patch < XN1JDb53.txt

Output:

1 out of 1 hunk ignored
(Stripping trailing CRs from patch.)
patching file main.cpp
Hunk #1 FAILED at 2555.
Hunk #2 FAILED at 2701.
2 out of 2 hunks FAILED
(Stripping trailing CRs from patch.)
patching file makefile.unix
Hunk #1 FAILED at 45.
Hunk #2 FAILED at 58.

It definitely does not apply to the SVN trunk.  Maybe tcatm could post the main.cpp itself?

Jeff Garzik, Bloq CEO, former bitcoin core dev team; opinions are my own.
Visit bloq.com / metronome.io
Donations / tip jar: 1BrufViLKnSWtuWGkryPsKsxonV2NQ7Tcj
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
July 31, 2010, 05:40:27 PM
 #12

Looks like pastebin.com messes up the patch...
Code:
diff --git a/cryptopp/sha256.cpp b/cryptopp/sha256.cpp
new file mode 100644
index 0000000..15f8be1
--- /dev/null
+++ b/cryptopp/sha256.cpp
@@ -0,0 +1,443 @@
+#include <string.h>
+#include <assert.h>
+
+#include <xmmintrin.h>
+#include <stdint.h>
+#include <stdio.h>
+
+#define NPAR 32
+
+static const unsigned int sha256_consts[] = {
+ 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, /*  0 */
+ 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
+ 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, /*  8 */
+ 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
+ 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, /* 16 */
+ 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
+ 0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, /* 24 */
+ 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
+ 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, /* 32 */
+ 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
+ 0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, /* 40 */
+ 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
+ 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, /* 48 */
+ 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
+ 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, /* 56 */
+ 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
+};
+
+
+static inline __m128i Ch(const __m128i b, const __m128i c, const __m128i d) {
+ return (b & c) ^ (~b & d);
+}
+
+static inline __m128i Maj(const __m128i b, const __m128i c, const __m128i d) {
+ return (b & c) ^ (b & d) ^ (c & d);
+}
+
+static inline __m128i ROTR(__m128i x, const int n) {
+ return _mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n);
+}
+
+static inline __m128i SHR(__m128i x, const int n) {
+ return _mm_srli_epi32(x, n);
+}
+
+/* SHA256 Functions */
+#define BIGSIGMA0_256(x) (ROTR((x), 2) ^ ROTR((x), 13) ^ ROTR((x), 22))
+#define BIGSIGMA1_256(x) (ROTR((x), 6) ^ ROTR((x), 11) ^ ROTR((x), 25))
+#define SIGMA0_256(x) (ROTR((x), 7) ^ ROTR((x), 18) ^ SHR((x), 3))
+#define SIGMA1_256(x) (ROTR((x), 17) ^ ROTR((x), 19) ^ SHR((x), 10))
+
+static inline __m128i load_epi32(const unsigned int x0, const unsigned int x1, const unsigned int x2, const unsigned int x3) {
+ return _mm_set_epi32(x0, x1, x2, x3);
+}
+
+static inline unsigned int store32(const __m128i x, int i) {
+ union { unsigned int ret[4]; __m128i x; } box;
+ box.x = x;
+ return box.ret[i];
+}
+
+static inline void store_epi32(const __m128i x, unsigned int *x0, unsigned int *x1, unsigned int *x2, unsigned int *x3) {
+ union { unsigned int ret[4]; __m128i x; } box;
+ box.x = x;
+ *x0 = box.ret[3]; *x1 = box.ret[2]; *x2 = box.ret[1]; *x3 = box.ret[0];
+}
+
+static inline __m128i SHA256_CONST(const int i) {
+ return _mm_set1_epi32(sha256_consts[i]);
+}
+
+#define add4(x0, x1, x2, x3) _mm_add_epi32(_mm_add_epi32(_mm_add_epi32(x0, x1), x2), x3)
+#define add5(x0, x1, x2, x3, x4) _mm_add_epi32(add4(x0, x1, x2, x3), x4)
+
+#define SHA256ROUND(a, b, c, d, e, f, g, h, i, w)                       \
+ T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w); \
+d = _mm_add_epi32(d, T1);                                           \
+T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 \
+h = _mm_add_epi32(T1, T2);
+
+#define SHA256ROUND_lastd(a, b, c, d, e, f, g, h, i, w)                       \
+ T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w); \
+d = _mm_add_epi32(d, T1);                                           
+//T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 
+//h = _mm_add_epi32(T1, T2);
+
+#define SHA256ROUND_last(a, b, c, d, e, f, g, h, i, w)                       \
+ T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w); \
+T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 \
+h = _mm_add_epi32(T1, T2);
+
+static inline unsigned int swap(unsigned int value) {
+ __asm__ ("bswap %0" : "=r" (value) : "0" (value));
+ return value;
+}
+
+static inline unsigned int SWAP32(const void *addr) {
+ unsigned int value = (*((unsigned int *)(addr)));
+ __asm__ ("bswap %0" : "=r" (value) : "0" (value));
+ return value;
+}
+
+static inline void dumpreg(__m128i x, char *msg) {
+ union { unsigned int ret[4]; __m128i x; } box;
+ box.x = x ;
+ printf("%s %08x %08x %08x %08x\n", msg, box.ret[0], box.ret[1], box.ret[2], box.ret[3]);
+}
+
+#if 1
+#define dumpstate(i) printf("%s: %08x %08x %08x %08x %08x %08x %08x %08x %08x\n", \
+ __func__, store32(w0, i), store32(a, i), store32(b, i), store32(c, i), store32(d, i), store32(e, i), store32(f, i), store32(g, i), store32(h, i));
+#else
+#define dumpstate()
+#endif
+void Double_BlockSHA256(const void* pin, void* pad, const void *pre, unsigned int thash[8][NPAR], const void *init)
+{
+ unsigned int* In = (unsigned int*)pin;
+ unsigned int* Pad = (unsigned int*)pad;
+ unsigned int* hPre = (unsigned int*)pre;
+ unsigned int* hInit = (unsigned int*)init;
+ unsigned int i, j, k;
+
+ /* vectors used in calculation */
+ __m128i w0, w1, w2, w3, w4, w5, w6, w7;
+ __m128i w8, w9, w10, w11, w12, w13, w14, w15;
+ __m128i T1, T2;
+ __m128i a, b, c, d, e, f, g, h;
+
+ /* nonce offset for vector */
+ __m128i offset = load_epi32(0x00000003, 0x00000002, 0x00000001, 0x00000000);
+
+
+ for(k = 0; k<NPAR; k+=4) {
+ w0 = load_epi32(In[0], In[0], In[0], In[0]);
+ w1 = load_epi32(In[1], In[1], In[1], In[1]);
+ w2 = load_epi32(In[2], In[2], In[2], In[2]);
+ w3 = load_epi32(In[3], In[3], In[3], In[3]);
+ w4 = load_epi32(In[4], In[4], In[4], In[4]);
+ w5 = load_epi32(In[5], In[5], In[5], In[5]);
+ w6 = load_epi32(In[6], In[6], In[6], In[6]);
+ w7 = load_epi32(In[7], In[7], In[7], In[7]);
+ w8 = load_epi32(In[8], In[8], In[8], In[8]);
+ w9 = load_epi32(In[9], In[9], In[9], In[9]);
+ w10 = load_epi32(In[10], In[10], In[10], In[10]);
+ w11 = load_epi32(In[11], In[11], In[11], In[11]);
+ w12 = load_epi32(In[12], In[12], In[12], In[12]);
+ w13 = load_epi32(In[13], In[13], In[13], In[13]);
+ w14 = load_epi32(In[14], In[14], In[14], In[14]);
+ w15 = load_epi32(In[15], In[15], In[15], In[15]);
+
+ /* hack nonce into lowest byte of w3 */
+ __m128i k_vec = load_epi32(k, k, k, k);
+ w3 = _mm_add_epi32(w3, offset);
+ w3 = _mm_add_epi32(w3, k_vec);
+
+ a = load_epi32(hPre[0], hPre[0], hPre[0], hPre[0]);
+ b = load_epi32(hPre[1], hPre[1], hPre[1], hPre[1]);
+ c = load_epi32(hPre[2], hPre[2], hPre[2], hPre[2]);
+ d = load_epi32(hPre[3], hPre[3], hPre[3], hPre[3]);
+ e = load_epi32(hPre[4], hPre[4], hPre[4], hPre[4]);
+ f = load_epi32(hPre[5], hPre[5], hPre[5], hPre[5]);
+ g = load_epi32(hPre[6], hPre[6], hPre[6], hPre[6]);
+ h = load_epi32(hPre[7], hPre[7], hPre[7], hPre[7]);
+
+ SHA256ROUND(a, b, c, d, e, f, g, h, 0, w0);   
+ SHA256ROUND(h, a, b, c, d, e, f, g, 1, w1);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 2, w2);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 3, w3);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 4, w4);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 5, w5);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 6, w6);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 7, w7);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 8, w8);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 9, w9);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 10, w10);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 11, w11);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 12, w12);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 13, w13);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 14, w14);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 15, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 16, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 17, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 18, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 19, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 20, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 21, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 22, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 23, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 24, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 25, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 26, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 27, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 28, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 29, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 30, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 31, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 32, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 33, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 34, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 35, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 36, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 37, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 38, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 39, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 40, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 41, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 42, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 43, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 44, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 45, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 46, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 47, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 48, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 49, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 50, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 51, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 52, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 53, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 54, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 55, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 56, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 57, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 58, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 59, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 60, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 61, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 62, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 63, w15);
+
+#define store_load(x, i, dest) \
+ w8 = load_epi32((hPre)[i], (hPre)[i], (hPre)[i], (hPre)[i]); \
+ dest = _mm_add_epi32(w8, x);
+
+ store_load(a, 0, w0);
+ store_load(b, 1, w1);
+ store_load(c, 2, w2);
+ store_load(d, 3, w3);
+ store_load(e, 4, w4);
+ store_load(f, 5, w5);
+ store_load(g, 6, w6);
+ store_load(h, 7, w7);
+
+ w8 = load_epi32(Pad[8], Pad[8], Pad[8], Pad[8]);
+ w9 = load_epi32(Pad[9], Pad[9], Pad[9], Pad[9]);
+ w10 = load_epi32(Pad[10], Pad[10], Pad[10], Pad[10]);
+ w11 = load_epi32(Pad[11], Pad[11], Pad[11], Pad[11]);
+ w12 = load_epi32(Pad[12], Pad[12], Pad[12], Pad[12]);
+ w13 = load_epi32(Pad[13], Pad[13], Pad[13], Pad[13]);
+ w14 = load_epi32(Pad[14], Pad[14], Pad[14], Pad[14]);
+ w15 = load_epi32(Pad[15], Pad[15], Pad[15], Pad[15]);
+
+ a = load_epi32(hInit[0], hInit[0], hInit[0], hInit[0]);
+ b = load_epi32(hInit[1], hInit[1], hInit[1], hInit[1]);
+ c = load_epi32(hInit[2], hInit[2], hInit[2], hInit[2]);
+ d = load_epi32(hInit[3], hInit[3], hInit[3], hInit[3]);
+ e = load_epi32(hInit[4], hInit[4], hInit[4], hInit[4]);
+ f = load_epi32(hInit[5], hInit[5], hInit[5], hInit[5]);
+ g = load_epi32(hInit[6], hInit[6], hInit[6], hInit[6]);
+ h = load_epi32(hInit[7], hInit[7], hInit[7], hInit[7]);
+
+ SHA256ROUND(a, b, c, d, e, f, g, h, 0, w0);   
+ SHA256ROUND(h, a, b, c, d, e, f, g, 1, w1);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 2, w2);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 3, w3);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 4, w4);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 5, w5);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 6, w6);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 7, w7);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 8, w8);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 9, w9);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 10, w10);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 11, w11);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 12, w12);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 13, w13);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 14, w14);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 15, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 16, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 17, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 18, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 19, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 20, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 21, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 22, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 23, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 24, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 25, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 26, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 27, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 28, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 29, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 30, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 31, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 32, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 33, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 34, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 35, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 36, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 37, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 38, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 39, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 40, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 41, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 42, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 43, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 44, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 45, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 46, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 47, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 48, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 49, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 50, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 51, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 52, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 53, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 54, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 55, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 56, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 57, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 58, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 59, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 60, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 61, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 62, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 63, w15);
+
+ /* store resulsts directly in thash */
+#define store_2(x,i)  \
+ w0 = load_epi32((hInit)[i], (hInit)[i], (hInit)[i], (hInit)[i]); \
+ *(__m128i *)&(thash)[i][0+k] = _mm_add_epi32(w0, x);
+
+ store_2(a, 0);
+ store_2(b, 1);
+ store_2(c, 2);
+ store_2(d, 3);
+ store_2(e, 4);
+ store_2(f, 5);
+ store_2(g, 6);
+ store_2(h, 7);
+ }
+
+}
diff --git a/main.cpp b/main.cpp
index ddc359a..d30d642 100755
--- a/main.cpp
+++ b/main.cpp
@@ -2555,8 +2555,10 @@ inline void SHA256Transform(void* pstate, void* pinput, const void* pinit)
     CryptoPP::SHA256::Transform((CryptoPP::word32*)pstate, (CryptoPP::word32*)pinput);
 }
 
+// !!!! NPAR must match NPAR in cryptopp/sha256.cpp !!!!
+#define NPAR 32
 
-
+extern void Double_BlockSHA256(const void* pin, void* pout, const void *pinit, unsigned int hash[8][NPAR], const void *init2);
 
 
 void BitcoinMiner()
@@ -2701,108 +2703,123 @@ void BitcoinMiner()
         uint256 hashTarget = CBigNum().SetCompact(pblock->nBits).getuint256();
         uint256 hashbuf[2];
         uint256& hash = *alignup<16>(hashbuf);
+
+        // Cache for NPAR hashes
+        unsigned int thash[8][NPAR];
+
+        unsigned int j;
         loop
         {
-            SHA256Transform(&tmp.hash1, (char*)&tmp.block + 64, &midstate);
-            SHA256Transform(&hash, &tmp.hash1, pSHA256InitState);
+          Double_BlockSHA256((char*)&tmp.block + 64, &tmp.hash1, &midstate, thash, pSHA256InitState);
 
-            if (((unsigned short*)&hash)[14] == 0)
+          for(j = 0; j<NPAR; j++) {
+            if (thash[7][j] == 0)
             {
-                // Byte swap the result after preliminary check
-                for (int i = 0; i < sizeof(hash)/4; i++)
-                    ((unsigned int*)&hash)[i] = ByteReverse(((unsigned int*)&hash)[i]);
-
-                if (hash <= hashTarget)
+              // Byte swap the result after preliminary check
+              for (int i = 0; i < sizeof(hash)/4; i++)
+                ((unsigned int*)&hash)[i] = ByteReverse((unsigned int)thash[i][j]);
+
+              if (hash <= hashTarget)
+              {
+                // Double_BlocSHA256 might only calculate parts of the hash.
+                // We'll insert the nonce and get the real hash.
+                //pblock->nNonce = ByteReverse(tmp.block.nNonce + j);
+                //hash = pblock->GetHash();
+
+                pblock->nNonce = ByteReverse(tmp.block.nNonce + j);
+                assert(hash == pblock->GetHash());
+
+                //// debug print
+                printf("BitcoinMiner:\n");
+                printf("proof-of-work found  \n  hash: %s  \ntarget: %s\n", hash.GetHex().c_str(), hashTarget.GetHex().c_str());
+                pblock->print();
+                printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
+                printf("generated %s\n", FormatMoney(pblock->vtx[0].vout[0].nValue).c_str());
+
+                SetThreadPriority(THREAD_PRIORITY_NORMAL);
+                CRITICAL_BLOCK(cs_main)
                 {
-                    pblock->nNonce = ByteReverse(tmp.block.nNonce);
-                    assert(hash == pblock->GetHash());
-
-                        //// debug print
-                        printf("BitcoinMiner:\n");
-                        printf("proof-of-work found  \n  hash: %s  \ntarget: %s\n", hash.GetHex().c_str(), hashTarget.GetHex().c_str());
-                        pblock->print();
-                        printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
-                        printf("generated %s\n", FormatMoney(pblock->vtx[0].vout[0].nValue).c_str());
-
-                    SetThreadPriority(THREAD_PRIORITY_NORMAL);
-                    CRITICAL_BLOCK(cs_main)
-                    {
-                        if (pindexPrev == pindexBest)
-                        {
-                            // Save key
-                            if (!AddKey(key))
-                                return;
-                            key.MakeNewKey();
-
-                            // Track how many getdata requests this block gets
-                            CRITICAL_BLOCK(cs_mapRequestCount)
-                                mapRequestCount[pblock->GetHash()] = 0;
-
-                            // Process this block the same as if we had received it from another node
-                            if (!ProcessBlock(NULL, pblock.release()))
-                                printf("ERROR in BitcoinMiner, ProcessBlock, block not accepted\n");
-                        }
-                    }
-                    SetThreadPriority(THREAD_PRIORITY_LOWEST);
-
-                    Sleep(500);
-                    break;
+                  if (pindexPrev == pindexBest)
+                  {
+                    // Save key
+                    if (!AddKey(key))
+                      return;
+                    key.MakeNewKey();
+
+                    // Track how many getdata requests this block gets
+                    CRITICAL_BLOCK(cs_mapRequestCount)
+                      mapRequestCount[pblock->GetHash()] = 0;
+
+                    // Process this block the same as if we had received it from another node
+                    if (!ProcessBlock(NULL, pblock.release()))
+                      printf("ERROR in BitcoinMiner, ProcessBlock, block not accepted\n");
+
+                  }
                 }
-            }
+                SetThreadPriority(THREAD_PRIORITY_LOWEST);
 
-            // Update nTime every few seconds
-            const unsigned int nMask = 0xffff;
-            if ((++tmp.block.nNonce & nMask) == 0)
+                Sleep(500);
+                break;
+              }
+            }
+          }
+
+          // Update nonce
+          tmp.block.nNonce += NPAR;
+
+          // Update nTime every few seconds
+          const unsigned int nMask = 0xffff;
+          if ((tmp.block.nNonce & nMask) == 0)
+          {
+            // Meter hashes/sec
+            static int64 nTimerStart;
+            static int nHashCounter;
+            if (nTimerStart == 0)
+              nTimerStart = GetTimeMillis();
+            else
+              nHashCounter++;
+            if (GetTimeMillis() - nTimerStart > 4000)
             {
-                // Meter hashes/sec
-                static int64 nTimerStart;
-                static int nHashCounter;
-                if (nTimerStart == 0)
-                    nTimerStart = GetTimeMillis();
-                else
-                    nHashCounter++;
+              static CCriticalSection cs;
+              CRITICAL_BLOCK(cs)
+              {
                 if (GetTimeMillis() - nTimerStart > 4000)
                 {
-                    static CCriticalSection cs;
-                    CRITICAL_BLOCK(cs)
-                    {
-                        if (GetTimeMillis() - nTimerStart > 4000)
-                        {
-                            double dHashesPerSec = 1000.0 * (nMask+1) * nHashCounter / (GetTimeMillis() - nTimerStart);
-                            nTimerStart = GetTimeMillis();
-                            nHashCounter = 0;
-                            string strStatus = strprintf("    %.0f khash/s", dHashesPerSec/1000.0);
-                            UIThreadCall(bind(CalledSetStatusBar, strStatus, 0));
-                            static int64 nLogTime;
-                            if (GetTime() - nLogTime > 30 * 60)
-                            {
-                                nLogTime = GetTime();
-                                printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
-                                printf("hashmeter %3d CPUs %6.0f khash/s\n", vnThreadsRunning[3], dHashesPerSec/1000.0);
-                            }
-                        }
-                    }
+                  double dHashesPerSec = 1000.0 * (nMask+1) * nHashCounter / (GetTimeMillis() - nTimerStart);
+                  nTimerStart = GetTimeMillis();
+                  nHashCounter = 0;
+                  string strStatus = strprintf("    %.0f khash/s", dHashesPerSec/1000.0);
+                  UIThreadCall(bind(CalledSetStatusBar, strStatus, 0));
+                  static int64 nLogTime;
+                  if (GetTime() - nLogTime > 30 * 60)
+                  {
+                    nLogTime = GetTime();
+                    printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
+                    printf("hashmeter %3d CPUs %6.0f khash/s\n", vnThreadsRunning[3], dHashesPerSec/1000.0);
+                  }
                 }
-
-                // Check for stop or if block needs to be rebuilt
-                if (fShutdown)
-                    return;
-                if (!fGenerateBitcoins)
-                    return;
-                if (fLimitProcessors && vnThreadsRunning[3] > nLimitProcessors)
-                    return;
-                if (vNodes.empty())
-                    break;
-                if (tmp.block.nNonce == 0)
-                    break;
-                if (nTransactionsUpdated != nTransactionsUpdatedLast && GetTime() - nStart > 60)
-                    break;
-                if (pindexPrev != pindexBest)
-                    break;
-
-                pblock->nTime = max(pindexPrev->GetMedianTimePast()+1, GetAdjustedTime());
-                tmp.block.nTime = ByteReverse(pblock->nTime);
+              }
             }
+
+            // Check for stop or if block needs to be rebuilt
+            if (fShutdown)
+              return;
+            if (!fGenerateBitcoins)
+              return;
+            if (fLimitProcessors && vnThreadsRunning[3] > nLimitProcessors)
+              return;
+            if (vNodes.empty())
+              break;
+            if (tmp.block.nNonce == 0)
+              break;
+            if (nTransactionsUpdated != nTransactionsUpdatedLast && GetTime() - nStart > 60)
+              break;
+            if (pindexPrev != pindexBest)
+              break;
+
+            pblock->nTime = max(pindexPrev->GetMedianTimePast()+1, GetAdjustedTime());
+            tmp.block.nTime = ByteReverse(pblock->nTime);
+          }
         }
     }
 }
diff --git a/makefile.unix b/makefile.unix
index 597a0ea..8fb0aa6 100755
--- a/makefile.unix
+++ b/makefile.unix
@@ -45,7 +45,8 @@ OBJS= \
     obj/rpc.o \
     obj/init.o \
     cryptopp/obj/sha.o \
-    cryptopp/obj/cpu.o
+    cryptopp/obj/cpu.o \
+ cryptopp/obj/sha256.o
 
 
 all: bitcoin
@@ -58,18 +59,20 @@ obj/%.o: %.cpp $(HEADERS) headers.h.gch
  g++ -c $(CFLAGS) -DGUI -o $@ $<
 
 cryptopp/obj/%.o: cryptopp/%.cpp
- g++ -c $(CFLAGS) -O3 -DCRYPTOPP_DISABLE_SSE2 -o $@ $<
+ g++ -c $(CFLAGS) -frename-registers -funroll-all-loops -fomit-frame-pointer  -march=native -msse2 -msse3  -ffast-math -O3 -o $@ $<
 
 bitcoin: $(OBJS) obj/ui.o obj/uibase.o
  g++ $(CFLAGS) -o $@ $(LIBPATHS) $^ $(WXLIBS) $(LIBS)
 
-
 obj/nogui/%.o: %.cpp $(HEADERS)
  g++ -c $(CFLAGS) -o $@ $<
 
 bitcoind: $(OBJS:obj/%=obj/nogui/%)
  g++ $(CFLAGS) -o $@ $(LIBPATHS) $^ $(LIBS)
 
+test: cryptopp/obj/sha.o cryptopp/obj/sha256.o test.cpp
+   g++ $(CFLAGS) -o $@ $(LIBPATHS) $^ $(WXLIBS) $(LIBS)
+
 
 clean:
  -rm -f obj/*.o
diff --git a/test.cpp b/test.cpp
new file mode 100755
index 0000000..7cab332
--- /dev/null
+++ b/test.cpp
@@ -0,0 +1,237 @@
+// Copyright (c) 2009-2010 Satoshi Nakamoto
+// Distributed under the MIT/X11 software license, see the accompanying
+// file license.txt or http://www.opensource.org/licenses/mit-license.php.
+#include <assert.h>
+#include <openssl/ecdsa.h>
+#include <openssl/evp.h>
+#include <openssl/rand.h>
+#include <openssl/sha.h>
+#include <openssl/ripemd.h>
+#include <db_cxx.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <math.h>
+#include <limits.h>
+#include <float.h>
+#include <assert.h>
+#include <memory>
+#include <iostream>
+#include <sstream>
+#include <string>
+#include <vector>
+#include <list>
+#include <deque>
+#include <map>
+#include <set>
+#include <algorithm>
+#include <numeric>
+#include <boost/foreach.hpp>
+#include <boost/lexical_cast.hpp>
+#include <boost/tuple/tuple.hpp>
+#include <boost/fusion/container/vector.hpp>
+#include <boost/tuple/tuple_comparison.hpp>
+#include <boost/tuple/tuple_io.hpp>
+#include <boost/array.hpp>
+#include <boost/bind.hpp>
+#include <boost/function.hpp>
+#include <boost/filesystem.hpp>
+#include <boost/filesystem/fstream.hpp>
+#include <boost/algorithm/string.hpp>
+#include <boost/interprocess/sync/interprocess_mutex.hpp>
+#include <boost/interprocess/sync/interprocess_recursive_mutex.hpp>
+#include <boost/date_time/gregorian/gregorian_types.hpp>
+#include <boost/date_time/posix_time/posix_time_types.hpp>
+#include <sys/resource.h>
+#include <sys/time.h>
+using namespace std;
+using namespace boost;
+#include "cryptopp/sha.h"
+#include "strlcpy.h"
+#include "serialize.h"
+#include "uint256.h"
+#include "bignum.h"
+
+#undef printf
+ template <size_t nBytes, typename T>
+T* alignup(T* p)
+{
+ union
+ {   
+ T* ptr;
+ size_t n;
+ } u;
+ u.ptr = p;
+ u.n = (u.n + (nBytes-1)) & ~(nBytes-1);
+ return u.ptr;
+}
+
+int FormatHashBlocks(void* pbuffer, unsigned int len)
+{
+ unsigned char* pdata = (unsigned char*)pbuffer;
+ unsigned int blocks = 1 + ((len + 8) / 64);
+ unsigned char* pend = pdata + 64 * blocks;
+ memset(pdata + len, 0, 64 * blocks - len);
+ pdata[len] = 0x80;
+ unsigned int bits = len * 8;
+ pend[-1] = (bits >> 0) & 0xff;
+ pend[-2] = (bits >> 8) & 0xff;
+ pend[-3] = (bits >> 16) & 0xff;
+ pend[-4] = (bits >> 24) & 0xff;
+ return blocks;
+}
+
+using CryptoPP::ByteReverse;
+static int detectlittleendian = 1;
+
+#define NPAR 32
+
+extern void Double_BlockSHA256(const void* pin, void* pout, const void *pinit, unsigned int hash[8][NPAR], const void *init2);
+
+using CryptoPP::ByteReverse;
+
+static const unsigned int pSHA256InitState[8] = {0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19};
+
+inline void SHA256Transform(void* pstate, void* pinput, const void* pinit)
+{
+ memcpy(pstate, pinit, 32);
+ CryptoPP::SHA256::Transform((CryptoPP::word32*)pstate, (CryptoPP::word32*)pinput);
+}
+
+void BitcoinTester(char *filename)
+{
+ printf("SHA256 test started\n");
+
+ struct tmpworkspace
+ {
+ struct unnamed2
+ {
+ int nVersion;
+ uint256 hashPrevBlock;
+ uint256 hashMerkleRoot;
+ unsigned int nTime;
+ unsigned int nBits;
+ unsigned int nNonce;
+ }
+ block;
+ unsigned char pchPadding0[64];
+ uint256 hash1;
+ unsigned char pchPadding1[64];
+ };
+ char tmpbuf[sizeof(tmpworkspace)+16];
+ tmpworkspace& tmp = *(tmpworkspace*)alignup<16>(tmpbuf);
+
+
+ char line[180];
+ ifstream fin(filename);
+ char *p;
+ unsigned long int totalhashes= 0;
+ unsigned long int found = 0;
+ clock_t start, end;
+ unsigned long int cpu_time_used;
+ unsigned int tnonce;
+ start = clock();
+
+ while( fin.getline(line, 180))
+ {
+ string in(line);
+ //printf("%s\n", in.c_str());
+ tmp.block.nVersion       = strtol(in.substr(0,8).c_str(), &p, 16);
+ tmp.block.hashPrevBlock.SetHex(in.substr(8,64));
+ tmp.block.hashMerkleRoot.SetHex(in.substr(64+8,64));
+ tmp.block.nTime          = strtol(in.substr(128+8,8).c_str(), &p, 16);
+ tmp.block.nBits          = strtol(in.substr(128+16,8).c_str(), &p, 16);
+ tnonce = strtol(in.substr(128+24,8).c_str(), &p, 16);
+ tmp.block.nNonce         = tnonce;
+
+ unsigned int nBlocks0 = FormatHashBlocks(&tmp.block, sizeof(tmp.block));
+ unsigned int nBlocks1 = FormatHashBlocks(&tmp.hash1, sizeof(tmp.hash1));
+
+ // Byte swap all the input buffer
+ for (int i = 0; i < sizeof(tmp)/4; i++)
+ ((unsigned int*)&tmp)[i] = ByteReverse(((unsigned int*)&tmp)[i]);
+
+ // Precalc the first half of the first hash, which stays constant
+ uint256 midstatebuf[2];
+ uint256& midstate = *alignup<16>(midstatebuf);
+ SHA256Transform(&midstate, &tmp.block, pSHA256InitState);
+
+
+ uint256 hashTarget = CBigNum().SetCompact(ByteReverse(tmp.block.nBits)).getuint256();
+ // printf("target %s\n", hashTarget.GetHex().c_str());
+ uint256 hash;
+ uint256 hashbuf[2];
+ uint256& refhash = *alignup<16>(hashbuf);
+
+ unsigned int thash[8][NPAR];
+ int done = 0;
+ unsigned int i, j;
+
+ /* reference */
+ SHA256Transform(&tmp.hash1, (char*)&tmp.block + 64, &midstate);
+ SHA256Transform(&refhash, &tmp.hash1, pSHA256InitState);
+ for (int i = 0; i < sizeof(refhash)/4; i++)
+ ((unsigned int*)&refhash)[i] = ByteReverse(((unsigned int*)&refhash)[i]);
+
+ //printf("reference nonce %08x:\n%s\n\n", tnonce, refhash.GetHex().c_str());
+
+ tmp.block.nNonce = ByteReverse(tnonce) & 0xfffff000;
+
+
+ for(;;)
+ {
+
+ Double_BlockSHA256((char*)&tmp.block + 64, &tmp.hash1, &midstate, thash, pSHA256InitState);
+
+ for(i = 0; i<NPAR; i++) {
+ /* fast hash checking */
+ if(thash[7][i] == 0) {
+ // printf("found something... ");
+
+ for(j = 0; j<8; j++) ((unsigned int *)&hash)[j] = ByteReverse((unsigned int)thash[j][i]);
+ // printf("%s\n", hash.GetHex().c_str());
+
+ if (hash <= hashTarget)
+ {
+ found++;
+ if(tnonce == ByteReverse(tmp.block.nNonce + i) ) {
+ if(hash == refhash) {
+ printf("\r%lu", found);
+ totalhashes += NPAR;
+ done = 1;
+ } else {
+ printf("Hashes do not match!\n");
+ }
+ } else {
+ printf("nonce does not match. %08x != %08x\n", tnonce, ByteReverse(tmp.block.nNonce + i));
+ }
+ break;
+ }
+ }
+ }
+ if(done) break;
+
+ tmp.block.nNonce+=NPAR;
+ totalhashes += NPAR;
+ if(tmp.block.nNonce == 0) {
+ printf("ERROR: Hash not found for:\n%s\n", in.c_str());
+ return;
+ }
+ }
+ }
+ printf("\n");
+ end = clock();
+ cpu_time_used += (unsigned int)(end - start);
+ cpu_time_used /= ((CLOCKS_PER_SEC)/1000);
+ printf("found solutions = %lu\n", found);
+ printf("total hashes = %lu\n", totalhashes);
+ printf("total time = %lu ms\n", cpu_time_used);
+ printf("average speed: %lu khash/s\n", (totalhashes)/cpu_time_used);
+}
+
+int main(int argc, char* argv[]) {
+ if(argc == 2) {
+ BitcoinTester(argv[1]);
+ } else
+ printf("Missing filename!\n");
+ return 0;
+}
nelisky
Legendary
*
Offline Offline

Activity: 1540
Merit: 1002


View Profile
July 31, 2010, 07:17:17 PM
 #13

Had to manually patch, as I'm not using git for bitcoin and 'patch' doesn't munch this format, I guess. Anyway, got almost double speed on the OSX side, (i5 2.4, now ~2400 from ~1400), but my linux on Q6600 quad 2.4Ghz was pumping ~2500 with 0.3.6 (from source) and now, with the patch it's... ~2400. Need I tweak anything to take advantage on this?
nelisky
Legendary
*
Offline Offline

Activity: 1540
Merit: 1002


View Profile
July 31, 2010, 08:34:06 PM
 #14

ahm, let me correct myself: on the quad core linux, I went from ~4400 with svn trunk @ 119 to ~2400 with the patch... not exactly what I hoped for after the success in OSX.
aceat64
Full Member
***
Offline Offline

Activity: 307
Merit: 102



View Profile
July 31, 2010, 08:57:49 PM
 #15

ahm, let me correct myself: on the quad core linux, I went from ~4400 with svn trunk @ 119 to ~2400 with the patch... not exactly what I hoped for after the success in OSX.

I noticed the same, I went from about 4300 to 2100 when I tested it on Linux.
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
July 31, 2010, 10:38:29 PM
 #16

What CPUs are you running it on? Could you send me sha256.o (compiled object of the algorithm)?
nelisky
Legendary
*
Offline Offline

Activity: 1540
Merit: 1002


View Profile
July 31, 2010, 11:18:07 PM
 #17

I'm running on the Intel Q6600 2.4Ghz, how shall I get the file to you?
Mionione
Newbie
*
Offline Offline

Activity: 10
Merit: 1


View Profile
July 31, 2010, 11:30:11 PM
 #18

care with __attribute__ ((aligned (16))) , it doesn't work with local variable, gcc doesn't align the stack
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
July 31, 2010, 11:37:02 PM
 #19

I'm running on the Intel Q6600 2.4Ghz, how shall I get the file to you?
yes. i will look at the assembler code. maybe the compiler did something "wrong".
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
August 01, 2010, 12:00:10 AM
 #20

Patch against SVN. Maybe it'll work now...
Code:
diff --git a/cryptopp/sha256.cpp b/cryptopp/sha256.cpp
new file mode 100644
index 0000000..6735678
--- /dev/null
+++ b/cryptopp/sha256.cpp
@@ -0,0 +1,447 @@
+#include <string.h>
+#include <assert.h>
+
+#include <xmmintrin.h>
+#include <stdint.h>
+#include <stdio.h>
+
+#define NPAR 32
+
+static const unsigned int sha256_consts[] = {
+ 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, /*  0 */
+ 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
+ 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, /*  8 */
+ 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
+ 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, /* 16 */
+ 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
+ 0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, /* 24 */
+ 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
+ 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, /* 32 */
+ 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
+ 0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, /* 40 */
+ 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
+ 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, /* 48 */
+ 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
+ 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, /* 56 */
+ 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
+};
+
+
+static inline __m128i Ch(const __m128i b, const __m128i c, const __m128i d) {
+ return (b & c) ^ (~b & d);
+}
+
+static inline __m128i Maj(const __m128i b, const __m128i c, const __m128i d) {
+ return (b & c) ^ (b & d) ^ (c & d);
+}
+
+static inline __m128i ROTR(__m128i x, const int n) {
+ return _mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n);
+}
+
+static inline __m128i SHR(__m128i x, const int n) {
+ return _mm_srli_epi32(x, n);
+}
+
+/* SHA256 Functions */
+#define BIGSIGMA0_256(x) (ROTR((x), 2) ^ ROTR((x), 13) ^ ROTR((x), 22))
+#define BIGSIGMA1_256(x) (ROTR((x), 6) ^ ROTR((x), 11) ^ ROTR((x), 25))
+#define SIGMA0_256(x) (ROTR((x), 7) ^ ROTR((x), 18) ^ SHR((x), 3))
+#define SIGMA1_256(x) (ROTR((x), 17) ^ ROTR((x), 19) ^ SHR((x), 10))
+
+static inline __m128i load_epi32(const unsigned int x0, const unsigned int x1, const unsigned int x2, const unsigned int x3) {
+ return _mm_set_epi32(x0, x1, x2, x3);
+}
+
+static inline unsigned int store32(const __m128i x, int i) {
+ union { unsigned int ret[4]; __m128i x; } box;
+ box.x = x;
+ return box.ret[i];
+}
+
+static inline void store_epi32(const __m128i x, unsigned int *x0, unsigned int *x1, unsigned int *x2, unsigned int *x3) {
+ union { unsigned int ret[4]; __m128i x; } box;
+ box.x = x;
+ *x0 = box.ret[3]; *x1 = box.ret[2]; *x2 = box.ret[1]; *x3 = box.ret[0];
+}
+
+static inline __m128i SHA256_CONST(const int i) {
+ return _mm_set1_epi32(sha256_consts[i]);
+}
+
+#define add4(x0, x1, x2, x3) _mm_add_epi32(_mm_add_epi32(_mm_add_epi32(x0, x1), x2), x3)
+#define add5(x0, x1, x2, x3, x4) _mm_add_epi32(add4(x0, x1, x2, x3), x4)
+
+#define SHA256ROUND(a, b, c, d, e, f, g, h, i, w)                       \
+ T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w); \
+d = _mm_add_epi32(d, T1);                                           \
+T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 \
+h = _mm_add_epi32(T1, T2);
+
+#define SHA256ROUND_lastd(a, b, c, d, e, f, g, h, i, w)                       \
+ T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w); \
+d = _mm_add_epi32(d, T1);                                           
+//T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 
+//h = _mm_add_epi32(T1, T2);
+
+#define SHA256ROUND_last(a, b, c, d, e, f, g, h, i, w)                       \
+ T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w); \
+T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 \
+h = _mm_add_epi32(T1, T2);
+
+static inline unsigned int swap(unsigned int value) {
+ __asm__ ("bswap %0" : "=r" (value) : "0" (value));
+ return value;
+}
+
+static inline unsigned int SWAP32(const void *addr) {
+ unsigned int value = (*((unsigned int *)(addr)));
+ __asm__ ("bswap %0" : "=r" (value) : "0" (value));
+ return value;
+}
+
+static inline void dumpreg(__m128i x, char *msg) {
+ union { unsigned int ret[4]; __m128i x; } box;
+ box.x = x ;
+ printf("%s %08x %08x %08x %08x\n", msg, box.ret[0], box.ret[1], box.ret[2], box.ret[3]);
+}
+
+#if 1
+#define dumpstate(i) printf("%s: %08x %08x %08x %08x %08x %08x %08x %08x %08x\n", \
+ __func__, store32(w0, i), store32(a, i), store32(b, i), store32(c, i), store32(d, i), store32(e, i), store32(f, i), store32(g, i), store32(h, i));
+#else
+#define dumpstate()
+#endif
+void Double_BlockSHA256(const void* pin, void* pad, const void *pre, unsigned int thash[9][NPAR], const void *init)
+{
+ unsigned int* In = (unsigned int*)pin;
+ unsigned int* Pad = (unsigned int*)pad;
+ unsigned int* hPre = (unsigned int*)pre;
+ unsigned int* hInit = (unsigned int*)init;
+ unsigned int i, j, k;
+
+ /* vectors used in calculation */
+ __m128i w0, w1, w2, w3, w4, w5, w6, w7;
+ __m128i w8, w9, w10, w11, w12, w13, w14, w15;
+ __m128i T1, T2;
+ __m128i a, b, c, d, e, f, g, h;
+  __m128i nonce;
+
+ /* nonce offset for vector */
+ __m128i offset = load_epi32(0x00000003, 0x00000002, 0x00000001, 0x00000000);
+
+
+ for(k = 0; k<NPAR; k+=4) {
+ w0 = load_epi32(In[0], In[0], In[0], In[0]);
+ w1 = load_epi32(In[1], In[1], In[1], In[1]);
+ w2 = load_epi32(In[2], In[2], In[2], In[2]);
+ //w3 = load_epi32(In[3], In[3], In[3], In[3]); nonce will be later hacked into the hash
+ w4 = load_epi32(In[4], In[4], In[4], In[4]);
+ w5 = load_epi32(In[5], In[5], In[5], In[5]);
+ w6 = load_epi32(In[6], In[6], In[6], In[6]);
+ w7 = load_epi32(In[7], In[7], In[7], In[7]);
+ w8 = load_epi32(In[8], In[8], In[8], In[8]);
+ w9 = load_epi32(In[9], In[9], In[9], In[9]);
+ w10 = load_epi32(In[10], In[10], In[10], In[10]);
+ w11 = load_epi32(In[11], In[11], In[11], In[11]);
+ w12 = load_epi32(In[12], In[12], In[12], In[12]);
+ w13 = load_epi32(In[13], In[13], In[13], In[13]);
+ w14 = load_epi32(In[14], In[14], In[14], In[14]);
+ w15 = load_epi32(In[15], In[15], In[15], In[15]);
+
+ /* hack nonce into lowest byte of w3 */
+ nonce = load_epi32(In[3], In[3], In[3], In[3]);
+ __m128i k_vec = load_epi32(k, k, k, k);
+ nonce = _mm_add_epi32(nonce, offset);
+ nonce = _mm_add_epi32(nonce, k_vec);
+    w3 = nonce;
+
+ a = load_epi32(hPre[0], hPre[0], hPre[0], hPre[0]);
+ b = load_epi32(hPre[1], hPre[1], hPre[1], hPre[1]);
+ c = load_epi32(hPre[2], hPre[2], hPre[2], hPre[2]);
+ d = load_epi32(hPre[3], hPre[3], hPre[3], hPre[3]);
+ e = load_epi32(hPre[4], hPre[4], hPre[4], hPre[4]);
+ f = load_epi32(hPre[5], hPre[5], hPre[5], hPre[5]);
+ g = load_epi32(hPre[6], hPre[6], hPre[6], hPre[6]);
+ h = load_epi32(hPre[7], hPre[7], hPre[7], hPre[7]);
+
+ SHA256ROUND(a, b, c, d, e, f, g, h, 0, w0);   
+ SHA256ROUND(h, a, b, c, d, e, f, g, 1, w1);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 2, w2);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 3, w3);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 4, w4);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 5, w5);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 6, w6);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 7, w7);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 8, w8);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 9, w9);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 10, w10);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 11, w11);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 12, w12);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 13, w13);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 14, w14);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 15, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 16, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 17, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 18, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 19, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 20, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 21, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 22, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 23, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 24, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 25, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 26, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 27, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 28, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 29, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 30, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 31, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 32, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 33, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 34, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 35, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 36, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 37, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 38, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 39, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 40, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 41, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 42, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 43, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 44, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 45, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 46, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 47, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 48, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 49, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 50, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 51, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 52, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 53, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 54, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 55, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 56, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 57, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 58, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 59, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 60, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 61, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 62, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 63, w15);
+
+#define store_load(x, i, dest) \
+ w8 = load_epi32((hPre)[i], (hPre)[i], (hPre)[i], (hPre)[i]); \
+ dest = _mm_add_epi32(w8, x);
+
+ store_load(a, 0, w0);
+ store_load(b, 1, w1);
+ store_load(c, 2, w2);
+ store_load(d, 3, w3);
+ store_load(e, 4, w4);
+ store_load(f, 5, w5);
+ store_load(g, 6, w6);
+ store_load(h, 7, w7);
+
+ w8 = load_epi32(Pad[8], Pad[8], Pad[8], Pad[8]);
+ w9 = load_epi32(Pad[9], Pad[9], Pad[9], Pad[9]);
+ w10 = load_epi32(Pad[10], Pad[10], Pad[10], Pad[10]);
+ w11 = load_epi32(Pad[11], Pad[11], Pad[11], Pad[11]);
+ w12 = load_epi32(Pad[12], Pad[12], Pad[12], Pad[12]);
+ w13 = load_epi32(Pad[13], Pad[13], Pad[13], Pad[13]);
+ w14 = load_epi32(Pad[14], Pad[14], Pad[14], Pad[14]);
+ w15 = load_epi32(Pad[15], Pad[15], Pad[15], Pad[15]);
+
+ a = load_epi32(hInit[0], hInit[0], hInit[0], hInit[0]);
+ b = load_epi32(hInit[1], hInit[1], hInit[1], hInit[1]);
+ c = load_epi32(hInit[2], hInit[2], hInit[2], hInit[2]);
+ d = load_epi32(hInit[3], hInit[3], hInit[3], hInit[3]);
+ e = load_epi32(hInit[4], hInit[4], hInit[4], hInit[4]);
+ f = load_epi32(hInit[5], hInit[5], hInit[5], hInit[5]);
+ g = load_epi32(hInit[6], hInit[6], hInit[6], hInit[6]);
+ h = load_epi32(hInit[7], hInit[7], hInit[7], hInit[7]);
+
+ SHA256ROUND(a, b, c, d, e, f, g, h, 0, w0);   
+ SHA256ROUND(h, a, b, c, d, e, f, g, 1, w1);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 2, w2);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 3, w3);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 4, w4);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 5, w5);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 6, w6);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 7, w7);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 8, w8);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 9, w9);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 10, w10);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 11, w11);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 12, w12);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 13, w13);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 14, w14);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 15, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 16, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 17, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 18, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 19, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 20, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 21, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 22, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 23, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 24, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 25, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 26, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 27, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 28, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 29, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 30, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 31, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 32, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 33, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 34, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 35, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 36, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 37, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 38, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 39, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 40, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 41, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 42, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 43, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 44, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 45, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 46, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 47, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 48, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 49, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 50, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 51, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 52, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 53, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 54, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 55, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 56, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 57, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 58, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 59, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 60, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 61, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 62, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 63, w15);
+
+ /* store resulsts directly in thash */
+#define store_2(x,i)  \
+ w0 = load_epi32((hInit)[i], (hInit)[i], (hInit)[i], (hInit)[i]); \
+ *(__m128i *)&(thash)[i][0+k] = _mm_add_epi32(w0, x);
+
+ store_2(a, 0);
+ store_2(b, 1);
+ store_2(c, 2);
+ store_2(d, 3);
+ store_2(e, 4);
+ store_2(f, 5);
+ store_2(g, 6);
+ store_2(h, 7);
+ *(__m128i *)&(thash)[8][0+k] = nonce;
+ }
+
+}
diff --git a/main.cpp b/main.cpp
index 0239915..50db1a3 100644
--- a/main.cpp
+++ b/main.cpp
@@ -2555,8 +2555,10 @@ inline void SHA256Transform(void* pstate, void* pinput, const void* pinit)
     CryptoPP::SHA256::Transform((CryptoPP::word32*)pstate, (CryptoPP::word32*)pinput);
 }
 
+// !!!! NPAR must match NPAR in cryptopp/sha256.cpp !!!!
+#define NPAR 32
 
-
+extern void Double_BlockSHA256(const void* pin, void* pout, const void *pinit, unsigned int hash[9][NPAR], const void *init2);
 
 
 void BitcoinMiner()
@@ -2701,108 +2703,128 @@ void BitcoinMiner()
         uint256 hashTarget = CBigNum().SetCompact(pblock->nBits).getuint256();
         uint256 hashbuf[2];
         uint256& hash = *alignup<16>(hashbuf);
+
+        // Cache for NPAR hashes
+        unsigned int thash[9][NPAR] __attribute__ ((aligned (16)));
+
+        unsigned int j;
         loop
         {
-            SHA256Transform(&tmp.hash1, (char*)&tmp.block + 64, &midstate);
-            SHA256Transform(&hash, &tmp.hash1, pSHA256InitState);
+          Double_BlockSHA256((char*)&tmp.block + 64, &tmp.hash1, &midstate, thash, pSHA256InitState);
 
-            if (((unsigned short*)&hash)[14] == 0)
+          for(j = 0; j<NPAR; j++) {
+            if (thash[7][j] == 0)
             {
-                // Byte swap the result after preliminary check
-                for (int i = 0; i < sizeof(hash)/4; i++)
-                    ((unsigned int*)&hash)[i] = ByteReverse(((unsigned int*)&hash)[i]);
-
-                if (hash <= hashTarget)
+              // Byte swap the result after preliminary check
+              for (int i = 0; i < sizeof(hash)/4; i++)
+                ((unsigned int*)&hash)[i] = ByteReverse((unsigned int)thash[i][j]);
+
+              if (hash <= hashTarget)
+              {
+                // Double_BlocSHA256 might only calculate parts of the hash.
+                // We'll insert the nonce and get the real hash.
+                //pblock->nNonce = ByteReverse(tmp.block.nNonce + j);
+                //hash = pblock->GetHash();
+
+                /* get nonce from hash */
+                pblock->nNonce = ByteReverse((unsigned int)thash[8][j]);
+                assert(hash == pblock->GetHash());
+
+                //// debug print
+                printf("BitcoinMiner:\n");
+                printf("proof-of-work found  \n  hash: %s  \ntarget: %s\n", hash.GetHex().c_str(), hashTarget.GetHex().c_str());
+                pblock->print();
+                printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
+                printf("generated %s\n", FormatMoney(pblock->vtx[0].vout[0].nValue).c_str());
+
+                SetThreadPriority(THREAD_PRIORITY_NORMAL);
+                CRITICAL_BLOCK(cs_main)
                 {
-                    pblock->nNonce = ByteReverse(tmp.block.nNonce);
-                    assert(hash == pblock->GetHash());
-
-                        //// debug print
-                        printf("BitcoinMiner:\n");
-                        printf("proof-of-work found  \n  hash: %s  \ntarget: %s\n", hash.GetHex().c_str(), hashTarget.GetHex().c_str());
-                        pblock->print();
-                        printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
-                        printf("generated %s\n", FormatMoney(pblock->vtx[0].vout[0].nValue).c_str());
-
-                    SetThreadPriority(THREAD_PRIORITY_NORMAL);
-                    CRITICAL_BLOCK(cs_main)
-                    {
-                        if (pindexPrev == pindexBest)
-                        {
-                            // Save key
-                            if (!AddKey(key))
-                                return;
-                            key.MakeNewKey();
-
-                            // Track how many getdata requests this block gets
-                            CRITICAL_BLOCK(cs_mapRequestCount)
-                                mapRequestCount[pblock->GetHash()] = 0;
-
-                            // Process this block the same as if we had received it from another node
-                            if (!ProcessBlock(NULL, pblock.release()))
-                                printf("ERROR in BitcoinMiner, ProcessBlock, block not accepted\n");
-                        }
+                  if (pindexPrev == pindexBest)
+                  {
+                    // Save key
+                    if (!AddKey(key))
+                      return;
+                    key.MakeNewKey();
+
+                    // Track how many getdata requests this block gets
+                    CRITICAL_BLOCK(cs_mapRequestCount)
+                      mapRequestCount[pblock->GetHash()] = 0;
+
+                    // Process this block the same as if we had received it from another node
+                    if (!ProcessBlock(NULL, pblock.release()))
+                      printf("ERROR in BitcoinMiner, ProcessBlock, block not accepted\n");
+
                     }
                     SetThreadPriority(THREAD_PRIORITY_LOWEST);
 
                     Sleep(500);
                     break;
                 }
-            }
+                SetThreadPriority(THREAD_PRIORITY_LOWEST);
 
-            // Update nTime every few seconds
-            const unsigned int nMask = 0xffff;
-            if ((++tmp.block.nNonce & nMask) == 0)
+                Sleep(500);
+                break;
+              }
+            }
+          }
+
+          // Update nonce
+          tmp.block.nNonce += NPAR;
+
+          // Update nTime every few seconds
+          const unsigned int nMask = 0xffff;
+          if ((tmp.block.nNonce & nMask) == 0)
+          {
+            // Meter hashes/sec
+            static int64 nTimerStart;
+            static int nHashCounter;
+            if (nTimerStart == 0)
+              nTimerStart = GetTimeMillis();
+            else
+              nHashCounter++;
+            if (GetTimeMillis() - nTimerStart > 4000)
             {
-                // Meter hashes/sec
-                static int64 nTimerStart;
-                static int nHashCounter;
-                if (nTimerStart == 0)
-                    nTimerStart = GetTimeMillis();
-                else
-                    nHashCounter++;
+              static CCriticalSection cs;
+              CRITICAL_BLOCK(cs)
+              {
                 if (GetTimeMillis() - nTimerStart > 4000)
                 {
-                    static CCriticalSection cs;
-                    CRITICAL_BLOCK(cs)
-                    {
-                        if (GetTimeMillis() - nTimerStart > 4000)
-                        {
-                            double dHashesPerSec = 1000.0 * (nMask+1) * nHashCounter / (GetTimeMillis() - nTimerStart);
-                            nTimerStart = GetTimeMillis();
-                            nHashCounter = 0;
-                            string strStatus = strprintf("    %.0f khash/s", dHashesPerSec/1000.0);
-                            UIThreadCall(bind(CalledSetStatusBar, strStatus, 0));
-                            static int64 nLogTime;
-                            if (GetTime() - nLogTime > 30 * 60)
-                            {
-                                nLogTime = GetTime();
-                                printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
-                                printf("hashmeter %3d CPUs %6.0f khash/s\n", vnThreadsRunning[3], dHashesPerSec/1000.0);
-                            }
-                        }
-                    }
+                  double dHashesPerSec = 1000.0 * (nMask+1) * nHashCounter / (GetTimeMillis() - nTimerStart);
+                  nTimerStart = GetTimeMillis();
+                  nHashCounter = 0;
+                  string strStatus = strprintf("    %.0f khash/s", dHashesPerSec/1000.0);
+                  UIThreadCall(bind(CalledSetStatusBar, strStatus, 0));
+                  static int64 nLogTime;
+                  if (GetTime() - nLogTime > 30 * 60)
+                  {
+                    nLogTime = GetTime();
+                    printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
+                    printf("hashmeter %3d CPUs %6.0f khash/s\n", vnThreadsRunning[3], dHashesPerSec/1000.0);
+                  }
                 }
-
-                // Check for stop or if block needs to be rebuilt
-                if (fShutdown)
-                    return;
-                if (!fGenerateBitcoins)
-                    return;
-                if (fLimitProcessors && vnThreadsRunning[3] > nLimitProcessors)
-                    return;
-                if (vNodes.empty())
-                    break;
-                if (tmp.block.nNonce == 0)
-                    break;
-                if (nTransactionsUpdated != nTransactionsUpdatedLast && GetTime() - nStart > 60)
-                    break;
-                if (pindexPrev != pindexBest)
-                    break;
-
-                pblock->nTime = max(pindexPrev->GetMedianTimePast()+1, GetAdjustedTime());
-                tmp.block.nTime = ByteReverse(pblock->nTime);
+              }
             }
+
+            // Check for stop or if block needs to be rebuilt
+            if (fShutdown)
+              return;
+            if (!fGenerateBitcoins)
+              return;
+            if (fLimitProcessors && vnThreadsRunning[3] > nLimitProcessors)
+              return;
+            if (vNodes.empty())
+              break;
+            if (tmp.block.nNonce == 0)
+              break;
+            if (nTransactionsUpdated != nTransactionsUpdatedLast && GetTime() - nStart > 60)
+              break;
+            if (pindexPrev != pindexBest)
+              break;
+
+            pblock->nTime = max(pindexPrev->GetMedianTimePast()+1, GetAdjustedTime());
+            tmp.block.nTime = ByteReverse(pblock->nTime);
+          }
         }
     }
 }
diff --git a/makefile.unix b/makefile.unix
index e965287..04dac86 100644
--- a/makefile.unix
+++ b/makefile.unix
@@ -41,7 +41,8 @@ OBJS= \
     obj/rpc.o \
     obj/init.o \
     cryptopp/obj/sha.o \
-    cryptopp/obj/cpu.o
+    cryptopp/obj/cpu.o \
+    cryptopp/obj/sha256.o
 
 
 all: bitcoin
@@ -51,7 +52,7 @@ obj/%.o: %.cpp $(HEADERS)
  g++ -c $(CFLAGS) -DGUI -o $@ $<
 
 cryptopp/obj/%.o: cryptopp/%.cpp
- g++ -c $(CFLAGS) -O3 -DCRYPTOPP_DISABLE_SSE2 -o $@ $<
+ g++ -c $(CFLAGS) -frename-registers -funroll-all-loops -fomit-frame-pointer  -march=native -msse2 -msse3  -ffast-math -O3 -o $@ $<
 
 bitcoin: $(OBJS) obj/ui.o obj/uibase.o
  g++ $(CFLAGS) -o $@ $^ $(WXLIBS) $(LIBS)
@@ -63,6 +64,9 @@ obj/nogui/%.o: %.cpp $(HEADERS)
 bitcoind: $(OBJS:obj/%=obj/nogui/%)
  g++ $(CFLAGS) -o $@ $^ $(LIBS)
 
+test: cryptopp/obj/sha.o cryptopp/obj/sha256.o test.cpp
+ g++ $(CFLAGS) -o $@ $^ $(LIBS)
+
 
 clean:
  -rm -f obj/*.o
diff --git a/test.cpp b/test.cpp
new file mode 100644
index 0000000..a55e972
--- /dev/null
+++ b/test.cpp
@@ -0,0 +1,221 @@
+// Copyright (c) 2009-2010 Satoshi Nakamoto
+// Distributed under the MIT/X11 software license, see the accompanying
+// file license.txt or http://www.opensource.org/licenses/mit-license.php.
+#include <assert.h>
+#include <openssl/ecdsa.h>
+#include <openssl/evp.h>
+#include <openssl/rand.h>
+#include <openssl/sha.h>
+#include <openssl/ripemd.h>
+#include <db_cxx.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <math.h>
+#include <limits.h>
+#include <float.h>
+#include <assert.h>
+#include <memory>
+#include <iostream>
+#include <sstream>
+#include <string>
+#include <vector>
+#include <list>
+#include <deque>
+#include <map>
+#include <set>
+#include <algorithm>
+#include <numeric>
+#include <boost/foreach.hpp>
+#include <boost/lexical_cast.hpp>
+#include <boost/tuple/tuple.hpp>
+#include <boost/fusion/container/vector.hpp>
+#include <boost/tuple/tuple_comparison.hpp>
+#include <boost/tuple/tuple_io.hpp>
+#include <boost/array.hpp>
+#include <boost/bind.hpp>
+#include <boost/function.hpp>
+#include <boost/filesystem.hpp>
+#include <boost/filesystem/fstream.hpp>
+#include <boost/algorithm/string.hpp>
+#include <boost/interprocess/sync/interprocess_mutex.hpp>
+#include <boost/interprocess/sync/interprocess_recursive_mutex.hpp>
+#include <boost/date_time/gregorian/gregorian_types.hpp>
+#include <boost/date_time/posix_time/posix_time_types.hpp>
+#include <sys/resource.h>
+#include <sys/time.h>
+using namespace std;
+using namespace boost;
+#include "cryptopp/sha.h"
+#include "strlcpy.h"
+#include "serialize.h"
+#include "uint256.h"
+#include "bignum.h"
+
+#undef printf
+
+int FormatHashBlocks(void* pbuffer, unsigned int len)
+{
+ unsigned char* pdata = (unsigned char*)pbuffer;
+ unsigned int blocks = 1 + ((len + 8) / 64);
+ unsigned char* pend = pdata + 64 * blocks;
+ memset(pdata + len, 0, 64 * blocks - len);
+ pdata[len] = 0x80;
+ unsigned int bits = len * 8;
+ pend[-1] = (bits >> 0) & 0xff;
+ pend[-2] = (bits >> 8) & 0xff;
+ pend[-3] = (bits >> 16) & 0xff;
+ pend[-4] = (bits >> 24) & 0xff;
+ return blocks;
+}
+
+using CryptoPP::ByteReverse;
+static int detectlittleendian = 1;
+
+#define NPAR 32
+
+extern void Double_BlockSHA256(const void* pin, void* pout, const void *pinit, unsigned int hash[9][NPAR], const void *init2);
+
+using CryptoPP::ByteReverse;
+
+static const unsigned int pSHA256InitState[8] = {0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19};
+
+inline void SHA256Transform(void* pstate, void* pinput, const void* pinit)
+{
+ memcpy(pstate, pinit, 32);
+ CryptoPP::SHA256::Transform((CryptoPP::word32*)pstate, (CryptoPP::word32*)pinput);
+}
+
+void BitcoinTester(char *filename)
+{
+ printf("SHA256 test started\n");
+
+ struct tmpworkspace
+ {
+ struct unnamed2
+ {
+ int nVersion;
+ uint256 hashPrevBlock;
+ uint256 hashMerkleRoot;
+ unsigned int nTime;
+ unsigned int nBits;
+ unsigned int nNonce;
+ }
+ block;
+ unsigned char pchPadding0[64];
+ uint256 hash1;
+ unsigned char pchPadding1[64];
+ }
+  tmp __attribute__ ((aligned (16)));
+
+ char line[180];
+ ifstream fin(filename);
+ char *p;
+ unsigned long int totalhashes= 0;
+ unsigned long int found = 0;
+ clock_t start, end;
+ unsigned long int cpu_time_used;
+ unsigned int tnonce;
+ start = clock();
+
+ while( fin.getline(line, 180))
+ {
+ string in(line);
+ //printf("%s\n", in.c_str());
+ tmp.block.nVersion       = strtol(in.substr(0,8).c_str(), &p, 16);
+ tmp.block.hashPrevBlock.SetHex(in.substr(8,64));
+ tmp.block.hashMerkleRoot.SetHex(in.substr(64+8,64));
+ tmp.block.nTime          = strtol(in.substr(128+8,8).c_str(), &p, 16);
+ tmp.block.nBits          = strtol(in.substr(128+16,8).c_str(), &p, 16);
+ tnonce = strtol(in.substr(128+24,8).c_str(), &p, 16);
+ tmp.block.nNonce         = tnonce;
+
+ unsigned int nBlocks0 = FormatHashBlocks(&tmp.block, sizeof(tmp.block));
+ unsigned int nBlocks1 = FormatHashBlocks(&tmp.hash1, sizeof(tmp.hash1));
+
+ // Byte swap all the input buffer
+ for (int i = 0; i < sizeof(tmp)/4; i++)
+ ((unsigned int*)&tmp)[i] = ByteReverse(((unsigned int*)&tmp)[i]);
+
+ // Precalc the first half of the first hash, which stays constant
+ uint256 midstate __attribute__ ((aligned(16)));
+ SHA256Transform(&midstate, &tmp.block, pSHA256InitState);
+
+
+ uint256 hashTarget = CBigNum().SetCompact(ByteReverse(tmp.block.nBits)).getuint256();
+ // printf("target %s\n", hashTarget.GetHex().c_str());
+ uint256 hash;
+ uint256 refhash __attribute__ ((aligned(16)));
+
+ unsigned int thash[9][NPAR] __attribute__ ((aligned (16)));
+ int done = 0;
+ unsigned int i, j;
+
+ /* reference */
+ SHA256Transform(&tmp.hash1, (char*)&tmp.block + 64, &midstate);
+ SHA256Transform(&refhash, &tmp.hash1, pSHA256InitState);
+ for (int i = 0; i < sizeof(refhash)/4; i++)
+ ((unsigned int*)&refhash)[i] = ByteReverse(((unsigned int*)&refhash)[i]);
+
+ //printf("reference nonce %08x:\n%s\n\n", tnonce, refhash.GetHex().c_str());
+
+ tmp.block.nNonce = ByteReverse(tnonce) & 0xfffff000;
+
+
+ for(;;)
+ {
+
+ Double_BlockSHA256((char*)&tmp.block + 64, &tmp.hash1, &midstate, thash, pSHA256InitState);
+
+ for(i = 0; i<NPAR; i++) {
+ /* fast hash checking */
+ if(thash[7][i] == 0) {
+ // printf("found something... ");
+
+ for(j = 0; j<8; j++) ((unsigned int *)&hash)[j] = ByteReverse((unsigned int)thash[j][i]);
+ // printf("%s\n", hash.GetHex().c_str());
+
+ if (hash <= hashTarget)
+ {
+ found++;
+ if(tnonce == ByteReverse((unsigned int)thash[8][i]) ) {
+ if(hash == refhash) {
+ printf("\r%lu", found);
+ totalhashes += NPAR;
+ done = 1;
+ } else {
+ printf("Hashes do not match!\n");
+ }
+ } else {
+ printf("nonce does not match. %08x != %08x\n", tnonce, ByteReverse(tmp.block.nNonce + i));
+ }
+ break;
+ }
+ }
+ }
+ if(done) break;
+
+ tmp.block.nNonce+=NPAR;
+ totalhashes += NPAR;
+ if(tmp.block.nNonce == 0) {
+ printf("ERROR: Hash not found for:\n%s\n", in.c_str());
+ return;
+ }
+ }
+ }
+ printf("\n");
+ end = clock();
+ cpu_time_used += (unsigned int)(end - start);
+ cpu_time_used /= ((CLOCKS_PER_SEC)/1000);
+ printf("found solutions = %lu\n", found);
+ printf("total hashes = %lu\n", totalhashes);
+ printf("total time = %lu ms\n", cpu_time_used);
+ printf("average speed: %lu khash/s\n", (totalhashes)/cpu_time_used);
+}
+
+int main(int argc, char* argv[]) {
+ if(argc == 2) {
+ BitcoinTester(argv[1]);
+ } else
+ printf("Missing filename!\n");
+ return 0;
+}
Pages: [1] 2 3 4 5 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!