Bitcoin Forum
May 04, 2024, 04:26:42 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1] 2 3 4 5 »  All
  Print  
Author Topic: 4 hashes parallel on SSE2 CPUs for 0.3.6  (Read 22003 times)
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
July 30, 2010, 09:23:10 PM
 #1

This patch will calculate four hashes on one core using vector instructions. There's a test programm included that validates the new hash function against the old one so it should be correct.

The patch is against 0.3.6. Improves khash/s by roughly 115%.

http://pastebin.com/XN1JDb53
1714840002
Hero Member
*
Offline Offline

Posts: 1714840002

View Profile Personal Message (Offline)

Ignore
1714840002
Reply with quote  #2

1714840002
Report to moderator
1714840002
Hero Member
*
Offline Offline

Posts: 1714840002

View Profile Personal Message (Offline)

Ignore
1714840002
Reply with quote  #2

1714840002
Report to moderator
The Bitcoin network protocol was designed to be extremely flexible. It can be used to create timed transactions, escrow transactions, multi-signature transactions, etc. The current features of the client only hint at what will be possible in the future.
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1714840002
Hero Member
*
Offline Offline

Posts: 1714840002

View Profile Personal Message (Offline)

Ignore
1714840002
Reply with quote  #2

1714840002
Report to moderator
1714840002
Hero Member
*
Offline Offline

Posts: 1714840002

View Profile Personal Message (Offline)

Ignore
1714840002
Reply with quote  #2

1714840002
Report to moderator
knightmb
Sr. Member
****
Offline Offline

Activity: 308
Merit: 256



View Profile WWW
July 30, 2010, 09:33:29 PM
 #2

I take it that you've already tested the hash limit before performance starts to suffer against the stock code? I'm just curious myself.

Timekoin - The World's Most Energy Efficient Encrypted Digital Currency
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
July 30, 2010, 09:47:22 PM
 #3

Performance of stock code (as measured by my test/benchmark program) is about 1500khash/s.
My code does 3500khash/s. Both figures are for one core. It scales well because I do 128 hashes at once and keep the datastructures small enough to fit in the CPU cache.

I have two local collision attacks which will squeeze another 300khash/s out, but they are not stable yet.
knightmb
Sr. Member
****
Offline Offline

Activity: 308
Merit: 256



View Profile WWW
July 30, 2010, 09:51:10 PM
 #4

Awesome, I'll have to give it a try myself then.  Shocked

Timekoin - The World's Most Energy Efficient Encrypted Digital Currency
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
July 30, 2010, 10:00:24 PM
 #5

Tell me if it works Smiley
Donations are welcome. 17asVKkzRGTFvvGH9dMGQaHe78xzfvgSSA
satoshi
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 6723


View Profile
July 31, 2010, 12:29:20 AM
 #6

That's amazing...

So are you saying you use 128-bit registers to SIMD four 32-bit data at once?  I've wondered about that for a long time, but I didn't think it would be possible due to addition carrying into the neighbour's value.
knightmb
Sr. Member
****
Offline Offline

Activity: 308
Merit: 256



View Profile WWW
July 31, 2010, 04:49:33 AM
 #7

Darn, it means the next release, the difficulty is going to have to increase to 1000 or so to keep up, LOL  Grin

Timekoin - The World's Most Energy Efficient Encrypted Digital Currency
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
July 31, 2010, 10:12:38 AM
 #8

That's amazing...

So are you saying you use 128-bit registers to SIMD four 32-bit data at once?  I've wondered about that for a long time, but I didn't think it would be possible due to addition carrying into the neighbour's value.
That's how it works. Four 32 bit values in a 128 bit vector. They're calculated independently, but at the same time.

Btw. Why are you using this alignup<16> function when __attribute__ ((aligned (16))) will tell the compiler to align at compiletime?
em3rgentOrdr
Sr. Member
****
Offline Offline

Activity: 434
Merit: 251


youtube.com/ericfontainejazz now accepts bitcoin


View Profile WWW
July 31, 2010, 01:42:48 PM
 #9

hmm...I wasn't able to apply the patch (I'm a noobie). Here's the command I ran from bitcoin-0.3.6/src # patch < XN1JDb53.txt

Output:

1 out of 1 hunk ignored
(Stripping trailing CRs from patch.)
patching file main.cpp
Hunk #1 FAILED at 2555.
Hunk #2 FAILED at 2701.
2 out of 2 hunks FAILED
(Stripping trailing CRs from patch.)
patching file makefile.unix
Hunk #1 FAILED at 45.
Hunk #2 FAILED at 58.

What's the proper command to type into linux?  Or do you have linux binaries?

"We will not find a solution to political problems in cryptography, but we can win a major battle in the arms race and gain a new territory of freedom for several years.

Governments are good at cutting off the heads of a centrally controlled networks, but pure P2P networks are holding their own."
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
July 31, 2010, 02:18:03 PM
 #10

the mean client would send all generated bitcoins to a certain address Wink

@em3rgent0rder: i don't know why it fails, but it should be easy to patch it manually...
jgarzik
Legendary
*
qt
Offline Offline

Activity: 1596
Merit: 1091


View Profile
July 31, 2010, 05:18:30 PM
 #11

hmm...I wasn't able to apply the patch (I'm a noobie). Here's the command I ran from bitcoin-0.3.6/src # patch < XN1JDb53.txt

Output:

1 out of 1 hunk ignored
(Stripping trailing CRs from patch.)
patching file main.cpp
Hunk #1 FAILED at 2555.
Hunk #2 FAILED at 2701.
2 out of 2 hunks FAILED
(Stripping trailing CRs from patch.)
patching file makefile.unix
Hunk #1 FAILED at 45.
Hunk #2 FAILED at 58.

It definitely does not apply to the SVN trunk.  Maybe tcatm could post the main.cpp itself?

Jeff Garzik, Bloq CEO, former bitcoin core dev team; opinions are my own.
Visit bloq.com / metronome.io
Donations / tip jar: 1BrufViLKnSWtuWGkryPsKsxonV2NQ7Tcj
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
July 31, 2010, 05:40:27 PM
 #12

Looks like pastebin.com messes up the patch...
Code:
diff --git a/cryptopp/sha256.cpp b/cryptopp/sha256.cpp
new file mode 100644
index 0000000..15f8be1
--- /dev/null
+++ b/cryptopp/sha256.cpp
@@ -0,0 +1,443 @@
+#include <string.h>
+#include <assert.h>
+
+#include <xmmintrin.h>
+#include <stdint.h>
+#include <stdio.h>
+
+#define NPAR 32
+
+static const unsigned int sha256_consts[] = {
+ 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, /*  0 */
+ 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
+ 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, /*  8 */
+ 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
+ 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, /* 16 */
+ 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
+ 0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, /* 24 */
+ 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
+ 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, /* 32 */
+ 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
+ 0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, /* 40 */
+ 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
+ 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, /* 48 */
+ 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
+ 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, /* 56 */
+ 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
+};
+
+
+static inline __m128i Ch(const __m128i b, const __m128i c, const __m128i d) {
+ return (b & c) ^ (~b & d);
+}
+
+static inline __m128i Maj(const __m128i b, const __m128i c, const __m128i d) {
+ return (b & c) ^ (b & d) ^ (c & d);
+}
+
+static inline __m128i ROTR(__m128i x, const int n) {
+ return _mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n);
+}
+
+static inline __m128i SHR(__m128i x, const int n) {
+ return _mm_srli_epi32(x, n);
+}
+
+/* SHA256 Functions */
+#define BIGSIGMA0_256(x) (ROTR((x), 2) ^ ROTR((x), 13) ^ ROTR((x), 22))
+#define BIGSIGMA1_256(x) (ROTR((x), 6) ^ ROTR((x), 11) ^ ROTR((x), 25))
+#define SIGMA0_256(x) (ROTR((x), 7) ^ ROTR((x), 18) ^ SHR((x), 3))
+#define SIGMA1_256(x) (ROTR((x), 17) ^ ROTR((x), 19) ^ SHR((x), 10))
+
+static inline __m128i load_epi32(const unsigned int x0, const unsigned int x1, const unsigned int x2, const unsigned int x3) {
+ return _mm_set_epi32(x0, x1, x2, x3);
+}
+
+static inline unsigned int store32(const __m128i x, int i) {
+ union { unsigned int ret[4]; __m128i x; } box;
+ box.x = x;
+ return box.ret[i];
+}
+
+static inline void store_epi32(const __m128i x, unsigned int *x0, unsigned int *x1, unsigned int *x2, unsigned int *x3) {
+ union { unsigned int ret[4]; __m128i x; } box;
+ box.x = x;
+ *x0 = box.ret[3]; *x1 = box.ret[2]; *x2 = box.ret[1]; *x3 = box.ret[0];
+}
+
+static inline __m128i SHA256_CONST(const int i) {
+ return _mm_set1_epi32(sha256_consts[i]);
+}
+
+#define add4(x0, x1, x2, x3) _mm_add_epi32(_mm_add_epi32(_mm_add_epi32(x0, x1), x2), x3)
+#define add5(x0, x1, x2, x3, x4) _mm_add_epi32(add4(x0, x1, x2, x3), x4)
+
+#define SHA256ROUND(a, b, c, d, e, f, g, h, i, w)                       \
+ T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w); \
+d = _mm_add_epi32(d, T1);                                           \
+T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 \
+h = _mm_add_epi32(T1, T2);
+
+#define SHA256ROUND_lastd(a, b, c, d, e, f, g, h, i, w)                       \
+ T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w); \
+d = _mm_add_epi32(d, T1);                                           
+//T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 
+//h = _mm_add_epi32(T1, T2);
+
+#define SHA256ROUND_last(a, b, c, d, e, f, g, h, i, w)                       \
+ T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w); \
+T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 \
+h = _mm_add_epi32(T1, T2);
+
+static inline unsigned int swap(unsigned int value) {
+ __asm__ ("bswap %0" : "=r" (value) : "0" (value));
+ return value;
+}
+
+static inline unsigned int SWAP32(const void *addr) {
+ unsigned int value = (*((unsigned int *)(addr)));
+ __asm__ ("bswap %0" : "=r" (value) : "0" (value));
+ return value;
+}
+
+static inline void dumpreg(__m128i x, char *msg) {
+ union { unsigned int ret[4]; __m128i x; } box;
+ box.x = x ;
+ printf("%s %08x %08x %08x %08x\n", msg, box.ret[0], box.ret[1], box.ret[2], box.ret[3]);
+}
+
+#if 1
+#define dumpstate(i) printf("%s: %08x %08x %08x %08x %08x %08x %08x %08x %08x\n", \
+ __func__, store32(w0, i), store32(a, i), store32(b, i), store32(c, i), store32(d, i), store32(e, i), store32(f, i), store32(g, i), store32(h, i));
+#else
+#define dumpstate()
+#endif
+void Double_BlockSHA256(const void* pin, void* pad, const void *pre, unsigned int thash[8][NPAR], const void *init)
+{
+ unsigned int* In = (unsigned int*)pin;
+ unsigned int* Pad = (unsigned int*)pad;
+ unsigned int* hPre = (unsigned int*)pre;
+ unsigned int* hInit = (unsigned int*)init;
+ unsigned int i, j, k;
+
+ /* vectors used in calculation */
+ __m128i w0, w1, w2, w3, w4, w5, w6, w7;
+ __m128i w8, w9, w10, w11, w12, w13, w14, w15;
+ __m128i T1, T2;
+ __m128i a, b, c, d, e, f, g, h;
+
+ /* nonce offset for vector */
+ __m128i offset = load_epi32(0x00000003, 0x00000002, 0x00000001, 0x00000000);
+
+
+ for(k = 0; k<NPAR; k+=4) {
+ w0 = load_epi32(In[0], In[0], In[0], In[0]);
+ w1 = load_epi32(In[1], In[1], In[1], In[1]);
+ w2 = load_epi32(In[2], In[2], In[2], In[2]);
+ w3 = load_epi32(In[3], In[3], In[3], In[3]);
+ w4 = load_epi32(In[4], In[4], In[4], In[4]);
+ w5 = load_epi32(In[5], In[5], In[5], In[5]);
+ w6 = load_epi32(In[6], In[6], In[6], In[6]);
+ w7 = load_epi32(In[7], In[7], In[7], In[7]);
+ w8 = load_epi32(In[8], In[8], In[8], In[8]);
+ w9 = load_epi32(In[9], In[9], In[9], In[9]);
+ w10 = load_epi32(In[10], In[10], In[10], In[10]);
+ w11 = load_epi32(In[11], In[11], In[11], In[11]);
+ w12 = load_epi32(In[12], In[12], In[12], In[12]);
+ w13 = load_epi32(In[13], In[13], In[13], In[13]);
+ w14 = load_epi32(In[14], In[14], In[14], In[14]);
+ w15 = load_epi32(In[15], In[15], In[15], In[15]);
+
+ /* hack nonce into lowest byte of w3 */
+ __m128i k_vec = load_epi32(k, k, k, k);
+ w3 = _mm_add_epi32(w3, offset);
+ w3 = _mm_add_epi32(w3, k_vec);
+
+ a = load_epi32(hPre[0], hPre[0], hPre[0], hPre[0]);
+ b = load_epi32(hPre[1], hPre[1], hPre[1], hPre[1]);
+ c = load_epi32(hPre[2], hPre[2], hPre[2], hPre[2]);
+ d = load_epi32(hPre[3], hPre[3], hPre[3], hPre[3]);
+ e = load_epi32(hPre[4], hPre[4], hPre[4], hPre[4]);
+ f = load_epi32(hPre[5], hPre[5], hPre[5], hPre[5]);
+ g = load_epi32(hPre[6], hPre[6], hPre[6], hPre[6]);
+ h = load_epi32(hPre[7], hPre[7], hPre[7], hPre[7]);
+
+ SHA256ROUND(a, b, c, d, e, f, g, h, 0, w0);   
+ SHA256ROUND(h, a, b, c, d, e, f, g, 1, w1);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 2, w2);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 3, w3);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 4, w4);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 5, w5);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 6, w6);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 7, w7);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 8, w8);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 9, w9);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 10, w10);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 11, w11);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 12, w12);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 13, w13);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 14, w14);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 15, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 16, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 17, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 18, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 19, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 20, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 21, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 22, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 23, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 24, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 25, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 26, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 27, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 28, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 29, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 30, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 31, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 32, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 33, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 34, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 35, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 36, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 37, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 38, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 39, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 40, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 41, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 42, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 43, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 44, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 45, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 46, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 47, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 48, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 49, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 50, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 51, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 52, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 53, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 54, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 55, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 56, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 57, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 58, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 59, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 60, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 61, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 62, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 63, w15);
+
+#define store_load(x, i, dest) \
+ w8 = load_epi32((hPre)[i], (hPre)[i], (hPre)[i], (hPre)[i]); \
+ dest = _mm_add_epi32(w8, x);
+
+ store_load(a, 0, w0);
+ store_load(b, 1, w1);
+ store_load(c, 2, w2);
+ store_load(d, 3, w3);
+ store_load(e, 4, w4);
+ store_load(f, 5, w5);
+ store_load(g, 6, w6);
+ store_load(h, 7, w7);
+
+ w8 = load_epi32(Pad[8], Pad[8], Pad[8], Pad[8]);
+ w9 = load_epi32(Pad[9], Pad[9], Pad[9], Pad[9]);
+ w10 = load_epi32(Pad[10], Pad[10], Pad[10], Pad[10]);
+ w11 = load_epi32(Pad[11], Pad[11], Pad[11], Pad[11]);
+ w12 = load_epi32(Pad[12], Pad[12], Pad[12], Pad[12]);
+ w13 = load_epi32(Pad[13], Pad[13], Pad[13], Pad[13]);
+ w14 = load_epi32(Pad[14], Pad[14], Pad[14], Pad[14]);
+ w15 = load_epi32(Pad[15], Pad[15], Pad[15], Pad[15]);
+
+ a = load_epi32(hInit[0], hInit[0], hInit[0], hInit[0]);
+ b = load_epi32(hInit[1], hInit[1], hInit[1], hInit[1]);
+ c = load_epi32(hInit[2], hInit[2], hInit[2], hInit[2]);
+ d = load_epi32(hInit[3], hInit[3], hInit[3], hInit[3]);
+ e = load_epi32(hInit[4], hInit[4], hInit[4], hInit[4]);
+ f = load_epi32(hInit[5], hInit[5], hInit[5], hInit[5]);
+ g = load_epi32(hInit[6], hInit[6], hInit[6], hInit[6]);
+ h = load_epi32(hInit[7], hInit[7], hInit[7], hInit[7]);
+
+ SHA256ROUND(a, b, c, d, e, f, g, h, 0, w0);   
+ SHA256ROUND(h, a, b, c, d, e, f, g, 1, w1);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 2, w2);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 3, w3);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 4, w4);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 5, w5);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 6, w6);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 7, w7);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 8, w8);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 9, w9);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 10, w10);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 11, w11);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 12, w12);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 13, w13);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 14, w14);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 15, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 16, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 17, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 18, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 19, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 20, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 21, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 22, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 23, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 24, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 25, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 26, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 27, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 28, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 29, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 30, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 31, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 32, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 33, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 34, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 35, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 36, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 37, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 38, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 39, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 40, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 41, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 42, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 43, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 44, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 45, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 46, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 47, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 48, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 49, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 50, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 51, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 52, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 53, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 54, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 55, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 56, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 57, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 58, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 59, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 60, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 61, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 62, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 63, w15);
+
+ /* store resulsts directly in thash */
+#define store_2(x,i)  \
+ w0 = load_epi32((hInit)[i], (hInit)[i], (hInit)[i], (hInit)[i]); \
+ *(__m128i *)&(thash)[i][0+k] = _mm_add_epi32(w0, x);
+
+ store_2(a, 0);
+ store_2(b, 1);
+ store_2(c, 2);
+ store_2(d, 3);
+ store_2(e, 4);
+ store_2(f, 5);
+ store_2(g, 6);
+ store_2(h, 7);
+ }
+
+}
diff --git a/main.cpp b/main.cpp
index ddc359a..d30d642 100755
--- a/main.cpp
+++ b/main.cpp
@@ -2555,8 +2555,10 @@ inline void SHA256Transform(void* pstate, void* pinput, const void* pinit)
     CryptoPP::SHA256::Transform((CryptoPP::word32*)pstate, (CryptoPP::word32*)pinput);
 }
 
+// !!!! NPAR must match NPAR in cryptopp/sha256.cpp !!!!
+#define NPAR 32
 
-
+extern void Double_BlockSHA256(const void* pin, void* pout, const void *pinit, unsigned int hash[8][NPAR], const void *init2);
 
 
 void BitcoinMiner()
@@ -2701,108 +2703,123 @@ void BitcoinMiner()
         uint256 hashTarget = CBigNum().SetCompact(pblock->nBits).getuint256();
         uint256 hashbuf[2];
         uint256& hash = *alignup<16>(hashbuf);
+
+        // Cache for NPAR hashes
+        unsigned int thash[8][NPAR];
+
+        unsigned int j;
         loop
         {
-            SHA256Transform(&tmp.hash1, (char*)&tmp.block + 64, &midstate);
-            SHA256Transform(&hash, &tmp.hash1, pSHA256InitState);
+          Double_BlockSHA256((char*)&tmp.block + 64, &tmp.hash1, &midstate, thash, pSHA256InitState);
 
-            if (((unsigned short*)&hash)[14] == 0)
+          for(j = 0; j<NPAR; j++) {
+            if (thash[7][j] == 0)
             {
-                // Byte swap the result after preliminary check
-                for (int i = 0; i < sizeof(hash)/4; i++)
-                    ((unsigned int*)&hash)[i] = ByteReverse(((unsigned int*)&hash)[i]);
-
-                if (hash <= hashTarget)
+              // Byte swap the result after preliminary check
+              for (int i = 0; i < sizeof(hash)/4; i++)
+                ((unsigned int*)&hash)[i] = ByteReverse((unsigned int)thash[i][j]);
+
+              if (hash <= hashTarget)
+              {
+                // Double_BlocSHA256 might only calculate parts of the hash.
+                // We'll insert the nonce and get the real hash.
+                //pblock->nNonce = ByteReverse(tmp.block.nNonce + j);
+                //hash = pblock->GetHash();
+
+                pblock->nNonce = ByteReverse(tmp.block.nNonce + j);
+                assert(hash == pblock->GetHash());
+
+                //// debug print
+                printf("BitcoinMiner:\n");
+                printf("proof-of-work found  \n  hash: %s  \ntarget: %s\n", hash.GetHex().c_str(), hashTarget.GetHex().c_str());
+                pblock->print();
+                printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
+                printf("generated %s\n", FormatMoney(pblock->vtx[0].vout[0].nValue).c_str());
+
+                SetThreadPriority(THREAD_PRIORITY_NORMAL);
+                CRITICAL_BLOCK(cs_main)
                 {
-                    pblock->nNonce = ByteReverse(tmp.block.nNonce);
-                    assert(hash == pblock->GetHash());
-
-                        //// debug print
-                        printf("BitcoinMiner:\n");
-                        printf("proof-of-work found  \n  hash: %s  \ntarget: %s\n", hash.GetHex().c_str(), hashTarget.GetHex().c_str());
-                        pblock->print();
-                        printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
-                        printf("generated %s\n", FormatMoney(pblock->vtx[0].vout[0].nValue).c_str());
-
-                    SetThreadPriority(THREAD_PRIORITY_NORMAL);
-                    CRITICAL_BLOCK(cs_main)
-                    {
-                        if (pindexPrev == pindexBest)
-                        {
-                            // Save key
-                            if (!AddKey(key))
-                                return;
-                            key.MakeNewKey();
-
-                            // Track how many getdata requests this block gets
-                            CRITICAL_BLOCK(cs_mapRequestCount)
-                                mapRequestCount[pblock->GetHash()] = 0;
-
-                            // Process this block the same as if we had received it from another node
-                            if (!ProcessBlock(NULL, pblock.release()))
-                                printf("ERROR in BitcoinMiner, ProcessBlock, block not accepted\n");
-                        }
-                    }
-                    SetThreadPriority(THREAD_PRIORITY_LOWEST);
-
-                    Sleep(500);
-                    break;
+                  if (pindexPrev == pindexBest)
+                  {
+                    // Save key
+                    if (!AddKey(key))
+                      return;
+                    key.MakeNewKey();
+
+                    // Track how many getdata requests this block gets
+                    CRITICAL_BLOCK(cs_mapRequestCount)
+                      mapRequestCount[pblock->GetHash()] = 0;
+
+                    // Process this block the same as if we had received it from another node
+                    if (!ProcessBlock(NULL, pblock.release()))
+                      printf("ERROR in BitcoinMiner, ProcessBlock, block not accepted\n");
+
+                  }
                 }
-            }
+                SetThreadPriority(THREAD_PRIORITY_LOWEST);
 
-            // Update nTime every few seconds
-            const unsigned int nMask = 0xffff;
-            if ((++tmp.block.nNonce & nMask) == 0)
+                Sleep(500);
+                break;
+              }
+            }
+          }
+
+          // Update nonce
+          tmp.block.nNonce += NPAR;
+
+          // Update nTime every few seconds
+          const unsigned int nMask = 0xffff;
+          if ((tmp.block.nNonce & nMask) == 0)
+          {
+            // Meter hashes/sec
+            static int64 nTimerStart;
+            static int nHashCounter;
+            if (nTimerStart == 0)
+              nTimerStart = GetTimeMillis();
+            else
+              nHashCounter++;
+            if (GetTimeMillis() - nTimerStart > 4000)
             {
-                // Meter hashes/sec
-                static int64 nTimerStart;
-                static int nHashCounter;
-                if (nTimerStart == 0)
-                    nTimerStart = GetTimeMillis();
-                else
-                    nHashCounter++;
+              static CCriticalSection cs;
+              CRITICAL_BLOCK(cs)
+              {
                 if (GetTimeMillis() - nTimerStart > 4000)
                 {
-                    static CCriticalSection cs;
-                    CRITICAL_BLOCK(cs)
-                    {
-                        if (GetTimeMillis() - nTimerStart > 4000)
-                        {
-                            double dHashesPerSec = 1000.0 * (nMask+1) * nHashCounter / (GetTimeMillis() - nTimerStart);
-                            nTimerStart = GetTimeMillis();
-                            nHashCounter = 0;
-                            string strStatus = strprintf("    %.0f khash/s", dHashesPerSec/1000.0);
-                            UIThreadCall(bind(CalledSetStatusBar, strStatus, 0));
-                            static int64 nLogTime;
-                            if (GetTime() - nLogTime > 30 * 60)
-                            {
-                                nLogTime = GetTime();
-                                printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
-                                printf("hashmeter %3d CPUs %6.0f khash/s\n", vnThreadsRunning[3], dHashesPerSec/1000.0);
-                            }
-                        }
-                    }
+                  double dHashesPerSec = 1000.0 * (nMask+1) * nHashCounter / (GetTimeMillis() - nTimerStart);
+                  nTimerStart = GetTimeMillis();
+                  nHashCounter = 0;
+                  string strStatus = strprintf("    %.0f khash/s", dHashesPerSec/1000.0);
+                  UIThreadCall(bind(CalledSetStatusBar, strStatus, 0));
+                  static int64 nLogTime;
+                  if (GetTime() - nLogTime > 30 * 60)
+                  {
+                    nLogTime = GetTime();
+                    printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
+                    printf("hashmeter %3d CPUs %6.0f khash/s\n", vnThreadsRunning[3], dHashesPerSec/1000.0);
+                  }
                 }
-
-                // Check for stop or if block needs to be rebuilt
-                if (fShutdown)
-                    return;
-                if (!fGenerateBitcoins)
-                    return;
-                if (fLimitProcessors && vnThreadsRunning[3] > nLimitProcessors)
-                    return;
-                if (vNodes.empty())
-                    break;
-                if (tmp.block.nNonce == 0)
-                    break;
-                if (nTransactionsUpdated != nTransactionsUpdatedLast && GetTime() - nStart > 60)
-                    break;
-                if (pindexPrev != pindexBest)
-                    break;
-
-                pblock->nTime = max(pindexPrev->GetMedianTimePast()+1, GetAdjustedTime());
-                tmp.block.nTime = ByteReverse(pblock->nTime);
+              }
             }
+
+            // Check for stop or if block needs to be rebuilt
+            if (fShutdown)
+              return;
+            if (!fGenerateBitcoins)
+              return;
+            if (fLimitProcessors && vnThreadsRunning[3] > nLimitProcessors)
+              return;
+            if (vNodes.empty())
+              break;
+            if (tmp.block.nNonce == 0)
+              break;
+            if (nTransactionsUpdated != nTransactionsUpdatedLast && GetTime() - nStart > 60)
+              break;
+            if (pindexPrev != pindexBest)
+              break;
+
+            pblock->nTime = max(pindexPrev->GetMedianTimePast()+1, GetAdjustedTime());
+            tmp.block.nTime = ByteReverse(pblock->nTime);
+          }
         }
     }
 }
diff --git a/makefile.unix b/makefile.unix
index 597a0ea..8fb0aa6 100755
--- a/makefile.unix
+++ b/makefile.unix
@@ -45,7 +45,8 @@ OBJS= \
     obj/rpc.o \
     obj/init.o \
     cryptopp/obj/sha.o \
-    cryptopp/obj/cpu.o
+    cryptopp/obj/cpu.o \
+ cryptopp/obj/sha256.o
 
 
 all: bitcoin
@@ -58,18 +59,20 @@ obj/%.o: %.cpp $(HEADERS) headers.h.gch
  g++ -c $(CFLAGS) -DGUI -o $@ $<
 
 cryptopp/obj/%.o: cryptopp/%.cpp
- g++ -c $(CFLAGS) -O3 -DCRYPTOPP_DISABLE_SSE2 -o $@ $<
+ g++ -c $(CFLAGS) -frename-registers -funroll-all-loops -fomit-frame-pointer  -march=native -msse2 -msse3  -ffast-math -O3 -o $@ $<
 
 bitcoin: $(OBJS) obj/ui.o obj/uibase.o
  g++ $(CFLAGS) -o $@ $(LIBPATHS) $^ $(WXLIBS) $(LIBS)
 
-
 obj/nogui/%.o: %.cpp $(HEADERS)
  g++ -c $(CFLAGS) -o $@ $<
 
 bitcoind: $(OBJS:obj/%=obj/nogui/%)
  g++ $(CFLAGS) -o $@ $(LIBPATHS) $^ $(LIBS)
 
+test: cryptopp/obj/sha.o cryptopp/obj/sha256.o test.cpp
+   g++ $(CFLAGS) -o $@ $(LIBPATHS) $^ $(WXLIBS) $(LIBS)
+
 
 clean:
  -rm -f obj/*.o
diff --git a/test.cpp b/test.cpp
new file mode 100755
index 0000000..7cab332
--- /dev/null
+++ b/test.cpp
@@ -0,0 +1,237 @@
+// Copyright (c) 2009-2010 Satoshi Nakamoto
+// Distributed under the MIT/X11 software license, see the accompanying
+// file license.txt or http://www.opensource.org/licenses/mit-license.php.
+#include <assert.h>
+#include <openssl/ecdsa.h>
+#include <openssl/evp.h>
+#include <openssl/rand.h>
+#include <openssl/sha.h>
+#include <openssl/ripemd.h>
+#include <db_cxx.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <math.h>
+#include <limits.h>
+#include <float.h>
+#include <assert.h>
+#include <memory>
+#include <iostream>
+#include <sstream>
+#include <string>
+#include <vector>
+#include <list>
+#include <deque>
+#include <map>
+#include <set>
+#include <algorithm>
+#include <numeric>
+#include <boost/foreach.hpp>
+#include <boost/lexical_cast.hpp>
+#include <boost/tuple/tuple.hpp>
+#include <boost/fusion/container/vector.hpp>
+#include <boost/tuple/tuple_comparison.hpp>
+#include <boost/tuple/tuple_io.hpp>
+#include <boost/array.hpp>
+#include <boost/bind.hpp>
+#include <boost/function.hpp>
+#include <boost/filesystem.hpp>
+#include <boost/filesystem/fstream.hpp>
+#include <boost/algorithm/string.hpp>
+#include <boost/interprocess/sync/interprocess_mutex.hpp>
+#include <boost/interprocess/sync/interprocess_recursive_mutex.hpp>
+#include <boost/date_time/gregorian/gregorian_types.hpp>
+#include <boost/date_time/posix_time/posix_time_types.hpp>
+#include <sys/resource.h>
+#include <sys/time.h>
+using namespace std;
+using namespace boost;
+#include "cryptopp/sha.h"
+#include "strlcpy.h"
+#include "serialize.h"
+#include "uint256.h"
+#include "bignum.h"
+
+#undef printf
+ template <size_t nBytes, typename T>
+T* alignup(T* p)
+{
+ union
+ {   
+ T* ptr;
+ size_t n;
+ } u;
+ u.ptr = p;
+ u.n = (u.n + (nBytes-1)) & ~(nBytes-1);
+ return u.ptr;
+}
+
+int FormatHashBlocks(void* pbuffer, unsigned int len)
+{
+ unsigned char* pdata = (unsigned char*)pbuffer;
+ unsigned int blocks = 1 + ((len + 8) / 64);
+ unsigned char* pend = pdata + 64 * blocks;
+ memset(pdata + len, 0, 64 * blocks - len);
+ pdata[len] = 0x80;
+ unsigned int bits = len * 8;
+ pend[-1] = (bits >> 0) & 0xff;
+ pend[-2] = (bits >> 8) & 0xff;
+ pend[-3] = (bits >> 16) & 0xff;
+ pend[-4] = (bits >> 24) & 0xff;
+ return blocks;
+}
+
+using CryptoPP::ByteReverse;
+static int detectlittleendian = 1;
+
+#define NPAR 32
+
+extern void Double_BlockSHA256(const void* pin, void* pout, const void *pinit, unsigned int hash[8][NPAR], const void *init2);
+
+using CryptoPP::ByteReverse;
+
+static const unsigned int pSHA256InitState[8] = {0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19};
+
+inline void SHA256Transform(void* pstate, void* pinput, const void* pinit)
+{
+ memcpy(pstate, pinit, 32);
+ CryptoPP::SHA256::Transform((CryptoPP::word32*)pstate, (CryptoPP::word32*)pinput);
+}
+
+void BitcoinTester(char *filename)
+{
+ printf("SHA256 test started\n");
+
+ struct tmpworkspace
+ {
+ struct unnamed2
+ {
+ int nVersion;
+ uint256 hashPrevBlock;
+ uint256 hashMerkleRoot;
+ unsigned int nTime;
+ unsigned int nBits;
+ unsigned int nNonce;
+ }
+ block;
+ unsigned char pchPadding0[64];
+ uint256 hash1;
+ unsigned char pchPadding1[64];
+ };
+ char tmpbuf[sizeof(tmpworkspace)+16];
+ tmpworkspace& tmp = *(tmpworkspace*)alignup<16>(tmpbuf);
+
+
+ char line[180];
+ ifstream fin(filename);
+ char *p;
+ unsigned long int totalhashes= 0;
+ unsigned long int found = 0;
+ clock_t start, end;
+ unsigned long int cpu_time_used;
+ unsigned int tnonce;
+ start = clock();
+
+ while( fin.getline(line, 180))
+ {
+ string in(line);
+ //printf("%s\n", in.c_str());
+ tmp.block.nVersion       = strtol(in.substr(0,8).c_str(), &p, 16);
+ tmp.block.hashPrevBlock.SetHex(in.substr(8,64));
+ tmp.block.hashMerkleRoot.SetHex(in.substr(64+8,64));
+ tmp.block.nTime          = strtol(in.substr(128+8,8).c_str(), &p, 16);
+ tmp.block.nBits          = strtol(in.substr(128+16,8).c_str(), &p, 16);
+ tnonce = strtol(in.substr(128+24,8).c_str(), &p, 16);
+ tmp.block.nNonce         = tnonce;
+
+ unsigned int nBlocks0 = FormatHashBlocks(&tmp.block, sizeof(tmp.block));
+ unsigned int nBlocks1 = FormatHashBlocks(&tmp.hash1, sizeof(tmp.hash1));
+
+ // Byte swap all the input buffer
+ for (int i = 0; i < sizeof(tmp)/4; i++)
+ ((unsigned int*)&tmp)[i] = ByteReverse(((unsigned int*)&tmp)[i]);
+
+ // Precalc the first half of the first hash, which stays constant
+ uint256 midstatebuf[2];
+ uint256& midstate = *alignup<16>(midstatebuf);
+ SHA256Transform(&midstate, &tmp.block, pSHA256InitState);
+
+
+ uint256 hashTarget = CBigNum().SetCompact(ByteReverse(tmp.block.nBits)).getuint256();
+ // printf("target %s\n", hashTarget.GetHex().c_str());
+ uint256 hash;
+ uint256 hashbuf[2];
+ uint256& refhash = *alignup<16>(hashbuf);
+
+ unsigned int thash[8][NPAR];
+ int done = 0;
+ unsigned int i, j;
+
+ /* reference */
+ SHA256Transform(&tmp.hash1, (char*)&tmp.block + 64, &midstate);
+ SHA256Transform(&refhash, &tmp.hash1, pSHA256InitState);
+ for (int i = 0; i < sizeof(refhash)/4; i++)
+ ((unsigned int*)&refhash)[i] = ByteReverse(((unsigned int*)&refhash)[i]);
+
+ //printf("reference nonce %08x:\n%s\n\n", tnonce, refhash.GetHex().c_str());
+
+ tmp.block.nNonce = ByteReverse(tnonce) & 0xfffff000;
+
+
+ for(;;)
+ {
+
+ Double_BlockSHA256((char*)&tmp.block + 64, &tmp.hash1, &midstate, thash, pSHA256InitState);
+
+ for(i = 0; i<NPAR; i++) {
+ /* fast hash checking */
+ if(thash[7][i] == 0) {
+ // printf("found something... ");
+
+ for(j = 0; j<8; j++) ((unsigned int *)&hash)[j] = ByteReverse((unsigned int)thash[j][i]);
+ // printf("%s\n", hash.GetHex().c_str());
+
+ if (hash <= hashTarget)
+ {
+ found++;
+ if(tnonce == ByteReverse(tmp.block.nNonce + i) ) {
+ if(hash == refhash) {
+ printf("\r%lu", found);
+ totalhashes += NPAR;
+ done = 1;
+ } else {
+ printf("Hashes do not match!\n");
+ }
+ } else {
+ printf("nonce does not match. %08x != %08x\n", tnonce, ByteReverse(tmp.block.nNonce + i));
+ }
+ break;
+ }
+ }
+ }
+ if(done) break;
+
+ tmp.block.nNonce+=NPAR;
+ totalhashes += NPAR;
+ if(tmp.block.nNonce == 0) {
+ printf("ERROR: Hash not found for:\n%s\n", in.c_str());
+ return;
+ }
+ }
+ }
+ printf("\n");
+ end = clock();
+ cpu_time_used += (unsigned int)(end - start);
+ cpu_time_used /= ((CLOCKS_PER_SEC)/1000);
+ printf("found solutions = %lu\n", found);
+ printf("total hashes = %lu\n", totalhashes);
+ printf("total time = %lu ms\n", cpu_time_used);
+ printf("average speed: %lu khash/s\n", (totalhashes)/cpu_time_used);
+}
+
+int main(int argc, char* argv[]) {
+ if(argc == 2) {
+ BitcoinTester(argv[1]);
+ } else
+ printf("Missing filename!\n");
+ return 0;
+}
nelisky
Legendary
*
Offline Offline

Activity: 1540
Merit: 1001


View Profile
July 31, 2010, 07:17:17 PM
 #13

Had to manually patch, as I'm not using git for bitcoin and 'patch' doesn't munch this format, I guess. Anyway, got almost double speed on the OSX side, (i5 2.4, now ~2400 from ~1400), but my linux on Q6600 quad 2.4Ghz was pumping ~2500 with 0.3.6 (from source) and now, with the patch it's... ~2400. Need I tweak anything to take advantage on this?
nelisky
Legendary
*
Offline Offline

Activity: 1540
Merit: 1001


View Profile
July 31, 2010, 08:34:06 PM
 #14

ahm, let me correct myself: on the quad core linux, I went from ~4400 with svn trunk @ 119 to ~2400 with the patch... not exactly what I hoped for after the success in OSX.
aceat64
Full Member
***
Offline Offline

Activity: 307
Merit: 102



View Profile
July 31, 2010, 08:57:49 PM
 #15

ahm, let me correct myself: on the quad core linux, I went from ~4400 with svn trunk @ 119 to ~2400 with the patch... not exactly what I hoped for after the success in OSX.

I noticed the same, I went from about 4300 to 2100 when I tested it on Linux.
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
July 31, 2010, 10:38:29 PM
 #16

What CPUs are you running it on? Could you send me sha256.o (compiled object of the algorithm)?
nelisky
Legendary
*
Offline Offline

Activity: 1540
Merit: 1001


View Profile
July 31, 2010, 11:18:07 PM
 #17

I'm running on the Intel Q6600 2.4Ghz, how shall I get the file to you?
Mionione
Newbie
*
Offline Offline

Activity: 10
Merit: 0


View Profile
July 31, 2010, 11:30:11 PM
 #18

care with __attribute__ ((aligned (16))) , it doesn't work with local variable, gcc doesn't align the stack
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
July 31, 2010, 11:37:02 PM
 #19

I'm running on the Intel Q6600 2.4Ghz, how shall I get the file to you?
yes. i will look at the assembler code. maybe the compiler did something "wrong".
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
August 01, 2010, 12:00:10 AM
 #20

Patch against SVN. Maybe it'll work now...
Code:
diff --git a/cryptopp/sha256.cpp b/cryptopp/sha256.cpp
new file mode 100644
index 0000000..6735678
--- /dev/null
+++ b/cryptopp/sha256.cpp
@@ -0,0 +1,447 @@
+#include <string.h>
+#include <assert.h>
+
+#include <xmmintrin.h>
+#include <stdint.h>
+#include <stdio.h>
+
+#define NPAR 32
+
+static const unsigned int sha256_consts[] = {
+ 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, /*  0 */
+ 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
+ 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, /*  8 */
+ 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
+ 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, /* 16 */
+ 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
+ 0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, /* 24 */
+ 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
+ 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, /* 32 */
+ 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
+ 0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, /* 40 */
+ 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
+ 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, /* 48 */
+ 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
+ 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, /* 56 */
+ 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
+};
+
+
+static inline __m128i Ch(const __m128i b, const __m128i c, const __m128i d) {
+ return (b & c) ^ (~b & d);
+}
+
+static inline __m128i Maj(const __m128i b, const __m128i c, const __m128i d) {
+ return (b & c) ^ (b & d) ^ (c & d);
+}
+
+static inline __m128i ROTR(__m128i x, const int n) {
+ return _mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n);
+}
+
+static inline __m128i SHR(__m128i x, const int n) {
+ return _mm_srli_epi32(x, n);
+}
+
+/* SHA256 Functions */
+#define BIGSIGMA0_256(x) (ROTR((x), 2) ^ ROTR((x), 13) ^ ROTR((x), 22))
+#define BIGSIGMA1_256(x) (ROTR((x), 6) ^ ROTR((x), 11) ^ ROTR((x), 25))
+#define SIGMA0_256(x) (ROTR((x), 7) ^ ROTR((x), 18) ^ SHR((x), 3))
+#define SIGMA1_256(x) (ROTR((x), 17) ^ ROTR((x), 19) ^ SHR((x), 10))
+
+static inline __m128i load_epi32(const unsigned int x0, const unsigned int x1, const unsigned int x2, const unsigned int x3) {
+ return _mm_set_epi32(x0, x1, x2, x3);
+}
+
+static inline unsigned int store32(const __m128i x, int i) {
+ union { unsigned int ret[4]; __m128i x; } box;
+ box.x = x;
+ return box.ret[i];
+}
+
+static inline void store_epi32(const __m128i x, unsigned int *x0, unsigned int *x1, unsigned int *x2, unsigned int *x3) {
+ union { unsigned int ret[4]; __m128i x; } box;
+ box.x = x;
+ *x0 = box.ret[3]; *x1 = box.ret[2]; *x2 = box.ret[1]; *x3 = box.ret[0];
+}
+
+static inline __m128i SHA256_CONST(const int i) {
+ return _mm_set1_epi32(sha256_consts[i]);
+}
+
+#define add4(x0, x1, x2, x3) _mm_add_epi32(_mm_add_epi32(_mm_add_epi32(x0, x1), x2), x3)
+#define add5(x0, x1, x2, x3, x4) _mm_add_epi32(add4(x0, x1, x2, x3), x4)
+
+#define SHA256ROUND(a, b, c, d, e, f, g, h, i, w)                       \
+ T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w); \
+d = _mm_add_epi32(d, T1);                                           \
+T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 \
+h = _mm_add_epi32(T1, T2);
+
+#define SHA256ROUND_lastd(a, b, c, d, e, f, g, h, i, w)                       \
+ T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w); \
+d = _mm_add_epi32(d, T1);                                           
+//T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 
+//h = _mm_add_epi32(T1, T2);
+
+#define SHA256ROUND_last(a, b, c, d, e, f, g, h, i, w)                       \
+ T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w); \
+T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 \
+h = _mm_add_epi32(T1, T2);
+
+static inline unsigned int swap(unsigned int value) {
+ __asm__ ("bswap %0" : "=r" (value) : "0" (value));
+ return value;
+}
+
+static inline unsigned int SWAP32(const void *addr) {
+ unsigned int value = (*((unsigned int *)(addr)));
+ __asm__ ("bswap %0" : "=r" (value) : "0" (value));
+ return value;
+}
+
+static inline void dumpreg(__m128i x, char *msg) {
+ union { unsigned int ret[4]; __m128i x; } box;
+ box.x = x ;
+ printf("%s %08x %08x %08x %08x\n", msg, box.ret[0], box.ret[1], box.ret[2], box.ret[3]);
+}
+
+#if 1
+#define dumpstate(i) printf("%s: %08x %08x %08x %08x %08x %08x %08x %08x %08x\n", \
+ __func__, store32(w0, i), store32(a, i), store32(b, i), store32(c, i), store32(d, i), store32(e, i), store32(f, i), store32(g, i), store32(h, i));
+#else
+#define dumpstate()
+#endif
+void Double_BlockSHA256(const void* pin, void* pad, const void *pre, unsigned int thash[9][NPAR], const void *init)
+{
+ unsigned int* In = (unsigned int*)pin;
+ unsigned int* Pad = (unsigned int*)pad;
+ unsigned int* hPre = (unsigned int*)pre;
+ unsigned int* hInit = (unsigned int*)init;
+ unsigned int i, j, k;
+
+ /* vectors used in calculation */
+ __m128i w0, w1, w2, w3, w4, w5, w6, w7;
+ __m128i w8, w9, w10, w11, w12, w13, w14, w15;
+ __m128i T1, T2;
+ __m128i a, b, c, d, e, f, g, h;
+  __m128i nonce;
+
+ /* nonce offset for vector */
+ __m128i offset = load_epi32(0x00000003, 0x00000002, 0x00000001, 0x00000000);
+
+
+ for(k = 0; k<NPAR; k+=4) {
+ w0 = load_epi32(In[0], In[0], In[0], In[0]);
+ w1 = load_epi32(In[1], In[1], In[1], In[1]);
+ w2 = load_epi32(In[2], In[2], In[2], In[2]);
+ //w3 = load_epi32(In[3], In[3], In[3], In[3]); nonce will be later hacked into the hash
+ w4 = load_epi32(In[4], In[4], In[4], In[4]);
+ w5 = load_epi32(In[5], In[5], In[5], In[5]);
+ w6 = load_epi32(In[6], In[6], In[6], In[6]);
+ w7 = load_epi32(In[7], In[7], In[7], In[7]);
+ w8 = load_epi32(In[8], In[8], In[8], In[8]);
+ w9 = load_epi32(In[9], In[9], In[9], In[9]);
+ w10 = load_epi32(In[10], In[10], In[10], In[10]);
+ w11 = load_epi32(In[11], In[11], In[11], In[11]);
+ w12 = load_epi32(In[12], In[12], In[12], In[12]);
+ w13 = load_epi32(In[13], In[13], In[13], In[13]);
+ w14 = load_epi32(In[14], In[14], In[14], In[14]);
+ w15 = load_epi32(In[15], In[15], In[15], In[15]);
+
+ /* hack nonce into lowest byte of w3 */
+ nonce = load_epi32(In[3], In[3], In[3], In[3]);
+ __m128i k_vec = load_epi32(k, k, k, k);
+ nonce = _mm_add_epi32(nonce, offset);
+ nonce = _mm_add_epi32(nonce, k_vec);
+    w3 = nonce;
+
+ a = load_epi32(hPre[0], hPre[0], hPre[0], hPre[0]);
+ b = load_epi32(hPre[1], hPre[1], hPre[1], hPre[1]);
+ c = load_epi32(hPre[2], hPre[2], hPre[2], hPre[2]);
+ d = load_epi32(hPre[3], hPre[3], hPre[3], hPre[3]);
+ e = load_epi32(hPre[4], hPre[4], hPre[4], hPre[4]);
+ f = load_epi32(hPre[5], hPre[5], hPre[5], hPre[5]);
+ g = load_epi32(hPre[6], hPre[6], hPre[6], hPre[6]);
+ h = load_epi32(hPre[7], hPre[7], hPre[7], hPre[7]);
+
+ SHA256ROUND(a, b, c, d, e, f, g, h, 0, w0);   
+ SHA256ROUND(h, a, b, c, d, e, f, g, 1, w1);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 2, w2);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 3, w3);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 4, w4);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 5, w5);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 6, w6);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 7, w7);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 8, w8);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 9, w9);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 10, w10);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 11, w11);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 12, w12);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 13, w13);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 14, w14);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 15, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 16, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 17, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 18, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 19, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 20, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 21, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 22, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 23, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 24, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 25, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 26, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 27, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 28, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 29, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 30, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 31, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 32, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 33, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 34, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 35, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 36, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 37, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 38, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 39, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 40, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 41, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 42, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 43, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 44, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 45, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 46, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 47, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 48, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 49, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 50, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 51, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 52, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 53, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 54, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 55, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 56, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 57, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 58, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 59, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 60, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 61, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 62, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 63, w15);
+
+#define store_load(x, i, dest) \
+ w8 = load_epi32((hPre)[i], (hPre)[i], (hPre)[i], (hPre)[i]); \
+ dest = _mm_add_epi32(w8, x);
+
+ store_load(a, 0, w0);
+ store_load(b, 1, w1);
+ store_load(c, 2, w2);
+ store_load(d, 3, w3);
+ store_load(e, 4, w4);
+ store_load(f, 5, w5);
+ store_load(g, 6, w6);
+ store_load(h, 7, w7);
+
+ w8 = load_epi32(Pad[8], Pad[8], Pad[8], Pad[8]);
+ w9 = load_epi32(Pad[9], Pad[9], Pad[9], Pad[9]);
+ w10 = load_epi32(Pad[10], Pad[10], Pad[10], Pad[10]);
+ w11 = load_epi32(Pad[11], Pad[11], Pad[11], Pad[11]);
+ w12 = load_epi32(Pad[12], Pad[12], Pad[12], Pad[12]);
+ w13 = load_epi32(Pad[13], Pad[13], Pad[13], Pad[13]);
+ w14 = load_epi32(Pad[14], Pad[14], Pad[14], Pad[14]);
+ w15 = load_epi32(Pad[15], Pad[15], Pad[15], Pad[15]);
+
+ a = load_epi32(hInit[0], hInit[0], hInit[0], hInit[0]);
+ b = load_epi32(hInit[1], hInit[1], hInit[1], hInit[1]);
+ c = load_epi32(hInit[2], hInit[2], hInit[2], hInit[2]);
+ d = load_epi32(hInit[3], hInit[3], hInit[3], hInit[3]);
+ e = load_epi32(hInit[4], hInit[4], hInit[4], hInit[4]);
+ f = load_epi32(hInit[5], hInit[5], hInit[5], hInit[5]);
+ g = load_epi32(hInit[6], hInit[6], hInit[6], hInit[6]);
+ h = load_epi32(hInit[7], hInit[7], hInit[7], hInit[7]);
+
+ SHA256ROUND(a, b, c, d, e, f, g, h, 0, w0);   
+ SHA256ROUND(h, a, b, c, d, e, f, g, 1, w1);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 2, w2);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 3, w3);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 4, w4);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 5, w5);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 6, w6);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 7, w7);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 8, w8);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 9, w9);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 10, w10);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 11, w11);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 12, w12);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 13, w13);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 14, w14);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 15, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 16, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 17, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 18, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 19, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 20, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 21, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 22, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 23, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 24, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 25, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 26, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 27, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 28, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 29, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 30, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 31, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 32, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 33, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 34, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 35, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 36, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 37, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 38, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 39, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 40, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 41, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 42, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 43, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 44, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 45, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 46, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 47, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 48, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 49, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 50, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 51, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 52, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 53, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 54, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 55, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 56, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 57, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 58, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 59, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 60, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 61, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 62, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 63, w15);
+
+ /* store resulsts directly in thash */
+#define store_2(x,i)  \
+ w0 = load_epi32((hInit)[i], (hInit)[i], (hInit)[i], (hInit)[i]); \
+ *(__m128i *)&(thash)[i][0+k] = _mm_add_epi32(w0, x);
+
+ store_2(a, 0);
+ store_2(b, 1);
+ store_2(c, 2);
+ store_2(d, 3);
+ store_2(e, 4);
+ store_2(f, 5);
+ store_2(g, 6);
+ store_2(h, 7);
+ *(__m128i *)&(thash)[8][0+k] = nonce;
+ }
+
+}
diff --git a/main.cpp b/main.cpp
index 0239915..50db1a3 100644
--- a/main.cpp
+++ b/main.cpp
@@ -2555,8 +2555,10 @@ inline void SHA256Transform(void* pstate, void* pinput, const void* pinit)
     CryptoPP::SHA256::Transform((CryptoPP::word32*)pstate, (CryptoPP::word32*)pinput);
 }
 
+// !!!! NPAR must match NPAR in cryptopp/sha256.cpp !!!!
+#define NPAR 32
 
-
+extern void Double_BlockSHA256(const void* pin, void* pout, const void *pinit, unsigned int hash[9][NPAR], const void *init2);
 
 
 void BitcoinMiner()
@@ -2701,108 +2703,128 @@ void BitcoinMiner()
         uint256 hashTarget = CBigNum().SetCompact(pblock->nBits).getuint256();
         uint256 hashbuf[2];
         uint256& hash = *alignup<16>(hashbuf);
+
+        // Cache for NPAR hashes
+        unsigned int thash[9][NPAR] __attribute__ ((aligned (16)));
+
+        unsigned int j;
         loop
         {
-            SHA256Transform(&tmp.hash1, (char*)&tmp.block + 64, &midstate);
-            SHA256Transform(&hash, &tmp.hash1, pSHA256InitState);
+          Double_BlockSHA256((char*)&tmp.block + 64, &tmp.hash1, &midstate, thash, pSHA256InitState);
 
-            if (((unsigned short*)&hash)[14] == 0)
+          for(j = 0; j<NPAR; j++) {
+            if (thash[7][j] == 0)
             {
-                // Byte swap the result after preliminary check
-                for (int i = 0; i < sizeof(hash)/4; i++)
-                    ((unsigned int*)&hash)[i] = ByteReverse(((unsigned int*)&hash)[i]);
-
-                if (hash <= hashTarget)
+              // Byte swap the result after preliminary check
+              for (int i = 0; i < sizeof(hash)/4; i++)
+                ((unsigned int*)&hash)[i] = ByteReverse((unsigned int)thash[i][j]);
+
+              if (hash <= hashTarget)
+              {
+                // Double_BlocSHA256 might only calculate parts of the hash.
+                // We'll insert the nonce and get the real hash.
+                //pblock->nNonce = ByteReverse(tmp.block.nNonce + j);
+                //hash = pblock->GetHash();
+
+                /* get nonce from hash */
+                pblock->nNonce = ByteReverse((unsigned int)thash[8][j]);
+                assert(hash == pblock->GetHash());
+
+                //// debug print
+                printf("BitcoinMiner:\n");
+                printf("proof-of-work found  \n  hash: %s  \ntarget: %s\n", hash.GetHex().c_str(), hashTarget.GetHex().c_str());
+                pblock->print();
+                printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
+                printf("generated %s\n", FormatMoney(pblock->vtx[0].vout[0].nValue).c_str());
+
+                SetThreadPriority(THREAD_PRIORITY_NORMAL);
+                CRITICAL_BLOCK(cs_main)
                 {
-                    pblock->nNonce = ByteReverse(tmp.block.nNonce);
-                    assert(hash == pblock->GetHash());
-
-                        //// debug print
-                        printf("BitcoinMiner:\n");
-                        printf("proof-of-work found  \n  hash: %s  \ntarget: %s\n", hash.GetHex().c_str(), hashTarget.GetHex().c_str());
-                        pblock->print();
-                        printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
-                        printf("generated %s\n", FormatMoney(pblock->vtx[0].vout[0].nValue).c_str());
-
-                    SetThreadPriority(THREAD_PRIORITY_NORMAL);
-                    CRITICAL_BLOCK(cs_main)
-                    {
-                        if (pindexPrev == pindexBest)
-                        {
-                            // Save key
-                            if (!AddKey(key))
-                                return;
-                            key.MakeNewKey();
-
-                            // Track how many getdata requests this block gets
-                            CRITICAL_BLOCK(cs_mapRequestCount)
-                                mapRequestCount[pblock->GetHash()] = 0;
-
-                            // Process this block the same as if we had received it from another node
-                            if (!ProcessBlock(NULL, pblock.release()))
-                                printf("ERROR in BitcoinMiner, ProcessBlock, block not accepted\n");
-                        }
+                  if (pindexPrev == pindexBest)
+                  {
+                    // Save key
+                    if (!AddKey(key))
+                      return;
+                    key.MakeNewKey();
+
+                    // Track how many getdata requests this block gets
+                    CRITICAL_BLOCK(cs_mapRequestCount)
+                      mapRequestCount[pblock->GetHash()] = 0;
+
+                    // Process this block the same as if we had received it from another node
+                    if (!ProcessBlock(NULL, pblock.release()))
+                      printf("ERROR in BitcoinMiner, ProcessBlock, block not accepted\n");
+
                     }
                     SetThreadPriority(THREAD_PRIORITY_LOWEST);
 
                     Sleep(500);
                     break;
                 }
-            }
+                SetThreadPriority(THREAD_PRIORITY_LOWEST);
 
-            // Update nTime every few seconds
-            const unsigned int nMask = 0xffff;
-            if ((++tmp.block.nNonce & nMask) == 0)
+                Sleep(500);
+                break;
+              }
+            }
+          }
+
+          // Update nonce
+          tmp.block.nNonce += NPAR;
+
+          // Update nTime every few seconds
+          const unsigned int nMask = 0xffff;
+          if ((tmp.block.nNonce & nMask) == 0)
+          {
+            // Meter hashes/sec
+            static int64 nTimerStart;
+            static int nHashCounter;
+            if (nTimerStart == 0)
+              nTimerStart = GetTimeMillis();
+            else
+              nHashCounter++;
+            if (GetTimeMillis() - nTimerStart > 4000)
             {
-                // Meter hashes/sec
-                static int64 nTimerStart;
-                static int nHashCounter;
-                if (nTimerStart == 0)
-                    nTimerStart = GetTimeMillis();
-                else
-                    nHashCounter++;
+              static CCriticalSection cs;
+              CRITICAL_BLOCK(cs)
+              {
                 if (GetTimeMillis() - nTimerStart > 4000)
                 {
-                    static CCriticalSection cs;
-                    CRITICAL_BLOCK(cs)
-                    {
-                        if (GetTimeMillis() - nTimerStart > 4000)
-                        {
-                            double dHashesPerSec = 1000.0 * (nMask+1) * nHashCounter / (GetTimeMillis() - nTimerStart);
-                            nTimerStart = GetTimeMillis();
-                            nHashCounter = 0;
-                            string strStatus = strprintf("    %.0f khash/s", dHashesPerSec/1000.0);
-                            UIThreadCall(bind(CalledSetStatusBar, strStatus, 0));
-                            static int64 nLogTime;
-                            if (GetTime() - nLogTime > 30 * 60)
-                            {
-                                nLogTime = GetTime();
-                                printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
-                                printf("hashmeter %3d CPUs %6.0f khash/s\n", vnThreadsRunning[3], dHashesPerSec/1000.0);
-                            }
-                        }
-                    }
+                  double dHashesPerSec = 1000.0 * (nMask+1) * nHashCounter / (GetTimeMillis() - nTimerStart);
+                  nTimerStart = GetTimeMillis();
+                  nHashCounter = 0;
+                  string strStatus = strprintf("    %.0f khash/s", dHashesPerSec/1000.0);
+                  UIThreadCall(bind(CalledSetStatusBar, strStatus, 0));
+                  static int64 nLogTime;
+                  if (GetTime() - nLogTime > 30 * 60)
+                  {
+                    nLogTime = GetTime();
+                    printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
+                    printf("hashmeter %3d CPUs %6.0f khash/s\n", vnThreadsRunning[3], dHashesPerSec/1000.0);
+                  }
                 }
-
-                // Check for stop or if block needs to be rebuilt
-                if (fShutdown)
-                    return;
-                if (!fGenerateBitcoins)
-                    return;
-                if (fLimitProcessors && vnThreadsRunning[3] > nLimitProcessors)
-                    return;
-                if (vNodes.empty())
-                    break;
-                if (tmp.block.nNonce == 0)
-                    break;
-                if (nTransactionsUpdated != nTransactionsUpdatedLast && GetTime() - nStart > 60)
-                    break;
-                if (pindexPrev != pindexBest)
-                    break;
-
-                pblock->nTime = max(pindexPrev->GetMedianTimePast()+1, GetAdjustedTime());
-                tmp.block.nTime = ByteReverse(pblock->nTime);
+              }
             }
+
+            // Check for stop or if block needs to be rebuilt
+            if (fShutdown)
+              return;
+            if (!fGenerateBitcoins)
+              return;
+            if (fLimitProcessors && vnThreadsRunning[3] > nLimitProcessors)
+              return;
+            if (vNodes.empty())
+              break;
+            if (tmp.block.nNonce == 0)
+              break;
+            if (nTransactionsUpdated != nTransactionsUpdatedLast && GetTime() - nStart > 60)
+              break;
+            if (pindexPrev != pindexBest)
+              break;
+
+            pblock->nTime = max(pindexPrev->GetMedianTimePast()+1, GetAdjustedTime());
+            tmp.block.nTime = ByteReverse(pblock->nTime);
+          }
         }
     }
 }
diff --git a/makefile.unix b/makefile.unix
index e965287..04dac86 100644
--- a/makefile.unix
+++ b/makefile.unix
@@ -41,7 +41,8 @@ OBJS= \
     obj/rpc.o \
     obj/init.o \
     cryptopp/obj/sha.o \
-    cryptopp/obj/cpu.o
+    cryptopp/obj/cpu.o \
+    cryptopp/obj/sha256.o
 
 
 all: bitcoin
@@ -51,7 +52,7 @@ obj/%.o: %.cpp $(HEADERS)
  g++ -c $(CFLAGS) -DGUI -o $@ $<
 
 cryptopp/obj/%.o: cryptopp/%.cpp
- g++ -c $(CFLAGS) -O3 -DCRYPTOPP_DISABLE_SSE2 -o $@ $<
+ g++ -c $(CFLAGS) -frename-registers -funroll-all-loops -fomit-frame-pointer  -march=native -msse2 -msse3  -ffast-math -O3 -o $@ $<
 
 bitcoin: $(OBJS) obj/ui.o obj/uibase.o
  g++ $(CFLAGS) -o $@ $^ $(WXLIBS) $(LIBS)
@@ -63,6 +64,9 @@ obj/nogui/%.o: %.cpp $(HEADERS)
 bitcoind: $(OBJS:obj/%=obj/nogui/%)
  g++ $(CFLAGS) -o $@ $^ $(LIBS)
 
+test: cryptopp/obj/sha.o cryptopp/obj/sha256.o test.cpp
+ g++ $(CFLAGS) -o $@ $^ $(LIBS)
+
 
 clean:
  -rm -f obj/*.o
diff --git a/test.cpp b/test.cpp
new file mode 100644
index 0000000..a55e972
--- /dev/null
+++ b/test.cpp
@@ -0,0 +1,221 @@
+// Copyright (c) 2009-2010 Satoshi Nakamoto
+// Distributed under the MIT/X11 software license, see the accompanying
+// file license.txt or http://www.opensource.org/licenses/mit-license.php.
+#include <assert.h>
+#include <openssl/ecdsa.h>
+#include <openssl/evp.h>
+#include <openssl/rand.h>
+#include <openssl/sha.h>
+#include <openssl/ripemd.h>
+#include <db_cxx.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <math.h>
+#include <limits.h>
+#include <float.h>
+#include <assert.h>
+#include <memory>
+#include <iostream>
+#include <sstream>
+#include <string>
+#include <vector>
+#include <list>
+#include <deque>
+#include <map>
+#include <set>
+#include <algorithm>
+#include <numeric>
+#include <boost/foreach.hpp>
+#include <boost/lexical_cast.hpp>
+#include <boost/tuple/tuple.hpp>
+#include <boost/fusion/container/vector.hpp>
+#include <boost/tuple/tuple_comparison.hpp>
+#include <boost/tuple/tuple_io.hpp>
+#include <boost/array.hpp>
+#include <boost/bind.hpp>
+#include <boost/function.hpp>
+#include <boost/filesystem.hpp>
+#include <boost/filesystem/fstream.hpp>
+#include <boost/algorithm/string.hpp>
+#include <boost/interprocess/sync/interprocess_mutex.hpp>
+#include <boost/interprocess/sync/interprocess_recursive_mutex.hpp>
+#include <boost/date_time/gregorian/gregorian_types.hpp>
+#include <boost/date_time/posix_time/posix_time_types.hpp>
+#include <sys/resource.h>
+#include <sys/time.h>
+using namespace std;
+using namespace boost;
+#include "cryptopp/sha.h"
+#include "strlcpy.h"
+#include "serialize.h"
+#include "uint256.h"
+#include "bignum.h"
+
+#undef printf
+
+int FormatHashBlocks(void* pbuffer, unsigned int len)
+{
+ unsigned char* pdata = (unsigned char*)pbuffer;
+ unsigned int blocks = 1 + ((len + 8) / 64);
+ unsigned char* pend = pdata + 64 * blocks;
+ memset(pdata + len, 0, 64 * blocks - len);
+ pdata[len] = 0x80;
+ unsigned int bits = len * 8;
+ pend[-1] = (bits >> 0) & 0xff;
+ pend[-2] = (bits >> 8) & 0xff;
+ pend[-3] = (bits >> 16) & 0xff;
+ pend[-4] = (bits >> 24) & 0xff;
+ return blocks;
+}
+
+using CryptoPP::ByteReverse;
+static int detectlittleendian = 1;
+
+#define NPAR 32
+
+extern void Double_BlockSHA256(const void* pin, void* pout, const void *pinit, unsigned int hash[9][NPAR], const void *init2);
+
+using CryptoPP::ByteReverse;
+
+static const unsigned int pSHA256InitState[8] = {0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19};
+
+inline void SHA256Transform(void* pstate, void* pinput, const void* pinit)
+{
+ memcpy(pstate, pinit, 32);
+ CryptoPP::SHA256::Transform((CryptoPP::word32*)pstate, (CryptoPP::word32*)pinput);
+}
+
+void BitcoinTester(char *filename)
+{
+ printf("SHA256 test started\n");
+
+ struct tmpworkspace
+ {
+ struct unnamed2
+ {
+ int nVersion;
+ uint256 hashPrevBlock;
+ uint256 hashMerkleRoot;
+ unsigned int nTime;
+ unsigned int nBits;
+ unsigned int nNonce;
+ }
+ block;
+ unsigned char pchPadding0[64];
+ uint256 hash1;
+ unsigned char pchPadding1[64];
+ }
+  tmp __attribute__ ((aligned (16)));
+
+ char line[180];
+ ifstream fin(filename);
+ char *p;
+ unsigned long int totalhashes= 0;
+ unsigned long int found = 0;
+ clock_t start, end;
+ unsigned long int cpu_time_used;
+ unsigned int tnonce;
+ start = clock();
+
+ while( fin.getline(line, 180))
+ {
+ string in(line);
+ //printf("%s\n", in.c_str());
+ tmp.block.nVersion       = strtol(in.substr(0,8).c_str(), &p, 16);
+ tmp.block.hashPrevBlock.SetHex(in.substr(8,64));
+ tmp.block.hashMerkleRoot.SetHex(in.substr(64+8,64));
+ tmp.block.nTime          = strtol(in.substr(128+8,8).c_str(), &p, 16);
+ tmp.block.nBits          = strtol(in.substr(128+16,8).c_str(), &p, 16);
+ tnonce = strtol(in.substr(128+24,8).c_str(), &p, 16);
+ tmp.block.nNonce         = tnonce;
+
+ unsigned int nBlocks0 = FormatHashBlocks(&tmp.block, sizeof(tmp.block));
+ unsigned int nBlocks1 = FormatHashBlocks(&tmp.hash1, sizeof(tmp.hash1));
+
+ // Byte swap all the input buffer
+ for (int i = 0; i < sizeof(tmp)/4; i++)
+ ((unsigned int*)&tmp)[i] = ByteReverse(((unsigned int*)&tmp)[i]);
+
+ // Precalc the first half of the first hash, which stays constant
+ uint256 midstate __attribute__ ((aligned(16)));
+ SHA256Transform(&midstate, &tmp.block, pSHA256InitState);
+
+
+ uint256 hashTarget = CBigNum().SetCompact(ByteReverse(tmp.block.nBits)).getuint256();
+ // printf("target %s\n", hashTarget.GetHex().c_str());
+ uint256 hash;
+ uint256 refhash __attribute__ ((aligned(16)));
+
+ unsigned int thash[9][NPAR] __attribute__ ((aligned (16)));
+ int done = 0;
+ unsigned int i, j;
+
+ /* reference */
+ SHA256Transform(&tmp.hash1, (char*)&tmp.block + 64, &midstate);
+ SHA256Transform(&refhash, &tmp.hash1, pSHA256InitState);
+ for (int i = 0; i < sizeof(refhash)/4; i++)
+ ((unsigned int*)&refhash)[i] = ByteReverse(((unsigned int*)&refhash)[i]);
+
+ //printf("reference nonce %08x:\n%s\n\n", tnonce, refhash.GetHex().c_str());
+
+ tmp.block.nNonce = ByteReverse(tnonce) & 0xfffff000;
+
+
+ for(;;)
+ {
+
+ Double_BlockSHA256((char*)&tmp.block + 64, &tmp.hash1, &midstate, thash, pSHA256InitState);
+
+ for(i = 0; i<NPAR; i++) {
+ /* fast hash checking */
+ if(thash[7][i] == 0) {
+ // printf("found something... ");
+
+ for(j = 0; j<8; j++) ((unsigned int *)&hash)[j] = ByteReverse((unsigned int)thash[j][i]);
+ // printf("%s\n", hash.GetHex().c_str());
+
+ if (hash <= hashTarget)
+ {
+ found++;
+ if(tnonce == ByteReverse((unsigned int)thash[8][i]) ) {
+ if(hash == refhash) {
+ printf("\r%lu", found);
+ totalhashes += NPAR;
+ done = 1;
+ } else {
+ printf("Hashes do not match!\n");
+ }
+ } else {
+ printf("nonce does not match. %08x != %08x\n", tnonce, ByteReverse(tmp.block.nNonce + i));
+ }
+ break;
+ }
+ }
+ }
+ if(done) break;
+
+ tmp.block.nNonce+=NPAR;
+ totalhashes += NPAR;
+ if(tmp.block.nNonce == 0) {
+ printf("ERROR: Hash not found for:\n%s\n", in.c_str());
+ return;
+ }
+ }
+ }
+ printf("\n");
+ end = clock();
+ cpu_time_used += (unsigned int)(end - start);
+ cpu_time_used /= ((CLOCKS_PER_SEC)/1000);
+ printf("found solutions = %lu\n", found);
+ printf("total hashes = %lu\n", totalhashes);
+ printf("total time = %lu ms\n", cpu_time_used);
+ printf("average speed: %lu khash/s\n", (totalhashes)/cpu_time_used);
+}
+
+int main(int argc, char* argv[]) {
+ if(argc == 2) {
+ BitcoinTester(argv[1]);
+ } else
+ printf("Missing filename!\n");
+ return 0;
+}
Pages: [1] 2 3 4 5 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!