Print Page - 4 hashes parallel on SSE2 CPUs for 0.3.6

Title: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on July 30, 2010, 09:23:10 PM

This patch will calculate four hashes on one core using vector instructions. There's a test programm included that validates the new hash function against the old one so it should be correct.

The patch is against 0.3.6. Improves khash/s by roughly 115%.

http://pastebin.com/XN1JDb53

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: knightmb on July 30, 2010, 09:33:29 PM

I take it that you've already tested the hash limit before performance starts to suffer against the stock code? I'm just curious myself.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on July 30, 2010, 09:47:22 PM

Performance of stock code (as measured by my test/benchmark program) is about 1500khash/s.
My code does 3500khash/s. Both figures are for one core. It scales well because I do 128 hashes at once and keep the datastructures small enough to fit in the CPU cache.

I have two local collision attacks which will squeeze another 300khash/s out, but they are not stable yet.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: knightmb on July 30, 2010, 09:51:10 PM

Awesome, I'll have to give it a try myself then. :o

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on July 30, 2010, 10:00:24 PM

Tell me if it works :)
Donations are welcome. 17asVKkzRGTFvvGH9dMGQaHe78xzfvgSSA

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: satoshi on July 31, 2010, 12:29:20 AM

That's amazing...

So are you saying you use 128-bit registers to SIMD four 32-bit data at once? I've wondered about that for a long time, but I didn't think it would be possible due to addition carrying into the neighbour's value.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: knightmb on July 31, 2010, 04:49:33 AM

Darn, it means the next release, the difficulty is going to have to increase to 1000 or so to keep up, LOL ;D

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on July 31, 2010, 10:12:38 AM

Quote from: satoshi on July 31, 2010, 12:29:20 AM

That's how it works. Four 32 bit values in a 128 bit vector. They're calculated independently, but at the same time.

Btw. Why are you using this alignup<16> function when __attribute__ ((aligned (16))) will tell the compiler to align at compiletime?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: em3rgentOrdr on July 31, 2010, 01:42:48 PM

hmm...I wasn't able to apply the patch (I'm a noobie). Here's the command I ran from bitcoin-0.3.6/src # patch < XN1JDb53.txt

Output:

1 out of 1 hunk ignored
(Stripping trailing CRs from patch.)
patching file main.cpp
Hunk #1 FAILED at 2555.
Hunk #2 FAILED at 2701.
2 out of 2 hunks FAILED
(Stripping trailing CRs from patch.)
patching file makefile.unix
Hunk #1 FAILED at 45.
Hunk #2 FAILED at 58.

What's the proper command to type into linux? Or do you have linux binaries?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on July 31, 2010, 02:18:03 PM

the mean client would send all generated bitcoins to a certain address ;)

@em3rgent0rder: i don't know why it fails, but it should be easy to patch it manually...

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: jgarzik on July 31, 2010, 05:18:30 PM

Quote from: em3rgentOrdr on July 31, 2010, 01:42:48 PM

It definitely does not apply to the SVN trunk. Maybe tcatm could post the main.cpp itself?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on July 31, 2010, 05:40:27 PM

Looks like pastebin.com messes up the patch...

Code:

diff --git a/cryptopp/sha256.cpp b/cryptopp/sha256.cpp
new file mode 100644
index 0000000..15f8be1
--- /dev/null
+++ b/cryptopp/sha256.cpp
@@ -0,0 +1,443 @@
+#include <string.h>
+#include <assert.h>
+
+#include <xmmintrin.h>
+#include <stdint.h>
+#include <stdio.h>
+
+#define NPAR 32
+
+static const unsigned int sha256_consts[] = {
+	0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, /*  0 */
+	0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
+	0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, /*  8 */
+	0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
+	0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, /* 16 */
+	0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
+	0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, /* 24 */
+	0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
+	0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, /* 32 */
+	0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
+	0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, /* 40 */
+	0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
+	0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, /* 48 */
+	0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
+	0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, /* 56 */
+	0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
+};
+
+
+static inline __m128i Ch(const __m128i b, const __m128i c, const __m128i d) {
+	return (b & c) ^ (~b & d);
+}
+
+static inline __m128i Maj(const __m128i b, const __m128i c, const __m128i d) {
+	return (b & c) ^ (b & d) ^ (c & d);
+}
+
+static inline __m128i ROTR(__m128i x, const int n) {
+	return _mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n);
+}
+
+static inline __m128i SHR(__m128i x, const int n) {
+	return _mm_srli_epi32(x, n);
+}
+
+/* SHA256 Functions */
+#define	BIGSIGMA0_256(x)	(ROTR((x), 2) ^ ROTR((x), 13) ^ ROTR((x), 22))
+#define	BIGSIGMA1_256(x)	(ROTR((x), 6) ^ ROTR((x), 11) ^ ROTR((x), 25))
+#define	SIGMA0_256(x)		(ROTR((x), 7) ^ ROTR((x), 18) ^ SHR((x), 3))
+#define	SIGMA1_256(x)		(ROTR((x), 17) ^ ROTR((x), 19) ^ SHR((x), 10))
+
+static inline __m128i load_epi32(const unsigned int x0, const unsigned int x1, const unsigned int x2, const unsigned int x3) {
+	return _mm_set_epi32(x0, x1, x2, x3);
+}
+
+static inline unsigned int store32(const __m128i x, int i) {
+	union { unsigned int ret[4]; __m128i x; } box;
+	box.x = x;
+	return box.ret[i];
+}
+
+static inline void store_epi32(const __m128i x, unsigned int *x0, unsigned int *x1, unsigned int *x2, unsigned int *x3) {
+	union { unsigned int ret[4]; __m128i x; } box;
+	box.x = x;
+	*x0 = box.ret[3]; *x1 = box.ret[2]; *x2 = box.ret[1]; *x3 = box.ret[0];
+}
+
+static inline __m128i SHA256_CONST(const int i) {
+	return _mm_set1_epi32(sha256_consts[i]);
+}
+
+#define add4(x0, x1, x2, x3) _mm_add_epi32(_mm_add_epi32(_mm_add_epi32(x0, x1), x2), x3)
+#define add5(x0, x1, x2, x3, x4) _mm_add_epi32(add4(x0, x1, x2, x3), x4)
+
+#define	SHA256ROUND(a, b, c, d, e, f, g, h, i, w)                       \
+	T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w);	\
+d = _mm_add_epi32(d, T1);                                           \
+T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 \
+h = _mm_add_epi32(T1, T2);
+
+#define	SHA256ROUND_lastd(a, b, c, d, e, f, g, h, i, w)                       \
+	T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w);	\
+d = _mm_add_epi32(d, T1);                                           
+//T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 
+//h = _mm_add_epi32(T1, T2);
+
+#define	SHA256ROUND_last(a, b, c, d, e, f, g, h, i, w)                       \
+	T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w);	\
+T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 \
+h = _mm_add_epi32(T1, T2);
+
+static inline unsigned int swap(unsigned int value) {
+	__asm__ ("bswap %0" : "=r" (value) : "0" (value));
+	return value;
+}
+
+static inline unsigned int SWAP32(const void *addr) {
+	unsigned int value = (*((unsigned int *)(addr)));
+	__asm__ ("bswap %0" : "=r" (value) : "0" (value));
+	return value;
+}
+
+static inline void dumpreg(__m128i x, char *msg) {
+	union { unsigned int ret[4]; __m128i x; } box;
+	box.x = x ;
+	printf("%s %08x %08x %08x %08x\n", msg, box.ret[0], box.ret[1], box.ret[2], box.ret[3]);
+}
+
+#if 1
+#define dumpstate(i) printf("%s: %08x %08x %08x %08x %08x %08x %08x %08x %08x\n", \
+		__func__, store32(w0, i), store32(a, i), store32(b, i), store32(c, i), store32(d, i), store32(e, i), store32(f, i), store32(g, i), store32(h, i));
+#else
+#define dumpstate()
+#endif
+void Double_BlockSHA256(const void* pin, void* pad, const void *pre, unsigned int thash[8][NPAR], const void *init)
+{
+	unsigned int* In = (unsigned int*)pin;
+	unsigned int* Pad = (unsigned int*)pad;
+	unsigned int* hPre = (unsigned int*)pre;
+	unsigned int* hInit = (unsigned int*)init;
+	unsigned int i, j, k;
+
+	/* vectors used in calculation */
+	__m128i w0, w1, w2, w3, w4, w5, w6, w7;
+	__m128i w8, w9, w10, w11, w12, w13, w14, w15;
+	__m128i T1, T2;
+	__m128i a, b, c, d, e, f, g, h;
+
+	/* nonce offset for vector */
+	__m128i offset = load_epi32(0x00000003, 0x00000002, 0x00000001, 0x00000000);
+
+
+	for(k = 0; k<NPAR; k+=4) {
+		w0 = load_epi32(In[0], In[0], In[0], In[0]);
+		w1 = load_epi32(In[1], In[1], In[1], In[1]);
+		w2 = load_epi32(In[2], In[2], In[2], In[2]);
+		w3 = load_epi32(In[3], In[3], In[3], In[3]);
+		w4 = load_epi32(In[4], In[4], In[4], In[4]);
+		w5 = load_epi32(In[5], In[5], In[5], In[5]);
+		w6 = load_epi32(In[6], In[6], In[6], In[6]);
+		w7 = load_epi32(In[7], In[7], In[7], In[7]);
+		w8 = load_epi32(In[8], In[8], In[8], In[8]);
+		w9 = load_epi32(In[9], In[9], In[9], In[9]);
+		w10 = load_epi32(In[10], In[10], In[10], In[10]);
+		w11 = load_epi32(In[11], In[11], In[11], In[11]);
+		w12 = load_epi32(In[12], In[12], In[12], In[12]);
+		w13 = load_epi32(In[13], In[13], In[13], In[13]);
+		w14 = load_epi32(In[14], In[14], In[14], In[14]);
+		w15 = load_epi32(In[15], In[15], In[15], In[15]);
+
+		/* hack nonce into lowest byte of w3 */
+		__m128i k_vec = load_epi32(k, k, k, k);
+		w3 = _mm_add_epi32(w3, offset);
+		w3 = _mm_add_epi32(w3, k_vec);
+
+		a = load_epi32(hPre[0], hPre[0], hPre[0], hPre[0]);
+		b = load_epi32(hPre[1], hPre[1], hPre[1], hPre[1]);
+		c = load_epi32(hPre[2], hPre[2], hPre[2], hPre[2]);
+		d = load_epi32(hPre[3], hPre[3], hPre[3], hPre[3]);
+		e = load_epi32(hPre[4], hPre[4], hPre[4], hPre[4]);
+		f = load_epi32(hPre[5], hPre[5], hPre[5], hPre[5]);
+		g = load_epi32(hPre[6], hPre[6], hPre[6], hPre[6]);
+		h = load_epi32(hPre[7], hPre[7], hPre[7], hPre[7]);
+
+		SHA256ROUND(a, b, c, d, e, f, g, h, 0, w0);    
+		SHA256ROUND(h, a, b, c, d, e, f, g, 1, w1);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 2, w2);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 3, w3);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 4, w4);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 5, w5);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 6, w6);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 7, w7);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 8, w8);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 9, w9);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 10, w10);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 11, w11);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 12, w12);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 13, w13);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 14, w14);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 15, w15);
+
+		w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 16, w0);
+		w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 17, w1);
+		w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 18, w2);
+		w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 19, w3);
+		w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 20, w4);
+		w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 21, w5);
+		w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 22, w6);
+		w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 23, w7);
+		w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 24, w8);
+		w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 25, w9);
+		w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 26, w10);
+		w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 27, w11);
+		w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 28, w12);
+		w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 29, w13);
+		w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 30, w14);
+		w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 31, w15);
+
+		w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 32, w0);
+		w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 33, w1);
+		w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 34, w2);
+		w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 35, w3);
+		w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 36, w4);
+		w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 37, w5);
+		w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 38, w6);
+		w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 39, w7);
+		w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 40, w8);
+		w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 41, w9);
+		w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 42, w10);
+		w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 43, w11);
+		w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 44, w12);
+		w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 45, w13);
+		w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 46, w14);
+		w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 47, w15);
+
+		w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 48, w0);
+		w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 49, w1);
+		w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 50, w2);
+		w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 51, w3);
+		w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 52, w4);
+		w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 53, w5);
+		w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 54, w6);
+		w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 55, w7);
+		w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 56, w8);
+		w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 57, w9);
+		w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 58, w10);
+		w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 59, w11);
+		w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 60, w12);
+		w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 61, w13);
+		w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 62, w14);
+		w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 63, w15);
+
+#define store_load(x, i, dest) \
+		w8 = load_epi32((hPre)[i], (hPre)[i], (hPre)[i], (hPre)[i]); \
+		dest = _mm_add_epi32(w8, x);
+
+		store_load(a, 0, w0);
+		store_load(b, 1, w1);
+		store_load(c, 2, w2);
+		store_load(d, 3, w3);
+		store_load(e, 4, w4);
+		store_load(f, 5, w5);
+		store_load(g, 6, w6);
+		store_load(h, 7, w7);
+
+		w8 = load_epi32(Pad[8], Pad[8], Pad[8], Pad[8]);
+		w9 = load_epi32(Pad[9], Pad[9], Pad[9], Pad[9]);
+		w10 = load_epi32(Pad[10], Pad[10], Pad[10], Pad[10]);
+		w11 = load_epi32(Pad[11], Pad[11], Pad[11], Pad[11]);
+		w12 = load_epi32(Pad[12], Pad[12], Pad[12], Pad[12]);
+		w13 = load_epi32(Pad[13], Pad[13], Pad[13], Pad[13]);
+		w14 = load_epi32(Pad[14], Pad[14], Pad[14], Pad[14]);
+		w15 = load_epi32(Pad[15], Pad[15], Pad[15], Pad[15]);
+
+		a = load_epi32(hInit[0], hInit[0], hInit[0], hInit[0]);
+		b = load_epi32(hInit[1], hInit[1], hInit[1], hInit[1]);
+		c = load_epi32(hInit[2], hInit[2], hInit[2], hInit[2]);
+		d = load_epi32(hInit[3], hInit[3], hInit[3], hInit[3]);
+		e = load_epi32(hInit[4], hInit[4], hInit[4], hInit[4]);
+		f = load_epi32(hInit[5], hInit[5], hInit[5], hInit[5]);
+		g = load_epi32(hInit[6], hInit[6], hInit[6], hInit[6]);
+		h = load_epi32(hInit[7], hInit[7], hInit[7], hInit[7]);
+
+		SHA256ROUND(a, b, c, d, e, f, g, h, 0, w0);    
+		SHA256ROUND(h, a, b, c, d, e, f, g, 1, w1);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 2, w2);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 3, w3);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 4, w4);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 5, w5);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 6, w6);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 7, w7);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 8, w8);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 9, w9);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 10, w10);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 11, w11);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 12, w12);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 13, w13);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 14, w14);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 15, w15);
+
+		w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 16, w0);
+		w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 17, w1);
+		w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 18, w2);
+		w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 19, w3);
+		w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 20, w4);
+		w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 21, w5);
+		w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 22, w6);
+		w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 23, w7);
+		w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 24, w8);
+		w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 25, w9);
+		w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 26, w10);
+		w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 27, w11);
+		w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 28, w12);
+		w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 29, w13);
+		w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 30, w14);
+		w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 31, w15);
+
+		w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 32, w0);
+		w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 33, w1);
+		w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 34, w2);
+		w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 35, w3);
+		w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 36, w4);
+		w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 37, w5);
+		w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 38, w6);
+		w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 39, w7);
+		w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 40, w8);
+		w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 41, w9);
+		w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 42, w10);
+		w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 43, w11);
+		w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 44, w12);
+		w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 45, w13);
+		w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 46, w14);
+		w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 47, w15);
+
+		w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 48, w0);
+		w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 49, w1);
+		w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 50, w2);
+		w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 51, w3);
+		w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 52, w4);
+		w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 53, w5);
+		w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 54, w6);
+		w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 55, w7);
+		w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 56, w8);
+		w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 57, w9);
+		w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 58, w10);
+		w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 59, w11);
+		w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 60, w12);
+		w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 61, w13);
+		w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 62, w14);
+		w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 63, w15);
+
+		/* store resulsts directly in thash */
+#define store_2(x,i)  \
+		w0 = load_epi32((hInit)[i], (hInit)[i], (hInit)[i], (hInit)[i]); \
+		*(__m128i *)&(thash)[i][0+k] = _mm_add_epi32(w0, x); 
+
+		store_2(a, 0);
+		store_2(b, 1);
+		store_2(c, 2);
+		store_2(d, 3);
+		store_2(e, 4);
+		store_2(f, 5);
+		store_2(g, 6);
+		store_2(h, 7);
+	}
+
+}
diff --git a/main.cpp b/main.cpp
index ddc359a..d30d642 100755
--- a/main.cpp
+++ b/main.cpp
@@ -2555,8 +2555,10 @@ inline void SHA256Transform(void* pstate, void* pinput, const void* pinit)
     CryptoPP::SHA256::Transform((CryptoPP::word32*)pstate, (CryptoPP::word32*)pinput);
 }
 
+// !!!! NPAR must match NPAR in cryptopp/sha256.cpp !!!!
+#define NPAR 32
 
-
+extern void Double_BlockSHA256(const void* pin, void* pout, const void *pinit, unsigned int hash[8][NPAR], const void *init2);
 
 
 void BitcoinMiner()
@@ -2701,108 +2703,123 @@ void BitcoinMiner()
         uint256 hashTarget = CBigNum().SetCompact(pblock->nBits).getuint256();
         uint256 hashbuf[2];
         uint256& hash = *alignup<16>(hashbuf);
+
+        // Cache for NPAR hashes
+        unsigned int thash[8][NPAR];
+
+        unsigned int j;
         loop
         {
-            SHA256Transform(&tmp.hash1, (char*)&tmp.block + 64, &midstate);
-            SHA256Transform(&hash, &tmp.hash1, pSHA256InitState);
+          Double_BlockSHA256((char*)&tmp.block + 64, &tmp.hash1, &midstate, thash, pSHA256InitState);
 
-            if (((unsigned short*)&hash)[14] == 0)
+          for(j = 0; j<NPAR; j++) {
+            if (thash[7][j] == 0)
             {
-                // Byte swap the result after preliminary check
-                for (int i = 0; i < sizeof(hash)/4; i++)
-                    ((unsigned int*)&hash)[i] = ByteReverse(((unsigned int*)&hash)[i]);
-
-                if (hash <= hashTarget)
+              // Byte swap the result after preliminary check
+              for (int i = 0; i < sizeof(hash)/4; i++)
+                ((unsigned int*)&hash)[i] = ByteReverse((unsigned int)thash[i][j]);
+
+              if (hash <= hashTarget)
+              {
+                // Double_BlocSHA256 might only calculate parts of the hash.
+                // We'll insert the nonce and get the real hash.
+                //pblock->nNonce = ByteReverse(tmp.block.nNonce + j);
+                //hash = pblock->GetHash();
+
+                pblock->nNonce = ByteReverse(tmp.block.nNonce + j);
+                assert(hash == pblock->GetHash());
+
+                //// debug print
+                printf("BitcoinMiner:\n");
+                printf("proof-of-work found  \n  hash: %s  \ntarget: %s\n", hash.GetHex().c_str(), hashTarget.GetHex().c_str());
+                pblock->print();
+                printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
+                printf("generated %s\n", FormatMoney(pblock->vtx[0].vout[0].nValue).c_str());
+
+                SetThreadPriority(THREAD_PRIORITY_NORMAL);
+                CRITICAL_BLOCK(cs_main)
                 {
-                    pblock->nNonce = ByteReverse(tmp.block.nNonce);
-                    assert(hash == pblock->GetHash());
-
-                        //// debug print
-                        printf("BitcoinMiner:\n");
-                        printf("proof-of-work found  \n  hash: %s  \ntarget: %s\n", hash.GetHex().c_str(), hashTarget.GetHex().c_str());
-                        pblock->print();
-                        printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
-                        printf("generated %s\n", FormatMoney(pblock->vtx[0].vout[0].nValue).c_str());
-
-                    SetThreadPriority(THREAD_PRIORITY_NORMAL);
-                    CRITICAL_BLOCK(cs_main)
-                    {
-                        if (pindexPrev == pindexBest)
-                        {
-                            // Save key
-                            if (!AddKey(key))
-                                return;
-                            key.MakeNewKey();
-
-                            // Track how many getdata requests this block gets
-                            CRITICAL_BLOCK(cs_mapRequestCount)
-                                mapRequestCount[pblock->GetHash()] = 0;
-
-                            // Process this block the same as if we had received it from another node
-                            if (!ProcessBlock(NULL, pblock.release()))
-                                printf("ERROR in BitcoinMiner, ProcessBlock, block not accepted\n");
-                        }
-                    }
-                    SetThreadPriority(THREAD_PRIORITY_LOWEST);
-
-                    Sleep(500);
-                    break;
+                  if (pindexPrev == pindexBest)
+                  {
+                    // Save key
+                    if (!AddKey(key))
+                      return;
+                    key.MakeNewKey();
+
+                    // Track how many getdata requests this block gets
+                    CRITICAL_BLOCK(cs_mapRequestCount)
+                      mapRequestCount[pblock->GetHash()] = 0;
+
+                    // Process this block the same as if we had received it from another node
+                    if (!ProcessBlock(NULL, pblock.release()))
+                      printf("ERROR in BitcoinMiner, ProcessBlock, block not accepted\n");
+
+                  }
                 }
-            }
+                SetThreadPriority(THREAD_PRIORITY_LOWEST);
 
-            // Update nTime every few seconds
-            const unsigned int nMask = 0xffff;
-            if ((++tmp.block.nNonce & nMask) == 0)
+                Sleep(500);
+                break;
+              }
+            }
+          }
+
+          // Update nonce
+          tmp.block.nNonce += NPAR;
+
+          // Update nTime every few seconds
+          const unsigned int nMask = 0xffff;
+          if ((tmp.block.nNonce & nMask) == 0)
+          {
+            // Meter hashes/sec
+            static int64 nTimerStart;
+            static int nHashCounter;
+            if (nTimerStart == 0)
+              nTimerStart = GetTimeMillis();
+            else
+              nHashCounter++;
+            if (GetTimeMillis() - nTimerStart > 4000)
             {
-                // Meter hashes/sec
-                static int64 nTimerStart;
-                static int nHashCounter;
-                if (nTimerStart == 0)
-                    nTimerStart = GetTimeMillis();
-                else
-                    nHashCounter++;
+              static CCriticalSection cs;
+              CRITICAL_BLOCK(cs)
+              {
                 if (GetTimeMillis() - nTimerStart > 4000)
                 {
-                    static CCriticalSection cs;
-                    CRITICAL_BLOCK(cs)
-                    {
-                        if (GetTimeMillis() - nTimerStart > 4000)
-                        {
-                            double dHashesPerSec = 1000.0 * (nMask+1) * nHashCounter / (GetTimeMillis() - nTimerStart);
-                            nTimerStart = GetTimeMillis();
-                            nHashCounter = 0;
-                            string strStatus = strprintf("    %.0f khash/s", dHashesPerSec/1000.0);
-                            UIThreadCall(bind(CalledSetStatusBar, strStatus, 0));
-                            static int64 nLogTime;
-                            if (GetTime() - nLogTime > 30 * 60)
-                            {
-                                nLogTime = GetTime();
-                                printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
-                                printf("hashmeter %3d CPUs %6.0f khash/s\n", vnThreadsRunning[3], dHashesPerSec/1000.0);
-                            }
-                        }
-                    }
+                  double dHashesPerSec = 1000.0 * (nMask+1) * nHashCounter / (GetTimeMillis() - nTimerStart);
+                  nTimerStart = GetTimeMillis();
+                  nHashCounter = 0;
+                  string strStatus = strprintf("    %.0f khash/s", dHashesPerSec/1000.0);
+                  UIThreadCall(bind(CalledSetStatusBar, strStatus, 0));
+                  static int64 nLogTime;
+                  if (GetTime() - nLogTime > 30 * 60)
+                  {
+                    nLogTime = GetTime();
+                    printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
+                    printf("hashmeter %3d CPUs %6.0f khash/s\n", vnThreadsRunning[3], dHashesPerSec/1000.0);
+                  }
                 }
-
-                // Check for stop or if block needs to be rebuilt
-                if (fShutdown)
-                    return;
-                if (!fGenerateBitcoins)
-                    return;
-                if (fLimitProcessors && vnThreadsRunning[3] > nLimitProcessors)
-                    return;
-                if (vNodes.empty())
-                    break;
-                if (tmp.block.nNonce == 0)
-                    break;
-                if (nTransactionsUpdated != nTransactionsUpdatedLast && GetTime() - nStart > 60)
-                    break;
-                if (pindexPrev != pindexBest)
-                    break;
-
-                pblock->nTime = max(pindexPrev->GetMedianTimePast()+1, GetAdjustedTime());
-                tmp.block.nTime = ByteReverse(pblock->nTime);
+              }
             }
+
+            // Check for stop or if block needs to be rebuilt
+            if (fShutdown)
+              return;
+            if (!fGenerateBitcoins)
+              return;
+            if (fLimitProcessors && vnThreadsRunning[3] > nLimitProcessors)
+              return;
+            if (vNodes.empty())
+              break;
+            if (tmp.block.nNonce == 0)
+              break;
+            if (nTransactionsUpdated != nTransactionsUpdatedLast && GetTime() - nStart > 60)
+              break;
+            if (pindexPrev != pindexBest)
+              break;
+
+            pblock->nTime = max(pindexPrev->GetMedianTimePast()+1, GetAdjustedTime());
+            tmp.block.nTime = ByteReverse(pblock->nTime);
+          }
         }
     }
 }
diff --git a/makefile.unix b/makefile.unix
index 597a0ea..8fb0aa6 100755
--- a/makefile.unix
+++ b/makefile.unix
@@ -45,7 +45,8 @@ OBJS= \
     obj/rpc.o \
     obj/init.o \
     cryptopp/obj/sha.o \
-    cryptopp/obj/cpu.o
+    cryptopp/obj/cpu.o \
+		cryptopp/obj/sha256.o
 
 
 all: bitcoin
@@ -58,18 +59,20 @@ obj/%.o: %.cpp $(HEADERS) headers.h.gch
 	g++ -c $(CFLAGS) -DGUI -o $@ $<
 
 cryptopp/obj/%.o: cryptopp/%.cpp
-	g++ -c $(CFLAGS) -O3 -DCRYPTOPP_DISABLE_SSE2 -o $@ $<
+	g++ -c $(CFLAGS) -frename-registers -funroll-all-loops -fomit-frame-pointer  -march=native -msse2 -msse3  -ffast-math -O3 -o $@ $<
 
 bitcoin: $(OBJS) obj/ui.o obj/uibase.o
 	g++ $(CFLAGS) -o $@ $(LIBPATHS) $^ $(WXLIBS) $(LIBS)
 
-
 obj/nogui/%.o: %.cpp $(HEADERS)
 	g++ -c $(CFLAGS) -o $@ $<
 
 bitcoind: $(OBJS:obj/%=obj/nogui/%)
 	g++ $(CFLAGS) -o $@ $(LIBPATHS) $^ $(LIBS)
 
+test: cryptopp/obj/sha.o cryptopp/obj/sha256.o test.cpp
+	  g++ $(CFLAGS) -o $@ $(LIBPATHS) $^ $(WXLIBS) $(LIBS)
+
 
 clean:
 	-rm -f obj/*.o
diff --git a/test.cpp b/test.cpp
new file mode 100755
index 0000000..7cab332
--- /dev/null
+++ b/test.cpp
@@ -0,0 +1,237 @@
+// Copyright (c) 2009-2010 Satoshi Nakamoto
+// Distributed under the MIT/X11 software license, see the accompanying
+// file license.txt or http://www.opensource.org/licenses/mit-license.php.
+#include <assert.h>
+#include <openssl/ecdsa.h>
+#include <openssl/evp.h>
+#include <openssl/rand.h>
+#include <openssl/sha.h>
+#include <openssl/ripemd.h>
+#include <db_cxx.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <math.h>
+#include <limits.h>
+#include <float.h>
+#include <assert.h>
+#include <memory>
+#include <iostream>
+#include <sstream>
+#include <string>
+#include <vector>
+#include <list>
+#include <deque>
+#include <map>
+#include <set>
+#include <algorithm>
+#include <numeric>
+#include <boost/foreach.hpp>
+#include <boost/lexical_cast.hpp>
+#include <boost/tuple/tuple.hpp>
+#include <boost/fusion/container/vector.hpp>
+#include <boost/tuple/tuple_comparison.hpp>
+#include <boost/tuple/tuple_io.hpp>
+#include <boost/array.hpp>
+#include <boost/bind.hpp>
+#include <boost/function.hpp>
+#include <boost/filesystem.hpp>
+#include <boost/filesystem/fstream.hpp>
+#include <boost/algorithm/string.hpp>
+#include <boost/interprocess/sync/interprocess_mutex.hpp>
+#include <boost/interprocess/sync/interprocess_recursive_mutex.hpp>
+#include <boost/date_time/gregorian/gregorian_types.hpp>
+#include <boost/date_time/posix_time/posix_time_types.hpp>
+#include <sys/resource.h>
+#include <sys/time.h>
+using namespace std;
+using namespace boost;
+#include "cryptopp/sha.h"
+#include "strlcpy.h"
+#include "serialize.h"
+#include "uint256.h"
+#include "bignum.h"
+
+#undef printf
+	template <size_t nBytes, typename T>
+T* alignup(T* p)
+{
+	union
+	{   
+		T* ptr;
+		size_t n;
+	} u;
+	u.ptr = p;
+	u.n = (u.n + (nBytes-1)) & ~(nBytes-1);
+	return u.ptr;
+}
+
+int FormatHashBlocks(void* pbuffer, unsigned int len) 
+{
+	unsigned char* pdata = (unsigned char*)pbuffer;
+	unsigned int blocks = 1 + ((len + 8) / 64); 
+	unsigned char* pend = pdata + 64 * blocks;
+	memset(pdata + len, 0, 64 * blocks - len);
+	pdata[len] = 0x80;
+	unsigned int bits = len * 8; 
+	pend[-1] = (bits >> 0) & 0xff;
+	pend[-2] = (bits >> 8) & 0xff;
+	pend[-3] = (bits >> 16) & 0xff;
+	pend[-4] = (bits >> 24) & 0xff;
+	return blocks;
+}
+
+using CryptoPP::ByteReverse;
+static int detectlittleendian = 1;
+
+#define NPAR 32 
+
+extern void Double_BlockSHA256(const void* pin, void* pout, const void *pinit, unsigned int hash[8][NPAR], const void *init2);
+
+using CryptoPP::ByteReverse;
+
+static const unsigned int pSHA256InitState[8] = {0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19};
+
+inline void SHA256Transform(void* pstate, void* pinput, const void* pinit)
+{
+	memcpy(pstate, pinit, 32); 
+	CryptoPP::SHA256::Transform((CryptoPP::word32*)pstate, (CryptoPP::word32*)pinput);
+}
+
+void BitcoinTester(char *filename)
+{
+	printf("SHA256 test started\n");
+
+	struct tmpworkspace
+	{
+		struct unnamed2
+		{
+			int nVersion;
+			uint256 hashPrevBlock;
+			uint256 hashMerkleRoot;
+			unsigned int nTime;
+			unsigned int nBits;
+			unsigned int nNonce;
+		}
+		block;
+		unsigned char pchPadding0[64];
+		uint256 hash1;
+		unsigned char pchPadding1[64];
+	};
+	char tmpbuf[sizeof(tmpworkspace)+16];
+	tmpworkspace& tmp = *(tmpworkspace*)alignup<16>(tmpbuf);
+
+
+	char line[180];
+	ifstream fin(filename);
+	char *p;
+	unsigned long int totalhashes= 0;
+	unsigned long int found = 0;
+	clock_t start, end;
+	unsigned long int cpu_time_used;
+	unsigned int tnonce;
+	start = clock();
+
+	while( fin.getline(line, 180)) 
+	{
+		string in(line);
+		//printf("%s\n", in.c_str());
+		tmp.block.nVersion       = strtol(in.substr(0,8).c_str(), &p, 16);
+		tmp.block.hashPrevBlock.SetHex(in.substr(8,64));
+		tmp.block.hashMerkleRoot.SetHex(in.substr(64+8,64));
+		tmp.block.nTime          = strtol(in.substr(128+8,8).c_str(), &p, 16);
+		tmp.block.nBits          = strtol(in.substr(128+16,8).c_str(), &p, 16);
+		tnonce = strtol(in.substr(128+24,8).c_str(), &p, 16);
+		tmp.block.nNonce         = tnonce;
+
+		unsigned int nBlocks0 = FormatHashBlocks(&tmp.block, sizeof(tmp.block));
+		unsigned int nBlocks1 = FormatHashBlocks(&tmp.hash1, sizeof(tmp.hash1));
+
+		// Byte swap all the input buffer
+		for (int i = 0; i < sizeof(tmp)/4; i++) 
+			((unsigned int*)&tmp)[i] = ByteReverse(((unsigned int*)&tmp)[i]);
+
+		// Precalc the first half of the first hash, which stays constant
+		uint256 midstatebuf[2];
+		uint256& midstate = *alignup<16>(midstatebuf);
+		SHA256Transform(&midstate, &tmp.block, pSHA256InitState);
+
+
+		uint256 hashTarget = CBigNum().SetCompact(ByteReverse(tmp.block.nBits)).getuint256();
+		//	printf("target %s\n", hashTarget.GetHex().c_str());
+		uint256 hash;
+		uint256 hashbuf[2];
+		uint256& refhash = *alignup<16>(hashbuf);
+
+		unsigned int thash[8][NPAR];
+		int done = 0;
+		unsigned int i, j;
+
+		/* reference */
+		SHA256Transform(&tmp.hash1, (char*)&tmp.block + 64, &midstate);
+		SHA256Transform(&refhash, &tmp.hash1, pSHA256InitState);
+		for (int i = 0; i < sizeof(refhash)/4; i++)
+			((unsigned int*)&refhash)[i] = ByteReverse(((unsigned int*)&refhash)[i]);
+
+		//printf("reference nonce %08x:\n%s\n\n", tnonce, refhash.GetHex().c_str());
+
+		tmp.block.nNonce = ByteReverse(tnonce) & 0xfffff000;
+
+
+		for(;;)
+		{
+
+			Double_BlockSHA256((char*)&tmp.block + 64, &tmp.hash1, &midstate, thash, pSHA256InitState);
+
+			for(i = 0; i<NPAR; i++) {
+				/* fast hash checking */
+				if(thash[7][i] == 0) {
+			//		printf("found something... ");
+
+					for(j = 0; j<8; j++) ((unsigned int *)&hash)[j] = ByteReverse((unsigned int)thash[j][i]);
+				//	printf("%s\n", hash.GetHex().c_str());
+
+					if (hash <= hashTarget)
+					{
+						found++;
+						if(tnonce == ByteReverse(tmp.block.nNonce + i) ) {
+							if(hash == refhash) {
+								printf("\r%lu", found);
+								totalhashes += NPAR;
+								done = 1;
+							} else {
+								printf("Hashes do not match!\n");
+							}
+						} else {
+							printf("nonce does not match. %08x != %08x\n", tnonce, ByteReverse(tmp.block.nNonce + i));
+						}
+						break;
+					}
+				}
+			}
+			if(done) break;
+
+			tmp.block.nNonce+=NPAR;
+			totalhashes += NPAR;
+			if(tmp.block.nNonce == 0) {
+				printf("ERROR: Hash not found for:\n%s\n", in.c_str());
+				return;
+			}
+		}
+	}
+	printf("\n");
+	end = clock();
+	cpu_time_used += (unsigned int)(end - start);
+	cpu_time_used /= ((CLOCKS_PER_SEC)/1000);
+	printf("found solutions = %lu\n", found);
+	printf("total hashes = %lu\n", totalhashes);
+	printf("total time = %lu ms\n", cpu_time_used);
+	printf("average speed: %lu khash/s\n", (totalhashes)/cpu_time_used);
+}
+
+int main(int argc, char* argv[]) {
+	if(argc == 2) {
+		BitcoinTester(argv[1]);
+	} else 
+		printf("Missing filename!\n");
+	return 0;
+}

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: nelisky on July 31, 2010, 07:17:17 PM

Had to manually patch, as I'm not using git for bitcoin and 'patch' doesn't munch this format, I guess. Anyway, got almost double speed on the OSX side, (i5 2.4, now ~2400 from ~1400), but my linux on Q6600 quad 2.4Ghz was pumping ~2500 with 0.3.6 (from source) and now, with the patch it's... ~2400. Need I tweak anything to take advantage on this?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: nelisky on July 31, 2010, 08:34:06 PM

ahm, let me correct myself: on the quad core linux, I went from ~4400 with svn trunk @ 119 to ~2400 with the patch... not exactly what I hoped for after the success in OSX.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: aceat64 on July 31, 2010, 08:57:49 PM

Quote from: nelisky on July 31, 2010, 08:34:06 PM

ahm, let me correct myself: on the quad core linux, I went from ~4400 with svn trunk @ 119 to ~2400 with the patch... not exactly what I hoped for after the success in OSX.

I noticed the same, I went from about 4300 to 2100 when I tested it on Linux.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on July 31, 2010, 10:38:29 PM

What CPUs are you running it on? Could you send me sha256.o (compiled object of the algorithm)?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: nelisky on July 31, 2010, 11:18:07 PM

I'm running on the Intel Q6600 2.4Ghz, how shall I get the file to you?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: Mionione on July 31, 2010, 11:30:11 PM

care with __attribute__ ((aligned (16))) , it doesn't work with local variable, gcc doesn't align the stack

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on July 31, 2010, 11:37:02 PM

Quote from: nelisky on July 31, 2010, 11:18:07 PM

I'm running on the Intel Q6600 2.4Ghz, how shall I get the file to you?

yes. i will look at the assembler code. maybe the compiler did something "wrong".

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on August 01, 2010, 12:00:10 AM

Patch against SVN. Maybe it'll work now...

Code:

diff --git a/cryptopp/sha256.cpp b/cryptopp/sha256.cpp
new file mode 100644
index 0000000..6735678
--- /dev/null
+++ b/cryptopp/sha256.cpp
@@ -0,0 +1,447 @@
+#include <string.h>
+#include <assert.h>
+
+#include <xmmintrin.h>
+#include <stdint.h>
+#include <stdio.h>
+
+#define NPAR 32
+
+static const unsigned int sha256_consts[] = {
+	0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, /*  0 */
+	0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
+	0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, /*  8 */
+	0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
+	0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, /* 16 */
+	0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
+	0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, /* 24 */
+	0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
+	0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, /* 32 */
+	0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
+	0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, /* 40 */
+	0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
+	0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, /* 48 */
+	0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
+	0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, /* 56 */
+	0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
+};
+
+
+static inline __m128i Ch(const __m128i b, const __m128i c, const __m128i d) {
+	return (b & c) ^ (~b & d);
+}
+
+static inline __m128i Maj(const __m128i b, const __m128i c, const __m128i d) {
+	return (b & c) ^ (b & d) ^ (c & d);
+}
+
+static inline __m128i ROTR(__m128i x, const int n) {
+	return _mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n);
+}
+
+static inline __m128i SHR(__m128i x, const int n) {
+	return _mm_srli_epi32(x, n);
+}
+
+/* SHA256 Functions */
+#define	BIGSIGMA0_256(x)	(ROTR((x), 2) ^ ROTR((x), 13) ^ ROTR((x), 22))
+#define	BIGSIGMA1_256(x)	(ROTR((x), 6) ^ ROTR((x), 11) ^ ROTR((x), 25))
+#define	SIGMA0_256(x)		(ROTR((x), 7) ^ ROTR((x), 18) ^ SHR((x), 3))
+#define	SIGMA1_256(x)		(ROTR((x), 17) ^ ROTR((x), 19) ^ SHR((x), 10))
+
+static inline __m128i load_epi32(const unsigned int x0, const unsigned int x1, const unsigned int x2, const unsigned int x3) {
+	return _mm_set_epi32(x0, x1, x2, x3);
+}
+
+static inline unsigned int store32(const __m128i x, int i) {
+	union { unsigned int ret[4]; __m128i x; } box;
+	box.x = x;
+	return box.ret[i];
+}
+
+static inline void store_epi32(const __m128i x, unsigned int *x0, unsigned int *x1, unsigned int *x2, unsigned int *x3) {
+	union { unsigned int ret[4]; __m128i x; } box;
+	box.x = x;
+	*x0 = box.ret[3]; *x1 = box.ret[2]; *x2 = box.ret[1]; *x3 = box.ret[0];
+}
+
+static inline __m128i SHA256_CONST(const int i) {
+	return _mm_set1_epi32(sha256_consts[i]);
+}
+
+#define add4(x0, x1, x2, x3) _mm_add_epi32(_mm_add_epi32(_mm_add_epi32(x0, x1), x2), x3)
+#define add5(x0, x1, x2, x3, x4) _mm_add_epi32(add4(x0, x1, x2, x3), x4)
+
+#define	SHA256ROUND(a, b, c, d, e, f, g, h, i, w)                       \
+	T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w);	\
+d = _mm_add_epi32(d, T1);                                           \
+T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 \
+h = _mm_add_epi32(T1, T2);
+
+#define	SHA256ROUND_lastd(a, b, c, d, e, f, g, h, i, w)                       \
+	T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w);	\
+d = _mm_add_epi32(d, T1);                                           
+//T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 
+//h = _mm_add_epi32(T1, T2);
+
+#define	SHA256ROUND_last(a, b, c, d, e, f, g, h, i, w)                       \
+	T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w);	\
+T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 \
+h = _mm_add_epi32(T1, T2);
+
+static inline unsigned int swap(unsigned int value) {
+	__asm__ ("bswap %0" : "=r" (value) : "0" (value));
+	return value;
+}
+
+static inline unsigned int SWAP32(const void *addr) {
+	unsigned int value = (*((unsigned int *)(addr)));
+	__asm__ ("bswap %0" : "=r" (value) : "0" (value));
+	return value;
+}
+
+static inline void dumpreg(__m128i x, char *msg) {
+	union { unsigned int ret[4]; __m128i x; } box;
+	box.x = x ;
+	printf("%s %08x %08x %08x %08x\n", msg, box.ret[0], box.ret[1], box.ret[2], box.ret[3]);
+}
+
+#if 1
+#define dumpstate(i) printf("%s: %08x %08x %08x %08x %08x %08x %08x %08x %08x\n", \
+		__func__, store32(w0, i), store32(a, i), store32(b, i), store32(c, i), store32(d, i), store32(e, i), store32(f, i), store32(g, i), store32(h, i));
+#else
+#define dumpstate()
+#endif
+void Double_BlockSHA256(const void* pin, void* pad, const void *pre, unsigned int thash[9][NPAR], const void *init)
+{
+	unsigned int* In = (unsigned int*)pin;
+	unsigned int* Pad = (unsigned int*)pad;
+	unsigned int* hPre = (unsigned int*)pre;
+	unsigned int* hInit = (unsigned int*)init;
+	unsigned int i, j, k;
+
+	/* vectors used in calculation */
+	__m128i w0, w1, w2, w3, w4, w5, w6, w7;
+	__m128i w8, w9, w10, w11, w12, w13, w14, w15;
+	__m128i T1, T2;
+	__m128i a, b, c, d, e, f, g, h;
+  __m128i nonce;
+
+	/* nonce offset for vector */
+	__m128i offset = load_epi32(0x00000003, 0x00000002, 0x00000001, 0x00000000);
+
+
+	for(k = 0; k<NPAR; k+=4) {
+		w0 = load_epi32(In[0], In[0], In[0], In[0]);
+		w1 = load_epi32(In[1], In[1], In[1], In[1]);
+		w2 = load_epi32(In[2], In[2], In[2], In[2]);
+		//w3 = load_epi32(In[3], In[3], In[3], In[3]); nonce will be later hacked into the hash
+		w4 = load_epi32(In[4], In[4], In[4], In[4]);
+		w5 = load_epi32(In[5], In[5], In[5], In[5]);
+		w6 = load_epi32(In[6], In[6], In[6], In[6]);
+		w7 = load_epi32(In[7], In[7], In[7], In[7]);
+		w8 = load_epi32(In[8], In[8], In[8], In[8]);
+		w9 = load_epi32(In[9], In[9], In[9], In[9]);
+		w10 = load_epi32(In[10], In[10], In[10], In[10]);
+		w11 = load_epi32(In[11], In[11], In[11], In[11]);
+		w12 = load_epi32(In[12], In[12], In[12], In[12]);
+		w13 = load_epi32(In[13], In[13], In[13], In[13]);
+		w14 = load_epi32(In[14], In[14], In[14], In[14]);
+		w15 = load_epi32(In[15], In[15], In[15], In[15]);
+
+		/* hack nonce into lowest byte of w3 */
+		nonce = load_epi32(In[3], In[3], In[3], In[3]);
+		__m128i k_vec = load_epi32(k, k, k, k);
+		nonce = _mm_add_epi32(nonce, offset);
+		nonce = _mm_add_epi32(nonce, k_vec);
+    w3 = nonce;
+
+		a = load_epi32(hPre[0], hPre[0], hPre[0], hPre[0]);
+		b = load_epi32(hPre[1], hPre[1], hPre[1], hPre[1]);
+		c = load_epi32(hPre[2], hPre[2], hPre[2], hPre[2]);
+		d = load_epi32(hPre[3], hPre[3], hPre[3], hPre[3]);
+		e = load_epi32(hPre[4], hPre[4], hPre[4], hPre[4]);
+		f = load_epi32(hPre[5], hPre[5], hPre[5], hPre[5]);
+		g = load_epi32(hPre[6], hPre[6], hPre[6], hPre[6]);
+		h = load_epi32(hPre[7], hPre[7], hPre[7], hPre[7]);
+
+		SHA256ROUND(a, b, c, d, e, f, g, h, 0, w0);    
+		SHA256ROUND(h, a, b, c, d, e, f, g, 1, w1);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 2, w2);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 3, w3);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 4, w4);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 5, w5);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 6, w6);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 7, w7);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 8, w8);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 9, w9);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 10, w10);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 11, w11);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 12, w12);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 13, w13);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 14, w14);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 15, w15);
+
+		w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 16, w0);
+		w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 17, w1);
+		w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 18, w2);
+		w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 19, w3);
+		w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 20, w4);
+		w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 21, w5);
+		w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 22, w6);
+		w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 23, w7);
+		w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 24, w8);
+		w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 25, w9);
+		w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 26, w10);
+		w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 27, w11);
+		w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 28, w12);
+		w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 29, w13);
+		w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 30, w14);
+		w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 31, w15);
+
+		w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 32, w0);
+		w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 33, w1);
+		w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 34, w2);
+		w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 35, w3);
+		w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 36, w4);
+		w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 37, w5);
+		w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 38, w6);
+		w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 39, w7);
+		w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 40, w8);
+		w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 41, w9);
+		w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 42, w10);
+		w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 43, w11);
+		w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 44, w12);
+		w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 45, w13);
+		w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 46, w14);
+		w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 47, w15);
+
+		w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 48, w0);
+		w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 49, w1);
+		w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 50, w2);
+		w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 51, w3);
+		w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 52, w4);
+		w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 53, w5);
+		w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 54, w6);
+		w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 55, w7);
+		w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 56, w8);
+		w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 57, w9);
+		w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 58, w10);
+		w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 59, w11);
+		w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 60, w12);
+		w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 61, w13);
+		w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 62, w14);
+		w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 63, w15);
+
+#define store_load(x, i, dest) \
+		w8 = load_epi32((hPre)[i], (hPre)[i], (hPre)[i], (hPre)[i]); \
+		dest = _mm_add_epi32(w8, x);
+
+		store_load(a, 0, w0);
+		store_load(b, 1, w1);
+		store_load(c, 2, w2);
+		store_load(d, 3, w3);
+		store_load(e, 4, w4);
+		store_load(f, 5, w5);
+		store_load(g, 6, w6);
+		store_load(h, 7, w7);
+
+		w8 = load_epi32(Pad[8], Pad[8], Pad[8], Pad[8]);
+		w9 = load_epi32(Pad[9], Pad[9], Pad[9], Pad[9]);
+		w10 = load_epi32(Pad[10], Pad[10], Pad[10], Pad[10]);
+		w11 = load_epi32(Pad[11], Pad[11], Pad[11], Pad[11]);
+		w12 = load_epi32(Pad[12], Pad[12], Pad[12], Pad[12]);
+		w13 = load_epi32(Pad[13], Pad[13], Pad[13], Pad[13]);
+		w14 = load_epi32(Pad[14], Pad[14], Pad[14], Pad[14]);
+		w15 = load_epi32(Pad[15], Pad[15], Pad[15], Pad[15]);
+
+		a = load_epi32(hInit[0], hInit[0], hInit[0], hInit[0]);
+		b = load_epi32(hInit[1], hInit[1], hInit[1], hInit[1]);
+		c = load_epi32(hInit[2], hInit[2], hInit[2], hInit[2]);
+		d = load_epi32(hInit[3], hInit[3], hInit[3], hInit[3]);
+		e = load_epi32(hInit[4], hInit[4], hInit[4], hInit[4]);
+		f = load_epi32(hInit[5], hInit[5], hInit[5], hInit[5]);
+		g = load_epi32(hInit[6], hInit[6], hInit[6], hInit[6]);
+		h = load_epi32(hInit[7], hInit[7], hInit[7], hInit[7]);
+
+		SHA256ROUND(a, b, c, d, e, f, g, h, 0, w0);    
+		SHA256ROUND(h, a, b, c, d, e, f, g, 1, w1);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 2, w2);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 3, w3);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 4, w4);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 5, w5);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 6, w6);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 7, w7);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 8, w8);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 9, w9);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 10, w10);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 11, w11);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 12, w12);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 13, w13);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 14, w14);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 15, w15);
+
+		w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 16, w0);
+		w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 17, w1);
+		w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 18, w2);
+		w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 19, w3);
+		w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 20, w4);
+		w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 21, w5);
+		w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 22, w6);
+		w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 23, w7);
+		w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 24, w8);
+		w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 25, w9);
+		w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 26, w10);
+		w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 27, w11);
+		w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 28, w12);
+		w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 29, w13);
+		w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 30, w14);
+		w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 31, w15);
+
+		w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 32, w0);
+		w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 33, w1);
+		w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 34, w2);
+		w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 35, w3);
+		w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 36, w4);
+		w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 37, w5);
+		w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 38, w6);
+		w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 39, w7);
+		w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 40, w8);
+		w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 41, w9);
+		w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 42, w10);
+		w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 43, w11);
+		w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 44, w12);
+		w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 45, w13);
+		w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 46, w14);
+		w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 47, w15);
+
+		w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 48, w0);
+		w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 49, w1);
+		w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 50, w2);
+		w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 51, w3);
+		w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 52, w4);
+		w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 53, w5);
+		w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 54, w6);
+		w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 55, w7);
+		w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+		SHA256ROUND(a, b, c, d, e, f, g, h, 56, w8);
+		w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+		SHA256ROUND(h, a, b, c, d, e, f, g, 57, w9);
+		w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+		SHA256ROUND(g, h, a, b, c, d, e, f, 58, w10);
+		w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+		SHA256ROUND(f, g, h, a, b, c, d, e, 59, w11);
+		w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+		SHA256ROUND(e, f, g, h, a, b, c, d, 60, w12);
+		w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+		SHA256ROUND(d, e, f, g, h, a, b, c, 61, w13);
+		w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+		SHA256ROUND(c, d, e, f, g, h, a, b, 62, w14);
+		w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+		SHA256ROUND(b, c, d, e, f, g, h, a, 63, w15);
+
+		/* store resulsts directly in thash */
+#define store_2(x,i)  \
+		w0 = load_epi32((hInit)[i], (hInit)[i], (hInit)[i], (hInit)[i]); \
+		*(__m128i *)&(thash)[i][0+k] = _mm_add_epi32(w0, x); 
+
+		store_2(a, 0);
+		store_2(b, 1);
+		store_2(c, 2);
+		store_2(d, 3);
+		store_2(e, 4);
+		store_2(f, 5);
+		store_2(g, 6);
+		store_2(h, 7);
+		*(__m128i *)&(thash)[8][0+k] = nonce;
+	}
+
+}
diff --git a/main.cpp b/main.cpp
index 0239915..50db1a3 100644
--- a/main.cpp
+++ b/main.cpp
@@ -2555,8 +2555,10 @@ inline void SHA256Transform(void* pstate, void* pinput, const void* pinit)
     CryptoPP::SHA256::Transform((CryptoPP::word32*)pstate, (CryptoPP::word32*)pinput);
 }
 
+// !!!! NPAR must match NPAR in cryptopp/sha256.cpp !!!!
+#define NPAR 32
 
-
+extern void Double_BlockSHA256(const void* pin, void* pout, const void *pinit, unsigned int hash[9][NPAR], const void *init2);
 
 
 void BitcoinMiner()
@@ -2701,108 +2703,128 @@ void BitcoinMiner()
         uint256 hashTarget = CBigNum().SetCompact(pblock->nBits).getuint256();
         uint256 hashbuf[2];
         uint256& hash = *alignup<16>(hashbuf);
+
+        // Cache for NPAR hashes
+        unsigned int thash[9][NPAR] __attribute__ ((aligned (16)));
+
+        unsigned int j;
         loop
         {
-            SHA256Transform(&tmp.hash1, (char*)&tmp.block + 64, &midstate);
-            SHA256Transform(&hash, &tmp.hash1, pSHA256InitState);
+          Double_BlockSHA256((char*)&tmp.block + 64, &tmp.hash1, &midstate, thash, pSHA256InitState);
 
-            if (((unsigned short*)&hash)[14] == 0)
+          for(j = 0; j<NPAR; j++) {
+            if (thash[7][j] == 0)
             {
-                // Byte swap the result after preliminary check
-                for (int i = 0; i < sizeof(hash)/4; i++)
-                    ((unsigned int*)&hash)[i] = ByteReverse(((unsigned int*)&hash)[i]);
-
-                if (hash <= hashTarget)
+              // Byte swap the result after preliminary check
+              for (int i = 0; i < sizeof(hash)/4; i++)
+                ((unsigned int*)&hash)[i] = ByteReverse((unsigned int)thash[i][j]);
+
+              if (hash <= hashTarget)
+              {
+                // Double_BlocSHA256 might only calculate parts of the hash.
+                // We'll insert the nonce and get the real hash.
+                //pblock->nNonce = ByteReverse(tmp.block.nNonce + j);
+                //hash = pblock->GetHash();
+
+                /* get nonce from hash */
+                pblock->nNonce = ByteReverse((unsigned int)thash[8][j]);
+                assert(hash == pblock->GetHash());
+
+                //// debug print
+                printf("BitcoinMiner:\n");
+                printf("proof-of-work found  \n  hash: %s  \ntarget: %s\n", hash.GetHex().c_str(), hashTarget.GetHex().c_str());
+                pblock->print();
+                printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
+                printf("generated %s\n", FormatMoney(pblock->vtx[0].vout[0].nValue).c_str());
+
+                SetThreadPriority(THREAD_PRIORITY_NORMAL);
+                CRITICAL_BLOCK(cs_main)
                 {
-                    pblock->nNonce = ByteReverse(tmp.block.nNonce);
-                    assert(hash == pblock->GetHash());
-
-                        //// debug print
-                        printf("BitcoinMiner:\n");
-                        printf("proof-of-work found  \n  hash: %s  \ntarget: %s\n", hash.GetHex().c_str(), hashTarget.GetHex().c_str());
-                        pblock->print();
-                        printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
-                        printf("generated %s\n", FormatMoney(pblock->vtx[0].vout[0].nValue).c_str());
-
-                    SetThreadPriority(THREAD_PRIORITY_NORMAL);
-                    CRITICAL_BLOCK(cs_main)
-                    {
-                        if (pindexPrev == pindexBest)
-                        {
-                            // Save key
-                            if (!AddKey(key))
-                                return;
-                            key.MakeNewKey();
-
-                            // Track how many getdata requests this block gets
-                            CRITICAL_BLOCK(cs_mapRequestCount)
-                                mapRequestCount[pblock->GetHash()] = 0;
-
-                            // Process this block the same as if we had received it from another node
-                            if (!ProcessBlock(NULL, pblock.release()))
-                                printf("ERROR in BitcoinMiner, ProcessBlock, block not accepted\n");
-                        }
+                  if (pindexPrev == pindexBest)
+                  {
+                    // Save key
+                    if (!AddKey(key))
+                      return;
+                    key.MakeNewKey();
+
+                    // Track how many getdata requests this block gets
+                    CRITICAL_BLOCK(cs_mapRequestCount)
+                      mapRequestCount[pblock->GetHash()] = 0;
+
+                    // Process this block the same as if we had received it from another node
+                    if (!ProcessBlock(NULL, pblock.release()))
+                      printf("ERROR in BitcoinMiner, ProcessBlock, block not accepted\n");
+
                     }
                     SetThreadPriority(THREAD_PRIORITY_LOWEST);
 
                     Sleep(500);
                     break;
                 }
-            }
+                SetThreadPriority(THREAD_PRIORITY_LOWEST);
 
-            // Update nTime every few seconds
-            const unsigned int nMask = 0xffff;
-            if ((++tmp.block.nNonce & nMask) == 0)
+                Sleep(500);
+                break;
+              }
+            }
+          }
+
+          // Update nonce
+          tmp.block.nNonce += NPAR;
+
+          // Update nTime every few seconds
+          const unsigned int nMask = 0xffff;
+          if ((tmp.block.nNonce & nMask) == 0)
+          {
+            // Meter hashes/sec
+            static int64 nTimerStart;
+            static int nHashCounter;
+            if (nTimerStart == 0)
+              nTimerStart = GetTimeMillis();
+            else
+              nHashCounter++;
+            if (GetTimeMillis() - nTimerStart > 4000)
             {
-                // Meter hashes/sec
-                static int64 nTimerStart;
-                static int nHashCounter;
-                if (nTimerStart == 0)
-                    nTimerStart = GetTimeMillis();
-                else
-                    nHashCounter++;
+              static CCriticalSection cs;
+              CRITICAL_BLOCK(cs)
+              {
                 if (GetTimeMillis() - nTimerStart > 4000)
                 {
-                    static CCriticalSection cs;
-                    CRITICAL_BLOCK(cs)
-                    {
-                        if (GetTimeMillis() - nTimerStart > 4000)
-                        {
-                            double dHashesPerSec = 1000.0 * (nMask+1) * nHashCounter / (GetTimeMillis() - nTimerStart);
-                            nTimerStart = GetTimeMillis();
-                            nHashCounter = 0;
-                            string strStatus = strprintf("    %.0f khash/s", dHashesPerSec/1000.0);
-                            UIThreadCall(bind(CalledSetStatusBar, strStatus, 0));
-                            static int64 nLogTime;
-                            if (GetTime() - nLogTime > 30 * 60)
-                            {
-                                nLogTime = GetTime();
-                                printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
-                                printf("hashmeter %3d CPUs %6.0f khash/s\n", vnThreadsRunning[3], dHashesPerSec/1000.0);
-                            }
-                        }
-                    }
+                  double dHashesPerSec = 1000.0 * (nMask+1) * nHashCounter / (GetTimeMillis() - nTimerStart);
+                  nTimerStart = GetTimeMillis();
+                  nHashCounter = 0;
+                  string strStatus = strprintf("    %.0f khash/s", dHashesPerSec/1000.0);
+                  UIThreadCall(bind(CalledSetStatusBar, strStatus, 0));
+                  static int64 nLogTime;
+                  if (GetTime() - nLogTime > 30 * 60)
+                  {
+                    nLogTime = GetTime();
+                    printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
+                    printf("hashmeter %3d CPUs %6.0f khash/s\n", vnThreadsRunning[3], dHashesPerSec/1000.0);
+                  }
                 }
-
-                // Check for stop or if block needs to be rebuilt
-                if (fShutdown)
-                    return;
-                if (!fGenerateBitcoins)
-                    return;
-                if (fLimitProcessors && vnThreadsRunning[3] > nLimitProcessors)
-                    return;
-                if (vNodes.empty())
-                    break;
-                if (tmp.block.nNonce == 0)
-                    break;
-                if (nTransactionsUpdated != nTransactionsUpdatedLast && GetTime() - nStart > 60)
-                    break;
-                if (pindexPrev != pindexBest)
-                    break;
-
-                pblock->nTime = max(pindexPrev->GetMedianTimePast()+1, GetAdjustedTime());
-                tmp.block.nTime = ByteReverse(pblock->nTime);
+              }
             }
+
+            // Check for stop or if block needs to be rebuilt
+            if (fShutdown)
+              return;
+            if (!fGenerateBitcoins)
+              return;
+            if (fLimitProcessors && vnThreadsRunning[3] > nLimitProcessors)
+              return;
+            if (vNodes.empty())
+              break;
+            if (tmp.block.nNonce == 0)
+              break;
+            if (nTransactionsUpdated != nTransactionsUpdatedLast && GetTime() - nStart > 60)
+              break;
+            if (pindexPrev != pindexBest)
+              break;
+
+            pblock->nTime = max(pindexPrev->GetMedianTimePast()+1, GetAdjustedTime());
+            tmp.block.nTime = ByteReverse(pblock->nTime);
+          }
         }
     }
 }
diff --git a/makefile.unix b/makefile.unix
index e965287..04dac86 100644
--- a/makefile.unix
+++ b/makefile.unix
@@ -41,7 +41,8 @@ OBJS= \
     obj/rpc.o \
     obj/init.o \
     cryptopp/obj/sha.o \
-    cryptopp/obj/cpu.o
+    cryptopp/obj/cpu.o \
+    cryptopp/obj/sha256.o
 
 
 all: bitcoin
@@ -51,7 +52,7 @@ obj/%.o: %.cpp $(HEADERS)
 	g++ -c $(CFLAGS) -DGUI -o $@ $<
 
 cryptopp/obj/%.o: cryptopp/%.cpp
-	g++ -c $(CFLAGS) -O3 -DCRYPTOPP_DISABLE_SSE2 -o $@ $<
+	g++ -c $(CFLAGS) -frename-registers -funroll-all-loops -fomit-frame-pointer  -march=native -msse2 -msse3  -ffast-math -O3 -o $@ $<
 
 bitcoin: $(OBJS) obj/ui.o obj/uibase.o
 	g++ $(CFLAGS) -o $@ $^ $(WXLIBS) $(LIBS)
@@ -63,6 +64,9 @@ obj/nogui/%.o: %.cpp $(HEADERS)
 bitcoind: $(OBJS:obj/%=obj/nogui/%)
 	g++ $(CFLAGS) -o $@ $^ $(LIBS)
 
+test: cryptopp/obj/sha.o cryptopp/obj/sha256.o test.cpp
+	g++ $(CFLAGS) -o $@ $^ $(LIBS)
+
 
 clean:
 	-rm -f obj/*.o
diff --git a/test.cpp b/test.cpp
new file mode 100644
index 0000000..a55e972
--- /dev/null
+++ b/test.cpp
@@ -0,0 +1,221 @@
+// Copyright (c) 2009-2010 Satoshi Nakamoto
+// Distributed under the MIT/X11 software license, see the accompanying
+// file license.txt or http://www.opensource.org/licenses/mit-license.php.
+#include <assert.h>
+#include <openssl/ecdsa.h>
+#include <openssl/evp.h>
+#include <openssl/rand.h>
+#include <openssl/sha.h>
+#include <openssl/ripemd.h>
+#include <db_cxx.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <math.h>
+#include <limits.h>
+#include <float.h>
+#include <assert.h>
+#include <memory>
+#include <iostream>
+#include <sstream>
+#include <string>
+#include <vector>
+#include <list>
+#include <deque>
+#include <map>
+#include <set>
+#include <algorithm>
+#include <numeric>
+#include <boost/foreach.hpp>
+#include <boost/lexical_cast.hpp>
+#include <boost/tuple/tuple.hpp>
+#include <boost/fusion/container/vector.hpp>
+#include <boost/tuple/tuple_comparison.hpp>
+#include <boost/tuple/tuple_io.hpp>
+#include <boost/array.hpp>
+#include <boost/bind.hpp>
+#include <boost/function.hpp>
+#include <boost/filesystem.hpp>
+#include <boost/filesystem/fstream.hpp>
+#include <boost/algorithm/string.hpp>
+#include <boost/interprocess/sync/interprocess_mutex.hpp>
+#include <boost/interprocess/sync/interprocess_recursive_mutex.hpp>
+#include <boost/date_time/gregorian/gregorian_types.hpp>
+#include <boost/date_time/posix_time/posix_time_types.hpp>
+#include <sys/resource.h>
+#include <sys/time.h>
+using namespace std;
+using namespace boost;
+#include "cryptopp/sha.h"
+#include "strlcpy.h"
+#include "serialize.h"
+#include "uint256.h"
+#include "bignum.h"
+
+#undef printf
+
+int FormatHashBlocks(void* pbuffer, unsigned int len) 
+{
+	unsigned char* pdata = (unsigned char*)pbuffer;
+	unsigned int blocks = 1 + ((len + 8) / 64); 
+	unsigned char* pend = pdata + 64 * blocks;
+	memset(pdata + len, 0, 64 * blocks - len);
+	pdata[len] = 0x80;
+	unsigned int bits = len * 8; 
+	pend[-1] = (bits >> 0) & 0xff;
+	pend[-2] = (bits >> 8) & 0xff;
+	pend[-3] = (bits >> 16) & 0xff;
+	pend[-4] = (bits >> 24) & 0xff;
+	return blocks;
+}
+
+using CryptoPP::ByteReverse;
+static int detectlittleendian = 1;
+
+#define NPAR 32 
+
+extern void Double_BlockSHA256(const void* pin, void* pout, const void *pinit, unsigned int hash[9][NPAR], const void *init2);
+
+using CryptoPP::ByteReverse;
+
+static const unsigned int pSHA256InitState[8] = {0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19};
+
+inline void SHA256Transform(void* pstate, void* pinput, const void* pinit)
+{
+	memcpy(pstate, pinit, 32); 
+	CryptoPP::SHA256::Transform((CryptoPP::word32*)pstate, (CryptoPP::word32*)pinput);
+}
+
+void BitcoinTester(char *filename)
+{
+	printf("SHA256 test started\n");
+
+	struct tmpworkspace
+	{
+		struct unnamed2
+		{
+			int nVersion;
+			uint256 hashPrevBlock;
+			uint256 hashMerkleRoot;
+			unsigned int nTime;
+			unsigned int nBits;
+			unsigned int nNonce;
+		}
+		block;
+		unsigned char pchPadding0[64];
+		uint256 hash1;
+		unsigned char pchPadding1[64];
+	}
+  tmp __attribute__ ((aligned (16)));
+
+	char line[180];
+	ifstream fin(filename);
+	char *p;
+	unsigned long int totalhashes= 0;
+	unsigned long int found = 0;
+	clock_t start, end;
+	unsigned long int cpu_time_used;
+	unsigned int tnonce;
+	start = clock();
+
+	while( fin.getline(line, 180)) 
+	{
+		string in(line);
+		//printf("%s\n", in.c_str());
+		tmp.block.nVersion       = strtol(in.substr(0,8).c_str(), &p, 16);
+		tmp.block.hashPrevBlock.SetHex(in.substr(8,64));
+		tmp.block.hashMerkleRoot.SetHex(in.substr(64+8,64));
+		tmp.block.nTime          = strtol(in.substr(128+8,8).c_str(), &p, 16);
+		tmp.block.nBits          = strtol(in.substr(128+16,8).c_str(), &p, 16);
+		tnonce = strtol(in.substr(128+24,8).c_str(), &p, 16);
+		tmp.block.nNonce         = tnonce;
+
+		unsigned int nBlocks0 = FormatHashBlocks(&tmp.block, sizeof(tmp.block));
+		unsigned int nBlocks1 = FormatHashBlocks(&tmp.hash1, sizeof(tmp.hash1));
+
+		// Byte swap all the input buffer
+		for (int i = 0; i < sizeof(tmp)/4; i++) 
+			((unsigned int*)&tmp)[i] = ByteReverse(((unsigned int*)&tmp)[i]);
+
+		// Precalc the first half of the first hash, which stays constant
+		uint256 midstate __attribute__ ((aligned(16)));
+		SHA256Transform(&midstate, &tmp.block, pSHA256InitState);
+
+
+		uint256 hashTarget = CBigNum().SetCompact(ByteReverse(tmp.block.nBits)).getuint256();
+		//	printf("target %s\n", hashTarget.GetHex().c_str());
+		uint256 hash;
+		uint256 refhash __attribute__ ((aligned(16)));
+
+		unsigned int thash[9][NPAR] __attribute__ ((aligned (16)));
+		int done = 0;
+		unsigned int i, j;
+
+		/* reference */
+		SHA256Transform(&tmp.hash1, (char*)&tmp.block + 64, &midstate);
+		SHA256Transform(&refhash, &tmp.hash1, pSHA256InitState);
+		for (int i = 0; i < sizeof(refhash)/4; i++)
+			((unsigned int*)&refhash)[i] = ByteReverse(((unsigned int*)&refhash)[i]);
+
+		//printf("reference nonce %08x:\n%s\n\n", tnonce, refhash.GetHex().c_str());
+
+		tmp.block.nNonce = ByteReverse(tnonce) & 0xfffff000;
+
+
+		for(;;)
+		{
+
+			Double_BlockSHA256((char*)&tmp.block + 64, &tmp.hash1, &midstate, thash, pSHA256InitState);
+
+			for(i = 0; i<NPAR; i++) {
+				/* fast hash checking */
+				if(thash[7][i] == 0) {
+			//		printf("found something... ");
+
+					for(j = 0; j<8; j++) ((unsigned int *)&hash)[j] = ByteReverse((unsigned int)thash[j][i]);
+				//	printf("%s\n", hash.GetHex().c_str());
+
+					if (hash <= hashTarget)
+					{
+						found++;
+						if(tnonce == ByteReverse((unsigned int)thash[8][i]) ) {
+							if(hash == refhash) {
+								printf("\r%lu", found);
+								totalhashes += NPAR;
+								done = 1;
+							} else {
+								printf("Hashes do not match!\n");
+							}
+						} else {
+							printf("nonce does not match. %08x != %08x\n", tnonce, ByteReverse(tmp.block.nNonce + i));
+						}
+						break;
+					}
+				}
+			}
+			if(done) break;
+
+			tmp.block.nNonce+=NPAR;
+			totalhashes += NPAR;
+			if(tmp.block.nNonce == 0) {
+				printf("ERROR: Hash not found for:\n%s\n", in.c_str());
+				return;
+			}
+		}
+	}
+	printf("\n");
+	end = clock();
+	cpu_time_used += (unsigned int)(end - start);
+	cpu_time_used /= ((CLOCKS_PER_SEC)/1000);
+	printf("found solutions = %lu\n", found);
+	printf("total hashes = %lu\n", totalhashes);
+	printf("total time = %lu ms\n", cpu_time_used);
+	printf("average speed: %lu khash/s\n", (totalhashes)/cpu_time_used);
+}
+
+int main(int argc, char* argv[]) {
+	if(argc == 2) {
+		BitcoinTester(argv[1]);
+	} else 
+		printf("Missing filename!\n");
+	return 0;
+}

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: wereHamster on August 01, 2010, 10:16:48 AM

Quote from: Mionione on July 31, 2010, 11:30:11 PM

care with __attribute__ ((aligned (16))) , it doesn't work with local variable, gcc doesn't align the stack

Maybe gcc doesn't align the stack, but it can (and automatically does) align variables on the stack.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: Mionione on August 01, 2010, 12:53:40 PM

that's what it is supposed to do, but it doesn't always do it, issues are on gcc bugzilla

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43798
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16660
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: Ground Loop on August 02, 2010, 12:22:43 AM

No joy against SVN tip here.

Code:

patching file sha256.cpp
patching file main.cpp
Hunk #1 FAILED at 2555.
Hunk #2 FAILED at 2703.
2 out of 2 hunks FAILED -- saving rejects to file main.cpp.rej
patching file makefile.unix
Hunk #1 FAILED at 41.
Hunk #2 FAILED at 52.
Hunk #3 FAILED at 64.
3 out of 3 hunks FAILED -- saving rejects to file makefile.unix.rej
patching file test.cpp

Trying manually now.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: Ground Loop on August 02, 2010, 06:57:20 AM

I got the patch knitted in, and I think I did it correctly.. wasn't complicated.

Regrettably, the hash rate has decreased by almost half. I'm down from 2071 (stock build, svn tip) to 1150 khash/sec with the patch.

It's an Intel Xeon 3 GHz, Linux, with these proc flags:

Code:

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr

Has anyone seen gains?

Did I botch it? Missing CPU capabilities? Wrong compiler options?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: impossible7 on August 02, 2010, 08:13:11 AM

I have been able to apply the patch against SVN (r121) and I tested it on 2 machines:

on an AMD Opteron 2374 HE running x86_64 linux I got a 105% improvement (!)
on an Intel Core 2 Duo T7300 running x86_64 linux it was 55% slower compared to the stock version (r121)

The strange thing is that despite the fact that I have been running it on 6 Opterons (i.e. 6x4=24 cores) for 40 hours with an average rate of 51,000 khash/s, I still haven't generated any blocks. The probability of this (no blocks, 40 hours, 51,000 khash/s and diffuculty=244.2) is 0.09% or 1/1098. Are you sure this thing works correctly and that the reported rate is correct? How do I run the included test program?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: knightmb on August 02, 2010, 08:47:04 AM

Is it a AMD only optimization perhaps?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: impossible7 on August 02, 2010, 09:00:55 AM

Quote from: knightmb on August 02, 2010, 08:47:04 AM

Is it a AMD only optimization perhaps?

Or a 64-bit only optimization.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: Ground Loop on August 02, 2010, 09:17:07 AM

With the patch above, I was unable to build the test program. You?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: petree on August 02, 2010, 09:22:29 AM

The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client. I was even able to port its minor changes to 0.3.7 successfully, with the same results.

Is there a way we can confirm that the variables are being aligned properly? I'm wondering if the Intel procs are less tolerant of misalignment than the AMD's.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: impossible7 on August 02, 2010, 09:31:44 AM

Quote from: Ground Loop on August 02, 2010, 09:17:07 AM

With the patch above, I was unable to build the test program. You?

Under x86 I had to include cryptopp/obj/cpu.o in the list of object files, otherwise "make test" would fail. Under x86_64 I had no such issue.

Quote from: petree on August 02, 2010, 09:22:29 AM

As I said above I did notice an imporvement in performace too, but I am not sure the patched version works correctly. Have you been able to generate any blocks with the patched version?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: nelisky on August 02, 2010, 01:05:48 PM

Quote from: impossible7 on August 02, 2010, 09:00:55 AM

Quote from: knightmb on August 02, 2010, 08:47:04 AM

Is it a AMD only optimization perhaps?

Or a 64-bit only optimization.

I'm trying on a Q6600 running 64bit linux (ubuntu server) and it makes things slower there, so not 64bit only. And I'm running on my mac laptop which sports an Intel i5 (also 64 bit OSX 10.6), which great speed improvement there, so not AMD only.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: petree on August 02, 2010, 04:12:33 PM

Quote from: impossible7 on August 02, 2010, 09:31:44 AM

Quote from: Ground Loop on August 02, 2010, 09:17:07 AM

With the patch above, I was unable to build the test program. You?

Under x86 I had to include cryptopp/obj/cpu.o in the list of object files, otherwise "make test" would fail. Under x86_64 I had no such issue.

Quote from: petree on August 02, 2010, 09:22:29 AM

As I said above I did notice an imporvement in performace too, but I am not sure the patched version works correctly. Have you been able to generate any blocks with the patched version?

Yes, since applying this patch I've generated 2 blocks.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: satoshi on August 02, 2010, 07:02:46 PM

Is it 2x fast on AMD and 1/2 fast on Intel?

Quote from: tcatm on July 31, 2010, 10:12:38 AM

Btw. Why are you using this alignup<16> function when __attribute__ ((aligned (16))) will tell the compiler to align at compiletime?

Tried that, but it doesn't work for things on the stack. I ran some tests.

It doesn't even cause an error, it just doesn't align it.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: jgarzik on August 02, 2010, 07:15:23 PM

FWIW, there exists -mstackrealign and -mpreferred-stack-boundary=NUM

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: impossible7 on August 02, 2010, 08:49:25 PM

After 52 hours of trying with no blocks generated, I give up and I am switching back to the vanilla bitcoin.

The probability of getting no blocks within 52 hours at 51,000 khash/s is 0.011%. So I conclude that the patch doesn't work and I am 99.989% confident about that. I hope that tcatm provides some explanation on how to use the supplied test program.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on August 02, 2010, 09:07:56 PM

To use the test program download this file (or generate it yourself from the blockchain): http://ul.to/hz5wlg
The program will try to find the correct nonce in each block and detect if the hash function does work correctly. It'll also benchmark the algorithm.

From what I've heard the patch does not work on 32 bit systems. I don't know why. I've developed it on an AMD64 machine and it works fine. If it's slower on Intel, try to disable Hyperthreading. The big loop in the SSE2 code doesn't contain any "normal" x86 except for one jump at the end.

Btw, there's a git repo at http://github.com/tcatm/bitcoin-cruncher/

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: nelisky on August 02, 2010, 11:52:27 PM

I've tried the git branch and results stay the same, almost half of what the vanilla svn can pump out. I'm running Intel and not AMD, but I am on 64bit:

Linux bah 2.6.32-22-server #33-Ubuntu SMP Wed Apr 28 14:34:48 UTC 2010 x86_64 GNU/Linux

Anything I can try to help and debug this?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on August 03, 2010, 12:21:41 AM

Quote from: nelisky on August 02, 2010, 11:52:27 PM

Can you mail me a copy of cryptopp/obj/sha256.o to tcatm@gawab.com? I still fear Intels microcode in their CPUs wasn't made for such tight loops of SSE code. Have you run the test program? How many khash/s does it crunch?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: nelisky on August 03, 2010, 12:46:12 AM

datla@bah:~/src/bitcoin/bitcoin-cruncher$ ./test blocks.txt
SHA256 test started
70293
found solutions = 70293
total hashes = 139463136
total time = 235480 ms
average speed: 592 khash/s

I'll send you the obj file now

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on August 03, 2010, 01:17:53 AM

Thanks for the object!

There are two things I noticed:
1) The Intel object runs at 3269khash/s on my AMD64 (vs. 3778khash/s) so it's less optimized than the AMD64 code.

2) AMD64 moves less data around and does more calculations. Sometimes it even abuses floating point instructions for integers.

Could you drop in my sha256.o from http://ul.to/2ckndx to cryptopp/obj/, delete test (not the .cpp!!) and recompile test using make -f makefile.unix test (take care it doesn't recompile sha256.cpp to sha256.o). Then run test again. It should be using AMD64 code now. Maybe it works better...

If not we've found that AMD64 is about four times faster than Intel at SSE2 integer vector arithmetic. Anyone working on a floating point SHA256 implementation? ;)

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: nelisky on August 03, 2010, 02:04:22 AM

SHA256 test started
70293
found solutions = 70293
total hashes = 139463136
total time = 222110 ms
average speed: 627 khash/s

So slightly better, but still far for good... As for AMD vs Intel, on my Mac, which is an intel i5, the performance boost was almost 100%, so maybe some compiler thing? I did have to remove the -arch i386 from makefile.osx to have it build on osx 10.6, but there's no such flag on linux' g++ and I'm pretty sure the 64bit g++ will not compile 32bit anyway.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on August 03, 2010, 02:21:50 AM

i5 is a different architecture than Core2. Maybe SSE in Core2 is broken and was fixed in i5. That means the original client is close to the fastest you can get on Core2. It's not a compiler thing. I compared the output for different architectures and -march=amdfam10 produces the fastest and smallest code. I would be surprised if a longer loop using the same instructions was faster on an older CPU.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: nelisky on August 03, 2010, 02:23:48 AM

Well, kudos to you for trying. Now if I can just get your code merged with the old cuda version on my macbook pro, I'll be a happy camper :)

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: vess on August 03, 2010, 08:36:04 PM

Anyone able to send me a compiled version of this for windows? I'm interested to try it out on my AMD server.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: impossible7 on August 03, 2010, 09:49:12 PM

I kept running the patched version on 2 machines and the following has happened 5 times: bitcoind crashes and debug.log contains the following:

Code:

proof-of-work found
  hash: 00000000001c3530e42b2c7e1a20de01436882d0c1de0b63db6be8e6194255dd
target: 00000000010c5a00000000000000000000000000000000000000000000000000
CBlock(hash=00000000001c3530, ver=1, hashPrevBlock=0000000000253ab5, hashMerkleRoot=89541f, nTime=1280867359, nBits=1c010c5a, nNonce=3915571979, vtx=2)
  CTransaction(hash=4fcb8e, ver=1, vin.size=1, vout.size=1, nLockTime=0)
    CTxIn(COutPoint(000000, -1), coinbase 045a0c011c021b04)
    CTxOut(nValue=50.00000000, scriptPubKey=0xCE5264238BAC29160CDC9C)
  CTransaction(hash=8f2466, ver=1, vin.size=1, vout.size=1, nLockTime=0)
    CTxIn(COutPoint(77aaae, 1), scriptSig=0x01F561A9044BF348CEF6F4)
    CTxOut(nValue=5.00000000, scriptPubKey=OP_DUP OP_HASH160 0xB13A)
  vMerkleTree: 4fcb8e 8f2466 89541f
08/03/10 20:29 generated 50.00
AddToWallet 4fcb8e  new
AddToBlockIndex: new best=00000000001c3530  height=72112
ProcessBlock: ACCEPTED
sending: inv

I guess this means that a new block has been generated. But when I restart bitcoind the balance is still zero. When I ask for a list of generated blocks I get the following:

Code:

$ ./bitcoind listgenerated
[
    {
        "value" : 50.00000000000000,
        "maturesIn" : -1,
        "accepted" : false,
        "confirmations" : 0,
        "genTime" : 1280867359
    }
]

(listgenerated is from the patch at http://www.alloscomp.com/bitcoin/)

I guess this means that my client produced a block but it crashed before it was able to broadcast it.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on August 03, 2010, 09:53:54 PM

did you run it on 32 bit machines? which version of the patch did you use?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: impossible7 on August 03, 2010, 09:56:47 PM

r121 from the svn patched with the patch from the post #21 running on a Opteron/x86_64

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on August 03, 2010, 10:00:35 PM

did it crash with a segfault and can you provide a backtrace (gdb bitcoind; run; bt)?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: impossible7 on August 03, 2010, 11:51:28 PM

I did segfault according to dmesg:

Code:

bitcoind[2469]: segfault at 0 ip 00007fe92c5b3f32 sp 00007fff15e5f6b0 error 4 in libc-2.11.2.so[7fe92c57e000+150000]

I don't have a stack trace.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on August 04, 2010, 12:03:07 AM

Did your kernel write a coredump and if so can you mail me the binary + coredump to tcatm@gawab.com?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: impossible7 on August 04, 2010, 02:52:17 AM

I modified bitcoind so that it doen't fork to the background and now I can debug it with gdb. Next time it crashes gdb will give me a backtrace.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on August 04, 2010, 03:00:49 AM

GDB can also generate coredumps with the command generate-core-file. It might be useful to reconstruct the cause for the segfault. Please note, that the coredump might include your wallet so it's probably a good idea to run bitcoind on a seperate datadir.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: vess on August 04, 2010, 01:34:44 PM

Just a comment that this would be easier to test if difficulty were set to '1' in the client.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on August 04, 2010, 01:43:26 PM

Oh thats easy: Get two nodes on a seperate network and connect them using -connect=other_nodes_ip and a seperate/empty datadir.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: impossible7 on August 05, 2010, 06:40:16 AM

I had 5 machines running today and when I checked back 10 hours later, 4 of them had crashed, in the same way as with the previous times (i.e. right after they had generated a new block but before they broadcasted it).

I created a tarball containing the coredump, backtrace, binary and the sources I used to compile it including the compiled object files. You can get it from ~~here~~ (link removed). Hope that helps.

EDIT: Ok, I just noticed that both your patch and the patch for the getkhps rpc (from http://www.alloscomp.com/bitcoin/) modify the function BitcoinMiner in main.cpp (which is where the segfault occurs) so this must be the reason for the seqfaults. I will try to test it without the getkhps patch.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: impossible7 on August 05, 2010, 01:35:50 PM

Ok I tried again, this time with no extra patches. I just cloned your git tree and compiled it. It still crashes. Here's the stack backtrace:

Code:

#0  0x00007ffff710b1b5 in raise () from /lib/libc.so.6
#1  0x00007ffff710c5e0 in abort () from /lib/libc.so.6
#2  0x00007ffff71042d1 in __assert_fail () from /lib/libc.so.6
#3  0x00000000004628de in BitcoinMiner () at main.cpp:2741
#4  0x0000000000462d70 in ThreadBitcoinMiner (parg=0x391e) at main.cpp:2518
#5  0x00007ffff6ec3894 in start_thread () from /lib/libpthread.so.0
#6  0x00007ffff71aa07d in clone () from /lib/libc.so.6

I have also uploaded ~~here~~ (link removed) the sources with the object files as well as the a core dump and the binary.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on August 05, 2010, 02:06:14 PM

That's not a real crash this time. It's an assert that fails in the miner. Most likely assert(hash == pblock->GetHash());. Can you run the test programm (explained in http://bitcointalk.org/index.php?topic=648.msg7096#msg7096)? If it fails, can you change -march=amdfam10 back to -march=native in makefile.unix, rm cryptopp/obj/*.o and recompile everything? What cpu are you running it on?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: impossible7 on August 06, 2010, 02:49:32 AM

The test program does not fail

Code:

$ ./test ../blocks.txt 
SHA256 test started
70293
found solutions = 70293
total hashes = 139463136
total time = 63250 ms
average speed: 2204 khash/s

Does the test program run on a single thread?

Finally, I have the same problem with both -march=amdfam10 and -march=native. The cpu is a Opteron 2374.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on August 06, 2010, 03:38:53 AM

Yes, test is single threaded. Is there any output on stderr? From the coredumps I can tell that there must be some output.

The problem seems to be hard to debug, though. Is the khash/s you get worth it anymore at 352 difficulty? I'm only getting a block once a week now. If you want to keep the block chain working, you should use the original client. If you want to gain lots of bitcoins you should use a GPU.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: knightmb on August 06, 2010, 06:50:03 AM

Quote from: tcatm on August 06, 2010, 03:38:53 AM

<snip> Is the khash/s you get worth it anymore at 352 difficulty? I'm only getting a block once a week now. If you want to keep the block chain working, you should use the original client. If you want to gain lots of bitcoins you should use a GPU.

When the difficulty changed, the first machine in my group to generate a block was the slowest one running the stock client (800 MHz E-machine) and it hadn't generated anything in over a week itself. So I guess every little bit helps, that's so many are interested in getting this to work on their machine.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: impossible7 on August 06, 2010, 09:00:52 AM

Here's the output on stderr:

Code:

bitcoind: main.cpp:2741: void BitcoinMiner(): Assertion `("break caught by CRITICAL_BLOCK!", !fcriticalblockonce)' failed.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on August 06, 2010, 10:59:18 AM

Oh, that's a part in the code my patch doesn't touch. You could try to remove line 2741 (CRITICAL_BLOCK(cs_main)).

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: impossible7 on August 06, 2010, 11:37:20 AM

CRITICAL_BLOCK is a macro that contains a for loop. The assertion failure indicates that break has been called inside the body of the loop. The only break statement in this block is in line 2762. In the original source file, there is no break statement in this critical block. I think you must remove lines 2759-2762. The is nothing like that in the original main.cpp.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on August 06, 2010, 11:54:04 AM

Thanks! That got probably mixed up when I patched the file with an older diff. I fixed the git: http://github.com/tcatm/bitcoin-cruncher

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: satoshi on August 07, 2010, 09:16:01 PM

Quote from: impossible7 on August 06, 2010, 11:37:20 AM

Sorry about that. CRITICAL_BLOCK isn't perfect. You have to be careful not to break or continue out of it. There's an assert that catches and warns about break. I can be criticized for using it, but the syntax would be so much more bloated and error prone without it.

Is there a chance the SSE2 code is slow on Intel because of some quirk that could be worked around? For instance, if something works but is slow if it's not aligned, or thrashing the cache, or one type of instruction that's really slow? I'm not sure how available it is, but I think Intel used to have a profiler for profiling on a per instruction level. I guess if tcatm doesn't have a system with the slow processor to test with, there's not much hope. But it would be really nice if this was working on most CPUs.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: impossible7 on August 07, 2010, 10:51:07 PM

I can confirm that the patch now works just fine. I just generated my first 50 BTC with it. And since this patch doubles the speed I think it's only fair if I donated half of that to tcatm.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on August 07, 2010, 10:59:55 PM

Quote from: impossible7 on August 07, 2010, 10:51:07 PM

I can confirm that the patch now works just fine. I just generated my first 50 BTC with it. And since this patch doubles the speed I think it's only fair if I donated half of that to tcatm.

That's nice to hear. Thanks for the donation and thanks to everyone else who donated!

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: aceat64 on August 08, 2010, 06:37:13 AM

I just tried again with the latest version from your github. I'm still seeing a drop in performance compared to the vanilla source.

My system went from ~7100 to ~4200.

This particular system has dual Intel Xeon Quad-Core CPUs (E5335) @ 2.00GHz.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on August 08, 2010, 11:52:53 AM

It seems to be like this: everything before Core2 will be slower, everything starting with Core2 is faster. Can anyone test the code on an older AMD64? I know there was a change in the way SSE2 instructions are executed in recent architectures.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: nimnul on August 12, 2010, 12:18:23 PM

Can we implement a speed test, so different hashing engines are tried and the fastest is chosen?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: nelisky on August 12, 2010, 12:57:58 PM

Quote from: tcatm on August 08, 2010, 11:52:53 AM

My Core2Quad (Q6600) slowed down 50%, my i5 improved ~200%, thus I don't think what you state is accurate. Maybe starting at some specific Core2?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: satoshi on August 12, 2010, 10:07:23 PM

That big of a difference in speed, by a factor of 4 or 6, feels like it's likely to be some quirky weak spot or instruction that the old chip is slow with. Unless it's a touted feature of the i5 that they made SSE2 six times faster.

A quick summary:
Xeon Quad 41% slower
Core 2 Duo 55% slower
Core 2 Duo same (vess)
Core 2 Quad 50% slower
Core i5 200% faster (nelisky)
Core i5 100% faster (vess)
AMD Opteron 105% faster

aceat64:
My system went from ~7100 to ~4200.
This particular system has dual Intel Xeon Quad-Core CPUs (E5335) @ 2.00GHz.

impossible7:
on an Intel Core 2 Duo T7300 running x86_64 linux it was 55% slower compared to the stock version (r121)

nelisky:
My Core2Quad (Q6600) slowed down 50%,
my i5 improved ~200%,

impossible7:
on an AMD Opteron 2374 HE running x86_64 linux I got a 105% improvement (!)

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: vess on August 12, 2010, 10:09:50 PM

My core i5 doubled in speed. My Core 2 Duo is the same speed.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on August 13, 2010, 12:42:47 AM

Would be interesting to try it out on older AMD64. There's been a change that would explain it there:
http://developer.amd.com/documentation/articles/pages/682007171.aspx

Maybe Intel did something similiar without announcing it?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: Cheater on August 13, 2010, 06:27:23 AM

I'll just pitch in that Phenom and Phenom II processors doubled (roughly).
No difference between the two that I can tell.

Sorry I dont have anything older than Phenoms available right now.
Might be able to access a old X2 in a week.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: sgtstein on August 13, 2010, 09:10:33 PM

Just a question for whoever, trying to wrap up the information in this thread.

Does this:

1. Work on 32-bit?
2. Patch the SVN (r130 as of current) or Git?
3. Compile on CentOS?

If anyone has any answers I would greatly appreciate them.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on August 13, 2010, 09:27:14 PM

1. Does not work on 32-bit (though that's not a problem with the algorithm).
2. Patch is against older SVN. There's a git repo at http://github.com/tcatm/bitcoin-cruncher
3. Compiles on every 64bit Linux.

It's not intended as a replacement for a standard client but for a dedicated bitcoinminer box. I'm planning a pluggable bitcoinminer someday. But at current difficulty it's easier to work for bitcoins than finding faster ways for mining.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: NewLibertyStandard on August 13, 2010, 10:27:25 PM

I would really like to have this feature included in an official build sometimes soon along with an internal speed test to determine which algorithm to use. You can always remove the speed test later once you figure out how to determine whether it will be faster or slower without running the speed test.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: sgtstein on August 13, 2010, 11:17:51 PM

Quote from: tcatm on August 13, 2010, 09:27:14 PM

1. Do we know why it doesn't work on 32bit? Is is it because it's using 128bits and if so, would it help if we dropped it to 64?

2. Thank you, I will look into implementing it on my 64bit systems.
3. Excellent to hear. I'm looking forward to using it.

I was planning on using it on a PE2650 dual proc Xeon @3.2GHz w/HT. I would really like to get this figured out to utilize that system. I am planning one as well. At current difficulty I would agree, except when the system needs to be run anyway and latency isn't an issue.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: satoshi on August 14, 2010, 12:49:18 AM

MinGW on Windows has trouble compiling it:

g++ -c -mthreads -O2 -w -Wno-invalid-offsetof -Wformat -g -D__WXDEBUG__ -DWIN32 -D__WXMSW__ -D_WINDOWS -DNOPCH -I"/boost" -I"/db/build_unix" -I"/openssl/include" -I"/wxwidgets/lib/gcc_lib/mswud" -I"/wxwidgets/include" -msse2 -O3 -o obj/sha256.o sha256.cpp

sha256.cpp: In function `long long int __vector__ Ch(long long int __vector__, long long int __vector__, long long int __vector__)':
sha256.cpp:31: internal compiler error: in perform_integral_promotions, at cp/typeck.c:1454
Please submit a full bug report,
with preprocessed source if appropriate.
See <URL:http://www.mingw.org/bugs.shtml> for instructions.
make: *** [obj/sha256.o] Error 1

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on August 14, 2010, 12:50:28 AM

Quote from: sgtstein on August 13, 2010, 11:17:51 PM

1. Do we know why it doesn't work on 32bit? Is is it because it's using 128bits and if so, would it help if we dropped it to 64?

No idea, maybe some alignment problem. Someone was trying to figure it out on IRC. I don't have a SSE2 capable 32bit system. The additional registers in 64bit mode are also useful. I don't know if your PE2650 has a recent enough CPU. You might experience a performance drop of 50% if the CPU is too old.

Btw, did anyone with Intel CPU compare performance with Hyperthreading enabled/disabled? The SSE2 loop keeps the arithmetic units and pipelines pretty busy and I can imagine Hyperthreading might decrease performance.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: tcatm on August 14, 2010, 12:53:07 AM

Quote from: satoshi on August 14, 2010, 12:49:18 AM

Looks like we're triggering a compiler bug in the tree optimizer. Can you try to compile it -O0?

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: satoshi on August 14, 2010, 04:22:29 AM

If you haven't already, try aligning thash. It might matter. Couldn't hurt.

Quote from: tcatm on August 14, 2010, 12:53:07 AM

Looks like we're triggering a compiler bug in the tree optimizer. Can you try to compile it -O0?

No help from -O0, same error.

MinGW is GCC 3.4.5. Probably the problem.

I'll see if I can get a newer version of MinGW.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: satoshi on August 14, 2010, 05:55:37 PM

Got the test working on 32-bit with MinGW GCC 4.5. Exactly 50% slower than stock with Core 2.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: satoshi on August 14, 2010, 10:06:13 PM

MinGW GCC 4.5.0:
Crypto++ doesn't work, X86_SHA256_HashBlocks() never returns
I only got 4-way working with test.cpp but not when called by BitcoinMiner

MinGW GCC 4.4.1:
Crypto++ works
4-way SIGSEGV

GCC is definitely not aligning __m128i.

Even if we align our own __m128i variables, the compiler may decide to use a __m128i behind the scenes as a temporary variable.

By making our __m128i variables aligned and changing these inlines to defines, I was able to get it to work on 4.4.1 with -O0 only:
#define Ch(b, c, d) ((b & c) ^ (~b & d))
#define Maj(b, c, d) ((b & c) ^ (b & d) ^ (c & d))
#define ROTR(x, n) (_mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n))
#define SHR(x, n) _mm_srli_epi32(x, n)

But that's with -O0.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: sgtstein on August 15, 2010, 03:19:31 AM

Well, reporting back.

I got it to compile by specifying -msse and -msse2 to gcc when compiling. I first was hashing about 692kh/s (50% of SVN r130[1400kh/s]) but recompiled and am now receiving about ~1120kh/s. This is currently the equivalent of using both of my CPUs without HyperThreading, though I can verify that it IS using HyperThreading. With HyperThreading turned off, I get ~1350kh/s. Pretty close to the stock build.

Also, does the git contain the patched and updated code?

Code:

// SVN r130 Using HT.
08/14/10 19:02 hashmeter   4 CPUs   1392 khash/s
08/14/10 19:32 hashmeter   4 CPUs   1387 khash/s
08/14/10 20:02 hashmeter   4 CPUs   1386 khash/s
08/14/10 20:32 hashmeter   4 CPUs   1380 khash/s
08/14/10 21:02 hashmeter   4 CPUs   1363 khash/s
// With -msse -msse2, first run. Using HT.
08/14/10 21:32 hashmeter   4 CPUs    692 khash/s
08/14/10 22:06 hashmeter   4 CPUs   1011 khash/s
08/14/10 22:11 hashmeter   4 CPUs   1104 khash/s
08/14/10 22:16 hashmeter   4 CPUs   1120 khash/s
// NOT using HT.
08/14/10 22:21 hashmeter   2 CPUs   1359 khash/s
08/14/10 22:26 hashmeter   2 CPUs   1340 khash/s

Just wanted to tell my story and help with whatever information I could.

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: satoshi on August 15, 2010, 03:40:29 AM

On both MinGW GCC 4.4.1 and 4.5.0 I have it working with test.cpp but SIGSEGV when called by BitcoinMiner. So now it doesn't look like it's the version of GCC, it's something else, maybe just the luck of how the stack is aligned.

I have it working fine on GCC 4.3.3 on Ubuntu 32-bit.

I found the problem with Crypto++ on MinGW 4.5.0. Here's the patch for that:

Code:

--- \old\sha.cpp	Mon Jul 26 13:31:11 2010
+++ \new\sha.cpp	Sat Aug 14 20:21:08 2010
@@ -336,7 +336,7 @@
 	ROUND(14, 0, eax, ecx, edi, edx)
 	ROUND(15, 0, ecx, eax, edx, edi)
 
-	ASL(1)
+    ASL(label1)   // Bitcoin: fix for MinGW GCC 4.5
 	AS2(add WORD_REG(si), 4*16)
 	ROUND(0, 1, eax, ecx, edi, edx)
 	ROUND(1, 1, ecx, eax, edx, edi)
@@ -355,7 +355,7 @@
 	ROUND(14, 1, eax, ecx, edi, edx)
 	ROUND(15, 1, ecx, eax, edx, edi)
 	AS2(	cmp		WORD_REG(si), K_END)
-	ASJ(	jne,	1, b)
+    ASJ(    jne,    label1,  )   // Bitcoin: fix for MinGW GCC 4.5
 
 	AS2(	mov		WORD_REG(dx), DATA_SAVE)
 	AS2(	add		WORD_REG(dx), 64)

Title: Re: 4 hashes parallel on SSE2 CPUs for 0.3.6
Post by: hugolp on August 17, 2010, 06:23:11 PM

Tryed Bitcoin 3.10 on Ubuntu Lucid 64 bit on Intel Atom 330.

Using the option -4way produces half the hash/s than not using the option. I tried using 1 to 4 (virtual) cores and -4way option produces less than no option always (arround half). Its probably due to the Intel thing.

Bitcoin Forum

Bitcoin => Development & Technical Discussion => Topic started by: tcatm on July 30, 2010, 09:23:10 PM