Bitcoin Forum
June 25, 2024, 03:33:24 AM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
  Home Help Search Login Register More  
  Show Posts
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 [17]
321  Bitcoin / Development & Technical Discussion / Re: 4 hashes parallel on SSE2 CPUs for 0.3.6 on: August 04, 2010, 12:03:07 AM
Did your kernel write a coredump and if so can you mail me the binary + coredump to tcatm@gawab.com?
322  Bitcoin / Development & Technical Discussion / Re: 4 hashes parallel on SSE2 CPUs for 0.3.6 on: August 03, 2010, 10:00:35 PM
did it crash with a segfault and can you provide a backtrace (gdb bitcoind; run; bt)?
323  Bitcoin / Development & Technical Discussion / Re: 4 hashes parallel on SSE2 CPUs for 0.3.6 on: August 03, 2010, 09:53:54 PM
did you run it on 32 bit machines? which version of the patch did you use?
324  Bitcoin / Development & Technical Discussion / Re: 4 hashes parallel on SSE2 CPUs for 0.3.6 on: August 03, 2010, 02:21:50 AM
i5 is a different architecture than Core2. Maybe SSE in Core2 is broken and was fixed in i5. That means the original client is close to the fastest you can get on Core2. It's not a compiler thing. I compared the output for different architectures and -march=amdfam10 produces the fastest and smallest code. I would be surprised if a longer loop using the same instructions was faster on an older CPU.
325  Bitcoin / Development & Technical Discussion / Re: 4 hashes parallel on SSE2 CPUs for 0.3.6 on: August 03, 2010, 01:17:53 AM
Thanks for the object!

There are two things I noticed:
1) The Intel object runs at 3269khash/s on my AMD64 (vs. 3778khash/s) so it's less optimized than the AMD64 code.

2) AMD64 moves less data around and does more calculations. Sometimes it even abuses floating point instructions for integers.

Could you drop in my sha256.o from http://ul.to/2ckndx to cryptopp/obj/, delete test (not the .cpp!!) and recompile test using make -f makefile.unix test (take care it doesn't recompile sha256.cpp to sha256.o). Then run test again. It should be using AMD64 code now. Maybe it works better...

If not we've found that AMD64 is about four times faster than Intel at SSE2 integer vector arithmetic. Anyone working on a floating point SHA256 implementation? Wink
326  Bitcoin / Development & Technical Discussion / Re: 4 hashes parallel on SSE2 CPUs for 0.3.6 on: August 03, 2010, 12:21:41 AM
I've tried the git branch and results stay the same, almost half of what the vanilla svn can pump out. I'm running Intel and not AMD, but I am on 64bit:

Linux bah 2.6.32-22-server #33-Ubuntu SMP Wed Apr 28 14:34:48 UTC 2010 x86_64 GNU/Linux

Anything I can try to help and debug this?
Can you mail me a copy of cryptopp/obj/sha256.o to tcatm@gawab.com? I still fear Intels microcode in their CPUs wasn't made for such tight loops of SSE code. Have you run the test program? How many khash/s does it crunch?
327  Bitcoin / Development & Technical Discussion / Re: 4 hashes parallel on SSE2 CPUs for 0.3.6 on: August 02, 2010, 09:07:56 PM
To use the test program download this file (or generate it yourself from the blockchain): http://ul.to/hz5wlg
The program will try to find the correct nonce in each block and detect if the hash function does work correctly. It'll also benchmark the algorithm.

From what I've heard the patch does not work on 32 bit systems. I don't know why. I've developed it on an AMD64 machine and it works fine. If it's slower on Intel, try to disable Hyperthreading. The big loop in the SSE2 code doesn't contain any "normal" x86 except for one jump at the end.

Btw, there's a git repo at http://github.com/tcatm/bitcoin-cruncher/
328  Bitcoin / Development & Technical Discussion / Re: 4 hashes parallel on SSE2 CPUs for 0.3.6 on: August 01, 2010, 12:00:10 AM
Patch against SVN. Maybe it'll work now...
Code:
diff --git a/cryptopp/sha256.cpp b/cryptopp/sha256.cpp
new file mode 100644
index 0000000..6735678
--- /dev/null
+++ b/cryptopp/sha256.cpp
@@ -0,0 +1,447 @@
+#include <string.h>
+#include <assert.h>
+
+#include <xmmintrin.h>
+#include <stdint.h>
+#include <stdio.h>
+
+#define NPAR 32
+
+static const unsigned int sha256_consts[] = {
+ 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, /*  0 */
+ 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
+ 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, /*  8 */
+ 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
+ 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, /* 16 */
+ 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
+ 0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, /* 24 */
+ 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
+ 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, /* 32 */
+ 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
+ 0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, /* 40 */
+ 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
+ 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, /* 48 */
+ 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
+ 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, /* 56 */
+ 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
+};
+
+
+static inline __m128i Ch(const __m128i b, const __m128i c, const __m128i d) {
+ return (b & c) ^ (~b & d);
+}
+
+static inline __m128i Maj(const __m128i b, const __m128i c, const __m128i d) {
+ return (b & c) ^ (b & d) ^ (c & d);
+}
+
+static inline __m128i ROTR(__m128i x, const int n) {
+ return _mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n);
+}
+
+static inline __m128i SHR(__m128i x, const int n) {
+ return _mm_srli_epi32(x, n);
+}
+
+/* SHA256 Functions */
+#define BIGSIGMA0_256(x) (ROTR((x), 2) ^ ROTR((x), 13) ^ ROTR((x), 22))
+#define BIGSIGMA1_256(x) (ROTR((x), 6) ^ ROTR((x), 11) ^ ROTR((x), 25))
+#define SIGMA0_256(x) (ROTR((x), 7) ^ ROTR((x), 18) ^ SHR((x), 3))
+#define SIGMA1_256(x) (ROTR((x), 17) ^ ROTR((x), 19) ^ SHR((x), 10))
+
+static inline __m128i load_epi32(const unsigned int x0, const unsigned int x1, const unsigned int x2, const unsigned int x3) {
+ return _mm_set_epi32(x0, x1, x2, x3);
+}
+
+static inline unsigned int store32(const __m128i x, int i) {
+ union { unsigned int ret[4]; __m128i x; } box;
+ box.x = x;
+ return box.ret[i];
+}
+
+static inline void store_epi32(const __m128i x, unsigned int *x0, unsigned int *x1, unsigned int *x2, unsigned int *x3) {
+ union { unsigned int ret[4]; __m128i x; } box;
+ box.x = x;
+ *x0 = box.ret[3]; *x1 = box.ret[2]; *x2 = box.ret[1]; *x3 = box.ret[0];
+}
+
+static inline __m128i SHA256_CONST(const int i) {
+ return _mm_set1_epi32(sha256_consts[i]);
+}
+
+#define add4(x0, x1, x2, x3) _mm_add_epi32(_mm_add_epi32(_mm_add_epi32(x0, x1), x2), x3)
+#define add5(x0, x1, x2, x3, x4) _mm_add_epi32(add4(x0, x1, x2, x3), x4)
+
+#define SHA256ROUND(a, b, c, d, e, f, g, h, i, w)                       \
+ T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w); \
+d = _mm_add_epi32(d, T1);                                           \
+T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 \
+h = _mm_add_epi32(T1, T2);
+
+#define SHA256ROUND_lastd(a, b, c, d, e, f, g, h, i, w)                       \
+ T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w); \
+d = _mm_add_epi32(d, T1);                                           
+//T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 
+//h = _mm_add_epi32(T1, T2);
+
+#define SHA256ROUND_last(a, b, c, d, e, f, g, h, i, w)                       \
+ T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w); \
+T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 \
+h = _mm_add_epi32(T1, T2);
+
+static inline unsigned int swap(unsigned int value) {
+ __asm__ ("bswap %0" : "=r" (value) : "0" (value));
+ return value;
+}
+
+static inline unsigned int SWAP32(const void *addr) {
+ unsigned int value = (*((unsigned int *)(addr)));
+ __asm__ ("bswap %0" : "=r" (value) : "0" (value));
+ return value;
+}
+
+static inline void dumpreg(__m128i x, char *msg) {
+ union { unsigned int ret[4]; __m128i x; } box;
+ box.x = x ;
+ printf("%s %08x %08x %08x %08x\n", msg, box.ret[0], box.ret[1], box.ret[2], box.ret[3]);
+}
+
+#if 1
+#define dumpstate(i) printf("%s: %08x %08x %08x %08x %08x %08x %08x %08x %08x\n", \
+ __func__, store32(w0, i), store32(a, i), store32(b, i), store32(c, i), store32(d, i), store32(e, i), store32(f, i), store32(g, i), store32(h, i));
+#else
+#define dumpstate()
+#endif
+void Double_BlockSHA256(const void* pin, void* pad, const void *pre, unsigned int thash[9][NPAR], const void *init)
+{
+ unsigned int* In = (unsigned int*)pin;
+ unsigned int* Pad = (unsigned int*)pad;
+ unsigned int* hPre = (unsigned int*)pre;
+ unsigned int* hInit = (unsigned int*)init;
+ unsigned int i, j, k;
+
+ /* vectors used in calculation */
+ __m128i w0, w1, w2, w3, w4, w5, w6, w7;
+ __m128i w8, w9, w10, w11, w12, w13, w14, w15;
+ __m128i T1, T2;
+ __m128i a, b, c, d, e, f, g, h;
+  __m128i nonce;
+
+ /* nonce offset for vector */
+ __m128i offset = load_epi32(0x00000003, 0x00000002, 0x00000001, 0x00000000);
+
+
+ for(k = 0; k<NPAR; k+=4) {
+ w0 = load_epi32(In[0], In[0], In[0], In[0]);
+ w1 = load_epi32(In[1], In[1], In[1], In[1]);
+ w2 = load_epi32(In[2], In[2], In[2], In[2]);
+ //w3 = load_epi32(In[3], In[3], In[3], In[3]); nonce will be later hacked into the hash
+ w4 = load_epi32(In[4], In[4], In[4], In[4]);
+ w5 = load_epi32(In[5], In[5], In[5], In[5]);
+ w6 = load_epi32(In[6], In[6], In[6], In[6]);
+ w7 = load_epi32(In[7], In[7], In[7], In[7]);
+ w8 = load_epi32(In[8], In[8], In[8], In[8]);
+ w9 = load_epi32(In[9], In[9], In[9], In[9]);
+ w10 = load_epi32(In[10], In[10], In[10], In[10]);
+ w11 = load_epi32(In[11], In[11], In[11], In[11]);
+ w12 = load_epi32(In[12], In[12], In[12], In[12]);
+ w13 = load_epi32(In[13], In[13], In[13], In[13]);
+ w14 = load_epi32(In[14], In[14], In[14], In[14]);
+ w15 = load_epi32(In[15], In[15], In[15], In[15]);
+
+ /* hack nonce into lowest byte of w3 */
+ nonce = load_epi32(In[3], In[3], In[3], In[3]);
+ __m128i k_vec = load_epi32(k, k, k, k);
+ nonce = _mm_add_epi32(nonce, offset);
+ nonce = _mm_add_epi32(nonce, k_vec);
+    w3 = nonce;
+
+ a = load_epi32(hPre[0], hPre[0], hPre[0], hPre[0]);
+ b = load_epi32(hPre[1], hPre[1], hPre[1], hPre[1]);
+ c = load_epi32(hPre[2], hPre[2], hPre[2], hPre[2]);
+ d = load_epi32(hPre[3], hPre[3], hPre[3], hPre[3]);
+ e = load_epi32(hPre[4], hPre[4], hPre[4], hPre[4]);
+ f = load_epi32(hPre[5], hPre[5], hPre[5], hPre[5]);
+ g = load_epi32(hPre[6], hPre[6], hPre[6], hPre[6]);
+ h = load_epi32(hPre[7], hPre[7], hPre[7], hPre[7]);
+
+ SHA256ROUND(a, b, c, d, e, f, g, h, 0, w0);   
+ SHA256ROUND(h, a, b, c, d, e, f, g, 1, w1);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 2, w2);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 3, w3);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 4, w4);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 5, w5);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 6, w6);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 7, w7);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 8, w8);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 9, w9);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 10, w10);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 11, w11);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 12, w12);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 13, w13);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 14, w14);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 15, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 16, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 17, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 18, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 19, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 20, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 21, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 22, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 23, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 24, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 25, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 26, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 27, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 28, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 29, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 30, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 31, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 32, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 33, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 34, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 35, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 36, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 37, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 38, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 39, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 40, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 41, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 42, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 43, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 44, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 45, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 46, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 47, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 48, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 49, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 50, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 51, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 52, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 53, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 54, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 55, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 56, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 57, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 58, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 59, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 60, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 61, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 62, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 63, w15);
+
+#define store_load(x, i, dest) \
+ w8 = load_epi32((hPre)[i], (hPre)[i], (hPre)[i], (hPre)[i]); \
+ dest = _mm_add_epi32(w8, x);
+
+ store_load(a, 0, w0);
+ store_load(b, 1, w1);
+ store_load(c, 2, w2);
+ store_load(d, 3, w3);
+ store_load(e, 4, w4);
+ store_load(f, 5, w5);
+ store_load(g, 6, w6);
+ store_load(h, 7, w7);
+
+ w8 = load_epi32(Pad[8], Pad[8], Pad[8], Pad[8]);
+ w9 = load_epi32(Pad[9], Pad[9], Pad[9], Pad[9]);
+ w10 = load_epi32(Pad[10], Pad[10], Pad[10], Pad[10]);
+ w11 = load_epi32(Pad[11], Pad[11], Pad[11], Pad[11]);
+ w12 = load_epi32(Pad[12], Pad[12], Pad[12], Pad[12]);
+ w13 = load_epi32(Pad[13], Pad[13], Pad[13], Pad[13]);
+ w14 = load_epi32(Pad[14], Pad[14], Pad[14], Pad[14]);
+ w15 = load_epi32(Pad[15], Pad[15], Pad[15], Pad[15]);
+
+ a = load_epi32(hInit[0], hInit[0], hInit[0], hInit[0]);
+ b = load_epi32(hInit[1], hInit[1], hInit[1], hInit[1]);
+ c = load_epi32(hInit[2], hInit[2], hInit[2], hInit[2]);
+ d = load_epi32(hInit[3], hInit[3], hInit[3], hInit[3]);
+ e = load_epi32(hInit[4], hInit[4], hInit[4], hInit[4]);
+ f = load_epi32(hInit[5], hInit[5], hInit[5], hInit[5]);
+ g = load_epi32(hInit[6], hInit[6], hInit[6], hInit[6]);
+ h = load_epi32(hInit[7], hInit[7], hInit[7], hInit[7]);
+
+ SHA256ROUND(a, b, c, d, e, f, g, h, 0, w0);   
+ SHA256ROUND(h, a, b, c, d, e, f, g, 1, w1);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 2, w2);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 3, w3);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 4, w4);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 5, w5);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 6, w6);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 7, w7);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 8, w8);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 9, w9);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 10, w10);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 11, w11);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 12, w12);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 13, w13);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 14, w14);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 15, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 16, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 17, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 18, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 19, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 20, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 21, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 22, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 23, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 24, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 25, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 26, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 27, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 28, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 29, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 30, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 31, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 32, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 33, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 34, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 35, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 36, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 37, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 38, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 39, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 40, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 41, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 42, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 43, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 44, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 45, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 46, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 47, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 48, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 49, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 50, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 51, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 52, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 53, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 54, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 55, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 56, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 57, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 58, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 59, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 60, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 61, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 62, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 63, w15);
+
+ /* store resulsts directly in thash */
+#define store_2(x,i)  \
+ w0 = load_epi32((hInit)[i], (hInit)[i], (hInit)[i], (hInit)[i]); \
+ *(__m128i *)&(thash)[i][0+k] = _mm_add_epi32(w0, x);
+
+ store_2(a, 0);
+ store_2(b, 1);
+ store_2(c, 2);
+ store_2(d, 3);
+ store_2(e, 4);
+ store_2(f, 5);
+ store_2(g, 6);
+ store_2(h, 7);
+ *(__m128i *)&(thash)[8][0+k] = nonce;
+ }
+
+}
diff --git a/main.cpp b/main.cpp
index 0239915..50db1a3 100644
--- a/main.cpp
+++ b/main.cpp
@@ -2555,8 +2555,10 @@ inline void SHA256Transform(void* pstate, void* pinput, const void* pinit)
     CryptoPP::SHA256::Transform((CryptoPP::word32*)pstate, (CryptoPP::word32*)pinput);
 }
 
+// !!!! NPAR must match NPAR in cryptopp/sha256.cpp !!!!
+#define NPAR 32
 
-
+extern void Double_BlockSHA256(const void* pin, void* pout, const void *pinit, unsigned int hash[9][NPAR], const void *init2);
 
 
 void BitcoinMiner()
@@ -2701,108 +2703,128 @@ void BitcoinMiner()
         uint256 hashTarget = CBigNum().SetCompact(pblock->nBits).getuint256();
         uint256 hashbuf[2];
         uint256& hash = *alignup<16>(hashbuf);
+
+        // Cache for NPAR hashes
+        unsigned int thash[9][NPAR] __attribute__ ((aligned (16)));
+
+        unsigned int j;
         loop
         {
-            SHA256Transform(&tmp.hash1, (char*)&tmp.block + 64, &midstate);
-            SHA256Transform(&hash, &tmp.hash1, pSHA256InitState);
+          Double_BlockSHA256((char*)&tmp.block + 64, &tmp.hash1, &midstate, thash, pSHA256InitState);
 
-            if (((unsigned short*)&hash)[14] == 0)
+          for(j = 0; j<NPAR; j++) {
+            if (thash[7][j] == 0)
             {
-                // Byte swap the result after preliminary check
-                for (int i = 0; i < sizeof(hash)/4; i++)
-                    ((unsigned int*)&hash)[i] = ByteReverse(((unsigned int*)&hash)[i]);
-
-                if (hash <= hashTarget)
+              // Byte swap the result after preliminary check
+              for (int i = 0; i < sizeof(hash)/4; i++)
+                ((unsigned int*)&hash)[i] = ByteReverse((unsigned int)thash[i][j]);
+
+              if (hash <= hashTarget)
+              {
+                // Double_BlocSHA256 might only calculate parts of the hash.
+                // We'll insert the nonce and get the real hash.
+                //pblock->nNonce = ByteReverse(tmp.block.nNonce + j);
+                //hash = pblock->GetHash();
+
+                /* get nonce from hash */
+                pblock->nNonce = ByteReverse((unsigned int)thash[8][j]);
+                assert(hash == pblock->GetHash());
+
+                //// debug print
+                printf("BitcoinMiner:\n");
+                printf("proof-of-work found  \n  hash: %s  \ntarget: %s\n", hash.GetHex().c_str(), hashTarget.GetHex().c_str());
+                pblock->print();
+                printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
+                printf("generated %s\n", FormatMoney(pblock->vtx[0].vout[0].nValue).c_str());
+
+                SetThreadPriority(THREAD_PRIORITY_NORMAL);
+                CRITICAL_BLOCK(cs_main)
                 {
-                    pblock->nNonce = ByteReverse(tmp.block.nNonce);
-                    assert(hash == pblock->GetHash());
-
-                        //// debug print
-                        printf("BitcoinMiner:\n");
-                        printf("proof-of-work found  \n  hash: %s  \ntarget: %s\n", hash.GetHex().c_str(), hashTarget.GetHex().c_str());
-                        pblock->print();
-                        printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
-                        printf("generated %s\n", FormatMoney(pblock->vtx[0].vout[0].nValue).c_str());
-
-                    SetThreadPriority(THREAD_PRIORITY_NORMAL);
-                    CRITICAL_BLOCK(cs_main)
-                    {
-                        if (pindexPrev == pindexBest)
-                        {
-                            // Save key
-                            if (!AddKey(key))
-                                return;
-                            key.MakeNewKey();
-
-                            // Track how many getdata requests this block gets
-                            CRITICAL_BLOCK(cs_mapRequestCount)
-                                mapRequestCount[pblock->GetHash()] = 0;
-
-                            // Process this block the same as if we had received it from another node
-                            if (!ProcessBlock(NULL, pblock.release()))
-                                printf("ERROR in BitcoinMiner, ProcessBlock, block not accepted\n");
-                        }
+                  if (pindexPrev == pindexBest)
+                  {
+                    // Save key
+                    if (!AddKey(key))
+                      return;
+                    key.MakeNewKey();
+
+                    // Track how many getdata requests this block gets
+                    CRITICAL_BLOCK(cs_mapRequestCount)
+                      mapRequestCount[pblock->GetHash()] = 0;
+
+                    // Process this block the same as if we had received it from another node
+                    if (!ProcessBlock(NULL, pblock.release()))
+                      printf("ERROR in BitcoinMiner, ProcessBlock, block not accepted\n");
+
                     }
                     SetThreadPriority(THREAD_PRIORITY_LOWEST);
 
                     Sleep(500);
                     break;
                 }
-            }
+                SetThreadPriority(THREAD_PRIORITY_LOWEST);
 
-            // Update nTime every few seconds
-            const unsigned int nMask = 0xffff;
-            if ((++tmp.block.nNonce & nMask) == 0)
+                Sleep(500);
+                break;
+              }
+            }
+          }
+
+          // Update nonce
+          tmp.block.nNonce += NPAR;
+
+          // Update nTime every few seconds
+          const unsigned int nMask = 0xffff;
+          if ((tmp.block.nNonce & nMask) == 0)
+          {
+            // Meter hashes/sec
+            static int64 nTimerStart;
+            static int nHashCounter;
+            if (nTimerStart == 0)
+              nTimerStart = GetTimeMillis();
+            else
+              nHashCounter++;
+            if (GetTimeMillis() - nTimerStart > 4000)
             {
-                // Meter hashes/sec
-                static int64 nTimerStart;
-                static int nHashCounter;
-                if (nTimerStart == 0)
-                    nTimerStart = GetTimeMillis();
-                else
-                    nHashCounter++;
+              static CCriticalSection cs;
+              CRITICAL_BLOCK(cs)
+              {
                 if (GetTimeMillis() - nTimerStart > 4000)
                 {
-                    static CCriticalSection cs;
-                    CRITICAL_BLOCK(cs)
-                    {
-                        if (GetTimeMillis() - nTimerStart > 4000)
-                        {
-                            double dHashesPerSec = 1000.0 * (nMask+1) * nHashCounter / (GetTimeMillis() - nTimerStart);
-                            nTimerStart = GetTimeMillis();
-                            nHashCounter = 0;
-                            string strStatus = strprintf("    %.0f khash/s", dHashesPerSec/1000.0);
-                            UIThreadCall(bind(CalledSetStatusBar, strStatus, 0));
-                            static int64 nLogTime;
-                            if (GetTime() - nLogTime > 30 * 60)
-                            {
-                                nLogTime = GetTime();
-                                printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
-                                printf("hashmeter %3d CPUs %6.0f khash/s\n", vnThreadsRunning[3], dHashesPerSec/1000.0);
-                            }
-                        }
-                    }
+                  double dHashesPerSec = 1000.0 * (nMask+1) * nHashCounter / (GetTimeMillis() - nTimerStart);
+                  nTimerStart = GetTimeMillis();
+                  nHashCounter = 0;
+                  string strStatus = strprintf("    %.0f khash/s", dHashesPerSec/1000.0);
+                  UIThreadCall(bind(CalledSetStatusBar, strStatus, 0));
+                  static int64 nLogTime;
+                  if (GetTime() - nLogTime > 30 * 60)
+                  {
+                    nLogTime = GetTime();
+                    printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
+                    printf("hashmeter %3d CPUs %6.0f khash/s\n", vnThreadsRunning[3], dHashesPerSec/1000.0);
+                  }
                 }
-
-                // Check for stop or if block needs to be rebuilt
-                if (fShutdown)
-                    return;
-                if (!fGenerateBitcoins)
-                    return;
-                if (fLimitProcessors && vnThreadsRunning[3] > nLimitProcessors)
-                    return;
-                if (vNodes.empty())
-                    break;
-                if (tmp.block.nNonce == 0)
-                    break;
-                if (nTransactionsUpdated != nTransactionsUpdatedLast && GetTime() - nStart > 60)
-                    break;
-                if (pindexPrev != pindexBest)
-                    break;
-
-                pblock->nTime = max(pindexPrev->GetMedianTimePast()+1, GetAdjustedTime());
-                tmp.block.nTime = ByteReverse(pblock->nTime);
+              }
             }
+
+            // Check for stop or if block needs to be rebuilt
+            if (fShutdown)
+              return;
+            if (!fGenerateBitcoins)
+              return;
+            if (fLimitProcessors && vnThreadsRunning[3] > nLimitProcessors)
+              return;
+            if (vNodes.empty())
+              break;
+            if (tmp.block.nNonce == 0)
+              break;
+            if (nTransactionsUpdated != nTransactionsUpdatedLast && GetTime() - nStart > 60)
+              break;
+            if (pindexPrev != pindexBest)
+              break;
+
+            pblock->nTime = max(pindexPrev->GetMedianTimePast()+1, GetAdjustedTime());
+            tmp.block.nTime = ByteReverse(pblock->nTime);
+          }
         }
     }
 }
diff --git a/makefile.unix b/makefile.unix
index e965287..04dac86 100644
--- a/makefile.unix
+++ b/makefile.unix
@@ -41,7 +41,8 @@ OBJS= \
     obj/rpc.o \
     obj/init.o \
     cryptopp/obj/sha.o \
-    cryptopp/obj/cpu.o
+    cryptopp/obj/cpu.o \
+    cryptopp/obj/sha256.o
 
 
 all: bitcoin
@@ -51,7 +52,7 @@ obj/%.o: %.cpp $(HEADERS)
  g++ -c $(CFLAGS) -DGUI -o $@ $<
 
 cryptopp/obj/%.o: cryptopp/%.cpp
- g++ -c $(CFLAGS) -O3 -DCRYPTOPP_DISABLE_SSE2 -o $@ $<
+ g++ -c $(CFLAGS) -frename-registers -funroll-all-loops -fomit-frame-pointer  -march=native -msse2 -msse3  -ffast-math -O3 -o $@ $<
 
 bitcoin: $(OBJS) obj/ui.o obj/uibase.o
  g++ $(CFLAGS) -o $@ $^ $(WXLIBS) $(LIBS)
@@ -63,6 +64,9 @@ obj/nogui/%.o: %.cpp $(HEADERS)
 bitcoind: $(OBJS:obj/%=obj/nogui/%)
  g++ $(CFLAGS) -o $@ $^ $(LIBS)
 
+test: cryptopp/obj/sha.o cryptopp/obj/sha256.o test.cpp
+ g++ $(CFLAGS) -o $@ $^ $(LIBS)
+
 
 clean:
  -rm -f obj/*.o
diff --git a/test.cpp b/test.cpp
new file mode 100644
index 0000000..a55e972
--- /dev/null
+++ b/test.cpp
@@ -0,0 +1,221 @@
+// Copyright (c) 2009-2010 Satoshi Nakamoto
+// Distributed under the MIT/X11 software license, see the accompanying
+// file license.txt or http://www.opensource.org/licenses/mit-license.php.
+#include <assert.h>
+#include <openssl/ecdsa.h>
+#include <openssl/evp.h>
+#include <openssl/rand.h>
+#include <openssl/sha.h>
+#include <openssl/ripemd.h>
+#include <db_cxx.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <math.h>
+#include <limits.h>
+#include <float.h>
+#include <assert.h>
+#include <memory>
+#include <iostream>
+#include <sstream>
+#include <string>
+#include <vector>
+#include <list>
+#include <deque>
+#include <map>
+#include <set>
+#include <algorithm>
+#include <numeric>
+#include <boost/foreach.hpp>
+#include <boost/lexical_cast.hpp>
+#include <boost/tuple/tuple.hpp>
+#include <boost/fusion/container/vector.hpp>
+#include <boost/tuple/tuple_comparison.hpp>
+#include <boost/tuple/tuple_io.hpp>
+#include <boost/array.hpp>
+#include <boost/bind.hpp>
+#include <boost/function.hpp>
+#include <boost/filesystem.hpp>
+#include <boost/filesystem/fstream.hpp>
+#include <boost/algorithm/string.hpp>
+#include <boost/interprocess/sync/interprocess_mutex.hpp>
+#include <boost/interprocess/sync/interprocess_recursive_mutex.hpp>
+#include <boost/date_time/gregorian/gregorian_types.hpp>
+#include <boost/date_time/posix_time/posix_time_types.hpp>
+#include <sys/resource.h>
+#include <sys/time.h>
+using namespace std;
+using namespace boost;
+#include "cryptopp/sha.h"
+#include "strlcpy.h"
+#include "serialize.h"
+#include "uint256.h"
+#include "bignum.h"
+
+#undef printf
+
+int FormatHashBlocks(void* pbuffer, unsigned int len)
+{
+ unsigned char* pdata = (unsigned char*)pbuffer;
+ unsigned int blocks = 1 + ((len + 8) / 64);
+ unsigned char* pend = pdata + 64 * blocks;
+ memset(pdata + len, 0, 64 * blocks - len);
+ pdata[len] = 0x80;
+ unsigned int bits = len * 8;
+ pend[-1] = (bits >> 0) & 0xff;
+ pend[-2] = (bits >> 8) & 0xff;
+ pend[-3] = (bits >> 16) & 0xff;
+ pend[-4] = (bits >> 24) & 0xff;
+ return blocks;
+}
+
+using CryptoPP::ByteReverse;
+static int detectlittleendian = 1;
+
+#define NPAR 32
+
+extern void Double_BlockSHA256(const void* pin, void* pout, const void *pinit, unsigned int hash[9][NPAR], const void *init2);
+
+using CryptoPP::ByteReverse;
+
+static const unsigned int pSHA256InitState[8] = {0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19};
+
+inline void SHA256Transform(void* pstate, void* pinput, const void* pinit)
+{
+ memcpy(pstate, pinit, 32);
+ CryptoPP::SHA256::Transform((CryptoPP::word32*)pstate, (CryptoPP::word32*)pinput);
+}
+
+void BitcoinTester(char *filename)
+{
+ printf("SHA256 test started\n");
+
+ struct tmpworkspace
+ {
+ struct unnamed2
+ {
+ int nVersion;
+ uint256 hashPrevBlock;
+ uint256 hashMerkleRoot;
+ unsigned int nTime;
+ unsigned int nBits;
+ unsigned int nNonce;
+ }
+ block;
+ unsigned char pchPadding0[64];
+ uint256 hash1;
+ unsigned char pchPadding1[64];
+ }
+  tmp __attribute__ ((aligned (16)));
+
+ char line[180];
+ ifstream fin(filename);
+ char *p;
+ unsigned long int totalhashes= 0;
+ unsigned long int found = 0;
+ clock_t start, end;
+ unsigned long int cpu_time_used;
+ unsigned int tnonce;
+ start = clock();
+
+ while( fin.getline(line, 180))
+ {
+ string in(line);
+ //printf("%s\n", in.c_str());
+ tmp.block.nVersion       = strtol(in.substr(0,8).c_str(), &p, 16);
+ tmp.block.hashPrevBlock.SetHex(in.substr(8,64));
+ tmp.block.hashMerkleRoot.SetHex(in.substr(64+8,64));
+ tmp.block.nTime          = strtol(in.substr(128+8,8).c_str(), &p, 16);
+ tmp.block.nBits          = strtol(in.substr(128+16,8).c_str(), &p, 16);
+ tnonce = strtol(in.substr(128+24,8).c_str(), &p, 16);
+ tmp.block.nNonce         = tnonce;
+
+ unsigned int nBlocks0 = FormatHashBlocks(&tmp.block, sizeof(tmp.block));
+ unsigned int nBlocks1 = FormatHashBlocks(&tmp.hash1, sizeof(tmp.hash1));
+
+ // Byte swap all the input buffer
+ for (int i = 0; i < sizeof(tmp)/4; i++)
+ ((unsigned int*)&tmp)[i] = ByteReverse(((unsigned int*)&tmp)[i]);
+
+ // Precalc the first half of the first hash, which stays constant
+ uint256 midstate __attribute__ ((aligned(16)));
+ SHA256Transform(&midstate, &tmp.block, pSHA256InitState);
+
+
+ uint256 hashTarget = CBigNum().SetCompact(ByteReverse(tmp.block.nBits)).getuint256();
+ // printf("target %s\n", hashTarget.GetHex().c_str());
+ uint256 hash;
+ uint256 refhash __attribute__ ((aligned(16)));
+
+ unsigned int thash[9][NPAR] __attribute__ ((aligned (16)));
+ int done = 0;
+ unsigned int i, j;
+
+ /* reference */
+ SHA256Transform(&tmp.hash1, (char*)&tmp.block + 64, &midstate);
+ SHA256Transform(&refhash, &tmp.hash1, pSHA256InitState);
+ for (int i = 0; i < sizeof(refhash)/4; i++)
+ ((unsigned int*)&refhash)[i] = ByteReverse(((unsigned int*)&refhash)[i]);
+
+ //printf("reference nonce %08x:\n%s\n\n", tnonce, refhash.GetHex().c_str());
+
+ tmp.block.nNonce = ByteReverse(tnonce) & 0xfffff000;
+
+
+ for(;;)
+ {
+
+ Double_BlockSHA256((char*)&tmp.block + 64, &tmp.hash1, &midstate, thash, pSHA256InitState);
+
+ for(i = 0; i<NPAR; i++) {
+ /* fast hash checking */
+ if(thash[7][i] == 0) {
+ // printf("found something... ");
+
+ for(j = 0; j<8; j++) ((unsigned int *)&hash)[j] = ByteReverse((unsigned int)thash[j][i]);
+ // printf("%s\n", hash.GetHex().c_str());
+
+ if (hash <= hashTarget)
+ {
+ found++;
+ if(tnonce == ByteReverse((unsigned int)thash[8][i]) ) {
+ if(hash == refhash) {
+ printf("\r%lu", found);
+ totalhashes += NPAR;
+ done = 1;
+ } else {
+ printf("Hashes do not match!\n");
+ }
+ } else {
+ printf("nonce does not match. %08x != %08x\n", tnonce, ByteReverse(tmp.block.nNonce + i));
+ }
+ break;
+ }
+ }
+ }
+ if(done) break;
+
+ tmp.block.nNonce+=NPAR;
+ totalhashes += NPAR;
+ if(tmp.block.nNonce == 0) {
+ printf("ERROR: Hash not found for:\n%s\n", in.c_str());
+ return;
+ }
+ }
+ }
+ printf("\n");
+ end = clock();
+ cpu_time_used += (unsigned int)(end - start);
+ cpu_time_used /= ((CLOCKS_PER_SEC)/1000);
+ printf("found solutions = %lu\n", found);
+ printf("total hashes = %lu\n", totalhashes);
+ printf("total time = %lu ms\n", cpu_time_used);
+ printf("average speed: %lu khash/s\n", (totalhashes)/cpu_time_used);
+}
+
+int main(int argc, char* argv[]) {
+ if(argc == 2) {
+ BitcoinTester(argv[1]);
+ } else
+ printf("Missing filename!\n");
+ return 0;
+}
329  Bitcoin / Development & Technical Discussion / Re: 4 hashes parallel on SSE2 CPUs for 0.3.6 on: July 31, 2010, 11:37:02 PM
I'm running on the Intel Q6600 2.4Ghz, how shall I get the file to you?
yes. i will look at the assembler code. maybe the compiler did something "wrong".
330  Bitcoin / Development & Technical Discussion / Re: 4 hashes parallel on SSE2 CPUs for 0.3.6 on: July 31, 2010, 10:38:29 PM
What CPUs are you running it on? Could you send me sha256.o (compiled object of the algorithm)?
331  Bitcoin / Development & Technical Discussion / Re: 4 hashes parallel on SSE2 CPUs for 0.3.6 on: July 31, 2010, 05:40:27 PM
Looks like pastebin.com messes up the patch...
Code:
diff --git a/cryptopp/sha256.cpp b/cryptopp/sha256.cpp
new file mode 100644
index 0000000..15f8be1
--- /dev/null
+++ b/cryptopp/sha256.cpp
@@ -0,0 +1,443 @@
+#include <string.h>
+#include <assert.h>
+
+#include <xmmintrin.h>
+#include <stdint.h>
+#include <stdio.h>
+
+#define NPAR 32
+
+static const unsigned int sha256_consts[] = {
+ 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, /*  0 */
+ 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
+ 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, /*  8 */
+ 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
+ 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, /* 16 */
+ 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
+ 0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, /* 24 */
+ 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
+ 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, /* 32 */
+ 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
+ 0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, /* 40 */
+ 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
+ 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, /* 48 */
+ 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
+ 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, /* 56 */
+ 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
+};
+
+
+static inline __m128i Ch(const __m128i b, const __m128i c, const __m128i d) {
+ return (b & c) ^ (~b & d);
+}
+
+static inline __m128i Maj(const __m128i b, const __m128i c, const __m128i d) {
+ return (b & c) ^ (b & d) ^ (c & d);
+}
+
+static inline __m128i ROTR(__m128i x, const int n) {
+ return _mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n);
+}
+
+static inline __m128i SHR(__m128i x, const int n) {
+ return _mm_srli_epi32(x, n);
+}
+
+/* SHA256 Functions */
+#define BIGSIGMA0_256(x) (ROTR((x), 2) ^ ROTR((x), 13) ^ ROTR((x), 22))
+#define BIGSIGMA1_256(x) (ROTR((x), 6) ^ ROTR((x), 11) ^ ROTR((x), 25))
+#define SIGMA0_256(x) (ROTR((x), 7) ^ ROTR((x), 18) ^ SHR((x), 3))
+#define SIGMA1_256(x) (ROTR((x), 17) ^ ROTR((x), 19) ^ SHR((x), 10))
+
+static inline __m128i load_epi32(const unsigned int x0, const unsigned int x1, const unsigned int x2, const unsigned int x3) {
+ return _mm_set_epi32(x0, x1, x2, x3);
+}
+
+static inline unsigned int store32(const __m128i x, int i) {
+ union { unsigned int ret[4]; __m128i x; } box;
+ box.x = x;
+ return box.ret[i];
+}
+
+static inline void store_epi32(const __m128i x, unsigned int *x0, unsigned int *x1, unsigned int *x2, unsigned int *x3) {
+ union { unsigned int ret[4]; __m128i x; } box;
+ box.x = x;
+ *x0 = box.ret[3]; *x1 = box.ret[2]; *x2 = box.ret[1]; *x3 = box.ret[0];
+}
+
+static inline __m128i SHA256_CONST(const int i) {
+ return _mm_set1_epi32(sha256_consts[i]);
+}
+
+#define add4(x0, x1, x2, x3) _mm_add_epi32(_mm_add_epi32(_mm_add_epi32(x0, x1), x2), x3)
+#define add5(x0, x1, x2, x3, x4) _mm_add_epi32(add4(x0, x1, x2, x3), x4)
+
+#define SHA256ROUND(a, b, c, d, e, f, g, h, i, w)                       \
+ T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w); \
+d = _mm_add_epi32(d, T1);                                           \
+T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 \
+h = _mm_add_epi32(T1, T2);
+
+#define SHA256ROUND_lastd(a, b, c, d, e, f, g, h, i, w)                       \
+ T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w); \
+d = _mm_add_epi32(d, T1);                                           
+//T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 
+//h = _mm_add_epi32(T1, T2);
+
+#define SHA256ROUND_last(a, b, c, d, e, f, g, h, i, w)                       \
+ T1 = add5(h, BIGSIGMA1_256(e), Ch(e, f, g), SHA256_CONST(i), w); \
+T2 = _mm_add_epi32(BIGSIGMA0_256(a), Maj(a, b, c));                 \
+h = _mm_add_epi32(T1, T2);
+
+static inline unsigned int swap(unsigned int value) {
+ __asm__ ("bswap %0" : "=r" (value) : "0" (value));
+ return value;
+}
+
+static inline unsigned int SWAP32(const void *addr) {
+ unsigned int value = (*((unsigned int *)(addr)));
+ __asm__ ("bswap %0" : "=r" (value) : "0" (value));
+ return value;
+}
+
+static inline void dumpreg(__m128i x, char *msg) {
+ union { unsigned int ret[4]; __m128i x; } box;
+ box.x = x ;
+ printf("%s %08x %08x %08x %08x\n", msg, box.ret[0], box.ret[1], box.ret[2], box.ret[3]);
+}
+
+#if 1
+#define dumpstate(i) printf("%s: %08x %08x %08x %08x %08x %08x %08x %08x %08x\n", \
+ __func__, store32(w0, i), store32(a, i), store32(b, i), store32(c, i), store32(d, i), store32(e, i), store32(f, i), store32(g, i), store32(h, i));
+#else
+#define dumpstate()
+#endif
+void Double_BlockSHA256(const void* pin, void* pad, const void *pre, unsigned int thash[8][NPAR], const void *init)
+{
+ unsigned int* In = (unsigned int*)pin;
+ unsigned int* Pad = (unsigned int*)pad;
+ unsigned int* hPre = (unsigned int*)pre;
+ unsigned int* hInit = (unsigned int*)init;
+ unsigned int i, j, k;
+
+ /* vectors used in calculation */
+ __m128i w0, w1, w2, w3, w4, w5, w6, w7;
+ __m128i w8, w9, w10, w11, w12, w13, w14, w15;
+ __m128i T1, T2;
+ __m128i a, b, c, d, e, f, g, h;
+
+ /* nonce offset for vector */
+ __m128i offset = load_epi32(0x00000003, 0x00000002, 0x00000001, 0x00000000);
+
+
+ for(k = 0; k<NPAR; k+=4) {
+ w0 = load_epi32(In[0], In[0], In[0], In[0]);
+ w1 = load_epi32(In[1], In[1], In[1], In[1]);
+ w2 = load_epi32(In[2], In[2], In[2], In[2]);
+ w3 = load_epi32(In[3], In[3], In[3], In[3]);
+ w4 = load_epi32(In[4], In[4], In[4], In[4]);
+ w5 = load_epi32(In[5], In[5], In[5], In[5]);
+ w6 = load_epi32(In[6], In[6], In[6], In[6]);
+ w7 = load_epi32(In[7], In[7], In[7], In[7]);
+ w8 = load_epi32(In[8], In[8], In[8], In[8]);
+ w9 = load_epi32(In[9], In[9], In[9], In[9]);
+ w10 = load_epi32(In[10], In[10], In[10], In[10]);
+ w11 = load_epi32(In[11], In[11], In[11], In[11]);
+ w12 = load_epi32(In[12], In[12], In[12], In[12]);
+ w13 = load_epi32(In[13], In[13], In[13], In[13]);
+ w14 = load_epi32(In[14], In[14], In[14], In[14]);
+ w15 = load_epi32(In[15], In[15], In[15], In[15]);
+
+ /* hack nonce into lowest byte of w3 */
+ __m128i k_vec = load_epi32(k, k, k, k);
+ w3 = _mm_add_epi32(w3, offset);
+ w3 = _mm_add_epi32(w3, k_vec);
+
+ a = load_epi32(hPre[0], hPre[0], hPre[0], hPre[0]);
+ b = load_epi32(hPre[1], hPre[1], hPre[1], hPre[1]);
+ c = load_epi32(hPre[2], hPre[2], hPre[2], hPre[2]);
+ d = load_epi32(hPre[3], hPre[3], hPre[3], hPre[3]);
+ e = load_epi32(hPre[4], hPre[4], hPre[4], hPre[4]);
+ f = load_epi32(hPre[5], hPre[5], hPre[5], hPre[5]);
+ g = load_epi32(hPre[6], hPre[6], hPre[6], hPre[6]);
+ h = load_epi32(hPre[7], hPre[7], hPre[7], hPre[7]);
+
+ SHA256ROUND(a, b, c, d, e, f, g, h, 0, w0);   
+ SHA256ROUND(h, a, b, c, d, e, f, g, 1, w1);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 2, w2);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 3, w3);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 4, w4);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 5, w5);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 6, w6);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 7, w7);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 8, w8);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 9, w9);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 10, w10);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 11, w11);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 12, w12);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 13, w13);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 14, w14);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 15, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 16, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 17, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 18, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 19, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 20, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 21, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 22, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 23, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 24, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 25, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 26, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 27, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 28, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 29, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 30, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 31, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 32, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 33, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 34, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 35, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 36, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 37, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 38, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 39, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 40, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 41, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 42, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 43, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 44, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 45, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 46, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 47, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 48, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 49, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 50, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 51, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 52, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 53, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 54, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 55, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 56, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 57, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 58, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 59, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 60, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 61, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 62, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 63, w15);
+
+#define store_load(x, i, dest) \
+ w8 = load_epi32((hPre)[i], (hPre)[i], (hPre)[i], (hPre)[i]); \
+ dest = _mm_add_epi32(w8, x);
+
+ store_load(a, 0, w0);
+ store_load(b, 1, w1);
+ store_load(c, 2, w2);
+ store_load(d, 3, w3);
+ store_load(e, 4, w4);
+ store_load(f, 5, w5);
+ store_load(g, 6, w6);
+ store_load(h, 7, w7);
+
+ w8 = load_epi32(Pad[8], Pad[8], Pad[8], Pad[8]);
+ w9 = load_epi32(Pad[9], Pad[9], Pad[9], Pad[9]);
+ w10 = load_epi32(Pad[10], Pad[10], Pad[10], Pad[10]);
+ w11 = load_epi32(Pad[11], Pad[11], Pad[11], Pad[11]);
+ w12 = load_epi32(Pad[12], Pad[12], Pad[12], Pad[12]);
+ w13 = load_epi32(Pad[13], Pad[13], Pad[13], Pad[13]);
+ w14 = load_epi32(Pad[14], Pad[14], Pad[14], Pad[14]);
+ w15 = load_epi32(Pad[15], Pad[15], Pad[15], Pad[15]);
+
+ a = load_epi32(hInit[0], hInit[0], hInit[0], hInit[0]);
+ b = load_epi32(hInit[1], hInit[1], hInit[1], hInit[1]);
+ c = load_epi32(hInit[2], hInit[2], hInit[2], hInit[2]);
+ d = load_epi32(hInit[3], hInit[3], hInit[3], hInit[3]);
+ e = load_epi32(hInit[4], hInit[4], hInit[4], hInit[4]);
+ f = load_epi32(hInit[5], hInit[5], hInit[5], hInit[5]);
+ g = load_epi32(hInit[6], hInit[6], hInit[6], hInit[6]);
+ h = load_epi32(hInit[7], hInit[7], hInit[7], hInit[7]);
+
+ SHA256ROUND(a, b, c, d, e, f, g, h, 0, w0);   
+ SHA256ROUND(h, a, b, c, d, e, f, g, 1, w1);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 2, w2);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 3, w3);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 4, w4);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 5, w5);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 6, w6);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 7, w7);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 8, w8);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 9, w9);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 10, w10);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 11, w11);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 12, w12);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 13, w13);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 14, w14);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 15, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 16, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 17, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 18, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 19, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 20, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 21, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 22, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 23, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 24, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 25, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 26, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 27, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 28, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 29, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 30, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 31, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 32, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 33, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 34, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 35, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 36, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 37, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 38, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 39, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 40, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 41, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 42, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 43, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 44, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 45, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 46, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 47, w15);
+
+ w0 = add4(SIGMA1_256(w14), w9, SIGMA0_256(w1), w0);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 48, w0);
+ w1 = add4(SIGMA1_256(w15), w10, SIGMA0_256(w2), w1);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 49, w1);
+ w2 = add4(SIGMA1_256(w0), w11, SIGMA0_256(w3), w2);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 50, w2);
+ w3 = add4(SIGMA1_256(w1), w12, SIGMA0_256(w4), w3);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 51, w3);
+ w4 = add4(SIGMA1_256(w2), w13, SIGMA0_256(w5), w4);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 52, w4);
+ w5 = add4(SIGMA1_256(w3), w14, SIGMA0_256(w6), w5);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 53, w5);
+ w6 = add4(SIGMA1_256(w4), w15, SIGMA0_256(w7), w6);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 54, w6);
+ w7 = add4(SIGMA1_256(w5), w0, SIGMA0_256(w8), w7);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 55, w7);
+ w8 = add4(SIGMA1_256(w6), w1, SIGMA0_256(w9), w8);
+ SHA256ROUND(a, b, c, d, e, f, g, h, 56, w8);
+ w9 = add4(SIGMA1_256(w7), w2, SIGMA0_256(w10), w9);
+ SHA256ROUND(h, a, b, c, d, e, f, g, 57, w9);
+ w10 = add4(SIGMA1_256(w8), w3, SIGMA0_256(w11), w10);
+ SHA256ROUND(g, h, a, b, c, d, e, f, 58, w10);
+ w11 = add4(SIGMA1_256(w9), w4, SIGMA0_256(w12), w11);
+ SHA256ROUND(f, g, h, a, b, c, d, e, 59, w11);
+ w12 = add4(SIGMA1_256(w10), w5, SIGMA0_256(w13), w12);
+ SHA256ROUND(e, f, g, h, a, b, c, d, 60, w12);
+ w13 = add4(SIGMA1_256(w11), w6, SIGMA0_256(w14), w13);
+ SHA256ROUND(d, e, f, g, h, a, b, c, 61, w13);
+ w14 = add4(SIGMA1_256(w12), w7, SIGMA0_256(w15), w14);
+ SHA256ROUND(c, d, e, f, g, h, a, b, 62, w14);
+ w15 = add4(SIGMA1_256(w13), w8, SIGMA0_256(w0), w15);
+ SHA256ROUND(b, c, d, e, f, g, h, a, 63, w15);
+
+ /* store resulsts directly in thash */
+#define store_2(x,i)  \
+ w0 = load_epi32((hInit)[i], (hInit)[i], (hInit)[i], (hInit)[i]); \
+ *(__m128i *)&(thash)[i][0+k] = _mm_add_epi32(w0, x);
+
+ store_2(a, 0);
+ store_2(b, 1);
+ store_2(c, 2);
+ store_2(d, 3);
+ store_2(e, 4);
+ store_2(f, 5);
+ store_2(g, 6);
+ store_2(h, 7);
+ }
+
+}
diff --git a/main.cpp b/main.cpp
index ddc359a..d30d642 100755
--- a/main.cpp
+++ b/main.cpp
@@ -2555,8 +2555,10 @@ inline void SHA256Transform(void* pstate, void* pinput, const void* pinit)
     CryptoPP::SHA256::Transform((CryptoPP::word32*)pstate, (CryptoPP::word32*)pinput);
 }
 
+// !!!! NPAR must match NPAR in cryptopp/sha256.cpp !!!!
+#define NPAR 32
 
-
+extern void Double_BlockSHA256(const void* pin, void* pout, const void *pinit, unsigned int hash[8][NPAR], const void *init2);
 
 
 void BitcoinMiner()
@@ -2701,108 +2703,123 @@ void BitcoinMiner()
         uint256 hashTarget = CBigNum().SetCompact(pblock->nBits).getuint256();
         uint256 hashbuf[2];
         uint256& hash = *alignup<16>(hashbuf);
+
+        // Cache for NPAR hashes
+        unsigned int thash[8][NPAR];
+
+        unsigned int j;
         loop
         {
-            SHA256Transform(&tmp.hash1, (char*)&tmp.block + 64, &midstate);
-            SHA256Transform(&hash, &tmp.hash1, pSHA256InitState);
+          Double_BlockSHA256((char*)&tmp.block + 64, &tmp.hash1, &midstate, thash, pSHA256InitState);
 
-            if (((unsigned short*)&hash)[14] == 0)
+          for(j = 0; j<NPAR; j++) {
+            if (thash[7][j] == 0)
             {
-                // Byte swap the result after preliminary check
-                for (int i = 0; i < sizeof(hash)/4; i++)
-                    ((unsigned int*)&hash)[i] = ByteReverse(((unsigned int*)&hash)[i]);
-
-                if (hash <= hashTarget)
+              // Byte swap the result after preliminary check
+              for (int i = 0; i < sizeof(hash)/4; i++)
+                ((unsigned int*)&hash)[i] = ByteReverse((unsigned int)thash[i][j]);
+
+              if (hash <= hashTarget)
+              {
+                // Double_BlocSHA256 might only calculate parts of the hash.
+                // We'll insert the nonce and get the real hash.
+                //pblock->nNonce = ByteReverse(tmp.block.nNonce + j);
+                //hash = pblock->GetHash();
+
+                pblock->nNonce = ByteReverse(tmp.block.nNonce + j);
+                assert(hash == pblock->GetHash());
+
+                //// debug print
+                printf("BitcoinMiner:\n");
+                printf("proof-of-work found  \n  hash: %s  \ntarget: %s\n", hash.GetHex().c_str(), hashTarget.GetHex().c_str());
+                pblock->print();
+                printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
+                printf("generated %s\n", FormatMoney(pblock->vtx[0].vout[0].nValue).c_str());
+
+                SetThreadPriority(THREAD_PRIORITY_NORMAL);
+                CRITICAL_BLOCK(cs_main)
                 {
-                    pblock->nNonce = ByteReverse(tmp.block.nNonce);
-                    assert(hash == pblock->GetHash());
-
-                        //// debug print
-                        printf("BitcoinMiner:\n");
-                        printf("proof-of-work found  \n  hash: %s  \ntarget: %s\n", hash.GetHex().c_str(), hashTarget.GetHex().c_str());
-                        pblock->print();
-                        printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
-                        printf("generated %s\n", FormatMoney(pblock->vtx[0].vout[0].nValue).c_str());
-
-                    SetThreadPriority(THREAD_PRIORITY_NORMAL);
-                    CRITICAL_BLOCK(cs_main)
-                    {
-                        if (pindexPrev == pindexBest)
-                        {
-                            // Save key
-                            if (!AddKey(key))
-                                return;
-                            key.MakeNewKey();
-
-                            // Track how many getdata requests this block gets
-                            CRITICAL_BLOCK(cs_mapRequestCount)
-                                mapRequestCount[pblock->GetHash()] = 0;
-
-                            // Process this block the same as if we had received it from another node
-                            if (!ProcessBlock(NULL, pblock.release()))
-                                printf("ERROR in BitcoinMiner, ProcessBlock, block not accepted\n");
-                        }
-                    }
-                    SetThreadPriority(THREAD_PRIORITY_LOWEST);
-
-                    Sleep(500);
-                    break;
+                  if (pindexPrev == pindexBest)
+                  {
+                    // Save key
+                    if (!AddKey(key))
+                      return;
+                    key.MakeNewKey();
+
+                    // Track how many getdata requests this block gets
+                    CRITICAL_BLOCK(cs_mapRequestCount)
+                      mapRequestCount[pblock->GetHash()] = 0;
+
+                    // Process this block the same as if we had received it from another node
+                    if (!ProcessBlock(NULL, pblock.release()))
+                      printf("ERROR in BitcoinMiner, ProcessBlock, block not accepted\n");
+
+                  }
                 }
-            }
+                SetThreadPriority(THREAD_PRIORITY_LOWEST);
 
-            // Update nTime every few seconds
-            const unsigned int nMask = 0xffff;
-            if ((++tmp.block.nNonce & nMask) == 0)
+                Sleep(500);
+                break;
+              }
+            }
+          }
+
+          // Update nonce
+          tmp.block.nNonce += NPAR;
+
+          // Update nTime every few seconds
+          const unsigned int nMask = 0xffff;
+          if ((tmp.block.nNonce & nMask) == 0)
+          {
+            // Meter hashes/sec
+            static int64 nTimerStart;
+            static int nHashCounter;
+            if (nTimerStart == 0)
+              nTimerStart = GetTimeMillis();
+            else
+              nHashCounter++;
+            if (GetTimeMillis() - nTimerStart > 4000)
             {
-                // Meter hashes/sec
-                static int64 nTimerStart;
-                static int nHashCounter;
-                if (nTimerStart == 0)
-                    nTimerStart = GetTimeMillis();
-                else
-                    nHashCounter++;
+              static CCriticalSection cs;
+              CRITICAL_BLOCK(cs)
+              {
                 if (GetTimeMillis() - nTimerStart > 4000)
                 {
-                    static CCriticalSection cs;
-                    CRITICAL_BLOCK(cs)
-                    {
-                        if (GetTimeMillis() - nTimerStart > 4000)
-                        {
-                            double dHashesPerSec = 1000.0 * (nMask+1) * nHashCounter / (GetTimeMillis() - nTimerStart);
-                            nTimerStart = GetTimeMillis();
-                            nHashCounter = 0;
-                            string strStatus = strprintf("    %.0f khash/s", dHashesPerSec/1000.0);
-                            UIThreadCall(bind(CalledSetStatusBar, strStatus, 0));
-                            static int64 nLogTime;
-                            if (GetTime() - nLogTime > 30 * 60)
-                            {
-                                nLogTime = GetTime();
-                                printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
-                                printf("hashmeter %3d CPUs %6.0f khash/s\n", vnThreadsRunning[3], dHashesPerSec/1000.0);
-                            }
-                        }
-                    }
+                  double dHashesPerSec = 1000.0 * (nMask+1) * nHashCounter / (GetTimeMillis() - nTimerStart);
+                  nTimerStart = GetTimeMillis();
+                  nHashCounter = 0;
+                  string strStatus = strprintf("    %.0f khash/s", dHashesPerSec/1000.0);
+                  UIThreadCall(bind(CalledSetStatusBar, strStatus, 0));
+                  static int64 nLogTime;
+                  if (GetTime() - nLogTime > 30 * 60)
+                  {
+                    nLogTime = GetTime();
+                    printf("%s ", DateTimeStrFormat("%x %H:%M", GetTime()).c_str());
+                    printf("hashmeter %3d CPUs %6.0f khash/s\n", vnThreadsRunning[3], dHashesPerSec/1000.0);
+                  }
                 }
-
-                // Check for stop or if block needs to be rebuilt
-                if (fShutdown)
-                    return;
-                if (!fGenerateBitcoins)
-                    return;
-                if (fLimitProcessors && vnThreadsRunning[3] > nLimitProcessors)
-                    return;
-                if (vNodes.empty())
-                    break;
-                if (tmp.block.nNonce == 0)
-                    break;
-                if (nTransactionsUpdated != nTransactionsUpdatedLast && GetTime() - nStart > 60)
-                    break;
-                if (pindexPrev != pindexBest)
-                    break;
-
-                pblock->nTime = max(pindexPrev->GetMedianTimePast()+1, GetAdjustedTime());
-                tmp.block.nTime = ByteReverse(pblock->nTime);
+              }
             }
+
+            // Check for stop or if block needs to be rebuilt
+            if (fShutdown)
+              return;
+            if (!fGenerateBitcoins)
+              return;
+            if (fLimitProcessors && vnThreadsRunning[3] > nLimitProcessors)
+              return;
+            if (vNodes.empty())
+              break;
+            if (tmp.block.nNonce == 0)
+              break;
+            if (nTransactionsUpdated != nTransactionsUpdatedLast && GetTime() - nStart > 60)
+              break;
+            if (pindexPrev != pindexBest)
+              break;
+
+            pblock->nTime = max(pindexPrev->GetMedianTimePast()+1, GetAdjustedTime());
+            tmp.block.nTime = ByteReverse(pblock->nTime);
+          }
         }
     }
 }
diff --git a/makefile.unix b/makefile.unix
index 597a0ea..8fb0aa6 100755
--- a/makefile.unix
+++ b/makefile.unix
@@ -45,7 +45,8 @@ OBJS= \
     obj/rpc.o \
     obj/init.o \
     cryptopp/obj/sha.o \
-    cryptopp/obj/cpu.o
+    cryptopp/obj/cpu.o \
+ cryptopp/obj/sha256.o
 
 
 all: bitcoin
@@ -58,18 +59,20 @@ obj/%.o: %.cpp $(HEADERS) headers.h.gch
  g++ -c $(CFLAGS) -DGUI -o $@ $<
 
 cryptopp/obj/%.o: cryptopp/%.cpp
- g++ -c $(CFLAGS) -O3 -DCRYPTOPP_DISABLE_SSE2 -o $@ $<
+ g++ -c $(CFLAGS) -frename-registers -funroll-all-loops -fomit-frame-pointer  -march=native -msse2 -msse3  -ffast-math -O3 -o $@ $<
 
 bitcoin: $(OBJS) obj/ui.o obj/uibase.o
  g++ $(CFLAGS) -o $@ $(LIBPATHS) $^ $(WXLIBS) $(LIBS)
 
-
 obj/nogui/%.o: %.cpp $(HEADERS)
  g++ -c $(CFLAGS) -o $@ $<
 
 bitcoind: $(OBJS:obj/%=obj/nogui/%)
  g++ $(CFLAGS) -o $@ $(LIBPATHS) $^ $(LIBS)
 
+test: cryptopp/obj/sha.o cryptopp/obj/sha256.o test.cpp
+   g++ $(CFLAGS) -o $@ $(LIBPATHS) $^ $(WXLIBS) $(LIBS)
+
 
 clean:
  -rm -f obj/*.o
diff --git a/test.cpp b/test.cpp
new file mode 100755
index 0000000..7cab332
--- /dev/null
+++ b/test.cpp
@@ -0,0 +1,237 @@
+// Copyright (c) 2009-2010 Satoshi Nakamoto
+// Distributed under the MIT/X11 software license, see the accompanying
+// file license.txt or http://www.opensource.org/licenses/mit-license.php.
+#include <assert.h>
+#include <openssl/ecdsa.h>
+#include <openssl/evp.h>
+#include <openssl/rand.h>
+#include <openssl/sha.h>
+#include <openssl/ripemd.h>
+#include <db_cxx.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <math.h>
+#include <limits.h>
+#include <float.h>
+#include <assert.h>
+#include <memory>
+#include <iostream>
+#include <sstream>
+#include <string>
+#include <vector>
+#include <list>
+#include <deque>
+#include <map>
+#include <set>
+#include <algorithm>
+#include <numeric>
+#include <boost/foreach.hpp>
+#include <boost/lexical_cast.hpp>
+#include <boost/tuple/tuple.hpp>
+#include <boost/fusion/container/vector.hpp>
+#include <boost/tuple/tuple_comparison.hpp>
+#include <boost/tuple/tuple_io.hpp>
+#include <boost/array.hpp>
+#include <boost/bind.hpp>
+#include <boost/function.hpp>
+#include <boost/filesystem.hpp>
+#include <boost/filesystem/fstream.hpp>
+#include <boost/algorithm/string.hpp>
+#include <boost/interprocess/sync/interprocess_mutex.hpp>
+#include <boost/interprocess/sync/interprocess_recursive_mutex.hpp>
+#include <boost/date_time/gregorian/gregorian_types.hpp>
+#include <boost/date_time/posix_time/posix_time_types.hpp>
+#include <sys/resource.h>
+#include <sys/time.h>
+using namespace std;
+using namespace boost;
+#include "cryptopp/sha.h"
+#include "strlcpy.h"
+#include "serialize.h"
+#include "uint256.h"
+#include "bignum.h"
+
+#undef printf
+ template <size_t nBytes, typename T>
+T* alignup(T* p)
+{
+ union
+ {   
+ T* ptr;
+ size_t n;
+ } u;
+ u.ptr = p;
+ u.n = (u.n + (nBytes-1)) & ~(nBytes-1);
+ return u.ptr;
+}
+
+int FormatHashBlocks(void* pbuffer, unsigned int len)
+{
+ unsigned char* pdata = (unsigned char*)pbuffer;
+ unsigned int blocks = 1 + ((len + 8) / 64);
+ unsigned char* pend = pdata + 64 * blocks;
+ memset(pdata + len, 0, 64 * blocks - len);
+ pdata[len] = 0x80;
+ unsigned int bits = len * 8;
+ pend[-1] = (bits >> 0) & 0xff;
+ pend[-2] = (bits >> 8) & 0xff;
+ pend[-3] = (bits >> 16) & 0xff;
+ pend[-4] = (bits >> 24) & 0xff;
+ return blocks;
+}
+
+using CryptoPP::ByteReverse;
+static int detectlittleendian = 1;
+
+#define NPAR 32
+
+extern void Double_BlockSHA256(const void* pin, void* pout, const void *pinit, unsigned int hash[8][NPAR], const void *init2);
+
+using CryptoPP::ByteReverse;
+
+static const unsigned int pSHA256InitState[8] = {0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19};
+
+inline void SHA256Transform(void* pstate, void* pinput, const void* pinit)
+{
+ memcpy(pstate, pinit, 32);
+ CryptoPP::SHA256::Transform((CryptoPP::word32*)pstate, (CryptoPP::word32*)pinput);
+}
+
+void BitcoinTester(char *filename)
+{
+ printf("SHA256 test started\n");
+
+ struct tmpworkspace
+ {
+ struct unnamed2
+ {
+ int nVersion;
+ uint256 hashPrevBlock;
+ uint256 hashMerkleRoot;
+ unsigned int nTime;
+ unsigned int nBits;
+ unsigned int nNonce;
+ }
+ block;
+ unsigned char pchPadding0[64];
+ uint256 hash1;
+ unsigned char pchPadding1[64];
+ };
+ char tmpbuf[sizeof(tmpworkspace)+16];
+ tmpworkspace& tmp = *(tmpworkspace*)alignup<16>(tmpbuf);
+
+
+ char line[180];
+ ifstream fin(filename);
+ char *p;
+ unsigned long int totalhashes= 0;
+ unsigned long int found = 0;
+ clock_t start, end;
+ unsigned long int cpu_time_used;
+ unsigned int tnonce;
+ start = clock();
+
+ while( fin.getline(line, 180))
+ {
+ string in(line);
+ //printf("%s\n", in.c_str());
+ tmp.block.nVersion       = strtol(in.substr(0,8).c_str(), &p, 16);
+ tmp.block.hashPrevBlock.SetHex(in.substr(8,64));
+ tmp.block.hashMerkleRoot.SetHex(in.substr(64+8,64));
+ tmp.block.nTime          = strtol(in.substr(128+8,8).c_str(), &p, 16);
+ tmp.block.nBits          = strtol(in.substr(128+16,8).c_str(), &p, 16);
+ tnonce = strtol(in.substr(128+24,8).c_str(), &p, 16);
+ tmp.block.nNonce         = tnonce;
+
+ unsigned int nBlocks0 = FormatHashBlocks(&tmp.block, sizeof(tmp.block));
+ unsigned int nBlocks1 = FormatHashBlocks(&tmp.hash1, sizeof(tmp.hash1));
+
+ // Byte swap all the input buffer
+ for (int i = 0; i < sizeof(tmp)/4; i++)
+ ((unsigned int*)&tmp)[i] = ByteReverse(((unsigned int*)&tmp)[i]);
+
+ // Precalc the first half of the first hash, which stays constant
+ uint256 midstatebuf[2];
+ uint256& midstate = *alignup<16>(midstatebuf);
+ SHA256Transform(&midstate, &tmp.block, pSHA256InitState);
+
+
+ uint256 hashTarget = CBigNum().SetCompact(ByteReverse(tmp.block.nBits)).getuint256();
+ // printf("target %s\n", hashTarget.GetHex().c_str());
+ uint256 hash;
+ uint256 hashbuf[2];
+ uint256& refhash = *alignup<16>(hashbuf);
+
+ unsigned int thash[8][NPAR];
+ int done = 0;
+ unsigned int i, j;
+
+ /* reference */
+ SHA256Transform(&tmp.hash1, (char*)&tmp.block + 64, &midstate);
+ SHA256Transform(&refhash, &tmp.hash1, pSHA256InitState);
+ for (int i = 0; i < sizeof(refhash)/4; i++)
+ ((unsigned int*)&refhash)[i] = ByteReverse(((unsigned int*)&refhash)[i]);
+
+ //printf("reference nonce %08x:\n%s\n\n", tnonce, refhash.GetHex().c_str());
+
+ tmp.block.nNonce = ByteReverse(tnonce) & 0xfffff000;
+
+
+ for(;;)
+ {
+
+ Double_BlockSHA256((char*)&tmp.block + 64, &tmp.hash1, &midstate, thash, pSHA256InitState);
+
+ for(i = 0; i<NPAR; i++) {
+ /* fast hash checking */
+ if(thash[7][i] == 0) {
+ // printf("found something... ");
+
+ for(j = 0; j<8; j++) ((unsigned int *)&hash)[j] = ByteReverse((unsigned int)thash[j][i]);
+ // printf("%s\n", hash.GetHex().c_str());
+
+ if (hash <= hashTarget)
+ {
+ found++;
+ if(tnonce == ByteReverse(tmp.block.nNonce + i) ) {
+ if(hash == refhash) {
+ printf("\r%lu", found);
+ totalhashes += NPAR;
+ done = 1;
+ } else {
+ printf("Hashes do not match!\n");
+ }
+ } else {
+ printf("nonce does not match. %08x != %08x\n", tnonce, ByteReverse(tmp.block.nNonce + i));
+ }
+ break;
+ }
+ }
+ }
+ if(done) break;
+
+ tmp.block.nNonce+=NPAR;
+ totalhashes += NPAR;
+ if(tmp.block.nNonce == 0) {
+ printf("ERROR: Hash not found for:\n%s\n", in.c_str());
+ return;
+ }
+ }
+ }
+ printf("\n");
+ end = clock();
+ cpu_time_used += (unsigned int)(end - start);
+ cpu_time_used /= ((CLOCKS_PER_SEC)/1000);
+ printf("found solutions = %lu\n", found);
+ printf("total hashes = %lu\n", totalhashes);
+ printf("total time = %lu ms\n", cpu_time_used);
+ printf("average speed: %lu khash/s\n", (totalhashes)/cpu_time_used);
+}
+
+int main(int argc, char* argv[]) {
+ if(argc == 2) {
+ BitcoinTester(argv[1]);
+ } else
+ printf("Missing filename!\n");
+ return 0;
+}
332  Bitcoin / Development & Technical Discussion / Re: 4 hashes parallel on SSE2 CPUs for 0.3.6 on: July 31, 2010, 02:18:03 PM
the mean client would send all generated bitcoins to a certain address Wink

@em3rgent0rder: i don't know why it fails, but it should be easy to patch it manually...
333  Bitcoin / Development & Technical Discussion / Re: 4 hashes parallel on SSE2 CPUs for 0.3.6 on: July 31, 2010, 10:12:38 AM
That's amazing...

So are you saying you use 128-bit registers to SIMD four 32-bit data at once?  I've wondered about that for a long time, but I didn't think it would be possible due to addition carrying into the neighbour's value.
That's how it works. Four 32 bit values in a 128 bit vector. They're calculated independently, but at the same time.

Btw. Why are you using this alignup<16> function when __attribute__ ((aligned (16))) will tell the compiler to align at compiletime?
334  Bitcoin / Development & Technical Discussion / Re: 4 hashes parallel on SSE2 CPUs for 0.3.6 on: July 30, 2010, 10:00:24 PM
Tell me if it works Smiley
Donations are welcome. 17asVKkzRGTFvvGH9dMGQaHe78xzfvgSSA
335  Bitcoin / Development & Technical Discussion / Re: 4 hashes parallel on SSE2 CPUs for 0.3.6 on: July 30, 2010, 09:47:22 PM
Performance of stock code (as measured by my test/benchmark program) is about 1500khash/s.
My code does 3500khash/s. Both figures are for one core. It scales well because I do 128 hashes at once and keep the datastructures small enough to fit in the CPU cache.

I have two local collision attacks which will squeeze another 300khash/s out, but they are not stable yet.
336  Bitcoin / Development & Technical Discussion / 4 hashes parallel on SSE2 CPUs for 0.3.6 on: July 30, 2010, 09:23:10 PM
This patch will calculate four hashes on one core using vector instructions. There's a test programm included that validates the new hash function against the old one so it should be correct.

The patch is against 0.3.6. Improves khash/s by roughly 115%.

http://pastebin.com/XN1JDb53
337  Bitcoin / Bitcoin Discussion / Solution to lost Bitcoins / double spending on: July 29, 2010, 08:53:50 PM
I have found a solution to reclaim "lost" Bitcoins after double spending.
Basically the solution is to remove all transactions from the wallet and disconnect their inputs. If someone has lost serious money because of double spending, I can help.
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 [17]
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!