Bitcoin Forum
June 16, 2024, 08:10:27 PM *
News: Voting for pizza day contest
 
  Home Help Search Login Register More  
  Show Posts
Pages: [1]
1  Bitcoin / Development & Technical Discussion / Re: 4 hashes parallel on SSE2 CPUs for 0.3.6 on: August 01, 2010, 10:16:48 AM
care with __attribute__ ((aligned (16))) , it doesn't work with local variable, gcc doesn't align the stack

Maybe gcc doesn't align the stack, but it can (and automatically does) align variables on the stack.
2  Bitcoin / Bitcoin Discussion / Re: Post Your Hash/Sec and Hardware on: July 28, 2010, 05:11:31 PM

Actually, I re-examined my changes based on your results and I think I may have messed up the rate calculation. Since it calculates four hashes per loop iteration I multiplied the increment to nHashCounter by four, but in retrospect it's already accounting for that by incrementing the nonce four times as quickly. Ergo, the rate was displayed as 4x higher than actual. For reference, I've generated 300 BTC (six blocks) since I started using the program on the 11th of this month, which works out to about one block out of every 400 (0.25%). Is that about what you're getting? It looks like your version should be a bit faster than mine, which is only to be expected--this was my first attempt at using SSE, or for that matter any kind of SIMD optimization.

I should've known it wouldn't be quite so simple. Smiley

Oh yeah, I had to review the khash/s code a few times to get it right. At one point it displayed 12000 but I somehow didn't believe it Smiley
I settled for this: in each thread, for each hash that I calculate, increase a counter. Every couple iterations save the khash/s *for this thread* in an array. Then every 30 seconds one thread sums up all the khash/s values and prints the total.

I don't think pure SSE can be exactly four times as fast as a well optimized C code, mostly because SSE lacks rotate instructions. Did you implement SHA completely in SSE or are you mixing SSE and C?

I implemented the rotates as ((x >> y) | (x << (32-y))), using SSE opcodes for the shifts. It's implemented completely in C, using GCC's vector extensions--no direct assembly code. I did use intrinsics for the shift operations, since the shift operators aren't implemented for vectors. The rest looks much like the original version.

Yep, that's more or less what I did. Except I used intrinsics instead of the gcc vector extension, I think that should be more portable. It was pretty easy, I took an existing implementation as a base and then only had to change some macros. Comparing my SSE version with the base yields a speedup of (only) 2.5x. It's not 4x, mainly because the lack of rotate operations in SSE. Packing and unpacking also cause a small decrease in speed.

I also heard that some compilers can generate suboptimal, sometimes even outright wrong, code from intrinsics. I was advised to stay away from them and use pure assembler instead.
3  Bitcoin / Bitcoin Discussion / Re: Post Your Hash/Sec and Hardware on: July 28, 2010, 03:39:51 PM
1 thread: ~4200 khash/s
2 threads: ~7700 khash/s

Suffice it to say that I leverage SSE instructions to calculate four hashes at once, per thread.

I tried to implement your idea and my SSE code is almost exactly 4x as fast as the vanilla code (~2000 khash/s with one thread, up from ~500). However, when running two threads I only get ~3000 khash/s. There is room to optimize my code, but still, the improvement is way lower that yours.

I don't think pure SSE can be exactly four times as fast as a well optimized C code, mostly because SSE lacks rotate instructions. Did you implement SHA completely in SSE or are you mixing SSE and C?
Pages: [1]
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!