Bitcoin Forum
September 20, 2024, 03:46:13 AM *
News: Latest Bitcoin Core release: 27.1 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 [2] 3 4 5 »  All
  Print  
Author Topic: 4 hashes parallel on SSE2 CPUs for 0.3.6  (Read 22024 times)
wereHamster
Newbie
*
Offline Offline

Activity: 3
Merit: 0


View Profile
August 01, 2010, 10:16:48 AM
 #21

care with __attribute__ ((aligned (16))) , it doesn't work with local variable, gcc doesn't align the stack

Maybe gcc doesn't align the stack, but it can (and automatically does) align variables on the stack.
Mionione
Newbie
*
Offline Offline

Activity: 10
Merit: 1


View Profile
August 01, 2010, 12:53:40 PM
 #22

that's what it is supposed to do, but it doesn't always do it, issues are on gcc bugzilla

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43798
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16660
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838
Ground Loop
Member
**
Offline Offline

Activity: 111
Merit: 10


View Profile
August 02, 2010, 12:22:43 AM
 #23

No joy against SVN tip here.

Code:
patching file sha256.cpp
patching file main.cpp
Hunk #1 FAILED at 2555.
Hunk #2 FAILED at 2703.
2 out of 2 hunks FAILED -- saving rejects to file main.cpp.rej
patching file makefile.unix
Hunk #1 FAILED at 41.
Hunk #2 FAILED at 52.
Hunk #3 FAILED at 64.
3 out of 3 hunks FAILED -- saving rejects to file makefile.unix.rej
patching file test.cpp

Trying manually now.

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
Ground Loop
Member
**
Offline Offline

Activity: 111
Merit: 10


View Profile
August 02, 2010, 06:57:20 AM
 #24

I got the patch knitted in, and I think I did it correctly.. wasn't complicated.

Regrettably, the hash rate has decreased by almost half.  I'm down from 2071 (stock build, svn tip) to 1150 khash/sec with the patch.

It's an Intel Xeon 3 GHz, Linux, with these proc flags:
Code:
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr

Has anyone seen gains?

Did I botch it?  Missing CPU capabilities?  Wrong compiler options?

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
impossible7
Newbie
*
Offline Offline

Activity: 18
Merit: 0


View Profile
August 02, 2010, 08:13:11 AM
 #25

I have been able to apply the patch against SVN (r121) and I tested it on 2 machines:
  • on an AMD Opteron 2374 HE running x86_64 linux I got a 105% improvement (!)
  • on an Intel Core 2 Duo T7300 running x86_64 linux it was 55% slower compared to the stock version (r121)

The strange thing is that despite the fact that I have been running it on 6 Opterons (i.e. 6x4=24 cores) for 40 hours with an average rate of 51,000 khash/s, I still haven't generated any blocks. The probability of this (no blocks, 40 hours, 51,000 khash/s and diffuculty=244.2) is 0.09% or 1/1098. Are you sure this thing works correctly and that the reported rate is correct? How do I run the included test program?
knightmb
Sr. Member
****
Offline Offline

Activity: 308
Merit: 258



View Profile WWW
August 02, 2010, 08:47:04 AM
 #26

Is it a AMD only optimization perhaps?

Timekoin - The World's Most Energy Efficient Encrypted Digital Currency
impossible7
Newbie
*
Offline Offline

Activity: 18
Merit: 0


View Profile
August 02, 2010, 09:00:55 AM
 #27

Is it a AMD only optimization perhaps?

Or a 64-bit only optimization.
Ground Loop
Member
**
Offline Offline

Activity: 111
Merit: 10


View Profile
August 02, 2010, 09:17:07 AM
 #28

With the patch above, I was unable to build the test program.  You?

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
petree
Newbie
*
Offline Offline

Activity: 4
Merit: 0


View Profile
August 02, 2010, 09:22:29 AM
 #29

The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client.  I was even able to port its minor changes to 0.3.7 successfully, with the same results.

Is there a way we can confirm that the variables are being aligned properly?  I'm wondering if the Intel procs are less tolerant of misalignment than the AMD's.
impossible7
Newbie
*
Offline Offline

Activity: 18
Merit: 0


View Profile
August 02, 2010, 09:31:44 AM
 #30

With the patch above, I was unable to build the test program.  You?

Under x86 I had to include cryptopp/obj/cpu.o in the list of object files, otherwise "make test" would fail. Under x86_64 I had no such issue.


The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client.  I was even able to port its minor changes to 0.3.7 successfully, with the same results.

As I said above I did notice an imporvement in performace too, but I am not sure the patched version works correctly. Have you been able to generate any blocks with the patched version?
nelisky
Legendary
*
Offline Offline

Activity: 1540
Merit: 1002


View Profile
August 02, 2010, 01:05:48 PM
 #31

Is it a AMD only optimization perhaps?

Or a 64-bit only optimization.

I'm trying on a Q6600 running 64bit linux (ubuntu server) and it makes things slower there, so not 64bit only. And I'm running on my mac laptop which sports an Intel i5 (also 64 bit OSX 10.6), which great speed improvement there, so not AMD only.
petree
Newbie
*
Offline Offline

Activity: 4
Merit: 0


View Profile
August 02, 2010, 04:12:33 PM
 #32

With the patch above, I was unable to build the test program.  You?

Under x86 I had to include cryptopp/obj/cpu.o in the list of object files, otherwise "make test" would fail. Under x86_64 I had no such issue.


The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client.  I was even able to port its minor changes to 0.3.7 successfully, with the same results.

As I said above I did notice an imporvement in performace too, but I am not sure the patched version works correctly. Have you been able to generate any blocks with the patched version?

Yes, since applying this patch I've generated 2 blocks.
satoshi
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 7065


View Profile
August 02, 2010, 07:02:46 PM
Last edit: August 02, 2010, 07:14:55 PM by satoshi
 #33

Is it 2x fast on AMD and 1/2 fast on Intel?

Btw. Why are you using this alignup<16> function when __attribute__ ((aligned (16))) will tell the compiler to align at compiletime?
Tried that, but it doesn't work for things on the stack.  I ran some tests.

It doesn't even cause an error, it just doesn't align it.
jgarzik
Legendary
*
qt
Offline Offline

Activity: 1596
Merit: 1099


View Profile
August 02, 2010, 07:15:23 PM
 #34

FWIW, there exists -mstackrealign and -mpreferred-stack-boundary=NUM

Jeff Garzik, Bloq CEO, former bitcoin core dev team; opinions are my own.
Visit bloq.com / metronome.io
Donations / tip jar: 1BrufViLKnSWtuWGkryPsKsxonV2NQ7Tcj
impossible7
Newbie
*
Offline Offline

Activity: 18
Merit: 0


View Profile
August 02, 2010, 08:49:25 PM
 #35

After 52 hours of trying with no blocks generated, I give up and I am switching back to the vanilla bitcoin.

The probability of getting no blocks within 52 hours at 51,000 khash/s is 0.011%. So I conclude that the patch doesn't work and I am 99.989% confident about that. I hope that tcatm provides some explanation on how to use the supplied test program.
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
August 02, 2010, 09:07:56 PM
 #36

To use the test program download this file (or generate it yourself from the blockchain): http://ul.to/hz5wlg
The program will try to find the correct nonce in each block and detect if the hash function does work correctly. It'll also benchmark the algorithm.

From what I've heard the patch does not work on 32 bit systems. I don't know why. I've developed it on an AMD64 machine and it works fine. If it's slower on Intel, try to disable Hyperthreading. The big loop in the SSE2 code doesn't contain any "normal" x86 except for one jump at the end.

Btw, there's a git repo at http://github.com/tcatm/bitcoin-cruncher/
nelisky
Legendary
*
Offline Offline

Activity: 1540
Merit: 1002


View Profile
August 02, 2010, 11:52:27 PM
 #37

I've tried the git branch and results stay the same, almost half of what the vanilla svn can pump out. I'm running Intel and not AMD, but I am on 64bit:

Linux bah 2.6.32-22-server #33-Ubuntu SMP Wed Apr 28 14:34:48 UTC 2010 x86_64 GNU/Linux

Anything I can try to help and debug this?
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
August 03, 2010, 12:21:41 AM
 #38

I've tried the git branch and results stay the same, almost half of what the vanilla svn can pump out. I'm running Intel and not AMD, but I am on 64bit:

Linux bah 2.6.32-22-server #33-Ubuntu SMP Wed Apr 28 14:34:48 UTC 2010 x86_64 GNU/Linux

Anything I can try to help and debug this?
Can you mail me a copy of cryptopp/obj/sha256.o to tcatm@gawab.com? I still fear Intels microcode in their CPUs wasn't made for such tight loops of SSE code. Have you run the test program? How many khash/s does it crunch?
nelisky
Legendary
*
Offline Offline

Activity: 1540
Merit: 1002


View Profile
August 03, 2010, 12:46:12 AM
 #39

datla@bah:~/src/bitcoin/bitcoin-cruncher$ ./test blocks.txt
SHA256 test started
70293
found solutions = 70293
total hashes = 139463136
total time = 235480 ms
average speed: 592 khash/s

I'll send you the obj file now
tcatm (OP)
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
August 03, 2010, 01:17:53 AM
 #40

Thanks for the object!

There are two things I noticed:
1) The Intel object runs at 3269khash/s on my AMD64 (vs. 3778khash/s) so it's less optimized than the AMD64 code.

2) AMD64 moves less data around and does more calculations. Sometimes it even abuses floating point instructions for integers.

Could you drop in my sha256.o from http://ul.to/2ckndx to cryptopp/obj/, delete test (not the .cpp!!) and recompile test using make -f makefile.unix test (take care it doesn't recompile sha256.cpp to sha256.o). Then run test again. It should be using AMD64 code now. Maybe it works better...

If not we've found that AMD64 is about four times faster than Intel at SSE2 integer vector arithmetic. Anyone working on a floating point SHA256 implementation? Wink
Pages: « 1 [2] 3 4 5 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!