wereHamster
Newbie
Offline
Activity: 3
Merit: 0
|
|
August 01, 2010, 10:16:48 AM |
|
care with __attribute__ ((aligned (16))) , it doesn't work with local variable, gcc doesn't align the stack
Maybe gcc doesn't align the stack, but it can (and automatically does) align variables on the stack.
|
|
|
|
Mionione
Newbie
Offline
Activity: 10
Merit: 1
|
|
August 01, 2010, 12:53:40 PM |
|
|
|
|
|
Ground Loop
Member
Offline
Activity: 111
Merit: 10
|
|
August 02, 2010, 12:22:43 AM |
|
No joy against SVN tip here. patching file sha256.cpp patching file main.cpp Hunk #1 FAILED at 2555. Hunk #2 FAILED at 2703. 2 out of 2 hunks FAILED -- saving rejects to file main.cpp.rej patching file makefile.unix Hunk #1 FAILED at 41. Hunk #2 FAILED at 52. Hunk #3 FAILED at 64. 3 out of 3 hunks FAILED -- saving rejects to file makefile.unix.rej patching file test.cpp
Trying manually now.
|
Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
|
|
|
Ground Loop
Member
Offline
Activity: 111
Merit: 10
|
|
August 02, 2010, 06:57:20 AM |
|
I got the patch knitted in, and I think I did it correctly.. wasn't complicated. Regrettably, the hash rate has decreased by almost half. I'm down from 2071 (stock build, svn tip) to 1150 khash/sec with the patch. It's an Intel Xeon 3 GHz, Linux, with these proc flags: flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr
Has anyone seen gains? Did I botch it? Missing CPU capabilities? Wrong compiler options?
|
Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
|
|
|
impossible7
Newbie
Offline
Activity: 18
Merit: 0
|
|
August 02, 2010, 08:13:11 AM |
|
I have been able to apply the patch against SVN (r121) and I tested it on 2 machines: - on an AMD Opteron 2374 HE running x86_64 linux I got a 105% improvement (!)
- on an Intel Core 2 Duo T7300 running x86_64 linux it was 55% slower compared to the stock version (r121)
The strange thing is that despite the fact that I have been running it on 6 Opterons (i.e. 6x4=24 cores) for 40 hours with an average rate of 51,000 khash/s, I still haven't generated any blocks. The probability of this (no blocks, 40 hours, 51,000 khash/s and diffuculty=244.2) is 0.09% or 1/1098. Are you sure this thing works correctly and that the reported rate is correct? How do I run the included test program?
|
|
|
|
knightmb
|
|
August 02, 2010, 08:47:04 AM |
|
Is it a AMD only optimization perhaps?
|
Timekoin - The World's Most Energy Efficient Encrypted Digital Currency
|
|
|
impossible7
Newbie
Offline
Activity: 18
Merit: 0
|
|
August 02, 2010, 09:00:55 AM |
|
Is it a AMD only optimization perhaps?
Or a 64-bit only optimization.
|
|
|
|
Ground Loop
Member
Offline
Activity: 111
Merit: 10
|
|
August 02, 2010, 09:17:07 AM |
|
With the patch above, I was unable to build the test program. You?
|
Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
|
|
|
petree
Newbie
Offline
Activity: 4
Merit: 0
|
|
August 02, 2010, 09:22:29 AM |
|
The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client. I was even able to port its minor changes to 0.3.7 successfully, with the same results.
Is there a way we can confirm that the variables are being aligned properly? I'm wondering if the Intel procs are less tolerant of misalignment than the AMD's.
|
|
|
|
impossible7
Newbie
Offline
Activity: 18
Merit: 0
|
|
August 02, 2010, 09:31:44 AM |
|
With the patch above, I was unable to build the test program. You?
Under x86 I had to include cryptopp/obj/cpu.o in the list of object files, otherwise "make test" would fail. Under x86_64 I had no such issue. The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client. I was even able to port its minor changes to 0.3.7 successfully, with the same results.
As I said above I did notice an imporvement in performace too, but I am not sure the patched version works correctly. Have you been able to generate any blocks with the patched version?
|
|
|
|
nelisky
Legendary
Offline
Activity: 1540
Merit: 1002
|
|
August 02, 2010, 01:05:48 PM |
|
Is it a AMD only optimization perhaps?
Or a 64-bit only optimization. I'm trying on a Q6600 running 64bit linux (ubuntu server) and it makes things slower there, so not 64bit only. And I'm running on my mac laptop which sports an Intel i5 (also 64 bit OSX 10.6), which great speed improvement there, so not AMD only.
|
|
|
|
petree
Newbie
Offline
Activity: 4
Merit: 0
|
|
August 02, 2010, 04:12:33 PM |
|
With the patch above, I was unable to build the test program. You?
Under x86 I had to include cryptopp/obj/cpu.o in the list of object files, otherwise "make test" would fail. Under x86_64 I had no such issue. The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client. I was even able to port its minor changes to 0.3.7 successfully, with the same results.
As I said above I did notice an imporvement in performace too, but I am not sure the patched version works correctly. Have you been able to generate any blocks with the patched version? Yes, since applying this patch I've generated 2 blocks.
|
|
|
|
satoshi
Founder
Sr. Member
Offline
Activity: 364
Merit: 7118
|
|
August 02, 2010, 07:02:46 PM Last edit: August 02, 2010, 07:14:55 PM by satoshi |
|
Is it 2x fast on AMD and 1/2 fast on Intel? Btw. Why are you using this alignup<16> function when __attribute__ ((aligned (16))) will tell the compiler to align at compiletime?
Tried that, but it doesn't work for things on the stack. I ran some tests. It doesn't even cause an error, it just doesn't align it.
|
|
|
|
jgarzik
Legendary
Offline
Activity: 1596
Merit: 1099
|
|
August 02, 2010, 07:15:23 PM |
|
FWIW, there exists -mstackrealign and -mpreferred-stack-boundary=NUM
|
Jeff Garzik, Bloq CEO, former bitcoin core dev team; opinions are my own. Visit bloq.com / metronome.io Donations / tip jar: 1BrufViLKnSWtuWGkryPsKsxonV2NQ7Tcj
|
|
|
impossible7
Newbie
Offline
Activity: 18
Merit: 0
|
|
August 02, 2010, 08:49:25 PM |
|
After 52 hours of trying with no blocks generated, I give up and I am switching back to the vanilla bitcoin.
The probability of getting no blocks within 52 hours at 51,000 khash/s is 0.011%. So I conclude that the patch doesn't work and I am 99.989% confident about that. I hope that tcatm provides some explanation on how to use the supplied test program.
|
|
|
|
tcatm (OP)
|
|
August 02, 2010, 09:07:56 PM |
|
To use the test program download this file (or generate it yourself from the blockchain): http://ul.to/hz5wlgThe program will try to find the correct nonce in each block and detect if the hash function does work correctly. It'll also benchmark the algorithm. From what I've heard the patch does not work on 32 bit systems. I don't know why. I've developed it on an AMD64 machine and it works fine. If it's slower on Intel, try to disable Hyperthreading. The big loop in the SSE2 code doesn't contain any "normal" x86 except for one jump at the end. Btw, there's a git repo at http://github.com/tcatm/bitcoin-cruncher/
|
|
|
|
nelisky
Legendary
Offline
Activity: 1540
Merit: 1002
|
|
August 02, 2010, 11:52:27 PM |
|
I've tried the git branch and results stay the same, almost half of what the vanilla svn can pump out. I'm running Intel and not AMD, but I am on 64bit:
Linux bah 2.6.32-22-server #33-Ubuntu SMP Wed Apr 28 14:34:48 UTC 2010 x86_64 GNU/Linux
Anything I can try to help and debug this?
|
|
|
|
tcatm (OP)
|
|
August 03, 2010, 12:21:41 AM |
|
I've tried the git branch and results stay the same, almost half of what the vanilla svn can pump out. I'm running Intel and not AMD, but I am on 64bit:
Linux bah 2.6.32-22-server #33-Ubuntu SMP Wed Apr 28 14:34:48 UTC 2010 x86_64 GNU/Linux
Anything I can try to help and debug this?
Can you mail me a copy of cryptopp/obj/sha256.o to tcatm@gawab.com? I still fear Intels microcode in their CPUs wasn't made for such tight loops of SSE code. Have you run the test program? How many khash/s does it crunch?
|
|
|
|
nelisky
Legendary
Offline
Activity: 1540
Merit: 1002
|
|
August 03, 2010, 12:46:12 AM |
|
datla@bah:~/src/bitcoin/bitcoin-cruncher$ ./test blocks.txt SHA256 test started 70293 found solutions = 70293 total hashes = 139463136 total time = 235480 ms average speed: 592 khash/s
I'll send you the obj file now
|
|
|
|
tcatm (OP)
|
|
August 03, 2010, 01:17:53 AM |
|
Thanks for the object! There are two things I noticed: 1) The Intel object runs at 3269khash/s on my AMD64 (vs. 3778khash/s) so it's less optimized than the AMD64 code. 2) AMD64 moves less data around and does more calculations. Sometimes it even abuses floating point instructions for integers. Could you drop in my sha256.o from http://ul.to/2ckndx to cryptopp/obj/, delete test (not the .cpp!!) and recompile test using make -f makefile.unix test (take care it doesn't recompile sha256.cpp to sha256.o). Then run test again. It should be using AMD64 code now. Maybe it works better... If not we've found that AMD64 is about four times faster than Intel at SSE2 integer vector arithmetic. Anyone working on a floating point SHA256 implementation?
|
|
|
|
|