satoshi (OP)
Founder
Sr. Member
Offline
Activity: 364
Merit: 7248
|
|
August 15, 2010, 03:52:09 PM Last edit: August 16, 2010, 02:52:09 AM by satoshi |
|
0.3.10 has tcatm's 4-way SSE2 as an option switch. Use the switch "-4way" to turn it on. Without the switch you get Crypto++ ASM SHA-256. I could only get this working with Linux. Download: Get 0.3.10 from http://bitcointalk.org/index.php?topic=827.0Please report back your CPU and results! I think it's pretty clear that Core 2 and lower are slower, i5 faster. I don't think we've heard any i7 results yet. We need to know about the different models of AMD or other less common CPUs.
|
|
|
|
knightmb
|
|
August 15, 2010, 05:02:16 PM Last edit: August 15, 2010, 05:29:51 PM by knightmb |
|
I did a quick test, will report back when I try it on more machines. Pentium E5300 Dual-Core 2.6 GHz (2MB cache, FSB 800MHz) Processor info: http://en.wikipedia.org/wiki/Wolfdale_%28microprocessor%29Stock = 2261 khash/s 4-way = 1103 khash/s (64 bit) Pentium 4 - 3.0GHz (hyper-threading off) 1MB Cache, FSB 800MHz Processor info: http://en.wikipedia.org/wiki/NetBurst_%28microarchitecture%29Stock = 1024 khash/s (32 bit) 4-way = 658 khash/s (32 bit) Pentium 4 - 2.8GHz (hyper-threading off) 1MB Cache, FSB 800MHz Processor info: http://en.wikipedia.org/wiki/NetBurst_%28microarchitecture%29Stock = 917 khash/s (64 bit) 4-way = 747 khash/s (64 bit) If I didn't know better, I would say the key is the CPU cache size. Seems all the CPU that run slower have 2 MB or less onboard cache, where as the Core i5 starts with at least 3MB of onboard CPU cache.
|
Timekoin - The World's Most Energy Efficient Encrypted Digital Currency
|
|
|
satoshi (OP)
Founder
Sr. Member
Offline
Activity: 364
Merit: 7248
|
|
August 15, 2010, 06:23:26 PM |
|
I hope someone can test an i5 or AMD to check that I built it right. I don't have either to test with.
I'm also curious if it performs much worse on 32-bit linux vs 64-bit.
|
|
|
|
sgtstein
Member
Offline
Activity: 61
Merit: 10
|
|
August 15, 2010, 06:26:40 PM |
|
Where is the code for this? I'm on a CentOS 5.5 box and need to build it myself. Once I do that I will report back with linux 32-bit and 1MB cache Xeon.
|
|
|
|
satoshi (OP)
Founder
Sr. Member
Offline
Activity: 364
Merit: 7248
|
|
August 15, 2010, 06:43:27 PM |
|
I just uploaded a quick build so testers can check if I built it right. (I don't have an i5 or AMD) If it checks out, I'll put together the full package and do all the release stuff.
|
|
|
|
sgtstein
Member
Offline
Activity: 61
Merit: 10
|
|
August 15, 2010, 06:46:25 PM |
|
Okay, makes sense. I have an i7 930 I'll try and test out with too.
|
|
|
|
tcatm
|
|
August 15, 2010, 09:50:41 PM |
|
If I didn't know better, I would say the key is the CPU cache size. Seems all the CPU that run slower have 2 MB or less onboard cache, where as the Core i5 starts with at least 3MB of onboard CPU cache.
That's unlikely. The loop accesses 432 bytes of data. That should fit in most caches.
|
|
|
|
Ground Loop
Member
Offline
Activity: 111
Merit: 10
|
|
August 15, 2010, 11:49:40 PM |
|
5,911 khash with -4way 11,260 without (Dual Xeon E5450, 64-bit, 8 threads)
|
Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
|
|
|
tcatm
|
|
August 16, 2010, 12:03:18 AM Last edit: August 16, 2010, 12:25:59 AM by tcatm |
|
-4way: 12518 khash/s without: 6550 khash/s
It's a little bit slower than my patch (~14000kash/s).
edit: I ran the binary on an older AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ with the same effect we see on older intel cpus: -4way: 1120khash/s without: 2012khash/s
|
|
|
|
tcatm
|
|
August 16, 2010, 12:08:38 AM |
|
Did anyone verify it to produce correct results on 32 bit hosts?
|
|
|
|
gebler
Newbie
Offline
Activity: 16
Merit: 0
|
|
August 16, 2010, 12:32:57 AM |
|
Running 32-bit Linux on an AMD Athlon 64 X2, I get the following results:
normal: 2850 khash/s with -4way: 1708 khash/s
I haven't checked if the hashes are correct, just the speed.
|
|
|
|
|
tcatm
|
|
August 16, 2010, 12:43:39 AM |
|
I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
|
|
|
|
HostFat
Staff
Legendary
Offline
Activity: 4270
Merit: 1209
I support freedom of choice
|
|
August 16, 2010, 12:47:23 AM |
|
I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
Good Will this also work on Windows OS?
|
|
|
|
tcatm
|
|
August 16, 2010, 12:50:04 AM |
|
I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
Good Will this also work on Windows OS? Didn't try it, but CFLAGS are not OS dependent at all so I guess it'll work.
|
|
|
|
NewLibertyStandard
|
|
August 16, 2010, 01:49:01 AM |
|
You might want to add columns for whether hyper-threading is enabled, number of physical cores and how many cores Bitcoin is using. Without 4way, I get very slightly better results when I have half of my virtual cores hashing. With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.
|
Treazant: A Fullever Rewarding Bitcoin - Backup Your Wallet TODAY to Double Your Money! - Dual Currency Donation Address: 1Dnvwj3hAGSwFPMnkJZvi3KnaqksRPa74p
|
|
|
aceat64
|
|
August 16, 2010, 02:13:28 AM |
|
You might want to add columns for whether hyper-threading is enabled, number of physical cores and how many cores Bitcoin is using. Without 4way, I get very slightly better results when I have half of my virtual cores hashing. With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way. I've updated the page with your suggestions, I've also added footnotes to explain some of the fields.
|
|
|
|
jgarzik
Legendary
Offline
Activity: 1596
Merit: 1100
|
|
August 16, 2010, 02:30:52 AM |
|
My -4way results: slower for two older boxes, faster for newer one.
("model name" comes from Linux's /proc/cpuinfo, which reports directly from CPU)
1) model name : Intel(R) Pentium(R) D CPU 3.00GHz
total cores: 2 without -4way: 0.999 Mhash/sec with -4way: 0.850 Mhash/sec
2) model name : Dual Core AMD Opteron(tm) Processor 280
total cores: 4 without -4way: 4.6 Mhash/sec with -4way: 4.0 Mhash/sec
3) model name : Genuine Intel(R) CPU 000 @ 3.20GHz
total cores: 4 without -4way: 5.7 Mhash/sec with -4way: 7.0 Mhash/sec
|
Jeff Garzik, Bloq CEO, former bitcoin core dev team; opinions are my own. Visit bloq.com / metronome.io Donations / tip jar: 1BrufViLKnSWtuWGkryPsKsxonV2NQ7Tcj
|
|
|
satoshi (OP)
Founder
Sr. Member
Offline
Activity: 364
Merit: 7248
|
|
August 16, 2010, 02:57:57 AM |
|
I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
GCC 4.3.3 doesn't support -march=amdfamk10. I get: sha256.cpp:1: error: bad value (amdfamk10) for -march= switch With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.
Hey, you may be onto something! hyperthreading didn't help before because all the work was in the arithmetic and logic units, which the hyperthreads share. tcatm's SSE2 code must be a mix of normal x86 instructions and SSE2 instructions, so while one is doing x86 code, the other can do SSE2. How much of an improvement do you get with hyperthreading? Some numbers? What CPU is that?
|
|
|
|
lfm
|
|
August 16, 2010, 03:10:34 AM |
|
model name : AMD Phenom(tm) II X4 940 Processor at 3.0 ghz linux 64
with -4way "hashespersec" : 11132770
without "hashespersec" : 5877668
|
|
|
|
|