Bitcoin Forum
December 10, 2024, 09:14:58 PM *
News: Latest Bitcoin Core release: 28.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1] 2 3 4 »  All
  Print  
Author Topic: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10  (Read 24779 times)
satoshi (OP)
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 7424


View Profile
August 15, 2010, 03:52:09 PM
Last edit: August 16, 2010, 02:52:09 AM by satoshi
 #1

0.3.10 has tcatm's 4-way SSE2 as an option switch.

Use the switch "-4way" to turn it on.  Without the switch you get Crypto++ ASM SHA-256.

I could only get this working with Linux.

Download:
Get 0.3.10 from http://bitcointalk.org/index.php?topic=827.0

Please report back your CPU and results!  I think it's pretty clear that Core 2 and lower are slower, i5 faster.  I don't think we've heard any i7 results yet.  We need to know about the different models of AMD or other less common CPUs.
knightmb
Sr. Member
****
Offline Offline

Activity: 308
Merit: 258



View Profile WWW
August 15, 2010, 05:02:16 PM
Last edit: August 15, 2010, 05:29:51 PM by knightmb
 #2

I did a quick test, will report back when I try it on more machines.

Pentium E5300 Dual-Core 2.6 GHz (2MB cache, FSB 800MHz)
Processor info: http://en.wikipedia.org/wiki/Wolfdale_%28microprocessor%29
Stock = 2261 khash/s
4-way = 1103 khash/s (64 bit)

Pentium 4 - 3.0GHz (hyper-threading off) 1MB Cache, FSB 800MHz
Processor info: http://en.wikipedia.org/wiki/NetBurst_%28microarchitecture%29
Stock = 1024 khash/s (32 bit)
4-way = 658 khash/s (32 bit)

Pentium 4 - 2.8GHz (hyper-threading off) 1MB Cache, FSB 800MHz
Processor info: http://en.wikipedia.org/wiki/NetBurst_%28microarchitecture%29
Stock = 917 khash/s (64 bit)
4-way = 747 khash/s (64 bit)


If I didn't know better, I would say the key is the CPU cache size. Seems all the CPU that run slower have 2 MB or less onboard cache, where as the Core i5 starts with at least 3MB of onboard CPU cache.

Timekoin - The World's Most Energy Efficient Encrypted Digital Currency
satoshi (OP)
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 7424


View Profile
August 15, 2010, 06:23:26 PM
 #3

I hope someone can test an i5 or AMD to check that I built it right.  I don't have either to test with.

I'm also curious if it performs much worse on 32-bit linux vs 64-bit.
sgtstein
Member
**
Offline Offline

Activity: 61
Merit: 10


View Profile
August 15, 2010, 06:26:40 PM
 #4

Where is the code for this? I'm on a CentOS 5.5 box and need to build it myself. Once I do that I will report back with linux 32-bit and 1MB cache Xeon.
satoshi (OP)
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 7424


View Profile
August 15, 2010, 06:43:27 PM
 #5

I just uploaded a quick build so testers can check if I built it right.  (I don't have an i5 or AMD)  If it checks out, I'll put together the full package and do all the release stuff.
sgtstein
Member
**
Offline Offline

Activity: 61
Merit: 10


View Profile
August 15, 2010, 06:46:25 PM
 #6

Okay, makes sense. I have an i7 930 I'll try and test out with too.
tcatm
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 285


View Profile
August 15, 2010, 09:50:41 PM
 #7

If I didn't know better, I would say the key is the CPU cache size. Seems all the CPU that run slower have 2 MB or less onboard cache, where as the Core i5 starts with at least 3MB of onboard CPU cache.

That's unlikely. The loop accesses 432 bytes of data. That should fit in most caches.
Ground Loop
Member
**
Offline Offline

Activity: 111
Merit: 10


View Profile
August 15, 2010, 11:49:40 PM
 #8

5,911 khash with -4way
11,260 without
(Dual Xeon E5450, 64-bit, 8 threads)

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
tcatm
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 285


View Profile
August 16, 2010, 12:03:18 AM
Last edit: August 16, 2010, 12:25:59 AM by tcatm
 #9

-4way: 12518 khash/s
without: 6550 khash/s

It's a little bit slower than my patch (~14000kash/s).

edit: I ran the binary on an older AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ with the same effect we see on older intel cpus:
-4way: 1120khash/s
without: 2012khash/s
tcatm
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 285


View Profile
August 16, 2010, 12:08:38 AM
 #10

Did anyone verify it to produce correct results on 32 bit hosts?
gebler
Newbie
*
Offline Offline

Activity: 16
Merit: 0


View Profile
August 16, 2010, 12:32:57 AM
 #11

Running 32-bit Linux on an AMD Athlon 64 X2, I get the following results:

  normal: 2850 khash/s
  with -4way: 1708 khash/s

I haven't checked if the hashes are correct, just the speed.
aceat64
Full Member
***
Offline Offline

Activity: 307
Merit: 102



View Profile
August 16, 2010, 12:37:54 AM
 #12

I created a wiki page so we can keep track of the results: http://www.bitcoin.org/wiki/doku.php?id=4-way_sse2
tcatm
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 285


View Profile
August 16, 2010, 12:43:39 AM
 #13

I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
HostFat
Staff
Legendary
*
Offline Offline

Activity: 4270
Merit: 1209


I support freedom of choice


View Profile WWW
August 16, 2010, 12:47:23 AM
 #14

I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
Good Cheesy
Will this also work on Windows OS?

NON DO ASSISTENZA PRIVATA - https://t.me/hostfatmind/
tcatm
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 285


View Profile
August 16, 2010, 12:50:04 AM
 #15

I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
Good Cheesy
Will this also work on Windows OS?
Didn't try it, but CFLAGS are not OS dependent at all so I guess it'll work.
NewLibertyStandard
Sr. Member
****
Offline Offline

Activity: 252
Merit: 268



View Profile WWW
August 16, 2010, 01:49:01 AM
 #16

I created a wiki page so we can keep track of the results: http://www.bitcoin.org/wiki/doku.php?id=4-way_sse2
You might want to add columns for whether hyper-threading is enabled, number of physical cores and how many cores Bitcoin is using. Without 4way, I get very slightly better results when I have half of my virtual cores hashing. With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.

Treazant: A Fullever Rewarding Bitcoin - Backup Your Wallet TODAY to Double Your Money! - Dual Currency Donation Address: 1Dnvwj3hAGSwFPMnkJZvi3KnaqksRPa74p
aceat64
Full Member
***
Offline Offline

Activity: 307
Merit: 102



View Profile
August 16, 2010, 02:13:28 AM
 #17

I created a wiki page so we can keep track of the results: http://www.bitcoin.org/wiki/doku.php?id=4-way_sse2
You might want to add columns for whether hyper-threading is enabled, number of physical cores and how many cores Bitcoin is using. Without 4way, I get very slightly better results when I have half of my virtual cores hashing. With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.

I've updated the page with your suggestions, I've also added footnotes to explain some of the fields.
jgarzik
Legendary
*
qt
Offline Offline

Activity: 1596
Merit: 1100


View Profile
August 16, 2010, 02:30:52 AM
 #18


My -4way results:  slower for two older boxes, faster for newer one.


("model name" comes from Linux's /proc/cpuinfo, which reports directly from CPU)

1) model name   : Intel(R) Pentium(R) D CPU 3.00GHz

total cores: 2
without -4way:    0.999 Mhash/sec
with -4way: 0.850 Mhash/sec

2) model name   : Dual Core AMD Opteron(tm) Processor 280

total cores: 4
without -4way:   4.6 Mhash/sec
with -4way:    4.0 Mhash/sec

3) model name   : Genuine Intel(R) CPU             000  @ 3.20GHz

total cores: 4
without -4way:   5.7 Mhash/sec
with -4way:    7.0 Mhash/sec


Jeff Garzik, Bloq CEO, former bitcoin core dev team; opinions are my own.
Visit bloq.com / metronome.io
Donations / tip jar: 1BrufViLKnSWtuWGkryPsKsxonV2NQ7Tcj
satoshi (OP)
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 7424


View Profile
August 16, 2010, 02:57:57 AM
 #19

I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
GCC 4.3.3 doesn't support -march=amdfamk10.  I get:
sha256.cpp:1: error: bad value (amdfamk10) for -march= switch


With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.
Hey, you may be onto something!

hyperthreading didn't help before because all the work was in the arithmetic and logic units, which the hyperthreads share.

tcatm's SSE2 code must be a mix of normal x86 instructions and SSE2 instructions, so while one is doing x86 code, the other can do SSE2.

How much of an improvement do you get with hyperthreading?

Some numbers?  What CPU is that?
lfm
Full Member
***
Offline Offline

Activity: 196
Merit: 104



View Profile
August 16, 2010, 03:10:34 AM
 #20

model name      : AMD Phenom(tm) II X4 940 Processor  at 3.0 ghz  linux 64

with -4way     "hashespersec" : 11132770

without      "hashespersec" : 5877668

Pages: [1] 2 3 4 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!