Bitcoin Forum
February 19, 2018, 06:57:39 PM *
News: Latest stable version of Bitcoin Core: 0.15.1  [Torrent].
 
   Home   Help Search Donate Login Register  
Pages: 1 2 3 4 [All]
  Print  
Author Topic: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10  (Read 15033 times)
satoshi
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 855


View Profile
August 15, 2010, 03:52:09 PM
 #1

0.3.10 has tcatm's 4-way SSE2 as an option switch.

Use the switch "-4way" to turn it on.  Without the switch you get Crypto++ ASM SHA-256.

I could only get this working with Linux.

Download:
Get 0.3.10 from http://bitcointalk.org/index.php?topic=827.0

Please report back your CPU and results!  I think it's pretty clear that Core 2 and lower are slower, i5 faster.  I don't think we've heard any i7 results yet.  We need to know about the different models of AMD or other less common CPUs.
1519066659
Hero Member
*
Offline Offline

Posts: 1519066659

View Profile Personal Message (Offline)

Ignore
1519066659
Reply with quote  #2

1519066659
Report to moderator
1519066659
Hero Member
*
Offline Offline

Posts: 1519066659

View Profile Personal Message (Offline)

Ignore
1519066659
Reply with quote  #2

1519066659
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
knightmb
Sr. Member
****
Offline Offline

Activity: 322
Merit: 250


mymdn.io


View Profile WWW
August 15, 2010, 05:02:16 PM
 #2

I did a quick test, will report back when I try it on more machines.

Pentium E5300 Dual-Core 2.6 GHz (2MB cache, FSB 800MHz)
Processor info: http://en.wikipedia.org/wiki/Wolfdale_%28microprocessor%29
Stock = 2261 khash/s
4-way = 1103 khash/s (64 bit)

Pentium 4 - 3.0GHz (hyper-threading off) 1MB Cache, FSB 800MHz
Processor info: http://en.wikipedia.org/wiki/NetBurst_%28microarchitecture%29
Stock = 1024 khash/s (32 bit)
4-way = 658 khash/s (32 bit)

Pentium 4 - 2.8GHz (hyper-threading off) 1MB Cache, FSB 800MHz
Processor info: http://en.wikipedia.org/wiki/NetBurst_%28microarchitecture%29
Stock = 917 khash/s (64 bit)
4-way = 747 khash/s (64 bit)


If I didn't know better, I would say the key is the CPU cache size. Seems all the CPU that run slower have 2 MB or less onboard cache, where as the Core i5 starts with at least 3MB of onboard CPU cache.




      ▄▄          ▄▄
     ▄███▄      ▄███▄
     ███████▄ ▄██████▄
    ██████████████████▄
   ███████████████████
  ▄█████████████████████
 ▄███████████████████████
▄█████████████████████████
███████████████████████████
▀▀███████████████████████▀▀
    ▀▀███████████████▀▀
        ▀▀██████▀▀
            ▀
Meridian

myMDN.io
▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
Digital Collateral


JOIN ICO
satoshi
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 855


View Profile
August 15, 2010, 06:23:26 PM
 #3

I hope someone can test an i5 or AMD to check that I built it right.  I don't have either to test with.

I'm also curious if it performs much worse on 32-bit linux vs 64-bit.
sgtstein
Member
**
Offline Offline

Activity: 61
Merit: 10


View Profile
August 15, 2010, 06:26:40 PM
 #4

Where is the code for this? I'm on a CentOS 5.5 box and need to build it myself. Once I do that I will report back with linux 32-bit and 1MB cache Xeon.
satoshi
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 855


View Profile
August 15, 2010, 06:43:27 PM
 #5

I just uploaded a quick build so testers can check if I built it right.  (I don't have an i5 or AMD)  If it checks out, I'll put together the full package and do all the release stuff.
sgtstein
Member
**
Offline Offline

Activity: 61
Merit: 10


View Profile
August 15, 2010, 06:46:25 PM
 #6

Okay, makes sense. I have an i7 930 I'll try and test out with too.
tcatm
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 250


View Profile
August 15, 2010, 09:50:41 PM
 #7

If I didn't know better, I would say the key is the CPU cache size. Seems all the CPU that run slower have 2 MB or less onboard cache, where as the Core i5 starts with at least 3MB of onboard CPU cache.

That's unlikely. The loop accesses 432 bytes of data. That should fit in most caches.
Ground Loop
Member
**
Offline Offline

Activity: 112
Merit: 10


View Profile
August 15, 2010, 11:49:40 PM
 #8

5,911 khash with -4way
11,260 without
(Dual Xeon E5450, 64-bit, 8 threads)

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
tcatm
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 250


View Profile
August 16, 2010, 12:03:18 AM
 #9

-4way: 12518 khash/s
without: 6550 khash/s

It's a little bit slower than my patch (~14000kash/s).

edit: I ran the binary on an older AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ with the same effect we see on older intel cpus:
-4way: 1120khash/s
without: 2012khash/s
tcatm
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 250


View Profile
August 16, 2010, 12:08:38 AM
 #10

Did anyone verify it to produce correct results on 32 bit hosts?
gebler
Newbie
*
Offline Offline

Activity: 16
Merit: 0


View Profile
August 16, 2010, 12:32:57 AM
 #11

Running 32-bit Linux on an AMD Athlon 64 X2, I get the following results:

  normal: 2850 khash/s
  with -4way: 1708 khash/s

I haven't checked if the hashes are correct, just the speed.

18h6HV7RPU4xFwiap2RdzsqDjcgsEmh3vy
aceat64
Full Member
***
Offline Offline

Activity: 211
Merit: 100



View Profile
August 16, 2010, 12:37:54 AM
 #12

I created a wiki page so we can keep track of the results: http://www.bitcoin.org/wiki/doku.php?id=4-way_sse2

TUNE COINTHE EXPERIMENTAL CRYPTOCURRENCY
WEBSITEANN THREADTWITTERDISCORDTELEGRAM
POW ||| INCUBATING ||| VARIABLE MASTERNODE PRICES ||| LOW SUPPLY ||| SHARE PROFIT
tcatm
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 250


View Profile
August 16, 2010, 12:43:39 AM
 #13

I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
HostFat
Staff
Legendary
*
Offline Offline

Activity: 2730
Merit: 1042


I support freedom of choice


View Profile WWW
August 16, 2010, 12:47:23 AM
 #14

I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
Good Cheesy
Will this also work on Windows OS?

NON DO ASSISTENZA PRIVATA - The Rock Trading (ref): A good exchange since 2007. 
https://bitcointa.lk: Bitcointalk backup if offline - Bitcoin Foundation Italia - Blog: http://theupwind.blogspot.it
tcatm
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 250


View Profile
August 16, 2010, 12:50:04 AM
 #15

I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
Good Cheesy
Will this also work on Windows OS?
Didn't try it, but CFLAGS are not OS dependent at all so I guess it'll work.
NewLibertyStandard
Sr. Member
****
Offline Offline

Activity: 252
Merit: 250



View Profile WWW
August 16, 2010, 01:49:01 AM
 #16

I created a wiki page so we can keep track of the results: http://www.bitcoin.org/wiki/doku.php?id=4-way_sse2
You might want to add columns for whether hyper-threading is enabled, number of physical cores and how many cores Bitcoin is using. Without 4way, I get very slightly better results when I have half of my virtual cores hashing. With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.

Treazant: A Fullever Rewarding Bitcoin - Backup Your Wallet TODAY to Double Your Money! - Dual Currency Donation Address: 1Dnvwj3hAGSwFPMnkJZvi3KnaqksRPa74p
aceat64
Full Member
***
Offline Offline

Activity: 211
Merit: 100



View Profile
August 16, 2010, 02:13:28 AM
 #17

I created a wiki page so we can keep track of the results: http://www.bitcoin.org/wiki/doku.php?id=4-way_sse2
You might want to add columns for whether hyper-threading is enabled, number of physical cores and how many cores Bitcoin is using. Without 4way, I get very slightly better results when I have half of my virtual cores hashing. With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.

I've updated the page with your suggestions, I've also added footnotes to explain some of the fields.

TUNE COINTHE EXPERIMENTAL CRYPTOCURRENCY
WEBSITEANN THREADTWITTERDISCORDTELEGRAM
POW ||| INCUBATING ||| VARIABLE MASTERNODE PRICES ||| LOW SUPPLY ||| SHARE PROFIT
jgarzik
Legendary
*
qt
Offline Offline

Activity: 1512
Merit: 1000


View Profile
August 16, 2010, 02:30:52 AM
 #18


My -4way results:  slower for two older boxes, faster for newer one.


("model name" comes from Linux's /proc/cpuinfo, which reports directly from CPU)

1) model name   : Intel(R) Pentium(R) D CPU 3.00GHz

total cores: 2
without -4way:    0.999 Mhash/sec
with -4way: 0.850 Mhash/sec

2) model name   : Dual Core AMD Opteron(tm) Processor 280

total cores: 4
without -4way:   4.6 Mhash/sec
with -4way:    4.0 Mhash/sec

3) model name   : Genuine Intel(R) CPU             000  @ 3.20GHz

total cores: 4
without -4way:   5.7 Mhash/sec
with -4way:    7.0 Mhash/sec


Jeff Garzik, Bloq CEO, former bitcoin core dev team; opinions are my own.
Visit bloq.com / metronome.io
Donations / tip jar: 1BrufViLKnSWtuWGkryPsKsxonV2NQ7Tcj
satoshi
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 855


View Profile
August 16, 2010, 02:57:57 AM
 #19

I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
GCC 4.3.3 doesn't support -march=amdfamk10.  I get:
sha256.cpp:1: error: bad value (amdfamk10) for -march= switch


With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.
Hey, you may be onto something!

hyperthreading didn't help before because all the work was in the arithmetic and logic units, which the hyperthreads share.

tcatm's SSE2 code must be a mix of normal x86 instructions and SSE2 instructions, so while one is doing x86 code, the other can do SSE2.

How much of an improvement do you get with hyperthreading?

Some numbers?  What CPU is that?
lfm
Full Member
***
Offline Offline

Activity: 196
Merit: 100



View Profile
August 16, 2010, 03:10:34 AM
 #20

model name      : AMD Phenom(tm) II X4 940 Processor  at 3.0 ghz  linux 64

with -4way     "hashespersec" : 11132770

without      "hashespersec" : 5877668

gridecon
Jr. Member
*
Offline Offline

Activity: 35
Merit: 0


View Profile
August 16, 2010, 03:15:44 AM
 #21

I have two quadcore Phenom II 64-bit linux machines (ubuntu 9.10 both) and the -4way option increases my hashing speed so much I'm suspicious. I get about 5-6khash/sec on these boxes previously and without -4way option. With -4way I get over 11khash/sec! In other words, the -4way switch almost DOUBLES the reported hashing speed. This level of improvement seems more than expected and makes me wonder if my boxes are really doing the hashing that much faster or if there could possible be an issue where the math operations are actually being skipped over for some reason, causing illusory speed and an inability to actually generate blocks?
Vasiliev
Jr. Member
*
Offline Offline

Activity: 55
Merit: 0


View Profile
August 16, 2010, 03:17:07 AM
 #22

I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
GCC 4.3.3 doesn't support -march=amdfamk10.  I get:
sha256.cpp:1: error: bad value (amdfamk10) for -march= switch
try -march=amdfam10

1DiaPwjDV6rLB2KrBA7gzitzSWCJmXnESy
satoshi
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 855


View Profile
August 16, 2010, 03:23:04 AM
 #23

try -march=amdfam10
That works.

That's strange...  are we sure that's the same thing?  tcatm, try amdfam10 and make sure you get the same speed measurement.
Vasiliev
Jr. Member
*
Offline Offline

Activity: 55
Merit: 0


View Profile
August 16, 2010, 03:27:35 AM
 #24

http://www.google.com/search?q=amdfamk10

I think he misremembered it since AMD arches are K#.

1DiaPwjDV6rLB2KrBA7gzitzSWCJmXnESy
lfm
Full Member
***
Offline Offline

Activity: 196
Merit: 100



View Profile
August 16, 2010, 03:30:35 AM
 #25

model name      : Intel(R) Core(TM)2 Quad  CPU   Q9450  @ 2.66GHz,   linux 64

no difference at about 4950 khash/s


jgarzik
Legendary
*
qt
Offline Offline

Activity: 1512
Merit: 1000


View Profile
August 16, 2010, 03:35:28 AM
 #26


Update for
Code:
cpu family : 6
model : 26
model name : Genuine Intel(R) CPU             000  @ 3.20GHz
stepping : 4

Machine has 4 cores, each with 2 hyperthreads.  /proc/cpuinfo shows 8 virtual processors.

without -4way, setgen 4:    5.7 Mhash/sec
without -4way, setgen 8:    5.0 Mhash/sec

with -4way, setgen 4:   7.0 Mhash/sec
with -4way, setgen 8:   9.3 Mhash/sec

So, the old wisdom of "hyperthreading slows things down" is now shattered, on this machine.

Jeff Garzik, Bloq CEO, former bitcoin core dev team; opinions are my own.
Visit bloq.com / metronome.io
Donations / tip jar: 1BrufViLKnSWtuWGkryPsKsxonV2NQ7Tcj
Ground Loop
Member
**
Offline Offline

Activity: 112
Merit: 10


View Profile
August 16, 2010, 04:34:20 AM
 #27

No winners for 4way in my other three Intel machines either:

Intel(R) Core(TM)2 Duo CPU     E8500 @ 3.16GHz (64-bit Linux)
4way: 1565  std: 3002

Intel(R) Xeon(TM) CPU 3.00GHz (32-bit Linux)
4way: 1243  std: 2048

Intel(R) Core(TM)2 CPU          6300  @ 1.86GHz
4way: 932   std: 1733

(All running 0.3.10, -1 proclimit)
Experiments with proclimit weren't any better.


Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
satoshi
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 855


View Profile
August 16, 2010, 04:36:59 AM
 #28

Code:
cpu family : 6
model : 26
model name : Genuine Intel(R) CPU             000  @ 3.20GHz
stepping : 4
cpu family 6 model 26 stepping 4 is an Intel Core i7.
That's a 23% speedup with -4way, 63% total speedup with -4way + hyperthreading.
33% faster with hyperthreading than without it.
NewLibertyStandard
Sr. Member
****
Offline Offline

Activity: 252
Merit: 250



View Profile WWW
August 16, 2010, 05:02:31 AM
 #29

I have two quadcore Phenom II 64-bit linux machines (ubuntu 9.10 both) and the -4way option increases my hashing speed so much I'm suspicious. I get about 5-6khash/sec on these boxes previously and without -4way option. With -4way I get over 11khash/sec! In other words, the -4way switch almost DOUBLES the reported hashing speed. This level of improvement seems more than expected and makes me wonder if my boxes are really doing the hashing that much faster or if there could possible be an issue where the math operations are actually being skipped over for some reason, causing illusory speed and an inability to actually generate blocks?
o_O... good luck hashing, you're gonna need it!

With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.
Hey, you may be onto something!

hyperthreading didn't help before because all the work was in the arithmetic and logic units, which the hyperthreads share.

tcatm's SSE2 code must be a mix of normal x86 instructions and SSE2 instructions, so while one is doing x86 code, the other can do SSE2.

How much of an improvement do you get with hyperthreading?

Some numbers?  What CPU is that?
Here are the results from my very poor memory on an i7 860 2.8 GHz with Ubuntu 10.04 amd64. Some of the numbers may be a bit off.

Without 4way, with HT, 4/8 virtual cores, 4.5-5 Mhash/sec
Without 4way, with HT, 8/8 virtual cores, a bit less than above, but basically the same

With 4way, with HT, 8/8 virtual cores, 6.5-8 Mhash/sec (It may be my imagination, but it seems noticeably more variable.)
With 4way, with HT, 4/8 virtual cores, 5-6 Mhash/sec

Without 4way, without HT, 4/4 physical cores, 4.5-5 Mhas/sec (But a bit slower than the first result.)
With 4way, without HT, 4/4 physical cores, 5-6 Mhash/sec

Treazant: A Fullever Rewarding Bitcoin - Backup Your Wallet TODAY to Double Your Money! - Dual Currency Donation Address: 1Dnvwj3hAGSwFPMnkJZvi3KnaqksRPa74p
gridecon
Jr. Member
*
Offline Offline

Activity: 35
Merit: 0


View Profile
August 16, 2010, 05:30:15 AM
 #30

I have two quadcore Phenom II 64-bit linux machines (ubuntu 9.10 both) and the -4way option increases my hashing speed so much I'm suspicious. I get about 5-6khash/sec on these boxes previously and without -4way option. With -4way I get over 11khash/sec! In other words, the -4way switch almost DOUBLES the reported hashing speed. This level of improvement seems more than expected and makes me wonder if my boxes are really doing the hashing that much faster or if there could possible be an issue where the math operations are actually being skipped over for some reason, causing illusory speed and an inability to actually generate blocks?
o_O... good luck hashing, you're gonna need it!
I guess that should read either mhash/sec or THOUSANDS of khash/sec...but hey, what's 3 orders of magnitude among friends?

Perhaps that typographical error is why nobody has answered whether or not a nearly 100% speeded from the -4way option is at all realistic? I'm not convinced the crypto hashing is really taking place at the rate of 11000khash/sec on my desktop box.
jgarzik
Legendary
*
qt
Offline Offline

Activity: 1512
Merit: 1000


View Profile
August 16, 2010, 05:33:56 AM
 #31

Code:
cpu family : 6
model : 26
model name : Genuine Intel(R) CPU             000  @ 3.20GHz
stepping : 4
cpu family 6 model 26 stepping 4 is an Intel Core i7.
That's a 23% speedup with -4way, 63% total speedup with -4way + hyperthreading.
33% faster with hyperthreading than without it.


Does bitcoin perform any self-tests at startup, to verify that hashing is working?



Jeff Garzik, Bloq CEO, former bitcoin core dev team; opinions are my own.
Visit bloq.com / metronome.io
Donations / tip jar: 1BrufViLKnSWtuWGkryPsKsxonV2NQ7Tcj
NewLibertyStandard
Sr. Member
****
Offline Offline

Activity: 252
Merit: 250



View Profile WWW
August 16, 2010, 06:16:44 AM
 #32

More importantly, about how long should it take 10 Mhash/sec to verify difficulty 1 blocks?

After the 64-bit Linux hashing bug was fixed I generated a block or two in short order, but since that one or two blocks, I have not generated a single block. It's starting to seem a little fishy.

I'm currently testing Bitcoin on two Linux 64-bit computers. Is there anything in the code blocking early block verification?

Edit: Never mind. I used the Bitcoin Generation Calculator and divided out the difficulty. Everything is fine here, I've generated a couple blocks with 4way. About to start testing without 4way.

Another Edit: My test only verifies that hashing works. It does not verify whether I'm really getting the displayed speed.

Treazant: A Fullever Rewarding Bitcoin - Backup Your Wallet TODAY to Double Your Money! - Dual Currency Donation Address: 1Dnvwj3hAGSwFPMnkJZvi3KnaqksRPa74p
tcatm
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 250


View Profile
August 16, 2010, 11:15:04 AM
 #33

@satoshi: Oops, I meant -march=amdfam10. Sorry.

@everyone confused about improvement on Phenoms: I developed the code on a Phenom (940) and verified it (at least in 64bit mode) and the improvement you see is real.

Concerning Hyperthreading: It seems to give a little performance gain, maybe from running load/store instructions in parallel with aritmethic instructions. There's only a tiny bit of plain x86 instructions for glueing the function into the ABI. They take less than ~2% of the total CPU time (measured with gprof).
teknohog
Sr. Member
****
Offline Offline

Activity: 480
Merit: 250


555


View Profile WWW
August 16, 2010, 12:31:51 PM
 #34

On a Core 2 Duo T7200, the default code gives about 1.8 Mhash/s, and 4way is slower at 1.0 Mhash/s. It has 4 MB of L2 cache, so it is probably not a question of cache size, as suggested at some point.

Unfortunately, the code (from svn) no longer compiles on ARM, as it now has SSE intrinsics hardcoded. I have removed the -msse2 and -DFOURWAYSSE2 flags from the makefile, and it still produces errors like this

Code:
sha256.cpp:8:23: error: xmmintrin.h: No such file or directory
sha256.cpp:34: error: ‘__m128i’ does not name a type

but hopefully this is easy to fix.

satoshi
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 855


View Profile
August 16, 2010, 01:38:01 PM
 #35

I wrapped sha256.cpp in
#ifdef FOURWAYSSE2
#endif // FOURWAYSSE2

try it now.
tommy
Jr. Member
*
Offline Offline

Activity: 48
Merit: 0


View Profile
August 16, 2010, 03:42:55 PM
 #36

model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 5600+

w/o -4way  "hashespersec" : 2539397

with -4way  "hashespersec" : 2108791

Linux, Debian, 32 bit.
teknohog
Sr. Member
****
Offline Offline

Activity: 480
Merit: 250


555


View Profile WWW
August 16, 2010, 04:41:31 PM
 #37

I wrapped sha256.cpp in
#ifdef FOURWAYSSE2
#endif // FOURWAYSSE2

try it now.

Thanks, works fine now.

hugolp
Hero Member
*****
Offline Offline

Activity: 784
Merit: 1000



View Profile
August 17, 2010, 06:26:27 PM
 #38

Model: Intel Atom n330 (2 cores, 4 virtual).

OS: Ubuntu 10.04 64bit

Using the -4way option I get half the speed than using no option.
denaje
Newbie
*
Offline Offline

Activity: 2
Merit: 0


View Profile
August 18, 2010, 07:06:04 PM
 #39

64-bit Gentoo / Intel Core i7

W/O 4way: 4324294
With 4way: 7649415



32-bit Ubuntu VM on XP host /  Intel Core 2 Duo

W/O 4way: 1751518
With 4way: 793100
Ground Loop
Member
**
Offline Offline

Activity: 112
Merit: 10


View Profile
August 18, 2010, 11:00:08 PM
 #40

So is it accurate to say that, so far, only Intel Core i7 processors and certain (Phenom?) AMD processors enjoy a speed bump from -4way?

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
nelisky
Legendary
*
Offline Offline

Activity: 1554
Merit: 1000


View Profile
August 18, 2010, 11:02:25 PM
 #41

So is it accurate to say that, so far, only Intel Core i7 processors and certain (Phenom?) AMD processors enjoy a speed bump from -4way?


And i5, at least on my macbookpro
Ground Loop
Member
**
Offline Offline

Activity: 112
Merit: 10


View Profile
August 18, 2010, 11:14:26 PM
 #42

Any non-Mac i5 love?
Windows i5 64-bit got slower here.
[correction -- not true.  Windows doesn't have -4way, and the Linux machines are Xeons.]

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
vess
Full Member
***
Offline Offline

Activity: 141
Merit: 100



View Profile WWW
August 19, 2010, 06:41:27 AM
 #43

My Core i5 laptop (Ubuntu) doubled in speed. Actually, it didn't double in speed. It stayed the same speed, but only uses half the CPU now. I can't get it to go back to full CPU usage. That said, my laptop is a lot cooler when generating blocks now. I'll post back if I see it successfully go up to 100% usage.

I'm the CEO of CoinLab (www.coinlab.com) and the Executive Director of the Bitcoin Foundation, I will identify if I'm speaking for myself or one of the organizations when I post from this account.
satoshi
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 855


View Profile
August 19, 2010, 07:07:43 PM
 #44

Any non-Mac i5 love?
Windows i5 64-bit got slower here.
That's the first I've heard anyone say i5 was slower.  Everyone else has said 4way was faster on i5.  Moreso with hyperthreading enabled.

And i5, at least on my macbookpro
Good, so I take it that's a confirmation that it's working on Mac as well?

Laszlo told me he did compile in the -4way stuff on Mac, so the -4way switch is also available to try on Mac.  I don't think makefile.osx on SVN has it yet, just the built version.
nelisky
Legendary
*
Offline Offline

Activity: 1554
Merit: 1000


View Profile
August 19, 2010, 08:28:19 PM
 #45

And i5, at least on my macbookpro
Good, so I take it that's a confirmation that it's working on Mac as well?

Laszlo told me he did compile in the -4way stuff on Mac, so the -4way switch is also available to try on Mac.  I don't think makefile.osx on SVN has it yet, just the built version.

Yep, it's working all right. The number I had posted were from an old svn revision patched with tcatm's changes, but today I compiled trunk and while I had to once again tweak the makefile, after I did it works great with the numbers matching what I experienced before.

Changes I did for my system are below, and while some are cosmetic, like removing wx-config from making bitcoind, just to avoid the warnings if you don't have it installed, others are system specific, like the DEPS dir, and the fact I don't have 32bit libs which makes the link step fail if -arch i386 is there.
The bsddb changes are, I believe, a typo. Includes and Libs point to db46, but then the object list for the linker states db48. Anyway, here's the diff for what got me going:

Code:
Index: makefile.osx
===================================================================
--- makefile.osx (revision 139)
+++ makefile.osx (working copy)
@@ -6,29 +6,29 @@
 # Laszlo Hanyecz (solar@heliacal.net)
 
 CXX=llvm-g++
-DEPSDIR=/Users/macosuser/bitcoin/deps
+DEPSDIR=/opt/local
 
 INCLUDEPATHS= \
- -I"$(DEPSDIR)/include"
+ -I"$(DEPSDIR)/include"  -I"$(DEPSDIR)/include/db46"
 
 LIBPATHS= \
- -L"$(DEPSDIR)/lib"
+ -L"$(DEPSDIR)/lib"  -L"$(DEPSDIR)/lib/db46"
 
-WXLIBS=$(shell $(DEPSDIR)/bin/wx-config --libs --static)
+WXLIBS=
 
 LIBS= -dead_strip \
- $(DEPSDIR)/lib/libdb_cxx-4.8.a \
- $(DEPSDIR)/lib/libboost_system.a \
- $(DEPSDIR)/lib/libboost_filesystem.a \
- $(DEPSDIR)/lib/libboost_program_options.a \
- $(DEPSDIR)/lib/libboost_thread.a \
+ $(DEPSDIR)/lib/db46/libdb_cxx-4.6.a \
+ $(DEPSDIR)/lib/libboost_system-mt.a \
+ $(DEPSDIR)/lib/libboost_filesystem-mt.a \
+ $(DEPSDIR)/lib/libboost_program_options-mt.a \
+ $(DEPSDIR)/lib/libboost_thread-mt.a \
  $(DEPSDIR)/lib/libcrypto.a
 
-DEFS=$(shell $(DEPSDIR)/bin/wx-config --cxxflags) -D__WXMAC_OSX__ -DNOPCH -DMSG_NOSIGNAL=0
+DEFS=-D__WXMAC_OSX__ -DNOPCH -DMSG_NOSIGNAL=0 -DFOURWAYSSE2
 
 DEBUGFLAGS=-g -DwxDEBUG_LEVEL=0
 # ppc doesn't work because we don't support big-endian
-CFLAGS=-mmacosx-version-min=10.5 -arch i386 -arch x86_64 -O3 -Wno-invalid-offsetof -Wformat $(DEBUGFLAGS) $(DEFS) $(INCLUDEPATHS)
+CFLAGS=-mmacosx-version-min=10.5 -arch x86_64 -O3 -Wno-invalid-offsetof -Wformat $(DEBUGFLAGS) $(DEFS) $(INCLUDEPATHS)
 HEADERS=headers.h strlcpy.h serialize.h uint256.h util.h key.h bignum.h base58.h \
     script.h db.h net.h irc.h main.h rpc.h uibase.h ui.h noui.h init.h
 
@@ -42,6 +42,7 @@
     obj/rpc.o \
     obj/init.o \
     cryptopp/obj/sha.o \
+    obj/sha256.o \
     cryptopp/obj/cpu.o
 
 
@@ -55,7 +56,7 @@
  $(CXX) -c $(CFLAGS) -O3 -DCRYPTOPP_DISABLE_ASM -o $@ $<
 
 bitcoin: $(OBJS) obj/ui.o obj/uibase.o
- $(CXX) $(CFLAGS) -o $@ $(LIBPATHS) $^ $(WXLIBS) $(LIBS)
+ $(CXX) $(shell $(DEPSDIR)/bin/wx-config --cxxflags) $(CFLAGS) -o $@ $(LIBPATHS) $^ $(shell $(DEPSDIR)/bin/wx-config --libs --static) $(LIBS)
 
 
 obj/nogui/%.o: %.cpp $(HEADERS)
ArtForz
Sr. Member
****
Offline Offline

Activity: 406
Merit: 250


View Profile
August 21, 2010, 04:56:31 PM
 #46

The difference between new and older CPUs is pretty easy to explain.
Older microarchitectures have 64-bit mmx/sse execution units and split 128bit sse ops into 2 64bit microops.
Newer archs have 128bit sse units.
  • AMD K8: 2 64bit units
  • intel Core/Core2: 3 64bit units
  • AMD K10: 2 128bit units
  • intel nehalem: 3 128bit units
K10 = Opterons with 4 or more cores, Phenom, Phenom II, Athlon II
nehalem = xeon 34xx/35xx/36xx/55xx/56xx/65xx/75xx, i3/i5/i7

bitcoin: 1Fb77Xq5ePFER8GtKRn2KDbDTVpJKfKmpz
i0coin: jNdvyvd6v6gV3kVJLD7HsB5ZwHyHwAkfdw
satoshi
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 855


View Profile
August 22, 2010, 11:21:50 PM
 #47

Thanks for clearing that up.  I read the link someone posted about AMD making that change around 2007, but I didn't know what the story was for Intel.

There's no hope for Core/Core2 then.  They only have half the SSE2 hardware.

Strange that Intel has 3 128bit units, but AMD with 2 128bit units is the faster one.
Ground Loop
Member
**
Offline Offline

Activity: 112
Merit: 10


View Profile
August 23, 2010, 05:45:17 AM
 #48

Intel Atom 230 @ 1.60GHz.  Linux 32-bit.
(Acer Aspire Revo)

Stock: 438 khash/sec (1 proc gives 354)
4way:  254 khash/sec

So you can take this one off the powerhouse list. Smiley
 Grin

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
sgtstein
Member
**
Offline Offline

Activity: 61
Merit: 10


View Profile
August 24, 2010, 05:31:08 PM
 #49

Anybody catch the new AMD Bulldozer press release? If I understand correctly, it should be capable of processing 8 64bit hashes, per core, at the same time. Would be quite a speed boost using this same code design.

Slashdot has the article.
PC Perspective has the details.

Was also covered by AnandTech back in November, 2009.
satoshi
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 855


View Profile
August 24, 2010, 10:43:56 PM
 #50

  • AMD K10: 2 128bit units
  • intel nehalem: 3 128bit units
This probably explains why hyperthreading increases performance with -4way.  If three SSE2 units is excessive, then hyperthreading would help keep them all busy.
tcatm
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 250


View Profile
August 28, 2010, 12:27:08 AM
 #51

I just reviewed the sourcecode as I had a few ideas to optimize it further and I noticed that 4way is partly broken:

from main.cpp:
Code:
                for (int j = 0; j < NPAR; j++)
                {   
                    if (thash[7][j] == 0)
                    {                       
                        for (int i = 0; i < sizeof(hash)/4; i++)
                          ((unsigned int*)&hash)[i] = thash[i][j];
                        pblock->nNonce = ByteReverse(tmp.block.nNonce + j);
                    }   
                }   

The code will only process one hash (the last with thash[7] == 0) out of 32 hashes even when there is more than one hash that might be a correct one.

Somethine like this should fix it but it won't be safe at higher difficulties. Also, I'm not sure whether the byte order should be reversed or not. Could someone review this?
Code:
                unsigned int min_hash = ~1;
       for (int j = 0; j < NPAR; j++)
                {   
                    if (thash[7][j] == 0)
                    {   
                        if(thash[6][j] < min_hash) {
                          min_hash = thash[6][j];
                          for (int i = 0; i < sizeof(hash)/4; i++)
                            ((unsigned int*)&hash)[i] = thash[i][j];
                          pblock->nNonce = ByteReverse(tmp.block.nNonce + j);
                        }   
                    }   
                } 
satoshi
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 855


View Profile
August 28, 2010, 02:27:15 PM
 #52

The simplification is intentional.  There will only be more than one thash[7]=0 in one out of 134,217,728 cases.  It only makes it 0.0000007% slower.
Gespenster
Newbie
*
Offline Offline

Activity: 15
Merit: 0


View Profile
August 29, 2010, 11:11:32 AM
 #53

@sgtstein: Intel's Sandy Bridge (to be released Q4 2010) will also support AVX 256-bit SIMD registers. That means 8 simultaneous hash calculations/thread would be possible, in principle.

Does anybody has any reports on 4-way SSE2 on the Pentium D (Presler)? What kind of performance can I expect? I have an old Pentium D+mobo laying around and I would fire it up as a mining server if performance would be ok. Probably won't be the most efficient khash/Watt though.
lfm
Full Member
***
Offline Offline

Activity: 196
Merit: 100



View Profile
August 29, 2010, 04:55:41 PM
 #54

Does anybody has any reports on 4-way SSE2 on the Pentium D (Presler)? What kind of performance can I expect? I have an old Pentium D+mobo laying around and I would fire it up as a mining server if performance would be ok. Probably won't be the most efficient khash/Watt though.

My Pentium-D died but it was generally just two P4s in one package and probably will do bitcoin like that. Yes it was terribly power hungry. The 4way code doesn't do very well on P4s in general. I only get about 900 khash/s on a 3.4ghz P4 without -4way. With -4way its in the 600s.

Ground Loop
Member
**
Offline Offline

Activity: 112
Merit: 10


View Profile
August 31, 2010, 12:48:33 AM
 #55

Seriously?  Got free electricity?

At Difficulty 623, I've shut down anything under 3000 khash/sec.

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
lfm
Full Member
***
Offline Offline

Activity: 196
Merit: 100



View Profile
September 01, 2010, 05:23:26 PM
 #56

Seriously?  Got free electricity?

At Difficulty 623, I've shut down anything under 3000 khash/sec.


I don't have free electricity but I am running a number of electric heaters that look like computers.

The bits produced are a by-product.
Ground Loop
Member
**
Offline Offline

Activity: 112
Merit: 10


View Profile
September 16, 2010, 07:08:19 PM
 #57

4way pays off on one of the HP blade machines..
It's a 12-core Intel(R) Xeon(R) CPU X5650  @ 2.67GHz

Running 24 threads with 4way, I get 22,569 khash/sec.

Yow.

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
LZ
Staff
Legendary
*
Offline Offline

Activity: 1722
Merit: 1008


P2P Cryptocurrency


View Profile
September 16, 2010, 08:20:19 PM
 #58

Cool! This means that the processor in practice can catch up with the video card!

Ground Loop
Member
**
Offline Offline

Activity: 112
Merit: 10


View Profile
September 16, 2010, 09:24:30 PM
 #59

It's still not cost effective.
These are HP BL460c blades.. around $6k each.  That buys a lot of fresh CUDA!

It's a fun way to do "burn in", but not a smart use of resources.
12 hyperthreading Xeon cores, though..  each.

22,500 khash/sec with -4way, and only 13,400 without, so yeah, it's not subtle.

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
BeeCee1
Member
**
Offline Offline

Activity: 116
Merit: 10


View Profile
January 21, 2011, 02:11:10 AM
 #60

I made a couple of changes to the sse2 source that sped it up about 5 or 6%:

I changed:
#define add4(x0, x1, x2, x3) _mm_add_epi32(_mm_add_epi32(_mm_add_epi32(x0, x1), x2), x3)
to
#define add4(x0, x1, x2, x3) _mm_add_epi32(_mm_add_epi32(x0, x1),_mm_add_epi32( x2,x3))

It is just re-ordering the adds.  There is a data dependency, each one depends on the result of the one before, the way I reordered it two of the adds are independent.  This function is called a lot of times so that little change can add up. (On an older machine it made no difference so YMMV)

A portion of the nonce calculation is repeated over and over, even though the result is the same.  I moved
nonce = _mm_set1_epi32(In[3]);
nonce = _mm_add_epi32(nonce, offset);
out of the "for(k = 0; k<NPAR; k+=4) {"

Here's a diff
153c153
>     __m128i nonce,preNonce;
---
<     __m128i nonce;
157d156
>     preNonce = _mm_add_epi32(_mm_set1_epi32(In[3]),offset);
179,182c178,180
>         //nonce = _mm_set1_epi32(In[3]);
>         //nonce = _mm_add_epi32(nonce, offset);
>         //nonce = _mm_add_epi32(nonce, _mm_set1_epi32(k));
>         nonce = _mm_add_epi32(preNonce,_mm_set1_epi32(k));
---
<         nonce = _mm_set1_epi32(In[3]);
<         nonce = _mm_add_epi32(nonce, offset);
<         nonce = _mm_add_epi32(nonce, _mm_set1_epi32(k));


I have been running this for a couple of days on the mining pool and have generated shares.
jwalck
Newbie
*
Offline Offline

Activity: 17
Merit: 0


View Profile
January 22, 2011, 11:23:16 AM
 #61

I made a couple of changes to the sse2 source that sped it up about 5 or 6%:

Nice find of data dependencies!

Get about the same increase here on my AMD Opteron 6128 server. From ~32.4M to ~34.2M with all 16 cores. Too bad it will have plenty of other things to do soon. ;)
Pages: 1 2 3 4 [All]
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!