Bitcoin Forum
November 08, 2024, 09:52:05 PM *
News: Latest Bitcoin Core release: 28.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 [2] 3 4 »  All
  Print  
Author Topic: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10  (Read 24768 times)
gridecon
Newbie
*
Offline Offline

Activity: 35
Merit: 0


View Profile
August 16, 2010, 03:15:44 AM
 #21

I have two quadcore Phenom II 64-bit linux machines (ubuntu 9.10 both) and the -4way option increases my hashing speed so much I'm suspicious. I get about 5-6khash/sec on these boxes previously and without -4way option. With -4way I get over 11khash/sec! In other words, the -4way switch almost DOUBLES the reported hashing speed. This level of improvement seems more than expected and makes me wonder if my boxes are really doing the hashing that much faster or if there could possible be an issue where the math operations are actually being skipped over for some reason, causing illusory speed and an inability to actually generate blocks?
Vasiliev
Newbie
*
Offline Offline

Activity: 55
Merit: 0


View Profile
August 16, 2010, 03:17:07 AM
 #22

I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
GCC 4.3.3 doesn't support -march=amdfamk10.  I get:
sha256.cpp:1: error: bad value (amdfamk10) for -march= switch
try -march=amdfam10
satoshi (OP)
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 7193


View Profile
August 16, 2010, 03:23:04 AM
 #23

try -march=amdfam10
That works.

That's strange...  are we sure that's the same thing?  tcatm, try amdfam10 and make sure you get the same speed measurement.
Vasiliev
Newbie
*
Offline Offline

Activity: 55
Merit: 0


View Profile
August 16, 2010, 03:27:35 AM
 #24

http://www.google.com/search?q=amdfamk10

I think he misremembered it since AMD arches are K#.
lfm
Full Member
***
Offline Offline

Activity: 196
Merit: 104



View Profile
August 16, 2010, 03:30:35 AM
 #25

model name      : Intel(R) Core(TM)2 Quad  CPU   Q9450  @ 2.66GHz,   linux 64

no difference at about 4950 khash/s


jgarzik
Legendary
*
qt
Offline Offline

Activity: 1596
Merit: 1100


View Profile
August 16, 2010, 03:35:28 AM
 #26


Update for
Code:
cpu family	: 6
model : 26
model name : Genuine Intel(R) CPU             000  @ 3.20GHz
stepping : 4

Machine has 4 cores, each with 2 hyperthreads.  /proc/cpuinfo shows 8 virtual processors.

without -4way, setgen 4:    5.7 Mhash/sec
without -4way, setgen 8:    5.0 Mhash/sec

with -4way, setgen 4:   7.0 Mhash/sec
with -4way, setgen 8:   9.3 Mhash/sec

So, the old wisdom of "hyperthreading slows things down" is now shattered, on this machine.

Jeff Garzik, Bloq CEO, former bitcoin core dev team; opinions are my own.
Visit bloq.com / metronome.io
Donations / tip jar: 1BrufViLKnSWtuWGkryPsKsxonV2NQ7Tcj
Ground Loop
Member
**
Offline Offline

Activity: 111
Merit: 10


View Profile
August 16, 2010, 04:34:20 AM
 #27

No winners for 4way in my other three Intel machines either:

Intel(R) Core(TM)2 Duo CPU     E8500 @ 3.16GHz (64-bit Linux)
4way: 1565  std: 3002

Intel(R) Xeon(TM) CPU 3.00GHz (32-bit Linux)
4way: 1243  std: 2048

Intel(R) Core(TM)2 CPU          6300  @ 1.86GHz
4way: 932   std: 1733

(All running 0.3.10, -1 proclimit)
Experiments with proclimit weren't any better.


Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
satoshi (OP)
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 7193


View Profile
August 16, 2010, 04:36:59 AM
 #28

Code:
cpu family	: 6
model : 26
model name : Genuine Intel(R) CPU             000  @ 3.20GHz
stepping : 4
cpu family 6 model 26 stepping 4 is an Intel Core i7.
That's a 23% speedup with -4way, 63% total speedup with -4way + hyperthreading.
33% faster with hyperthreading than without it.
NewLibertyStandard
Sr. Member
****
Offline Offline

Activity: 252
Merit: 268



View Profile WWW
August 16, 2010, 05:02:31 AM
 #29

I have two quadcore Phenom II 64-bit linux machines (ubuntu 9.10 both) and the -4way option increases my hashing speed so much I'm suspicious. I get about 5-6khash/sec on these boxes previously and without -4way option. With -4way I get over 11khash/sec! In other words, the -4way switch almost DOUBLES the reported hashing speed. This level of improvement seems more than expected and makes me wonder if my boxes are really doing the hashing that much faster or if there could possible be an issue where the math operations are actually being skipped over for some reason, causing illusory speed and an inability to actually generate blocks?
o_O... good luck hashing, you're gonna need it!

With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.
Hey, you may be onto something!

hyperthreading didn't help before because all the work was in the arithmetic and logic units, which the hyperthreads share.

tcatm's SSE2 code must be a mix of normal x86 instructions and SSE2 instructions, so while one is doing x86 code, the other can do SSE2.

How much of an improvement do you get with hyperthreading?

Some numbers?  What CPU is that?
Here are the results from my very poor memory on an i7 860 2.8 GHz with Ubuntu 10.04 amd64. Some of the numbers may be a bit off.

Without 4way, with HT, 4/8 virtual cores, 4.5-5 Mhash/sec
Without 4way, with HT, 8/8 virtual cores, a bit less than above, but basically the same

With 4way, with HT, 8/8 virtual cores, 6.5-8 Mhash/sec (It may be my imagination, but it seems noticeably more variable.)
With 4way, with HT, 4/8 virtual cores, 5-6 Mhash/sec

Without 4way, without HT, 4/4 physical cores, 4.5-5 Mhas/sec (But a bit slower than the first result.)
With 4way, without HT, 4/4 physical cores, 5-6 Mhash/sec

Treazant: A Fullever Rewarding Bitcoin - Backup Your Wallet TODAY to Double Your Money! - Dual Currency Donation Address: 1Dnvwj3hAGSwFPMnkJZvi3KnaqksRPa74p
gridecon
Newbie
*
Offline Offline

Activity: 35
Merit: 0


View Profile
August 16, 2010, 05:30:15 AM
 #30

I have two quadcore Phenom II 64-bit linux machines (ubuntu 9.10 both) and the -4way option increases my hashing speed so much I'm suspicious. I get about 5-6khash/sec on these boxes previously and without -4way option. With -4way I get over 11khash/sec! In other words, the -4way switch almost DOUBLES the reported hashing speed. This level of improvement seems more than expected and makes me wonder if my boxes are really doing the hashing that much faster or if there could possible be an issue where the math operations are actually being skipped over for some reason, causing illusory speed and an inability to actually generate blocks?
o_O... good luck hashing, you're gonna need it!
I guess that should read either mhash/sec or THOUSANDS of khash/sec...but hey, what's 3 orders of magnitude among friends?

Perhaps that typographical error is why nobody has answered whether or not a nearly 100% speeded from the -4way option is at all realistic? I'm not convinced the crypto hashing is really taking place at the rate of 11000khash/sec on my desktop box.
jgarzik
Legendary
*
qt
Offline Offline

Activity: 1596
Merit: 1100


View Profile
August 16, 2010, 05:33:56 AM
 #31

Code:
cpu family	: 6
model : 26
model name : Genuine Intel(R) CPU             000  @ 3.20GHz
stepping : 4
cpu family 6 model 26 stepping 4 is an Intel Core i7.
That's a 23% speedup with -4way, 63% total speedup with -4way + hyperthreading.
33% faster with hyperthreading than without it.


Does bitcoin perform any self-tests at startup, to verify that hashing is working?



Jeff Garzik, Bloq CEO, former bitcoin core dev team; opinions are my own.
Visit bloq.com / metronome.io
Donations / tip jar: 1BrufViLKnSWtuWGkryPsKsxonV2NQ7Tcj
NewLibertyStandard
Sr. Member
****
Offline Offline

Activity: 252
Merit: 268



View Profile WWW
August 16, 2010, 06:16:44 AM
Last edit: August 16, 2010, 06:40:13 AM by NewLibertyStandard
 #32

More importantly, about how long should it take 10 Mhash/sec to verify difficulty 1 blocks?

After the 64-bit Linux hashing bug was fixed I generated a block or two in short order, but since that one or two blocks, I have not generated a single block. It's starting to seem a little fishy.

I'm currently testing Bitcoin on two Linux 64-bit computers. Is there anything in the code blocking early block verification?

Edit: Never mind. I used the Bitcoin Generation Calculator and divided out the difficulty. Everything is fine here, I've generated a couple blocks with 4way. About to start testing without 4way.

Another Edit: My test only verifies that hashing works. It does not verify whether I'm really getting the displayed speed.

Treazant: A Fullever Rewarding Bitcoin - Backup Your Wallet TODAY to Double Your Money! - Dual Currency Donation Address: 1Dnvwj3hAGSwFPMnkJZvi3KnaqksRPa74p
tcatm
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 285


View Profile
August 16, 2010, 11:15:04 AM
 #33

@satoshi: Oops, I meant -march=amdfam10. Sorry.

@everyone confused about improvement on Phenoms: I developed the code on a Phenom (940) and verified it (at least in 64bit mode) and the improvement you see is real.

Concerning Hyperthreading: It seems to give a little performance gain, maybe from running load/store instructions in parallel with aritmethic instructions. There's only a tiny bit of plain x86 instructions for glueing the function into the ABI. They take less than ~2% of the total CPU time (measured with gprof).
teknohog
Sr. Member
****
Offline Offline

Activity: 520
Merit: 253


555


View Profile WWW
August 16, 2010, 12:31:51 PM
 #34

On a Core 2 Duo T7200, the default code gives about 1.8 Mhash/s, and 4way is slower at 1.0 Mhash/s. It has 4 MB of L2 cache, so it is probably not a question of cache size, as suggested at some point.

Unfortunately, the code (from svn) no longer compiles on ARM, as it now has SSE intrinsics hardcoded. I have removed the -msse2 and -DFOURWAYSSE2 flags from the makefile, and it still produces errors like this

Code:
sha256.cpp:8:23: error: xmmintrin.h: No such file or directory
sha256.cpp:34: error: ‘__m128i’ does not name a type

but hopefully this is easy to fix.

world famous math art | masternodes are bad, mmmkay?
Every sha(sha(sha(sha()))), every ho-o-o-old, still shines
satoshi (OP)
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 7193


View Profile
August 16, 2010, 01:38:01 PM
 #35

I wrapped sha256.cpp in
#ifdef FOURWAYSSE2
#endif // FOURWAYSSE2

try it now.
tommy
Newbie
*
Offline Offline

Activity: 44
Merit: 0


View Profile
August 16, 2010, 03:42:55 PM
Last edit: August 16, 2010, 04:08:35 PM by tommy
 #36

model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 5600+

w/o -4way  "hashespersec" : 2539397

with -4way  "hashespersec" : 2108791

Linux, Debian, 32 bit.
teknohog
Sr. Member
****
Offline Offline

Activity: 520
Merit: 253


555


View Profile WWW
August 16, 2010, 04:41:31 PM
 #37

I wrapped sha256.cpp in
#ifdef FOURWAYSSE2
#endif // FOURWAYSSE2

try it now.

Thanks, works fine now.

world famous math art | masternodes are bad, mmmkay?
Every sha(sha(sha(sha()))), every ho-o-o-old, still shines
hugolp
Legendary
*
Offline Offline

Activity: 1148
Merit: 1001


Radix-The Decentralized Finance Protocol


View Profile
August 17, 2010, 06:26:27 PM
 #38

Model: Intel Atom n330 (2 cores, 4 virtual).

OS: Ubuntu 10.04 64bit

Using the -4way option I get half the speed than using no option.


               ▄████████▄
               ██▀▀▀▀▀▀▀▀
              ██▀
             ███
▄▄▄▄▄       ███
██████     ███
    ▀██▄  ▄██
     ▀██▄▄██▀
       ████▀
        ▀█▀
The Radix DeFi Protocol is
R A D I X

███████████████████████████████████

The Decentralized

Finance Protocol
Scalable
▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
██▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀██
██                   ██
██                   ██
████████████████     ██
██            ██     ██
██            ██     ██
██▄▄▄▄▄▄      ██     ██
██▀▀▀▀██      ██     ██
██    ██      ██     
██    ██      ██
███████████████████████

███
Secure
      ▄▄▄▄▄
    █████████
   ██▀     ▀██
  ███       ███

▄▄███▄▄▄▄▄▄▄███▄▄
██▀▀▀▀▀▀▀▀▀▀▀▀▀██
██             ██
██             ██
██             ██
██             ██
██             ██
██    ███████████

███
Community Driven
      ▄█   ▄▄
      ██ ██████▄▄
      ▀▀▄█▀   ▀▀██▄
     ▄▄ ██       ▀███▄▄██
    ██ ██▀          ▀▀██▀
    ██ ██▄            ██
   ██ ██████▄▄       ██▀
  ▄██       ▀██▄     ██
  ██▀         ▀███▄▄██▀
 ▄██             ▀▀▀▀
 ██▀
▄██
▄▄
██
███▄
▀███▄
 ▀███▄
  ▀████
    ████
     ████▄
      ▀███▄
       ▀███▄
        ▀████
          ███
           ██
           ▀▀

███
Radix is using our significant technology
innovations to be the first layer 1 protocol
specifically built to serve the rapidly growing DeFi.
Radix is the future of DeFi
█████████████████████████████████████

   ▄▄█████
  ▄████▀▀▀
  █████
█████████▀
▀▀█████▀▀
  ████
  ████
  ████

Facebook

███

             ▄▄
       ▄▄▄█████
  ▄▄▄███▀▀▄███
▀▀███▀ ▄██████
    █ ███████
     ██▀▀▀███
           ▀▀

Telegram

███

▄      ▄███▄▄
██▄▄▄ ██████▀
████████████
 ██████████▀
   ███████▀
 ▄█████▀▀

Twitter

██████

...Get Tokens...
denaje
Newbie
*
Offline Offline

Activity: 2
Merit: 0


View Profile
August 18, 2010, 07:06:04 PM
 #39

64-bit Gentoo / Intel Core i7

W/O 4way: 4324294
With 4way: 7649415



32-bit Ubuntu VM on XP host /  Intel Core 2 Duo

W/O 4way: 1751518
With 4way: 793100
Ground Loop
Member
**
Offline Offline

Activity: 111
Merit: 10


View Profile
August 18, 2010, 11:00:08 PM
 #40

So is it accurate to say that, so far, only Intel Core i7 processors and certain (Phenom?) AMD processors enjoy a speed bump from -4way?

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
Pages: « 1 [2] 3 4 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!