Bitcoin Forum
December 03, 2016, 01:41:12 PM *
News: Latest stable version of Bitcoin Core: 0.13.1  [Torrent].
 
   Home   Help Search Donate Login Register  
Pages: [1]
  Print  
Author Topic: Regarding the theoretical maximum Performance of GPUs  (Read 790 times)
ElectricMucus
Legendary
*
Offline Offline

Activity: 1540


Drama Junkie


View Profile
August 20, 2011, 04:59:37 PM
 #1

I am interested in how effective the ALUs on a card can be utilized and did some calculations:

Considering the the SHA256 loop there are the following things in there:

Operations:
1 not, 5 and, 7 xor
6 rotations by 2, 13, 22 and 6, 11, 25

makes 19

32 Bit words. (register access)
5 A, 2 B, 5 E, 1 F, 1, G

makes 14

4 additions, 2 LUT accesses
8 Memory accesses, 2 extra additions

makes 16
---------
49 total
run 64 times
------
3136 cylces


5970 with 3200 ALUs:

3200*725/3136 = 739.795918 mhash

Is this calculation correct or is there more/less done on the gpu?

Because according to this the code utilization would be nearly optimal which makes claims of awesome optimizations dubious, (ArtForz entry on the wiki for ex...)

First they ignore you, then they laugh at you, then they keep laughing, then they start choking on their laughter, and then they go and catch their breath. Then they start laughing even more.
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1480772472
Hero Member
*
Offline Offline

Posts: 1480772472

View Profile Personal Message (Offline)

Ignore
1480772472
Reply with quote  #2

1480772472
Report to moderator
CanaryInTheMine
Donator
Legendary
*
Offline Offline

Activity: 1512


between a rock and a block!


View Profile
August 20, 2011, 06:24:27 PM
 #2

take a look at:

https://bitcointalk.org/index.php?topic=33817.0

| In Default we Trust | Need gold/silver for btc? | Buy bitcoins |
ElectricMucus
Legendary
*
Offline Offline

Activity: 1540


Drama Junkie


View Profile
August 20, 2011, 06:46:48 PM
 #3

Thanks, thats nearly the same result, obviously I forgot some things architecture specific to the cards.  Cool

First they ignore you, then they laugh at you, then they keep laughing, then they start choking on their laughter, and then they go and catch their breath. Then they start laughing even more.
ArtForz
Sr. Member
****
Offline Offline

Activity: 406


View Profile
August 20, 2011, 06:59:55 PM
 #4

You realize a bitcoinhash is *2* sha256 blocks operations, right?
Well, not exactly 2 thanks to some optimizations possible
you can drop the last 3 rounds completely (they don't change H), and lose part of the previous round (you only need the E output of the 4th-to-last)
Initial rounds can be optimized as well, as the last DWORD of hMerkleRoot and nTime/nBits don't change between loops, so you can drop the equivalent of ~3 rounds there as well.
Same thing goes for optimizing/precalculating parts of the W mangling, as we're feeding in quite a bit of constants.
Register access is basically free on GPUs (they mask reg r/w by pipelining 4 "threads" on the shader pipeline).
Ch() can be done in 1 cycle, and Maj() in 2.
Also, what LUT accesses? just hardcode the K constants in the instruction stream.

So while you came up with a somewhat reasonable result, you did so by pure chance using invalid assumptions and numbers.

bitcoin: 1Fb77Xq5ePFER8GtKRn2KDbDTVpJKfKmpz
i0coin: jNdvyvd6v6gV3kVJLD7HsB5ZwHyHwAkfdw
ElectricMucus
Legendary
*
Offline Offline

Activity: 1540


Drama Junkie


View Profile
August 20, 2011, 07:09:54 PM
 #5

Well thanks for the pointers, excuse my noobish rants. I'll be back once I understand what is being said  Kiss

First they ignore you, then they laugh at you, then they keep laughing, then they start choking on their laughter, and then they go and catch their breath. Then they start laughing even more.
CanaryInTheMine
Donator
Legendary
*
Offline Offline

Activity: 1512


between a rock and a block!


View Profile
August 20, 2011, 07:35:07 PM
 #6

You realize a bitcoinhash is *2* sha256 blocks operations, right?
Well, not exactly 2 thanks to some optimizations possible
you can drop the last 3 rounds completely (they don't change H), and lose part of the previous round (you only need the E output of the 4th-to-last)
Initial rounds can be optimized as well, as the last DWORD of hMerkleRoot and nTime/nBits don't change between loops, so you can drop the equivalent of ~3 rounds there as well.
Same thing goes for optimizing/precalculating parts of the W mangling, as we're feeding in quite a bit of constants.
Register access is basically free on GPUs (they mask reg r/w by pipelining 4 "threads" on the shader pipeline).
Ch() can be done in 1 cycle, and Maj() in 2.
Also, what LUT accesses? just hardcode the K constants in the instruction stream.

So while you came up with a somewhat reasonable result, you did so by pure chance using invalid assumptions and numbers.

ArtForz,  you are quite the legend on these forums...  Glad to see you here! Smiley

| In Default we Trust | Need gold/silver for btc? | Buy bitcoins |
Pages: [1]
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!