Bitcoin Forum

Other => CPU/GPU Bitcoin mining hardware => Topic started by: ElectricMucus on August 20, 2011, 04:59:37 PM



Title: Regarding the theoretical maximum Performance of GPUs
Post by: ElectricMucus on August 20, 2011, 04:59:37 PM
I am interested in how effective the ALUs on a card can be utilized and did some calculations:

Considering the the SHA256 loop there are the following things in there:

Operations:
1 not, 5 and, 7 xor
6 rotations by 2, 13, 22 and 6, 11, 25

makes 19

32 Bit words. (register access)
5 A, 2 B, 5 E, 1 F, 1, G

makes 14

4 additions, 2 LUT accesses
8 Memory accesses, 2 extra additions

makes 16
---------
49 total
run 64 times
------
3136 cylces


5970 with 3200 ALUs:

3200*725/3136 = 739.795918 mhash

Is this calculation correct or is there more/less done on the gpu?

Because according to this the code utilization would be nearly optimal which makes claims of awesome optimizations dubious, (ArtForz entry on the wiki for ex...)


Title: Re: Regarding the theoretical maximum Performance of GPUs
Post by: CanaryInTheMine on August 20, 2011, 06:24:27 PM
take a look at:

https://bitcointalk.org/index.php?topic=33817.0


Title: Re: Regarding the theoretical maximum Performance of GPUs
Post by: ElectricMucus on August 20, 2011, 06:46:48 PM
take a look at:

https://bitcointalk.org/index.php?topic=33817.0

Thanks, thats nearly the same result, obviously I forgot some things architecture specific to the cards.  8)


Title: Re: Regarding the theoretical maximum Performance of GPUs
Post by: ArtForz on August 20, 2011, 06:59:55 PM
You realize a bitcoinhash is *2* sha256 blocks operations, right?
Well, not exactly 2 thanks to some optimizations possible
you can drop the last 3 rounds completely (they don't change H), and lose part of the previous round (you only need the E output of the 4th-to-last)
Initial rounds can be optimized as well, as the last DWORD of hMerkleRoot and nTime/nBits don't change between loops, so you can drop the equivalent of ~3 rounds there as well.
Same thing goes for optimizing/precalculating parts of the W mangling, as we're feeding in quite a bit of constants.
Register access is basically free on GPUs (they mask reg r/w by pipelining 4 "threads" on the shader pipeline).
Ch() can be done in 1 cycle, and Maj() in 2.
Also, what LUT accesses? just hardcode the K constants in the instruction stream.

So while you came up with a somewhat reasonable result, you did so by pure chance using invalid assumptions and numbers.


Title: Re: Regarding the theoretical maximum Performance of GPUs
Post by: ElectricMucus on August 20, 2011, 07:09:54 PM
Well thanks for the pointers, excuse my noobish rants. I'll be back once I understand what is being said  :-*


Title: Re: Regarding the theoretical maximum Performance of GPUs
Post by: CanaryInTheMine on August 20, 2011, 07:35:07 PM
You realize a bitcoinhash is *2* sha256 blocks operations, right?
Well, not exactly 2 thanks to some optimizations possible
you can drop the last 3 rounds completely (they don't change H), and lose part of the previous round (you only need the E output of the 4th-to-last)
Initial rounds can be optimized as well, as the last DWORD of hMerkleRoot and nTime/nBits don't change between loops, so you can drop the equivalent of ~3 rounds there as well.
Same thing goes for optimizing/precalculating parts of the W mangling, as we're feeding in quite a bit of constants.
Register access is basically free on GPUs (they mask reg r/w by pipelining 4 "threads" on the shader pipeline).
Ch() can be done in 1 cycle, and Maj() in 2.
Also, what LUT accesses? just hardcode the K constants in the instruction stream.

So while you came up with a somewhat reasonable result, you did so by pure chance using invalid assumptions and numbers.

ArtForz,  you are quite the legend on these forums...  Glad to see you here! :)