Regarding the theoretical maximum Performance of GPUs

ElectricMucus (OP)

Legendary

Offline

Activity: 1666
Merit: 1057

Marketing manager - GO MP

Regarding the theoretical maximum Performance of GPUs

August 20, 2011, 04:59:37 PM

#1

I am interested in how effective the ALUs on a card can be utilized and did some calculations:

Considering the the SHA256 loop there are the following things in there:

Operations:
1 not, 5 and, 7 xor
6 rotations by 2, 13, 22 and 6, 11, 25

makes 19

32 Bit words. (register access)
5 A, 2 B, 5 E, 1 F, 1, G

makes 14

4 additions, 2 LUT accesses
8 Memory accesses, 2 extra additions

makes 16
---------
49 total
run 64 times
------
3136 cylces

5970 with 3200 ALUs:

3200*725/3136 = 739.795918 mhash

Is this calculation correct or is there more/less done on the gpu?

Because according to this the code utilization would be nearly optimal which makes claims of awesome optimizations dubious, (ArtForz entry on the wiki for ex...)

CanaryInTheMine

Donator
Legendary

Offline

Activity: 2352
Merit: 1060

between a rock and a block!

Re: Regarding the theoretical maximum Performance of GPUs

August 20, 2011, 06:24:27 PM

#2

take a look at:

https://bitcointalk.org/index.php?topic=33817.0

ElectricMucus (OP)

Legendary

Offline

Activity: 1666
Merit: 1057

Marketing manager - GO MP

Re: Regarding the theoretical maximum Performance of GPUs

August 20, 2011, 06:46:48 PM

#3

Quote from: CanaryInTheMine on August 20, 2011, 06:24:27 PM

take a look at:

https://bitcointalk.org/index.php?topic=33817.0

Thanks, thats nearly the same result, obviously I forgot some things architecture specific to the cards. Cool

ArtForz

Sr. Member

Offline

Activity: 406
Merit: 257

Re: Regarding the theoretical maximum Performance of GPUs

August 20, 2011, 06:59:55 PM

#4

You realize a bitcoinhash is *2* sha256 blocks operations, right?
Well, not exactly 2 thanks to some optimizations possible
you can drop the last 3 rounds completely (they don't change H), and lose part of the previous round (you only need the E output of the 4th-to-last)
Initial rounds can be optimized as well, as the last DWORD of hMerkleRoot and nTime/nBits don't change between loops, so you can drop the equivalent of ~3 rounds there as well.
Same thing goes for optimizing/precalculating parts of the W mangling, as we're feeding in quite a bit of constants.
Register access is basically free on GPUs (they mask reg r/w by pipelining 4 "threads" on the shader pipeline).
Ch() can be done in 1 cycle, and Maj() in 2.
Also, what LUT accesses? just hardcode the K constants in the instruction stream.

So while you came up with a somewhat reasonable result, you did so by pure chance using invalid assumptions and numbers.

bitcoin: 1Fb77Xq5ePFER8GtKRn2KDbDTVpJKfKmpz
i0coin: jNdvyvd6v6gV3kVJLD7HsB5ZwHyHwAkfdw

ElectricMucus (OP)

Legendary

Offline

Activity: 1666
Merit: 1057

Marketing manager - GO MP

Re: Regarding the theoretical maximum Performance of GPUs

August 20, 2011, 07:09:54 PM

#5

Well thanks for the pointers, excuse my noobish rants. I'll be back once I understand what is being said Kiss

CanaryInTheMine

Donator
Legendary

Offline

Activity: 2352
Merit: 1060

between a rock and a block!

Re: Regarding the theoretical maximum Performance of GPUs

August 20, 2011, 07:35:07 PM

#6

Quote from: ArtForz on August 20, 2011, 06:59:55 PM

You realize a bitcoinhash is *2* sha256 blocks operations, right?
Well, not exactly 2 thanks to some optimizations possible
you can drop the last 3 rounds completely (they don't change H), and lose part of the previous round (you only need the E output of the 4th-to-last)
Initial rounds can be optimized as well, as the last DWORD of hMerkleRoot and nTime/nBits don't change between loops, so you can drop the equivalent of ~3 rounds there as well.
Same thing goes for optimizing/precalculating parts of the W mangling, as we're feeding in quite a bit of constants.
Register access is basically free on GPUs (they mask reg r/w by pipelining 4 "threads" on the shader pipeline).
Ch() can be done in 1 cycle, and Maj() in 2.
Also, what LUT accesses? just hardcode the K constants in the instruction stream.

So while you came up with a somewhat reasonable result, you did so by pure chance using invalid assumptions and numbers.

ArtForz, you are quite the legend on these forums... Glad to see you here!