I am interested in how effective the ALUs on a card can be utilized and did some calculations:

Considering the the SHA256 loop there are the following things in there:

Operations:

1 not, 5 and, 7 xor

6 rotations by 2, 13, 22 and 6, 11, 25

makes 19

32 Bit words. (register access)

5 A, 2 B, 5 E, 1 F, 1, G

makes 14

4 additions, 2 LUT accesses

8 Memory accesses, 2 extra additions

makes 16

---------

49 total

run 64 times

------

3136 cylces

5970 with 3200 ALUs:

3200*725/3136 = 739.795918 mhash

Is this calculation correct or is there more/less done on the gpu?

Because according to this the code utilization would be nearly optimal which makes claims of awesome optimizations dubious, (ArtForz entry on the wiki for ex...)