further improved phatk_dia kernel for Phoenix + SDK 2.6

If it works, please post here and consider a small donation @ 1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x :).

Important (2012-01-13): The FASTLOOP=False parameter is not needed anymore, because FASTLOOP defaults to false in this version. Update: FASTLOOP=True works now, I uploaded a fixed version!

Important: since OpenCL SDK / Runtime version 2.6 AMD updated their OpenCL compiler, so that some older kernels and optimizations in them seem to not work anymore or are not needed anymore. In order to reflect this change I had to edit the kernel performance section of this thread.

Important: since version 2011-08-27 you don't need to supply the BFI_INT switch anymore. If your HW supports it, it's enabled automatically. To disable it use BFI_INT=false.

Important: since version 2011-08-04 (pre-release) you have to use the switch VECTORS2 instead of VECTORS. I made this change to be clear what vectors are used in the kernel (2- or 4-component). To use 4-component vectors use switch VECTORS4.

Important: since version 2011-07-17 a modified version of __init__.py (for the Phoenix miner) is included in this package and has to be used! The kernel won't work for other Miners without modifications to them, see kernel.cl for further infos.

This is the preferred switch for Phoenix with phatk_dia in order to achieve comparable performance:

Code:

-k phatk AGGRESSION=12 VECTORS2 WORKSIZE=128

Download version 2012-01-13: http://www.mediafire.com/?xzk6b1yvb24r4dg
Download version 2011-12-21: http://www.mediafire.com/?r3n2m5s2y2b32d9
Download version 2011-08-27: http://www.mediafire.com/?697r8t2pdk419ji
Download version 2011-08-11: http://www.mediafire.com/?s5c7h4r91r4ad4j
Download version 2011-08-04 (pre-release): http://www.mediafire.com/?upwwud7kfyx7788
Download version 2011-07-17: http://www.mediafire.com/?4zxdd5557243has
Download version 2011-07-11: http://www.mediafire.com/?k404b6lqn8vu6z6
Download version 2011-07-07: http://www.mediafire.com/?o7jfp60s7xefrg4
Download version 2011-07-06: http://www.mediafire.com/?f8b8q3w5u5p0ln0
Download version 2011-07-03: http://www.mediafire.com/?xlkcc08jvp5a43v
Download version 2011-07-01: http://www.mediafire.com/?5jmt7t0e83k3eox

Kernel performance (BFI_INT / VECTORS2 / WORKSIZE=128 / SDK 2.6 / APP KernelAnalyzer 1.11 - Cal 11.12 profile):
HD5870
2011-08-20: 22 GPR / 1427 ALU OPs / 66 CF OPs
2011-08-27: 22 GPR / 1426 ALU OPs / 66 CF OPs
2011-12-21: 20 GPR / 1400 ALU OPs / 66 CF OPs
2012-01-13: 21 GPR / 1394 ALU OPs / 67 CF OPs

HD6970
2011-08-20: 21 GPR / 1687 ALU OPs / 66 CF OPs
2011-08-27: 23 GPR / 1688 ALU OPs / 68 CF OPs
2011-12-21: 21 GPR / 1687 ALU OPs / 66 CF OPs
2012-01-13: 20 GPR / 1687 ALU OPs / 66 CF OPs

Kernel performance (BFI_INT / VECTORS2 / SDK 2.5 / APP KernelAnalyzer 1.9 - Cal 11.7 profile):
HD5870
original phatk 1.X: 1393 ALU OPs
2011-07-01: 1389 ALU OPs
2011-07-03: 1385 ALU OPs
2011-07-06: 1380 ALU OPs
2011-07-07: 1380 ALU OPs
2011-07-11: 1378 ALU OPs
2011-07-17: 1376 ALU OPs
2011-08-04 (pre-release): 1368 ALU OPs
2011-08-11: 1364 ALU OPs
2011-08-27: 1363 ALU OPs (30 less compared to original phatk 1.X)
HD6970
original phatk 1.X: 1707 ALU OPs
2011-07-01: 1710 ALU OPs
2011-07-03: 1706 ALU OPs
2011-07-06: 1702 ALU OPs
2011-07-07: 1702 ALU OPs
2011-07-11: 1701 ALU OPs
2011-07-17: 1699 ALU OPs
2011-08-04 (pre-release): 1689 ALU OPs
2011-08-11: 1687 ALU OPs
2011-08-27: 1687 ALU OPs (20 less compared to original phatk 1.X)

changelog:

2012-01-13
Kernel:
- modified: Disclaimer is now the same as in original Phoenix package
- removed: all (u) typecasts in front of scalars, where vectors and scalars were used together because per OpenCL definition this is not needed
- removed: all () brackets around n in the #define parts of the kernel
- removed: S0(), which is now again merged into s0()
- removed: brackets around the commands in t1W(), t1(), t2() and W() were removed, to allow the compiler to reorder these
- added: W() function missed an ; at it's end
- added: init variable B1addK6 used in round 6 to save an add -> THX to DiabloD3
- added: a (uint) typecast in front of get_local_id() and get_group_id() calls, because return value could be 64 bits long, which is not wanted
- modified: replaced all ma() + s0() or s0() + ma() calls with t2()
- modified: round 6 now uses the new new B1addK6 variable
- modified: reordered W[] calculation for rounds 32, 91 and 92
- modified: rounds 121, 122 and 123 to not compute Vals[4], Vals[5] and Vals[6], because they are not needed for final computation of Vals[7] -> THX to jhajduk
- modified: removed + H[7] from round 124 and use -0xec9fcd13 to check for valid nonces
- added: result_r124 variable to take the result of the last round 124, this saves a few ALU OPs on VLIW5 GPUs
Python Init:
- modified: replaced spaces with tabs in the source code formatting (I really dislike this part in Python ^^)
- modified: a few comments and commands were reformatted for better readability or to be better understandable
- modified: FASTLOOP parameter now defaults to False, which means you don't need to supply FASTLOOP=False anymore
- removed: OUTPUT_SIZE is not used anymore so all references to it were removed
- modified: changed REVISION to 122
- modified: moved the WORKSIZE checks below the part where the check, if and which vectors should be used is performed
   this takes into account, that the global worksize passed to the kernel is influenced by vector usage and vectorsize
   (currently the use of FASTLOOP can break this, because of the "dynamic" number of iterations)
- added: some debug info about worksize and pyOpenCL is displayed at the start
- added: B1 + K[6] is passed as new kernel parameter
- modified: made enqueue_read_buffer() / enqueue_write_buffer() blocking and removed finish() after the read, as per AMDs recommendations
   to minimize API overhead

2011-08-27:
Kernel:
- added: code path for 3-component Vectors, activated via VECTORS3 (currently not usable, because of a bug in the AMD drivers up to Cat 11.8)
- removed: BITALIGN option from the kernel, BFI_INT is now used automatically, if the HW supports it (disabled via BFI_INT=false)
- modified: non BFI_INT Ch() function, which was broken in 2011-08-11 -> THX to Vince
- modified: kernel output buffer is now an ulong array and not an uint array
- removed: OUTPUT_SIZE argument is not passed and used in the kernel anymore
- modified: WORKSIZEx4, WORKSIZEx3 and WORKSIZEx2 arguments were merged into WORKSIZExVECSIZE
- modified: removed, reordered and added some brackets and type-casting stuff in the kernel
- modified: restored command order for round 108 - 123 to free a GPR
- modified: added H[7] into round 124 calculation
- modified: changed the checking for positive nonces again to cover the H[7] change
- modified: writing of nonces to output now uses 1 write for Vec2 and max. 2 writes for Vec4, because 2x uints are now encoded into 1x ulong
Python Init:
- added: code for 3-component Vectors, activated via VECTORS3 (currently not usable, because of a bug in the AMD drivers up to Cat 11.8)
- removed: BITALIGN option from the Python init, BFI_INT is now used automatically, if the HW supports it (disabled via BFI_INT=false)
- added: detection of maximum supported WORKSIZE per Device, which is used if no WORKSIZE is supplied, if supplied WORKSIZE > max. supported WORKSIZE
    or if WORKSIZE is not a power of 2
- added: code to decode the ulong from the output buffer into 2x uint and process the results
- modified: comments, code formating and line breaks for better readability
- modified: output buffer size is now the WORKSIZE -> THX to Phaetus

2011-08-11:
- modified: reverted a former change to the Ma() function to save an ALU OP for 69XX cards
- added: S0() and S1() function, which is a compiler help -> THX Phateus
- modified: a few brackets and layout of all helper functions for better readability and compatibility
- added: t2() function, which is (s0(n) + ma(n)) and saves a few GPRs -> THX Phateus and myself (had this in earlier, but removed it sometime ^^)
- modified: changed layout of kernel definition for better readability
- modified: all values which for example had a 10u now have a 10U (uppercase) to be consistent in the whole kernel
- modified: modified round 94 W calculation for better performance
- modified: round 108 - 123 now consists of 2 W() blocks followed by 2 sharoundW() blocks to save a GPR
- modified: changed the checking for positive nonces again to never create an invalid share and lower ALU OP usage

2011-08-04 (pre-release):
- added: user Vince into disclaimer -> THX Vince :)
- added: kernel is now able to work with 4-component vectors (switch VECTORS4) -> THX to Phateus
- modified: to use 2-component vectors I renamed the switch VECTORS to VECTORS2
- added: __attribute__((reqd_work_group_size(WORKSIZE, 1, 1))) -> THX to Phateus
- added: constants PreW31 and PreW32, which store P2() + P4() for round 31 and 32 -> THX to Phateus
- renamed - modified: W17_2 is now PreW19, W2 is now PreW18, PreVal4addT1 is now PreVal4 (= PreVal4 + T1), state0subT1 is now PreVal0 (= Preval4 + state0)
- modified: base is now declared as u to save the addidion of uint2(0, 1) or uint4(0, 1, 2, 3) for W_3 init -> THX to Phateus
- modified: nonce calculation now uses the local Work-Item ID, the group ID and the WORKSIZE instead of only the global Work-Item ID -> THX to Phateus
- added: saved a multiplication by passing WORKSIZEx2 and WORKSIZEx4 constants to the kernel
- modified: calculation for W[18 - O] was optimized so that P2(18) is only calculated for x component (if Vectors are used), because x and y only differ
    in the LSB and afterwards Bit 14 and 25 are rotated for W[18 - O].y -> THX to Phateus
- modified: saved an addition for Vals[0] init, because of the change to PreVal0
- modified: reordered code for round 4 - 95 to optimize for less ALU OPs used -> THX Phateus and myself ^^
- modified: ordering of variables in additions for Round 124 was changed to optimize for less ALU OPs used
- modified: rewrote the part where nonces are checked, if they are positive and where they are written into output buffer
    (saves 2 global writes per work-item and saves additional ALU OPs)
- modified: changed variables W_3, P2_18_x, P2_18 and nonce into a constant
- modified: changed code formating for rounds 4 - 124 better readability
- removed: some comments to cleanup the code

2011-07-17:
- added: offset for W[] array to reduce it's size -> THX to user Vince
- modified: function t1() renamed to t1W() / function sharound() renamed to sharoundW()
- added: function t1() and sharound() which are used where the W[] addition can be left out, because W[] == 0
    I guess the compiler already does this optimization, but doesn't hurt) -> THX to user Vince
- modified: P1() - P4() and W() to male use of the offset
- modified: quite a few kernel parameters have new values or were added (mixed ideas from User Vince with own ones)
    C1addK5: C1addK5 = C1 + K[5]: C1addK5 = C1 + 0x59f111f1
    D1: D1 = D1 + K[4] + W[4]: D1 = D1 + 0xe9b5dba5 + 0x80000000U
    W2: W2 + W16 in P1(): W2 = P1(18) + P4(18)
    W17_2: 0x80000000U in P2() = 0x11002000 + W17 in P1(): W17_2 = P1(19) + P2(19)
    PreValaddT1: PreValaddT1 = PreVal4 + T1
    T1substate0: T1substate0 = T1 - substate0
- added: variable W_3, which stores the first value formely held in W[3]
- added: Temp variable used to speed up calculation for rounds 4 and 5
- modified: changed round 3 so that it's more efficient (uses: Vals[0] and Vals[4])
- modified: W[0] - W[14] are now kind of hard-coded or left out, where they were 0
- modified: optimized P1(18) + P2(18) + P4(18)
- modified: optimized P1(19) + P2(19) + P4(19)
- modified: optimized round 4 + 5
- modified: rounds 6 - 14 and 73 - 78 now use new sharound() without W[] addition
- modified: offset added for all parts, where W[] is used
- modified: W_3 is used as result instead of W[3] (W[3] is still used to generate random possition in output buffer) -> THX to user Vince

2011-07-11:
- modified: constant H[7] has a new value (saves an addition in round 124)
- modified: non BFI_INT Ch() function now uses OpenCL built-in bitselect
- modified: reordered W[] calculations for round 18 - 30, 87 and 94
- modified: reordered calculation for round 5
- modified: W[] calculation for round 80 - 86 is now a block before sharound() is called
- removed: K[60] from round 124 (because of new H[7] value)

2011-07-07:
- removed: some large comments in the source were removed
- modified: Ma() function is now unique in the kernel, no matter if BFI_INT is used or not -> THX to User: 1MLyg5WVFSMifFjkrZiyGW2nw
- added: Ch() function which uses OpenCL bitselect() command (but it's not active, so you are free to try it) -> THX to User: 1MLyg5WVFSMifFjkrZiyGW2nw
- modified: u W[128] is replaced with u W[124] because no more than 124 values are used
- modified: initialisation for Vals[0], Vals[3], Vals[4] and Vals[7] is now processed in other places to save some unneeded writes to these variables
- fixed: some hex values, which were used in vector additions are now properly type-casted, which hopefully restores AMD APP SDK 2.1 compatibility
- modified: rounds 3, 4 and 5 were modified for better performance (guess this can be tuned, if I have a working KernelAnalyzer)

2011-07-06:
- modified: H[] constants were reordered (2 were not used because of earlier mods)
- added: ulong L constant added (it's value doesn't fit into an uint)
- modified: new Ma() for non BFI_INT capable cards, should be faster -> THX to User: 1MLyg5WVFSMifFjkrZiyGW2nw
- removed: t1W()
- modified: t1() reordered function calls for better performance
- modified: W() reordered function calls for better performance
- modified: sharound() removed writing to t1, now t1() is called twice, which makes this function FASTER (OpenCL compiler optimization)
- removed: sharound2() (if needed W() + sharound() is used instead)
- removed: partround() not needed because of another solution for round 3 and 124
- removed: t1 and t1W variabled
- modified: rounds 3, 19, 30, 81, 87, 94 and 124 were modified for better performance

2011-07-03:
- removed: t2(), w(n), r0(x), r1(x), R0(n) and R1(n)
- renamed - modified: R(x) to W(x) plus now uses P1, P2, P3 and P4 directly
- modified: P1(x) and P2(x) to not use R1(x - 2), R0(x - 15) but do that directly
- modified: SHA rounds 31, 32, 47 - 61, 86, 87, 114 - 119 now use sharound2() instead of W() + sharound()
- modified: reordered code for SHA rounds 66 - 94 -> saw no decrease in performance -> better readability
- modified: SHA rounds 18, 19, 20, 80, 93, 94 now use a simpler calculation because of removed zero addions
--> 1x P1(x), 2x P2(x), 4x P3(x) and 2x P4(x) were removed which should give a little MHash/sec boost
- modified: sharound() so that a double execution of t1() is avoided -> THX to User: 1MLyg5WVFSMifFjkrZiyGW2nw
- added: "u t1W" variable, which is used in sharound2() to avoid double execution of t1W()

2011-07-01:

Code:

Vals[7] = 0xb0edbdd0 + K[0] +  W[64] + 0x08909ae5U; -> Vals[7] = 0xfc08884d + W[64];
Vals[3] = 0xa54ff53a + 0xb0edbdd0 + K[0] +  W[64]; -> Vals[3] = 0x198c7e2a2 + W[64];

- removed the

Code:

Vals[7] += H[7]

addition and replaced the final if-statements in the Kernel
- reordered some W[n] = statements to remove some unneeded additions
- replaced all additions like 64 + 5 with the corresponding integer value (guess it was in there for readability reasons, so here it got worse :D)
- removed some unneeded brackets
- re-formatted for better readability

If it works, please post here and consider a small donation @ 1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x :).

Thanks,
Dia