Bitcoin Forum
December 10, 2016, 09:07:53 AM *
News: To be able to use the next phase of the beta forum software, please ensure that your email address is correct/functional.
 
   Home   Help Search Donate Login Register  
Pages: [1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 »  All
  Print  
Author Topic: further improved phatk_dia kernel for Phoenix + SDK 2.6 - 2012-01-13  (Read 101490 times)
Diapolo
Hero Member
*****
Offline Offline

Activity: 769



View Profile WWW
July 04, 2011, 08:15:55 AM
 #1

This is a repost from the Newbies forum, because I'm now allowed to post here :).
original Thread is located here: http://forum.bitcoin.org/index.php?topic=25135.0

If it works, please post here and consider a small donation @ 1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x :).



Important (2012-01-13): The FASTLOOP=False parameter is not needed anymore, because FASTLOOP defaults to false in this version. Update: FASTLOOP=True works now, I uploaded a fixed version!

Important: since OpenCL SDK / Runtime version 2.6 AMD updated their OpenCL compiler, so that some older kernels and optimizations in them seem to not work anymore or are not needed anymore. In order to reflect this change I had to edit the kernel performance section of this thread.

Important: since version 2011-08-27 you don't need to supply the BFI_INT switch anymore. If your HW supports it, it's enabled automatically. To disable it use BFI_INT=false.

Important: since version 2011-08-04 (pre-release) you have to use the switch VECTORS2 instead of VECTORS. I made this change to be clear what vectors are used in the kernel (2- or 4-component). To use 4-component vectors use switch VECTORS4.

Important: since version 2011-07-17 a modified version of __init__.py (for the Phoenix miner) is included in this package and has to be used! The kernel won't work for other Miners without modifications to them, see kernel.cl for further infos.



This is the preferred switch for Phoenix with phatk_dia in order to achieve comparable performance:
Code:
-k phatk AGGRESSION=12 VECTORS2 WORKSIZE=128


Download version 2012-01-13: http://www.mediafire.com/?xzk6b1yvb24r4dg
Download version 2011-12-21: http://www.mediafire.com/?r3n2m5s2y2b32d9
Download version 2011-08-27: http://www.mediafire.com/?697r8t2pdk419ji
Download version 2011-08-11: http://www.mediafire.com/?s5c7h4r91r4ad4j
Download version 2011-08-04 (pre-release): http://www.mediafire.com/?upwwud7kfyx7788
Download version 2011-07-17: http://www.mediafire.com/?4zxdd5557243has
Download version 2011-07-11: http://www.mediafire.com/?k404b6lqn8vu6z6
Download version 2011-07-07: http://www.mediafire.com/?o7jfp60s7xefrg4
Download version 2011-07-06: http://www.mediafire.com/?f8b8q3w5u5p0ln0
Download version 2011-07-03: http://www.mediafire.com/?xlkcc08jvp5a43v
Download version 2011-07-01: http://www.mediafire.com/?5jmt7t0e83k3eox

Kernel performance (BFI_INT / VECTORS2 / WORKSIZE=128 / SDK 2.6 / APP KernelAnalyzer 1.11 - Cal 11.12 profile):
HD5870
2011-08-20: 22 GPR / 1427 ALU OPs / 66 CF OPs
2011-08-27: 22 GPR / 1426 ALU OPs / 66 CF OPs
2011-12-21: 20 GPR / 1400 ALU OPs / 66 CF OPs
2012-01-13: 21 GPR / 1394 ALU OPs / 67 CF OPs

HD6970
2011-08-20: 21 GPR / 1687 ALU OPs / 66 CF OPs
2011-08-27:  23 GPR / 1688 ALU OPs / 68 CF OPs
2011-12-21: 21 GPR / 1687 ALU OPs / 66 CF OPs
2012-01-13: 20 GPR / 1687 ALU OPs / 66 CF OPs



Kernel performance (BFI_INT / VECTORS2 / SDK 2.5 / APP KernelAnalyzer 1.9 - Cal 11.7 profile):
HD5870
original phatk 1.X: 1393 ALU OPs
2011-07-01: 1389 ALU OPs
2011-07-03: 1385 ALU OPs
2011-07-06: 1380 ALU OPs
2011-07-07: 1380 ALU OPs
2011-07-11: 1378 ALU OPs
2011-07-17: 1376 ALU OPs
2011-08-04 (pre-release): 1368 ALU OPs
2011-08-11: 1364 ALU OPs
2011-08-27: 1363 ALU OPs (30 less compared to original phatk 1.X)
HD6970
original phatk 1.X: 1707 ALU OPs
2011-07-01: 1710 ALU OPs
2011-07-03: 1706 ALU OPs
2011-07-06: 1702 ALU OPs
2011-07-07: 1702 ALU OPs
2011-07-11: 1701 ALU OPs
2011-07-17: 1699 ALU OPs
2011-08-04 (pre-release): 1689 ALU OPs
2011-08-11: 1687 ALU OPs
2011-08-27: 1687 ALU OPs (20 less compared to original phatk 1.X)



changelog:

2012-01-13
Kernel:
- modified: Disclaimer is now the same as in original Phoenix package
- removed: all (u) typecasts in front of scalars, where vectors and scalars were used together because per OpenCL definition this is not needed
- removed: all () brackets around n in the #define parts of the kernel
- removed: S0(), which is now again merged into s0()
- removed: brackets around the commands in t1W(), t1(), t2() and W() were removed, to allow the compiler to reorder these
- added: W() function missed an ; at it's end
- added: init variable B1addK6 used in round 6 to save an add -> THX to DiabloD3
- added: a (uint) typecast in front of get_local_id() and get_group_id() calls, because return value could be 64 bits long, which is not wanted
- modified: replaced all ma() + s0() or s0() + ma() calls with t2()
- modified: round 6 now uses the new new B1addK6 variable
- modified: reordered W[] calculation for rounds 32, 91 and 92
- modified: rounds 121, 122 and 123 to not compute Vals[4], Vals[5] and Vals[6], because they are not needed for final computation of Vals[7] -> THX to jhajduk
- modified: removed + H[7] from round 124 and use -0xec9fcd13 to check for valid nonces
- added: result_r124 variable to take the result of the last round 124, this saves a few ALU OPs on VLIW5 GPUs
Python Init:
- modified: replaced spaces with tabs in the source code formatting (I really dislike this part in Python ^^)
- modified: a few comments and commands were reformatted for better readability or to be better understandable
- modified: FASTLOOP parameter now defaults to False, which means you don't need to supply FASTLOOP=False anymore
- removed: OUTPUT_SIZE is not used anymore so all references to it were removed
- modified: changed REVISION to 122
- modified: moved the WORKSIZE checks below the part where the check, if and which vectors should be used is performed
            this takes into account, that the global worksize passed to the kernel is influenced by vector usage and vectorsize
            (currently the use of FASTLOOP can break this, because of the "dynamic" number of iterations)
- added: some debug info about worksize and pyOpenCL is displayed at the start
- added: B1 + K[6] is passed as new kernel parameter
- modified: made enqueue_read_buffer() / enqueue_write_buffer() blocking and removed finish() after the read, as per AMDs recommendations
            to minimize API overhead

2011-08-27:
Kernel:
- added: code path for 3-component Vectors, activated via VECTORS3 (currently not usable, because of a bug in the AMD drivers up to Cat 11.8)
- removed: BITALIGN option from the kernel, BFI_INT is now used automatically, if the HW supports it (disabled via BFI_INT=false)
- modified: non BFI_INT Ch() function, which was broken in 2011-08-11 -> THX to Vince
- modified: kernel output buffer is now an ulong array and not an uint array
- removed: OUTPUT_SIZE argument is not passed and used in the kernel anymore
- modified: WORKSIZEx4, WORKSIZEx3 and WORKSIZEx2 arguments were merged into WORKSIZExVECSIZE
- modified: removed, reordered and added some brackets and type-casting stuff in the kernel
- modified: restored command order for round 108 - 123 to free a GPR
- modified: added H[7] into round 124 calculation
- modified: changed the checking for positive nonces again to cover the H[7] change
- modified: writing of nonces to output now uses 1 write for Vec2 and max. 2 writes for Vec4, because 2x uints are now encoded into 1x ulong
Python Init:
- added: code for 3-component Vectors, activated via VECTORS3 (currently not usable, because of a bug in the AMD drivers up to Cat 11.8)
- removed: BITALIGN option from the Python init, BFI_INT is now used automatically, if the HW supports it (disabled via BFI_INT=false)
- added: detection of maximum supported WORKSIZE per Device, which is used if no WORKSIZE is supplied, if supplied WORKSIZE > max. supported WORKSIZE
    or if WORKSIZE is not a power of 2
- added: code to decode the ulong from the output buffer into 2x uint and process the results
- modified: comments, code formating and line breaks for better readability
- modified: output buffer size is now the WORKSIZE -> THX to Phaetus

2011-08-11:
- modified: reverted a former change to the Ma() function to save an ALU OP for 69XX cards
- added: S0() and S1() function, which is a compiler help -> THX Phateus
- modified: a few brackets and layout of all helper functions for better readability and compatibility
- added: t2() function, which is (s0(n) + ma(n)) and saves a few GPRs -> THX Phateus and myself (had this in earlier, but removed it sometime ^^)
- modified: changed layout of kernel definition for better readability
- modified: all values which for example had a 10u now have a 10U (uppercase) to be consistent in the whole kernel
- modified: modified round 94 W calculation for better performance
- modified: round 108 - 123 now consists of 2 W() blocks followed by 2 sharoundW() blocks to save a GPR
- modified: changed the checking for positive nonces again to never create an invalid share and lower ALU OP usage

2011-08-04 (pre-release):
- added: user Vince into disclaimer -> THX Vince :)
- added: kernel is now able to work with 4-component vectors (switch VECTORS4) -> THX to Phateus
- modified: to use 2-component vectors I renamed the switch VECTORS to VECTORS2
- added: __attribute__((reqd_work_group_size(WORKSIZE, 1, 1))) -> THX to Phateus
- added: constants PreW31 and PreW32, which store P2() + P4() for round 31 and 32 -> THX to Phateus
- renamed - modified: W17_2 is now PreW19, W2 is now PreW18, PreVal4addT1 is now PreVal4 (= PreVal4 + T1), state0subT1 is now PreVal0 (= Preval4 + state0)
- modified: base is now declared as u to save the addidion of uint2(0, 1) or uint4(0, 1, 2, 3) for W_3 init -> THX to Phateus
- modified: nonce calculation now uses the local Work-Item ID, the group ID and the WORKSIZE instead of only the global Work-Item ID -> THX to Phateus
- added: saved a multiplication by passing WORKSIZEx2 and WORKSIZEx4 constants to the kernel
- modified: calculation for W[18 - O] was optimized so that P2(18) is only calculated for x component (if Vectors are used), because x and y only differ
       in the LSB and afterwards Bit 14 and 25 are rotated for W[18 - O].y -> THX to Phateus
- modified: saved an addition for Vals[0] init, because of the change to PreVal0
- modified: reordered code for round 4 - 95 to optimize for less ALU OPs used -> THX Phateus and myself ^^
- modified: ordering of variables in additions for Round 124 was changed to optimize for less ALU OPs used
- modified: rewrote the part where nonces are checked, if they are positive and where they are written into output buffer
       (saves 2 global writes per work-item and saves additional ALU OPs)
- modified: changed variables W_3, P2_18_x, P2_18 and nonce into a constant
- modified: changed code formating for rounds 4 - 124 better readability
- removed: some comments to cleanup the code

2011-07-17:
- added: offset for W[] array to reduce it's size -> THX to user Vince
- modified: function t1() renamed to t1W() / function sharound() renamed to sharoundW()
- added: function t1() and sharound() which are used where the W[] addition can be left out, because W[] == 0
    I guess the compiler already does this optimization, but doesn't hurt) -> THX to user Vince
- modified: P1() - P4() and W() to male use of the offset
- modified: quite a few kernel parameters have new values or were added (mixed ideas from User Vince with own ones)
    C1addK5: C1addK5 = C1 + K[5]: C1addK5 = C1 + 0x59f111f1
    D1: D1 = D1 + K[4] + W[4]: D1 = D1 + 0xe9b5dba5 + 0x80000000U
    W2: W2 + W16 in P1(): W2 = P1(18) + P4(18)
    W17_2: 0x80000000U in P2() = 0x11002000 + W17 in P1(): W17_2 = P1(19) + P2(19)
    PreValaddT1: PreValaddT1 = PreVal4 + T1
    T1substate0: T1substate0 = T1 - substate0
- added: variable W_3, which stores the first value formely held in W[3]
- added: Temp variable used to speed up calculation for rounds 4 and 5
- modified: changed round 3 so that it's more efficient (uses: Vals[0] and Vals[4])
- modified: W[0] - W[14] are now kind of hard-coded or left out, where they were 0
- modified: optimized P1(18) + P2(18) + P4(18)
- modified: optimized P1(19) + P2(19) + P4(19)
- modified: optimized round 4 + 5
- modified: rounds 6 - 14 and 73 - 78 now use new sharound() without W[] addition
- modified: offset added for all parts, where W[] is used
- modified: W_3 is used as result instead of W[3] (W[3] is still used to generate random possition in output buffer) -> THX to user Vince

2011-07-11:
- modified: constant H[7] has a new value (saves an addition in round 124)
- modified: non BFI_INT Ch() function now uses OpenCL built-in bitselect
- modified: reordered W[] calculations for round 18 - 30, 87 and 94
- modified: reordered calculation for round 5
- modified: W[] calculation for round 80 - 86 is now a block before sharound() is called
- removed: K[60] from round 124 (because of new H[7] value)

2011-07-07:
- removed: some large comments in the source were removed
- modified: Ma() function is now unique in the kernel, no matter if BFI_INT is used or not -> THX to User: 1MLyg5WVFSMifFjkrZiyGW2nw
- added: Ch() function which uses OpenCL bitselect() command (but it's not active, so you are free to try it) -> THX to User: 1MLyg5WVFSMifFjkrZiyGW2nw
- modified: u W[128] is replaced with u W[124] because no more than 124 values are used
- modified: initialisation for Vals[0], Vals[3], Vals[4] and Vals[7] is now processed in other places to save some unneeded writes to these variables
- fixed: some hex values, which were used in vector additions are now properly type-casted, which hopefully restores AMD APP SDK 2.1 compatibility
- modified: rounds 3, 4 and 5 were modified for better performance (guess this can be tuned, if I have a working KernelAnalyzer)

2011-07-06:
- modified: H[] constants were reordered (2 were not used because of earlier mods)
- added: ulong L constant added (it's value doesn't fit into an uint)
- modified: new Ma() for non BFI_INT capable cards, should be faster -> THX to User: 1MLyg5WVFSMifFjkrZiyGW2nw
- removed: t1W()
- modified: t1() reordered function calls for better performance
- modified: W() reordered function calls for better performance
- modified: sharound() removed writing to t1, now t1() is called twice, which makes this function FASTER (OpenCL compiler optimization)
- removed: sharound2() (if needed W() + sharound() is used instead)
- removed: partround() not needed because of another solution for round 3 and 124
- removed: t1 and t1W variabled
- modified: rounds 3, 19, 30, 81, 87, 94 and 124 were modified for better performance

2011-07-03:
- removed: t2(), w(n), r0(x), r1(x), R0(n) and R1(n)
- renamed - modified: R(x) to W(x) plus now uses P1, P2, P3 and P4 directly
- modified: P1(x) and P2(x) to not use R1(x - 2), R0(x - 15) but do that directly
- modified: SHA rounds 31, 32, 47 - 61, 86, 87, 114 - 119 now use sharound2() instead of W() + sharound()
- modified: reordered code for SHA rounds 66 - 94 -> saw no decrease in performance -> better readability
- modified: SHA rounds 18, 19, 20, 80, 93, 94 now use a simpler calculation because of removed zero addions
--> 1x P1(x), 2x P2(x), 4x P3(x) and 2x P4(x) were removed which should give a little MHash/sec boost
- modified: sharound() so that a double execution of t1() is avoided -> THX to User: 1MLyg5WVFSMifFjkrZiyGW2nw
- added: "u t1W" variable, which is used in sharound2() to avoid double execution of t1W()

2011-07-01:
Code:
Vals[7] = 0xb0edbdd0 + K[0] +  W[64] + 0x08909ae5U; -> Vals[7] = 0xfc08884d + W[64];
Vals[3] = 0xa54ff53a + 0xb0edbdd0 + K[0] +  W[64]; -> Vals[3] = 0x198c7e2a2 + W[64];
- removed the
Code:
Vals[7] += H[7]
addition and replaced the final if-statements in the Kernel
- reordered some W[n] = statements to remove some unneeded additions
- replaced all additions like 64 + 5 with the corresponding integer value (guess it was in there for readability reasons, so here it got worse :D)
- removed some unneeded brackets
- re-formatted for better readability

If it works, please post here and consider a small donation @ 1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x :).

Thanks,
Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
1481360873
Hero Member
*
Offline Offline

Posts: 1481360873

View Profile Personal Message (Offline)

Ignore
1481360873
Reply with quote  #2

1481360873
Report to moderator
1481360873
Hero Member
*
Offline Offline

Posts: 1481360873

View Profile Personal Message (Offline)

Ignore
1481360873
Reply with quote  #2

1481360873
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1481360873
Hero Member
*
Offline Offline

Posts: 1481360873

View Profile Personal Message (Offline)

Ignore
1481360873
Reply with quote  #2

1481360873
Report to moderator
1481360873
Hero Member
*
Offline Offline

Posts: 1481360873

View Profile Personal Message (Offline)

Ignore
1481360873
Reply with quote  #2

1481360873
Report to moderator
1481360873
Hero Member
*
Offline Offline

Posts: 1481360873

View Profile Personal Message (Offline)

Ignore
1481360873
Reply with quote  #2

1481360873
Report to moderator
lebuen
Member
**
Offline Offline

Activity: 106


View Profile
July 04, 2011, 08:38:12 AM
 #2

Works perfectly for me, although only slight increase in performance (371 -> 373 MH/s). But I had the previous patch already in place. Thanks!
Fletch
Full Member
***
Offline Offline

Activity: 168


I'll have a steak sandwich and a... steak sandwich


View Profile
July 04, 2011, 08:40:32 AM
 #3

Great, thanks. I went from 240 -> 242.5 on my 5850's. I was using only the "3% Ma-function patch" before.

- replaced all additions like 64 + 5 with the corresponding integer value (guess it was in there for readability reasons, so here it got worse Cheesy)
I've never developed any code for GPUs or used OpenCL, but wouldn't the compiler take care of that for you? At least all C compilers would.

HashPeak - GPU mining hashrate peak detector
BTC: 1FLETCHvcUKosefrcZCLUQTtvx4WvgnYMC
Diapolo
Hero Member
*****
Offline Offline

Activity: 769



View Profile WWW
July 04, 2011, 09:38:52 AM
 #4

Great, thanks. I went from 240 -> 242.5 on my 5850's. I was using only the "3% Ma-function patch" before.

- replaced all additions like 64 + 5 with the corresponding integer value (guess it was in there for readability reasons, so here it got worse Cheesy)
I've never developed any code for GPUs or used OpenCL, but wouldn't the compiler take care of that for you? At least all C compilers would.

Great that you benefit from the modifications Smiley. Your comment about the compiler stuff IS right, but at least it doesn't make things worse Cheesy.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
teukon
Legendary
*
Offline Offline

Activity: 1246



View Profile
July 04, 2011, 10:37:43 AM
 #5

No change for me at all unfortunately, not even 0.1 MH/s.  I copied the file in place of phoenix-1.50/kernels/phatk/kernel.cl and restarted my instances of phoenix.

Settings:
Linux
Catalyst 11.6
SDK 2.1
phatk (bundled with phoenix 1.50 with MA tweak)
VECTORS BFI_INT FASTLOOP=false AGGRESSION=12 WORKSIZE=256
Solo mining

5850 Xtreme @ 1.0875V - 970MHz (399.4 MH/s) (75*C)
5850 Xtreme @ 1.1625V - 1050MHz (432.8 MH/s) (60*C)
OCedHrt
Member
**
Offline Offline

Activity: 98


View Profile
July 04, 2011, 10:41:57 AM
 #6

No change for me at all unfortunately, not even 0.1 MH/s.  I copied the file in place of phoenix-1.50/kernels/phatk/kernel.cl and restarted my instances of phoenix.

Settings:
Linux
Catalyst 11.6
SDK 2.1
phatk (bundled with phoenix 1.50 with MA tweak)
VECTORS BFI_INT FASTLOOP=false AGGRESSION=12 WORKSIZE=256
Solo mining

5850 Xtreme @ 1.0875V - 970MHz (399.4 MH/s) (75*C)
5850 Xtreme @ 1.1625V - 1050MHz (432.8 MH/s) (60*C)


I went from 56->61Mhash/s on my 4850. Phatk is now faster than DiabloMiner. Getting 59Mhash/s with DiabloMiner.
Diapolo
Hero Member
*****
Offline Offline

Activity: 769



View Profile WWW
July 04, 2011, 10:44:37 AM
 #7

No change for me at all unfortunately, not even 0.1 MH/s.  I copied the file in place of phoenix-1.50/kernels/phatk/kernel.cl and restarted my instances of phoenix.

Settings:
Linux
Catalyst 11.6
SDK 2.1
phatk (bundled with phoenix 1.50 with MA tweak)
VECTORS BFI_INT FASTLOOP=false AGGRESSION=12 WORKSIZE=256
Solo mining

5850 Xtreme @ 1.0875V - 970MHz (399.4 MH/s) (75*C)
5850 Xtreme @ 1.1625V - 1050MHz (432.8 MH/s) (60*C)


Why are you using SDK 2.1? The phatk Kernel likes 2.4 best. Your Phoenix settings look good though.
Strange that your MH/s didn't change at all. Did Phoenix apply the BFI_INT patch the first time you startet with the new Kernel?

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Diapolo
Hero Member
*****
Offline Offline

Activity: 769



View Profile WWW
July 04, 2011, 10:46:19 AM
 #8

I went from 56->61Mhash/s on my 4850. Phatk is now faster than DiabloMiner. Getting 59Mhash/s with DiabloMiner.

Good to know that 4XXX series get a boost, too Smiley. What SDK are you on?

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
teukon
Legendary
*
Offline Offline

Activity: 1246



View Profile
July 04, 2011, 10:49:47 AM
 #9

No change for me at all unfortunately, not even 0.1 MH/s.  I copied the file in place of phoenix-1.50/kernels/phatk/kernel.cl and restarted my instances of phoenix.

Settings:
Linux
Catalyst 11.6
SDK 2.1
phatk (bundled with phoenix 1.50 with MA tweak)
VECTORS BFI_INT FASTLOOP=false AGGRESSION=12 WORKSIZE=256
Solo mining

5850 Xtreme @ 1.0875V - 970MHz (399.4 MH/s) (75*C)
5850 Xtreme @ 1.1625V - 1050MHz (432.8 MH/s) (60*C)


Why are you using SDK 2.1? The phatk Kernel likes 2.4 best. Your Phoenix settings look good though.
Strange that your MH/s didn't change at all. Did Phoenix apply the BFI_INT patch the first time you startet with the new Kernel?

Dia

I'm using SDK 2.1 because when I compared it with 2.4 I found a slight speed improvement with both phatk and poclbm (much larger with poclbm).

I haven't tried SDK 2.4 since applying the MA patch so that is maybe worth a try.

BFI_INT is definitely being used.  If I restart the command without BFI_INT on my 399.4 MH/s gpu I get 354.5 MH/s instead.
teukon
Legendary
*
Offline Offline

Activity: 1246



View Profile
July 04, 2011, 10:52:56 AM
 #10

No change for me at all unfortunately, not even 0.1 MH/s.  I copied the file in place of phoenix-1.50/kernels/phatk/kernel.cl and restarted my instances of phoenix.

Settings:
Linux
Catalyst 11.6
SDK 2.1
phatk (bundled with phoenix 1.50 with MA tweak)
VECTORS BFI_INT FASTLOOP=false AGGRESSION=12 WORKSIZE=256
Solo mining

5850 Xtreme @ 1.0875V - 970MHz (399.4 MH/s) (75*C)
5850 Xtreme @ 1.1625V - 1050MHz (432.8 MH/s) (60*C)


Why are you using SDK 2.1? The phatk Kernel likes 2.4 best. Your Phoenix settings look good though.
Strange that your MH/s didn't change at all. Did Phoenix apply the BFI_INT patch the first time you startet with the new Kernel?

Dia

I'm using SDK 2.1 because when I compared it with 2.4 I found a slight speed improvement with both phatk and poclbm (much larger with poclbm).

I haven't tried SDK 2.4 since applying the MA patch so that is maybe worth a try.

BFI_INT is definitely being used.  If I restart the command without BFI_INT on my 399.4 MH/s gpu I get 354.5 MH/s instead.


Just tried SDK 2.4 on my 399.4 MH/s gpu.  It went down to 393.7 MH/s (actually 394.3 MH/s but occasionally dropping off to 390-391, I took an average).

Edit: I compared the new kernel with the old one using SDK 2.4 and the improvement was 3.1 MH/s (+0.79%).  This is a nice improvement but not enough to make me move away from SDK 2.1.  Also, SDK 2.4 causes the MH/s to drop suddenly by 3 or 4 MH/s every so often (variance is within 0.5 MH/s with SDK 2.1 for me) and even the peak values I achieve with SDK 2.4, new kernel or not, are below my SDK 2.1 average.

Ah well, good work though.  I sent you some BTC anyway simply because you tried to help me fix my problem.
OCedHrt
Member
**
Offline Offline

Activity: 98


View Profile
July 04, 2011, 11:48:23 AM
 #11

I have a really high reject rate with this kernel on my 4850. I've tried alternatively running this kernel and DiabloMiner. I'm getting about 30%-50% reject on your kernel vs 10% on DiabloMiner. Could just be bad luck. Right now it's 4 accepted and 5 rejected after 10 minutes.
teukon
Legendary
*
Offline Offline

Activity: 1246



View Profile
July 04, 2011, 12:00:17 PM
 #12

I have a really high reject rate with this kernel on my 4850. I've tried alternatively running this kernel and DiabloMiner. I'm getting about 30%-50% reject on your kernel vs 10% on DiabloMiner. Could just be bad luck. Right now it's 4 accepted and 5 rejected after 10 minutes.

I found when pool mining that most of my rejects came shortly after new work was pushed.  As a result I had to try much longer test runs (3 hours or so) before coming to a conclusion about the miner's efficiency.
OCedHrt
Member
**
Offline Offline

Activity: 98


View Profile
July 04, 2011, 12:08:45 PM
 #13

I have a really high reject rate with this kernel on my 4850. I've tried alternatively running this kernel and DiabloMiner. I'm getting about 30%-50% reject on your kernel vs 10% on DiabloMiner. Could just be bad luck. Right now it's 4 accepted and 5 rejected after 10 minutes.

I found when pool mining that most of my rejects came shortly after new work was pushed.  As a result I had to try much longer test runs (3 hours or so) before coming to a conclusion about the miner's efficiency.


I'm aware of that, but I've had more rejects in the 10 minutes I'm running it than I've had all day. 411 shares, 22 stale. 13 of those are from testing with this kernel in the last 45 minutes or so.

I'm trying another pool and it's a bit better at 6 Accepted 2 Rejected. On bitclockers I was running at 300+ shares and 11 rejects with DM but getting the numbers above with the modded phatk kernel using phoenix. Trying it again.

Quote
[04/07/2011 05:05:06] Phoenix 1.50 starting...
[04/07/2011 05:05:06] Connected to server
[04/07/2011 05:05:55] Result: 12d3028b accepted
[04/07/2011 05:06:46] Result: e2513abe accepted
[04/07/2011 05:09:05] LP: New work pushed
[04/07/2011 05:09:09] Result: 55153340 accepted
[04/07/2011 05:09:48] Result: c49813f2 accepted
[04/07/2011 05:11:17] Result: 3e257a0d rejected
[04/07/2011 05:11:25] Result: 28da50c1 rejected
[04/07/2011 05:11:26] Result: e062d59e rejected
[04/07/2011 05:11:48] LP: New work pushed
[04/07/2011 05:12:05] Result: 75d54b7b rejected
[04/07/2011 05:12:08] Result: c832f2b0 rejected
teukon
Legendary
*
Offline Offline

Activity: 1246



View Profile
July 04, 2011, 12:15:40 PM
 #14

I have a really high reject rate with this kernel on my 4850. I've tried alternatively running this kernel and DiabloMiner. I'm getting about 30%-50% reject on your kernel vs 10% on DiabloMiner. Could just be bad luck. Right now it's 4 accepted and 5 rejected after 10 minutes.

I found when pool mining that most of my rejects came shortly after new work was pushed.  As a result I had to try much longer test runs (3 hours or so) before coming to a conclusion about the miner's efficiency.


I'm aware of that, but I've had more rejects in the 10 minutes I'm running it than I've had all day. 411 shares, 22 stale. 13 of those are from testing with this kernel in the last 45 minutes or so.

I'm trying another pool and it's a bit better at 6 Accepted 2 Rejected. On bitclockers I was running at 300+ shares and 11 rejects with DM but getting the numbers above with the modded phatk kernel using phoenix. Trying it again.

Yes, after 45 minutes things look highly suspect to me.  A good 3 hour test is useful for comparing different very-good setups but this tweak seems to have really hurt your accept/reject ratio.  Let us know your best when you're done testing.
OCedHrt
Member
**
Offline Offline

Activity: 98


View Profile
July 04, 2011, 12:23:02 PM
 #15

With the original phatk kernel (ma patched)

Quote
[04/07/2011 05:13:33] Phoenix 1.50 starting...
[04/07/2011 05:13:33] Connected to server
[04/07/2011 05:14:21] Result: 8c4fd15d accepted
[04/07/2011 05:14:45] Result: f15aefe5 accepted
[04/07/2011 05:15:35] Result: 9d5dfc38 rejected
[04/07/2011 05:19:21] LP: New work pushed
[04/07/2011 05:19:54] Result: 8adaadf6 accepted
[04/07/2011 05:19:57] Result: 5382cf90 accepted
[04/07/2011 05:22:54] Result: 2d0233f8 rejected
[04/07/2011 05:24:03] Result: 28c05c3a rejected
[04/07/2011 05:25:41] Result: 9dff1142 rejected
[04/07/2011 05:25:54] Result: 33095b05 accepted
[04/07/2011 05:26:05] Result: 3ec67e7e accepted
[04/07/2011 05:27:33] Result: 5307e072 accepted
[04/07/2011 05:27:37] Result: 20237b07 accepted
[04/07/2011 05:29:18] Result: c8abce0f rejected

It's actually 05:22 now so duration is same but number of results is significantly less though total accepted is same. Hashrate is 57 vs 61 with your kernel.

Update: Some more results...seems like it may just be a phoenix thing with bitclockers. I will get some data from DM for reference but I may have to go ask in Phoenix/GUIMiner thread.

From DiabloMiner
Quote
[7/4/11 5:31:58 AM] Started
[7/4/11 5:31:58 AM] Connecting to: http://pool.bitclockers.com:8332/
[7/4/11 5:31:58 AM] Using AMD Accelerated Parallel Processing OpenCL 1.1 AMD-APP
-SDK-v2.5 (684.211)
[7/4/11 5:32:00 AM] Added ATI RV770 (#1) (10 CU, local work size of 128)
[7/4/11 5:33:20 AM] Accepted block 1 found on ATI RV770 (#1)
[7/4/11 5:35:41 AM] Accepted block 2 found on ATI RV770 (#1)
[7/4/11 5:36:49 AM] Accepted block 3 found on ATI RV770 (#1)
[7/4/11 5:36:49 AM] Accepted block 4 found on ATI RV770 (#1)
[7/4/11 5:37:54 AM] Rejected block 1 found on ATI RV770 (#1)
[7/4/11 5:39:40 AM] Accepted block 5 found on ATI RV770 (#1)
[7/4/11 5:40:23 AM] Accepted block 6 found on ATI RV770 (#1)
[7/4/11 5:40:34 AM] Accepted block 7 found on ATI RV770 (#1)
[7/4/11 5:40:56 AM] Accepted block 8 found on ATI RV770 (#1)
[7/4/11 5:41:53 AM] Accepted block 9 found on ATI RV770 (#1)
[7/4/11 5:42:15 AM] Accepted block 10 found on ATI RV770 (#1)
[7/4/11 5:42:38 AM] Accepted block 11 found on ATI RV770 (#1)
[7/4/11 5:42:59 AM] Accepted block 12 found on ATI RV770 (#1)
[7/4/11 5:43:57 AM] Accepted block 13 found on ATI RV770 (#1)
[7/4/11 5:44:48 AM] Accepted block 14 found on ATI RV770 (#1)

Trying your kernel with poclbm miner now. Actually getting 62 on here vs 61 from Phoenix.

Update again: No issues on poclbm using your kernel so I guess it's Phoenix.
Diapolo
Hero Member
*****
Offline Offline

Activity: 769



View Profile WWW
July 04, 2011, 03:22:12 PM
 #16

Trying your kernel with poclbm miner now. Actually getting 62 on here vs 61 from Phoenix.

Update again: No issues on poclbm using your kernel so I guess it's Phoenix.

Dunno what the difference is between the 2 OpenCL wise. Setup, command queues, perhaps kernel result download or processing.
If you see a problem you should contact jedi95, perhaps he can clear this up?

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
huayra.agera
Full Member
***
Offline Offline

Activity: 154



View Profile
July 04, 2011, 04:20:35 PM
 #17

Hi! Just tested this, it did boost my 5850 @ 960/300/1.174v from 390 > 395. However, it seems to drop off to 380 all of a sudden like 3-4 seconds then back up to 395. Still, an increase, thanks for the work!

BTC: 1JMPScxohom4MXy9X1Vgj8AGwcHjT8XTuy
OCedHrt
Member
**
Offline Offline

Activity: 98


View Profile
July 04, 2011, 05:48:07 PM
 #18

Trying your kernel with poclbm miner now. Actually getting 62 on here vs 61 from Phoenix.

Update again: No issues on poclbm using your kernel so I guess it's Phoenix.

Dunno what the difference is between the 2 OpenCL wise. Setup, command queues, perhaps kernel result download or processing.
If you see a problem you should contact jedi95, perhaps he can clear this up?

Dia

I will look into it. GUIMiner has their own Phoenix and that may be related.
gfaust
Newbie
*
Offline Offline

Activity: 24


View Profile
July 04, 2011, 11:50:16 PM
 #19

I was already running the "#define Ma" optimized kernel, and this is good for another .5% on top of that.
shakaru
Sr. Member
****
Offline Offline

Activity: 364


View Profile WWW
July 04, 2011, 11:58:33 PM
 #20

went from 306-314 on my 5830's Sent you a little thank you. People PAY THIS MAN!

Pages: [1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 »  All
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!