Diapolo (OP)
|
|
July 11, 2011, 04:51:53 PM |
|
Download version 2011-07-11: http://www.mediafire.com/?k404b6lqn8vu6z6This could be the last version, because there seems no more room for big jumps. I thought I could remove some more additions, but the OpenCL compiler does a better job than I . This version is faster than all previous kernels (uses the least ALU OPs for 69XX and 58XX). Should also work with SDK 2.1. If it throws an error with 2.1, please post here and include the error message! Thanks to all donators and your feedback! Dia Awesome work. Btw, could you explain why Vals[2] = C1; is needed in the beginning when it is reassigned in round 5? I know removing this line breaks the hashing but I'm not seeing why since it isn't used prior to round 5. Various functions before the reassignment use Vals[2] search the kernel for Vals[ and you will see them. I tried to remove that assignment, but as you discovered, it is needed there . Dia
|
|
|
|
teukon
Legendary
Offline
Activity: 1246
Merit: 1011
|
|
July 11, 2011, 05:06:15 PM |
|
This does work with SDK 2.1 but it might be a tiny bit slower than your previous version.
HD5850, 1.0875V, 975MHz clock, 360 MHz RAM, aggression=14, worksize=256, Catalyst 11.6 (Linux)
SDK 2.1: 404.6 MH/s -> 404.5 MH/s SDK 2.4: 401.8 MH/s -> 402.2 MH/s
Note that at aggression=14 my rate can sometimes drop as much as 1 MH/s suddenly before recovering but usually varies by 0.2 MH/s so the apparent decrease with SDK 2.1 could well be statistical noise.
I might also have to play with the RAM frequency again.
|
|
|
|
OCedHrt
Member
Offline
Activity: 111
Merit: 10
|
|
July 11, 2011, 05:14:01 PM |
|
Download version 2011-07-11: http://www.mediafire.com/?k404b6lqn8vu6z6This could be the last version, because there seems no more room for big jumps. I thought I could remove some more additions, but the OpenCL compiler does a better job than I . This version is faster than all previous kernels (uses the least ALU OPs for 69XX and 58XX). Should also work with SDK 2.1. If it throws an error with 2.1, please post here and include the error message! Thanks to all donators and your feedback! Dia Awesome work. Btw, could you explain why Vals[2] = C1; is needed in the beginning when it is reassigned in round 5? I know removing this line breaks the hashing but I'm not seeing why since it isn't used prior to round 5. Various functions before the reassignment use Vals[2] search the kernel for Vals[ and you will see them. I tried to remove that assignment, but as you discovered, it is needed there . Dia I only searched for Vals[2] so did not see the % XD Very small increase for me on this one 278->278.5. Interestingly a lower memory clock on my 6870 actually has a detrimental effect. Downclocking from 1050->800 reduces hash rate b 0.5MH/s. I can't clock it any lower so don't know if 300 will be better or not.
|
|
|
|
dishwara
Legendary
Offline
Activity: 1855
Merit: 1016
|
|
July 11, 2011, 05:22:22 PM |
|
Download version 2011-07-11: http://www.mediafire.com/?k404b6lqn8vu6z6This could be the last version, because there seems no more room for big jumps. I thought I could remove some more additions, but the OpenCL compiler does a better job than I . This version is faster than all previous kernels (uses the least ALU OPs for 69XX and 58XX). Should also work with SDK 2.1. If it throws an error with 2.1, please post here and include the error message! Thanks to all donators and your feedback! Dia I wish & hope you will still find out some ways to get more hashes. Thanks.
|
|
|
|
dikidera
|
|
July 11, 2011, 05:23:18 PM |
|
At 2 megahashes, your device produces around 2 to 2,5 million hashes per second, If we halven that to 1 megahash, that's still 1,25 million hashes per second, if we halven that, around 750 thousand per second. So a increase of 0.2% or so, yields around 100 thousand hashes more per second.
|
|
|
|
Vince
Newbie
Offline
Activity: 38
Merit: 0
|
|
July 11, 2011, 05:39:13 PM |
|
Dont give up! There's still more to optimize, I'm at 1694 ALU OPs (HD6970) at the moment.
|
|
|
|
hugolp
Legendary
Offline
Activity: 1148
Merit: 1001
Radix-The Decentralized Finance Protocol
|
|
July 11, 2011, 06:19:54 PM |
|
Download version 2011-07-11: http://www.mediafire.com/?k404b6lqn8vu6z6This could be the last version, because there seems no more room for big jumps. I thought I could remove some more additions, but the OpenCL compiler does a better job than I . This version is faster than all previous kernels (uses the least ALU OPs for 69XX and 58XX). Should also work with SDK 2.1. If it throws an error with 2.1, please post here and include the error message! Thanks to all donators and your feedback! Dia Previous patches had solved one of my cards crashing the miner every 30 minutes or so, but this one has reintroduced the problem. One miner mining a 5870 crashed after 20 minutes.
|
|
|
|
BOARBEAR
Member
Offline
Activity: 77
Merit: 10
|
|
July 11, 2011, 06:28:47 PM |
|
I wonder why do we need const uint D1. It is only use once.
|
|
|
|
Vince
Newbie
Offline
Activity: 38
Merit: 0
|
|
July 11, 2011, 06:50:09 PM |
|
I wonder why do we need const uint D1. It is only use once.
Its part of the precalculation. Its needed.
|
|
|
|
Diapolo (OP)
|
|
July 11, 2011, 07:31:01 PM |
|
Dont give up! There's still more to optimize, I'm at 1694 ALU OPs (HD6970) at the moment.
I can't read or edit Phyton, so yes there is room if one could alter or add some more kernel arguments. Strange thing is, that I saw some additions, of known values, which I tried to to eleminate via constants, but this led to lower kernel performance. I played around with this today and saw no more improvement ... too bad, was real fun the last days! If you would like to share your work, we all will be happy . What is your kernel doing for 58XX cards? I thought it makes no sense, to optimize one over the other and tried to reduce ALU OP count for both platforms. Dia
|
|
|
|
Diapolo (OP)
|
|
July 11, 2011, 07:31:55 PM |
|
Download version 2011-07-11: http://www.mediafire.com/?k404b6lqn8vu6z6This could be the last version, because there seems no more room for big jumps. I thought I could remove some more additions, but the OpenCL compiler does a better job than I . This version is faster than all previous kernels (uses the least ALU OPs for 69XX and 58XX). Should also work with SDK 2.1. If it throws an error with 2.1, please post here and include the error message! Thanks to all donators and your feedback! Dia Previous patches had solved one of my cards crashing the miner every 30 minutes or so, but this one has reintroduced the problem. One miner mining a 5870 crashed after 20 minutes. Sorry for that, but I have no idea, what would cause this. Perhaps the card is faulty? Will Furmark "crash" the card or show artifacts? Dia
|
|
|
|
error
|
|
July 11, 2011, 08:25:28 PM |
|
Download version 2011-07-11: http://www.mediafire.com/?k404b6lqn8vu6z6This could be the last version, because there seems no more room for big jumps. I thought I could remove some more additions, but the OpenCL compiler does a better job than I . This version is faster than all previous kernels (uses the least ALU OPs for 69XX and 58XX). Should also work with SDK 2.1. If it throws an error with 2.1, please post here and include the error message! Thanks to all donators and your feedback! Dia This one seems to act rather strangely. On 2011-07-06 I had a pretty consistent 360MHash/sec with little variation. With 2011-07-11 the card is varying dramatically between 355-363, and usually at the lower end of that. Overall it may be slightly slower. The cards are 5850s clocked at 900/300, using 11.4 and SDK 2.4 on Linux. I'm going to let it run a while longer.
|
3KzNGwzRZ6SimWuFAgh4TnXzHpruHMZmV8
|
|
|
pennytrader
|
|
July 11, 2011, 09:22:54 PM |
|
On 5830 + SDK 2.1, it's slightly slower than the previous version. Guess I'll revert it back.
|
please donate to 1P3m2resGCP2o2sFX324DP1mfqHgGPA8BL
|
|
|
CYPER
|
|
July 11, 2011, 09:53:22 PM |
|
On 5830 + SDK 2.1, it's slightly slower than the previous version. Guess I'll revert it back.
Same here. With the previous version my average was 1758 and now it is 1756. This is for 4x 5870 @ 960Mhz Core & 300Mhz Memory SDK 2.1 Ubuntu 32bit
|
|
|
|
wazoo42
Newbie
Offline
Activity: 42
Merit: 0
|
|
July 11, 2011, 10:18:46 PM |
|
7/4/11 = a 1-2 MH/s increase 7/6/11 = 0 increase (maybe slight decrease) 7/11/11 = 1-2 MH/s further increase over 7/4/11
These are on 2x 5830s, and 3x 5770s using ati-drivers-11.6, phoenix-1.50, pyopencl-0.92, and ati-stream-sdk-bin-2.4.
|
|
|
|
error
|
|
July 11, 2011, 10:21:38 PM |
|
Download version 2011-07-11: http://www.mediafire.com/?k404b6lqn8vu6z6This could be the last version, because there seems no more room for big jumps. I thought I could remove some more additions, but the OpenCL compiler does a better job than I . This version is faster than all previous kernels (uses the least ALU OPs for 69XX and 58XX). Should also work with SDK 2.1. If it throws an error with 2.1, please post here and include the error message! Thanks to all donators and your feedback! Dia This one seems to act rather strangely. On 2011-07-06 I had a pretty consistent 360MHash/sec with little variation. With 2011-07-11 the card is varying dramatically between 355-363, and usually at the lower end of that. Overall it may be slightly slower. The cards are 5850s clocked at 900/300, using 11.4 and SDK 2.4 on Linux. I'm going to let it run a while longer. Sorry, but this is still slower; my cards were running around 355MH/sec and even down to 350. Went back to 2011-07-06.
|
3KzNGwzRZ6SimWuFAgh4TnXzHpruHMZmV8
|
|
|
Vince
Newbie
Offline
Activity: 38
Merit: 0
|
|
July 11, 2011, 11:31:37 PM |
|
Decreasing hashrates .. thats really strange. These 58xx-cards sometimes behave quite strange.
I cant test it cause all my rigs run on 6950's unlocked to 6970's
|
|
|
|
erek
Newbie
Offline
Activity: 36
Merit: 0
|
|
July 11, 2011, 11:32:18 PM |
|
Download version 2011-07-11: http://www.mediafire.com/?k404b6lqn8vu6z6This could be the last version, because there seems no more room for big jumps. I thought I could remove some more additions, but the OpenCL compiler does a better job than I . This version is faster than all previous kernels (uses the least ALU OPs for 69XX and 58XX). Should also work with SDK 2.1. If it throws an error with 2.1, please post here and include the error message! Thanks to all donators and your feedback! Dia This one seems to act rather strangely. On 2011-07-06 I had a pretty consistent 360MHash/sec with little variation. With 2011-07-11 the card is varying dramatically between 355-363, and usually at the lower end of that. Overall it may be slightly slower. The cards are 5850s clocked at 900/300, using 11.4 and SDK 2.4 on Linux. I'm going to let it run a while longer. Sorry, but this is still slower; my cards were running around 355MH/sec and even down to 350. Went back to 2011-07-06. I totally disagree, each version for me has been getting faster and faster. 7-11-11 is the fastest, yet for me.
|
|
|
|
Wildvest
Newbie
Offline
Activity: 41
Merit: 0
|
|
July 12, 2011, 12:25:55 AM |
|
2011-11-11 i can now report a 0.5-1% increase over the improved phatk kernel
|
|
|
|
Vince
Newbie
Offline
Activity: 38
Merit: 0
|
|
July 12, 2011, 02:27:20 AM |
|
So here are some more changes: I introduced const uint W17_2, containing P1(19) + 0x11002000, thats 3 shifts, 2 xor, 1 add traded against one extra parameter, well worth it, extended self.f: self.f = np.zeros(5, np.uint32) to self.f = np.zeros(6, np.uint32) just after W17 calculation in calculateF: #W17_2 self.f[5] = np.uint32(0x11002000+( rot(self.f[2], 32-13) ^ rot(self.f[2], 32-15) ^ (self.f[2] >> 10) )) added the parameter (right after W17) in call and function => Effectively 3 Op's saved. next change: You can cut out all W0 to W14! Most of them are zero anyway, just needed to hardcode the first ones. Also W[73] to W[78] are not used anymore with some small changes, so no need to initialize them. => less memory use, but has the same speed for me Next one: Round 3 #ifdef VECTORS Vals[4] = (W_3 = ((base + get_global_id(0)) << 1) + (uint2)(0, 1)) + PreVal4; #else Vals[4] = (W_3 = base + get_global_id(0)) + PreVal4; #endif -- // Round 3 Vals[0] = state0 + Vals[4]; Vals[4] += T1; -- W[64 - O] = state0 + Vals[0]; you can reorganize and shorten round 3 to: Vals[0] = T1 + Vals[4]; needed changes in precalculation: Preval4 += T1 T1 = state0 - T1 => another addition almost effortless here the files with these changes: http://www.filesonic.com/file/1423103594still some more to come!
|
|
|
|
|