Diapolo


July 08, 2011, 11:54:04 AM 

Guys, I introduced a small glitch, which produces an OpenCL compiler warning in version 0707. For stability reasons please change line 77: old: u W[123]; new: u W[124]; I missed sharound(123), which writes to W[123], which is undefined, because it's out of range. Sorry for that! Will upload a fixed version shortly (only includes the change above and stays 0707). Edit: Download 0707 fixed: http://www.mediafire.com/?o7jfp60s7xefrg4Dia







Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.



John (John K.)
Global Trollbuster and
Legendary
Offline
Activity: 1134
Will read PM's. Have more time lately


July 08, 2011, 12:08:39 PM 

Thanks for the updates. Hashing rate increased like 12MH/s

My BTC Tip Jar: 1Pgvfy19uwtYe5o9dg3zZsAjgCPt3XZqz9 , GPG ID: B3AAEEB0 ,OTC ID: johnthedong Escrow service is available on a case by case basis! (PM Me to verify I'm the escrow!)



mitchel
Newbie
Offline
Activity: 22


July 08, 2011, 02:38:37 PM 

Thanks man this is awesome! Hashing rate increased by 10 for each 5830 that i have.
I'm at work right but i will definitely donate when i get home.




Maxim Gladkov


July 08, 2011, 02:40:01 PM 

Thank you for this improvements!




gominoa
Newbie
Offline
Activity: 17


July 09, 2011, 08:59:24 AM 

Works great!




Diapolo


July 09, 2011, 02:19:50 PM 

I've been toying with an idea, but I don't have the necessary programming skills (or knowledge of the SHA256 algorithm) to implement anything. http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_APP_SDK_FAQ.pdf41. What is the difference between 24bit and 32bit integer operations?
24bit operations are faster because they use floating point hardware and can execute on all compute unts. Many 32bit integer operations also run on all stream processors, but if both a 24bit and a 32bit version exist for the same instruction, the 32bit instruction executes only one per cycle.
43. Do 24bit integers exist in hardware?
No, there are 24bit instructions, such as MUL24/MAD24, but the smallest integer in hardware registers is 32bits.
75. Is it possible to use all 256 register in a thread?
No, the compiler limits a wavefront to half of the register pool, so there can always be at least two wavefronts executing in parallel.
http://developer.amd.com/sdks/amdappsdk/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdfPage 462 24bit integer MULs and MADs have five times the throughput of 32bit integer multiplies. 24bit unsigned integers are natively supported only on the Evergreen family of devices and later. Signed 24bit integers are supported only on the Northern Island family of devices and later. The use of OpenCL builtin functions for mul24 and mad24 is encouraged. Note that mul24 can be useful for array indexing operations.
http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=144722On the 5800 series, signed mul24(a,b) is turned into (((a<<8)>>8)*((b<<8)>>8)) . This makes it noticeably SLOWER than simply using a*b. Unsigned mul24(a,b) uses a native function. mad24 is similar. I made some kernels which just looped the same operation over and over: signed a * b: 0.9736s unsigned mul24(a,b): 0.9734s signed mul24(a,b): 2.2771s So anyhow what I was thinking was the following Current kernel: 1 * 256 bit hash / 32int = 8 32bit operations (speed 100% ) Possible Kernel: 3 * 256 bit hash / 24int = 32 24bit operations (speed a maximum of 166% [5 times faster divided by 3 SHA256 operations in parallel]) ^{*}^{*} It may actually end up being slower than the current kernel.cl if 32bit and 24bit operations are sent as wavefronts at the same time. There may be some merit in trying to write a new kernel.cl that uses 32 x 24bit integers to carry out 3 parallel SHA256 operations at once faster than one SHA256 operation using 8 32bit integers . But not everything can be carried out as 24bit operations, only mul24(a,b) and mad24(a,b), so the 166% speed up would only be achieved if every SHA256 operation was covered by these two operations. The new kernel.cl would be limited to modern ATI hardware (54xx59xx,67xx69xx), which is generally what miners are using. But to be honest I haven't looked into the SHA256 algorithm, so I'm not sure if parts of it could ever be rewritten to utilise mad24(a,b) or mul24(a,b). But I like thinking outside the box. Hi Bert, sorry for not directly answering you. I checked the OpenCL 1.1 specs and yes, there faster 24Bit integer operations are mentioned, too. mul24 (Fast integer function.) Multiply 24bit integer values a and b mad24 (Fast integer function.) Multiply 24bit integer then add the 32bit result to 32bit integer But here lies the problem, AFAIK there are only additions, bitshiftig and other bitwise operations used for current kernel (no multiplications). So there should be no use for it on the first sight. Dia




KKAtan
Jr. Member
Offline
Activity: 51


July 09, 2011, 05:26:21 PM 

I've tested your patch, and there are some great improvements from the original phoenix 1.50 miner indeed. My 6950 gets improvement. My 5870 gets improvement. My 5850 gets improvement.
But I have noticed a regression with my 6870 (1005mhz core / 200 mhz mem)
Configuration: Using window 7 Using Catalyst 11.6 Using the aoclbf 1.74 frontend for phoenix 1.50 phatk Vector BFI_INT Aggression 13 Work size 128
20110703 kernel: 317 mhash/s (all 3 number are peak value) 20110706 kernel: 283 20110707 kernel: 283
...Needless to say, something bad happened between 0703 and 0706. I hope we can get to the bottom of this. If you need me to test something, I will be happy to do what I can for you.

1Pote63ZeU4fgFHnTuV6sB6T7duegwi5vc



Bert


July 09, 2011, 06:18:15 PM 

... snip ... But here lies the problem, AFAIK there are only additions, bitshiftig and other bitwise operations used for current kernel (no multiplications). So there should be no use for it on the first sight.
Yea, I was thinking the same, but then I thought that there may be some smart way to rearrange some of the SHA256 algorithm to change simple bit shifts, exclusiveors and addition into more complex multiplies, maybe by carrying out two or three operations at once. But it would require a deep knowledge of the SHA256 algorithm and binary maths. Something along the lines of way that Laplace can be used to solve differential equations by transferring everything into the Sdomain, solve with addition and subtraction and then transfer back for the answer.

Tip jar: 1BW6kXgUjGrFTqEpyP8LpVEPQDLTkbATZ6



1MLyg5WVFSMifFjkrZiyGW2nw
Newbie
Offline
Activity: 28


July 09, 2011, 10:44:39 PM 

Yea, I was thinking the same, but then I thought that there may be some smart way to rearrange some of the SHA256 algorithm to change simple bit shifts, exclusiveors and addition into more complex multiplies, maybe by carrying out two or three operations at once. But it would require a deep knowledge of the SHA256 algorithm and binary maths. Something along the lines of way that Laplace can be used to solve differential equations by transferring everything into the Sdomain, solve with addition and subtraction and then transfer back for the answer.
While not a math guru, I am certain this can't be done. The algorithm uses these kinds of operations:  rotate or shift bits right by three different numbers of places, then XOR together  select bits from one of two values, depending on bits in a third  majority of bits set/clear in three values  addition of the result of these operations and constant values for each round you can build multiplication out of these operations if combined in a certain way, but the SHA256 algorithm does not use them like this. If (parts of) SHA were equivalent to something as simple as multiplication, I'd say it could be broken in no time. Also, SHA256 uses 32 bit values for everything. You could of course implement it on an 8 bit machine, but this would make it much slower. And having 24 bit wide registers does not even mean you could run three 8 bit ops at the same timeAnother thought I had: Is aggressive loop unrolling really helping performance? At least for FPGAs, I guess that lots of very small units that maybe do one hash every 64 clock cycles could be better than a much bigger unrolled design, and the same could be true for GPUs. Was this already tested or did everyone start with the assumption that unrolling is the way to go?

oops, username was cut off 1MLyg5WVFSMifFjkrZiyGW2nw7WnsU8AZ4



SeriousWorm
Jr. Member
Offline
Activity: 54


July 10, 2011, 12:23:26 AM 

I've tested your patch, and there are some great improvements from the original phoenix 1.50 miner indeed. My 6950 gets improvement. My 5870 gets improvement. My 5850 gets improvement.
But I have noticed a regression with my 6870 (1005mhz core / 200 mhz mem)
Configuration: Using window 7 Using Catalyst 11.6 Using the aoclbf 1.74 frontend for phoenix 1.50 phatk Vector BFI_INT Aggression 13 Work size 128
20110703 kernel: 317 mhash/s (all 3 number are peak value) 20110706 kernel: 283 20110707 kernel: 283
...Needless to say, something bad happened between 0703 and 0706. I hope we can get to the bottom of this. If you need me to test something, I will be happy to do what I can for you.
Try setting worksize to 256 and upping your memory to 350mhz. I get the most mhash/sec using that.




bmgjet


July 10, 2011, 12:29:20 AM 

Iv found 500mhz to be best for memory clock for gddr5 cards and 800mhz for gddr3 cards. Runs a bit hotter then 350 but is worth it imo.




ssateneth
Legendary
Offline
Activity: 1288


July 10, 2011, 04:19:33 AM 

Came across this topic when browsing https://en.bitcoin.it/wiki/Mining_hardware_comparison and the comment on the last 5830 entry (not the crossfire one). Here's my results Baseline: 11.6 drivers. Not sure what SDK it is (I'm going to assume 2.4. I haven't done anything to my knowledge to change SDK and it seems 2.4 is what comes with 11.6) 5870 @ 1015 core, 300 memory with original phatk with phoenix 1.5 VECTORS BFI_INT WORKSIZE=128 FASTLOOP=false AGGRESSION=13 441Mhash 707 build, immediate gain to 450 MHash/sec Increased memory to 350. 459 Mhash/sec Increase WORKSIZE to 256. 463 MHash/sec And I found that any increases to memory after 360 cause weird things to happen (almost certain crash, I panicked to get it back to 350 before it crashed), but I see people posting about 500MHz, so I'm going to try that out, hopefully not crash. Edit: 500 memory speed causes a decrease to 452 MHash/sec. It leads me to think that there are certain dividers/timings being changed at certain thresholds. This would explain why 500MHz and 350 MHz appears to be stable, but 360+ and 600MHz are unstable (I was limited to 600MHz for a while because I didn't know how to push memory lower than 600. I used MSI Afterburner). Now I use MSI Afterburner to alter voltage and AMD GPU Clock Tool to set frequencies. I don't suppose anyone knows an allinone solution? AMD GPU Clock Tool doesn't seem to want to set custom voltages. It just has 0.9500, 1.0630, and Max VDCC. It won't accept custom numbers.




Diapolo


July 10, 2011, 11:39:06 AM 

Next kernel version will, once more, be faster for 69XX and 58XX cards . Stay tuned! Dia




BitCoinJack.com
Jr. Member
Offline
Activity: 42


July 10, 2011, 12:00:13 PM 

Next kernel version will, once more, be faster for 69XX and 58XX cards . Stay tuned! Dia Great, looking forward to it!




Nialpo
Newbie
Offline
Activity: 7


July 11, 2011, 05:54:53 AM 

Also, you can simplify last sharound()'s  there is no need in calculating second variables:
Change W(120); sharound(120); W(121); sharound(121); W(122); sharound(122); W(123); sharound(123);
To W(120); sharound(120); W(121); Vals[2] += t1(121); W(122); Vals[1] += t1(122); W(123); Vals[0] += t1(123);




Diapolo


July 11, 2011, 06:38:50 AM 

Also, you can simplify last sharound()'s  there is no need in calculating second variables:
Change W(120); sharound(120); W(121); sharound(121); W(122); sharound(122); W(123); sharound(123);
To W(120); sharound(120); W(121); Vals[2] += t1(121); W(122); Vals[1] += t1(122); W(123); Vals[0] += t1(123);
Seems like a good idea, but I checked via KernelAnalyzer and it doesn't lower the needed ALU operations ... perhaps with reordering of commands this will help. Looking into it and thanks for your posting! Dia




Nialpo
Newbie
Offline
Activity: 7


July 11, 2011, 07:03:14 AM 

Probably compiler already removed this, since there is no dependency. Also, dimension of W[] can be lowered, down to 16. In each step no values before w[n16] are used. I've tried this, but it's not faster. Maybe with W[32] or W[64] it will be better. At least, compiled .elf file is shorter




zephyr4
Newbie
Offline
Activity: 5


July 11, 2011, 07:08:14 AM 

thanks ill give it a try




Diapolo


July 11, 2011, 02:09:28 PM 

Download version 20110711: http://www.mediafire.com/?k404b6lqn8vu6z6This could be the last version, because there seems no more room for big jumps. I thought I could remove some more additions, but the OpenCL compiler does a better job than I . This version is faster than all previous kernels (uses the least ALU OPs for 69XX and 58XX). Should also work with SDK 2.1. If it throws an error with 2.1, please post here and include the error message! Thanks to all donators and your feedback! Dia




Bert


July 11, 2011, 05:40:01 PM 

Thanks for the hard work, sent you another donation. 456.75Mh/sec up to 458.88 with the last update on overclocked 5870's running at 1000/347.

Tip jar: 1BW6kXgUjGrFTqEpyP8LpVEPQDLTkbATZ6



