Bitcoin Forum
December 11, 2024, 08:41:13 PM *
News: Latest Bitcoin Core release: 28.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 3 4 [5] 6 7 8 9 10 11 12 13 14 »  All
  Print  
Author Topic: further improved phatk_dia kernel for Phoenix + SDK 2.6 - 2012-01-13  (Read 51259 times)
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 772
Merit: 500



View Profile WWW
July 08, 2011, 11:54:04 AM
 #81

Guys, I introduced a small glitch, which produces an OpenCL compiler warning in version 07-07. For stability reasons please change line 77:

old:
u W[123];

new:
u W[124];

I missed sharound(123), which writes to W[123], which is undefined, because it's out of range. Sorry for that!
Will upload a fixed version shortly (only includes the change above and stays 07-07).

Edit:
Download 07-07 fixed: http://www.mediafire.com/?o7jfp60s7xefrg4

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
John (John K.)
Global Troll-buster and
Legendary
*
Offline Offline

Activity: 1288
Merit: 1227


Away on an extended break


View Profile
July 08, 2011, 12:08:39 PM
 #82

Thanks for the updates. Hashing rate increased like 1-2MH/s Grin
mitchel
Newbie
*
Offline Offline

Activity: 22
Merit: 0


View Profile
July 08, 2011, 02:38:37 PM
 #83

Thanks man this is awesome!  Hashing rate increased by 10 for each 5830 that i have. 

I'm at work right but i will definitely donate when i get home.
Maxim Gladkov
Newbie
*
Offline Offline

Activity: 28
Merit: 0



View Profile WWW
July 08, 2011, 02:40:01 PM
 #84

Thank you for this improvements!
gominoa
Newbie
*
Offline Offline

Activity: 17
Merit: 0


View Profile
July 09, 2011, 08:59:24 AM
 #85

Works great!
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 772
Merit: 500



View Profile WWW
July 09, 2011, 02:19:50 PM
 #86

I've been toying with an idea, but I don't have the necessary programming skills (or knowledge of the SHA-256 algorithm) to implement anything.

http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_APP_SDK_FAQ.pdf
Quote
41. What is the difference between 24-bit and 32-bit integer operations?

    24-bit operations are faster because they use floating point hardware and can execute on all compute unts. Many 32-bit integer operations also run on all stream processors, but if both a 24-bit and a 32-bit version exist for the same  instruction, the 32-bit instruction executes only one per cycle.

43. Do 24-bit integers exist in hardware?

    No, there are 24-bit instructions, such as MUL24/MAD24, but the smallest integer in hardware registers is 32-bits.

75. Is it possible to use all 256 register in a thread?

    No, the compiler limits a wavefront to half of the register pool, so there can always be at least two wavefronts executing in parallel.
http://developer.amd.com/sdks/amdappsdk/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
Page 4-62
Quote
24-bit integer MULs and MADs have five times the throughput of 32-bit integer multiplies. 24-bit unsigned integers are natively supported only on the Evergreen family of devices and later. Signed 24-bit integers are supported only on the Northern Island family of devices and later. The use of OpenCL built-in functions for mul24 and mad24 is encouraged. Note that mul24 can be useful for array indexing operations.

http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=144722
Quote
On the 5800 series, signed mul24(a,b) is turned into
Code:
(((a<<8)>>8)*((b<<8)>>8))
. This makes it noticeably SLOWER than simply using a*b. Unsigned mul24(a,b) uses a native function. mad24 is similar. I made some kernels which just looped the same operation over and over:
signed a * b: 0.9736s
unsigned mul24(a,b): 0.9734s
signed mul24(a,b): 2.2771s

So anyhow what I was thinking was the following

Current  kernel: 1 * 256 bit hash / 32int =  8 32bit operations (speed 100% )
Possible Kernel: 3 * 256 bit hash / 24int = 32 24bit operations (speed a maximum of  166% [5 times faster divided by 3 SHA-256 operations in parallel])*

* It may actually end up being slower than the current kernel.cl if 32bit and 24bit operations are sent as wavefronts at the same time.

There may be some merit in trying to write a new kernel.cl that uses 32 x 24bit integers to carry out 3 parallel SHA-256 operations at once faster than one SHA-256 operation using 8 32bit integers .

But not everything can be carried out as 24bit operations, only mul24(a,b) and mad24(a,b), so the 166% speed up would only be achieved if every SHA-256 operation was covered by these two operations. The new kernel.cl would be limited to modern ATI hardware (54xx-59xx,67xx-69xx), which is generally what miners are using.

But to be honest I haven't looked into the SHA-256 algorithm, so I'm not sure if parts of it could ever be rewritten to utilise mad24(a,b) or mul24(a,b). But I like thinking outside the box.

Hi Bert, sorry for not directly answering you. I checked the OpenCL 1.1 specs and yes, there faster 24-Bit integer operations are mentioned, too.

mul24    (Fast integer function.) Multiply 24-bit integer values a and b
mad24    (Fast integer function.) Multiply 24-bit integer then add the 32-bit result to 32-bit integer

But here lies the problem, AFAIK there are only additions, bit-shiftig and other bit-wise operations used for current kernel (no multiplications). So there should be no use for it on the first sight.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
KKAtan
Newbie
*
Offline Offline

Activity: 50
Merit: 0


View Profile
July 09, 2011, 05:26:21 PM
 #87

I've tested your patch, and there are some great improvements from the original phoenix 1.50 miner indeed.
My 6950 gets improvement.
My 5870 gets improvement.
My 5850 gets improvement.

But I have noticed a regression with my 6870 (1005mhz core / 200 mhz mem)

Configuration:
-Using window 7
-Using Catalyst 11.6
-Using the aoclbf 1.74 frontend for phoenix 1.50
phatk
Vector
BFI_INT
Aggression 13
Work size 128

2011-07-03 kernel: 317 mhash/s (all 3 number are peak value)
2011-07-06 kernel: 283
2011-07-07 kernel: 283

...Needless to say, something bad happened between 07-03 and 07-06. I hope we can get to the bottom of this. If you need me to test something, I will be happy to do what I can for you.
Bert
Full Member
***
Offline Offline

Activity: 126
Merit: 100



View Profile
July 09, 2011, 06:18:15 PM
 #88

... snip ...
But here lies the problem, AFAIK there are only additions, bit-shiftig and other bit-wise operations used for current kernel (no multiplications). So there should be no use for it on the first sight.
Yea, I was thinking the same, but then I thought that there may be some smart way to rearrange some of the SHA-256 algorithm to change simple bit shifts, exclusive-ors and addition into more complex multiplies, maybe by carrying out two or three operations at once. But it would require a deep knowledge of the SHA-256 algorithm and binary maths. Something along the lines of way that Laplace can be used to solve differential equations by transferring everything into the S-domain, solve with addition and subtraction and then transfer back for the answer.

Tip jar: 1BW6kXgUjGrFTqEpyP8LpVEPQDLTkbATZ6
1MLyg5WVFSMifFjkrZiyGW2nw
Newbie
*
Offline Offline

Activity: 28
Merit: 0


View Profile
July 09, 2011, 10:44:39 PM
Last edit: July 09, 2011, 10:55:31 PM by 1MLyg5WVFSMifFjkrZiyGW2nw
 #89

Yea, I was thinking the same, but then I thought that there may be some smart way to rearrange some of the SHA-256 algorithm to change simple bit shifts, exclusive-ors and addition into more complex multiplies, maybe by carrying out two or three operations at once. But it would require a deep knowledge of the SHA-256 algorithm and binary maths. Something along the lines of way that Laplace can be used to solve differential equations by transferring everything into the S-domain, solve with addition and subtraction and then transfer back for the answer.

While not a math guru, I am certain this can't be done. The algorithm uses these kinds of operations:

- rotate or shift bits right by three different numbers of places, then XOR together
- select bits from one of two values, depending on bits in a third
- majority of bits set/clear in three values
- addition of the result of these operations and constant values for each round

you can build multiplication out of these operations if combined in a certain way, but the SHA-256 algorithm does not use them like this. If (parts of) SHA were equivalent to something as simple as multiplication, I'd say it could be broken in no time.

Also, SHA256 uses 32 bit values for everything. You could of course implement it on an 8 bit machine, but this would make it much slower. And having 24 bit wide registers does not even mean you could run three 8 bit ops at the same time

Another thought I had:
Is aggressive loop unrolling really helping performance? At least for FPGAs, I guess that lots of very small units that maybe do one hash every 64 clock cycles could be better than a much bigger unrolled design, and the same could be true for GPUs. Was this already tested or did everyone start with the assumption that unrolling is the way to go?
SeriousWorm
Newbie
*
Offline Offline

Activity: 54
Merit: 0



View Profile
July 10, 2011, 12:23:26 AM
 #90

I've tested your patch, and there are some great improvements from the original phoenix 1.50 miner indeed.
My 6950 gets improvement.
My 5870 gets improvement.
My 5850 gets improvement.

But I have noticed a regression with my 6870 (1005mhz core / 200 mhz mem)

Configuration:
-Using window 7
-Using Catalyst 11.6
-Using the aoclbf 1.74 frontend for phoenix 1.50
phatk
Vector
BFI_INT
Aggression 13
Work size 128

2011-07-03 kernel: 317 mhash/s (all 3 number are peak value)
2011-07-06 kernel: 283
2011-07-07 kernel: 283

...Needless to say, something bad happened between 07-03 and 07-06. I hope we can get to the bottom of this. If you need me to test something, I will be happy to do what I can for you.

Try setting worksize to 256 and upping your memory to 350mhz. I get the most mhash/sec using that.
bmgjet
Member
**
Offline Offline

Activity: 98
Merit: 10


View Profile
July 10, 2011, 12:29:20 AM
 #91

Iv found 500mhz to be best for memory clock for gddr5 cards and 800mhz for gddr3 cards. Runs a bit hotter then 350 but is worth it imo.

Donations to: 1BMGjetfht9XLkGBYR4TSsuXjrYEKACcow
1stbits: 1bmgjet
300MHash/s 6850 http://www.techpowerup.com/gpuz/5u6wr/
Overclocked for 6 years and still strong http://valid.canardpc.com/show_oc.php?id=1931458 & http://valid.canardpc.com/show_oc.php?id=285337
ssateneth
Legendary
*
Offline Offline

Activity: 1344
Merit: 1004



View Profile
July 10, 2011, 04:19:33 AM
Last edit: July 10, 2011, 06:58:16 AM by ssateneth
 #92

Came across this topic when browsing https://en.bitcoin.it/wiki/Mining_hardware_comparison and the comment on the last 5830 entry (not the crossfire one). Here's my results

Baseline: 11.6 drivers. Not sure what SDK it is (I'm going to assume 2.4. I haven't done anything to my knowledge to change SDK and it seems 2.4 is what comes with 11.6) 5870 @ 1015 core, 300 memory with original phatk with phoenix 1.5 VECTORS BFI_INT WORKSIZE=128 FASTLOOP=false AGGRESSION=13
441Mhash
7-07 build, immediate gain to 450 MHash/sec
Increased memory to 350. 459 Mhash/sec
Increase WORKSIZE to 256. 463 MHash/sec

And I found that any increases to memory after 360 cause weird things to happen (almost certain crash, I panicked to get it back to 350 before it crashed), but I see people posting about 500MHz, so I'm going to try that out, hopefully not crash.

Edit: 500 memory speed causes a -decrease- to 452 MHash/sec. It leads me to think that there are certain dividers/timings being changed at certain thresholds. This would explain why 500MHz and 350 MHz appears to be stable, but 360+ and 600MHz are unstable (I was limited to 600MHz for a while because I didn't know how to push memory lower than 600. I used MSI Afterburner).

Now I use MSI Afterburner to alter voltage and AMD GPU Clock Tool to set frequencies. I don't suppose anyone knows an all-in-one solution? AMD GPU Clock Tool doesn't seem to want to set custom voltages. It just has 0.9500, 1.0630, and Max VDCC. It won't accept custom numbers.


Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 772
Merit: 500



View Profile WWW
July 10, 2011, 11:39:06 AM
 #93

Next kernel version will, once more, be faster for 69XX and 58XX cards Smiley. Stay tuned!

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
BitCoinJack.com
Copper Member
Newbie
*
Offline Offline

Activity: 45
Merit: 0


View Profile
July 10, 2011, 12:00:13 PM
 #94

Next kernel version will, once more, be faster for 69XX and 58XX cards Smiley. Stay tuned!

Dia

Great, looking forward to it! Cheesy
Nialpo
Newbie
*
Offline Offline

Activity: 8
Merit: 5


View Profile
July 11, 2011, 05:54:53 AM
 #95

Also, you can simplify last sharound()'s - there is no need in calculating second variables:

Change
           W(120);
   sharound(120);
   W(121);
   sharound(121);
   W(122);
   sharound(122);
   W(123);
   sharound(123);

To
           W(120);
   sharound(120);
           W(121);
           Vals[2] += t1(121);
           W(122);
           Vals[1] += t1(122);
           W(123);
           Vals[0] += t1(123);
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 772
Merit: 500



View Profile WWW
July 11, 2011, 06:38:50 AM
 #96

Also, you can simplify last sharound()'s - there is no need in calculating second variables:

Change
           W(120);
   sharound(120);
   W(121);
   sharound(121);
   W(122);
   sharound(122);
   W(123);
   sharound(123);

To
           W(120);
   sharound(120);
           W(121);
           Vals[2] += t1(121);
           W(122);
           Vals[1] += t1(122);
           W(123);
           Vals[0] += t1(123);


Seems like a good idea, but I checked via KernelAnalyzer and it doesn't lower the needed ALU operations ... perhaps with reordering of commands this will help. Looking into it and thanks for your posting!

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Nialpo
Newbie
*
Offline Offline

Activity: 8
Merit: 5


View Profile
July 11, 2011, 07:03:14 AM
 #97

Probably compiler already removed this, since there is no dependency.

Also, dimension of W[] can be lowered, down to 16. In each step no values before w[n-16] are used. I've tried this, but it's not faster. Maybe with W[32] or W[64] it will be better. At least, compiled .elf file is shorter Smiley

zephyr4
Newbie
*
Offline Offline

Activity: 5
Merit: 0


View Profile
July 11, 2011, 07:08:14 AM
 #98

thanks ill give it a try
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 772
Merit: 500



View Profile WWW
July 11, 2011, 02:09:28 PM
 #99

Download version 2011-07-11: http://www.mediafire.com/?k404b6lqn8vu6z6

This could be the last version, because there seems no more room for big jumps. I thought I could remove some more additions, but the OpenCL compiler does a better job than I Cheesy. This version is faster than all previous kernels (uses the least ALU OPs for 69XX and 58XX). Should also work with SDK 2.1. If it throws an error with 2.1, please post here and include the error message!

Thanks to all donators and your feedback!

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Bert
Full Member
***
Offline Offline

Activity: 126
Merit: 100



View Profile
July 11, 2011, 05:40:01 PM
 #100

Thanks for the hard work, sent you another donation. 456.75Mh/sec up to 458.88 with the last update on overclocked 5870's running at 1000/347.

Tip jar: 1BW6kXgUjGrFTqEpyP8LpVEPQDLTkbATZ6
Pages: « 1 2 3 4 [5] 6 7 8 9 10 11 12 13 14 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!