Bitcoin Forum
October 05, 2024, 06:27:27 PM *
News: Latest Bitcoin Core release: 28.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 [17] 18 19 20 21 »  All
  Print  
Author Topic: further improved phatk_dia kernel for Phoenix + SDK 2.6 - 2012-01-13  (Read 106834 times)
dishwara
Legendary
*
Offline Offline

Activity: 1855
Merit: 1016



View Profile
November 04, 2011, 03:11:47 PM
 #321

Sorry for not donating.
I tried & used your software when i was mining.
2 btc sent to the address in signature. 1B6LEGEUu1USreFNaUfvPWLu6JZb7TLivM
Something is better than nothing.
Thank you for your valuable kernel.
gat3way
Sr. Member
****
Offline Offline

Activity: 256
Merit: 250


View Profile
November 05, 2011, 08:27:23 PM
 #322

I did Smiley, liked getting (positve) feedback and to be an interesting part of Bitcoin for a short period of time.
Dia

Still coding OpenCL stuff, Diapolo?
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 772
Merit: 500



View Profile WWW
November 06, 2011, 10:22:05 AM
 #323

I did Smiley, liked getting (positve) feedback and to be an interesting part of Bitcoin for a short period of time.
Dia

Still coding OpenCL stuff, Diapolo?

I had no time to make any further progress, from time to time I vist AMDs OpenCL forum to stay a little up to date, but I'm currently not coding. Last thing I tried was to implement 3-component vectors into the kernel, but AMDs drivers seem still buggy there.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 772
Merit: 500



View Profile WWW
November 06, 2011, 10:22:30 AM
 #324

Sorry for not donating.
I tried & used your software when i was mining.
2 btc sent to the address in signature. 1B6LEGEUu1USreFNaUfvPWLu6JZb7TLivM
Something is better than nothing.
Thank you for your valuable kernel.

I received your donation, a warm thank you Smiley.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
December 08, 2011, 08:50:39 AM
Last edit: December 08, 2011, 09:01:41 AM by d3m0n1q_733rz
 #325

Small change I could suggest just looking at some of the code.  I notice that some variables use simple addition and subtraction a few times.  For example:  
Code:
// intermediate W calculations
#define P1(n) (rot(W[(n) - 2 - O], 15U) ^ rot(W[(n) - 2 - O], 13U) ^ (W[(n) - 2 - O] >> 10U))
#define P2(n) (rot(W[(n) - 15 - O], 25U) ^ rot(W[(n) - 15 - O], 14U) ^ (W[(n) - 15 - O] >> 3U))
#define P3(n) W[n - 7 - O]
#define P4(n) W[n - 16 - O]

// full W calculation
#define W(n) (W[n - O] = P4(n) + P3(n) + P2(n) + P1(n))
You notice that n - O comes up about 3 times in a row.  Wouldn't it be better to just combine n - O into its own variable and subtract from it to reduce the number of variables required to read?  Afterall, n - 7 - O is the same as n - 16 - O.  If you combine n - O, it's just a simple "pull from buffer and subtract this number" problem.
I haven't tested it, but I imagine it could speed things alone slightly.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
kano
Legendary
*
Offline Offline

Activity: 4592
Merit: 1851


Linux since 1997 RedHat 4


View Profile
December 08, 2011, 11:27:41 AM
 #326

Small change I could suggest just looking at some of the code.  I notice that some variables use simple addition and subtraction a few times.  For example:  
Code:
// intermediate W calculations
#define P1(n) (rot(W[(n) - 2 - O], 15U) ^ rot(W[(n) - 2 - O], 13U) ^ (W[(n) - 2 - O] >> 10U))
#define P2(n) (rot(W[(n) - 15 - O], 25U) ^ rot(W[(n) - 15 - O], 14U) ^ (W[(n) - 15 - O] >> 3U))
#define P3(n) W[n - 7 - O]
#define P4(n) W[n - 16 - O]

// full W calculation
#define W(n) (W[n - O] = P4(n) + P3(n) + P2(n) + P1(n))
You notice that n - O comes up about 3 times in a row.  Wouldn't it be better to just combine n - O into its own variable and subtract from it to reduce the number of variables required to read?  Afterall, n - 7 - O is the same as n - 16 - O.  If you combine n - O, it's just a simple "pull from buffer and subtract this number" problem.
I haven't tested it, but I imagine it could speed things alone slightly.
Just looking at that from a standard compiler point of view
(I have no idea how good or bad or literal the OpenCL compiler is)
The compiler would probably notice that anyway if the OpenCL compiler was able to do basic optimisations.

Actually, Diapolo, do you know if it has an optimiser in it? (and how good it is?)
i.e. would that change suggested make a difference, or would the compiler work it out itself?
... just curious Smiley

Pool: https://kano.is - low 0.5% fee PPLNS 3 Days - Most reliable Solo with ONLY 0.5% fee   Bitcointalk thread: Forum
Discord support invite at https://kano.is/ Majority developer of the ckpool code - k for kano
The ONLY active original developer of cgminer. Original master git: https://github.com/kanoi/cgminer
deepceleron
Legendary
*
Offline Offline

Activity: 1512
Merit: 1036



View Profile WWW
December 08, 2011, 12:15:20 PM
 #327

Yes, in fact there are some tweaks done in the code now to make the OpenCL compiler produce more optimized code than it normally does.
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 772
Merit: 500



View Profile WWW
December 08, 2011, 06:25:19 PM
 #328

Small change I could suggest just looking at some of the code.  I notice that some variables use simple addition and subtraction a few times.  For example:  
Code:
// intermediate W calculations
#define P1(n) (rot(W[(n) - 2 - O], 15U) ^ rot(W[(n) - 2 - O], 13U) ^ (W[(n) - 2 - O] >> 10U))
#define P2(n) (rot(W[(n) - 15 - O], 25U) ^ rot(W[(n) - 15 - O], 14U) ^ (W[(n) - 15 - O] >> 3U))
#define P3(n) W[n - 7 - O]
#define P4(n) W[n - 16 - O]

// full W calculation
#define W(n) (W[n - O] = P4(n) + P3(n) + P2(n) + P1(n))
You notice that n - O comes up about 3 times in a row.  Wouldn't it be better to just combine n - O into its own variable and subtract from it to reduce the number of variables required to read?  Afterall, n - 7 - O is the same as n - 16 - O.  If you combine n - O, it's just a simple "pull from buffer and subtract this number" problem.
I haven't tested it, but I imagine it could speed things alone slightly.
Just looking at that from a standard compiler point of view
(I have no idea how good or bad or literal the OpenCL compiler is)
The compiler would probably notice that anyway if the OpenCL compiler was able to do basic optimisations.

Actually, Diapolo, do you know if it has an optimiser in it? (and how good it is?)
i.e. would that change suggested make a difference, or would the compiler work it out itself?
... just curious Smiley

It's not a beneficial change, because the compiler optimizes this out + it makes the code a bit more readable.
I'm pretty sure the easy optimizations are all done, but if you guys prove me wrong it would be nice Wink.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
December 09, 2011, 04:54:42 AM
 #329

Is there a way to disassemble the compiled version to the readable format so that I can do a little bit of a search for things to optimize?  I've learned never leave to a compile what you can do yourself.  Sometimes compilers will take you at your word.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
kano
Legendary
*
Offline Offline

Activity: 4592
Merit: 1851


Linux since 1997 RedHat 4


View Profile
December 09, 2011, 08:14:08 AM
 #330

True - however, consider this little comparison ...
A reasonably simple version of sha256 in C when compile with -O2 versus without is almost a double in performance.
(yeah I spent a couple of weeks recently playing with sha256 in C code and seeing what I could do with it ... and early on wondering why I was getting so bad results when I noticed I stupidly left out -O2 ... Tongue)
Their compiler may not be as good as gcc, but hopefully not much worse.

Of course yes do try and many will be interested in your results Smiley

Pool: https://kano.is - low 0.5% fee PPLNS 3 Days - Most reliable Solo with ONLY 0.5% fee   Bitcointalk thread: Forum
Discord support invite at https://kano.is/ Majority developer of the ckpool code - k for kano
The ONLY active original developer of cgminer. Original master git: https://github.com/kanoi/cgminer
gat3way
Sr. Member
****
Offline Offline

Activity: 256
Merit: 250


View Profile
December 09, 2011, 01:55:33 PM
 #331

The OpenCL compiler does involve constant folding as an optimization pass. It is an obvious optimization, no need to try this.
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
December 10, 2011, 11:50:49 PM
 #332

Anyone know of a decompiler I can use to look at the compiled source?  It'll help me remove unnecessary variables and the like.  Granted, I'm only decent with assembly at the moment, but I wouldn't mind seeing the finished product when the optimizer takes hold.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 772
Merit: 500



View Profile WWW
December 11, 2011, 09:05:42 AM
 #333

Anyone know of a decompiler I can use to look at the compiled source?  It'll help me remove unnecessary variables and the like.  Granted, I'm only decent with assembly at the moment, but I wouldn't mind seeing the finished product when the optimizer takes hold.

Take a look at AMD APP KernelAnalyzer 1.9 it creates assembly like output for OpenCL kernels and gives register informations and that stuff ... it's in the AMD APP SDK.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
gat3way
Sr. Member
****
Offline Offline

Activity: 256
Merit: 250


View Profile
December 19, 2011, 09:54:19 PM
 #334

Someone interested in keeping that kernel up with 2.6? 3-component vectors are working now and it would need to get reordered a bit again to get better ALUPacking as the compiler backend has apparently changed in a way. I lost my interest in bitcoin, but it would be an interesting experiment. I believe pre-2.6 speeds can easily be regained.
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 772
Merit: 500



View Profile WWW
December 20, 2011, 04:07:11 PM
 #335

I made a few quick performance checks on a 6950 + a 6650D (APU) and it's weird. CGMINER is quite a bit slower with phatk2, compared to Phoenix 1.7 with my latest kernel on my rig.

For the 6950 CGMINER 2.0.8 is @ 330 MH/s with
Code:
-I 8 -d 0 -v 2 -w 128 --auto-gpu --gpu-fan 25-50 --gpu-engine 800 --gpu-memclock 1250 --temp-target 70
.
For the 6950 Phoenix 1.7 is @ 355 MH/s with
Code:
-a 50 -k phatk AGGRESSION=12 DEVICE=0 FASTLOOP=false VECTORS2 WORKSIZE=128
.

Am I missing something? Both run with 800 / 1250 and 2-component Vectors + Worksize of 128. I'm using SDK 2.6 installed with Cat 12.1 Preview!

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Dexter770221
Legendary
*
Offline Offline

Activity: 1029
Merit: 1000


View Profile
December 20, 2011, 09:43:22 PM
 #336

For 6950 and cgminer I have identical hashrate. But memclock is set to 690. Catalyst 11.9

Under development Modular UPGRADEABLE Miner (MUM). Looking for investors.
Changing one PCB with screwdriver and you have brand new miner in hand... Plug&Play, scalable from one module to thousands.
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 772
Merit: 500



View Profile WWW
December 20, 2011, 10:23:48 PM
 #337

For 6950 and cgminer I have identical hashrate. But memclock is set to 690. Catalyst 11.9

Could you give Phoenix 1.7 with my latest posted version on posting 1 a try and report back Smiley?

Thanks,
Dia

Btw.: Is anyone able to help me getting 3-component vectors to work? The kernel should be valid but in __init__.py line 50
Code:
self.size = (nonceRange.size / rateDivisor) / self.iterations
it seems that
Code:
nonceRange.size / rateDivisor
(rateDivisor == 3 if VECTORS3 is used as kernel argument instead of VECTORS2) generates a problem, because nonceRange.size is a multiple of 256, which is not dividable by 3.

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
ssateneth
Legendary
*
Offline Offline

Activity: 1344
Merit: 1004



View Profile
December 27, 2011, 03:21:39 AM
 #338

So whats the latest kernel? 8-27? Or is there a secret newer version that I'm not seeing? Because according to main page, there is an unreleased kernel thats faster than 8-27 which is also called current. Where can I get the current kernel?

Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 772
Merit: 500



View Profile WWW
December 27, 2011, 09:37:44 AM
 #339

So whats the latest kernel? 8-27? Or is there a secret newer version that I'm not seeing? Because according to main page, there is an unreleased kernel thats faster than 8-27 which is also called current. Where can I get the current kernel?

It's not released, because I had no time over Christmas Wink ... I guess I can put it on later today or tomorror.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
naz86
Member
**
Offline Offline

Activity: 111
Merit: 10


View Profile
December 27, 2011, 10:52:27 AM
 #340

Hi Diapolo,

do you think we can still have such big improvements like in the past ?
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 [17] 18 19 20 21 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!