dishwara
Legendary
Offline
Activity: 1855
Merit: 1016
|
|
November 04, 2011, 03:11:47 PM |
|
Sorry for not donating. I tried & used your software when i was mining. 2 btc sent to the address in signature. 1B6LEGEUu1USreFNaUfvPWLu6JZb7TLivM Something is better than nothing. Thank you for your valuable kernel.
|
|
|
|
gat3way
|
|
November 05, 2011, 08:27:23 PM |
|
I did , liked getting (positve) feedback and to be an interesting part of Bitcoin for a short period of time. Dia Still coding OpenCL stuff, Diapolo?
|
|
|
|
Diapolo (OP)
|
|
November 06, 2011, 10:22:05 AM |
|
I did , liked getting (positve) feedback and to be an interesting part of Bitcoin for a short period of time. Dia Still coding OpenCL stuff, Diapolo? I had no time to make any further progress, from time to time I vist AMDs OpenCL forum to stay a little up to date, but I'm currently not coding. Last thing I tried was to implement 3-component vectors into the kernel, but AMDs drivers seem still buggy there. Dia
|
|
|
|
Diapolo (OP)
|
|
November 06, 2011, 10:22:30 AM |
|
Sorry for not donating. I tried & used your software when i was mining. 2 btc sent to the address in signature. 1B6LEGEUu1USreFNaUfvPWLu6JZb7TLivM Something is better than nothing. Thank you for your valuable kernel.
I received your donation, a warm thank you . Dia
|
|
|
|
d3m0n1q_733rz
|
|
December 08, 2011, 08:50:39 AM Last edit: December 08, 2011, 09:01:41 AM by d3m0n1q_733rz |
|
Small change I could suggest just looking at some of the code. I notice that some variables use simple addition and subtraction a few times. For example: // intermediate W calculations #define P1(n) (rot(W[(n) - 2 - O], 15U) ^ rot(W[(n) - 2 - O], 13U) ^ (W[(n) - 2 - O] >> 10U)) #define P2(n) (rot(W[(n) - 15 - O], 25U) ^ rot(W[(n) - 15 - O], 14U) ^ (W[(n) - 15 - O] >> 3U)) #define P3(n) W[n - 7 - O] #define P4(n) W[n - 16 - O]
// full W calculation #define W(n) (W[n - O] = P4(n) + P3(n) + P2(n) + P1(n)) You notice that n - O comes up about 3 times in a row. Wouldn't it be better to just combine n - O into its own variable and subtract from it to reduce the number of variables required to read? Afterall, n - 7 - O is the same as n - 16 - O. If you combine n - O, it's just a simple "pull from buffer and subtract this number" problem. I haven't tested it, but I imagine it could speed things alone slightly.
|
Funroll_Loops, the theoretically quicker breakfast cereal! Check out http://www.facebook.com/JupiterICT for all of your computing needs. If you need it, we can get it. We have solutions for your computing conundrums. BTC accepted! 12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
|
|
|
kano
Legendary
Offline
Activity: 4592
Merit: 1851
Linux since 1997 RedHat 4
|
|
December 08, 2011, 11:27:41 AM |
|
Small change I could suggest just looking at some of the code. I notice that some variables use simple addition and subtraction a few times. For example: // intermediate W calculations #define P1(n) (rot(W[(n) - 2 - O], 15U) ^ rot(W[(n) - 2 - O], 13U) ^ (W[(n) - 2 - O] >> 10U)) #define P2(n) (rot(W[(n) - 15 - O], 25U) ^ rot(W[(n) - 15 - O], 14U) ^ (W[(n) - 15 - O] >> 3U)) #define P3(n) W[n - 7 - O] #define P4(n) W[n - 16 - O]
// full W calculation #define W(n) (W[n - O] = P4(n) + P3(n) + P2(n) + P1(n)) You notice that n - O comes up about 3 times in a row. Wouldn't it be better to just combine n - O into its own variable and subtract from it to reduce the number of variables required to read? Afterall, n - 7 - O is the same as n - 16 - O. If you combine n - O, it's just a simple "pull from buffer and subtract this number" problem. I haven't tested it, but I imagine it could speed things alone slightly. Just looking at that from a standard compiler point of view (I have no idea how good or bad or literal the OpenCL compiler is) The compiler would probably notice that anyway if the OpenCL compiler was able to do basic optimisations. Actually, Diapolo, do you know if it has an optimiser in it? (and how good it is?) i.e. would that change suggested make a difference, or would the compiler work it out itself? ... just curious
|
|
|
|
deepceleron
Legendary
Offline
Activity: 1512
Merit: 1036
|
|
December 08, 2011, 12:15:20 PM |
|
Yes, in fact there are some tweaks done in the code now to make the OpenCL compiler produce more optimized code than it normally does.
|
|
|
|
Diapolo (OP)
|
|
December 08, 2011, 06:25:19 PM |
|
Small change I could suggest just looking at some of the code. I notice that some variables use simple addition and subtraction a few times. For example: // intermediate W calculations #define P1(n) (rot(W[(n) - 2 - O], 15U) ^ rot(W[(n) - 2 - O], 13U) ^ (W[(n) - 2 - O] >> 10U)) #define P2(n) (rot(W[(n) - 15 - O], 25U) ^ rot(W[(n) - 15 - O], 14U) ^ (W[(n) - 15 - O] >> 3U)) #define P3(n) W[n - 7 - O] #define P4(n) W[n - 16 - O]
// full W calculation #define W(n) (W[n - O] = P4(n) + P3(n) + P2(n) + P1(n)) You notice that n - O comes up about 3 times in a row. Wouldn't it be better to just combine n - O into its own variable and subtract from it to reduce the number of variables required to read? Afterall, n - 7 - O is the same as n - 16 - O. If you combine n - O, it's just a simple "pull from buffer and subtract this number" problem. I haven't tested it, but I imagine it could speed things alone slightly. Just looking at that from a standard compiler point of view (I have no idea how good or bad or literal the OpenCL compiler is) The compiler would probably notice that anyway if the OpenCL compiler was able to do basic optimisations. Actually, Diapolo, do you know if it has an optimiser in it? (and how good it is?) i.e. would that change suggested make a difference, or would the compiler work it out itself? ... just curious It's not a beneficial change, because the compiler optimizes this out + it makes the code a bit more readable. I'm pretty sure the easy optimizations are all done, but if you guys prove me wrong it would be nice . Dia
|
|
|
|
d3m0n1q_733rz
|
|
December 09, 2011, 04:54:42 AM |
|
Is there a way to disassemble the compiled version to the readable format so that I can do a little bit of a search for things to optimize? I've learned never leave to a compile what you can do yourself. Sometimes compilers will take you at your word.
|
Funroll_Loops, the theoretically quicker breakfast cereal! Check out http://www.facebook.com/JupiterICT for all of your computing needs. If you need it, we can get it. We have solutions for your computing conundrums. BTC accepted! 12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
|
|
|
kano
Legendary
Offline
Activity: 4592
Merit: 1851
Linux since 1997 RedHat 4
|
|
December 09, 2011, 08:14:08 AM |
|
True - however, consider this little comparison ... A reasonably simple version of sha256 in C when compile with -O2 versus without is almost a double in performance. (yeah I spent a couple of weeks recently playing with sha256 in C code and seeing what I could do with it ... and early on wondering why I was getting so bad results when I noticed I stupidly left out -O2 ... ) Their compiler may not be as good as gcc, but hopefully not much worse. Of course yes do try and many will be interested in your results
|
|
|
|
gat3way
|
|
December 09, 2011, 01:55:33 PM |
|
The OpenCL compiler does involve constant folding as an optimization pass. It is an obvious optimization, no need to try this.
|
|
|
|
d3m0n1q_733rz
|
|
December 10, 2011, 11:50:49 PM |
|
Anyone know of a decompiler I can use to look at the compiled source? It'll help me remove unnecessary variables and the like. Granted, I'm only decent with assembly at the moment, but I wouldn't mind seeing the finished product when the optimizer takes hold.
|
Funroll_Loops, the theoretically quicker breakfast cereal! Check out http://www.facebook.com/JupiterICT for all of your computing needs. If you need it, we can get it. We have solutions for your computing conundrums. BTC accepted! 12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
|
|
|
Diapolo (OP)
|
|
December 11, 2011, 09:05:42 AM |
|
Anyone know of a decompiler I can use to look at the compiled source? It'll help me remove unnecessary variables and the like. Granted, I'm only decent with assembly at the moment, but I wouldn't mind seeing the finished product when the optimizer takes hold.
Take a look at AMD APP KernelAnalyzer 1.9 it creates assembly like output for OpenCL kernels and gives register informations and that stuff ... it's in the AMD APP SDK. Dia
|
|
|
|
gat3way
|
|
December 19, 2011, 09:54:19 PM |
|
Someone interested in keeping that kernel up with 2.6? 3-component vectors are working now and it would need to get reordered a bit again to get better ALUPacking as the compiler backend has apparently changed in a way. I lost my interest in bitcoin, but it would be an interesting experiment. I believe pre-2.6 speeds can easily be regained.
|
|
|
|
Diapolo (OP)
|
|
December 20, 2011, 04:07:11 PM |
|
I made a few quick performance checks on a 6950 + a 6650D (APU) and it's weird. CGMINER is quite a bit slower with phatk2, compared to Phoenix 1.7 with my latest kernel on my rig. For the 6950 CGMINER 2.0.8 is @ 330 MH/s with -I 8 -d 0 -v 2 -w 128 --auto-gpu --gpu-fan 25-50 --gpu-engine 800 --gpu-memclock 1250 --temp-target 70 . For the 6950 Phoenix 1.7 is @ 355 MH/s with -a 50 -k phatk AGGRESSION=12 DEVICE=0 FASTLOOP=false VECTORS2 WORKSIZE=128 . Am I missing something? Both run with 800 / 1250 and 2-component Vectors + Worksize of 128. I'm using SDK 2.6 installed with Cat 12.1 Preview! Dia
|
|
|
|
Dexter770221
Legendary
Offline
Activity: 1029
Merit: 1000
|
|
December 20, 2011, 09:43:22 PM |
|
For 6950 and cgminer I have identical hashrate. But memclock is set to 690. Catalyst 11.9
|
Under development Modular UPGRADEABLE Miner (MUM). Looking for investors. Changing one PCB with screwdriver and you have brand new miner in hand... Plug&Play, scalable from one module to thousands.
|
|
|
Diapolo (OP)
|
|
December 20, 2011, 10:23:48 PM |
|
For 6950 and cgminer I have identical hashrate. But memclock is set to 690. Catalyst 11.9
Could you give Phoenix 1.7 with my latest posted version on posting 1 a try and report back ? Thanks, Dia Btw.: Is anyone able to help me getting 3-component vectors to work? The kernel should be valid but in __init__.py line 50 self.size = (nonceRange.size / rateDivisor) / self.iterations it seems that nonceRange.size / rateDivisor (rateDivisor == 3 if VECTORS3 is used as kernel argument instead of VECTORS2) generates a problem, because nonceRange.size is a multiple of 256, which is not dividable by 3.
|
|
|
|
ssateneth
Legendary
Offline
Activity: 1344
Merit: 1004
|
|
December 27, 2011, 03:21:39 AM |
|
So whats the latest kernel? 8-27? Or is there a secret newer version that I'm not seeing? Because according to main page, there is an unreleased kernel thats faster than 8-27 which is also called current. Where can I get the current kernel?
|
|
|
|
Diapolo (OP)
|
|
December 27, 2011, 09:37:44 AM |
|
So whats the latest kernel? 8-27? Or is there a secret newer version that I'm not seeing? Because according to main page, there is an unreleased kernel thats faster than 8-27 which is also called current. Where can I get the current kernel?
It's not released, because I had no time over Christmas ... I guess I can put it on later today or tomorror. Dia
|
|
|
|
naz86
Member
Offline
Activity: 111
Merit: 10
|
|
December 27, 2011, 10:52:27 AM |
|
Hi Diapolo,
do you think we can still have such big improvements like in the past ?
|
|
|
|
|