Bitcoin Forum
April 25, 2024, 02:30:02 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1]
  Print  
Author Topic: -- Optimized poclbm kernel! Another 5 Mhash/s --  (Read 2091 times)
Vince (OP)
Newbie
*
Offline Offline

Activity: 38
Merit: 0


View Profile
July 02, 2011, 05:24:48 AM
Last edit: July 02, 2011, 06:06:52 AM by Vince
 #1

Want some more Mhash/s? Try this optimized kernel!

Tested with Phoenix miner - got my HD6950, stock speed, locked shaders, from 343Mhash/s to 349Mhash/s!

This kernel also contains the optimization already posted on this forum - namely "Ma z^x", this is not mine and I'm not taking credit for it! 343Mhash/s already contained this patch.

Whats new:
Lots of small changes, some only save a single addition.

Code:
#1:
Before:
H = 0xb0edbdd0 + K[ 0] +  W0; D = 0xa54ff53a + H; H = H + 0x08909ae5U;

After:
H = W0 + 4228417613; D = W0 + 2563236514;

#2:
Before:
D = D1 + (rotr(A, 6) ^ rotr(A, 11) ^ rotr(A, 25)) + Ch(A, B1, C1) + K[ 4] +  0x80000000;

After:
D = D1 + (rotr(A, 6) ^ rotr(A, 11) ^ rotr(A, 25)) + Ch(A, B1, C1);
+ Put Constant K[ 4] + 0x80000000 into python pre-calculation
-> self.state2[3] = np.uint32(self.state2[3]+3109470811);

#3:
Before:
H = ....   K[60] + W12;
H+=0x5be0cd19U;
if (H == 0)

After:
if (H == 325071597)

#4:
Before:
        if (H.x == 0)
        {
                output[OUTPUT_SIZE] = output[nonce.x & OUTPUT_MASK] = nonce.x;
        }
        else if (H.y == 0)
        {
                output[OUTPUT_SIZE] = output[nonce.y & OUTPUT_MASK] = nonce.y;
        }

After:
        if (H.x == 0)
        {
                output[OUTPUT_SIZE] = output[nonce.x & OUTPUT_MASK] = nonce.x;
        }
        if (H.y == 0)
        {
                output[OUTPUT_SIZE] = output[nonce.y & OUTPUT_MASK] = nonce.y;
        }

Why abort checking if we found a result? Unlikely, but we could have found two: This adds almost no overhead.

#5:
Lots of small changes (some of them were optimized by the compiler before, but anyway)


For #2 I changed the precalculation in __init__.py. Take a look at them! You can use diff - its just 2 lines.

Please note: This is part of the result of >100 hours hard work. If you want me to post keep posting patches, say thank you in form of a small donation. Everything above 0.01 is just fine Wink
-> 1Dsxro7GvNDaxWkvMgkraEttAA4xqagxVp

Btw, I already got some more - minor - optimizations.

Here is it:
http://www.filesonic.com/file/1348177284/poclbm_kernel.zip


Please post some results!
1714055402
Hero Member
*
Offline Offline

Posts: 1714055402

View Profile Personal Message (Offline)

Ignore
1714055402
Reply with quote  #2

1714055402
Report to moderator
1714055402
Hero Member
*
Offline Offline

Posts: 1714055402

View Profile Personal Message (Offline)

Ignore
1714055402
Reply with quote  #2

1714055402
Report to moderator
"Bitcoin: the cutting edge of begging technology." -- Giraffe.BTC
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1714055402
Hero Member
*
Offline Offline

Posts: 1714055402

View Profile Personal Message (Offline)

Ignore
1714055402
Reply with quote  #2

1714055402
Report to moderator
1714055402
Hero Member
*
Offline Offline

Posts: 1714055402

View Profile Personal Message (Offline)

Ignore
1714055402
Reply with quote  #2

1714055402
Report to moderator
1714055402
Hero Member
*
Offline Offline

Posts: 1714055402

View Profile Personal Message (Offline)

Ignore
1714055402
Reply with quote  #2

1714055402
Report to moderator
bitless
Newbie
*
Offline Offline

Activity: 28
Merit: 0


View Profile
July 02, 2011, 05:44:45 AM
 #2

I've sent you a small donation for your hard work, but...

Do pools accept any hashes generated with your kernel? Really? For instance, the 'local' optimization was declared invalid (it messes up the calculation, so the thread got locked by the moderator), etc.

EDIT - As to why exit early on the if()-s... well, if you found a solution already, why do you need a second solution? Doing branches on the GPU is very expensive (threads may diverge, etc.), so two branches may and most likely will end up being worse than one. May I suggest if(min(x,y)==0) { output x; }? Assuming min can be done without branching, this is one branch if you don't have a solution in either x or y (if min is not 1 instruction, find another function to replace the min...), then try both x and (x+1) on the CPU side to figure out which one of these is the real solution.



fascistmuffin
Newbie
*
Offline Offline

Activity: 56
Merit: 0



View Profile
July 02, 2011, 05:45:39 AM
 #3

I'm confused by change #4.

The else if should save a few operations just in itself, and if the code already runs correctly with the else if in it, then a double assignment into arrays would be costly if allowed to run.
Vince (OP)
Newbie
*
Offline Offline

Activity: 38
Merit: 0


View Profile
July 02, 2011, 05:53:14 AM
 #4

Code:
if (H.x == 0)
        {
                output[OUTPUT_SIZE] = output[nonce.x & OUTPUT_MASK] = nonce.x;
}
  else if (H.y == 0)
{
                output[OUTPUT_SIZE] = output[nonce.y & OUTPUT_MASK] = nonce.y;
}

The first condition H.x == 0 if is almost always false, so its 2 comparisons almost every cycle, exactly the same speed as without "else"

The assignments are only done when a hash is found, this does not affect speed at all.

Its not a double assignment, the second one goes to output[nonce.y & OUTPUT_MASK], thats just to get around race conditions.

The __local-patch is invalid? First time i heard of it .. Sure, I'll remove it then. The other ones produce valid hashes, I see no reason why they sould not be valid. Its all calculated on paper, step by step, looks equal to me.

Thanks for the donation!
bitless
Newbie
*
Offline Offline

Activity: 28
Merit: 0


View Profile
July 02, 2011, 05:58:50 AM
 #5

Actually, *with* the min() used like I said earlier, the kernel compiles into something quite a lot shorter... I'm gonna *test* it overnight and claim the donations *if and only it works* unless you want to test it Smiley

For the local, search the board.

EDIT - i'll claim the donations for the min anyways, if (and only if) it works and helps anyone Smiley
fascistmuffin
Newbie
*
Offline Offline

Activity: 56
Merit: 0



View Profile
July 02, 2011, 05:59:44 AM
 #6

If H.x == 0 is almost always false, flip the if...else statement:

Code:
if (H.y == 0)
{
                output[OUTPUT_SIZE] = output[nonce.y & OUTPUT_MASK] = nonce.y;
}
else if (H.x == 0)
{
                output[OUTPUT_SIZE] = output[nonce.x & OUTPUT_MASK] = nonce.x;
}

It'd be faster than a double if statement since it'd be a single comparison in most cases.
Vince (OP)
Newbie
*
Offline Offline

Activity: 38
Merit: 0


View Profile
July 02, 2011, 06:05:14 AM
 #7

Actually, *with* the min() used like I said earlier, the kernel compiles into something quite a lot shorter... I'm gonna *test* it overnight and claim the donations *if and only it works* unless you want to test it Smiley

For the local, search the board.

EDIT - i'll claim the donations for the min anyways, if (and only if) it works and helps anyone Smiley

Seen the issues on __local, removed it from the zip.

The min() has exactly the same speed here - maybe you can get it faster?
bitless
Newbie
*
Offline Offline

Activity: 28
Merit: 0


View Profile
July 02, 2011, 06:14:09 AM
 #8

Yeah, the min() seems to help, but it helps so little that I can't see the difference without a profiler Smiley And since I haven't tested the change nearly enough...
Vince (OP)
Newbie
*
Offline Offline

Activity: 38
Merit: 0


View Profile
July 02, 2011, 06:22:20 AM
 #9

I tested this one on the pools (even with __local) without any problems - and generated a block on testnet.
bitless
Newbie
*
Offline Offline

Activity: 28
Merit: 0


View Profile
July 02, 2011, 06:25:55 AM
 #10

Well, I meant I haven't tested my min() for long enough... and I probably won't test it at all because the difference is not worth the effort (well, I'll test together with other kernel mods, if I have any).

 I haven't tried your changes yet. I honestly don't understand why it works with the local for anyone, but I like the constant thing you've done Smiley
Diapolo
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
July 02, 2011, 07:37:12 AM
 #11

Seems like you used some similar ideas that I had for phatk Smiley.

Look here: http://forum.bitcoin.org/index.php?topic=25135.msg314520#msg314520

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Diapolo
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
July 02, 2011, 07:39:36 AM
 #12

I'm confused by change #4.

The else if should save a few operations just in itself, and if the code already runs correctly with the else if in it, then a double assignment into arrays would be costly if allowed to run.

If / else statements (control flow) in OpenCL kernels slow down computation speed always. Both paths need to be examined so it should make only a small or no difference to use if else or if if.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Vince (OP)
Newbie
*
Offline Offline

Activity: 38
Merit: 0


View Profile
July 02, 2011, 11:17:47 PM
 #13

Did anyone even try this version??  Huh

I was waiting for results from HD5xxx and NVIDIA owners!
server
Legendary
*
Offline Offline

Activity: 892
Merit: 1002


1 BTC =1 BTC


View Profile
July 02, 2011, 11:38:10 PM
 #14

Did anyone even try this version??  Huh

I was waiting for results from HD5xxx and NVIDIA owners!

Yup, but sorry... I tried your kernel on 5870 and Mhash/s went down from 392 (Dia's kernel) to 374.

Anibalayl
Newbie
*
Offline Offline

Activity: 7
Merit: 0


View Profile
July 02, 2011, 11:41:06 PM
 #15

interesting
Diapolo
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
July 03, 2011, 07:39:53 AM
 #16

Did anyone even try this version??  Huh

I was waiting for results from HD5xxx and NVIDIA owners!

Yup, but sorry... I tried your kernel on 5870 and Mhash/s went down from 392 (Dia's kernel) to 374.

What were your values with the stock kernel, if I may ask?

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
server
Legendary
*
Offline Offline

Activity: 892
Merit: 1002


1 BTC =1 BTC


View Profile
July 03, 2011, 11:11:52 AM
 #17

What were your values with the stock kernel, if I may ask?

Dia

I use this:

phoenix.exe -u ... -k poclbm VECTORS BFI_INT FASTLOOP=false AGGRESSION=11 DEVICE=0

(long term rejection rate is between 1-1.5%)


Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!