Bitcoin Forum
December 11, 2016, 12:06:12 PM *
News: Latest stable version of Bitcoin Core: 0.13.1  [Torrent].
 
   Home   Help Search Donate Login Register  
Pages: [1]
  Print  
Author Topic: -- Optimized poclbm kernel! Another 5 Mhash/s --  (Read 1853 times)
Vince
Jr. Member
*
Offline Offline

Activity: 38


View Profile
July 02, 2011, 05:24:48 AM
 #1

Want some more Mhash/s? Try this optimized kernel!

Tested with Phoenix miner - got my HD6950, stock speed, locked shaders, from 343Mhash/s to 349Mhash/s!

This kernel also contains the optimization already posted on this forum - namely "Ma z^x", this is not mine and I'm not taking credit for it! 343Mhash/s already contained this patch.

Whats new:
Lots of small changes, some only save a single addition.

Code:
#1:
Before:
H = 0xb0edbdd0 + K[ 0] +  W0; D = 0xa54ff53a + H; H = H + 0x08909ae5U;

After:
H = W0 + 4228417613; D = W0 + 2563236514;

#2:
Before:
D = D1 + (rotr(A, 6) ^ rotr(A, 11) ^ rotr(A, 25)) + Ch(A, B1, C1) + K[ 4] +  0x80000000;

After:
D = D1 + (rotr(A, 6) ^ rotr(A, 11) ^ rotr(A, 25)) + Ch(A, B1, C1);
+ Put Constant K[ 4] + 0x80000000 into python pre-calculation
-> self.state2[3] = np.uint32(self.state2[3]+3109470811);

#3:
Before:
H = ....   K[60] + W12;
H+=0x5be0cd19U;
if (H == 0)

After:
if (H == 325071597)

#4:
Before:
        if (H.x == 0)
        {
                output[OUTPUT_SIZE] = output[nonce.x & OUTPUT_MASK] = nonce.x;
        }
        else if (H.y == 0)
        {
                output[OUTPUT_SIZE] = output[nonce.y & OUTPUT_MASK] = nonce.y;
        }

After:
        if (H.x == 0)
        {
                output[OUTPUT_SIZE] = output[nonce.x & OUTPUT_MASK] = nonce.x;
        }
        if (H.y == 0)
        {
                output[OUTPUT_SIZE] = output[nonce.y & OUTPUT_MASK] = nonce.y;
        }

Why abort checking if we found a result? Unlikely, but we could have found two: This adds almost no overhead.

#5:
Lots of small changes (some of them were optimized by the compiler before, but anyway)


For #2 I changed the precalculation in __init__.py. Take a look at them! You can use diff - its just 2 lines.

Please note: This is part of the result of >100 hours hard work. If you want me to post keep posting patches, say thank you in form of a small donation. Everything above 0.01 is just fine Wink
-> 1Dsxro7GvNDaxWkvMgkraEttAA4xqagxVp

Btw, I already got some more - minor - optimizations.

Here is it:
http://www.filesonic.com/file/1348177284/poclbm_kernel.zip


Please post some results!

Like what I'm posting? -> 1DtHZgVufX1tc5jRCoNiUnygenwRWj939C
1481457972
Hero Member
*
Offline Offline

Posts: 1481457972

View Profile Personal Message (Offline)

Ignore
1481457972
Reply with quote  #2

1481457972
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1481457972
Hero Member
*
Offline Offline

Posts: 1481457972

View Profile Personal Message (Offline)

Ignore
1481457972
Reply with quote  #2

1481457972
Report to moderator
1481457972
Hero Member
*
Offline Offline

Posts: 1481457972

View Profile Personal Message (Offline)

Ignore
1481457972
Reply with quote  #2

1481457972
Report to moderator
1481457972
Hero Member
*
Offline Offline

Posts: 1481457972

View Profile Personal Message (Offline)

Ignore
1481457972
Reply with quote  #2

1481457972
Report to moderator
bitless
Newbie
*
Offline Offline

Activity: 28


View Profile
July 02, 2011, 05:44:45 AM
 #2

I've sent you a small donation for your hard work, but...

Do pools accept any hashes generated with your kernel? Really? For instance, the 'local' optimization was declared invalid (it messes up the calculation, so the thread got locked by the moderator), etc.

EDIT - As to why exit early on the if()-s... well, if you found a solution already, why do you need a second solution? Doing branches on the GPU is very expensive (threads may diverge, etc.), so two branches may and most likely will end up being worse than one. May I suggest if(min(x,y)==0) { output x; }? Assuming min can be done without branching, this is one branch if you don't have a solution in either x or y (if min is not 1 instruction, find another function to replace the min...), then try both x and (x+1) on the CPU side to figure out which one of these is the real solution.



fascistmuffin
Jr. Member
*
Offline Offline

Activity: 56



View Profile
July 02, 2011, 05:45:39 AM
 #3

I'm confused by change #4.

The else if should save a few operations just in itself, and if the code already runs correctly with the else if in it, then a double assignment into arrays would be costly if allowed to run.
Vince
Jr. Member
*
Offline Offline

Activity: 38


View Profile
July 02, 2011, 05:53:14 AM
 #4

Code:
if (H.x == 0)
        {
                output[OUTPUT_SIZE] = output[nonce.x & OUTPUT_MASK] = nonce.x;
}
  else if (H.y == 0)
{
                output[OUTPUT_SIZE] = output[nonce.y & OUTPUT_MASK] = nonce.y;
}

The first condition H.x == 0 if is almost always false, so its 2 comparisons almost every cycle, exactly the same speed as without "else"

The assignments are only done when a hash is found, this does not affect speed at all.

Its not a double assignment, the second one goes to output[nonce.y & OUTPUT_MASK], thats just to get around race conditions.

The __local-patch is invalid? First time i heard of it .. Sure, I'll remove it then. The other ones produce valid hashes, I see no reason why they sould not be valid. Its all calculated on paper, step by step, looks equal to me.

Thanks for the donation!

Like what I'm posting? -> 1DtHZgVufX1tc5jRCoNiUnygenwRWj939C
bitless
Newbie
*
Offline Offline

Activity: 28


View Profile
July 02, 2011, 05:58:50 AM
 #5

Actually, *with* the min() used like I said earlier, the kernel compiles into something quite a lot shorter... I'm gonna *test* it overnight and claim the donations *if and only it works* unless you want to test it Smiley

For the local, search the board.

EDIT - i'll claim the donations for the min anyways, if (and only if) it works and helps anyone Smiley
fascistmuffin
Jr. Member
*
Offline Offline

Activity: 56



View Profile
July 02, 2011, 05:59:44 AM
 #6

If H.x == 0 is almost always false, flip the if...else statement:

Code:
if (H.y == 0)
{
                output[OUTPUT_SIZE] = output[nonce.y & OUTPUT_MASK] = nonce.y;
}
else if (H.x == 0)
{
                output[OUTPUT_SIZE] = output[nonce.x & OUTPUT_MASK] = nonce.x;
}

It'd be faster than a double if statement since it'd be a single comparison in most cases.
Vince
Jr. Member
*
Offline Offline

Activity: 38


View Profile
July 02, 2011, 06:05:14 AM
 #7

Actually, *with* the min() used like I said earlier, the kernel compiles into something quite a lot shorter... I'm gonna *test* it overnight and claim the donations *if and only it works* unless you want to test it Smiley

For the local, search the board.

EDIT - i'll claim the donations for the min anyways, if (and only if) it works and helps anyone Smiley

Seen the issues on __local, removed it from the zip.

The min() has exactly the same speed here - maybe you can get it faster?

Like what I'm posting? -> 1DtHZgVufX1tc5jRCoNiUnygenwRWj939C
bitless
Newbie
*
Offline Offline

Activity: 28


View Profile
July 02, 2011, 06:14:09 AM
 #8

Yeah, the min() seems to help, but it helps so little that I can't see the difference without a profiler Smiley And since I haven't tested the change nearly enough...
Vince
Jr. Member
*
Offline Offline

Activity: 38


View Profile
July 02, 2011, 06:22:20 AM
 #9

I tested this one on the pools (even with __local) without any problems - and generated a block on testnet.

Like what I'm posting? -> 1DtHZgVufX1tc5jRCoNiUnygenwRWj939C
bitless
Newbie
*
Offline Offline

Activity: 28


View Profile
July 02, 2011, 06:25:55 AM
 #10

Well, I meant I haven't tested my min() for long enough... and I probably won't test it at all because the difference is not worth the effort (well, I'll test together with other kernel mods, if I have any).

 I haven't tried your changes yet. I honestly don't understand why it works with the local for anyone, but I like the constant thing you've done Smiley
Diapolo
Hero Member
*****
Offline Offline

Activity: 769



View Profile WWW
July 02, 2011, 07:37:12 AM
 #11

Seems like you used some similar ideas that I had for phatk Smiley.

Look here: http://forum.bitcoin.org/index.php?topic=25135.msg314520#msg314520

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Diapolo
Hero Member
*****
Offline Offline

Activity: 769



View Profile WWW
July 02, 2011, 07:39:36 AM
 #12

I'm confused by change #4.

The else if should save a few operations just in itself, and if the code already runs correctly with the else if in it, then a double assignment into arrays would be costly if allowed to run.

If / else statements (control flow) in OpenCL kernels slow down computation speed always. Both paths need to be examined so it should make only a small or no difference to use if else or if if.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Vince
Jr. Member
*
Offline Offline

Activity: 38


View Profile
July 02, 2011, 11:17:47 PM
 #13

Did anyone even try this version??  Huh

I was waiting for results from HD5xxx and NVIDIA owners!

Like what I'm posting? -> 1DtHZgVufX1tc5jRCoNiUnygenwRWj939C
server
Hero Member
*****
Offline Offline

Activity: 813


~\/~


View Profile
July 02, 2011, 11:38:10 PM
 #14

Did anyone even try this version??  Huh

I was waiting for results from HD5xxx and NVIDIA owners!

Yup, but sorry... I tried your kernel on 5870 and Mhash/s went down from 392 (Dia's kernel) to 374.
Anibalayl
Newbie
*
Offline Offline

Activity: 7


View Profile
July 02, 2011, 11:41:06 PM
 #15

interesting
Diapolo
Hero Member
*****
Offline Offline

Activity: 769



View Profile WWW
July 03, 2011, 07:39:53 AM
 #16

Did anyone even try this version??  Huh

I was waiting for results from HD5xxx and NVIDIA owners!

Yup, but sorry... I tried your kernel on 5870 and Mhash/s went down from 392 (Dia's kernel) to 374.

What were your values with the stock kernel, if I may ask?

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
server
Hero Member
*****
Offline Offline

Activity: 813


~\/~


View Profile
July 03, 2011, 11:11:52 AM
 #17

What were your values with the stock kernel, if I may ask?

Dia

I use this:

phoenix.exe -u ... -k poclbm VECTORS BFI_INT FASTLOOP=false AGGRESSION=11 DEVICE=0

(long term rejection rate is between 1-1.5%)

Pages: [1]
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!