Print Page - -- Optimized poclbm kernel! Another 5 Mhash/s --

Title: -- Optimized poclbm kernel! Another 5 Mhash/s --
Post by: Vince on July 02, 2011, 05:24:48 AM

Want some more Mhash/s? Try this optimized kernel!

Tested with Phoenix miner - got my HD6950, stock speed, locked shaders, from 343Mhash/s to 349Mhash/s!

This kernel also contains the optimization already posted on this forum - namely "Ma z^x", this is not mine and I'm not taking credit for it! 343Mhash/s already contained this patch.

Whats new:
Lots of small changes, some only save a single addition.

Code:

#1:
Before:
H = 0xb0edbdd0 + K[ 0] +  W0; D = 0xa54ff53a + H; H = H + 0x08909ae5U;

After:
H = W0 + 4228417613; D = W0 + 2563236514;

#2:
Before:
D = D1 + (rotr(A, 6) ^ rotr(A, 11) ^ rotr(A, 25)) + Ch(A, B1, C1) + K[ 4] +  0x80000000;

After:
D = D1 + (rotr(A, 6) ^ rotr(A, 11) ^ rotr(A, 25)) + Ch(A, B1, C1);
+ Put Constant K[ 4] + 0x80000000 into python pre-calculation
-> self.state2[3] = np.uint32(self.state2[3]+3109470811);

#3:
Before:
H = ....   K[60] + W12;
H+=0x5be0cd19U;
if (H == 0)

After:
if (H == 325071597)

#4:
Before:
        if (H.x == 0)
        {
                output[OUTPUT_SIZE] = output[nonce.x & OUTPUT_MASK] = nonce.x;
        }
        else if (H.y == 0)
        {
                output[OUTPUT_SIZE] = output[nonce.y & OUTPUT_MASK] = nonce.y;
        }

After:
        if (H.x == 0)
        {
                output[OUTPUT_SIZE] = output[nonce.x & OUTPUT_MASK] = nonce.x;
        }
        if (H.y == 0)
        {
                output[OUTPUT_SIZE] = output[nonce.y & OUTPUT_MASK] = nonce.y;
        }

Why abort checking if we found a result? Unlikely, but we could have found two: This adds almost no overhead.

#5:
Lots of small changes (some of them were optimized by the compiler before, but anyway)

For #2 I changed the precalculation in __init__.py. Take a look at them! You can use diff - its just 2 lines.

Please note: This is part of the result of >100 hours hard work. If you want me to post keep posting patches, say thank you in form of a small donation. Everything above 0.01 is just fine ;)
-> 1Dsxro7GvNDaxWkvMgkraEttAA4xqagxVp

Btw, I already got some more - minor - optimizations.

Here is it:
http://www.filesonic.com/file/1348177284/poclbm_kernel.zip

Please post some results!

Title: Re: -- Optimized poclbm kernel! Another 5 Mhash/s --
Post by: bitless on July 02, 2011, 05:44:45 AM

I've sent you a small donation for your hard work, but...

Do pools accept any hashes generated with your kernel? Really? For instance, the 'local' optimization was declared invalid (it messes up the calculation, so the thread got locked by the moderator), etc.

EDIT - As to why exit early on the if()-s... well, if you found a solution already, why do you need a second solution? Doing branches on the GPU is very expensive (threads may diverge, etc.), so two branches may and most likely will end up being worse than one. May I suggest if(min(x,y)==0) { output x; }? Assuming min can be done without branching, this is one branch if you don't have a solution in either x or y (if min is not 1 instruction, find another function to replace the min...), then try both x and (x+1) on the CPU side to figure out which one of these is the real solution.

Title: Re: -- Optimized poclbm kernel! Another 5 Mhash/s --
Post by: fascistmuffin on July 02, 2011, 05:45:39 AM

I'm confused by change #4.

The else if should save a few operations just in itself, and if the code already runs correctly with the else if in it, then a double assignment into arrays would be costly if allowed to run.

Title: Re: -- Optimized poclbm kernel! Another 5 Mhash/s --
Post by: Vince on July 02, 2011, 05:53:14 AM

Code:

if (H.x == 0)
        {
                output[OUTPUT_SIZE] = output[nonce.x & OUTPUT_MASK] = nonce.x;
}
  else if (H.y == 0)
{
                output[OUTPUT_SIZE] = output[nonce.y & OUTPUT_MASK] = nonce.y;
}

The first condition H.x == 0 if is almost always false, so its 2 comparisons almost every cycle, exactly the same speed as without "else"

The assignments are only done when a hash is found, this does not affect speed at all.

Its not a double assignment, the second one goes to output[nonce.y & OUTPUT_MASK], thats just to get around race conditions.

The __local-patch is invalid? First time i heard of it .. Sure, I'll remove it then. The other ones produce valid hashes, I see no reason why they sould not be valid. Its all calculated on paper, step by step, looks equal to me.

Thanks for the donation!

Title: Re: -- Optimized poclbm kernel! Another 5 Mhash/s --
Post by: bitless on July 02, 2011, 05:58:50 AM

Actually, *with* the min() used like I said earlier, the kernel compiles into something quite a lot shorter... I'm gonna *test* it overnight and claim the donations *if and only it works* unless you want to test it :)

For the local, search the board.

EDIT - i'll claim the donations for the min anyways, if (and only if) it works and helps anyone :)

Title: Re: -- Optimized poclbm kernel! Another 5 Mhash/s --
Post by: fascistmuffin on July 02, 2011, 05:59:44 AM

If H.x == 0 is almost always false, flip the if...else statement:

Code:

if (H.y == 0)
{
                output[OUTPUT_SIZE] = output[nonce.y & OUTPUT_MASK] = nonce.y;
}
else if (H.x == 0)
{
                output[OUTPUT_SIZE] = output[nonce.x & OUTPUT_MASK] = nonce.x;
}

It'd be faster than a double if statement since it'd be a single comparison in most cases.

Title: Re: -- Optimized poclbm kernel! Another 5 Mhash/s --
Post by: Vince on July 02, 2011, 06:05:14 AM

Quote from: bitless on July 02, 2011, 05:58:50 AM

Seen the issues on __local, removed it from the zip.

The min() has exactly the same speed here - maybe you can get it faster?

Title: Re: -- Optimized poclbm kernel! Another 5 Mhash/s --
Post by: bitless on July 02, 2011, 06:14:09 AM

Yeah, the min() seems to help, but it helps so little that I can't see the difference without a profiler :) And since I haven't tested the change nearly enough...

Title: Re: -- Optimized poclbm kernel! Another 5 Mhash/s --
Post by: Vince on July 02, 2011, 06:22:20 AM

I tested this one on the pools (even with __local) without any problems - and generated a block on testnet.

Title: Re: -- Optimized poclbm kernel! Another 5 Mhash/s --
Post by: bitless on July 02, 2011, 06:25:55 AM

Well, I meant I haven't tested my min() for long enough... and I probably won't test it at all because the difference is not worth the effort (well, I'll test together with other kernel mods, if I have any).

I haven't tried your changes yet. I honestly don't understand why it works with the local for anyone, but I like the constant thing you've done :)

Title: Re: -- Optimized poclbm kernel! Another 5 Mhash/s --
Post by: Diapolo on July 02, 2011, 07:37:12 AM

Seems like you used some similar ideas that I had for phatk :).

Look here: http://forum.bitcoin.org/index.php?topic=25135.msg314520#msg314520

Dia

Title: Re: -- Optimized poclbm kernel! Another 5 Mhash/s --
Post by: Diapolo on July 02, 2011, 07:39:36 AM

Quote from: fascistmuffin on July 02, 2011, 05:45:39 AM

If / else statements (control flow) in OpenCL kernels slow down computation speed always. Both paths need to be examined so it should make only a small or no difference to use if else or if if.

Dia

Title: Re: -- Optimized poclbm kernel! Another 5 Mhash/s --
Post by: Vince on July 02, 2011, 11:17:47 PM

Did anyone even try this version?? ???

I was waiting for results from HD5xxx and NVIDIA owners!

Title: Re: -- Optimized poclbm kernel! Another 5 Mhash/s --
Post by: server on July 02, 2011, 11:38:10 PM

Quote from: Vince on July 02, 2011, 11:17:47 PM

Did anyone even try this version?? ???

I was waiting for results from HD5xxx and NVIDIA owners!

Yup, but sorry... I tried your kernel on 5870 and Mhash/s went down from 392 (Dia's kernel) to 374.

Title: Re: -- Optimized poclbm kernel! Another 5 Mhash/s --
Post by: Anibalayl on July 02, 2011, 11:41:06 PM

interesting

Title: Re: -- Optimized poclbm kernel! Another 5 Mhash/s --
Post by: Diapolo on July 03, 2011, 07:39:53 AM

Quote from: server on July 02, 2011, 11:38:10 PM

Quote from: Vince on July 02, 2011, 11:17:47 PM

Did anyone even try this version?? ???

I was waiting for results from HD5xxx and NVIDIA owners!

Yup, but sorry... I tried your kernel on 5870 and Mhash/s went down from 392 (Dia's kernel) to 374.

What were your values with the stock kernel, if I may ask?

Dia

Title: Re: -- Optimized poclbm kernel! Another 5 Mhash/s --
Post by: server on July 03, 2011, 11:11:52 AM

Quote from: Diapolo on July 03, 2011, 07:39:53 AM

What were your values with the stock kernel, if I may ask?

Dia

I use this:

phoenix.exe -u ... -k poclbm VECTORS BFI_INT FASTLOOP=false AGGRESSION=11 DEVICE=0

(long term rejection rate is between 1-1.5%)

http://dl.dropbox.com/u/28686048/dia-392.png

Bitcoin Forum

Other => Beginners & Help => Topic started by: Vince on July 02, 2011, 05:24:48 AM