further improved phatk_dia kernel for Phoenix + SDK 2.6

Phateus

Newbie

Offline

Activity: 52
Merit: 0

Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2011-08-04

August 08, 2011, 10:43:00 PM

#281

Quote from: jedi95 on August 08, 2011, 06:57:58 PM

The steps:

1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)

Steps 7-8 are to produce an 8-bit index that is 0 for all invalid nonces and hopefuly unique for each valid nonce assuming there are a small number of valid nonces. However in the worst case (more than 1 hash found in a single execution) at least 1 will be returned. However if 3 or less nonces are found per execution all of them should be returned in most cass.

Sorry to jump in in the middle of the conversation, but if I understand what you are trying to do...
Can't you just replace all of the steps with:

Code:

Valid = 1 - min(H, 1u);
Nonce = W[3];
OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;

if you are trying to remove all control flow? Any invalid nonce will be written into Output[0] and the valid nonces will be randomly distributed through the rest of the array.

I really don't know how the architecture handles having 4 billion threads writing to the same address, but... you may want to try it out...

Also, it is easy enough to make it work with VECTORS ;

Code:

Valid = 1 - (min(H.x, H.y), 1u);
//If .y is valid, add 1 to the nonce.
Nonce = W[3].x + min(H.y, 1);
OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;

(or you could just double the code for .x and .y)

OR

Code:

Valid = 1 - (min(H.x, H.y), 1u);
//If .y is valid, add 1 to the nonce.
Nonce = W[3].x;
OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;

and have the __init__ file check both Nonce and Nonce+1

another way of doing it would be (the compiler should replace the if statement with a set conditional):

Code:

Nonce = W[3];
Position = W[3] & OUTPUT_MASK;
if(H)
   Position = OUTPUT_MASK + 1;
//Invalid nonce are at the last position of the array, valid are distributed at the front
OUTPUT[Position] = Nonce;

Slightly faster would be to have the Position = the local thread # (since you save an &) and make sure that the size of the output* array is WORKSIZE + 1:

Code:

Nonce = W[3];
Position = get_local_id(0);
if(H)
   Position = WORKSIZE + 1;
OUTPUT[Position] = Nonce;

EDIT: Ooh, just thought of something else:

If it doesn't like writing everything to the same address: Make the buffer size = 2*WORKSIZE...

Code:

Nonce = W[3];
Position = get_local_id(0);
if(H)
   Position += WORKSIZE;
OUTPUT[Position] = Nonce;

Then all of the threads in a workgroup will write to a different address. The valid nonces will be in the first half, and the invalid will be in the second.

Now I have no idea if any of these things would be faster, but I think all of them would work...

Sorry to put so much code down... but this kind of coding isn't really an exact science...

bcforum

Full Member

Offline

Activity: 140
Merit: 100

Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2011-08-04

August 09, 2011, 02:50:54 AM

#282

Quote from: jedi95 on August 08, 2011, 06:57:58 PM

1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)

I don't claim to understand this, but step (1) should be an OR, not an AND.

If you found this post useful, feel free to share the wealth: 1E35gTBmJzPNJ3v72DX4wu4YtvHTWqNRbM

jedi95

Full Member

Offline

Activity: 219
Merit: 120

Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2011-08-04

August 09, 2011, 02:56:28 AM

#283

Quote from: bcforum on August 09, 2011, 02:50:54 AM

Quote from: jedi95 on August 08, 2011, 06:57:58 PM

I don't claim to understand this, but step (1) should be an OR, not an AND.

Yeah that's right. Must have missed that when I went over the post. I had it correct in the example though.

Phoenix Miner developer

Donations appreciated at:
1PHoenix9j9J3M6v3VQYWeXrHPPjf7y3rU

indio007

Full Member

Offline

Activity: 224
Merit: 100

Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2011-08-04

August 09, 2011, 04:14:13 PM

#284

Sent you half a bit to keep you motivated. Grin

Keep up the good work Diapolo

Diapolo (OP)

Hero Member

Offline

Activity: 769
Merit: 500

Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2011-08-04

August 09, 2011, 04:26:31 PM

#285

Quote from: indio007 on August 09, 2011, 04:14:13 PM

Sent you half a bit to keep you motivated. Grin

Keep up the good work Diapolo

Woohoo I feel damn motivated Wink

... thanks mate!

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo

indio007

Full Member

Offline

Activity: 224
Merit: 100

Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2011-08-04

August 09, 2011, 04:35:18 PM

#286

Just out of curiosity , how many unique downloads of your modification have there been? If you know of course.

Diapolo (OP)

Hero Member

Offline

Activity: 769
Merit: 500

Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2011-08-04

August 09, 2011, 05:33:03 PM

#287

Quote from: indio007 on August 09, 2011, 04:35:18 PM

Just out of curiosity , how many unique downloads of your modification have there been? If you know of course.

The sum of all downloads is > 5500 (for all released versions).

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo

Diapolo (OP)

Hero Member

Offline

Activity: 769
Merit: 500

Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2011-08-04

August 09, 2011, 07:49:07 PM

#288

Quote from: jedi95 on August 09, 2011, 02:56:28 AM

Quote from: bcforum on August 09, 2011, 02:50:54 AM

Quote from: jedi95 on August 08, 2011, 06:57:58 PM

I don't claim to understand this, but step (1) should be an OR, not an AND.

Yeah that's right. Must have missed that when I went over the post. I had it correct in the example though.

I tried to implement this, but the kernel only crashes the display driver THAT hard, I get a Bluescreen everytime ... weird.

Code:

	// Round 124
	Vals[7] += Vals[3] + P4(124) + P3(124) + P1(124) + P2(124) + ch(124) + s1(124) + H[7];

	...

	// lo 16 Bits OR hi 16 Bits
	uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x & 0xFFFF0000U);
	// lo 8 Bits OR hi 8 Bits
	positive = (positive & 0x00FFU) | (positive & 0xFF00U);
	// lo 4 Bits OR hi 4 Bits
	positive = (positive & 0x0FU) | (positive & 0xF0U);
	// lo 2 Bits OR hi 2 Bits
	positive = (positive & 0x3U) | (positive & 0xCU);
	// lo 1 Bit NOR hi 1 Bit
	positive = ~((positive & 0x1U) | (positive & 0x2U));

	// nonce AND positive
	uint position = W_3.x & positive;
	// lo 16 Bits XOR hi 16 Bits
	position = (position & 0x0000FFFFU) ^ (position & 0xFFFF0000U);
	// lo 8 Bits OR hi 8 Bits
	position = (position & 0x00FFU) | (position & 0xFF00U);
		
	output[position] = W_3.x;

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo

Phateus

Newbie

Offline

Activity: 52
Merit: 0

Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2011-08-04

August 09, 2011, 08:20:46 PM

#289

Quote from: Diapolo on August 09, 2011, 07:49:07 PM

Quote from: jedi95 on August 09, 2011, 02:56:28 AM

Quote from: bcforum on August 09, 2011, 02:50:54 AM

Quote from: jedi95 on August 08, 2011, 06:57:58 PM

I don't claim to understand this, but step (1) should be an OR, not an AND.

Yeah that's right. Must have missed that when I went over the post. I had it correct in the example though.

I tried to implement this, but the kernel only crashes the display driver THAT hard, I get a Bluescreen everytime ... weird.

Code:

	// Round 124
	Vals[7] += Vals[3] + P4(124) + P3(124) + P1(124) + P2(124) + ch(124) + s1(124) + H[7];

	...

	// lo 16 Bits OR hi 16 Bits
	uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x & 0xFFFF0000U);
	// lo 8 Bits OR hi 8 Bits
	positive = (positive & 0x00FFU) | (positive & 0xFF00U);
	// lo 4 Bits OR hi 4 Bits
	positive = (positive & 0x0FU) | (positive & 0xF0U);
	// lo 2 Bits OR hi 2 Bits
	positive = (positive & 0x3U) | (positive & 0xCU);
	// lo 1 Bit NOR hi 1 Bit
	positive = ~((positive & 0x1U) | (positive & 0x2U));

	// nonce AND positive
	uint position = W_3.x & positive;
	// lo 16 Bits XOR hi 16 Bits
	position = (position & 0x0000FFFFU) ^ (position & 0xFFFF0000U);
	// lo 8 Bits OR hi 8 Bits
	position = (position & 0x00FFU) | (position & 0xFF00U);
		
	output[position] = W_3.x;

Dia

You need to shift the the bits for each stage:

For example, oring the top bits to the bottom bits should be:

Code:

uint positive = (Vals[7].x & 0x0000FFFFU) | ((Vals[7].x & 0xFFFF0000U) >> 16);

or just:

Code:

uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x >> 16);

because the upper 16 bits will already be 0 because of the shift;

Otherwise, you will just get the original Vals[7] value;
if you want to do it that way, the code would be:

Code:

	uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x >> 16);
	// lo 8 Bits OR hi 8 Bits
	positive = (positive & 0x00FFU) | (positive >> 8);
	// lo 4 Bits OR hi 4 Bits
	positive = (positive & 0x0FU) | (positive >> 4);
	// lo 2 Bits OR hi 2 Bits
	positive = (positive & 0x3U) | (positive >> 2);
	// lo 1 Bit NOR hi 1 Bit
	positive = ~((positive & 0x1U) | (positive >> 1));

However, similar to what I said earlier, the following code does the same thing:

Code:

	uint positive = 0xFFFFFFFF + min(Vals[7], 1u);

if Vals[7] ==0, then min(Vals[7], 1u) == 0, otherwise it equals 1
0xFFFFFFFF + 0 = 0xFFFFFFFF
0xFFFFFFFF + 1 = 0

oh yeah... you are getting blue screens because your address would be a random 32 bit number and it was probably trying to access memory that your video card doesn't have

Diapolo (OP)

Hero Member

Offline

Activity: 769
Merit: 500

Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2011-08-11

August 11, 2011, 03:42:23 PM

#290

Download version 2011-08-11: http://www.mediafire.com/?s5c7h4r91r4ad4j

New version for your testing pleasure Wink

. Remember to use VECTORS2 as switch!
This one should be a bit faster for 58XX and 69XX cards compared to earlier versions PLUS it should not generate invalid shares, if more than 1 positve nonce is found in a work-group!

If a few of you could make a comparison (with older or other kernel versions) of accepted shares over a certain period of time, this woule be pretty cool!

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo

miscreanity

Legendary

Offline

Activity: 1316
Merit: 1005

Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2011-08-11

August 11, 2011, 04:20:20 PM

#291

Quote from: Diapolo on August 11, 2011, 03:42:23 PM

Download version 2011-08-11: http://www.mediafire.com/?s5c7h4r91r4ad4j

New version for your testing pleasure Wink

6950 @ 920/300; Linux 2.6.38, 11.6/2.4; 2x 5 min runs for each setting with Phoenix 1.50

AGGRESSION=12 BFI_INT FASTLOOP=false VECTORS2

WORKSIZE=128
[374.89 Mhash/sec] [28 Accepted] [0 Rejected] [RPC (+LP)]
- Negligible difference from 2011-08-02 kernel.

WORKSIZE=256
[344.50 Mhash/sec] [25 Accepted] [0 Rejected] [RPC (+LP)]
- Significant drop of ~25-30 Mh/s from 08-02 kernel.

Tx2000

Full Member

Offline

Activity: 182
Merit: 100

Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2011-08-11

August 11, 2011, 04:53:46 PM

#292

11.8 / SDK 2.4 920c/320m 5850 reference

-k phatk VECTORS VECTORS2 BFI_INT FASTLOOP=false AGGRESSION=10 WORKSIZE=256

393-394 Mh/s, compared to 398-399 with prior version.

dishwara

Legendary

Offline

Activity: 1855
Merit: 1016

Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2011-08-11

August 11, 2011, 06:16:37 PM

#293

436 & 426 using diapolo 2011-8-11.
While phatk 2.2 of Phateus gives 448 & 433.

Windows 7, 64 bit, AERO enabled, AOCLBF 1.75, for diapolo used vectors2 & removed check mark for vectors in AOCLBF.
Aggression=12, worksize=256
11.8 catalyst beta.
MSI R5870 Lightning & Sapphore HD 5870.
975/325 & 939/313.

talldude

Member

Offline

Activity: 224
Merit: 10

Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2011-08-11

August 11, 2011, 07:10:33 PM

#294

I'm giving this a go.

5850, all the usual flags, aggression 11. Dropped 1.5mhash compared to phatk 2.2 but we'll see if invalid shares also drop and/or valid shares go up. I'll edit this post in a day or so (if I remember).

Diapolo (OP)

Hero Member

Offline

Activity: 769
Merit: 500

Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2011-08-11

August 11, 2011, 08:03:52 PM

#295

Did anyone with SDK 2.5 check this out? I get better results on 5870 and 5830 than with former kernels and I had hoped 69XX would be really faster :-/.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo

moomoocow

Newbie

Offline

Activity: 18
Merit: 0

Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2011-08-11

August 11, 2011, 10:54:39 PM

#296

I was running the 8-4-2011 pre-release before this and the new 8-11-2011 release yields identical hash rates on my 6950.

Cat 11.8 preview.

PcChip

Sr. Member

Offline

Activity: 418
Merit: 250

Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2011-08-11

August 18, 2011, 02:03:05 AM

#297

On Cat 11.8 Preview:

Your latest: 307 MH/s
Phateus 2.2: 312 MH/s

(5830 @ 965/300, Worksize 256)

Legacy signature from 2011:
All rates with Phoenix 1.50 / PhatK
5850 - 400 MH/s | 5850 - 355 MH/s | 5830 - 310 MH/s | GTX570 - 115 MH/s | 5770 - 210 MH/s | 5770 - 200 MH/s

iopq

Hero Member

Offline

Activity: 658
Merit: 500

Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2011-08-11

August 18, 2011, 03:59:37 AM

#298

I am getting about the same with diapolo's as I do with phatk2.2 on a 5750 with memory clock at 200, worksize 256, vectors2

I'm using fpgaminer's modified poclbm

Parja

Newbie

Offline

Activity: 36
Merit: 0

Re: further improved phatk OpenCL Kernel (> 2% increase) for Phoenix - 2011-07-06

August 20, 2011, 04:36:36 PM

#299

Quote from: Diapolo on July 06, 2011, 07:07:04 PM

I made an interesting discovery during my own tests with the new kernel version. I had to up the memory clock of my 5870 from 200 to 350 MHz in order to achieve the highest hashing values. Another thing to mention is, that I drive a Phenom II X6 1090T with only 800 MHz for every core, due to power saving, while mining. If I let the CPU use full speed, MHash/s goes even higher, let's say 3-4 MH/s.

Conclusion: Perhaps you guys should try to raise your mem speeds + experiment with CPU clocks, too. I know it has to be a good balance, so that higher MH/s values are not eaten by higher energy costs.

Dia

I'm actually finding with the 8-11 kernel that memory speed can be dropped down very low and still maintain optimal performance. I've got a total of 5 58X0 cards running, and they're all perfectly content to max out the MH/s at 150MHz memory speed.

So while I've found that phatk 2.2 can do about 1-1.5% higher than 8-11 at the same core speed, phatk likes a memory speed up around 430MHz for optimal performance. So with that memory speed drop, I'm seeing about 2-3C lower core temps on my cards...or about 20MHz higher core speeds for the same temps, which more than makes up for the performance gap.

Diapolo (OP)

Hero Member

Offline

Activity: 769
Merit: 500

Re: further improved phatk OpenCL Kernel (> 2% increase) for Phoenix - 2011-07-06

August 23, 2011, 05:48:08 AM

#300

Quote from: Parja on August 20, 2011, 04:36:36 PM

Quote from: Diapolo on July 06, 2011, 07:07:04 PM

Very interesting, but I guess currently the focus for most users is on phatk2, even if your observation could turn out to change some users mind Wink

. I'm still working on the kernel, but the really big jumps are hard to do these days Cheesy

.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo

Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 [15] 16 17 18 19 20 21 » All

Bitcoin Forum > Bitcoin > Mining > Mining software (miners) > further improved phatk_dia kernel for Phoenix + SDK 2.6 - 2012-01-13

« previous topic next topic »