Bitcoin Forum
October 22, 2017, 03:44:22 PM
 News: Latest stable version of Bitcoin Core: 0.15.0.1  [Torrent]. (New!)
 Home Help Search Donate Login Register
 Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 [15] 16 17 18 19 20 21  All
 Author Topic: further improved phatk_dia kernel for Phoenix + SDK 2.6 - 2012-01-13  (Read 105435 times)
Phateus
Jr. Member

Offline

Activity: 52

 August 08, 2011, 10:43:00 PM

The steps:

1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)

Steps 7-8 are to produce an 8-bit index that is 0 for all invalid nonces and hopefuly unique for each valid nonce assuming there are a small number of valid nonces. However in the worst case (more than 1 hash found in a single execution) at least 1 will be returned. However if 3 or less nonces are found per execution all of them should be returned in most cass.

Sorry to jump in in the middle of the conversation, but if I understand what you are trying to do...
Can't you just replace all of the steps  with:
Code:
Valid = 1 - min(H, 1u);
Nonce = W[3];
OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;
if you are trying to remove all control flow?  Any invalid nonce will be written into Output[0] and the valid nonces will be randomly distributed through the rest of the array.

I really don't know how the architecture handles having 4 billion threads writing to the same address, but... you may want to try it out...

Also, it is easy enough to make it work with VECTORS ;

Code:
Valid = 1 - (min(H.x, H.y), 1u);
//If .y is valid, add 1 to the nonce.
Nonce = W[3].x + min(H.y, 1);
OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;
(or you could just double the code for .x and .y)

OR
Code:
Valid = 1 - (min(H.x, H.y), 1u);
//If .y is valid, add 1 to the nonce.
Nonce = W[3].x;
OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;
and have the __init__ file check both Nonce and Nonce+1

another way of doing it would be (the compiler should replace the if statement with a set conditional):
Code:
Nonce = W[3];
if(H)
//Invalid nonce are at the last position of the array, valid are distributed at the front
OUTPUT[Position] = Nonce;

Slightly faster would be to have the Position = the local thread # (since you save an &) and make sure that the size of the output* array is WORKSIZE + 1:
Code:
Nonce = W[3];
Position = get_local_id(0);
if(H)
Position = WORKSIZE + 1;
OUTPUT[Position] = Nonce;

EDIT:  Ooh, just thought of something else:

If it doesn't like writing everything to the same address: Make the buffer size = 2*WORKSIZE...
Code:
Nonce = W[3];
Position = get_local_id(0);
if(H)
Position += WORKSIZE;
OUTPUT[Position] = Nonce;
Then all of the threads in a workgroup will write to a different address.  The valid nonces will be in the first half, and the invalid will be in the second.

Now I have no idea if any of these things would be faster, but I think all of them would work...

Sorry to put so much code down... but this kind of coding isn't really an exact science...
1508687062
Hero Member

Offline

Posts: 1508687062

Ignore
 1508687062

1508687062
 Report to moderator
1508687062
Hero Member

Offline

Posts: 1508687062

Ignore
 1508687062

1508687062
 Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1508687062
Hero Member

Offline

Posts: 1508687062

Ignore
 1508687062

1508687062
 Report to moderator
1508687062
Hero Member

Offline

Posts: 1508687062

Ignore
 1508687062

1508687062
 Report to moderator
1508687062
Hero Member

Offline

Posts: 1508687062

Ignore
 1508687062

1508687062
 Report to moderator
bcforum
Full Member

Offline

Activity: 140

 August 09, 2011, 02:50:54 AM

1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)

I don't claim to understand this, but step (1) should be an OR, not an AND.

If you found this post useful, feel free to share the wealth: 1E35gTBmJzPNJ3v72DX4wu4YtvHTWqNRbM
jedi95
Full Member

Offline

Activity: 219

 August 09, 2011, 02:56:28 AM

1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)

I don't claim to understand this, but step (1) should be an OR, not an AND.

Yeah that's right. Must have missed that when I went over the post. I had it correct in the example though.

Phoenix Miner developer

Donations appreciated at:
1PHoenix9j9J3M6v3VQYWeXrHPPjf7y3rU
indio007
Full Member

Offline

Activity: 224

 August 09, 2011, 04:14:13 PM

Sent you half a bit to keep you motivated.

Keep up the good work Diapolo
Diapolo
Hero Member

Offline

Activity: 769

 August 09, 2011, 04:26:31 PM

Sent you half a bit to keep you motivated.

Keep up the good work Diapolo

Woohoo I feel damn motivated ... thanks mate!

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
indio007
Full Member

Offline

Activity: 224

 August 09, 2011, 04:35:18 PM

Just out of curiosity , how many unique downloads of your modification have there been? If you know of course.
Diapolo
Hero Member

Offline

Activity: 769

 August 09, 2011, 05:33:03 PM

Just out of curiosity , how many unique downloads of your modification have there been? If you know of course.

The sum of all downloads is > 5500 (for all released versions).

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Diapolo
Hero Member

Offline

Activity: 769

 August 09, 2011, 07:49:07 PM

1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)

I don't claim to understand this, but step (1) should be an OR, not an AND.

Yeah that's right. Must have missed that when I went over the post. I had it correct in the example though.

I tried to implement this, but the kernel only crashes the display driver THAT hard, I get a Bluescreen everytime ... weird.

Code:
// Round 124
Vals[7] += Vals[3] + P4(124) + P3(124) + P1(124) + P2(124) + ch(124) + s1(124) + H[7];

...

// lo 16 Bits OR hi 16 Bits
uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x & 0xFFFF0000U);
// lo 8 Bits OR hi 8 Bits
positive = (positive & 0x00FFU) | (positive & 0xFF00U);
// lo 4 Bits OR hi 4 Bits
positive = (positive & 0x0FU) | (positive & 0xF0U);
// lo 2 Bits OR hi 2 Bits
positive = (positive & 0x3U) | (positive & 0xCU);
// lo 1 Bit NOR hi 1 Bit
positive = ~((positive & 0x1U) | (positive & 0x2U));

// nonce AND positive
uint position = W_3.x & positive;
// lo 16 Bits XOR hi 16 Bits
position = (position & 0x0000FFFFU) ^ (position & 0xFFFF0000U);
// lo 8 Bits OR hi 8 Bits
position = (position & 0x00FFU) | (position & 0xFF00U);

output[position] = W_3.x;

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Phateus
Jr. Member

Offline

Activity: 52

 August 09, 2011, 08:20:46 PM

1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)

I don't claim to understand this, but step (1) should be an OR, not an AND.

Yeah that's right. Must have missed that when I went over the post. I had it correct in the example though.

I tried to implement this, but the kernel only crashes the display driver THAT hard, I get a Bluescreen everytime ... weird.

Code:
// Round 124
Vals[7] += Vals[3] + P4(124) + P3(124) + P1(124) + P2(124) + ch(124) + s1(124) + H[7];

...

// lo 16 Bits OR hi 16 Bits
uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x & 0xFFFF0000U);
// lo 8 Bits OR hi 8 Bits
positive = (positive & 0x00FFU) | (positive & 0xFF00U);
// lo 4 Bits OR hi 4 Bits
positive = (positive & 0x0FU) | (positive & 0xF0U);
// lo 2 Bits OR hi 2 Bits
positive = (positive & 0x3U) | (positive & 0xCU);
// lo 1 Bit NOR hi 1 Bit
positive = ~((positive & 0x1U) | (positive & 0x2U));

// nonce AND positive
uint position = W_3.x & positive;
// lo 16 Bits XOR hi 16 Bits
position = (position & 0x0000FFFFU) ^ (position & 0xFFFF0000U);
// lo 8 Bits OR hi 8 Bits
position = (position & 0x00FFU) | (position & 0xFF00U);

output[position] = W_3.x;

Dia

You need to shift the the bits for each stage:

For example, oring the top bits to the bottom bits should be:

Code:
uint positive = (Vals[7].x & 0x0000FFFFU) | ((Vals[7].x & 0xFFFF0000U) >> 16);
or just:
Code:
uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x >> 16);
because the upper 16 bits will already be 0 because of the shift;

Otherwise, you will just get the original Vals[7] value;
if you want to do it that way, the code would be:
Code:
uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x >> 16);
// lo 8 Bits OR hi 8 Bits
positive = (positive & 0x00FFU) | (positive >> 8);
// lo 4 Bits OR hi 4 Bits
positive = (positive & 0x0FU) | (positive >> 4);
// lo 2 Bits OR hi 2 Bits
positive = (positive & 0x3U) | (positive >> 2);
// lo 1 Bit NOR hi 1 Bit
positive = ~((positive & 0x1U) | (positive >> 1));

However, similar to what I said earlier, the following code does the same thing:
Code:
uint positive = 0xFFFFFFFF + min(Vals[7], 1u);
if Vals[7] ==0, then min(Vals[7], 1u) == 0, otherwise it equals 1
0xFFFFFFFF + 0 = 0xFFFFFFFF
0xFFFFFFFF + 1 = 0

oh yeah...  you are getting blue screens because your address would be a random 32 bit number and it was probably trying to access memory that your video card doesn't have
Diapolo
Hero Member

Offline

Activity: 769

 August 11, 2011, 03:42:23 PM

New version for your testing pleasure . Remember to use VECTORS2 as switch!
This one should be a bit faster for 58XX and 69XX cards compared to earlier versions PLUS it should not generate invalid shares, if more than 1 positve nonce is found in a work-group!

If a few of you could make a comparison (with older or other kernel versions) of accepted shares over a certain period of time, this woule be pretty cool!

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
miscreanity
Legendary

Offline

Activity: 1260

 August 11, 2011, 04:20:20 PM

New version for your testing pleasure . Remember to use VECTORS2 as switch!
This one should be a bit faster for 58XX and 69XX cards compared to earlier versions PLUS it should not generate invalid shares, if more than 1 positve nonce is found in a work-group!

If a few of you could make a comparison (with older or other kernel versions) of accepted shares over a certain period of time, this woule be pretty cool!

Dia

6950 @ 920/300; Linux 2.6.38, 11.6/2.4; 2x 5 min runs for each setting with Phoenix 1.50

AGGRESSION=12 BFI_INT FASTLOOP=false VECTORS2

WORKSIZE=128
[374.89 Mhash/sec] [28 Accepted] [0 Rejected] [RPC (+LP)]
- Negligible difference from 2011-08-02 kernel.

WORKSIZE=256
[344.50 Mhash/sec] [25 Accepted] [0 Rejected] [RPC (+LP)]
- Significant drop of ~25-30 Mh/s from 08-02 kernel.
Tx2000
Full Member

Offline

Activity: 182

 August 11, 2011, 04:53:46 PM

11.8 / SDK 2.4   920c/320m  5850 reference

-k phatk VECTORS VECTORS2 BFI_INT FASTLOOP=false AGGRESSION=10 WORKSIZE=256

393-394 Mh/s, compared to 398-399 with prior version.
dishwara
Legendary

Offline

Activity: 1526

 August 11, 2011, 06:16:37 PM

436 & 426 using diapolo 2011-8-11.
While phatk 2.2 of Phateus gives 448 & 433.

Windows 7, 64 bit, AERO enabled, AOCLBF 1.75, for diapolo used vectors2 & removed check mark for vectors in AOCLBF.
Aggression=12, worksize=256
11.8 catalyst beta.
MSI R5870 Lightning & Sapphore HD 5870.
975/325 & 939/313.

 BitSend ◢◤www.bitsend.info ▄█▄█████▄████████▄███████████▄█████████████████████████▀████████▀█████▀▀█▀ ████████████ Your Digital Network | 10MB Blocks Algo: XEVAN | DK3 | Masternodes Bitcore - BTX/BTC -Project ████████████ BSD -USDT | Bittrex | C.Gather | S.Exchange Cryptopia | NovaExchange | Livecoin Bitsend Airdrop ████████████ JOIN OURSLACK ████ ████  ████   ████    ████     ████      ████       ████        ████       ████      ████     ████    ████   ████  ████ ████████ ████ ████  ████   ████    ████     ████      ████       ████        ████       ████      ████     ████    ████   ████  ████ ████████
talldude
Member

Offline

Activity: 76

 August 11, 2011, 07:10:33 PM

I'm giving this a go.

5850, all the usual flags, aggression 11. Dropped 1.5mhash compared to phatk 2.2 but we'll see if invalid shares also drop and/or valid shares go up. I'll edit this post in a day or so (if I remember).
Diapolo
Hero Member

Offline

Activity: 769

 August 11, 2011, 08:03:52 PM

Did anyone with SDK 2.5 check this out? I get better results on 5870 and 5830 than with former kernels and I had hoped 69XX would be really faster :-/.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
moomoocow
Newbie

Offline

Activity: 18

 August 11, 2011, 10:54:39 PM

I was running the 8-4-2011 pre-release before this and the new 8-11-2011 release yields identical hash rates on my 6950.

Cat 11.8 preview.
PcChip
Sr. Member

Offline

Activity: 402

 August 18, 2011, 02:03:05 AM

On Cat 11.8 Preview:

Phateus 2.2: 312 MH/s

(5830 @ 965/300, Worksize 256)

Legacy signature from 2011:
All rates with Phoenix 1.50 / PhatK
5850 - 400 MH/s  |  5850 - 355 MH/s | 5830 - 310 MH/s  |  GTX570 - 115 MH/s | 5770 - 210 MH/s | 5770 - 200 MH/s
iopq
Hero Member

Offline

Activity: 658

 August 18, 2011, 03:59:37 AM

I am getting about the same with diapolo's as I do with phatk2.2 on a 5750 with memory clock at 200, worksize 256, vectors2

I'm using fpgaminer's modified poclbm
Parja
Jr. Member

Offline

Activity: 36

 August 20, 2011, 04:36:36 PM

I made an interesting discovery during my own tests with the new kernel version. I had to up the memory clock of my 5870 from 200 to 350 MHz in order to achieve the highest hashing values. Another thing to mention is, that I drive a Phenom II X6 1090T with only 800 MHz for every core, due to power saving, while mining. If I let the CPU use full speed, MHash/s goes even higher, let's say 3-4 MH/s.

Conclusion: Perhaps you guys should try to raise your mem speeds + experiment with CPU clocks, too. I know it has to be a good balance, so that higher MH/s values are not eaten by higher energy costs.

Dia

I'm actually finding with the 8-11 kernel that memory speed can be dropped down very low and still maintain optimal performance.  I've got a total of 5 58X0 cards running, and they're all perfectly content to max out the MH/s at 150MHz memory speed.

So while I've found that phatk 2.2 can do about 1-1.5% higher than 8-11 at the same core speed, phatk likes a memory speed up around 430MHz for optimal performance.  So with that memory speed drop, I'm seeing about 2-3C lower core temps on my cards...or about 20MHz higher core speeds for the same temps, which more than makes up for the performance gap.
Diapolo
Hero Member

Offline

Activity: 769

 August 23, 2011, 05:48:08 AM

I made an interesting discovery during my own tests with the new kernel version. I had to up the memory clock of my 5870 from 200 to 350 MHz in order to achieve the highest hashing values. Another thing to mention is, that I drive a Phenom II X6 1090T with only 800 MHz for every core, due to power saving, while mining. If I let the CPU use full speed, MHash/s goes even higher, let's say 3-4 MH/s.

Conclusion: Perhaps you guys should try to raise your mem speeds + experiment with CPU clocks, too. I know it has to be a good balance, so that higher MH/s values are not eaten by higher energy costs.

Dia

I'm actually finding with the 8-11 kernel that memory speed can be dropped down very low and still maintain optimal performance.  I've got a total of 5 58X0 cards running, and they're all perfectly content to max out the MH/s at 150MHz memory speed.

So while I've found that phatk 2.2 can do about 1-1.5% higher than 8-11 at the same core speed, phatk likes a memory speed up around 430MHz for optimal performance.  So with that memory speed drop, I'm seeing about 2-3C lower core temps on my cards...or about 20MHz higher core speeds for the same temps, which more than makes up for the performance gap.

Very interesting, but I guess currently the focus for most users is on phatk2, even if your observation could turn out to change some users mind . I'm still working on the kernel, but the really big jumps are hard to do these days .

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
 Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 [15] 16 17 18 19 20 21  All