Phateus
Newbie
Offline
Activity: 52
Merit: 0
|
|
August 08, 2011, 10:43:00 PM |
|
The steps:
1. AND the low 16-bits of H against the high 16 bits 2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits 3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits 4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits 5. Take the resulting 2-bit number and NOR the first bit against the second bit
6. do bitwise AND of the resulting 1-bit number against the nonce 7. take the result from #6 and XOR the low 16-bits against the high 16-bits 8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits 9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce
Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)
Steps 7-8 are to produce an 8-bit index that is 0 for all invalid nonces and hopefuly unique for each valid nonce assuming there are a small number of valid nonces. However in the worst case (more than 1 hash found in a single execution) at least 1 will be returned. However if 3 or less nonces are found per execution all of them should be returned in most cass.
Sorry to jump in in the middle of the conversation, but if I understand what you are trying to do... Can't you just replace all of the steps with: Valid = 1 - min(H, 1u); Nonce = W[3]; OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;
if you are trying to remove all control flow? Any invalid nonce will be written into Output[0] and the valid nonces will be randomly distributed through the rest of the array. I really don't know how the architecture handles having 4 billion threads writing to the same address, but... you may want to try it out... Also, it is easy enough to make it work with VECTORS ; Valid = 1 - (min(H.x, H.y), 1u); //If .y is valid, add 1 to the nonce. Nonce = W[3].x + min(H.y, 1); OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;
(or you could just double the code for .x and .y) OR Valid = 1 - (min(H.x, H.y), 1u); //If .y is valid, add 1 to the nonce. Nonce = W[3].x; OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;
and have the __init__ file check both Nonce and Nonce+1 another way of doing it would be (the compiler should replace the if statement with a set conditional): Nonce = W[3]; Position = W[3] & OUTPUT_MASK; if(H) Position = OUTPUT_MASK + 1; //Invalid nonce are at the last position of the array, valid are distributed at the front OUTPUT[Position] = Nonce;
Slightly faster would be to have the Position = the local thread # (since you save an &) and make sure that the size of the output* array is WORKSIZE + 1: Nonce = W[3]; Position = get_local_id(0); if(H) Position = WORKSIZE + 1; OUTPUT[Position] = Nonce;
EDIT: Ooh, just thought of something else: If it doesn't like writing everything to the same address: Make the buffer size = 2*WORKSIZE... Nonce = W[3]; Position = get_local_id(0); if(H) Position += WORKSIZE; OUTPUT[Position] = Nonce;
Then all of the threads in a workgroup will write to a different address. The valid nonces will be in the first half, and the invalid will be in the second. Now I have no idea if any of these things would be faster, but I think all of them would work... Sorry to put so much code down... but this kind of coding isn't really an exact science...
|
|
|
|
bcforum
|
|
August 09, 2011, 02:50:54 AM |
|
1. AND the low 16-bits of H against the high 16 bits 2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits 3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits 4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits 5. Take the resulting 2-bit number and NOR the first bit against the second bit
6. do bitwise AND of the resulting 1-bit number against the nonce 7. take the result from #6 and XOR the low 16-bits against the high 16-bits 8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits 9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce
Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)
I don't claim to understand this, but step (1) should be an OR, not an AND.
|
If you found this post useful, feel free to share the wealth: 1E35gTBmJzPNJ3v72DX4wu4YtvHTWqNRbM
|
|
|
jedi95
|
|
August 09, 2011, 02:56:28 AM |
|
1. AND the low 16-bits of H against the high 16 bits 2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits 3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits 4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits 5. Take the resulting 2-bit number and NOR the first bit against the second bit
6. do bitwise AND of the resulting 1-bit number against the nonce 7. take the result from #6 and XOR the low 16-bits against the high 16-bits 8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits 9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce
Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)
I don't claim to understand this, but step (1) should be an OR, not an AND. Yeah that's right. Must have missed that when I went over the post. I had it correct in the example though.
|
Phoenix Miner developer Donations appreciated at: 1PHoenix9j9J3M6v3VQYWeXrHPPjf7y3rU
|
|
|
indio007
|
|
August 09, 2011, 04:14:13 PM |
|
Sent you half a bit to keep you motivated. Keep up the good work Diapolo
|
|
|
|
Diapolo (OP)
|
|
August 09, 2011, 04:26:31 PM |
|
Sent you half a bit to keep you motivated. Keep up the good work Diapolo Woohoo I feel damn motivated ... thanks mate! Dia
|
|
|
|
indio007
|
|
August 09, 2011, 04:35:18 PM |
|
Just out of curiosity , how many unique downloads of your modification have there been? If you know of course.
|
|
|
|
Diapolo (OP)
|
|
August 09, 2011, 05:33:03 PM |
|
Just out of curiosity , how many unique downloads of your modification have there been? If you know of course.
The sum of all downloads is > 5500 (for all released versions). Dia
|
|
|
|
Diapolo (OP)
|
|
August 09, 2011, 07:49:07 PM |
|
1. AND the low 16-bits of H against the high 16 bits 2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits 3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits 4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits 5. Take the resulting 2-bit number and NOR the first bit against the second bit
6. do bitwise AND of the resulting 1-bit number against the nonce 7. take the result from #6 and XOR the low 16-bits against the high 16-bits 8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits 9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce
Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)
I don't claim to understand this, but step (1) should be an OR, not an AND. Yeah that's right. Must have missed that when I went over the post. I had it correct in the example though. I tried to implement this, but the kernel only crashes the display driver THAT hard, I get a Bluescreen everytime ... weird. // Round 124 Vals[7] += Vals[3] + P4(124) + P3(124) + P1(124) + P2(124) + ch(124) + s1(124) + H[7];
...
// lo 16 Bits OR hi 16 Bits uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x & 0xFFFF0000U); // lo 8 Bits OR hi 8 Bits positive = (positive & 0x00FFU) | (positive & 0xFF00U); // lo 4 Bits OR hi 4 Bits positive = (positive & 0x0FU) | (positive & 0xF0U); // lo 2 Bits OR hi 2 Bits positive = (positive & 0x3U) | (positive & 0xCU); // lo 1 Bit NOR hi 1 Bit positive = ~((positive & 0x1U) | (positive & 0x2U));
// nonce AND positive uint position = W_3.x & positive; // lo 16 Bits XOR hi 16 Bits position = (position & 0x0000FFFFU) ^ (position & 0xFFFF0000U); // lo 8 Bits OR hi 8 Bits position = (position & 0x00FFU) | (position & 0xFF00U); output[position] = W_3.x;
Dia
|
|
|
|
Phateus
Newbie
Offline
Activity: 52
Merit: 0
|
|
August 09, 2011, 08:20:46 PM |
|
1. AND the low 16-bits of H against the high 16 bits 2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits 3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits 4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits 5. Take the resulting 2-bit number and NOR the first bit against the second bit
6. do bitwise AND of the resulting 1-bit number against the nonce 7. take the result from #6 and XOR the low 16-bits against the high 16-bits 8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits 9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce
Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)
I don't claim to understand this, but step (1) should be an OR, not an AND. Yeah that's right. Must have missed that when I went over the post. I had it correct in the example though. I tried to implement this, but the kernel only crashes the display driver THAT hard, I get a Bluescreen everytime ... weird. // Round 124 Vals[7] += Vals[3] + P4(124) + P3(124) + P1(124) + P2(124) + ch(124) + s1(124) + H[7];
...
// lo 16 Bits OR hi 16 Bits uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x & 0xFFFF0000U); // lo 8 Bits OR hi 8 Bits positive = (positive & 0x00FFU) | (positive & 0xFF00U); // lo 4 Bits OR hi 4 Bits positive = (positive & 0x0FU) | (positive & 0xF0U); // lo 2 Bits OR hi 2 Bits positive = (positive & 0x3U) | (positive & 0xCU); // lo 1 Bit NOR hi 1 Bit positive = ~((positive & 0x1U) | (positive & 0x2U));
// nonce AND positive uint position = W_3.x & positive; // lo 16 Bits XOR hi 16 Bits position = (position & 0x0000FFFFU) ^ (position & 0xFFFF0000U); // lo 8 Bits OR hi 8 Bits position = (position & 0x00FFU) | (position & 0xFF00U); output[position] = W_3.x;
Dia You need to shift the the bits for each stage: For example, oring the top bits to the bottom bits should be: uint positive = (Vals[7].x & 0x0000FFFFU) | ((Vals[7].x & 0xFFFF0000U) >> 16);
or just: uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x >> 16);
because the upper 16 bits will already be 0 because of the shift; Otherwise, you will just get the original Vals[7] value; if you want to do it that way, the code would be: uint positive = (Vals[7].x & 0x0000FFFFU) | (Vals[7].x >> 16); // lo 8 Bits OR hi 8 Bits positive = (positive & 0x00FFU) | (positive >> 8); // lo 4 Bits OR hi 4 Bits positive = (positive & 0x0FU) | (positive >> 4); // lo 2 Bits OR hi 2 Bits positive = (positive & 0x3U) | (positive >> 2); // lo 1 Bit NOR hi 1 Bit positive = ~((positive & 0x1U) | (positive >> 1));
However, similar to what I said earlier, the following code does the same thing: uint positive = 0xFFFFFFFF + min(Vals[7], 1u);
if Vals[7] ==0, then min(Vals[7], 1u) == 0, otherwise it equals 1 0xFFFFFFFF + 0 = 0xFFFFFFFF 0xFFFFFFFF + 1 = 0 oh yeah... you are getting blue screens because your address would be a random 32 bit number and it was probably trying to access memory that your video card doesn't have
|
|
|
|
Diapolo (OP)
|
|
August 11, 2011, 03:42:23 PM |
|
Download version 2011-08-11: http://www.mediafire.com/?s5c7h4r91r4ad4jNew version for your testing pleasure . Remember to use VECTORS2 as switch! This one should be a bit faster for 58XX and 69XX cards compared to earlier versions PLUS it should not generate invalid shares, if more than 1 positve nonce is found in a work-group! If a few of you could make a comparison (with older or other kernel versions) of accepted shares over a certain period of time, this woule be pretty cool! Dia
|
|
|
|
miscreanity
Legendary
Offline
Activity: 1316
Merit: 1005
|
|
August 11, 2011, 04:20:20 PM |
|
Download version 2011-08-11: http://www.mediafire.com/?s5c7h4r91r4ad4jNew version for your testing pleasure . Remember to use VECTORS2 as switch! This one should be a bit faster for 58XX and 69XX cards compared to earlier versions PLUS it should not generate invalid shares, if more than 1 positve nonce is found in a work-group! If a few of you could make a comparison (with older or other kernel versions) of accepted shares over a certain period of time, this woule be pretty cool! Dia 6950 @ 920/300; Linux 2.6.38, 11.6/2.4; 2x 5 min runs for each setting with Phoenix 1.50 AGGRESSION=12 BFI_INT FASTLOOP=false VECTORS2 WORKSIZE=128 [374.89 Mhash/sec] [28 Accepted] [0 Rejected] [RPC (+LP)] - Negligible difference from 2011-08-02 kernel. WORKSIZE=256 [344.50 Mhash/sec] [25 Accepted] [0 Rejected] [RPC (+LP)] - Significant drop of ~25-30 Mh/s from 08-02 kernel.
|
|
|
|
Tx2000
|
|
August 11, 2011, 04:53:46 PM |
|
11.8 / SDK 2.4 920c/320m 5850 reference
-k phatk VECTORS VECTORS2 BFI_INT FASTLOOP=false AGGRESSION=10 WORKSIZE=256
393-394 Mh/s, compared to 398-399 with prior version.
|
|
|
|
dishwara
Legendary
Offline
Activity: 1855
Merit: 1016
|
|
August 11, 2011, 06:16:37 PM |
|
436 & 426 using diapolo 2011-8-11. While phatk 2.2 of Phateus gives 448 & 433.
Windows 7, 64 bit, AERO enabled, AOCLBF 1.75, for diapolo used vectors2 & removed check mark for vectors in AOCLBF. Aggression=12, worksize=256 11.8 catalyst beta. MSI R5870 Lightning & Sapphore HD 5870. 975/325 & 939/313.
|
|
|
|
talldude
Member
Offline
Activity: 224
Merit: 10
|
|
August 11, 2011, 07:10:33 PM |
|
I'm giving this a go.
5850, all the usual flags, aggression 11. Dropped 1.5mhash compared to phatk 2.2 but we'll see if invalid shares also drop and/or valid shares go up. I'll edit this post in a day or so (if I remember).
|
|
|
|
Diapolo (OP)
|
|
August 11, 2011, 08:03:52 PM |
|
Did anyone with SDK 2.5 check this out? I get better results on 5870 and 5830 than with former kernels and I had hoped 69XX would be really faster :-/.
Dia
|
|
|
|
moomoocow
Newbie
Offline
Activity: 18
Merit: 0
|
|
August 11, 2011, 10:54:39 PM |
|
I was running the 8-4-2011 pre-release before this and the new 8-11-2011 release yields identical hash rates on my 6950.
Cat 11.8 preview.
|
|
|
|
PcChip
|
|
August 18, 2011, 02:03:05 AM |
|
On Cat 11.8 Preview:
Your latest: 307 MH/s Phateus 2.2: 312 MH/s
(5830 @ 965/300, Worksize 256)
|
Legacy signature from 2011: All rates with Phoenix 1.50 / PhatK 5850 - 400 MH/s | 5850 - 355 MH/s | 5830 - 310 MH/s | GTX570 - 115 MH/s | 5770 - 210 MH/s | 5770 - 200 MH/s
|
|
|
iopq
|
|
August 18, 2011, 03:59:37 AM |
|
I am getting about the same with diapolo's as I do with phatk2.2 on a 5750 with memory clock at 200, worksize 256, vectors2
I'm using fpgaminer's modified poclbm
|
|
|
|
Parja
Newbie
Offline
Activity: 36
Merit: 0
|
|
August 20, 2011, 04:36:36 PM |
|
I made an interesting discovery during my own tests with the new kernel version. I had to up the memory clock of my 5870 from 200 to 350 MHz in order to achieve the highest hashing values. Another thing to mention is, that I drive a Phenom II X6 1090T with only 800 MHz for every core, due to power saving, while mining. If I let the CPU use full speed, MHash/s goes even higher, let's say 3-4 MH/s.
Conclusion: Perhaps you guys should try to raise your mem speeds + experiment with CPU clocks, too. I know it has to be a good balance, so that higher MH/s values are not eaten by higher energy costs.
Dia
I'm actually finding with the 8-11 kernel that memory speed can be dropped down very low and still maintain optimal performance. I've got a total of 5 58X0 cards running, and they're all perfectly content to max out the MH/s at 150MHz memory speed. So while I've found that phatk 2.2 can do about 1-1.5% higher than 8-11 at the same core speed, phatk likes a memory speed up around 430MHz for optimal performance. So with that memory speed drop, I'm seeing about 2-3C lower core temps on my cards...or about 20MHz higher core speeds for the same temps, which more than makes up for the performance gap.
|
|
|
|
Diapolo (OP)
|
|
August 23, 2011, 05:48:08 AM |
|
I made an interesting discovery during my own tests with the new kernel version. I had to up the memory clock of my 5870 from 200 to 350 MHz in order to achieve the highest hashing values. Another thing to mention is, that I drive a Phenom II X6 1090T with only 800 MHz for every core, due to power saving, while mining. If I let the CPU use full speed, MHash/s goes even higher, let's say 3-4 MH/s.
Conclusion: Perhaps you guys should try to raise your mem speeds + experiment with CPU clocks, too. I know it has to be a good balance, so that higher MH/s values are not eaten by higher energy costs.
Dia
I'm actually finding with the 8-11 kernel that memory speed can be dropped down very low and still maintain optimal performance. I've got a total of 5 58X0 cards running, and they're all perfectly content to max out the MH/s at 150MHz memory speed. So while I've found that phatk 2.2 can do about 1-1.5% higher than 8-11 at the same core speed, phatk likes a memory speed up around 430MHz for optimal performance. So with that memory speed drop, I'm seeing about 2-3C lower core temps on my cards...or about 20MHz higher core speeds for the same temps, which more than makes up for the performance gap. Very interesting, but I guess currently the focus for most users is on phatk2, even if your observation could turn out to change some users mind . I'm still working on the kernel, but the really big jumps are hard to do these days . Dia
|
|
|
|
|