The steps:
1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit
6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce
Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)
Steps 7-8 are to produce an 8-bit index that is 0 for all invalid nonces and hopefuly unique for each valid nonce assuming there are a small number of valid nonces. However in the worst case (more than 1 hash found in a single execution) at least 1 will be returned. However if 3 or less nonces are found per execution all of them should be returned in most cass.
Sorry to jump in in the middle of the conversation, but if I understand what you are trying to do...
Can't you just replace all of the steps with:
Valid = 1 - min(H, 1u);
Nonce = W[3];
OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;
if you are trying to remove all control flow? Any invalid nonce will be written into Output[0] and the valid nonces will be randomly distributed through the rest of the array.
I really don't know how the architecture handles having 4 billion threads writing to the same address, but... you may want to try it out...
Also, it is easy enough to make it work with VECTORS ;
Valid = 1 - (min(H.x, H.y), 1u);
//If .y is valid, add 1 to the nonce.
Nonce = W[3].x + min(H.y, 1);
OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;
(or you could just double the code for .x and .y)
OR
Valid = 1 - (min(H.x, H.y), 1u);
//If .y is valid, add 1 to the nonce.
Nonce = W[3].x;
OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;
and have the __init__ file check both Nonce and Nonce+1
another way of doing it would be (the compiler should replace the if statement with a set conditional):
Nonce = W[3];
Position = W[3] & OUTPUT_MASK;
if(H)
Position = OUTPUT_MASK + 1;
//Invalid nonce are at the last position of the array, valid are distributed at the front
OUTPUT[Position] = Nonce;
Slightly faster would be to have the Position = the local thread # (since you save an &) and make sure that the size of the output* array is WORKSIZE + 1:
Nonce = W[3];
Position = get_local_id(0);
if(H)
Position = WORKSIZE + 1;
OUTPUT[Position] = Nonce;
EDIT: Ooh, just thought of something else:
If it doesn't like writing everything to the same address: Make the buffer size = 2*WORKSIZE...
Nonce = W[3];
Position = get_local_id(0);
if(H)
Position += WORKSIZE;
OUTPUT[Position] = Nonce;
Then all of the threads in a workgroup will write to a different address. The valid nonces will be in the first half, and the invalid will be in the second.
Now I have no idea if any of these things would be faster, but I think all of them would work...
Sorry to put so much code down... but this kind of coding isn't really an exact science...