The steps:

1. AND the low 16-bits of H against the high 16 bits

2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits

3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits

4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits

5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce

7. take the result from #6 and XOR the low 16-bits against the high 16-bits

8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits

9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)

Steps 7-8 are to produce an 8-bit index that is 0 for all invalid nonces and hopefuly unique for each valid nonce assuming there are a small number of valid nonces. However in the worst case (more than 1 hash found in a single execution) at least 1 will be returned. However if 3 or less nonces are found per execution all of them should be returned in most cass.

Sorry to jump in in the middle of the conversation, but if I understand what you are trying to do...

Can't you just replace all of the steps with:

Valid = 1 - min(H, 1u);

Nonce = W[3];

OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;

if you are trying to remove all control flow? Any invalid nonce will be written into Output[0] and the valid nonces will be randomly distributed through the rest of the array.

I really don't know how the architecture handles having 4 billion threads writing to the same address, but... you may want to try it out...

Also, it is easy enough to make it work with VECTORS ;

Valid = 1 - (min(H.x, H.y), 1u);

//If .y is valid, add 1 to the nonce.

Nonce = W[3].x + min(H.y, 1);

OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;

(or you could just double the code for .x and .y)

OR

Valid = 1 - (min(H.x, H.y), 1u);

//If .y is valid, add 1 to the nonce.

Nonce = W[3].x;

OUTPUT[((Nonce & OUTPUT_MASK) + 1) * Valid] = Nonce;

and have the __init__ file check both Nonce and Nonce+1

another way of doing it would be (the compiler should replace the if statement with a set conditional):

Nonce = W[3];

Position = W[3] & OUTPUT_MASK;

if(H)

Position = OUTPUT_MASK + 1;

//Invalid nonce are at the last position of the array, valid are distributed at the front

OUTPUT[Position] = Nonce;

Slightly faster would be to have the Position = the local thread # (since you save an &) and make sure that the size of the output* array is WORKSIZE + 1:

Nonce = W[3];

Position = get_local_id(0);

if(H)

Position = WORKSIZE + 1;

OUTPUT[Position] = Nonce;

EDIT: Ooh, just thought of something else:

If it doesn't like writing everything to the same address: Make the buffer size = 2*WORKSIZE...

Nonce = W[3];

Position = get_local_id(0);

if(H)

Position += WORKSIZE;

OUTPUT[Position] = Nonce;

Then all of the threads in a workgroup will write to a different address. The valid nonces will be in the first half, and the invalid will be in the second.

Now I have no idea if any of these things would be faster, but I think all of them would work...

Sorry to put so much code down... but this kind of coding isn't really an exact science...