You have to compare the loss of valid nonces to the higher efficiency because of the removed control flow in the kernel (all current GPUs dislike if/else and so on). I thought this tradeoff would be well worth it, but you could prove me wrong. I was thinking about a better way of writing the positive nonces into output, but that didn't work.
Any good ideas for that part of the kernel will be a big plus!
After looking at the code more carefully your method is only problematic if more than 1 vector component returns a valid nonce. The odds of this happening are EXTREMELY small, since you would have to find more than 1 valid hash in a range of only 2 or 4 hashes.
That said, I have devised a way to remove the if(nonce) control structure entirely. This makes a couple assumptions:
1. Control flow instructions have a large clock cycle penalty regardless of the branch taken (so you get 44 cycle penalty on Cypress and Cayman regardless of if H == 0)
2. Writing values to output for every nonce even if the nonce is invalid does not incur a significant clock cycle cost relative to the control flow instructions. (ideally <10 clocks, but if it's below ~30 the code below will still be faster than the current code)
1. OR the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit
6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce
Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)
Steps 7-8 are to produce an 8-bit index that is 0 for all invalid nonces and hopefuly unique for each valid nonce assuming there are a small number of valid nonces. However in the worst case (more than 1 hash found in a single execution) at least 1 will be returned. However if 3 or less nonces are found per execution all of them should be returned in most cass.
output will be overwritten constantly by invalid nonces (since the 1-bit number from step 5 will be 0 unless the hash satisfies H == 0, the resulting 8-bit number will also be 0) output[>0] will contain valid nonces will a small chance of collisions.
Cypress and Cayman (58xx and 69xx respectively) have a 44 cycle latency for control flow instructions
Steps 1 - 8 should execute in 1 clock each (however they can't be vectorized, so this won't exploit any ILP)
Step 9 takes no longer than the current code for valid nonces, but this will now also apply to invalid nonces.
overall this should be fast, return only valid nonces, and retain the capability to return more than one nonce if the assumptions above are true.
An example of how even a single 1 in the input will cause the output of steps 1-5 to be 0:
H = 0000000000000001 0000000000000000