Looking at the opencl kernel, can the belowOrEquals function not avoid the endian related comparison of seperate bytes, instead moving the switch into the python code when creating targetH and targetG. Then less branches in kernel and perhaps better stream usage?