Since your revision of my source has nothing to offer on the majority of the gpus, <edit>even 6 months after the release</edit> (except the 1.7% fee you take in your pocket).
Can we see the source code please?
Why don't you extract the ptx, you will see that it is different.
I left the exe wide open.
I get +2-6% faster on all the cards I have tested. Gtx 1060 3gb, gtx 1070, gtx 970.
100% tdp, small oc. -500 on the memory.
My kernel vs your kernel (gtx 1060 3gb) same launch config, compiled with the same compiler (cuda 7.5), same driver:
You can profile it yourself.