As with my previous release, this may be Kepler-only, but I think the algorithm will work on Fermi. Part of why I want to open source it is to let the community build upon it. I know that there's more performance that could be tuned into it - probably on the order of 30% but maybe 200%.
really??
Yeah. I don't do any overlapping of kernel execution and CPU, or use multiple kernels. During the execution of one of the phases the memory is locked up 100% and the compute cores are bored, and in another phase, the cores are locked up and the memory is bored, etc.
But, as yvg1900 convinced me, it's more healthy to let these things evolve a little bit gradually anyway.