There is also a great Pull request that was submitted a day or two ago. It allows the design to scale down to fit into smaller chips, which I know a lot of people have been waiting for. I'm just waiting for some free time to open up so I can dive in, test the new patch out, and merge it. Many thanks to udif for submitting such a wonderful improvement!
The verilog code was updated on my fork of fpgaminer's git, and seems to be working under the simulator. I will try on real HW later today.
You should now be able to fold the 2x64 pipe stages to 2xN stages where N is 1,2,4,..64 (for N=64 it behaves as the original code).
Ofcourse folding the HW pipe into loops means that it will run 64/N times slower.
I was able to fit an EP3C25 at >90% capacity with N=8.