Tl;dr: Some explanations as to why there currently is still no GPU miner for primecoins
Some of you may know me from the other GPU miner thread, where I tried to accelerate the primecoin miner with CUDA.
While I have no idea how mtrlt is doing on his port, I can only image that he encounters similar problems. A primecoin miner has two stages, filtering candidates and testing numbers for primality by using Fermat tests (with base 2). Latest default setting are: On testnet you spend 30% sieving and 70% testing candidates, on mainnet 70% sieving and 30% testing. On mainnet you will filter out 97% of all candidates you started with by sieving.
In order to produce a meaningful speedup, you will have to port both stages to the GPU or make one of them very small. Lets say you keep the original settings, port the sieve to GPU, interleave Fermat tests on the CPU.
In this scenario your speedup is at most ~3x, because you still have to do that 30% candidate testing. Ok, you can sieve much faster now, so why not sieve with 10x more primes? You already filter 97%, and each new prime will not kill more than 1/p more candidates. The low-hanging fruits are the smaller primes and you have already used them. So, dimishing returns for the extra computations if these primes get bigger. At some point, even if you sieve 100x faster it won't be worth the extra computations and just checking if that number is prime with a Fermat test makes more sense. (>90% of all Fermat tests come back positive after sieving, that is, the start of that possible chain is definitely composite. Only for the remaining <10% you check if you hit a chain of some type.). Bottom line: to produce meaningful speedups with gpu mining, you will have to port both the sieve and Fermat tests to the GPU. Another argument for this is that transfer between CPU and GPU is costly and can quickly become a bottleneck.
Ok, why is porting Fermats test to the GPU so hard?
That Fermat test is actually very simple in principle, it's a one liner in python:
2==pow(2,n,n)
In python, it can be also easily used with bigger numbers:
>>> n = 8229677002348755372716479379024478187948307769969295540186468240943252147753090 395716866121321015192338179630120010069556821511582091
>>> 2 == pow(2,n,n)
True
>>>
But a lot of magic is running under the hood, which makes this a quite complex problem for GPUs: the numbers involved are bigger than what fits into standard 32/64bit values and 2^n can't be evaluated directly. For the first problem, you use e.g. seven 64bit integers that represent the above number n and emulate multiplication and addition on them with 64-instructions.
A fast algorithm is known for the second problem, called Modular exponentiation (
http://en.wikipedia.org/wiki/Modular_exponentiation). The algorithm is simple, but there are many more complex tricks to speed up the computation. For example, with a Montgomery reduction (which, at least on a CPU, is faster than a Barrett reduction but works only on odd numbers) no divisions are required, which is great, since multiplications are much faster than divisions. Memory can be traded for faster execution time by precomputing some values, which however won't make much sense on a GPU.
So doing many Fermat tests means evaluating many independent modular exponentiations. At first, doing many independent modular exponentiation seems to be a great fit for a GPU, since each calculation could be done independently in its own thread. The root of the problem is that Modular exponentiation is a recursive evaluation (= can't be easily parallelized) and that the execution flow is strongly dependent on the input data. That is really a tough problem, since GPU's are "Single instruction, multiple data" devices. A CUDA GPU will execute many threads in warps, which on newer CUDA GPU's are in the order of 48 threads. If these warps divergate, which they will do if each threads execution flow is strongly data dependent, then these threads will be serialized.
Whoops, that's a 48x penalty.
Second problem is that at least most NVIDIA GPU's synthesise 64-bit integer instructions from 32-bit instructions. 64-bit addition and subtraction can be synthesized efficiently from 32-bit operations (2 or 3 instructions depending on compute capability). Multiplication requires longer instruction sequences (on the order of 10 to 20 instructions depending on architecture), so
that is another 10x penalty compared to a 64-bit CPU which can execute these instructions natively.
So that's easily a 480x combined penalty, even if a GPU is 100x faster calculating simple instructions than a CPU, it will be ~5 times
slower with the above idea than a CPU. There might be a clever idea to do something radically different than what was outlined above, but at this point I'm very sceptical that we will see a GPU miner soon which would be practical given the higher electricity use of a GPU compared to a CPU.