Bitcoin Forum

Alternate cryptocurrencies => Altcoin Discussion => Topic started by: primedigger on August 22, 2013, 12:26:44 AM



Title: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: primedigger on August 22, 2013, 12:26:44 AM
Tl;dr: Some explanations as to why there currently is still no GPU miner for primecoins


Some of you may know me from the other GPU miner thread, where I tried to accelerate the primecoin miner with CUDA.

While I have no idea how mtrlt is doing on his port, I can only image that he encounters similar problems. A primecoin miner has two stages, filtering candidates and testing numbers for primality by using Fermat tests (with base 2). Latest default setting are: On testnet you spend 30% sieving and 70% testing candidates, on mainnet 70% sieving and 30% testing. On mainnet you will filter out 97% of all candidates you started with by sieving.

In order to produce a meaningful speedup, you will have to port both stages to the GPU or make one of them very small. Lets say you keep the original settings, port the sieve to GPU, interleave Fermat tests on the CPU. In this scenario your speedup is at most ~3x, because you still have to do that 30% candidate testing. Ok, you can sieve much faster now, so why not sieve with 10x more primes? You already filter 97%, and each new prime will not kill more than 1/p more candidates. The low-hanging fruits are the smaller primes and you have already used them. So, dimishing returns for the extra computations if these primes get bigger. At some point, even if you sieve 100x faster it won't be worth the extra computations and just checking if that number is prime with a Fermat test makes more sense. (>90% of all Fermat tests come back positive after sieving, that is, the start of that possible chain is definitely composite. Only for the remaining <10% you check if you hit a chain of some type.). Bottom line: to produce meaningful speedups with gpu mining, you will have to port both the sieve and Fermat tests to the GPU. Another argument for this is that transfer between CPU and GPU is costly and can quickly become a bottleneck.

Ok, why is porting Fermats test to the GPU so hard?  

That Fermat test is actually very simple in principle, it's a one liner in python:

2==pow(2,n,n)

In python, it can be also easily used with bigger numbers:

>>> n = 8229677002348755372716479379024478187948307769969295540186468240943252147753090 395716866121321015192338179630120010069556821511582091
>>> 2 == pow(2,n,n)
True
>>>

But a lot of magic is running under the hood, which makes this a quite complex problem for GPUs: the numbers involved are bigger than what fits into standard 32/64bit values and 2^n can't be evaluated directly. For the first problem, you use e.g. seven 64bit integers that represent the above number n and emulate multiplication and addition on them with 64-instructions.

A fast algorithm is known for the second problem, called Modular exponentiation (http://en.wikipedia.org/wiki/Modular_exponentiation). The algorithm is simple, but there are many more complex tricks to speed up the computation. For example, with a Montgomery reduction (which, at least on a CPU, is faster than a Barrett reduction but works only on odd numbers) no divisions are required, which is great, since multiplications are much faster than divisions. Memory can be traded for faster execution time by precomputing some values, which however won't make much sense on a GPU.

So doing many Fermat tests means evaluating many independent modular exponentiations. At first, doing many independent modular exponentiation seems to be a great fit for a GPU, since each calculation could be done independently in its own thread. The root of the problem is that Modular exponentiation is a recursive evaluation (= can't be easily parallelized) and that the execution flow is strongly dependent on the input data. That is really a tough problem, since GPU's are "Single instruction, multiple data" devices. A CUDA GPU will execute many threads in warps, which on newer CUDA GPU's are in the order of 48 threads. If these warps divergate, which they will do if each threads execution flow is strongly data dependent, then these threads will be serialized. Whoops, that's a 48x penalty.

Second problem is that at least most NVIDIA GPU's synthesise 64-bit integer instructions from 32-bit instructions. 64-bit addition and subtraction can be synthesized efficiently from 32-bit operations (2 or 3 instructions depending on compute capability). Multiplication requires longer instruction sequences (on the order of 10 to 20 instructions depending on architecture), so that is another 10x penalty compared to a 64-bit CPU which can execute these instructions natively.

So that's easily a 480x combined penalty, even if a GPU is 100x faster calculating simple instructions than a CPU, it will be ~5 times slower with the above idea than a CPU. There might be a clever idea to do something radically different than what was outlined above, but at this point I'm very sceptical that we will see a GPU miner soon which would be practical given the higher electricity use of a GPU compared to a CPU.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: jh00 on August 22, 2013, 12:57:04 AM
You sum up the problem quite nicely. But I have to disagree in one point. The fermat tests are not the bottleneck. They were at the beginning, but using highly optimized mpn routines we can test a few thousand chains (not single primes) per second by now. If you sieve for chains of length 9 with only 50000 prime factors you will still spend a majority of time in the sieve. Of course we are speaking of an optimized sieve routine, not the original one. Sieving time will get even higher when the difficulty rises to 10.

I personally would recommend doing a hybrid by porting only the sieve to the GPU and doing the primality tests on the CPU. The GPU has faster memory access and segmented sieves will likely work well. The CPU is good at processing data with varying length and many conditional branches like it is the case for the primality tests. But then again I am not sure if the overall speed gain will actually be worth the increased electricity bill.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: SynergyCores on August 22, 2013, 12:59:40 AM
So........

https://i.imgur.com/WCxqZ8e.gif
??


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: tyranosaur on August 22, 2013, 01:00:45 AM

 The hot spot in a primecoin miner is testing numbers for primality by using Fermat tests (with base 2). Porting the sieve to the GPU is almost pointless, because only 1% of the computing time is spend there.




I was under the impression that the sieving part of the algorithm was taking up much more than 1% of the computing time. For example, currently my "Sieve/Test ratio" is at 75%/25%... or am I confusing things?



Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: reb0rn21 on August 22, 2013, 01:01:58 AM
Maybe some low end GPU even integrated one can help in sieva part, like haswell HD4600 opencl or AMD integrated?


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: billym2k on August 22, 2013, 01:09:14 AM
This is a great thread to show the complexity of the problem...

Though, out of curiosity, why are people wanting a GPU miner so badly for Primecoin anyway? I kinda enjoy a CPU only coin, since it lets me mine different coins with my graphics card and my processor at the same time...


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: primedigger on August 22, 2013, 01:17:49 AM

 The hot spot in a primecoin miner is testing numbers for primality by using Fermat tests (with base 2). Porting the sieve to the GPU is almost pointless, because only 1% of the computing time is spend there.




I was under the impression that the sieving part of the algorithm was taking up much more than 1% of the computing time. For example, currently my "Sieve/Test ratio" is at 75%/25%... or am I confusing things?



Redacted that statement, but still, even if you made that 75% sieving 1000 times faster, you would still need those other 25% of the time and wouldn't mine much faster.




Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: jh00 on August 22, 2013, 01:21:45 AM
...
I was under the impression that the sieving part of the algorithm was taking up much more than 1% of the computing time. For example, currently my "Sieve/Test ratio" is at 75%/25%... or am I confusing things?
Redacted that statement, but still, even if you made that 25% sieving 1000 times faster, you would still need those other 75% of the time and wouldn't mine much faster.
It's 75% sieving, not 25%. If you make the sieve 10000 times faster you can eliminate 99.9999% of candidates that need testing by filtering more prime divisors. The purpose of the sieve is to avoid having to do the slow tests, so improving the sieve does actually make more sense.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: LlamaMaster on August 22, 2013, 01:22:20 AM
This is a great thread to show the complexity of the problem...

Though, out of curiosity, why are people wanting a GPU miner so badly for Primecoin anyway? I kinda enjoy a CPU only coin, since it lets me mine different coins with my graphics card and my processor at the same time...
People mainly just wanted to chase after the large short term profits a GPU may provide (assuming the increase in performance was sufficiently large).  There was also the argument that a GPU miner would open up the coin to more people are drive up the price, but I can't wrap my head around that logic when everybody already has CPU.  All in all I don't see any advantage to releasing a GPU miner at this point, and I think that it will just lead to a higher difficulty, a lower market price, or both.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: primedigger on August 22, 2013, 01:34:38 AM
...
I was under the impression that the sieving part of the algorithm was taking up much more than 1% of the computing time. For example, currently my "Sieve/Test ratio" is at 75%/25%... or am I confusing things?
Redacted that statement, but still, even if you made that 25% sieving 1000 times faster, you would still need those other 75% of the time and wouldn't mine much faster.
It's 75% sieving, not 25%. If you make the sieve 10000 times faster you can eliminate 99.9999% of candidates that need testing by filtering more prime divisors. The purpose of the sieve is to avoid having to do the slow tests, so improving the sieve does actually make more sense.

Ah, sorry, need some sleep... But you'd still have dimishing returns for sieving, as filtering by 2x as many primes won't filter out 2x more candidates.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: Alex P on August 22, 2013, 03:54:18 AM
I would offer 5 BTC to the first to provide me with a working GPU that the public does not know about.  This will need to be atleast 10x faster than CPU.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: bcp19 on August 22, 2013, 11:25:43 AM
I would offer 5 BTC to the first to provide me with a working GPU that the public does not know about.  This will need to be atleast 10x faster than CPU.
mtrlt have already been donated over 75BTC and he's not been able to accomplish more that ~2x speed increase at last report.  Good luck with that :P.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: bcp19 on August 22, 2013, 11:28:21 AM
You sum up the problem quite nicely. But I have to disagree in one point. The fermat tests are not the bottleneck. They were at the beginning, but using highly optimized mpn routines we can test a few thousand chains (not single primes) per second by now. If you sieve for chains of length 9 with only 50000 prime factors you will still spend a majority of time in the sieve. Of course we are speaking of an optimized sieve routine, not the original one. Sieving time will get even higher when the difficulty rises to 10.

I personally would recommend doing a hybrid by porting only the sieve to the GPU and doing the primality tests on the CPU. The GPU has faster memory access and segmented sieves will likely work well. The CPU is good at processing data with varying length and many conditional branches like it is the case for the primality tests. But then again I am not sure if the overall speed gain will actually be worth the increased electricity bill.
mtrlt already accomplished sieving on the GPU and fermat on the CPU and it came out to the same speed as the CPU alone.  Data flow between GPU/CPU is also a bottleneck.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: CoinBuzz on August 22, 2013, 11:39:27 AM
i am wondering if in primecoin, it find just a probablity of primility, so one can start from 1 and go to next, next,next and test each one. am i wrong ?


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: usahero on August 22, 2013, 11:51:13 AM
Could you explain what primecoin algorithm actually does?

Why doesn't primecoin for example just read some known prime numbers from some database, compare them if there is any Cunningham chain candidate and ... offer result?

Does this algorithm have to find all these prime numbers by brute force first, and then find the chains..?

If you could describe this in layman's terms, I'd be very happy.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: 18RATTT on August 22, 2013, 12:02:50 PM
I would offer 5 BTC to the first to provide me with a working GPU that the public does not know about.  This will need to be atleast 10x faster than CPU.
mtrlt have already been donated over 75BTC and he's not been able to accomplish more that ~2x speed increase at last report.  Good luck with that :P.
alex p, better donate that btc into the development process.

let me tell you something, people like you are bad for the ecosystem, people who only want to have the best thing delivered for them first and paid it afterward, but dont want to contribute during the development process.

i have a quote for you,

Code:
"if you are absent during my struggle, dont expect my presence during my success"


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: reb0rn21 on August 22, 2013, 12:25:25 PM
You sum up the problem quite nicely. But I have to disagree in one point. The fermat tests are not the bottleneck. They were at the beginning, but using highly optimized mpn routines we can test a few thousand chains (not single primes) per second by now. If you sieve for chains of length 9 with only 50000 prime factors you will still spend a majority of time in the sieve. Of course we are speaking of an optimized sieve routine, not the original one. Sieving time will get even higher when the difficulty rises to 10.

I personally would recommend doing a hybrid by porting only the sieve to the GPU and doing the primality tests on the CPU. The GPU has faster memory access and segmented sieves will likely work well. The CPU is good at processing data with varying length and many conditional branches like it is the case for the primality tests. But then again I am not sure if the overall speed gain will actually be worth the increased electricity bill.
mtrlt already accomplished sieving on the GPU and fermat on the CPU and it came out to the same speed as the CPU alone.  Data flow between GPU/CPU is also a bottleneck.

If this is true then only CPU + integreted GPU might get "some" benefit... data flow should be a lot better


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: CoinBuzz on August 22, 2013, 12:36:11 PM
Could you explain what primecoin algorithm actually does?

Why doesn't primecoin for example just read some known prime numbers from some database, compare them if there is any Cunningham chain candidate and ... offer result?

Does this algorithm have to find all these prime numbers by brute force first, and then find the chains..?

If you could describe this in layman's terms, I'd be very happy.

+1


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: #Darren on August 22, 2013, 12:53:04 PM
 I am happy it is proving GPU resistant  :D


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: FreedomCoin on August 22, 2013, 01:11:55 PM
I am happy it is proving GPU resistant  :D

Just like LTC is currently ASIC resistant :-)


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: primedigger on August 22, 2013, 01:15:28 PM
So, to get my numbers straight (bottom line still is that porting Fermat's test is though):

A primecoin miner has two stages, filtering candidates and testing numbers for primality by using Fermat tests (with base 2). Latest default setting are: On testnet you spend 30% sieving and 70% testing candidates, on mainnet 70% sieving and 30% testing. On mainnet you will filter out 97% of all candidates you started with by sieving.

In order to produce a meaningful speedup, you will have to port both stages to the GPU or make one of them very small. Lets say you keep the original settings, port the sieve to GPU, interleave Fermat tests on the CPU. In this scenario your speedup is at most ~3x, because you still have to do that 30% candidate testing. Ok, you can sieve much faster now, so why not sieve with 10x more primes? You already filter 97%, and each new prime will not kill more than 1/p more candidates. The low-hanging fruits are the smaller primes and you have already used them. So, dimishing returns for the extra computations if these primes get bigger. At some point, even if you sieve 100x faster it won't be worth the extra computations and just checking if that number is prime with a Fermat test makes more sense. (90% of all Fermat tests come back positive, that is, the start of that possible chain is definitely composite. Only for the remaining 10% you check if you hit a chain of some type.). Bottom line: to produce meaningful speedups with gpu mining, you will have to port both the sieve and Fermat tests to the GPU. Another argument for this is that transfer between CPU and GPU is costly and can quickly become a bottleneck.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: cryptocoinsnews on August 22, 2013, 01:36:33 PM
http://www.cryptocoinsnews.com/2013/08/22/xpm-so-why-is-porting-primecoin-mining-to-the-gpu-so-difficult/


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: Sy on August 22, 2013, 01:46:10 PM
So once we are at it - what about fpgas?


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: ilostcoins on August 22, 2013, 01:52:13 PM
Interesting insight, thank you.

Does Nvidia or AMD have a better chance of boosting efficiency in primecoin mining? Is CUDA more flexible in some ways and would that help?


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: primedigger on August 22, 2013, 02:36:53 PM
So once we are at it - what about fpgas?

I don't know much about FPGA's, but in theory, if you build a simple 1024-bit processor that can multiply,add,subtract and shift, then you could also do Fermat tests. 1024-bit would be enough for current prime records on the primecoin chain.

RSA also heavily relies on modular expansion and a basic FPGA implementation would start with 1024bit, so if you search for research papers on the topic you find papers like that one: http://irep.iium.edu.my/29540/1/FPGA_Implementation_of_RSA_Encryption.pdf

No idea how fast that would be though.



Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: pyromaniac on August 22, 2013, 03:28:42 PM
I am happy it is proving GPU resistant  :D

Just like LTC is currently ASIC resistant :-)
Yep. Like it was GPU resistant just one and half year ago.  ::)


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: smolen on August 22, 2013, 03:34:49 PM
So once we are at it - what about fpgas?

I don't know much about FPGA's, but in theory, if you build a simple 1024-bit processor that can multiply,add,subtract and shift, then you could also do Fermat tests. 1024-bit would be enough for current prime records on the primecoin chain.

RSA also heavily relies on modular expansion and a basic FPGA implementation would start with 1024bit, so if you search for research papers on the topic you find papers like that one: http://irep.iium.edu.my/29540/1/FPGA_Implementation_of_RSA_Encryption.pdf

No idea how fast that would be though.


http://www-mtl.mit.edu/researchgroups/icsystems/pubs/conferences/2001/goodman_isscc01_slides.pdf (http://www-mtl.mit.edu/researchgroups/icsystems/pubs/conferences/2001/goodman_isscc01_slides.pdf)
Primitives performance points (@50 MHz):
Modexp (IF/DL)
 GFexp (DL)
 ECmult (ECDL)
8.2/5.2 ms @ 512b
 8.0ms @ 512b
 7.0ms @ 176b
32.1/17 ms @ 1024b
 31.7ms @ 1024b
 14.5ms @ 256b


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: Schleicher on August 22, 2013, 07:49:02 PM
Bottom line: to produce meaningful speedups with gpu mining, you will have to port both the sieve and Fermat tests to the GPU. Another argument for this is that transfer between CPU and GPU is costly and can quickly become a bottleneck.
I imagine that it can work like this:
Run the sieve on the GPU until you have lots of candidates.
Then run a single fermat test on the GPU for all of these candidates.
All candidates passing the first test will be tested for chain length 2.
And so on.
I'm not sure how much memory we will need on the GPU if we store all the candidates there.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: Koooooj on August 22, 2013, 10:20:27 PM
Could you explain what primecoin algorithm actually does?

Why doesn't primecoin for example just read some known prime numbers from some database, compare them if there is any Cunningham chain candidate and ... offer result?

Does this algorithm have to find all these prime numbers by brute force first, and then find the chains..?

If you could describe this in layman's terms, I'd be very happy.

The algorithm is kind of complicated to explain in lay terms.  First, I'd like to point out that there is no database of primes of this size that would be even kind of useful.  The numbers used are on the order of 10100.  In that range about 1 in 180 numbers is prime.  That means that in order to store all of the prime numbers between 10100 and 10101 you would more memory than there could ever exist in the universe--more bits than there are atoms.  No such database can ever exist within the current understanding of physics.

So, the algorithm is somewhat of a brute force.  It has some elegance to it, though.  The chains have a very specific form, which gives a lot of opportunity for speedups.  For example, if your chain starts at N then 2N-1 is in the chain (for one of the chain types).  It would be inefficient to check the primality of 2N-1 if N is already shown to be composite--the chain is already proven to be invalid.  Many optimizations take this general form.

The algorithm itself is broken into two main parts: sieving and primality tests.  Sieving is a really fast way to show that lots of numbers are composite, but it is impractical to show that any number is prime by a sieve alone.  The most famous (and oldest) sieve is the Sieve of Eratosthenes, which has a lovely Gif and explanation on its wikipedia page.  Sieves work by eliminating numbers that have small factors;  it is easy and computationally cheap to carry out a single division to determine if a big number, N, is divisible by a small number, k, and about 1 in k divisions will show that N is divisible by k and that N is therefore composite.  This can be taken a step farther by noting that if N is divisible by k then N + a*k is also divisible by k, allowing many candidates to be eliminated by a single division and a few multiplications.  Thus, the mining algorithm first generates a set of candidate numbers based on a hash function and the data in the block, then runs that set of numbers through a sieve.

The next step is to actually test the remaining numbers after the sieve takes out most of them.  This is accomplished by the Fermat (Probable) Primality Test, which holds that if aP-1 ≡ 1 (mod P) for all a and prime P.  That is to say, you can take any number (a) and raise it to the power of your candidate (P) minus one, then divide the result by the candidate; if you get anything other than 1 as the remainder of that division then P is composite, otherwise P is probably prime.  For PrimeCoin a is always taken to be 2, and "probable" is taken to be good enough--the odds that 2 is a "Fermat Liar" for P are much less than the odds that P is prime.

Thus, the general flow is (Protocol to build a block and find the appropriate hash value to start from) -> (form a set of numbers that could be valid as part of prime chains for this block) -> (run the numbers through a sieve) -> (Eliminate numbers that passed the sieve but aren't parts of chains of length 9 or longer (current integer difficulty)) -> (for each chain of possibly prime numbers remaining, execute a Fermat Primality test until you either find a complete chain or find a composite number that proves that no chain long enough will be found there).


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: ilostcoins on August 23, 2013, 12:02:48 AM

The algorithm is kind of complicated to explain in lay terms.  First, I'd like to point out that there is no database of primes of this size that would be even kind of useful.  The numbers used are on the order of 10100.  In that range about 1 in 180 numbers is prime.  That means that in order to store all of the prime numbers between 10100 and 10101 you would more memory than there could ever exist in the universe--more bits than there are atoms.  No such database can ever exist within the current understanding of physics.

So, the algorithm is somewhat of a brute force.  It has some elegance to it, though.  The chains have a very specific form, which gives a lot of opportunity for speedups.  For example, if your chain starts at N then 2N-1 is in the chain (for one of the chain types).  It would be inefficient to check the primality of 2N-1 if N is already shown to be composite--the chain is already proven to be invalid.  Many optimizations take this general form.

The algorithm itself is broken into two main parts: sieving and primality tests.  Sieving is a really fast way to show that lots of numbers are composite, but it is impractical to show that any number is prime by a sieve alone.  The most famous (and oldest) sieve is the Sieve of Eratosthenes, which has a lovely Gif and explanation on its wikipedia page.  Sieves work by eliminating numbers that have small factors;  it is easy and computationally cheap to carry out a single division to determine if a big number, N, is divisible by a small number, k, and about 1 in k divisions will show that N is divisible by k and that N is therefore composite.  This can be taken a step farther by noting that if N is divisible by k then N + a*k is also divisible by k, allowing many candidates to be eliminated by a single division and a few multiplications.  Thus, the mining algorithm first generates a set of candidate numbers based on a hash function and the data in the block, then runs that set of numbers through a sieve.

The next step is to actually test the remaining numbers after the sieve takes out most of them.  This is accomplished by the Fermat (Probable) Primality Test, which holds that if aP-1 ≡ 1 (mod P) for all a and prime P.  That is to say, you can take any number (a) and raise it to the power of your candidate (P) minus one, then divide the result by the candidate; if you get anything other than 1 as the remainder of that division then P is composite, otherwise P is probably prime.  For PrimeCoin a is always taken to be 2, and "probable" is taken to be good enough--the odds that 2 is a "Fermat Liar" for P are much less than the odds that P is prime.

Thus, the general flow is (Protocol to build a block and find the appropriate hash value to start from) -> (form a set of numbers that could be valid as part of prime chains for this block) -> (run the numbers through a sieve) -> (Eliminate numbers that passed the sieve but aren't parts of chains of length 9 or longer (current integer difficulty)) -> (for each chain of possibly prime numbers remaining, execute a Fermat Primality test until you either find a complete chain or find a composite number that proves that no chain long enough will be found there).

You explained very well. Thank you.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: primedigger on August 23, 2013, 12:33:13 AM
Bottom line: to produce meaningful speedups with gpu mining, you will have to port both the sieve and Fermat tests to the GPU. Another argument for this is that transfer between CPU and GPU is costly and can quickly become a bottleneck.
I imagine that it can work like this:
Run the sieve on the GPU until you have lots of candidates.
Then run a single fermat test on the GPU for all of these candidates.
All candidates passing the first test will be tested for chain length 2.
And so on.
I'm not sure how much memory we will need on the GPU if we store all the candidates there.

Problem being, that Fermat tests are not a good fit for current GPUs, because these tests have a strongly data-dependent execution flow. Memory isn't a problem, even for many many candidates.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: runlinux on August 23, 2013, 12:36:51 AM
Why couldn't each core / thread on the GPU run the whole algorithm from start to finish? yeah, each one would be slow, but as a whole, everything wins. It gets around the threading issue that I know plagues most of algorithm being ported. Then again, I could be totally out in left field.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: tacotime on August 23, 2013, 09:28:39 PM
There is already a fast implementation of the Lucas–Lehmer primality test on AMD cards, I'd guess one for the Fermat primality test would be a matter of time.

http://mersenneforum.org/showthread.php?t=12576&page=176

Mfakto implements sieving as well: http://www.mersennewiki.org/index.php/Mfakto


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: Sharky444 on August 24, 2013, 12:44:53 AM
So, to get my numbers straight (bottom line still is that porting Fermat's test is though):

A primecoin miner has two stages, filtering candidates and testing numbers for primality by using Fermat tests (with base 2).

It's not sufficient to do the test with base 2 only. You get just about 99% prim probability with that.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: ilostcoins on August 24, 2013, 12:49:37 AM
So, to get my numbers straight (bottom line still is that porting Fermat's test is though):

A primecoin miner has two stages, filtering candidates and testing numbers for primality by using Fermat tests (with base 2).

It's not sufficient to do the test with base 2 only. You get just about 99% prim probability with that.

I think that's what this coin does? I mean some chains accepted in this coin are composite?


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: Sunny King on August 24, 2013, 01:16:07 AM
Base 2 Fermat test is strong enough. It's known that pseudoprimes are extremely rare compared to primes. The term probable prime is misleading to layman because they are actually almost certain primes.

In fact I would be very surprised if you can find a pseudoprime in the block chain.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: bcp19 on August 24, 2013, 02:12:03 AM
There is already a fast implementation of the Lucas–Lehmer primality test on AMD cards, I'd guess one for the Fermat primality test would be a matter of time.

http://mersenneforum.org/showthread.php?t=12576&page=176

Mfakto implements sieving as well: http://www.mersennewiki.org/index.php/Mfakto
The Lucas-Lehmer test currently available on GPU is very simple compared to the fermat testing needed for Primecoin.  A mersenne number is stated as 2p-1.  In order for a mersenne number to be prime, P must also be prime.  As stated on http://Mersenne.org/various/math.php, The Lucas-Lehmer primality test is remarkably simple. It states that for P > 2, 2P-1 is prime if and only if Sp-2 is zero in this sequence: S0 = 4, SN = (SN-12 - 2) mod (2P-1).  Trust me when I say the explanation is a lot harder than the actual math involved.  To prove 27-1 is prime take 6 steps, which is 1 less that the exponent of 7.  The current record Mersenne Prime, 257,885,161-1 took 57,885,160 calculations to prove as prime, requiring 15 days on my overclocked i5-2500k (4.3GHz) compared to ~4 days on an nVidia GTX 480 or ~3.5 days on a GTX 580.

Trial factoring is even easier, as all factors of Mersenne numbers can be expressed as 2 * k * P - 1.  Since 2 and P are givens, you simply sieve out prime k's from whatever bit depth you are working at and take 2kp-1 mod (2^P-1).  The fast fournier transforms available make GPUs operates at speeds in excess of 100 times a CPU.

The more complicate P-1 test, which has 'somewhat' successfully been ported to GPU can barely get 2-3x the speed of a good i5/i7.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: Sharky444 on August 24, 2013, 02:38:00 AM
Base 2 Fermat test is strong enough. It's known that pseudoprimes are extremely rare compared to primes. The term probable prime is misleading to layman because they are actually almost certain primes.

In fact I would be very surprised if you can find a pseudoprime in the block chain.
Pseudoprimes are not that rare! Here is an example:

341, 561, 645, 1105, 1387, 1729, 1905, 2047, 2465, 2701, 2821, 3277, 4033, 4369, 4371, 4681, 5461, 6601, 7957, 8321, 8481, 8911


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: Sunny King on August 24, 2013, 02:44:48 AM
Pseudoprimes are not that rare! Here is an example:

341, 561, 645, 1105, 1387, 1729, 1905, 2047, 2465, 2701, 2821, 3277, 4033, 4369, 4371, 4681, 5461, 6601, 7957, 8321, 8481, 8911

Asymptotically they are extremely rare compared to primes. Please refer to my design paper.

Primecoin is dealing with numbers over 256 bits long, so you cannot compare the probability of the really small numbers.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: bcp19 on August 24, 2013, 02:47:47 AM
Base 2 Fermat test is strong enough. It's known that pseudoprimes are extremely rare compared to primes. The term probable prime is misleading to layman because they are actually almost certain primes.

In fact I would be very surprised if you can find a pseudoprime in the block chain.
Pseudoprimes are not that rare! Here is an example:

341, 561, 645, 1105, 1387, 1729, 1905, 2047, 2465, 2701, 2821, 3277, 4033, 4369, 4371, 4681, 5461, 6601, 7957, 8321, 8481, 8911
22 Pseudoprimes by 8911 as compared to 1384 prime numbers.  Definitely very uncommon.  Plus, similar to Mersenne Primes, while there are a lot of psuedoprimes at the lower range, they become a lot more scarce:  22 mersenne primes had an exponent less than 10,000.  #48, found Jan this year, and has an exponent of 57,885,161


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: usahero on August 24, 2013, 02:54:45 AM

The algorithm is kind of complicated to explain in lay terms....


Thank you. You explained very well.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: Sunny King on August 24, 2013, 03:06:06 AM
The Lucas-Lehmer test currently available on GPU is very simple compared to the fermat testing needed for Primecoin.  A mersenne number is stated as 2p-1.  In order for a mersenne number to be prime, P must also be prime.  As stated on http://Mersenne.org/various/math.php, The Lucas-Lehmer primality test is remarkably simple. It states that for P > 2, 2P-1 is prime if and only if Sp-2 is zero in this sequence: S0 = 4, SN = (SN-12 - 2) mod (2P-1).  Trust me when I say the explanation is a lot harder than the actual math involved.  To prove 27-1 is prime take 6 steps, which is 1 less that the exponent of 7.  The current record Mersenne Prime, 257,885,161-1 took 57,885,160 calculations to prove as prime, requiring 15 days on my overclocked i5-2500k (4.3GHz) compared to ~4 days on an nVidia GTX 480 or ~3.5 days on a GTX 580.

Lucas-Lehmer test itself is computationally similar to Fermat test in the sense that they both rely on modular multiplication/square. But in the case of Mersenne the modulo is of a special form 2p-1, so the modulo operation can be done through only shift and addition.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: bcp19 on August 24, 2013, 03:28:43 AM
The Lucas-Lehmer test currently available on GPU is very simple compared to the fermat testing needed for Primecoin.  A mersenne number is stated as 2p-1.  In order for a mersenne number to be prime, P must also be prime.  As stated on http://Mersenne.org/various/math.php, The Lucas-Lehmer primality test is remarkably simple. It states that for P > 2, 2P-1 is prime if and only if Sp-2 is zero in this sequence: S0 = 4, SN = (SN-12 - 2) mod (2P-1).  Trust me when I say the explanation is a lot harder than the actual math involved.  To prove 27-1 is prime take 6 steps, which is 1 less that the exponent of 7.  The current record Mersenne Prime, 257,885,161-1 took 57,885,160 calculations to prove as prime, requiring 15 days on my overclocked i5-2500k (4.3GHz) compared to ~4 days on an nVidia GTX 480 or ~3.5 days on a GTX 580.

Lucas-Lehmer test itself is computationally similar to Fermat test in the sense that they both rely on modular multiplication/square. But in the case of Mersenne the modulo is of a special form 2p-1, so the modulo operation can be done through only shift and addition.
I will admit that I don't undertand the fermat test well... my brain locks up at The triple bar, ≡, and the set membership symbol (∈).  :/


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: Sunny King on August 24, 2013, 04:08:28 AM
:) Though I think the Lucas-Lehmer used for GIMPS is very different because Mersenne is dealing with multiplications of tens of millions of bits. I think the kernel is probably spending most of its time doing a special FFT (floating point) algorithm for big number multiplications.

For primecoin mining it only needs modular multiplication of a thousand bits.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: primedigger on August 24, 2013, 11:01:26 PM
Lucas–Lehmer–Riesel test is deterministic and requires that numbers are in a special form. This is both a good fit for Mersenne primes, which have that special form and deterministic means that the test is proof for primality. The class of probabilistic primality tests is usually faster, but doesn't prove primality. Fermat's test is a good fit for primecoin, because it is general purpose. Also Fermat's test should be the fastest test among all probabilistic primality tests with the price of a high false positive rate (it's still very small in practise). This is ignored in Primecoins, as all numbers passing a Fermat test in base 2 are threaded as primes.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: Dumbo on August 24, 2013, 11:22:06 PM
For primecoin mining it only needs modular multiplication of a thousand bits.

Unfortunately the only modular multiplication implementation on CUDA I saw involved 32-bit integers :( 


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: primedigger on August 26, 2013, 01:33:09 AM
Mtrlt has a new status on his miner. 2x speedup for testnet, ? for mainnet. I hope that we will see the source soon, so that I can port it over to CUDA. If we can get this to 5x-10x, it will still be a game changer! But based on my analysis, we won't see the extreme speedups which we have seen for hash based coins (atleast not without some clever tricks I'm unaware of). Sunny may have unintentionally created the best CPU-coin there currently is! ;)


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: cryptrol on August 26, 2013, 01:54:46 PM
Mtrlt has a new status on his miner. 2x speedup for testnet, ? for mainnet. I hope that we will see the source soon, so that I can port it over to CUDA. If we can get this to 5x-10x, it will still be a game changer! But based on my analysis, we won't see the extreme speedups which we have seen for hash based coins (atleast not without some clever tricks I'm unaware of). Sunny may have unintentionally created the best CPU-coin there currently is! ;)
He reports a 2x performance gain against a phenom x6, that's not much considering that a 6990 has 2 cores .
If the final results go on these lines, it will not be better to mine on GPU than on CPU.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: primedigger on August 26, 2013, 04:15:43 PM
Thanks, I didn't notice that he compared his speedup to a dual GPU. I wonder with what metric he compared it and if he factored in the new advances in hp-10 that brought a >2x speedup with optimisations to the sieve.

Still, the sieve looks quite easy to port to the GPU and I will try to do that if my time permits it. Let's better have a definite prove that GPU sieving is ineffective right now, if that is really the case. Although, a >10 difficulty could make the sieving part more important than now, that could change the current situation.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: arnuschky on August 26, 2013, 08:38:06 PM
Thanks, I didn't notice that he compared his speedup to a dual GPU. I wonder with what metric he compared it and if he factored in the new advances in hp-10 that brought a >2x speedup with optimisations to the sieve.

hp10 didn't improve that much; the metric is wrong. We're getting pretty much the same results with hp9 than hp10. Others have confirmed this.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: bcp19 on August 26, 2013, 10:07:29 PM
Thanks, I didn't notice that he compared his speedup to a dual GPU. I wonder with what metric he compared it and if he factored in the new advances in hp-10 that brought a >2x speedup with optimisations to the sieve.

hp10 didn't improve that much; the metric is wrong. We're getting pretty much the same results with hp9 than hp10. Others have confirmed this.
doubling the sieve speed would not equate to a 2x speedup overall.  The deeper you end up seiving the less increased result you get from it.  Imagine the normal sieve eliminating 99% of candidates.  The added seive could then amount to 99.1% or 99.2% of all candidates. 


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: drumaliens on August 29, 2013, 06:23:26 PM
From reading all of the information out there I think I understand the basics of primecoin. Is there a worked example, using an existing block which shows how they fit together ?


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: patapato on August 29, 2013, 10:38:21 PM
He reports a 2x performance gain against a phenom x6, that's not much considering that a 6990 has 2 cores .
If the final results go on these lines, it will not be better to mine on GPU than on CPU.

In fact the price of a Radeon HD 6990 is more than twice that of a Phenom II X6. Probably also the power consumption is 2x.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: crendore on September 03, 2013, 12:29:00 PM
Hey primedigger, now that mtrlt's miner is out, have you taken a look at the source (http://cryptobro.com/reaperprime.zip)? Was it what you expected (https://bitcointalk.org/index.php?topic=258982.msg2940118#msg2940118)?


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: YipYip on September 03, 2013, 01:22:52 PM
I would offer 5 BTC to the first to provide me with a working GPU that the public does not know about.  This will need to be atleast 10x faster than CPU.

WOW u are cooolzz


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: bcp19 on September 07, 2013, 06:12:30 PM
Thanks, I didn't notice that he compared his speedup to a dual GPU. I wonder with what metric he compared it and if he factored in the new advances in hp-10 that brought a >2x speedup with optimisations to the sieve.

Still, the sieve looks quite easy to port to the GPU and I will try to do that if my time permits it. Let's better have a definite prove that GPU sieving is ineffective right now, if that is really the case. Although, a >10 difficulty could make the sieving part more important than now, that could change the current situation.
Since the openCL miner is in the wild now, have you had a chance too look at it and see if a CUDA miner is possible?


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: tugvarish on June 17, 2014, 08:06:58 PM
I know this is and old topic, but I just now read it.
I am a long time Boinc-er (using the Berkley distributed computing platform) and between the projects I participate, there is one called PrimeGrid and it is all about various prime number mathematical work, one of them is finding various kind of them.
The projects uses GPU in some of their calculation, so maybe it could interesting to look at what they are using it for and get some new ideas.


Title: Re: [XPM] So, why is porting primecoin mining to the GPU so difficult?
Post by: smoothie on June 17, 2014, 09:41:53 PM
I know this is and old topic, but I just now read it.
I am a long time Boinc-er (using the Berkley distributed computing platform) and between the projects I participate, there is one called PrimeGrid and it is all about various prime number mathematical work, one of them is finding various kind of them.
The projects uses GPU in some of their calculation, so maybe it could interesting to look at what they are using it for and get some new ideas.

As far as I am aware there are a few Primecoin GPU mining code bases to use:

Claymore and PrimeGPU