Bitcoin Forum
June 22, 2024, 03:19:44 AM *
News: Voting for pizza day contest
 
   Home   Help Search Login Register More  
Pages: [1] 2 3 »  All
  Print  
Author Topic: [XPM] So, why is porting primecoin mining to the GPU so difficult?  (Read 18678 times)
primedigger (OP)
Member
**
Offline Offline

Activity: 75
Merit: 10


View Profile
August 22, 2013, 12:26:44 AM
Last edit: August 22, 2013, 10:55:50 PM by primedigger
 #1

Tl;dr: Some explanations as to why there currently is still no GPU miner for primecoins


Some of you may know me from the other GPU miner thread, where I tried to accelerate the primecoin miner with CUDA.

While I have no idea how mtrlt is doing on his port, I can only image that he encounters similar problems. A primecoin miner has two stages, filtering candidates and testing numbers for primality by using Fermat tests (with base 2). Latest default setting are: On testnet you spend 30% sieving and 70% testing candidates, on mainnet 70% sieving and 30% testing. On mainnet you will filter out 97% of all candidates you started with by sieving.

In order to produce a meaningful speedup, you will have to port both stages to the GPU or make one of them very small. Lets say you keep the original settings, port the sieve to GPU, interleave Fermat tests on the CPU. In this scenario your speedup is at most ~3x, because you still have to do that 30% candidate testing. Ok, you can sieve much faster now, so why not sieve with 10x more primes? You already filter 97%, and each new prime will not kill more than 1/p more candidates. The low-hanging fruits are the smaller primes and you have already used them. So, dimishing returns for the extra computations if these primes get bigger. At some point, even if you sieve 100x faster it won't be worth the extra computations and just checking if that number is prime with a Fermat test makes more sense. (>90% of all Fermat tests come back positive after sieving, that is, the start of that possible chain is definitely composite. Only for the remaining <10% you check if you hit a chain of some type.). Bottom line: to produce meaningful speedups with gpu mining, you will have to port both the sieve and Fermat tests to the GPU. Another argument for this is that transfer between CPU and GPU is costly and can quickly become a bottleneck.

Ok, why is porting Fermats test to the GPU so hard?  

That Fermat test is actually very simple in principle, it's a one liner in python:

2==pow(2,n,n)

In python, it can be also easily used with bigger numbers:

>>> n = 8229677002348755372716479379024478187948307769969295540186468240943252147753090 395716866121321015192338179630120010069556821511582091
>>> 2 == pow(2,n,n)
True
>>>

But a lot of magic is running under the hood, which makes this a quite complex problem for GPUs: the numbers involved are bigger than what fits into standard 32/64bit values and 2^n can't be evaluated directly. For the first problem, you use e.g. seven 64bit integers that represent the above number n and emulate multiplication and addition on them with 64-instructions.

A fast algorithm is known for the second problem, called Modular exponentiation (http://en.wikipedia.org/wiki/Modular_exponentiation). The algorithm is simple, but there are many more complex tricks to speed up the computation. For example, with a Montgomery reduction (which, at least on a CPU, is faster than a Barrett reduction but works only on odd numbers) no divisions are required, which is great, since multiplications are much faster than divisions. Memory can be traded for faster execution time by precomputing some values, which however won't make much sense on a GPU.

So doing many Fermat tests means evaluating many independent modular exponentiations. At first, doing many independent modular exponentiation seems to be a great fit for a GPU, since each calculation could be done independently in its own thread. The root of the problem is that Modular exponentiation is a recursive evaluation (= can't be easily parallelized) and that the execution flow is strongly dependent on the input data. That is really a tough problem, since GPU's are "Single instruction, multiple data" devices. A CUDA GPU will execute many threads in warps, which on newer CUDA GPU's are in the order of 48 threads. If these warps divergate, which they will do if each threads execution flow is strongly data dependent, then these threads will be serialized. Whoops, that's a 48x penalty.

Second problem is that at least most NVIDIA GPU's synthesise 64-bit integer instructions from 32-bit instructions. 64-bit addition and subtraction can be synthesized efficiently from 32-bit operations (2 or 3 instructions depending on compute capability). Multiplication requires longer instruction sequences (on the order of 10 to 20 instructions depending on architecture), so that is another 10x penalty compared to a 64-bit CPU which can execute these instructions natively.

So that's easily a 480x combined penalty, even if a GPU is 100x faster calculating simple instructions than a CPU, it will be ~5 times slower with the above idea than a CPU. There might be a clever idea to do something radically different than what was outlined above, but at this point I'm very sceptical that we will see a GPU miner soon which would be practical given the higher electricity use of a GPU compared to a CPU.
jh00
Newbie
*
Offline Offline

Activity: 39
Merit: 0


View Profile
August 22, 2013, 12:57:04 AM
 #2

You sum up the problem quite nicely. But I have to disagree in one point. The fermat tests are not the bottleneck. They were at the beginning, but using highly optimized mpn routines we can test a few thousand chains (not single primes) per second by now. If you sieve for chains of length 9 with only 50000 prime factors you will still spend a majority of time in the sieve. Of course we are speaking of an optimized sieve routine, not the original one. Sieving time will get even higher when the difficulty rises to 10.

I personally would recommend doing a hybrid by porting only the sieve to the GPU and doing the primality tests on the CPU. The GPU has faster memory access and segmented sieves will likely work well. The CPU is good at processing data with varying length and many conditional branches like it is the case for the primality tests. But then again I am not sure if the overall speed gain will actually be worth the increased electricity bill.
SynergyCores
Newbie
*
Offline Offline

Activity: 20
Merit: 0



View Profile
August 22, 2013, 12:59:40 AM
 #3

So........

https://i.imgur.com/WCxqZ8e.gif
??
tyranosaur
Newbie
*
Offline Offline

Activity: 12
Merit: 0


View Profile
August 22, 2013, 01:00:45 AM
 #4


 The hot spot in a primecoin miner is testing numbers for primality by using Fermat tests (with base 2). Porting the sieve to the GPU is almost pointless, because only 1% of the computing time is spend there.




I was under the impression that the sieving part of the algorithm was taking up much more than 1% of the computing time. For example, currently my "Sieve/Test ratio" is at 75%/25%... or am I confusing things?

reb0rn21
Legendary
*
Offline Offline

Activity: 1898
Merit: 1024


View Profile
August 22, 2013, 01:01:58 AM
 #5

Maybe some low end GPU even integrated one can help in sieva part, like haswell HD4600 opencl or AMD integrated?

              ▄▄▄ ▀▀▀▀▀▀▀▀▀ ▄▄▄
           ▄▀▀    ▄▄▄▄▄▄▄▄▄    ▀▀▄
        ▄▀▀  ▄▄▀█          ▀█▀▄▄  ▀▀▄
      ▄▀▀ ▄▄▀    ▀▀▄▄▄▄▄▄▄▀▀    ▀▄▄ ▀▀▄
     █   █            ▀            █   █
   ▄▀ █  ▀▄▄                     ▄█▀  █ ▀▄
  ▄▀ ▄▀ █▄ ▀▀▀██▄▄▄       ▄▄▄██▀▀  ██ ▀▄ ▀▄
  ▀▄▀▀▄ ██ ▄▄▄▄▄▄  ▀▄   ▄▀  ▄▄▄▄▄▄ ██ ▄▀▀▄▀
 ██   █ ██ ▀▄    ▀▄ █   █ ▄▀    ▄▀ ██ █  ▀██
 █  ▄█  ▀█  ▀▀▀▀▀▀▀ █   █ ▀▀▀▀▀▀▀  █   █▄  █
█▀ █  █  █          █   █          █  █  █ ▀▀
 █▀  ▄▀  █▀▄        █   █        ▄▀█  ▀▄  ▀█
 ▄  █▀   █ ▀█▄      ▀   ▀      ▄█▀ █  ▄▀█  ▄
 █▄▀  █  █                         █  █  ▀▄█
 ▀▄  █   ▀█        ▄▄▀▄▀▄▄        █▀   █  ▄
  ▀▄▀▀  █▄ █     ▀█  ▀▀▀  █▀     █ ▄█ ▄▀▀▄▀
   ▀ ▄  ██ █▀▄     ▀▀▄▄▄▀▀     ▄▀█ ██ ▀▄ ▀
    ▀█  ██ █ █▀▄    ▄▄▄▄▄    ▄▀█ █ ██  █▀
      ▀▄ ▀ █ █ ██▄         ▄██ █ █ ▀ ▄▀
        ▀▄ █ █ █ ▀█▄     ▄█▀ █ █ █ ▄▀
          ▀▀▄█ █    ▀▀▀▀▀    █ █▄▀▀
              ▀▀ ▄▄▄▄▄▄▄▄▄▄▄ ▀▀
   
..I  D  E  N  A..
   
Proof-of-Person Blockchain

Join the mining of the first human-centric
cryptocurrency
 



 
▲    2 3 2 2

..N  O  D  E  S..
   
                ██
                ██
                ██
                ██
                ██
         ▄      ██      ▄
         ███▄   ██   ▄███
          ▀███▄ ██ ▄███▀
            ▀████████▀
              ▀████▀
                ▀▀
██▄                            ▄██
███                            ███
███                            ███
███                            ███
 ███▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄███
  ▀▀██████████████████████████▀▀
   
D O W N L O A D

Idena node

   
   
▄▄▄██████▄▄▄
▄▄████████████████▄▄
▄█████▀▀        ▀▀█████▄
████▀                ▀████
███▀    ▄▄▄▄▄▄▄▄▄       ▀███
███      █   ▄▄ █▀▄        ███
██▀      █  ███ █  ▀▄      ▀██
███       █   ▀▀ ▀▀▀▀█       ███
███       █  ▄▄▄▄▄▄  █       ███
███       █  ▄▄▄▄▄▄  █       ███
██▄      █  ▄▄▄▄▄▄  █      ▄██
███      █          █      ███
███▄    ▀▀▀▀▀▀▀▀▀▀▀▀    ▄███
████▄                ▄████
▀█████▄▄        ▄▄█████▀
▀▀████████████████▀▀
▀▀▀██████▀▀▀
   
    .REQUEST INVITATION.
billym2k
Full Member
***
Offline Offline

Activity: 126
Merit: 100



View Profile
August 22, 2013, 01:09:14 AM
 #6

This is a great thread to show the complexity of the problem...

Though, out of curiosity, why are people wanting a GPU miner so badly for Primecoin anyway? I kinda enjoy a CPU only coin, since it lets me mine different coins with my graphics card and my processor at the same time...
primedigger (OP)
Member
**
Offline Offline

Activity: 75
Merit: 10


View Profile
August 22, 2013, 01:17:49 AM
Last edit: August 22, 2013, 01:36:45 AM by primedigger
 #7


 The hot spot in a primecoin miner is testing numbers for primality by using Fermat tests (with base 2). Porting the sieve to the GPU is almost pointless, because only 1% of the computing time is spend there.




I was under the impression that the sieving part of the algorithm was taking up much more than 1% of the computing time. For example, currently my "Sieve/Test ratio" is at 75%/25%... or am I confusing things?



Redacted that statement, but still, even if you made that 75% sieving 1000 times faster, you would still need those other 25% of the time and wouldn't mine much faster.


jh00
Newbie
*
Offline Offline

Activity: 39
Merit: 0


View Profile
August 22, 2013, 01:21:45 AM
 #8

...
I was under the impression that the sieving part of the algorithm was taking up much more than 1% of the computing time. For example, currently my "Sieve/Test ratio" is at 75%/25%... or am I confusing things?
Redacted that statement, but still, even if you made that 25% sieving 1000 times faster, you would still need those other 75% of the time and wouldn't mine much faster.
It's 75% sieving, not 25%. If you make the sieve 10000 times faster you can eliminate 99.9999% of candidates that need testing by filtering more prime divisors. The purpose of the sieve is to avoid having to do the slow tests, so improving the sieve does actually make more sense.
LlamaMaster
Newbie
*
Offline Offline

Activity: 51
Merit: 0


View Profile
August 22, 2013, 01:22:20 AM
 #9

This is a great thread to show the complexity of the problem...

Though, out of curiosity, why are people wanting a GPU miner so badly for Primecoin anyway? I kinda enjoy a CPU only coin, since it lets me mine different coins with my graphics card and my processor at the same time...
People mainly just wanted to chase after the large short term profits a GPU may provide (assuming the increase in performance was sufficiently large).  There was also the argument that a GPU miner would open up the coin to more people are drive up the price, but I can't wrap my head around that logic when everybody already has CPU.  All in all I don't see any advantage to releasing a GPU miner at this point, and I think that it will just lead to a higher difficulty, a lower market price, or both.
primedigger (OP)
Member
**
Offline Offline

Activity: 75
Merit: 10


View Profile
August 22, 2013, 01:34:38 AM
 #10

...
I was under the impression that the sieving part of the algorithm was taking up much more than 1% of the computing time. For example, currently my "Sieve/Test ratio" is at 75%/25%... or am I confusing things?
Redacted that statement, but still, even if you made that 25% sieving 1000 times faster, you would still need those other 75% of the time and wouldn't mine much faster.
It's 75% sieving, not 25%. If you make the sieve 10000 times faster you can eliminate 99.9999% of candidates that need testing by filtering more prime divisors. The purpose of the sieve is to avoid having to do the slow tests, so improving the sieve does actually make more sense.

Ah, sorry, need some sleep... But you'd still have dimishing returns for sieving, as filtering by 2x as many primes won't filter out 2x more candidates.
Alex P
Member
**
Offline Offline

Activity: 97
Merit: 10



View Profile
August 22, 2013, 03:54:18 AM
 #11

I would offer 5 BTC to the first to provide me with a working GPU that the public does not know about.  This will need to be atleast 10x faster than CPU.
bcp19
Hero Member
*****
Offline Offline

Activity: 532
Merit: 500



View Profile
August 22, 2013, 11:25:43 AM
 #12

I would offer 5 BTC to the first to provide me with a working GPU that the public does not know about.  This will need to be atleast 10x faster than CPU.
mtrlt have already been donated over 75BTC and he's not been able to accomplish more that ~2x speed increase at last report.  Good luck with that Tongue.

I do not suffer fools gladly... "Captain!  We're surrounded!"
I embrace my inner Kool-Aid.
bcp19
Hero Member
*****
Offline Offline

Activity: 532
Merit: 500



View Profile
August 22, 2013, 11:28:21 AM
 #13

You sum up the problem quite nicely. But I have to disagree in one point. The fermat tests are not the bottleneck. They were at the beginning, but using highly optimized mpn routines we can test a few thousand chains (not single primes) per second by now. If you sieve for chains of length 9 with only 50000 prime factors you will still spend a majority of time in the sieve. Of course we are speaking of an optimized sieve routine, not the original one. Sieving time will get even higher when the difficulty rises to 10.

I personally would recommend doing a hybrid by porting only the sieve to the GPU and doing the primality tests on the CPU. The GPU has faster memory access and segmented sieves will likely work well. The CPU is good at processing data with varying length and many conditional branches like it is the case for the primality tests. But then again I am not sure if the overall speed gain will actually be worth the increased electricity bill.
mtrlt already accomplished sieving on the GPU and fermat on the CPU and it came out to the same speed as the CPU alone.  Data flow between GPU/CPU is also a bottleneck.

I do not suffer fools gladly... "Captain!  We're surrounded!"
I embrace my inner Kool-Aid.
CoinBuzz
Sr. Member
****
Offline Offline

Activity: 490
Merit: 250



View Profile
August 22, 2013, 11:39:27 AM
 #14

i am wondering if in primecoin, it find just a probablity of primility, so one can start from 1 and go to next, next,next and test each one. am i wrong ?

Join ASAP: FREE BITCOIN
usahero
Sr. Member
****
Offline Offline

Activity: 434
Merit: 250


View Profile
August 22, 2013, 11:51:13 AM
 #15

Could you explain what primecoin algorithm actually does?

Why doesn't primecoin for example just read some known prime numbers from some database, compare them if there is any Cunningham chain candidate and ... offer result?

Does this algorithm have to find all these prime numbers by brute force first, and then find the chains..?

If you could describe this in layman's terms, I'd be very happy.
18RATTT
Sr. Member
****
Offline Offline

Activity: 282
Merit: 250



View Profile
August 22, 2013, 12:02:50 PM
 #16

I would offer 5 BTC to the first to provide me with a working GPU that the public does not know about.  This will need to be atleast 10x faster than CPU.
mtrlt have already been donated over 75BTC and he's not been able to accomplish more that ~2x speed increase at last report.  Good luck with that Tongue.
alex p, better donate that btc into the development process.

let me tell you something, people like you are bad for the ecosystem, people who only want to have the best thing delivered for them first and paid it afterward, but dont want to contribute during the development process.

i have a quote for you,

Code:
"if you are absent during my struggle, dont expect my presence during my success"

reb0rn21
Legendary
*
Offline Offline

Activity: 1898
Merit: 1024


View Profile
August 22, 2013, 12:25:25 PM
 #17

You sum up the problem quite nicely. But I have to disagree in one point. The fermat tests are not the bottleneck. They were at the beginning, but using highly optimized mpn routines we can test a few thousand chains (not single primes) per second by now. If you sieve for chains of length 9 with only 50000 prime factors you will still spend a majority of time in the sieve. Of course we are speaking of an optimized sieve routine, not the original one. Sieving time will get even higher when the difficulty rises to 10.

I personally would recommend doing a hybrid by porting only the sieve to the GPU and doing the primality tests on the CPU. The GPU has faster memory access and segmented sieves will likely work well. The CPU is good at processing data with varying length and many conditional branches like it is the case for the primality tests. But then again I am not sure if the overall speed gain will actually be worth the increased electricity bill.
mtrlt already accomplished sieving on the GPU and fermat on the CPU and it came out to the same speed as the CPU alone.  Data flow between GPU/CPU is also a bottleneck.

If this is true then only CPU + integreted GPU might get "some" benefit... data flow should be a lot better

              ▄▄▄ ▀▀▀▀▀▀▀▀▀ ▄▄▄
           ▄▀▀    ▄▄▄▄▄▄▄▄▄    ▀▀▄
        ▄▀▀  ▄▄▀█          ▀█▀▄▄  ▀▀▄
      ▄▀▀ ▄▄▀    ▀▀▄▄▄▄▄▄▄▀▀    ▀▄▄ ▀▀▄
     █   █            ▀            █   █
   ▄▀ █  ▀▄▄                     ▄█▀  █ ▀▄
  ▄▀ ▄▀ █▄ ▀▀▀██▄▄▄       ▄▄▄██▀▀  ██ ▀▄ ▀▄
  ▀▄▀▀▄ ██ ▄▄▄▄▄▄  ▀▄   ▄▀  ▄▄▄▄▄▄ ██ ▄▀▀▄▀
 ██   █ ██ ▀▄    ▀▄ █   █ ▄▀    ▄▀ ██ █  ▀██
 █  ▄█  ▀█  ▀▀▀▀▀▀▀ █   █ ▀▀▀▀▀▀▀  █   █▄  █
█▀ █  █  █          █   █          █  █  █ ▀▀
 █▀  ▄▀  █▀▄        █   █        ▄▀█  ▀▄  ▀█
 ▄  █▀   █ ▀█▄      ▀   ▀      ▄█▀ █  ▄▀█  ▄
 █▄▀  █  █                         █  █  ▀▄█
 ▀▄  █   ▀█        ▄▄▀▄▀▄▄        █▀   █  ▄
  ▀▄▀▀  █▄ █     ▀█  ▀▀▀  █▀     █ ▄█ ▄▀▀▄▀
   ▀ ▄  ██ █▀▄     ▀▀▄▄▄▀▀     ▄▀█ ██ ▀▄ ▀
    ▀█  ██ █ █▀▄    ▄▄▄▄▄    ▄▀█ █ ██  █▀
      ▀▄ ▀ █ █ ██▄         ▄██ █ █ ▀ ▄▀
        ▀▄ █ █ █ ▀█▄     ▄█▀ █ █ █ ▄▀
          ▀▀▄█ █    ▀▀▀▀▀    █ █▄▀▀
              ▀▀ ▄▄▄▄▄▄▄▄▄▄▄ ▀▀
   
..I  D  E  N  A..
   
Proof-of-Person Blockchain

Join the mining of the first human-centric
cryptocurrency
 



 
▲    2 3 2 2

..N  O  D  E  S..
   
                ██
                ██
                ██
                ██
                ██
         ▄      ██      ▄
         ███▄   ██   ▄███
          ▀███▄ ██ ▄███▀
            ▀████████▀
              ▀████▀
                ▀▀
██▄                            ▄██
███                            ███
███                            ███
███                            ███
 ███▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄███
  ▀▀██████████████████████████▀▀
   
D O W N L O A D

Idena node

   
   
▄▄▄██████▄▄▄
▄▄████████████████▄▄
▄█████▀▀        ▀▀█████▄
████▀                ▀████
███▀    ▄▄▄▄▄▄▄▄▄       ▀███
███      █   ▄▄ █▀▄        ███
██▀      █  ███ █  ▀▄      ▀██
███       █   ▀▀ ▀▀▀▀█       ███
███       █  ▄▄▄▄▄▄  █       ███
███       █  ▄▄▄▄▄▄  █       ███
██▄      █  ▄▄▄▄▄▄  █      ▄██
███      █          █      ███
███▄    ▀▀▀▀▀▀▀▀▀▀▀▀    ▄███
████▄                ▄████
▀█████▄▄        ▄▄█████▀
▀▀████████████████▀▀
▀▀▀██████▀▀▀
   
    .REQUEST INVITATION.
CoinBuzz
Sr. Member
****
Offline Offline

Activity: 490
Merit: 250



View Profile
August 22, 2013, 12:36:11 PM
 #18

Could you explain what primecoin algorithm actually does?

Why doesn't primecoin for example just read some known prime numbers from some database, compare them if there is any Cunningham chain candidate and ... offer result?

Does this algorithm have to find all these prime numbers by brute force first, and then find the chains..?

If you could describe this in layman's terms, I'd be very happy.

+1

Join ASAP: FREE BITCOIN
#Darren
Sr. Member
****
Offline Offline

Activity: 784
Merit: 250


DIA | Data infrastructure for DeFi


View Profile
August 22, 2013, 12:53:04 PM
 #19

 I am happy it is proving GPU resistant  Cheesy


      ████████████████
   ████████████████████████
██████████████████████████████
█████████████████████████████████
███████████████████████████████████
█████████████████████████████████████
██████████████████████████████████████
███████████████████████████████████████
█████████████████   ███████████████████
█████████████████        ███████████████
█████████████████          █████████████
█████████████████            ████████████
█████████████████             ███████████
█████████████████             ███████████
█████████████████              ██████████
█████████████████              ██████████
█████████████████             ███████████
█████████████████            ███████████
█████████████████          █████████████
█████████████████      ███████████████
█████████████████ ████████████████████
█████████████████████████████████████
███████████████████████████████████
█████████████████████████████████
██████████████████████████████
██████████████████████████
     █████████████████
          ████
DIA | OPEN ACCESS FINANCIAL DATA
FreedomCoin
Hero Member
*****
Offline Offline

Activity: 675
Merit: 507


Freedom to choose


View Profile
August 22, 2013, 01:11:55 PM
 #20

I am happy it is proving GPU resistant  Cheesy

Just like LTC is currently ASIC resistant :-)

Pages: [1] 2 3 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!