Bitcoin Forum
June 26, 2024, 02:49:18 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
  Home Help Search Login Register More  
  Show Posts
Pages: « 1 2 [3] 4 »
41  Alternate cryptocurrencies / Altcoin Discussion / Re: [XPM] So, why is porting primecoin mining to the GPU so difficult? on: August 24, 2013, 11:01:26 PM
Lucas–Lehmer–Riesel test is deterministic and requires that numbers are in a special form. This is both a good fit for Mersenne primes, which have that special form and deterministic means that the test is proof for primality. The class of probabilistic primality tests is usually faster, but doesn't prove primality. Fermat's test is a good fit for primecoin, because it is general purpose. Also Fermat's test should be the fastest test among all probabilistic primality tests with the price of a high false positive rate (it's still very small in practise). This is ignored in Primecoins, as all numbers passing a Fermat test in base 2 are threaded as primes.
42  Alternate cryptocurrencies / Altcoin Discussion / Re: [XPM] So, why is porting primecoin mining to the GPU so difficult? on: August 23, 2013, 12:33:13 AM
Bottom line: to produce meaningful speedups with gpu mining, you will have to port both the sieve and Fermat tests to the GPU. Another argument for this is that transfer between CPU and GPU is costly and can quickly become a bottleneck.
I imagine that it can work like this:
Run the sieve on the GPU until you have lots of candidates.
Then run a single fermat test on the GPU for all of these candidates.
All candidates passing the first test will be tested for chain length 2.
And so on.
I'm not sure how much memory we will need on the GPU if we store all the candidates there.

Problem being, that Fermat tests are not a good fit for current GPUs, because these tests have a strongly data-dependent execution flow. Memory isn't a problem, even for many many candidates.
43  Alternate cryptocurrencies / Altcoin Discussion / Re: [XPM] So, why is porting primecoin mining to the GPU so difficult? on: August 22, 2013, 02:36:53 PM
So once we are at it - what about fpgas?

I don't know much about FPGA's, but in theory, if you build a simple 1024-bit processor that can multiply,add,subtract and shift, then you could also do Fermat tests. 1024-bit would be enough for current prime records on the primecoin chain.

RSA also heavily relies on modular expansion and a basic FPGA implementation would start with 1024bit, so if you search for research papers on the topic you find papers like that one: http://irep.iium.edu.my/29540/1/FPGA_Implementation_of_RSA_Encryption.pdf

No idea how fast that would be though.

44  Alternate cryptocurrencies / Altcoin Discussion / Re: [XPM] So, why is porting primecoin mining to the GPU so difficult? on: August 22, 2013, 01:15:28 PM
So, to get my numbers straight (bottom line still is that porting Fermat's test is though):

A primecoin miner has two stages, filtering candidates and testing numbers for primality by using Fermat tests (with base 2). Latest default setting are: On testnet you spend 30% sieving and 70% testing candidates, on mainnet 70% sieving and 30% testing. On mainnet you will filter out 97% of all candidates you started with by sieving.

In order to produce a meaningful speedup, you will have to port both stages to the GPU or make one of them very small. Lets say you keep the original settings, port the sieve to GPU, interleave Fermat tests on the CPU. In this scenario your speedup is at most ~3x, because you still have to do that 30% candidate testing. Ok, you can sieve much faster now, so why not sieve with 10x more primes? You already filter 97%, and each new prime will not kill more than 1/p more candidates. The low-hanging fruits are the smaller primes and you have already used them. So, dimishing returns for the extra computations if these primes get bigger. At some point, even if you sieve 100x faster it won't be worth the extra computations and just checking if that number is prime with a Fermat test makes more sense. (90% of all Fermat tests come back positive, that is, the start of that possible chain is definitely composite. Only for the remaining 10% you check if you hit a chain of some type.). Bottom line: to produce meaningful speedups with gpu mining, you will have to port both the sieve and Fermat tests to the GPU. Another argument for this is that transfer between CPU and GPU is costly and can quickly become a bottleneck.
45  Alternate cryptocurrencies / Altcoin Discussion / Re: [XPM] So, why is porting primecoin mining to the GPU so difficult? on: August 22, 2013, 01:34:38 AM
...
I was under the impression that the sieving part of the algorithm was taking up much more than 1% of the computing time. For example, currently my "Sieve/Test ratio" is at 75%/25%... or am I confusing things?
Redacted that statement, but still, even if you made that 25% sieving 1000 times faster, you would still need those other 75% of the time and wouldn't mine much faster.
It's 75% sieving, not 25%. If you make the sieve 10000 times faster you can eliminate 99.9999% of candidates that need testing by filtering more prime divisors. The purpose of the sieve is to avoid having to do the slow tests, so improving the sieve does actually make more sense.

Ah, sorry, need some sleep... But you'd still have dimishing returns for sieving, as filtering by 2x as many primes won't filter out 2x more candidates.
46  Alternate cryptocurrencies / Altcoin Discussion / Re: [XPM] So, why is porting primecoin mining to the GPU so difficult? on: August 22, 2013, 01:17:49 AM

 The hot spot in a primecoin miner is testing numbers for primality by using Fermat tests (with base 2). Porting the sieve to the GPU is almost pointless, because only 1% of the computing time is spend there.




I was under the impression that the sieving part of the algorithm was taking up much more than 1% of the computing time. For example, currently my "Sieve/Test ratio" is at 75%/25%... or am I confusing things?



Redacted that statement, but still, even if you made that 75% sieving 1000 times faster, you would still need those other 25% of the time and wouldn't mine much faster.


47  Alternate cryptocurrencies / Altcoin Discussion / [XPM] So, why is porting primecoin mining to the GPU so difficult? on: August 22, 2013, 12:26:44 AM
Tl;dr: Some explanations as to why there currently is still no GPU miner for primecoins


Some of you may know me from the other GPU miner thread, where I tried to accelerate the primecoin miner with CUDA.

While I have no idea how mtrlt is doing on his port, I can only image that he encounters similar problems. A primecoin miner has two stages, filtering candidates and testing numbers for primality by using Fermat tests (with base 2). Latest default setting are: On testnet you spend 30% sieving and 70% testing candidates, on mainnet 70% sieving and 30% testing. On mainnet you will filter out 97% of all candidates you started with by sieving.

In order to produce a meaningful speedup, you will have to port both stages to the GPU or make one of them very small. Lets say you keep the original settings, port the sieve to GPU, interleave Fermat tests on the CPU. In this scenario your speedup is at most ~3x, because you still have to do that 30% candidate testing. Ok, you can sieve much faster now, so why not sieve with 10x more primes? You already filter 97%, and each new prime will not kill more than 1/p more candidates. The low-hanging fruits are the smaller primes and you have already used them. So, dimishing returns for the extra computations if these primes get bigger. At some point, even if you sieve 100x faster it won't be worth the extra computations and just checking if that number is prime with a Fermat test makes more sense. (>90% of all Fermat tests come back positive after sieving, that is, the start of that possible chain is definitely composite. Only for the remaining <10% you check if you hit a chain of some type.). Bottom line: to produce meaningful speedups with gpu mining, you will have to port both the sieve and Fermat tests to the GPU. Another argument for this is that transfer between CPU and GPU is costly and can quickly become a bottleneck.

Ok, why is porting Fermats test to the GPU so hard?  

That Fermat test is actually very simple in principle, it's a one liner in python:

2==pow(2,n,n)

In python, it can be also easily used with bigger numbers:

>>> n = 8229677002348755372716479379024478187948307769969295540186468240943252147753090 395716866121321015192338179630120010069556821511582091
>>> 2 == pow(2,n,n)
True
>>>

But a lot of magic is running under the hood, which makes this a quite complex problem for GPUs: the numbers involved are bigger than what fits into standard 32/64bit values and 2^n can't be evaluated directly. For the first problem, you use e.g. seven 64bit integers that represent the above number n and emulate multiplication and addition on them with 64-instructions.

A fast algorithm is known for the second problem, called Modular exponentiation (http://en.wikipedia.org/wiki/Modular_exponentiation). The algorithm is simple, but there are many more complex tricks to speed up the computation. For example, with a Montgomery reduction (which, at least on a CPU, is faster than a Barrett reduction but works only on odd numbers) no divisions are required, which is great, since multiplications are much faster than divisions. Memory can be traded for faster execution time by precomputing some values, which however won't make much sense on a GPU.

So doing many Fermat tests means evaluating many independent modular exponentiations. At first, doing many independent modular exponentiation seems to be a great fit for a GPU, since each calculation could be done independently in its own thread. The root of the problem is that Modular exponentiation is a recursive evaluation (= can't be easily parallelized) and that the execution flow is strongly dependent on the input data. That is really a tough problem, since GPU's are "Single instruction, multiple data" devices. A CUDA GPU will execute many threads in warps, which on newer CUDA GPU's are in the order of 48 threads. If these warps divergate, which they will do if each threads execution flow is strongly data dependent, then these threads will be serialized. Whoops, that's a 48x penalty.

Second problem is that at least most NVIDIA GPU's synthesise 64-bit integer instructions from 32-bit instructions. 64-bit addition and subtraction can be synthesized efficiently from 32-bit operations (2 or 3 instructions depending on compute capability). Multiplication requires longer instruction sequences (on the order of 10 to 20 instructions depending on architecture), so that is another 10x penalty compared to a 64-bit CPU which can execute these instructions natively.

So that's easily a 480x combined penalty, even if a GPU is 100x faster calculating simple instructions than a CPU, it will be ~5 times slower with the above idea than a CPU. There might be a clever idea to do something radically different than what was outlined above, but at this point I'm very sceptical that we will see a GPU miner soon which would be practical given the higher electricity use of a GPU compared to a CPU.
48  Alternate cryptocurrencies / Altcoin Discussion / Re: What happened to mtrlt? Did he just scam us of 75 BTC? on: August 21, 2013, 11:43:09 PM
I still believe that Mtrlt does and didn't intend to scam you out of your money and will give you an update soon. However, I'm not sure if a fast GPU miner for primecoins is feasible at all: https://bitcointalk.org/index.php?topic=279194.msg2983323#msg2983323

Edit: Moved my lengthly explanation into its own thread.
49  Alternate cryptocurrencies / Altcoin Discussion / Re: [XPM] CUDA enabled qt client miner for primecoins. Source code inside. WIP on: August 16, 2013, 11:48:06 AM
If you can't wait for Mikaelh's fix: The fast div can be disabled easily, open prime.cpp in an editor and comment out the whole "if (fFastDiv) {...}" block in the function FermatProbablePrimalityTestFast and comment out the block after the comment "// Compute parameters for fast div test" in MineProbablePrimeChain.

The number of candidates it filtered out was depended on "nFastDivPrimes", which by default was chosen fairly low, thus only ~6% of candidates are (wrongly) discarded. But hp-9 is still doing unnecessary divisions on all candidates, so you can speed up hp-9 by commenting out those divisions.
50  Alternate cryptocurrencies / Altcoin Discussion / Re: [XPM] CUDA enabled qt client miner for primecoins. Source code inside. WIP on: August 15, 2013, 07:20:27 PM
There are also a couple of papers on Fermat tests on the GPU (e.g. http://www.gpgpgpu.com/gecco2009/6.pdf), however these implementations are usually assuming that n is smaller than 32 or 64bits, which makes the test much easier.

I just skimmed over that paper.  Their results are novel, but almost useless in application.
If you're doing Fermat tests, there's a good chance the numbers you want to analyze are greater than 2^64.   Sad

Exactly, and they are much greater than 2^64 in primcoin.

As for CUMP, it is completely useless for primecoin, as it only implements floating point arithmetic and then only addition, multiplication and subtraction.
51  Alternate cryptocurrencies / Altcoin Discussion / Re: [XPM] CUDA enabled qt client miner for primecoins. Source code inside. WIP on: August 15, 2013, 05:10:59 PM
So, sad news:

I think there might be a bug in hp-9 somewhere, so that the trial division doesn't work quite right in that version and sorts out wrong candidates. I will need to confirm this with the hp-9 sources without my changes, so that I'm sure I didn't introduce that bug. The CUDA trial division seems to be doing the right think and it doesn't find any candidates to discard, because if I understand it right, the sieve already did that. To put it in different words: this idea is likely a dead end.

I pushed my changes for anyone who that wants to play with it. There is also still a very slow ported version of the Fermat test, which is easily outperformed by GMP's implementation on the CPU. I think there is no easy way to avoid doing Fermat tests on the GPU. So for now, there is sadly nothing for the GPU which is faster than hp-9 on the CPU.

I will have a close look at Mtrlt's project, but as it seems, he might have similar problems. It would be a major achievement if gets a GPU Fermat test working with a good speed-up. This means that a fast GPU "exponentiation by squaring" algorithm is available to the research community and prime research would benefit from that in general, as most prime tests (not only Fermat's test) need that. There are also a couple of papers on Fermat tests on the GPU (e.g. http://www.gpgpgpu.com/gecco2009/6.pdf), however these implementations are usually assuming that n is smaller than 32 or 64bits, which makes the test much easier.

Also if Mtrlt succeds, he really deserves his price money... it's really not an easy task and I doubt other GPU implementations are in the wild. I would also then port over his method to my CUDA project.
52  Alternate cryptocurrencies / Altcoin Discussion / Re: [XPM] CUDA enabled qt client miner for primecoins. Source code inside. WIP on: August 13, 2013, 03:00:41 PM
I was away for the past week and will look into it again this week. Yes, it's just a hobby project and it got bigger than I expected. Currently, I'm the only one working on this, so if someone wants to chip in and help (programming), send me a PM.

Status:

I will push my lastest changes soon, I have updated my code basis to hp-9 and I implemented a fast big num small prime trial division for the GPU. Depending on the settings, this can filter out 10-90% of all candidates. The CPU than computes the fermat tests on the remaining candidates. I was under the impression that the sieve would already filter out all chains versus small primes, but apparently the high performance client still filters out some candidates with trial divisions and does this before doing fermat tests.

If a fast fermat test for the GPU surfaces, than filtering+fermat tests could be chained directly on the GPU to give a better speed up.

To clarify: I didn't push my changes because I still have a silly bug somewhere, so that apparently not all prime divisors are found. But doing more prime division tests than what the high performance client does by default yields already better speed ups directly on the CPU.
53  Alternate cryptocurrencies / Altcoin Discussion / Re: [XPM] CUDA enabled qt client miner for primecoins. Source code inside. WIP on: July 31, 2013, 09:52:58 AM
I'm still on it - with a different idea. As it turns out, doing Fermat tests on the GPU is not a no brainer and getting that fast requires too much effort for now, so I'll try to port something else to the GPU.

I'm still sure a GPU miner is possible, but right now I would say it's a lot harder than for the other coins. The other OpenCL miner project is (amusingly!) also having problems.
54  Alternate cryptocurrencies / Altcoin Discussion / Re: [XPM] CUDA enabled qt client miner for primecoins. Source code inside. WIP on: July 26, 2013, 12:20:14 PM
Please check that you're using the latest SDK. I also encountered memory problems with cuda 5.0 and I'm using 5.5 now which works for me.
Just curious, have you looked at the Mfaktc source code at all?  While it is used for trial factoring Mersenne Primes, which may not be helpful, the writer did get it to sieve completely on the GPU, which may.

I looked into it, yes. Code is not very understandable though...
55  Alternate cryptocurrencies / Altcoin Discussion / Re: [XPM] CUDA enabled qt client miner for primecoins. Source code inside. WIP on: July 25, 2013, 03:27:25 PM
Please check that you're using the latest SDK. I also encountered memory problems with cuda 5.0 and I'm using 5.5 now which works for me.
56  Alternate cryptocurrencies / Altcoin Discussion / Re: [XPM] CUDA enabled qt client miner for primecoins. Source code inside. WIP on: July 25, 2013, 01:45:44 PM
My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.

Some feedback from knowledgeable people indicates that mod_exp probably would not speed up as well on gpu. However I think if gpu can do the sieve much more efficiently it could generate a lot less candidates for the Fermat test, which could speed things up quite a bit.

There is indeed a problem with the speeds of Fermat tests on the GPU. GNU GMP uses the most sophisticated algorithms available, the student library I found and which I started to extend uses the most basic algorithms.

Mpz_powmod needs fast multiplications of big ints, GMP's algorithm is most likely in O(log(n)*n), school book multiplication which the GPU now uses is O(n^2). I hoped that for the ~400 bit numbers involved it wouldn't make such a difference. Currently, the new version in my repo does Fermat tests on the GPU (rate is 10 per second), but my CPU is still faster due to better algorithms and a better big num implementation.

But don't worry, I won't give up so fast! The current situation is that I either need to look into porting better algorithms to the GPU or to do something else than Fermat tests on the GPU to sieve candidates (e.g. trial division with most common primes).

Anybody with a better GPU than the Geforce 570 TI I own, please test this! My stats (base version is still hp4):

2013-07-24 21:53:38 primemeter     24303 prime/h    490729 test/h        47 5-chains/h

prime/h  and test/h seem to fluctuate enormously and seem to be rather meaningless. As most tests are on the GPU, I have no idea if this is even measuring the tests right. 5-chains is accurate though.

You have to use setgenerate true 1, i.e. one CPU thread for mining.  
running current git (b0c7062f3925482935f5eb352b17737d21b95c5b) and i cant see any usage of my GPU, no heat nor used mem increases when using the QT. anything special to activate so it mines with the GPU? i got a powerfull GPU to test with Wink

EDIT:
Code:
2013-07-25 13:35:43 primemeter         0 prime/h   34261932 test/h         0 5-chains/h
seems the miner thread which should launch the CUDA is borked?

EDIT2:
Code:
Have 2400 candidates after main loop
Cuda start!
Cuda error: cudaMemcpy: cudaMemcpyDeviceToHost, the launch timed out and was terminated
from debug log

You can also run it with -printmining -printtoconsole to see that output directly. Could you compile the cuda portion with -G -g (change the qt project file where it invokes nvcc) and give me the output of cuda-memcheck?

You can also #define CUDA_DEBUG in the cu file, to see the GPU printfs from the console
57  Alternate cryptocurrencies / Altcoin Discussion / Re: [XPM] CUDA enabled qt client miner for primecoins. Source code inside. WIP on: July 25, 2013, 01:20:57 PM
Good job primedigger!  Just looking at your 5-chains score, I think that means that this is slower than most CPU's though right?  My crappy 4-core AMD does 400-800 5-chains/ hour (around 3000PPS).

Yes, it's an extremely crappy score, but it's a start. The high performance client does a good job of squeezing out as much as possible out of your CPU though!
58  Alternate cryptocurrencies / Altcoin Discussion / Re: [XPM] CUDA enabled qt client miner for primecoins. Source code inside. WIP on: July 25, 2013, 01:06:02 PM
My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.

Some feedback from knowledgeable people indicates that mod_exp probably would not speed up as well on gpu. However I think if gpu can do the sieve much more efficiently it could generate a lot less candidates for the Fermat test, which could speed things up quite a bit.

There is indeed a problem with the speeds of Fermat tests on the GPU. GNU GMP uses the most sophisticated algorithms available, the student library I found and which I started to extend uses the most basic algorithms.

Mpz_powmod needs fast multiplications of big ints, GMP's algorithm is most likely in O(log(n)*n), school book multiplication which the GPU now uses is O(n^2). I hoped that for the ~400 bit numbers involved it wouldn't make such a difference. Currently, the new version in my repo does Fermat tests on the GPU (rate is 10 per second), but my CPU is still faster due to better algorithms and a better big num implementation.

But don't worry, I won't give up so fast! The current situation is that I either need to look into porting better algorithms to the GPU or to do something else than Fermat tests on the GPU to sieve candidates (e.g. trial division with most common primes).

Anybody with a better GPU than the Geforce 570 TI I own, please test this! My stats (base version is still hp4):

2013-07-24 21:53:38 primemeter     24303 prime/h    490729 test/h        47 5-chains/h

prime/h  and test/h seem to fluctuate enormously and seem to be rather meaningless. As most tests are on the GPU, I have no idea if this is even measuring the tests right. 5-chains is accurate though.

You have to use setgenerate true 1, i.e. one CPU thread for mining. 
59  Alternate cryptocurrencies / Altcoin Discussion / Re: [XPM] CUDA enabled qt client miner for primecoins. Source code inside. WIP on: July 24, 2013, 02:55:20 PM
Progress update:

- GPU does Fermats tests now and CPU does the rest. Fermat tests seem to work fine now.
- I can find prime chains, but couldn't find a block on testnet in a timely manner

Todo:
- Transferring mpz types with strings is slow so I will transform gmp mpz's directly to my CUDA format on the CPU.
- A lot happenend with the high performance client, I will update my codebase to hp7
- The changes in hp6 could also be useful on the GPU ( -> fast divisibility tests before doing the expensive Fermat's test)
- Interleave CPU+GPU computations and async memory copys. Without this, my client won't be very fast.

I don't like to put an ETA on this, it's done when it's done. But I hope to have something by next week which outperforms my old intel core2 quad core.

My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.

Edit: And thanks for the donations / pledges guys!
60  Alternate cryptocurrencies / Altcoin Discussion / Re: [XPM] CUDA enabled qt client miner for primecoins. Source code inside. WIP on: July 23, 2013, 11:03:58 AM
hi
if you guys dont mind i would also try it on a nvidia card ...
thx

Of course, if you want to hack on this - go ahead! Just to clarify: The project isn't in a stage were you can start mining with this, but we're not far away from that either.
Pages: « 1 2 [3] 4 »
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!