FreeTrade (OP)
Legendary
Offline
Activity: 1470
Merit: 1030
|
|
December 19, 2013, 02:14:33 PM |
|
A bit of (not so) bad news: by coalescing main RAM access I have sped it up by ~40%, and opened some more vectors for minor optimizations. For now it runs at 5.86hpm on 7870.
Okay thanks. Worth noting our CPU algorithm hasn't been optimised or tuned at all, so we may have some room to catch-up. What are you using for the aes encryption?
|
RepNet is a reputational social network blockchain for uncensored Twitter/Reddit style discussion. 10% Interest On All Balances. 100% Distributed to Users and Developers.
|
|
|
reorder
|
|
December 19, 2013, 02:24:37 PM |
|
And by the way, when trying to build an optimized Quark miner, I have noticed that AES-NI version of Groestl hash performed worse than AVX version on Intel when called by multiple threads For single thread it was other way around. It could be an implementation fault, but it could also mean, for example, a single on-chip AES module shared by hyperthreads with serialized access. Maybe it has some implications for MemoryCoin as well.
Hmm - seeing hashing improvements linear with number of cores, so think those NI must be part of each core. Well, here is what I get on two E5-2620 Xeons (6 cores, 12 threads each): [root@xxx ~]# openssl speed aes-256-cbc -multi 12 ... aes-256 cbc 382580.34k 517842.22k 521875.46k 525670.06k 527021.40k [root@xxx ~]# openssl speed aes-256-cbc -multi 24 ... aes-256 cbc 588586.78k 611764.04k 617288.53k 618816.17k 619241.47k
Not linear at all. Hashing does also not scale linearly with number of threads.
|
|
|
|
reorder
|
|
December 19, 2013, 02:32:52 PM |
|
A bit of (not so) bad news: by coalescing main RAM access I have sped it up by ~40%, and opened some more vectors for minor optimizations. For now it runs at 5.86hpm on 7870.
Okay thanks. Worth noting our CPU algorithm hasn't been optimised or tuned at all, so we may have some room to catch-up. What are you using for the aes encryption? I use a bitsliced implementation from OpenSSL, reversed to little-endian to avoid conversion. I am still bounded by RAM latencies rather than computation. IIRC random global memory access is 0.18 words per cycle on Radeons, despite the huge bandwidth, so it is the major bottleneck. In fact, I do not see room for much improvement of CPU hashing, OpenSSL is already (almost) perfect at AES.
|
|
|
|
FreeTrade (OP)
Legendary
Offline
Activity: 1470
Merit: 1030
|
|
December 19, 2013, 04:30:15 PM |
|
Well, here is what I get on two E5-2620 Xeons (6 cores, 12 threads each): [root@xxx ~]# openssl speed aes-256-cbc -multi 12 ... aes-256 cbc 382580.34k 517842.22k 521875.46k 525670.06k 527021.40k [root@xxx ~]# openssl speed aes-256-cbc -multi 24 ... aes-256 cbc 588586.78k 611764.04k 617288.53k 618816.17k 619241.47k
Not linear at all. Hashing does also not scale linearly with number of threads. Scales linearly with the number of cores maybe - so each core might have dedicated AES-NI instructions, but 2 or 4 processes for each core might not be able to access them. I bet you see linear scaling with 1, 2, 4 cores . . . then little drop off at 8, bigger at 16, and massive drop off at 32.
|
RepNet is a reputational social network blockchain for uncensored Twitter/Reddit style discussion. 10% Interest On All Balances. 100% Distributed to Users and Developers.
|
|
|
FreeTrade (OP)
Legendary
Offline
Activity: 1470
Merit: 1030
|
|
December 19, 2013, 04:31:32 PM |
|
In fact, I do not see room for much improvement of CPU hashing, OpenSSL is already (almost) perfect at AES.
Yeah, but you should see my code!
|
RepNet is a reputational social network blockchain for uncensored Twitter/Reddit style discussion. 10% Interest On All Balances. 100% Distributed to Users and Developers.
|
|
|
reorder
|
|
December 19, 2013, 04:52:24 PM |
|
Well, here is what I get on two E5-2620 Xeons (6 cores, 12 threads each): [root@xxx ~]# openssl speed aes-256-cbc -multi 12 ... aes-256 cbc 382580.34k 517842.22k 521875.46k 525670.06k 527021.40k [root@xxx ~]# openssl speed aes-256-cbc -multi 24 ... aes-256 cbc 588586.78k 611764.04k 617288.53k 618816.17k 619241.47k
Not linear at all. Hashing does also not scale linearly with number of threads. Scales linearly with the number of cores maybe - so each core might have dedicated AES-NI instructions, but 2 or 4 processes for each core might not be able to access them. I bet you see linear scaling with 1, 2, 4 cores . . . then little drop off at 8, bigger at 16, and massive drop off at 32. So this is what I was trying to say - single AES circuit for both hyperthreads. Of course, all cores are identical and there has to be the circuit in each. In fact, I do not see room for much improvement of CPU hashing, OpenSSL is already (almost) perfect at AES.
Yeah, but you should see my code! It's not like I could skip your momentum.cpp writing the miner Other than somewhat hard to read (all those constants), pretty straightforward. I cannot think of a way it could be optimized significantly. Page-lock those caches maybe, but you cannot do that in portable way.
|
|
|
|
eddilicious
|
|
December 20, 2013, 06:23:21 AM |
|
Hey wait...
MemoryCoin = the coin that require memory
I just figured out when I start to read this thread, memorycoin is not my good old memory of my first gf, but the computer memory on the latest standard. so, to be a miner, I either buy 16GB memory, or buy R9 280. I either complain about people setting up 125 GPU farm(my rig only have 4 so far), or complain ppl who have lots of credit on amazon server farm, b/c i only have one droplet on a cloud hashing at 0.26. it is a money game, one way or another, and a faith game, how much I am willing to commit to it. so, as a small timer miner, all we need, is just a pool.
|
|
|
|
Stinky_Pete
|
|
December 21, 2013, 12:05:59 AM |
|
Hey wait...
MemoryCoin = the coin that require memory
I just figured out when I start to read this thread, memorycoin is not my good old memory of my first gf, but the computer memory on the latest standard. so, to be a miner, I either buy 16GB memory, or buy R9 280. I either complain about people setting up 125 GPU farm(my rig only have 4 so far), or complain ppl who have lots of credit on amazon server farm, b/c i only have one droplet on a cloud hashing at 0.26. it is a money game, one way or another, and a faith game, how much I am willing to commit to it. so, as a small timer miner, all we need, is just a pool. MemoryCoin only needs 1GB to run.
|
|
|
|
reorder
|
|
December 22, 2013, 01:47:13 PM |
|
I have some more numbers from GPU mining field to share. Currently 7870GE mines at 8.42hpm at stock clocks, this is 7.12s per work. 4 of these 7 seconds are spent loading precalculated hashes from global RAM. I have also attempted calculating sha hashes on the fly instead of storing them, but, obviously, calculating each hash 50 times is about 20 times slower than caching it once.
Essentially, it is not AES that makes it GPU-hostile but huge amount of random RAM access required.
|
|
|
|
FreeTrade (OP)
Legendary
Offline
Activity: 1470
Merit: 1030
|
|
December 22, 2013, 02:44:12 PM |
|
I have some more numbers from GPU mining field to share. Currently 7870GE mines at 8.42hpm at stock clocks, this is 7.12s per work. 4 of these 7 seconds are spent loading precalculated hashes from global RAM. I have also attempted calculating sha hashes on the fly instead of storing them, but, obviously, calculating each hash 50 times is about 20 times slower than caching it once.
Essentially, it is not AES that makes it GPU-hostile but huge amount of random RAM access required.
Music to my ears, thank you! For commercialization, it looks like we're going to have pools soon, so there might be a good opportunity to run a GPU miners pool. Alternatively you could consider a binary release of the GPU miner that sends a small percentage of each block mined to you.
|
RepNet is a reputational social network blockchain for uncensored Twitter/Reddit style discussion. 10% Interest On All Balances. 100% Distributed to Users and Developers.
|
|
|
AnonyMint
|
|
December 27, 2013, 09:53:19 AM Last edit: January 08, 2014, 06:18:46 AM by AnonyMint |
|
Essentially, it is not AES that makes it GPU-hostile but huge amount of random RAM access required.
On the GPU, increase the parallelization of computation with random access, so that random access is masked by computation. Also try to coalesce memory accesses so that latency is masked by the memory bandwidth that can load data faster than the CPU. If I am not mistaken, I believe you can essentially accomplish this statistically by running more copies of the same hash simultaneously, so it means you need to increase the amount of memory on your GPU to say 16 or 32GB. I believe if you increase this enough (128 GB?), you will eventually become computation bound. From upthread conjecture, we would expect the performance would top out at roughly 12 hashes per minute and be AES computation bound. Can you experiment and confirm, as it impacts what I am designing as well as Memorycoin 2.0?
Also a reminder on upthread conjecture, that an ASIC would not need to use a GPU's very slow memory latency design, thus Memorycoin remains vulnerable to ASICs.
Okay the 4 hashes per minute is very slow, but that is okay if the validation is much faster than the search for a hash solution. How much faster? Because denial-of-service is a threat. If each peer can only validate say 10 hashes per second, how will your system fend off a botnet denial-of-service attack that floods the network with bogus hashes?
|
|
|
|
reorder
|
|
December 27, 2013, 10:31:18 AM |
|
Essentially, it is not AES that makes it GPU-hostile but huge amount of random RAM access required.
You've discovered my key insight (which I had mentioned during the Protoshares launch). But you probably don't yet know how to capitalize on it to make a CPU-only that is ASIC-resistant. You will eventually figure it out, but probably not before I have released the whitepaper. On the GPU, increase the parallelization of computation with random access, so that random access is masked by computation. Also try to coalesce memory accesses so that latency is masked by the memory bandwidth that can load data faster than the CPU. If I am not mistaken, I believe you can essentially accomplish this statistically by running more copies of the same hash simultaneously, so it means you need to increase the amount of memory on your GPU to say 16 or 32GB. I believe if you increase this enough (128 GB?), you will eventually become computation bound. From upthread conjecture, we would expect the performance would top out at roughly 12 hashes per minute and be AES computation bound. Can you experiment and confirm, as it impacts what I am designing as well as Memorycoin 2.0?
Also a reminder on upthread conjecture, that an ASIC would not need to use a GPU's very slow memory latency design, thus Memorycoin remains vulnerable to ASICs.
Also I wasn't paying attention before. 4 hashes per minute! Are you kidding me? I assumed per second. That is much too slow to fend of denial-of-service attacks. How are you doing to test whether hashes solutions are valid fast enough to prevent a denial-of-service attack on the proof-of-work? To begin with, there are no consumer-grade GPUs with more than 6GB on the market. Besides, I have already done all coalescing possible, both statistically and logically, and overall it only yields 3x-4x advantage over CPU (10hpm on 7870). The PoW itself is not parallelizable due to CBC encryption. Of course, ASIC may employ different techniques to reduce the latency, 3D memory etc, as the memory is not exactly randomly accessed like in scrypt, but in 64-byte chunks of 64k linear ranges. It can even ditch the RAM entirely replacing it with SHA calculation on the fly. Good luck in designing such an ASIC though.. But GPU is pretty much limited in what it can and what it can not.
|
|
|
|
AnonyMint
|
|
December 27, 2013, 10:57:35 AM Last edit: December 27, 2013, 11:17:49 AM by AnonyMint |
|
To begin with, there are no consumer-grade GPUs with more than 6GB on the market.
Demand is a funny economics 101 thing, it causes supply to rise to meet it on the price curve. Assuming Memorycoin became significant. Although ASICs would probably take over before that point any way. The point is we want to test what are the technical limitations, not just what the market currently bears, because a coin needs to be future-proof. Because if GPUs can become more efficient at solving the hash by adding more memory, then we need to factor that into our analysis. However see below, I now don't think more memory is necessary to increase parallelization. Besides, I have already done all coalescing possible, both statistically and logically, and overall it only yields 3x-4x advantage over CPU (10hpm on 7870).
10 hpm is 2.5x correct? (FreeTrade reported 4 hpm on CPU) That is faster than the last report I had seen from you on this thread. That is congruent with the conjecture when it is AES computation bound. Do you have any measurement giving an estimate of how close to compute bound your implementation is? The PoW itself is not parallelizable due to CBC encryption.
I am forgetting from upthread discussion that the hash can run up to 16,384 threads simultaneously without needing more than 1GB. How many threads are you running? Did you try increasing the number of threads? The point I believe is to get multiple random memory accesses to overlap statistically and they will be stored in the 768 KB cache so latency is masked by memory bandwidth. Although I am not sure how sophisticated the GPU is on merging coincident random memory accesses across threads into a sequential memory access. Of course, ASIC may employ different techniques to reduce the latency, 3D memory etc,
As far as I can see, it simply needs to have a similar main memory as the CPU (and perhaps an L2 cache), or perhaps even be PCIe card that runs on your PC. The point is AES can be made to run much faster, if the CPU is compute bound, as I showed (see the link to the upthread post). but in 64-byte chunks of 64k linear ranges.
I thought it was working on a random chunk of 64 KB in size? So the random access latency shouldn't be a factor, except that perhaps 64 KB is loaded so fast due to the very fast memory bandwidth of the GPU. I wondering if you did something wrong or are misinterpreting some statistics you've analyzed or I am not understanding the algorithm? Or if you are not running enough threads to statistically mask the latency? Good luck in designing such an ASIC though.
Upthread I cited references for low transistor counts ASIC designs which run AES much faster. But GPU is pretty much limited in what it can and what it can not.
GPU is limited only by very slow memory latency. And the lack of specialized AES instructions. The former can't be rectified as it is fundamental to what makes the memory bandwidth so fast. The latter could maybe be added to GPUs, since the transistor counts required are relatively small as I cited with references upthread.
|
|
|
|
AnonyMint
|
|
December 27, 2013, 11:26:57 AM |
|
Okay the 4 hashes per minute is very slow, but that is okay if the validation is much faster than the search for a hash solution. How much faster? Because denial-of-service is a threat. If each peer can only validate say 10 hashes per second, how will your system fend off a botnet denial-of-service attack that floods the network with bogus hashes?
Additionally how will pool share hashes work if the hash rate is only 4 per minute for each pool miner? Won't the variance be incredibly high for block times in the few minute range. Aren't you going to need at least a 10 minute block time thus no improvement over Bitcoin?
|
|
|
|
reorder
|
|
December 27, 2013, 11:27:53 AM |
|
Do you have any measurement giving an estimate of how close to compute bound your implementation is?
It is about 50% now. How many threads are you running? Did you try increasing the number of threads?
As many as GPU can start. 7870 has 4 CUs of 256 threads (and some more not exposed to OpenCL I believe). The point I believe is to get multiple random memory accesses to overlap statistically and they will be stored in the 768 KB cache so latency is masked by memory bandwidth.
GPUs do not have automatically controlled L2 cache. I'd guess you are referring to 'local' memory which is a totally different beast. Controller is smart enough though to 'stream' simultaneous access to adjacent RAM areas. I thought it was working on a random chunk of 64 KB in size? So the random access latency shouldn't be a factor, except that perhaps 64 KB is loaded so fast due to the very fast memory bandwidth of the GPU. I wondering if you did something wrong or are misinterpreting some statistics you've analyzed or I am not understanding the algorithm? Or if you are not running enough threads to statistically mask the latency?
You cannot load 64K anywhere, you have about 240*4 bytes of registers per thread and about 128 bytes of that 'local' memory per thread, and that's to it.
|
|
|
|
AnonyMint
|
|
December 27, 2013, 11:35:39 AM Last edit: December 27, 2013, 11:47:00 AM by AnonyMint |
|
Thanks for the feedback. It would interesting to see if more latency is statistically masked with higher number of threads. I wonder if there is any GPU simulator or actual GPU which can run more than 1024 threads?
In any case, your results are in the range expected by FreeTrade, even if you eliminate the remaining 50% of memory latency bound. So I suppose he is happy. It appears to be an improvement over Litecoin, yet I have some pending questions above about the impact of the slow hash rate.
Did you do any power measurements? Because one of the points I made upthread is that GPU may be less power efficient even though it achieves a faster hash rate. However if it is latency bound (idle) 50% of the time, it may not be maxing out its power consumption.
|
|
|
|
reorder
|
|
December 27, 2013, 11:52:46 AM |
|
Thanks for the feedback. It would interesting to see if more latency is statistically masked with higher number of threads. I wonder if there is any GPU simulator or actual GPU which can run more than 1024 threads?
Of course it would be masked. Random global RAM access is about 0.18 words/thread (<1 byte) per cycle in Radeons on average, while sequentially you can load/store 16 bytes/thread. Higher range Teslas have ~2048 threads I believe (and compiler that crashes on my kernel, so I did not test with it yet). Also Nvidia ships a pretty nice analyzer for CUDA where you can see subsystems utilisation in runtime. In any case, your results are in the range expected by FreeTrade, even if you eliminate the remaining 50% of memory latency bound. So I suppose he is happy. It appears to be an improvement over Litecoin, yet I have some pending questions above about the impact of the slow hash rate.
Yes, it is like Litecoin without the infamous lookup-gap shortcut. Did you do any power measurements? Because one of the points I made upthread is that GPU may be less power efficient even though it achieves a faster hash rate.
~200W/10hpm for 7870, but this varies across GPUs of course.
|
|
|
|
AnonyMint
|
|
December 27, 2013, 11:58:34 AM |
|
~200W/10hpm for 7870, but this varies across GPUs of course.
So very roughly parity with the CPU on power efficiency, assuming the CPU is maxed out at 80W. Although CPU systems usually consume more than 100W when they are not idle, so a rack of GPUs might be slightly more power efficient.
|
|
|
|
hownowbrowncow
|
|
January 14, 2014, 10:16:31 AM |
|
The latest CPU optimizations are fantastic improvement.
I am running close to 2000HPM without using any GPU...
Thank you all!!!
What is your setup?
|
|
|
|
Evil-Knievel
Legendary
Offline
Activity: 1260
Merit: 1168
|
|
January 14, 2014, 07:35:37 PM Last edit: April 17, 2016, 09:24:27 PM by Evil-Knievel |
|
This message was too old and has been purged
|
|
|
|
|