Bitcoin Forum

Bitcoin => Development & Technical Discussion => Topic started by: aliashraf on August 19, 2018, 07:42:03 PM



Title: A framework for designing CPU only mining algorithms
Post by: aliashraf on August 19, 2018, 07:42:03 PM
Hi all,

I'm not thinking of replacing SHA256 of bitcoin or releasing a brand new coin, ... and yet I need a proof of work algorithm resistant against parallelism i.e. not feasible for GPUs to compete with a modern CPU.

I'm already aware of lots of literature and mostly failed efforts regarding this issue but I can't get rid of one simple idea I have for designing such an algorithm. I just don't understand why it shouldn't be considered safe against gpu mining.

It will be highly appreciated if somebody please prove me wrong  :D

Two Phase Mining
Suppose we have this memory hard algorithm, something like Dagger Hashimoto which utilizes a big chunk of memory. As we are already aware, gpus mine this algorithm ways more efficient and faster than a cpu because of their multiple thousands of cores that can share the same memory bank. For EtHash (as a Dagger Hashimoto variant) it causes the whole mining process being bound with memory bus unit performance which resists against ASICs but bypasses cpus because of the significant number of cores that utilize the bus almost completely without affecting miner's performance as the bus is dedicated to the gpu.

Now we change this algorithm such that it goes through 2 phases: estimation phase and targeting phase.

Estimation Phase is a normal run of the algorithm, but instead of looking for a hash less than network target difficulty dn, we look for a hash with much lighter difficulty, like 216 times easier, i.e. d0 = dn << 16. (actually it is difficulty-1 that we are talking about). We assume the shift/multiplication operation won't produce any carry ; i.e target difficulty > 216

A typical GPU with enough RAM will substantially outperform any CPU because of the huge number of cores, obviously. After each hit (that normally happens very frequently), we have a nonce like n1 that satisfies H(s|n1) < d1, Until now everything is in favor of gpus, yet. But ...

Targeting Phase is supposed to be much hard to run using a shared chunk of (like 1 GB) of memory. For this:
1-We primarily set n2 = n1 << 16

2- Suppose we have a function,  f(M, n, s, e, flag) that changes a chunk of memory partially (like 20%) using the supplied n from address s to address e and flag determines whether the function only maps and returns the range or modifies it in the memory as well. This function is supposed to be complicated enough that running it is hundreds of thousands times harder and more time consuming than fetching a page from memory. We change the memory chunk (DAG in EtHash) by applying this function with n1 as the second parameter and start and end addresses of the memory chunk and setting flag to true to modify it. Now we have a dedicated chunk of memory specialized for this nonce.

3-We run the original memory hard algorithm with a special restriction: only a combination of last 16 bits of n2 are allowed to be set to 1 to generate a new nonce n, i.e. n-n2 <= 216
 
4- We need H(s|n) <= 216      
5- Rebuild Memory chunk (e.g. use a backup)

Validation
validating the hash includes:
1- Calculate n1= n >> 16, d0 = dn << 16
2- Check that the supplied block header with its nonce, yields a hash H(s|n) <=  216). For this, in each memory access  f(M,n1, address, address, false) should be called instead of memory read.
3-If step 2 is passed now check H(s|n1 < d0
4-continue checking other rules.

Discussion:
We note that the targeting phase above is optimized once we follow the algorithm by applying 20% change in memory chunk (Dag file for Ethash) otherwise we need checking and calculating the values we read from memory in every single run of the algorithm which is supposed to access memory many times( otherwise it is not memory hard at all).

If in our algorithm we access the memory N (20 for EtHash, I suppose) times applying f function in each round of targeting phase will cost n times executing f and as we have almost 216 rounds. Obviously it wouldn't be cost effective for a gpu to use f function in calculate only mode so many times.

Alternatively, modifying memory by calling f function once is a single thread job because f should hold lock on the memory and the multiple cores of gpu are useless during this process. If  f is defined properly this algorithm in its second phase would outperform a gpu because setting up multiple cores to start searching a 32K space simply doesn't worth it.

Conclusion

We use a two phase algorithm, in phase one, an estimate nonce is generated that is useless without a complementary 32K search that is practically single thread. Although gpus keep their advantage in phase 1, the estimates they generate are useless because they should be kept in the queue behind a single thread task that is deliberately designed to be a bottleneck.

I expect a 2 core cpu to beat a single gpu with up to 10 thousand cores.