Hi!
I took the time to go through MTP v1.2 in-depth a few days ago from an AMD implementation perspective, posting the results below.
Cheers,
K
---------------
Static analysis of an AMD GCN assembly implementation of MTP v1.2
=================================================================
TL;DR
- MTP v1.2 will be an intensive algorithm with very high power draw and general GPU load.
- An optimized AMD miner with a similar profile is Claymore's Neoscrypt miner.
- I would suggest looking into reducing the full Blake2b used in the main rounds to the 4-round Blake2b used for Argon2d.
Analysis
========
MTP v1.2 as defined in the paper [1] pulls 70 quasi-random selections of 1024 bytes from a generated Argon2d block of 2 GB, i.e. the full block can be seen as 2097152 segments of 1024 bytes. Each of the 70 main rounds necessary for a single hash does a full Blake2b hash of 32+1024 bytes of data: the Y-1 state from the previous round (32 bytes) and the loaded Argon2d segment (1024 bytes). A single Blake2b compress call can process a block of max 128 bytes. This means we need 9 blocks to hash the full 32+1024 bytes.
Blake2b is quite trivial to implement. Assuming a fully unrolled version with no extra overhead:
- 12 rounds needed for a single block of 128 bytes to compress.
- 8 G-function calls per round.
- 26 four-cycle asm instructions per G-function call.
Expanding to MTP v1.2:
- We need 12*8*26 = 2,496 four-cycle instructions per block of 128 bytes.
- For hashing a full round in MTP v1.2, we need nine blocks to cover 32+1024 bytes, resulting in 9*2496 = 22,464 four-cycle instructions.
- For a single MTP v1.2 hash (including only the main 70 rounds of computation with random Argon2d selections) we need 1,572,480 four-cycle instructions.
A single AMD GCN Compute Unit (CU) at X MHz clk can process max X/4 * 4 SIMDs * 64 threads four-cycle instructions/sec. This means that 64X / 1572480 is the theoretical max nr of hashes possible for one CU. Note that this assumes zero cost for all setup code, initial hash, memory loads etc so it's very much a theoretical upper limit - the true number will be lower.
Reference cards max hash rates:
Vega64 @ 1408 main clk: 64 CUs * 64*1408*10^6 / 1572480 =~ 3.67 MH/s
Rx580 @ 1256 main clk: 36 CUs * 64*1256*10^6 / 1572480 =~ 1.84 MH/s
The numbers above are upper bounds for the maximum possible hashrate when each card is running with a 100% busy VALU, i.e. doing nothing but crunching numbers and never stalling for memory loads. Next, we analyze what mem bandwidth is necessary to sustain a 100% busy VALU for the above cards and clocks. If the mem controller will be able to provide enough bandwidth, the algorithm will be computationally bound, otherwise memory bound.
For a single hash, you need to process 70kb of Argon2d segments. For the Vega64, this means approx 70*1024*3.67*10^6 = 244 GB/sec. For the Rx580, it's 70*1024*1.84*10^6 = 122 GB/s.
These numbers are by no means high enough to indicate that the mem controller will be the bottleneck in MTP v1.2. Moreover, with the relatively high cost of 2496 instructions between memory loads, internal latency hiding is trivial. This means you issue the next 128 byte memory load before you do a full Blake2b hash. Hence, given this information I draw the conclusion that an optimized AMD implementation of MTP v1.2 will _not_ be memory bound and bottlenecked by available memory bandwidth. Instead, all stream processors will continuously be running at 100% with the mem controller working quite hard to keep up as well. Assuming you want maximum possible hashrate, the end result is a very high power draw and high pressure on the GPU.
Comparisons to other hash functions
===================================
For further analysis, some relevant comparisons to other memory-intensive hash functions follow below. A relevant metric used in the comparisons is the nr of four-cycle instructions of raw calculations needed per loaded cacheline before you can make further progress in the algorithm. For MTP v1.2 this is 1248 instrs/cacheline, assuming we do the natural thing and load 128 bytes at the time for Blake2b hashing.
Ethash
------
Ethash loads 128 bytes per round in its main part. It does 64 rounds instead of 70 as in MTP v1.2. For an Rx 580, we can assume a max hashrate of 30 MH/s at 1256 clk. We know that this is a low estimate of the point where memory becomes the bottleneck. The effective bandwidth is then 30*10^6 * 64 * 128 = 228 GB/s, far above the 122 GB/s necessary for MTP v1.2 running on the same card. For a Vega64, the standard ethash reference hashrate is ~40 MH/s. This means a bandwidth of 40*10^6 * 64 * 128 = 305 GB/s, again more than enough to drive MTP v1.2 as a computationally bound algorithm.
The main difference between the two hash functions (in the analyzed main parts) is the nr of four-cycle instructions necessary to process 128 bytes. A decent asm implementation will do FNV in 5 four-cycle instruction equivalents, so 64*5 = 320 instructions, then assume 10 instructions for the set up for the next round = 330 instructions. Per loaded cacheline, this means 165 instructions, or only 13.2% of the 1248 instructions necessary in MTP v1.2.
Neoscrypt
---------
Neoscrypt does 512 main rounds divided into four distinct parts: store 128 rows of 256 bytes, read 128 quasi-random rows of 256 bytes, store 128 rows of 256 bytes, read 128 quasi-random rows of 256 bytes. For simplicity, we treat reading/writing as an aggregated single bandwidth. In each of the 512 rounds, either a full salsa or chacha round is calculated on 256 bytes, 64 bytes at the time. Salsa and chacha do 10 rounds of 96 instructions, so 960 instructions per call, and there are four such calls in each of the 512 rounds, which also means 960 instructions/loaded cacheline.
Hence, this algorithm, that we know becomes computationally bound in Claymore's asm miner, still needs LESS VALU-bound work per loaded cacheline vs MTP v1.2: 960 instrs vs 1248 instrs. And, the power draw profile of Claymore's miner is well known - to maximize your hashrate, your GPU needs to be running full throttle for the VALU (main clock) and preferable also for the mem controller, the end result being a very high power draw. An optimized AMD miner for MTP v1.2 will exhibit the same characteristics and power draw profile.
Discussion
==========
I don't believe the Zcoin community is fully aware of the characteristics of the upcoming algo, hence this post. I can only speak for how this would be implemented on modern AMD GPUs, but I see no reason for why the implementation would behave differently on Nvidia cards - the issue is in the algorithm design. Imho, the main issue is the full Blake2b hash with 12 rounds requiring too much computational work to process each loaded chunk of memory. Therefore, before MTP is pushed out in its currently suggested form it would certainly be interesting to hear what the researchers behind MTP think about e.g. reducing the hash used to the same 4-round Blake2b that they suggest themselves for the first Argon2d part. If still deemed cryptograhically safe, this would reduce the key metric used above with -67%. MTP would still be above ethash in terms of computations vs memory loads and power draw characteristics, but it would significantly lower the stress on mining GPUs.
Disclaimer
==========
I might very well have made an error or two in the analysis and calculations above, like a bad assumption on the nr of instructions necessary for ethash. However, I do not believe any such error would have an impact on the drawn conclusions.
References
==========
[1]
https://arxiv.org/pdf/1606.03588.pdf