limits of ZEC mining

nerdralph (OP)

Sr. Member

Offline

Activity: 588
Merit: 251

⇾ limits of ZEC mining

November 14, 2016, 01:28:53 AM

As I write this the fastest ZEC miners are Claymore v5 and Optiminer v 0.3.1. Just yesterday silentarmy v5 was faster than Claymore v4, but then Claymore v5 leapfrogged silentarmy. But the days of doubling performance of ZEC miners is over, as the software is approaching the hardware performance limits of the GPUs (at least AMD GPUs). In order to understand why, it helps to understand a bit about the zcash equihash algorithm. For the math nerds, it's based on Wagner's algorithm for solving the generalized birthday problem. Specifically, 2 million pseudo-random numbers are generated using blake2b (see http://blake2.net/). Each of these numbers is 200 bits (25 bytes), and they are sorted to find pairs of numbers that result in collisions on the first 20 bits. On average, there is about 2 million pairs that collide on the first 20 bits. Those pairs are XORed, and the resulting numbers are sorted on the next 20 bits. This continues for 8 rounds, until 40 bits are left, where there will be 2 (actually 1.88) collisions on the last 40 bits. These last 2 collisions are the solutions to the equihash proof of work.

Starting with 25 bytes of data, the natural choice for a data structure would be records of 32 bytes each. In the silentarmy implementation (https://github.com/mbevand/silentarmy) these records are called slots. Although the original (CPU-based) equihash algorithm uses a radix sort, the fastest sorting algorithm for equihash is a bin sort, with 2^20 (1 million) bins (silentarmy calls them rows). At each round, the next 20 bits determine the bin to save the XOR data to, followed by a scan of all bins to find those with at least 2 records (slots) filled. With an average of 2 million records of 32-bytes each, that's 64MB of data to scan each round. You might think that there's also 64MB of data to write (into the bins) each round, but on an AMD GPU, there will be 128MB of writes to RAM for storing data in the bins. The reason is that AMD memory channels are 64-bits wide, and GDDR5 transfers a minimum of an 8-bit burst, so AMD GPUs transfer a minimum of 64 bytes of data to RAM at a time. In addition, if the GPU kernel writes less than 64 contiguous bytes at a time, the memory controller will read 64 bytes, modify some of the bytes, and then write 64 back to RAM. Therefore writing 2 32-byte slots to a bin involves reading 64 bytes, writing 64 bytes, and repeating once more. Therefore a reasonably efficient equihash implementation will do 5 * 64 * 1 million bytes (320MB) of IO per round. With 9 rounds that means 2.88GB per itteration, or 77.8 itterations per second on a Rx 470 with RAM clocked at 7Gbps (224GB/s memory bandwidth). At 1.88 solutions per iteration, that's an average of 146 solutions/second, or about 25% faster than Claymore v5.

The theoretical equihash performance limit on a Rx 470 is likely about 25% faster than 146 solutions, but it involves using 64-byte data structures that requires a lot more memory. So much memory that I think it will not be possible with 4GB cards. At least it will be something for owners of 8GB Rx 480 cards to be happy about.

Subw

Hero Member

Offline

Activity: 672
Merit: 500