Bitcoin Forum
May 07, 2024, 05:28:00 AM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1] 2 3 4 »  All
  Print  
Author Topic: limits of ZEC mining  (Read 9969 times)
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 14, 2016, 01:28:53 AM
 #1

As I write this the fastest ZEC miners are Claymore v5 and Optiminer v 0.3.1.  Just yesterday silentarmy v5 was faster than Claymore v4, but then Claymore v5 leapfrogged silentarmy.  But the days of doubling performance of ZEC miners is over, as the software is approaching the hardware performance limits of the GPUs (at least AMD GPUs).  In order to understand why, it helps to understand a bit about the zcash equihash algorithm.  For the math nerds, it's based on Wagner's algorithm for solving the generalized birthday problem.  Specifically, 2 million pseudo-random numbers are generated using blake2b (see http://blake2.net/).  Each of these numbers is 200 bits (25 bytes), and they are sorted to find pairs of numbers that result in collisions on the first 20 bits.  On average, there is about 2 million pairs that collide on the first 20 bits.  Those pairs are XORed, and the resulting numbers are sorted on the next 20 bits.  This continues for 8 rounds, until 40 bits are left, where there will be 2 (actually 1.88) collisions on the last 40 bits.  These last 2 collisions are the solutions to the equihash proof of work.

Starting with 25 bytes of data, the natural choice for a data structure would be records of 32 bytes each.  In the silentarmy implementation (https://github.com/mbevand/silentarmy) these records are called slots.  Although the original (CPU-based) equihash algorithm uses a radix sort, the fastest sorting algorithm for equihash is a bin sort, with 2^20 (1 million) bins (silentarmy calls them rows).  At each round, the next 20 bits determine the bin to save the XOR data to, followed by a scan of all bins to find those with at least 2 records (slots) filled.  With an average of 2 million records of 32-bytes each, that's 64MB of data to scan each round.  You might think that there's also 64MB of data to write (into the bins) each round, but on an AMD GPU, there will be 128MB of writes to RAM for storing data in the bins.  The reason is that AMD memory channels are 64-bits wide, and GDDR5 transfers a minimum of an 8-bit burst, so AMD GPUs transfer a minimum of 64 bytes of data to RAM at a time.  In addition, if the GPU kernel writes less than 64 contiguous bytes at a time, the memory controller will read 64 bytes, modify some of the bytes, and then write 64 back to RAM.  Therefore writing 2 32-byte slots to a bin involves reading 64 bytes, writing 64 bytes, and repeating once more.  Therefore a reasonably efficient equihash implementation will do 5 * 64 * 1 million bytes (320MB) of IO per round.  With 9 rounds that means 2.88GB per itteration, or 77.8 itterations per second on a Rx 470 with RAM clocked at 7Gbps (224GB/s memory bandwidth).  At 1.88 solutions per iteration, that's an average of 146 solutions/second, or about 25% faster than Claymore v5.

The theoretical equihash performance limit on a Rx 470 is likely about 25% faster than 146 solutions, but it involves using 64-byte data structures that requires a lot more memory.  So much memory that I think it will not be possible with 4GB cards.  At least it will be something for owners of 8GB Rx 480 cards to be happy about.
1715059680
Hero Member
*
Offline Offline

Posts: 1715059680

View Profile Personal Message (Offline)

Ignore
1715059680
Reply with quote  #2

1715059680
Report to moderator
1715059680
Hero Member
*
Offline Offline

Posts: 1715059680

View Profile Personal Message (Offline)

Ignore
1715059680
Reply with quote  #2

1715059680
Report to moderator
1715059680
Hero Member
*
Offline Offline

Posts: 1715059680

View Profile Personal Message (Offline)

Ignore
1715059680
Reply with quote  #2

1715059680
Report to moderator
"In a nutshell, the network works like a distributed timestamp server, stamping the first transaction to spend a coin. It takes advantage of the nature of information being easy to spread but hard to stifle." -- Satoshi
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1715059680
Hero Member
*
Offline Offline

Posts: 1715059680

View Profile Personal Message (Offline)

Ignore
1715059680
Reply with quote  #2

1715059680
Report to moderator
Subw
Hero Member
*****
Offline Offline

Activity: 672
Merit: 500


View Profile
November 14, 2016, 01:52:52 AM
 #2

"it involves using 64-byte data structures"

how much changes/coding transition to 64-byte data structures require?
Tmdz
Hero Member
*****
Offline Offline

Activity: 1008
Merit: 1000


View Profile
November 14, 2016, 01:59:35 AM
 #3

Interesting to see that after 2 weeks we are fairly close to the limits.
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 14, 2016, 03:22:34 AM
 #4

"it involves using 64-byte data structures"

how much changes/coding transition to 64-byte data structures require?

Someone like eXtremal could probably do it in a week, re-using parts of silentarmy.  It would take me 2-3 times longer.  I can write top-quality code, but I don't pump it out as fast as some other coders.
TheRider
Full Member
***
Offline Offline

Activity: 157
Merit: 100


View Profile
November 14, 2016, 05:21:51 AM
 #5

...

Therefore a reasonably efficient equihash implementation will do 5 * 64 * 1 million bytes (320MB) of IO per round.  With 9 rounds that means 2.88GB per itteration, or 77.8 itterations per second on a Rx 470 with RAM clocked at 7Gbps (224GB/s memory bandwidth).  At 1.88 solutions per iteration, that's an average of 146 solutions/second, or about 25% faster than Claymore v5.

The theoretical equihash performance limit on a Rx 470 is likely about 25% faster than 146 solutions, but it involves using 64-byte data structures that requires a lot more memory.  So much memory that I think it will not be possible with 4GB cards.  At least it will be something for owners of 8GB Rx 480 cards to be happy about.

A few noob questions if you don't mind.
What's the theoretical limit on the RX 470 8G Nitro cards with RAM clocked at 8Gbps (256GB/s)? Also, does overclocking the memory result in a linear increase in performance?
Does this all mean that equihash solving isn't GPU compute limited, but rather memory limited? If so, I wonder why GPU-Z shows 100% GPU load vs sub-40% memory controller load (whereas mining Eth fully loads both core and mem controller...)

Fascinating stuff. Thanks in advance.
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 14, 2016, 01:55:19 PM
 #6

...

Therefore a reasonably efficient equihash implementation will do 5 * 64 * 1 million bytes (320MB) of IO per round.  With 9 rounds that means 2.88GB per itteration, or 77.8 itterations per second on a Rx 470 with RAM clocked at 7Gbps (224GB/s memory bandwidth).  At 1.88 solutions per iteration, that's an average of 146 solutions/second, or about 25% faster than Claymore v5.

The theoretical equihash performance limit on a Rx 470 is likely about 25% faster than 146 solutions, but it involves using 64-byte data structures that requires a lot more memory.  So much memory that I think it will not be possible with 4GB cards.  At least it will be something for owners of 8GB Rx 480 cards to be happy about.

A few noob questions if you don't mind.
What's the theoretical limit on the RX 470 8G Nitro cards with RAM clocked at 8Gbps (256GB/s)? Also, does overclocking the memory result in a linear increase in performance?
Does this all mean that equihash solving isn't GPU compute limited, but rather memory limited? If so, I wonder why GPU-Z shows 100% GPU load vs sub-40% memory controller load (whereas mining Eth fully loads both core and mem controller...)

Fascinating stuff. Thanks in advance.

A Rx 470 at 8Gbps would have a theoretical limit 8/7 times faster than one at 7Gbps.
The only part of equihash that is compute limited is the blake2b initialization.  The intention of the authors was for the algorithm to be limited by memory bandwidth.
https://www.internetsociety.org/sites/default/files/blogs-media/equihash-asymmetric-proof-of-work-based-generalized-birthday-problem.pdf

As for what GPU-z shows, you'll have to figure out how to correctly interpret what it reports on your own.  I do my OpenCL development on Linux, and even if there was a Linux version, I don't consider GPU-z a useful tool for kernel developers.
Katadin
Newbie
*
Offline Offline

Activity: 18
Merit: 0


View Profile
November 14, 2016, 05:05:02 PM
 #7

Does it mean the R9 390 which has 512 bit memory bus and 1500 Mhz, should be faster than the 470?
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 14, 2016, 05:17:32 PM
 #8

Does it mean the R9 390 which has 512 bit memory bus and 1500 Mhz, should be faster than the 470?

An optimal implementation should be faster.
adaseb
Legendary
*
Offline Offline

Activity: 3752
Merit: 1710



View Profile
November 14, 2016, 05:22:14 PM
 #9

What role does the memory bus width play into regarding the speeds? Because many of these old 7950s are getting almost the same speeds as the 470/390.


.BEST..CHANGE.███████████████
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
███████████████
..BUY/ SELL CRYPTO..
CoinPro69
Newbie
*
Offline Offline

Activity: 58
Merit: 0


View Profile
November 14, 2016, 05:27:48 PM
 #10

What role does the memory bus width play into regarding the speeds? Because many of these old 7950s are getting almost the same speeds as the 470/390.

I just bought an older 3gb version of 7950 to see how they perform with your optimized memory straps.
Also I have the 470 4gb Nitro wich makes 110-120 sols

I was wondering about the memory bus as well.
Genoil
Sr. Member
****
Offline Offline

Activity: 438
Merit: 250


View Profile
November 14, 2016, 06:59:23 PM
Last edit: November 14, 2016, 07:54:50 PM by Genoil
 #11

Dude where is your own miner Grin.

Next coin I expect you to be one of the top dogs in the pit  Kiss

On a more serious note: if you state the performance is now at 80% of theoretical maximum, we're basically there, right? ETH miners also peak at about 80-85% of the theoretical maximum. Does the same rule apply here?

ETH: 0xeb9310b185455f863f526dab3d245809f6854b4d
BTC: 1Nu2fMCEBjmnLzqb8qUJpKgq5RoEWFhNcW
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 14, 2016, 08:21:30 PM
 #12

What role does the memory bus width play into regarding the speeds? Because many of these old 7950s are getting almost the same speeds as the 470/390.

That makes sense, since a 384-bit wide memory bus at 1.5Ghz (6Gbps) has a bit more bandwidth than a 256-bit wide bus at 8Gbps.
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 14, 2016, 08:30:40 PM
Last edit: November 14, 2016, 08:46:33 PM by nerdralph
 #13

On a more serious note: if you state the performance is now at 80% of theoretical maximum, we're basically there, right? ETH miners also peak at about 80-85% of the theoretical maximum. Does the same rule apply here?

I still have some work to do before I write my own miner from scratch.  I like to *really* understand the problem before I start writing code, and there's still some parts of the GCN architecture that I'm figuring out.

Eth miners max out at around 93% of the theoretical maximum.  24Mh/s is the theoretical max for a R9 380 with 6Gbps memory, and I've been able to get 22.3Mh out of a couple cards.  You'll never reach 100% due to the fact that refresh consumes some of the bandwdith, perhaps as much as 5%.

p.s. I also have another idea that should work on 4GB cards.  The miner could use 12-slot bins of 32 bytes, just like silentarmy, but use a new table every round instead of using 2 tables in a double-buffered fashion.  This would use 384MB * 9 =~ 3.5GB, but then your first write to any row could write 32-bytes of dummy data along with the 32-byte collision record.  This would avoid the read-before-write. You could do this with the 2nd through 6th write by filling the even slots before the odd ones.  This would reduce the average IO per round to 2^20 * 3 * 64-bytes, or 192MB per round and 1.728GB per iteration.  That would be a theoretical max of 130 iterations per second on a Rx 470 with a 7Gbps memory clock, which would be around 240 solutions per second.  Using 93% of the theoretical limit taken from eth mining, that would give real-world performance of 225 sols/s.
deadsix
Hero Member
*****
Offline Offline

Activity: 751
Merit: 517


Fail to plan, and you plan to fail.


View Profile
November 14, 2016, 11:06:00 PM
 #14

Fascinating stuff really. Thank you for trying to explain to us laymen how this stuff works. I'm not much of a programmer myself, but Ive always wanted to try and understand better how miners work, what kind of data is processed and how etc. Ill be following this thread closely Smiley
Also I have huge respect for people like you, genoil, mrvb etc who work hard on these complex problems and still release stuff for free.

Ethereum/Zcash/Monero Mining Bangalore https://bitcointalk.org/index.php?topic=1703592
mirny
Legendary
*
Offline Offline

Activity: 1108
Merit: 1005



View Profile
November 14, 2016, 11:23:42 PM
 #15

Do you think, you can start developing miners for Nvidia?
There is a lack of skilled and honest developers.

This is my signature...
philipma1957
Legendary
*
Offline Offline

Activity: 4116
Merit: 7858


'The right to privacy matters'


View Profile WWW
November 15, 2016, 01:15:30 AM
 #16

On a more serious note: if you state the performance is now at 80% of theoretical maximum, we're basically there, right? ETH miners also peak at about 80-85% of the theoretical maximum. Does the same rule apply here?

I still have some work to do before I write my own miner from scratch.  I like to *really* understand the problem before I start writing code, and there's still some parts of the GCN architecture that I'm figuring out.

Eth miners max out at around 93% of the theoretical maximum.  24Mh/s is the theoretical max for a R9 380 with 6Gbps memory, and I've been able to get 22.3Mh out of a couple cards.  You'll never reach 100% due to the fact that refresh consumes some of the bandwdith, perhaps as much as 5%.

p.s. I also have another idea that should work on 4GB cards.  The miner could use 12-slot bins of 32 bytes, just like silentarmy, but use a new table every round instead of using 2 tables in a double-buffered fashion.  This would use 384MB * 9 =~ 3.5GB, but then your first write to any row could write 32-bytes of dummy data along with the 32-byte collision record.  This would avoid the read-before-write. You could do this with the 2nd through 6th write by filling the even slots before the odd ones.  This would reduce the average IO per round to 2^20 * 3 * 64-bytes, or 192MB per round and 1.728GB per iteration.  That would be a theoretical max of 130 iterations per second on a Rx 470 with a 7Gbps memory clock, which would be around 240 solutions per second.  Using 93% of the theoretical limit taken from eth mining, that would give real-world performance of 225 sols/s.


so the 225  would be max for the 4gb and the 8gb cards?

▄▄███████▄▄
▄██████████████▄
▄██████████████████▄
▄████▀▀▀▀███▀▀▀▀█████▄
▄█████████████▄█▀████▄
███████████▄███████████
██████████▄█▀███████████
██████████▀████████████
▀█████▄█▀█████████████▀
▀████▄▄▄▄███▄▄▄▄████▀
▀██████████████████▀
▀███████████████▀
▀▀███████▀▀
.
 MΞTAWIN  THE FIRST WEB3 CASINO   
.
.. PLAY NOW ..
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 15, 2016, 01:23:59 AM
 #17

Do you think, you can start developing miners for Nvidia?
There is a lack of skilled and honest developers.

Kind of hard when I don't have any Nvidia cards.  Even if I did, I'd stick to OpenCL, as I have no interest in learning CUDA.
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 15, 2016, 01:28:15 AM
 #18

p.s. I also have another idea that should work on 4GB cards.  The miner could use 12-slot bins of 32 bytes, just like silentarmy, but use a new table every round instead of using 2 tables in a double-buffered fashion.  This would use 384MB * 9 =~ 3.5GB, but then your first write to any row could write 32-bytes of dummy data along with the 32-byte collision record.  This would avoid the read-before-write. You could do this with the 2nd through 6th write by filling the even slots before the odd ones.  This would reduce the average IO per round to 2^20 * 3 * 64-bytes, or 192MB per round and 1.728GB per iteration.  That would be a theoretical max of 130 iterations per second on a Rx 470 with a 7Gbps memory clock, which would be around 240 solutions per second.  Using 93% of the theoretical limit taken from eth mining, that would give real-world performance of 225 sols/s.


so the 225  would be max for the 4gb and the 8gb cards?

Yes.  I'm pretty sure with 3.5GB for the table data that the remaining 0.5GB on a 4GB card would be enough for the row counters and any other small data structures required.
Amph
Legendary
*
Offline Offline

Activity: 3206
Merit: 1069



View Profile
November 15, 2016, 07:41:03 AM
 #19

Do you think, you can start developing miners for Nvidia?
There is a lack of skilled and honest developers.

Kind of hard when I don't have any Nvidia cards.  Even if I did, I'd stick to OpenCL, as I have no interest in learning CUDA.


even if the potential might be higher? or maybe you know already that this will be never the case?
krnlx
Full Member
***
Offline Offline

Activity: 243
Merit: 105


View Profile
November 15, 2016, 07:54:28 AM
 #20

Do you think, you can start developing miners for Nvidia?
There is a lack of skilled and honest developers.

Kind of hard when I don't have any Nvidia cards.  Even if I did, I'd stick to OpenCL, as I have no interest in learning CUDA.


No difference in speed between cuda and openCL implementations of silentarmy. Cuda can take advantage in computation algo, where its inline assembly can be used(LOP3 and other). In memory hard algo it does not matter use cuda or opencl.

I'm getting on hard overclocked 1070(samsung memory) ~590 s/s from 6 cards, it is near 97-98 from card. eXtremal got 90 s/s from rx480 bios tuned.
I don't have 480 , but have 470, in etherium they get 27 M/H, while overcloked 1070 with samsung memory - 31-32 MH. That is about 15-18% more, then amd. So 92-98 s/s on 1070 vs ~80-85 on 470 is proportional etherium hashrate difference.
Pages: [1] 2 3 4 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!