Bitcoin Forum
May 06, 2024, 08:56:50 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 [2] 3 4 »  All
  Print  
Author Topic: limits of ZEC mining  (Read 9969 times)
QuintLeo
Legendary
*
Offline Offline

Activity: 1498
Merit: 1030


View Profile
November 15, 2016, 10:56:21 AM
 #21

And the power usage on ETH for the 1070 vs the RX 480 is also very similar - pretty much a dead heat on a hash/watt basis.

 Unfortunately for ETH or ZEC miners the 1070 is almost twice the cost of the RX480 while not offering comparable hash/$.




I'm no longer legendary just in my own mind!
Like something I said? Donations gratefully accepted. LYLnTKvLefz9izJFUvEGQEZzSkz34b3N6U (Litecoin)
1GYbjMTPdCuV7dci3iCUiaRrcNuaiQrVYY (Bitcoin)
1715029010
Hero Member
*
Offline Offline

Posts: 1715029010

View Profile Personal Message (Offline)

Ignore
1715029010
Reply with quote  #2

1715029010
Report to moderator
1715029010
Hero Member
*
Offline Offline

Posts: 1715029010

View Profile Personal Message (Offline)

Ignore
1715029010
Reply with quote  #2

1715029010
Report to moderator
1715029010
Hero Member
*
Offline Offline

Posts: 1715029010

View Profile Personal Message (Offline)

Ignore
1715029010
Reply with quote  #2

1715029010
Report to moderator
The forum strives to allow free discussion of any ideas. All policies are built around this principle. This doesn't mean you can post garbage, though: posts should actually contain ideas, and these ideas should be argued reasonably.
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1715029010
Hero Member
*
Offline Offline

Posts: 1715029010

View Profile Personal Message (Offline)

Ignore
1715029010
Reply with quote  #2

1715029010
Report to moderator
tromp
Legendary
*
Offline Offline

Activity: 978
Merit: 1087


View Profile
November 15, 2016, 02:21:20 PM
 #22

On a more serious note: if you state the performance is now at 80% of theoretical maximum, we're basically there, right? ETH miners also peak at about 80-85% of the theoretical maximum. Does the same rule apply here?

I still have some work to do before I write my own miner from scratch.  I like to *really* understand the problem before I start writing code, and there's still some parts of the GCN architecture that I'm figuring out.


Hi, nerdralph! Would you be interested in going after my Cuckoo Cycle bounties?

https://github.com/tromp/cuckoo
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 15, 2016, 03:40:02 PM
 #23

Do you think, you can start developing miners for Nvidia?
There is a lack of skilled and honest developers.

Kind of hard when I don't have any Nvidia cards.  Even if I did, I'd stick to OpenCL, as I have no interest in learning CUDA.


even if the potential might be higher? or maybe you know already that this will be never the case?

Based on my limited knowledge of Nvidia GPUs, I think they have 32-byte cache lines, which should give them an advantage for equihash.
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 15, 2016, 03:51:00 PM
 #24

On a more serious note: if you state the performance is now at 80% of theoretical maximum, we're basically there, right? ETH miners also peak at about 80-85% of the theoretical maximum. Does the same rule apply here?

I still have some work to do before I write my own miner from scratch.  I like to *really* understand the problem before I start writing code, and there's still some parts of the GCN architecture that I'm figuring out.


Hi, nerdralph! Would you be interested in going after my Cuckoo Cycle bounties?

https://github.com/tromp/cuckoo

After a quick scan of your readme, it doesn't look appealing.  While it might be fun, it doesn't seem to have any practical application in a popular cryptocurrency like BTC, ETH, or ZEC.
mrb
Legendary
*
Offline Offline

Activity: 1512
Merit: 1027


View Profile WWW
November 15, 2016, 04:28:34 PM
 #25

The reason is that AMD memory channels are 64-bits wide, and GDDR5 transfers a minimum of an 8-bit burst, so AMD GPUs transfer a minimum of 64 bytes of data to RAM at a time.  In addition, if the GPU kernel writes less than 64 contiguous bytes at a time, the memory controller will read 64 bytes, modify some of the bytes, and then write 64 back to RAM.

This is actually incorrect. The GDDR4-5 channels are 32 bits wide, so the data granularity is 32 bytes. However, cache lines are 64 bytes long, so in practice you often have 2 consecutive bursts, hence 64 bytes read or written.

And for HBM memory (R9 Nano & R9 Fury) the channels are 128 bits wide, however the burst length is reduced to 2, maintaining the same granularity of 32 bytes.
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 15, 2016, 05:18:49 PM
Last edit: November 15, 2016, 06:34:56 PM by nerdralph
 #26

The reason is that AMD memory channels are 64-bits wide, and GDDR5 transfers a minimum of an 8-bit burst, so AMD GPUs transfer a minimum of 64 bytes of data to RAM at a time.  In addition, if the GPU kernel writes less than 64 contiguous bytes at a time, the memory controller will read 64 bytes, modify some of the bytes, and then write 64 back to RAM.

This is actually incorrect. The GDDR4-5 channels are 32 bits wide, so the data granularity is 32 bytes. However, cache lines are 64 bytes long, so in practice you often have 2 consecutive bursts, hence 64 bytes read or written.

And for HBM memory (R9 Nano & R9 Fury) the channels are 128 bits wide, however the burst length is reduced to 2, maintaining the same granularity of 32 bytes.

You are incorrect.  Cache lines are 64-bytes long because AMD memory channels are 64-bits wide (i.e. 2 DDR5 chips).  The GCN memory controller is not 32-bits wide, with 2 consecutive bursts to fill a cache line.
"Each memory controller is 64-bits wide and composed of two independent 32-bit GDDR5 memory channels." - pg 10:
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

edit: see also slide #34
http://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah
mrb
Legendary
*
Offline Offline

Activity: 1512
Merit: 1027


View Profile WWW
November 15, 2016, 05:57:50 PM
 #27

You are incorrect.  Cache lines are 64-bytes long because AMD memory channels are 64-bits wide (i.e. 2 DDR5 chips).  The GCN memory controller is not 32-bits wide, with 2 consecutive bursts to fill a cache line.
"Each memory controller is 64-bits wide and composed of two independent 32-bit GDDR5 memory channels." - pg 10:
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

I read tons and tons of docs (including this whitepaper), but somehow missed that one line. Ok. Misconception clarified Smiley
Kyubey
Newbie
*
Offline Offline

Activity: 35
Merit: 0


View Profile
November 15, 2016, 06:02:27 PM
 #28

So, looks like Claymore v6 reached the limit?
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 15, 2016, 06:07:21 PM
 #29

So, looks like Claymore v6 reached the limit?

I was just about to post that, as I'm getting ~145 sol on a Rx 470 clocked at 1250/1750.  I expect it will take much longer now to reach the 225 sols limit for 64-byte IO optimization.

nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 18, 2016, 02:59:31 PM
 #30

p.s. I also have another idea that should work on 4GB cards.  The miner could use 12-slot bins of 32 bytes, just like silentarmy, but use a new table every round instead of using 2 tables in a double-buffered fashion.  This would use 384MB * 9 =~ 3.5GB, but then your first write to any row could write 32-bytes of dummy data along with the 32-byte collision record.  This would avoid the read-before-write. You could do this with the 2nd through 6th write by filling the even slots before the odd ones.  This would reduce the average IO per round to 2^20 * 3 * 64-bytes, or 192MB per round and 1.728GB per iteration.  That would be a theoretical max of 130 iterations per second on a Rx 470 with a 7Gbps memory clock, which would be around 240 solutions per second.  Using 93% of the theoretical limit taken from eth mining, that would give real-world performance of 225 sols/s.

Yesterday I realized I forgot to account for the 1:1 command:burst ratio.  Because of that, doing 64-byte random writes will be no faster than 128-byte random writes.  It should still be faster than the 3 read + 2 write algorithm, as there will be no write-after-read delays, and the high page hit rate of reading the tables will free up some command slots for the random writes.  About 4.2 cache line transfers per row is what I figure, or 269MB per round and 2.42GB per iteration.  That's 173 sols/s for the Rx 470 before refresh overhead, or 160 sols/s using the 93% ratio.  Claymore V6 does about 140 sols/s on the Rx 470, so there's only about 10-15% more room for improvement.
xeridea
Sr. Member
****
Offline Offline

Activity: 449
Merit: 251


View Profile WWW
November 19, 2016, 01:17:17 AM
 #31

p.s. I also have another idea that should work on 4GB cards.  The miner could use 12-slot bins of 32 bytes, just like silentarmy, but use a new table every round instead of using 2 tables in a double-buffered fashion.  This would use 384MB * 9 =~ 3.5GB, but then your first write to any row could write 32-bytes of dummy data along with the 32-byte collision record.  This would avoid the read-before-write. You could do this with the 2nd through 6th write by filling the even slots before the odd ones.  This would reduce the average IO per round to 2^20 * 3 * 64-bytes, or 192MB per round and 1.728GB per iteration.  That would be a theoretical max of 130 iterations per second on a Rx 470 with a 7Gbps memory clock, which would be around 240 solutions per second.  Using 93% of the theoretical limit taken from eth mining, that would give real-world performance of 225 sols/s.

Yesterday I realized I forgot to account for the 1:1 command:burst ratio.  Because of that, doing 64-byte random writes will be no faster than 128-byte random writes.  It should still be faster than the 3 read + 2 write algorithm, as there will be no write-after-read delays, and the high page hit rate of reading the tables will free up some command slots for the random writes.  About 4.2 cache line transfers per row is what I figure, or 269MB per round and 2.42GB per iteration.  That's 173 sols/s for the Rx 470 before refresh overhead, or 160 sols/s using the 93% ratio.  Claymore V6 does about 140 sols/s on the Rx 470, so there's only about 10-15% more room for improvement.
Claymore 7.0 I get around 155 Sol/s on 470 4GB, and 164 Sol/s on 480 4GB, default clocks.  So either V7 is at/above theoretical limit, there are additional tricks to get speed, or theoretical limit is actually higher.

Profitability over time charts for many GPUs - http://xeridea.us/charts

BTC:  bc1qr2xwjwfmjn43zhrlp6pn7vwdjrjnv5z0anhjhn LTC:  LXDm6sR4dkyqtEWfUbPumMnVEiUFQvxSbZ Eth:  0x44cCe2cf90C8FEE4C9e4338Ae7049913D4F6fC24
reb0rn21
Legendary
*
Offline Offline

Activity: 1898
Merit: 1024


View Profile
November 19, 2016, 01:25:21 AM
 #32

rx480 have 8Gbs memory which is fater compared to rx470 so speed is not over theoretical, 480 should be faster then 470 on default

              ▄▄▄ ▀▀▀▀▀▀▀▀▀ ▄▄▄
           ▄▀▀    ▄▄▄▄▄▄▄▄▄    ▀▀▄
        ▄▀▀  ▄▄▀█          ▀█▀▄▄  ▀▀▄
      ▄▀▀ ▄▄▀    ▀▀▄▄▄▄▄▄▄▀▀    ▀▄▄ ▀▀▄
     █   █            ▀            █   █
   ▄▀ █  ▀▄▄                     ▄█▀  █ ▀▄
  ▄▀ ▄▀ █▄ ▀▀▀██▄▄▄       ▄▄▄██▀▀  ██ ▀▄ ▀▄
  ▀▄▀▀▄ ██ ▄▄▄▄▄▄  ▀▄   ▄▀  ▄▄▄▄▄▄ ██ ▄▀▀▄▀
 ██   █ ██ ▀▄    ▀▄ █   █ ▄▀    ▄▀ ██ █  ▀██
 █  ▄█  ▀█  ▀▀▀▀▀▀▀ █   █ ▀▀▀▀▀▀▀  █   █▄  █
█▀ █  █  █          █   █          █  █  █ ▀▀
 █▀  ▄▀  █▀▄        █   █        ▄▀█  ▀▄  ▀█
 ▄  █▀   █ ▀█▄      ▀   ▀      ▄█▀ █  ▄▀█  ▄
 █▄▀  █  █                         █  █  ▀▄█
 ▀▄  █   ▀█        ▄▄▀▄▀▄▄        █▀   █  ▄
  ▀▄▀▀  █▄ █     ▀█  ▀▀▀  █▀     █ ▄█ ▄▀▀▄▀
   ▀ ▄  ██ █▀▄     ▀▀▄▄▄▀▀     ▄▀█ ██ ▀▄ ▀
    ▀█  ██ █ █▀▄    ▄▄▄▄▄    ▄▀█ █ ██  █▀
      ▀▄ ▀ █ █ ██▄         ▄██ █ █ ▀ ▄▀
        ▀▄ █ █ █ ▀█▄     ▄█▀ █ █ █ ▄▀
          ▀▀▄█ █    ▀▀▀▀▀    █ █▄▀▀
              ▀▀ ▄▄▄▄▄▄▄▄▄▄▄ ▀▀
   
..I  D  E  N  A..
   
Proof-of-Person Blockchain

Join the mining of the first human-centric
cryptocurrency
 



 
▲    2 3 2 2

..N  O  D  E  S..
   
                ██
                ██
                ██
                ██
                ██
         ▄      ██      ▄
         ███▄   ██   ▄███
          ▀███▄ ██ ▄███▀
            ▀████████▀
              ▀████▀
                ▀▀
██▄                            ▄██
███                            ███
███                            ███
███                            ███
 ███▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄███
  ▀▀██████████████████████████▀▀
   
D O W N L O A D

Idena node

   
   
▄▄▄██████▄▄▄
▄▄████████████████▄▄
▄█████▀▀        ▀▀█████▄
████▀                ▀████
███▀    ▄▄▄▄▄▄▄▄▄       ▀███
███      █   ▄▄ █▀▄        ███
██▀      █  ███ █  ▀▄      ▀██
███       █   ▀▀ ▀▀▀▀█       ███
███       █  ▄▄▄▄▄▄  █       ███
███       █  ▄▄▄▄▄▄  █       ███
██▄      █  ▄▄▄▄▄▄  █      ▄██
███      █          █      ███
███▄    ▀▀▀▀▀▀▀▀▀▀▀▀    ▄███
████▄                ▄████
▀█████▄▄        ▄▄█████▀
▀▀████████████████▀▀
▀▀▀██████▀▀▀
   
    .REQUEST INVITATION.
xeridea
Sr. Member
****
Offline Offline

Activity: 449
Merit: 251


View Profile WWW
November 19, 2016, 02:16:52 PM
 #33

rx480 have 8Gbs memory which is fater compared to rx470 so speed is not over theoretical, 480 should be faster then 470 on default
Both the 470 and 480 have 4GB and 8GB variants.  8GB 480 gets 180 Sol/s.  4GB cards have 6.6-7 GHz effective speed, 8GB typically 8GHz, but some (MSI 470 ...), manufactures cheaped out.  Ethereum, 470 and 480 are nearly identical, given the same memory.

Sapphire Nitro cards (7 or 8GHz), this is what I get

470 4GB 155
470 8GB 160

480 4GB 165
480 8GB 178

All have memory strap mod, but default memory clock, so since the theory discussed is only based on bandwidth, it shouldn't affect the limit.

Profitability over time charts for many GPUs - http://xeridea.us/charts

BTC:  bc1qr2xwjwfmjn43zhrlp6pn7vwdjrjnv5z0anhjhn LTC:  LXDm6sR4dkyqtEWfUbPumMnVEiUFQvxSbZ Eth:  0x44cCe2cf90C8FEE4C9e4338Ae7049913D4F6fC24
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 19, 2016, 04:34:56 PM
 #34

Yesterday I realized I forgot to account for the 1:1 command:burst ratio.  Because of that, doing 64-byte random writes will be no faster than 128-byte random writes.  It should still be faster than the 3 read + 2 write algorithm, as there will be no write-after-read delays, and the high page hit rate of reading the tables will free up some command slots for the random writes.  About 4.2 cache line transfers per row is what I figure, or 269MB per round and 2.42GB per iteration.  That's 173 sols/s for the Rx 470 before refresh overhead, or 160 sols/s using the 93% ratio.  Claymore V6 does about 140 sols/s on the Rx 470, so there's only about 10-15% more room for improvement.
Claymore 7.0 I get around 155 Sol/s on 470 4GB, and 164 Sol/s on 480 4GB, default clocks.  So either V7 is at/above theoretical limit, there are additional tricks to get speed, or theoretical limit is actually higher.

I'm getting 150 sol/s on my Rx 470, and even your 155 is still below the 160 sols/s limit I calculated.  If the 164 Sols/s you are seeing on the Rx 480 is with a 1750Mhz memory clock, then that is ~95% of my 173 sols/s instead of 93%.  It's possible the refresh may only impact the burst transfers and not the command rate, which could mean no impact on the 173 sols/s since less than 75% of the burst transfer slots are being used.

If someone writes a miner that gets substantially more than 173 sols/s (i.e. ~200 sols/s) on a Rx 470 with 7Gbps memory, that would be conclusive proof that there is a way to avoid the limit as I calculated it.  I've discussed my ideas with other miner developers, and I have considered various ways of reducing the external GDDR bandwidth requirements.  I doubt anyone will find a serious mistake now.  Something I just noticed is that the L2 cache on Polaris/Ellesmere is 2MB (512KB/controller), compared to 128KB/controller on previous chips like Tonga.  That is still too small, IMO, to be of much (more than a few %) help.

Computer scientists and mathematicians have studied sorting problems for decades, and I am convinced that sorting n random records requires at least n reads plus n writes.  With some work it should be possible to implement an equihash algorithm where the average amount of data manipulated per round is 16 bytes per record.  A GPU with a 32MB cache would then be limited to it's cache bandwidth instead of the external GDDR5 bandwidth.
xeridea
Sr. Member
****
Offline Offline

Activity: 449
Merit: 251


View Profile WWW
November 19, 2016, 09:59:17 PM
 #35

Yesterday I realized I forgot to account for the 1:1 command:burst ratio.  Because of that, doing 64-byte random writes will be no faster than 128-byte random writes.  It should still be faster than the 3 read + 2 write algorithm, as there will be no write-after-read delays, and the high page hit rate of reading the tables will free up some command slots for the random writes.  About 4.2 cache line transfers per row is what I figure, or 269MB per round and 2.42GB per iteration.  That's 173 sols/s for the Rx 470 before refresh overhead, or 160 sols/s using the 93% ratio.  Claymore V6 does about 140 sols/s on the Rx 470, so there's only about 10-15% more room for improvement.
Claymore 7.0 I get around 155 Sol/s on 470 4GB, and 164 Sol/s on 480 4GB, default clocks.  So either V7 is at/above theoretical limit, there are additional tricks to get speed, or theoretical limit is actually higher.

I'm getting 150 sol/s on my Rx 470, and even your 155 is still below the 160 sols/s limit I calculated.  If the 164 Sols/s you are seeing on the Rx 480 is with a 1750Mhz memory clock, then that is ~95% of my 173 sols/s instead of 93%.  It's possible the refresh may only impact the burst transfers and not the command rate, which could mean no impact on the 173 sols/s since less than 75% of the burst transfer slots are being used.

If someone writes a miner that gets substantially more than 173 sols/s (i.e. ~200 sols/s) on a Rx 470 with 7Gbps memory, that would be conclusive proof that there is a way to avoid the limit as I calculated it.  I've discussed my ideas with other miner developers, and I have considered various ways of reducing the external GDDR bandwidth requirements.  I doubt anyone will find a serious mistake now.  Something I just noticed is that the L2 cache on Polaris/Ellesmere is 2MB (512KB/controller), compared to 128KB/controller on previous chips like Tonga.  That is still too small, IMO, to be of much (more than a few %) help.

Computer scientists and mathematicians have studied sorting problems for decades, and I am convinced that sorting n random records requires at least n reads plus n writes.  With some work it should be possible to implement an equihash algorithm where the average amount of data manipulated per round is 16 bytes per record.  A GPU with a 32MB cache would then be limited to it's cache bandwidth instead of the external GDDR5 bandwidth.


Yeah I was basing on 480 4GB, since 470 seems somewhat compute limited on CM 7.0.  Since it is around the theoretical max, I am thinking there may be ways to get a bit more, some tricks to save some bandwidth, but of course we can't know since we are not CM.  He says he figures 300H is possible on 390X, which currently gets 230, but it will take a lot more work.

https://bitcointalk.org/index.php?topic=1670733.msg16924145#msg16924145

Profitability over time charts for many GPUs - http://xeridea.us/charts

BTC:  bc1qr2xwjwfmjn43zhrlp6pn7vwdjrjnv5z0anhjhn LTC:  LXDm6sR4dkyqtEWfUbPumMnVEiUFQvxSbZ Eth:  0x44cCe2cf90C8FEE4C9e4338Ae7049913D4F6fC24
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 20, 2016, 02:01:14 AM
 #36

I'm getting 150 sol/s on my Rx 470, and even your 155 is still below the 160 sols/s limit I calculated.  If the 164 Sols/s you are seeing on the Rx 480 is with a 1750Mhz memory clock, then that is ~95% of my 173 sols/s instead of 93%.  It's possible the refresh may only impact the burst transfers and not the command rate, which could mean no impact on the 173 sols/s since less than 75% of the burst transfer slots are being used.

If someone writes a miner that gets substantially more than 173 sols/s (i.e. ~200 sols/s) on a Rx 470 with 7Gbps memory, that would be conclusive proof that there is a way to avoid the limit as I calculated it.  I've discussed my ideas with other miner developers, and I have considered various ways of reducing the external GDDR bandwidth requirements.  I doubt anyone will find a serious mistake now.  Something I just noticed is that the L2 cache on Polaris/Ellesmere is 2MB (512KB/controller), compared to 128KB/controller on previous chips like Tonga.  That is still too small, IMO, to be of much (more than a few %) help.

Computer scientists and mathematicians have studied sorting problems for decades, and I am convinced that sorting n random records requires at least n reads plus n writes.  With some work it should be possible to implement an equihash algorithm where the average amount of data manipulated per round is 16 bytes per record.  A GPU with a 32MB cache would then be limited to it's cache bandwidth instead of the external GDDR5 bandwidth.


Yeah I was basing on 480 4GB, since 470 seems somewhat compute limited on CM 7.0.  Since it is around the theoretical max, I am thinking there may be ways to get a bit more, some tricks to save some bandwidth, but of course we can't know since we are not CM.  He says he figures 300H is possible on 390X, which currently gets 230, but it will take a lot more work.

https://bitcointalk.org/index.php?topic=1670733.msg16924145#msg16924145

300 for a 390x is 1.73x the 173 limit I calculated for a Rx 470.  The 384GB/s memory bandwidth of the 390x is 1.71x the 224GB/s memory bandwidth of the Rx 470.  I'd say that's not a coincidence, and Claymore has come to the same conclusions that I have about the limits of ZEC mining performance.
xeridea
Sr. Member
****
Offline Offline

Activity: 449
Merit: 251


View Profile WWW
November 20, 2016, 03:35:18 AM
 #37

I'm getting 150 sol/s on my Rx 470, and even your 155 is still below the 160 sols/s limit I calculated.  If the 164 Sols/s you are seeing on the Rx 480 is with a 1750Mhz memory clock, then that is ~95% of my 173 sols/s instead of 93%.  It's possible the refresh may only impact the burst transfers and not the command rate, which could mean no impact on the 173 sols/s since less than 75% of the burst transfer slots are being used.

If someone writes a miner that gets substantially more than 173 sols/s (i.e. ~200 sols/s) on a Rx 470 with 7Gbps memory, that would be conclusive proof that there is a way to avoid the limit as I calculated it.  I've discussed my ideas with other miner developers, and I have considered various ways of reducing the external GDDR bandwidth requirements.  I doubt anyone will find a serious mistake now.  Something I just noticed is that the L2 cache on Polaris/Ellesmere is 2MB (512KB/controller), compared to 128KB/controller on previous chips like Tonga.  That is still too small, IMO, to be of much (more than a few %) help.

Computer scientists and mathematicians have studied sorting problems for decades, and I am convinced that sorting n random records requires at least n reads plus n writes.  With some work it should be possible to implement an equihash algorithm where the average amount of data manipulated per round is 16 bytes per record.  A GPU with a 32MB cache would then be limited to it's cache bandwidth instead of the external GDDR5 bandwidth.


Yeah I was basing on 480 4GB, since 470 seems somewhat compute limited on CM 7.0.  Since it is around the theoretical max, I am thinking there may be ways to get a bit more, some tricks to save some bandwidth, but of course we can't know since we are not CM.  He says he figures 300H is possible on 390X, which currently gets 230, but it will take a lot more work.

https://bitcointalk.org/index.php?topic=1670733.msg16924145#msg16924145

300 for a 390x is 1.73x the 173 limit I calculated for a Rx 470.  The 384GB/s memory bandwidth of the 390x is 1.71x the 224GB/s memory bandwidth of the Rx 470.  I'd say that's not a coincidence, and Claymore has come to the same conclusions that I have about the limits of ZEC mining performance.


Hmm, I was thinking 390x had 320GB/s memory bandwidth, I guess that's the 290x.  It seemed right since 390x isn't much faster than 480 at Ethereum.  Its bout 27% faster Ethereum, and 27% faster Ethash (as of CM 7.0, and many revisions before, this is general trend).  Both algorithms are affected by timings, though less so with newer CM versions.  Your estimate seems reasonable, I am just speculating. I don't do side projects programming these days, developing hand issues, so I can just speculate Sad

Profitability over time charts for many GPUs - http://xeridea.us/charts

BTC:  bc1qr2xwjwfmjn43zhrlp6pn7vwdjrjnv5z0anhjhn LTC:  LXDm6sR4dkyqtEWfUbPumMnVEiUFQvxSbZ Eth:  0x44cCe2cf90C8FEE4C9e4338Ae7049913D4F6fC24
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 20, 2016, 04:18:41 PM
 #38

You are incorrect.  Cache lines are 64-bytes long because AMD memory channels are 64-bits wide (i.e. 2 DDR5 chips).  The GCN memory controller is not 32-bits wide, with 2 consecutive bursts to fill a cache line.
"Each memory controller is 64-bits wide and composed of two independent 32-bit GDDR5 memory channels." - pg 10:
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

I read tons and tons of docs (including this whitepaper), but somehow missed that one line. Ok. Misconception clarified Smiley

Marc did some testing following this post, and determined that while reads result in 32 bytes being read from the two GDDR5 memory channels to fill a 64-byte cache line, writes are different.  When data is only written to half of a cache line (32 bytes), due to the dirty byte mask the controller knows only one of the 2 GDDR5 memory channels is affected, and so will only write to one of them.  However this does not mean the write bandwidth is double what I originally calculated, as writing 2 32-byte chunks of memory to the memory controller requires 2 core clocks.   This would require the GPU core to be clocked the same as the memory, i.e. 2Ghz for a Rx 480 with 8Gbps RAM.  This is due to the core:memory clock ratio limit I described here:
https://bitcointalk.org/index.php?topic=1682003.0
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 22, 2016, 08:15:08 PM
 #39

While my initial analysis was focused on the external GDDR5 bandwidth limits, current ZEC GPU mining software seems to be limited by the memory controller/core bus.  On AMD GCN, each memory controller can xfer 64 bytes (1 cache line) per clock.  In SA5, the ht_store function, in addition to adding to row counters, does 4 separate memory writes for most rounds (3 writes for the last couple rounds).  All of these writes are either 4 or 8 bytes, so much less than 64 bytes per clock are being transferred to the L2 cache.  A single thread (1 SIMD element) can transfer at most 16 bytes (dwordX4) in a single instruction.  This means a modified ht_store thread could update a row slot in 2 clocks.  If the update operation is split between 2 (or 4 or more) threads, one slot can be updated in one clock, since 2 threads can simultaneously write to different parts of the same 64-byte block.  This would mean each row update operation could be done in 2 GPU core clock cycles; one for the counter update, and one for updating the row slot.

Even with those changes, my calculations indicate that a ZEC miner would be limited by the core clock, according to a ratio of approximately 5:6.  In other words, when a Rx 470 has a memory clock of 1750Mhz, the core would need to be clocked at 1750 * 5/6 = 1458Mhz in order to achieve maximum performance.

If the row counters can be kept in LDS or GDS, the core:memory ratio required would be 1:2, thereby allowing full use of the external memory bandwidth.  There is 64KB of LDS per CU, and the AMD GCN architecture docs indicate the LDS can be globally addressed; i.e. one CU can access the LDS of another CU.  However the syntax of OpenCL does not permit the local memory of one work-group to be accessed by a different work-group.  There is only 64KB of GDS shared by all CUs, and even if the row counters could be stored in such a small amount of memory, OpenCL does not have any concept of GDS.

This likely means writing a top performance ZEC miner for AMD is the domain of someone who codes in GCN assembler.  Canis lupus?
zawawa
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
November 22, 2016, 08:53:07 PM
 #40

This is juicy stuff! Thanks, nerdralph!

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
Pages: « 1 [2] 3 4 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!