Bitcoin Forum
May 05, 2024, 08:17:41 AM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: 1 2 [All]
  Print  
Author Topic: AMD core/memory clock ratio fundamental limit 0.5  (Read 2105 times)
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 15, 2016, 07:07:29 PM
 #1

I previously posted that for modern AMD GPUs like Tonga and Polaris that have a 256-bit memory interface, the minimum core/clock ratio for eth mining is ~0.56.  After digging through lots of GCN architecture documents for my analysis of ZEC mining, I have determined that the fundamental limit for fully utilizing the GDDR5 memory is 0.5.  This means on a Rx 480 with memory clocked at 2Ghz, the core has to be at least 1Ghz in order to fully utilize the GDDR5 memory bandwidth.  The reason is that the L2 cache can transfer at most 64 bytes per core clock.  GDDR5 transfers 4 bits per clock cycle in a 2 cycle burst.  The GDDR memory interface is 32-bits wide, so each chip xfers 32 bytes per 2 clocks.  Tonga and Polaris use 8 GGDR5 chips for 8 x 32 = 256 bits on the external memory bus.

Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1714897061
Hero Member
*
Offline Offline

Posts: 1714897061

View Profile Personal Message (Offline)

Ignore
1714897061
Reply with quote  #2

1714897061
Report to moderator
1714897061
Hero Member
*
Offline Offline

Posts: 1714897061

View Profile Personal Message (Offline)

Ignore
1714897061
Reply with quote  #2

1714897061
Report to moderator
1714897061
Hero Member
*
Offline Offline

Posts: 1714897061

View Profile Personal Message (Offline)

Ignore
1714897061
Reply with quote  #2

1714897061
Report to moderator
olcaytu2005
Legendary
*
Offline Offline

Activity: 1470
Merit: 1024



View Profile
November 15, 2016, 07:14:00 PM
 #2

Can you comment on the fact that with latest optiminer/claymore releases it seems better to have a higher core/lower mem. Reported from me and several other people.
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 15, 2016, 07:19:38 PM
 #3

Can you comment on the fact that with latest optiminer/claymore releases it seems better to have a higher core/lower mem. Reported from me and several other people.

This just indicates their kernel code could be more efficient, i.e. better optimized.

adaseb
Legendary
*
Offline Offline

Activity: 3752
Merit: 1710



View Profile
November 15, 2016, 07:22:28 PM
 #4

What about for the Tahiti and the Hawaii GPUs? Whats the ratio there?


.BEST..CHANGE.███████████████
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
███████████████
..BUY/ SELL CRYPTO..
batko
Full Member
***
Offline Offline

Activity: 169
Merit: 100


View Profile
November 15, 2016, 07:29:39 PM
 #5

Rx 480 1100/2000 = 135Sol
Rx 480 1300/2000=155Sol

Donation: ETH: 0x0c8ce94dd3d1bfd09c0f887559c61d1b551e4b4d
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 15, 2016, 07:40:07 PM
 #6

What about for the Tahiti and the Hawaii GPUs? Whats the ratio there?

Haven't looked at the Tahiti specs.  Hawaii uses the same L2 architecture as Tonga, except twice the number of memory channels.  Therefore the same 0.5x ratio would apply.  I know a R9 290 won't get the maximum hashrate for eth with a core clock at 1/2 the memory clock, but it likely would if it had more compute units.  Since a R9 380 with 28 CUs gets the max performance on eth at the 0.56 ratio, a Hawaii GPU with 56 CUs would probably max out at the same 0.56 ratio.

I also suspect a more efficient miner (i.e. Wolf's miner written in GCN assembler) would make better use of the 40 CUs on a R9 290, and therefore could still get 36-37Mh mining eth with a 1250Mhz memory clock and a 700-750Mhz core clock.
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 15, 2016, 07:42:15 PM
 #7

Rx 480 1100/2000 = 135Sol
Rx 480 1300/2000=155Sol

This is not due to any fundamental limit of the GPU architecture, just how well optimized the miner code is.
uray
Hero Member
*****
Offline Offline

Activity: 1400
Merit: 505


View Profile
November 15, 2016, 10:44:46 PM
 #8

I previously posted that for modern AMD GPUs like Tonga and Polaris that have a 256-bit memory interface, the minimum core/clock ratio for eth mining is ~0.56.  After digging through lots of GCN architecture documents for my analysis of ZEC mining, I have determined that the fundamental limit for fully utilizing the GDDR5 memory is 0.5.  This means on a Rx 480 with memory clocked at 2Ghz, the core has to be at least 1Ghz in order to fully utilize the GDDR5 memory bandwidth.  The reason is that the L2 cache can transfer at most 64 bytes per core clock.  GDDR5 transfers 4 bits per clock cycle in a 2 cycle burst.  The GDDR memory interface is 32-bits wide, so each chip xfers 32 bytes per 2 clocks.  Tonga and Polaris use 8 GGDR5 chips for 8 x 32 = 256 bits on the external memory bus.



do you assume memory clock with stock timing, or tighter strap timing mod is not relevant to ratio ?

My RX470-4G running at 1150MHz/1975Mhz and 1500Mhz strap memory timing, mining ZEC at 140 sol/s using claymore
olcaytu2005
Legendary
*
Offline Offline

Activity: 1470
Merit: 1024



View Profile
November 15, 2016, 10:49:12 PM
 #9

Rx 480 1100/2000 = 135Sol
Rx 480 1300/2000=155Sol

This is not due to any fundamental limit of the GPU architecture, just how well optimized the miner code is.


I dont know if you are able to inspect optiminer code but it seems to me it is better coded since it gives the same hashrate with claymore without bottlenecking my cpu and insane watt usage, thus heat. Also I was getting better results at low core high mem. Is claymore pushing intensity to compete atm?
QuintLeo
Legendary
*
Offline Offline

Activity: 1498
Merit: 1030


View Profile
November 16, 2016, 12:59:48 AM
 #10

Doesn't match up to reality on the R9 290.

 Best results I get out of my R9 290s on ETH (running ANY of the ETH miners) has been with 1100 core / 1250 memory clocks, at least since I flashed them with one of TheStilt BIOS (before that they overheated long before I could get them to 1000).

 I can't speak to ZEC on them as ALL of my AMD miners are on LINUX and none of the LINUX miners for ZEC are even close to competative on hashrate to the Windows ones.


 I see the same thing on my R9 280x, best ETH results at 1100/1500.

 In both cases, downclocking core OR memory clock result in a decrease in hashrate.



 I suspect the situation is a LOT more complex than you believe, due to interaction of multiple compute units trying to access memory at the same time, wait state timings on the actual memory, etc.



I'm no longer legendary just in my own mind!
Like something I said? Donations gratefully accepted. LYLnTKvLefz9izJFUvEGQEZzSkz34b3N6U (Litecoin)
1GYbjMTPdCuV7dci3iCUiaRrcNuaiQrVYY (Bitcoin)
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 16, 2016, 03:32:54 AM
 #11

I suspect the situation is a LOT more complex than you believe, due to interaction of multiple compute units trying to access memory at the same time, wait state timings on the actual memory, etc.

GDDR5 RAS/CAS timings has nothing to do with the bandwidth of the data xfer.
jstefanop
Legendary
*
Offline Offline

Activity: 2095
Merit: 1396


View Profile
November 16, 2016, 05:16:59 AM
 #12

I suspect the situation is a LOT more complex than you believe, due to interaction of multiple compute units trying to access memory at the same time, wait state timings on the actual memory, etc.

GDDR5 RAS/CAS timings has nothing to do with the bandwidth of the data xfer.


Of course it does...if a memory controller has to wait more clock cycles before it can read or write to a memory bank, its effective I/O bandwidth is reduced.

Project Apollo: A Pod Miner Designed for the Home https://bitcointalk.org/index.php?topic=4974036
FutureBit Moonlander 2 USB Scrypt Stick Miner: https://bitcointalk.org/index.php?topic=2125643.0
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 16, 2016, 11:39:24 AM
 #13

I suspect the situation is a LOT more complex than you believe, due to interaction of multiple compute units trying to access memory at the same time, wait state timings on the actual memory, etc.

GDDR5 RAS/CAS timings has nothing to do with the bandwidth of the data xfer.


Of course it does...if a memory controller has to wait more clock cycles before it can read or write to a memory bank, its effective I/O bandwidth is reduced.

Try reading a GDDR5 datasheet.  Every mfr supports at least 8 concurrent open pages.  2 bursts (4 clocks) from the same page are required to max the I/O bandwidth due to the 1:1 command:burst ratio.  In the context of the Hawaii cards with a memory clock of 1250 or 1300Mhz, faster page activation times than stock make no difference in the data xfer rate.

When the time for 16 burst xfers is less than the time to activate a page, only then will page activate timing impact bandwidth for the minimum 64-byte xfers.  You'll run into that issue with an Rx 470 at 1750Mhz, but not on a R9 290 at 1250.
antantti
Legendary
*
Offline Offline

Activity: 1176
Merit: 1015


View Profile
November 16, 2016, 11:57:14 AM
 #14


none of the LINUX miners for ZEC are even close to competative on hashrate to the Windows ones.


What's wrong with optiminer?
AriesIV10
Legendary
*
Offline Offline

Activity: 1260
Merit: 1006


Mine for a Bit


View Profile WWW
November 16, 2016, 05:20:27 PM
 #15

I previously posted that for modern AMD GPUs like Tonga and Polaris that have a 256-bit memory interface, the minimum core/clock ratio for eth mining is ~0.56.  After digging through lots of GCN architecture documents for my analysis of ZEC mining, I have determined that the fundamental limit for fully utilizing the GDDR5 memory is 0.5.  This means on a Rx 480 with memory clocked at 2Ghz, the core has to be at least 1Ghz in order to fully utilize the GDDR5 memory bandwidth.  The reason is that the L2 cache can transfer at most 64 bytes per core clock.  GDDR5 transfers 4 bits per clock cycle in a 2 cycle burst.  The GDDR memory interface is 32-bits wide, so each chip xfers 32 bytes per 2 clocks.  Tonga and Polaris use 8 GGDR5 chips for 8 x 32 = 256 bits on the external memory bus.



do you assume memory clock with stock timing, or tighter strap timing mod is not relevant to ratio ?

My RX470-4G running at 1150MHz/1975Mhz and 1500Mhz strap memory timing, mining ZEC at 140 sol/s using claymore

Based upon the ratio that you are suggesting, are you seeing that 1200 core 1950 mem -OR- 1270 core 1750 mem would be a better ratio for the MSI RX470 X GPU?  They both net the same results in H/s.

BTC Address (Donations):  3LepZAju88ZRuNVD4cS6Xv5hKyKrjvirkB     Website:  www.MintMining.com
Email: Mining@MintMining.com      Power Supplies: https://bit.ly/2TtvdOR
jstefanop
Legendary
*
Offline Offline

Activity: 2095
Merit: 1396


View Profile
November 16, 2016, 05:30:59 PM
 #16

I suspect the situation is a LOT more complex than you believe, due to interaction of multiple compute units trying to access memory at the same time, wait state timings on the actual memory, etc.

GDDR5 RAS/CAS timings has nothing to do with the bandwidth of the data xfer.


Of course it does...if a memory controller has to wait more clock cycles before it can read or write to a memory bank, its effective I/O bandwidth is reduced.

Try reading a GDDR5 datasheet.  Every mfr supports at least 8 concurrent open pages.  2 bursts (4 clocks) from the same page are required to max the I/O bandwidth due to the 1:1 command:burst ratio.  In the context of the Hawaii cards with a memory clock of 1250 or 1300Mhz, faster page activation times than stock make no difference in the data xfer rate.

When the time for 16 burst xfers is less than the time to activate a page, only then will page activate timing impact bandwidth for the minimum 64-byte xfers.  You'll run into that issue with an Rx 470 at 1750Mhz, but not on a R9 290 at 1250.


I have read almost the entire Hynix GDDR5 data sheet, and am probably one of only a handful of people that have decoded the entire memory strap region of the ATOM bios. Of course timings wouldnt matter for slower clocks...but no one on these forums run their memory clocks at 1250 or 1750 Wink

Project Apollo: A Pod Miner Designed for the Home https://bitcointalk.org/index.php?topic=4974036
FutureBit Moonlander 2 USB Scrypt Stick Miner: https://bitcointalk.org/index.php?topic=2125643.0
AriesIV10
Legendary
*
Offline Offline

Activity: 1260
Merit: 1006


Mine for a Bit


View Profile WWW
November 16, 2016, 05:48:22 PM
 #17

I suspect the situation is a LOT more complex than you believe, due to interaction of multiple compute units trying to access memory at the same time, wait state timings on the actual memory, etc.

GDDR5 RAS/CAS timings has nothing to do with the bandwidth of the data xfer.


Of course it does...if a memory controller has to wait more clock cycles before it can read or write to a memory bank, its effective I/O bandwidth is reduced.

Try reading a GDDR5 datasheet.  Every mfr supports at least 8 concurrent open pages.  2 bursts (4 clocks) from the same page are required to max the I/O bandwidth due to the 1:1 command:burst ratio.  In the context of the Hawaii cards with a memory clock of 1250 or 1300Mhz, faster page activation times than stock make no difference in the data xfer rate.

When the time for 16 burst xfers is less than the time to activate a page, only then will page activate timing impact bandwidth for the minimum 64-byte xfers.  You'll run into that issue with an Rx 470 at 1750Mhz, but not on a R9 290 at 1250.


I have read almost the entire Hynix GDDR5 data sheet, and am probably one of only a handful of people that have decoded the entire memory strap region of the ATOM bios. Of course timings wouldnt matter for slower clocks...but no one on these forums run their memory clocks at 1250 or 1750 Wink

Here is my 7 GPU RIG.

More details at: https://bitcointalk.org/index.php?topic=1676763.0



1015 H/s
7 MSI RX470s
968 Watts at the wall
68 ave Watts (MSI Afterburner)
57C ave GPU Temp (MSI Afterburner)
49C CPU Temp
1500 Strap
MSI Afterburner settings: 1950 Memory Clock, 1200 Core Clock

My main goal is profitability = W/Hs (less is better)
Previously I was at .97 W/Hs
I am now at .95 W/Hs

If there is a better setting that will increase efficiency; sense, profitability I would love to know it.   OR if I am doing something that will hurt my equipment in the long run.

More details at: https://bitcointalk.org/index.php?topic=1676763.0

BTC Address (Donations):  3LepZAju88ZRuNVD4cS6Xv5hKyKrjvirkB     Website:  www.MintMining.com
Email: Mining@MintMining.com      Power Supplies: https://bit.ly/2TtvdOR
QuintLeo
Legendary
*
Offline Offline

Activity: 1498
Merit: 1030


View Profile
November 17, 2016, 01:31:43 AM
 #18

I suspect the situation is a LOT more complex than you believe, due to interaction of multiple compute units trying to access memory at the same time, wait state timings on the actual memory, etc.

GDDR5 RAS/CAS timings has nothing to do with the bandwidth of the data xfer.


Of course it does...if a memory controller has to wait more clock cycles before it can read or write to a memory bank, its effective I/O bandwidth is reduced.

Try reading a GDDR5 datasheet.  Every mfr supports at least 8 concurrent open pages.  2 bursts (4 clocks) from the same page are required to max the I/O bandwidth due to the 1:1 command:burst ratio.  In the context of the Hawaii cards with a memory clock of 1250 or 1300Mhz, faster page activation times than stock make no difference in the data xfer rate.

When the time for 16 burst xfers is less than the time to activate a page, only then will page activate timing impact bandwidth for the minimum 64-byte xfers.  You'll run into that issue with an Rx 470 at 1750Mhz, but not on a R9 290 at 1250.


 Try the real world - BIOS by TheStilt were big mostly because they optimised memory timings and gave VERY NOTICEABLE improvements in actual memory throughput.
 Even before I played with the core clocks, just flashing the BIOS gave me a noticeable hashrate improvement on my set of R9 290s on ETH, while DROPPING the memory clock from the previous optimal 1350 to 1250 after the BIOS change increased hashrate as well.
 *THEN* I started playing with the core clock - and I found the optimal was to max it as high as it could go within the limits of overheating stability.


 The point however is that your "theoretical optimal" ratio has ZERO basis in the real world for some reason, as dropping my R9 290s to 625/1250 or my R9 280x cards to 750/1500 would result in a HUGE hashrate drop from their currently optimal 1100/1250 and 1100/1500 respective settings, which indicates your theory is incomplete or incorrect.


 I can't speak to the RX series cards as I don't own any of those.

I'm no longer legendary just in my own mind!
Like something I said? Donations gratefully accepted. LYLnTKvLefz9izJFUvEGQEZzSkz34b3N6U (Litecoin)
1GYbjMTPdCuV7dci3iCUiaRrcNuaiQrVYY (Bitcoin)
QuintLeo
Legendary
*
Offline Offline

Activity: 1498
Merit: 1030


View Profile
November 17, 2016, 01:34:17 AM
 #19


I have read almost the entire Hynix GDDR5 data sheet, and am probably one of only a handful of people that have decoded the entire memory strap region of the ATOM bios. Of course timings wouldnt matter for slower clocks...but no one on these forums run their memory clocks at 1250 or 1750 Wink

 Perhaps not on an RX series card, but a lot of us run the older R9 2xx and 3xx series cards (and a few of us run even older HD 78xx/79xx series cards) that can't DO 1750 Mhz memory and in some cases can't even do 1500.


I'm no longer legendary just in my own mind!
Like something I said? Donations gratefully accepted. LYLnTKvLefz9izJFUvEGQEZzSkz34b3N6U (Litecoin)
1GYbjMTPdCuV7dci3iCUiaRrcNuaiQrVYY (Bitcoin)
nerdralph (OP)
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 17, 2016, 05:07:09 AM
 #20

The point however is that your "theoretical optimal" ratio has ZERO basis in the real world for some reason, as dropping my R9 290s to 625/1250 or my R9 280x cards to 750/1500 would result in a HUGE hashrate drop from their currently optimal 1100/1250 and 1100/1500 respective settings, which indicates your theory is incomplete or incorrect.

Congrats on knocking down the straw man.  I said the 0.5 clock ratio limit is a fundamental limit below which it would be impossible to max out the memory bandwidth.  In other words, even if you wrote the most efficient OpenCL (or GCN assembler) kernel possible, you couldn't drop the core clock lower than 1/2 the memory clock without impacting the memory bandwidth.
Brewins
Legendary
*
Offline Offline

Activity: 1120
Merit: 1000



View Profile
January 18, 2017, 11:02:50 AM
 #21

I suspect the situation is a LOT more complex than you believe, due to interaction of multiple compute units trying to access memory at the same time, wait state timings on the actual memory, etc.

GDDR5 RAS/CAS timings has nothing to do with the bandwidth of the data xfer.


Of course it does...if a memory controller has to wait more clock cycles before it can read or write to a memory bank, its effective I/O bandwidth is reduced.

Try reading a GDDR5 datasheet.  Every mfr supports at least 8 concurrent open pages.  2 bursts (4 clocks) from the same page are required to max the I/O bandwidth due to the 1:1 command:burst ratio.  In the context of the Hawaii cards with a memory clock of 1250 or 1300Mhz, faster page activation times than stock make no difference in the data xfer rate.

When the time for 16 burst xfers is less than the time to activate a page, only then will page activate timing impact bandwidth for the minimum 64-byte xfers.  You'll run into that issue with an Rx 470 at 1750Mhz, but not on a R9 290 at 1250.


I have read almost the entire Hynix GDDR5 data sheet, and am probably one of only a handful of people that have decoded the entire memory strap region of the ATOM bios. Of course timings wouldnt matter for slower clocks...but no one on these forums run their memory clocks at 1250 or 1750 Wink

One of the few except for people like me that have mapped out the ENTIRE BIOS with some little tools like atomworks---  Tongue

Pages: 1 2 [All]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!