Bitcoin Forum
May 10, 2024, 07:13:00 AM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 ... 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 [63] 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 »
  Print  
Author Topic: SILENTARMY v5: Zcash miner, 115 sol/s on R9 Nano, 70 sol/s on GTX 1070  (Read 209263 times)
eXtremal
Sr. Member
****
Offline Offline

Activity: 2106
Merit: 282


👉bit.ly/3QXp3oh | 🔥 Ultimate Launc


View Profile WWW
November 18, 2016, 01:35:14 PM
 #1241

Because it's not native and we are discussing open source project here.If you don't want to share your kernel - that's fine. At least give a little hint to mrb,nerdralph and eXtremal for the new version.
Thank you.
I know what need to do, but have a time problem. I'll make an update at this week and instruction for other developers how to get other miners performance.

TONUP██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
▄▄███████▄▄
▄▄███████████████▄▄
▄███████████████████▄
▄█████▄░▄▄▀█████▀▄████▄
▄███████▄▀█▄▀██▀▄███████▄
█████████▄▀█▄▀▄██████████
██████████▄▀█▄▀██████████
██████████▀▄▀█▄▀█████████
▀███████▀▄██▄▀█▄▀███████▀
▀████▀▄█████▄▀▀░▀█████▀
▀███████████████████▀
▀▀███████████████▀▀
▀▀███████▀▀
▄▄▄███████▄▄▄
▄▄███████████████▄▄
▄███████████████████▄
▄██████████████▀▀█████▄
▄██████████▀▀█████▐████▄
██████▀▀████▄▄▀▀█████████
████▄▄███▄██▀█████▐██████
█████████▀██████████████
▀███████▌▐██████▐██████▀
▀███████▄▄███▄████████▀
▀███████████████████▀
▀▀███████████████▀▀
▀▀▀███████▀▀▀
▄▄▄███████▄▄▄
▄▄███████████████▄▄
▄███████████████████▄
▄█████████████████████▄
▄████▀▀███▀▀███▀▀██▀███▄
████▀███████▀█▀███▀█████
██████████████████████
████▄███████▄█▄███▄█████
▀████▄▄███▄▄███▄▄██▄███▀
▀█████████████████████▀
▀███████████████████▀
▀▀███████████████▀▀
▀▀▀███████▀▀▀
████████
██
██
██
██
██
██
██
██
██
██
██
████████
████████████████████████████████████████████████████████████████████████████████
.
JOIN NOW
.
████████████████████████████████████████████████████████████████████████████████
████████
██
██
██
██
██
██
██
██
██
██
██
████████
1715325180
Hero Member
*
Offline Offline

Posts: 1715325180

View Profile Personal Message (Offline)

Ignore
1715325180
Reply with quote  #2

1715325180
Report to moderator
"Governments are good at cutting off the heads of a centrally controlled networks like Napster, but pure P2P networks like Gnutella and Tor seem to be holding their own." -- Satoshi
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1715325180
Hero Member
*
Offline Offline

Posts: 1715325180

View Profile Personal Message (Offline)

Ignore
1715325180
Reply with quote  #2

1715325180
Report to moderator
1715325180
Hero Member
*
Offline Offline

Posts: 1715325180

View Profile Personal Message (Offline)

Ignore
1715325180
Reply with quote  #2

1715325180
Report to moderator
QuintLeo
Legendary
*
Offline Offline

Activity: 1498
Merit: 1030


View Profile
November 18, 2016, 01:51:26 PM
 #1242

RX 480 has faster (8000 Mhz effective) but narrower (256 bit) memory than the R9 290 and R9 390 that gives it overall slightly better memory bandwidth than the R9 290 (5000 Mhz effective at 384 bit) but slightly worse than the R9 390 (6000 effective Mhz at 384 bit).
 The RX 480 has 12.5% MORE compute cores (2304 vs. 2048 for exactly a 9:8 ratio) at quite a bit HIGHER clock rate than the R9 390 and even more so than the R9 290.
 RX 480 and R9 390 are both PCI-E 3.0 cards, R9 290 is only PCI-E 2.0, but that has little or no measurable effect on most mining.

 The RX 480 is NOT "close or a bit less than a R9 290" but in fact is a superior card across the board except ONLY for memory bus width

 Might also want to pay attention to the R9 290x vs the R9 290 as they have the same memory system but the 290x has the same 2304 cores that the RX 480 does



He's gonna need some ice for that burn. Good job fact-checking.

 What burn, that GPU-Z image just proves my stated facts about it.
 If you're talking about the "listed" memory speed vs my stated EFFECTIVE memory speed, keep in mind that GDDR 5 can transfer 4 bytes per bus cycle - on raw clocks the R9 290 and 290x run at 1250 vs the 2000 for the RX 480, so same ratio as I stated.

 I'm not discussing overclocked efforts, or it would be even worse - the RX 480 has demonstrated a LOT more overclock headroom than any R9 2xx series managed.

 Too bad the other 2 links appear to be broken, would be interesting to see what they were about.

 I don't need GPU-Z images for the R9 290 or R9 290x though when I have several of the first one and one of the second one and have worked with them quite a bit and have the bloody specs on them memorised.


I'm no longer legendary just in my own mind!
Like something I said? Donations gratefully accepted. LYLnTKvLefz9izJFUvEGQEZzSkz34b3N6U (Litecoin)
1GYbjMTPdCuV7dci3iCUiaRrcNuaiQrVYY (Bitcoin)
bensam1231
Legendary
*
Offline Offline

Activity: 1750
Merit: 1024


View Profile
November 18, 2016, 01:59:34 PM
 #1243


Do you also know if you want to check if a algo is memory limited, you can go into GPUZ and check out the MCU (memory controller unit) and see the load on it?

I think this is wrong.  Although I primarily mine using Linux, I have a Windoze box that I use for testing cards.  GPU-z appears to show only external bus bandwidth use (to the GDDR), and not the utilization of the bandwidth between the controller and core.  In practical terms, a miner kernel may be using 200GB/s of memory bandwidth, but a significant percentage of it can be from the L2 cache.  The collision counter tables in SA5 would be an example of this.


Do you have a source for this hypothesis? In all memory restricted algos that correlates to MCU usage. Pretty sure it pertains to any sort of memory overload, bandwidth or bus width...

480 and 1070 have similar TDP.  Mining Zcash, their power usage would be similar.  1070 maybe slightly less if you could downclock it, but you can also undervolt the 480.  Even if the 1070 is slightly more efficient with optimized Zcash, it doesn't matter much. I make 9x more on ZCash than I spend in power.  So it isn't worth spending $400 on card that has same speed as $200 card.

38% wasn't from me.  I was using similar method of extrapolation.  I get 160S on 480, no overclocks. ~60% MCU on Claymore 6.0.

Their power usage would be similar if they were both being maxed out. Equihash is not a highly optimized algo yet, especially for Nvidia. That's the whole reason we're talking about this. You're trying to make a point of Nvidia not being that more efficient then a AMD with highly unoptimized code, not sure why you assume Nvidia with almost no one working on it is in the same shoes as AMD. Because MBK added Nvidia support he put just as much effort into Nvidia as his AMD endeavors?

What is a 'similiar method'? I was literally talking about MCU usage. Also calling BS on 60% MCU usage. Give me a screenshot, which you didn't provide for Equihash either.

I like how you base assumptions on loose logic. The whole reason I'm not believing Equihash limits are based purely on memory bus width like Dagger (not bus bandwidth). That's what the whole BCT talk thread was.

Screenshot, 55% average memory controller load, GPU-Z.  http://prnt.sc/d8phr4
Doing 160S/s, Claymore 6.0. At the wall ~150w, but haven't tuned the voltage/core as much as I could. If you don't believe screenshot, fire up CM 6.0, new version tomorrow.

I know Equihash miners aren't fully optimized yet.  But it is obvious it is memory limited, so even if it is fully optimized, the cards would perform similar.  I was saying since 1070 has higher compute (ignoring architecture differences that could favor either card), you may be able to underclock some to reduce power, similarly you could undervolt the 480, but it'ts not worth paying extra $200 to save $2/month.

Card wasn't in the screenshot, but I'll believe you for shits and giggles since no one lies online, especially in a argument.  The card is at a a reported 106w, so even if the MCU is at 55% you'll hit TDP before ever maxing out the MCU, unless the code becomes more efficient, but that can be done for Nvidia as well.

This goes to show you even more so that this algo isn't completely memory bound. If it were we wouldn't be hitting TDPs before MCU usage. If TDPs are the limiting factor, efficiency definitely becomes more important. Depending on how this algo will stress the cards when it's finally maxed out, based on what we're seeing right here, it's definitely not just memory bound. Lyra2v2 and NeoS also stress memory, but not enough for it to be the sole bottleneck.


And Wolf0 must google his name every day and BCT.

I buy private Nvidia miners. Send information and/or inquiries to my PM box.
bensam1231
Legendary
*
Offline Offline

Activity: 1750
Merit: 1024


View Profile
November 18, 2016, 02:04:15 PM
 #1244

RX 480 has faster (8000 Mhz effective) but narrower (256 bit) memory than the R9 290 and R9 390 that gives it overall slightly better memory bandwidth than the R9 290 (5000 Mhz effective at 384 bit) but slightly worse than the R9 390 (6000 effective Mhz at 384 bit).
 The RX 480 has 12.5% MORE compute cores (2304 vs. 2048 for exactly a 9:8 ratio) at quite a bit HIGHER clock rate than the R9 390 and even more so than the R9 290.
 RX 480 and R9 390 are both PCI-E 3.0 cards, R9 290 is only PCI-E 2.0, but that has little or no measurable effect on most mining.

 The RX 480 is NOT "close or a bit less than a R9 290" but in fact is a superior card across the board except ONLY for memory bus width

 Might also want to pay attention to the R9 290x vs the R9 290 as they have the same memory system but the 290x has the same 2304 cores that the RX 480 does



He's gonna need some ice for that burn. Good job fact-checking.

 What burn, that GPU-Z image just proves my stated facts about it.
 If you're talking about the "listed" memory speed vs my stated EFFECTIVE memory speed, keep in mind that GDDR 5 can transfer 4 bytes per bus cycle - on raw clocks the R9 290 and 290x run at 1250 vs the 2000 for the RX 480, so same ratio as I stated.

 I'm not discussing overclocked efforts, or it would be even worse - the RX 480 has demonstrated a LOT more overclock headroom than any R9 2xx series managed.

 Too bad the other 2 links appear to be broken, would be interesting to see what they were about.

 I don't need GPU-Z images for the R9 290 or R9 290x though when I have several of the first one and one of the second one and have worked with them quite a bit and have the bloody specs on them memorised.



Basic google searches disprove the majority of what you're talking about, including performance (cores can't be compared across generations of cards or chip makers). I pointed out performance and the memory bus width as that was off the top of my head. Your memory is corrupt.

I buy private Nvidia miners. Send information and/or inquiries to my PM box.
QuintLeo
Legendary
*
Offline Offline

Activity: 1498
Merit: 1030


View Profile
November 18, 2016, 02:07:05 PM
 #1245



Scrypt GPU mining ended in the fall of 14 without private kernels. x11 started up shortly there after, became unprofitable at the beginning of winter. Gridseed weren't ASICs either, the first ones weren't very profitable or good. You may have just remembered those little USB things coming out and thought 'well those were ASICs', they weren't. There were a lot of really bad ASICs. Gridseeds were never a good deal.

Unless you were running private kernels yourself, it wasn't happening.

What other algo are you looking at that's mature? Dagger doesn't count. That's a very niche scenario and it's bound almost exclusively by bus width. The GPUs never get a chance to even be close being fully utilized.

R9-290 has a 512bit bus as was already mentioned.

Who tests GPUs on sha-256? How about trying something remotely relevant to the discussion like say NeoS, Lyra2v2, or even x11. People haven't made optimized miners for Sha in years. As mentioned before if you're talking about 'theoretical usage' scenarios, video games are a very good example of that as GPUs are made to run as fast as possible on them.

Memory usage doesn't need to be about bandwidth or bus width, it could just be the total memory usage as well. Not just that, it doesn't need to be restricted JUST to throughput, it can utilize memory and still do a lot of processing on GPUs. At this point though you're just making shit up and theorycrafting again.

You can blame latency all you want, but Fury not only has a 4096 bit bus, but also gobs of memory bandwidth, it's not eight times faster then R9-290 or even twice as fast. It's not just all about memory speeds here or even latency.

 The Gridseed 3355 WAS in fact an ASIC - and on scrypt it was more efficient than anything GPU based at the time by quite a bit. single side of an "80 blade" would pull 2.5 Mhash/sec at 40 watts where the best GPUs of the time were pulling less than half that at a LOT more power (7990 was an exception with it's pair of cores, it could actually manage a bit more than half the hashrate but pulled a TON more power to do so).

 Dagger (ETH) isn't "bus width limited, it's memory access limited - NOT the same thing  or the RX 480 wouldn't even be close to matching the R9 290 on hashrate.

 For MOST usage, the Fury is a LOT faster than the R9 290 - but on ETH it's barely in the same ballpark despite the much higher "in theory" memory bandwidth. *SOMETHING* certainly keeps it uncompetative with much older cards with lower rated memory bandwidth.


I'm no longer legendary just in my own mind!
Like something I said? Donations gratefully accepted. LYLnTKvLefz9izJFUvEGQEZzSkz34b3N6U (Litecoin)
1GYbjMTPdCuV7dci3iCUiaRrcNuaiQrVYY (Bitcoin)
QuintLeo
Legendary
*
Offline Offline

Activity: 1498
Merit: 1030


View Profile
November 18, 2016, 02:10:55 PM
 #1246


 (cores can't be compared across generations of cards or chip makers).

 AMD cores in the GCN generations have been pretty consistant on their performance, if anything they've gotten a hair MORE efficient with generational changes.

 Comparing GCN to Terrascale cores or to NVidia cores (which I've NOT DONE AT ALL, strawman comment there) is a lot more problematical.

I'm no longer legendary just in my own mind!
Like something I said? Donations gratefully accepted. LYLnTKvLefz9izJFUvEGQEZzSkz34b3N6U (Litecoin)
1GYbjMTPdCuV7dci3iCUiaRrcNuaiQrVYY (Bitcoin)
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
November 18, 2016, 03:02:52 PM
 #1247


Do you also know if you want to check if a algo is memory limited, you can go into GPUZ and check out the MCU (memory controller unit) and see the load on it?

I think this is wrong.  Although I primarily mine using Linux, I have a Windoze box that I use for testing cards.  GPU-z appears to show only external bus bandwidth use (to the GDDR), and not the utilization of the bandwidth between the controller and core.  In practical terms, a miner kernel may be using 200GB/s of memory bandwidth, but a significant percentage of it can be from the L2 cache.  The collision counter tables in SA5 would be an example of this.


Do you have a source for this hypothesis? In all memory restricted algos that correlates to MCU usage. Pretty sure it pertains to any sort of memory overload, bandwidth or bus width...

My knowledge of the AMD GCN architecture (and computer architecture in general), and my experience writing OpenCL.
xeridea
Sr. Member
****
Offline Offline

Activity: 449
Merit: 251


View Profile WWW
November 18, 2016, 04:27:53 PM
 #1248


Do you also know if you want to check if a algo is memory limited, you can go into GPUZ and check out the MCU (memory controller unit) and see the load on it?

I think this is wrong.  Although I primarily mine using Linux, I have a Windoze box that I use for testing cards.  GPU-z appears to show only external bus bandwidth use (to the GDDR), and not the utilization of the bandwidth between the controller and core.  In practical terms, a miner kernel may be using 200GB/s of memory bandwidth, but a significant percentage of it can be from the L2 cache.  The collision counter tables in SA5 would be an example of this.


Do you have a source for this hypothesis? In all memory restricted algos that correlates to MCU usage. Pretty sure it pertains to any sort of memory overload, bandwidth or bus width...

480 and 1070 have similar TDP.  Mining Zcash, their power usage would be similar.  1070 maybe slightly less if you could downclock it, but you can also undervolt the 480.  Even if the 1070 is slightly more efficient with optimized Zcash, it doesn't matter much. I make 9x more on ZCash than I spend in power.  So it isn't worth spending $400 on card that has same speed as $200 card.

38% wasn't from me.  I was using similar method of extrapolation.  I get 160S on 480, no overclocks. ~60% MCU on Claymore 6.0.

Their power usage would be similar if they were both being maxed out. Equihash is not a highly optimized algo yet, especially for Nvidia. That's the whole reason we're talking about this. You're trying to make a point of Nvidia not being that more efficient then a AMD with highly unoptimized code, not sure why you assume Nvidia with almost no one working on it is in the same shoes as AMD. Because MBK added Nvidia support he put just as much effort into Nvidia as his AMD endeavors?

What is a 'similiar method'? I was literally talking about MCU usage. Also calling BS on 60% MCU usage. Give me a screenshot, which you didn't provide for Equihash either.

I like how you base assumptions on loose logic. The whole reason I'm not believing Equihash limits are based purely on memory bus width like Dagger (not bus bandwidth). That's what the whole BCT talk thread was.

Screenshot, 55% average memory controller load, GPU-Z.  http://prnt.sc/d8phr4
Doing 160S/s, Claymore 6.0. At the wall ~150w, but haven't tuned the voltage/core as much as I could. If you don't believe screenshot, fire up CM 6.0, new version tomorrow.

I know Equihash miners aren't fully optimized yet.  But it is obvious it is memory limited, so even if it is fully optimized, the cards would perform similar.  I was saying since 1070 has higher compute (ignoring architecture differences that could favor either card), you may be able to underclock some to reduce power, similarly you could undervolt the 480, but it'ts not worth paying extra $200 to save $2/month.

Card wasn't in the screenshot, but I'll believe you for shits and giggles since no one lies online, especially in a argument.  The card is at a a reported 106w, so even if the MCU is at 55% you'll hit TDP before ever maxing out the MCU, unless the code becomes more efficient, but that can be done for Nvidia as well.

This goes to show you even more so that this algo isn't completely memory bound. If it were we wouldn't be hitting TDPs before MCU usage. If TDPs are the limiting factor, efficiency definitely becomes more important. Depending on how this algo will stress the cards when it's finally maxed out, based on what we're seeing right here, it's definitely not just memory bound. Lyra2v2 and NeoS also stress memory, but not enough for it to be the sole bottleneck.


And Wolf0 must google his name every day and BCT.
The first card.  Others are 4GB, and different 480s.  While compute is more of a factor than for Eth, memory is still a major factor, I highly doubt it would ever be worth getting a 1070 over a 480.

Profitability over time charts for many GPUs - http://xeridea.us/charts

BTC:  bc1qr2xwjwfmjn43zhrlp6pn7vwdjrjnv5z0anhjhn LTC:  LXDm6sR4dkyqtEWfUbPumMnVEiUFQvxSbZ Eth:  0x44cCe2cf90C8FEE4C9e4338Ae7049913D4F6fC24
xeridea
Sr. Member
****
Offline Offline

Activity: 449
Merit: 251


View Profile WWW
November 18, 2016, 04:33:41 PM
 #1249



Scrypt GPU mining ended in the fall of 14 without private kernels. x11 started up shortly there after, became unprofitable at the beginning of winter. Gridseed weren't ASICs either, the first ones weren't very profitable or good. You may have just remembered those little USB things coming out and thought 'well those were ASICs', they weren't. There were a lot of really bad ASICs. Gridseeds were never a good deal.

Unless you were running private kernels yourself, it wasn't happening.

What other algo are you looking at that's mature? Dagger doesn't count. That's a very niche scenario and it's bound almost exclusively by bus width. The GPUs never get a chance to even be close being fully utilized.

R9-290 has a 512bit bus as was already mentioned.

Who tests GPUs on sha-256? How about trying something remotely relevant to the discussion like say NeoS, Lyra2v2, or even x11. People haven't made optimized miners for Sha in years. As mentioned before if you're talking about 'theoretical usage' scenarios, video games are a very good example of that as GPUs are made to run as fast as possible on them.

Memory usage doesn't need to be about bandwidth or bus width, it could just be the total memory usage as well. Not just that, it doesn't need to be restricted JUST to throughput, it can utilize memory and still do a lot of processing on GPUs. At this point though you're just making shit up and theorycrafting again.

You can blame latency all you want, but Fury not only has a 4096 bit bus, but also gobs of memory bandwidth, it's not eight times faster then R9-290 or even twice as fast. It's not just all about memory speeds here or even latency.

 The Gridseed 3355 WAS in fact an ASIC - and on scrypt it was more efficient than anything GPU based at the time by quite a bit. single side of an "80 blade" would pull 2.5 Mhash/sec at 40 watts where the best GPUs of the time were pulling less than half that at a LOT more power (7990 was an exception with it's pair of cores, it could actually manage a bit more than half the hashrate but pulled a TON more power to do so).

 Dagger (ETH) isn't "bus width limited, it's memory access limited - NOT the same thing  or the RX 480 wouldn't even be close to matching the R9 290 on hashrate.

 For MOST usage, the Fury is a LOT faster than the R9 290 - but on ETH it's barely in the same ballpark despite the much higher "in theory" memory bandwidth. *SOMETHING* certainly keeps it uncompetative with much older cards with lower rated memory bandwidth.



The Fury cards have HBM, which has a lot higher memory bandwidth, but higher latency.  Eth is sensitive to latency.  This is also why 1080 sucks at Eth, the GDDR5X doesn't have much more bandwidth, but has higher latency.  Tightening memory timings on Eth or Zcash give you speed boost.  Games aren't really affected by latency as much, just raw bandwidth.  HBM2 will be better, though Vega 10 may or may not have that much more bandwidth.  There is a new way of accessing HBM2 that reduces latency some though, if application is coded for it, so we will see how things go.

Profitability over time charts for many GPUs - http://xeridea.us/charts

BTC:  bc1qr2xwjwfmjn43zhrlp6pn7vwdjrjnv5z0anhjhn LTC:  LXDm6sR4dkyqtEWfUbPumMnVEiUFQvxSbZ Eth:  0x44cCe2cf90C8FEE4C9e4338Ae7049913D4F6fC24
adamvp
Hero Member
*****
Offline Offline

Activity: 1246
Merit: 708



View Profile
November 18, 2016, 07:53:41 PM
 #1250

sorry if it was already answered, but...
I would like to ask about new realase of optiminer..
Could someone give me an example how to use its watchdog (for ubuntu 14)?
Code:
--watchdog-timeout <seconds>
     Timeout after which the watchdog triggers if a GPU does not produce
     any solutions. It will execute the command specified by
     --watchdog-cmd. You can use this command to do an appropriate action
     (e.g. reset driver or reboot). 0 disables watchdog.

Are you for real? Open a thread on Optiminer then or use an existing Optiminer one to discuss.. This thread is for a different miner.
sorry,  I am a little tired now, can't see any optiminer thread, sorry..
but I'd preffer to use FOSS miner, but while waiting to new relase of SM I should use another due to significally better solrate Sad

I am looking for signature campaign Wink pm me
zawawa
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
November 18, 2016, 08:06:46 PM
 #1251

Hey devs, I have been playing with eXtremal's latest kernel and trying to optimize kernel_sols() now as it seems to be one of the bottlenecks as far as I can tell with CodeXL with NUMVGPR being over 170 on RX 480. I was able to reduce it to 50 something, but I cannot get rid of scratchpad registers that are 512 bytes in size. They must have something to do with array indexing, but I was not able to pinpoint the exact portion of the code that is causing register spills. Do you guys have any ideas?

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
November 18, 2016, 08:15:21 PM
 #1252

By the way, I was able to create a multi-threaded version of sa-solver for Windows with a few percent speed gain.
I would like to see more performance improvements, though. I am pretty sure they are possible if we can get rid of these annoying register spills....

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
November 18, 2016, 08:20:13 PM
 #1253

Hey devs, I have been playing with eXtremal's latest kernel and trying to optimize kernel_sols() now as it seems to be one of the bottlenecks as far as I can tell with CodeXL with NUMVGPR being over 170 on RX 480. I was able to reduce it to 50 something, but I cannot get rid of scratchpad registers that are 512 bytes in size. They must have something to do with array indexing, but I was not able to pinpoint the exact portion of the code that is causing register spills. Do you guys have any ideas?

Never mind, I just found it. Let me see...

Code:
uint	values_tmp[(1 << PARAM_K)];

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
reb0rn21
Legendary
*
Offline Offline

Activity: 1898
Merit: 1024


View Profile
November 18, 2016, 08:43:00 PM
 #1254

clay 7:
280x modded + OC 180sol, 1200Mhz near 200sol/s
390x - 240sol/s
RX 4xx did not gain any speed

looks like its all about memory speed and some compute for blake ago, so 280x need 1200Mhz to make near 200sol/s

              ▄▄▄ ▀▀▀▀▀▀▀▀▀ ▄▄▄
           ▄▀▀    ▄▄▄▄▄▄▄▄▄    ▀▀▄
        ▄▀▀  ▄▄▀█          ▀█▀▄▄  ▀▀▄
      ▄▀▀ ▄▄▀    ▀▀▄▄▄▄▄▄▄▀▀    ▀▄▄ ▀▀▄
     █   █            ▀            █   █
   ▄▀ █  ▀▄▄                     ▄█▀  █ ▀▄
  ▄▀ ▄▀ █▄ ▀▀▀██▄▄▄       ▄▄▄██▀▀  ██ ▀▄ ▀▄
  ▀▄▀▀▄ ██ ▄▄▄▄▄▄  ▀▄   ▄▀  ▄▄▄▄▄▄ ██ ▄▀▀▄▀
 ██   █ ██ ▀▄    ▀▄ █   █ ▄▀    ▄▀ ██ █  ▀██
 █  ▄█  ▀█  ▀▀▀▀▀▀▀ █   █ ▀▀▀▀▀▀▀  █   █▄  █
█▀ █  █  █          █   █          █  █  █ ▀▀
 █▀  ▄▀  █▀▄        █   █        ▄▀█  ▀▄  ▀█
 ▄  █▀   █ ▀█▄      ▀   ▀      ▄█▀ █  ▄▀█  ▄
 █▄▀  █  █                         █  █  ▀▄█
 ▀▄  █   ▀█        ▄▄▀▄▀▄▄        █▀   █  ▄
  ▀▄▀▀  █▄ █     ▀█  ▀▀▀  █▀     █ ▄█ ▄▀▀▄▀
   ▀ ▄  ██ █▀▄     ▀▀▄▄▄▀▀     ▄▀█ ██ ▀▄ ▀
    ▀█  ██ █ █▀▄    ▄▄▄▄▄    ▄▀█ █ ██  █▀
      ▀▄ ▀ █ █ ██▄         ▄██ █ █ ▀ ▄▀
        ▀▄ █ █ █ ▀█▄     ▄█▀ █ █ █ ▄▀
          ▀▀▄█ █    ▀▀▀▀▀    █ █▄▀▀
              ▀▀ ▄▄▄▄▄▄▄▄▄▄▄ ▀▀
   
..I  D  E  N  A..
   
Proof-of-Person Blockchain

Join the mining of the first human-centric
cryptocurrency
 



 
▲    2 3 2 2

..N  O  D  E  S..
   
                ██
                ██
                ██
                ██
                ██
         ▄      ██      ▄
         ███▄   ██   ▄███
          ▀███▄ ██ ▄███▀
            ▀████████▀
              ▀████▀
                ▀▀
██▄                            ▄██
███                            ███
███                            ███
███                            ███
 ███▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄███
  ▀▀██████████████████████████▀▀
   
D O W N L O A D

Idena node

   
   
▄▄▄██████▄▄▄
▄▄████████████████▄▄
▄█████▀▀        ▀▀█████▄
████▀                ▀████
███▀    ▄▄▄▄▄▄▄▄▄       ▀███
███      █   ▄▄ █▀▄        ███
██▀      █  ███ █  ▀▄      ▀██
███       █   ▀▀ ▀▀▀▀█       ███
███       █  ▄▄▄▄▄▄  █       ███
███       █  ▄▄▄▄▄▄  █       ███
██▄      █  ▄▄▄▄▄▄  █      ▄██
███      █          █      ███
███▄    ▀▀▀▀▀▀▀▀▀▀▀▀    ▄███
████▄                ▄████
▀█████▄▄        ▄▄█████▀
▀▀████████████████▀▀
▀▀▀██████▀▀▀
   
    .REQUEST INVITATION.
bardacuda
Sr. Member
****
Offline Offline

Activity: 430
Merit: 254


View Profile
November 18, 2016, 08:45:50 PM
 #1255

Just when I was about to switch to ETC....

The future will rely on AI. SingularityNET lets anyone create, monetize, and use AI at scale. From the creators of Sophia the Robot.
zawawa
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
November 18, 2016, 09:07:45 PM
 #1256

I was able to remove scratchpad registers from kernel_sols() and placed values_temp[] in __local.
The speed gain was not as much as I hoped for, though. I knew GCN's shared memory was rather slow compared to CUDA...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
mrb (OP)
Legendary
*
Offline Offline

Activity: 1512
Merit: 1027


View Profile WWW
November 18, 2016, 10:06:01 PM
 #1257

Hey devs, I have been playing with eXtremal's latest kernel and trying to optimize kernel_sols() now as it seems to be one of the bottlenecks as far as I can tell with CodeXL with NUMVGPR being over 170 on RX 480. I was able to reduce it to 50 something, but I cannot get rid of scratchpad registers that are 512 bytes in size. They must have something to do with array indexing, but I was not able to pinpoint the exact portion of the code that is causing register spills. Do you guys have any ideas?

kernel_sols is not a bottleneck. It only takes 1.2-1.5 ms on the R9 Nano out of 17 ms of a full Equihash run.
laik2
Sr. Member
****
Offline Offline

Activity: 652
Merit: 266



View Profile WWW
November 18, 2016, 11:06:00 PM
 #1258

Quote
Well... erm... dunno how to say this, but not only is vdrop+.rom not edited in Heliox's style, it doesn't seem to be edited for voltage AT ALL - even in the ways that DON'T work. All the DPM states on vdrop+.rom point stock (into the voltage table, which has not been changed). The core clocks have been dropped pretty much across the board, though. Default memory clock was changed to 2080, but the rest of the memory states are untouched.

Heliox/Eliovp would (in a low-power ROM) have added a new VID for the initialization of the regulator in VoltageObjectInfo (changing the length of the table), as well as a value for it, which allows for global core undervolts that apply to every power state.
https://forum.ethereum.org/discussion/9650/sapphire-rx-480-nitro-oc-8gb-11260-01-20g-modded-bios-29-mh-downvolt
Here is where I got these bioses. I haven't touched vbios for 8 years...last time I modded vbios was when I got Radeon 9800Pro for PC and flashed it with Mac rom to put it in my G4 Smiley
So basicly my knowledge of this is pretty much narrowed to minimum. Don't judge me too harsh... I just started mining (1/2 weeks ago) Smiley

Miners Mining Platform [ MMP OS ] - https://app.mmpos.eu/
zawawa
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
November 18, 2016, 11:11:20 PM
 #1259

Hey devs, I have been playing with eXtremal's latest kernel and trying to optimize kernel_sols() now as it seems to be one of the bottlenecks as far as I can tell with CodeXL with NUMVGPR being over 170 on RX 480. I was able to reduce it to 50 something, but I cannot get rid of scratchpad registers that are 512 bytes in size. They must have something to do with array indexing, but I was not able to pinpoint the exact portion of the code that is causing register spills. Do you guys have any ideas?

kernel_sols is not a bottleneck. It only takes 1.2-1.5 ms on the R9 Nano out of 17 ms of a full Equihash run.

You are absolutely right! I must tackle equihash_round()...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
laik2
Sr. Member
****
Offline Offline

Activity: 652
Merit: 266



View Profile WWW
November 18, 2016, 11:22:00 PM
 #1260

Quote
Well... erm... dunno how to say this, but not only is vdrop+.rom not edited in Heliox's style, it doesn't seem to be edited for voltage AT ALL - even in the ways that DON'T work. All the DPM states on vdrop+.rom point stock (into the voltage table, which has not been changed). The core clocks have been dropped pretty much across the board, though. Default memory clock was changed to 2080, but the rest of the memory states are untouched.

Heliox/Eliovp would (in a low-power ROM) have added a new VID for the initialization of the regulator in VoltageObjectInfo (changing the length of the table), as well as a value for it, which allows for global core undervolts that apply to every power state.
https://forum.ethereum.org/discussion/9650/sapphire-rx-480-nitro-oc-8gb-11260-01-20g-modded-bios-29-mh-downvolt
Here is where I got these bioses. I haven't touched vbios for 8 years...last time I modded vbios was when I got Radeon 9800Pro for PC and flashed it with Mac rom to put it in my G4 Smiley
So basicly my knowledge of this is pretty much narrowed to minimum. Don't judge me too harsh... I just started mining (1/2 weeks ago) Smiley

Heh, don't worry about it, I won't. I'll just tell you what's there. Or not there. Looking it over now.
I really appreciate your opinion and help.
Thank you. PM your T address and I will put one VM core to mine on it 24/7.

Miners Mining Platform [ MMP OS ] - https://app.mmpos.eu/
Pages: « 1 ... 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 [63] 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!