Bitcoin Forum
April 20, 2024, 03:27:16 AM *
News: Latest Bitcoin Core release: 26.0 [Torrent]
 
   Home   Help Search Login Register More  
Poll
Question: Do you want to see improvements in Ethash dual-mining with GGS?
I desperately need it. - 8 (15.1%)
It would be nice. - 12 (22.6%)
It's not worth it anymore. - 33 (62.3%)
Total Voters: 53

Pages: « 1 [2] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 ... 197 »
  Print  
Author Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480!  (Read 214332 times)
Linit
Newbie
*
Offline Offline

Activity: 13
Merit: 0


View Profile
December 15, 2016, 03:06:13 PM
Last edit: December 15, 2016, 07:00:37 PM by Linit
 #21

Windows 10 64 Bit.
Driver 16.6.
Gigabyte R9 390 G1

160 S/s
1713583636
Hero Member
*
Offline Offline

Posts: 1713583636

View Profile Personal Message (Offline)

Ignore
1713583636
Reply with quote  #2

1713583636
Report to moderator
1713583636
Hero Member
*
Offline Offline

Posts: 1713583636

View Profile Personal Message (Offline)

Ignore
1713583636
Reply with quote  #2

1713583636
Report to moderator
1713583636
Hero Member
*
Offline Offline

Posts: 1713583636

View Profile Personal Message (Offline)

Ignore
1713583636
Reply with quote  #2

1713583636
Report to moderator
According to NIST and ECRYPT II, the cryptographic algorithms used in Bitcoin are expected to be strong until at least 2030. (After that, it will not be too difficult to transition to different algorithms.)
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1713583636
Hero Member
*
Offline Offline

Posts: 1713583636

View Profile Personal Message (Offline)

Ignore
1713583636
Reply with quote  #2

1713583636
Report to moderator
1713583636
Hero Member
*
Offline Offline

Posts: 1713583636

View Profile Personal Message (Offline)

Ignore
1713583636
Reply with quote  #2

1713583636
Report to moderator
1713583636
Hero Member
*
Offline Offline

Posts: 1713583636

View Profile Personal Message (Offline)

Ignore
1713583636
Reply with quote  #2

1713583636
Report to moderator
xeridea
Sr. Member
****
Offline Offline

Activity: 449
Merit: 251


View Profile WWW
December 15, 2016, 04:57:25 PM
 #22

You should probably update bottom of readme...

"Author

Marc Bevand -- http://zorinaq.com"

I probably won't be using, most my cards back to Eth for now, and CM faster, and has Remote Monitor. I like open source projects though, I am a developer also, but can't contribute due to issues with my hands. I would like to tinker with OpenCL if I could. Good luck with project!

Profitability over time charts for many GPUs - http://xeridea.us/charts

BTC:  bc1qr2xwjwfmjn43zhrlp6pn7vwdjrjnv5z0anhjhn LTC:  LXDm6sR4dkyqtEWfUbPumMnVEiUFQvxSbZ Eth:  0x44cCe2cf90C8FEE4C9e4338Ae7049913D4F6fC24
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
December 15, 2016, 06:33:48 PM
 #23

You should probably update bottom of readme...

"Author

Marc Bevand -- http://zorinaq.com"

I probably won't be using, most my cards back to Eth for now, and CM faster, and has Remote Monitor. I like open source projects though, I am a developer also, but can't contribute due to issues with my hands. I would like to tinker with OpenCL if I could. Good luck with project!

Thanks! After I rest a little, I will optimize the miner further. My ultimate goal would be to create a GUI-based, feature-rich, multi-algorithm miner.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
Linit
Newbie
*
Offline Offline

Activity: 13
Merit: 0


View Profile
December 15, 2016, 06:59:56 PM
 #24

Ubuntu 15.04 64 bit.
Driver fglrx 15.12.
Gigabyte R9 390 G1.

160 S/s.
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
December 15, 2016, 07:04:10 PM
 #25

Ubuntu 15.04 64 bit.
Driver fglrx 15.12.
Gigabyte R9 390 G1.

160 S/s.


Very nice! I would like to reach 200 sol/s without a GCN assembler.
We will see.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
Linit
Newbie
*
Offline Offline

Activity: 13
Merit: 0


View Profile
December 15, 2016, 07:16:18 PM
 #26

Ubuntu 15.04 64 bit.
Driver fglrx 15.12.
Gigabyte R9 390 G1.

160 S/s.


Very nice! I would like to reach 200 sol/s without a GCN assembler.
We will see.

Excellent...
laik2
Sr. Member
****
Offline Offline

Activity: 652
Merit: 266



View Profile WWW
December 15, 2016, 07:25:50 PM
 #27

Ubuntu 15.04 64 bit.
Driver fglrx 15.12.
Gigabyte R9 390 G1.

160 S/s.


Very nice! I would like to reach 200 sol/s without a GCN assembler.
We will see.

Without GCN asm 390s should reach 300S/s at most.
Multialgo miner is sgminer but documentation is hell...until I find some useful values for a card my beard looks like Santa Claus's.

Miners Mining Platform [ MMP OS ] - https://app.mmpos.eu/
Vetal_inside
Member
**
Offline Offline

Activity: 78
Merit: 10


View Profile
December 15, 2016, 07:56:28 PM
 #28

R9 280x w/ modded bios - 85 s/s with instances=1 and 90-95 s/s with instances=2(not stable), like as original SA miner v.5.
Win8.1, x64, drivers 15.12

add: with CM it shows 210-220 s/s, depending from memclock
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
December 15, 2016, 09:14:51 PM
 #29

R9 280x w/ modded bios - 85 s/s with instances=1 and 90-95 s/s with instances=2(not stable), like as original SA miner v.5.
Win8.1, x64, drivers 15.12

add: with CM it shows 210-220 s/s, depending from memclock

The slow speed is probably due either to the modded BIOS or to the driver. Mods for Claymore's do not necessarily work with Gateless Gate/SILENTARMY. I would try the stock BIOS first. Also, I only tested the miner with Crimson drivers. I suppose I need to be more clear about requirements...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
krnlx
Full Member
***
Offline Offline

Activity: 243
Merit: 105


View Profile
December 15, 2016, 09:43:52 PM
 #30

Quote
Total 1094.3 sol/s [dev0 177.9, dev1 176.8, dev2 182.7, dev3 180.9, dev4 185.4, dev5 185.1] 36 shares
Total 1093.9 sol/s [dev0 177.6, dev1 177.4, dev2 181.9, dev3 180.4, dev4 184.8, dev5 185.2] 36 shares
Total 1094.0 sol/s [dev0 177.6, dev1 177.4, dev2 182.0, dev3 180.7, dev4 185.5, dev5 185.4] 38 shares
Total 1093.3 sol/s [dev0 177.5, dev1 176.6, dev2 182.2, dev3 179.8, dev4 186.6, dev5 184.7] 38 shares
Total 1092.8 sol/s [dev0 178.5, dev1 176.9, dev2 181.7, dev3 180.7, dev4 185.8, dev5 184.8] 38 shares
Total 1093.1 sol/s [dev0 177.7, dev1 177.1, dev2 181.4, dev3 180.4, dev4 186.1, dev5 184.0] 40 shares
Total 1093.2 sol/s [dev0 177.1, dev1 177.8, dev2 182.2, dev3 179.9, dev4 186.3, dev5 182.7] 40 shares
Total 1093.5 sol/s [dev0 176.8, dev1 178.0, dev2 182.0, dev3 180.2, dev4 186.5, dev5 182.8] 40 shares

6x1070 with a little tune
Vetal_inside
Member
**
Offline Offline

Activity: 78
Merit: 10


View Profile
December 15, 2016, 09:45:45 PM
 #31

The slow speed is probably due either to the modded BIOS or to the driver. Mods for Claymore's do not necessarily work with Gateless Gate/SILENTARMY. I would try the stock BIOS first. Also, I only tested the miner with Crimson drivers. I suppose I need to be more clear about requirements...
This is memory timings patch. Not sure that it can be a reason for this low solrate.
But, on next few days I will try install latest crimson drivers and reflash stock bios. Will see what will change.
krnlx
Full Member
***
Offline Offline

Activity: 243
Merit: 105


View Profile
December 15, 2016, 09:53:09 PM
 #32

@zawawa

Nvidia cards run faster with NR_ROWS_LOG = 14
Can you check my settings for NR_ROWS_LOG = 14 ? All is correct ?

I think, it will be faster with NR_ROWS_LOG=12...


Code:
#define NR_ROWS_LOG            14
#define NR_SLOTS               240
#define LOCAL_WORK_SIZE        512
#define THREADS_PER_ROW        512
#define LOCAL_WORK_SIZE_SOLS   256
#define THREADS_PER_ROW_SOLS   256
#define GLOBAL_WORK_SIZE_RATIO 512
#define SLOT_CACHE_SIZE        (NR_SLOTS * (LOCAL_WORK_SIZE/THREADS_PER_ROW) * 75 / 100)
#define LDS_COLL_SIZE          (NR_SLOTS * (LOCAL_WORK_SIZE / THREADS_PER_ROW) * 240 / 100)
laik2
Sr. Member
****
Offline Offline

Activity: 652
Merit: 266



View Profile WWW
December 15, 2016, 10:21:40 PM
 #33

@zawawa

Nvidia cards run faster with NR_ROWS_LOG = 14
Can you check my settings for NR_ROWS_LOG = 14 ? All is correct ?

I think, it will be faster with NR_ROWS_LOG=12...


Code:
#define NR_ROWS_LOG            14
#define NR_SLOTS               240
#define LOCAL_WORK_SIZE        512
#define THREADS_PER_ROW        512
#define LOCAL_WORK_SIZE_SOLS   256
#define THREADS_PER_ROW_SOLS   256
#define GLOBAL_WORK_SIZE_RATIO 512
#define SLOT_CACHE_SIZE        (NR_SLOTS * (LOCAL_WORK_SIZE/THREADS_PER_ROW) * 75 / 100)
#define LDS_COLL_SIZE          (NR_SLOTS * (LOCAL_WORK_SIZE / THREADS_PER_ROW) * 240 / 100)

Proper CUDA implementation is required for NV to boost over 300S/s. There are already nicehash and EWBF CUDA closed source miners doing ~300S/s on 1070. I am waiting on my 1070s to arrive so I can test some CUDA tweaks.

Miners Mining Platform [ MMP OS ] - https://app.mmpos.eu/
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
December 15, 2016, 11:01:35 PM
 #34

@zawawa

Nvidia cards run faster with NR_ROWS_LOG = 14
Can you check my settings for NR_ROWS_LOG = 14 ? All is correct ?

I think, it will be faster with NR_ROWS_LOG=12...


Code:
#define NR_ROWS_LOG            14
#define NR_SLOTS               240
#define LOCAL_WORK_SIZE        512
#define THREADS_PER_ROW        512
#define LOCAL_WORK_SIZE_SOLS   256
#define THREADS_PER_ROW_SOLS   256
#define GLOBAL_WORK_SIZE_RATIO 512
#define SLOT_CACHE_SIZE        (NR_SLOTS * (LOCAL_WORK_SIZE/THREADS_PER_ROW) * 75 / 100)
#define LDS_COLL_SIZE          (NR_SLOTS * (LOCAL_WORK_SIZE / THREADS_PER_ROW) * 240 / 100)

What speed are you getting on which card? I'm very curious. You could lower NR_SLOTS by 10 or 20, I think. You can uncomment "#define ENABLE_DEBUG", rebuild the app, and run sa-solver.exe to see how many slots drop out at each round. Too many dropped slots would hurt performance. Adding NR_ROWS_LOG=12 itself is trivial, but there may not be enough shared memory.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
December 15, 2016, 11:05:43 PM
 #35

The slow speed is probably due either to the modded BIOS or to the driver. Mods for Claymore's do not necessarily work with Gateless Gate/SILENTARMY. I would try the stock BIOS first. Also, I only tested the miner with Crimson drivers. I suppose I need to be more clear about requirements...
This is memory timings patch. Not sure that it can be a reason for this low solrate.
But, on next few days I will try install latest crimson drivers and reflash stock bios. Will see what will change.

Performance does suffer if memory timings are too tight.
In the meantime, I will test the miner with my trusty  7990's...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
December 15, 2016, 11:31:30 PM
 #36

Not bad zawawa.  You still have room to improve ht_store.
Code:
p = slot.ui8
Will at best result in 2 store_dwordx4 instructions, and 2 core cycles to the memory controller.
Code:
p = slot.ui4[0]
Which you use after round 5 should only be one cycle, but it will force a 32-byte read burst from the GDDR into the L2, modification of 16 bytes, and then write back.  This will waste a lot of GDDR cycles due to the bus turnaround delay.  The solution is to have a n-way operation where n threads write 32/n bytes.  That will be just core one cycle to xfer 32 bytes to the L2, and a single 32-byte write burst to one of the 2 GDDR5 chips per memory controller channel.
I also think using an odd number for NR_SLOTS should be a tiny bit faster by balancing out the writes between the odd and even memory chips.  With NR_SLOTS even, the first write to a given row will always be to an even memory chip.  With more slots per row this becomes less significant because the rows don't fill up equally.  Using an odd number for NR_SLOTS may also reduce channel conflicts.


zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
December 15, 2016, 11:46:42 PM
 #37

Not bad zawawa.  You still have room to improve ht_store.
Code:
p = slot.ui8
Will at best result in 2 store_dwordx4 instructions, and 2 core cycles to the memory controller.
Code:
p = slot.ui4[0]
Which you use after round 5 should only be one cycle, but it will force a 32-byte read burst from the GDDR into the L2, modification of 16 bytes, and then write back.  This will waste a lot of GDDR cycles due to the bus turnaround delay.  The solution is to have a n-way operation where n threads write 32/n bytes.  That will be just core one cycle to xfer 32 bytes to the L2, and a single 32-byte write burst to one of the 2 GDDR5 chips per memory controller channel.
I also think using an odd number for NR_SLOTS should be a tiny bit faster by balancing out the writes between the odd and even memory chips.  With NR_SLOTS even, the first write to a given row will always be to an even memory chip.  With more slots per row this becomes less significant because the rows don't fill up equally.  Using an odd number for NR_SLOTS may also reduce channel conflicts.




Very interesting suggestions. Let me see...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
December 16, 2016, 03:10:03 AM
 #38

Not bad zawawa.  You still have room to improve ht_store.
Code:
p = slot.ui8
Will at best result in 2 store_dwordx4 instructions, and 2 core cycles to the memory controller.
Code:
p = slot.ui4[0]
Which you use after round 5 should only be one cycle, but it will force a 32-byte read burst from the GDDR into the L2, modification of 16 bytes, and then write back.  This will waste a lot of GDDR cycles due to the bus turnaround delay.  The solution is to have a n-way operation where n threads write 32/n bytes.  That will be just core one cycle to xfer 32 bytes to the L2, and a single 32-byte write burst to one of the 2 GDDR5 chips per memory controller channel.
I also think using an odd number for NR_SLOTS should be a tiny bit faster by balancing out the writes between the odd and even memory chips.  With NR_SLOTS even, the first write to a given row will always be to an even memory chip.  With more slots per row this becomes less significant because the rows don't fill up equally.  Using an odd number for NR_SLOTS may also reduce channel conflicts.




I tried 4-way writes with mixed results. The 4-way write version was actually slower than the single-thread-write version, but the former seems to speed up the last few rounds. It makes sense as these rounds are more memory-intensive. I will explore this approach further.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
bigchirv
Newbie
*
Offline Offline

Activity: 19
Merit: 0


View Profile
December 16, 2016, 03:31:09 AM
 #39

Thanks for publishing your repo! Appreciated.

I'm not a C programmer (or OpenCL for the matter) but I'm a fan of DRY; so when I was reading input.cl I found the get_row() function and I think we can make it a little bit DRYer by doing something like this:

Code:
uint get_row(uint round, uint xi0)
{
  uint           row;
  uint           swp;
  uint           num;
#if NR_ROWS_LOG == 14
  swp = 0;
#elif NR_ROWS_LOG == 15
  swp = 1;
#elif NR_ROWS_LOG == 16
  swp = 2;
#else
#error "unsupported NR_ROWS_LOG"
#endif
  num = (40 << swp) - 1);
  if (!(round % 2))
    row = (xi0 & ((num << 8 | 0xff));
  else
    row = ((xi0 & (num << 16 | 0xf00)) >> 8) | ((xi0 & 0xf0000000) >> 24);
  return row;
}

So, what do you think, @zawawa?

I don't know if this can be useful at all, but if you like it I can make a PR so you can merge the changes later.
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
December 16, 2016, 03:52:25 AM
 #40

Thanks for publishing your repo! Appreciated.

I'm not a C programmer (or OpenCL for the matter) but I'm a fan of DRY; so when I was reading input.cl I found the get_row() function and I think we can make it a little bit DRYer by doing something like this:

Code:
uint get_row(uint round, uint xi0)
{
  uint           row;
  uint           swp;
  uint           num;
#if NR_ROWS_LOG == 14
  swp = 0;
#elif NR_ROWS_LOG == 15
  swp = 1;
#elif NR_ROWS_LOG == 16
  swp = 2;
#else
#error "unsupported NR_ROWS_LOG"
#endif
  num = (40 << swp) - 1);
  if (!(round % 2))
    row = (xi0 & ((num << 8 | 0xff));
  else
    row = ((xi0 & (num << 16 | 0xf00)) >> 8) | ((xi0 & 0xf0000000) >> 24);
  return row;
}

So, what do you think, @zawawa?

I don't know if this can be useful at all, but if you like it I can make a PR so you can merge the changes later.

I appreciate your enthusiasm and willingness to help, but I will keep the current code. With GPGPU, and especially with AMD OpenCL drivers, repeats are often better because you can keep register usage low that way, which is crucially important. My general approach toward GPGPU is that I sacrifice everything for performance, including readability.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
Pages: « 1 [2] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 ... 197 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!