Bitcoin Forum
December 12, 2017, 08:58:34 PM *
News: Latest stable version of Bitcoin Core: 0.15.1  [Torrent].
 
   Home   Help Search Donate Login Register  
Pages: « 1 [2] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 ... 86 »
  Print  
Author Topic: Gateless Gate Sharp 1.1.4: zawawa's open-source dual ETH/XMR/PASC/LBC miner  (Read 163134 times)
Linit
Newbie
*
Offline Offline

Activity: 13


View Profile
December 15, 2016, 03:06:13 PM
 #21

Windows 10 64 Bit.
Driver 16.6.
Gigabyte R9 390 G1

160 S/s
1513112314
Hero Member
*
Offline Offline

Posts: 1513112314

View Profile Personal Message (Offline)

Ignore
1513112314
Reply with quote  #2

1513112314
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
xeridea
Sr. Member
****
Offline Offline

Activity: 375


View Profile WWW
December 15, 2016, 04:57:25 PM
 #22

You should probably update bottom of readme...

"Author

Marc Bevand -- http://zorinaq.com"

I probably won't be using, most my cards back to Eth for now, and CM faster, and has Remote Monitor. I like open source projects though, I am a developer also, but can't contribute due to issues with my hands. I would like to tinker with OpenCL if I could. Good luck with project!

Profitability over time charts for many GPUs - http://xeridea.us/charts

BTC:  16wzGLYLh1ximotu3Ln7htKnbUUcwWvQUv   LTC:  LdPvSJoAwgH87TXSMBuxDefBvp2bweXApY   Eth:  0xb508131ca5d983ebe72f8af61ecfb7d1b61f6d18
zawawa
Sr. Member
****
Online Online

Activity: 420


Miner Developer


View Profile
December 15, 2016, 06:33:48 PM
 #23

You should probably update bottom of readme...

"Author

Marc Bevand -- http://zorinaq.com"

I probably won't be using, most my cards back to Eth for now, and CM faster, and has Remote Monitor. I like open source projects though, I am a developer also, but can't contribute due to issues with my hands. I would like to tinker with OpenCL if I could. Good luck with project!

Thanks! After I rest a little, I will optimize the miner further. My ultimate goal would be to create a GUI-based, feature-rich, multi-algorithm miner.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
Linit
Newbie
*
Offline Offline

Activity: 13


View Profile
December 15, 2016, 06:59:56 PM
 #24

Ubuntu 15.04 64 bit.
Driver fglrx 15.12.
Gigabyte R9 390 G1.

160 S/s.
zawawa
Sr. Member
****
Online Online

Activity: 420


Miner Developer


View Profile
December 15, 2016, 07:04:10 PM
 #25

Ubuntu 15.04 64 bit.
Driver fglrx 15.12.
Gigabyte R9 390 G1.

160 S/s.


Very nice! I would like to reach 200 sol/s without a GCN assembler.
We will see.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
Linit
Newbie
*
Offline Offline

Activity: 13


View Profile
December 15, 2016, 07:16:18 PM
 #26

Ubuntu 15.04 64 bit.
Driver fglrx 15.12.
Gigabyte R9 390 G1.

160 S/s.


Very nice! I would like to reach 200 sol/s without a GCN assembler.
We will see.

Excellent...
laik2
Sr. Member
****
Offline Offline

Activity: 392


View Profile
December 15, 2016, 07:25:50 PM
 #27

Ubuntu 15.04 64 bit.
Driver fglrx 15.12.
Gigabyte R9 390 G1.

160 S/s.


Very nice! I would like to reach 200 sol/s without a GCN assembler.
We will see.

Without GCN asm 390s should reach 300S/s at most.
Multialgo miner is sgminer but documentation is hell...until I find some useful values for a card my beard looks like Santa Claus's.

ZEC: t1KbbHtXqzSS6qHBaPZDKyWnzxhRjr9oCtW
Vetal_inside
Member
**
Offline Offline

Activity: 78


View Profile
December 15, 2016, 07:56:28 PM
 #28

R9 280x w/ modded bios - 85 s/s with instances=1 and 90-95 s/s with instances=2(not stable), like as original SA miner v.5.
Win8.1, x64, drivers 15.12

add: with CM it shows 210-220 s/s, depending from memclock
zawawa
Sr. Member
****
Online Online

Activity: 420


Miner Developer


View Profile
December 15, 2016, 09:14:51 PM
 #29

R9 280x w/ modded bios - 85 s/s with instances=1 and 90-95 s/s with instances=2(not stable), like as original SA miner v.5.
Win8.1, x64, drivers 15.12

add: with CM it shows 210-220 s/s, depending from memclock

The slow speed is probably due either to the modded BIOS or to the driver. Mods for Claymore's do not necessarily work with Gateless Gate/SILENTARMY. I would try the stock BIOS first. Also, I only tested the miner with Crimson drivers. I suppose I need to be more clear about requirements...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
krnlx
Full Member
***
Offline Offline

Activity: 227


View Profile
December 15, 2016, 09:43:52 PM
 #30

Quote
Total 1094.3 sol/s [dev0 177.9, dev1 176.8, dev2 182.7, dev3 180.9, dev4 185.4, dev5 185.1] 36 shares
Total 1093.9 sol/s [dev0 177.6, dev1 177.4, dev2 181.9, dev3 180.4, dev4 184.8, dev5 185.2] 36 shares
Total 1094.0 sol/s [dev0 177.6, dev1 177.4, dev2 182.0, dev3 180.7, dev4 185.5, dev5 185.4] 38 shares
Total 1093.3 sol/s [dev0 177.5, dev1 176.6, dev2 182.2, dev3 179.8, dev4 186.6, dev5 184.7] 38 shares
Total 1092.8 sol/s [dev0 178.5, dev1 176.9, dev2 181.7, dev3 180.7, dev4 185.8, dev5 184.8] 38 shares
Total 1093.1 sol/s [dev0 177.7, dev1 177.1, dev2 181.4, dev3 180.4, dev4 186.1, dev5 184.0] 40 shares
Total 1093.2 sol/s [dev0 177.1, dev1 177.8, dev2 182.2, dev3 179.9, dev4 186.3, dev5 182.7] 40 shares
Total 1093.5 sol/s [dev0 176.8, dev1 178.0, dev2 182.0, dev3 180.2, dev4 186.5, dev5 182.8] 40 shares

6x1070 with a little tune

BTC 1DGhgVtTzJqxFvM9yrL8kFBGZdf8Zq6bEr
Vetal_inside
Member
**
Offline Offline

Activity: 78


View Profile
December 15, 2016, 09:45:45 PM
 #31

The slow speed is probably due either to the modded BIOS or to the driver. Mods for Claymore's do not necessarily work with Gateless Gate/SILENTARMY. I would try the stock BIOS first. Also, I only tested the miner with Crimson drivers. I suppose I need to be more clear about requirements...
This is memory timings patch. Not sure that it can be a reason for this low solrate.
But, on next few days I will try install latest crimson drivers and reflash stock bios. Will see what will change.
krnlx
Full Member
***
Offline Offline

Activity: 227


View Profile
December 15, 2016, 09:53:09 PM
 #32

@zawawa

Nvidia cards run faster with NR_ROWS_LOG = 14
Can you check my settings for NR_ROWS_LOG = 14 ? All is correct ?

I think, it will be faster with NR_ROWS_LOG=12...


Code:
#define NR_ROWS_LOG            14
#define NR_SLOTS               240
#define LOCAL_WORK_SIZE        512
#define THREADS_PER_ROW        512
#define LOCAL_WORK_SIZE_SOLS   256
#define THREADS_PER_ROW_SOLS   256
#define GLOBAL_WORK_SIZE_RATIO 512
#define SLOT_CACHE_SIZE        (NR_SLOTS * (LOCAL_WORK_SIZE/THREADS_PER_ROW) * 75 / 100)
#define LDS_COLL_SIZE          (NR_SLOTS * (LOCAL_WORK_SIZE / THREADS_PER_ROW) * 240 / 100)

BTC 1DGhgVtTzJqxFvM9yrL8kFBGZdf8Zq6bEr
laik2
Sr. Member
****
Offline Offline

Activity: 392


View Profile
December 15, 2016, 10:21:40 PM
 #33

@zawawa

Nvidia cards run faster with NR_ROWS_LOG = 14
Can you check my settings for NR_ROWS_LOG = 14 ? All is correct ?

I think, it will be faster with NR_ROWS_LOG=12...


Code:
#define NR_ROWS_LOG            14
#define NR_SLOTS               240
#define LOCAL_WORK_SIZE        512
#define THREADS_PER_ROW        512
#define LOCAL_WORK_SIZE_SOLS   256
#define THREADS_PER_ROW_SOLS   256
#define GLOBAL_WORK_SIZE_RATIO 512
#define SLOT_CACHE_SIZE        (NR_SLOTS * (LOCAL_WORK_SIZE/THREADS_PER_ROW) * 75 / 100)
#define LDS_COLL_SIZE          (NR_SLOTS * (LOCAL_WORK_SIZE / THREADS_PER_ROW) * 240 / 100)

Proper CUDA implementation is required for NV to boost over 300S/s. There are already nicehash and EWBF CUDA closed source miners doing ~300S/s on 1070. I am waiting on my 1070s to arrive so I can test some CUDA tweaks.

ZEC: t1KbbHtXqzSS6qHBaPZDKyWnzxhRjr9oCtW
zawawa
Sr. Member
****
Online Online

Activity: 420


Miner Developer


View Profile
December 15, 2016, 11:01:35 PM
 #34

@zawawa

Nvidia cards run faster with NR_ROWS_LOG = 14
Can you check my settings for NR_ROWS_LOG = 14 ? All is correct ?

I think, it will be faster with NR_ROWS_LOG=12...


Code:
#define NR_ROWS_LOG            14
#define NR_SLOTS               240
#define LOCAL_WORK_SIZE        512
#define THREADS_PER_ROW        512
#define LOCAL_WORK_SIZE_SOLS   256
#define THREADS_PER_ROW_SOLS   256
#define GLOBAL_WORK_SIZE_RATIO 512
#define SLOT_CACHE_SIZE        (NR_SLOTS * (LOCAL_WORK_SIZE/THREADS_PER_ROW) * 75 / 100)
#define LDS_COLL_SIZE          (NR_SLOTS * (LOCAL_WORK_SIZE / THREADS_PER_ROW) * 240 / 100)

What speed are you getting on which card? I'm very curious. You could lower NR_SLOTS by 10 or 20, I think. You can uncomment "#define ENABLE_DEBUG", rebuild the app, and run sa-solver.exe to see how many slots drop out at each round. Too many dropped slots would hurt performance. Adding NR_ROWS_LOG=12 itself is trivial, but there may not be enough shared memory.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Online Online

Activity: 420


Miner Developer


View Profile
December 15, 2016, 11:05:43 PM
 #35

The slow speed is probably due either to the modded BIOS or to the driver. Mods for Claymore's do not necessarily work with Gateless Gate/SILENTARMY. I would try the stock BIOS first. Also, I only tested the miner with Crimson drivers. I suppose I need to be more clear about requirements...
This is memory timings patch. Not sure that it can be a reason for this low solrate.
But, on next few days I will try install latest crimson drivers and reflash stock bios. Will see what will change.

Performance does suffer if memory timings are too tight.
In the meantime, I will test the miner with my trusty  7990's...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 406


View Profile
December 15, 2016, 11:31:30 PM
 #36

Not bad zawawa.  You still have room to improve ht_store.
Code:
p = slot.ui8
Will at best result in 2 store_dwordx4 instructions, and 2 core cycles to the memory controller.
Code:
p = slot.ui4[0]
Which you use after round 5 should only be one cycle, but it will force a 32-byte read burst from the GDDR into the L2, modification of 16 bytes, and then write back.  This will waste a lot of GDDR cycles due to the bus turnaround delay.  The solution is to have a n-way operation where n threads write 32/n bytes.  That will be just core one cycle to xfer 32 bytes to the L2, and a single 32-byte write burst to one of the 2 GDDR5 chips per memory controller channel.
I also think using an odd number for NR_SLOTS should be a tiny bit faster by balancing out the writes between the odd and even memory chips.  With NR_SLOTS even, the first write to a given row will always be to an even memory chip.  With more slots per row this becomes less significant because the rows don't fill up equally.  Using an odd number for NR_SLOTS may also reduce channel conflicts.


zawawa
Sr. Member
****
Online Online

Activity: 420


Miner Developer


View Profile
December 15, 2016, 11:46:42 PM
 #37

Not bad zawawa.  You still have room to improve ht_store.
Code:
p = slot.ui8
Will at best result in 2 store_dwordx4 instructions, and 2 core cycles to the memory controller.
Code:
p = slot.ui4[0]
Which you use after round 5 should only be one cycle, but it will force a 32-byte read burst from the GDDR into the L2, modification of 16 bytes, and then write back.  This will waste a lot of GDDR cycles due to the bus turnaround delay.  The solution is to have a n-way operation where n threads write 32/n bytes.  That will be just core one cycle to xfer 32 bytes to the L2, and a single 32-byte write burst to one of the 2 GDDR5 chips per memory controller channel.
I also think using an odd number for NR_SLOTS should be a tiny bit faster by balancing out the writes between the odd and even memory chips.  With NR_SLOTS even, the first write to a given row will always be to an even memory chip.  With more slots per row this becomes less significant because the rows don't fill up equally.  Using an odd number for NR_SLOTS may also reduce channel conflicts.




Very interesting suggestions. Let me see...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Online Online

Activity: 420


Miner Developer


View Profile
December 16, 2016, 03:10:03 AM
 #38

Not bad zawawa.  You still have room to improve ht_store.
Code:
p = slot.ui8
Will at best result in 2 store_dwordx4 instructions, and 2 core cycles to the memory controller.
Code:
p = slot.ui4[0]
Which you use after round 5 should only be one cycle, but it will force a 32-byte read burst from the GDDR into the L2, modification of 16 bytes, and then write back.  This will waste a lot of GDDR cycles due to the bus turnaround delay.  The solution is to have a n-way operation where n threads write 32/n bytes.  That will be just core one cycle to xfer 32 bytes to the L2, and a single 32-byte write burst to one of the 2 GDDR5 chips per memory controller channel.
I also think using an odd number for NR_SLOTS should be a tiny bit faster by balancing out the writes between the odd and even memory chips.  With NR_SLOTS even, the first write to a given row will always be to an even memory chip.  With more slots per row this becomes less significant because the rows don't fill up equally.  Using an odd number for NR_SLOTS may also reduce channel conflicts.




I tried 4-way writes with mixed results. The 4-way write version was actually slower than the single-thread-write version, but the former seems to speed up the last few rounds. It makes sense as these rounds are more memory-intensive. I will explore this approach further.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
bigchirv
Newbie
*
Offline Offline

Activity: 19


View Profile
December 16, 2016, 03:31:09 AM
 #39

Thanks for publishing your repo! Appreciated.

I'm not a C programmer (or OpenCL for the matter) but I'm a fan of DRY; so when I was reading input.cl I found the get_row() function and I think we can make it a little bit DRYer by doing something like this:

Code:
uint get_row(uint round, uint xi0)
{
  uint           row;
  uint           swp;
  uint           num;
#if NR_ROWS_LOG == 14
  swp = 0;
#elif NR_ROWS_LOG == 15
  swp = 1;
#elif NR_ROWS_LOG == 16
  swp = 2;
#else
#error "unsupported NR_ROWS_LOG"
#endif
  num = (40 << swp) - 1);
  if (!(round % 2))
    row = (xi0 & ((num << 8 | 0xff));
  else
    row = ((xi0 & (num << 16 | 0xf00)) >> 8) | ((xi0 & 0xf0000000) >> 24);
  return row;
}

So, what do you think, @zawawa?

I don't know if this can be useful at all, but if you like it I can make a PR so you can merge the changes later.

Bitrated user: bigchirv.
zawawa
Sr. Member
****
Online Online

Activity: 420


Miner Developer


View Profile
December 16, 2016, 03:52:25 AM
 #40

Thanks for publishing your repo! Appreciated.

I'm not a C programmer (or OpenCL for the matter) but I'm a fan of DRY; so when I was reading input.cl I found the get_row() function and I think we can make it a little bit DRYer by doing something like this:

Code:
uint get_row(uint round, uint xi0)
{
  uint           row;
  uint           swp;
  uint           num;
#if NR_ROWS_LOG == 14
  swp = 0;
#elif NR_ROWS_LOG == 15
  swp = 1;
#elif NR_ROWS_LOG == 16
  swp = 2;
#else
#error "unsupported NR_ROWS_LOG"
#endif
  num = (40 << swp) - 1);
  if (!(round % 2))
    row = (xi0 & ((num << 8 | 0xff));
  else
    row = ((xi0 & (num << 16 | 0xf00)) >> 8) | ((xi0 & 0xf0000000) >> 24);
  return row;
}

So, what do you think, @zawawa?

I don't know if this can be useful at all, but if you like it I can make a PR so you can merge the changes later.

I appreciate your enthusiasm and willingness to help, but I will keep the current code. With GPGPU, and especially with AMD OpenCL drivers, repeats are often better because you can keep register usage low that way, which is crucially important. My general approach toward GPGPU is that I sacrifice everything for performance, including readability.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
Pages: « 1 [2] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 ... 86 »
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!