Bitcoin Forum
April 26, 2024, 12:36:49 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Poll
Question: Do you want to see improvements in Ethash dual-mining with GGS?
I desperately need it. - 8 (15.1%)
It would be nice. - 12 (22.6%)
It's not worth it anymore. - 33 (62.3%)
Total Voters: 53

Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 [34] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 ... 197 »
  Print  
Author Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480!  (Read 214337 times)
Balitorium
Newbie
*
Offline Offline

Activity: 23
Merit: 0


View Profile
February 18, 2017, 01:55:21 PM
 #661

i could never understand how to compile / use this on ubuntu 16.04 ? i know it might be a stupid question, but could someone kindly write a simple full step by step how-to build on ubuntu 16.04? it will be very much appreciated Smiley

I'm running my rigs on ubuntu 16.04 and it's pretty straight forward with basic Linux know how. Don't have the spare time to write up a full how-to now but the basic steps as far as I remember are:

- download and install latest AMD SDK
- download and install latest AMD PRO GRU driver
- download zawawas miner from github
- use "apt-get install" to clear sgminer dependencies (see readme)
- bulid from source according to readme

If you run into troubles on the way I pretty sure someone here will help you figure it out  Wink

1714135009
Hero Member
*
Offline Offline

Posts: 1714135009

View Profile Personal Message (Offline)

Ignore
1714135009
Reply with quote  #2

1714135009
Report to moderator
1714135009
Hero Member
*
Offline Offline

Posts: 1714135009

View Profile Personal Message (Offline)

Ignore
1714135009
Reply with quote  #2

1714135009
Report to moderator
1714135009
Hero Member
*
Offline Offline

Posts: 1714135009

View Profile Personal Message (Offline)

Ignore
1714135009
Reply with quote  #2

1714135009
Report to moderator
"With e-currency based on cryptographic proof, without the need to trust a third party middleman, money can be secure and transactions effortless." -- Satoshi
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
February 19, 2017, 05:35:26 PM
 #662

So, while my wife is away on a sky trip for three days, I decided to give the GCN compiler one more shot and started debugging again. With a stripped-down version of SA's kernel, I was able to identify LDS access as the main cause for incompatibility. Without LDS access, the compiler builds the kernel without a problem. Parallel programming is already hard, GPGPU is harder, and debugging a compiler for GPGPU is notoriously harder yet, but I'm getting closer...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
m1n1ngP4d4w4n
Full Member
***
Offline Offline

Activity: 224
Merit: 100

CryptoLearner


View Profile
February 19, 2017, 07:00:29 PM
 #663

So, while my wife is away on a sky trip for three days, I decided to give the GCN compiler one more shot and started debugging again. With a stripped-down version of SA's kernel, I was able to identify LDS access as the main cause for incompatibility. Without LDS access, the compiler builds the kernel without a problem. Parallel programming is already hard, GPGPU is harder, and debugging a compiler for GPGPU is notoriously harder yet, but I'm getting closer...

Lol i can see the wife coming home, finding a bearded zawawa with 3 days old clothes in front of the computer with code in the eyes @_@ * brain dead *.

Don't let it get to you man, ahah  Grin
cryptominer420
Sr. Member
****
Offline Offline

Activity: 450
Merit: 255


View Profile
February 19, 2017, 07:02:20 PM
 #664

@zawawa
Sent you $1 to your BTC address it's not much but right now I'm living off my BTC.

   ╖   ╓╖╖                         ╖╖╖ ,
  ▒   ╢▒,@▒▒▒║ ╓╣╝║║*╢  ╢▒╣ ],`]░╢▒▒╖ ▒ ╥╢▒▒▒╢  @╝╢▒
  Ñ▒▒]▒▒` ]`╢║▒╣▒╢▒▒  ╢▒╝▒▒▒  ╢▒╜║▒▒▒╢▒╜  ╢╢║N
 ║╢   ▒▒╜ ║▒▒╢▒▒@@╢▒║  ╢▒╜ ▒ ╙▒▒,║░▒╣ ▒║ ╢▒▒╢▒▒▒»@╢@@╢╜



.















▬▬  A Miner Built Mining Platform  ▬▬[/url]
Powered by Our Mining Community













megacrypto
Sr. Member
****
Offline Offline

Activity: 291
Merit: 250



View Profile
February 19, 2017, 08:09:38 PM
 #665

i could never understand how to compile / use this on ubuntu 16.04 ? i know it might be a stupid question, but could someone kindly write a simple full step by step how-to build on ubuntu 16.04? it will be very much appreciated Smiley

I'm running my rigs on ubuntu 16.04 and it's pretty straight forward with basic Linux know how. Don't have the spare time to write up a full how-to now but the basic steps as far as I remember are:

- download and install latest AMD SDK
- download and install latest AMD PRO GRU driver
- download zawawas miner from github
- use "apt-get install" to clear sgminer dependencies (see readme)
- bulid from source according to readme

If you run into troubles on the way I pretty sure someone here will help you figure it out  Wink



i got the first 3 steps all fine (actually using sgminer-gm right now) its the last 2 steps i seem not to find my way around!! it could be just straight forward, but for some reason i just cant seem to see !! Smiley)

▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ZEROSTATEEMOTIONAL INTELLIGENCE                                   ● INSTAGRAMFACEBOOK TWITTER
POWERED BY BLOCKCHAIN                                         ● MEDIUMANN THREAD
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
February 20, 2017, 03:02:00 AM
 #666

The new compiler can build SA's old kernel now.
Let's try GG's kernel again...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
February 20, 2017, 08:28:46 AM
 #667

Alright, 10 out of 12 Equihash kernels are working without LLVM's alloca promotion.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
February 20, 2017, 03:11:43 PM
 #668

Alright, 10 out of 12 Equihash kernels are working without LLVM's alloca promotion.

You're deeper into LLVM than I've ever ventured.  Is the issue that LLVM is not promoting because it is too conservative about the number of available registers?  And it is conservative about registers because it is trying to generate code to support more waves than is optimal?
Since cache hit rates have a big impact on memory latency and therefore the optimal number of waves for latency hiding, it would be virtually impossible for the compiler to determine the optimal number of waves.  However a compiler hint might be a solution (i.e. -fnum-waves=X).

p.s. despite CLRX and a couple other sources claiming 4 waves are required for full VALU occupancy, I'm now convinced it can be done with just one.
theflow4321
Newbie
*
Offline Offline

Activity: 1
Merit: 0


View Profile
February 20, 2017, 10:44:09 PM
 #669

But you need 4 waves to take advantage of the GCN timing architecture in that 4 lock-steps / 4 SIMDs design.

The steps are related to the instruction/ scalar processor but a single ALU can addresse like 12 registers per cycle if I'm not mistaken within that Single Instruction Multiple Data.  Or something.



http://gpuopen.com/anatomy-total-war-engine-part-2/

"VGPRS
From the point of view of a single thread each of the vector registers hold a single value. 256 VGPRs can mean 64 float4, 128 float2, 256 float or any combination of these. For example, if we sample a texture, but only use its RGB components and not its alpha channel, it will take up 3 VGPRs. Let’s do some math: we want to blend 8 terrain layers. Each layer has a diffuse, a normal and a spec/gloss texture. We use all 4 channels of the diffuse texture, 2 channels of the normal texture and 2 channels of the spec/gloss texture. That’s 4+2+2 = 8 VGPRs per layer. Multiplied by 8 layers is 64 VGPRs. So we’ve already used up a quarter of all the available registers in the SIMD and we haven’t even started to talk about other parts of the code, blend maps, height map, etc. Some registers can be reused, but as we’ll see soon it’s not as trivial as it seems."

"The number of used registers is important, because modern hardware runs multiple wavefronts at the same time. You can think about this as processing multiple pixels at the same time. This means that one of the main limiting factors on the number of pixels we can have in flight is the number of registers the shaders require. If a single SIMD in the hardware has 256 VGPRs and a shader is using 200 of them for example, the GPU can work only on one wavefront at a time. After the first wavefront is launched on a SIMD it leaves 56 registers unused, which is not enough to accommodate another wavefront running the same shader. If it’s using 110 VGPRs, two wavefronts can run at the same time (112+112=224 and 32 registers remain unused). If the shader uses only 24 or less VGPRs, the hardware can run 10 wavefronts on a SIMD at the same time. 10 concurrent wavefronts is the current maximum for Fiji GPUs, such as the Radeon® Fury X. This limit is hard wired."

nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
February 20, 2017, 11:17:57 PM
 #670

But you need 4 waves to take advantage of the GCN timing architecture in that 4 lock-steps / 4 SIMDs design.

No, each SIMD unit only processes 16 work-items at a time.  After 4 clock cycles it is ready to process another 64 work-items (one wave).  On cycle 1 the CU will dispatch to SIMD0, then to SIMD1 on cycle 2, SIMD2 on cycle 3, SIMD3 on cycle 4, and back to SIMD0 on cycle 5.  It is a bit complicated, and it seems many people don't take the time to fully understand it.  Because of that, I've written a short blog post to explain it.

http://nerdralph.blogspot.ca/2017/02/inside-amd-gcn-code-execution.html

edit: Here's a quote from AMD's GCN whitepaper:
Each SIMD includes a 16-lane vector pipeline that is predicated and fully IEEE-754 compliant for single precision and double precision floating point operations,
with full speed denormals and all rounding modes. Each lane can natively executes a single precision fused or unfused multiply-add or a 24-bit integer
operation. The integer multiply-add is particularly useful for calculating addresses within a work-group. A wavefront is issued to a SIMD in a single cycle,
but takes 4 cycles to execute operations for all 64 work items
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
February 21, 2017, 05:12:51 AM
 #671

I tried splitting row counters between GDS and the L2 cache with 7990, with 2048 rows assigned to each, but it turned out that the miner runs slower this way. I think the overhead of switching between them outweighs the performance gain of the splitting. Perhaps a better way is to alternate between GDS and the L2 cache for each round.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
February 21, 2017, 05:25:08 AM
 #672

Alright, 10 out of 12 Equihash kernels are working without LLVM's alloca promotion.

You're deeper into LLVM than I've ever ventured.  Is the issue that LLVM is not promoting because it is too conservative about the number of available registers?  And it is conservative about registers because it is trying to generate code to support more waves than is optimal?
Since cache hit rates have a big impact on memory latency and therefore the optimal number of waves for latency hiding, it would be virtually impossible for the compiler to determine the optimal number of waves.  However a compiler hint might be a solution (i.e. -fnum-waves=X).

p.s. despite CLRX and a couple other sources claiming 4 waves are required for full VALU occupancy, I'm now convinced it can be done with just one.


I had to turn off alloca promotion even to run SA v5's original kernel. I suspect there is a bug in the routine that promotes alloca to LDS. AMD's conservative approach is pretty lazy IMO, but LLVM/Clang seems to allow for compiler hints in the form of attributes.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
February 21, 2017, 05:28:10 AM
 #673

But you need 4 waves to take advantage of the GCN timing architecture in that 4 lock-steps / 4 SIMDs design.

No, each SIMD unit only processes 16 work-items at a time.  After 4 clock cycles it is ready to process another 64 work-items (one wave).  On cycle 1 the CU will dispatch to SIMD0, then to SIMD1 on cycle 2, SIMD2 on cycle 3, SIMD3 on cycle 4, and back to SIMD0 on cycle 5.  It is a bit complicated, and it seems many people don't take the time to fully understand it.  Because of that, I've written a short blog post to explain it.

http://nerdralph.blogspot.ca/2017/02/inside-amd-gcn-code-execution.html

edit: Here's a quote from AMD's GCN whitepaper:
Each SIMD includes a 16-lane vector pipeline that is predicated and fully IEEE-754 compliant for single precision and double precision floating point operations,
with full speed denormals and all rounding modes. Each lane can natively executes a single precision fused or unfused multiply-add or a 24-bit integer
operation. The integer multiply-add is particularly useful for calculating addresses within a work-group. A wavefront is issued to a SIMD in a single cycle,
but takes 4 cycles to execute operations for all 64 work items


This is a really nice write-up. Thank you! We do need more documentation that is concise and accurate. Seriously.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
February 21, 2017, 10:01:40 AM
 #674

It seems that there is an ABI-dependent code in LLVM's AMDGPUPromoteAlloca.cpp.  This stuff is not straight forward at all, huh. There must be a pointer to the dispatch packet:

Code:
  // We must read the size out of the dispatch pointer.
  assert(IsAMDGCN);

No wonder the compiler was not working reliably...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
February 21, 2017, 04:04:08 PM
 #675

I tried splitting row counters between GDS and the L2 cache with 7990, with 2048 rows assigned to each, but it turned out that the miner runs slower this way. I think the overhead of switching between them outweighs the performance gain of the splitting. Perhaps a better way is to alternate between GDS and the L2 cache for each round.

Too bad, although I think performance on the newer cards is more important anyway.
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
February 21, 2017, 04:17:32 PM
 #676

It seems that there is an ABI-dependent code in LLVM's AMDGPUPromoteAlloca.cpp.  This stuff is not straight forward at all, huh. There must be a pointer to the dispatch packet:

Code:
  // We must read the size out of the dispatch pointer.
  assert(IsAMDGCN);

No wonder the compiler was not working reliably...

If the equihash algorithm were simpler, I'd say just write it from scratch in asm.  When you can write in asm, you'll rarely be pleased with compiler-generated code.  If assemblers did register allocation, I'd probably write most of my performance-sensitive code in asm.
Kompik
Sr. Member
****
Offline Offline

Activity: 463
Merit: 250


View Profile
February 21, 2017, 05:53:17 PM
 #677

sp_ has reportedly achieved 530 sols@OC1070 on zcash on his miner. 15+% more over EWBF(Claymore?) which is amazing.

Bitrated user: Kompik.
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
February 21, 2017, 06:04:08 PM
 #678

sp_ has reportedly achieved 530 sols@OC1070 on zcash on his miner. 15+% more over EWBF(Claymore?) which is amazing.

Makes me wonder if I should rewrite the kernel in the GCN assembly from scratch.
I don't know about his personality, but he surely is pretty good at what he is doing.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
Kompik
Sr. Member
****
Offline Offline

Activity: 463
Merit: 250


View Profile
February 21, 2017, 09:26:33 PM
 #679

sp_ has reportedly achieved 530 sols@OC1070 on zcash on his miner. 15+% more over EWBF(Claymore?) which is amazing.

Makes me wonder if I should rewrite the kernel in the GCN assembly from scratch.
I don't know about his personality, but he surely is pretty good at what he is doing.
It seems that some other people are making jokes of him and posting different miners saying that its SP_mod. I dont really get it all, but it seems that he has some miner, that is faster than ewbf, but probably not 530 sols. Sorry for the mystification Smiley

Bitrated user: Kompik.
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
February 21, 2017, 09:28:41 PM
 #680

sp_ has reportedly achieved 530 sols@OC1070 on zcash on his miner. 15+% more over EWBF(Claymore?) which is amazing.

Makes me wonder if I should rewrite the kernel in the GCN assembly from scratch.
I don't know about his personality, but he surely is pretty good at what he is doing.
It seems that some other people are making jokes of him and posting different miners saying that its SP_mod. I dont really get it all, but it seems that he has some miner, that is faster than ewbf, but probably not 530 sols. Sorry for the mystification Smiley

Bwahahaha!

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 [34] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 ... 197 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!