Bitcoin Forum
December 14, 2017, 07:01:38 AM *
News: Latest stable version of Bitcoin Core: 0.15.1  [Torrent].
 
   Home   Help Search Donate Login Register  
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 [34] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 ... 86 »
  Print  
Author Topic: Gateless Gate Sharp 1.1.4: zawawa's open-source dual ETH/XMR/PASC/LBC miner  (Read 163992 times)
woodaxe
Member
**
Offline Offline

Activity: 104


View Profile
February 18, 2017, 01:14:03 AM
 #661

just FYI guys: running any of SGminer forks with CryptoNote on Win10 may produce 100% fan bug (occasionally fans jump to 100%) however it seems that single specific version: 16.10.4 fixes this problem.

 will that work with modded bios Huh

indeed it does. there was no bois signature check in versions up to 16.12 (from 16.9.1?)
(also you can bypass it: https://www.techpowerup.com/228536/amd-bios-signature-check-re-enabled-with-relive-locks-out-polaris-bios-modders)

 thanks   i cannot find a 16.10.4  on amd  site

fuck, I'm an idiot, it's 16.11.4, sorry for that

np  ive found the right ones 16.11.4   il install them later today 


 a nightmare  used ddu to  clear old drivers etc  installed 16.11.4  but it didnt install amd settings or wattman  and when i looked in device drivers  showed only vga drivers  i tried for well over a hour to sort it  even went back to my 16.9.2  but each time  no amd settings  or wattman   in the end i gave up and
used the auto updater which installed 17. sommat  which is working ok  but means i cannot  use modded bios  anyone any ideas where ive gone wrong
According to NIST and ECRYPT II, the cryptographic algorithms used in Bitcoin are expected to be strong until at least 2030. (After that, it will not be too difficult to transition to different algorithms.)
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
Balitorium
Newbie
*
Offline Offline

Activity: 23


View Profile
February 18, 2017, 01:55:21 PM
 #662

i could never understand how to compile / use this on ubuntu 16.04 ? i know it might be a stupid question, but could someone kindly write a simple full step by step how-to build on ubuntu 16.04? it will be very much appreciated Smiley

I'm running my rigs on ubuntu 16.04 and it's pretty straight forward with basic Linux know how. Don't have the spare time to write up a full how-to now but the basic steps as far as I remember are:

- download and install latest AMD SDK
- download and install latest AMD PRO GRU driver
- download zawawas miner from github
- use "apt-get install" to clear sgminer dependencies (see readme)
- bulid from source according to readme

If you run into troubles on the way I pretty sure someone here will help you figure it out  Wink

zawawa
Sr. Member
****
Online Online

Activity: 420


Miner Developer


View Profile
February 19, 2017, 05:35:26 PM
 #663

So, while my wife is away on a sky trip for three days, I decided to give the GCN compiler one more shot and started debugging again. With a stripped-down version of SA's kernel, I was able to identify LDS access as the main cause for incompatibility. Without LDS access, the compiler builds the kernel without a problem. Parallel programming is already hard, GPGPU is harder, and debugging a compiler for GPGPU is notoriously harder yet, but I'm getting closer...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
m1n1ngP4d4w4n
Full Member
***
Offline Offline

Activity: 154

CryptoLearner


View Profile
February 19, 2017, 07:00:29 PM
 #664

So, while my wife is away on a sky trip for three days, I decided to give the GCN compiler one more shot and started debugging again. With a stripped-down version of SA's kernel, I was able to identify LDS access as the main cause for incompatibility. Without LDS access, the compiler builds the kernel without a problem. Parallel programming is already hard, GPGPU is harder, and debugging a compiler for GPGPU is notoriously harder yet, but I'm getting closer...

Lol i can see the wife coming home, finding a bearded zawawa with 3 days old clothes in front of the computer with code in the eyes @_@ * brain dead *.

Don't let it get to you man, ahah  Grin

BTC - 1B1RBYkzxiTmrbnFe2vj8EaNPSYftW8186 for tips Wink
cryptominer420
Full Member
***
Offline Offline

Activity: 185


View Profile
February 19, 2017, 07:02:20 PM
 #665

@zawawa
Sent you $1 to your BTC address it's not much but right now I'm living off my BTC.

BTC: 1Eeb9SoBeY7AQjjFn7YMJZMY7Jtw5gxxHs  ETH: 0x68e4EA3b7e60C8D6fC9BA92775ccE27Ca542D114
megacrypto
Full Member
***
Offline Offline

Activity: 239


View Profile
February 19, 2017, 08:09:38 PM
 #666

i could never understand how to compile / use this on ubuntu 16.04 ? i know it might be a stupid question, but could someone kindly write a simple full step by step how-to build on ubuntu 16.04? it will be very much appreciated Smiley

I'm running my rigs on ubuntu 16.04 and it's pretty straight forward with basic Linux know how. Don't have the spare time to write up a full how-to now but the basic steps as far as I remember are:

- download and install latest AMD SDK
- download and install latest AMD PRO GRU driver
- download zawawas miner from github
- use "apt-get install" to clear sgminer dependencies (see readme)
- bulid from source according to readme

If you run into troubles on the way I pretty sure someone here will help you figure it out  Wink



i got the first 3 steps all fine (actually using sgminer-gm right now) its the last 2 steps i seem not to find my way around!! it could be just straight forward, but for some reason i just cant seem to see !! Smiley)

BTC: 1Feqs22qa8hAUC13YJh9z3bBe4FSYsa5nn                    ZEC: t1eYeHJKV6Ku9VzadpS8p1LDBeYXqQtRjvw
ETH: 0x00BEa51b34482d76fC91BA6865Ab92A4A438Cf90       ETC: 0x565EC4035645C4d9AEa5AB58fdF25E0Ea43e0b86
zawawa
Sr. Member
****
Online Online

Activity: 420


Miner Developer


View Profile
February 20, 2017, 03:02:00 AM
 #667

The new compiler can build SA's old kernel now.
Let's try GG's kernel again...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Online Online

Activity: 420


Miner Developer


View Profile
February 20, 2017, 08:28:46 AM
 #668

Alright, 10 out of 12 Equihash kernels are working without LLVM's alloca promotion.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 406


View Profile
February 20, 2017, 03:11:43 PM
 #669

Alright, 10 out of 12 Equihash kernels are working without LLVM's alloca promotion.

You're deeper into LLVM than I've ever ventured.  Is the issue that LLVM is not promoting because it is too conservative about the number of available registers?  And it is conservative about registers because it is trying to generate code to support more waves than is optimal?
Since cache hit rates have a big impact on memory latency and therefore the optimal number of waves for latency hiding, it would be virtually impossible for the compiler to determine the optimal number of waves.  However a compiler hint might be a solution (i.e. -fnum-waves=X).

p.s. despite CLRX and a couple other sources claiming 4 waves are required for full VALU occupancy, I'm now convinced it can be done with just one.
theflow4321
Newbie
*
Offline Offline

Activity: 1


View Profile
February 20, 2017, 10:44:09 PM
 #670

But you need 4 waves to take advantage of the GCN timing architecture in that 4 lock-steps / 4 SIMDs design.

The steps are related to the instruction/ scalar processor but a single ALU can addresse like 12 registers per cycle if I'm not mistaken within that Single Instruction Multiple Data.  Or something.



http://gpuopen.com/anatomy-total-war-engine-part-2/

"VGPRS
From the point of view of a single thread each of the vector registers hold a single value. 256 VGPRs can mean 64 float4, 128 float2, 256 float or any combination of these. For example, if we sample a texture, but only use its RGB components and not its alpha channel, it will take up 3 VGPRs. Let’s do some math: we want to blend 8 terrain layers. Each layer has a diffuse, a normal and a spec/gloss texture. We use all 4 channels of the diffuse texture, 2 channels of the normal texture and 2 channels of the spec/gloss texture. That’s 4+2+2 = 8 VGPRs per layer. Multiplied by 8 layers is 64 VGPRs. So we’ve already used up a quarter of all the available registers in the SIMD and we haven’t even started to talk about other parts of the code, blend maps, height map, etc. Some registers can be reused, but as we’ll see soon it’s not as trivial as it seems."

"The number of used registers is important, because modern hardware runs multiple wavefronts at the same time. You can think about this as processing multiple pixels at the same time. This means that one of the main limiting factors on the number of pixels we can have in flight is the number of registers the shaders require. If a single SIMD in the hardware has 256 VGPRs and a shader is using 200 of them for example, the GPU can work only on one wavefront at a time. After the first wavefront is launched on a SIMD it leaves 56 registers unused, which is not enough to accommodate another wavefront running the same shader. If it’s using 110 VGPRs, two wavefronts can run at the same time (112+112=224 and 32 registers remain unused). If the shader uses only 24 or less VGPRs, the hardware can run 10 wavefronts on a SIMD at the same time. 10 concurrent wavefronts is the current maximum for Fiji GPUs, such as the Radeon® Fury X. This limit is hard wired."

nerdralph
Sr. Member
****
Offline Offline

Activity: 406


View Profile
February 20, 2017, 11:17:57 PM
 #671

But you need 4 waves to take advantage of the GCN timing architecture in that 4 lock-steps / 4 SIMDs design.

No, each SIMD unit only processes 16 work-items at a time.  After 4 clock cycles it is ready to process another 64 work-items (one wave).  On cycle 1 the CU will dispatch to SIMD0, then to SIMD1 on cycle 2, SIMD2 on cycle 3, SIMD3 on cycle 4, and back to SIMD0 on cycle 5.  It is a bit complicated, and it seems many people don't take the time to fully understand it.  Because of that, I've written a short blog post to explain it.

http://nerdralph.blogspot.ca/2017/02/inside-amd-gcn-code-execution.html

edit: Here's a quote from AMD's GCN whitepaper:
Each SIMD includes a 16-lane vector pipeline that is predicated and fully IEEE-754 compliant for single precision and double precision floating point operations,
with full speed denormals and all rounding modes. Each lane can natively executes a single precision fused or unfused multiply-add or a 24-bit integer
operation. The integer multiply-add is particularly useful for calculating addresses within a work-group. A wavefront is issued to a SIMD in a single cycle,
but takes 4 cycles to execute operations for all 64 work items
zawawa
Sr. Member
****
Online Online

Activity: 420


Miner Developer


View Profile
February 21, 2017, 05:12:51 AM
 #672

I tried splitting row counters between GDS and the L2 cache with 7990, with 2048 rows assigned to each, but it turned out that the miner runs slower this way. I think the overhead of switching between them outweighs the performance gain of the splitting. Perhaps a better way is to alternate between GDS and the L2 cache for each round.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Online Online

Activity: 420


Miner Developer


View Profile
February 21, 2017, 05:25:08 AM
 #673

Alright, 10 out of 12 Equihash kernels are working without LLVM's alloca promotion.

You're deeper into LLVM than I've ever ventured.  Is the issue that LLVM is not promoting because it is too conservative about the number of available registers?  And it is conservative about registers because it is trying to generate code to support more waves than is optimal?
Since cache hit rates have a big impact on memory latency and therefore the optimal number of waves for latency hiding, it would be virtually impossible for the compiler to determine the optimal number of waves.  However a compiler hint might be a solution (i.e. -fnum-waves=X).

p.s. despite CLRX and a couple other sources claiming 4 waves are required for full VALU occupancy, I'm now convinced it can be done with just one.


I had to turn off alloca promotion even to run SA v5's original kernel. I suspect there is a bug in the routine that promotes alloca to LDS. AMD's conservative approach is pretty lazy IMO, but LLVM/Clang seems to allow for compiler hints in the form of attributes.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Online Online

Activity: 420


Miner Developer


View Profile
February 21, 2017, 05:28:10 AM
 #674

But you need 4 waves to take advantage of the GCN timing architecture in that 4 lock-steps / 4 SIMDs design.

No, each SIMD unit only processes 16 work-items at a time.  After 4 clock cycles it is ready to process another 64 work-items (one wave).  On cycle 1 the CU will dispatch to SIMD0, then to SIMD1 on cycle 2, SIMD2 on cycle 3, SIMD3 on cycle 4, and back to SIMD0 on cycle 5.  It is a bit complicated, and it seems many people don't take the time to fully understand it.  Because of that, I've written a short blog post to explain it.

http://nerdralph.blogspot.ca/2017/02/inside-amd-gcn-code-execution.html

edit: Here's a quote from AMD's GCN whitepaper:
Each SIMD includes a 16-lane vector pipeline that is predicated and fully IEEE-754 compliant for single precision and double precision floating point operations,
with full speed denormals and all rounding modes. Each lane can natively executes a single precision fused or unfused multiply-add or a 24-bit integer
operation. The integer multiply-add is particularly useful for calculating addresses within a work-group. A wavefront is issued to a SIMD in a single cycle,
but takes 4 cycles to execute operations for all 64 work items


This is a really nice write-up. Thank you! We do need more documentation that is concise and accurate. Seriously.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Online Online

Activity: 420


Miner Developer


View Profile
February 21, 2017, 10:01:40 AM
 #675

It seems that there is an ABI-dependent code in LLVM's AMDGPUPromoteAlloca.cpp.  This stuff is not straight forward at all, huh. There must be a pointer to the dispatch packet:

Code:
  // We must read the size out of the dispatch pointer.
  assert(IsAMDGCN);

No wonder the compiler was not working reliably...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 406


View Profile
February 21, 2017, 04:04:08 PM
 #676

I tried splitting row counters between GDS and the L2 cache with 7990, with 2048 rows assigned to each, but it turned out that the miner runs slower this way. I think the overhead of switching between them outweighs the performance gain of the splitting. Perhaps a better way is to alternate between GDS and the L2 cache for each round.

Too bad, although I think performance on the newer cards is more important anyway.
nerdralph
Sr. Member
****
Offline Offline

Activity: 406


View Profile
February 21, 2017, 04:17:32 PM
 #677

It seems that there is an ABI-dependent code in LLVM's AMDGPUPromoteAlloca.cpp.  This stuff is not straight forward at all, huh. There must be a pointer to the dispatch packet:

Code:
  // We must read the size out of the dispatch pointer.
  assert(IsAMDGCN);

No wonder the compiler was not working reliably...

If the equihash algorithm were simpler, I'd say just write it from scratch in asm.  When you can write in asm, you'll rarely be pleased with compiler-generated code.  If assemblers did register allocation, I'd probably write most of my performance-sensitive code in asm.
Kompik
Sr. Member
****
Offline Offline

Activity: 389


View Profile
February 21, 2017, 05:53:17 PM
 #678

sp_ has reportedly achieved 530 sols@OC1070 on zcash on his miner. 15+% more over EWBF(Claymore?) which is amazing.
zawawa
Sr. Member
****
Online Online

Activity: 420


Miner Developer


View Profile
February 21, 2017, 06:04:08 PM
 #679

sp_ has reportedly achieved 530 sols@OC1070 on zcash on his miner. 15+% more over EWBF(Claymore?) which is amazing.

Makes me wonder if I should rewrite the kernel in the GCN assembly from scratch.
I don't know about his personality, but he surely is pretty good at what he is doing.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
Kompik
Sr. Member
****
Offline Offline

Activity: 389


View Profile
February 21, 2017, 09:26:33 PM
 #680

sp_ has reportedly achieved 530 sols@OC1070 on zcash on his miner. 15+% more over EWBF(Claymore?) which is amazing.

Makes me wonder if I should rewrite the kernel in the GCN assembly from scratch.
I don't know about his personality, but he surely is pretty good at what he is doing.
It seems that some other people are making jokes of him and posting different miners saying that its SP_mod. I dont really get it all, but it seems that he has some miner, that is faster than ewbf, but probably not 530 sols. Sorry for the mystification Smiley
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 [34] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 ... 86 »
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!