Balitorium
Newbie
Offline
Activity: 23
Merit: 0
|
|
February 18, 2017, 01:55:21 PM |
|
i could never understand how to compile / use this on ubuntu 16.04 ? i know it might be a stupid question, but could someone kindly write a simple full step by step how-to build on ubuntu 16.04? it will be very much appreciated I'm running my rigs on ubuntu 16.04 and it's pretty straight forward with basic Linux know how. Don't have the spare time to write up a full how-to now but the basic steps as far as I remember are: - download and install latest AMD SDK - download and install latest AMD PRO GRU driver - download zawawas miner from github - use "apt-get install" to clear sgminer dependencies (see readme) - bulid from source according to readme If you run into troubles on the way I pretty sure someone here will help you figure it out
|
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
February 19, 2017, 05:35:26 PM |
|
So, while my wife is away on a sky trip for three days, I decided to give the GCN compiler one more shot and started debugging again. With a stripped-down version of SA's kernel, I was able to identify LDS access as the main cause for incompatibility. Without LDS access, the compiler builds the kernel without a problem. Parallel programming is already hard, GPGPU is harder, and debugging a compiler for GPGPU is notoriously harder yet, but I'm getting closer...
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
m1n1ngP4d4w4n
Full Member
Offline
Activity: 224
Merit: 100
CryptoLearner
|
|
February 19, 2017, 07:00:29 PM |
|
So, while my wife is away on a sky trip for three days, I decided to give the GCN compiler one more shot and started debugging again. With a stripped-down version of SA's kernel, I was able to identify LDS access as the main cause for incompatibility. Without LDS access, the compiler builds the kernel without a problem. Parallel programming is already hard, GPGPU is harder, and debugging a compiler for GPGPU is notoriously harder yet, but I'm getting closer...
Lol i can see the wife coming home, finding a bearded zawawa with 3 days old clothes in front of the computer with code in the eyes @_@ * brain dead *. Don't let it get to you man, ahah
|
|
|
|
cryptominer420
|
|
February 19, 2017, 07:02:20 PM |
|
@zawawa Sent you $1 to your BTC address it's not much but right now I'm living off my BTC.
|
╖╖ ╓╖╖ ╖╖╖ ,╖╖─ ║▒▒ ╢▒╜,@╢▒▒▒║ ╓╣╢╝║║*║▒╢ ╢▒╣ ]▒▒,╢▒╢`]▒▒░╢▒▒╖ ╢▒ ╥╢▒▒▒╢ @║╝╢▒╜ ▒▒Ñ╝╝╢▒▒]▒▒` ]▒▒`╙╢╢║║╖┌▒▒╣▒╢▒▒ ╢▒╝▒▒▒ ╢▒╜║▒╢▒▒╢▒░║▒╜ ╥╥─╙╢╢╢║N ║▒╢ ▒▒╜ ║▒▒╢▒▒╣╓╢@@╢╢╜║▒║ ╢▒╜ ║▒▒ ╙▒▒,║▒▒░▒╣ ║▒▒║ ╢▒▒╢▒▒▒»@╢@@╢╜
|
. | | |
█ █ █ █ █ █ █ █ █ █ █ █ | | | | | |
█ █ █ █ █ █ █ █ █ █ █ █ |
|
|
|
megacrypto
|
|
February 19, 2017, 08:09:38 PM |
|
i could never understand how to compile / use this on ubuntu 16.04 ? i know it might be a stupid question, but could someone kindly write a simple full step by step how-to build on ubuntu 16.04? it will be very much appreciated I'm running my rigs on ubuntu 16.04 and it's pretty straight forward with basic Linux know how. Don't have the spare time to write up a full how-to now but the basic steps as far as I remember are: - download and install latest AMD SDK - download and install latest AMD PRO GRU driver - download zawawas miner from github - use "apt-get install" to clear sgminer dependencies (see readme) - bulid from source according to readme If you run into troubles on the way I pretty sure someone here will help you figure it out i got the first 3 steps all fine (actually using sgminer-gm right now) its the last 2 steps i seem not to find my way around!! it could be just straight forward, but for some reason i just cant seem to see !! )
|
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
February 20, 2017, 03:02:00 AM |
|
The new compiler can build SA's old kernel now. Let's try GG's kernel again...
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
February 20, 2017, 08:28:46 AM |
|
Alright, 10 out of 12 Equihash kernels are working without LLVM's alloca promotion.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
nerdralph
|
|
February 20, 2017, 03:11:43 PM |
|
Alright, 10 out of 12 Equihash kernels are working without LLVM's alloca promotion.
You're deeper into LLVM than I've ever ventured. Is the issue that LLVM is not promoting because it is too conservative about the number of available registers? And it is conservative about registers because it is trying to generate code to support more waves than is optimal? Since cache hit rates have a big impact on memory latency and therefore the optimal number of waves for latency hiding, it would be virtually impossible for the compiler to determine the optimal number of waves. However a compiler hint might be a solution (i.e. -fnum-waves=X). p.s. despite CLRX and a couple other sources claiming 4 waves are required for full VALU occupancy, I'm now convinced it can be done with just one.
|
|
|
|
theflow4321
Newbie
Offline
Activity: 1
Merit: 0
|
|
February 20, 2017, 10:44:09 PM |
|
But you need 4 waves to take advantage of the GCN timing architecture in that 4 lock-steps / 4 SIMDs design. The steps are related to the instruction/ scalar processor but a single ALU can addresse like 12 registers per cycle if I'm not mistaken within that Single Instruction Multiple Data. Or something. http://gpuopen.com/anatomy-total-war-engine-part-2/"VGPRS From the point of view of a single thread each of the vector registers hold a single value. 256 VGPRs can mean 64 float4, 128 float2, 256 float or any combination of these. For example, if we sample a texture, but only use its RGB components and not its alpha channel, it will take up 3 VGPRs. Let’s do some math: we want to blend 8 terrain layers. Each layer has a diffuse, a normal and a spec/gloss texture. We use all 4 channels of the diffuse texture, 2 channels of the normal texture and 2 channels of the spec/gloss texture. That’s 4+2+2 = 8 VGPRs per layer. Multiplied by 8 layers is 64 VGPRs. So we’ve already used up a quarter of all the available registers in the SIMD and we haven’t even started to talk about other parts of the code, blend maps, height map, etc. Some registers can be reused, but as we’ll see soon it’s not as trivial as it seems." "The number of used registers is important, because modern hardware runs multiple wavefronts at the same time. You can think about this as processing multiple pixels at the same time. This means that one of the main limiting factors on the number of pixels we can have in flight is the number of registers the shaders require. If a single SIMD in the hardware has 256 VGPRs and a shader is using 200 of them for example, the GPU can work only on one wavefront at a time. After the first wavefront is launched on a SIMD it leaves 56 registers unused, which is not enough to accommodate another wavefront running the same shader. If it’s using 110 VGPRs, two wavefronts can run at the same time (112+112=224 and 32 registers remain unused). If the shader uses only 24 or less VGPRs, the hardware can run 10 wavefronts on a SIMD at the same time. 10 concurrent wavefronts is the current maximum for Fiji GPUs, such as the Radeon® Fury X. This limit is hard wired."
|
|
|
|
nerdralph
|
|
February 20, 2017, 11:17:57 PM |
|
But you need 4 waves to take advantage of the GCN timing architecture in that 4 lock-steps / 4 SIMDs design.
No, each SIMD unit only processes 16 work-items at a time. After 4 clock cycles it is ready to process another 64 work-items (one wave). On cycle 1 the CU will dispatch to SIMD0, then to SIMD1 on cycle 2, SIMD2 on cycle 3, SIMD3 on cycle 4, and back to SIMD0 on cycle 5. It is a bit complicated, and it seems many people don't take the time to fully understand it. Because of that, I've written a short blog post to explain it. http://nerdralph.blogspot.ca/2017/02/inside-amd-gcn-code-execution.htmledit: Here's a quote from AMD's GCN whitepaper: Each SIMD includes a 16-lane vector pipeline that is predicated and fully IEEE-754 compliant for single precision and double precision floating point operations, with full speed denormals and all rounding modes. Each lane can natively executes a single precision fused or unfused multiply-add or a 24-bit integer operation. The integer multiply-add is particularly useful for calculating addresses within a work-group. A wavefront is issued to a SIMD in a single cycle, but takes 4 cycles to execute operations for all 64 work items
|
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
February 21, 2017, 05:12:51 AM |
|
I tried splitting row counters between GDS and the L2 cache with 7990, with 2048 rows assigned to each, but it turned out that the miner runs slower this way. I think the overhead of switching between them outweighs the performance gain of the splitting. Perhaps a better way is to alternate between GDS and the L2 cache for each round.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
February 21, 2017, 05:25:08 AM |
|
Alright, 10 out of 12 Equihash kernels are working without LLVM's alloca promotion.
You're deeper into LLVM than I've ever ventured. Is the issue that LLVM is not promoting because it is too conservative about the number of available registers? And it is conservative about registers because it is trying to generate code to support more waves than is optimal? Since cache hit rates have a big impact on memory latency and therefore the optimal number of waves for latency hiding, it would be virtually impossible for the compiler to determine the optimal number of waves. However a compiler hint might be a solution (i.e. -fnum-waves=X). p.s. despite CLRX and a couple other sources claiming 4 waves are required for full VALU occupancy, I'm now convinced it can be done with just one. I had to turn off alloca promotion even to run SA v5's original kernel. I suspect there is a bug in the routine that promotes alloca to LDS. AMD's conservative approach is pretty lazy IMO, but LLVM/Clang seems to allow for compiler hints in the form of attributes.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
February 21, 2017, 05:28:10 AM |
|
But you need 4 waves to take advantage of the GCN timing architecture in that 4 lock-steps / 4 SIMDs design.
No, each SIMD unit only processes 16 work-items at a time. After 4 clock cycles it is ready to process another 64 work-items (one wave). On cycle 1 the CU will dispatch to SIMD0, then to SIMD1 on cycle 2, SIMD2 on cycle 3, SIMD3 on cycle 4, and back to SIMD0 on cycle 5. It is a bit complicated, and it seems many people don't take the time to fully understand it. Because of that, I've written a short blog post to explain it. http://nerdralph.blogspot.ca/2017/02/inside-amd-gcn-code-execution.htmledit: Here's a quote from AMD's GCN whitepaper: Each SIMD includes a 16-lane vector pipeline that is predicated and fully IEEE-754 compliant for single precision and double precision floating point operations, with full speed denormals and all rounding modes. Each lane can natively executes a single precision fused or unfused multiply-add or a 24-bit integer operation. The integer multiply-add is particularly useful for calculating addresses within a work-group. A wavefront is issued to a SIMD in a single cycle, but takes 4 cycles to execute operations for all 64 work items This is a really nice write-up. Thank you! We do need more documentation that is concise and accurate. Seriously.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
February 21, 2017, 10:01:40 AM |
|
It seems that there is an ABI-dependent code in LLVM's AMDGPUPromoteAlloca.cpp. This stuff is not straight forward at all, huh. There must be a pointer to the dispatch packet: // We must read the size out of the dispatch pointer. assert(IsAMDGCN);
No wonder the compiler was not working reliably...
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
nerdralph
|
|
February 21, 2017, 04:04:08 PM |
|
I tried splitting row counters between GDS and the L2 cache with 7990, with 2048 rows assigned to each, but it turned out that the miner runs slower this way. I think the overhead of switching between them outweighs the performance gain of the splitting. Perhaps a better way is to alternate between GDS and the L2 cache for each round.
Too bad, although I think performance on the newer cards is more important anyway.
|
|
|
|
nerdralph
|
|
February 21, 2017, 04:17:32 PM |
|
It seems that there is an ABI-dependent code in LLVM's AMDGPUPromoteAlloca.cpp. This stuff is not straight forward at all, huh. There must be a pointer to the dispatch packet: // We must read the size out of the dispatch pointer. assert(IsAMDGCN);
No wonder the compiler was not working reliably... If the equihash algorithm were simpler, I'd say just write it from scratch in asm. When you can write in asm, you'll rarely be pleased with compiler-generated code. If assemblers did register allocation, I'd probably write most of my performance-sensitive code in asm.
|
|
|
|
Kompik
|
|
February 21, 2017, 05:53:17 PM |
|
sp_ has reportedly achieved 530 sols@OC1070 on zcash on his miner. 15+% more over EWBF(Claymore?) which is amazing.
|
Bitrated user: Kompik.
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
February 21, 2017, 06:04:08 PM |
|
sp_ has reportedly achieved 530 sols@OC1070 on zcash on his miner. 15+% more over EWBF(Claymore?) which is amazing.
Makes me wonder if I should rewrite the kernel in the GCN assembly from scratch. I don't know about his personality, but he surely is pretty good at what he is doing.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
Kompik
|
|
February 21, 2017, 09:26:33 PM |
|
sp_ has reportedly achieved 530 sols@OC1070 on zcash on his miner. 15+% more over EWBF(Claymore?) which is amazing.
Makes me wonder if I should rewrite the kernel in the GCN assembly from scratch. I don't know about his personality, but he surely is pretty good at what he is doing. It seems that some other people are making jokes of him and posting different miners saying that its SP_mod. I dont really get it all, but it seems that he has some miner, that is faster than ewbf, but probably not 530 sols. Sorry for the mystification
|
Bitrated user: Kompik.
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
February 21, 2017, 09:28:41 PM |
|
sp_ has reportedly achieved 530 sols@OC1070 on zcash on his miner. 15+% more over EWBF(Claymore?) which is amazing.
Makes me wonder if I should rewrite the kernel in the GCN assembly from scratch. I don't know about his personality, but he surely is pretty good at what he is doing. It seems that some other people are making jokes of him and posting different miners saying that its SP_mod. I dont really get it all, but it seems that he has some miner, that is faster than ewbf, but probably not 530 sols. Sorry for the mystification Bwahahaha!
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
|