zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
December 18, 2016, 09:00:32 PM |
|
I think I figured out how to coalesce global memory reads. (Memory writes cannot be coalesced because the destination of each slot is not predictable.) It would have been impossible with the original design of SA, but it should be possible with GG because it loads slots differently. If everything works out, there should be a massive speedup, hehehe...
Great news! Does that include CUDA also? Definitely for NVIDIA, and potentially for AMD. I already have some positive results, but I still need to reduce the overhead and bring NR_ROWS_LOG down to 13. More work, more work...
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
|
kilo17
Legendary
Offline
Activity: 980
Merit: 1001
aka "whocares"
|
|
December 19, 2016, 05:58:45 AM |
|
Not sure if you can help out on this or not. I am trying out the miner on Ubuntu 16.10 with 4.9 kernel and open source drivers and etc. I changed the opencl location in the make file but had similar results to Eliovp: echo 'const char *ocl_code = R"_mrb_(' >_kernel.h cpp input.cl >>_kernel.h echo ')_mrb_";' >>_kernel.h gcc -O2 -std=gnu99 -pedantic -Wextra -Wall -Wno-deprecated-declarations -Wno-overlength-strings -I"/opt/AMDAPPSDK-3.0/include" -c -o main.o main.c main.c: In function ‘examine_ht’: main.c:534:26: warning: unused parameter ‘round’ [-Wunused-parameter] void examine_ht(unsigned round, cl_command_queue queue, cl_mem *hash_table_buffers, cl_mem row_counters_buffer) ^~~~~ main.c:534:50: warning: unused parameter ‘queue’ [-Wunused-parameter] void examine_ht(unsigned round, cl_command_queue queue, cl_mem *hash_table_buffers, cl_mem row_counters_buffer) ^~~~~ main.c:534:65: warning: unused parameter ‘hash_table_buffers’ [-Wunused-parameter] void examine_ht(unsigned round, cl_command_queue queue, cl_mem *hash_table_buffers, cl_mem row_counters_buffer) ^~~~~~~~~~~~~~~~~~ main.c:534:92: warning: unused parameter ‘row_counters_buffer’ [-Wunused-parameter] d round, cl_command_queue queue, cl_mem *hash_table_buffers, cl_mem row_counters_buffer) ^~~~~~~~~~~~~~~~~~~ main.c: In function ‘store_encoded_sol’: main.c:640:25: warning: left shift of negative value [-Wshift-negative-value] uint32_t mask = ~(-1 << (8 - x_bits_used)); ^~ main.c: In function ‘solve_equihash’: main.c:958:57: warning: unused parameter ‘ctx’ [-Wunused-parameter] uint32_t solve_equihash(cl_device_id dev_id, cl_context ctx, cl_command_queue queue, ^~~ main.c: In function ‘mining_mode’: main.c:1408:18: warning: unused variable ‘status’ [-Wunused-variable] cl_int status; ^~~~~~ main.c:1393:50: warning: unused parameter ‘program’ [-Wunused-parameter] void mining_mode(cl_device_id dev_id, cl_program program, cl_context ctx, cl_command_queue queue, ^~~~~~~ gcc -O2 -std=gnu99 -pedantic -Wextra -Wall -Wno-deprecated-declarations -Wno-overlength-strings -I"/opt/AMDAPPSDK-3.0/include" -c -o blake.o blake.c blake.c:26:25: warning: ‘blake2b_block_len’ defined but not used [-Wunused-const-variable=] static const uint32_t blake2b_block_len = 128; ^~~~~~~~~~~~~~~~~ gcc -O2 -std=gnu99 -pedantic -Wextra -Wall -Wno-deprecated-declarations -Wno-overlength-strings -I"/opt/AMDAPPSDK-3.0/include" -c -o sha256.o sha256.c gcc -o sa-solver main.o blake.o sha256.o -rdynamic -L"/usr/lib/x86_64-linux-gnu" -lOpenCL
and then: kilo17@kilo-GT7:~/gatelessgate-master$ ./gatelessgate.py -c stratum+tcp://us1-zcash.flypool.org:3333 -u t1cVviFvgJinQ4w3C2m2CfRxgP5DnHYaoFC Gateless Gate, a Zcash miner Copyright 2016 zawawa @ bitcointalk.org Connecting to us1-zcash.flypool.org:3333 Stratum server sent us the first job Mining on 1 device
|
Bitcoin Will Only Succeed If The Community That Supports It Gets Support - Support Home Miners & Mining
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
December 19, 2016, 05:34:37 PM |
|
Not sure if you can help out on this or not. I am trying out the miner on Ubuntu 16.10 with 4.9 kernel and open source drivers and etc. I changed the opencl location in the make file but had similar results to Eliovp: echo 'const char *ocl_code = R"_mrb_(' >_kernel.h cpp input.cl >>_kernel.h echo ')_mrb_";' >>_kernel.h gcc -O2 -std=gnu99 -pedantic -Wextra -Wall -Wno-deprecated-declarations -Wno-overlength-strings -I"/opt/AMDAPPSDK-3.0/include" -c -o main.o main.c main.c: In function ‘examine_ht’: main.c:534:26: warning: unused parameter ‘round’ [-Wunused-parameter] void examine_ht(unsigned round, cl_command_queue queue, cl_mem *hash_table_buffers, cl_mem row_counters_buffer) ^~~~~ main.c:534:50: warning: unused parameter ‘queue’ [-Wunused-parameter] void examine_ht(unsigned round, cl_command_queue queue, cl_mem *hash_table_buffers, cl_mem row_counters_buffer) ^~~~~ main.c:534:65: warning: unused parameter ‘hash_table_buffers’ [-Wunused-parameter] void examine_ht(unsigned round, cl_command_queue queue, cl_mem *hash_table_buffers, cl_mem row_counters_buffer) ^~~~~~~~~~~~~~~~~~ main.c:534:92: warning: unused parameter ‘row_counters_buffer’ [-Wunused-parameter] d round, cl_command_queue queue, cl_mem *hash_table_buffers, cl_mem row_counters_buffer) ^~~~~~~~~~~~~~~~~~~ main.c: In function ‘store_encoded_sol’: main.c:640:25: warning: left shift of negative value [-Wshift-negative-value] uint32_t mask = ~(-1 << (8 - x_bits_used)); ^~ main.c: In function ‘solve_equihash’: main.c:958:57: warning: unused parameter ‘ctx’ [-Wunused-parameter] uint32_t solve_equihash(cl_device_id dev_id, cl_context ctx, cl_command_queue queue, ^~~ main.c: In function ‘mining_mode’: main.c:1408:18: warning: unused variable ‘status’ [-Wunused-variable] cl_int status; ^~~~~~ main.c:1393:50: warning: unused parameter ‘program’ [-Wunused-parameter] void mining_mode(cl_device_id dev_id, cl_program program, cl_context ctx, cl_command_queue queue, ^~~~~~~ gcc -O2 -std=gnu99 -pedantic -Wextra -Wall -Wno-deprecated-declarations -Wno-overlength-strings -I"/opt/AMDAPPSDK-3.0/include" -c -o blake.o blake.c blake.c:26:25: warning: ‘blake2b_block_len’ defined but not used [-Wunused-const-variable=] static const uint32_t blake2b_block_len = 128; ^~~~~~~~~~~~~~~~~ gcc -O2 -std=gnu99 -pedantic -Wextra -Wall -Wno-deprecated-declarations -Wno-overlength-strings -I"/opt/AMDAPPSDK-3.0/include" -c -o sha256.o sha256.c gcc -o sa-solver main.o blake.o sha256.o -rdynamic -L"/usr/lib/x86_64-linux-gnu" -lOpenCL
and then: kilo17@kilo-GT7:~/gatelessgate-master$ ./gatelessgate.py -c stratum+tcp://us1-zcash.flypool.org:3333 -u t1cVviFvgJinQ4w3C2m2CfRxgP5DnHYaoFC Gateless Gate, a Zcash miner Copyright 2016 zawawa @ bitcointalk.org Connecting to us1-zcash.flypool.org:3333 Stratum server sent us the first job Mining on 1 device
You can run sa-solver to see what's actually going on.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
December 19, 2016, 06:06:29 PM |
|
I just noticed three things:
(1) With NR_ROWS_LOG=14, Rounds 1 through 8 are actually much faster. (2) However, kernel_sols() becomes a bottleneck because there are just too many slots in one row. This problem can be easily solved by using a different sorting algorithm for kernel_sols(). (3) Currently, NR_ROWS_LOG<14 is not possible because there is not enough space in shared memory. This problem can be partially solved by making NR_ROWS_LOG variable across rounds. Since less space in shared memory is required for caching slots at later rounds, NR_ROWS_LOG can be decreased accordingly.
This is it. I'm catching up with Claymore's and Eqminer.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
chown.multi
Newbie
Offline
Activity: 28
Merit: 0
|
|
December 19, 2016, 06:11:18 PM |
|
I just noticed three things:
(1) With NR_ROWS_LOG=14, Rounds 1 through 8 are actually much faster. (2) However, kernel_sols() becomes a bottleneck because there are just too many slots in one row. This problem can be easily solved by using a different sorting algorithm for kernel_sols(). (3) Currently, NR_ROWS_LOG<14 is not possible because there is not enough space in shared memory. This problem can be partially solved by making NR_ROWS_LOG variable across rounds. Since less space in shared memory is required for caching slots at later rounds, NR_ROWS_LOG can be decreased accordingly.
This is it. I'm catching up with Claymore's and Eqminer.
That is good news, keep up the good work.
|
|
|
|
laik2
|
|
December 19, 2016, 08:46:08 PM |
|
Updated to latest crimson fglrx drivers and...still no go, amdgpu-pro has a bug and is not working well on R9 series...
|
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
December 20, 2016, 11:28:49 AM |
|
Updated to latest crimson fglrx drivers and...still no go, amdgpu-pro has a bug and is not working well on R9 series...
I am preparing the next point release today. PM me when your Linux servers are ready. I will do my best to make GG compatible with fglrx.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
December 20, 2016, 11:40:31 AM |
|
By the way, it turned out that NR_ROWS_LOG=12 and 13 actually work. (There was a bug in the code.) I am also thinking about rewriting GG in CUDA for a better performance. The miner is already running 40% faster on GTX 1060, but I need an extra boost to catch up with Eqminer.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
qwep1
|
|
December 20, 2016, 02:12:55 PM |
|
whe are i download miner for win
|
|
|
|
krnlx
|
|
December 20, 2016, 02:56:35 PM |
|
By the way, it turned out that NR_ROWS_LOG=12 and 13 actually work. (There was a bug in the code.) I am also thinking about rewriting GG in CUDA for a better performance. The miner is already running 40% faster on GTX 1060, but I need an extra boost to catch up with Eqminer.
I made cuda port of SA5, no difference in performance vs opencl+nvidia cpu load fix. The only one thing that cannot be implemented in opencl is cudaDeviceSetCacheConfig. In openCL you can inline nvidia ptx asm easy, like in cuda.
|
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
December 20, 2016, 03:06:05 PM |
|
By the way, it turned out that NR_ROWS_LOG=12 and 13 actually work. (There was a bug in the code.) I am also thinking about rewriting GG in CUDA for a better performance. The miner is already running 40% faster on GTX 1060, but I need an extra boost to catch up with Eqminer.
I made cuda port of SA5, no difference in performance vs opencl+nvidia cpu load fix. The only one thing that cannot be implemented in opencl is cudaDeviceSetCacheConfig. In openCL you can inline nvidia ptx asm easy, like in cuda. Thank you so much for letting me know. What I specifically had in my mind was "shfl." If that instruction can be exposed through inline PTX, I can save a considerable amount of time.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
December 20, 2016, 03:08:13 PM |
|
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
maztheman
Newbie
Offline
Activity: 9
Merit: 0
|
|
December 20, 2016, 03:10:38 PM |
|
By the way, it turned out that NR_ROWS_LOG=12 and 13 actually work. (There was a bug in the code.) I am also thinking about rewriting GG in CUDA for a better performance. The miner is already running 40% faster on GTX 1060, but I need an extra boost to catch up with Eqminer.
I made cuda port of SA5, no difference in performance vs opencl+nvidia cpu load fix. The only one thing that cannot be implemented in opencl is cudaDeviceSetCacheConfig. In openCL you can inline nvidia ptx asm easy, like in cuda. Thank you so much for letting me know. What I specifically had in my mind was "shfl." If that instruction can be exposed through inline PTX, I can save a considerable amount of time. Yeah you can but check the compute version needed, I think its compute 3.2+.
|
|
|
|
krnlx
|
|
December 20, 2016, 03:28:01 PM |
|
By the way, it turned out that NR_ROWS_LOG=12 and 13 actually work. (There was a bug in the code.) I am also thinking about rewriting GG in CUDA for a better performance. The miner is already running 40% faster on GTX 1060, but I need an extra boost to catch up with Eqminer.
I made cuda port of SA5, no difference in performance vs opencl+nvidia cpu load fix. The only one thing that cannot be implemented in opencl is cudaDeviceSetCacheConfig. In openCL you can inline nvidia ptx asm easy, like in cuda. Thank you so much for letting me know. What I specifically had in my mind was "shfl." If that instruction can be exposed through inline PTX, I can save a considerable amount of time. From cuda include files: int __shfl(int var, int srcLane, int width) { int ret; int c = ((warpSize-width) << 8) | 0x1f; asm volatile ("shfl.idx.b32 %0, %1, %2, %3;" : "=r"(ret) : "r"(var), "r"(srcLane), "r"(c)); return ret; }
|
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
December 20, 2016, 04:44:15 PM |
|
By the way, it turned out that NR_ROWS_LOG=12 and 13 actually work. (There was a bug in the code.) I am also thinking about rewriting GG in CUDA for a better performance. The miner is already running 40% faster on GTX 1060, but I need an extra boost to catch up with Eqminer.
I made cuda port of SA5, no difference in performance vs opencl+nvidia cpu load fix. The only one thing that cannot be implemented in opencl is cudaDeviceSetCacheConfig. In openCL you can inline nvidia ptx asm easy, like in cuda. Thank you so much for letting me know. What I specifically had in my mind was "shfl." If that instruction can be exposed through inline PTX, I can save a considerable amount of time. From cuda include files: int __shfl(int var, int srcLane, int width) { int ret; int c = ((warpSize-width) << 8) | 0x1f; asm volatile ("shfl.idx.b32 %0, %1, %2, %3;" : "=r"(ret) : "r"(var), "r"(srcLane), "r"(c)); return ret; }
Awesome!
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
December 20, 2016, 08:23:22 PM |
|
Alright peeps, bug fixes for the next version is almost done. I'm getting 164 sol/s with RX 480 and 128 sol/s with GTX 1060 3GB. That should be good enough for now. I will upload the new version tonight, US PST.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
December 20, 2016, 11:57:31 PM Last edit: December 21, 2016, 05:34:57 AM by zawawa |
|
Updated to latest crimson fglrx drivers and...still no go, amdgpu-pro has a bug and is not working well on R9 series...
I just tested GG on your server. I think I already fixed the problem. (The next version should be much more stable overall.) I will push it to the repo in the next few hours, so you can check it yourself.This compatibility issue turned out to be much more complicated than I thought, and I am afraid I need to drop support for fglrx for now. I am practically a one-man development team, and I don't have resources to support outdated drivers that are known to be notoriously buggy. I continue to support AMDPRO drivers, though.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
December 21, 2016, 03:00:52 AM |
|
I just pushed the new version to GitHub: https://github.com/zawawawa/gatelessgateMore speed enhancements are coming very soon. Enjoy!
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
laik2
|
|
December 21, 2016, 08:04:47 AM |
|
Updated to latest crimson fglrx drivers and...still no go, amdgpu-pro has a bug and is not working well on R9 series...
I just tested GG on your server. I think I already fixed the problem. (The next version should be much more stable overall.) I will push it to the repo in the next few hours, so you can check it yourself.This compatibility issue turned out to be much more complicated than I thought, and I am afraid I need to drop support for fglrx for now. I am practically a one-man development team, and I don't have resources to support outdated drivers that are known to be notoriously buggy. I continue to support AMDPRO drivers, though. RX cards should work well with amdgpu-pro but R9 are currently poorly supported...
|
|
|
|
|