I didn't remove +2, added another +2
+2 not need anymore for all platforms. COLL_DATA_SIZE_PER_TH best *5 for AMD and *6 (or *7) for NV Pascal.
|
|
|
Can you share your changes ?
All default, no changes It is shared memory, and NR_SLOTS=12, so NR_SLOTS+2=14 bytes, and it is not aligned by 4 byte memory bank size... I tryed to fix it(make +2 bytes offset, seems getting another 1-2%, but difficult to confirm it. I see no difference. Edit: NVidia no difference, but AMD RX480 faster by 1-1,5% when remove "+2". gives +5% on 1070(confirmed). You are right for GTX1070, went from 70 to 73 sols/s
|
|
|
I don't think SILENTARMY is optimized for the Kepler architecture.
Yes, it not uses atomic operations on local memory which supported only on GCN GPUs, Maxwell and Pascal.
|
|
|
Latest version gives me 92-93 sol/s from modded RX480: Total 361.5 sol/s [dev0 92.4, dev1 89.5, dev2 81.7, dev3 87.1] 307 shares Total 361.8 sol/s [dev0 91.7, dev1 89.2, dev2 81.0, dev3 92.5] 308 shares Total 358.8 sol/s [dev0 92.2, dev1 85.9, dev2 79.4, dev3 91.2] 310 shares
Have anyone try this miner https://github.com/Optiminer/OptiminerZcash ?
|
|
|
----
A lot of ZCash miners know about this pool. It had a problems at start (due to using z_sendmany api for payments), since 1Nov all works allright. Now there is an another problem, too low power. if you like opensource software and want support this unique (or you know other working open source pools with good hashrate?) pool and silentarmy kernel, and waiting block for a long time is not problem for you - welcome!
With current hashrate block found time is 5-6d ays or more.
|
|
|
Ok, I merged my patch, with latest mrb release, it gives next speedup: Total 329.4 sol/s [dev0 84.5, dev1 83.0, dev2 72.6, dev3 88.1] 36 shares Total 329.8 sol/s [dev0 84.3, dev1 85.3, dev2 72.0, dev3 89.1] 37 shares Total 330.9 sol/s [dev0 86.5, dev1 87.3, dev2 72.3, dev3 93.1] 37 shares Total 331.5 sol/s [dev0 87.0, dev1 83.3, dev2 75.1, dev3 92.4] 38 shares Total 331.9 sol/s [dev0 86.0, dev1 83.3, dev2 74.8, dev3 89.7] 38 shares Total 331.6 sol/s [dev0 87.9, dev1 81.2, dev2 75.4, dev3 89.9] 38 shares
Modded RX480 up to 90sol/s, stock RX470 ~70sols/s http://coinsforall.io/distr/main.c.opt2http://coinsforall.io/distr/input.cl.opt2http://coinsforall.io/distr/param.h.opt2RX480 with 90 sols/s preset is 1160/2240 clocks and timings from 1750Mhz (see polaris bios editor).
|
|
|
is it possible to merge pci-e bandwidth fix ? that one,https://github.com/mbevand/silentarmy/commit/146b8dc0b6618852e2f322fab51f3ed3739da07a ??
mrb can do it I'll try release next optimization in 12-24 hours with +10-15% performance increase.
|
|
|
netaccess Fixed. You can remove this line.
|
|
|
And Radeon RX480: Total 74.2 sol/s [dev0 78.5] 4 shares Total 74.6 sol/s [dev0 78.6] 4 shares Total 74.8 sol/s [dev0 78.0] 4 shares Total 74.6 sol/s [dev0 78.6] 4 shares Total 74.2 sol/s [dev0 77.0] 4 shares Total 74.1 sol/s [dev0 75.8] 4 shares
It's only first step, soon will be second. I'll send sources to mrb and genoil in a 30 minutes. Edit: use OPTIM_SIMPLIFY_ROUND gives extra 5% and 80 sols/s on modded RX480 is possible Total 309.0 sol/s [dev0 81.2, dev1 78.3, dev2 68.5, dev3 80.8] 19 shares Total 307.8 sol/s [dev0 78.1, dev1 77.7, dev2 69.7, dev3 80.1] 19 shares Total 307.8 sol/s [dev0 78.3, dev1 79.1, dev2 70.6, dev3 81.4] 19 shares Total 309.2 sol/s [dev0 80.1, dev1 80.5, dev2 70.6, dev3 81.4] 20 shares
|
|
|
Next optimization will be huge, I already got 60 sols/s on GTX1070: Total 60.2 sol/s [dev0 59.5] 1 share Total 60.8 sol/s [dev0 60.6] 1 share Total 60.9 sol/s [dev0 61.5] 1 share Total 60.5 sol/s [dev0 59.5] 1 share
|
|
|
nerdralph Did you try to use local memory for atomic increment (store all data to global memory and walk through data in seperate kernel) ?
Each compute unit has 64KB of LDS, so a Rx 470 with 32 CUs has 2MB of LDS. 1 million (2^20) 32-bit counters needs 4MB. atomic_inc works only with ints, so even if the counters are packed into 8 bits each so they'll all fit in LDS, there doesn't seem to be a way in opencl to atomically increment them. See pm.
|
|
|
nerdralph Did you try to use local memory for atomic increment (store all data to global memory and walk through data in seperate kernel) ?
|
|
|
I have another idea for optimize equihash round kernel, results will be in next 12-24h You must have o/c'd. I don't believe 16.30 is 37% faster than 16.40. Only memory, 1100/2160 and low DRAM timings preset (modded ROM). Excellent! We believe in you! Got only 2%, because need optimize another place - function ht_store at kernel. This one row in code: 124 cnt = atomic_inc((__global uint *)p);
Takes a half of all iteration time!
|
|
|
I have another idea for optimize equihash round kernel, results will be in next 12-24h You must have o/c'd. I don't believe 16.30 is 37% faster than 16.40. Only memory, 1100/2160 and low DRAM timings preset (modded ROM).
|
|
|
RX480 with amdgpu-pro 16.30 Total 55.8 sol/s [dev0 54.0] 18 shares Total 55.3 sol/s [dev0 52.4] 18 shares Total 55.6 sol/s [dev0 54.7] 18 shares Total 55.9 sol/s [dev0 55.7] 18 shares Total 55.0 sol/s [dev0 55.7] 18 shares Total 55.5 sol/s [dev0 56.2] 18 shares Total 55.2 sol/s [dev0 56.1] 19 shares Total 54.6 sol/s [dev0 54.8] 19 shares Total 54.9 sol/s [dev0 55.3] 19 shares Total 55.1 sol/s [dev0 53.1] 19 shares Total 54.4 sol/s [dev0 52.6] 19 shares
Kernel: http://coinsforall.io/distr/input.cl.coll1 NVidia also have speedup. I reduced number of collisions to found from 5 to 1, it seems 5 is too much, need mrb's comments.
|
|
|
I think the NVIDIA private farmers already got a 120Sol/s miner / card for the zcash harvesting.
But with 48sols/s on GTX1070 this miner became fastest public.
|
|
|
|