MaxDZ8
|
|
October 23, 2015, 09:27:28 AM |
|
What is your experience with async block reads?
|
|
|
|
ghostlander (OP)
Legendary
Offline
Activity: 1241
Merit: 1020
No surrender, no retreat, no regret.
|
|
October 23, 2015, 12:55:35 PM |
|
Thanks to those who have donated. instead of using "elif", just use simple independent "if" statements and remove the dup-triplicate instructions. or, better, make an unrolled loop. that way it's much more compact and easier to debug.
Umm, these are preprocessor directives. The decision is made at compile time. There is nothing left to unroll. What is your experience with async block reads?
Aren't they async by default in SGminer?
|
|
|
|
pallas
Legendary
Offline
Activity: 2716
Merit: 1094
Black Belt Developer
|
|
October 23, 2015, 01:17:41 PM |
|
Thanks to those who have donated. instead of using "elif", just use simple independent "if" statements and remove the dup-triplicate instructions. or, better, make an unrolled loop. that way it's much more compact and easier to debug.
Umm, these are preprocessor directives. The decision is made at compile time. There is nothing left to unroll. do you see the repeated instructions? just change the "if" structure and you can remove them ;-) i.e. if worksize <= 128 you need to do some additional rotates compared to default (256), some additional others if worksize == 64.... although the alternative for loop is a much more elegant solution and the difference in speed is negligible.
|
|
|
|
ghostlander (OP)
Legendary
Offline
Activity: 1241
Merit: 1020
No surrender, no retreat, no regret.
|
|
October 23, 2015, 01:42:40 PM |
|
Thanks to those who have donated. instead of using "elif", just use simple independent "if" statements and remove the dup-triplicate instructions. or, better, make an unrolled loop. that way it's much more compact and easier to debug.
Umm, these are preprocessor directives. The decision is made at compile time. There is nothing left to unroll. do you see the repeated instructions? just change the "if" structure and you can remove them ;-) i.e. if worksize <= 128 you need to do some additional rotates compared to default (256), some additional others if worksize == 64.... although the alternative for loop is a much more elegant solution and the difference in speed is negligible. What you suggest results in less linear memory writes which isn't good usually. I prefer to avoid loops if possible. T0_L[lclid] = T0[lclid]; T1_L[lclid] = rotate(T0[lclid], 8UL); T2_L[lclid] = rotate(T0[lclid], 16UL); T3_L[lclid] = rotate(T0[lclid], 24UL); T4_L[lclid] = rotate(T0[lclid], 32UL); T5_L[lclid] = rotate(T0[lclid], 40UL); T6_L[lclid] = rotate(T0[lclid], 48UL); T7_L[lclid] = rotate(T0[lclid], 56UL); #if (WORKSIZE < 256) T0_L[lclid + 128] = T0[lclid + 128]; T1_L[lclid + 128] = rotate(T0[lclid + 128], 8UL); T2_L[lclid + 128] = rotate(T0[lclid + 128], 16UL); T3_L[lclid + 128] = rotate(T0[lclid + 128], 24UL); T4_L[lclid + 128] = rotate(T0[lclid + 128], 32UL); T5_L[lclid + 128] = rotate(T0[lclid + 128], 40UL); T6_L[lclid + 128] = rotate(T0[lclid + 128], 48UL); T7_L[lclid + 128] = rotate(T0[lclid + 128], 56UL); #endif #if (WORKSIZE < 128) T0_L[lclid + 64] = T0[lclid + 64]; T0_L[lclid + 192] = T0[lclid + 192]; T1_L[lclid + 64] = rotate(T0[lclid + 64], 8UL); T1_L[lclid + 192] = rotate(T0[lclid + 192], 8UL); T2_L[lclid + 64] = rotate(T0[lclid + 64], 16UL); T2_L[lclid + 192] = rotate(T0[lclid + 192], 16UL); T3_L[lclid + 64] = rotate(T0[lclid + 64], 24UL); T3_L[lclid + 192] = rotate(T0[lclid + 192], 24UL); T4_L[lclid + 64] = rotate(T0[lclid + 64], 32UL); T4_L[lclid + 192] = rotate(T0[lclid + 192], 32UL); T5_L[lclid + 64] = rotate(T0[lclid + 64], 40UL); T5_L[lclid + 192] = rotate(T0[lclid + 192], 40UL); T6_L[lclid + 64] = rotate(T0[lclid + 64], 48UL); T6_L[lclid + 192] = rotate(T0[lclid + 192], 48UL); T7_L[lclid + 64] = rotate(T0[lclid + 64], 56UL); T7_L[lclid + 192] = rotate(T0[lclid + 192], 56UL); #endif
|
|
|
|
MaxDZ8
|
|
October 23, 2015, 03:50:00 PM |
|
Aren't they async by default in SGminer? I mean as of async_work_group_copy & friends. I have very mixed feelings on them, on my hardware, they just don't perform well not even in terms of bandwidth usage. Maybe on GCN1.1 they work better, IDK. BTW, I would suggest to put a barrier(CLK_LOCAL_MEM_FENCE) right after the T-tables load. Besides, just have it looped! for(ulong i = 0; i < 8; ++i) { local uint *tdst = T_local + 256 * 8; // not even really required if you alloc your local T tables in block for(int el = 0; el < 256; el += get_local_size(0)) { tdst[get_local_id(0)] = rotate(T_global[get_local_id(0)], i * 8); tdst += get_local_size(0); tsrc += get_local_size(0); } }
That's more or less what pallas suggests. It is a fully coherent read. LDS has full scatter/gather capability, as long as you end in a different bank you're safe (here, it happens by construction). Alternatively the first block can be loaded by async_work_group_copy, and you can derive T1_L from T0_L, LDS should provide you a massive latency reduction by not having to round-trip to the (hot) L1 cache. Loops such as this are fully unrolled in most cases.
|
|
|
|
drr0ss
Member
Offline
Activity: 98
Merit: 10
|
|
October 23, 2015, 09:18:34 PM |
|
instead of using "elif", just use simple independent "if" statements and remove the dup-triplicate instructions. or, better, make an unrolled loop. that way it's much more compact and easier to debug.
Hi Pallas, can you share me your .cl, I will send you you some feeds.....
|
|
|
|
drr0ss
Member
Offline
Activity: 98
Merit: 10
|
|
October 23, 2015, 09:31:51 PM |
|
Aren't they async by default in SGminer? I mean as of async_work_group_copy & friends. I have very mixed feelings on them, on my hardware, they just don't perform well not even in terms of bandwidth usage. Maybe on GCN1.1 they work better, IDK. BTW, I would suggest to put a barrier(CLK_LOCAL_MEM_FENCE) right after the T-tables load. Besides, just have it looped! for(ulong i = 0; i < 8; ++i) { local uint *tdst = T_local + 256 * 8; // not even really required if you alloc your local T tables in block for(int el = 0; el < 256; el += get_local_size(0)) { tdst[get_local_id(0)] = rotate(T_global[get_local_id(0)], i * 8); tdst += get_local_size(0); tsrc += get_local_size(0); } }
That's more or less what pallas suggests. It is a fully coherent read. LDS has full scatter/gather capability, as long as you end in a different bank you're safe (here, it happens by construction). Alternatively the first block can be loaded by async_work_group_copy, and you can derive T1_L from T0_L, LDS should provide you a massive latency reduction by not having to round-trip to the (hot) L1 cache. Loops such as this are fully unrolled in most cases. Oh how forcing a old man like me to retype the code and thinking never mind and cheers
|
|
|
|
sp_
Legendary
Offline
Activity: 2940
Merit: 1087
Team Black developer
|
|
October 23, 2015, 10:30:40 PM Last edit: November 17, 2015, 08:12:17 PM by sp_ |
|
Pallas is pretty good.
He bough a NVIDIA card and improved Neoscrypt 10% in a couple of weeks.
CUDA, foreign language foreign technology...
respect
|
|
|
|
proctologic
|
|
November 17, 2015, 08:10:57 PM |
|
What coin do you use Myriad-Groestl to mine?
There is : - Saffroncoin - Digibyte - Myriadcoin Trinitycoin
|
|
|
|
|
Koltan
Newbie
Offline
Activity: 57
Merit: 0
|
|
May 04, 2016, 08:14:02 PM |
|
Radeon HD7790 1200/1600 mining MYR got 17.5 Mh on this kernel. It's two times faster than the original
|
|
|
|
Tmdz
|
|
May 04, 2016, 09:10:30 PM Last edit: May 05, 2016, 12:01:29 AM by Tmdz |
|
nice work 7950 went from 7 mh to 27 mh, but I think mining the skein will still earn you more with dgb. On the technical side that kind of efficiency improvement is simply amazing.
|
|
|
|
navydude
|
|
May 18, 2017, 08:30:52 PM |
|
Wondering if someone would compile this for windows. Would be much appreciated!
|
|
|
|
prichina
Newbie
Offline
Activity: 82
Merit: 0
|
|
May 23, 2017, 07:56:44 PM |
|
7970 is doing fine with sgminer 5.1.1. - 35 mh/s ....but my R9 290X is bonkers, only 40-42 mh/s is very low :/ Can someone help, pls send me a kernel or bin so i can make it work...I've tryied ghostlenders myriad-groestl.cl ...still 7970 - 35 mh/s and R9 290X - 40-42 mh/s :/ If someone needs work with video editing, i can make it happen....privat message me, Kind Regards Ivo Icevski
|
|
|
|
ghostlander (OP)
Legendary
Offline
Activity: 1241
Merit: 1020
No surrender, no retreat, no regret.
|
|
May 24, 2017, 03:20:55 PM |
|
I don't work on this kernel any more. 500k MYR in donations was all that I received, which is hardly worth the effort. If anyone wants to continue, feel free.
|
|
|
|
JetstoBrazil
Newbie
Offline
Activity: 31
Merit: 0
|
|
June 04, 2017, 05:31:44 AM |
|
Hi, I have very stupid question and cant find the aswer...
How could I install this kernel? Should I save the code in text editor? And than how to order sgminer to use this kernel? I have in sgminer fordel subforlder with a lot of kernels (kernel files *.cl).
I cant find any instructions for installing this kernel.
Thank you for your answer
open miner folder in there open the kernel folder and find the myriad-grostel kernel right click and select open with wordpad delete everything that is in there then copy paste the code from the first post into it. save and close
|
|
|
|
coinmania
Member
Offline
Activity: 97
Merit: 10
|
|
June 09, 2017, 09:12:36 AM |
|
Hey There,
just a question the digibite core wallet is now running for one day and telling me the the synchronizing with the network will take about 10 Year? This is not normal right?
|
"Amateurs sit and wait for inspiration, the rest of us get up and go to work" Stephen King
Just sayin: 158xW3o63zdGe6wCQH5edyrjm5RFzTqghn
|
|
|
pallas
Legendary
Offline
Activity: 2716
Merit: 1094
Black Belt Developer
|
|
June 09, 2017, 09:27:40 AM |
|
Hey There,
just a question the digibite core wallet is now running for one day and telling me the the synchronizing with the network will take about 10 Year? This is not normal right?
And it is not on topic either, you should ask in the digibyte thread.
|
|
|
|
coinmania
Member
Offline
Activity: 97
Merit: 10
|
|
June 09, 2017, 11:17:14 AM |
|
Hey There,
just a question the digibite core wallet is now running for one day and telling me the the synchronizing with the network will take about 10 Year? This is not normal right?
And it is not on topic either, you should ask in the digibyte thread. yeah you are right, never mind found the Solutions.
|
"Amateurs sit and wait for inspiration, the rest of us get up and go to work" Stephen King
Just sayin: 158xW3o63zdGe6wCQH5edyrjm5RFzTqghn
|
|
|
Harry5555
Member
Offline
Activity: 81
Merit: 10
|
|
July 21, 2017, 02:22:44 AM Last edit: July 21, 2017, 12:44:50 PM by Harry5555 |
|
Could anybody help me configure this miner, each time I configure it it prefers mine on intel HD graphics rather than my RX GPU...
Any suggestions?
Edit: I found a work around by using --gpu-platform 1
|
|
|
|
|