joblo
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 28, 2016, 12:35:19 AM |
|
when is the last time you delivered 30% in less than an hour?
Today. your quark kernel. Since skein is much faster than groestl we only do skein and throw away 50% of the hashes. if (hash[0] & 0x8) { sph_groestl512_init(&ctx_groestl); sph_groestl512 (&ctx_groestl, (const void*) hash, 64); sph_groestl512_close(&ctx_groestl, (void*) hash); } else { sph_skein512_init(&ctx_skein); sph_skein512 (&ctx_skein, (const void*) hash, 64); sph_skein512_close(&ctx_skein, (void*) hash); } There was an optimization made in cpuminer that if it was determined that a second round of groestl was necessary the existing hashes would be thrown away on the belief it would take longer to complete the second groestl than to start over. It didn't work. However, I might try ccminer's logic. cpuminer uses a state machine as the engine. ccminer just uses a simple if. I'm also going to look at other contexts. selctively reinitializing necessary fields may be quicker thn the current implementation of copying a saved initialiazed context. Both are quicker than what ccminer does.
|
|
|
|
sp_ (OP)
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
January 28, 2016, 12:51:04 AM |
|
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
|
|
|
|
joblo
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 28, 2016, 12:54:04 AM |
|
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
Whatever it is it's faster.
|
|
|
|
sp_ (OP)
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
January 28, 2016, 12:55:22 AM |
|
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
Whatever it is it's faster. The cpu verification is only done when the gpu find a solution.
|
|
|
|
joblo
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 28, 2016, 12:58:24 AM |
|
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
My changes have nothing to do with avoiding branches but avoiding work.
|
|
|
|
joblo
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 28, 2016, 01:00:23 AM |
|
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
Whatever it is it's faster. The cpu verification is only done when the gpu find a solution. I may not have realized I was looking at verification code at the time but I know what it is. Maybe my changes can be applied to the GPU code and you'll get your 30%
|
|
|
|
joblo
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 28, 2016, 01:12:09 AM |
|
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
Whatever it is it's faster. The cpu verification is only done when the gpu find a solution. I know why changing the verification code made things faster. I wa scpumining 8 threads at the time so it was slowing down the CPU.
|
|
|
|
joblo
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 28, 2016, 03:42:51 AM |
|
You are an ssembly language guy, do you reorder instructions to maximize instruction throughput. It requires detail knowledge of theprocessor such as how many instructions can be fetched per clock, how many can be executed per clock, how deep is the memory buffer, dies it delay writes to prioritize reads?, how big is a cache line, etc. I know none of this stuff, maybe you do and could use it to speed up the hot spots.
|
|
|
|
sp_ (OP)
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
January 28, 2016, 06:32:06 AM Last edit: January 28, 2016, 06:57:30 AM by sp_ |
|
You are an ssembly language guy, do you reorder instructions to maximize instruction throughput. It requires detail knowledge of theprocessor such as how many instructions can be fetched per clock, how many can be executed per clock, how deep is the memory buffer, dies it delay writes to prioritize reads?, how big is a cache line, etc. I know none of this stuff, maybe you do and could use it to speed up the hot spots.
This is something the compiler is very good at. The cudacore is a 3 + operation risc processor with up to 256 registers. It is buildt for the compiler.. Sometimes you need to move code around, manually unroll some loops etc.. Verify the result with disassembling. (this is what DJM34 is calling random stuff) But don't let the codesize grow to big, the instruction cache is small. ... While you where trolling my thread I added another 0.4% in the decred algo. I will try to do 5% and include it in my donation miner.
|
|
|
|
sp_ (OP)
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
January 28, 2016, 06:33:30 AM |
|
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
Whatever it is it's faster. The cpu verification is only done when the gpu find a solution. I know why changing the verification code made things faster. I wa scpumining 8 threads at the time so it was slowing down the CPU. But in ccminer you can just remove the verification. It's there so that you can check if you break the hash when you change something.
|
|
|
|
joblo
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 28, 2016, 07:14:28 AM |
|
You are an ssembly language guy, do you reorder instructions to maximize instruction throughput. It requires detail knowledge of theprocessor such as how many instructions can be fetched per clock, how many can be executed per clock, how deep is the memory buffer, dies it delay writes to prioritize reads?, how big is a cache line, etc. I know none of this stuff, maybe you do and could use it to speed up the hot spots.
This is something the compiler is very good at. The cudacore is a 3 + operation risc processor with up to 256 registers. It is buildt for the compiler.. Sometimes you need to move code around, manually unroll some loops etc.. Verify the result with disassembling. (this is what DJM34 is calling random stuff) But don't let the codesize grow to big, the instruction cache is small. ... While you where trolling my thread I added another 0.4% in the decred algo. I will try to do 5% and include it in my donation miner. I wastalking more about performing loads as soon as possible to give time for mem to respond before you need the data. It also fills the cache line for susequent loads. If cuda supports read priority you can even issue a store before a load and the load will have priority. You just have to watch for register conflicts. There is also issuing different types of instructions on the same clock to improve superscalar operation. These kinds of things are hard for a normal compiler to do because it is specific to each processor, but if anyone can do it it'd cuda because thy have one HW architecture, one run time system and one compiler. And another thing, you trolled me first.
|
|
|
|
joblo
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 28, 2016, 07:17:41 AM |
|
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
Whatever it is it's faster. The cpu verification is only done when the gpu find a solution. I know why changing the verification code made things faster. I wa scpumining 8 threads at the time so it was slowing down the CPU. But in ccminer you can just remove the verification. It's there so that you can check if you break the hash when you change something. Tried that in cpuminer, didn't help. I only managed to get another 1% out of c11, not sure why, expected more, will take another look. No other algos benefit from the fast ctx reinit but you should try it in ccminer, the GPU kernel, that is.
|
|
|
|
a123
Member
Offline
Activity: 98
Merit: 10
|
|
January 28, 2016, 07:18:24 AM |
|
Used allanmac's code ( https://gist.github.com/allanmac/f91b67c112bcba98649d) from devtalk.nvidia.com to test TLB thrashing and compiled it with nvcc -m32. It didn't alleviate the TLB thrashing issue; 970 still dives like a stone past 2GB. Not sure if this is representative of memory bandwidth in Ethash though, just sharing to see if anyone can tweak and improve the TLB situation. Perhaps compile the ether miner for 32 bit's will help? Cached Pointersizes will go from 64bit to 32 (and double the tlb limit?) You need to remove the cpu verfication code because it use 64bit libraries I think..
Thought of that but it's going to be troublesome. You only have a 4GB address space, with windows already sucking up ~half. Then you have to load the 1.3 GB DAG from disk, and allocate 1.3GB of GPU RAM (which, AFAIK sits in the same space, although it isn't pinned to host). This doens't fit. So then you would have to read the DAG from disk in small chunks and copy it cover to GPU RAM. And when that's all done, you will have to pass on all solutions to a special light version of ethminer, that does light verification, is it can't load a DAG into RAM for the same reasons. Or you simply don't verify and risk some Boo's. Then, when that's all done, you're not even sure if it fixes the problem. You could try getting a 32-bit version of dagSimCL to work. I believe Epsylon3/tpruvot has been trying to get a 32-bit version of ethminer to work a while back. Can't find the source anymore.
|
|
|
|
joblo
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 28, 2016, 07:49:34 AM |
|
While you where trolling my thread I added another 0.4% in the decred algo. I will try to do 5% and include it in my donation miner.
Since I forked cpuminer I've increased performance up to 92 % (x13), 75% (x15), 36% (qubit) and 27% (quark). I can't take credit for all of it because it was just plugging in faster functions that already existed. But all the gains in quark are mine.
|
|
|
|
sp_ (OP)
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
January 28, 2016, 07:53:45 AM |
|
Used allanmac's code ( https://gist.github.com/allanmac/f91b67c112bcba98649d) from devtalk.nvidia.com to test TLB thrashing and compiled it with nvcc -m32. It didn't alleviate the TLB thrashing issue; 970 still dives like a stone past 2GB. Not sure if this is representative of memory bandwidth in Ethash though, just sharing to see if anyone can tweak and improve the TLB situation. Did you try in a 32bit OS?
|
|
|
|
sp_ (OP)
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
January 28, 2016, 07:54:12 AM |
|
While you where trolling my thread I added another 0.4% in the decred algo. I will try to do 5% and include it in my donation miner.
Since I forked cpuminer I've increased performance up to 92 % (x13), 75% (x15), 36% (qubit) and 27% (quark). I can't take credit for all of it because it was just plugging in faster functions that already existed. But all the gains in quark are mine. This is good.
|
|
|
|
pallas
Legendary
Offline
Activity: 2716
Merit: 1094
Black Belt Developer
|
|
January 28, 2016, 10:08:10 AM |
|
joblo, does your quark optimisation work at the end? not sure I understand your conversation with sp_ fully: where does the +30% come from?
|
|
|
|
myagui
Legendary
Offline
Activity: 1154
Merit: 1001
|
|
January 28, 2016, 10:12:11 AM |
|
joblo, does your quark optimisation work at the end? not sure I understand your conversation with sp_ fully: where does the +30% come from?
Joblo's optimization impacts CPU validation of any found shares. This is usually insignificant, but since he's also mining with all CPU cores, it did have an impact for him. It was that his CPU mining was slowing down ccminer. Joblo: You're invited for a beer over at #ccminer @freenode: there's friendlier dev talk there, some collaboration now and then, and certainly a lot less BS
|
|
|
|
sp_ (OP)
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
January 28, 2016, 10:16:45 AM |
|
joblo, does your quark optimisation work at the end? not sure I understand your conversation with sp_ fully: where does the +30% come from?
My private quark is +30% up from release 74. The buyable private is +5%
|
|
|
|
sp_ (OP)
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
January 28, 2016, 10:31:43 AM |
|
joblo, does your quark optimisation work at the end? not sure I understand your conversation with sp_ fully: where does the +30% come from?
My private quark is +30% up from release 74. The buyable private is +5% If you didn't work on SIMD, I'm suprised and disappointed. If I did I wouldn't opensource it would I. What is the point? Why don't you opensource yours..
|
|
|
|
|