CCminer(SP-MOD) Modded GPU kernels.

Quote from: sp_ on January 28, 2016, 12:29:30 AM

January 28, 2016, 12:35:19 AM

#9261

Quote from: joblo on January 28, 2016, 12:24:04 AM

when is the last time you delivered 30% in less than an hour?

Today. your quark kernel.

Since skein is much faster than groestl we only do skein and throw away 50% of the hashes.

   if (hash[0] & 0x8)
   {
   sph_groestl512_init(&ctx_groestl);
   sph_groestl512 (&ctx_groestl, (const void*) hash, 64);
   sph_groestl512_close(&ctx_groestl, (void*) hash);
   }
   else
   {
   sph_skein512_init(&ctx_skein);
   sph_skein512 (&ctx_skein, (const void*) hash, 64);
   sph_skein512_close(&ctx_skein, (void*) hash);
   }

There was an optimization made in cpuminer that if it was determined that a second
round of groestl was necessary the existing hashes would be thrown away on the belief
it would take longer to complete the second groestl than to start over. It didn't work.

However, I might try ccminer's logic. cpuminer uses a state machine as
the engine. ccminer just uses a simple if.

I'm also going to look at other contexts. selctively reinitializing necessary fields may be
quicker thn the current implementation of copying a saved initialiazed context.
Both are quicker than what ccminer does.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

sp_ (OP)

Legendary

Offline

Activity: 2954
Merit: 1087

Team Black developer

January 28, 2016, 12:51:04 AM

#9262

this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner

joblo

Legendary

Offline

Activity: 1470
Merit: 1114

January 28, 2016, 12:54:04 AM

#9263

this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.

Whatever it is it's faster.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

sp_ (OP)

Legendary

Offline

Activity: 2954
Merit: 1087

Team Black developer

January 28, 2016, 12:55:22 AM

#9264

this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.

Whatever it is it's faster.

The cpu verification is only done when the gpu find a solution.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner

joblo

Legendary

Offline

Activity: 1470
Merit: 1114

January 28, 2016, 12:58:24 AM

#9265

this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.

My changes have nothing to do with avoiding branches but avoiding work.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

joblo

Legendary

Offline

Activity: 1470
Merit: 1114

January 28, 2016, 01:00:23 AM

#9266

this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.

Whatever it is it's faster.

The cpu verification is only done when the gpu find a solution.

I may not have realized I was looking at verification code at the time but I know what it is.
Maybe my changes can be applied to the GPU code and you'll get your 30%

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

joblo

Legendary

Offline

Activity: 1470
Merit: 1114

January 28, 2016, 01:12:09 AM

#9267

this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.

Whatever it is it's faster.

The cpu verification is only done when the gpu find a solution.

I know why changing the verification code made things faster. I wa scpumining 8
threads at the time so it was slowing down the CPU.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

joblo

Legendary

Offline

Activity: 1470
Merit: 1114

January 28, 2016, 03:42:51 AM

#9268

You are an ssembly language guy, do you reorder instructions to maximize instruction throughput.
It requires detail knowledge of theprocessor such as how many instructions can be fetched per clock,
how many can be executed per clock, how deep is the memory buffer, dies it delay writes to prioritize
reads?, how big is a cache line, etc. I know none of this stuff, maybe you do and could use it to speed
up the hot spots.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

sp_ (OP)

Legendary

Offline

Activity: 2954
Merit: 1087

Team Black developer

Quote from: joblo on January 28, 2016, 03:42:51 AM

January 28, 2016, 06:32:06 AM
Last edit: January 28, 2016, 06:57:30 AM by sp_

#9269

This is something the compiler is very good at. The cudacore is a 3 + operation risc processor with up to 256 registers.
It is buildt for the compiler..

Sometimes you need to move code around, manually unroll some loops etc.. Verify the result with disassembling. (this is what DJM34 is calling random stuff)

But don't let the codesize grow to big, the instruction cache is small.
...

While you where trolling my thread I added another 0.4% in the decred algo.
I will try to do 5% and include it in my donation miner.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner

sp_ (OP)

Legendary

Offline

Activity: 2954
Merit: 1087

Team Black developer

Quote from: joblo on January 28, 2016, 01:12:09 AM

January 28, 2016, 06:33:30 AM

#9270

this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.

Whatever it is it's faster.

The cpu verification is only done when the gpu find a solution.

I know why changing the verification code made things faster. I wa scpumining 8
threads at the time so it was slowing down the CPU.

But in ccminer you can just remove the verification. It's there so that you can check if you break the hash when you change something.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner

joblo

Legendary

Offline

Activity: 1470
Merit: 1114

Quote from: sp_ on January 28, 2016, 06:32:06 AM

January 28, 2016, 07:14:28 AM

#9271

Quote from: joblo on January 28, 2016, 03:42:51 AM

I wastalking more about performing loads as soon as possible to give time for mem to respond before
you need the data. It also fills the cache line for susequent loads. If cuda supports read priority you
can even issue a store before a load and the load will have priority. You just have to watch for register
conflicts.

There is also issuing different types of instructions on the same clock to improve superscalar
operation.

These kinds of things are hard for a normal compiler to do because it is specific to each processor,
but if anyone can do it it'd cuda because thy have one HW architecture, one run time system and
one compiler.

And another thing, you trolled me first.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

joblo

Legendary

Offline

Activity: 1470
Merit: 1114

Quote from: sp_ on January 28, 2016, 06:33:30 AM

January 28, 2016, 07:17:41 AM

#9272

Quote from: joblo on January 28, 2016, 01:12:09 AM

⇾ Re: CCminer(SP-MOD) Modded NVIDIA Maxwell kernels.

this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.

Whatever it is it's faster.

The cpu verification is only done when the gpu find a solution.

I know why changing the verification code made things faster. I wa scpumining 8
threads at the time so it was slowing down the CPU.

But in ccminer you can just remove the verification. It's there so that you can check if you break the hash when you change something.

Tried that in cpuminer, didn't help. I only managed to get another 1% out of c11, not sure why, expected more,
will take another look.

No other algos benefit from the fast ctx reinit but you should try it in ccminer, the GPU kernel, that is.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

a123

Member

Offline

Activity: 98
Merit: 10

January 28, 2016, 07:18:24 AM

#9273

Used allanmac's code (https://gist.github.com/allanmac/f91b67c112bcba98649d) from devtalk.nvidia.com to test TLB thrashing and compiled it with nvcc -m32. It didn't alleviate the TLB thrashing issue; 970 still dives like a stone past 2GB. Not sure if this is representative of memory bandwidth in Ethash though, just sharing to see if anyone can tweak and improve the TLB situation.

Quote from: Genoil on January 27, 2016, 09:35:23 AM

Quote from: sp_ on January 26, 2016, 10:54:52 AM

Perhaps compile the ether miner for 32 bit's will help? Cached Pointersizes will go from 64bit to 32 (and double the tlb limit?) You need to remove the cpu verfication code because it use 64bit libraries I think..

Thought of that but it's going to be troublesome. You only have a 4GB address space, with windows already sucking up ~half. Then you have to load the 1.3 GB DAG from disk, and allocate 1.3GB of GPU RAM (which, AFAIK sits in the same space, although it isn't pinned to host). This doens't fit. So then you would have to read the DAG from disk in small chunks and copy it cover to GPU RAM. And when that's all done, you will have to pass on all solutions to a special light version of ethminer, that does light verification, is it can't load a DAG into RAM for the same reasons. Or you simply don't verify and risk some Boo's.

Then, when that's all done, you're not even sure if it fixes the problem. You could try getting a 32-bit version of dagSimCL to work.

I believe Epsylon3/tpruvot has been trying to get a 32-bit version of ethminer to work a while back. Can't find the source anymore.

joblo

Legendary

Offline

Activity: 1470
Merit: 1114

Quote from: sp_ on January 28, 2016, 06:32:06 AM

January 28, 2016, 07:49:34 AM

#9274

While you where trolling my thread I added another 0.4% in the decred algo.
I will try to do 5% and include it in my donation miner.

Since I forked cpuminer I've increased performance up to 92 % (x13), 75% (x15), 36% (qubit)
and 27% (quark). I can't take credit for all of it because it was just plugging in faster
functions that already existed. But all the gains in quark are mine.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

sp_ (OP)

Legendary

Offline

Activity: 2954
Merit: 1087

Team Black developer

Quote from: a123 on January 28, 2016, 07:18:24 AM

January 28, 2016, 07:53:45 AM

#9275

Did you try in a 32bit OS?

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner

sp_ (OP)

Legendary

Offline

Activity: 2954
Merit: 1087

Team Black developer

Quote from: joblo on January 28, 2016, 07:49:34 AM

January 28, 2016, 07:54:12 AM

#9276

Quote from: sp_ on January 28, 2016, 06:32:06 AM

While you where trolling my thread I added another 0.4% in the decred algo.
I will try to do 5% and include it in my donation miner.

This is good.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner

pallas

Legendary

Offline

Activity: 2716
Merit: 1094

Black Belt Developer

January 28, 2016, 10:08:10 AM

#9277

joblo, does your quark optimisation work at the end? not sure I understand your conversation with sp_ fully: where does the +30% come from?

Cryptonite (XCN): first mini-blockchain coin, innovative, running since 2014!

myagui

Legendary

Offline

Activity: 1154
Merit: 1001

Quote from: pallas on January 28, 2016, 10:08:10 AM

January 28, 2016, 10:12:11 AM

#9278

joblo, does your quark optimisation work at the end? not sure I understand your conversation with sp_ fully: where does the +30% come from?

Joblo's optimization impacts CPU validation of any found shares. This is usually insignificant, but since he's also mining with all CPU cores, it did have an impact for him. It was that his CPU mining was slowing down ccminer.

Joblo: You're invited for a beer over at #ccminer @freenode: there's friendlier dev talk there, some collaboration now and then, and certainly a lot less BS Wink

MONERO

.
For Privacy
& Freedom

freebitcoin

.
< For Satoshis
& Dogecoins >

freedogecoin

StakePool.eu

.
Hassle-free
DCR Voting

sp_ (OP)

Legendary

Offline

Activity: 2954
Merit: 1087

Team Black developer

Quote from: pallas on January 28, 2016, 10:08:10 AM

January 28, 2016, 10:16:45 AM

#9279

joblo, does your quark optimisation work at the end? not sure I understand your conversation with sp_ fully: where does the +30% come from?

My private quark is +30% up from release 74. The buyable private is +5%

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner

sp_ (OP)

Legendary

Offline

Activity: 2954
Merit: 1087

Team Black developer