Bitcoin Forum
November 08, 2024, 04:55:30 AM *
News: Latest Bitcoin Core release: 28.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 ... 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 [464] 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 ... 1240 »
  Print  
Author Topic: CCminer(SP-MOD) Modded GPU kernels.  (Read 2347570 times)
joblo
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
January 28, 2016, 12:35:19 AM
 #9261

when is the last time you delivered 30% in less than an hour?

Today. your quark kernel.

Since skein is much faster than groestl we only do skein and throw away 50% of the hashes.

    if (hash[0] & 0x8)
    {
        sph_groestl512_init(&ctx_groestl);
        sph_groestl512 (&ctx_groestl, (const void*) hash, 64);
        sph_groestl512_close(&ctx_groestl, (void*) hash);
    }
    else
    {
        sph_skein512_init(&ctx_skein);
        sph_skein512 (&ctx_skein, (const void*) hash, 64);
        sph_skein512_close(&ctx_skein, (void*) hash);
    }


There was an optimization made in cpuminer that  if it was determined that a second
round of groestl was necessary the existing hashes would be thrown away on the belief
it would take longer to complete the second groestl than to start over. It didn't work.

However, I might try ccminer's logic. cpuminer uses a state machine as
the engine. ccminer just uses a simple if.

I'm also going to look at other contexts. selctively reinitializing necessary fields may be
quicker thn the current implementation of copying a saved initialiazed context.
Both are quicker than what ccminer does.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
sp_ (OP)
Legendary
*
Offline Offline

Activity: 2954
Merit: 1087

Team Black developer


View Profile
January 28, 2016, 12:51:04 AM
 #9262

this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
joblo
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
January 28, 2016, 12:54:04 AM
 #9263

this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.

Whatever it is it's faster.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
sp_ (OP)
Legendary
*
Offline Offline

Activity: 2954
Merit: 1087

Team Black developer


View Profile
January 28, 2016, 12:55:22 AM
 #9264

this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
Whatever it is it's faster.

The cpu verification is only done when the gpu find a solution.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
joblo
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
January 28, 2016, 12:58:24 AM
 #9265

this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.

My changes have nothing to do with avoiding branches but avoiding work.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
joblo
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
January 28, 2016, 01:00:23 AM
 #9266

this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
Whatever it is it's faster.

The cpu verification is only done when the gpu find a solution.

I may not have realized I was looking at verification code at the time but I know what it is.
Maybe my changes can be applied to the GPU code and you'll get your 30%

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
joblo
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
January 28, 2016, 01:12:09 AM
 #9267

this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
Whatever it is it's faster.

The cpu verification is only done when the gpu find a solution.

I know why changing the verification code made things faster. I wa scpumining 8
threads at the time so it was slowing down the CPU.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
joblo
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
January 28, 2016, 03:42:51 AM
 #9268

You are an ssembly language guy, do you reorder instructions to maximize instruction throughput.
It requires detail knowledge of theprocessor such as how many instructions can be fetched per clock,
how many can be executed per clock, how deep is the memory buffer, dies it delay writes to prioritize
reads?, how big is a cache line, etc. I know none of this stuff, maybe you do and could use it to speed
up the hot spots.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
sp_ (OP)
Legendary
*
Offline Offline

Activity: 2954
Merit: 1087

Team Black developer


View Profile
January 28, 2016, 06:32:06 AM
Last edit: January 28, 2016, 06:57:30 AM by sp_
 #9269

You are an ssembly language guy, do you reorder instructions to maximize instruction throughput.
It requires detail knowledge of theprocessor such as how many instructions can be fetched per clock,
how many can be executed per clock, how deep is the memory buffer, dies it delay writes to prioritize
reads?, how big is a cache line, etc. I know none of this stuff, maybe you do and could use it to speed
up the hot spots.

This is something the compiler is very good at. The cudacore is a 3 + operation risc processor with up to 256 registers.
It is buildt for the compiler..

Sometimes you need to move code around, manually unroll some loops etc.. Verify the result with disassembling. (this is what DJM34 is calling random stuff)

But don't let the codesize grow to big, the instruction cache is small.
...

While you where trolling my thread I added another 0.4% in the decred algo.
I will try to do 5% and include it in my donation miner.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
sp_ (OP)
Legendary
*
Offline Offline

Activity: 2954
Merit: 1087

Team Black developer


View Profile
January 28, 2016, 06:33:30 AM
 #9270

this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
Whatever it is it's faster.
The cpu verification is only done when the gpu find a solution.
I know why changing the verification code made things faster. I wa scpumining 8
threads at the time so it was slowing down the CPU.

But in ccminer you can just remove the verification. It's there so that you can check if you break the hash when you change something.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
joblo
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
January 28, 2016, 07:14:28 AM
 #9271

You are an ssembly language guy, do you reorder instructions to maximize instruction throughput.
It requires detail knowledge of theprocessor such as how many instructions can be fetched per clock,
how many can be executed per clock, how deep is the memory buffer, dies it delay writes to prioritize
reads?, how big is a cache line, etc. I know none of this stuff, maybe you do and could use it to speed
up the hot spots.

This is something the compiler is very good at. The cudacore is a 3 + operation risc processor with up to 256 registers.
It is buildt for the compiler..

Sometimes you need to move code around, manually unroll some loops etc.. Verify the result with disassembling. (this is what DJM34 is calling random stuff)

But don't let the codesize grow to big, the instruction cache is small.
...

While you where trolling my thread I added another 0.4% in the decred algo.
I will try to do 5% and include it in my donation miner.

I wastalking more about performing loads as soon as possible to give time for mem to respond before
you need the data. It also fills the cache line for susequent loads. If cuda supports read priority you
can even issue a store before a load and the load will have priority. You just have to watch for register
conflicts.

There is also issuing different types of instructions on the same clock to improve superscalar
operation.

These kinds of things are hard for a normal compiler to do because it is specific to each processor,
but if anyone can do it it'd cuda because thy have one HW architecture, one run time system and
one compiler.

And another thing, you trolled me first. Smiley

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
joblo
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
January 28, 2016, 07:17:41 AM
 #9272

this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
Whatever it is it's faster.
The cpu verification is only done when the gpu find a solution.
I know why changing the verification code made things faster. I wa scpumining 8
threads at the time so it was slowing down the CPU.

But in ccminer you can just remove the verification. It's there so that you can check if you break the hash when you change something.

Tried that in cpuminer, didn't help. I only managed to get another 1% out of c11, not sure why, expected more,
will take another look.

No other algos benefit from the fast ctx reinit but you should try it in ccminer, the GPU kernel,  that is.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
a123
Member
**
Offline Offline

Activity: 98
Merit: 10


View Profile
January 28, 2016, 07:18:24 AM
 #9273

Used allanmac's code (https://gist.github.com/allanmac/f91b67c112bcba98649d) from devtalk.nvidia.com to test TLB thrashing and compiled it with nvcc -m32. It didn't alleviate the TLB thrashing issue; 970 still dives like a stone past 2GB. Not sure if this is representative of memory bandwidth in Ethash though, just sharing to see if anyone can tweak and improve the TLB situation.

Perhaps compile the ether miner for 32 bit's will help? Cached Pointersizes will go from 64bit to 32 (and double the tlb limit?) You need to remove the cpu verfication code because it use 64bit libraries I think..

Thought of that but it's going to be troublesome. You only have a 4GB address space, with windows already sucking up ~half. Then you have to load the 1.3 GB DAG from disk, and allocate 1.3GB of GPU RAM (which, AFAIK sits in the same space, although it isn't pinned to host). This doens't fit. So then you would have to read the DAG from disk in small chunks and copy it cover to GPU RAM. And when that's all done, you will have to pass on all solutions to a special light version of ethminer, that does light verification, is it can't load a DAG into RAM for the same reasons. Or you simply don't verify and risk some Boo's.

Then, when that's all done, you're not even sure if it fixes the problem. You could try getting a 32-bit version of dagSimCL to work.

I believe Epsylon3/tpruvot has been trying to get a 32-bit version of ethminer to work a while back. Can't find the source anymore.
joblo
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
January 28, 2016, 07:49:34 AM
 #9274

While you where trolling my thread I added another 0.4% in the decred algo.
I will try to do 5% and include it in my donation miner.

Since I forked cpuminer I've increased performance up to 92 % (x13), 75% (x15), 36% (qubit)
and 27% (quark). I can't take credit for all of it because it was just plugging in faster
functions that already existed. But all the gains in quark are mine.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
sp_ (OP)
Legendary
*
Offline Offline

Activity: 2954
Merit: 1087

Team Black developer


View Profile
January 28, 2016, 07:53:45 AM
 #9275

Used allanmac's code (https://gist.github.com/allanmac/f91b67c112bcba98649d) from devtalk.nvidia.com to test TLB thrashing and compiled it with nvcc -m32. It didn't alleviate the TLB thrashing issue; 970 still dives like a stone past 2GB. Not sure if this is representative of memory bandwidth in Ethash though, just sharing to see if anyone can tweak and improve the TLB situation.

Did you try in a 32bit OS?

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
sp_ (OP)
Legendary
*
Offline Offline

Activity: 2954
Merit: 1087

Team Black developer


View Profile
January 28, 2016, 07:54:12 AM
 #9276

While you where trolling my thread I added another 0.4% in the decred algo.
I will try to do 5% and include it in my donation miner.
Since I forked cpuminer I've increased performance up to 92 % (x13), 75% (x15), 36% (qubit)
and 27% (quark). I can't take credit for all of it because it was just plugging in faster
functions that already existed. But all the gains in quark are mine.

This is good.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
pallas
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
January 28, 2016, 10:08:10 AM
 #9277

joblo, does your quark optimisation work at the end? not sure I understand your conversation with sp_ fully: where does the +30% come from?

myagui
Legendary
*
Offline Offline

Activity: 1154
Merit: 1001



View Profile
January 28, 2016, 10:12:11 AM
 #9278

joblo, does your quark optimisation work at the end? not sure I understand your conversation with sp_ fully: where does the +30% come from?

Joblo's optimization impacts CPU validation of any found shares. This is usually insignificant, but since he's also mining with all CPU cores, it did have an impact for him. It was that his CPU mining was slowing down ccminer.

Joblo: You're invited for a beer over at #ccminer @freenode: there's friendlier dev talk there, some collaboration now and then, and certainly a lot less BS  Wink

sp_ (OP)
Legendary
*
Offline Offline

Activity: 2954
Merit: 1087

Team Black developer


View Profile
January 28, 2016, 10:16:45 AM
 #9279

joblo, does your quark optimisation work at the end? not sure I understand your conversation with sp_ fully: where does the +30% come from?

My private quark is +30% up from release 74. The buyable private is +5%

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
sp_ (OP)
Legendary
*
Offline Offline

Activity: 2954
Merit: 1087

Team Black developer


View Profile
January 28, 2016, 10:31:43 AM
 #9280

joblo, does your quark optimisation work at the end? not sure I understand your conversation with sp_ fully: where does the +30% come from?
My private quark is +30% up from release 74. The buyable private is +5%
If you didn't work on SIMD, I'm suprised and disappointed.

If I did I wouldn't opensource it would I. What is the point? Why don't you opensource yours..

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
Pages: « 1 ... 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 [464] 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 ... 1240 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!