Bitcoin Forum
October 21, 2025, 08:13:29 AM *
News: Latest Bitcoin Core release: 30.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 ... 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 [502] 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 ... 1240 »
  Print  
Author Topic: CCminer(SP-MOD) Modded GPU kernels.  (Read 2348002 times)
pallas
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
March 14, 2016, 09:54:55 AM
 #10021

On a side note, I've worked on hodlcoin algo (which is similar to memorycoin and has a 1GB scratchpad of "random" data).
It is a bit different than the dag file because it depends on the blockheader (including the nonce), still a similar "memory hard" algo.
As a test I tried generating the scratchpad slice I need on the fly, instead of doing it all in advance. That way you only need 8KB of data instead of 1GB.
Without specific optimisations, it was about half the speed (on CPU). On GPU, it probably would be faster than keeping the full buffer.
It is very interesting because generating on fly means an order of magnitude more calculations (for the sha512 part), still it is only 2 times slower because of the much better cache usage.

sp_ (OP)
Legendary
*
Offline Offline

Activity: 2926
Merit: 1087

Team Black developer


View Profile
March 14, 2016, 09:55:20 AM
 #10022

Dead on. But I looked at some Blake-256 code, Kramble's. Probably different from the code you're thinking of, but anyways, it wasn't all that awesome. So you'd probably want to do your own design if you're gonna commit to a manufacturing run.
To reduce the cost there are co-ops between developers to print an asic with many kernals in one chip and split the cost. Since blake-256 will take little chip space you might get a good deal, I.E pay 1% of the cost of the chip.. But then you need to draw the circut board, print it (expensive). mount it. And make code for it. If decred's MCAP goes up to 20MUSD it might be worth it...
I think FPGA is the way to go.. Less investments, programable, buit also much slower than an asic..
Fuck that noise, if I'm making an ASIC for a coin, I may as well go whole hog. I want to fit as many Blake-256 hashing cores as I can on each chip. I have a Blake-256 Decred implementation I run on FPGA that uses a 56-stage pipeline in order to keep outputting one result per clock tick, yet have very little delay so it'll clock to the moon. I can fit two of them on my Cyclone V - imagine what you could have for hashrate if you could pack a shitton of them on a chip made with the latest fabs (14nm) and put multiple chips on a board...

Here are some numbers from blake coin(8 round blake 256):

1.6GH/s on a ZTEX USB-FPGA 1.15y Quad Spartan-6 LX150 Development Board
1.5GH/s on a Enterpoint Cairnsmore 1 Quad Spartan-6 LX150 Development Board
960MH/s on a Lancelot Dual Spartan-6 LX150 Development Board
360MH/s on a ZTEX USB-FPGA 1.15x Spartan-6 LX150 Development Board

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
chrysophylax
Legendary
*
Offline Offline

Activity: 3122
Merit: 1093


--- ChainWorks Industries ---


View Profile WWW
March 14, 2016, 09:55:54 AM
 #10023

Dead on. But I looked at some Blake-256 code, Kramble's. Probably different from the code you're thinking of, but anyways, it wasn't all that awesome. So you'd probably want to do your own design if you're gonna commit to a manufacturing run.

To reduce the cost there are co-ops between developers to print an asic with many kernals in one chip and split the cost. Since blake-256 will take little chip space you might get a good deal, I.E pay 1% of the cost of the chip.. But then you need to draw the circut board, print it (expensive). mount it. And make code for it. If decred's MCAP goes up to 20MUSD it might be worth it...

I think FPGA is the way to go.. Less investments, programable, buit also much slower than an asic..

Fuck that noise, if I'm making an ASIC for a coin, I may as well go whole hog. I want to fit as many Blake-256 hashing cores as I can on each chip. I have a Blake-256 Decred implementation I run on FPGA that uses a 56-stage pipeline in order to keep outputting one result per clock tick, yet have very little delay so it'll clock to the moon. I can fit two of them on my Cyclone V - imagine what you could have for hashrate if you could pack a shitton of them on a chip made with the latest fabs (14nm) and put multiple chips on a board...

which is THE reason im looking for an investor to take on the 'challenge' ...

ooops! ... did i say that out loud? ...

Wink ...

#crysx

Grim
Sr. Member
****
Offline Offline

Activity: 506
Merit: 252


View Profile
March 14, 2016, 10:01:29 AM
 #10024

@ Wolf

So what is your idea of an ASIC resistant algo?
The most extreme algo in that direction is probably Burst.

But since anything can be calculated on the fly ... this is a losing battle?



I'm sorry but fpgas and asics are VERY much against a decentralized distribution.
You guys think to much about how to milk a coin and forget on the other hand that nobody cares for a coin which gets milked. Like shooting in your own foot.

(Yes I know you can make a shitton of money that way, but it essentially is against EVERYTHING cryptocoins stand for)
Ayers
Legendary
*
Offline Offline

Activity: 2940
Merit: 1024


Make Your Own Fortune


View Profile
March 14, 2016, 10:11:53 AM
 #10025

also if ethereum go pos, another big coin will emerge, probably decred, so a pump there is not so unexpected in the near future
the money will always move in way or another and diff will follow

one HUGE flaw ...

Decred will have ASIC's in a matter of months. Easy to implement compute only algo. (kindergarten)


And ETH actually has a VERY hard memory algo which has pretty much the best ASIC resistance in existence.
Yet exactly that coin gos POS ... (strange world we live in, ain't it)  Roll Eyes


If the algos were the other way around you would be right ...

it may be right but decred can always be forked for a better algo, there is an evolution of ethereum algo, the one used by HODL coin, i'm not sure, they could use that for the future if decred get big
or maybe a new strong currency will emerge with that algo or a new one, you will never know, like decred emerged from nothing, another altcoin can do the same

██████████▄█
████████▄██▌
██████▄████
████▄█████▌
██▄███▀░▀███▄
▄███▀█▄░▄█▀███▄
███████████████
▀███▄█▀░▀█▄███▀
██▀███▄░▄███▀
████▐█████▀
████████▀
███▐██▀
████▀
Shock
POWER UP
YOUR PLAY!
█████████████████████
██████▄▄███████▄▄██████
████▄██▄▀▀███▀▀▄██████
███████▄▀▀███▀▀▄██████
████▀▄▀█████████▀▄▀███
███▄██▄██▄██▄██▄███
█████████████████████
███▀███▀███▀███▀███
████▄▀▄████████▄▀████
███████▀▄▄███▄▄▀██████
████▀██▀▄▄███▄▄▀██▀████
██████▀▀███████▀▀██████
█████████████████████
█████████████████████
███████████████████████
███████████████▄█▀█████
█████████████████████
███████████████████
████████████████████
██████████████████
████████████████████
██████████████████████
███████████████████████
██████████████████████
███████████████████████
█████████████████████

ORIGINAL GAMES
INSTANT RAKEBACK
WEEKLY REWARDS
MONTHLY REWARDS
.
..100% FIRST DEPOSIT BONUS....PLAY NOW..
Ayers
Legendary
*
Offline Offline

Activity: 2940
Merit: 1024


Make Your Own Fortune


View Profile
March 14, 2016, 10:12:50 AM
 #10026

@ Wolf

So what is your idea of an ASIC resistant algo?
The most extreme algo in that direction is probably Burst.

But since anything can be calculated on the fly ... this is a losing battle?



I'm sorry but fpgas and asics are VERY much against a decentralized distribution.
You guys think to much about how to milk a coin and forget on the other hand that nobody cares for a coin which gets milked. Like shooting in your own foot.

(Yes I know you can make a shitton of money that way, but it essentially is against EVERYTHING cryptocoins stand for)


monero is a good candidate, since not even gpu are efficient there, so asic will not be efficient too, hodlcoin use an evolution of monero algo, so that is the way to go

██████████▄█
████████▄██▌
██████▄████
████▄█████▌
██▄███▀░▀███▄
▄███▀█▄░▄█▀███▄
███████████████
▀███▄█▀░▀█▄███▀
██▀███▄░▄███▀
████▐█████▀
████████▀
███▐██▀
████▀
Shock
POWER UP
YOUR PLAY!
█████████████████████
██████▄▄███████▄▄██████
████▄██▄▀▀███▀▀▄██████
███████▄▀▀███▀▀▄██████
████▀▄▀█████████▀▄▀███
███▄██▄██▄██▄██▄███
█████████████████████
███▀███▀███▀███▀███
████▄▀▄████████▄▀████
███████▀▄▄███▄▄▀██████
████▀██▀▄▄███▄▄▀██▀████
██████▀▀███████▀▀██████
█████████████████████
█████████████████████
███████████████████████
███████████████▄█▀█████
█████████████████████
███████████████████
████████████████████
██████████████████
████████████████████
██████████████████████
███████████████████████
██████████████████████
███████████████████████
█████████████████████

ORIGINAL GAMES
INSTANT RAKEBACK
WEEKLY REWARDS
MONTHLY REWARDS
.
..100% FIRST DEPOSIT BONUS....PLAY NOW..
pallas
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
March 14, 2016, 10:16:31 AM
 #10027

@ Wolf

So what is your idea of an ASIC resistant algo?
The most extreme algo in that direction is probably Burst.

But since anything can be calculated on the fly ... this is a losing battle?



I'm sorry but fpgas and asics are VERY much against a decentralized distribution.
You guys think to much about how to milk a coin and forget on the other hand that nobody cares for a coin which gets milked. Like shooting in your own foot.

(Yes I know you can make a shitton of money that way, but it essentially is against EVERYTHING cryptocoins stand for)


monero is a good candidate, since not even gpu are efficient there, so asic will not be efficient too, hodlcoin use an evolution of monero algo, so that is the way to go

gpus are no more efficient than cpus on monero and hodl because cpus use the aes extension.
if the gpu had the same, they'd be much more efficient than cpus.
still the post by wolf0 is valid.
you don't need to be "memory hard", you need a "changing" algo so a fixed chip design is more difficult.
that's not the case of monero and hodl.

Grim
Sr. Member
****
Offline Offline

Activity: 506
Merit: 252


View Profile
March 14, 2016, 10:22:26 AM
 #10028

you don't need to be "memory hard", you need a "changing" algo so a fixed chip design is more difficult.
that's not the case of monero and hodl.

so how is that done? any example already out there?
pallas
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
March 14, 2016, 10:26:53 AM
 #10029

you don't need to be "memory hard", you need a "changing" algo so a fixed chip design is more difficult.
that's not the case of monero and hodl.

so how is that done? any example already out there?

not that I know of.
but I don't understand all that "asic resistant" hype.
as people has been saying for years: if it's worth, asics will come.
worth = high market cap: if you invested in the coin, you should be happy, not sad ;-)

sp_ (OP)
Legendary
*
Offline Offline

Activity: 2926
Merit: 1087

Team Black developer


View Profile
March 14, 2016, 01:13:34 PM
 #10030

Ok. I have found away to do the optimal decred kernal now.

http://stackoverflow.com/questions/15842507/passing-the-ptx-program-to-the-cuda-driver-directly

So I will generate the ptx assembly with the midstate data included in the instructions. Then for every time the midstate is changing, I recompile the kernal runtime with the API calls described in the article.
To estimate the speedgain you can replace all the constant mem access with contstants in the 1.7.4 code.. Since the sourcecode will be ptx assembly I also can support linux users. Since operations on constants can be precalculated, the compiler will reduce the number of instructions needed for you, so you end up with a kernal that use less instructions than before..



Release #4 will be near optimal..

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
pallas
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
March 14, 2016, 01:16:57 PM
 #10031

Ok. I have found away to do the optimal decred kernal now.

http://stackoverflow.com/questions/15842507/passing-the-ptx-program-to-the-cuda-driver-directly

So I will generate the ptx assembly with the midstate data included in the instructions. Then for every time the midstate is changing, I recompile the kernal runtime with the API calls described in the article.
To estimate the speedgain you can replace all the constant mem access with contstants in the 1.7.4 code.. Release #4 will be optimal..


Interesting technique.
But I doubt you'll gain even 1% from it, likely less.

sp_ (OP)
Legendary
*
Offline Offline

Activity: 2926
Merit: 1087

Team Black developer


View Profile
March 14, 2016, 01:20:19 PM
Last edit: March 14, 2016, 01:34:28 PM by sp_
 #10032

Ok. I have found away to do the optimal decred kernal now.
http://stackoverflow.com/questions/15842507/passing-the-ptx-program-to-the-cuda-driver-directly
So I will generate the ptx assembly with the midstate data included in the instructions. Then for every time the midstate is changing, I recompile the kernal runtime with the API calls described in the article.
To estimate the speedgain you can replace all the constant mem access with contstants in the 1.7.4 code.. Release #4 will be optimal..
Interesting technique.
But I doubt you'll gain even 1% from it, likely less.

you will, because some of the first rounds will be gone.. (instructions are removed since they work on constant data..) You can try it. replace the d_data[0]...d_data[23] with constant data 0x01234567 etc; make sure every constant is different from each other.. Compile,read the ptx, and count the lines before and after.

Then you don't have 14 round blake kernal. but a 12 rounds blake kernal that only works for one midstate. And solves the 14 round blake problem for one given midstate.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
pallas
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
March 14, 2016, 01:52:19 PM
 #10033

Ok. I have found away to do the optimal decred kernal now.
http://stackoverflow.com/questions/15842507/passing-the-ptx-program-to-the-cuda-driver-directly
So I will generate the ptx assembly with the midstate data included in the instructions. Then for every time the midstate is changing, I recompile the kernal runtime with the API calls described in the article.
To estimate the speedgain you can replace all the constant mem access with contstants in the 1.7.4 code.. Release #4 will be optimal..
Interesting technique.
But I doubt you'll gain even 1% from it, likely less.

you will, because some of the first rounds will be gone.. (instructions are removed since they work on constant data..) You can try it. replace the d_data[0]...d_data[23] with constant data 0x01234567 etc; make sure every constant is different from each other.. Compile,read the ptx, and count the lines before and after.

Then you don't have 14 round blake kernal. but a 12 rounds blake kernal that only works for one midstate. And solves the 14 round blake problem for one given midstate.

so, when in solo mode everytime you get a new transaction or block (and on a pool it's not much different), you will recompile the kernel? doesn't look optimal to me.

sp_ (OP)
Legendary
*
Offline Offline

Activity: 2926
Merit: 1087

Team Black developer


View Profile
March 14, 2016, 01:55:49 PM
Last edit: March 14, 2016, 03:06:04 PM by sp_
 #10034

Ok. I have found away to do the optimal decred kernal now.
http://stackoverflow.com/questions/15842507/passing-the-ptx-program-to-the-cuda-driver-directly
So I will generate the ptx assembly with the midstate data included in the instructions. Then for every time the midstate is changing, I recompile the kernal runtime with the API calls described in the article.
To estimate the speedgain you can replace all the constant mem access with contstants in the 1.7.4 code.. Release #4 will be optimal..
Interesting technique.
But I doubt you'll gain even 1% from it, likely less.
you will, because some of the first rounds will be gone.. (instructions are removed since they work on constant data..) You can try it. replace the d_data[0]...d_data[23] with constant data 0x01234567 etc; make sure every constant is different from each other.. Compile,read the ptx, and count the lines before and after.
Then you don't have 14 round blake kernal. but a 12 rounds blake kernal that only works for one midstate. And solves the 14 round blake problem for one given midstate.
so, when in solo mode everytime you get a new transaction or block (and on a pool it's not much different), you will recompile the kernel? doesn't look optimal to me.

There is a faster way. Poke the new constants directly into the binary of the gpu. (self modified code.). Once the binary has been made, only 24 (+) constant numbers needs to be changed (on a new transaction or block), then the kernal needs to be reloaded to the gpu with a cacheflush (cudadevice reset) or perhaps there is a api call that can load/reload a .cubin file directly.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
sp_ (OP)
Legendary
*
Offline Offline

Activity: 2926
Merit: 1087

Team Black developer


View Profile
March 14, 2016, 03:16:34 PM
 #10035

I kinda doubt there's a documented and stable, supported method of doing so...
You can do it safe:

1. Put the compiled cubin in a ramdisk. (virtual memory drive)
2. Poke the constant values with the cpu directly in the file. (the locations can be found with disassembly and the offsets might change from compiler to compiler (cuda versions) )
2. call the cuda api call cuModuleLoad

https://www.cs.cmu.edu/afs/cs/academic/class/15668-s11/www/cuda-doc/html/group__CUDA__MODULE_g366093bd269dafd0af21f1c7d18115d3.html


Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
joblo
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
March 14, 2016, 04:59:52 PM
 #10036

I kinda doubt there's a documented and stable, supported method of doing so...
You can do it safe:

1. Put the compiled cubin in a ramdisk. (virtual memory drive)
2. Poke the constant values with the cpu directly in the file. (the locations can be found with disassembly and the offsets might change from compiler to compiler (cuda versions) )
2. call the cuda api call cuModuleLoad

https://www.cs.cmu.edu/afs/cs/academic/class/15668-s11/www/cuda-doc/html/group__CUDA__MODULE_g366093bd269dafd0af21f1c7d18115d3.html



I stand corrected.

Nice hack. I've always had a soft spot for self modifying code. I once implemented a switch/case that way because there
wasn't enough memory for a jump table. I didn't think it was still possible with modern cpus and all their protections.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
bensam1231
Legendary
*
Offline Offline

Activity: 1848
Merit: 1024


View Profile
March 14, 2016, 05:30:19 PM
 #10037

ethereum is much more profitable to mine so this is pointless, i can mine ethereum and buy more decred than mining decred

And BTC used to be profitable for GPUs to mine. Things change. We're in a huge profit bubble right now and that can pop at any moment and then all hell is going to break loose when all that Eth hash hits all the other GPU coins.

it does not work like that, they dump? i'm fine, diff will adjust = same profit as before

Oh yeah? I don't think it works the way you're thinking. Why do you think profitability will be the same if Eth loses market value? No other coin is nearly as profitable and Eth has hand over fist more hash then any other coin. If it starts to equalize the other coins can't support the amount of hash.

As I mentioned before, GPU mining hash has grown about 30% in the last two weeks... Maybe closer to 50% as Eth has gained another 300Mh since then.

This is completely putting aside Eth can crash and it can go PoS, which means no more mining. They have talked about PoS already.

Decred has some pretty damned good profitability - it may not exceed Eth for all GPUs, but it comes fairly close.

Yeah, but quite fragile. Eth has a lot of hash and volume going for it. That isn't easily upset. If people from Eth all jumped on Decred it'd instantly bottom out.

also if ethereum go pos, another big coin will emerge, probably decred, so a pump there is not so unexpected in the near future
the money will always move in way or another and diff will follow

Investors don't just decide to invest in a new coin when one goes PoS in order to feed miners money. If Eth dies, either by bottoming out or PoS, miners are more then likely SoL. Decred and Vanilla are the next closest things.

Before Eth it was Dash and Dash has been private kernels/ASIC for quite some time... three months ago we were making $.50 profit on a 970, today it's $6... This is definitely a high point and it shouldn't be expected it'll stay this way.

I buy private Nvidia miners. Send information and/or inquiries to my PM box.
malekbaba
Legendary
*
Offline Offline

Activity: 1526
Merit: 1026


View Profile
March 14, 2016, 09:16:13 PM
 #10038

As per my opinion performance of 970 is equal to 2.7x gtx 750ti. What would be clever idea, 1 gtx 970 or 3x 750ti would be better to start with?
Some points:
1. If any how gpu dies, in case of 970, some one will loose $350. But 1 750ti will cost $120.
2. In both case almost same amount of electricity bill will be needed.
3. Regarding eth, 970 is solely winner.

I am confused. Should i buy 1x 970 or 3x 750ti?

Also mention if u have other choice
djm34
Legendary
*
Offline Offline

Activity: 1400
Merit: 1050


View Profile WWW
March 14, 2016, 09:52:40 PM
 #10039

As per my opinion performance of 970 is equal to 2.7x gtx 750ti. What would be clever idea, 1 gtx 970 or 3x 750ti would be better to start with?
Some points:
1. If any how gpu dies, in case of 970, some one will loose $350. But 1 750ti will cost $120.
2. In both case almost same amount of electricity bill will be needed.
3. Regarding eth, 970 is solely winner.

I am confused. Should i buy 1x 970 or 3x 750ti?

Also mention if u have other choice
gpu's don't die like that unless you really don't take care or them and you can always RMA'd them.

djm34 facebook page
BTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze
Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
malekbaba
Legendary
*
Offline Offline

Activity: 1526
Merit: 1026


View Profile
March 14, 2016, 10:07:21 PM
 #10040

I was sleeping and my 970 was mining. While i woke up, i found my pc in comatose form. Pc was running but no display and there was burning smell from my cpu. I found something greeze like product in the back end of my gpu and it died that way.
Pages: « 1 ... 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 [502] 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 ... 1240 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!