Bitcoin Forum
August 03, 2025, 02:24:00 AM *
News: Latest Bitcoin Core release: 29.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 ... 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 [1224] 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 »
  Print  
Author Topic: CCminer(SP-MOD) Modded GPU kernels.  (Read 2347899 times)
sp_ (OP)
Legendary
*
Offline Offline

Activity: 2926
Merit: 1087

Team Black developer


View Profile
September 06, 2019, 08:17:25 PM
Last edit: September 06, 2019, 09:00:52 PM by sp_
 #24461

X16rv2 is FPGA shit algo, you better start optimizing the FPGA bitsream to 5Ghz or so for a fee  Grin

I'm just testing the new cards and algo. I've got a rtx 2070 and a rtx 2060 SUPER.
In x16rv2 I managed to remove the new tiger192 completely from the SHA512, and partly on luffa and keccak. By merging the tiger into the other kernels the gpu can do the new multiplications and AES in the tiger192 in parallell.

So the new X16v2 will perform close to  the speed of the old x16r on the gpu, and fpga's will slow down mostly because of the multiplications.
A bether FGGA killer would be to generate PTX kernels runtime for each block by permuting the assembly instructions. The instuctions should include multiplications, logic, scrambling. The gpu miner will need to compile the ptx on the fly for every block before warping. Then the FPGA implementation would need to have an ALU (cpu emulation) and this will slow down alot.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
sp_ (OP)
Legendary
*
Offline Offline

Activity: 2926
Merit: 1087

Team Black developer


View Profile
September 06, 2019, 08:32:57 PM
Last edit: September 06, 2019, 08:58:53 PM by sp_
 #24462

What about older cards?

Some RTX optimalizations doesn't work on gtx cards. In the code I have split execution on some of them rtx optimized kernel/gtx optimized, but not all of them.  If I do the splitting x17 will do around 24MHASH on the gtx 1080ti. ccminer 1.0 alexis is around 20MHASH.

F.ex reverting cubehash-shavite to the old version you gain a megahash on 1080ti.

To get the opensource up to date with the latest fee miners is more work. The opensource SIMD is slow and need to be rewritten. I have extracted the latest t-rex ptx code. PM me if you want to help reverse engineer to cuda and opensource.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
pallas
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
September 07, 2019, 07:08:37 AM
 #24463

sorry to say but FPGA hashrate of x16rv2 will be the same as the old x16r

sp_ (OP)
Legendary
*
Offline Offline

Activity: 2926
Merit: 1087

Team Black developer


View Profile
September 07, 2019, 09:26:47 AM
Last edit: September 07, 2019, 10:11:46 AM by sp_
 #24464

Let's see what the result will be after the fork. X16v2 will remove the ASIC'S and x16v3 can remove the fpga'a
Ravencoin could hardfork again in 2 months to a randomhash variant with permuted instructions in the hash x16v3, and then the FPGA's will have to mine something else. An optmized x16rv2 could do around 35MHASH on 65 watt's on the RTX 2060 SUPER.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
joblo
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
September 07, 2019, 11:49:09 AM
 #24465

The problem is x16rv2 isn't any more ASIC resistant than v1. All it really needs is development
of a Tiger kernel which shouldn't be too difficult. The only thing that would prevent it is lack
of market demand. Lyra2v3 has a similar problem.

It's not easy to make an algo GPU friendly and ASIC resistant. It would have to target a resource
that GPUs have in abundance that would be too expensive to implement on an ASIC. I'm not aware of any.

Permuted instructions can be worked around with a RAM code segment so it just increases RAM requirements.
A bigger dataset has the same effect. Ironically Lyra2REv2 used a smaller dataset than Lyra2RE to give GPUs
an advantage over CPUs. Lyra2REv3 did not change the size of the dataset.

In the end coins that fork to new algos periodically as an anti-ASIC strategy do little more than create
a planned obsolenscence environment driving demand for new ASICs.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
sp_ (OP)
Legendary
*
Offline Offline

Activity: 2926
Merit: 1087

Team Black developer


View Profile
September 07, 2019, 01:10:00 PM
 #24466

The problem is x16rv2 isn't any more ASIC resistant than v1. All it really needs is development
of a Tiger kernel which shouldn't be too difficult. The only thing that would prevent it is lack
of market demand. Lyra2v3 has a similar problem.

I expect a decline in the difficulty after the fork. Look what happened in the Beam II fork.

Quote
Permuted instructions can be worked around with a RAM code segment so it just increases RAM requirements.

You need to read the instruction from ram decode and execute. Difficult to make the fpga run at full speed.  You can create a superscalar version that execute more than one instruction per cycle, but still much slower than a static hash function. Or is it a faster way?

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
bensam1231
Legendary
*
Offline Offline

Activity: 1834
Merit: 1024


View Profile
September 12, 2019, 01:37:08 AM
 #24467

Depending on the market there isn't demand for new ASICs (due to price of development and emission), FPGAs are the new ASICs and much more difficult to deal with. What further exasperates the issue is that the bitstreams for them are generally relegated to smokey back rooms where most people don't have access. So even if you find and buy the really expensive hardware, you don't have access to the software to run on them.

This is an awful lot like the scrypt days where people were making miners for algos and trading them in back rooms while ASICs were just starting to emerge. This entire last year or so has been dominated by this behavior, matched with market decline and over bloat of hash (along with dark hash) has lead to a relatively stark outlook.

GPUs DO have things that FPGAs don't have, which is a much lower price tag and memory. So then it comes down to hash per/$. If FPGAs give little to no advantage for the price you're paying for them, there is no reason to use them. Memory is only on a couple FPGAs and depending on which one you're talking about, it's not the best in the world, which further limits performance.

A lot of the current predicament, putting aside the market decline and bloating of hashrate caused by the boom in '18, has to do with coin devs being lazy. There are a lot of algos that are much more asic resistant then others, instead they do stupid shit like what RVN is doing by slightly altering their algo to maintain a brandname and appearing progressive, without actually tackling the problem. It's not even about having a silver bullet either, they can just swap for already available algos. MTP and Progpow for instance are very asic/fpga resistant (for now).

I buy private Nvidia miners. Send information and/or inquiries to my PM box.
pallas
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
September 12, 2019, 06:56:05 AM
 #24468

There are HBM equipped FPGAs already.
Problem is, even with restricted bitstreams, their ROI is close to infinity. Just like with ASICs.

sp_ (OP)
Legendary
*
Offline Offline

Activity: 2926
Merit: 1087

Team Black developer


View Profile
September 12, 2019, 10:06:44 PM
Last edit: September 12, 2019, 10:19:19 PM by sp_
 #24469

There are HBM equipped FPGAs already.
Problem is, even with restricted bitstreams, their ROI is close to infinity. Just like with ASICs.

So the next question is how many times can you access the HBM per cycle.
In my algo proposal you will have a random stream of instructions for every new block. (15000 PTX instructions / 15 sec blocktime).
On the GPU you will just run the ptx. (cuda will compile and cache the code before execution and it will take a few milliseconds). After the compilation has been done, you get 14.xx seconds left to run the compiled kernel in full speed. On the FPGA you cannot generate the VHDL code compile and flash in 15 seconds, so you need to make a CPU emulator. This is because it would probably difficult,slow or impossible to generate VHDL out of random instructions and run it without timing bugs.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
joblo
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
September 13, 2019, 02:24:32 AM
 #24470

So the next question is how many times can you access the HBM per cycle.
In my algo proposal you will have a random stream of instructions for every new block. (15000 PTX instructions / 15 sec blocktime).
On the GPU you will just run the ptx. (cuda will compile and cache the code before execution and it will take a few milliseconds). After the compilation has been done, you get 14.xx seconds left to run the compiled kernel in full speed. On the FPGA you cannot generate the VHDL code compile and flash in 15 seconds, so you need to make a CPU emulator. This is because it would probably difficult,slow or impossible to generate VHDL out of random instructions and run it without timing bugs.

By using PTX you're esentially using a proprietary language to prevent anything but a Nvidia product
or a Nvidia licensed product from mining your algo. That's one way to make an algo ASIC/FPGA resistant.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
bensam1231
Legendary
*
Offline Offline

Activity: 1834
Merit: 1024


View Profile
September 13, 2019, 03:51:50 AM
 #24471

There are HBM equipped FPGAs already.
Problem is, even with restricted bitstreams, their ROI is close to infinity. Just like with ASICs.

If you're talking about FKs they aren't even being shipped and they don't even talk about what algos they'll support. Either way, as I mentioned FPGAs with memory (specifically fast memory) are in the extreme minority. They aren't everywhere.

Anti-FPGA effort is not a silver bullet. They're a lot more expensive to produce so you make something that makes it extremely expensive to produce then there has to be a huge reward on the other side or it's not worth it. Looking at 2-3 year ROI on a lot of FPGAs, even if they produce a lot of hashrate makes them very unpalatable. There is opportunity cost associated with everything a lot of people don't consider that. FPGAs also become obsolete and obsolescence is something that has to be considered. So even if you have a FPGA that will ROI in 3 years, there can and more then likely will be newer ones out that will obsolete those.

Chinese don't respect licenses or IP rights unless it's some megacorp and you have millions to throw at it with lawyers.

I buy private Nvidia miners. Send information and/or inquiries to my PM box.
pallas
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
September 13, 2019, 06:19:24 AM
 #24472

There are HBM equipped FPGAs already.
Problem is, even with restricted bitstreams, their ROI is close to infinity. Just like with ASICs.

So the next question is how many times can you access the HBM per cycle.
In my algo proposal you will have a random stream of instructions for every new block. (15000 PTX instructions / 15 sec blocktime).
On the GPU you will just run the ptx. (cuda will compile and cache the code before execution and it will take a few milliseconds). After the compilation has been done, you get 14.xx seconds left to run the compiled kernel in full speed. On the FPGA you cannot generate the VHDL code compile and flash in 15 seconds, so you need to make a CPU emulator. This is because it would probably difficult,slow or impossible to generate VHDL out of random instructions and run it without timing bugs.

what will happen when cards compatible with the language are no longer produced?
maybe you are planning a pump and dump coin so you don't care :-D

sp_ (OP)
Legendary
*
Offline Offline

Activity: 2926
Merit: 1087

Team Black developer


View Profile
September 13, 2019, 07:00:57 AM
 #24473

By using PTX you're esentially using a proprietary language to prevent anything but a Nvidia product
or a Nvidia licensed product from mining your algo. That's one way to make an algo ASIC/FPGA resistant.

Doesn't need to be PTX. You need a pseudo Assembly language that can easily be translated to ptx before execution.
The CPU miner would have to parse this language and create proper native binary before execution. (Create instructions in memory, flush the caches, then execute ) CPU verification is important for the pool/wallet/exchanges.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
sp_ (OP)
Legendary
*
Offline Offline

Activity: 2926
Merit: 1087

Team Black developer


View Profile
September 13, 2019, 07:03:54 AM
 #24474

what will happen when cards compatible with the language are no longer produced?

The point with ptx is that it's a unified language for all NVIDIA gpu architechtures. The ptx is compiled to the native gpu language by the NVIDIA driver before execution. If NVIDIA decide to replace PTX with SPTX, you simply need your miner software to convert the random hashing function into SPTX.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
pallas
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
September 13, 2019, 07:13:02 AM
 #24475

By using PTX you're esentially using a proprietary language to prevent anything but a Nvidia product
or a Nvidia licensed product from mining your algo. That's one way to make an algo ASIC/FPGA resistant.

Doesn't need to be PTX. You need a pseudo Assembly language that can easily be translated to ptx before execution.
The CPU miner would have to parse this language and create proper native binary before execution. (Create instructions in memory, flush the caches, then execute ) CPU verification is important for the pool/wallet/exchanges.

Yeah, no PTX, that's what I was saying.
==> RandomX

sp_ (OP)
Legendary
*
Offline Offline

Activity: 2926
Merit: 1087

Team Black developer


View Profile
September 13, 2019, 07:15:06 AM
Last edit: September 13, 2019, 07:45:47 AM by sp_
 #24476

Yeah, no PTX, that's what I was saying.
==> RandomX

So to make a fast randomx miner on NVIDIA you can convert the randomx code to ptx before execution. (Create a new ptx kernel for each block)

Without optimalizations the NVIDIA cards are loosing to the CPU.

randomx benchmarks:

https://bitcointalk.org/index.php?topic=5176747.0

GPUCryptonight-RRandomX
AMD
Vega 642200 H/s1225 H/s
RX 480/580960-1000 H/s400-410 H/s
RX 560 4GB (1400/2200 MHz)495 H/s260 H/s
NVIDIA/EVGA
RTX 2080 Ti (1915/13600 MHz)960-1000 H/s400-410 H/s
GTX 1080 Ti (2037/11800 MHz)927 H/s1122 H/s
GTX 1070 Ti (1900/7600 MHz)625 H/s769 H/s

For CPUs:
CPUCryptonight-RRandomX
AMD 3900X (4.25GHZ ALL CORE, 3600MHZ RAM)1335 H/s13330 H/s
RYZEN 3700X1018 H/s6853 H/s
RYZEN 5 3600803 H/s6580 H/s
INTEL I9 9900K630 H/s2102 H/s
2X XEON E5 2670 V2 930 H/s5815 H/s
INTEL I7 7700K350 H/s2100 H/s


Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
joblo
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
September 13, 2019, 04:33:06 PM
 #24477

The point with ptx is that it's a unified language for all NVIDIA gpu architechtures.

The point s that it's only Nvidia GPU architectures. No ASIC, no FPGA, no Radeon, no CPU.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
sp_ (OP)
Legendary
*
Offline Offline

Activity: 2926
Merit: 1087

Team Black developer


View Profile
September 13, 2019, 08:24:57 PM
 #24478

The point with ptx is that it's a unified language for all NVIDIA gpu architechtures.
The point s that it's only Nvidia GPU architectures. No ASIC, no FPGA, no Radeon, no CPU.

Doesn't need to be PTX. If you run on NVIDIA hardware you convert the random stream of instructions to PTX. RandomX could be very profitable on NVIDIA hardware with a proper implementation...

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
joblo
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
September 13, 2019, 10:37:05 PM
 #24479

Doesn't need to be PTX. If you run on NVIDIA hardware you convert the random stream of instructions to PTX. RandomX could be very profitable on NVIDIA hardware with a proper implementation...

Precisely. You can build a Nvidia-only proof of concept, but a real product will need
it's own pseudo language that can be compiled to ptx/cuda, ocl, and x86 native instructions
producing identical functionality. The language would have to complex enough (in the CISC sense)
that the FPGA can't decode with a simple table lookup. That's a hell of a lot of work.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
sp_ (OP)
Legendary
*
Offline Offline

Activity: 2926
Merit: 1087

Team Black developer


View Profile
September 14, 2019, 05:05:48 AM
 #24480

The language would have to complex enough (in the CISC sense) that the FPGA can't decode with a simple table lookup. That's a hell of a lot of work.

The FPGA have limits to memory access and multipliers. Let's say the FPGA can do 32 multiplications and 32 mem access per cycle, then you might be able to run 32 instruction per cycle. @500mhz


RandomX on the gpu doesn't need any memory access because the code is compiled, and you can run with 1024 threads at 2000Mz.

So the gpu can do 1024 instructions per cycle@2000mz

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW EVRPROGPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
Pages: « 1 ... 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 [1224] 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!