Bitcoin Forum
March 15, 2026, 07:12:38 AM *
News: Latest Bitcoin Core release: 30.2 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 [32] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 ... 98 »
  Print  
Author Topic: BitCrack - A tool for brute-forcing private keys  (Read 79927 times)
t0nyst4r
Newbie
*
Offline Offline

Activity: 18
Merit: 0


View Profile
January 16, 2021, 09:35:51 PM
 #621

clbitcrack has always had issues, still does, and did you change the compute_cap in your makefile,
otherwise compiling the cubitcrack won't succeed. even if succeeded won't work with your hardware.
change it accordingly to your hardware.

I didn't change the compute_cap value and was still able to compile cubitcrack on windows simply by updating the references to CUDA 10.1 to 11.2 and making sure project resources were in the correct locations. I'm not able to actually run it because of the "misaligned address" error, so I am using clbitcrack instead until someone is able to fix cubitcrack and allow it to run again with CUDA 11.2+.

In the meantime, what issues should I expect clbitcrack to have running on Windows? I'm not working with any P2SH addresses. What other issues would cause clbitcrack to not find a private key, as @yoyodapro mentioned?

https://github.com/brichard19/BitCrack/issues/81
this was the main reason i said that, besides you can test it out easily if it works o.o.t.b.

Thanks for the link, I read up on the known issue. I then performed the test with the provided list of 18 addresses using Win64 + clbitcrack + 3090 + CUDA 11.2 and it found all keys in the list.
WanderingPhilospher
Sr. Member
****
Offline Offline

Activity: 1484
Merit: 285

Shooters Shoot...


View Profile
January 16, 2021, 10:59:28 PM
 #622

clbitcrack has always had issues, still does, and did you change the compute_cap in your makefile,
otherwise compiling the cubitcrack won't succeed. even if succeeded won't work with your hardware.
change it accordingly to your hardware.

I didn't change the compute_cap value and was still able to compile cubitcrack on windows simply by updating the references to CUDA 10.1 to 11.2 and making sure project resources were in the correct locations. I'm not able to actually run it because of the "misaligned address" error, so I am using clbitcrack instead until someone is able to fix cubitcrack and allow it to run again with CUDA 11.2+.

In the meantime, what issues should I expect clbitcrack to have running on Windows? I'm not working with any P2SH addresses. What other issues would cause clbitcrack to not find a private key, as @yoyodapro mentioned?

https://github.com/brichard19/BitCrack/issues/81
this was the main reason i said that, besides you can test it out easily if it works o.o.t.b.

Thanks for the link, I read up on the known issue. I then performed the test with the provided list of 18 addresses using Win64 + clbitcrack + 3090 + CUDA 11.2 and it found all keys in the list.
So what is your speed with 3090? If it's not doubling a 2080Ti, is it worth it? Meaning, it's great that it runs, but is it running as it should be, MH/s wise? I've only found one program that truly utilizes the new 30xx cards, on windows; but source code is not available.
NotATether
Legendary
*
Offline Offline

Activity: 2268
Merit: 9572


┻┻ ︵㇏(°□°㇏)


View Profile WWW
January 16, 2021, 11:05:01 PM
 #623

if I understood correctly you have P2SH addresses in the list
they start with "3"
BitCrack does not accept them

That's because "3" addresses are all P2SH addresses which are the RIPEMD160 hashes of a script. The addresses that haven't encoded a segwit script that is. Bitcrack's using a bloom filter that can quickly check if a hash of a private key matches a bunch of RIPEMD160 hashes of the input addresses (that's why it's more efficient to put many addresses in the input file at once).

A script is not generated from random bytes like a private key (according to this pictograph), but it's just a redeem script anyway if we somehow were to obtain the public script for such addresses, or guess what kinda math problem someone would make into a redeem script, then only the solution to that problem  (which is sometimes very easy) has to be brute-forced to spend the input, and bitcrack is completely incapable of doing because it works in terms of private keys.

But it should be possible to brute-force bc1 addresses since those also use private keys, if that's not implemented that'll make yet another good science fair project or even a Google Summer of Code project  Grin

 
 b1exch.to 
  ETH      DAI   
  BTC      LTC   
  USDT     XMR    
.███████████▄▀▄▀
█████████▄█▄▀
███████████
███████▄█▀
█▀█
▄▄▀░░██▄▄
▄▀██▄▀█████▄
██▄▀░▄██████
███████░█████
█░████░█████████
█░█░█░████░█████
█░█░█░██░█████
▀▀▀▄█▄████▀▀▀
t0nyst4r
Newbie
*
Offline Offline

Activity: 18
Merit: 0


View Profile
January 16, 2021, 11:31:42 PM
 #624

clbitcrack has always had issues, still does, and did you change the compute_cap in your makefile,
otherwise compiling the cubitcrack won't succeed. even if succeeded won't work with your hardware.
change it accordingly to your hardware.

I didn't change the compute_cap value and was still able to compile cubitcrack on windows simply by updating the references to CUDA 10.1 to 11.2 and making sure project resources were in the correct locations. I'm not able to actually run it because of the "misaligned address" error, so I am using clbitcrack instead until someone is able to fix cubitcrack and allow it to run again with CUDA 11.2+.

In the meantime, what issues should I expect clbitcrack to have running on Windows? I'm not working with any P2SH addresses. What other issues would cause clbitcrack to not find a private key, as @yoyodapro mentioned?

https://github.com/brichard19/BitCrack/issues/81
this was the main reason i said that, besides you can test it out easily if it works o.o.t.b.

Thanks for the link, I read up on the known issue. I then performed the test with the provided list of 18 addresses using Win64 + clbitcrack + 3090 + CUDA 11.2 and it found all keys in the list.
So what is your speed with 3090? If it's not doubling a 2080Ti, is it worth it? Meaning, it's great that it runs, but is it running as it should be, MH/s wise? I've only found one program that truly utilizes the new 30xx cards, on windows; but source code is not available.

Best I got was 1050MKey/sec
bitcoinforktech
Newbie
*
Offline Offline

Activity: 28
Merit: 4


View Profile WWW
January 17, 2021, 01:36:31 AM
Last edit: January 17, 2021, 05:38:49 AM by bitcoinforktech
 #625

Quote from: NotATether
But it should be possible to brute-force bc1 addresses since those also use private keys, if that's not implemented that'll make yet another good science fair project or even a Google Summer of Code project  Grin

Neat idea. I might give that a go and submit a pull request or fork BitCrack with that function. It should be possible.

Edit: my repo is at https://github.com/bitcoinforktech/BitCrack.git which will have some updates in the next few days.
renedx
Jr. Member
*
Offline Offline

Activity: 36
Merit: 3


View Profile
January 19, 2021, 03:09:28 PM
Last edit: January 19, 2021, 06:38:58 PM by renedx
 #626

Really wonder if someone was able to run this against compute_75 & what speed bitcrack would hit. I've been running a modified VanitySearch, doing 4.6GK/s on a single 3090. Sadly due to the 86k threads it trying to fill, it goes out of bounds now & then (GPU/GPUCompute.h:54). Just cannot wrap my head around that funny one yet. But besides of me trying to understand that & learning a lot, CUDA should be doing something near that speed on bitcrack too  Tongue

Neat idea. I might give that a go and submit a pull request or fork BitCrack with that function. It should be possible.

Edit: my repo is at https://github.com/bitcoinforktech/BitCrack.git which will have some updates in the next few days.

Yeah, cuda on bitcrack has this interesting problem on the new drivers. Will try with line info later, was just doing a quick run of your repo.

Code:
[2021-01-19.17:31:52] [Info] Error: misaligned address
========= Misaligned Shared or Local Address
=========     at 0x0000e610 in keyFinderKernelWithDouble(int, int)
=========     by thread (160,0,0) in block (0,0,0)

Edit:
Most fascinating thing about this issue, is that it runs my full test keyspace in debug exe (400M)[ofc slow af], the release crashes on the error above.
WanderingPhilospher
Sr. Member
****
Offline Offline

Activity: 1484
Merit: 285

Shooters Shoot...


View Profile
January 19, 2021, 06:37:52 PM
 #627

Really wonder if someone was able to run this against compute_75 & what speed bitcrack would hit. I've been running a modified VanitySearch, doing 4.6GK/s on a single 3090. Sadly due to the 86k threads it trying to fill, it goes out of bounds now & then (GPU/GPUCompute.h:54). Just cannot wrap my head around that funny one yet. But besides of me trying to understand that & learning a lot, CUDA should be doing something near that speed on bitcrack too  Tongue

Neat idea. I might give that a go and submit a pull request or fork BitCrack with that function. It should be possible.

Edit: my repo is at https://github.com/bitcoinforktech/BitCrack.git which will have some updates in the next few days.

Yeah, cuda on bitcrack has this interesting problem on the new drivers. Will try with line info later, was just doing a quick run of your repo.

Code:
[2021-01-19.17:31:52] [Info] Error: misaligned address
========= Misaligned Shared or Local Address
=========     at 0x0000e610 in keyFinderKernelWithDouble(int, int)
=========     by thread (160,0,0) in block (0,0,0)
When you say modified VanitySearch, what do you mean? How is it modified? Still searching for vanity/prefixes or doing a search sequentially like bitcracK? Vanity in general, is much more faster than bitcrack.
renedx
Jr. Member
*
Offline Offline

Activity: 36
Merit: 3


View Profile
January 19, 2021, 06:40:38 PM
 #628

When you say modified VanitySearch, what do you mean? How is it modified? Still searching for vanity/prefixes or doing a search sequentially like bitcracK? Vanity in general, is much more faster than bitcrack.

I've modified it to do keyspace search on CUDA 11.2 on my RTX 30XX cards. It just keeps going out of bound at random, so releasing it will just fill my issues with "this doesn't work" Smiley
Using small grids, you could keep it running for a bit, but still wouldn't be as stable to put my name on it.
zahid888
Member
**
Offline Offline

Activity: 334
Merit: 24

the right steps towards the goal


View Profile
January 19, 2021, 07:16:58 PM
 #629

I've modified it to do keyspace search on CUDA 11.2 on my RTX 30XX cards. It just keeps going out of bound at random, so releasing it will just fill my issues with "this doesn't work" Smiley
Using small grids, you could keep it running for a bit, but still wouldn't be as stable to put my name on it.

may I try your modified VanitySearch with keyspace search?

1BGvwggxfCaHGykKrVXX7fk8GYaLQpeixA
renedx
Jr. Member
*
Offline Offline

Activity: 36
Merit: 3


View Profile
January 19, 2021, 07:38:54 PM
 #630


may I try your modified VanitySearch with keyspace search?

I would honestly not recommend using non-stable software. If I get it to work properly and understand the part going wrong, I'm happy to share. But at this moment, it needs front-running code to restart it. I would be feeling guilty and spending time helping people out, instead of fixing the real problem. I was hoping to trigger someone on the code part going wrong, rather then making people run unstable  Undecided
dextronomous
Full Member
***
Offline Offline

Activity: 454
Merit: 105


View Profile
January 19, 2021, 11:29:34 PM
 #631


may I try your modified VanitySearch with keyspace search?

I would honestly not recommend using non-stable software. If I get it to work properly and understand the part going wrong, I'm happy to share. But at this moment, it needs front-running code to restart it. I would be feeling guilty and spending time helping people out, instead of fixing the real problem. I was hoping to trigger someone on the code part going wrong, rather then making people run unstable  Undecided

heb tijd and uit amsterdam, pm maar door, would be
able to do some testing in spare time,
renedx
Jr. Member
*
Offline Offline

Activity: 36
Merit: 3


View Profile
January 20, 2021, 01:02:04 AM
Last edit: January 20, 2021, 04:07:13 AM by renedx
 #632

What in/why is the copyBigInt() causing error?  Same error appears in multiple programs written prior to release of RTX 30xx cards.

It's not copyBigInt() itself that's problematic (it's a simple element-wise assignment)  but one of the arrays passed to it which is not aligned. CUDA wants all arrays aligned to 32-but boundaries and one of the arrays that eventually reaches copyBigInt() comes from "xp" and "x" pointer arguments of beginBatchAdd()...these are passed to SubModP() and the result is stored in an 8-element int array that's then passed to MulModP() and from there to copyBigInt().

At first it wasn't clear to me where this error was coming from because the problem disappeared in debug mode, so I could not use the debugger. That's right, if you pass -g -G switches to NVCC, you get a working but extremely slow bitcrack binary.

I tried draconian measures in a attempt to fix this like unrolling the loop, changing the array assignment to memcpy(), qualifying it with __restrict__ and __align__ keywords and I even changed it to a #define statement but the destination and source arrays just don't want to be accessed (since these arrays cannot even be used in the parent function, the problem stems deeper). More bafflingly, assigning a constant to an element in the dest array or making a variable that's initialized to an element from src works but this obviously breaks the elliptic curve stuff.

This is supposed to be performance-critical code so I did not attempt to change the static array to malloc.



For the uninitiated: this is where the bug is: https://github.com/brichard19/BitCrack/blob/master/cudaMath/secp256k1.cuh

CudaMath/secp256k1.cuh, everything in here are inline functions.

We arrive here from CudaKeySearchDevice via beginBatchAdd() and beginBatchAddWithDouble(). Both of these functions call MulModP for point multiplication. Methods like that need to copy to and from temporary arrays. Somehow the arrays being passed are not on an alignment boundary, and I'm honestly not sure what to do. (Of course, rewriting the whole secp256k1 module is also an option but really...? That's like opening a nut with a sledgehammer.)

Been following your debugging by hand, as the debugger runs versus the release crashing. I'm nowhere close to the base-function as you, but it seems I'm hitting a different path. You're saying it starts from "beginBatchAdd".

I know the following breaks the code, but just for finding the issue: if you comment out the following part
https://github.com/brichard19/BitCrack/blob/master/CudaKeySearchDevice/CudaKeySearchDevice.cu#L179-L190
The code runs for me (ofc its broken now).

The interesting part is, "doBatchInverse" as running the upfollowing loop will make it crash, while the loop never hits "completeBatchAdd".

May be hitting a different issue? Or did you mean "completeBatchAdd"?


Edit:

nvm, I didn't undo my function overwrites. It indeed bubbles from subModP.
https://github.com/brichard19/BitCrack/blob/master/cudaMath/secp256k1.cuh#L646

We're on the same track (i think), thank god Smiley *digging*

Edit 2:

Installed all the proper tools to debug simultaneous threads. The following breakpoint got hit.

Thats it for now, time for sleep Wink

Btw: when running in legacy mode (old hardware compatible), it was running fine using nsight. I’m not sure what flag that is on regular CUDA builds yet, just pressed the wrong button and was waiting for it to crash, totally didn’t. Will check tomorrow what speed that was on, could be interesting as fast-fix.


For staying in a certain range, (must be a small range); do you want it to end or push back into the range? Bitcrack ends, Kangaroo pushes back. Which route are you trying to go? To end, need last key function...


Just ends, its not that complicated. The CUDA part is just a little to much above my understanding atm. The Bitcrack parts are easier to understand for me at least.
WanderingPhilospher
Sr. Member
****
Offline Offline

Activity: 1484
Merit: 285

Shooters Shoot...


View Profile
January 20, 2021, 01:53:26 AM
 #633


may I try your modified VanitySearch with keyspace search?

I would honestly not recommend using non-stable software. If I get it to work properly and understand the part going wrong, I'm happy to share. But at this moment, it needs front-running code to restart it. I would be feeling guilty and spending time helping people out, instead of fixing the real problem. I was hoping to trigger someone on the code part going wrong, rather then making people run unstable  Undecided
For staying in a certain range, (must be a small range); do you want it to end or push back into the range? Bitcrack ends, Kangaroo pushes back. Which route are you trying to go? To end, need last key function...
NotATether
Legendary
*
Offline Offline

Activity: 2268
Merit: 9572


┻┻ ︵㇏(°□°㇏)


View Profile WWW
January 20, 2021, 05:21:02 AM
Merited by renedx (1)
 #634

Btw: when running in legacy mode (old hardware compatible), it was running fine using nsight. I’m not sure what flag that is on regular CUDA builds yet, just pressed the wrong button and was waiting for it to crash, totally didn’t. Will check tomorrow what speed that was on, could be interesting as fast-fix.

Make sure you track where the pointers that were passed to submodp were initialized from. Specifically, if you increment an array pointer by 1 or 2 or something like that in host code and then hand it over to CUDA then it will crap itself. It's too bad that CUDA doesn't have a native 256-bit unsigned type yet. Not only would that be faster but then we could avoid all this trickery to fix it.

Maybe the minimum memory alignment bytes increased for newer GPUs?

Is there a flag in nvcc that'll activate this legacy mode you're talking about? It's kind of frustrating that the code pretends to be fine when using debug flags.

 
 b1exch.to 
  ETH      DAI   
  BTC      LTC   
  USDT     XMR    
.███████████▄▀▄▀
█████████▄█▄▀
███████████
███████▄█▀
█▀█
▄▄▀░░██▄▄
▄▀██▄▀█████▄
██▄▀░▄██████
███████░█████
█░████░█████████
█░█░█░████░█████
█░█░█░██░█████
▀▀▀▄█▄████▀▀▀
WanderingPhilospher
Sr. Member
****
Offline Offline

Activity: 1484
Merit: 285

Shooters Shoot...


View Profile
January 20, 2021, 05:44:06 AM
 #635

Quote
Just ends, its not that complicated. The CUDA part is just a little to much above my understanding atm. The Bitcrack parts are easier to understand for me at least.

You are correct, not complicated at all.  Just ends...that's already been done.
bitcoinforktech
Newbie
*
Offline Offline

Activity: 28
Merit: 4


View Profile WWW
January 20, 2021, 12:39:29 PM
 #636

Really wonder if someone was able to run this against compute_75 & what speed bitcrack would hit. I've been running a modified VanitySearch, doing 4.6GK/s on a single 3090. Sadly due to the 86k threads it trying to fill, it goes out of bounds now & then (GPU/GPUCompute.h:54). Just cannot wrap my head around that funny one yet. But besides of me trying to understand that & learning a lot, CUDA should be doing something near that speed on bitcrack too  Tongue

Neat idea. I might give that a go and submit a pull request or fork BitCrack with that function. It should be possible.

Edit: my repo is at https://github.com/bitcoinforktech/BitCrack.git which will have some updates in the next few days.

Yeah, cuda on bitcrack has this interesting problem on the new drivers. Will try with line info later, was just doing a quick run of your repo.

Code:
[2021-01-19.17:31:52] [Info] Error: misaligned address
========= Misaligned Shared or Local Address
=========     at 0x0000e610 in keyFinderKernelWithDouble(int, int)
=========     by thread (160,0,0) in block (0,0,0)

Edit:
Most fascinating thing about this issue, is that it runs my full test keyspace in debug exe (400M)[ofc slow af], the release crashes on the error above.

I have just installed my 3070 and giving it a go, I've compiled the CUDA version a few times but only for older cards.

I hear that I have to roll back my driver to get it working for 3070, 3080 or 3090 cards, but not sure which one.  I can't get it to start at all right now on the RTX 3070, using the driver that comes with CUDA development kit 11.2.

Aside, I think I know where to fix this, if I can just get it to work on my card so I can give it a whirl :/
WanderingPhilospher
Sr. Member
****
Offline Offline

Activity: 1484
Merit: 285

Shooters Shoot...


View Profile
January 20, 2021, 12:55:40 PM
 #637

Really wonder if someone was able to run this against compute_75 & what speed bitcrack would hit. I've been running a modified VanitySearch, doing 4.6GK/s on a single 3090. Sadly due to the 86k threads it trying to fill, it goes out of bounds now & then (GPU/GPUCompute.h:54). Just cannot wrap my head around that funny one yet. But besides of me trying to understand that & learning a lot, CUDA should be doing something near that speed on bitcrack too  Tongue

Neat idea. I might give that a go and submit a pull request or fork BitCrack with that function. It should be possible.

Edit: my repo is at https://github.com/bitcoinforktech/BitCrack.git which will have some updates in the next few days.

Yeah, cuda on bitcrack has this interesting problem on the new drivers. Will try with line info later, was just doing a quick run of your repo.

Code:
[2021-01-19.17:31:52] [Info] Error: misaligned address
========= Misaligned Shared or Local Address
=========     at 0x0000e610 in keyFinderKernelWithDouble(int, int)
=========     by thread (160,0,0) in block (0,0,0)

Edit:
Most fascinating thing about this issue, is that it runs my full test keyspace in debug exe (400M)[ofc slow af], the release crashes on the error above.

I have just installed my 3070 and giving it a go, I've compiled the CUDA version a few times but only for older cards.

I hear that I have to roll back my driver to get it working for 3070, 3080 or 3090 cards, but not sure which one.  I can't get it to start at all right now on the RTX 3070, using the driver that comes with CUDA development kit 11.2.

Aside, I think I know where to fix this, if I can just get it to work on my card so I can give it a whirl :/
I used either 452 or 456, but I have other cards attached as well.
renedx
Jr. Member
*
Offline Offline

Activity: 36
Merit: 3


View Profile
January 20, 2021, 02:45:18 PM
Last edit: January 20, 2021, 03:36:05 PM by renedx
 #638


I have just installed my 3070 and giving it a go, I've compiled the CUDA version a few times but only for older cards.

I hear that I have to roll back my driver to get it working for 3070, 3080 or 3090 cards, but not sure which one.  I can't get it to start at all right now on the RTX 3070, using the driver that comes with CUDA development kit 11.2.

Aside, I think I know where to fix this, if I can just get it to work on my card so I can give it a whirl :/

You wouldn't need to rollback your drivers tho? It should run using the latest CUDA & drivers (and ofc crash due to the error), atleast on windows.
It runs on legacy compatibility mode using _75 (with CUDA injector at ~500M/k/s). Yet nowhere it is documented how to compile against compatibility mode and what the effect of it even is.
https://docs.nvidia.com/cuda/ampere-compatibility-guide/

Docs say I should compile against c86,sm86, so will test what that does using legacy mode. Whatever that mode even does.

Edit:
c86,s86 makes legacy build crash. So it has something to do with Turing vs Ampere CUDA.

Edit2:
So, to view the memory you gotta enable device debugging, but when enabled, it 'works' (slow af ofc). Great.
NotATether
Legendary
*
Offline Offline

Activity: 2268
Merit: 9572


┻┻ ︵㇏(°□°㇏)


View Profile WWW
January 20, 2021, 06:04:31 PM
 #639

Edit2:
So, to view the memory you gotta enable device debugging, but when enabled, it 'works' (slow af ofc). Great.

Yeah, that's the warning I was talking about  Sad . I'm glad you were able to at least narrow it down to Turing-to-Ampere arch changes though.

Perhaps there is a way in VS to print the contents of a raw pointer address that each variable is at? That should surely make an misaligned access error if we try to do that if it's indeed the variable that's not aligned. Alternatively we can just view the addresses in hexadecimal and see which ones are not divisible by 4,8, etc. I know gdb can do both of these things but I forgot the commands for them.

Docs say I should compile against c86,sm86, so will test what that does using legacy mode. Whatever that mode even does.

To my understanding, the difference between compute_* and sm_* is that the compute_ targets are for compiling the CUDA code into PTX, which is some kind of assembly code for GPUs. Basically there are different versions of assembly code specifications, and they're forward compatible with newer PTX versions (the so-called compute caps). In fact that's the reason brichard19 was able to leave the makefile at compute_35 all this time without bitcrack going to hell for everyone running a newer GPU family.

While sm_* is the specification of the binary, what we'd call in C/C++ land the "linking phase", and it absolutely has to match the compute cap for the specific GPU you're targeting in order to run on it. The GPU binaries are not forward-compatible, which means you can't run a CUDA binary corresponding on one GPU family on any other family (unless it's compute cap has the same major version).

For example, if you compile a CUDA program for Kepler's PTX architecture, which is what bitcrack was doing all this time, it's gonna work on Maxwell,Pascal,Volta,...etc. All families after it too.

But the binary itself won't work unless your GPU family has the same major version as inside the sm_ the program was compiled against (e.g Volta 7.0 and Turing 7.5, but not Ampere 8.6 or Pascal 6.1.

If you wanted to distribute a CUDA binary that works on all of those families, you gotta pass a low compute_ version and hen the sm_ versions for every family you want it to run on, e.g.  

Code:
-gencode=arch=compute_35,code=sm_35,sm_50,sm_52,sm_61,sm_75,sm_86

to make it run on every desktop GPU starting with Kepler (This excludes compute caps for embedded Tegra GPUs in game consoles, a few datacenter Tesla GPUs, and the Titan V Wink). Stuffing all those binaries inside a single program also turns out to make it very huge, perhaps the reason why video game installs are several GBs large.

At any rate, the blasted thing is supposed to work without passing any compute cap flags at all so maybe we're focusing on the wrong thing Smiley I'll study the docs some more because attempting to debug without understanding CUDA won't get us anywhere.

 
 b1exch.to 
  ETH      DAI   
  BTC      LTC   
  USDT     XMR    
.███████████▄▀▄▀
█████████▄█▄▀
███████████
███████▄█▀
█▀█
▄▄▀░░██▄▄
▄▀██▄▀█████▄
██▄▀░▄██████
███████░█████
█░████░█████████
█░█░█░████░█████
█░█░█░██░█████
▀▀▀▄█▄████▀▀▀
renedx
Jr. Member
*
Offline Offline

Activity: 36
Merit: 3


View Profile
January 20, 2021, 06:59:54 PM
Last edit: January 21, 2021, 01:32:59 AM by renedx
 #640



Yeah, that's the warning I was talking about  Sad . I'm glad you were able to at least narrow it down to Turing-to-Ampere arch changes though.

Perhaps there is a way in VS to print the contents of a raw pointer address that each variable is at? That should surely make an misaligned access error if we try to do that if it's indeed the variable that's not aligned. Alternatively we can just view the addresses in hexadecimal and see which ones are not divisible by 4,8, etc. I know gdb can do both of these things but I forgot the commands for them.

Docs say I should compile against c86,sm86, so will test what that does using legacy mode. Whatever that mode even does.

To my understanding, the difference between compute_* and sm_* is that the compute_ targets are for compiling the CUDA code into PTX, which is some kind of assembly code for GPUs. Basically there are different versions of assembly code specifications, and they're forward compatible with newer PTX versions (the so-called compute caps). In fact that's the reason brichard19 was able to leave the makefile at compute_35 all this time without bitcrack going to hell for everyone running a newer GPU family.

While sm_* is the specification of the binary, what we'd call in C/C++ land the "linking phase", and it absolutely has to match the compute cap for the specific GPU you're targeting in order to run on it. The GPU binaries are not forward-compatible, which means you can't run a CUDA binary corresponding on one GPU family on any other family (unless it's compute cap has the same major version).

For example, if you compile a CUDA program for Kepler's PTX architecture, which is what bitcrack was doing all this time, it's gonna work on Maxwell,Pascal,Volta,...etc. All families after it too.

But the binary itself won't work unless your GPU family has the same major version as inside the sm_ the program was compiled against (e.g Volta 7.0 and Turing 7.5, but not Ampere 8.6 or Pascal 6.1.

If you wanted to distribute a CUDA binary that works on all of those families, you gotta pass a low compute_ version and hen the sm_ versions for every family you want it to run on, e.g.  

Code:
-gencode=arch=compute_35,code=sm_35,sm_50,sm_52,sm_61,sm_75,sm_86

to make it run on every desktop GPU starting with Kepler (This excludes compute caps for embedded Tegra GPUs in game consoles, a few datacenter Tesla GPUs, and the Titan V Wink). Stuffing all those binaries inside a single program also turns out to make it very huge, perhaps the reason why video game installs are several GBs large.

At any rate, the blasted thing is supposed to work without passing any compute cap flags at all so maybe we're focusing on the wrong thing Smiley I'll study the docs some more because attempting to debug without understanding CUDA won't get us anywhere.

You're totally right, was just reading about these. Thanks for adding, you're amazing in explaining stuff btw.
The reason for trying this was because of the legacy build thing. I noticed the docs stating the following:

Code:
1.4.3. Independent Thread Scheduling Compatibility
NVIDIA GPUs since Volta architecture have Independent Thread Scheduling among threads in a warp. If the developer made assumptions about warp-synchronicity2, this feature can alter the set of threads participating in the executed code compared to previous architectures. Please see Compute Capability 7.0 in the Programming Guide for details and corrective actions. To aid migration to the NVIDIA Ampere GPU architecture, developers can opt-in to the Pascal scheduling model with the following combination of compiler options.

nvcc -gencode=arch=compute_60,code=sm_80 ...

And while debugging, I noticed my breakpoints changes using different specific versions.

My understanding of CUDA is at minimum, but I'm fascinated about it too much. It's a shame the person with CUDA knowledge didn't share any insights yet.

Edit:
Attempts to log, gather more info or whatever, will slow down the program just enough that it *works*.  Roll Eyes
Prob. the reason it works on older hardware. Its really the speed killing it.
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 [32] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 ... 98 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!