Bitcoin Forum
January 24, 2026, 01:43:00 PM *
News: Latest Bitcoin Core release: 30.2 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1]
  Print  
Author Topic: BIP-39 Cracking  (Read 43 times)
ipsbruno3 (OP)
Newbie
*
Offline Offline

Activity: 9
Merit: 1


View Profile
Today at 04:49:29 AM
 #1

I recently completed some high-performance projects focused on breaking BIP-39 focusing on recovering above >4 words (something that no Bitcoin recovery wallet company is willing to do today due to the high cost involved)

I'll break it down into topics, show proof of concept, and give my technical observations. All the code focuses on OpenCL for GPU parallelization and Python as the main orchestrator.

Today some brute-force tools,like Hashcat use wNAF-4 bits for computing public keys (G × k) on the secp256k1 elliptic curve, widely used in Bitcoin and ECDSA.

We previously used a 4-bit windowed Non-Adjacent Form (wNAF) method for point multiplication in our software, but recently, we developed a new, significantly faster approach based on the traditional Comb method (not wNAF), optimized for high-throughput batch computation of many random scalars - something that cannot be solved with algorithms like Kangaroo/Pollard Rho methods.

Method 1 – Comb Method (16×16 window)
Technique: Classic Comb (Combining) method with parameters COMB_W = 16, COMB_D = 16
Precomputed table size: 16 columns × 65,536 entries = 1,048,576 points (~16–20 MB depending on format)
Algorithm characteristics:
  • Exactly 16 doublings (fixed)
  • Up to 16 point additions (average ~8 for random scalars)
  • All additions are affine additions (z = 1) — much cheaper than generic Jacobian adds
  • Highly regular control flow → ideal for GPU warp execution and minimal divergence
  • Typical throughput on modern GPUs: Very high (often 5–10× faster than wNAF in large batches)

wNAF method
Technique: Windowed Non-Adjacent Form with 4-bit digits (range -7 to +8, no adjacent non-zeros)
Precomputed table size: Only 96 fixed multiples of the base point G (~3 KB)
Algorithm characteristics:
  • ~120–140 doublings per scalar (average)
  • ~40–55 point additions (average, depending on NAF Hamming weight)
  • All additions are generic Jacobian additions (z ≠ 1) — ~40–60% more expensive per operation
  • More variable control flow (digit-dependent) → higher branch divergence on GPU
  • Typical throughput on modern GPUs: Noticeably lower than Comb in batch/high-throughput scenarios
  • Why Method 1 (Comb 16×16) is significantly faster on GPU


Here's a snippet of the core arithmetic I use in a high-performance Comb(16,16) kernel (validated against coincurve):

Code:
# test_secp256k1_opencl_ulong.py

import numpy as np
import pyopencl as cl
import time
import random

P = 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFC2F

N_ORDER = 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEBAAEDCE6AF48A03BBFD25E8CD0364141
P_FIELD = 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFC2F

A = 0
B = 7

Gx = int("79BE667EF9DCBBAC55A06295CE870B07029BFCDB2DCE28D959F2815B16F81798", 16)
Gy = int("483ADA7726A3C4655DA4FBFC0E1108A8FD17B448A68554199C47D08FFB10D4B8", 16)

MASK64 = (1 << 64) - 1
COMB_W = 16
COMB_D = 16
print("COMB_W COMB_D", COMB_W, COMB_D)
OPENCL_SRC = f"\n\
#define PVT __private\n\
#define COMB_W {COMB_W}\n\
#define COMB_D {COMB_D}" + r"""
#define COMB_T (1u << COMB_W)
typedef ulong fe_t[4];
//#define DISABLE_ZERO_VERIFICATION 1
//#define DISABLE_INFINTE_VERIFICATION 1

typedef struct {
  fe_t x, y, z;
} jac_t;

static constant ulong SECP_P[4] = {
  (ulong)0xFFFFFFFEFFFFFC2FUL, (ulong)0xFFFFFFFFFFFFFFFFUL,
  (ulong)0xFFFFFFFFFFFFFFFFUL, (ulong)0xFFFFFFFFFFFFFFFFUL
};

static constant ulong SECP_P_MINUS_2[4] = {
  (ulong)0xFFFFFFFEFFFFFC2DUL, (ulong)0xFFFFFFFFFFFFFFFFUL,
  (ulong)0xFFFFFFFFFFFFFFFFUL, (ulong)0xFFFFFFFFFFFFFFFFUL
};

static constant ulong SECP_GX[4] = {
  (ulong)0x59F2815B16F81798UL, (ulong)0x029BFCDB2DCE28D9UL,
  (ulong)0x55A06295CE870B07UL, (ulong)0x79BE667EF9DCBBACUL
};

static constant ulong SECP_GY[4] = {
  (ulong)0x9C47D08FFB10D4B8UL, (ulong)0xFD17B448A6855419UL,
  (ulong)0x5DA4FBFC0E1108A8UL, (ulong)0x483ADA7726A3C465UL
};

inline ulong addc64(ulong x, ulong y, PVT ulong *c) {
  ulong s  = x + y;
  ulong c1 = (s < x);
  ulong s2 = s + *c;
  ulong c2 = (s2 < s);
  *c = (c1 | c2);
  return s2;
}

inline ulong subb64(ulong x, ulong y, PVT ulong *b) {
  ulong t  = x - y;
  ulong b1 = (x < y);
  ulong t2 = t - *b;
  ulong b2 = (t < *b);
  *b = (b1 | b2);
  return t2;
}

inline void fe_copy(PVT ulong r[4], const PVT ulong a[4]) { r[0]=a[0]; r[1]=a[1]; r[2]=a[2]; r[3]=a[3]; }
inline void fe_clear(PVT ulong r[4]) { r[0]=0; r[1]=0; r[2]=0; r[3]=0; }
inline void fe_set_one(PVT ulong r[4]) { r[0]=1; r[1]=0; r[2]=0; r[3]=0; }

inline int fe_is_zero(const PVT ulong a[4]) { return (a[0] | a[1] | a[2] | a[3]) == 0; }

inline void fe_cond_sub_p(PVT ulong a[4]) {
  ulong t[4];
  ulong br = 0;
  t[0] = subb64(a[0], SECP_P[0], &br);
  t[1] = subb64(a[1], SECP_P[1], &br);
  t[2] = subb64(a[2], SECP_P[2], &br);
  t[3] = subb64(a[3], SECP_P[3], &br);

  ulong cond = (br == 0);
  ulong m = (ulong)0 - cond;

  a[0] = (a[0] & ~m) | (t[0] & m);
  a[1] = (a[1] & ~m) | (t[1] & m);
  a[2] = (a[2] & ~m) | (t[2] & m);
  a[3] = (a[3] & ~m) | (t[3] & m);
}

inline void fe_add(PVT ulong r[4], const PVT ulong a[4], const PVT ulong b[4]) {
  ulong s[4];
  ulong c = 0;
  s[0] = addc64(a[0], b[0], &c);
  s[1] = addc64(a[1], b[1], &c);
  s[2] = addc64(a[2], b[2], &c);
  s[3] = addc64(a[3], b[3], &c);

  ulong t[4];
  ulong br = 0;
  t[0] = subb64(s[0], SECP_P[0], &br);
  t[1] = subb64(s[1], SECP_P[1], &br);
  t[2] = subb64(s[2], SECP_P[2], &br);
  t[3] = subb64(s[3], SECP_P[3], &br);

  ulong cond = c | (br == 0);
  ulong m = (ulong)0 - cond;

  r[0] = (s[0] & ~m) | (t[0] & m);
  r[1] = (s[1] & ~m) | (t[1] & m);
  r[2] = (s[2] & ~m) | (t[2] & m);
  r[3] = (s[3] & ~m) | (t[3] & m);
}

inline void fe_sub(PVT ulong r[4], const PVT ulong a[4], const PVT ulong b[4]) {
  ulong d[4];
  ulong br = 0;
  d[0] = subb64(a[0], b[0], &br);
  d[1] = subb64(a[1], b[1], &br);
  d[2] = subb64(a[2], b[2], &br);
  d[3] = subb64(a[3], b[3], &br);

  ulong ap[4];
  ulong c = 0;
  ap[0] = addc64(d[0], SECP_P[0], &c);
  ap[1] = addc64(d[1], SECP_P[1], &c);
  ap[2] = addc64(d[2], SECP_P[2], &c);
  ap[3] = addc64(d[3], SECP_P[3], &c);

  ulong m = (ulong)0 - br;
  r[0] = (d[0] & ~m) | (ap[0] & m);
  r[1] = (d[1] & ~m) | (ap[1] & m);
  r[2] = (d[2] & ~m) | (ap[2] & m);
  r[3] = (d[3] & ~m) | (ap[3] & m);
}

inline void acc_add128(PVT ulong *lo, PVT ulong *hi, PVT ulong *ex, ulong plo, ulong phi) {
  ulong nlo = *lo + plo;
  ulong c0  = (nlo < *lo);
  *lo = nlo;
  ulong t   = *hi + phi;
  ulong cA  = (t < *hi);
  ulong t2  = t + c0;
  ulong cB  = (t2 < t);
  *hi = t2;
  *ex += (cA + cB);
}

inline void mul256_raw(PVT ulong t[8], const PVT ulong a[4], const PVT ulong b[4])
{
  const ulong a0=a[0], a1=a[1], a2=a[2], a3=a[3];
  const ulong b0=b[0], b1=b[1], b2=b[2], b3=b[3];

  ulong carry_lo = 0, carry_hi = 0;
  ulong lo, hi, ex;

  #define MAC(x,y) do {                     \
      ulong plo = (x) * (y);               \
      ulong phi = mul_hi((x), (y));        \
      acc_add128(&lo, &hi, &ex, plo, phi); \
    } while (0)

  // k=0
  lo = carry_lo; hi = carry_hi; ex = 0;
  MAC(a0,b0);
  t[0] = lo; carry_lo = hi; carry_hi = ex;

  // k=1
  lo = carry_lo; hi = carry_hi; ex = 0;
  MAC(a0,b1); MAC(a1,b0);
  t[1] = lo; carry_lo = hi; carry_hi = ex;

  // k=2
  lo = carry_lo; hi = carry_hi; ex = 0;
  MAC(a0,b2); MAC(a1,b1); MAC(a2,b0);
  t[2] = lo; carry_lo = hi; carry_hi = ex;

  // k=3
  lo = carry_lo; hi = carry_hi; ex = 0;
  MAC(a0,b3); MAC(a1,b2); MAC(a2,b1); MAC(a3,b0);
  t[3] = lo; carry_lo = hi; carry_hi = ex;

  // k=4
  lo = carry_lo; hi = carry_hi; ex = 0;
  MAC(a1,b3); MAC(a2,b2); MAC(a3,b1);
  t[4] = lo; carry_lo = hi; carry_hi = ex;

  // k=5
  lo = carry_lo; hi = carry_hi; ex = 0;
  MAC(a2,b3); MAC(a3,b2);
  t[5] = lo; carry_lo = hi; carry_hi = ex;

  // k=6
  lo = carry_lo; hi = carry_hi; ex = 0;
  MAC(a3,b3);
  t[6] = lo; carry_lo = hi; carry_hi = ex;

  // k=7: só sobra o carry (bits acima de 512 deveriam ser 0)
  t[7] = carry_lo;

  #undef MAC
}

inline ulong add64_cc(ulong s, ulong x, PVT uint *cc) {
  ulong t = s + x;
  *cc += (t < s);
  return t;
}

inline void fe_reduce512(PVT ulong r[4], const PVT ulong t[8]) {
  ulong H0 = t[4], H1 = t[5], H2 = t[6], H3 = t[7];
  ulong  m0_hi = mul_hi(H0, 977UL);
  ulong m1_hi = mul_hi(H1, 977UL);
  ulong  m2_hi = mul_hi(H2, 977UL);
  ulong  m3_hi = mul_hi(H3, 977UL);

  ulong s0_lo = (H0 << 32), s0_hi = (H0 >> 32);
  ulong s1_lo = (H1 << 32), s1_hi = (H1 >> 32);
  ulong s2_lo = (H2 << 32), s2_hi = (H2 >> 32);
  ulong s3_lo = (H3 << 32), s3_hi = (H3 >> 32);

  ulong r5[5];
  ulong carry = 0;

  // limb 0
  { uint cc=0; ulong s=t[0];
    s=add64_cc(s,H0 * 977UL,&cc); s=add64_cc(s,s0_lo,&cc); s=add64_cc(s,carry,&cc);
    r5[0]=s; carry=(ulong)cc; }
  // limb 1
  { uint cc=0; ulong s=t[1];
    s=add64_cc(s,H1 * 977UL,&cc); s=add64_cc(s,m0_hi,&cc);
    s=add64_cc(s,s1_lo,&cc); s=add64_cc(s,s0_hi,&cc);
    s=add64_cc(s,carry,&cc);
    r5[1]=s; carry=(ulong)cc; }
  // limb 2
  { uint cc=0; ulong s=t[2];
    s=add64_cc(s,H2 * 977UL,&cc); s=add64_cc(s,m1_hi,&cc);
    s=add64_cc(s,s2_lo,&cc); s=add64_cc(s,s1_hi,&cc);
    s=add64_cc(s,carry,&cc);
    r5[2]=s; carry=(ulong)cc; }
  // limb 3
  { uint cc=0; ulong s=t[3];
    s=add64_cc(s,H3 * 977UL,&cc); s=add64_cc(s,m2_hi,&cc);
    s=add64_cc(s,s3_lo,&cc); s=add64_cc(s,s2_hi,&cc);
    s=add64_cc(s,carry,&cc);
    r5[3]=s; carry=(ulong)cc; }
  // limb 4
  { uint cc=0; ulong s=0;
    s=add64_cc(s,m3_hi,&cc); s=add64_cc(s,s3_hi,&cc); s=add64_cc(s,carry,&cc);
    r5[4]=s; }

  // fold r5[4] using 2^256 ≡ 2^32 + 977 (mod p)
  #pragma unroll
  for (int it=0; it<3; ++it) {
    ulong k = r5[4];
    if (k==0) break;
    r5[4]=0;

    ulong k_lo = k*977UL, k_hi = mul_hi(k,977UL);
    ulong k_slo = (k<<32), k_shi = (k>>32);

    ulong c = 0;
    // limb 0
    { uint cc=0; ulong s=r5[0];
      s=add64_cc(s,k_lo,&cc); s=add64_cc(s,k_slo,&cc); s=add64_cc(s,c,&cc);
      r5[0]=s; c=(ulong)cc; }
    // limb 1
    { uint cc=0; ulong s=r5[1];
      s=add64_cc(s,k_hi,&cc); s=add64_cc(s,k_shi,&cc); s=add64_cc(s,c,&cc);
      r5[1]=s; c=(ulong)cc; }
    // limb 2..4 propagate carry-count
    for (int i=2;i<5;i++){
      uint cc=0; ulong s=r5[i];
      s=add64_cc(s,c,&cc);
      r5[i]=s; c=(ulong)cc;
    }
  }

  r[0]=r5[0]; r[1]=r5[1]; r[2]=r5[2]; r[3]=r5[3];
  fe_cond_sub_p(r);
  fe_cond_sub_p(r);
}

inline void fe_mul(PVT ulong r[4], const PVT ulong a[4], const PVT ulong b[4]) {
  ulong t[8];
  mul256_raw(t, a, b);
  fe_reduce512(r, t);
}

inline void fe_sqr(PVT ulong r[4], const PVT ulong a[4]) { fe_mul(r, a, a); }

inline uint exp_bit(int bit) {
  int w = bit >> 6;
  int o = bit & 63;
  return (uint)((SECP_P_MINUS_2[w] >> o) & 1UL);
}

inline void fe_inv(PVT ulong r[4], const PVT ulong a[4]) {
  if (fe_is_zero(a)) { fe_clear(r); return; }
  ulong a2[4], a4[4], a8[4], a12[4], a13[4], a14[4], a15[4];
  fe_sqr(a2, a);     // a^2
  fe_sqr(a4, a2);    // a^4
  fe_sqr(a8, a4);    // a^8
  fe_mul(a12, a8, a4);   // a^12
  fe_mul(a13, a12, a);   // a^13
  fe_mul(a14, a12, a2);  // a^14
  fe_mul(a15, a14, a);   // a^15
  fe_copy(r, a15);
  #define SQR4_INPLACE(v) do { fe_sqr((v),(v)); fe_sqr((v),(v)); fe_sqr((v),(v)); fe_sqr((v),(v)); } while(0)
  for (int i=1; i<=54; i++) { SQR4_INPLACE(r); fe_mul(r, r, a15); } // F
  { SQR4_INPLACE(r); fe_mul(r, r, a14); }                          // E
  for (int i=56; i<=60; i++) { SQR4_INPLACE(r); fe_mul(r, r, a15); } // F
  { SQR4_INPLACE(r); fe_mul(r, r, a12); }                          // C
  { SQR4_INPLACE(r); fe_mul(r, r, a2);  }                          // 2
  { SQR4_INPLACE(r); fe_mul(r, r, a13); }                          // D

  #undef SQR4_INPLACE
}

// ---- point ops (Jacobian) ----
inline void point_set_inf(PVT jac_t *P) { fe_clear(P->x); fe_clear(P->y); fe_clear(P->z); }
inline int  point_is_inf(const PVT jac_t *P) { return fe_is_zero(P->z); }


inline void point_set_affine(PVT jac_t *P, const PVT ulong ax[4], const PVT ulong ay[4]) {
  P->x[0]=ax[0]; P->x[1]=ax[1]; P->x[2]=ax[2]; P->x[3]=ax[3];
  P->y[0]=ay[0]; P->y[1]=ay[1]; P->y[2]=ay[2]; P->y[3]=ay[3];
  fe_set_one(P->z);
}

inline void point_copy(PVT jac_t *R, const PVT jac_t *P) { fe_copy(R->x,P->x); fe_copy(R->y,P->y); fe_copy(R->z,P->z); }



inline void point_double(PVT jac_t *R, const PVT jac_t *P) {
#ifndef DISABLE_INFINITE_VERIFICATION
  if (point_is_inf(P) || fe_is_zero(P->y)) { point_set_inf(R); return; }
#endif
  ulong A_[4], B_[4], C_[4], D_[4], E_[4], F_[4], tmp[4], tmp2[4];
  fe_sqr(A_, P->x);
  fe_sqr(B_, P->y);
  fe_sqr(C_, B_);
  fe_add(tmp, P->x, B_);
  fe_sqr(tmp, tmp);
  fe_sub(tmp, tmp, A_);
  fe_sub(tmp, tmp, C_);
  fe_add(D_, tmp, tmp);
  fe_add(tmp, A_, A_);
  fe_add(E_, tmp, A_);
  fe_sqr(F_, E_);
  fe_add(tmp, D_, D_);
  fe_sub(R->x, F_, tmp);
  fe_sub(tmp, D_, R->x);
  fe_mul(tmp, E_, tmp);
  fe_add(tmp2, C_, C_);
  fe_add(tmp2, tmp2, tmp2);
  fe_add(tmp2, tmp2, tmp2);
  fe_sub(R->y, tmp, tmp2);
  fe_mul(tmp, P->y, P->z);
  fe_add(R->z, tmp, tmp);
}

inline void point_add(PVT jac_t *R, const PVT jac_t *P, const PVT jac_t *Q)  {
#ifndef DISABLE_INFINITE_VERIFICATION
  if (point_is_inf(P)) { point_copy(R,Q); return; }
  if (point_is_inf(Q)) { point_copy(R,P); return; }
#endif
  ulong Z1Z1[4], Z2Z2[4], U1[4], U2[4], S1[4], S2[4];
  ulong H[4], RR[4], HH[4], HHH[4], V[4], tmp[4], tmp2[4];

  fe_sqr(Z1Z1, P->z);
  fe_sqr(Z2Z2, Q->z);
  fe_mul(U1, P->x, Z2Z2);
  fe_mul(U2, Q->x, Z1Z1);
  fe_mul(tmp, Q->z, Z2Z2);
  fe_mul(S1, P->y, tmp);
  fe_mul(tmp, P->z, Z1Z1);
  fe_mul(S2, Q->y, tmp);
  fe_sub(H, U2, U1);
  fe_sub(RR, S2, S1);
  
#ifndef DISABLE_ZERO_VERIFICATION
  if (fe_is_zero(H)) {
    if (fe_is_zero(RR)) point_double(R,P);
    else {point_set_inf(R);}
    return;
  }
  #endif

  fe_sqr(HH, H);
  fe_mul(HHH, H, HH);
  fe_mul(V, U1, HH);
  fe_sqr(tmp, RR);
  fe_add(tmp2, V, V);
  fe_sub(tmp, tmp, HHH);
  fe_sub(R->x, tmp, tmp2);
  fe_sub(tmp, V, R->x);
  fe_mul(tmp, RR, tmp);
  fe_mul(tmp2, S1, HHH);
  fe_sub(R->y, tmp, tmp2);
  fe_mul(tmp, P->z, Q->z);
  fe_mul(R->z, H, tmp);
}

inline void point_to_affine(PVT ulong ax[4], PVT ulong ay[4], const PVT jac_t *P) {
  #ifndef DISABLE_INFINITE_VERIFICATION
  if (point_is_inf(P)) { fe_clear(ax); fe_clear(ay); return; }
  #endif
  
  ulong zinv[4], z2[4], z3[4];
  fe_inv(zinv, P->z);
  fe_sqr(z2, zinv);
  fe_mul(z3, z2, zinv);
  fe_mul(ax, P->x, z2);
  fe_mul(ay, P->y, z3);
}
#define COPY_FOUR(a,b) \
    a[0] = b[0];\
    a[1] = b[1];\
    a[2] = b[2];\
    a[3] = b[3];
#define COPY_FOUR_OFFSET(a,b,c) \
    a[0] = b[0+c];\
    a[1] = b[1+c];\
    a[2] = b[2+c];\
    a[3] = b[3+c];
    
inline void point_set_affine_global(PVT jac_t *P,
                                    __global const ulong *ax_g,
                                    __global const ulong *ay_g)
{
  COPY_FOUR(P->x, ax_g);
  COPY_FOUR(P->y, ay_g);
  fe_set_one(P->z);
}


inline void point_add_affine(PVT jac_t *R,
                             const PVT jac_t *P,
                             const PVT ulong Qx[4],
                             const PVT ulong Qy[4])
{
#ifndef DISABLE_INFINITE_VERIFICATION
  if (point_is_inf(P)) {
    point_set_affine(R, Qx, Qy);
    return;
  }
  #endif
  ulong Z1Z1[4], U2[4], S2[4], H[4], RR[4], HH[4], HHH[4], V[4];
  ulong tmp[4], tmp2[4];
  fe_sqr(Z1Z1, P->z);
  fe_mul(U2, Qx, Z1Z1);
  fe_mul(tmp, P->z, Z1Z1);
  fe_mul(S2, Qy, tmp);
  fe_sub(H, U2, P->x);
  fe_sub(RR, S2, P->y);
#ifndef DISABLE_ZERO_VERIFICATION
  if (fe_is_zero(H)) {
    if (fe_is_zero(RR)) {
      point_double(R, P);
    } else {
      point_set_inf(R);
    }
    return;
  }
  #endif
  fe_sqr(HH, H);
  fe_mul(HHH, H, HH);
  fe_mul(V, P->x, HH);
  fe_sqr(tmp, RR);
  fe_add(tmp2, V, V);
  fe_sub(tmp, tmp, HHH);
  fe_sub(R->x, tmp, tmp2);
  fe_sub(tmp, V, R->x);
  fe_mul(tmp, RR, tmp);
  fe_mul(tmp2, P->y, HHH);
  fe_sub(R->y, tmp, tmp2);
  fe_mul(R->z, P->z, H);
}

inline void point_mulG_comb8(PVT jac_t *R,const PVT ulong k[4],__global const ulong *COMB)
{

  jac_t acc, T;
  point_set_inf(&acc);

  int started = 0;
  #pragma unroll  
  for (int col = COMB_D - 1; col >= 0; --col) {
    uint idx =
    (((k[ col      >> 6] >> ( col       & 63)) & 1u) <<  0) |
    (((k[(col+ 16) >> 6] >> ((col+ 16)  & 63)) & 1u) <<  1) |
    (((k[(col+ 32) >> 6] >> ((col+ 32)  & 63)) & 1u) <<  2) |
    (((k[(col+ 48) >> 6] >> ((col+ 48)  & 63)) & 1u) <<  3) |
    (((k[(col+ 64) >> 6] >> ((col+ 64)  & 63)) & 1u) <<  4) |
    (((k[(col+ 80) >> 6] >> ((col+ 80)  & 63)) & 1u) <<  5) |
    (((k[(col+ 96) >> 6] >> ((col+ 96)  & 63)) & 1u) <<  6) |
    (((k[(col+112) >> 6] >> ((col+112)  & 63)) & 1u) <<  7) |
    (((k[(col+128) >> 6] >> ((col+128)  & 63)) & 1u) <<  8) |
    (((k[(col+144) >> 6] >> ((col+144)  & 63)) & 1u) <<  9) |
    (((k[(col+160) >> 6] >> ((col+160)  & 63)) & 1u) << 10) |
    (((k[(col+176) >> 6] >> ((col+176)  & 63)) & 1u) << 11) |
    (((k[(col+192) >> 6] >> ((col+192)  & 63)) & 1u) << 12) |
    (((k[(col+208) >> 6] >> ((col+208)  & 63)) & 1u) << 13) |
    (((k[(col+224) >> 6] >> ((col+224)  & 63)) & 1u) << 14) |
    (((k[(col+240) >> 6] >> ((col+240)  & 63)) & 1u) << 15);
    
    if (!started) {
      if (idx == 0) continue;

      const __global ulong *p = COMB + ((col * COMB_T + idx) * 8);
      // p[0..3]=x, p[4..7]=y
      point_set_affine_global(&acc, p + 0, p + 4);
      started = 1;
      continue;
    }
    point_double(&T, &acc);
    point_copy(&acc, &T);
    if (idx) {
      const __global ulong *p = COMB + ((col * COMB_T + idx) * 8);
      ulong qx[4], qy[4];
      COPY_FOUR(qx,p);
      COPY_FOUR_OFFSET(qy,p,4);

      point_add_affine(&T, &acc, qx, qy);
      point_copy(&acc, &T);
    }
  }

  if (!started) point_set_inf(&acc);
  point_copy(R, &acc);
}

__kernel void k_point_mulG_comb8(__global const ulong *K,
                                 __global const ulong *COMB,
                                 __global ulong *Rx,
                                 __global ulong *Ry)
{
  int gid = (int)get_global_id(0);

  ulong k4[4], ax[4], ay[4];
  #pragma unroll
  for (int i=0;i<4;i++) k4[i] = K[gid*4+i];

  jac_t R;
  point_mulG_comb8(&R, k4, COMB);
  point_to_affine(ax, ay, &R);

  #pragma unroll
  for (int i=0;i<4;i++){
    Rx[gid*4+i] = ax[i];
    Ry[gid*4+i] = ay[i];
  }
}
"""

def make_comb_buffer(ctx, comb_u64):
    mf = cl.mem_flags
    return cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=comb_u64)
def run_kernel_point_mulG_comb8_profiled(prg, q, K_u64, COMB_buf):
    mf = cl.mem_flags
    ctx = q.context
    N = K_u64.shape[0] // 4

    # buffers por-run (K varia; Rx/Ry variam)
    K_buf  = cl.Buffer(ctx, mf.READ_ONLY,  K_u64.nbytes)
    Rx = np.empty_like(K_u64); Ry = np.empty_like(K_u64)
    Rx_buf = cl.Buffer(ctx, mf.WRITE_ONLY, Rx.nbytes)
    Ry_buf = cl.Buffer(ctx, mf.WRITE_ONLY, Ry.nbytes)

    ev_h2d = cl.enqueue_copy(q, dest=K_buf, src=K_u64)

    ev_k = prg.k_point_mulG_comb8(
        q, (N,), (64,),  # local=64
        K_buf, COMB_buf, Rx_buf, Ry_buf,
        wait_for=[ev_h2d]
    )

    ev_d2h_rx = cl.enqueue_copy(q, dest=Rx, src=Rx_buf, wait_for=[ev_k])
    ev_d2h_ry = cl.enqueue_copy(q, dest=Ry, src=Ry_buf, wait_for=[ev_k])
    ev_d2h_ry.wait()

    def ms(ev): return (ev.profile.end - ev.profile.start) * 1e-6

    prof = {
        "h2d_K_ms": ms(ev_h2d),
        "kernel_ms": ms(ev_k),
        "d2h_rx_ms": ms(ev_d2h_rx),
        "d2h_ry_ms": ms(ev_d2h_ry),
    }
    prof["d2h_ms"] = prof["d2h_rx_ms"] + prof["d2h_ry_ms"]
    prof["gpu_pipeline_ms"] = prof["h2d_K_ms"] + prof["kernel_ms"] + prof["d2h_ms"]
    return Rx, Ry, prof

def int_to_limbs_le64(x: int):
    return [(x >> (64*i)) & MASK64 for i in range(4)]

def limbs_le64_to_int(a4):
    return int(a4[0]) | (int(a4[1]) << 64) | (int(a4[2]) << 128) | (int(a4[3]) << 192)
def print_bench(M, prof):
    k_s = prof["kernel_ms"] / 1000.0
    p_s = prof["gpu_pipeline_ms"] / 1000.0
    print("\n[COMB8 mulG timings]")
    print(f"  H2D(K)     : {prof['h2d_K_ms']:.3f} ms")
    print(f"  Kernel     : {prof['kernel_ms']:.3f} ms")
    print(f"  D2H(Rx)    : {prof['d2h_rx_ms']:.3f} ms")
    print(f"  D2H(Ry)    : {prof['d2h_ry_ms']:.3f} ms")
    print(f"  D2H total  : {prof['d2h_ms']:.3f} ms")
    print(f"  Pipeline   : {prof['gpu_pipeline_ms']:.3f} ms")
    if k_s > 0: print(f"  Throughput kernel : {M/k_s/1e6:.3f} Mpoints/s")
    if p_s > 0: print(f"  Throughput pipe   : {M/p_s/1e6:.3f} Mpoints/s")

def pubkey_xy_from_scalar_lib(k: int):
    k_mod = k % N_ORDER
    if k_mod == 0:
        return None
    from coincurve import PrivateKey
    priv = PrivateKey(k_mod.to_bytes(32, "big"))
    pub65 = priv.public_key.format(compressed=False)
    x = int.from_bytes(pub65[1:33], "big")
    y = int.from_bytes(pub65[33:65], "big")
    return (x % P_FIELD, y % P_FIELD)

def pick_device():
    plats = cl.get_platforms()
    for p in plats:
        gpus = p.get_devices(device_type=cl.device_type.GPU)
        if gpus: return gpus[0]
    for p in plats:
        devs = p.get_devices()
        if devs: return devs[0]
    raise RuntimeError("Nenhum device OpenCL encontrado.")

def build_ctx_queue():
    dev = pick_device()
    ctx = cl.Context([dev])
    q = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)
    print(f"[OpenCL] Platform: {dev.platform.name}")
    print(f"[OpenCL] Device  : {dev.name}")
    return ctx, q


def validate_with_coincurve_sample(scalars, Rx, Ry, sample=1, seed=1234):
    rng = random.Random(seed)
    M = len(scalars)

    Rx4 = Rx.reshape(-1, 4)
    Ry4 = Ry.reshape(-1, 4)

    sample = min(sample, M)
    idxs = [rng.randrange(M) for _ in range(sample)]

    t0 = time.perf_counter()
    for i in idxs:
        k = scalars[i]
        exp = pubkey_xy_from_scalar_lib(k)

        gotx = limbs_le64_to_int(Rx4[i]) % P_FIELD
        goty = limbs_le64_to_int(Ry4[i]) % P_FIELD

        if exp is None:
            if not (gotx == 0 and goty == 0):
                raise AssertionError(f"[mulG] idx={i} esperado INF, got x={gotx:x} y={goty:x}")
        else:
            expx, expy = exp
            if gotx != expx or goty != expy:
                raise AssertionError(
                    f"[mulG] mismatch idx={i}\n"
                    f"  k=0x{k:064x}\n"
                    f"  got x=0x{gotx:064x}\n"
                    f"  exp x=0x{expx:064x}\n"
                    f"  got y=0x{goty:064x}\n"
                    f"  exp y=0x{expy:064x}\n"
                )
    t1 = time.perf_counter()

    dt = t1 - t0
    print("\n[Validação coincurve (amostra)]")
    print(f"  amostras     : {sample:,}")
    print(f"  tempo        : {dt:.3f} s")
    if dt > 0:
        print(f"  throughput   : {sample/dt:,.1f} points/s")
    print("[OK] validação coincurve ✅")
import coincurve
import struct
import numpy as np

N_ORDER = 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEBAAEDCE6AF48A03BBFD25E8CD0364141


def bytes_to_u64_le(b32_be: bytes):
    return list(struct.unpack("<4Q", b32_be[::-1]))

def scalar_to_point_u64(scalar: int):
    scalar %= N_ORDER
    if scalar == 0:
        return [0]*4, [0]*4
    pk = coincurve.PrivateKey.from_int(scalar)
    pt = pk.public_key.format(compressed=False)  # 0x04 || X32 || Y32
    x_be, y_be = pt[1:33], pt[33:65]
    return bytes_to_u64_le(x_be), bytes_to_u64_le(y_be)

def build_comb_table_u64(COMB_W, COMB_D):
    TABLE_SIZE = 1 << COMB_W
    table = np.zeros((COMB_D, TABLE_SIZE, 8), dtype=np.uint64)
    for d in range(COMB_D):
        for i in range(TABLE_SIZE):
            scalar = 0
            for w in range(COMB_W):
                scalar |= ((i >> w) & 1) << (w * COMB_D)   # SEM + d
            x4, y4 = scalar_to_point_u64(scalar)
            table[d, i, 0:4] = x4
            table[d, i, 4:8] = y4
    return np.ascontiguousarray(table.reshape(-1))

def main():
    ctx, q = build_ctx_queue()
    prg = cl.Program(ctx, OPENCL_SRC).build(options=[])

    comb_u64 = build_comb_table_u64(COMB_W,COMB_D)
    COMB_buf = make_comb_buffer(ctx, comb_u64)
    print(f"COMB_u64 bytes: {comb_u64.nbytes/1024:.1f} KB")
    rng = random.Random(1234)
    M =4_000_000
    M -= M % 64
    t0 = time.perf_counter()
    scalars = [(rng.getrandbits(256) % N_ORDER) for _ in range(M)]
    t1 = time.perf_counter()
    print(f"Scalars geradas: {M:,} em {(t1-t0):.3f}s")
    t2 = time.perf_counter()
    K_u64 = np.array([w for k in scalars for w in int_to_limbs_le64(k)], dtype=np.uint64)
    t3 = time.perf_counter()
    print(f"Packing K_u64: {(t3-t2):.3f}s  (bytes={K_u64.nbytes/1024/1024:.2f} MB)")
    Rx, Ry, prof = run_kernel_point_mulG_comb8_profiled(prg, q, K_u64, COMB_buf)
    print_bench(M/4, prof)
    validate_with_coincurve_sample(scalars, Rx, Ry, sample=20_000_000, seed=1234)

    print("\nDone.")

if __name__ == "__main__":
    main()

___

Benchmarks:

Quote
[COMB8 mulG timings ]
H2D(K) : 16.546 ms
Kernel : 111.190 ms
Packing K_u64: 2.585s (bytes=122.07 MB)
Throughput kernel : 22.484 Mpoints/s
Throughput kernel without point affine : 43.365 Mpoints/s




In my opinion, the code isn't fully mature yet regarding memory optimizations (optimized array layouts, loop unrolling, coalesced global memory access, efficient shared/constant memory usage, etc.), but even in its current form, we're already heading towards exceeding 2.5 million seeds processed per GPU in batch.

Theoretically, on an 8× RTX 5090 / PRO 6000 Blackwell rig, this would equate to roughly 500 million (0.5 billion) public keys generated per second.

Key point: this is achieved without any artificial inflation tricks like performing only sequential point additions (the common cheat in many brute-force tools to claim 'billions/s' — it's fast in benchmarks but useless for real ECDLP because it doesn't solve the discrete log problem efficiently).

My implementation uses the comb method (with larger windows, e.g., 16×16 or similar, heavily GPU-optimized) instead of traditional wNAF-4. This enables much more efficient computation of distinct points on the curve, which is a huge advantage for exhaustive BIP-39 seed searches (or reduced-entropy/mnemonic ranges), where we need true high-throughput, independent pubkey generation from each seed → master key → derivation path — not just incremental additions or collisions.

Compared to generic Pollard Rho / Kangaroo implementations (excellent for large-range collision searches, reducing the effective space to ~2¹²⁸ for secp256k1, but inherently sequential and slower per true iteration in full brute-force/exhaustive mode), this approach excels precisely in exhaustive / multi-target BIP-39 hunts, where parallelism across independent seeds is key.
Next steps: refine the comb precompute table to fit better in the massive 32–48 GB GDDR7, add aggressive unrolling, minimize field inversions, and explore even larger window configurations. If we reach 3–4M seeds/GPU, the rig will be absolutely flying.

_______________


Mnemonic Construction (BIP-39 Words into 64-bit Packing)

Brilliant APPEND_AUTO macro
  • Copies up to 8 bytes at a time (or intelligently splits when crossing a ulong boundary). For BIP-39 words (max 8 letters), each APP_WORD costs only 1–3 instructions (shifts + ORs + uint arithmetic). Almost zero branching (only a simple if that the compiler optimizes well).
  • Result: mnemonic construction is extremely cheap — typically <20 instructions total for the 5 variable words.
  • Fixed prefix pre-XORed with IPAD/OPAD
  • The 5 ulongs of the known prefix (P0–P4) are XORed with IPAD/OPAD at compile time.
  • Saves 5 XORs per thread and, more importantly, reduces register pressure.
  • Inner/outer arrays initialized with fixed padding and bit length
  • Padding and length bits are hardcoded, avoiding runtime computation.
  • Fully unrolled tail XOR
  • #pragma unroll on a loop of at most 11 iterations → compiler inlines all XORs with zero loop overhead.
  • Direct bit-shift extraction of word indices (w0–w4)
  • No private arrays or complex logic. w4 cleverly incorporates the SHA-256 checksum, filtering invalid candidates early.
  • Use of __constant memory and masks
  • KEEP_MSB_MASK stored in constant memory for fast access. Everything aligned for GPU execution (low divergence, good coalescing when WORDS_STRING/WORDS_LEN are in constant memory)
.

Complete uint64 wordlist: https://github.com/ipsbruno3/bip39_ulong/blob/main/kernel/bip39.cl

PoC
Code:
#define IPAD 0x3636363636363636UL
#define OPAD 0x5C5C5C5C5C5C5C5CUL

typedef struct {
  ulong master[10];
} test_t;

__constant ulong KEEP_MSB_MASK[9] = {
    0x0000000000000000UL, 0xFF00000000000000UL, 0xFFFF000000000000UL,
    0xFFFFFF0000000000UL, 0xFFFFFFFF00000000UL, 0xFFFFFFFFFF000000UL,
    0xFFFFFFFFFFFF0000UL, 0xFFFFFFFFFFFFFF00UL, 0xFFFFFFFFFFFFFFFFUL};



#define REPEAT_4LLX "%llx%llx%llx%llx"
#define REPEAT_8LLX REPEAT_4LLX REPEAT_4LLX
#define REPEAT_16LLX REPEAT_8LLX REPEAT_8LLX
#define ARG4(arr, off) arr[(off) + 0], arr[(off) + 1], arr[(off) + 2], arr[(off) + 3]
#define ARG8(arr, off) ARG4(arr, off), ARG4(arr, off + 4)
#define ARG16(arr, off) ARG8(arr, off), ARG8(arr, off + 8)

#define SPACE_BE 0x2000000000000000UL
#define REMAIN_BITS_LAST_BLOCK 64

#define APPEND_AUTO(dst, wi, cursor_bits, src_be, src_len_bytes)               \
  do {                                                                         \
    uint _lenB = (uint)(src_len_bytes);                                        \
    if (_lenB) {                                                               \
      ulong _src = (ulong)(src_be);                                            \
      if (_lenB < 8)                                                           \
        _src &= KEEP_MSB_MASK[_lenB];                                          \
      uint _bits = 8u * _lenB;                                                 \
      uint _c = (uint)(cursor_bits);                                           \
      uint _shift = 64u - _c;                                                  \
                                                                               \
      if (_bits <= _c) {                                                       \
        (dst)[(wi)] |= (_shift ? (_src >> _shift) : _src);                     \
        (cursor_bits) = _c - _bits;                                            \
        if ((cursor_bits) == 0u) {                                             \
          (wi)++;                                                              \
          (cursor_bits) = 64u;                                                 \
        }                                                                      \
      } else {                                                                 \
        (dst)[(wi)] |= (_shift ? (_src >> _shift) : _src);                     \
        uint _rem = _bits - _c;                                                \
        (wi)++;                                                                \
        (dst)[(wi)] |= (_src << _c);                                           \
        (cursor_bits) = 64u - _rem;                                            \
        if ((cursor_bits) == 0u) {                                             \
          (wi)++;                                                              \
          (cursor_bits) = 64u;                                                 \
        }                                                                      \
      }                                                                        \
    }                                                                          \
  } while (0)

#define APP_WORD(idx, is_last)                                                 \
  do {                                                                         \
    APPEND_AUTO(mnemLong, WI, CURSOR, WORDS_STRING[(idx)], WORDS_LEN[(idx)]);  \
    if (!(is_last))                                                            \
      APPEND_AUTO(mnemLong, WI, CURSOR, SPACE_BE, 1);                          \
  } while (0)

#define BLOCKS_KNOW_WORDS 5

__kernel void verify(__global test_t *output_hmac, const ulong offset) {
  const uint gid = get_global_id(0);
  const ulong low_offset = offset + (ulong)gid + OFFSET_LOW;
  const ulong lo = low_offset;

  const uint w0 = (uint)((lo >> 40) & 2047UL);
  const uint w1 = (uint)((lo >> 29) & 2047UL);
  const uint w2 = (uint)((lo >> 18) & 2047UL);
  const uint w3 = (uint)((lo >> 7) & 2047UL);

  const uint hi7 = (uint)(lo & 0x7FUL);
  const uint chk_byte = (uint)(sha256_from_byte(HIGH, lo) & 0xFFu);
  const uint cs4 = (chk_byte >> 4) & 0xFu;
  const uint w4 = (hi7 << 4) | cs4;

  ulong mnemLong[16] = {P0,  P1,  P2,  P3,  P4,  0UL, 0UL, 0UL,
                        0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL};

  uint WI = BLOCKS_KNOW_WORDS;
  uint CURSOR = REMAIN_BITS_LAST_BLOCK;

  APP_WORD(w0, 0);
  APP_WORD(w1, 0);
  APP_WORD(w2, 0);
  APP_WORD(w3, 0);
  APP_WORD(w4, 1);

  ulong inner[32] = {(P0 ^ IPAD),
                     (P1 ^ IPAD),
                     (P2 ^ IPAD),
                     (P3 ^ IPAD),
                     (P4 ^ IPAD),
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     7885351518267664739UL,
                     6442450944UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     1120UL};

  ulong outer[32] = {(P0 ^ OPAD),
                     (P1 ^ OPAD),
                     (P2 ^ OPAD),
                     (P3 ^ OPAD),
                     (P4 ^ OPAD),
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0x5C5C5C5C5C5C5C5CUL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0x8000000000000000UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     0UL,
                     1536UL};

#pragma unroll
  for (int i = BLOCKS_KNOW_WORDS; i < 16; i++) {
    inner[i] = mnemLong[i] ^ IPAD;
    outer[i] = mnemLong[i] ^ OPAD;
  }
ipsbruno3 (OP)
Newbie
*
Offline Offline

Activity: 9
Merit: 1


View Profile
Today at 04:55:57 AM
 #2


PBKDF2-HMAC-SHA512

Without a doubt, PBKDF is the biggest bottleneck for breaking BIP-39, limiting high-end GPUs to 3-4 million processes per second. Fortunately, we managed to assemble a highly parallelizable kernel with few branches to accelerate cryptography and make the code more efficient.

  • First HMAC computed outside the loop
  • Handles the longer initial message ("mnemonic" salt + passphrase) separately, requiring two sha512_process calls for the inner hash
  • Tight main loop
  • For the remaining 2047 iterations, the message is only the previous U (64 bytes < block size), requiring just one sha512_process per inner HMAC + one for outer. Total per iteration: 2 SHA-512 compressions + minimal copies/XORs.
  • Vectorized copies via COPY_EIGHT
  • OpenCL compiler typically converts these to native vstore8/vload8 instructions.
  • Efficient register usage
  • Everything kept in __private ulong[8] arrays with no unnecessary spilling.
  • Zero branching in the hot loop (except loop counter) — ideal for GPU execution.
  • 100% unrolled message expansion
  • All W₁₆ through W₇₉ are computed sequentially with no loops. Eliminates loop overhead and enables perfect instruction scheduling by the compiler. Huge throughput gain on GPUs.
  • Carry-chain optimization for Maj function reusing A&B and B&C to reduce calculations within SHA compression.
  • Saves one AND per round (80 ANDs total per compression) by passing bc_prev and chaining ab_current. Classic SHA-2 trick used in high-performance implementations (e.g., hashcat, ASICs)..
  • bitselect for Ch function
  • Uses the fastest native instruction on AMD/NVIDIA GPUs for Ch(x,y,z) = (x&y) ^ (~x&z).
  • Optimized rotates and sigmas.
  • Leverages OpenCL's built-in rotate() (hardware barrel shifter) and standard shift/rotate/XOR patterns for σ₀/σ₁.
  • Fully unrolled 80 rounds.
  • No loop in the main compression → maximum ILP (instruction-level parallelism). Each round is minimal (~12–15 effective instructions)..
  • Dedicated sha512_hash_two_blocks_message function.
  • Perfectly suited for the initial PBKDF2 message (mnemonic + "mnemonic" > 112 bytes), avoiding conditional logic inside the core compression.
  • Relatively low register pressure
  • Temporary W values are computed just-in-time and many are eliminated by dead-code elimination.
Code
Code:
#define INIT_SHA512(a)\
  (a)[0] = 0x6a09e667f3bcc908 UL;\
(a)[1] = 0xbb67ae8584caa73b UL;\
(a)[2] = 0x3c6ef372fe94f82b UL;\
(a)[3] = 0xa54ff53a5f1d36f1 UL;\
(a)[4] = 0x510e527fade682d1 UL;\
(a)[5] = 0x9b05688c2b3e6c1f UL;\
(a)[6] = 0x1f83d9abfb41bd6b UL;\
(a)[7] = 0x5be0cd19137e2179 UL;

#define rotr64(a, n)(rotate((a), (64 ul - n)))

inline ulong L0(ulong x) {
  return rotr64(x, 1 ul) ^ rotr64(x, 8 ul) ^ (x >> 7 ul);
}

inline ulong L1(ulong x) {
  return rotr64(x, 19 ul) ^ rotr64(x, 61 ul) ^ (x >> 6 ul);
}

#define SHA512_S0(x)(rotr64(x, 28 ul) ^ rotr64(x, 34 ul) ^ rotr64(x, 39 ul))
#define SHA512_S1(x)(rotr64(x, 14 ul) ^ rotr64(x, 18 ul) ^ rotr64(x, 41 ul))

#define F1(x, y, z)(bitselect(z, y, x))
#define F0(x, y, z)

static inline ulong RoR(ulong a, ulong b, ulong c, ulong * d,
  ulong e, ulong f, ulong g, ulong * h,
  ulong x, ulong K, ulong bc_prev) {
  ulong ab_current = a & b;
  ulong t1 = * h + SHA512_S1(e) + F1(e, f, g) + x + K;
  ulong maj = SHA512_S0(a) + (ab_current ^ (a & c) ^ bc_prev); // usa bc_prev em vez de (b & c)
  * d += t1;
  * h = t1 + maj;
  return ab_current;
}

void sha512_process(ulong * message, ulong * H) {
  __private ulong A0 = H[0], A1 = H[1], A2 = H[2], A3 = H[3], A4 = H[4],
    A5 = H[5], A6 = H[6], A7 = H[7];

  __private ulong W16 =
    (message[0] + L0(message[1]) + message[9] + L1(message[14]));
  __private ulong W17 =
    (message[1] + L0(message[2]) + message[10] + L1(message[15]));
  __private ulong W18 = (message[2] + L0(message[3]) + message[11] + L1(W16));
  __private ulong W19 = (message[3] + L0(message[4]) + message[12] + L1(W17));
  __private ulong W20 = (message[4] + L0(message[5]) + message[13] + L1(W18));
  __private ulong W21 = (message[5] + L0(message[6]) + message[14] + L1(W19));
  __private ulong W22 = message[6] + L0(message[7]) + message[15] + L1(W20);
  __private ulong W23 = message[7] + L0(message[8]) + W16 + L1(W21);
  __private ulong W24 = message[8] + L0(message[9]) + W17 + L1(W22);
  __private ulong W25 = message[9] + L0(message[10]) + W18 + L1(W23);
  __private ulong W26 = message[10] + L0(message[11]) + W19 + L1(W24);
  __private ulong W27 = message[11] + L0(message[12]) + W20 + L1(W25);
  __private ulong W28 = message[12] + L0(message[13]) + W21 + L1(W26);
  __private ulong W29 = message[13] + L0(message[14]) + W22 + L1(W27);
  __private ulong W30 = message[14] + L0(message[15]) + W23 + L1(W28);
  __private ulong W31 = message[15] + L0(W16) + W24 + L1(W29);

  __private ulong W32 = W16 + L0(W17) + W25 + L1(W30);
  __private ulong W33 = W17 + L0(W18) + W26 + L1(W31);
  __private ulong W34 = W18 + L0(W19) + W27 + L1(W32);
  __private ulong W35 = W19 + L0(W20) + W28 + L1(W33);
  __private ulong W36 = W20 + L0(W21) + W29 + L1(W34);
  __private ulong W37 = W21 + L0(W22) + W30 + L1(W35);
  __private ulong W38 = W22 + L0(W23) + W31 + L1(W36);
  __private ulong W39 = W23 + L0(W24) + W32 + L1(W37);
  __private ulong W40 = W24 + L0(W25) + W33 + L1(W38);
  __private ulong W41 = W25 + L0(W26) + W34 + L1(W39);
  __private ulong W42 = W26 + L0(W27) + W35 + L1(W40);
  __private ulong W43 = W27 + L0(W28) + W36 + L1(W41);
  __private ulong W44 = W28 + L0(W29) + W37 + L1(W42);
  __private ulong W45 = W29 + L0(W30) + W38 + L1(W43);
  __private ulong W46 = W30 + L0(W31) + W39 + L1(W44);
  __private ulong W47 = W31 + L0(W32) + W40 + L1(W45);
  __private ulong W48 = W32 + L0(W33) + W41 + L1(W46);
  __private ulong W49 = W33 + L0(W34) + W42 + L1(W47);
  __private ulong W50 = W34 + L0(W35) + W43 + L1(W48);
  __private ulong W51 = W35 + L0(W36) + W44 + L1(W49);
  __private ulong W52 = W36 + L0(W37) + W45 + L1(W50);
  __private ulong W53 = W37 + L0(W38) + W46 + L1(W51);
  __private ulong W54 = W38 + L0(W39) + W47 + L1(W52);
  __private ulong W55 = W39 + L0(W40) + W48 + L1(W53);
  __private ulong W56 = W40 + L0(W41) + W49 + L1(W54);
  __private ulong W57 = W41 + L0(W42) + W50 + L1(W55);
  __private ulong W58 = W42 + L0(W43) + W51 + L1(W56);
  __private ulong W59 = W43 + L0(W44) + W52 + L1(W57);
  __private ulong W60 = W44 + L0(W45) + W53 + L1(W58);
  __private ulong W61 = W45 + L0(W46) + W54 + L1(W59);
  __private ulong W62 = W46 + L0(W47) + W55 + L1(W60);
  __private ulong W63 = W47 + L0(W48) + W56 + L1(W61);
  __private ulong W64 = W48 + L0(W49) + W57 + L1(W62);
  __private ulong W65 = W49 + L0(W50) + W58 + L1(W63);
  __private ulong W66 = W50 + L0(W51) + W59 + L1(W64);
  __private ulong W67 = W51 + L0(W52) + W60 + L1(W65);
  __private ulong W68 = W52 + L0(W53) + W61 + L1(W66);
  __private ulong W69 = W53 + L0(W54) + W62 + L1(W67);
  __private ulong W70 = W54 + L0(W55) + W63 + L1(W68);
  __private ulong W71 = W55 + L0(W56) + W64 + L1(W69);
  __private ulong W72 = W56 + L0(W57) + W65 + L1(W70);
  __private ulong W73 = W57 + L0(W58) + W66 + L1(W71);
  __private ulong W74 = W58 + L0(W59) + W67 + L1(W72);
  __private ulong W75 = W59 + L0(W60) + W68 + L1(W73);
  __private ulong W76 = W60 + L0(W61) + W69 + L1(W74);
  __private ulong W77 = W61 + L0(W62) + W70 + L1(W75);
  __private ulong W78 = W62 + L0(W63) + W71 + L1(W76);
  __private ulong W79 = W63 + L0(W64) + W72 + L1(W77);

  ulong bc = A1 & A2;
  ulong ab = RoR(A0, A1, A2, & A3, A4, A5, A6, & A7, message[0], 0x428a2f98d728ae22 UL, bc);
  ab = RoR(A7, A0, A1, & A2, A3, A4, A5, & A6, message[1], 0x7137449123ef65cd UL, ab);
  ab = RoR(A6, A7, A0, & A1, A2, A3, A4, & A5, message[2], 0xb5c0fbcfec4d3b2f UL, ab);
  ab = RoR(A5, A6, A7, & A0, A1, A2, A3, & A4, message[3], 0xe9b5dba58189dbbc UL, ab);
  ab = RoR(A4, A5, A6, & A7, A0, A1, A2, & A3, message[4], 0x3956c25bf348b538 UL, ab);
  ab = RoR(A3, A4, A5, & A6, A7, A0, A1, & A2, message[5], 0x59f111f1b605d019 UL, ab);
  ab = RoR(A2, A3, A4, & A5, A6, A7, A0, & A1, message[6], 0x923f82a4af194f9b UL, ab);
  ab = RoR(A1, A2, A3, & A4, A5, A6, A7, & A0, message[7], 0xab1c5ed5da6d8118 UL, ab);
  ab = RoR(A0, A1, A2, & A3, A4, A5, A6, & A7, message[8], 0xd807aa98a3030242 UL, ab);
  ab = RoR(A7, A0, A1, & A2, A3, A4, A5, & A6, message[9], 0x12835b0145706fbe UL, ab);
  ab = RoR(A6, A7, A0, & A1, A2, A3, A4, & A5, message[10], 0x243185be4ee4b28c UL, ab);
  ab = RoR(A5, A6, A7, & A0, A1, A2, A3, & A4, message[11], 0x550c7dc3d5ffb4e2 UL, ab);
  ab = RoR(A4, A5, A6, & A7, A0, A1, A2, & A3, message[12], 0x72be5d74f27b896f UL, ab);
  ab = RoR(A3, A4, A5, & A6, A7, A0, A1, & A2, message[13], 0x80deb1fe3b1696b1 UL, ab);
  ab = RoR(A2, A3, A4, & A5, A6, A7, A0, & A1, message[14], 0x9bdc06a725c71235 UL, ab);
  ab = RoR(A1, A2, A3, & A4, A5, A6, A7, & A0, message[15], 0xc19bf174cf692694 UL, ab);
  ab = RoR(A0, A1, A2, & A3, A4, A5, A6, & A7, W16, 0xe49b69c19ef14ad2 UL, ab);
  ab = RoR(A7, A0, A1, & A2, A3, A4, A5, & A6, W17, 0xefbe4786384f25e3 UL, ab);
  ab = RoR(A6, A7, A0, & A1, A2, A3, A4, & A5, W18, 0x0fc19dc68b8cd5b5 UL, ab);
  ab = RoR(A5, A6, A7, & A0, A1, A2, A3, & A4, W19, 0x240ca1cc77ac9c65 UL, ab);
  ab = RoR(A4, A5, A6, & A7, A0, A1, A2, & A3, W20, 0x2de92c6f592b0275 UL, ab);
  ab = RoR(A3, A4, A5, & A6, A7, A0, A1, & A2, W21, 0x4a7484aa6ea6e483 UL, ab);
  ab = RoR(A2, A3, A4, & A5, A6, A7, A0, & A1, W22, 0x5cb0a9dcbd41fbd4 UL, ab);
  ab = RoR(A1, A2, A3, & A4, A5, A6, A7, & A0, W23, 0x76f988da831153b5 UL, ab);
  ab = RoR(A0, A1, A2, & A3, A4, A5, A6, & A7, W24, 0x983e5152ee66dfab UL, ab);
  ab = RoR(A7, A0, A1, & A2, A3, A4, A5, & A6, W25, 0xa831c66d2db43210 UL, ab);
  ab = RoR(A6, A7, A0, & A1, A2, A3, A4, & A5, W26, 0xb00327c898fb213f UL, ab);
  ab = RoR(A5, A6, A7, & A0, A1, A2, A3, & A4, W27, 0xbf597fc7beef0ee4 UL, ab);
  ab = RoR(A4, A5, A6, & A7, A0, A1, A2, & A3, W28, 0xc6e00bf33da88fc2 UL, ab);
  ab = RoR(A3, A4, A5, & A6, A7, A0, A1, & A2, W29, 0xd5a79147930aa725 UL, ab);
  ab = RoR(A2, A3, A4, & A5, A6, A7, A0, & A1, W30, 0x06ca6351e003826f UL, ab);
  ab = RoR(A1, A2, A3, & A4, A5, A6, A7, & A0, W31, 0x142929670a0e6e70 UL, ab);
  ab = RoR(A0, A1, A2, & A3, A4, A5, A6, & A7, W32, 0x27b70a8546d22ffc UL, ab);
  ab = RoR(A7, A0, A1, & A2, A3, A4, A5, & A6, W33, 0x2e1b21385c26c926 UL, ab);
  ab = RoR(A6, A7, A0, & A1, A2, A3, A4, & A5, W34, 0x4d2c6dfc5ac42aed UL, ab);
  ab = RoR(A5, A6, A7, & A0, A1, A2, A3, & A4, W35, 0x53380d139d95b3df UL, ab);
  ab = RoR(A4, A5, A6, & A7, A0, A1, A2, & A3, W36, 0x650a73548baf63de UL, ab);
  ab = RoR(A3, A4, A5, & A6, A7, A0, A1, & A2, W37, 0x766a0abb3c77b2a8 UL, ab);
  ab = RoR(A2, A3, A4, & A5, A6, A7, A0, & A1, W38, 0x81c2c92e47edaee6 UL, ab);
  ab = RoR(A1, A2, A3, & A4, A5, A6, A7, & A0, W39, 0x92722c851482353b UL, ab);
  ab = RoR(A0, A1, A2, & A3, A4, A5, A6, & A7, W40, 0xa2bfe8a14cf10364 UL, ab);
  ab = RoR(A7, A0, A1, & A2, A3, A4, A5, & A6, W41, 0xa81a664bbc423001 UL, ab);
  ab = RoR(A6, A7, A0, & A1, A2, A3, A4, & A5, W42, 0xc24b8b70d0f89791 UL, ab);
  ab = RoR(A5, A6, A7, & A0, A1, A2, A3, & A4, W43, 0xc76c51a30654be30 UL, ab);
  ab = RoR(A4, A5, A6, & A7, A0, A1, A2, & A3, W44, 0xd192e819d6ef5218 UL, ab);
  ab = RoR(A3, A4, A5, & A6, A7, A0, A1, & A2, W45, 0xd69906245565a910 UL, ab);
  ab = RoR(A2, A3, A4, & A5, A6, A7, A0, & A1, W46, 0xf40e35855771202a UL, ab);
  ab = RoR(A1, A2, A3, & A4, A5, A6, A7, & A0, W47, 0x106aa07032bbd1b8 UL, ab);
  ab = RoR(A0, A1, A2, & A3, A4, A5, A6, & A7, W48, 0x19a4c116b8d2d0c8 UL, ab);
  ab = RoR(A7, A0, A1, & A2, A3, A4, A5, & A6, W49, 0x1e376c085141ab53 UL, ab);
  ab = RoR(A6, A7, A0, & A1, A2, A3, A4, & A5, W50, 0x2748774cdf8eeb99 UL, ab);
  ab = RoR(A5, A6, A7, & A0, A1, A2, A3, & A4, W51, 0x34b0bcb5e19b48a8 UL, ab);
  ab = RoR(A4, A5, A6, & A7, A0, A1, A2, & A3, W52, 0x391c0cb3c5c95a63 UL, ab);
  ab = RoR(A3, A4, A5, & A6, A7, A0, A1, & A2, W53, 0x4ed8aa4ae3418acb UL, ab);
  ab = RoR(A2, A3, A4, & A5, A6, A7, A0, & A1, W54, 0x5b9cca4f7763e373 UL, ab);
  ab = RoR(A1, A2, A3, & A4, A5, A6, A7, & A0, W55, 0x682e6ff3d6b2b8a3 UL, ab);
  ab = RoR(A0, A1, A2, & A3, A4, A5, A6, & A7, W56, 0x748f82ee5defb2fc UL, ab);
  ab = RoR(A7, A0, A1, & A2, A3, A4, A5, & A6, W57, 0x78a5636f43172f60 UL, ab);
  ab = RoR(A6, A7, A0, & A1, A2, A3, A4, & A5, W58, 0x84c87814a1f0ab72 UL, ab);
  ab = RoR(A5, A6, A7, & A0, A1, A2, A3, & A4, W59, 0x8cc702081a6439ec UL, ab);
  ab = RoR(A4, A5, A6, & A7, A0, A1, A2, & A3, W60, 0x90befffa23631e28 UL, ab);
  ab = RoR(A3, A4, A5, & A6, A7, A0, A1, & A2, W61, 0xa4506cebde82bde9 UL, ab);
  ab = RoR(A2, A3, A4, & A5, A6, A7, A0, & A1, W62, 0xbef9a3f7b2c67915 UL, ab);
  ab = RoR(A1, A2, A3, & A4, A5, A6, A7, & A0, W63, 0xc67178f2e372532b UL, ab);
  ab = RoR(A0, A1, A2, & A3, A4, A5, A6, & A7, W64, 0xca273eceea26619c UL, ab);
  ab = RoR(A7, A0, A1, & A2, A3, A4, A5, & A6, W65, 0xd186b8c721c0c207 UL, ab);
  ab = RoR(A6, A7, A0, & A1, A2, A3, A4, & A5, W66, 0xeada7dd6cde0eb1e UL, ab);
  ab = RoR(A5, A6, A7, & A0, A1, A2, A3, & A4, W67, 0xf57d4f7fee6ed178 UL, ab);
  ab = RoR(A4, A5, A6, & A7, A0, A1, A2, & A3, W68, 0x06f067aa72176fba UL, ab);
  ab = RoR(A3, A4, A5, & A6, A7, A0, A1, & A2, W69, 0x0a637dc5a2c898a6, ab);
  ab = RoR(A2, A3, A4, & A5, A6, A7, A0, & A1, W70, 0x113f9804bef90dae UL, ab);
  ab = RoR(A1, A2, A3, & A4, A5, A6, A7, & A0, W71, 0x1b710b35131c471b UL, ab);
  ab = RoR(A0, A1, A2, & A3, A4, A5, A6, & A7, W72, 0x28db77f523047d84 UL, ab);
  ab = RoR(A7, A0, A1, & A2, A3, A4, A5, & A6, W73, 0x32caab7b40c72493 UL, ab);
  ab = RoR(A6, A7, A0, & A1, A2, A3, A4, & A5, W74, 0x3c9ebe0a15c9bebc UL, ab);
  ab = RoR(A5, A6, A7, & A0, A1, A2, A3, & A4, W75, 0x431d67c49c100d4c UL, ab);
  ab = RoR(A4, A5, A6, & A7, A0, A1, A2, & A3, W76, 0x4cc5d4becb3e42b6 UL, ab);
  ab = RoR(A3, A4, A5, & A6, A7, A0, A1, & A2, W77, 0x597f299cfc657e2a UL, ab);
  ab = RoR(A2, A3, A4, & A5, A6, A7, A0, & A1, W78, 0x5fcb6fab3ad6faec UL, ab);
  RoR(A1, A2, A3, & A4, A5, A6, A7, & A0, W79, 0x6c44198c4a475817 UL, ab);
  H[0] += A0;
  H[1] += A1;
  H[2] += A2;
  H[3] += A3;
  H[4] += A4;
  H[5] += A5;
  H[6] += A6;
  H[7] += A7;
}

void sha512_hash_two_blocks_message(ulong * message, ulong * H) {
  INIT_SHA512(H);
  sha512_process(message, H);
  sha512_process(message + 16, H);
}

#undef F0
#undef F1


I hope you like it, I'm open to improvement suggestions.

Thank you.
satscraper
Legendary
*
Offline Offline

Activity: 1358
Merit: 2485



View Profile
Today at 07:17:55 AM
 #3

~



Well, before I get enthralled with your computations and possibly test them, could you share the results you obtained for recovering let’s say 5 or 6 missing words from 12 words BIP39 compliant phrase, specifically, how long it would take using the methods you proposed?

▄▄███████████████████▄▄
▄███████████████████████▄
████████████████████████
█████████████████████████
████████████████████████
████████████▀██████▀████
████████████████████████
█████████▄▄▄▄███████████
██████████▄▄▄████████████
████████████████████████
████████████████▀▀███████
▀███████████████████████▀
▀▀███████████████████▀▀
 
 EARNBET 
██
██
██
██
██
██
██
██
██
██
██
██
██
███████▄▄███████████
████▄██████████████████
██▀▀███████████████▀▀███
▄████████████████████████
▄▄████████▀▀▀▀▀████████▄▄██
███████████████████████████
█████████▌██▀████████████
███████████████████████████
▀▀███████▄▄▄▄▄█████████▀▀██
▀█████████████████████▀██
██▄▄███████████████▄▄███
████▀██████████████████
███████▀▀███████████
██
██
██
██
██
██
██
██
██
██
██
██
██


▄▄▄
▄▄▄███████▐███▌███████▄▄▄
█████████████████████████
▀████▄▄▄███████▄▄▄████▀
█████████████████████
▐███████████████████▌
███████████████████
███████████████████
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

 King of The Castle 
 $200,000 in prizes
██
██
██
██
██
██
██
██
██
██
██
██
██

 62.5% 

 
RAKEBACK
BONUS
ABCbits
Legendary
*
Offline Offline

Activity: 3500
Merit: 9607



View Profile
Today at 09:02:02 AM
 #4

All the code focuses on OpenCL for GPU parallelization and Python as the main orchestrator.

It may be obvious to you, but how can we compile and run the code on our own device to attempt recovery/brute-force with our own BIP 39 words?

███████████████████████████
███████▄████████████▄██████
████████▄████████▄████████
███▀█████▀▄███▄▀█████▀███
█████▀█▀▄██▀▀▀██▄▀█▀█████
███████▄███████████▄███████
███████████████████████████
███████▀███████████▀███████
████▄██▄▀██▄▄▄██▀▄██▄████
████▄████▄▀███▀▄████▄████
██▄███▀▀█▀██████▀█▀███▄███
██▀█▀████████████████▀█▀███
███████████████████████████
.
.Duelbits PREDICT..
█████████████████████████
█████████████████████████
███████████▀▀░░░░▀▀██████
██████████░░▄████▄░░████
█████████░░████████░░████
█████████░░████████░░████
█████████▄▀██████▀▄████
████████▀▀░░░▀▀▀▀░░▄█████
██████▀░░░░██▄▄▄▄████████
████▀░░░░▄███████████████
█████▄▄█████████████████
█████████████████████████
█████████████████████████
.
.WHERE EVERYTHING IS A MARKET..
█████
██
██







██
██
██████
Will Bitcoin hit $200,000
before January 1st 2027?

    No @1.15         Yes @6.00    
█████
██
██







██
██
██████

  CHECK MORE > 
Cricktor
Legendary
*
Offline Offline

Activity: 1386
Merit: 3505



View Profile
Today at 12:51:55 PM
 #5

I want to suggest to add to topic's title that this toolset is intended to cracking a limited number of missing mnemonic recovery words.

I didn't look so far thoroughly at the code on Github. Should everything necessary be published openly on Github, kudos for an insight to your optimizations (if they work out).

I would love to see some valid comparison data, maybe not only for the beefiest GPUs available. Not sure if the GPU code compiles properly for other GPU than the heaviest gear available. Not everyone has a 5090...

By comparison data I mean, how long does it take to crack 3, 4 or 5 (seems to be the upper limit for now for this tool?) words?

Can the missing words be at any position? At first look I couldn't see any details on this (I might missed it).


...
Some thorough testing is indeed required. I would prefer a well documented build toolchain over some executables which I can't trust as long as they aren't available with reproducible builds.

███████████████████████████
███████▄████████████▄██████
████████▄████████▄████████
███▀█████▀▄███▄▀█████▀███
█████▀█▀▄██▀▀▀██▄▀█▀█████
███████▄███████████▄███████
███████████████████████████
███████▀███████████▀███████
████▄██▄▀██▄▄▄██▀▄██▄████
████▄████▄▀███▀▄████▄████
██▄███▀▀█▀██████▀█▀███▄███
██▀█▀████████████████▀█▀███
███████████████████████████
.
.Duelbits PREDICT..
█████████████████████████
█████████████████████████
███████████▀▀░░░░▀▀██████
██████████░░▄████▄░░████
█████████░░████████░░████
█████████░░████████░░████
█████████▄▀██████▀▄████
████████▀▀░░░▀▀▀▀░░▄█████
██████▀░░░░██▄▄▄▄████████
████▀░░░░▄███████████████
█████▄▄█████████████████
█████████████████████████
█████████████████████████
.
.WHERE EVERYTHING IS A MARKET..
█████
██
██







██
██
██████
Will Bitcoin hit $200,000
before January 1st 2027?

    No @1.15         Yes @6.00    
█████
██
██







██
██
██████

  CHECK MORE > 
Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!