Print Page - Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)

Title: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: FrozenThroneGuy on June 19, 2025, 10:57:10 AM

Hello guys!
My implementation of Pollard-rho algo:
https://github.com/Dookoo2/Mark1
38 minutes for solving 80 bits puzzle with half billion of DP, 14 minutes for solving 70 bits with 200 millions of dp (after phase 1 for collecting and storing DP).
AVX2 bloom filter, compact DP table, Brent loop detection and others features.
May be useful for somebody.
Have a nice day:)

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: jvaimamu on June 19, 2025, 12:08:26 PM

Quote from: FrozenThroneGuy on June 19, 2025, 10:57:10 AM

Great! Hope to get more speed

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: AlexanderCurl on June 19, 2025, 08:01:25 PM

Nice coding concept. Added to my collection.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: nomachine on June 20, 2025, 12:00:58 PM

The code is good, but it’s missing the part where the script loads the saved DP.bin with the --load-dp option.

Mark1.cpp

Code:

/***************************************************************************************************
 * Pollard–Kangaroo  (wrap-aware, user-configurable k, live counter, loop detector, restart counter)
 * By DooKoo2
 *
 *  g++ Mark1.cpp Int.cpp SECP256K1.cpp Point.cpp Random.cpp IntMod.cpp IntGroup.cpp Timer.cpp -O3 -march=native -funroll-loops -ftree-vectorize -fstrict-aliasing -fno-semantic-
 *    interposition -fvect-cost-model=unlimited -fno-trapping-math -fipa-ra -fipa-modref -flto -fassociative-math -fopenmp -mavx2 -mbmi2 -madx -std=c++17 -fopenmp -pthread -o Mark1
 *
 * Change log…
 ***************************************************************************************************/
#include <atomic>
#include <array>
#include <chrono>
#include <cmath>
#include <cstdint>
#include <cstdlib>
#include <cstring>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <memory>
#include <mutex>
#include <random>
#include <sstream>
#include <string>
#include <thread>
#include <vector>
#include <omp.h>

// --------------------------- project libs ------------------------------------
#include "Int.h"
#include "Point.h"
#include "SECP256K1.h"
#include "IntGroup.h"
#include "simd_block_bloom.h"

// ─── Scalar256 ────────────────────────────────────────────────────────────────
struct Scalar256 { uint64_t w[4]; };

static inline void intToScalar(const Int &src, Scalar256 &dst){
    dst.w[0]=src.bits64[0]; dst.w[1]=src.bits64[1];
    dst.w[2]=src.bits64[2]; dst.w[3]=src.bits64[3];
}
static inline void scalarToInt(const Scalar256 &s, Int &dst){
    dst.SetInt32(0);
    for(int i=3;i>=0;--i){
        dst.ShiftL(64);
        Int part(s.w[i]); dst.Add(&part);
    }
}

// ─── Int helpers ──────────────────────────────────────────────────────────────
static inline void intCopy(Int &d,const Int &s){ d.Set(const_cast<Int*>(&s)); }
static inline bool intGE(const Int &a,const Int &b){
    return const_cast<Int&>(a).IsGreaterOrEqual(const_cast<Int*>(&b));
}
static inline uint64_t IntLow64(const Int &n){ return n.bits64[0]; }
static inline int      bitlen (const Int &v){ return const_cast<Int&>(v).GetBitLength(); }

// ─── misc ─────────────────────────────────────────────────────────────────────
static inline uint64_t splitmix64(uint64_t x){
    x+=0x9E3779B97F4A7C15ULL;
    x=(x^(x>>30))*0xBF58476D1CE4E5B9ULL;
    x=(x^(x>>27))*0x94D049BB133111EBULL;
    return x^(x>>31);
}
static Int hexToInt(const std::string& h){ Int x; x.SetBase16((char*)h.c_str()); return x; }
static Int decToInt(const std::string& d){ Int x; x.SetBase10((char*)d.c_str()); return x; }
static std::string intHex(const Int &v,bool pad=false){
    Int t; intCopy(t,v); std::string s=t.GetBase16();
    if(pad && s.size()<64) s.insert(0,64-s.size(),'0');
    return s;
}
static std::string humanBytes(size_t bytes){
    constexpr const char* unit[]{"B","Kb","Mb","Gb","Tb"};
    double v=double(bytes); int u=0; while(v>=1024.0&&u<4){v/=1024.0;++u;}
    std::ostringstream o;
    if(v<10)       o<<std::fixed<<std::setprecision(2);
    else if(v<100) o<<std::fixed<<std::setprecision(1);
    else           o<<std::fixed<<std::setprecision(0);
    o<<v<<unit[u]; return o.str();
}

// ─── curve ────────────────────────────────────────────────────────────────────
static const char *P_HEX="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFC2F";
static const char *N_HEX="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEBAAEDCE6AF48A03BBFD25E8CD0364141";
static Int P_PRIME, ORDER_N; static Secp256K1 secp;

static inline Point addP(const Point &a,const Point &b){
    if(const_cast<Int&>(a.x).IsZero()&&const_cast<Int&>(a.y).IsZero()) return b;
    if(const_cast<Int&>(b.x).IsZero()&&const_cast<Int&>(b.y).IsZero()) return a;
    return secp.AddDirect(const_cast<Point&>(a),const_cast<Point&>(b));
}
static inline Point mulP(const Int &k){ Int t; intCopy(t,k); return secp.ComputePublicKey(&t); }

// ─── range split ──────────────────────────────────────────────────────────────
struct RangeSeg{ Int start,length; };

static std::vector<RangeSeg> splitRange(const Int &A,const Int &total,unsigned parts){
    std::vector<RangeSeg> seg(parts);
    Int chunk(total); Int div((uint64_t)parts); chunk.Div(&div,nullptr);
    Int lenLast(total);
    if(parts>1){ Int t(chunk); Int m((uint64_t)(parts-1)); t.Mult(&m); lenLast.Sub(&t); }
    for(unsigned i=0;i<parts;++i){
        seg[i].start=A;
        if(i){ Int off(chunk); Int k((uint64_t)i); off.Mult(&k); seg[i].start.Add(&off); }
        seg[i].length=(i==parts-1)?lenLast:chunk;
    }
    return seg;
}

// ─── wrap helper ──────────────────────────────────────────────────────────────
static inline void addWrapCnt(Int &v,const Int &d,const Int &len,uint64_t &wraps){
    v.Add(const_cast<Int*>(&d));
    if(intGE(v,len)){ v.Sub(const_cast<Int*>(&len)); ++wraps; }
}

// ─── compact DP ───────────────────────────────────────────────────────────────
using fp_t = uint64_t;
inline fp_t make_fp(const Point &P){
    return splitmix64(IntLow64(P.x) ^ uint64_t(!P.y.IsEven()));
}

static std::vector<fp_t>       fp_tbl;
static std::vector<Scalar256>  idx_tbl;
static std::unique_ptr<std::atomic<uint8_t>[]> used_tbl;
static uint32_t  dp_cap=0;
static std::atomic<uint64_t> dpDone{0};
static uint64_t  dpTarget=0;

inline bool sameScalar(const Scalar256 &a,const Scalar256 &b){
    return std::memcmp(&a,&b,sizeof(Scalar256))==0;
}

inline bool dp_insert_unique(fp_t fp,const Int &idx){
    Int modIdx; intCopy(modIdx, idx); modIdx.Mod(&ORDER_N);
    Scalar256 key; intToScalar(modIdx,key);
    uint32_t h = uint32_t(fp) % dp_cap;
    for(;;){
        uint8_t st = used_tbl[h].load(std::memory_order_acquire);
        if(st==2){
            if(fp_tbl[h]==fp && sameScalar(idx_tbl[h],key)) return false;
        }else if(st==0){
            uint8_t exp=0;
            if(used_tbl[h].compare_exchange_strong(exp,1,std::memory_order_acq_rel)){
                fp_tbl[h]=fp; idx_tbl[h]=key;
                used_tbl[h].store(2,std::memory_order_release);
                dpDone.fetch_add(1,std::memory_order_relaxed);
                return true;
            }
        }
        if(++h==dp_cap) h=0;
    }
}
inline bool dp_find(fp_t fp,Int &idx){
    uint32_t h=uint32_t(fp)%dp_cap;
    while(used_tbl[h].load(std::memory_order_acquire)==2){
        if(fp_tbl[h]==fp){ scalarToInt(idx_tbl[h],idx); return true; }
        if(++h==dp_cap) h=0;
    }
    return false;
}

// ─── globals ──────────────────────────────────────────────────────────────────
static simd_bloom::SimdBlockFilterFixed<> *bloom=nullptr;
static std::atomic<uint64_t> hops{0};
static std::atomic<uint64_t> restarts{0};
static std::atomic<bool>     solved{false};
static Int                   privFound;
static std::vector<Point>    jumps;
static std::atomic<unsigned> found_tid{0};
static std::atomic<uint64_t> winner_wraps{0};

// solving fix
static std::once_flag                record_flag;
static std::chrono::steady_clock::time_point t_end;

// ─── batch-EC-add ─────────────────────────────────────────────────────────────
template<unsigned N>
static inline void batchAdd(Point *base,Point *plus){
    std::array<Int,N> dX;
    for(unsigned i=0;i<N;++i) dX[i].ModSub(&plus[i].x,&base[i].x);
    static thread_local IntGroup grp(N); grp.Set(dX.data()); grp.ModInv();
    for(unsigned i=0;i<N;++i){
        Int dY; dY.ModSub(&plus[i].y,&base[i].y);
        Int k;  k.ModMulK1(&dY,&dX[i]);
        Int k2; k2.ModSquareK1(&k);
        Int xN(base[i].x); xN.ModNeg(); xN.ModAdd(&k2); xN.ModSub(&plus[i].x);
        Int dx(base[i].x); dx.ModSub(&xN); dx.ModMulK1(&k);
        base[i].x=xN; base[i].y.ModNeg(); base[i].y.ModAdd(&dx);
    }
}

// ─── jump-table ───────────────────────────────────────────────────────────────
static void buildJumpTable(unsigned k){
    jumps.resize(k);
#pragma omp parallel for schedule(static)
    for(unsigned i=0;i<k;++i){
        Int e((uint64_t)1); e.ShiftL(int(i+1)); jumps[i]=mulP(e);
    }
}

// ─── Binary DP File Format ────────────────────────────────────────────────────
#pragma pack(push, 1)
struct DpItem {
    fp_t fp;
    uint8_t priv[32];
};
#pragma pack(pop)

static void saveDPBinary(const std::string& filename) {
    std::ofstream file(filename, std::ios::binary | std::ios::trunc);
    if (!file) {
        std::cerr << "[ERROR] Cannot open " << filename << " for writing\n";
        return;
    }

    uint64_t count = 0;
    for (uint32_t h = 0; h < dp_cap; ++h) {
        if (used_tbl[h].load(std::memory_order_acquire) == 2) {
            DpItem item;
            item.fp = fp_tbl[h];
           
            Int priv;
            scalarToInt(idx_tbl[h], priv);
            priv.Get32Bytes(item.priv);
           
            file.write(reinterpret_cast<const char*>(&item), sizeof(DpItem));
            count++;
        }
    }

    std::cout << "Saved " << count << " DPs to " << filename
              << " (" << humanBytes(file.tellp()) << ")\n";
}

static bool loadDPBinary(const std::string& filename) {
    std::ifstream file(filename, std::ios::binary | std::ios::ate);
    if (!file) {
        std::cerr << "[ERROR] Cannot open " << filename << " for reading\n";
        return false;
    }

    auto fileSize = file.tellg();
    file.seekg(0);
   
    if (fileSize % sizeof(DpItem) != 0) {
        std::cerr << "[ERROR] Invalid DP file size\n";
        return false;
    }

    const uint64_t count = fileSize / sizeof(DpItem);
    std::cout << "Loading " << count << " DPs from " << filename << "\n";

    DpItem item;
    uint64_t loaded = 0;
    while (file.read(reinterpret_cast<char*>(&item), sizeof(DpItem))) {
        Int priv;
        priv.Set32Bytes(item.priv);
       
        if (dp_insert_unique(item.fp, priv)) {
            bloom->Add(uint32_t(item.fp));
            loaded++;
        }

        if (loaded % 1000000 == 0) {
            std::cout << "\rLoaded " << loaded << "/" << count << " DPs" << std::flush;
        }
    }

    std::cout << "\rLoaded " << loaded << " DPs (done)\n";
    return true;
}

// ─── Traps (Phase-1) ─────────────────────────────────────────────────────────
static constexpr unsigned K_DP=512;
static void buildDP_segment(const RangeSeg &seg,uint64_t target,
                            unsigned k,unsigned dp_bits,uint64_t seed)
{
    const uint64_t mask=(1ULL<<dp_bits)-1;
    std::mt19937_64 rng(seed);
    std::uniform_int_distribution<uint64_t> rd;
    std::array<Int,   K_DP> dist;
    std::array<uint64_t,K_DP> wraps{};
    std::array<Point, K_DP> cur, stepPts;

    auto rndMod=[&](Int &o){
        o.SetInt32(0); int parts=(bitlen(seg.length)+63)/64;
        for(int p=0;p<parts;++p){
            Int t((uint64_t)rd(rng)); t.ShiftL(p*64); o.Add(&t);
        }
        o.Mod(const_cast<Int*>(&seg.length));
    };

    for(unsigned i=0;i<K_DP;++i){
        rndMod(dist[i]);
        Int a(seg.start); a.Add(&dist[i]);
        cur[i]=mulP(a);
    }

    uint64_t made=0;
    while(made<target){
        for(unsigned i=0;i<K_DP;++i){
            uint64_t h=splitmix64(IntLow64(cur[i].x))%k;
            Int step((uint64_t)1); step.ShiftL(int(h+1));

            if((IntLow64(cur[i].x)&mask)==0){
                fp_t fp=make_fp(cur[i]);

                Int scalar(seg.length);
                Int w((uint64_t)wraps[i]); scalar.Mult(&w);
                scalar.Add(const_cast<Int*>(&dist[i]));
                scalar.Add(const_cast<Int*>(&seg.start));
                scalar.Mod(&ORDER_N);

                if(dp_insert_unique(fp,scalar)){
                    bloom->Add(uint32_t(fp));
                    if(++made==target) break;
                }
            }
            stepPts[i]=jumps[h];
            addWrapCnt(dist[i],step,seg.length,wraps[i]);
        }
        batchAdd<K_DP>(cur.data(),stepPts.data());
    }
}

// ─── Phase-2: wild kangaroos ─────────────────────────────────────────────────
static constexpr unsigned K=512, BUF=512;
static void worker(uint32_t tid,const RangeSeg &seg,const Point &pub,
                   unsigned k,unsigned dp_bits)
{
    struct LoopDet{ uint64_t next,cnt,sig;
        inline void reset(uint64_t s) noexcept{ next=1024; cnt=0; sig=s; } };

    const uint64_t mask=(1ULL<<dp_bits)-1;
    std::mt19937_64 rng(splitmix64(0xDEADBEEF*tid));
    std::uniform_int_distribution<uint64_t> rd;

    std::array<Int,   K> dist;
    std::array<uint64_t,K> wraps{};
    std::array<Point, K> cur, stepPts;
    std::array<LoopDet,K> loop;

    auto rndMod=[&](Int &o){
        o.SetInt32(0); int parts=(bitlen(seg.length)+63)/64;
        for(int p=0;p<parts;++p){
            Int t((uint64_t)rd(rng)); t.ShiftL(p*64); o.Add(&t);
        }
        o.Mod(const_cast<Int*>(&seg.length));
    };
    for(unsigned i=0;i<K;++i){
        rndMod(dist[i]); cur[i]=addP(pub,mulP(dist[i]));
        uint64_t sig=splitmix64(IntLow64(cur[i].x)^uint64_t(!cur[i].y.IsEven()));
        loop[i].reset(sig);
    }

    const uint64_t FLUSH=1ULL<<18;
    uint64_t local=0; std::array<fp_t,BUF> fpB; std::array<unsigned,BUF> idB; unsigned cnt=0;

    while(!solved.load()){
        for(unsigned i=0;i<K;++i){
            if(solved.load()) return;

            uint64_t x64=IntLow64(cur[i].x);
            uint64_t h  =splitmix64(x64)%k;

            // Brent loop-detector ──────────────────────────────
            LoopDet &ld=loop[i];
            if(++ld.cnt==ld.next){
                uint64_t sig=splitmix64(x64^uint64_t(!cur[i].y.IsEven()));
                if(sig==ld.sig){
                    rndMod(dist[i]);
                    cur[i]=addP(pub,mulP(dist[i]));
                    wraps[i]=0;
                    ld.reset(sig);
                    restarts.fetch_add(1,std::memory_order_relaxed);
                    continue;
                }
                ld.sig=sig; ld.next<<=1;
            }

            stepPts[i]=jumps[h];
            Int step((uint64_t)1); step.ShiftL(int(h+1));
            addWrapCnt(dist[i],step,seg.length,wraps[i]);
            ++local;
        }
        batchAdd<K>(cur.data(),stepPts.data());
        if(local>=FLUSH){ hops.fetch_add(local); local=0; }

        if(solved.load()) return;

        for(unsigned i=0;i<K;++i){
            if(solved.load()) return;
            if((IntLow64(cur[i].x)&mask)!=0) continue;
            fp_t fp=make_fp(cur[i]);
            fpB[cnt]=fp; idB[cnt]=i;
            if(++cnt==BUF){
                for(unsigned j=0;j<BUF;++j){
                    if(!bloom->Find(uint32_t(fpB[j]))) continue;
                    Int trap; if(!dp_find(fpB[j],trap)) continue;

                    Int dw(seg.length);
                    Int w((uint64_t)wraps[idB[j]]); dw.Mult(&w);
                    dw.Add(const_cast<Int*>(&dist[idB[j]]));
                    dw.Mod(&ORDER_N);

                    Int priv; intCopy(priv,trap);
                    priv.Sub(&dw); priv.Mod(&ORDER_N);

                    Point tst=mulP(priv);
                    if(tst.x.IsEqual(&const_cast<Int&>(pub.x)) &&
                       tst.y.IsEqual(&const_cast<Int&>(pub.y))){
                        std::call_once(record_flag,[&]{ t_end=std::chrono::steady_clock::now(); });
                        intCopy(privFound,priv);
                        found_tid.store(tid,std::memory_order_relaxed);
                        winner_wraps.store(wraps[idB[j]],std::memory_order_relaxed);
                        solved.store(true); return;
                    }
                }
                cnt=0;
            }
        }
    }
    if(local) hops.fetch_add(local);
}

// ─── main ─────────────────────────────────────────────────────────────────────
int main(int argc,char** argv){
    P_PRIME=hexToInt(P_HEX); ORDER_N=hexToInt(N_HEX); secp.Init();

    Int A,B; uint64_t traps=0; unsigned dp_bits=12;
    const double bloomFactor=2.0, MAX_LOAD=0.75;
    Point pub; bool saveDP=false; size_t ramLimitGB=16;
    unsigned k_user=0;
    bool loadDP=false;
    std::string dpFile;

    for(int i=1;i<argc;++i){
        std::string a=argv[i];
        if(a=="--range"){
            std::string s=argv[++i]; auto p=s.find(':');
            A=decToInt(s.substr(0,p)); B=decToInt(s.substr(p+1));
        }else if(a=="--dp_point") traps=std::strtoull(argv[++i],nullptr,10);
        else if(a=="--dp_bits")   dp_bits=std::stoul(argv[++i]);
        else if(a=="--ram"||a=="--ram-limit") ramLimitGB=std::stoull(argv[++i]);
        else if(a=="--pubkey"){
            std::string h=argv[++i]; if(h.rfind("0x",0)==0) h.erase(0,2);
            char pc=h[1]; Int x=hexToInt(h.substr(2));
            pub.x=x; pub.y=secp.GetY(x,pc=='2');
        }else if(a=="-s"||a=="--save-dp") saveDP=true;
        else if(a=="--k") k_user=std::stoul(argv[++i]);
        else if(a=="--load-dp") {
            loadDP = true;
            dpFile = argv[++i];
        }
        else{ std::cerr<<"Unknown option "<<a<<'\n'; return 1; }
    }
    if(A.IsZero()||B.IsZero()){ std::cerr<<"range not set\n"; return 1; }

    Int range(B); range.Sub(&A);
    unsigned Lbits=bitlen(range);

    if(!traps && !loadDP){
        traps=(Lbits>=52)?(1ULL<<(Lbits/2)):
              uint64_t(std::ceil(range.ToDouble()/std::sqrt(range.ToDouble())));
        std::cout<<"[auto] dp_point = "<<traps<<'\n';
    }
    unsigned k = k_user ? k_user : std::max(1u, Lbits/2);
    buildJumpTable(k);

    // ---------- RAM -------------------------------------------------------
    uint32_t need  = uint32_t(std::ceil(double(traps)/MAX_LOAD));
    dp_cap         = need;
    double load    = double(traps)/dp_cap;

    const size_t slotBytes = sizeof(fp_t)+sizeof(Scalar256)+1;
    size_t dpBytes   = size_t(dp_cap)*slotBytes;
    size_t bloomBytes= size_t(traps*bloomFactor);
    size_t totalBytes= dpBytes+bloomBytes;

    std::cout<<"\n=========== Phase-0: RAM summary ===========\n";
    std::cout<<"DP table : "<<humanBytes(dpBytes)
             <<"  ( "<<traps<<" / "<<dp_cap<<" slots, load "
             <<std::fixed<<std::setprecision(2)<<load*100<<"% )\n";
    std::cout<<"Bloom    : "<<humanBytes(bloomBytes)<<"\n";
    std::cout<<"--------------------------------------------\n";
    std::cout<<"Total    : "<<humanBytes(totalBytes)<<'\n';

    if(totalBytes>ramLimitGB*(size_t(1)<<30)){
        std::cerr<<"Error: need "<<humanBytes(totalBytes)
                 <<" (> "<<ramLimitGB<<" GiB)\n"; return 1;
    }

    // ---------- allocate -----------------------------------------------------
    fp_tbl.assign(dp_cap,0); idx_tbl.assign(dp_cap,Scalar256{0,0,0,0});
    used_tbl.reset(new std::atomic<uint8_t>[dp_cap]);
    for(uint32_t h=0;h<dp_cap;++h) used_tbl[h].store(0);
    dpTarget=traps; bloom=new simd_bloom::SimdBlockFilterFixed<>(bloomBytes);

    unsigned th=std::max(1u,std::thread::hardware_concurrency());
    auto segments=splitRange(A,range,th);
    uint64_t per=(traps+th-1)/th;

    std::cout<<"\n========== Phase-1: Building/Loading traps =========\n";
    if (loadDP) {
        if (!loadDPBinary(dpFile)) {
            return 1;
        }
    } else {
        std::thread progress([&]{
            while(dpDone.load()<dpTarget){
                std::cout<<"\rUnique traps: "<<dpDone<<'/'<<dpTarget<<std::flush;
                std::this_thread::sleep_for(std::chrono::seconds(1));
            }
            std::cout<<"\rUnique traps: "<<dpTarget<<'/'<<dpTarget<<" (done)\n";
        });
#pragma omp parallel for schedule(static)
        for(unsigned t=0;t<th;++t)
            buildDP_segment(segments[t],per,k,dp_bits,splitmix64(0xABCDEF12345678ULL^t));
        progress.join();

        if (saveDP) {
            saveDPBinary("DP.bin");
        }
    }

    // ─── Phase-2: Kangaroos ─────────────────────────────────────────────────
    std::cout<<"\n=========== Phase-2: Kangaroos =============\n";
    auto t0 = std::chrono::steady_clock::now(); uint64_t last=0;
    std::thread pool([&]{
#pragma omp parallel for num_threads(th) schedule(static)
        for(unsigned id=0;id<th;++id)
            worker(id,segments[id],pub,k,dp_bits);
    });

    while(!solved.load()){
        std::this_thread::sleep_for(std::chrono::seconds(5));
        uint64_t now=hops.load(), d=now-last; last=now;
        double disp=d/5.0; const char* u=" H/s";
        if(disp>1e6){ disp/=1e6; u=" MH/s"; }
        else if(disp>1e3){ disp/=1e3; u=" kH/s"; }
        auto dt=std::chrono::steady_clock::now()-t0;
        uint64_t s=std::chrono::duration_cast<std::chrono::seconds>(dt).count();
        std::cout<<"\rSpeed: "<<std::fixed<<std::setprecision(2)<<disp<<u
                 <<" | Hops: "<<now
                 <<" | Restart wild: "<<restarts.load()
                 <<" | Time: "<<s/3600<<':'<<((s/60)%60)<<':'<<s%60
                 <<std::flush;
    }
    pool.join();

    // ─── Phase-3: results -------------------------------------------
    auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t_end - t0);
    uint64_t total_ms = elapsed.count();
    uint64_t hours=total_ms/3600000, rem=total_ms%3600000;
    uint64_t minutes=rem/60000; rem%=60000;
    uint64_t seconds=rem/1000, millis=rem%1000;

    std::cout<<"\n\n============= Phase-3: Result ==============\n";
    std::cout<<"Private key : 0x"<<intHex(privFound,true)<<'\n';
    std::cout<<"Found by thr: "<<found_tid.load()<<'\n';
    uint64_t wcnt = winner_wraps.load();
    std::cout<<"Wild wraps  : "<<wcnt<<(wcnt?"  [wrapped]\n":"  [no wrap]\n");
    std::cout<<"Wild restart: "<<restarts.load()<<'\n';
    std::cout<<"Total time  : "
             <<std::setfill('0')<<std::setw(2)<<hours<<':'
             <<std::setw(2)<<minutes<<':'<<std::setw(2)<<seconds<<'.'
             <<std::setw(3)<<millis<<'\n';

    {   std::ofstream fout("FOUND.txt",std::ios::trunc);
        if(!fout) std::cerr<<"[WARN] cannot open FOUND.txt\n";
        else{ fout<<"0x"<<intHex(privFound,true)<<'\n';
              std::cout<<"Private key : saved to FOUND.txt\n"; }
    }

    delete bloom;
    return 0;
}

Makefile

Code:

# Detect OS
UNAME_S := $(shell uname -s)

# Enable static linking by default (change to 'no' to use dynamic linking)
STATIC_LINKING = yes

# Compiler settings based on OS
ifeq ($(UNAME_S),Linux)
# Linux settings

# Compiler
CXX = g++

# Compiler flags
CXXFLAGS = -m64 -std=c++17 -Ofast -march=native -mtune=native \
           -Wall -Wextra -Wno-write-strings -Wno-unused-variable \
           -Wno-deprecated-copy -Wno-unused-parameter -Wno-sign-compare \
           -Wno-strict-aliasing -Wno-unused-but-set-variable \
           -funroll-loops -ftree-vectorize -fstrict-aliasing \
           -fno-semantic-interposition -fvect-cost-model=unlimited \
           -fno-trapping-math -fipa-ra -flto -fassociative-math \
           -fopenmp -mavx2 -mbmi2 -madx -fwrapv \
           -fomit-frame-pointer -fpredictive-commoning -fgcse-sm -fgcse-las \
           -fmodulo-sched -fmodulo-sched-allow-regmoves -funsafe-math-optimizations

# Source files
SRCS = Mark1.cpp Int.cpp SECP256K1.cpp Point.cpp Random.cpp IntMod.cpp IntGroup.cpp Timer.cpp

# Object files
OBJS = $(SRCS:.cpp=.o)

# Target executable
TARGET = Mark1

# Link the object files to create the executable and then delete .o files
$(TARGET): $(OBJS)
	$(CXX) $(CXXFLAGS) -o $(TARGET) $(OBJS)
	rm -f $(OBJS) && chmod +x $(TARGET)

# Compile each source file into an object file
%.o: %.cpp
	$(CXX) $(CXXFLAGS) -c $< -o $@

# Clean up build files
clean:
	@echo "Cleaning..."
	rm -f $(OBJS) $(TARGET)

# Phony targets
.PHONY: all clean 

else
# Windows settings (MinGW-w64)

# Compiler
CXX = g++

# Check if compiler is found
CHECK_COMPILER := $(shell which $(CXX))

# Add MSYS path if the compiler is not found
ifeq ($(CHECK_COMPILER),)
  $(info Compiler not found. Adding MSYS path to the environment...)
  SHELL := powershell
  PATH := C:\msys64\mingw64\bin;$(PATH)
endif

# Compiler flags (without LTO)
CXXFLAGS = -m64 -std=c++17 -Ofast -mssse3 -Wall -Wextra \
           -Wno-write-strings -Wno-unused-variable -Wno-deprecated-copy \
           -Wno-unused-parameter -Wno-sign-compare -Wno-strict-aliasing \
           -Wno-unused-but-set-variable -funroll-loops -ftree-vectorize \
           -fstrict-aliasing -fno-semantic-interposition -fvect-cost-model=unlimited \
           -fno-trapping-math -fipa-ra -fassociative-math -fopenmp \
           -mavx2 -mbmi2 -madx -fwrapv

# Add -static flag if STATIC_LINKING is enabled
ifeq ($(STATIC_LINKING), yes)
    CXXFLAGS += -static
else
    $(info Dynamic linking will be used. Ensure required DLLs are distributed)
endif

# Source files
SRCS = Mark1.cpp Int.cpp SECP256K1.cpp Point.cpp Random.cpp IntMod.cpp IntGroup.cpp Timer.cpp


# Object files
OBJS = $(SRCS:.cpp=.o)

# Target executable
TARGET = Mark1.exe

# Default target
all: $(TARGET)

# Link the object files to create the executable
$(TARGET): $(OBJS)
	$(CXX) $(CXXFLAGS) -o $(TARGET) $(OBJS)
	rm -f $(OBJS)

# Compile each source file into an object file
%.o: %.cpp
	$(CXX) $(CXXFLAGS) -c $< -o $@

# Clean up build files
clean:
	@echo Cleaning...
	rm -f $(OBJS) $(TARGET)

# Phony targets
.PHONY: all clean
endif

Added new command-line option --load-dp <filename> to specify a DP.bin file to load

Modified the Phase-1 section to either: Load from file if --load-dp is specified or generate new traps as before if not

Added proper handling of the DP target when loading from file. Maintained all original functionality including:

Saving DP with -s/--save-dp . All performance optimizations. All existing command-line options...

How to Use:

Generate new DP and save it:

Quote

../Mark1 --range 576460752303423487:1152921504606846975 --pubkey 0348e843dc5b1bd246e6309b4924b81543d02b16c8083df973a89ce2c7eb89a10d --dp_point 60000000 --dp_bits 8 --ram 32 --save-dp

Load saved DP and continue searching:

Quote

./Mark1 --range 576460752303423487:1152921504606846975 --pubkey 0348e843dc5b1bd246e6309b4924b81543d02b16c8083df973a89ce2c7eb89a10d --dp_point 60000000 --dp_bits 8 --ram 32 --load-dp DP.bin

Generate new DP without saving (original behavior):

Quote

./Mark1 --range start:end --pubkey 03...

Quote

./Mark1 --range 576460752303423487:1152921504606846975 --pubkey 0348e843dc5b1bd246e6309b4924b81543d02b16c8083df973a89ce2c7eb89a10d --dp_point 60000000 --dp_bits 8 --ram 32 --save-dp

=========== Phase-0: RAM summary ===========
DP table : 3.05Gb ( 60000000 / 80000000 slots, load 75.00% )
Bloom : 114Mb
--------------------------------------------
Total : 3.17Gb

========== Phase-1: Building/Loading traps =========
Unique traps: 60000000/60000000 (done)
Saved 60000000 DPs to DP.bin (2.24Gb)

=========== Phase-2: Kangaroos =============
Speed: 111.80 MH/s | Hops: 58982400 | Restart wild: 0 | Time: 0:0:5

============= Phase-3: Result ==============
Private key : 0x0000000000000000000000000000000000000000000000000FC07A1825367BBE
Found by thr: 5
Wild wraps : 0 [no wrap]
Wild restart: 0
Total time : 00:00:01.476
Private key : saved to FOUND.txt

Quote

./Mark1 --range 576460752303423487:1152921504606846975 --pubkey 0348e843dc5b1bd246e6309b4924b81543d02b16c8083df973a89ce2c7eb89a10d --dp_point 60000000 --dp_bits 8 --ram 32 --load-dp DP.bin

=========== Phase-0: RAM summary ===========
DP table : 3.05Gb ( 60000000 / 80000000 slots, load 75.00% )
Bloom : 114Mb
--------------------------------------------
Total : 3.17Gb

========== Phase-1: Building/Loading traps =========
Loading 60000000 DPs from DP.bin
Loaded 60000000 DPs (done)Ps

=========== Phase-2: Kangaroos =============
Speed: 112.32 MH/s | Hops: 61603840 | Restart wild: 0 | Time: 0:0:5

============= Phase-3: Result ==============
Private key : 0x0000000000000000000000000000000000000000000000000FC07A1825367BBE
Found by thr: 5
Wild wraps : 0 [no wrap]
Wild restart: 0
Total time : 00:00:01.471
Private key : saved to FOUND.txt

P.S.
I don't have Github. Someone can fork and upload this for me. If they want to...

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: Akito S. M. Hosana on June 20, 2025, 12:12:37 PM

Holy s*it. How fast is the file saved as DP.bin and loaded afterwards. Puzzle 70 solved in 10 seconds on my trash machine. ;D

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: FrozenThroneGuy on June 20, 2025, 12:24:08 PM

NoMachine, plz return your GitHub back!
I like your coding skills (and readme files), and Mark1 is the deep refactoring and a few reingeneering of your Kangaroo-hops file.
I updated Mark1.cpp code

Code:

/***************************************************************************************************
 * Pollard–Kangaroo  (wrap-aware, user-configurable k, live counter, loop detector, restart counter)
 * Coded by DooKoo2
 * Load/Save DP tech by NoMachine
 *
 *  g++ Mark1.cpp Int.cpp SECP256K1.cpp Point.cpp Random.cpp IntMod.cpp IntGroup.cpp Timer.cpp -O3 -march=native -funroll-loops -ftree-vectorize -fstrict-aliasing -fno-semantic-
 *    interposition -fvect-cost-model=unlimited -fno-trapping-math -fipa-ra -fipa-modref -flto -fassociative-math -fopenmp -mavx2 -mbmi2 -madx -std=c++17 -fopenmp -pthread -o Mark1
 *
 ***************************************************************************************************/

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: AlexanderCurl on June 20, 2025, 01:58:15 PM

Quote from: FrozenThroneGuy on June 20, 2025, 12:24:08 PM

NoMachine, plz return your GitHub back!
I like your coding skills (and readme files), and Mark1 is the deep refactoring and a few reingeneering of your Kangaroo-hops file.
I updated Mark1.cpp code

Code:

/***************************************************************************************************
 * Pollard–Kangaroo  (wrap-aware, user-configurable k, live counter, loop detector, restart counter)
 * Coded by DooKoo2
 * Load/Save DP tech by NoMachine
 *
 *  g++ Mark1.cpp Int.cpp SECP256K1.cpp Point.cpp Random.cpp IntMod.cpp IntGroup.cpp Timer.cpp -O3 -march=native -funroll-loops -ftree-vectorize -fstrict-aliasing -fno-semantic-
 *    interposition -fvect-cost-model=unlimited -fno-trapping-math -fipa-ra -fipa-modref -flto -fassociative-math -fopenmp -mavx2 -mbmi2 -madx -std=c++17 -fopenmp -pthread -o Mark1
 *
 ***************************************************************************************************/

He is definitely right. NoMachine there was no obvious reason to delete your account just because of some stupid dork writing shit. For example I code for my pleasure and share code in case someone might find it useful. And I do not care a bit about someone telling me where I am wrong or what to do. I have marked all you repos with postfix _NoMachine in my repo. Waiting for your comeback on github... ;D

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: Akito S. M. Hosana on June 20, 2025, 02:12:37 PM

Maybe this is a stupid question, but is it possible to make the same Kangaroo have both odd and even DPs, like @AlexanderCurl does? Could we reach 90-bit *f* this way? :P

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: AlexanderCurl on June 20, 2025, 03:09:59 PM

Quote from: Akito S. M. Hosana on June 20, 2025, 02:12:37 PM

Maybe this is a stupid question, but is it possible to make the same Kangaroo have both odd and even DPs, like @AlexanderCurl does? Could we reach 90-bit *f* this way? :P

That would be a cool Kangaroo Hopping!!! If BSGS is concerned one of possible ways is to divide the range according to the number of cpu cores and start the search from the middle of each interval using batch logic and calculating only X coordinate for the batch_size - 1 and the last one full (x,y). Make a database from the target point as large as your RAM can hold. Or since you know the size of each block do calculations to put the point closer to the middle. Maybe that way you can hit higher bits. But that depends on what time you have on your mind for that collision to happen.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: kTimesG on June 20, 2025, 06:27:32 PM

Quote from: AlexanderCurl on June 20, 2025, 03:09:59 PM

Quote from: Akito S. M. Hosana on June 20, 2025, 02:12:37 PM

Maybe this is a stupid question, but is it possible to make the same Kangaroo have both odd and even DPs, like @AlexanderCurl does? Could we reach 90-bit *f* this way? :P

That's only useful for brute-forcing, however brute-forcing is un-needed for DLP problems.

For BSGS this trick is also useless because it requires shifting the key by range/2 (to have the same-X mapping to 2 points), but the shift itself may move the point in the left side, which means the solution will never be found (unless you use two giant points instead of one, so zero benefit overall).

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: AlexanderCurl on June 20, 2025, 06:36:48 PM

Quote from: kTimesG on June 20, 2025, 06:27:32 PM

Quote from: AlexanderCurl on June 20, 2025, 03:09:59 PM

Quote from: Akito S. M. Hosana on June 20, 2025, 02:12:37 PM

Maybe this is a stupid question, but is it possible to make the same Kangaroo have both odd and even DPs, like @AlexanderCurl does? Could we reach 90-bit *f* this way? :P

Someone shut the fuck up that bonehead. Try to implement this and see for yourself.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: kTimesG on June 20, 2025, 08:04:11 PM

Quote from: AlexanderCurl on June 20, 2025, 06:36:48 PM

Quote from: kTimesG on June 20, 2025, 06:27:32 PM

Quote from: AlexanderCurl on June 20, 2025, 03:09:59 PM

Quote from: Akito S. M. Hosana on June 20, 2025, 02:12:37 PM

Maybe this is a stupid question, but is it possible to make the same Kangaroo have both odd and even DPs, like @AlexanderCurl does? Could we reach 90-bit *f* this way? :P

Someone shut the fuck up that bonehead. Try to implement this and see for yourself.

Cool. But I did implement it, that's why I said it doesn't work. Otherwise, you might have explained things in a way that wasn't understood.

It's pointless to use the X coord and use batch block points the way you (tried to) explain. It simply doesn't work the way you attempted to explain. And for kangaroo, this is even more pointless.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: AlexanderCurl on June 20, 2025, 09:21:19 PM

Quote from: kTimesG on June 20, 2025, 08:04:11 PM

Quote from: AlexanderCurl on June 20, 2025, 06:36:48 PM

Quote from: kTimesG on June 20, 2025, 06:27:32 PM

Quote from: AlexanderCurl on June 20, 2025, 03:09:59 PM

Quote from: Akito S. M. Hosana on June 20, 2025, 02:12:37 PM

Maybe this is a stupid question, but is it possible to make the same Kangaroo have both odd and even DPs, like @AlexanderCurl does? Could we reach 90-bit *f* this way? :P

Someone shut the fuck up that bonehead. Try to implement this and see for yourself.

It works fine the way I explained it on my PC and my implementation. I am not interested in Kangaroo and never refer to it.
Here is the TEMPLATE for you.

Code:

IntGroup modGroup(POINTS_BATCH_SIZE); // group of deltaX (x1 - x2) set for batch inversion
Int deltaX[POINTS_BATCH_SIZE]; // here we store (x1 - x2) batch that will be inverted for later multiplication
modGroup.Set(deltaX); // assign array deltaX to modGroup for batch inversion (JLP way set it once)
Int pointBatchX[POINTS_BATCH_SIZE]; // X coordinates of the batch
Int pointBatchY[POINTS_BATCH_SIZE]; // Y coordinates of the batch
Int deltaY; // values to store the results of points addition formula
Int slope[POINTS_BATCH_SIZE];

Point startPoint = starting_Point; // start point

while (true) {

    for (int i = 0; i < POINTS_BATCH_SIZE; i++) { // we compute (x1 - x2) for each entry of the entire batch
        deltaX[i].ModSub(&startPoint.x, &addPoints[i].x); // insert each entry into the deltaX array
    }

    modGroup.ModInv();    // doing batch inversion
    
    int i;
    for (i = 0; i < POINTS_BATCH_SIZE - 1; i++) { // follow points addition formula logic
        
        deltaY.ModSub(&startPoint.y, &addPoints[i].y);
        slope[i].ModMulK1(&deltaY, &deltaX[i]); // deltaX already inverted for each entry of the batch

        pointBatchX[i].ModSquareK1(&slope[i]); // computing just x coordinate for the (batch_size - 1)
        pointBatchX[i].ModSub(&pointBatchX[i], &startPoint.x);
        pointBatchX[i].ModSub(&pointBatchX[i], &addPoints[i].x);
        
    }
    
    deltaY.ModSub(&startPoint.y, &addPoints[i].y); // computing the last entry of the batch full (x,y) coordinates
    slope[i].ModMulK1(&deltaY, &deltaX[i]); // deltaX already inverted for each entry of the batch

    pointBatchX[i].ModSquareK1(&slope[i]);
    pointBatchX[i].ModSub(&pointBatchX[i], &startPoint.x);
    pointBatchX[i].ModSub(&pointBatchX[i], &addPoints[i].x);
        
    pointBatchY[i].ModSub(&startPoint.x, &pointBatchX[i]);
    pointBatchY[i].ModMulK1(&slope[i], &pointBatchY[i]);
    pointBatchY[i].ModSub(&pointBatchY[i], &startPoint.y);
    
startPoint.x.Set(&pointBatchX[POINTS_BATCH_SIZE - 1]); // setting the new startPoint for the next batch iteration
startPoint.y.Set(&pointBatchY[POINTS_BATCH_SIZE - 1]);
}

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: kTimesG on June 20, 2025, 10:04:14 PM

Quote from: AlexanderCurl on June 20, 2025, 09:21:19 PM

It works fine the way I explained it on my PC and my implementation. I am not interested in Kangaroo and never refer to it.
Here is the TEMPLATE for you.

Thx for the template, but I already use that in my PointsBuilder demo.

I think you misunderstood my point though. This batch addition that computes both P + Q and P - Q has no usefulness outside BSGS or simply building that specific array of results (e.g. for brute-forcing the H160).

It simply slows down the processing, since there is nothing useful for having both of those points computed. Maybe you should read again what Akito was asking, but more carefully. The answer to his question is that having a DP obtained via P - Q will almost never be hit from another route then the current one, since the pseudo-random walk ends (the point is not used as a new base point). It's basically equivalent to having a DP which is 1 bit lower, however the processing is slowed down. The probability of having it hit by another walk is so close to zero, that it's simply not worth computing it at all. RC does use it though, but not always, and it's for a different purpose anyway.

And for rho or Kangaroo, you always, always need the Y coordinate, because it's needed to actually do the next jump. So your "let's only compute X" is useless for this case.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: AlexanderCurl on June 21, 2025, 03:30:09 AM

Quote from: kTimesG on June 20, 2025, 10:04:14 PM

Quote from: AlexanderCurl on June 20, 2025, 09:21:19 PM

It works fine the way I explained it on my PC and my implementation. I am not interested in Kangaroo and never refer to it.
Here is the TEMPLATE for you.

Exacly my answer is for Akito and not for you. I guess he understood what I meant. And moreover I heard that fairytale from you before that something "will not work or should not work and only you know how". Let alone that explanation of mine is pretty simple to implement and see that it works. I have much complex ones that work as well when I implement them but it is a waste of time explaining something to you.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: kTimesG on June 21, 2025, 07:17:57 AM

Quote from: AlexanderCurl on June 21, 2025, 03:30:09 AM

Eagerly waiting for someone to implement it just so they get bit in the ass. Maybe you should be the magician to do it, since you're the one claiming that it speeds up anything, using a concept that can only increase the number of operations (or even worse, that isn't even applicable to the issue).

If something doesn't work on a freaking PAPER, I'm not sure how it could ever work in practice.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: AlexanderCurl on June 21, 2025, 09:44:25 AM

Quote from: kTimesG on June 21, 2025, 07:17:57 AM

Quote from: AlexanderCurl on June 21, 2025, 03:30:09 AM

The crappy bullshitter never stops!!! ;D Go fuck youself stupid dork!!! ;D

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: kTimesG on June 21, 2025, 11:22:08 AM

Quote from: AlexanderCurl on June 21, 2025, 09:44:25 AM

Quote from: kTimesG on June 21, 2025, 07:17:57 AM

If something doesn't work on a freaking PAPER, I'm not sure how it could ever work in practice.

The crappy bullshitter never stops!!! ;D Go fuck youself stupid dork!!! ;D

I've sent you back all the merit points I had. I now owe you the rest of 11 points, I will compensate it back when I can afford it.

I don't need anything from people like you. Also, your idea is genius, so forgive me. That would be a cool Kangaroo Hopping!!! And your BSGS optimization is spot on. Let's reduce that bitch down to .5 sqrt, magical batch addition, yay.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: AlexanderCurl on June 21, 2025, 11:48:09 AM

Quote from: kTimesG on June 21, 2025, 11:22:08 AM

Quote from: AlexanderCurl on June 21, 2025, 09:44:25 AM

Quote from: kTimesG on June 21, 2025, 07:17:57 AM

If something doesn't work on a freaking PAPER, I'm not sure how it could ever work in practice.

The crappy bullshitter never stops!!! ;D Go fuck youself stupid dork!!! ;D

Yeah do that!!! I have respect only for worthy people and not the kind like yourself. You been told that many times here. The one who is disrespectful to others - what could you expect? I will publish what I explained in plain c++ working code. For others to see. ;D

P.S Publish your code that implements what I explained and fails to work. Since I have a clear picture of what should be going on there I can make modifications to make it work. But I am sure you will not do that. Because you did not do anything and your goal is just "for your personal kicks" to prove that everyone is wrong but you.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: kTimesG on June 21, 2025, 02:32:35 PM

Quote from: AlexanderCurl on June 21, 2025, 11:48:09 AM

P.S Publish your code that implements what I explained and fails to work. Since I have a clear picture of what should be going on there I can make modifications to make it work. But I am sure you will not do that. Because you did not do anything and your goal is just "for your personal kicks" to prove that everyone is wrong but you.

Batch addition for static delta points and/or shared inverses to do P±Q doesn't work in Kangaroo nor rho (which is what this thread is about before you hijacked it with your evolutionary breakthrough). This was the whole point.

Otherwise, you're simply trying to do something which has long been done in KeyHunt, JLP, etc. However it feels like you are stating that you found a way that saves on storage or lookup. Maybe you should take a piece of paper, draw your idea on it, and maybe notice how it fails reducing any of the complexity or required storage whatsoever. The intersection / collision can only be optimal if sqrt(n) points are stored, no matter what clever trick geniuses try to come up with.

Here's what I got from you explanation. If it's wrong, maybe explain it better. You wanted to save / check the "middle points" or something like that, right? The red point is the real (and unique) BSGS collision, blue points are the saved points, green points are the checked giant points. Collision is missed since it would jump right through unsaved / unchecked points in both baby and giant ranges.

https://talkimg.com/images/2025/06/21/UutpvH.png

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: AlexanderCurl on June 21, 2025, 03:09:18 PM

Quote from: kTimesG on June 21, 2025, 02:32:35 PM

Quote from: AlexanderCurl on June 21, 2025, 11:48:09 AM

Just as I thought. Nothing more than that usual crap that is constantly coming out of you. ;D What you describe here is far off from my explanation. If you do not like something about it. I am good with that. Everyone is free to go for what they like. You can just go... yourself. Pretty simple thing. What I explained works fine with my code.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: AbadomRSZ on June 22, 2025, 06:34:06 PM

Quote from: FrozenThroneGuy on June 19, 2025, 10:57:10 AM

rubbish waste of time

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: mcdouglasx on June 24, 2025, 09:40:41 PM

I think the next point to address regarding the Kangaroo algorithm is to avoid loading the DP into RAM. I think that is a strong limiting point for Kangaroo, especially when it comes to high ranges.

Has anyone thought of any interesting ideas to solve this?.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: kTimesG on June 24, 2025, 10:38:13 PM

Quote from: mcdouglasx on June 24, 2025, 09:40:41 PM

Where did you ever read that the Kangaroo algorithm requires storing the DPs in RAM?

Storing and querying DPs works just fine with a database. This allows collecting terabytes or more of DP data.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: mcdouglasx on June 24, 2025, 11:06:59 PM

Quote from: kTimesG on June 24, 2025, 10:38:13 PM

Quote from: mcdouglasx on June 24, 2025, 09:40:41 PM

Where did you ever read that the Kangaroo algorithm requires storing the DPs in RAM?

Storing and querying DPs works just fine with a database. This allows collecting terabytes or more of DP data.

I'm referring to the dp.bin of this thread

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: nomachine on June 25, 2025, 02:48:52 AM

Quote from: mcdouglasx on June 24, 2025, 09:40:41 PM

It ain’t worth makin’, but you could totally build a Memory-Mapped DP Table System that mixes a Bloom filter (in RAM) with a memory-mapped DP table (on disk). Gotta reserve, like, 2-3x the expected DP table size to keep things runnin’ smooth. And for those massive tables over 1TB? Yeah, you’ll need to split ’em across multiple files.Just like I did in my BSGS. But here’s the catch: If someone’s crazy enough to try solvin’ Puzzle 135, they’re gonna need thousands upon thousands of terabytes of storage. ;D

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: mcdouglasx on June 25, 2025, 03:23:40 AM

Quote from: nomachine on June 25, 2025, 02:48:52 AM

Quote from: mcdouglasx on June 24, 2025, 09:40:41 PM

I was once thinking about including my lightweight database in Kangaroo, but I never finished it; I lost interest.

But the two ideas that came to me back then were to create a file where the DP would be stored, sorted like any other database. For active searches, I would only store the parity (like the first lightweight databases). Instead of using the 64-point continuous parity, I used the DP parity. So, when the kangaroos collided, they had the same sequence of DP from that point.

Basically, I had a breadcrumb trail of their paths. Once the collision height was calculated, I would identify it in the file's database and retrieve the complete information to construct the private key.

and the other idea was the same with the bits but using deterministic seeds that were stored in a file, when the collision of the bits was detected the path was reconstructed with the seed used, but I didn't think about this much since if you had made many jumps, reconstructing the path could take too long, unless a different seed was used every certain number of jumps, or some type of checkpoint was used to speed up the reconstruction.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: FrozenThroneGuy on June 25, 2025, 05:40:53 AM

Quote from: nomachine on June 25, 2025, 02:48:52 AM

Quote from: mcdouglasx on June 24, 2025, 09:40:41 PM

I have done it already:) switch Dp_table from RAM to SSD:)
Bloom - RAM
Dp_table - SSD

Next step - I will add gpu computing secp256 with all logic from original version

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: iceland2k14 on June 25, 2025, 07:20:36 AM

Anyone already forked to have it in Visual Studio for Windows ?

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: kTimesG on June 25, 2025, 07:58:28 AM

Quote from: nomachine on June 25, 2025, 02:48:52 AM

WTF are you talking about?

DPs do not need instant lookup. That is only needed for when you want to do billions of lookups in a row. I'm sure you must know this, so are you OK? DPs are pretty rare, and more over, their first usage is to always be stored, which automatically involves a lookup itself. Win-win.

Any actual database engine works just fine, and you don't need to bother with splitting files and you definitely don't need any bloom filters. That's why DBMS exist.

Personally I'm using SQLite for this (it can handle TB of data without issues, and guess what: it's a single file), but for consistency there's absolutely nothing wrong with going with a full-blown dedicated database server just for storing DPs, with included backup and redundancy.

Did I mention none of these things slow down the DP search in any way? Hell, you could as well save the DPs to text files and merge them once in a while. Oh wait...

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: nomachine on June 25, 2025, 10:09:56 AM

Quote from: kTimesG on June 25, 2025, 07:58:28 AM

WTF are you talking about?

I have no idea what you're talking about.

https://raw.githubusercontent.com/Dookoo2/Mark1/refs/heads/main/Mark1.cpp

This code use a Memory-Mapped DP Table , which allows for:

Persistent storage of DPs,
Efficient concurrent access,
Disk-based extension of capacity beyond available RAM,
Fast lookups via hashing and atomic operations,
Optional loading/saving of DP sets for reuse.

This structure is then stored in a memory-mapped file (dp_table.bin) via mmap, allowing the program to treat a file on disk as if it were directly in memory.

Thus, the statement "DPs do not need instant lookup" is incorrect in this context because:

This implementation requires fast lookups for real-time collision detection.
Using memory-mapped files achieves both performance and persistence.

Regardless of everything said - this script is still useless if someone wants to solve puzzle 135.
It is purely for experimentation and learning. No guarantees are given.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: FrozenThroneGuy on June 25, 2025, 10:21:33 AM

SSD storing DP

Code:

root@ubuntu:/home/ubuntu/Mark1# ./Mark1 --range 18014398509481983:36028797018963967  --pubkey 0385a30d8413af4f8f9e6312400f2d194fe14f02e719b24c3f83bf1fd233a8f963 --dp_point 500000 --dp_bits 10 --ram 8 --save-dp

=========== Phase-0: Data summary ==========
DP table (SSD): 40.7Mb  ( 500000 / 666667 slots, load 75.00% )
Bloom    (RAM): 977Kb

========== Phase-1: Building traps =========
Unique traps: 500000/500000 (done)
Saved 500000 traps to DP.bin (19.1Mb)

=========== Phase-2: Kangaroos =============
Speed: 7.13 MH/s | Hops: 631767040 | Restart wild: 0 | Time: 0:0:505

============= Phase-3: Result ==============
Private key : 0x000000000000000000000000000000000000000000000000006ABE1F9B67E114
Found by thr: 4
Wild wraps  : 0  [no wrap]
Wild restart: 0
Total time  : 00:00:50.013
Private key : saved to FOUND.txt

NoMachine, KtimesG, I know that in present time it has no any chances for solving 135, but next version will be with GPU integration + distributed DP gen + distributed worker node + oracle node. I know how to do it and I have enough c++ skills for doing it.
GPU code is the most difficult for me, because i have no experience with CUDA. And AI is useless for it (and for senior/staff level c++ coding also).

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: Akito S. M. Hosana on June 25, 2025, 10:32:07 AM

Quote from: nomachine on June 25, 2025, 10:09:56 AM

Quote from: kTimesG on June 25, 2025, 07:58:28 AM

WTF are you talking about?

I have no idea what you're talking about.

I have no idea what you're both talking about. But if the script does what it promises, I don't care about the explanation. ;D

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: nomachine on June 25, 2025, 10:44:46 AM

Quote from: Akito S. M. Hosana on June 25, 2025, 10:32:07 AM

I have no idea what you're both talking about. But if the script does what it promises, I don't care about the explanation. ;D

The proof is as always in the pudding.

Quote from: FrozenThroneGuy on June 25, 2025, 10:21:33 AM

If you have time to code, go for it. I know how much effort it takes to pull this off... But the real question is whether to sit around and do it without any compensation. Maybe it’s better to go fishing? ;)

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: FrozenThroneGuy on June 25, 2025, 10:50:34 AM

New version is ready: https://github.com/Dookoo2/Mark1
And i believe that I can do my own challenge - 100 bit - 1 hour.
NoMachine, I stuck in managing IS projects, and this is my way not to fault my coding skills.
And fishing… I prefer enduro trials:)

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: nomachine on June 25, 2025, 11:01:14 AM

Quote from: FrozenThroneGuy on June 25, 2025, 10:50:34 AM

New version is ready: https://github.com/Dookoo2/Mark1
And i believe that I can do my own challenge - 100 bit - 1 hour.

I know you can. ;D

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: Akito S. M. Hosana on June 25, 2025, 11:27:16 AM

Quote from: FrozenThroneGuy on June 25, 2025, 10:50:34 AM

New version is ready: https://github.com/Dookoo2/Mark1

I just solved Puzzle 66 in two-tenths of a second. 8)

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: FrozenThroneGuy on June 25, 2025, 11:43:48 AM

Quote from: Akito S. M. Hosana on June 25, 2025, 11:27:16 AM

Quote from: FrozenThroneGuy on June 25, 2025, 10:50:34 AM

New version is ready: https://github.com/Dookoo2/Mark1

I just solved Puzzle 66 in two-tenths of a second. 8)

Fast… But how to solve 100 bits in less than one hour on 1 consumer level GPU like 4090 or 5090? This is a challenge.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: kTimesG on June 25, 2025, 12:11:35 PM

Quote from: nomachine on June 25, 2025, 10:09:56 AM

Quote from: kTimesG on June 25, 2025, 07:58:28 AM

WTF are you talking about?

I have no idea what you're talking about.

Thus, the statement "DPs do not need instant lookup" is incorrect in this context because:

This implementation requires fast lookups for real-time collision detection.
Using memory-mapped files achieves both performance and persistence.

DPs don't need fast lookup. Maybe it's better to take a better look at the general algorithm. In essence:

1. Generate DPs.
2. Detect collisions.

These two steps are independent. There's no need to store anything in RAM, not even the generated DPs. They may come from a worker node, for example, as a file, before being thrown into a high-capacity storage system (which can be in the order of many billions of entries). For example, even using 64 bits of a DP as an index key is a no-brainer for any database and can store up to 2**64 entries with very, very fast lookup, no matter the actual storage subsystem or its size. When an insert fails, handle the potential collision (or the 64-bit conflict).

That's why I don't see the point of accelerating the DP lookup via RAM or bloom filters. They don't really help much, maybe just for educational purposes on very low-end puzzles.

The only issue is that finding DPs will be slow for high puzzles, so the database grows pretty slow. But for example, with a DP of 32, puzzles up to 192 bits are manageable with at most 2**64 entries. For Puzzle 135 and a DP of 32, only around 2**35 entries of storage are needed, which is 32 billion entries. This can fit on a single SSD.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: nomachine on June 25, 2025, 12:32:59 PM

Quote from: kTimesG on June 25, 2025, 12:11:35 PM

That's why I don't see the point of accelerating the DP lookup via RAM or bloom filters. They don't really help much, maybe just for educational purposes on very low-end puzzles.

The only issue is that finding DPs will be slow for high puzzles, so the database grows pretty slow. But for example, with a DP of 32, puzzles up to 192 bits are manageable with at most 2**64 entries. For Puzzle 135 and a DP of 32, only around 2**35 entries of storage are needed, which is 32 billion entries. This can fit on a single SSD.

You're talking about another hypothetical script that has nothing to do with this one. But okay. Maybe the author of this script will go in that direction. Then we'll talk about those details. ;)

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: Akito S. M. Hosana on June 25, 2025, 12:48:39 PM

I don’t understand why people jump into other people’s script discussions and immediately favor their own solution as the best. Isn’t the point of having separate topics and scripts to allow everyone their own space? :P

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: nomachine on June 25, 2025, 12:53:40 PM

Quote from: Akito S. M. Hosana on June 25, 2025, 12:48:39 PM

Because they’re bored and ain’t got nothin’ better to do with their life. Or maybe they keep their scripts private, so they just talk trash about others instead. ;D

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: mcdouglasx on June 25, 2025, 03:33:16 PM

Quote from: kTimesG on June 25, 2025, 12:11:35 PM

DPs don't need fast lookup.

If the idea is to find quick solutions, you need to include more DPs in your searches, so I assume you need more frequent queries. And to keep both working without creating bottlenecks, you need fast access and compact databases, because not everyone has hundreds of terabytes of storage or supercomputers.

What happens is that you extrapolate everything to environments that go hand in hand with Google, Amazon, and Microsoft, and these scripts try to solve things faster with solutions within reach of the average user.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: kTimesG on June 25, 2025, 03:50:57 PM

Quote from: mcdouglasx on June 25, 2025, 03:33:16 PM

Quote from: kTimesG on June 25, 2025, 12:11:35 PM

DPs don't need fast lookup.

A separate thread works just fine. The only bottleneck would be having more DPs generated than the insertion/lookup can handle. However when you only have just a few DPs per second / minute, while the database can do several hundreds of thousands/s anyway, not sure where's the benefit of speeding that up another 10.000 times (e.g. having it in RAM). At some point, the RAM won't be enough anyway.

Don't need to complicate things so much.

PS: I'm not marketing cloud services, neither did I wrote SQLite or something. But why reinvent the wheel when there are already answers to solved problems? For example, mapping files to RAM and so on, this is not solving anything when you have huge files, it just decreases performance. And SQLite can very well just work off from RAM itself, besides having basically super-optimized strategies for page caching and low-level stuff to access the data.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: Akito S. M. Hosana on June 25, 2025, 08:00:58 PM

Quote from: FrozenThroneGuy on June 25, 2025, 11:43:48 AM

Fast…

I think there is now a discrepancy in how the actual speed is measured, depending on the size of dp_table.bin. For example, with a database larger than 50GB, the speed reaches petabits per second in hops similar to Keyhunt. This is especially noticeable when working on smaller puzzles, such as Puzzle 55, 60, or 66, if the database is "too large" for a specific range.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: FrozenThroneGuy on June 26, 2025, 11:09:46 AM

Quote from: Akito S. M. Hosana on June 25, 2025, 08:00:58 PM

Quote from: FrozenThroneGuy on June 25, 2025, 11:43:48 AM

Fast…

Hello, Akito! Keyhunt speed is not a “fair speed”, it’s include the size of each giant step. For example giant step size is 1b points and you have done 1 m steps per second. Fair speed - 1Mkeys/s, BSGS speed - 1Pkey/s.
In my Mark1, I show a “fair speed”, instead of Kangaroo speed.

And I have solved a huge problem with solving speed SSD dp storing. Right now all are OK. Solving speed is the same as RAM storing DP.
ChatGPT helped to do it, it take me a blocks of code with an errors (not errors, problems with dp table size and hashing).

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: nomachine on June 28, 2025, 12:23:03 PM

Quote from: FrozenThroneGuy on June 26, 2025, 11:09:46 AM

And I have solved a huge problem with solving speed SSD dp storing. Right now all are OK. Solving speed is the same as RAM storing DP.
ChatGPT helped to do it, it take me a blocks of code with an errors (not errors, problems with dp table size and hashing).

You can go even further here. See here how the database packaging is done and hashing using XXH64_hash_t in bsgs.cpp

https://github.com/AlexanderKud/BSGS_NoMachine

;)

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: coolmib on June 28, 2025, 12:38:59 PM

Quote from: nomachine on June 28, 2025, 12:23:03 PM

You can go even further here. See here how the database packaging is done and hashing using XXH64_hash_t in bsgs.cpp

https://github.com/AlexanderKud/BSGS_NoMachine

;)

The readme section made my day. ;D
Cheers

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: Akito S. M. Hosana on June 28, 2025, 12:48:20 PM

Quote from: nomachine on June 28, 2025, 12:23:03 PM

hashing using XXH64_hash_t

Extremely fast Hash algorithm? :P

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: nomachine on June 28, 2025, 12:52:34 PM

Quote from: Akito S. M. Hosana on June 28, 2025, 12:48:20 PM

Quote from: nomachine on June 28, 2025, 12:23:03 PM

hashing using XXH64_hash_t

Extremely fast Hash algorithm? :P

Yeah, XXH64 gets like 10-20 GB/s. ;D

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: Akito S. M. Hosana on June 28, 2025, 01:33:06 PM

Quote from: nomachine on June 28, 2025, 12:52:34 PM

Quote from: Akito S. M. Hosana on June 28, 2025, 12:48:20 PM

Quote from: nomachine on June 28, 2025, 12:23:03 PM

hashing using XXH64_hash_t

Extremely fast Hash algorithm? :P

Yeah, XXH64 gets like 10-20 GB/s. ;D

Can you give us some code? I don't know what to ask the AI where to put XXHash's? :P

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: nomachine on June 28, 2025, 01:53:00 PM

Quote from: Akito S. M. Hosana on June 28, 2025, 01:33:06 PM

Quote from: nomachine on June 28, 2025, 12:52:34 PM

Quote from: Akito S. M. Hosana on June 28, 2025, 12:48:20 PM

Quote from: nomachine on June 28, 2025, 12:23:03 PM

hashing using XXH64_hash_t

Extremely fast Hash algorithm? :P

Yeah, XXH64 gets like 10-20 GB/s. ;D

Can you give us some code? I don't know what to ask the AI where to put XXHash's? :P

Mark1.cpp

Code:

#include <atomic>
#include <array>
#include <chrono>
#include <cstdint>
#include <cstdlib>
#include <cstring>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <memory>
#include <mutex>
#include <random>
#include <sstream>
#include <string>
#include <thread>
#include <vector>
#include <algorithm>

#include <omp.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <xxhash.h>

#include "Int.h"
#include "Point.h"
#include "SECP256K1.h"
#include "IntGroup.h"
#include "simd_block_bloom.h"

// ─── XXHash Integration ──────────────────────────────────────────────────────
static inline uint64_t xxhash64(const void* data, size_t len, uint64_t seed = 0) {
    return XXH64(data, len, seed);
}

static inline uint64_t IntLow64(const Int& n){ return n.bits64[0]; }

static inline uint64_t pointToFp(const Point& p) {
    uint64_t x = IntLow64(p.x);
    uint8_t parity = !p.y.IsEven();
    return xxhash64(&x, sizeof(x), parity);
}


// ─── Misc ─────────────────────────────────────────────────────────────────────
static std::string humanBytes(size_t bytes){
    constexpr const char* u[]{"B","Kb","Mb","Gb","Tb"};
    double v = double(bytes); int s = 0;
    while(v >= 1024.0 && s < 4){ v /= 1024.0; ++s; }
    std::ostringstream o;
    o << std::fixed << std::setprecision((v<10)?2:(v<100)?1:0) << v << u[s];
    return o.str();
}
static inline void intCopy(Int& d,const Int& s){ d.Set(&const_cast<Int&>(s)); }
static inline int bitlen(const Int& n){ return const_cast<Int&>(n).GetBitLength(); }
static inline bool intGE(const Int& a,const Int& b){
    return const_cast<Int&>(a).IsGreaterOrEqual(&const_cast<Int&>(b));
}

// ─── Scalar256 ────────────────────────────────────────────────────────────────
struct Scalar256{ uint64_t w[4]; };
static inline void intToScalar(const Int& n, Scalar256& s){
    s.w[0]=n.bits64[0]; s.w[1]=n.bits64[1];
    s.w[2]=n.bits64[2]; s.w[3]=n.bits64[3];
}
static inline void scalarToInt(const Scalar256& s, Int& n){
    n.SetInt32(0);
    for(int i=3;i>=0;--i){ n.ShiftL(64); Int tmp(s.w[i]); n.Add(&tmp); }
}

using fp_t = uint64_t;

// ─── SSD DP storage ────────────────────────────────────────────────────────────
#pragma pack(push,1)
struct DPSlot{ fp_t fp; Scalar256 key; };
#pragma pack(pop)
static_assert(sizeof(DPSlot)==40);

struct DPStorage{
    size_t                        cap=0, mapBytes=0;
    int                           fd=-1;
    DPSlot*                       slots=nullptr;
    std::unique_ptr<std::atomic<uint8_t>[]> st_used, st_lock;
    std::atomic<size_t>           size{0};           
    std::atomic<size_t>           dirty{0};
    std::atomic<size_t>           flush_counter{0};   
    std::atomic<bool>             enable_flush{true};

    void   init(const std::string& path,size_t c);
    void   flushIfNeeded(size_t slotIdx) noexcept;
    void   fullSync() noexcept;
    void   close();
};
static DPStorage dp;

static constexpr size_t FLUSH_STEP = 1ull<<24;   

// ─── DPStorage impl ───────────────────────────────────────────────────────────
void DPStorage::init(const std::string& path,size_t c){
    cap=c; mapBytes=cap*sizeof(DPSlot);

    int flags = O_RDWR | O_CREAT;
#ifdef O_DIRECT
    flags |= O_DIRECT;         
#endif
#ifdef O_SYNC
    flags |= O_SYNC;           
#endif
    fd = ::open(path.c_str(),flags,0644);
    if(fd<0){
        perror("open(dp)"); throw std::runtime_error("open(dp)");
    }
    if(posix_fallocate(fd,0,mapBytes)){
        perror("fallocate(dp)"); throw std::runtime_error("fallocate(dp)");
    }
    void* p = mmap(nullptr,mapBytes,PROT_READ|PROT_WRITE,MAP_SHARED,fd,0);
    if(p==MAP_FAILED){
        perror("mmap(dp)"); throw std::runtime_error("mmap(dp)");
    }
    slots = static_cast<DPSlot*>(p);

    st_used = std::make_unique<std::atomic<uint8_t>[]>(cap);
    st_lock = std::make_unique<std::atomic<uint8_t>[]>(cap);
#pragma omp parallel for schedule(static)
    for(size_t i=0;i<cap;++i){
        st_used[i].store(0,std::memory_order_relaxed);
        st_lock[i].store(0,std::memory_order_relaxed);
    }
    madvise(slots,mapBytes,MADV_RANDOM);
}

void DPStorage::flushIfNeeded(size_t slotIdx) noexcept{
    if(!enable_flush.load(std::memory_order_relaxed)) return;
    if(flush_counter.fetch_add(1,std::memory_order_relaxed) % FLUSH_STEP == 0){
        size_t start = (slotIdx / FLUSH_STEP) * FLUSH_STEP;
        size_t end   = std::min(start + FLUSH_STEP, cap);
        size_t len   = (end - start) * sizeof(DPSlot);
        msync(reinterpret_cast<char*>(slots) + start*sizeof(DPSlot), len, MS_ASYNC);
    }
}
void DPStorage::fullSync() noexcept{
    msync(slots,mapBytes,MS_SYNC);
}
void DPStorage::close(){
    if(slots) munmap(slots,mapBytes);
    if(fd>=0) ::close(fd);
}

// ─── Curve ────────────────────────────────────────────────────────────────────
static const char *P_HEX="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFC2F";
static const char *N_HEX="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEBAAEDCE6AF48A03BBFD25E8CD0364141";
static Int P_PRIME, ORDER_N; static Secp256K1 secp;

static inline Point addP(const Point& a,const Point& b){
    if(const_cast<Int&>(a.x).IsZero() && const_cast<Int&>(a.y).IsZero()) return b;
    if(const_cast<Int&>(b.x).IsZero() && const_cast<Int&>(b.y).IsZero()) return a;
    return secp.AddDirect(const_cast<Point&>(a),const_cast<Point&>(b));
}
static inline Point mulP(const Int& k){ Int t(k); return secp.ComputePublicKey(&t); }

// ─── Bloom + DP API ───────────────────────────────────────────────────────────
static simd_bloom::SimdBlockFilterFixed<>* bloom=nullptr;
static std::atomic<uint64_t> dpDone{0};

inline bool sameScalar(const Scalar256& a,const Scalar256& b){
    return std::memcmp(&a,&b,sizeof(Scalar256))==0;
}

// ─── Dual-hash DP ────────────────────────────────────────────────────────────
bool dp_insert_unique(fp_t fp,const Int& idx){
    Int t(idx); t.Mod(&ORDER_N);
    Scalar256 key; intToScalar(t,key);

    size_t mask = dp.cap - 1;
    size_t h1   = fp & mask;
    size_t h2   = ((fp << 1) | 1) & mask;
    if(h2 == 0) h2 = 1;

    size_t h = h1;
    for(size_t i=0;i<dp.cap;++i){
        if(!dp.st_used[h].load(std::memory_order_acquire)){
            uint8_t exp=0;
            if(dp.st_lock[h].compare_exchange_strong(exp,1,std::memory_order_acq_rel)){
                if(!dp.st_used[h].load(std::memory_order_acquire)){
                    dp.slots[h].fp  = fp;
                    dp.slots[h].key = key;
                    dp.st_used[h].store(1,std::memory_order_release);
                    dp.st_lock[h].store(0,std::memory_order_release);

                    dp.size.fetch_add(1,std::memory_order_relaxed);
                    dp.flushIfNeeded(h);
                    dpDone.fetch_add(1,std::memory_order_relaxed);
                    return true;
                }
                dp.st_lock[h].store(0,std::memory_order_release);
            }
        }else if(dp.slots[h].fp==fp && sameScalar(dp.slots[h].key,key)){
            return false;
        }
        h = (h + h2) & mask;
    }
    return false;            
}
bool dp_find(fp_t fp,Int& out){
    size_t mask = dp.cap - 1;
    size_t h1   = fp & mask;
    size_t h2   = ((fp << 1) | 1) & mask;
    if(h2 == 0) h2 = 1;

    size_t h = h1;
    for(size_t i=0;i<dp.cap;++i){
        if(!dp.st_used[h].load(std::memory_order_acquire))
            return false;
        if(dp.slots[h].fp == fp){
            scalarToInt(dp.slots[h].key,out);
            return true;
        }
        h = (h + h2) & mask;
    }
    return false;
}

// ─── Binary DP I/O ────────────────────────────────────────────────────────────
#pragma pack(push,1)
struct DpItem{ fp_t fp; uint8_t priv[32]; };
#pragma pack(pop)

void saveDPBinary(const std::string& fn) {
    std::ofstream f(fn, std::ios::binary | std::ios::trunc);
    if (!f) { std::cerr << "[ERR] open " << fn << "\n"; return; }
    uint64_t cnt = 0;
    for (size_t h = 0; h < dp.cap; ++h) {
        if (!dp.st_used[h].load(std::memory_order_acquire)) continue;
        DpItem it;
        it.fp = dp.slots[h].fp;
        Int p; scalarToInt(dp.slots[h].key, p); p.Get32Bytes(it.priv);
        f.write(reinterpret_cast<char*>(&it), sizeof(it)); ++cnt;
    }
    std::cout << "Saved " << cnt << " traps to " << fn
              << " (" << humanBytes(f.tellp()) << ")\n";
}

bool loadDPBinary(const std::string& fn){
    std::ifstream f(fn,std::ios::binary|std::ios::ate);
    if(!f){ std::cerr<<"[ERR] open "<<fn<<"\n"; return false; }
    if(f.tellg()%sizeof(DpItem)){ std::cerr<<"[ERR] bad size\n"; return false; }
    f.seekg(0);
    DpItem it; uint64_t n = 0;
    while(f.read(reinterpret_cast<char*>(&it),sizeof(it))){
        Int p; p.Set32Bytes(it.priv);
        if(dp_insert_unique(it.fp,p)) bloom->Add(uint32_t(it.fp));
        if((++n & 0xFFFFF) == 0) std::cout<<"\rLoaded "<<n<<std::flush;
    }
    std::cout<<"\rLoaded "<<n<<" traps (done)\n";
    return true;
}

// ─── range split ──────────────────────────────────────────────────────────────
struct RangeSeg{ Int start,length; };
static std::vector<RangeSeg>
splitRange(const Int& A,const Int& total,unsigned parts){
    std::vector<RangeSeg> seg(parts);
    Int chunk(total); Int div((uint64_t)parts); chunk.Div(&div,nullptr);
    Int lenLast(total);
    if(parts>1){ Int t(chunk); Int m((uint64_t)(parts-1)); t.Mult(&m); lenLast.Sub(&t); }
    for(unsigned i=0;i<parts;++i){
        seg[i].start = A;
        if(i){
            Int off(chunk); Int k((uint64_t)i); off.Mult(&k); seg[i].start.Add(&off);
        }
        seg[i].length = (i==parts-1)?lenLast:chunk;
    }
    return seg;
}

// ─── batch-EC-add ─────────────────────────────────────────────────────────────
template<unsigned N>
static inline void batchAdd(Point* base,Point* plus){
    std::array<Int,N> dX;
    for(unsigned i=0;i<N;++i) dX[i].ModSub(&plus[i].x,&base[i].x);
    static thread_local IntGroup grp(N); grp.Set(dX.data()); grp.ModInv();

    for(unsigned i=0;i<N;++i){
        Int dY; dY.ModSub(&plus[i].y,&base[i].y);
        Int k ; k .ModMulK1(&dY,&dX[i]);
        Int k2; k2.ModSquareK1(&k);
        Int xn(base[i].x); xn.ModNeg(); xn.ModAdd(&k2); xn.ModSub(&plus[i].x);
        Int dx(base[i].x); dx.ModSub(&xn); dx.ModMulK1(&k);
        base[i].x = xn;
        base[i].y.ModNeg();
        base[i].y.ModAdd(&dx);
    }
}

// ─── jump-table ───────────────────────────────────────────────────────────────
static std::vector<Point> jumps;
static void buildJumpTable(unsigned k){
    jumps.resize(k);
#pragma omp parallel for schedule(static)
    for(unsigned i=0;i<k;++i){
        Int e((uint64_t)1); e.ShiftL(int(i+1));
        jumps[i] = mulP(e);
    }
}

// ─── globals ──────────────────────────────────────────────────────────────────
static std::atomic<uint64_t> hops{0}, restarts{0};
static std::atomic<bool>     solved{false};
static Int  privFound;
static std::atomic<unsigned> found_tid{0};
static std::atomic<uint64_t> winner_wraps{0};
static std::once_flag        record_flag;

// ─── wrap helper ──────────────────────────────────────────────────────────────
static inline void addWrapCnt(Int& v,const Int& d,const Int& len,uint64_t& wraps){
    v.Add(&const_cast<Int&>(d));
    if(intGE(v,len)){ v.Sub(&const_cast<Int&>(len)); ++wraps; }
}

// ─── Traps (Phase-1) ─────────────────────────────────────────────────────────
static constexpr unsigned K_DP = 512;
static void buildDP_segment(const RangeSeg& seg,uint64_t target,
                            unsigned k,unsigned bits,uint64_t seed){
    const uint64_t mask = (1ULL<<bits)-1;
    std::mt19937_64 rng(seed);
    std::uniform_int_distribution<uint64_t> rd;

    std::array<Int,   K_DP> dist;
    std::array<uint64_t,K_DP> wraps{};
    std::array<Point, K_DP> cur, stepPts;

    const size_t BATCH_SIZE = 256;
    std::vector<std::pair<fp_t,Int>> batch;
    batch.reserve(BATCH_SIZE);

    auto rndMod=[&](Int& o){
        o.SetInt32(0); int parts=(bitlen(seg.length)+63)/64;
        for(int p=0;p<parts;++p){
            Int t((uint64_t)rd(rng)); t.ShiftL(p*64); o.Add(&t);
        }
        o.Mod(&const_cast<Int&>(seg.length));
    };
    for(unsigned i=0;i<K_DP;++i){
        rndMod(dist[i]); Int a(seg.start); a.Add(&dist[i]); cur[i] = mulP(a);
    }

    uint64_t made = 0;
    while(made < target){
        for(unsigned i=0;i<K_DP;++i){
            uint64_t h = xxhash64(&cur[i].x.bits64[0], 8) % k;
            Int step((uint64_t)1); step.ShiftL(int(h+1));

            if((IntLow64(cur[i].x) & mask) == 0){
                fp_t fp = pointToFp(cur[i]);

                Int scalar(seg.length);
                Int w((uint64_t)wraps[i]); scalar.Mult(&w);
                scalar.Add(&const_cast<Int&>(dist[i]));
                scalar.Add(&const_cast<Int&>(seg.start));
                scalar.Mod(&ORDER_N);

                batch.emplace_back(fp,scalar);
                if(batch.size() >= BATCH_SIZE || made + batch.size() >= target){
#pragma omp critical(dp_insert)
                    {
                        for(auto& it: batch){
                            if(dp_insert_unique(it.first,it.second)){
                                bloom->Add(uint32_t(it.first));
                                ++made;
                                if(made==target) break;
                            }
                        }
                        batch.clear();
                    }
                }
            }

            stepPts[i] = jumps[h];
            addWrapCnt(dist[i],step,seg.length,wraps[i]);
        }
        batchAdd<K_DP>(cur.data(),stepPts.data());
    }
    if(!batch.empty()){
#pragma omp critical(dp_insert)
        {
            for(auto& it: batch){
                if(dp_insert_unique(it.first,it.second)){
                    bloom->Add(uint32_t(it.first));
                    ++made;
                }
            }
            batch.clear();
        }
    }
}

// ─── Phase-2: wild kangaroos ─────────────────────────────────────────────────
static constexpr unsigned K   = 512;
static constexpr unsigned CACHE_LIMIT = 1024;

struct PendingCheck{ fp_t fp; unsigned idx; };

static void worker(uint32_t tid,const RangeSeg& seg,const Point& pub,
                   unsigned k,unsigned bits){
    struct LoopDet{
        uint64_t next,cnt,sig;
        void reset(uint64_t s){ next=1024; cnt=0; sig=s; }
    };

    const uint64_t mask = (1ULL<<bits)-1;
    std::mt19937_64 rng(xxhash64(&tid, sizeof(tid), 0xDEADBEEF));
    std::uniform_int_distribution<uint64_t> rd;

    std::array<Int,   K> dist;
    std::array<uint64_t,K> wraps{};
    std::array<Point, K> cur, stepPts;
    std::array<LoopDet,K> loop;

    auto rndMod=[&](Int& o){
        o.SetInt32(0); int parts=(bitlen(seg.length)+63)/64;
        for(int p=0;p<parts;++p){
            Int t((uint64_t)rd(rng)); t.ShiftL(p*64); o.Add(&t);
        }
        o.Mod(&const_cast<Int&>(seg.length));
    };
    for(unsigned i=0;i<K;++i){
        rndMod(dist[i]);
        cur[i]  = addP(pub,mulP(dist[i]));
        loop[i].reset(pointToFp(cur[i]));
    }

    madvise(dp.slots,dp.mapBytes,MADV_SEQUENTIAL);

    uint64_t local=0; const uint64_t FLUSH = 1ULL<<18; 
    std::vector<PendingCheck> cache; cache.reserve(CACHE_LIMIT);

    while(!solved.load()){
        for(unsigned i=0;i<K;++i){
            if(solved.load()) return;

            uint64_t x64 = IntLow64(cur[i].x);
            uint64_t h   = xxhash64(&x64, sizeof(x64)) % k;

            auto& ld = loop[i];
            if(++ld.cnt == ld.next){
                uint64_t sig = pointToFp(cur[i]);
                if(sig == ld.sig){
                    rndMod(dist[i]);
                    cur[i]  = addP(pub,mulP(dist[i]));
                    wraps[i]=0; ld.reset(sig);
                    restarts.fetch_add(1,std::memory_order_relaxed);
                    continue;
                }
                ld.sig = sig; ld.next <<= 1;
            }

            stepPts[i] = jumps[h];
            Int step((uint64_t)1); step.ShiftL(int(h+1));
            addWrapCnt(dist[i],step,seg.length,wraps[i]);
            ++local;
        }
        batchAdd<K>(cur.data(),stepPts.data());
        if(local >= FLUSH){ hops.fetch_add(local); local = 0; }

        for(unsigned i=0;i<K;++i){
            if(solved.load()) return;
            if((IntLow64(cur[i].x)&mask)!=0) continue;

            fp_t fp = pointToFp(cur[i]);
            cache.push_back({fp,i});

            if(cache.size() >= CACHE_LIMIT){
#pragma omp critical(dp_query)
                {
                    for(auto& item: cache){
                        if(!bloom->Find(uint32_t(item.fp))) continue;
                        Int trap;
                        if(!dp_find(item.fp,trap)) continue;

                        Int dw(seg.length);
                        Int w((uint64_t)wraps[item.idx]); dw.Mult(&w);
                        dw.Add(&const_cast<Int&>(dist[item.idx]));
                        dw.Mod(&ORDER_N);

                        Int priv; intCopy(priv,trap); priv.Sub(&dw); priv.Mod(&ORDER_N);
                        Point tst = mulP(priv);
                        if(tst.x.IsEqual(&const_cast<Int&>(pub.x)) &&
                           tst.y.IsEqual(&const_cast<Int&>(pub.y)))
                        {
                            std::call_once(record_flag,[]{});
                            intCopy(privFound,priv);
                            found_tid.store(tid);
                            winner_wraps.store(wraps[item.idx]);
                            solved.store(true);
                        }
                    }
                }
                cache.clear();
                if(solved.load()) return;
            }
        }
    }
    if(local) hops.fetch_add(local);
}

// ─── main ─────────────────────────────────────────────────────────────────────
int main(int argc,char** argv)
{
    /* init curve */
    P_PRIME.SetBase16((char*)P_HEX); ORDER_N.SetBase16((char*)N_HEX); secp.Init();

    /* ── CLI ── */
    Int A,B; uint64_t traps=0; unsigned bits=12; size_t ramGB=8;
    Point pub; unsigned k_user=0; bool saveDP=false, loadDP=false;
    std::string dpFile;
    for(int i=1;i<argc;++i){
        std::string a=argv[i];
        if(a=="--range"){ std::string s=argv[++i]; auto p=s.find(':');
            A.SetBase10((char*)s.substr(0,p).c_str());
            B.SetBase10((char*)s.substr(p+1).c_str());
        }else if(a=="--dp_point") traps=std::strtoull(argv[++i],nullptr,10);
        else if(a=="--dp_bits")  bits=std::stoul(argv[++i]);
        else if(a=="--ram"||a=="--ram-limit") ramGB=std::stoull(argv[++i]);
        else if(a=="--k") k_user=std::stoul(argv[++i]);
        else if(a=="--pubkey"){
            std::string h=argv[++i]; if(h.rfind("0x",0)==0) h.erase(0,2);
            char pc=h[1]; Int x; x.SetBase16((char*)h.substr(2).c_str());
            pub.x=x; pub.y=secp.GetY(x,pc=='2');
        }else if(a=="--save-dp"||a=="-s") saveDP=true;
        else if(a=="--load-dp"){ loadDP=true; dpFile=argv[++i]; }
        else{ std::cerr<<"Unknown "<<a<<'\n'; return 1; }
    }
    if(A.IsZero()||B.IsZero()){ std::cerr<<"--range missing\n"; return 1; }

    /* ── params ── */
    Int range(B); range.Sub(&A);
    unsigned Lbits=range.GetBitLength();
    if(!traps){
        traps=(Lbits>=52)?(1ULL<<(Lbits/2))
             : uint64_t(std::ceil(range.ToDouble()/std::sqrt(range.ToDouble())));
    }
    unsigned k = k_user? k_user : std::max(1u,Lbits/2);

    /* нoвыe пapaмeтpы */
    constexpr double MAX_LOAD     = 0.50; 
    constexpr double bloomFactor  = 10.0;  

    size_t cap0 = size_t(std::ceil(double(traps) / MAX_LOAD));
    size_t cap  = 1;
    while(cap < cap0) cap <<= 1;

    dp.init("dp_table.bin",cap);

    size_t bloomBytes=size_t(traps*bloomFactor);
    std::cout<<"\n=========== Phase-0: Data summary ==========\n";
    std::cout<<"DP table (SSD): "<<humanBytes(cap*sizeof(DPSlot))
             <<"  ( "<<traps<<" / "<<cap<<" slots, load "
             <<std::fixed<<std::setprecision(2)
             <<double(traps)/cap*100<<"% )\n";
    std::cout<<"Bloom    (RAM): "<<humanBytes(bloomBytes)<<'\n';

    bloom=new simd_bloom::SimdBlockFilterFixed<>(bloomBytes);

    unsigned th=std::max(1u,std::thread::hardware_concurrency());
    auto segs=splitRange(A,range,th);
    uint64_t per=(traps+th-1)/th;
    buildJumpTable(k);

    // ─── Phase-1 ────────────────────────────────────────────────────────────
    dp.enable_flush.store(false);            
    std::cout<<"\n========== Phase-1: Building traps =========\n";
    if(loadDP){
        if(!loadDPBinary(dpFile)) return 1;
    }else{
        std::thread progress([&]{
            while(dpDone.load()<traps){
                std::cout<<"\rUnique traps: "<<dpDone<<'/'<<traps<<std::flush;
                std::this_thread::sleep_for(std::chrono::milliseconds(250));
            }
            std::cout<<"\rUnique traps: "<<traps<<"/"<<traps<<" (done)\n";
        });
#pragma omp parallel for schedule(static)
        for(unsigned t=0;t<th;++t)
            buildDP_segment(segs[t],per,k,bits,
                            xxhash64(&t, sizeof(t), 0xABCDEF12345678ULL));
        progress.join();
        if(saveDP) saveDPBinary("DP.bin");
    }
    dp.fullSync();                         
    dp.enable_flush.store(true);

    // ─── Phase-2 ────────────────────────────────────────────────────────────
    std::cout<<"\n=========== Phase-2: Kangaroos =============\n";
    auto t0=std::chrono::steady_clock::now();
    std::thread pool([&]{
#pragma omp parallel for num_threads(th) schedule(static)
        for(unsigned id=0;id<th;++id) worker(id,segs[id],pub,k,bits);
    });

    uint64_t lastHops=0;
    auto lastStat=t0;
    while(true){
        std::this_thread::sleep_for(std::chrono::milliseconds(200));
        if(solved.load()) break;
        auto nowTick=std::chrono::steady_clock::now();
        if(nowTick-lastStat<std::chrono::seconds(5)) continue;

        double dt=std::chrono::duration<double>(nowTick-lastStat).count();
        lastStat=nowTick;
        uint64_t now=hops.load(); uint64_t delta=now-lastHops; lastHops=now;
        double sp=delta/dt; const char* u=" H/s";
        if(sp>1e6){ sp/=1e6; u=" MH/s";}
        else if(sp>1e3){ sp/=1e3; u=" kH/s";}
        uint64_t sec=std::chrono::duration_cast<std::chrono::seconds>(nowTick-t0).count();

        std::cout<<"\rSpeed: "<<std::fixed<<std::setprecision(2)<<sp<<u
                 <<" | Hops: "<<now
                 <<" | Restart wild: "<<restarts.load()
                 <<" | Time: "<<sec/3600<<':'<<std::setw(2)<<std::setfill('0')
                 <<(sec/60)%60<<':'<<std::setw(2)<<sec%60<<std::flush;
    }
    pool.join();

    // ─── Phase-3 ────────────────────────────────────────────────────────────
    auto msTot=std::chrono::duration_cast<std::chrono::milliseconds>(
                   std::chrono::steady_clock::now()-t0).count();
    uint64_t h=msTot/3'600'000,m=(msTot/60'000)%60,s=(msTot/1'000)%60,ms=msTot%1'000;

    std::cout<<"\n\n============= Phase-3: Result ==============\n";
    std::cout<<"Private key : 0x"<<std::setw(64)<<std::setfill('0')
             <<privFound.GetBase16()<<"\n";
    std::cout<<"Found by thr: "<<found_tid.load()<<"\n";
    std::cout<<"Wild wraps  : "<<winner_wraps.load()
             <<(winner_wraps.load()?"  [wrapped]\n":"  [no wrap]\n");
    std::cout<<"Wild restart: "<<restarts.load()<<"\n";
    std::cout<<"Total time  : "<<std::setw(2)<<h<<':'<<std::setw(2)<<m<<':'
             <<std::setw(2)<<s<<'.'<<std::setw(3)<<ms<<"\n";

    { std::ofstream("FOUND.txt")<<"0x"<<std::setw(64)<<std::setfill('0')
                                <<privFound.GetBase16()<<"\n"; }
    std::cout<<"Private key : saved to FOUND.txt\n";

    delete bloom; dp.close();
    return 0;
}

Makefile

Code:

# Compiler
CXX = g++

# Compiler flags
CXXFLAGS = -m64 -std=c++17 -Ofast -march=native -mtune=native \
           -Wall -Wextra -Wno-write-strings -Wno-unused-variable \
           -Wno-deprecated-copy -Wno-unused-parameter -Wno-sign-compare \
           -Wno-strict-aliasing -Wno-unused-but-set-variable \
           -funroll-loops -ftree-vectorize -fstrict-aliasing \
           -fno-semantic-interposition -fvect-cost-model=unlimited \
           -fno-trapping-math -fipa-ra -flto -fassociative-math \
           -fopenmp -mavx2 -mbmi2 -madx -fwrapv \
           -fomit-frame-pointer -fpredictive-commoning -fgcse-sm -fgcse-las \
           -fmodulo-sched -fmodulo-sched-allow-regmoves -funsafe-math-optimizations

# Linker flags
LDFLAGS = -lxxhash

# Source files
SRCS = Mark1.cpp Int.cpp SECP256K1.cpp Point.cpp Random.cpp IntMod.cpp IntGroup.cpp Timer.cpp

# Object files
OBJS = $(SRCS:.cpp=.o)

# Target executable
TARGET = Mark1

# Link the object files to create the executable and then delete .o files
$(TARGET): $(OBJS)
	$(CXX) $(CXXFLAGS) -o $(TARGET) $(OBJS) $(LDFLAGS)
	rm -f $(OBJS) && chmod +x $(TARGET)

# Compile each source file into an object file
%.o: %.cpp
	$(CXX) $(CXXFLAGS) -c $< -o $@

# Clean up build files
clean:
	@echo "Cleaning..."
	rm -f $(OBJS) $(TARGET)

# Phony targets
.PHONY: all clean

Code:

sudo apt-get install libxxhash-dev

P.S.
Next up is an even tighter package dp_table.bin. After that, it’s all GPU code, no extra fluff for max speed.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: Akito S. M. Hosana on June 28, 2025, 08:15:05 PM

Quote from: nomachine on June 28, 2025, 01:53:00 PM

Quote from: Akito S. M. Hosana on June 28, 2025, 01:33:06 PM

Quote from: nomachine on June 28, 2025, 12:52:34 PM

Quote from: Akito S. M. Hosana on June 28, 2025, 12:48:20 PM

Quote from: nomachine on June 28, 2025, 12:23:03 PM

hashing using XXH64_hash_t

Extremely fast Hash algorithm? :P

Yeah, XXH64 gets like 10-20 GB/s. ;D

Can you give us some code? I don't know what to ask the AI where to put XXHash's? :P

Mark1.cpp

Code:

#include <atomic>
#include <array>
#include <chrono>
#include <cstdint>
#include <cstdlib>
#include <cstring>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <memory>
#include <mutex>
#include <random>
#include <sstream>
#include <string>
#include <thread>
#include <vector>
#include <algorithm>

#include <omp.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <xxhash.h>

#include "Int.h"
#include "Point.h"
#include "SECP256K1.h"
#include "IntGroup.h"
#include "simd_block_bloom.h"

// ─── XXHash Integration ──────────────────────────────────────────────────────
static inline uint64_t xxhash64(const void* data, size_t len, uint64_t seed = 0) {
    return XXH64(data, len, seed);
}

static inline uint64_t IntLow64(const Int& n){ return n.bits64[0]; }

static inline uint64_t pointToFp(const Point& p) {
    uint64_t x = IntLow64(p.x);
    uint8_t parity = !p.y.IsEven();
    return xxhash64(&x, sizeof(x), parity);
}


// ─── Misc ─────────────────────────────────────────────────────────────────────
static std::string humanBytes(size_t bytes){
    constexpr const char* u[]{"B","Kb","Mb","Gb","Tb"};
    double v = double(bytes); int s = 0;
    while(v >= 1024.0 && s < 4){ v /= 1024.0; ++s; }
    std::ostringstream o;
    o << std::fixed << std::setprecision((v<10)?2:(v<100)?1:0) << v << u[s];
    return o.str();
}
static inline void intCopy(Int& d,const Int& s){ d.Set(&const_cast<Int&>(s)); }
static inline int bitlen(const Int& n){ return const_cast<Int&>(n).GetBitLength(); }
static inline bool intGE(const Int& a,const Int& b){
    return const_cast<Int&>(a).IsGreaterOrEqual(&const_cast<Int&>(b));
}

// ─── Scalar256 ────────────────────────────────────────────────────────────────
struct Scalar256{ uint64_t w[4]; };
static inline void intToScalar(const Int& n, Scalar256& s){
    s.w[0]=n.bits64[0]; s.w[1]=n.bits64[1];
    s.w[2]=n.bits64[2]; s.w[3]=n.bits64[3];
}
static inline void scalarToInt(const Scalar256& s, Int& n){
    n.SetInt32(0);
    for(int i=3;i>=0;--i){ n.ShiftL(64); Int tmp(s.w[i]); n.Add(&tmp); }
}

using fp_t = uint64_t;

// ─── SSD DP storage ────────────────────────────────────────────────────────────
#pragma pack(push,1)
struct DPSlot{ fp_t fp; Scalar256 key; };
#pragma pack(pop)
static_assert(sizeof(DPSlot)==40);

struct DPStorage{
    size_t                        cap=0, mapBytes=0;
    int                           fd=-1;
    DPSlot*                       slots=nullptr;
    std::unique_ptr<std::atomic<uint8_t>[]> st_used, st_lock;
    std::atomic<size_t>           size{0};           
    std::atomic<size_t>           dirty{0};
    std::atomic<size_t>           flush_counter{0};   
    std::atomic<bool>             enable_flush{true};

    void   init(const std::string& path,size_t c);
    void   flushIfNeeded(size_t slotIdx) noexcept;
    void   fullSync() noexcept;
    void   close();
};
static DPStorage dp;

static constexpr size_t FLUSH_STEP = 1ull<<24;   

// ─── DPStorage impl ───────────────────────────────────────────────────────────
void DPStorage::init(const std::string& path,size_t c){
    cap=c; mapBytes=cap*sizeof(DPSlot);

    int flags = O_RDWR | O_CREAT;
#ifdef O_DIRECT
    flags |= O_DIRECT;         
#endif
#ifdef O_SYNC
    flags |= O_SYNC;           
#endif
    fd = ::open(path.c_str(),flags,0644);
    if(fd<0){
        perror("open(dp)"); throw std::runtime_error("open(dp)");
    }
    if(posix_fallocate(fd,0,mapBytes)){
        perror("fallocate(dp)"); throw std::runtime_error("fallocate(dp)");
    }
    void* p = mmap(nullptr,mapBytes,PROT_READ|PROT_WRITE,MAP_SHARED,fd,0);
    if(p==MAP_FAILED){
        perror("mmap(dp)"); throw std::runtime_error("mmap(dp)");
    }
    slots = static_cast<DPSlot*>(p);

    st_used = std::make_unique<std::atomic<uint8_t>[]>(cap);
    st_lock = std::make_unique<std::atomic<uint8_t>[]>(cap);
#pragma omp parallel for schedule(static)
    for(size_t i=0;i<cap;++i){
        st_used[i].store(0,std::memory_order_relaxed);
        st_lock[i].store(0,std::memory_order_relaxed);
    }
    madvise(slots,mapBytes,MADV_RANDOM);
}

void DPStorage::flushIfNeeded(size_t slotIdx) noexcept{
    if(!enable_flush.load(std::memory_order_relaxed)) return;
    if(flush_counter.fetch_add(1,std::memory_order_relaxed) % FLUSH_STEP == 0){
        size_t start = (slotIdx / FLUSH_STEP) * FLUSH_STEP;
        size_t end   = std::min(start + FLUSH_STEP, cap);
        size_t len   = (end - start) * sizeof(DPSlot);
        msync(reinterpret_cast<char*>(slots) + start*sizeof(DPSlot), len, MS_ASYNC);
    }
}
void DPStorage::fullSync() noexcept{
    msync(slots,mapBytes,MS_SYNC);
}
void DPStorage::close(){
    if(slots) munmap(slots,mapBytes);
    if(fd>=0) ::close(fd);
}

// ─── Curve ────────────────────────────────────────────────────────────────────
static const char *P_HEX="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFC2F";
static const char *N_HEX="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEBAAEDCE6AF48A03BBFD25E8CD0364141";
static Int P_PRIME, ORDER_N; static Secp256K1 secp;

static inline Point addP(const Point& a,const Point& b){
    if(const_cast<Int&>(a.x).IsZero() && const_cast<Int&>(a.y).IsZero()) return b;
    if(const_cast<Int&>(b.x).IsZero() && const_cast<Int&>(b.y).IsZero()) return a;
    return secp.AddDirect(const_cast<Point&>(a),const_cast<Point&>(b));
}
static inline Point mulP(const Int& k){ Int t(k); return secp.ComputePublicKey(&t); }

// ─── Bloom + DP API ───────────────────────────────────────────────────────────
static simd_bloom::SimdBlockFilterFixed<>* bloom=nullptr;
static std::atomic<uint64_t> dpDone{0};

inline bool sameScalar(const Scalar256& a,const Scalar256& b){
    return std::memcmp(&a,&b,sizeof(Scalar256))==0;
}

// ─── Dual-hash DP ────────────────────────────────────────────────────────────
bool dp_insert_unique(fp_t fp,const Int& idx){
    Int t(idx); t.Mod(&ORDER_N);
    Scalar256 key; intToScalar(t,key);

    size_t mask = dp.cap - 1;
    size_t h1   = fp & mask;
    size_t h2   = ((fp << 1) | 1) & mask;
    if(h2 == 0) h2 = 1;

    size_t h = h1;
    for(size_t i=0;i<dp.cap;++i){
        if(!dp.st_used[h].load(std::memory_order_acquire)){
            uint8_t exp=0;
            if(dp.st_lock[h].compare_exchange_strong(exp,1,std::memory_order_acq_rel)){
                if(!dp.st_used[h].load(std::memory_order_acquire)){
                    dp.slots[h].fp  = fp;
                    dp.slots[h].key = key;
                    dp.st_used[h].store(1,std::memory_order_release);
                    dp.st_lock[h].store(0,std::memory_order_release);

                    dp.size.fetch_add(1,std::memory_order_relaxed);
                    dp.flushIfNeeded(h);
                    dpDone.fetch_add(1,std::memory_order_relaxed);
                    return true;
                }
                dp.st_lock[h].store(0,std::memory_order_release);
            }
        }else if(dp.slots[h].fp==fp && sameScalar(dp.slots[h].key,key)){
            return false;
        }
        h = (h + h2) & mask;
    }
    return false;            
}
bool dp_find(fp_t fp,Int& out){
    size_t mask = dp.cap - 1;
    size_t h1   = fp & mask;
    size_t h2   = ((fp << 1) | 1) & mask;
    if(h2 == 0) h2 = 1;

    size_t h = h1;
    for(size_t i=0;i<dp.cap;++i){
        if(!dp.st_used[h].load(std::memory_order_acquire))
            return false;
        if(dp.slots[h].fp == fp){
            scalarToInt(dp.slots[h].key,out);
            return true;
        }
        h = (h + h2) & mask;
    }
    return false;
}

// ─── Binary DP I/O ────────────────────────────────────────────────────────────
#pragma pack(push,1)
struct DpItem{ fp_t fp; uint8_t priv[32]; };
#pragma pack(pop)

void saveDPBinary(const std::string& fn) {
    std::ofstream f(fn, std::ios::binary | std::ios::trunc);
    if (!f) { std::cerr << "[ERR] open " << fn << "\n"; return; }
    uint64_t cnt = 0;
    for (size_t h = 0; h < dp.cap; ++h) {
        if (!dp.st_used[h].load(std::memory_order_acquire)) continue;
        DpItem it;
        it.fp = dp.slots[h].fp;
        Int p; scalarToInt(dp.slots[h].key, p); p.Get32Bytes(it.priv);
        f.write(reinterpret_cast<char*>(&it), sizeof(it)); ++cnt;
    }
    std::cout << "Saved " << cnt << " traps to " << fn
              << " (" << humanBytes(f.tellp()) << ")\n";
}

bool loadDPBinary(const std::string& fn){
    std::ifstream f(fn,std::ios::binary|std::ios::ate);
    if(!f){ std::cerr<<"[ERR] open "<<fn<<"\n"; return false; }
    if(f.tellg()%sizeof(DpItem)){ std::cerr<<"[ERR] bad size\n"; return false; }
    f.seekg(0);
    DpItem it; uint64_t n = 0;
    while(f.read(reinterpret_cast<char*>(&it),sizeof(it))){
        Int p; p.Set32Bytes(it.priv);
        if(dp_insert_unique(it.fp,p)) bloom->Add(uint32_t(it.fp));
        if((++n & 0xFFFFF) == 0) std::cout<<"\rLoaded "<<n<<std::flush;
    }
    std::cout<<"\rLoaded "<<n<<" traps (done)\n";
    return true;
}

// ─── range split ──────────────────────────────────────────────────────────────
struct RangeSeg{ Int start,length; };
static std::vector<RangeSeg>
splitRange(const Int& A,const Int& total,unsigned parts){
    std::vector<RangeSeg> seg(parts);
    Int chunk(total); Int div((uint64_t)parts); chunk.Div(&div,nullptr);
    Int lenLast(total);
    if(parts>1){ Int t(chunk); Int m((uint64_t)(parts-1)); t.Mult(&m); lenLast.Sub(&t); }
    for(unsigned i=0;i<parts;++i){
        seg[i].start = A;
        if(i){
            Int off(chunk); Int k((uint64_t)i); off.Mult(&k); seg[i].start.Add(&off);
        }
        seg[i].length = (i==parts-1)?lenLast:chunk;
    }
    return seg;
}

// ─── batch-EC-add ─────────────────────────────────────────────────────────────
template<unsigned N>
static inline void batchAdd(Point* base,Point* plus){
    std::array<Int,N> dX;
    for(unsigned i=0;i<N;++i) dX[i].ModSub(&plus[i].x,&base[i].x);
    static thread_local IntGroup grp(N); grp.Set(dX.data()); grp.ModInv();

    for(unsigned i=0;i<N;++i){
        Int dY; dY.ModSub(&plus[i].y,&base[i].y);
        Int k ; k .ModMulK1(&dY,&dX[i]);
        Int k2; k2.ModSquareK1(&k);
        Int xn(base[i].x); xn.ModNeg(); xn.ModAdd(&k2); xn.ModSub(&plus[i].x);
        Int dx(base[i].x); dx.ModSub(&xn); dx.ModMulK1(&k);
        base[i].x = xn;
        base[i].y.ModNeg();
        base[i].y.ModAdd(&dx);
    }
}

// ─── jump-table ───────────────────────────────────────────────────────────────
static std::vector<Point> jumps;
static void buildJumpTable(unsigned k){
    jumps.resize(k);
#pragma omp parallel for schedule(static)
    for(unsigned i=0;i<k;++i){
        Int e((uint64_t)1); e.ShiftL(int(i+1));
        jumps[i] = mulP(e);
    }
}

// ─── globals ──────────────────────────────────────────────────────────────────
static std::atomic<uint64_t> hops{0}, restarts{0};
static std::atomic<bool>     solved{false};
static Int  privFound;
static std::atomic<unsigned> found_tid{0};
static std::atomic<uint64_t> winner_wraps{0};
static std::once_flag        record_flag;

// ─── wrap helper ──────────────────────────────────────────────────────────────
static inline void addWrapCnt(Int& v,const Int& d,const Int& len,uint64_t& wraps){
    v.Add(&const_cast<Int&>(d));
    if(intGE(v,len)){ v.Sub(&const_cast<Int&>(len)); ++wraps; }
}

// ─── Traps (Phase-1) ─────────────────────────────────────────────────────────
static constexpr unsigned K_DP = 512;
static void buildDP_segment(const RangeSeg& seg,uint64_t target,
                            unsigned k,unsigned bits,uint64_t seed){
    const uint64_t mask = (1ULL<<bits)-1;
    std::mt19937_64 rng(seed);
    std::uniform_int_distribution<uint64_t> rd;

    std::array<Int,   K_DP> dist;
    std::array<uint64_t,K_DP> wraps{};
    std::array<Point, K_DP> cur, stepPts;

    const size_t BATCH_SIZE = 256;
    std::vector<std::pair<fp_t,Int>> batch;
    batch.reserve(BATCH_SIZE);

    auto rndMod=[&](Int& o){
        o.SetInt32(0); int parts=(bitlen(seg.length)+63)/64;
        for(int p=0;p<parts;++p){
            Int t((uint64_t)rd(rng)); t.ShiftL(p*64); o.Add(&t);
        }
        o.Mod(&const_cast<Int&>(seg.length));
    };
    for(unsigned i=0;i<K_DP;++i){
        rndMod(dist[i]); Int a(seg.start); a.Add(&dist[i]); cur[i] = mulP(a);
    }

    uint64_t made = 0;
    while(made < target){
        for(unsigned i=0;i<K_DP;++i){
            uint64_t h = xxhash64(&cur[i].x.bits64[0], 8) % k;
            Int step((uint64_t)1); step.ShiftL(int(h+1));

            if((IntLow64(cur[i].x) & mask) == 0){
                fp_t fp = pointToFp(cur[i]);

                Int scalar(seg.length);
                Int w((uint64_t)wraps[i]); scalar.Mult(&w);
                scalar.Add(&const_cast<Int&>(dist[i]));
                scalar.Add(&const_cast<Int&>(seg.start));
                scalar.Mod(&ORDER_N);

                batch.emplace_back(fp,scalar);
                if(batch.size() >= BATCH_SIZE || made + batch.size() >= target){
#pragma omp critical(dp_insert)
                    {
                        for(auto& it: batch){
                            if(dp_insert_unique(it.first,it.second)){
                                bloom->Add(uint32_t(it.first));
                                ++made;
                                if(made==target) break;
                            }
                        }
                        batch.clear();
                    }
                }
            }

            stepPts[i] = jumps[h];
            addWrapCnt(dist[i],step,seg.length,wraps[i]);
        }
        batchAdd<K_DP>(cur.data(),stepPts.data());
    }
    if(!batch.empty()){
#pragma omp critical(dp_insert)
        {
            for(auto& it: batch){
                if(dp_insert_unique(it.first,it.second)){
                    bloom->Add(uint32_t(it.first));
                    ++made;
                }
            }
            batch.clear();
        }
    }
}

// ─── Phase-2: wild kangaroos ─────────────────────────────────────────────────
static constexpr unsigned K   = 512;
static constexpr unsigned CACHE_LIMIT = 1024;

struct PendingCheck{ fp_t fp; unsigned idx; };

static void worker(uint32_t tid,const RangeSeg& seg,const Point& pub,
                   unsigned k,unsigned bits){
    struct LoopDet{
        uint64_t next,cnt,sig;
        void reset(uint64_t s){ next=1024; cnt=0; sig=s; }
    };

    const uint64_t mask = (1ULL<<bits)-1;
    std::mt19937_64 rng(xxhash64(&tid, sizeof(tid), 0xDEADBEEF));
    std::uniform_int_distribution<uint64_t> rd;

    std::array<Int,   K> dist;
    std::array<uint64_t,K> wraps{};
    std::array<Point, K> cur, stepPts;
    std::array<LoopDet,K> loop;

    auto rndMod=[&](Int& o){
        o.SetInt32(0); int parts=(bitlen(seg.length)+63)/64;
        for(int p=0;p<parts;++p){
            Int t((uint64_t)rd(rng)); t.ShiftL(p*64); o.Add(&t);
        }
        o.Mod(&const_cast<Int&>(seg.length));
    };
    for(unsigned i=0;i<K;++i){
        rndMod(dist[i]);
        cur[i]  = addP(pub,mulP(dist[i]));
        loop[i].reset(pointToFp(cur[i]));
    }

    madvise(dp.slots,dp.mapBytes,MADV_SEQUENTIAL);

    uint64_t local=0; const uint64_t FLUSH = 1ULL<<18; 
    std::vector<PendingCheck> cache; cache.reserve(CACHE_LIMIT);

    while(!solved.load()){
        for(unsigned i=0;i<K;++i){
            if(solved.load()) return;

            uint64_t x64 = IntLow64(cur[i].x);
            uint64_t h   = xxhash64(&x64, sizeof(x64)) % k;

            auto& ld = loop[i];
            if(++ld.cnt == ld.next){
                uint64_t sig = pointToFp(cur[i]);
                if(sig == ld.sig){
                    rndMod(dist[i]);
                    cur[i]  = addP(pub,mulP(dist[i]));
                    wraps[i]=0; ld.reset(sig);
                    restarts.fetch_add(1,std::memory_order_relaxed);
                    continue;
                }
                ld.sig = sig; ld.next <<= 1;
            }

            stepPts[i] = jumps[h];
            Int step((uint64_t)1); step.ShiftL(int(h+1));
            addWrapCnt(dist[i],step,seg.length,wraps[i]);
            ++local;
        }
        batchAdd<K>(cur.data(),stepPts.data());
        if(local >= FLUSH){ hops.fetch_add(local); local = 0; }

        for(unsigned i=0;i<K;++i){
            if(solved.load()) return;
            if((IntLow64(cur[i].x)&mask)!=0) continue;

            fp_t fp = pointToFp(cur[i]);
            cache.push_back({fp,i});

            if(cache.size() >= CACHE_LIMIT){
#pragma omp critical(dp_query)
                {
                    for(auto& item: cache){
                        if(!bloom->Find(uint32_t(item.fp))) continue;
                        Int trap;
                        if(!dp_find(item.fp,trap)) continue;

                        Int dw(seg.length);
                        Int w((uint64_t)wraps[item.idx]); dw.Mult(&w);
                        dw.Add(&const_cast<Int&>(dist[item.idx]));
                        dw.Mod(&ORDER_N);

                        Int priv; intCopy(priv,trap); priv.Sub(&dw); priv.Mod(&ORDER_N);
                        Point tst = mulP(priv);
                        if(tst.x.IsEqual(&const_cast<Int&>(pub.x)) &&
                           tst.y.IsEqual(&const_cast<Int&>(pub.y)))
                        {
                            std::call_once(record_flag,[]{});
                            intCopy(privFound,priv);
                            found_tid.store(tid);
                            winner_wraps.store(wraps[item.idx]);
                            solved.store(true);
                        }
                    }
                }
                cache.clear();
                if(solved.load()) return;
            }
        }
    }
    if(local) hops.fetch_add(local);
}

// ─── main ─────────────────────────────────────────────────────────────────────
int main(int argc,char** argv)
{
    /* init curve */
    P_PRIME.SetBase16((char*)P_HEX); ORDER_N.SetBase16((char*)N_HEX); secp.Init();

    /* ── CLI ── */
    Int A,B; uint64_t traps=0; unsigned bits=12; size_t ramGB=8;
    Point pub; unsigned k_user=0; bool saveDP=false, loadDP=false;
    std::string dpFile;
    for(int i=1;i<argc;++i){
        std::string a=argv[i];
        if(a=="--range"){ std::string s=argv[++i]; auto p=s.find(':');
            A.SetBase10((char*)s.substr(0,p).c_str());
            B.SetBase10((char*)s.substr(p+1).c_str());
        }else if(a=="--dp_point") traps=std::strtoull(argv[++i],nullptr,10);
        else if(a=="--dp_bits")  bits=std::stoul(argv[++i]);
        else if(a=="--ram"||a=="--ram-limit") ramGB=std::stoull(argv[++i]);
        else if(a=="--k") k_user=std::stoul(argv[++i]);
        else if(a=="--pubkey"){
            std::string h=argv[++i]; if(h.rfind("0x",0)==0) h.erase(0,2);
            char pc=h[1]; Int x; x.SetBase16((char*)h.substr(2).c_str());
            pub.x=x; pub.y=secp.GetY(x,pc=='2');
        }else if(a=="--save-dp"||a=="-s") saveDP=true;
        else if(a=="--load-dp"){ loadDP=true; dpFile=argv[++i]; }
        else{ std::cerr<<"Unknown "<<a<<'\n'; return 1; }
    }
    if(A.IsZero()||B.IsZero()){ std::cerr<<"--range missing\n"; return 1; }

    /* ── params ── */
    Int range(B); range.Sub(&A);
    unsigned Lbits=range.GetBitLength();
    if(!traps){
        traps=(Lbits>=52)?(1ULL<<(Lbits/2))
             : uint64_t(std::ceil(range.ToDouble()/std::sqrt(range.ToDouble())));
    }
    unsigned k = k_user? k_user : std::max(1u,Lbits/2);

    /* нoвыe пapaмeтpы */
    constexpr double MAX_LOAD     = 0.50; 
    constexpr double bloomFactor  = 10.0;  

    size_t cap0 = size_t(std::ceil(double(traps) / MAX_LOAD));
    size_t cap  = 1;
    while(cap < cap0) cap <<= 1;

    dp.init("dp_table.bin",cap);

    size_t bloomBytes=size_t(traps*bloomFactor);
    std::cout<<"\n=========== Phase-0: Data summary ==========\n";
    std::cout<<"DP table (SSD): "<<humanBytes(cap*sizeof(DPSlot))
             <<"  ( "<<traps<<" / "<<cap<<" slots, load "
             <<std::fixed<<std::setprecision(2)
             <<double(traps)/cap*100<<"% )\n";
    std::cout<<"Bloom    (RAM): "<<humanBytes(bloomBytes)<<'\n';

    bloom=new simd_bloom::SimdBlockFilterFixed<>(bloomBytes);

    unsigned th=std::max(1u,std::thread::hardware_concurrency());
    auto segs=splitRange(A,range,th);
    uint64_t per=(traps+th-1)/th;
    buildJumpTable(k);

    // ─── Phase-1 ────────────────────────────────────────────────────────────
    dp.enable_flush.store(false);            
    std::cout<<"\n========== Phase-1: Building traps =========\n";
    if(loadDP){
        if(!loadDPBinary(dpFile)) return 1;
    }else{
        std::thread progress([&]{
            while(dpDone.load()<traps){
                std::cout<<"\rUnique traps: "<<dpDone<<'/'<<traps<<std::flush;
                std::this_thread::sleep_for(std::chrono::milliseconds(250));
            }
            std::cout<<"\rUnique traps: "<<traps<<"/"<<traps<<" (done)\n";
        });
#pragma omp parallel for schedule(static)
        for(unsigned t=0;t<th;++t)
            buildDP_segment(segs[t],per,k,bits,
                            xxhash64(&t, sizeof(t), 0xABCDEF12345678ULL));
        progress.join();
        if(saveDP) saveDPBinary("DP.bin");
    }
    dp.fullSync();                         
    dp.enable_flush.store(true);

    // ─── Phase-2 ────────────────────────────────────────────────────────────
    std::cout<<"\n=========== Phase-2: Kangaroos =============\n";
    auto t0=std::chrono::steady_clock::now();
    std::thread pool([&]{
#pragma omp parallel for num_threads(th) schedule(static)
        for(unsigned id=0;id<th;++id) worker(id,segs[id],pub,k,bits);
    });

    uint64_t lastHops=0;
    auto lastStat=t0;
    while(true){
        std::this_thread::sleep_for(std::chrono::milliseconds(200));
        if(solved.load()) break;
        auto nowTick=std::chrono::steady_clock::now();
        if(nowTick-lastStat<std::chrono::seconds(5)) continue;

        double dt=std::chrono::duration<double>(nowTick-lastStat).count();
        lastStat=nowTick;
        uint64_t now=hops.load(); uint64_t delta=now-lastHops; lastHops=now;
        double sp=delta/dt; const char* u=" H/s";
        if(sp>1e6){ sp/=1e6; u=" MH/s";}
        else if(sp>1e3){ sp/=1e3; u=" kH/s";}
        uint64_t sec=std::chrono::duration_cast<std::chrono::seconds>(nowTick-t0).count();

        std::cout<<"\rSpeed: "<<std::fixed<<std::setprecision(2)<<sp<<u
                 <<" | Hops: "<<now
                 <<" | Restart wild: "<<restarts.load()
                 <<" | Time: "<<sec/3600<<':'<<std::setw(2)<<std::setfill('0')
                 <<(sec/60)%60<<':'<<std::setw(2)<<sec%60<<std::flush;
    }
    pool.join();

    // ─── Phase-3 ────────────────────────────────────────────────────────────
    auto msTot=std::chrono::duration_cast<std::chrono::milliseconds>(
                   std::chrono::steady_clock::now()-t0).count();
    uint64_t h=msTot/3'600'000,m=(msTot/60'000)%60,s=(msTot/1'000)%60,ms=msTot%1'000;

    std::cout<<"\n\n============= Phase-3: Result ==============\n";
    std::cout<<"Private key : 0x"<<std::setw(64)<<std::setfill('0')
             <<privFound.GetBase16()<<"\n";
    std::cout<<"Found by thr: "<<found_tid.load()<<"\n";
    std::cout<<"Wild wraps  : "<<winner_wraps.load()
             <<(winner_wraps.load()?"  [wrapped]\n":"  [no wrap]\n");
    std::cout<<"Wild restart: "<<restarts.load()<<"\n";
    std::cout<<"Total time  : "<<std::setw(2)<<h<<':'<<std::setw(2)<<m<<':'
             <<std::setw(2)<<s<<'.'<<std::setw(3)<<ms<<"\n";

    { std::ofstream("FOUND.txt")<<"0x"<<std::setw(64)<<std::setfill('0')
                                <<privFound.GetBase16()<<"\n"; }
    std::cout<<"Private key : saved to FOUND.txt\n";

    delete bloom; dp.close();
    return 0;
}

Makefile

Code:

# Compiler
CXX = g++

# Compiler flags
CXXFLAGS = -m64 -std=c++17 -Ofast -march=native -mtune=native \
           -Wall -Wextra -Wno-write-strings -Wno-unused-variable \
           -Wno-deprecated-copy -Wno-unused-parameter -Wno-sign-compare \
           -Wno-strict-aliasing -Wno-unused-but-set-variable \
           -funroll-loops -ftree-vectorize -fstrict-aliasing \
           -fno-semantic-interposition -fvect-cost-model=unlimited \
           -fno-trapping-math -fipa-ra -flto -fassociative-math \
           -fopenmp -mavx2 -mbmi2 -madx -fwrapv \
           -fomit-frame-pointer -fpredictive-commoning -fgcse-sm -fgcse-las \
           -fmodulo-sched -fmodulo-sched-allow-regmoves -funsafe-math-optimizations

# Linker flags
LDFLAGS = -lxxhash

# Source files
SRCS = Mark1.cpp Int.cpp SECP256K1.cpp Point.cpp Random.cpp IntMod.cpp IntGroup.cpp Timer.cpp

# Object files
OBJS = $(SRCS:.cpp=.o)

# Target executable
TARGET = Mark1

# Link the object files to create the executable and then delete .o files
$(TARGET): $(OBJS)
	$(CXX) $(CXXFLAGS) -o $(TARGET) $(OBJS) $(LDFLAGS)
	rm -f $(OBJS) && chmod +x $(TARGET)

# Compile each source file into an object file
%.o: %.cpp
	$(CXX) $(CXXFLAGS) -c $< -o $@

# Clean up build files
clean:
	@echo "Cleaning..."
	rm -f $(OBJS) $(TARGET)

# Phony targets
.PHONY: all clean

Code:

sudo apt-get install libxxhash-dev

P.S.
Next up is an even tighter package dp_table.bin. After that, it’s all GPU code, no extra fluff for max speed.

Can this be further optimized in terms of speed? :P

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: nomachine on June 28, 2025, 08:38:29 PM

Quote from: Akito S. M. Hosana on June 28, 2025, 08:15:05 PM

Can this be further optimized in terms of speed? :P

The current implementation is already highly optimized, so further gains would likely be in the 10-30% range rather than order-of-magnitude improvements. The most promising areas would be hash function optimization and fine-tuning batch sizes for specific hardware.

These batch sizes could be tuned based on:

CPU cache sizes (L1/L2/L3)

Available SIMD registers

Benchmark different sizes (256, 512, 1024)

Current DP Table Structure:

Code:

#pragma pack(push,1)
struct DPSlot{ fp_t fp; Scalar256 key; };
#pragma pack(pop)
static_assert(sizeof(DPSlot)==40);

8 bytes for the fingerprint (fp_t)

32 bytes for the scalar key (Scalar256)

Total: 40 bytes per slot

Could potentially reduce to 32-bit with:

More frequent collisions (manageable)

Secondary verification when matches occur

Savings: 4 bytes per slot (10% reduction)

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: kTimesG on June 28, 2025, 09:19:28 PM

Quote from: nomachine on June 28, 2025, 08:38:29 PM

If this is IDLP you don't need to store full keys, just distance from base start. This alone saves up to 20 bytes. However this works only if the max distance is ensured to never be reached.

Also I'm willing to bet an arm and leg that mmap-ing the DP file will lose big time against an actual database file, when there are lots of DPs, since they'll be spread around the disk, I wonder what your OS will have to say about that, when seeking around and pages start to swap in and out (besides all running apps stalling due to low free memory)

But seems that there's some kind of competition on these forums on who can reinvent a rounder wheel.

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: nomachine on June 28, 2025, 11:37:45 PM

Quote from: kTimesG on June 28, 2025, 09:19:28 PM

You're right about the distance-only storage optimization for DLP.

This implementation isn’t mine. I’m just analyzing the code. I haven't even started it in my OS. This is all theoretical. If the original author wants to optimize further, they’d need to adopt delta encoding (storing walk distances, not full keys), replace mmap with a proper DB for large-scale DP storage, and tighten memory limits to avoid thrashing.

These changes require significant redesign. Until then, it’s stuck with these limitations. ;)

Title: Re: Mark1 - pollard rho implementation (38 minutes for 80 bits solving on CPU)
Post by: Akito S. M. Hosana on June 28, 2025, 11:43:59 PM

Quote from: nomachine on June 28, 2025, 11:37:45 PM

This implementation isn’t mine. I’m just analyzing the code. I haven't even started it in my OS. This is all theoretical.

I tested it. That thing’s fast as hell ‘til puzzle 80. After that? Mission f**in’ impossible :-\

Bitcoin Forum

Bitcoin => Development & Technical Discussion => Topic started by: FrozenThroneGuy on June 19, 2025, 10:57:10 AM