Bitcoin Forum
February 16, 2026, 06:40:38 AM *
News: Community awards 2025
 
   Home   Help Search Login Register More  
Pages: [1]
  Print  
Author Topic: UltrafastSecp256k1Zero-dependency high-performance secp256k1 CPU CUDA RISC-V CL  (Read 137 times)
shrec (OP)
Newbie
*
Offline Offline

Activity: 10
Merit: 0


View Profile
February 11, 2026, 05:27:59 PM
Last edit: February 14, 2026, 06:08:24 PM by shrec
 #1

Hi everyone,

I’ve been working on a performance-oriented implementation of secp256k1 written from scratch in C++ and CUDA, with optional x86-64 (BMI2/ADX) and RISC-V optimizations.

This project is focused on architectural efficiency, benchmarking, and hardware-aware ECC implementation. It is not intended for breaking cryptography or private key recovery.

### Goals

* Implement secp256k1 without external big-integer libraries
* Maintain deterministic memory layout
* Avoid dynamic allocation in hot paths
* Explore hardware-level performance limits

### Features

* Complete field arithmetic (mod p)
* Scalar arithmetic (mod n)
* Affine and Jacobian coordinates
* GLV optimization
* CPU optimizations (BMI2/ADX)
* RISC-V RV64GC support
* CUDA batch kernels
* Benchmark suite included

### Measured Performance

On RTX 5060:
~2.5 billion Jacobian mixed-add operations per second (measured)

CPU benchmarks also show 3–5× improvement over naive implementations when using BMI2/ADX paths.

### Design Approach

The implementation treats elliptic curve math as a hardware interaction problem:

* Little-endian limb layout for computational efficiency
* Explicit carry handling
* Batch inversion via Montgomery’s trick
* Minimal abstraction in hot execution paths

The idea is to reduce unnecessary movement and keep arithmetic predictable at the instruction level.

### Scope Disclaimer

This project does NOT claim:

* Any weakness in secp256k1
* Practical discrete log attacks
* Private key recovery

It is purely for performance research, benchmarking, and educational exploration of ECC implementations.

### Repository

[https://github.com/shrec/UltrafastSecp256k1](https://github.com/shrec/UltrafastSecp256k1)

I would appreciate feedback from anyone working on:

* ECC performance
* GLV implementation details
* GPU optimization strategies
* RISC-V vectorization approaches

Thanks.

If there is interest, I can post detailed benchmark comparisons and profiling results.
shrec (OP)
Newbie
*
Offline Offline

Activity: 10
Merit: 0


View Profile
February 11, 2026, 07:16:29 PM
 #2

Thank you for the valuable observation — I fully agree with your point regarding Fermat-based inversion and side-channel considerations.

At the current stage, the library is primarily focused on performance research and architectural exploration. Some inversion paths are optimized for raw throughput rather than strict constant-time guarantees, particularly in closed or controlled environments where side-channel exposure is not a concern.

However, side-channel resistance and deterministic execution are already planned as a next development stage.

The long-term design goal is to maintain two distinct operational modes:

1. A maximum-performance mode 
   Intended for closed systems or benchmarking scenarios where side-channel leakage is not a threat model and absolute throughput is prioritized.

2. A hardened mode 
   Intended for exposed or wallet-level architectures, where additional protections will be introduced, including:
   - Fermat-based inversion (p-2 exponentiation)
   - Montgomery ladder for scalar multiplication
   - Strict constant-time arithmetic paths
   - Reduced branch-dependent behavior

The idea is to allow developers integrating the library to explicitly choose the security/performance profile that matches their deployment environment.

I believe separating performance research from hardened cryptographic deployment is important to avoid mixing tradeoffs implicitly.

Your comment reinforces the direction I intend to take in the hardened path — much appreciated.
shrec (OP)
Newbie
*
Offline Offline

Activity: 10
Merit: 0


View Profile
February 14, 2026, 04:27:41 PM
 #3

Added OpenCL ESP32 support optimized few more library part Smiley
shrec (OP)
Newbie
*
Offline Offline

Activity: 10
Merit: 0


View Profile
February 15, 2026, 11:08:39 AM
 #4

# 🔥 UltrafastSecp256k1 — Feature Overview

UltrafastSecp256k1 is a zero-dependency, multi-platform secp256k1 stack covering the entire modern Bitcoin/EVM ecosystem — from field arithmetic to advanced multi-party protocols.

## 🧠 Core Engine

* Field / Scalar / Point arithmetic
* GLV endomorphism acceleration
* Precomputation tables
* Deterministic RFC6979 nonces
* Strict low-S normalization

## ⚙ Assembly & SIMD

* x64 (MASM/GAS, BMI2/ADX)
* ARM64 inline assembly
* RISC-V + RVV
* Montgomery batch inversion
* AVX2 / AVX-512 batch ops

## 🔐 Constant-Time Layer

* Constant-time field/scalar/point operations
* Separate `secp256k1::ct` namespace
* Montgomery ladder scalar multiplication
* Side-channel resistant design
* No runtime flag switching (explicit API separation)

## ✍ Digital Signatures

### ECDSA

* Sign / Verify
* DER / Compact
* Pubkey recovery (recid)
* Batch verification

### Schnorr (BIP-340)

* Sign / Verify
* Batch verification
* X-only public keys

## 🤝 Multi-Party & Advanced Protocols

* MuSig2 (2-round aggregation)
* FROST (t-of-n threshold signatures)
* Adaptor signatures
* Pedersen commitments
* Multi-scalar multiplication (Strauss/Shamir)
* ECDH (raw, x-only, SHA-256)

## 🟢 Bitcoin Stack

* Taproot (BIP-341/342)
* BIP-32 HD derivation (xprv/xpub)
* BIP-44 coin-type derivation
* BIP-352 Silent Payments
* Address generation:

  * P2PKH
  * P2WPKH
  * P2TR
  * Base58Check
  * Bech32 / Bech32m

## 🌍 Multi-Chain Support (27+ Coins)

* BTC, LTC, DOGE, BCH, BSV
* ETH (EIP-55)
* DASH, ZEC, RVN, QTUM, etc.
* Auto-dispatch address generation
* Coin-aware HD derivation
* Built-in Keccak-256

## 🔬 Research & Customization

* Custom generator support
* Fully custom curve context
* Zero-overhead default path
* Deterministic vector self-tests
* Cross-backend layout validation

## 🚀 GPU Backends

* CUDA kernels
* OpenCL backend
* ROCm/HIP portability
* Occupancy tuning
* PGO support
* Inline PTX
* Branchless field ops

## 📦 Platforms

* x86-64
* ARM64
* RISC-V
* ESP32
* STM32
* WebAssembly (Emscripten)
* iOS (SPM, CocoaPods, XCFramework)
* Android (NDK)
* CUDA / OpenCL / ROCm

## 🧪 Testing & Quality

* 200+ tests
* Fuzz harnesses
* Cross-platform CI
* Known vector verification
* Deterministic reproducibility
kTimesG
Full Member
***
Offline Offline

Activity: 742
Merit: 227


View Profile
February 15, 2026, 01:20:21 PM
 #5

Went through the repo.

Your code is the first (public) I've ever seen which does block batch inversion (e.g. a single inverse for the entire block of threads, via shared memory). So, congrats on this, it took me several weeks to write something which is almost identical in nature (and later discovering syncthreads bugs after two years, reading un-synced sibling values from bad warp scheduling, only on specific cards, it was a pain to debug!).

So this looks good but seems way too generic. Obviously, specific requirements can benefit from much larger optimizations.

Off the grid, training pigeons to broadcast signed messages.
NotATether
Legendary
*
Offline Offline

Activity: 2240
Merit: 9486


Trêvoid █ No KYC-AML Crypto Swaps


View Profile WWW
February 15, 2026, 03:12:43 PM
 #6

I am interested in knowing what hardware you used to test the RISC-V version.

Maybe even the ARM version as well, since Raspberry-Pis and Arduinos aren't exactly known for their compute power (to say nothing about their NVIDIA gpu interfacing support).

This can run on Metal, correct? On Apple silicon. It would be the first secp256k1 library to do that if true.

 
 b1exch.io 
  ETH      DAI   
  BTC      LTC   
  USDT     XMR    
.███████████▄▀▄▀
█████████▄█▄▀
███████████
███████▄█▀
█▀█
▄▄▀░░██▄▄
▄▀██▄▀█████▄
██▄▀░▄██████
███████░█████
█░████░█████████
█░█░█░████░█████
█░█░█░██░█████
▀▀▀▄█▄████▀▀▀
shrec (OP)
Newbie
*
Offline Offline

Activity: 10
Merit: 0


View Profile
February 15, 2026, 04:12:27 PM
 #7

hello i have Milk-V Mars Device and i compiled and benchmarked on it i dont have any other risc-v device at this moment. i also ordered esp32 risc-v version planing to port on it too
shrec (OP)
Newbie
*
Offline Offline

Activity: 10
Merit: 0


View Profile
February 15, 2026, 04:18:46 PM
 #8

I am interested in knowing what hardware you used to test the RISC-V version.

Maybe even the ARM version as well, since Raspberry-Pis and Arduinos aren't exactly known for their compute power (to say nothing about their NVIDIA gpu interfacing support).

This can run on Metal, correct? On Apple silicon. It would be the first secp256k1 library to do that if true.

I used Milk-v Mars, My library is zero dependent will work evrywhere you wish i tested on esp32-s3 esp32-pico-4d on STM32 on android device with rockchip works and builds evrywhere without any problems i'm not using any external libraries. i dont have apple devices for test but with githubg actions it builded without problem i dont have all kind of hardware but will be glad to hire benchmar results from comunity on different platforms.
and see any issues so i fixed them with you on Raspi will work without problem i dont have raspi at this moment to build and test on it but with CLand LLVM you can build it on all platforms.
shrec (OP)
Newbie
*
Offline Offline

Activity: 10
Merit: 0


View Profile
February 15, 2026, 04:26:55 PM
 #9

Went through the repo.

Your code is the first (public) I've ever seen which does block batch inversion (e.g. a single inverse for the entire block of threads, via shared memory). So, congrats on this, it took me several weeks to write something which is almost identical in nature (and later discovering syncthreads bugs after two years, reading un-synced sibling values from bad warp scheduling, only on specific cards, it was a pain to debug!).

So this looks good but seems way too generic. Obviously, specific requirements can benefit from much larger optimizations.

I'm open on any new ideas and optimization suggestions all code is open in repo evry one can check all part of code. will be glad to hire suggestions and new benchmark results on different hardware
shrec (OP)
Newbie
*
Offline Offline

Activity: 10
Merit: 0


View Profile
February 15, 2026, 05:13:23 PM
 #10

I am interested in knowing what hardware you used to test the RISC-V version.

Maybe even the ARM version as well, since Raspberry-Pis and Arduinos aren't exactly known for their compute power (to say nothing about their NVIDIA gpu interfacing support).

This can run on Metal, correct? On Apple silicon. It would be the first secp256k1 library to do that if true.

i started Metal Support development but dont have real device to test it if you have we can colaborate
shrec (OP)
Newbie
*
Offline Offline

Activity: 10
Merit: 0


View Profile
Today at 04:56:03 AM
 #11

v3.3.0 Latest
What's New in v3.3.0
Comprehensive Benchmarks (Metal + WASM)
Metal GPU benchmark (bench_metal.mm): 9 operations — Field Mul/Add/Sub/Sqr/Inv, Point Add/Double, Scalar Mul (P×k), Generator Mul (G×k). Matches CUDA benchmark format with warmup, kernel-only timing, and throughput tables.
3 new Metal GPU kernels: field_add_bench, field_sub_bench, field_inv_bench added to secp256k1_kernels.metal
WASM benchmark (bench_wasm.mjs): Node.js benchmark for all WASM-exported operations — Pubkey Create (G×k), Point Mul (P×k), Point Add (P+Q), ECDSA Sign/Verify, Schnorr Sign/Verify, SHA-256 (32B/1KB)
WASM Benchmark Results (CI, Node.js v20, linux x64)
Operation   Time/Op   Throughput
SHA-256 (32B)   649 ns   1.54 M/s
SHA-256 (1KB)   6.33 µs   158 K/s
Point Add (P+Q)   22 µs   45 K/s
Pubkey Create (G×k)   70 µs   14 K/s
Schnorr Sign   92 µs   11 K/s
ECDSA Sign   146 µs   7 K/s
Point Mul (P×k)   693 µs   1.4 K/s
ECDSA Verify   825 µs   1.2 K/s
Schnorr Verify   874 µs   1.1 K/s
CI Hardening
WASM benchmark now runs in CI (Node.js 20 setup + execution in wasm job)
Metal: skip generator_mul test on non-Apple7+ paravirtual devices (CI fix)
Benchmark alert threshold raised from 120% → 150% (reduces false positives on shared CI runners)
Fix WASM runtime crash: removed --closure 1, added -fno-exceptions, increased WASM memory (4MB initial, 512KB stack)
Bug Fixes
Fix Metal shader compilation errors (MSL address space mismatches, jacobian_to_affine ordering)
Fix keccak rotl64 undefined behavior (shift by 0)
Fix macOS build flags for Clang compatibility
Fix metal2.4 shader standard for newer Xcode toolchains
Remove unused .cuh files and sorted_ecc_db
Testing & Quality
Unified test runner (12 test files consolidated)
Selftest modes: smoke (fast), ci (full), stress (extended)
Boundary KAT vectors, field limb boundary tests, batch inverse sweep
Repro bundle support for deterministic test reproduction
Sanitizer CI integration (ASan/UBSan)
Security & Maturity
SECURITY.md v3.2, THREAT_MODEL.md
API stability guarantees documented
Fuzz testing documentation
Security contact: payysoon@gmail.com
Documentation
Batch inverse & mixed addition API reference with examples (full point, X-only, CUDA, division, scratch reuse, Montgomery trick)
README cleanup: removed AI-generated text, translated Georgian → English
Removed database/lookup/bloom references from public docs
Metal Backend Improvements
Apple Metal GPU backend with Comba-accelerated field arithmetic
4-bit windowed scalar multiplication on GPU
Chunked Montgomery batch inverse
Branchless bloom check with coalesced memory access
Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!