UltrafastSecp256k1Zero-dependency high-performance secp256k1 CPU CUDA RISC-V CL

shrec (OP)

Newbie

Offline

Activity: 6
Merit: 0

UltrafastSecp256k1Zero-dependency high-performance secp256k1 CPU CUDA RISC-V CL

February 11, 2026, 05:27:59 PM
Last edit: February 14, 2026, 06:08:24 PM by shrec

Hi everyone,

I’ve been working on a performance-oriented implementation of secp256k1 written from scratch in C++ and CUDA, with optional x86-64 (BMI2/ADX) and RISC-V optimizations.

This project is focused on architectural efficiency, benchmarking, and hardware-aware ECC implementation. It is not intended for breaking cryptography or private key recovery.

### Goals

* Implement secp256k1 without external big-integer libraries
* Maintain deterministic memory layout
* Avoid dynamic allocation in hot paths
* Explore hardware-level performance limits

### Features

* Complete field arithmetic (mod p)
* Scalar arithmetic (mod n)
* Affine and Jacobian coordinates
* GLV optimization
* CPU optimizations (BMI2/ADX)
* RISC-V RV64GC support
* CUDA batch kernels
* Benchmark suite included

### Measured Performance

On RTX 5060:
~2.5 billion Jacobian mixed-add operations per second (measured)

CPU benchmarks also show 3–5× improvement over naive implementations when using BMI2/ADX paths.

### Design Approach

The implementation treats elliptic curve math as a hardware interaction problem:

* Little-endian limb layout for computational efficiency
* Explicit carry handling
* Batch inversion via Montgomery’s trick
* Minimal abstraction in hot execution paths

The idea is to reduce unnecessary movement and keep arithmetic predictable at the instruction level.

### Scope Disclaimer

This project does NOT claim:

* Any weakness in secp256k1
* Practical discrete log attacks
* Private key recovery

It is purely for performance research, benchmarking, and educational exploration of ECC implementations.

### Repository

[https://github.com/shrec/UltrafastSecp256k1](https://github.com/shrec/UltrafastSecp256k1)

I would appreciate feedback from anyone working on:

* ECC performance
* GLV implementation details
* GPU optimization strategies
* RISC-V vectorization approaches

Thanks.

If there is interest, I can post detailed benchmark comparisons and profiling results.

shrec (OP)

Newbie

Offline

Activity: 6
Merit: 0

Re: UltrafastSecp256k1Zero-dependency high-performance secp256k1 CPU, CUDA, RISC-V

February 11, 2026, 07:16:29 PM

Thank you for the valuable observation — I fully agree with your point regarding Fermat-based inversion and side-channel considerations.

At the current stage, the library is primarily focused on performance research and architectural exploration. Some inversion paths are optimized for raw throughput rather than strict constant-time guarantees, particularly in closed or controlled environments where side-channel exposure is not a concern.

However, side-channel resistance and deterministic execution are already planned as a next development stage.

The long-term design goal is to maintain two distinct operational modes:

1. A maximum-performance mode
Intended for closed systems or benchmarking scenarios where side-channel leakage is not a threat model and absolute throughput is prioritized.

2. A hardened mode
Intended for exposed or wallet-level architectures, where additional protections will be introduced, including:
- Fermat-based inversion (p-2 exponentiation)
- Montgomery ladder for scalar multiplication
- Strict constant-time arithmetic paths
- Reduced branch-dependent behavior

The idea is to allow developers integrating the library to explicitly choose the security/performance profile that matches their deployment environment.

I believe separating performance research from hardened cryptographic deployment is important to avoid mixing tradeoffs implicitly.

Your comment reinforces the direction I intend to take in the hardened path — much appreciated.

shrec (OP)

Newbie

Offline

Activity: 6
Merit: 0

Re: UltrafastSecp256k1Zero-dependency high-performance secp256k1 CPU, CUDA, RISC-V

February 14, 2026, 04:27:41 PM

Added OpenCL ESP32 support optimized few more library part

shrec (OP)

Newbie

Offline

Activity: 6
Merit: 0

Re: UltrafastSecp256k1Zero-dependency high-performance secp256k1 CPU CUDA RISC-V CL

February 15, 2026, 11:08:39 AM

# 🔥 UltrafastSecp256k1 — Feature Overview

UltrafastSecp256k1 is a zero-dependency, multi-platform secp256k1 stack covering the entire modern Bitcoin/EVM ecosystem — from field arithmetic to advanced multi-party protocols.

## 🧠 Core Engine

* Field / Scalar / Point arithmetic
* GLV endomorphism acceleration
* Precomputation tables
* Deterministic RFC6979 nonces
* Strict low-S normalization

## ⚙ Assembly & SIMD

* x64 (MASM/GAS, BMI2/ADX)
* ARM64 inline assembly
* RISC-V + RVV
* Montgomery batch inversion
* AVX2 / AVX-512 batch ops

## 🔐 Constant-Time Layer

* Constant-time field/scalar/point operations
* Separate `secp256k1::ct` namespace
* Montgomery ladder scalar multiplication
* Side-channel resistant design
* No runtime flag switching (explicit API separation)

## ✍ Digital Signatures

### ECDSA

* Sign / Verify
* DER / Compact
* Pubkey recovery (recid)
* Batch verification

### Schnorr (BIP-340)

* Sign / Verify
* Batch verification
* X-only public keys

## 🤝 Multi-Party & Advanced Protocols

* MuSig2 (2-round aggregation)
* FROST (t-of-n threshold signatures)
* Adaptor signatures
* Pedersen commitments
* Multi-scalar multiplication (Strauss/Shamir)
* ECDH (raw, x-only, SHA-256)

## 🟢 Bitcoin Stack

* Taproot (BIP-341/342)
* BIP-32 HD derivation (xprv/xpub)
* BIP-44 coin-type derivation
* BIP-352 Silent Payments
* Address generation:

* P2PKH
* P2WPKH
* P2TR
* Base58Check
* Bech32 / Bech32m

## 🌍 Multi-Chain Support (27+ Coins)

* BTC, LTC, DOGE, BCH, BSV
* ETH (EIP-55)
* DASH, ZEC, RVN, QTUM, etc.
* Auto-dispatch address generation
* Coin-aware HD derivation
* Built-in Keccak-256

## 🔬 Research & Customization

* Custom generator support
* Fully custom curve context
* Zero-overhead default path
* Deterministic vector self-tests
* Cross-backend layout validation

## 🚀 GPU Backends

* CUDA kernels
* OpenCL backend
* ROCm/HIP portability
* Occupancy tuning
* PGO support
* Inline PTX
* Branchless field ops

## 📦 Platforms

* x86-64
* ARM64
* RISC-V
* ESP32
* STM32
* WebAssembly (Emscripten)
* iOS (SPM, CocoaPods, XCFramework)
* Android (NDK)
* CUDA / OpenCL / ROCm

## 🧪 Testing & Quality

* 200+ tests
* Fuzz harnesses
* Cross-platform CI
* Known vector verification
* Deterministic reproducibility

kTimesG

Full Member

Offline

Activity: 742
Merit: 229

Re: UltrafastSecp256k1Zero-dependency high-performance secp256k1 CPU CUDA RISC-V CL

February 15, 2026, 01:20:21 PM

Went through the repo.

Your code is the first (public) I've ever seen which does block batch inversion (e.g. a single inverse for the entire block of threads, via shared memory). So, congrats on this, it took me several weeks to write something which is almost identical in nature (and later discovering syncthreads bugs after two years, reading un-synced sibling values from bad warp scheduling, only on specific cards, it was a pain to debug!).

So this looks good but seems way too generic. Obviously, specific requirements can benefit from much larger optimizations.

Off the grid, training pigeons to broadcast signed messages.

NotATether

Legendary

Offline

Activity: 2240
Merit: 9488

Trêvoid █ No KYC-AML Crypto Swaps

Re: UltrafastSecp256k1Zero-dependency high-performance secp256k1 CPU CUDA RISC-V CL

February 15, 2026, 03:12:43 PM

I am interested in knowing what hardware you used to test the RISC-V version.

Maybe even the ARM version as well, since Raspberry-Pis and Arduinos aren't exactly known for their compute power (to say nothing about their NVIDIA gpu interfacing support).

This can run on Metal, correct? On Apple silicon. It would be the first secp256k1 library to do that if true.

b1exch.io

Instant Automated Exchange
.....................No Logs | No registration | No JavaScript .....................

ETH DAI BTC LTC USDT XMR

ACCESS VIA
Tor

███████████▄▀▄▀
█████████▄█▄▀
███████████
███████▄█▀

█▀█
▄▄▀░░██▄▄
▄▀██▄▀██████▄
███▄▀░▄████████
██████████░██████
█░████░██████████
█░█░█░████░██████
█░█░█░██░██████
▀▀▀▄█▄████▀▀▀

shrec (OP)

Newbie

Offline

Activity: 6
Merit: 0

Re: UltrafastSecp256k1Zero-dependency high-performance secp256k1 CPU CUDA RISC-V CL

February 15, 2026, 04:12:27 PM
Last edit: February 16, 2026, 08:42:58 PM by Mr. Big

hello i have Milk-V Mars Device and i compiled and benchmarked on it i dont have any other risc-v device at this moment. i also ordered esp32 risc-v version planing to port on it too

Quote from: NotATether on February 15, 2026, 03:12:43 PM

I used Milk-v Mars, My library is zero dependent will work evrywhere you wish i tested on esp32-s3 esp32-pico-4d on STM32 on android device with rockchip works and builds evrywhere without any problems i'm not using any external libraries. i dont have apple devices for test but with githubg actions it builded without problem i dont have all kind of hardware but will be glad to hire benchmar results from comunity on different platforms.
and see any issues so i fixed them with you on Raspi will work without problem i dont have raspi at this moment to build and test on it but with CLand LLVM you can build it on all platforms.

Quote from: kTimesG on February 15, 2026, 01:20:21 PM

I'm open on any new ideas and optimization suggestions all code is open in repo evry one can check all part of code. will be glad to hire suggestions and new benchmark results on different hardware

Quote from: NotATether on February 15, 2026, 03:12:43 PM

i started Metal Support development but dont have real device to test it if you have we can colaborate

v3.3.0 Latest
What's New in v3.3.0
Comprehensive Benchmarks (Metal + WASM)
Metal GPU benchmark (bench_metal.mm): 9 operations — Field Mul/Add/Sub/Sqr/Inv, Point Add/Double, Scalar Mul (P×k), Generator Mul (G×k). Matches CUDA benchmark format with warmup, kernel-only timing, and throughput tables.
3 new Metal GPU kernels: field_add_bench, field_sub_bench, field_inv_bench added to secp256k1_kernels.metal
WASM benchmark (bench_wasm.mjs): Node.js benchmark for all WASM-exported operations — Pubkey Create (G×k), Point Mul (P×k), Point Add (P+Q), ECDSA Sign/Verify, Schnorr Sign/Verify, SHA-256 (32B/1KB)
WASM Benchmark Results (CI, Node.js v20, linux x64)
Operation   Time/Op   Throughput
SHA-256 (32B)   649 ns   1.54 M/s
SHA-256 (1KB)   6.33 µs   158 K/s
Point Add (P+Q)   22 µs   45 K/s
Pubkey Create (G×k)   70 µs   14 K/s
Schnorr Sign   92 µs   11 K/s
ECDSA Sign   146 µs   7 K/s
Point Mul (P×k)   693 µs   1.4 K/s
ECDSA Verify   825 µs   1.2 K/s
Schnorr Verify   874 µs   1.1 K/s
CI Hardening
WASM benchmark now runs in CI (Node.js 20 setup + execution in wasm job)
Metal: skip generator_mul test on non-Apple7+ paravirtual devices (CI fix)
Benchmark alert threshold raised from 120% → 150% (reduces false positives on shared CI runners)
Fix WASM runtime crash: removed --closure 1, added -fno-exceptions, increased WASM memory (4MB initial, 512KB stack)
Bug Fixes
Fix Metal shader compilation errors (MSL address space mismatches, jacobian_to_affine ordering)
Fix keccak rotl64 undefined behavior (shift by 0)
Fix macOS build flags for Clang compatibility
Fix metal2.4 shader standard for newer Xcode toolchains
Remove unused .cuh files and sorted_ecc_db
Testing & Quality
Unified test runner (12 test files consolidated)
Selftest modes: smoke (fast), ci (full), stress (extended)
Boundary KAT vectors, field limb boundary tests, batch inverse sweep
Repro bundle support for deterministic test reproduction
Sanitizer CI integration (ASan/UBSan)
Security & Maturity
SECURITY.md v3.2, THREAT_MODEL.md
API stability guarantees documented
Fuzz testing documentation
Security contact: payysoon@gmail.com
Documentation
Batch inverse & mixed addition API reference with examples (full point, X-only, CUDA, division, scratch reuse, Montgomery trick)
README cleanup: removed AI-generated text, translated Georgian → English
Removed database/lookup/bloom references from public docs
Metal Backend Improvements
Apple Metal GPU backend with Comba-accelerated field arithmetic
4-bit windowed scalar multiplication on GPU
Chunked Montgomery batch inverse
Branchless bloom check with coalesced memory access

Quote from: NotATether on February 15, 2026, 03:12:43 PM

Apple Metal (Apple M3 Pro) — Kernel-Only
Operation   Time/Op   Throughput
Field Mul   1.9 ns   527 M/s
Field Add   1.0 ns   990 M/s
Field Sub   1.1 ns   892 M/s
Field Sqr   1.1 ns   872 M/s
Field Inv   106.4 ns   9.40 M/s
Point Add   10.1 ns   98.6 M/s
Point Double   5.1 ns   196 M/s
Scalar Mul (P×k)   2.94 μs   0.34 M/s
Generator Mul (G×k)   3.00 μs   0.33 M/s

v3.4.0 — Apple Metal GPU Backend Latest
@shrec shrec released this 2 minutes ago
v3.4.0
c068e1d
What's New
Apple Metal GPU Compute Backend
First secp256k1 library with Apple Metal compute support (M1/M2/M3/M4).

8x32-bit Comba product scanning (MSL has no uint64_t)
4-bit windowed scalar multiplication with branchless table select
Zero-copy unified memory (MTLResourceStorageModeShared)
Comprehensive benchmark suite (CUDA-format output)
Runtime .metallib compilation fallback
Benchmarks (Apple M3 Pro, 18 GPU cores)
Operation   Time/Op   Throughput
Field Mul   1.8 ns   560 M/s
Field Add   1.2 ns   801 M/s
Field Sqr   1.0 ns   985 M/s
Field Inv   106.4 ns   9.4 M/s
Point Double   5.0 ns   198 M/s
Point Add   11.1 ns   90 M/s
Scalar Mul (P*k)   2.95 us   340 K/s
Generator Mul (G*k)   2.85 us   350 K/s
Fixes
Apple9+ (M3) GPU family detection fix (enum vs macro guard)
All Supported Platforms
CPU: x86-64, ARM64, RISC-V, ESP32-S3, ESP32, STM32F103
GPU: CUDA, OpenCL, Metal (NEW)
Mobile: Android ARM64
Web: WebAssembly

Cricktor

Legendary

Offline

Activity: 1400
Merit: 3631

Re: UltrafastSecp256k1Zero-dependency high-performance secp256k1 CPU CUDA RISC-V CL

February 16, 2026, 05:52:58 PM

Your completely unnecessary consecutive posts are a nuisance and violation of rule #32 of Unofficial list of (official) Bitcointalk.org rules, guidelines, FAQ. I'm sure you can do better.

It's not rocket science to edit your own post if it's the last in a thread and you want to add something new. Your topic is interesting, don't ruin it by annoying consecutive posts.

.
.^{Duelbits PREDICT}..

.
.^{WHERE EVERYTHING IS A MARKET}..

█████
██
██

██
██
██████

Will Bitcoin hit $200,000
before January 1st 2027?
^No @1.15 ^Yes @6.00

█████
██
██

██
██
██████

^{CHECK MORE >}

Pages: [1]

Bitcoin Forum > Bitcoin > Development & Technical Discussion > UltrafastSecp256k1Zero-dependency high-performance secp256k1 CPU CUDA RISC-V CL

« previous topic next topic »