Bitcoin Forum
February 15, 2026, 10:38:28 PM *
News: Latest Bitcoin Core release: 30.2 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1]
  Print  
Author Topic: UltrafastSecp256k1Zero-dependency high-performance secp256k1 CPU CUDA RISC-V CL  (Read 126 times)
shrec (OP)
Newbie
*
Offline Offline

Activity: 9
Merit: 0


View Profile
February 11, 2026, 05:27:59 PM
Last edit: February 14, 2026, 06:08:24 PM by shrec
 #1

Hi everyone,

I’ve been working on a performance-oriented implementation of secp256k1 written from scratch in C++ and CUDA, with optional x86-64 (BMI2/ADX) and RISC-V optimizations.

This project is focused on architectural efficiency, benchmarking, and hardware-aware ECC implementation. It is not intended for breaking cryptography or private key recovery.

### Goals

* Implement secp256k1 without external big-integer libraries
* Maintain deterministic memory layout
* Avoid dynamic allocation in hot paths
* Explore hardware-level performance limits

### Features

* Complete field arithmetic (mod p)
* Scalar arithmetic (mod n)
* Affine and Jacobian coordinates
* GLV optimization
* CPU optimizations (BMI2/ADX)
* RISC-V RV64GC support
* CUDA batch kernels
* Benchmark suite included

### Measured Performance

On RTX 5060:
~2.5 billion Jacobian mixed-add operations per second (measured)

CPU benchmarks also show 3–5× improvement over naive implementations when using BMI2/ADX paths.

### Design Approach

The implementation treats elliptic curve math as a hardware interaction problem:

* Little-endian limb layout for computational efficiency
* Explicit carry handling
* Batch inversion via Montgomery’s trick
* Minimal abstraction in hot execution paths

The idea is to reduce unnecessary movement and keep arithmetic predictable at the instruction level.

### Scope Disclaimer

This project does NOT claim:

* Any weakness in secp256k1
* Practical discrete log attacks
* Private key recovery

It is purely for performance research, benchmarking, and educational exploration of ECC implementations.

### Repository

[https://github.com/shrec/UltrafastSecp256k1](https://github.com/shrec/UltrafastSecp256k1)

I would appreciate feedback from anyone working on:

* ECC performance
* GLV implementation details
* GPU optimization strategies
* RISC-V vectorization approaches

Thanks.

If there is interest, I can post detailed benchmark comparisons and profiling results.
shrec (OP)
Newbie
*
Offline Offline

Activity: 9
Merit: 0


View Profile
February 11, 2026, 07:16:29 PM
 #2

Thank you for the valuable observation — I fully agree with your point regarding Fermat-based inversion and side-channel considerations.

At the current stage, the library is primarily focused on performance research and architectural exploration. Some inversion paths are optimized for raw throughput rather than strict constant-time guarantees, particularly in closed or controlled environments where side-channel exposure is not a concern.

However, side-channel resistance and deterministic execution are already planned as a next development stage.

The long-term design goal is to maintain two distinct operational modes:

1. A maximum-performance mode 
   Intended for closed systems or benchmarking scenarios where side-channel leakage is not a threat model and absolute throughput is prioritized.

2. A hardened mode 
   Intended for exposed or wallet-level architectures, where additional protections will be introduced, including:
   - Fermat-based inversion (p-2 exponentiation)
   - Montgomery ladder for scalar multiplication
   - Strict constant-time arithmetic paths
   - Reduced branch-dependent behavior

The idea is to allow developers integrating the library to explicitly choose the security/performance profile that matches their deployment environment.

I believe separating performance research from hardened cryptographic deployment is important to avoid mixing tradeoffs implicitly.

Your comment reinforces the direction I intend to take in the hardened path — much appreciated.
shrec (OP)
Newbie
*
Offline Offline

Activity: 9
Merit: 0


View Profile
February 14, 2026, 04:27:41 PM
 #3

Added OpenCL ESP32 support optimized few more library part Smiley
shrec (OP)
Newbie
*
Offline Offline

Activity: 9
Merit: 0


View Profile
Today at 11:08:39 AM
 #4

# 🔥 UltrafastSecp256k1 — Feature Overview

UltrafastSecp256k1 is a zero-dependency, multi-platform secp256k1 stack covering the entire modern Bitcoin/EVM ecosystem — from field arithmetic to advanced multi-party protocols.

## 🧠 Core Engine

* Field / Scalar / Point arithmetic
* GLV endomorphism acceleration
* Precomputation tables
* Deterministic RFC6979 nonces
* Strict low-S normalization

## ⚙ Assembly & SIMD

* x64 (MASM/GAS, BMI2/ADX)
* ARM64 inline assembly
* RISC-V + RVV
* Montgomery batch inversion
* AVX2 / AVX-512 batch ops

## 🔐 Constant-Time Layer

* Constant-time field/scalar/point operations
* Separate `secp256k1::ct` namespace
* Montgomery ladder scalar multiplication
* Side-channel resistant design
* No runtime flag switching (explicit API separation)

## ✍ Digital Signatures

### ECDSA

* Sign / Verify
* DER / Compact
* Pubkey recovery (recid)
* Batch verification

### Schnorr (BIP-340)

* Sign / Verify
* Batch verification
* X-only public keys

## 🤝 Multi-Party & Advanced Protocols

* MuSig2 (2-round aggregation)
* FROST (t-of-n threshold signatures)
* Adaptor signatures
* Pedersen commitments
* Multi-scalar multiplication (Strauss/Shamir)
* ECDH (raw, x-only, SHA-256)

## 🟢 Bitcoin Stack

* Taproot (BIP-341/342)
* BIP-32 HD derivation (xprv/xpub)
* BIP-44 coin-type derivation
* BIP-352 Silent Payments
* Address generation:

  * P2PKH
  * P2WPKH
  * P2TR
  * Base58Check
  * Bech32 / Bech32m

## 🌍 Multi-Chain Support (27+ Coins)

* BTC, LTC, DOGE, BCH, BSV
* ETH (EIP-55)
* DASH, ZEC, RVN, QTUM, etc.
* Auto-dispatch address generation
* Coin-aware HD derivation
* Built-in Keccak-256

## 🔬 Research & Customization

* Custom generator support
* Fully custom curve context
* Zero-overhead default path
* Deterministic vector self-tests
* Cross-backend layout validation

## 🚀 GPU Backends

* CUDA kernels
* OpenCL backend
* ROCm/HIP portability
* Occupancy tuning
* PGO support
* Inline PTX
* Branchless field ops

## 📦 Platforms

* x86-64
* ARM64
* RISC-V
* ESP32
* STM32
* WebAssembly (Emscripten)
* iOS (SPM, CocoaPods, XCFramework)
* Android (NDK)
* CUDA / OpenCL / ROCm

## 🧪 Testing & Quality

* 200+ tests
* Fuzz harnesses
* Cross-platform CI
* Known vector verification
* Deterministic reproducibility
kTimesG
Full Member
***
Offline Offline

Activity: 742
Merit: 227


View Profile
Today at 01:20:21 PM
 #5

Went through the repo.

Your code is the first (public) I've ever seen which does block batch inversion (e.g. a single inverse for the entire block of threads, via shared memory). So, congrats on this, it took me several weeks to write something which is almost identical in nature (and later discovering syncthreads bugs after two years, reading un-synced sibling values from bad warp scheduling, only on specific cards, it was a pain to debug!).

So this looks good but seems way too generic. Obviously, specific requirements can benefit from much larger optimizations.

Off the grid, training pigeons to broadcast signed messages.
NotATether
Legendary
*
Offline Offline

Activity: 2240
Merit: 9485


Trêvoid █ No KYC-AML Crypto Swaps


View Profile WWW
Today at 03:12:43 PM
 #6

I am interested in knowing what hardware you used to test the RISC-V version.

Maybe even the ARM version as well, since Raspberry-Pis and Arduinos aren't exactly known for their compute power (to say nothing about their NVIDIA gpu interfacing support).

This can run on Metal, correct? On Apple silicon. It would be the first secp256k1 library to do that if true.

 
 b1exch.io 
  ETH      DAI   
  BTC      LTC   
  USDT     XMR    
.███████████▄▀▄▀
█████████▄█▄▀
███████████
███████▄█▀
█▀█
▄▄▀░░██▄▄
▄▀██▄▀█████▄
██▄▀░▄██████
███████░█████
█░████░█████████
█░█░█░████░█████
█░█░█░██░█████
▀▀▀▄█▄████▀▀▀
shrec (OP)
Newbie
*
Offline Offline

Activity: 9
Merit: 0


View Profile
Today at 04:12:27 PM
 #7

hello i have Milk-V Mars Device and i compiled and benchmarked on it i dont have any other risc-v device at this moment. i also ordered esp32 risc-v version planing to port on it too
shrec (OP)
Newbie
*
Offline Offline

Activity: 9
Merit: 0


View Profile
Today at 04:18:46 PM
 #8

I am interested in knowing what hardware you used to test the RISC-V version.

Maybe even the ARM version as well, since Raspberry-Pis and Arduinos aren't exactly known for their compute power (to say nothing about their NVIDIA gpu interfacing support).

This can run on Metal, correct? On Apple silicon. It would be the first secp256k1 library to do that if true.

I used Milk-v Mars, My library is zero dependent will work evrywhere you wish i tested on esp32-s3 esp32-pico-4d on STM32 on android device with rockchip works and builds evrywhere without any problems i'm not using any external libraries. i dont have apple devices for test but with githubg actions it builded without problem i dont have all kind of hardware but will be glad to hire benchmar results from comunity on different platforms.
and see any issues so i fixed them with you on Raspi will work without problem i dont have raspi at this moment to build and test on it but with CLand LLVM you can build it on all platforms.
shrec (OP)
Newbie
*
Offline Offline

Activity: 9
Merit: 0


View Profile
Today at 04:26:55 PM
 #9

Went through the repo.

Your code is the first (public) I've ever seen which does block batch inversion (e.g. a single inverse for the entire block of threads, via shared memory). So, congrats on this, it took me several weeks to write something which is almost identical in nature (and later discovering syncthreads bugs after two years, reading un-synced sibling values from bad warp scheduling, only on specific cards, it was a pain to debug!).

So this looks good but seems way too generic. Obviously, specific requirements can benefit from much larger optimizations.

I'm open on any new ideas and optimization suggestions all code is open in repo evry one can check all part of code. will be glad to hire suggestions and new benchmark results on different hardware
shrec (OP)
Newbie
*
Offline Offline

Activity: 9
Merit: 0


View Profile
Today at 05:13:23 PM
 #10

I am interested in knowing what hardware you used to test the RISC-V version.

Maybe even the ARM version as well, since Raspberry-Pis and Arduinos aren't exactly known for their compute power (to say nothing about their NVIDIA gpu interfacing support).

This can run on Metal, correct? On Apple silicon. It would be the first secp256k1 library to do that if true.

i started Metal Support development but dont have real device to test it if you have we can colaborate
Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!