Hi everyone,
I’ve been working on a performance-oriented implementation of secp256k1 written from scratch in C++ and CUDA, with optional x86-64 (BMI2/ADX) and RISC-V optimizations.
This project is focused on architectural efficiency, benchmarking, and hardware-aware ECC implementation. It is not intended for breaking cryptography or private key recovery.
### Goals
* Implement secp256k1 without external big-integer libraries
* Maintain deterministic memory layout
* Avoid dynamic allocation in hot paths
* Explore hardware-level performance limits
### Features
* Complete field arithmetic (mod p)
* Scalar arithmetic (mod n)
* Affine and Jacobian coordinates
* GLV optimization
* CPU optimizations (BMI2/ADX)
* RISC-V RV64GC support
* CUDA batch kernels
* Benchmark suite included
### Measured Performance
On RTX 5060:
~2.5 billion Jacobian mixed-add operations per second (measured)
CPU benchmarks also show 3–5× improvement over naive implementations when using BMI2/ADX paths.
### Design Approach
The implementation treats elliptic curve math as a hardware interaction problem:
* Little-endian limb layout for computational efficiency
* Explicit carry handling
* Batch inversion via Montgomery’s trick
* Minimal abstraction in hot execution paths
The idea is to reduce unnecessary movement and keep arithmetic predictable at the instruction level.
### Scope Disclaimer
This project does NOT claim:
* Any weakness in secp256k1
* Practical discrete log attacks
* Private key recovery
It is purely for performance research, benchmarking, and educational exploration of ECC implementations.
### Repository
[https://github.com/shrec/UltrafastSecp256k1](
https://github.com/shrec/UltrafastSecp256k1)
I would appreciate feedback from anyone working on:
* ECC performance
* GLV implementation details
* GPU optimization strategies
* RISC-V vectorization approaches
Thanks.
If there is interest, I can post detailed benchmark comparisons and profiling results.