hello i have Milk-V Mars Device and i compiled and benchmarked on it i dont have any other risc-v device at this moment. i also ordered esp32 risc-v version planing to port on it too
I am interested in knowing what hardware you used to test the RISC-V version.
Maybe even the ARM version as well, since Raspberry-Pis and Arduinos aren't exactly known for their compute power (to say nothing about their NVIDIA gpu interfacing support).
This can run on Metal, correct? On Apple silicon. It would be the first secp256k1 library to do that if true.
I used Milk-v Mars, My library is zero dependent will work evrywhere you wish i tested on esp32-s3 esp32-pico-4d on STM32 on android device with rockchip works and builds evrywhere without any problems i'm not using any external libraries. i dont have apple devices for test but with githubg actions it builded without problem i dont have all kind of hardware but will be glad to hire benchmar results from comunity on different platforms.
and see any issues so i fixed them with you on Raspi will work without problem i dont have raspi at this moment to build and test on it but with CLand LLVM you can build it on all platforms.
Went through the repo.
Your code is the first (public) I've ever seen which does block batch inversion (e.g. a single inverse for the entire block of threads, via shared memory). So, congrats on this, it took me several weeks to write something which is almost identical in nature (and later discovering syncthreads bugs after two years, reading un-synced sibling values from bad warp scheduling, only on specific cards, it was a pain to debug!).
So this looks good but seems way too generic. Obviously, specific requirements can benefit from much larger optimizations.
I'm open on any new ideas and optimization suggestions all code is open in repo evry one can check all part of code. will be glad to hire suggestions and new benchmark results on different hardware
I am interested in knowing what hardware you used to test the RISC-V version.
Maybe even the ARM version as well, since Raspberry-Pis and Arduinos aren't exactly known for their compute power (to say nothing about their NVIDIA gpu interfacing support).
This can run on Metal, correct? On Apple silicon. It would be the first secp256k1 library to do that if true.
i started Metal Support development but dont have real device to test it if you have we can colaborate
v3.3.0 Latest
What's New in v3.3.0
Comprehensive Benchmarks (Metal + WASM)
Metal GPU benchmark (bench_metal.mm): 9 operations — Field Mul/Add/Sub/Sqr/Inv, Point Add/Double, Scalar Mul (P×k), Generator Mul (G×k). Matches CUDA benchmark format with warmup, kernel-only timing, and throughput tables.
3 new Metal GPU kernels: field_add_bench, field_sub_bench, field_inv_bench added to secp256k1_kernels.metal
WASM benchmark (bench_wasm.mjs): Node.js benchmark for all WASM-exported operations — Pubkey Create (G×k), Point Mul (P×k), Point Add (P+Q), ECDSA Sign/Verify, Schnorr Sign/Verify, SHA-256 (32B/1KB)
WASM Benchmark Results (CI, Node.js v20, linux x64)
Operation Time/Op Throughput
SHA-256 (32B) 649 ns 1.54 M/s
SHA-256 (1KB) 6.33 µs 158 K/s
Point Add (P+Q) 22 µs 45 K/s
Pubkey Create (G×k) 70 µs 14 K/s
Schnorr Sign 92 µs 11 K/s
ECDSA Sign 146 µs 7 K/s
Point Mul (P×k) 693 µs 1.4 K/s
ECDSA Verify 825 µs 1.2 K/s
Schnorr Verify 874 µs 1.1 K/s
CI Hardening
WASM benchmark now runs in CI (Node.js 20 setup + execution in wasm job)
Metal: skip generator_mul test on non-Apple7+ paravirtual devices (CI fix)
Benchmark alert threshold raised from 120% → 150% (reduces false positives on shared CI runners)
Fix WASM runtime crash: removed --closure 1, added -fno-exceptions, increased WASM memory (4MB initial, 512KB stack)
Bug Fixes
Fix Metal shader compilation errors (MSL address space mismatches, jacobian_to_affine ordering)
Fix keccak rotl64 undefined behavior (shift by 0)
Fix macOS build flags for Clang compatibility
Fix metal2.4 shader standard for newer Xcode toolchains
Remove unused .cuh files and sorted_ecc_db
Testing & Quality
Unified test runner (12 test files consolidated)
Selftest modes: smoke (fast), ci (full), stress (extended)
Boundary KAT vectors, field limb boundary tests, batch inverse sweep
Repro bundle support for deterministic test reproduction
Sanitizer CI integration (ASan/UBSan)
Security & Maturity
SECURITY.md v3.2, THREAT_MODEL.md
API stability guarantees documented
Fuzz testing documentation
Security contact:
payysoon@gmail.comDocumentation
Batch inverse & mixed addition API reference with examples (full point, X-only, CUDA, division, scratch reuse, Montgomery trick)
README cleanup: removed AI-generated text, translated Georgian → English
Removed database/lookup/bloom references from public docs
Metal Backend Improvements
Apple Metal GPU backend with Comba-accelerated field arithmetic
4-bit windowed scalar multiplication on GPU
Chunked Montgomery batch inverse
Branchless bloom check with coalesced memory access
I am interested in knowing what hardware you used to test the RISC-V version.
Maybe even the ARM version as well, since Raspberry-Pis and Arduinos aren't exactly known for their compute power (to say nothing about their NVIDIA gpu interfacing support).
This can run on Metal, correct? On Apple silicon. It would be the first secp256k1 library to do that if true.
Apple Metal (Apple M3 Pro) — Kernel-Only
Operation Time/Op Throughput
Field Mul 1.9 ns 527 M/s
Field Add 1.0 ns 990 M/s
Field Sub 1.1 ns 892 M/s
Field Sqr 1.1 ns 872 M/s
Field Inv 106.4 ns 9.40 M/s
Point Add 10.1 ns 98.6 M/s
Point Double 5.1 ns 196 M/s
Scalar Mul (P×k) 2.94 μs 0.34 M/s
Generator Mul (G×k) 3.00 μs 0.33 M/s

v3.4.0 — Apple Metal GPU Backend Latest
@shrec shrec released this 2 minutes ago
v3.4.0
c068e1d
What's New
Apple Metal GPU Compute Backend
First secp256k1 library with Apple Metal compute support (M1/M2/M3/M4).
8x32-bit Comba product scanning (MSL has no uint64_t)
4-bit windowed scalar multiplication with branchless table select
Zero-copy unified memory (MTLResourceStorageModeShared)
Comprehensive benchmark suite (CUDA-format output)
Runtime .metallib compilation fallback
Benchmarks (Apple M3 Pro, 18 GPU cores)
Operation Time/Op Throughput
Field Mul 1.8 ns 560 M/s
Field Add 1.2 ns 801 M/s
Field Sqr 1.0 ns 985 M/s
Field Inv 106.4 ns 9.4 M/s
Point Double 5.0 ns 198 M/s
Point Add 11.1 ns 90 M/s
Scalar Mul (P*k) 2.95 us 340 K/s
Generator Mul (G*k) 2.85 us 350 K/s
Fixes
Apple9+ (M3) GPU family detection fix (enum vs macro guard)
All Supported Platforms
CPU: x86-64, ARM64, RISC-V, ESP32-S3, ESP32, STM32F103
GPU: CUDA, OpenCL, Metal (NEW)
Mobile: Android ARM64
Web: WebAssembly