Thanks! You're right - this is still very much an experimental project.
The backstory: I was doing deep research and experiments with cryptography and blockchain (especially Bitcoin), and at some point hit a wall. None of the existing Kangaroo tools were up-to-date or worked natively with my setup - I'm on Arch Linux with an AMD GPU, and most implementations are CUDA-only or abandoned. So I built this from scratch for my own needs and decided to share it.
It's also a learning project for me, and I really appreciate the feedback. Getting suggestions like yours helps me improve both the code and my understanding of the algorithm.
Regarding your specific points:
Jump point collision - already handled. When
dx == 0, I substitute
fe_one() to avoid poisoning the batch inversion, then skip the add. See
kangaroo_affine.wgsl:160-164.
Batch inversion - implemented using Montgomery's trick with Blelloch parallel scan (no individual FLT inversions in the hot path).
Algorithm variant documentation - fair point, I should document that this is a basic 2-kangaroo implementation without symmetry. Will add that to the README.
Performance vs SOTA - I know there's a gap. No symmetry optimization yet, and probably other things I'm missing. Happy to hear specific suggestions if you have them.
---
Latest version: v0.4.0Changes since first release:v0.2.0 GPU auto-calibration, data provider system, wgpu v28
v0.3.0 Affine batch addition
v0.4.0 Parallel batch inversion (Blelloch scan)
Performance progression (AMD RX 6800S, 48-bit range):
Version Rate Improvement
v0.2.0 3.70 M/s baseline
v0.3.0 5.50 M/s +49%
v0.4.0 8.84 M/s +139% total (+61% vs v0.3.0)
---
Thanks again for taking the time to look at the code.