FixedPaul's is using only 1 GPU for full power. For example 1 card 4090 i get 7.9MK/s and 5090 i get 9.5MK/s.
You cannot use also CPU and is not double the power of cards...for example for 8 cards 4090 i have almost 90.000MK/s. If you can get this power with what you find on github, good luck!
Everything that is on github, if you use CPU is limited to 256 threads by construction.
Also what is there is working only for RTX 3000, 4000 and 5000.
My code is working also on RTX 6000 ADA Blackwell up to H200+ CPU up to 1024 threads.
Plus extra to search not only by prefix, i search by any text i want inside or at the end of the Address.
I have a few questions to ask:
1. Are those speeds you claim, with or without endomorphism?
2. Did you change or tweak JLP inversion, multiply, add, subtract, etc. functions?
3. How many registers do your kernels consume, and how many spills/loads do you have when compiling code for 8.9 and 12.0 compute capability?
4. What do you mean by "up to 1024 threads on CPU"?
5. Did you change or tweak the hashing, how the rounds are handled, processed, rotations, etc.? Sha256, Ripemd160?