Thank you for your input. I might have used the incorrect term "optimized version"...
It's an in house implementation of Scrypt. I believe that's what Litecoin uses for their block hashing, maybe you could look there for some breakthrough in implemetation?
When reviewing KdfRomix::DeriveKey_OneIter(), as one example, it appears to me that removing memory allocations per iteration might be relevant when running lots of repeated iterations
https://github.com/etotheipi/BitcoinArmory/blob/2a6fc5355bb0c6fe26e387ccba30a5baafe8cd98/cppForSwig/EncryptionUtils.cpp#L212If the nature of Armory's algorithm means that threading doesn't scale past the number of CPUs, then the solution may not lie in encryption code "optimizations". On the other hand, if there's significant CPU idle time due to I/O operations, etc., might a high-memory system with a custom implementation offer some advantages? Sadly, I don't have the necessary knowledge to explore this deeper or implement the needed modifications.
So I guess this post is to figure out if I should look for outsourcing to analyze and improve the code, or if there's no point in going down that path.
The allocation is done via the default STL container under the hood (underlying container is std::vector<uint8_t>). Not much for you to do here, lookup table is preallocated:
lookupTable_.resize(memoryReqtBytes_);
lookupTable_.fill(0);
I created a Python script that generates all the passwords I wish to check into a file, but with 25 P/s, it will take years. I was hoping to reach 1000 P/s, which would make a great difference. Also, I'm willing to invest in more relevant hardware, once I understand what that entails.
You shouldn't do this from Python. It accesses the C++ code through the SWIG wrapper, this leads to multiple unnecessary allocations and copies. Also you're taking the hit for the Python runtime, and Python has no effective multi tasking, you'd have to multiprocess. That too isn't all that good, context switching for processes is significantly more expensive than for threads.
If you want to squeeze the most of your hardware on this task you'd need the following:
- Use an optimized, SSE/AVX enabled implementation of sha2-512. Armory 0.96.5 uses an old CryptoPP implementation from circa 2012, there could to be faster stuff out there by now. Building it with a modern compiler could help too.
- Use the C++ code, do NOT go through Python, multithread it in C++
- Run it in on some barethread Linux distro, not Windows. Do not mount a swap file either.
- Profile the CPU load to figure out the optimal thread count. Try with hyperthreading/SMT on & off. Try pining threads to explicit cores.