I forked now BitCrack and improved it a little.
https://github.com/Uzlopak/BitCrackOpenCL/In this Fork I concentrate on the OpenCL implementation. So no CUDA-stuff.
I already could improve the performance about 30 %. I assume I can get it better, when I start vectorizing the operations.
Already wrote on stackoverflow for help to get atleast the multiply256 function improved:
https://stackoverflow.com/questions/67667314/transform-native-c-matrix-multiplication-to-opencl-simd-matrix-multiplicationAlready posted in various chats and on fiverr with a bounty of 25 € to get it done by somebody else... Probably will ending solving it by myself. Some dude on fiverr was first promising to implement it, but I guess it was more like a dud.
I assume a performance improvement of 200-300% when this function is using SIMD-Operation.