I am currently exploring parallelization of SHA256 algorithm using SIMD based on a paper I've found which is basically parallelization of the "message scheduling" step that according to the authors takes up 26% of the computation time.
If I understand bitcoin core's code (
eg. AVX2), it seems like it doesn't support computing SHA256 of a large data using SIMD (eg. SHA256 of a single 512+ byte long data), but only has the code for computing SHA256 of multiple messages in parallel (ie. SHA256 of m1, m2, ..., m8) and return multiple hashes (ie. h1, h2, ... h8).
If I am reading the code wrong, please explain how it does that.
And if I am right then is there any reason why they didn't add this feature? It seems to be useful for computing the message digest of a big transaction specially the legacy ones which could easily be bigger than 512 bytes.
P.S. If you have any scientific paper about this topic that is newer than 2012 please let me know.