I've taken care of most of the redundant multiplications and register accesses that I could find.

But now, I'm trying to comment out some of the redundant commands in the cpu hash asm files and need a hand:

LAB_NEXT_NONCE:

mov rcx, 256 ; 256 - rcx is # of SHA-2 rounds

; mov rax, 64 ; 64 - rax is where we expand to

LAB_SHA:

push rcx

lea rcx, qword [data+(1024)] ; + 1024

lea r11, qword [data+(256)] ; + 256

I'm wanting to get rid of that redundant rcx move since it unnecessarily represents a constant for a total of three instructions. I know it's not much of anything, but it's a start at weeding out redundant code.

Also, is it just me or do I see rax being set to 0 and then being multiplied by 4 before added to data? And then being multiplied by 4 for no apparent reason?

Edit: I figured out it's part of the macro I overlooked. Haven't slept yet; probably should.

%endrep

add r11, LAB_CALC_UNROLL*LAB_CALC_PARA*16

cmp r11, rcx

jb LAB_CALC

pop rcx

mov rax, 0

; Load the init values of the message into the hash.

movntdqa xmm7, [init]

pshufd xmm5, xmm7, 0x55 ; xmm5 == b

pshufd xmm4, xmm7, 0xAA ; xmm4 == c

pshufd xmm3, xmm7, 0xFF ; xmm3 == d

pshufd xmm7, xmm7, 0 ; xmm7 == a

movntdqa xmm0, [init+16]

pshufd xmm8, xmm0, 0x55 ; xmm8 == f

pshufd xmm9, xmm0, 0xAA ; xmm9 == g

pshufd xmm10, xmm0, 0xFF ; xmm10 == h

pshufd xmm0, xmm0, 0 ; xmm0 == e

LAB_LOOP:

;; T t1 = h + (Rotr32(e, 6) ^ Rotr32(e, 11) ^ Rotr32(e, 25)) + ((e & f) ^ AndNot(e, g)) + Expand32<T>(g_sha256_k[j]) + w[j]

%macro lab_loop_blk 0

movntdqa xmm6, [data+rax*4]

paddd xmm6, g_4sha256_k[rax*4]

add rax, 4

As a tangent, I found this and wonder if we might be able to code something from it. "There are two meet-in-the-middle preimage attacks against SHA-2 with a reduced number of rounds. The first one attacks 41-round SHA-256 out of 64 rounds with time complexity of 2253.5 and space complexity of 216, and 46-round SHA-512 out of 80 rounds with time 2511.5 and space 23. The second one attacks 42-round SHA-256 with time complexity of 2251.7 and space complexity of 212, and 42-round SHA-512 with time 2502 and space 222."

So basically, if we store some of the computed hash into a look-up table in memory as we're computing, there's a good chance that we could speed-up hashing significantly for the first 42 rounds. Is that what you've already taken advantage of as you mentioned before?