It's not a new idea. It was used back in the GPU bitcoin mining days to get better speed on amd VLIW cards.

It's easy to adapt the miner itself to process multiple nonces per thread, not sure about how much work is needed to work on the algos themselves. Maybe we could make a test with a simple algo like blake. But I'm not the man because I'm not proficient in those cpu instruction extensions.

Neither am I, but it's not that difficult.

Say for example you have a loop like:

for (i = 0; i <100000000; i++)

b=sqrt (b);

bb=sqrt(bb);

bbb=sqrt(bbb);

bbbb=sqrt(bbbb);

...gcc will make it something like:

40072e: 0f 84 9b 00 00 00 je 4007cf <main+0x12f>

400734: f2 0f 51 d6

**sqrtsd** %xmm6,%xmm2

400738: 66 0f 2e d2 ucomisd %xmm2,%xmm2

40073c: 0f 8a 63 02 00 00 jp 4009a5 <main+0x305>

400742: 66 0f 28 f2 movapd %xmm2,%xmm6

400746: f2 0f 51 cd

**sqrtsd** %xmm5,%xmm1

40074a: 66 0f 2e c9 ucomisd %xmm1,%xmm1

40074e: 0f 8a d9 01 00 00 jp 40092d <main+0x28d>

400754: 66 0f 28 e9 movapd %xmm1,%xmm5

400758: f2 0f 51 c7

**sqrtsd** %xmm7,%xmm0

40075c: 66 0f 2e c0 ucomisd %xmm0,%xmm0

400760: 0f 8a 47 01 00 00 jp 4008ad <main+0x20d>

400766: 66 0f 28 f8 movapd %xmm0,%xmm7

40076a: f2 0f 51 c3

**sqrtsd** %xmm3,%xmm0

40076e: 66 0f 2e c0 ucomisd %xmm0,%xmm0

400772: 0f 8a b5 00 00 00 jp 40082d <main+0x18d>

...which is sqrt-scalar-double.

4 instructions / 4 math operations.

What could be done differently (intel syntax follows):

movlpd xmm1, b //loading the first variable "b" to the lower part of xmm1

movhpd xmm1, bb //loading the second variable "bb" to the higher part of xmm1

**SQRTPD** xmm1, xmm1 //batch processing both variables for their square root, with one SIMD command

movlpd xmm2, bbb //loading the third variable "bbb" to the lower part of xmm2

movhpd xmm2, bbbb //loading the fourth variable "bbbb" to the higher part of xmm2

**SQRTPD** xmm2, xmm2 //batch processing their square roots

movlpd b, xmm1 //

movhpd bb, xmm1 // Returning all results from the register back memory

movlpd bbb, xmm2 //

movhpd bbbb, xmm2 //

SQRTPD - Square root - P(acked)-Double.

So now 4 maths instructions became 2 and the time got down in half (I've actually benchmarked the above and it goes near half). But in order to pack instructions (math or logical) you need to have similar processing load, similar operations. You can't have that in a scenario where it goes like

sqrt

add

shift

xor

and the function is changing...

But if you loaded 4x hashes together, you'd be looking at

sqrt(of the first) sqrt (of the second) sqrt (third) sqrt (fourth) (<=pack them)

add add add add (<=pack them)

shift shift shift shift (<=pack them)

xor xor xor xor (<pack them)

...etc

I wasn't even aware of the above, until a couple of weeks ago when I got down to asm level to see what happens and why some Pascal output was slower than C output... then I run into

http://x86.renejeschke.de as a reference where I was trying to understand the instructions and what they are doing, and then rewrote some instructions myself - like the above with the packed (I thought it was pretty easy really) and then, more recently, I went over the code of the asm hash functions of altcoins and bitcoin - and it was full of serial operations, despite "SSE/AVX use" / "SSE/AVX enhanced". And I'm like WHAT THE F***? This is all crippled.