After tampering a bit with the code over the past year, I now understand that I initially asked the wrong question, regarding putting multiple verifications in parallel in a SIMD manner, as the calculations required in a single signature are too many by themselves.
Since this is the case, the question should be rephrased on whether SIMD can be used on these multiple calculations of a single pass. As you rightly pointed out, quadword->octaword is generally not supported.
Now, I think I found a way to use vectorization, bypassing the quadword->octaword obstacle - since it doesn't seem that it will be getting hardware support anytime soon.
The answer lies in breaking up the multiplications in 32x32 bits producing 64 bit results, like the 10x26 field does. Then packing the sources for packed multiplications.
I put the multiplications of 10x26 in vectorized SIMD loops, #pragma simd'ed (icc*), and it pulled it off (result on the left).
This is the http://www.felixcloutier.com/x86/PMULUDQ.html
instruction. It can be used for taking 4x 32bit sources and producing 2x 64 bit outputs.
In my q8200 (no avx2), in x64 mode (utilizing the 32bit field), this is ~5-10% slower than issuing multiple 64 bit imuls. Imuls in x64 mode are very, very convenient for 32x32 bit => 64 bit, and this is what is normally generated in x64 mode.
In an AVX2 scenario, packed multiplication goes up to 8x32 bit sources / 4x64 bit results.
In an AVX512 scenario (assuming there is a similar instruction), it should pack 16x32bit sources into eight 64 results. Plus with an extra 16 registers, it should eliminate a lot of memory accesses. If verification speed is an issue in the future, we might be able to exploit this.
As an incidental "discovery", while compiling with -m32 to check the 32bit output, I saw that the uint64s are produced with plain muls in eax:edx (which is to be expected), although many 32bit machines DO have support for SSE2. Now in x86+SSE2 mode, you can get either one or two 64 bit outputs (depending how you word the instruction) without going through MULs/ADDs in eax:edx - which is slower.*gcc doesn't like the d=d+dd part inside the loop and it hesitates to vectorize even with #pragma GCC ivdep. One needs to do the addition results outside the loop, or write it manually in asm. ICC does it like a boss with the padd's from the results. edit:
Something else I remembered regarding MULX + ADCX which I've mentioned previously. I tried building with
./configure --enable-benchmark --enable-endomorphism --with-asm=no
...to check the asm output of new compilers.
GCC 6.3 now outputs MULXs (no ADCXs though) in field and scalar multiplications.
Clang 3.9 now employs both MULQs and
40c319: 48 8b 4e 08 mov 0x8(%rsi),%rcx
40c31d: 48 89 4c 24 b8 mov %rcx,-0x48(%rsp)
40c322: c4 62 e3 f6 c8 mulx %rax,%rbx,%r9
40c327: 49 89 c3 mov %rax,%r11
40c32a: 4c 89 5c 24 d8 mov %r11,-0x28(%rsp)
40c32f: 49 8b 52 10 mov 0x10(%r10),%rdx
40c333: 48 89 54 24 a0 mov %rdx,-0x60(%rsp) 40c338: c4 62 fb f6 c1 mulx %rcx,%rax,%r8
40c33d: 48 01 d8 add %rbx,%rax 40c340: 66 4d 0f 38 f6 c1 adcx %r9,%r8
40c346: 48 8b 6e 10 mov 0x10(%rsi),%rbp
40c34a: 49 8b 1a mov (%r10),%rbx
...although how much faster it is compared to MUL/ADC c code or asm, I can't say.