I currently have an i5-8250u @ 3.4ghz laptop and I tried MULXing / ADCxing / ADOXing the field 5x52 asm. I found ADCXs/ADOXs to be useless in speeding up short parallel chains. Performance was same or worse than ADD/ADC.

MULX, on paper, is 4 cycles vs classic MUL at 3 cycles. The benefit though is that you can continue doing MULxing on other registers, thus not waiting for the rax/rdx pair to be added to go on.

So results are in. Test with gcc8 (I tried gcc7 + clang, similar performance improvements found) and parameters as follows (default cflags):

./configure --enable-endomorphism --enable-openssl-tests=no

**Default run (normal asm / unmodified secp)**:

perf stat -d ./bench_verify

ecdsa_verify: min 46.8us / avg 46.8us / max 47.1us

9377.185057 task-clock (msec) # 1.000 CPUs utilized

27 context-switches # 0.003 K/sec

2 cpu-migrations # 0.000 K/sec

3,284 page-faults # 0.350 K/sec

**31,765,926,255 cycles ** # 3.388 GHz (62.49%)

**85,725,991,961 instructions ** # 2.70 insn per cycle (75.00%)

1,822,885,107 branches # 194.396 M/sec (75.00%)

64,368,756 branch-misses # 3.53% of all branches (75.00%)

13,413,927,724 L1-dcache-loads # 1430.486 M/sec (75.03%)

7,487,706 L1-dcache-load-misses # 0.06% of all L1-dcache hits (75.08%)

4,131,670 LLC-loads # 0.441 M/sec (49.96%)

93,400 LLC-load-misses # 2.26% of all LL-cache hits (49.92%)

**9.376614679 seconds time elapsed**./bench_internal

scalar_add: min 0.00982us / avg 0.00996us / max 0.0103us

scalar_negate: min 0.00376us / avg 0.00381us / max 0.00411us

scalar_sqr: min 0.0389us / avg 0.0393us / max 0.0405us

scalar_mul: min 0.0395us / avg 0.0398us / max 0.0413us

scalar_split: min 0.175us / avg 0.178us / max 0.186us

scalar_inverse: min 11.3us / avg 11.5us / max 11.7us

scalar_inverse_var: min 2.65us / avg 2.70us / max 2.85us

field_normalize: min 0.00988us / avg 0.00995us / max 0.0102us

field_normalize_weak: min 0.00404us / avg 0.00405us / max 0.00411us

**field_sqr: min 0.0187us / avg 0.0189us / max 0.0194us**

field_mul: min 0.0233us / avg 0.0236us / max 0.0254usfield_inverse: min 5.10us / avg 5.11us / max 5.14us

field_inverse_var: min 2.61us / avg 2.62us / max 2.69us

field_sqrt: min 5.07us / avg 5.08us / max 5.13us

**group_double_var: min 0.149us / avg 0.150us / max 0.153us**group_add_var: min 0.337us / avg 0.338us / max 0.341us

group_add_affine: min 0.288us / avg 0.289us / max 0.292us

group_add_affine_var: min 0.243us / avg 0.244us / max 0.246us

group_jacobi_var: min 0.212us / avg 0.219us / max 0.251us

wnaf_const: min 0.0799us / avg 0.0830us / max 0.104us

ecmult_wnaf: min 0.528us / avg 0.532us / max 0.552us

hash_sha256: min 0.324us / avg 0.328us / max 0.345us

hash_hmac_sha256: min 1.26us / avg 1.27us / max 1.30us

hash_rfc6979_hmac_sha256: min 7.00us / avg 7.00us / max 7.03us

**context_verify: min 7007us / avg 7038us / max 7186us**context_sign: min 33.4us / avg 34.1us / max 36.7us

num_jacobi: min 0.109us / avg 0.111us / max 0.126us

After MULXing:

**Custom field 5x52 asm with MULXs**:

perf stat -d ./bench_verify

ecdsa_verify: min 39.9us / avg 39.9us / max 40.2us

8003.494101 task-clock (msec) # 1.000 CPUs utilized

28 context-switches # 0.003 K/sec

0 cpu-migrations # 0.000 K/sec

3,278 page-faults # 0.410 K/sec

**27,113,041,440 cycles ** # 3.388 GHz (62.46%)

**70,772,097,848 instructions ** # 2.61 insn per cycle (75.01%)

1,872,709,155 branches # 233.986 M/sec (75.01%)

63,567,635 branch-misses # 3.39% of all branches (75.01%)

20,812,623,788 L1-dcache-loads # 2600.442 M/sec (75.01%)

7,187,062 L1-dcache-load-misses # 0.03% of all L1-dcache hits (75.01%)

4,098,304 LLC-loads # 0.512 M/sec (49.98%)

97,108 LLC-load-misses # 2.37% of all LL-cache hits (49.98%)

**8.003690210 seconds time elapsed**./bench_internal

scalar_add: min 0.00982us / avg 0.00988us / max 0.0100us

scalar_negate: min 0.00376us / avg 0.00377us / max 0.00384us

scalar_sqr: min 0.0389us / avg 0.0391us / max 0.0402us

scalar_mul: min 0.0395us / avg 0.0397us / max 0.0412us

scalar_split: min 0.176us / avg 0.178us / max 0.184us

scalar_inverse: min 11.3us / avg 11.4us / max 11.6us

scalar_inverse_var: min 2.66us / avg 2.71us / max 3.01us

field_normalize: min 0.00988us / avg 0.00998us / max 0.0104us

field_normalize_weak: min 0.00404us / avg 0.00406us / max 0.00415us

**field_sqr: min 0.0153us / avg 0.0155us / max 0.0164us**

field_mul: min 0.0172us / avg 0.0175us / max 0.0182usfield_inverse: min 4.17us / avg 4.18us / max 4.22us

field_inverse_var: min 2.62us / avg 2.63us / max 2.65us

field_sqrt: min 4.12us / avg 4.12us / max 4.13us

**group_double_var: min 0.127us / avg 0.128us / max 0.131us**group_add_var: min 0.270us / avg 0.270us / max 0.272us

group_add_affine: min 0.250us / avg 0.250us / max 0.252us

group_add_affine_var: min 0.200us / avg 0.200us / max 0.201us

group_jacobi_var: min 0.212us / avg 0.214us / max 0.219us

wnaf_const: min 0.0799us / avg 0.0802us / max 0.0817us

ecmult_wnaf: min 0.528us / avg 0.535us / max 0.569us

hash_sha256: min 0.324us / avg 0.334us / max 0.355us

hash_hmac_sha256: min 1.27us / avg 1.27us / max 1.31us

hash_rfc6979_hmac_sha256: min 6.99us / avg 6.99us / max 7.03us

**context_verify: min 5934us / avg 5966us / max 6127us**context_sign: min 30.9us / avg 31.3us / max 33.0us

num_jacobi: min 0.111us / avg 0.113us / max 0.120us

From 9.37secs to 8.00 secs (0.85x). Field_mul at 0.73x. Field_sqr at 0.81.

**After doing some rearranging (c file) of group_impl.h:**perf stat -d ./bench_verify

ecdsa_verify: min 38.2us / avg 38.3us / max 38.7us

7675.837387 task-clock (msec) # 1.000 CPUs utilized

37 context-switches # 0.005 K/sec

0 cpu-migrations # 0.000 K/sec

3,268 page-faults # 0.426 K/sec

**25,993,738,895 cycles ** # 3.386 GHz (62.48%)

**70,649,153,999 instructions** # 2.72 insn per cycle (74.99%)

1,872,833,433 branches # 243.991 M/sec (74.99%)

64,040,465 branch-misses # 3.42% of all branches (74.99%)

20,969,673,428 L1-dcache-loads # 2731.907 M/sec (75.04%)

7,260,544 L1-dcache-load-misses # 0.03% of all L1-dcache hits (75.09%)

4,076,705 LLC-loads # 0.531 M/sec (49.97%)

110,695 LLC-load-misses # 2.72% of all LL-cache hits (49.92%)

**7.675396172 seconds time elapsed**./bench_internal

scalar_add: min 0.00980us / avg 0.00984us / max 0.00987us

scalar_negate: min 0.00376us / avg 0.00377us / max 0.00382us

scalar_sqr: min 0.0389us / avg 0.0391us / max 0.0396us

scalar_mul: min 0.0392us / avg 0.0396us / max 0.0403us

scalar_split: min 0.176us / avg 0.177us / max 0.181us

scalar_inverse: min 11.3us / avg 11.4us / max 11.7us

scalar_inverse_var: min 2.65us / avg 2.69us / max 2.93us

field_normalize: min 0.00991us / avg 0.00999us / max 0.0103us

field_normalize_weak: min 0.00404us / avg 0.00405us / max 0.00414us

**field_sqr: min 0.0153us / avg 0.0154us / max 0.0158us**

field_mul: min 0.0172us / avg 0.0173us / max 0.0175usfield_inverse: min 4.17us / avg 4.18us / max 4.20us

field_inverse_var: min 2.62us / avg 2.62us / max 2.65us

field_sqrt: min 4.12us / avg 4.13us / max 4.16us

**group_double_var: min 0.121us / avg 0.122us / max 0.123us**group_add_var: min 0.267us / avg 0.268us / max 0.271us

group_add_affine: min 0.249us / avg 0.249us / max 0.252us

group_add_affine_var: min 0.192us / avg 0.193us / max 0.196us

group_jacobi_var: min 0.211us / avg 0.214us / max 0.224us

wnaf_const: min 0.0799us / avg 0.0802us / max 0.0818us

ecmult_wnaf: min 0.528us / avg 0.534us / max 0.574us

hash_sha256: min 0.324us / avg 0.327us / max 0.341us

hash_hmac_sha256: min 1.26us / avg 1.27us / max 1.28us

hash_rfc6979_hmac_sha256: min 6.98us / avg 6.98us / max 7.01us

**context_verify: min 5885us / avg 5916us / max 6039us**context_sign: min 30.9us / avg 31.5us / max 35.9us

num_jacobi: min 0.110us / avg 0.111us / max 0.122us

**Time spent on ./bench_verify = 0.81x, from 31.76mn -> 25.99mn cycles.**Most of these are hardware specific though.

I have a few things I'm currently tampering with, getting it down to 0.78x. A special sqr_inner2 function which has 3 parameters (r - output, a input, counter). The counter is used for the number of times one wants to square the same number, without recalling the function - the function does that on it's own. ~10% of the time spent in ./bench verify are looped field squares. It takes inversions/squaring from 4.2us down to 3.9us. The gain is much larger on less optimized sqr functions. The c int128 5x52 impl sqr, goes 20%+ faster if looped internally with a counter.

[1] mulx-asm:

https://github.com/Alex-GR/secp256k1/blob/master/field_5x52_asm_impl.h[2] reordered group_impl.h (most gains in double_var and is probably not-hardware specific since I'm seeing 2% gains even on a 10-yr old quad core):

https://github.com/Alex-GR/secp256k1/blob/master/group_impl.hCode is free to use/reuse/get ideas/implemented, etc etc - although I don't claim it's heavily tested, or even safe. It does pass the tests though.