I had another idea.
First we take a look at your code, then my proposal.
***********************************************************************************************************
***********************************************************************************************************
FROM JACOBIAN TO AFFINE COORDINATESLet's start from
static void hrdec_ge_set_all_gej_new(secp256k1_ge *r, const secp256k1_gej *a) Your code:
....
for (i = 1; i < 4096; i++) {
secp256k1_fe_mul(&az[i], &az[i - 1], &a[i].z);
}
secp256k1_fe_inv_var(&u, &az[--i]); // sets i to 4095 here <---- You perform only 1 inverse
while (--i) {
secp256k1_fe_mul(&az[i + 1], &az[i], &u);
secp256k1_fe_mul(&u, &u, &a[i + 1].z);
}
To avoid to compute 4096 inverse field elements, you are using the "Simultaneous inversion" algorithm
in this way you perform only 1 inverse + 3*4095M =
1I + 12285M. You trade each inverse for 3 multiplication.
My proposal: only 1,5 multiplication for each inverse --> 1 inverse + 1,5*4095M =
1I + 6144M (the details later)
Your code:
for (i = 0; i < 4096; i++) {
r[i].infinity = a[i].infinity;
secp256k1_ge_set_gej_zinv(&r[i], &a[i], &az[i]);
}
From secp256k1 (
https://github.com/bitcoin-core/secp256k1/blob/master/src/group_impl.h ):
static void secp256k1_ge_set_gej_zinv(secp256k1_ge *r, const secp256k1_gej *a, const secp256k1_fe *zi) {
75 secp256k1_fe zi2;
76 secp256k1_fe zi3;
77 secp256k1_fe_sqr(&zi2, zi); --> you can compute this only 2048 instead of 4096
78 secp256k1_fe_mul(&zi3, &zi2, zi) --> you can compute this only 2048 instead of 4096
79 secp256k1_fe_mul(&r->x, &a->x, &zi2);
80 secp256k1_fe_mul(&r->y, &a->y, &zi3);
81 r->infinity = a->infinity;
82 }
So you perform additional operations: (3M + 1S) * 4096 =
11228M + 4096S to pass from jacobian to affine coordinates
(X,Y,Z) --> (X/Z^2,Y/Z^3)
I think there is a way to perform only (3M + 1S)*2048 + 2M*2048 =
10240M + 2048STotal: 1I + 23513M + 4096S <->
1I + 16384M + 2048S (about -33%)
***********************************************************************************************************
***********************************************************************************************************
Then let's look at the generation of uncompressed public keys:
Your code:
hrdec_mult_gen2(&batchpj[0], start);
for (i = 1; i < 4096; ++i) {
hrdec_gej_add_ge(&batchpj[i], &batchpj[i-1], &incr_a); // increment public key
}
.....
static void hrdec_mult_gen2(secp256k1_gej *r,
const uchar *seckey) {
secp256k1_gej o1;
secp256k1_gej s1, s2, s3, s4, s5, s6, s7, s8,
s9,s10,s11,s12,s13,s14,s15,s16;
secp256k1_gej_set_ge(&o1, &prec[ seckey[31]]);
secp256k1_gej_add_ge_var(&s1, &o1, &prec[ 256 + seckey[30]], NULL);
secp256k1_gej_set_ge(&o1, &prec[512 + seckey[29]]);
secp256k1_gej_add_ge_var(&s2, &o1, &prec[ 768 + seckey[28]], NULL);
secp256k1_gej_set_ge(&o1, &prec[1024 + seckey[27]]);
secp256k1_gej_add_ge_var(&s3, &o1, &prec[1280 + seckey[26]], NULL);
secp256k1_gej_set_ge(&o1, &prec[1536 + seckey[25]]);
secp256k1_gej_add_ge_var(&s4, &o1, &prec[1792 + seckey[24]], NULL);
secp256k1_gej_set_ge(&o1, &prec[2048 + seckey[23]]);
secp256k1_gej_add_ge_var(&s5, &o1, &prec[2304 + seckey[22]], NULL);
....
secp256k1_gej t1;
secp256k1_gej_add_var(&t1, &s1, &s2, NULL);
secp256k1_gej_add_var(&s1, &s3, &s4, NULL);
secp256k1_gej_add_var(&s2, &s5, &s6, NULL);
secp256k1_gej_add_var(&s3, &s7, &s8, NULL);
secp256k1_gej_add_var(&s4, &s9, &s10, NULL);
secp256k1_gej_add_var(&s5, &s11, &s12, NULL);
secp256k1_gej_add_var(&s6, &s13, &s14, NULL);
secp256k1_gej_add_var(&s7, &s15, &s16, NULL);
secp256k1_gej_add_var(&s8, &t1, &s1, NULL);
secp256k1_gej_add_var(&s1, &s2, &s3, NULL);
secp256k1_gej_add_var(&s2, &s4, &s5, NULL);
secp256k1_gej_add_var(&s3, &s6, &s7, NULL);
secp256k1_gej x1,x2;
secp256k1_gej_add_var(&x1, &s1, &s2, NULL);
secp256k1_gej_add_var(&x2, &s3, &s8, NULL);
secp256k1_gej_add_var(r, &x1, &x2, NULL);
}
From secp256k1 (
https://github.com/bitcoin-core/secp256k1/blob/master/src/group_impl.h )
static void secp256k1_gej_add_var(secp256k1_gej *r, const secp256k1_gej *a, const secp256k1_gej *b, secp256k1_fe *rzr) {
362 /* Operations: 12 mul, 4 sqr, 2 normalize, 12 mul_int/add/negate */
363 secp256k1_fe z22, z12, u1, u2, s1, s2, h, i, i2, h2, h3, t;
364
.....
So you are using the Point Addition (12M + 4S) (J+J --> J)
https://en.wikibooks.org/wiki/Cryptography/Prime_Curve/Jacobian_Coordinates#Point_Addition_.2812M_.2B_4S.29 instead of the more efficient : Mixed Addition (with affine point) (8M + 3S) (J+A --> J) (same link)
414 static void secp256k1_gej_add_ge_var(secp256k1_gej *r, const secp256k1_gej *a, const secp256k1_ge *b, secp256k1_fe *rzr) {
415 /* 8 mul, 3 sqr, 4 normalize, 12 mul_int/add/negate */
416 secp256k1_fe z12, u1, u2, s1, s2, h, i, i2, h2, h3, t;
Maybe there is a reason (if you compute kG, kG + G, (k+1)G + G, why G is not in affine coordinates?), I must admite I don't understand well these piece of code.
Do you use the same function (secp256k1_gej_add_var) in hrdec_gej_add_ge too?
for (i = 1; i < 4096; ++i) {
hrdec_gej_add_ge(&batchpj[i], &batchpj[i-1], &incr_a); // increment public key
}
In any case, I understand that you perform at least (8M +3S)*4096 op (or not?)
And it seems to me that you don't exploit at all the fact that the keys are consecutive (I'm sorry in advance if I'm wrong, I really can't read code in general, not only yours
)
***********************************************************************************
***********************************************************************************
My proposal: instead of generating k*G, k*G + G, (k+1)*G + G, ....
you could
1) precompute G, 2*G, 3*G, ...., 2048*G
2) then generate as first point of the batch: (k+2048)*G = k'*G (
jacobian coordinates -> (X1, Y1, Z1)) better (X1,Y1,1)
3) then generate the following couples:
k'*G+1*G , k'*G-1*G
k'*G+2*G, k'*G-2*G
.......................................
k'*G+m*G, k'*G-m*G
.......................................
k'*G+2048*G, k'*G-2048*G
In this way:
a) k'G is always equals to (X1,Y1,
Z1,1) (
jacobian affine coordinates)
b) m*G = (X2,Y2,1) is always in affine coordinates!
c) k'*G - m*G = k'*G+(-m*G) and +m*G / -m*G have the same X2, and Y2 opposite
then you can use the same X2 in 2 operations (and the same inverse!)
Mixed Addition (with affine point) in your case --> (4M 3M + 1,5S):
U1 = X1 (Z2=1)
U2 = X2*Z1^2 --> 1/2M (you have to compute Z1^2 only once in the batch, X2 is the same in +mG and -mG)
S1 = Y1
S2 = Y2*Z1^3 --> 1/2M (you have to compute Z1^3 only once in the batch, and remember that mG ->Y2, -mG -> -Y2)
H = U2 - U1 = X2*Z1^2 - X1 --> 0 M
R = S2 - S1 = Y2*Z1^3 - Y1 --> 0 M
X3 = R^2 - H^3 - 2*X1*H^2 --> (1 + 1/2)S + 2*1/2 M = 1,5S + 1M (H is the same each 2 addition)
Y3 = R*(X1*H^2 - X3) - Y1*H^3. --> 1M + 1/2M = 1,5M (Y1 is the same, H is the same each 2 addition )
Z3 = H*Z1 --> 1/2 M --> (Z3 is the same each 2 addition, then the inverse is the same too!)
return (X3, Y3, Z3)
Total:
4M 3M + 1,5SWhat do you think about?