Imagine you want to generate a batch from 10000 to 14096 (the script actually generates batches of 4097 points)
First you generate the key k = 12048 (always we start with the middle point, to exploit the symmetry), this is the only point (a pivot point) of the batch that we get with the slower function mult
... k ... <-- one batch, only one key k
jkx,jky,jkz = mul(k,Gx,Gy,1)
invjkz = inv(jkz,p)
(kx,ky) = jac_to_aff(jkx, jky, jkz, invjkz)
k can be any number greater than 2048 (otherwise, if k=3 for example, kG+3G gives a error because you are trying to use the addition formula instead of the double...) The first batch you can create with this script goes from 1 to 4097, the start key in that case would be k=2049.
Then the script generates three batches, each batch has 1 point + 2048 couple of points:
first batch: this is the batch you are more interested of, because it has 4097 points in your range, including the point 12048G:
(12048),(12048+1,12048-1),(12048+2,12048-2),....,(12048+2048=14096,12048-2048=10000)
the script computes this batch with the function double_add_P_Q_inv
Element #0 of the list is always kG, element #1 is the couple kG+1G, kG-1G, #2 is the couple kG+2G, kG-2G, and so on ... --> #2048 is the couple kG+2048,kG-2048G
batch = batch + list(map(double_add_P_Q_inv,kxl[1:],kyl[1:],mGx[1:],mGy[1:],kminverse[1:]))
Batch 1 and 2: these keys are not in your range, here we use endomorphism:
batch1:
(12048*lambda), ((12048+1)*lambda,(12048-1)*lambda), ((12048+2)*lambda,(12048-2)*lambda), ...., (10000*lambda,14096*lambda)
batch2:
(12048*lambda^2), ((12048+1)*lambda^2, (12048-1)*lambda^2), ((12048+2)*lambda^2, (12048-2)*lambda^2), ...., (14096*lambda^2, 10000*lambda^2)
EDIT:
to make sure work still is distributable / parallelizable and the bookkeeping still being sane.
You don't worry about each key, in my opinion you have to store only a private key for 3 batches, you can think at the single key in the middle of the batch like a special seed. 99,9999% of the batches doesn't match any address with bitcoin, so when a match occurs only then you have to regenerate the entire 3 batches from this single seed to fetch the correct private key. Batch 1 and 2 are sequence of keys each different from each other, so you are sure that you are not wasting your computational efforts. I'm
almost sure about the last sentence, there can't be more than three points with the same y, it is not possible checking the same key twice. Note that the 3 batches are related, they must be computed together.
Imagine you know that the pool has searched so far from key 1 to 2^50, then you know that the pool has searched keys 1*lambda, 2*lambda, 3*lambda ... to 2^50*lambda (mod n) too, and keys 1*lambda^2, 2*lambda^2, 3*lambda^2,... to 2^50*lambda^2 (mod n).
05 runs nearly 22 seconds for 16M keys on my notebook. This is now only 3.5 times slower than what LBC optimized C version needs for 16M keys. I don't dare to estimate what optimized C code can make of this.
I dare: if you use complement too, you can generate 16M keys in less than half a second (with cpu, I don't know for GPU)
Considering that your current code performs 6M + 1S only for the transition from jacobian to affine coordinates for each point and that you are using J+J --> J to perform each addition (12M + 4S), your current cost should be 18M + 5S each point.
Let's say 1S = 0,8M, you have about 22M for point.
If you are now using instead J+A --> J to perform addition (8M + 3S), then you have about 17,2M for point.
My code uses 3,5M + 1S for each point of the first batch, and only 1M for each point of the other 2 batches.
So the average is: 5,5/3= 1,83M + 0,33S for point, let's say about
2,1M for point.
Now your speed is 16M/6s = 2,7 M/s for each cpu core.
If you could achieve a 8x - 10x improvement, let's say a 8x, so you could perform at least 21M/s. If you use (X,Y) --> (X,-Y) too, 42M/s. Let's say at least 40M k/s for each core,
15x respect of your actual speed.
With a 8-core cpu, you could generate more keys than your entire pool can handle at this moment.
Maybe tomorrow I'll add more comments on the code. Anyway read again this post, I edited it.
EDIT2:
this is a version with more comments:
https://www.dropbox.com/s/6o2az7n6x0luld4/ecc_for_collider06.zip?dl=0