DP count is most important. You are running at DP 25, so it will take 2^57 (whatever expected group ops is for 115) - 2^25 (your DP setting) so roughly you want your DP count up to 2^32 to be getting close to solving.
I have files for DP 30. So I need to get close to 2^27 to be getting close to solving. 2^57(expected ops) - 2^30(my DP setting) = 2^27 DP count .
To be precise, 2^57(expected ops)
/ 2^30(my DP setting) = 2^(57-30) = 2^27 DP count .
There are 3 phases in this algorithm:
phase 1)
to generate a collision, you need to form on average N couples of points (T, W) to get 1 couple with the same point, a collision (exploiting the birthday paradox)
that because in a space of N points there are N possible couples with the same point among N^2 possible couples, then the chance to get a collision each time you generate a couple (T, W) is N/N^2 = 1/N , thefor you need N tries to get a collision.
To form N couple, you need 2 lists of sqrt(N) points each (because sqrt(N) * sqrt(N) = N)
This is the heaviest part of the algorithm, 2*sqrt(N) steps; thefor the idea is to parallelize this task, and at this stage parallelization is ok, especially for the gpus.
phase 2)
you cannot compare each couple of points you generate, because there are N possible combinations before to get a collision, too much;
the idea is to choose each jump in such a way that the next point depends only by the current point, this trick makes an occasional collision between 2 kangaroos permanent and we don't need to check every point against all the others to know if a collision has happened.
To 'fix' a delay between the moment of the collision (after about 2*sqrt(N) steps) and the moment of the detection of the collision we store the distinguished points. On average a 30 bit distinguished point is met each 2^30 steps. Only then we can know if that kangaroo has collide with another one.
These phase is strange: if you are using a cpu, a delay of 2^30 steps is nothing, a single core of a cpu can perform 2^30 steps in a few minutes.
But for the slow gpu is not the same, like BitCrack noted:
That 2^27 figure assumes the average walk length is 2^30. The GPU works by doing many slow walks in parallel e.g. 60 million walks that do 20 iterations per second. At that rate, it will take 2^30 / 20 seconds = 1.7 years before any of the walks are 2^30. Your DP count is going to be a few powers of 2 higher than 27.
On the other hand, the choice of the distinguished points is not only about a delay, it is about the storage too, because we can't store 2*sqrt(N) points (if we could, we would use the BSGS algorithm, that is faster and that finds the key surely).
With a high DP value we realize a low frequent sampling, then we have few samples to store in the hash table, but we waste a lot of time with the gpu (overhead).
With a lower DP value we realize a high frequent sampling, but we have to deal with the limit of our RAM.
You can see at the DP value at this way too: how much you need to reduce the points (among the 2*sqrt(N) points generated) to put in the hash table ?
DP = 2^30 means you choose to reduce by a factor of 2^30 the storage of the 2^57 points needed the get a collision (2^57 / 2^30 = 2^27 DP in the hash table). But in this way you delay by 2^30 steps the detection of the collision (and you increase by a factor of 2^30 the waste of the computation power, 2^30 steps * number of kangaroos in parallel, all these steps are useless because they are realized after the collision has already happened)
In the Zielar's case, from what I understood, there were about 2^33 kangaroos in parallel and DP = 22, N = 2^109 then:
total steps: 2*sqrt(2^109) = 2^55.5
hash table: 2^55.5 / 2^22 = 2^33.5
number of steps generated by each computing unit in parallel: 2^55.5 / 2^33 = 2^22.5, then each computing unit has generated on average 1,4 kangaroo (1,4 walk from start to the end point, DP)
phase 3)
When we know which wild kangaroo collides, if we don't know the distance covered by that kangaroo we only need to redo the steps of that kangaroo until the DP, and with a cpu this work can be rapidily finished.
So the reason why the GPUs find solution faster, is by the sheer number of Kangaroos they bring to the hunt? Example, I can find solution 50 times faster with GPU versus CPU, but GPU has more than 50 times the amount of Kangaroos. Something like that?