Yes 11111 is quite difficult (leading 0). May be there is also a bug with this particular case. I'll check that.
I restored the old group size. Could try to update your git repo And try again. Many thank for your help.
|
|
|
Ok Could you try to noiline to 2 ModMult, also in GPU/GPUEngine.cu: Line 510: __device__ __noinline__ void _ModMult(uint64_t *r, uint64_t *a, uint64_t *b) { Line 560: __device__ __noinline__ void _ModMult(uint64_t *r, uint64_t *a) {
It seems that I reached a limit with CUDA.... I had similar problem with the last release... And no warning at all !
|
|
|
Ok Could you try to edit GPUEngine.cu and to change the stackSize to 49152 line (48K) line 1371. I doubled the group size and I missed this. Does it improves something ? size_t stackSize = 49152; err = cudaDeviceSetLimit(cudaLimitStackSize, stackSize); if (err != cudaSuccess) {
|
|
|
@arulbero
Your git clone is up do date ? git pull
Did you clean before making ? make clean and make gpu=1 all
On my config the -check is ok. It looks like the problem I had last time when the GPU code was wrongly generated.
mmm....
|
|
|
The increase of stivensons seems rather normal, but 350% seems strange ! May be there is a problem... Does it work fine ? If you try to search the same prefix several times, you note each time the percentage when the prefix is found, the average should be around 50%. It is the case ?
|
|
|
Good Job Jean_Luc but this time I don’t get a raise on my equipment it’s the same
I didn't published yet the release as executable downlaod, if you want to test it you have to clone the git repository and compile by yourself. The new release is coming, I'm currently working on GPU code
|
|
|
Hello, I ended the implementation of endomorphisms and their symmetrics (CPU only). The code is committed to GitHub for those who want to test. On my hardware, I observe a ~20% speed increase (compressed addresses), the hash functions (SSE) takes now 76% of the CPU. GPU implementation is coming... Many thanks again to arulbero for these precious tips concerning symmetries and to all for you for helping to make this software better
|
|
|
it seems to me that ModInv should be much lower than 50% of ModMulk1, are you sure you don't take significative advantage from using more than 256 elements for each batch? Why don't try with 1024 or 4096?
May be there is a confusion between IntGroup::ModInv(256*3 ModMulK1) and Int::ModInv (The true ModInv). Look only a the column on the right (Self CPU). ModInv is taking ~2% (using compressed address) so if I multiply by 2 the group size I can expect a ~1% speed increase for the CPU release. I did the test on 1 core and as expected, the key rate goes from 3.4MKey/s to 3.44MKey/s. Of course for other applications where you do not need to hash, you can expect a more significant speed increase. I attach a new CPU profile with SSE disabled (-nosse option) and using compressed address, this profile should be close enough to the GPU profile, there is no SIMD instruction on GPU to speed up hash functions. Here the ModInv fall to 1%. For VanitySearch, having a smaller group size is better (This is a reason why I worked a lot on this DRS62 ModInv implementation). I can double the size of the group (I will definitely do it) but not more. The GPU kernel performs one group per thread and send back hash160 to the CPU. If the group size is too large, memory transfer and allocation become a problem. Divide and rule It's amazing how much progress is being made on this software so quickly. Great work!
Thanks
|
|
|
Yes and there is also a CUDA intrinsic that search for the number of starting zero __ffsll which could be used to speed up the checking of public key. CPU profiles of the last release: (using compressed address) (using uncompressed address) EDIT: FindKey include, mainly, lookup table, ecc arithmetic (ModAdd and ModSub) ModMulK1 is the SecpK1 modular mult, there are 2 ModMulK1 signatures for this method. Added the 2 profiles for compressed and uncompressed.
|
|
|
Thanks again for the typo, the automatic corrector doesn't work when editing the README.md Yes it can be fun, with GPU optimization we can win few base58 caracter, but 2^80 is still high to find a complete collision.
|
|
|
Thanks Hope I will be less disturbed by dreamers !
|
|
|
Thanks
|
|
|
OK Thanks arulbero , it is clear and rather simple, I will try to write something better. tomorow. see you
|
|
|
For the output, a space between Difficulty and Search would be better lol, i see that, I'll correct it in the next release
|
|
|
You compute the probability this way?
I compute the difficulty as vanitygen (number are not exactly equal because I use double calculation but you can see they are very near) then simply a Bernoulli trial as vanitygen also. Attacking 1 billion of addresses (I know there is much less with funds) is like having a key rate 1 billion time faster. I just would like a simple text to explain that birthday paradox cannot be used here and that having a big input file does not improve enough the odds to find a collision and it is still infeasible.
|
|
|
Hello, I added a note in the readme about attacking full address. It may be not clear. I would like a simple and understandable text about his. Thanks to help me to make it clear. Please don't use VanitySearch to attack a list of complete addresses. It is very unlikely that you find a collision. The time displayed indicates the time needed to reach the displayed probability of the most probable prefix in the list. In case of having n complete addresses in the input file, simply divide this time by the number of entries to get an aproximative idea of the time needed to reach the displayed probability (in fact it is longer). Even with a file containing 1 billion of addresses, using a very competitive hardware, the time needed to reach a probability of 50% will be much longer than the age of the universe. Note that the birthday paradox cannot be applied here as we look for fixed addresses and there is no trick possible (as for Pollard rho method on points coordinates) to simulate random walks because addresses are hashed.
|
|
|
Next Step: - Optimize CPU/GPU exchange - Add missing ECC optimizations (some symmetries and endomorphism) - Add support for GPU funnel shift that should speed up SHA (but I need to find a board with compute capability >3.5, mine is 3.0).
Did you implement already all the steps 1, 2, 3 or there is still space to further improvements? - Support for funnel shift no yet done. - p-iG/p+iG done. - k.(x,y)/-k.(x,-y) done. - Endomorphism is in progress. - CPU/GPU exchange done but still need improvement (difficult to find good compromises with multi prefixes search)
|
|
|
Thank you very much to all of you for testing this software and helping me to make it better
|
|
|
|