DaveF
Legendary
Offline
Activity: 3598
Merit: 6581
Crypto Swap Exchange
|
|
March 24, 2019, 02:00:24 AM |
|
Also, any chance of OpenCL or is it going to only be only CUDA? Thanks, Dave
|
|
|
|
Jean_Luc (OP)
|
|
March 24, 2019, 05:32:04 AM |
|
for the moment only on linux but it seems to me that jean_luc try or will try to adjust for windows also .... it is more difficult than for linux I think
Yes , on Windows no way to set up CUDA SDK 8.0 if a recent compiler (VC2017) is installed, even if the good one (VC2013) is also installed. The SDK setup fails. So the only solution is to start from a fresh install without VC2017 installed. Also, any chance of OpenCL or is it going to only be only CUDA? Thanks, Dave
The problem with OpenCL is that I don't know how to access to the carry flag and how to perform a wide 64bit multiplication (i64xi64=>i128). For instance: Here is the code of oclvanitygen to perform an addition with carry: #define bn_addc_word(r, a, b, t, c) do { \ t = a + b + c; \ c = (t < a) ? 1 : ((c & (t == a)) ? 1 : 0); \ r = t; \ } while (0)
This can be reduced to a single adc instruction with CUDA (and also with Visual C++, gcc, etc...) ! Some OpenCL driver compilers are smart enough to understand this code and reduce it to a single adc instruction but not all ! For the wide 64bit multiplication (i64xi64=>i128), CUDA offer the needed instructions (mul.lo.u64 and mul.hi.u64), but with OpenCL is seems that the only way is to use 32bit integer and to use 64bit integer to perform the multiplication (i32xi32=>i64). If an OpenCL expert know how to perform this efficiently, it would be great.
|
|
|
|
Telariust
Jr. Member
Offline
Activity: 38
Merit: 18
|
|
March 24, 2019, 12:25:46 PM Last edit: August 12, 2019, 03:30:33 AM by Telariust |
|
Here is the code of oclvanitygen to perform an addition with carry: #define bn_addc_word(r, a, b, t, c) do { \ t = a + b + c; \ c = (t < a) ? 1 : ((c & (t == a)) ? 1 : 0); \ r = t; \ } while (0)
This code maybe have problem, look post moved to https://bitcointalk.org/index.php?topic=25804.msg52110068#msg52110068
|
|
|
|
asche
Legendary
Offline
Activity: 1484
Merit: 1491
I forgot more than you will ever know.
|
|
March 24, 2019, 12:55:12 PM |
|
Salut Jean-Luc Do you plan to add support for P2SH (segwit starting with 3) adresses anytime soon to your tool? That would be a nice to have. For instance this project by nullios implemented both P2SH and bech32 addies. Is this included in your roadmap? Nice work anyway!
|
|
|
|
RobertPaulig
Newbie
Offline
Activity: 7
Merit: 1
|
|
March 24, 2019, 02:36:56 PM |
|
Win10, Cuda 10 i7 3700k, 8 Gb RAM vanitysearch -stop -t 0 -gpu -gpuId 0 -i input_addres.txt -o output_file.txt Search: 1Testtttt [Compressed] Start Sun Mar 24 17:22:35 2019 Base Key:E50C09A69B313FCC6480B3390C47BBD55D6FFFEEBBC36D3881E011AE0330275 Number of CPU thread: 0 GPU: GPU #0 GeForce GTX 1080 Ti (28x128 cores) Grid(224x128) 967.926 MK/s (GPU 967.926 MK/s) (2^32.44) [P 0.00%][50.00% in 24.9d][0]0] vanitysearch -stop -t 8 -gpu -gpuId 0 -i input_addres.txt -o output_file.txt Difficulty: 2988734397852221 Search: 1Testtttt [Compressed] Start Sun Mar 24 17:26:34 2019 Base Key:912441F08928FCEF7B5D6F9A1232221AF9FF3F6E653586F9146625C436060099 Number of CPU thread: 8 GPU: GPU #0 GeForce GTX 1080 Ti (28x128 cores) Grid(224x128) 914.418 MK/s (GPU 896.216 MK/s) (2^33.38) [P 0.00%][50.00% in 26.3d][0]0] It is very strange with the process slower than without it. Jean_Luc, thank you for your hard work. If you break execution? Whether to keep VanitySearch a result?
|
|
|
|
Jean_Luc (OP)
|
|
March 24, 2019, 04:57:42 PM |
|
Do you recognize this crash error?
No I never experienced this crash. Thanks for the infos Is this included in your roadmap?
Salut I'm not yet familiar with P2SH addresses, I have to learn in detail. May be for 1-to-1 multisig P2SH. Nice work anyway!
Thanks It is very strange with the process slower than without it.
Yes, This is because with -t 8, your CPU become a bottleneck and cannot handle GPU/CPU exchange. When having good GPU keyrate, it is generally better to free 1 CPU core per GPU. Jean_Luc, thank you for your hard work. If you break execution? Whether to keep VanitySearch a result?
If you are using a passphrase, and if you want to restart a search, you have to change your passphrase (1 character is enough) otherwise you will recompute exactly the same thing. If you're using the default random seed, the seed will change so you won't recompute the same thing, no need to save anything. But I recommend to use a passphrase in order to generate safe private keys.
|
|
|
|
OgNasty
Donator
Legendary
Offline
Activity: 4858
Merit: 4604
Leading Crypto Sports Betting & Casino Platform
|
|
March 24, 2019, 08:41:13 PM Last edit: March 25, 2019, 06:19:51 PM by OgNasty |
|
Yes, This is because with -t 8, your CPU become a bottleneck and cannot handle GPU/CPU exchange. When having good GPU keyrate, it is generally better to free 1 CPU core per GPU.
I think most users with newer GPUs would benefit from the power efficiency gains of running with -t 0. I would even argue that should be the default when a GPU is detected instead of the other way around where you have to enable GPUs. Edit: I'd also like to see the version number shown with the startup information.
|
..Stake.com.. | | | ▄████████████████████████████████████▄ ██ ▄▄▄▄▄▄▄▄▄▄ ▄▄▄▄▄▄▄▄▄▄ ██ ▄████▄ ██ ▀▀▀▀▀▀▀▀▀▀ ██████████ ▀▀▀▀▀▀▀▀▀▀ ██ ██████ ██ ██████████ ██ ██ ██████████ ██ ▀██▀ ██ ██ ██ ██████ ██ ██ ██ ██ ██ ██ ██████ ██ █████ ███ ██████ ██ ████▄ ██ ██ █████ ███ ████ ████ █████ ███ ████████ ██ ████ ████ ██████████ ████ ████ ████▀ ██ ██████████ ▄▄▄▄▄▄▄▄▄▄ ██████████ ██ ██ ▀▀▀▀▀▀▀▀▀▀ ██ ▀█████████▀ ▄████████████▄ ▀█████████▀ ▄▄▄▄▄▄▄▄▄▄▄▄███ ██ ██ ███▄▄▄▄▄▄▄▄▄▄▄▄ ██████████████████████████████████████████ | | | | | | ▄▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▄ █ ▄▀▄ █▀▀█▀▄▄ █ █▀█ █ ▐ ▐▌ █ ▄██▄ █ ▌ █ █ ▄██████▄ █ ▌ ▐▌ █ ██████████ █ ▐ █ █ ▐██████████▌ █ ▐ ▐▌ █ ▀▀██████▀▀ █ ▌ █ █ ▄▄▄██▄▄▄ █ ▌▐▌ █ █▐ █ █ █▐▐▌ █ █▐█ ▀▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▀█ | | | | | | ▄▄█████████▄▄ ▄██▀▀▀▀█████▀▀▀▀██▄ ▄█▀ ▐█▌ ▀█▄ ██ ▐█▌ ██ ████▄ ▄█████▄ ▄████ ████████▄███████████▄████████ ███▀ █████████████ ▀███ ██ ███████████ ██ ▀█▄ █████████ ▄█▀ ▀█▄ ▄██▀▀▀▀▀▀▀██▄ ▄▄▄█▀ ▀███████ ███████▀ ▀█████▄ ▄█████▀ ▀▀▀███▄▄▄███▀▀▀ | | | ..PLAY NOW.. |
|
|
|
Jean_Luc (OP)
|
|
March 25, 2019, 06:21:12 PM |
|
Yes, today the default is to free only one core when GPU is enabled, it will change this to number of GPU.
|
|
|
|
arulbero
Legendary
Offline
Activity: 1915
Merit: 2074
|
|
March 26, 2019, 06:39:09 PM |
|
Hi, I've just downloaded the VanitySearch Master, it works perfectly if I add "volatile" in this piece of code: void Int::ModSquareK1(Int *a) {
#ifndef WIN64 #if __GNUC__ <= 6 #warning "GCC lass than 7 detected, upgrade gcc to get best perfromance" volatile unsigned char c; <-- #else volatile unsigned char c; <-- #endif #else unsigned char c; #endif
|
|
|
|
Jean_Luc (OP)
|
|
March 26, 2019, 06:58:15 PM |
|
Hi, I've just downloaded the VanitySearch Master, it works perfectly if I add "volatile" in this piece of code:
OK, which release of gcc are you using for compiling VanitySearch (not the CUDA code) ?
|
|
|
|
arulbero
Legendary
Offline
Activity: 1915
Merit: 2074
|
|
March 26, 2019, 07:19:31 PM |
|
Hi, I've just downloaded the VanitySearch Master, it works perfectly if I add "volatile" in this piece of code:
OK, which release of gcc are you using for compiling VanitySearch (not the CUDA code) ? gcc version 7.0.1 20170407 (experimental) [trunk revision 246759] (Ubuntu 7-20170407-0ubuntu2)
|
|
|
|
Jean_Luc (OP)
|
|
March 26, 2019, 07:27:29 PM |
|
gcc version 7.0.1 20170407 (experimental) [trunk revision 246759] (Ubuntu 7-20170407-0ubuntu2)
Ok. I observed the issue with gcc 6 but with my gcc 7.3.0 it worked. It seems that this optimization bug is still here with 7.0.1. mmm... I will add a test for minor version and let the volatile up to gcc < 7.3. I tried with gcc 8.2 and it also works. Thanks for the report.
|
|
|
|
arulbero
Legendary
Offline
Activity: 1915
Merit: 2074
|
|
March 26, 2019, 07:50:27 PM |
|
gcc version 7.0.1 20170407 (experimental) [trunk revision 246759] (Ubuntu 7-20170407-0ubuntu2)
Ok. I observed the issue with gcc 6 but with my gcc 7.3.0 it worked. It seems that this optimization bug is still here with 7.0.1. mmm... I will add a test for minor version and let the volatile up to gcc < 7.3. I tried with gcc 8.2 and it also works. Thanks for the report. With -O0 in the makefile CXXFLAGS = -DWITHGPU -m64 -mssse3 -Wno-write-strings -O0 -I. -I$(CUDA)/include
it works without "volatile".
|
|
|
|
Jean_Luc (OP)
|
|
March 27, 2019, 09:39:59 AM |
|
Hello, I published a new release (1.10): -Support for compressed private key (Tested with Electrum 3.3.4) -Slight performance increase Thanks to test it Have fun
|
|
|
|
arulbero
Legendary
Offline
Activity: 1915
Merit: 2074
|
|
March 27, 2019, 09:54:55 AM |
|
Have you a _ModSqrMontgomery function? I would try to compute the inverse this way: __device__ void _ModInv(uint64_t* a) {
uint64_t x2[4], x3[4], x6[4], x9[4], x11[4], x22[4], x44[4], x88[4], x176[4], x220[4], x223[4], t1[4]; uint8_t j;
/** The binary representation of (p - 2) has 5 blocks of 1s, with lengths in * { 1, 2, 22, 223 }. Use an addition chain to calculate 2^n - 1 for each block: * [1], [2], 3, 6, 9, 11, [22], 44, 88, 176, 220, [223] */
_ModSqr(x2, a); _ModMult(x2, a);
_ModSqr(x3, x2); _ModMult(x3, a);
memcpy(x6,x3,32); _ModSqr(x6); _ModSqr(x6); _ModSqr(x6); _ModMult(x6, x3);
memcpy(x9,x6,32); _ModSqr(x9); _ModSqr(x9); _ModSqr(x9); _ModMult(x9, x3);
memcpy(x11,x9,32); _ModSqr(x11); _ModSqr(x11); _ModMult(x11, x2);
memcpy(x22,x11,32); for (j=0; j<11; j++) { _ModSqr(x22); } _ModMult(x22, x11);
memcpy(x44,x22,32); for (j=0; j<22; j++) { _ModSqr(x44); } _ModMult(x44, x22);
memcpy(x88,x44,32); for (j=0; j<44; j++) { _ModSqr(x88); } _ModMult(x88, x44);
memcpy(x176,x88,32); for (j=0; j<88; j++) { _ModSqr(x176); } _ModMult(x176, x88);
memcpy(x220,x176,32); for (j=0; j<44; j++) { _ModSqr(x220); } _ModMult(x220, x44);
memcpy(x223,x220,32); _ModSqr(x223); _ModSqr(x223); _ModSqr(x223); _ModMult(x223, x3);
/* The final result is then assembled using a sliding window over the blocks. */
memcpy(t1,x223,32); for (j=0; j<23; j++) { _ModSqr(t1); } _ModMult(t1, x22); _ModSqr(t1); _ModSqr(t1); _ModSqr(t1); _ModSqr(t1); _ModSqr(t1); _ModMult(t1, a); _ModSqr(t1); _ModSqr(t1); _ModSqr(t1); _ModMult(t1, t1, x2); _ModSqr(t1); _ModSqr(t1); _ModMult(a, t1);
}
|
|
|
|
Jean_Luc (OP)
|
|
March 27, 2019, 10:36:36 AM Last edit: March 27, 2019, 12:02:52 PM by Jean_Luc |
|
Have you a _ModSqrMontgomery function?
No. On the CPU: The DRS62 ModInv cost ~160 ModSquareK1(), however the DRS62 works for all odd prime. An optimization can also be done for SecpK1 prime as there is 2 mul by P. DRS62: 362.696 KiloI/sec ModSquareK1: 58.717 MegaS/sec On the GPU, the 62bit right shift can also be optimized by the funnel shift.
|
|
|
|
OgNasty
Donator
Legendary
Offline
Activity: 4858
Merit: 4604
Leading Crypto Sports Betting & Casino Platform
|
|
March 27, 2019, 04:12:13 PM |
|
Hello, I published a new release (1.10): -Support for compressed private key (Tested with Electrum 3.3.4) -Slight performance increase Thanks to test it Have fun Thanks for adding the version number!
|
..Stake.com.. | | | ▄████████████████████████████████████▄ ██ ▄▄▄▄▄▄▄▄▄▄ ▄▄▄▄▄▄▄▄▄▄ ██ ▄████▄ ██ ▀▀▀▀▀▀▀▀▀▀ ██████████ ▀▀▀▀▀▀▀▀▀▀ ██ ██████ ██ ██████████ ██ ██ ██████████ ██ ▀██▀ ██ ██ ██ ██████ ██ ██ ██ ██ ██ ██ ██████ ██ █████ ███ ██████ ██ ████▄ ██ ██ █████ ███ ████ ████ █████ ███ ████████ ██ ████ ████ ██████████ ████ ████ ████▀ ██ ██████████ ▄▄▄▄▄▄▄▄▄▄ ██████████ ██ ██ ▀▀▀▀▀▀▀▀▀▀ ██ ▀█████████▀ ▄████████████▄ ▀█████████▀ ▄▄▄▄▄▄▄▄▄▄▄▄███ ██ ██ ███▄▄▄▄▄▄▄▄▄▄▄▄ ██████████████████████████████████████████ | | | | | | ▄▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▄ █ ▄▀▄ █▀▀█▀▄▄ █ █▀█ █ ▐ ▐▌ █ ▄██▄ █ ▌ █ █ ▄██████▄ █ ▌ ▐▌ █ ██████████ █ ▐ █ █ ▐██████████▌ █ ▐ ▐▌ █ ▀▀██████▀▀ █ ▌ █ █ ▄▄▄██▄▄▄ █ ▌▐▌ █ █▐ █ █ █▐▐▌ █ █▐█ ▀▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▀█ | | | | | | ▄▄█████████▄▄ ▄██▀▀▀▀█████▀▀▀▀██▄ ▄█▀ ▐█▌ ▀█▄ ██ ▐█▌ ██ ████▄ ▄█████▄ ▄████ ████████▄███████████▄████████ ███▀ █████████████ ▀███ ██ ███████████ ██ ▀█▄ █████████ ▄█▀ ▀█▄ ▄██▀▀▀▀▀▀▀██▄ ▄▄▄█▀ ▀███████ ███████▀ ▀█████▄ ▄█████▀ ▀▀▀███▄▄▄███▀▀▀ | | | ..PLAY NOW.. |
|
|
|
Jean_Luc (OP)
|
|
March 27, 2019, 05:17:41 PM |
|
Thanks for adding the version number! You're welcome. Anyway, I managed to get back a used GTX 1050ti and I should be able to implement the funnel shift (for compute cap>3.5) which should speed up hashing and ModInv 62bit shift (unless nvcc is smart enough to use funnel shift alone when it sees something like ((x>>(32-n))|(x<<n)) )
|
|
|
|
Jean_Luc (OP)
|
|
March 28, 2019, 02:01:13 PM |
|
Hello, I set up the GTX 1050 Ti and I implemented the funnelshit for SHA and RIPE rotation (not yet for ModInv) I was waiting for a more significant performance increase (I got only a little bit less than 3%). Better than nothing. C:\C++\VanitySearch\x64\ReleaseSM30>VanitySearch.exe -t 0 -gpu 1Testtttt VanitySearch v1.11 Difficulty: 2988734397852221 Search: 1Testtttt [Compressed] Start Thu Mar 28 14:48:27 2019 Base Key:3ECA27E3A98E4267E3D308CAA7E66B8972C31C4C02A7D16616BA46C32C59AFAC Number of CPU thread: 0 GPU: GPU #0 GeForce GTX 1050 Ti (6x128 cores) Grid(48x128) 220.180 MK/s (GPU 220.180 MK/s) (2^32.76) [P 0.00%][50.00% in 109.4d][0]
C:\C++\VanitySearch\x64\ReleaseSM30>VanitySearch.exe -t 0 -gpu 1Testtttt VanitySearch v1.11 Difficulty: 2988734397852221 Search: 1Testtttt [Compressed] Start Thu Mar 28 14:51:10 2019 Base Key:7B8EEDDA6E7E418C9639AB5BBF0C14D2487D676ADDE6FC494F2504D3A026EF3B Number of CPU thread: 0 GPU: GPU #0 GeForce GTX 1050 Ti (6x128 cores) Grid(48x128) 226.483 MK/s (GPU 226.483 MK/s) (2^32.85) [P 0.00%][50.00% in 106.4d][0]
|
|
|
|
asche
Legendary
Offline
Activity: 1484
Merit: 1491
I forgot more than you will ever know.
|
|
March 28, 2019, 04:57:49 PM |
|
Better than nothing.
Is there a theoretical model that would allow to calculate the maximum performance for a given hardware? This would give an idea on how much more optimization you can achieve.
|
|
|
|
|