|
samr7 (OP)
Full Member
Offline
Activity: 140
Merit: 430
Firstbits: 1samr7
|
|
July 12, 2011, 09:52:30 AM |
|
New version 0.10 is up.
This version is approx. 6X (!!) faster at prefix matching, thanks to an OpenSSL optimization for quickly computing batches of modular inverses. This optimization also makes the cost of regular expressions much more acute. The search rate for matching a single regular expression only improved by about 3X, and overall is approx. 1/3 the speed of a prefix match.
|
|
|
|
Shevek
|
|
July 12, 2011, 10:50:52 AM |
|
New version 0.10 is up.
This version is approx. 6X (!!) faster at prefix matching, thanks to an OpenSSL optimization for quickly computing batches of modular inverses. This optimization also makes the cost of regular expressions much more acute. The search rate for matching a single regular expression only improved by about 3X, and overall is approx. 1/3 the speed of a prefix match.
Congratz! But... any news about entropy import?
|
Proposals for improving bitcoin are like asses: everybody has one 1SheveKuPHpzpLqSvPSavik9wnC51voBa
|
|
|
pc
|
|
July 12, 2011, 11:38:23 AM |
|
I have a dual-quad-core Mac Pro with hyperthreading, and on previous versions if I ran at 8 threads I got optimal performance, but I noticed with the new version that at 8 threads I was still only using "400%" of a cpu, so I tried running at 4 threads instead and got up to 300000 K/s instead of around 200000 K/s. So, I don't know if others have a similar configuration, but it might be good to play around with the number of threads to try to hit the optimal rate for your platform.
Thank you very much for this.
|
|
|
|
dserrano5
Legendary
Offline
Activity: 1974
Merit: 1030
|
|
July 12, 2011, 01:33:22 PM |
|
I'm seeing 4x increase. I don't care not getting 6x, 4x is an amazing improvement in any case .
|
|
|
|
samr7 (OP)
Full Member
Offline
Activity: 140
Merit: 430
Firstbits: 1samr7
|
|
July 12, 2011, 01:54:26 PM |
|
I have a dual-quad-core Mac Pro with hyperthreading, and on previous versions if I ran at 8 threads I got optimal performance, but I noticed with the new version that at 8 threads I was still only using "400%" of a cpu, so I tried running at 4 threads instead and got up to 300000 K/s instead of around 200000 K/s. So, I don't know if others have a similar configuration, but it might be good to play around with the number of threads to try to hit the optimal rate for your platform.
Great, negative scalability. Are you using regular expressions? How fast does it run with just one thread? I'm seeing 4x increase. I don't care not getting 6x, 4x is an amazing improvement in any case . an0therlr3, you might be noticing some scalability issues as well. If you have a sec, give some details. Are you using prefixes? How many cores, how fast, and how fast with a single thread?
|
|
|
|
dserrano5
Legendary
Offline
Activity: 1974
Merit: 1030
|
|
July 12, 2011, 02:34:16 PM |
|
If you have a sec, give some details. Are you using prefixes? How many cores, how fast, and how fast with a single thread?
Intel(R) Xeon(R) CPU E5420 @ 2.50GHz (8 cores): $ ./vg-0.6 -it1 1Loaners & sleep 10; kill $! [1] 30177 Difficulty: 28173812690 [28363 K/s][total 280000][Prob 0.0%][50% in 8.0d] $ ./vg-0.6 -i 1Loaners & sleep 10; kill $! [1] 30179 Difficulty: 28173812690 [174878 K/s][total 1520000][Prob 0.0%][50% in 1.3d] $ ./vg-0.10 -it1 1Loaners & sleep 10; kill $! [1] 30188 Difficulty: 28173812690 [164485 K/s][total 1605696][Prob 0.0%][50% in 1.4d] $ ./vg-0.10 -i 1Loaners & sleep 10; kill $! [1] 30190 Difficulty: 28173812690 [884067 K/s][total 8430080][Prob 0.0%][50% in 6.1h] v0.6 single thread to v0.6 8 threads: 174878/28363 = 6.1657x (expect 8x) v0.6 single thread to v0.10 single thread: 164485/28363 = 5.7992x (expect 6x as announced) v0.6 8 threads to v0.10 8 threads: 884067/174878 = 5.0553x (expect 6x as announced) Oops, my fault, it's not 4x but 5x. I stopped vanitygen v0.6 8 hours ago and started v0.10 some minutes ago. I judged the improvement not by the rate but by the time remaining, and I suspect I didn't take into account the fact that when I stopped v0.6 this morning, it had been running for some hours and the time remaining was, of course, less than at the start .
|
|
|
|
cbuchner1
|
|
July 12, 2011, 03:09:09 PM |
|
New version 0.10 is up. This version is approx. 6X (!!) faster at prefix matching
Congratulations for this optimization! I've profiled vanitygen 0.9 before and also noticed the issue with the inversion taking so much time. But you found a solution already. Are you very familiar with OpenSSL internals? It certainly seems so. If someone can port two important functions to the GPU, one being the EC_POINT_add() and the other being EC_POINTs_make_affine(), this thing will fly. Even more so when also the SHA256 and MD160 hashes are done on the GPU. Here is the blurb of relevant profiler output. The number in the second column is seconds spent inside the function and its children. The total execution time was about 25 seconds in this test run. ----------------------------------------------- [3] 99.9 0.01 24.92 1 vg_thread_loop(_vg_context_s*) [3] 0.00 12.55 249406/250471 EC_POINT_add [7] 0.00 7.91 932/941 EC_POINTs_make_affine [9] 0.00 1.82 219546/219548 EC_POINT_point2oct [16] 0.00 1.53 272770/272775 SHA256 [19] 0.01 0.71 226973/226974 RIPEMD160 [26]
Anything else is peanuts in comparison, including the prefix matching.
|
|
|
|
pc
|
|
July 12, 2011, 03:10:28 PM |
|
Great, negative scalability. Are you using regular expressions? How fast does it run with just one thread?
Well, I'm pretty sure it's still faster than it was on the old version, even with the old running at 8 threads, but I'd need to recompile the older version if I wanted to compare. Just running a case-insensitive prefix: cebu:~% nice ./Applications/vanitygen -i -t 1 1abcdefg Difficulty: 13628644118 [80020 K/s][total 501760][Prob 0.0%][50% in 1.4d] cebu:~% nice ./Applications/vanitygen -i -t 2 1abcdefg Difficulty: 13628644118 [162979 K/s][total 1505280][Prob 0.0%][50% in 16.1h]
cebu:~% nice ./Applications/vanitygen -i -t 3 1abcdefg Difficulty: 13628644118 [237562 K/s][total 903168][Prob 0.0%][50% in 11.0h]
cebu:~% nice ./Applications/vanitygen -i -t 4 1abcdefg Difficulty: 13628644118 [299808 K/s][total 2408448][Prob 0.0%][50% in 8.8h]
Up to this point, CPU usage in Activity Monitor is about what one expect, being roughly 100% times the number of threads. cebu:~% nice ./Applications/vanitygen -i -t 5 1abcdefg Difficulty: 13628644118 [262992 K/s][total 4264960][Prob 0.0%][50% in 10.0h] 5 threads was having CPU hovering between 420% and 440%, and a lower keygen rate, which makes me think that there's some kind of contention for something that's not CPU-bound. cebu:~% nice ./Applications/vanitygen -i -t 6 1abcdefg Difficulty: 13628644118 [261357 K/s][total 9182592][Prob 0.1%][50% in 10.0h]
cebu:~% nice ./Applications/vanitygen -i -t 7 1abcdefg Difficulty: 13628644118 [245618 K/s][total 1705984][Prob 0.0%][50% in 10.7h] Using 6 and 7 threads was roughly the same as 5, with CPU slightly higher, maybe between 425% and 445%. cebu:~% nice ./Applications/vanitygen -i -t 8 1abcdefg Difficulty: 13628644118 [200385 K/s][total 2358272][Prob 0.0%][50% in 13.1h] Using 8 threads somehow brings even more contention, with CPU hovering just around 400%. I'm not remembering exactly what speeds I was getting before on v0.8, but when I ran 8 threads it was using about 800% CPU, and I'm pretty sure it was well south of 200000 K/s, probably more like 100000, but I really don't remember so I wouldn't rely on that number at all. And just for completeness, here's my hardware configuration: Model Name: Mac Pro Model Identifier: MacPro4,1 Processor Name: Quad-Core Intel Xeon Processor Speed: 2.26 GHz Number Of Processors: 2 Total Number Of Cores: 8 L2 Cache (per core): 256 KB L3 Cache (per processor): 8 MB Memory: 32 GB Processor Interconnect Speed: 5.86 GT/s Boot ROM Version: MP41.0081.B07 SMC Version (system): 1.39f5 SMC Version (processor tray): 1.39f5
Thanks again!
|
|
|
|
cbuchner1
|
|
July 12, 2011, 04:13:59 PM |
|
Using 8 threads somehow brings even more contention, with CPU hovering just around 400%.
I believe the contention might be caused by the pooling of EC_POINT objects before calling that make_affine function. This might spill the contents of your L1/L2 caches now. So it may be more efficient to not run hyperthreaded in this version. There are some profiling tools by Intel Corp that would permit to figure this out. Haven't used any of them yet. You could also play with that pool size.
|
|
|
|
pc
|
|
July 12, 2011, 04:29:40 PM |
|
Using 8 threads somehow brings even more contention, with CPU hovering just around 400%.
I believe the contention might be caused by the pooling of EC_POINT objects before calling that make_affine function. This might spill the contents of your L1/L2 caches now. So it may be more efficient to not run hyperthreaded in this version. There are some profiling tools by Intel Corp that would permit to figure this out. Haven't used any of them yet. You could also play with that pool size. I don't think I was clear before: I have 8 physical cores, and hyperthreading is on, so I see 16 logical CPUs in Activity Monitor. I wasn't surprised with the older version when it maxed out performance at 8 as opposed to 16, but maxing out at 4 seems a little weird. It's so awesome to churning through billions of addresses. Amusing how this is even less useful than mining is, and yet somehow is more fun.
|
|
|
|
cbuchner1
|
|
July 12, 2011, 04:35:54 PM |
|
I don't think I was clear before: I have 8 physical cores, and hyperthreading is on, so I see 16 logical CPUs in Activity Monitor. I wasn't surprised with the older version when it maxed out performance at 8 as opposed to 16, but maxing out at 4 seems a little weird.
Sorry, I tend to go into denial mode if someone has better hardware than I do.
|
|
|
|
samr7 (OP)
Full Member
Offline
Activity: 140
Merit: 430
Firstbits: 1samr7
|
|
July 12, 2011, 08:29:01 PM |
|
Well, I'm pretty sure it's still faster than it was on the old version, even with the old running at 8 threads, but I'd need to recompile the older version if I wanted to compare. Just running a case-insensitive prefix: cebu:~% nice ./Applications/vanitygen -i -t 1 1abcdefg Difficulty: 13628644118 [80020 K/s][total 501760][Prob 0.0%][50% in 1.4d]
That's oddly slow, you should be getting about twice that key rate on that CPU. cebu:~% nice ./Applications/vanitygen -i -t 4 1abcdefg Difficulty: 13628644118 [299808 K/s][total 2408448][Prob 0.0%][50% in 8.8h]
Up to this point, CPU usage in Activity Monitor is about what one expect, being roughly 100% times the number of threads. cebu:~% nice ./Applications/vanitygen -i -t 5 1abcdefg Difficulty: 13628644118 [262992 K/s][total 4264960][Prob 0.0%][50% in 10.0h] 5 threads was having CPU hovering between 420% and 440%, and a lower keygen rate, which makes me think that there's some kind of contention for something that's not CPU-bound. Indeed! Try running two instances at four threads each. If the OS X scheduler is smart, it will isolate each to a processor package to minimize the cost of contention.
|
|
|
|
pc
|
|
July 12, 2011, 09:42:10 PM |
|
Indeed! Try running two instances at four threads each. If the OS X scheduler is smart, it will isolate each to a processor package to minimize the cost of contention.
Fascinating. Running two instances at four threads each gives me each instance running about 260000–275000 K/s or so, and each taking up a bit under 400% (probably about as much as they can with the other programs I have running here).
|
|
|
|
Joric
Member
Offline
Activity: 67
Merit: 130
|
|
July 12, 2011, 10:23:43 PM |
|
Just built up a script (pywallet.py 1.0) allowing export/import private keys in shortened format (mostly as a lightweight alternative to showwallet for those who didn't manage to compile the branch). Requires only openssl libs (for elliptic curve cryptography). URL: https://github.com/joric/pywallet
|
1JoricCBkW8C5m7QUZMwoRz9rBCM6ZSy96
|
|
|
bitlotto
|
|
July 12, 2011, 11:09:16 PM |
|
Just built up a script (pywallet.py 1.0) allowing export/import private keys in shortened format (mostly as a lightweight alternative to showwallet for those who didn't manage to compile the branch). Requires only openssl libs (for elliptic curve cryptography). URL: https://github.com/joric/pywalletCOOL! So I'm assuming you run it with Bitcoin closed and force a rescan so it rescans the entire blockchain.
|
*Next Draw Feb 1* BitLotto: monthly raffle (0.25 BTC per ticket) Completely transparent and impossible to manipulate who wins. TOR TOR2WEB Donations to: 1JQdiQsjhV2uJ4Y8HFtdqteJsZhv835a8J are appreciated.
|
|
|
samr7 (OP)
Full Member
Offline
Activity: 140
Merit: 430
Firstbits: 1samr7
|
|
July 12, 2011, 11:39:50 PM |
|
New version 0.11 posted. - Allow the RNG to be seeded from a file, suggested by Shevek
- Tweak the synchronization on the pattern list
Fascinating. Running two instances at four threads each gives me each instance running about 260000–275000 K/s or so, and each taking up a bit under 400% (probably about as much as they can with the other programs I have running here).
Try a single instance of the new version. It should make a lot fewer pthread synchronization calls, and hopefully scale better on your multi-processor machine. However, I'm still stumped on why each thread is getting about 1/2 the expected key rate. You should be able to do >1MK/s on that machine.
|
|
|
|
bmgjet
Member
Offline
Activity: 98
Merit: 10
|
|
July 13, 2011, 12:12:31 AM |
|
V0.10 is big improvement for me. Went from finding 1 address per day to find 4. Just using a LE sempron since running vanitygen takes the cpu from 18c idle to 29c full load and power useage went up by 10w. My desktop's way quicker finds an address every 2-3 hours but dont like running it full speed since temp goes up to 58C and uses 130W more lol.
Still would love to see what its like on a GPU.
|
|
|
|
pc
|
|
July 13, 2011, 12:27:21 AM |
|
New version 0.11 posted.
Try a single instance of the new version. It should make a lot fewer pthread synchronization calls, and hopefully scale better on your multi-processor machine. However, I'm still stumped on why each thread is getting about 1/2 the expected key rate. You should be able to do >1MK/s on that machine.
Yes, this seems to be scaling much better. 550000–575000 K/s on 8 threads, 320000 or so on 4 threads, 82000 on 1 thread. Thank you very much. And for anyone else compiling on a Mac, I have to add "-I/Developer/SDKs/MacOSX10.5.sdk/usr/include/php/ext/pcre/pcrelib/" to the makefile flags for it to find <pcre.h>. Perhaps there's a better way to get it into the build, but it seems to work for me.
|
|
|
|
Shevek
|
|
July 13, 2011, 10:02:41 AM |
|
New version 0.11 posted. - Allow the RNG to be seeded from a file, suggested by Shevek
- Tweak the synchronization on the pattern list
Thanks for the seed option! I've tested the code. A "break;" instance should be after "seedfile = optarg;". After this, the program works perfectly!
|
Proposals for improving bitcoin are like asses: everybody has one 1SheveKuPHpzpLqSvPSavik9wnC51voBa
|
|
|
|