joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
 |
August 03, 2016, 05:03:00 AM |
|
Things are looking up. I solved the alignment problem and made progress with performance. I've added 3% to Lyra2 so far and I have a few functions left to convert. So far I've only implemented AVX2, I still have to do AVX implementations of al functions. The improvements are to the lyra2 core so they should also help lyra2v2.
You are talking about Lyra2RE, right? So it will get sped up? Because so far tpruvot-cpuminer-multi's Lyra2RE is faster than yours by some 3.8% on my servers. I guess it might be because of -flto that I use with his but can't use with yours (doesn't compile)  I think you are doing something wrong, lyra2RE in cpuminer-opt v3.3.7 was improved 7% faster than cpuminer-multi. In the next release it will another 3% faster. If you have very old CPUs (ie core2) you won't get the benefits of my optimisations.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
 |
August 04, 2016, 02:40:46 AM |
|
I've found more AVX2 optimizations, converted cubehash SSE2 to AVX2, improved lyra2v2 19% and X algos 3-5%. Cubehash will probbaly be the biggest single AVX2 optimization next to Hodl. I don't know how much more I can find. Optiminer's code was an inspiration from which I learned a lot.
A lot more work to do before release.
|
|
|
|
johnsmithx
|
 |
August 04, 2016, 03:50:28 AM |
|
Things are looking up. I solved the alignment problem and made progress with performance. I've added 3% to Lyra2 so far and I have a few functions left to convert. So far I've only implemented AVX2, I still have to do AVX implementations of al functions. The improvements are to the lyra2 core so they should also help lyra2v2.
You are talking about Lyra2RE, right? So it will get sped up? Because so far tpruvot-cpuminer-multi's Lyra2RE is faster than yours by some 3.8% on my servers. I guess it might be because of -flto that I use with his but can't use with yours (doesn't compile)  I think you are doing something wrong, lyra2RE in cpuminer-opt v3.3.7 was improved 7% faster than cpuminer-multi. In the next release it will another 3% faster. If you have very old CPUs (ie core2) you won't get the benefits of my optimisations. I know you didn't mean to but what you said was very amusing. The whole time I am talking about servers, the real servers in data centers, not some desktop pc in your home that you call "a server", and you come back at me with "core2" - a 10 years old super-obsolete desktop cpu. Hilarious! But maybe you are right, maybe I am doing something wrong but it's not the hardware. I am using the up-to-date Ubuntu 16.04, if I messed something up then it must be the flags in the build.sh. Or maybe you did something wrong actually. Maybe you didn't compile tpruvot's cpuminer with -flto so now you are competing with a crippled sw because this flag does make the difference and it's not on by default, it's commented out in the build.sh so you have to enable it. Either way, instead of accusing each other let's try to make things better. In this spirit I made a little test for you. I picked the most powerful server AWS has to offer, the x1.32xlarge ( https://aws.amazon.com/ec2/instance-types/x1/) with 128 cores and 1952 GB memory. Here are the specs: root@xxx:~/# grep -e name -e flags /proc/cpuinfo | head -n2 model name : Intel(R) Xeon(R) CPU E7-8880 v3 @ 2.30GHz flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt ida
root@xxx:~/# grep processor /proc/cpuinfo | wc -l 128
root@xxx:~/# free -h total used free shared buff/cache available Mem: 1.9T 3.3G 1.9T 9.1M 624M 1.9T Swap: 0B 0B 0B
Now I fresh recompiled tpruvot and here is the result: (I noticed that with the many-cores machines it is sometimes actually more powerful to use just half the threads; here on x1.32xlarge the difference is marginal, on g2.8xlarge it's quite significant; the spike at the beginning of full cores I attribute to some throttling done by Amazon, it's just a vps after all, not a dedi) root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=64 | grep Total [2016-08-04 02:44:55] Total: 8903 kH/s [2016-08-04 02:44:59] Total: 8833 kH/s [2016-08-04 02:45:04] Total: 8735 kH/s [2016-08-04 02:45:09] Total: 8728 kH/s [2016-08-04 02:45:14] Total: 8727 kH/s
root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=128 | grep Total [2016-08-04 02:45:21] Total: 11661 kH/s [2016-08-04 02:45:25] Total: 11568 kH/s [2016-08-04 02:45:30] Total: 8731 kH/s [2016-08-04 02:45:35] Total: 8720 kH/s [2016-08-04 02:45:40] Total: 8722 kH/s [2016-08-04 02:45:45] Total: 8703 kH/s
Now let's look at you: CPU: Intel(R) Xeon(R) CPU E7-8880 v3 @ 2.30GHz CPU features: SSE2 AES AVX AVX2 SW built on Aug 4 2016 with GCC 5.4.0 SW features: SSE2 AES AVX AVX2 Algo features: SSE2 AES Start mining with AES-AVX optimizations...
root@xxx:~/cpuminer-opt-3.3.9# ./cpuminer -a lyra2re --benchmark --threads=64 | grep Total [2016-08-04 02:47:57] Total: 4128.77 kH, 8660.05 kH/s [2016-08-04 02:48:01] Total: 35.19 MH, 8668.56 kH/s [2016-08-04 02:48:06] Total: 43.34 MH, 8590.78 kH/s [2016-08-04 02:48:11] Total: 42.95 MH, 8600.93 kH/s [2016-08-04 02:48:16] Total: 43.00 MH, 8621.90 kH/s
root@xxx:~/cpuminer-opt-3.3.9# ./cpuminer -a lyra2re --benchmark --threads=128 | grep Total [2016-08-04 02:48:23] Total: 8323.07 kH, 11.96 MH/s [2016-08-04 02:48:26] Total: 11.68 MH, 12.03 MH/s [2016-08-04 02:48:31] Total: 32.93 MH, 8544.43 kH/s [2016-08-04 02:48:36] Total: 39.36 MH, 8531.49 kH/s [2016-08-04 02:48:41] Total: 42.52 MH, 8531.90 kH/s [2016-08-04 02:48:46] Total: 42.33 MH, 8536.75 kH/s [2016-08-04 02:48:51] Total: 41.88 MH, 8536.12 kH/s [2016-08-04 02:48:56] Total: 42.41 MH, 8526.90 kH/s
Here tpruvot is faster by over 2%. So let's look at the build.sh. Tpruvot's: root@xxx:~/tpruvot-cpuminer-multi# cat build.sh #!/bin/bash
if [ "$OS" = "Windows_NT" ]; then ./mingw64.sh exit 0 fi
# Linux build
make clean || echo clean
rm -f config.status ./autogen.sh || echo done
# Ubuntu 10.04 (gcc 4.4) extracflags="-O3 -march=native -D_REENTRANT -funroll-loops -fvariable-expansion-in-unroller -fmerge-all-constants -fbranch-target-load-optimize2 -fsched2-use-superblocks -falign-loops=16 -falign-functions=16 -falign-jumps=16 -falign-labels=16"
# Debian 7.7 / Ubuntu 14.04 (gcc 4.7+) extracflags="$extracflags -Ofast -flto -fuse-linker-plugin -ftree-loop-if-convert-stores"
if [ ! "0" = `cat /proc/cpuinfo | grep -c avx` ]; then # march native doesn't always works, ex. some Pentium Gxxx (no avx) extracflags="$extracflags -march=native" fi
./configure --with-crypto --with-curl CFLAGS="-O3 $extracflags -march=native -DUSE_ASM -pg"
make -j $(grep processor /proc/cpuinfo | wc -l)
strip -s cpuminer
Yours ("customized" by me, but all I actually did was taking most of the flags from tpruvot's as long as it was compilable; maybe I messed it up?): root@xxx:~/cpuminer-opt-3.3.9# cat build.sh #!/bin/bash
#if [ "$OS" = "Windows_NT" ]; then # ./mingw64.sh # exit 0 #fi
# Linux build
make clean || echo clean
rm -f config.status ./autogen.sh || echo done
# Ubuntu 10.04 (gcc 4.4) extracflags="-O3 -march=native -D_REENTRANT -funroll-loops -fvariable-expansion-in-unroller -fmerge-all-constants -fbranch-target-load-optimize2 -fsched2-use-superblocks -falign-loops=16 -falign-functions=16 -falign-jumps=16 -falign-labels=16"
# Debian 7.7 / Ubuntu 14.04 (gcc 4.7+) extracflags="$extracflags -Ofast -fuse-linker-plugin -ftree-loop-if-convert-stores"
CFLAGS="-O3 $extracflags -march=native -DUSE_ASM" CXXFLAGS="$CFLAGS -std=gnu++11" ./configure --with-crypto --with-curl
make -j $(grep processor /proc/cpuinfo | wc -l)
strip -s cpuminer
You don't have -flto and -pg, neither is compilable, he doesn't have -std=gnu++11 (doesn't compile either). Now let's see what all the -flto fuzz is about. What happens if I compile tpruvot without it: root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=64 | grep Total [2016-08-04 02:56:24] Total: 356.42 kH/s [2016-08-04 02:56:29] Total: 352.18 kH/s [2016-08-04 02:56:34] Total: 352.00 kH/s [2016-08-04 02:56:39] Total: 352.03 kH/s [2016-08-04 02:56:44] Total: 352.09 kH/s
root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=128 | grep Total [2016-08-04 02:57:29] Total: 358.48 kH/s [2016-08-04 02:57:34] Total: 357.76 kH/s [2016-08-04 02:57:39] Total: 357.96 kH/s
It turned into a snail. As if it couldn't manage the multiplexing or something. So let's try just 1 thread: root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=1 | grep Total [2016-08-04 02:57:45] Total: 124.22 kH/s [2016-08-04 02:57:50] Total: 124.66 kH/s [2016-08-04 02:57:55] Total: 125.81 kH/s [2016-08-04 02:58:00] Total: 126.19 kH/s [2016-08-04 02:58:05] Total: 126.84 kH/s
And yours: root@xxx:~/cpuminer-opt-3.3.9# ./cpuminer -a lyra2re --benchmark --threads=1 | grep Total [2016-08-04 02:58:25] Total: 65.54 kH, 136.75 kH/s [2016-08-04 02:58:30] Total: 683.77 kH, 138.37 kH/s [2016-08-04 02:58:35] Total: 691.85 kH, 140.37 kH/s [2016-08-04 02:58:40] Total: 701.86 kH, 140.28 kH/s [2016-08-04 02:58:45] Total: 701.42 kH, 140.33 kH/s [2016-08-04 02:58:50] Total: 701.68 kH, 140.74 kH/s
You are clearly faster if he doesn't use -flto. But if I again turn -flto back on and recompile: root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=1 | grep Total [2016-08-04 02:59:54] Total: 138.24 kH/s [2016-08-04 02:59:58] Total: 139.78 kH/s [2016-08-04 03:00:03] Total: 140.41 kH/s [2016-08-04 03:00:08] Total: 140.39 kH/s [2016-08-04 03:00:13] Total: 141.38 kH/s [2016-08-04 03:00:18] Total: 141.65 kH/s [2016-08-04 03:00:23] Total: 141.64 kH/s [2016-08-04 03:00:28] Total: 142.04 kH/s [2016-08-04 03:00:33] Total: 141.98 kH/s
He is clearly faster after all. Now how do you do with 8 threads? root@xxx:~/cpuminer-opt-3.3.9# ./cpuminer -a lyra2re --benchmark --threads=8 | grep Total [2016-08-04 03:00:53] Total: 524.29 kH, 1128.89 kH/s [2016-08-04 03:00:58] Total: 5644.46 kH, 1130.25 kH/s [2016-08-04 03:01:03] Total: 5651.25 kH, 1130.72 kH/s [2016-08-04 03:01:08] Total: 5653.59 kH, 1130.44 kH/s [2016-08-04 03:01:13] Total: 5652.21 kH, 1130.53 kH/s [2016-08-04 03:01:18] Total: 5652.64 kH, 1130.45 kH/s
And him with -flto? root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=8 | grep Total [2016-08-04 03:01:29] Total: 1143 kH/s [2016-08-04 03:01:34] Total: 1144 kH/s [2016-08-04 03:01:39] Total: 1144 kH/s [2016-08-04 03:01:44] Total: 1144 kH/s [2016-08-04 03:01:49] Total: 1144 kH/s [2016-08-04 03:01:54] Total: 1145 kH/s
And him without -flto: root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=8 | grep Total [2016-08-04 03:03:13] Total: 637.08 kH/s [2016-08-04 03:03:17] Total: 611.28 kH/s [2016-08-04 03:03:22] Total: 605.97 kH/s [2016-08-04 03:03:27] Total: 605.81 kH/s [2016-08-04 03:03:32] Total: 605.90 kH/s
I support very much your optimizing effort and whenever you need I will gladly do tests for you on various machines.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
 |
August 04, 2016, 04:54:55 AM |
|
I support very much your optimizing effort and whenever you need I will gladly do tests for you on various machines.
I appeciate the detailed report and the humour. It will take some time to digest it all but I have a couple of comments. You clearly have all the optimization features in your CPU and the miner was running the optimized code, so that is eliminated. I tested on a i7-6700K with 8 threads and saw no difference between multi with or without -flto. Specifically I get: multi: 920-922 kH/s opt v3.3.9: 995 opt v3.3.8: 930 opt dev: 1025 Furthermore I can compile opt with both -flto and -pg, again with no performance difference. I took the Lyra2RE code, and a lot more, directly from multi and I don't think he has made any changes to it. I've been tickering with Lyra2RE for a while and only recently made any significant progress. C++11 is required to support an algo not included in multi. I can only speculate your compiler version may be the issue. I am using 4.8.4 and you 5.4.0. I seem to recall someone else, Wolf maybe, having compile problems using a more recent version of gcc. I'm too lazy to look back in the thread to find it. If you can find it you could compare notes. I'll read through your report in more detail, If any ideas come to mind I'll post an update.
|
|
|
|
gimomars
Newbie
Offline
Activity: 34
Merit: 0
|
 |
August 04, 2016, 05:55:27 AM |
|
Hi experts, I'm trying to build in my openSUSE linux. But I've got an error:
./build.sh make: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by make) make: /lib64/libc.so.6: version `GLIBC_2.17' not found (required by make) strip: 'cpuminer': No such file
Regards
|
|
|
|
johnsmithx
|
 |
August 04, 2016, 08:45:32 AM |
|
I tested on a i7-6700K with 8 threads and saw no difference between multi with or without -flto.
One thing to keep in mind is that you are using a desktop cpu, I am using a server cpu. Those xeons are not multiple times more expensive for no reason (sure partly it's just branding, more reliability etc. etc., but there are also some functional differences). Or it (the fact that -flto doesn't do anything to you) could simply be the compiler. Yours is 2 years old, mine 2 months. But the crucial information is that you can actually compile with -flto. Could you please give me the exact flags that work for you? I will take the effort and find out what's the problem on my end, I just need the starting (compilable) point. I have no idea whether he improved the performance after you took the code from him. The very last commit is 3 days old but which one is the last one that could have had any real impact on Lyra2RE speed, or the overall speed, I am not going to investigate. But if you could remember, at least roughly, when did you take his code I can revert his tree to that date and try that version. Hi experts, I'm trying to build in my openSUSE linux. But I've got an error:
./build.sh make: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by make) make: /lib64/libc.so.6: version `GLIBC_2.17' not found (required by make) strip: 'cpuminer': No such file
Regards
This problem has nothing to do with cpuminer, your building environment is messed up. Glibc is the very core linux library, if make can't find the version it likes then there is something wrong with either. But if glibc was messed up the system would hardly even boot properly. Maybe try to update?
|
|
|
|
gimomars
Newbie
Offline
Activity: 34
Merit: 0
|
 |
August 04, 2016, 09:39:20 AM |
|
Hi experts, I'm trying to build in my openSUSE linux. But I've got an error:
./build.sh make: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by make) make: /lib64/libc.so.6: version `GLIBC_2.17' not found (required by make) strip: 'cpuminer': No such file
Regards
This problem has nothing to do with cpuminer, your building environment is messed up. Glibc is the very core linux library, if make can't find the version it likes then there is something wrong with either. But if glibc was messed up the system would hardly even boot properly. Maybe try to update? Thanks for the reply. I think my linux has old version of GLIBC_x.x. Version information: /bin/sh: libdl.so.2 (GLIBC_2.2.5) => /lib64/libdl.so.2 libc.so.6 (GLIBC_2.4) => /lib64/libc.so.6 libc.so.6 (GLIBC_2.  => /lib64/libc.so.6 libc.so.6 (GLIBC_2.3) => /lib64/libc.so.6 libc.so.6 (GLIBC_2.11) => /lib64/libc.so.6 libc.so.6 (GLIBC_2.3.4) => /lib64/libc.so.6 libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6 /lib64/libreadline.so.5: libc.so.6 (GLIBC_2.4) => /lib64/libc.so.6 libc.so.6 (GLIBC_2.3) => /lib64/libc.so.6 libc.so.6 (GLIBC_2.3.4) => /lib64/libc.so.6 libc.so.6 (GLIBC_2.11) => /lib64/libc.so.6 libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6
|
|
|
|
johnsmithx
|
 |
August 04, 2016, 12:09:33 PM |
|
Hi experts, I'm trying to build in my openSUSE linux. But I've got an error:
./build.sh make: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by make) make: /lib64/libc.so.6: version `GLIBC_2.17' not found (required by make) strip: 'cpuminer': No such file
Regards
This problem has nothing to do with cpuminer, your building environment is messed up. Glibc is the very core linux library, if make can't find the version it likes then there is something wrong with either. But if glibc was messed up the system would hardly even boot properly. Maybe try to update? Thanks for the reply. I think my linux has old version of GLIBC_x.x. Version information: /bin/sh: libdl.so.2 (GLIBC_2.2.5) => /lib64/libdl.so.2 libc.so.6 (GLIBC_2.4) => /lib64/libc.so.6 libc.so.6 (GLIBC_2.  => /lib64/libc.so.6 libc.so.6 (GLIBC_2.3) => /lib64/libc.so.6 libc.so.6 (GLIBC_2.11) => /lib64/libc.so.6 libc.so.6 (GLIBC_2.3.4) => /lib64/libc.so.6 libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6 /lib64/libreadline.so.5: libc.so.6 (GLIBC_2.4) => /lib64/libc.so.6 libc.so.6 (GLIBC_2.3) => /lib64/libc.so.6 libc.so.6 (GLIBC_2.3.4) => /lib64/libc.so.6 libc.so.6 (GLIBC_2.11) => /lib64/libc.so.6 libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6 Now you are looking what versions of glibc these two binaries (/bin/sh and /lib64/libreadline.so.5) require. To find out what version of glibc you actually have just run the library: type /lib64/libc.so.6 and press enter. Presumably it will be equal or higher than what /bin/sh requires and lower than what 'make' requires. But that would mean that you didn't install 'make' a standard way via a package system but somehow sideway. What did you do to your suse?!?
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
 |
August 04, 2016, 01:24:07 PM Last edit: August 04, 2016, 02:19:45 PM by joblo |
|
I tested on a i7-6700K with 8 threads and saw no difference between multi with or without -flto.
One thing to keep in mind is that you are using a desktop cpu, I am using a server cpu. Those xeons are not multiple times more expensive for no reason (sure partly it's just branding, more reliability etc. etc., but there are also some functional differences). Or it (the fact that -flto doesn't do anything to you) could simply be the compiler. Yours is 2 years old, mine 2 months. But the crucial information is that you can actually compile with -flto. Could you please give me the exact flags that work for you? I will take the effort and find out what's the problem on my end, I just need the starting (compilable) point. I have no idea whether he improved the performance after you took the code from him. The very last commit is 3 days old but which one is the last one that could have had any real impact on Lyra2RE speed, or the overall speed, I am not going to investigate. But if you could remember, at least roughly, when did you take his code I can revert his tree to that date and try that version. I considered things like larger cache and fatser memory interface as likely advantages of a server grade CPU but I can't figure out how that would produce inconsistent results. I'm still leaning toward the compiler. My dusty memories seem to also have LTO in them, I may have to dig back in the thread to refresh. Your compile errors might also jog some memories. I don't think the problem was in code imported from multi, more likely one of the ugly SSE2 optimized macros. When I compiled i used the flags from build.sh + -flto -pg. It is an issue worth pursuing, I can't stay on the old compiler forever. One way to determine if it's a compiler issue or CPU issue is with a VM. I don't know if you start up a VM or boot an older version of Ubuntu but a direct comparison of different compilers on the same HW would help understanding what's going on. I'm a little busy right now testing the latest AVX2 optimisations to get them released, but after that I'll look into it more. Edit: I found the post from Wolf0 about his experience compiling cpuminer-opt with gcc 6.1.1. Look similar to yours? https://bitcointalk.org/index.php?topic=1326803.msg15140799#msg15140799
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
 |
August 04, 2016, 06:14:54 PM |
|
cpuminer-opt v3.4.0 is released. A compile error was introduced in Windows v3.3.9 and has been fixed. X11gost was also broken since v3.3.7 and has also been fixed. The big news is more AVX2 optimizations inspired by Optiminer's work on the Hodl algo. See OP for details. The entire Cubehash function was converted from SSE2 to AVX2 and improved all algos that use it. Some AVX2 optimizations were also done to the Lyra2 core, improving both Lyra2RE and Lyra2REv2. Those were the easy ones, I don't know how much more I can find. See OP for list of improved algos. Source: https://drive.google.com/file/d/0B0lVSGQYLJIZbFB1WThUZ09JbVk/view?usp=sharing
|
|
|
|
clipto
Member

Offline
Activity: 311
Merit: 10
|
 |
August 04, 2016, 08:03:58 PM |
|
Great stuff, will it be released for Windows too?
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
 |
August 04, 2016, 08:06:41 PM |
|
Great stuff, will it be released for Windows too?
I'm hoping. CMB have been good, seems they skipped v3.3.9 because it didn't compile. I hope they pickup v3.4.0.
|
|
|
|
clipto
Member

Offline
Activity: 311
Merit: 10
|
 |
August 04, 2016, 08:08:54 PM |
|
3.3.7 gave me better hashrate than 3.3.8 on Lyra2RE, so I'm still running that. But looking forward to the increased performance, but are bound to Windows OS.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
 |
August 04, 2016, 08:36:49 PM Last edit: August 04, 2016, 10:32:30 PM by joblo |
|
3.3.7 gave me better hashrate than 3.3.8 on Lyra2RE, so I'm still running that. But looking forward to the increased performance, but are bound to Windows OS.
There should be a slight increase between 3.3.7 and 3.3.8. I'm suspecting data alignment issues. I've noticed different hashrates on different runs of the same version. It doesn't seem to be related to other CPU activity.
|
|
|
|
johnsmithx
|
 |
August 04, 2016, 11:49:08 PM |
|
joblo, that's an excellent improvement! Now you are definitely faster than tpruvot, at least by 4.8%. I made a better repeatable benchmark of tpruvot's, the numbers are directly in the build.sh: #!/bin/bash
if [ "$OS" = "Windows_NT" ]; then ./mingw64.sh exit 0 fi
# Linux build
make clean || echo clean
rm -f config.status ./autogen.sh || echo done
# Ubuntu 10.04 (gcc 4.4) extracflags="-O3 -march=native -w -D_REENTRANT -funroll-loops -fvariable-expansion-in-unroller -fmerge-all-constants -fbranch-target-load-optimize2 -fsched2-use-superblocks -falign-loops=16 -falign-functions=16 -falign-jumps=16 -falign-labels=16"
# Debian 7.7 / Ubuntu 14.04 (gcc 4.7+) extracflags="$extracflags -Ofast -fuse-linker-plugin -ftree-loop-if-convert-stores"
if [ ! "0" = `cat /proc/cpuinfo | grep -c avx` ]; then # march native doesn't always works, ex. some Pentium Gxxx (no avx) extracflags="$extracflags -march=native" fi
# Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz 4 threads (d2.xlarge)
#309-310 #CFLAGS="-O3 $extracflags -flto -march=native -DUSE_ASM -pg" ./configure --with-crypto --with-curl
#311-312 CFLAGS="-O3 $extracflags -flto -march=native -DUSE_ASM -pg" CXXFLAGS="-std=gnu++11" ./configure --with-crypto --with-curl
#281 #CFLAGS="-O3 $extracflags -flto -march=native -DUSE_ASM -pg" CXXFLAGS="$CFLAGS" ./configure --with-crypto --with-curl
#280 #CFLAGS="-O3 $extracflags -flto -march=native -DUSE_ASM -pg" CXXFLAGS="$CFLAGS -std=gnu++11" ./configure --with-crypto --with-curl
#269 #CFLAGS="-O3 $extracflags -march=native -DUSE_ASM -pg" ./configure --with-crypto --with-curl
#264 #CFLAGS="-O3 $extracflags -march=native -DUSE_ASM -pg" CXXFLAGS="-std=gnu++11" ./configure --with-crypto --with-curl
#242 #CFLAGS="-O3 $extracflags -march=native -DUSE_ASM -pg" CXXFLAGS="$CFLAGS" ./configure --with-crypto --with-curl
#245 #CFLAGS="-O3 $extracflags -march=native -DUSE_ASM -pg" CXXFLAGS="$CFLAGS -std=gnu++11" ./configure --with-crypto --with-curl
make -j $(grep processor /proc/cpuinfo | wc -l)
strip -s cpuminer
So with him I get 312 at best on this particular machine and that config of flags is basically the default if you uncomment everything so I didn't make him any faster, I just proved he can be much slower if wrong flags are used. Now with yours, without any change, untouched cpuminer-opt-3.4.0.tar.gz, I get this: root@xxx:~/cpuminer-opt# ./cpuminer -a lyra2re --benchmark
CPU: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz CPU features: SSE2 AES AVX AVX2 SW built on Aug 4 2016 with GCC 5.4.0 SW features: SSE2 AES AVX AVX2 Algo features: SSE2 AES AVX AVX2 Start mining with SSE2 AES AVX AVX2
[2016-08-04 23:24:25] 4 miner threads started, using 'lyra2re' algorithm. [2016-08-04 23:24:26] CPU #1: 65.54 kH, 82.16 kH/s [2016-08-04 23:24:26] CPU #0: 65.54 kH, 81.67 kH/s [2016-08-04 23:24:26] CPU #3: 65.54 kH, 81.84 kH/s [2016-08-04 23:24:26] Total: 196.61 kH, 245.67 kH/s [2016-08-04 23:24:26] CPU #2: 65.54 kH, 81.68 kH/s [2016-08-04 23:24:30] CPU #0: 326.68 kH, 81.80 kH/s [2016-08-04 23:24:30] CPU #3: 327.37 kH, 82.00 kH/s [2016-08-04 23:24:30] Total: 785.13 kH, 327.64 kH/s [2016-08-04 23:24:30] CPU #2: 326.73 kH, 81.75 kH/s [2016-08-04 23:24:30] CPU #1: 328.64 kH, 82.01 kH/s [2016-08-04 23:24:35] CPU #0: 409.02 kH, 81.78 kH/s [2016-08-04 23:24:35] CPU #3: 409.99 kH, 81.93 kH/s [2016-08-04 23:24:35] Total: 1474.38 kH, 327.46 kH/s [2016-08-04 23:24:35] CPU #2: 408.76 kH, 81.73 kH/s [2016-08-04 23:24:35] CPU #1: 410.04 kH, 81.94 kH/s [2016-08-04 23:24:40] CPU #0: 408.89 kH, 81.78 kH/s [2016-08-04 23:24:40] CPU #3: 409.63 kH, 81.94 kH/s [2016-08-04 23:24:40] Total: 1637.32 kH, 327.39 kH/s
But when I add -flto I get the following error at the final link: g++ -O3 -march=native -w -flto -std=gnu++11 -Lyes/lib -Lyes/lib -o cpuminer cpuminer-cpu-miner.o cpuminer-util.o cpuminer-uint256.o cpuminer-api.o cpuminer-sysinfos.o cpuminer-algo-gate-api.o algo/groestl/cpuminer-sph_groestl.o algo/skein/cpuminer-sph_skein.o algo/bmw/cpuminer-sph_bmw.o algo/shavite/cpuminer-sph_shavite.o algo/shavite/cpuminer-shavite.o algo/echo/cpuminer-sph_echo.o algo/blake/cpuminer-sph_blake.o algo/heavy/cpuminer-sph_hefty1.o algo/blake/cpuminer-mod_blakecoin.o algo/luffa/cpuminer-sph_luffa.o algo/cubehash/cpuminer-sph_cubehash.o algo/simd/cpuminer-sph_simd.o algo/hamsi/cpuminer-sph_hamsi.o algo/fugue/cpuminer-sph_fugue.o algo/gost/cpuminer-sph_gost.o algo/jh/cpuminer-sph_jh.o algo/keccak/cpuminer-sph_keccak.o algo/keccak/cpuminer-keccak.o algo/sha3/cpuminer-sph_sha2.o algo/sha3/cpuminer-sph_sha2big.o algo/shabal/cpuminer-sph_shabal.o algo/whirlpool/cpuminer-sph_whirlpool.o crypto/cpuminer-blake2s.o crypto/cpuminer-oaes_lib.o crypto/cpuminer-c_keccak.o crypto/cpuminer-c_groestl.o crypto/cpuminer-c_blake256.o crypto/cpuminer-c_jh.o crypto/cpuminer-c_skein.o crypto/cpuminer-hash.o crypto/cpuminer-aesb.o crypto/cpuminer-magimath.o algo/argon2/cpuminer-argon2a.o algo/argon2/ar2/cpuminer-argon2.o algo/argon2/ar2/cpuminer-opt.o algo/argon2/ar2/cpuminer-cores.o algo/argon2/ar2/cpuminer-ar2-scrypt-jane.o algo/argon2/ar2/cpuminer-blake2b.o algo/cpuminer-axiom.o algo/blake/cpuminer-blake.o algo/blake/cpuminer-blake2.o algo/blake/cpuminer-blakecoin.o algo/blake/cpuminer-decred.o algo/blake/cpuminer-pentablake.o algo/bmw/cpuminer-bmw256.o algo/cubehash/sse2/cpuminer-cubehash_sse2.o algo/cryptonight/cpuminer-cryptolight.o algo/cryptonight/cpuminer-cryptonight-common.o algo/cryptonight/cpuminer-cryptonight-aesni.o algo/cryptonight/cpuminer-cryptonight.o algo/cpuminer-drop.o algo/echo/aes_ni/cpuminer-hash.o algo/cpuminer-fresh.o algo/groestl/cpuminer-groestl.o algo/groestl/cpuminer-myr-groestl.o algo/groestl/sse2/cpuminer-grso.o algo/groestl/sse2/cpuminer-grso-asm.o algo/groestl/aes_ni/cpuminer-hash-groestl.o algo/groestl/aes_ni/cpuminer-hash-groestl256.o algo/haval/cpuminer-haval.o algo/heavy/cpuminer-heavy.o algo/heavy/cpuminer-bastion.o algo/cpuminer-hmq1725.o algo/hodl/cpuminer-hodl.o algo/hodl/cpuminer-hodl-gate.o algo/hodl/cpuminer-hodl_arith_uint256.o algo/hodl/cpuminer-hodl_uint256.o algo/hodl/cpuminer-hash.o algo/hodl/cpuminer-hmac_sha512.o algo/hodl/cpuminer-sha256.o algo/hodl/cpuminer-sha512.o algo/hodl/cpuminer-utilstrencodings.o algo/hodl/cpuminer-hodl-wolf.o algo/hodl/cpuminer-aes.o algo/hodl/cpuminer-sha512_avx.o algo/hodl/cpuminer-sha512_avx2.o algo/cpuminer-lbry.o algo/luffa/cpuminer-luffa.o algo/luffa/sse2/cpuminer-luffa_for_sse2.o algo/lyra2/cpuminer-lyra2.o algo/lyra2/cpuminer-sponge.o algo/lyra2/cpuminer-lyra2rev2.o algo/lyra2/cpuminer-lyra2re.o algo/keccak/sse2/cpuminer-keccak.o algo/cpuminer-m7m.o algo/cpuminer-neoscrypt.o algo/cpuminer-nist5.o algo/cpuminer-pluck.o algo/quark/cpuminer-quark.o algo/qubit/cpuminer-qubit.o algo/ripemd/cpuminer-sph_ripemd.o algo/cpuminer-scrypt.o algo/scryptjane/cpuminer-scrypt-jane.o algo/sha2/cpuminer-sha2.o algo/simd/sse2/cpuminer-nist.o algo/simd/sse2/cpuminer-vector.o algo/skein/cpuminer-skein.o algo/skein/cpuminer-skein2.o algo/cpuminer-s3.o algo/tiger/cpuminer-sph_tiger.o algo/whirlpool/cpuminer-whirlpool.o algo/whirlpool/cpuminer-whirlpoolx.o algo/x11/cpuminer-x11.o algo/x11/cpuminer-x11evo.o algo/x11/cpuminer-x11gost.o algo/x11/cpuminer-c11.o algo/x13/cpuminer-x13.o algo/x14/cpuminer-x14.o algo/x15/cpuminer-x15.o algo/x17/cpuminer-x17.o algo/yescrypt/cpuminer-yescrypt.o algo/yescrypt/cpuminer-yescrypt-common.o algo/yescrypt/cpuminer-sha256_Y.o algo/yescrypt/cpuminer-yescrypt-simd.o algo/cpuminer-zr5.o asm/cpuminer-neoscrypt_asm.o asm/cpuminer-sha2-x64.o asm/cpuminer-scrypt-x64.o asm/cpuminer-aesb-x64.o -lcurl -lz -ljansson -lpthread -lssl -lcrypto -lgmp /tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_avx2': <artificial>:(.text+0x9712): undefined reference to `scrypt_ChunkMix_avx2' <artificial>:(.text+0x9729): undefined reference to `scrypt_ChunkMix_avx2' <artificial>:(.text+0x9760): undefined reference to `scrypt_ChunkMix_avx2' <artificial>:(.text+0x9785): undefined reference to `scrypt_ChunkMix_avx2' /tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_xop': <artificial>:(.text+0x99f2): undefined reference to `scrypt_ChunkMix_xop' <artificial>:(.text+0x9a09): undefined reference to `scrypt_ChunkMix_xop' <artificial>:(.text+0x9a40): undefined reference to `scrypt_ChunkMix_xop' <artificial>:(.text+0x9a65): undefined reference to `scrypt_ChunkMix_xop' /tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_avx': <artificial>:(.text+0x9cd2): undefined reference to `scrypt_ChunkMix_avx' <artificial>:(.text+0x9ce9): undefined reference to `scrypt_ChunkMix_avx' <artificial>:(.text+0x9d20): undefined reference to `scrypt_ChunkMix_avx' <artificial>:(.text+0x9d45): undefined reference to `scrypt_ChunkMix_avx' /tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_ssse3': <artificial>:(.text+0x9fb2): undefined reference to `scrypt_ChunkMix_ssse3' <artificial>:(.text+0x9fc9): undefined reference to `scrypt_ChunkMix_ssse3' <artificial>:(.text+0xa000): undefined reference to `scrypt_ChunkMix_ssse3' <artificial>:(.text+0xa025): undefined reference to `scrypt_ChunkMix_ssse3' /tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_sse2': <artificial>:(.text+0xa292): undefined reference to `scrypt_ChunkMix_sse2' <artificial>:(.text+0xa2a9): undefined reference to `scrypt_ChunkMix_sse2' <artificial>:(.text+0xa2e0): undefined reference to `scrypt_ChunkMix_sse2' <artificial>:(.text+0xa305): undefined reference to `scrypt_ChunkMix_sse2' collect2: error: ld returned 1 exit status Makefile:1333: recipe for target 'cpuminer' failed make[2]: *** [cpuminer] Error 1 make[2]: Leaving directory '/root/cpuminer-opt' Makefile:3453: recipe for target 'all-recursive' failed make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory '/root/cpuminer-opt' Makefile:670: recipe for target 'all' failed make: *** [all] Error 2
If you are unsure how to fix it could you at least guide me how to disable the whole scrypt (optimization) because I am really anxious to see what -flto will do.
|
|
|
|
ReiMomo
Sr. Member
  
Offline
Activity: 2366
Merit: 305
Duelbits - $100k Bonus/week
|
 |
August 05, 2016, 12:08:11 AM |
|
So when is the Windows bin out? 
|
|
|
|
| | | . Duelbits | | | | | █▀▀▀▀▀ █ █ █ █ █ █ █ █ █ █ █ █▄▄▄▄▄▄▄ | TRY OUR
NEW UNIQUE GAMES! | ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀DICE .▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ | | ███████████████████████████████ ███▀▀ ▀▀███ ███ ▄▄▄▄ ▄▄▄▄ ███ ███ ██████ ██████ ███ ███ ▀████▀ ▀████▀ ███ ███ ███ ███ ███ ███ ███ ███ ▄████▄ ▄████▄ ███ ███ ██████ ██████ ███ ███ ▀▀▀▀ ▀▀▀▀ ███ ███▄▄ ▄▄███ ███████████████████████████████ | | | ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀MINES .▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ | | ███████████████████████████████ ████████████████████████▄▀▄████ ██████████████▀▄▄▄▀█████▄▀▄████ ████████████▀ █████▄▀████ █████ ██████████ █████▄▀▀▄██████ ███████▀ ▀████████████ █████▀ ▀██████████ █████ ██████████ ████▌ ▐█████████ █████ ██████████ ██████▄ ▄███████████ ████████▄▄ ▄▄█████████████ ███████████████████████████████ | | ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀PLINKO .▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ | | ███████████████████████████████ █████████▀▀▀ ▀▀▀█████████ ██████▀ ▄▄███ ███ ▀██████ █████ ▄▀▀ █████ ████ ▀ ████ ███ ███ ███ ███ ███ ███ ████ ████ █████ █████ ██████▄ ▄██████ █████████▄▄▄ ▄▄▄█████████ ███████████████████████████████ | | 10,000x MULTIPLIER | │ | | | | ▀▀▀▀▀█ █ █ █ █ █ █ █ █ █ █ █ ▄▄▄▄▄█ |
|
|
|
johnsmithx
|
 |
August 05, 2016, 01:30:49 AM Last edit: August 05, 2016, 02:08:52 AM by johnsmithx |
|
Success! I did this very ugly hack, joblo please don't get a heart attack: --- scrypt-jane-romix-template.h.orig 2016-02-05 22:05:38.000000000 +0000 +++ scrypt-jane-romix-template.h 2016-08-05 00:37:48.949684265 +0000 @@ -86,9 +86,9 @@ for (i = 0; i < /*N - 1*/511; i++, block += chunkWords) { /* 3: V_i = X */ /* 4: X = H(X) */ - SCRYPT_CHUNKMIX_FN(block + chunkWords, block, NULL, /*r*/1); +// SCRYPT_CHUNKMIX_FN(block + chunkWords, block, NULL, /*r*/1); } - SCRYPT_CHUNKMIX_FN(X, block, NULL, 1); +// SCRYPT_CHUNKMIX_FN(X, block, NULL, 1);
/* 6: for i = 0 to N - 1 do */ for (i = 0; i < /*N*/512; i += 2) { @@ -96,13 +96,13 @@ j = X[chunkWords - SCRYPT_BLOCK_WORDS] & /*(N - 1)*/511;
/* 8: X = H(Y ^ V_j) */ - SCRYPT_CHUNKMIX_FN(Y, X, scrypt_item(V, j, chunkWords), 1); +// SCRYPT_CHUNKMIX_FN(Y, X, scrypt_item(V, j, chunkWords), 1);
/* 7: j = Integerify(Y) % N */ j = Y[chunkWords - SCRYPT_BLOCK_WORDS] & /*(N - 1)*/511;
/* 8: X = H(Y ^ V_j) */ - SCRYPT_CHUNKMIX_FN(X, Y, scrypt_item(V, j, chunkWords), 1); +// SCRYPT_CHUNKMIX_FN(X, Y, scrypt_item(V, j, chunkWords), 1); }
/* 10: B' = X */
And now it does compile with -flto and here is the result: CPU: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz CPU features: SSE2 AES AVX AVX2 SW built on Aug 5 2016 with GCC 5.4.0 SW features: SSE2 AES AVX AVX2 Algo features: SSE2 AES AVX AVX2 Start mining with SSE2 AES AVX AVX2
[2016-08-05 00:58:03] 4 miner threads started, using 'lyra2re' algorithm. [2016-08-05 00:58:04] CPU #0: 65.54 kH, 84.17 kH/s [2016-08-05 00:58:04] CPU #1: 65.54 kH, 84.25 kH/s [2016-08-05 00:58:04] CPU #3: 65.54 kH, 84.23 kH/s [2016-08-05 00:58:04] Total: 196.61 kH, 252.64 kH/s [2016-08-05 00:58:04] CPU #2: 65.54 kH, 83.86 kH/s [2016-08-05 00:58:08] CPU #2: 335.45 kH, 84.02 kH/s [2016-08-05 00:58:08] CPU #1: 336.99 kH, 84.25 kH/s [2016-08-05 00:58:08] CPU #3: 336.92 kH, 84.24 kH/s [2016-08-05 00:58:08] Total: 1074.89 kH, 336.68 kH/s [2016-08-05 00:58:08] CPU #0: 336.67 kH, 84.04 kH/s [2016-08-05 00:58:13] CPU #2: 420.12 kH, 84.16 kH/s [2016-08-05 00:58:13] CPU #1: 421.26 kH, 84.35 kH/s [2016-08-05 00:58:13] CPU #0: 420.18 kH, 84.19 kH/s [2016-08-05 00:58:13] CPU #3: 421.18 kH, 84.34 kH/s [2016-08-05 00:58:13] Total: 1682.74 kH, 337.04 kH/s [2016-08-05 00:58:18] CPU #2: 420.78 kH, 84.16 kH/s [2016-08-05 00:58:18] CPU #1: 421.77 kH, 84.31 kH/s [2016-08-05 00:58:18] CPU #0: 420.97 kH, 84.19 kH/s [2016-08-05 00:58:18] CPU #3: 421.69 kH, 84.26 kH/s [2016-08-05 00:58:18] Total: 1685.21 kH, 336.92 kH/s [2016-08-05 00:58:23] CPU #1: 421.54 kH, 84.37 kH/s [2016-08-05 00:58:23] CPU #3: 421.31 kH, 84.32 kH/s [2016-08-05 00:58:23] CPU #2: 420.81 kH, 83.99 kH/s [2016-08-05 00:58:23] Total: 1684.63 kH, 336.87 kH/s [2016-08-05 00:58:23] CPU #0: 420.93 kH, 84.01 kH/s [2016-08-05 00:58:28] CPU #2: 419.96 kH, 84.10 kH/s [2016-08-05 00:58:28] CPU #0: 420.07 kH, 84.10 kH/s [2016-08-05 00:58:28] CPU #1: 421.87 kH, 84.17 kH/s [2016-08-05 00:58:28] CPU #3: 421.58 kH, 84.09 kH/s [2016-08-05 00:58:28] Total: 1683.49 kH, 336.46 kH/s
So using -flto gives another 2.75% speed increase. That's 7.7% speed increase in total over tpruvot. Now this is with -flto and -fuse-linker-plugin: CPU: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz CPU features: SSE2 AES AVX AVX2 SW built on Aug 5 2016 with GCC 5.4.0 SW features: SSE2 AES AVX AVX2 Algo features: SSE2 AES AVX AVX2 Start mining with SSE2 AES AVX AVX2
[2016-08-05 00:55:15] 4 miner threads started, using 'lyra2re' algorithm. [2016-08-05 00:55:16] CPU #0: 65.54 kH, 84.75 kH/s [2016-08-05 00:55:16] CPU #1: 65.54 kH, 84.78 kH/s [2016-08-05 00:55:16] CPU #2: 65.54 kH, 84.56 kH/s [2016-08-05 00:55:16] CPU #3: 65.54 kH, 84.44 kH/s [2016-08-05 00:55:16] Total: 262.14 kH, 338.53 kH/s [2016-08-05 00:55:20] CPU #3: 337.77 kH, 84.06 kH/s [2016-08-05 00:55:20] Total: 534.38 kH, 338.15 kH/s [2016-08-05 00:55:20] CPU #2: 338.22 kH, 84.01 kH/s [2016-08-05 00:55:20] CPU #1: 339.13 kH, 84.09 kH/s [2016-08-05 00:55:20] CPU #0: 338.98 kH, 84.02 kH/s [2016-08-05 00:55:25] CPU #0: 420.11 kH, 84.71 kH/s [2016-08-05 00:55:25] CPU #2: 420.03 kH, 84.49 kH/s [2016-08-05 00:55:25] CPU #3: 420.31 kH, 84.05 kH/s [2016-08-05 00:55:25] Total: 1599.59 kH, 337.33 kH/s [2016-08-05 00:55:25] CPU #1: 420.43 kH, 84.07 kH/s [2016-08-05 00:55:30] CPU #3: 420.25 kH, 83.97 kH/s [2016-08-05 00:55:30] Total: 1680.82 kH, 337.24 kH/s [2016-08-05 00:55:30] CPU #2: 422.44 kH, 83.97 kH/s [2016-08-05 00:55:30] CPU #0: 423.54 kH, 83.98 kH/s [2016-08-05 00:55:30] CPU #1: 420.36 kH, 83.97 kH/s [2016-08-05 00:55:35] CPU #0: 419.88 kH, 84.64 kH/s [2016-08-05 00:55:35] CPU #2: 419.84 kH, 84.39 kH/s [2016-08-05 00:55:35] CPU #3: 419.85 kH, 84.00 kH/s [2016-08-05 00:55:35] Total: 1679.93 kH, 337.00 kH/s [2016-08-05 00:55:35] CPU #1: 419.85 kH, 84.02 kH/s [2016-08-05 00:55:40] CPU #0: 423.20 kH, 84.42 kH/s [2016-08-05 00:55:40] CPU #3: 420.02 kH, 84.32 kH/s [2016-08-05 00:55:40] Total: 1682.91 kH, 337.15 kH/s
Basically the same speed. Now what if I actually call tpruvot's build.sh, exactly the one I showed in my previous post: CPU: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz CPU features: SSE2 AES AVX AVX2 SW built on Aug 5 2016 with GCC 5.4.0 SW features: SSE2 AES AVX AVX2 Algo features: SSE2 AES AVX AVX2 Start mining with SSE2 AES AVX AVX2
[2016-08-05 01:10:02] 4 miner threads started, using 'lyra2re' algorithm. [2016-08-05 01:10:03] CPU #0: 65.54 kH, 84.11 kH/s [2016-08-05 01:10:03] CPU #1: 65.54 kH, 83.93 kH/s [2016-08-05 01:10:03] CPU #2: 65.54 kH, 83.86 kH/s [2016-08-05 01:10:03] CPU #3: 65.54 kH, 83.96 kH/s [2016-08-05 01:10:03] Total: 262.14 kH, 335.86 kH/s [2016-08-05 01:10:07] CPU #1: 335.71 kH, 84.00 kH/s [2016-08-05 01:10:07] CPU #2: 335.44 kH, 83.92 kH/s [2016-08-05 01:10:07] CPU #3: 335.85 kH, 83.99 kH/s [2016-08-05 01:10:07] Total: 1072.54 kH, 336.02 kH/s [2016-08-05 01:10:07] CPU #0: 336.45 kH, 83.93 kH/s [2016-08-05 01:10:12] CPU #1: 420.00 kH, 84.00 kH/s [2016-08-05 01:10:12] CPU #2: 419.62 kH, 83.92 kH/s [2016-08-05 01:10:12] CPU #3: 419.93 kH, 83.99 kH/s [2016-08-05 01:10:12] Total: 1596.00 kH, 335.82 kH/s [2016-08-05 01:10:12] CPU #0: 419.64 kH, 83.91 kH/s [2016-08-05 01:10:17] CPU #1: 419.98 kH, 84.05 kH/s [2016-08-05 01:10:17] CPU #2: 419.58 kH, 83.98 kH/s [2016-08-05 01:10:17] CPU #3: 419.93 kH, 84.03 kH/s [2016-08-05 01:10:17] Total: 1679.12 kH, 335.98 kH/s [2016-08-05 01:10:17] CPU #0: 419.53 kH, 83.99 kH/s [2016-08-05 01:10:22] CPU #2: 419.92 kH, 84.04 kH/s [2016-08-05 01:10:22] CPU #1: 420.25 kH, 84.04 kH/s [2016-08-05 01:10:22] CPU #0: 419.93 kH, 84.04 kH/s [2016-08-05 01:10:22] CPU #3: 420.18 kH, 84.02 kH/s [2016-08-05 01:10:22] Total: 1680.28 kH, 336.14 kH/s
Still the same (maximum) speed. So I will be using joblo's cpuminer with tpruvot's (uncommented) build.sh because that build.sh has all those other flags (including -falign-*) which may or may not matter, so just to be safe.. EDIT: when I took the avx2 binary and tried to run it on a avx cpu I got this: CPU: Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz CPU features: SSE2 AES AVX SW built on Aug 5 2016 with GCC 5.4.0 SW features: SSE2 AES AVX AVX2 Algo features: SSE2 AES AVX AVX2 Start mining with SSE2 AES AVX
Illegal instruction (core dumped)
But wasn't the whole idea that all the cpu features will be compiled in and what particular feature shall be used will be determined at the runtime? It's not a big deal, I just recompiled it and I will have two versions (avx and avx2) and run the one that's appropriate to the cpu. Just I thought I would report this.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
 |
August 05, 2016, 02:25:33 PM |
|
So when is the Windows bin out?  Cryptomining Blog have usually been good producing binaries within a few hours of release. I'm sure why not this time. You could ask. I can't build distributable Windows binaries but mingw works to compile your own, instructions in README.md
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
 |
August 05, 2016, 02:51:44 PM |
|
Success! [snip] So I will be using joblo's cpuminer with tpruvot's (uncommented) build.sh because that build.sh has all those other flags (including -falign-*) which may or may not matter, so just to be safe.. EDIT: when I took the avx2 binary and tried to run it on a avx cpu I got this: CPU: Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz CPU features: SSE2 AES AVX SW built on Aug 5 2016 with GCC 5.4.0 SW features: SSE2 AES AVX AVX2 Algo features: SSE2 AES AVX AVX2 Start mining with SSE2 AES AVX
Illegal instruction (core dumped)
But wasn't the whole idea that all the cpu features will be compiled in and what particular feature shall be used will be determined at the runtime? It's not a big deal, I just recompiled it and I will have two versions (avx and avx2) and run the one that's appropriate to the cpu. Just I thought I would report this. Excellent work. The easiest way to block the compile error is to comment out the source dir for argon2 and remove the registration call for argon2 in algo-gate-api.c:register_algo_gate. You can easilly remove any algo this way. You have demonstrated that LTO improves performance with the new compiler but has some incompatibilities with the existing argon2 code. I will investigate argon2 to try to solve it. CPU architecture selection is made at compile time. If you do a native compile on a CPU that supports AVX2 you can not run it on a CPU with only AVX. If you want to cross compile you must specify the arch of the target CPU, and produce seperate executables for each desired architecture. My logic for AVX2 isn't fully implemented yet in the capablilities checks, had it been it would have displayed a message warning of the impending crash, then crashed. This is what you should see when implemented: CPU features: SSE2 AES AVX SW built on Aug 5 2016 with GCC 5.4.0 SW features: SSE2 AES AVX AVX2 Algo features: SSE2 AES AVX AVX2 [color=red]Unsupported CPU or SW configuration, miner will likely crash![/color] Illegal instruction (core dumped)
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
 |
August 05, 2016, 04:13:36 PM Last edit: August 05, 2016, 04:44:59 PM by joblo |
|
But when I add -flto I get the following error at the final link: g++ -O3 -march=native -w -flto -std=gnu++11 -Lyes/lib -Lyes/lib -o cpuminer cpuminer-cpu-miner.o cpuminer-util.o cpuminer-uint256.o cpuminer-api.o cpuminer-sysinfos.o cpuminer-algo-gate-api.o algo/groestl/cpuminer-sph_groestl.o algo/skein/cpuminer-sph_skein.o algo/bmw/cpuminer-sph_bmw.o algo/shavite/cpuminer-sph_shavite.o algo/shavite/cpuminer-shavite.o algo/echo/cpuminer-sph_echo.o algo/blake/cpuminer-sph_blake.o algo/heavy/cpuminer-sph_hefty1.o algo/blake/cpuminer-mod_blakecoin.o algo/luffa/cpuminer-sph_luffa.o algo/cubehash/cpuminer-sph_cubehash.o algo/simd/cpuminer-sph_simd.o algo/hamsi/cpuminer-sph_hamsi.o algo/fugue/cpuminer-sph_fugue.o algo/gost/cpuminer-sph_gost.o algo/jh/cpuminer-sph_jh.o algo/keccak/cpuminer-sph_keccak.o algo/keccak/cpuminer-keccak.o algo/sha3/cpuminer-sph_sha2.o algo/sha3/cpuminer-sph_sha2big.o algo/shabal/cpuminer-sph_shabal.o algo/whirlpool/cpuminer-sph_whirlpool.o crypto/cpuminer-blake2s.o crypto/cpuminer-oaes_lib.o crypto/cpuminer-c_keccak.o crypto/cpuminer-c_groestl.o crypto/cpuminer-c_blake256.o crypto/cpuminer-c_jh.o crypto/cpuminer-c_skein.o crypto/cpuminer-hash.o crypto/cpuminer-aesb.o crypto/cpuminer-magimath.o algo/argon2/cpuminer-argon2a.o algo/argon2/ar2/cpuminer-argon2.o algo/argon2/ar2/cpuminer-opt.o algo/argon2/ar2/cpuminer-cores.o algo/argon2/ar2/cpuminer-ar2-scrypt-jane.o algo/argon2/ar2/cpuminer-blake2b.o algo/cpuminer-axiom.o algo/blake/cpuminer-blake.o algo/blake/cpuminer-blake2.o algo/blake/cpuminer-blakecoin.o algo/blake/cpuminer-decred.o algo/blake/cpuminer-pentablake.o algo/bmw/cpuminer-bmw256.o algo/cubehash/sse2/cpuminer-cubehash_sse2.o algo/cryptonight/cpuminer-cryptolight.o algo/cryptonight/cpuminer-cryptonight-common.o algo/cryptonight/cpuminer-cryptonight-aesni.o algo/cryptonight/cpuminer-cryptonight.o algo/cpuminer-drop.o algo/echo/aes_ni/cpuminer-hash.o algo/cpuminer-fresh.o algo/groestl/cpuminer-groestl.o algo/groestl/cpuminer-myr-groestl.o algo/groestl/sse2/cpuminer-grso.o algo/groestl/sse2/cpuminer-grso-asm.o algo/groestl/aes_ni/cpuminer-hash-groestl.o algo/groestl/aes_ni/cpuminer-hash-groestl256.o algo/haval/cpuminer-haval.o algo/heavy/cpuminer-heavy.o algo/heavy/cpuminer-bastion.o algo/cpuminer-hmq1725.o algo/hodl/cpuminer-hodl.o algo/hodl/cpuminer-hodl-gate.o algo/hodl/cpuminer-hodl_arith_uint256.o algo/hodl/cpuminer-hodl_uint256.o algo/hodl/cpuminer-hash.o algo/hodl/cpuminer-hmac_sha512.o algo/hodl/cpuminer-sha256.o algo/hodl/cpuminer-sha512.o algo/hodl/cpuminer-utilstrencodings.o algo/hodl/cpuminer-hodl-wolf.o algo/hodl/cpuminer-aes.o algo/hodl/cpuminer-sha512_avx.o algo/hodl/cpuminer-sha512_avx2.o algo/cpuminer-lbry.o algo/luffa/cpuminer-luffa.o algo/luffa/sse2/cpuminer-luffa_for_sse2.o algo/lyra2/cpuminer-lyra2.o algo/lyra2/cpuminer-sponge.o algo/lyra2/cpuminer-lyra2rev2.o algo/lyra2/cpuminer-lyra2re.o algo/keccak/sse2/cpuminer-keccak.o algo/cpuminer-m7m.o algo/cpuminer-neoscrypt.o algo/cpuminer-nist5.o algo/cpuminer-pluck.o algo/quark/cpuminer-quark.o algo/qubit/cpuminer-qubit.o algo/ripemd/cpuminer-sph_ripemd.o algo/cpuminer-scrypt.o algo/scryptjane/cpuminer-scrypt-jane.o algo/sha2/cpuminer-sha2.o algo/simd/sse2/cpuminer-nist.o algo/simd/sse2/cpuminer-vector.o algo/skein/cpuminer-skein.o algo/skein/cpuminer-skein2.o algo/cpuminer-s3.o algo/tiger/cpuminer-sph_tiger.o algo/whirlpool/cpuminer-whirlpool.o algo/whirlpool/cpuminer-whirlpoolx.o algo/x11/cpuminer-x11.o algo/x11/cpuminer-x11evo.o algo/x11/cpuminer-x11gost.o algo/x11/cpuminer-c11.o algo/x13/cpuminer-x13.o algo/x14/cpuminer-x14.o algo/x15/cpuminer-x15.o algo/x17/cpuminer-x17.o algo/yescrypt/cpuminer-yescrypt.o algo/yescrypt/cpuminer-yescrypt-common.o algo/yescrypt/cpuminer-sha256_Y.o algo/yescrypt/cpuminer-yescrypt-simd.o algo/cpuminer-zr5.o asm/cpuminer-neoscrypt_asm.o asm/cpuminer-sha2-x64.o asm/cpuminer-scrypt-x64.o asm/cpuminer-aesb-x64.o -lcurl -lz -ljansson -lpthread -lssl -lcrypto -lgmp /tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_avx2': <artificial>:(.text+0x9712): undefined reference to `scrypt_ChunkMix_avx2' <artificial>:(.text+0x9729): undefined reference to `scrypt_ChunkMix_avx2' <artificial>:(.text+0x9760): undefined reference to `scrypt_ChunkMix_avx2' <artificial>:(.text+0x9785): undefined reference to `scrypt_ChunkMix_avx2' /tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_xop': <artificial>:(.text+0x99f2): undefined reference to `scrypt_ChunkMix_xop' <artificial>:(.text+0x9a09): undefined reference to `scrypt_ChunkMix_xop' <artificial>:(.text+0x9a40): undefined reference to `scrypt_ChunkMix_xop' <artificial>:(.text+0x9a65): undefined reference to `scrypt_ChunkMix_xop' /tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_avx': <artificial>:(.text+0x9cd2): undefined reference to `scrypt_ChunkMix_avx' <artificial>:(.text+0x9ce9): undefined reference to `scrypt_ChunkMix_avx' <artificial>:(.text+0x9d20): undefined reference to `scrypt_ChunkMix_avx' <artificial>:(.text+0x9d45): undefined reference to `scrypt_ChunkMix_avx' /tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_ssse3': <artificial>:(.text+0x9fb2): undefined reference to `scrypt_ChunkMix_ssse3' <artificial>:(.text+0x9fc9): undefined reference to `scrypt_ChunkMix_ssse3' <artificial>:(.text+0xa000): undefined reference to `scrypt_ChunkMix_ssse3' <artificial>:(.text+0xa025): undefined reference to `scrypt_ChunkMix_ssse3' /tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_sse2': <artificial>:(.text+0xa292): undefined reference to `scrypt_ChunkMix_sse2' <artificial>:(.text+0xa2a9): undefined reference to `scrypt_ChunkMix_sse2' <artificial>:(.text+0xa2e0): undefined reference to `scrypt_ChunkMix_sse2' <artificial>:(.text+0xa305): undefined reference to `scrypt_ChunkMix_sse2' collect2: error: ld returned 1 exit status Makefile:1333: recipe for target 'cpuminer' failed make[2]: *** [cpuminer] Error 1 make[2]: Leaving directory '/root/cpuminer-opt' Makefile:3453: recipe for target 'all-recursive' failed make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory '/root/cpuminer-opt' Makefile:670: recipe for target 'all' failed make: *** [all] Error 2
I just want to make sure I understand the problem definition - multi is faster with -flto - multi without -flto is slower than identically compiled opt - multi with -flto is faster than pre-avx2 opt compiled without -flto - opt fails to compile with gcc 5.4.0 with -flto - -flto compiles with gcc 4.8.4 with no effect in performance. The significant points are: - flto is faster with gcc 5.4.0 - code that compiles with -flto using gcc 4.8.4 fails to compile using gcc 5.4.0. The code that fails to compile is pretty ugly. It uses asm function pointers to select targets at compile time. I've never seen anything like this so it will take a while to understand what is going on. It looks like the code is self contained and the error doesn't seem to be related to missing libraries. As a workaround, if you disable argon2 you can get the best of my optimizations as well as LTO, unless some of my opts conflict with LTO. It wouldn't be the first time I step on the compiler when trying to optimize. related to missing libraries
|
|
|
|
|