[LOCKED] cpuminer-opt v3.12.3, open source optimized multi-algo CPU miner

joblo (OP)

Legendary

Offline

Activity: 1470
Merit: 1114

Re: [ANN]: cpuminer-opt v3.3.9, Optimized Multialgo CPU miner

August 03, 2016, 05:03:00 AM

#941

Quote from: johnsmithx on August 03, 2016, 03:37:53 AM

Quote from: joblo on August 02, 2016, 09:52:19 PM

Things are looking up. I solved the alignment problem and made progress with performance. I've added 3%
to Lyra2 so far and I have a few functions left to convert. So far I've only implemented AVX2, I still have to do
AVX implementations of al functions. The improvements are to the lyra2 core so they should also help lyra2v2.

You are talking about Lyra2RE, right? So it will get sped up? Because so far tpruvot-cpuminer-multi's Lyra2RE is faster than yours by some 3.8% on my servers. I guess it might be because of -flto that I use with his but can't use with yours (doesn't compile) Sad

I think you are doing something wrong, lyra2RE in cpuminer-opt v3.3.7 was improved 7% faster than cpuminer-multi.
In the next release it will another 3% faster. If you have very old CPUs (ie core2) you won't get the benefits of my optimisations.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

joblo (OP)

Legendary

Offline

Activity: 1470
Merit: 1114

Re: [ANN]: cpuminer-opt v3.3.9, Optimized Multialgo CPU miner

August 04, 2016, 02:40:46 AM

#942

I've found more AVX2 optimizations, converted cubehash SSE2 to AVX2, improved lyra2v2 19% and X algos 3-5%.
Cubehash will probbaly be the biggest single AVX2 optimization next to Hodl. I don't know how much more I can find.
Optiminer's code was an inspiration from which I learned a lot.

A lot more work to do before release.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

johnsmithx

Hero Member

Offline

Activity: 589
Merit: 507

I don't buy nor sell anything here and never will.

Re: [ANN]: cpuminer-opt v3.3.9, Optimized Multialgo CPU miner

August 04, 2016, 03:50:28 AM

#943

Quote from: joblo on August 03, 2016, 05:03:00 AM

Quote from: johnsmithx on August 03, 2016, 03:37:53 AM

Quote from: joblo on August 02, 2016, 09:52:19 PM

I know you didn't mean to but what you said was very amusing. The whole time I am talking about servers, the real servers in data centers, not some desktop pc in your home that you call "a server", and you come back at me with "core2" - a 10 years old super-obsolete desktop cpu. Hilarious!

But maybe you are right, maybe I am doing something wrong but it's not the hardware. I am using the up-to-date Ubuntu 16.04, if I messed something up then it must be the flags in the build.sh. Or maybe you did something wrong actually. Maybe you didn't compile tpruvot's cpuminer with -flto so now you are competing with a crippled sw because this flag does make the difference and it's not on by default, it's commented out in the build.sh so you have to enable it.

Either way, instead of accusing each other let's try to make things better. In this spirit I made a little test for you. I picked the most powerful server AWS has to offer, the x1.32xlarge (https://aws.amazon.com/ec2/instance-types/x1/) with 128 cores and 1952 GB memory. Here are the specs:

Code:

root@xxx:~/# grep -e name -e flags /proc/cpuinfo | head -n2
model name      : Intel(R) Xeon(R) CPU E7-8880 v3 @ 2.30GHz
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt ida

root@xxx:~/# grep processor /proc/cpuinfo | wc -l
128

root@xxx:~/# free -h
              total        used        free      shared  buff/cache   available
Mem:           1.9T        3.3G        1.9T        9.1M        624M        1.9T
Swap:            0B          0B          0B

Now I fresh recompiled tpruvot and here is the result:

(I noticed that with the many-cores machines it is sometimes actually more powerful to use just half the threads; here on x1.32xlarge the difference is marginal, on g2.8xlarge it's quite significant; the spike at the beginning of full cores I attribute to some throttling done by Amazon, it's just a vps after all, not a dedi)

Code:

root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=64 | grep Total
[2016-08-04 02:44:55] Total: 8903 kH/s
[2016-08-04 02:44:59] Total: 8833 kH/s
[2016-08-04 02:45:04] Total: 8735 kH/s
[2016-08-04 02:45:09] Total: 8728 kH/s
[2016-08-04 02:45:14] Total: 8727 kH/s

root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=128 | grep Total
[2016-08-04 02:45:21] Total: 11661 kH/s
[2016-08-04 02:45:25] Total: 11568 kH/s
[2016-08-04 02:45:30] Total: 8731 kH/s
[2016-08-04 02:45:35] Total: 8720 kH/s
[2016-08-04 02:45:40] Total: 8722 kH/s
[2016-08-04 02:45:45] Total: 8703 kH/s

Now let's look at you:

Code:

CPU: Intel(R) Xeon(R) CPU E7-8880 v3 @ 2.30GHz
CPU features: SSE2 AES AVX AVX2
SW built on Aug  4 2016 with GCC 5.4.0
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES
Start mining with AES-AVX optimizations...

root@xxx:~/cpuminer-opt-3.3.9# ./cpuminer -a lyra2re --benchmark --threads=64 | grep Total
[2016-08-04 02:47:57] Total: 4128.77 kH, 8660.05 kH/s
[2016-08-04 02:48:01] Total: 35.19 MH, 8668.56 kH/s
[2016-08-04 02:48:06] Total: 43.34 MH, 8590.78 kH/s
[2016-08-04 02:48:11] Total: 42.95 MH, 8600.93 kH/s
[2016-08-04 02:48:16] Total: 43.00 MH, 8621.90 kH/s

root@xxx:~/cpuminer-opt-3.3.9# ./cpuminer -a lyra2re --benchmark --threads=128 | grep Total
[2016-08-04 02:48:23] Total: 8323.07 kH, 11.96 MH/s
[2016-08-04 02:48:26] Total: 11.68 MH, 12.03 MH/s
[2016-08-04 02:48:31] Total: 32.93 MH, 8544.43 kH/s
[2016-08-04 02:48:36] Total: 39.36 MH, 8531.49 kH/s
[2016-08-04 02:48:41] Total: 42.52 MH, 8531.90 kH/s
[2016-08-04 02:48:46] Total: 42.33 MH, 8536.75 kH/s
[2016-08-04 02:48:51] Total: 41.88 MH, 8536.12 kH/s
[2016-08-04 02:48:56] Total: 42.41 MH, 8526.90 kH/s

Here tpruvot is faster by over 2%.

So let's look at the build.sh. Tpruvot's:

Code:

root@xxx:~/tpruvot-cpuminer-multi# cat build.sh
#!/bin/bash

if [ "$OS" = "Windows_NT" ]; then
    ./mingw64.sh
    exit 0
fi

# Linux build

make clean || echo clean

rm -f config.status
./autogen.sh || echo done

# Ubuntu 10.04 (gcc 4.4)
extracflags="-O3 -march=native -D_REENTRANT -funroll-loops -fvariable-expansion-in-unroller -fmerge-all-constants -fbranch-target-load-optimize2 -fsched2-use-superblocks -falign-loops=16 -falign-functions=16 -falign-jumps=16 -falign-labels=16"

# Debian 7.7 / Ubuntu 14.04 (gcc 4.7+)
extracflags="$extracflags -Ofast -flto -fuse-linker-plugin -ftree-loop-if-convert-stores"

if [ ! "0" = `cat /proc/cpuinfo | grep -c avx` ]; then
    # march native doesn't always works, ex. some Pentium Gxxx (no avx)
    extracflags="$extracflags -march=native"
fi

./configure --with-crypto --with-curl CFLAGS="-O3 $extracflags -march=native -DUSE_ASM -pg"

make -j $(grep processor /proc/cpuinfo | wc -l)

strip -s cpuminer

Yours ("customized" by me, but all I actually did was taking most of the flags from tpruvot's as long as it was compilable; maybe I messed it up?):

Code:

root@xxx:~/cpuminer-opt-3.3.9# cat build.sh
#!/bin/bash

#if [ "$OS" = "Windows_NT" ]; then
#    ./mingw64.sh
#    exit 0
#fi

# Linux build

make clean || echo clean

rm -f config.status
./autogen.sh || echo done

# Ubuntu 10.04 (gcc 4.4)
extracflags="-O3 -march=native -D_REENTRANT -funroll-loops -fvariable-expansion-in-unroller -fmerge-all-constants -fbranch-target-load-optimize2 -fsched2-use-superblocks -falign-loops=16 -falign-functions=16 -falign-jumps=16 -falign-labels=16"

# Debian 7.7 / Ubuntu 14.04 (gcc 4.7+)
extracflags="$extracflags -Ofast -fuse-linker-plugin -ftree-loop-if-convert-stores"

CFLAGS="-O3 $extracflags -march=native -DUSE_ASM" CXXFLAGS="$CFLAGS -std=gnu++11" ./configure --with-crypto --with-curl

make -j $(grep processor /proc/cpuinfo | wc -l)

strip -s cpuminer

You don't have -flto and -pg, neither is compilable, he doesn't have -std=gnu++11 (doesn't compile either).

Now let's see what all the -flto fuzz is about. What happens if I compile tpruvot without it:

Code:

root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=64 | grep Total
[2016-08-04 02:56:24] Total: 356.42 kH/s
[2016-08-04 02:56:29] Total: 352.18 kH/s
[2016-08-04 02:56:34] Total: 352.00 kH/s
[2016-08-04 02:56:39] Total: 352.03 kH/s
[2016-08-04 02:56:44] Total: 352.09 kH/s

root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=128 | grep Total
[2016-08-04 02:57:29] Total: 358.48 kH/s
[2016-08-04 02:57:34] Total: 357.76 kH/s
[2016-08-04 02:57:39] Total: 357.96 kH/s

It turned into a snail. As if it couldn't manage the multiplexing or something. So let's try just 1 thread:

Code:

root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=1 | grep Total
[2016-08-04 02:57:45] Total: 124.22 kH/s
[2016-08-04 02:57:50] Total: 124.66 kH/s
[2016-08-04 02:57:55] Total: 125.81 kH/s
[2016-08-04 02:58:00] Total: 126.19 kH/s
[2016-08-04 02:58:05] Total: 126.84 kH/s

And yours:

Code:

root@xxx:~/cpuminer-opt-3.3.9# ./cpuminer -a lyra2re --benchmark --threads=1 | grep Total
[2016-08-04 02:58:25] Total: 65.54 kH, 136.75 kH/s
[2016-08-04 02:58:30] Total: 683.77 kH, 138.37 kH/s
[2016-08-04 02:58:35] Total: 691.85 kH, 140.37 kH/s
[2016-08-04 02:58:40] Total: 701.86 kH, 140.28 kH/s
[2016-08-04 02:58:45] Total: 701.42 kH, 140.33 kH/s
[2016-08-04 02:58:50] Total: 701.68 kH, 140.74 kH/s

You are clearly faster if he doesn't use -flto. But if I again turn -flto back on and recompile:

Code:

root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=1 | grep Total
[2016-08-04 02:59:54] Total: 138.24 kH/s
[2016-08-04 02:59:58] Total: 139.78 kH/s
[2016-08-04 03:00:03] Total: 140.41 kH/s
[2016-08-04 03:00:08] Total: 140.39 kH/s
[2016-08-04 03:00:13] Total: 141.38 kH/s
[2016-08-04 03:00:18] Total: 141.65 kH/s
[2016-08-04 03:00:23] Total: 141.64 kH/s
[2016-08-04 03:00:28] Total: 142.04 kH/s
[2016-08-04 03:00:33] Total: 141.98 kH/s

He is clearly faster after all.

Now how do you do with 8 threads?

Code:

root@xxx:~/cpuminer-opt-3.3.9# ./cpuminer -a lyra2re --benchmark --threads=8 | grep Total
[2016-08-04 03:00:53] Total: 524.29 kH, 1128.89 kH/s
[2016-08-04 03:00:58] Total: 5644.46 kH, 1130.25 kH/s
[2016-08-04 03:01:03] Total: 5651.25 kH, 1130.72 kH/s
[2016-08-04 03:01:08] Total: 5653.59 kH, 1130.44 kH/s
[2016-08-04 03:01:13] Total: 5652.21 kH, 1130.53 kH/s
[2016-08-04 03:01:18] Total: 5652.64 kH, 1130.45 kH/s

And him with -flto?

Code:

root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=8 | grep Total
[2016-08-04 03:01:29] Total: 1143 kH/s
[2016-08-04 03:01:34] Total: 1144 kH/s
[2016-08-04 03:01:39] Total: 1144 kH/s
[2016-08-04 03:01:44] Total: 1144 kH/s
[2016-08-04 03:01:49] Total: 1144 kH/s
[2016-08-04 03:01:54] Total: 1145 kH/s

And him without -flto:

Code:

root@xxx:~/tpruvot-cpuminer-multi# ./cpuminer -a lyra2re --benchmark --threads=8 | grep Total
[2016-08-04 03:03:13] Total: 637.08 kH/s
[2016-08-04 03:03:17] Total: 611.28 kH/s
[2016-08-04 03:03:22] Total: 605.97 kH/s
[2016-08-04 03:03:27] Total: 605.81 kH/s
[2016-08-04 03:03:32] Total: 605.90 kH/s

I support very much your optimizing effort and whenever you need I will gladly do tests for you on various machines.

My list of 43(+3) reviewed Bitcoin forks | You don't have to download the pre-fork blockchain again for each fork! | Beware of fraudulent AWS accounts sellers and dangerous edu AWS codes! + My personal list of legit sellers and scammers | Never publicly reveal your btc addresses, ownership or any other details and stay very far away from anybody who asks you to! | The general rule of safe buying is: if the seller is a newbie, with no reputation, with no topic nor trust feedback, offering no vouches and/or selling from a locked or self-moderated topic and unwilling to go first or use escrow => AVOID. Always check the trust feedback first and make sure that you have enabled "Show untrusted feedback by default" in "Profile / Forum Profile Information".

joblo (OP)

Legendary

Offline

Activity: 1470
Merit: 1114

Re: [ANN]: cpuminer-opt v3.3.9, Optimized Multialgo CPU miner

August 04, 2016, 04:54:55 AM

#944

Quote from: johnsmithx on August 04, 2016, 03:50:28 AM

I support very much your optimizing effort and whenever you need I will gladly do tests for you on various machines.

I appeciate the detailed report and the humour. It will take some time to digest it all but I have a couple of comments.
You clearly have all the optimization features in your CPU and the miner was running the optimized code, so that is eliminated.

I tested on a i7-6700K with 8 threads and saw no difference between multi with or without -flto.
Specifically I get:

multi: 920-922 kH/s
opt v3.3.9: 995
opt v3.3.8: 930
opt dev: 1025

Furthermore I can compile opt with both -flto and -pg, again with no performance difference.

I took the Lyra2RE code, and a lot more, directly from multi and I don't think he has made any changes to it. I've been tickering with
Lyra2RE for a while and only recently made any significant progress.

C++11 is required to support an algo not included in multi.

I can only speculate your compiler version may be the issue. I am using 4.8.4 and you 5.4.0. I seem to recall someone
else, Wolf maybe, having compile problems using a more recent version of gcc. I'm too lazy to look back in the thread to find it.
If you can find it you could compare notes.

I'll read through your report in more detail, If any ideas come to mind I'll post an update.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

gimomars

Newbie

Offline

Activity: 34
Merit: 0

Re: [ANN]: cpuminer-opt v3.3.9, Optimized Multialgo CPU miner

August 04, 2016, 05:55:27 AM

#945

Hi experts, I'm trying to build in my openSUSE linux. But I've got an error:

./build.sh
make: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by make)
make: /lib64/libc.so.6: version `GLIBC_2.17' not found (required by make)
strip: 'cpuminer': No such file

Regards

johnsmithx

Hero Member

Offline

Activity: 589
Merit: 507

I don't buy nor sell anything here and never will.

Re: [ANN]: cpuminer-opt v3.3.9, Optimized Multialgo CPU miner

August 04, 2016, 08:45:32 AM

#946

Quote from: joblo on August 04, 2016, 04:54:55 AM

I tested on a i7-6700K with 8 threads and saw no difference between multi with or without -flto.

One thing to keep in mind is that you are using a desktop cpu, I am using a server cpu. Those xeons are not multiple times more expensive for no reason (sure partly it's just branding, more reliability etc. etc., but there are also some functional differences).

Or it (the fact that -flto doesn't do anything to you) could simply be the compiler. Yours is 2 years old, mine 2 months. But the crucial information is that you can actually compile with -flto. Could you please give me the exact flags that work for you? I will take the effort and find out what's the problem on my end, I just need the starting (compilable) point.

I have no idea whether he improved the performance after you took the code from him. The very last commit is 3 days old but which one is the last one that could have had any real impact on Lyra2RE speed, or the overall speed, I am not going to investigate. But if you could remember, at least roughly, when did you take his code I can revert his tree to that date and try that version.

Quote from: gimomars on August 04, 2016, 05:55:27 AM

This problem has nothing to do with cpuminer, your building environment is messed up. Glibc is the very core linux library, if make can't find the version it likes then there is something wrong with either. But if glibc was messed up the system would hardly even boot properly. Maybe try to update?

gimomars

Newbie

Offline

Activity: 34
Merit: 0

Re: [ANN]: cpuminer-opt v3.3.9, Optimized Multialgo CPU miner

August 04, 2016, 09:39:20 AM

#947

Quote from: johnsmithx on August 04, 2016, 08:45:32 AM

Quote from: gimomars on August 04, 2016, 05:55:27 AM

Thanks for the reply. I think my linux has old version of GLIBC_x.x.

Version information:
/bin/sh:
libdl.so.2 (GLIBC_2.2.5) => /lib64/libdl.so.2
libc.so.6 (GLIBC_2.4) => /lib64/libc.so.6
libc.so.6 (GLIBC_2. Cool

=> /lib64/libc.so.6
libc.so.6 (GLIBC_2.3) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.11) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.3.4) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6
/lib64/libreadline.so.5:
libc.so.6 (GLIBC_2.4) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.3) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.3.4) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.11) => /lib64/libc.so.6
libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6

johnsmithx

Hero Member

Offline

Activity: 589
Merit: 507

I don't buy nor sell anything here and never will.

Re: [ANN]: cpuminer-opt v3.3.9, Optimized Multialgo CPU miner

August 04, 2016, 12:09:33 PM

#948

Quote from: gimomars on August 04, 2016, 09:39:20 AM

Quote from: johnsmithx on August 04, 2016, 08:45:32 AM

Quote from: gimomars on August 04, 2016, 05:55:27 AM

Now you are looking what versions of glibc these two binaries (/bin/sh and /lib64/libreadline.so.5) require. To find out what version of glibc you actually have just run the library: type /lib64/libc.so.6 and press enter. Presumably it will be equal or higher than what /bin/sh requires and lower than what 'make' requires. But that would mean that you didn't install 'make' a standard way via a package system but somehow sideway. What did you do to your suse?!?

joblo (OP)

Legendary

Offline

Activity: 1470
Merit: 1114

Re: [ANN]: cpuminer-opt v3.3.9, Optimized Multialgo CPU miner

August 04, 2016, 01:24:07 PM
Last edit: August 04, 2016, 02:19:45 PM by joblo

#949

Quote from: johnsmithx on August 04, 2016, 08:45:32 AM

Quote from: joblo on August 04, 2016, 04:54:55 AM

I tested on a i7-6700K with 8 threads and saw no difference between multi with or without -flto.

I considered things like larger cache and fatser memory interface as likely advantages of a server grade CPU but I can't
figure out how that would produce inconsistent results. I'm still leaning toward the compiler. My dusty memories seem to also
have LTO in them, I may have to dig back in the thread to refresh. Your compile errors might also jog some memories. I don't
think the problem was in code imported from multi, more likely one of the ugly SSE2 optimized macros. When I compiled i
used the flags from build.sh + -flto -pg.

It is an issue worth pursuing, I can't stay on the old compiler forever.

One way to determine if it's a compiler issue or CPU issue is with a VM. I don't know if you start up a VM or boot an older
version of Ubuntu but a direct comparison of different compilers on the same HW would help understanding what's going on.

I'm a little busy right now testing the latest AVX2 optimisations to get them released, but after that I'll look into it more.

Edit: I found the post from Wolf0 about his experience compiling cpuminer-opt with gcc 6.1.1. Look similar to yours?

https://bitcointalk.org/index.php?topic=1326803.msg15140799#msg15140799

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

joblo (OP)

Legendary

Offline

Activity: 1470
Merit: 1114

Re: [ANN]: cpuminer-opt v3.4.0, NEW AVX2 optimizations.

August 04, 2016, 06:14:54 PM

#950

cpuminer-opt v3.4.0 is released.

A compile error was introduced in Windows v3.3.9 and has been fixed. X11gost was also broken since
v3.3.7 and has also been fixed.

The big news is more AVX2 optimizations inspired by Optiminer's work on the Hodl algo. See OP for details.
The entire Cubehash function was converted from SSE2 to AVX2 and improved all algos that use it. Some
AVX2 optimizations were also done to the Lyra2 core, improving both Lyra2RE and Lyra2REv2. Those were
the easy ones, I don't know how much more I can find. See OP for list of improved algos.

Source:
https://drive.google.com/file/d/0B0lVSGQYLJIZbFB1WThUZ09JbVk/view?usp=sharing

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

clipto

Member

Offline

Activity: 311
Merit: 10

Re: [ANN]: cpuminer-opt v3.4.0, NEW AVX2 optimizations.

August 04, 2016, 08:03:58 PM

#951

Great stuff, will it be released for Windows too?

joblo (OP)

Legendary

Offline

Activity: 1470
Merit: 1114

Re: [ANN]: cpuminer-opt v3.4.0, NEW AVX2 optimizations.

August 04, 2016, 08:06:41 PM

#952

Quote from: clipto on August 04, 2016, 08:03:58 PM

Great stuff, will it be released for Windows too?

I'm hoping. CMB have been good, seems they skipped v3.3.9 because it didn't compile. I hope they pickup v3.4.0.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

clipto

Member

Offline

Activity: 311
Merit: 10

Re: [ANN]: cpuminer-opt v3.4.0, NEW AVX2 optimizations.

August 04, 2016, 08:08:54 PM

#953

3.3.7 gave me better hashrate than 3.3.8 on Lyra2RE, so I'm still running that.
But looking forward to the increased performance, but are bound to Windows OS.

joblo (OP)

Legendary

Offline

Activity: 1470
Merit: 1114

Re: [ANN]: cpuminer-opt v3.4.0, NEW AVX2 optimizations.

August 04, 2016, 08:36:49 PM
Last edit: August 04, 2016, 10:32:30 PM by joblo

#954

Quote from: clipto on August 04, 2016, 08:08:54 PM

3.3.7 gave me better hashrate than 3.3.8 on Lyra2RE, so I'm still running that.
But looking forward to the increased performance, but are bound to Windows OS.

There should be a slight increase between 3.3.7 and 3.3.8.

I'm suspecting data alignment issues. I've noticed different hashrates on different runs of the same version.
It doesn't seem to be related to other CPU activity.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

johnsmithx

Hero Member

Offline

Activity: 589
Merit: 507

I don't buy nor sell anything here and never will.

Re: [ANN]: cpuminer-opt v3.4.0, NEW AVX2 optimizations.

August 04, 2016, 11:49:08 PM

#955

joblo, that's an excellent improvement! Now you are definitely faster than tpruvot, at least by 4.8%.

I made a better repeatable benchmark of tpruvot's, the numbers are directly in the build.sh:

Code:

#!/bin/bash

if [ "$OS" = "Windows_NT" ]; then
    ./mingw64.sh
    exit 0
fi

# Linux build

make clean || echo clean

rm -f config.status
./autogen.sh || echo done

# Ubuntu 10.04 (gcc 4.4)
extracflags="-O3 -march=native -w -D_REENTRANT -funroll-loops -fvariable-expansion-in-unroller -fmerge-all-constants -fbranch-target-load-optimize2 -fsched2-use-superblocks -falign-loops=16 -falign-functions=16 -falign-jumps=16 -falign-labels=16"

# Debian 7.7 / Ubuntu 14.04 (gcc 4.7+)
extracflags="$extracflags -Ofast -fuse-linker-plugin -ftree-loop-if-convert-stores"

if [ ! "0" = `cat /proc/cpuinfo | grep -c avx` ]; then
    # march native doesn't always works, ex. some Pentium Gxxx (no avx)
    extracflags="$extracflags -march=native"
fi


# Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz 4 threads (d2.xlarge)


#309-310
#CFLAGS="-O3 $extracflags -flto -march=native -DUSE_ASM -pg" ./configure --with-crypto --with-curl

#311-312
CFLAGS="-O3 $extracflags -flto -march=native -DUSE_ASM -pg" CXXFLAGS="-std=gnu++11" ./configure --with-crypto --with-curl

#281
#CFLAGS="-O3 $extracflags -flto -march=native -DUSE_ASM -pg" CXXFLAGS="$CFLAGS" ./configure --with-crypto --with-curl

#280
#CFLAGS="-O3 $extracflags -flto -march=native -DUSE_ASM -pg" CXXFLAGS="$CFLAGS -std=gnu++11" ./configure --with-crypto --with-curl


#269
#CFLAGS="-O3 $extracflags -march=native -DUSE_ASM -pg" ./configure --with-crypto --with-curl

#264
#CFLAGS="-O3 $extracflags -march=native -DUSE_ASM -pg" CXXFLAGS="-std=gnu++11" ./configure --with-crypto --with-curl

#242
#CFLAGS="-O3 $extracflags -march=native -DUSE_ASM -pg" CXXFLAGS="$CFLAGS" ./configure --with-crypto --with-curl

#245
#CFLAGS="-O3 $extracflags -march=native -DUSE_ASM -pg" CXXFLAGS="$CFLAGS -std=gnu++11" ./configure --with-crypto --with-curl


make -j $(grep processor /proc/cpuinfo | wc -l)

strip -s cpuminer

So with him I get 312 at best on this particular machine and that config of flags is basically the default if you uncomment everything so I didn't make him any faster, I just proved he can be much slower if wrong flags are used.

Now with yours, without any change, untouched cpuminer-opt-3.4.0.tar.gz, I get this:

Code:

root@xxx:~/cpuminer-opt# ./cpuminer -a lyra2re --benchmark

CPU: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
CPU features: SSE2 AES AVX AVX2
SW built on Aug  4 2016 with GCC 5.4.0
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX AVX2

[2016-08-04 23:24:25] 4 miner threads started, using 'lyra2re' algorithm.
[2016-08-04 23:24:26] CPU #1: 65.54 kH, 82.16 kH/s
[2016-08-04 23:24:26] CPU #0: 65.54 kH, 81.67 kH/s
[2016-08-04 23:24:26] CPU #3: 65.54 kH, 81.84 kH/s
[2016-08-04 23:24:26] Total: 196.61 kH, 245.67 kH/s
[2016-08-04 23:24:26] CPU #2: 65.54 kH, 81.68 kH/s
[2016-08-04 23:24:30] CPU #0: 326.68 kH, 81.80 kH/s
[2016-08-04 23:24:30] CPU #3: 327.37 kH, 82.00 kH/s
[2016-08-04 23:24:30] Total: 785.13 kH, 327.64 kH/s
[2016-08-04 23:24:30] CPU #2: 326.73 kH, 81.75 kH/s
[2016-08-04 23:24:30] CPU #1: 328.64 kH, 82.01 kH/s
[2016-08-04 23:24:35] CPU #0: 409.02 kH, 81.78 kH/s
[2016-08-04 23:24:35] CPU #3: 409.99 kH, 81.93 kH/s
[2016-08-04 23:24:35] Total: 1474.38 kH, 327.46 kH/s
[2016-08-04 23:24:35] CPU #2: 408.76 kH, 81.73 kH/s
[2016-08-04 23:24:35] CPU #1: 410.04 kH, 81.94 kH/s
[2016-08-04 23:24:40] CPU #0: 408.89 kH, 81.78 kH/s
[2016-08-04 23:24:40] CPU #3: 409.63 kH, 81.94 kH/s
[2016-08-04 23:24:40] Total: 1637.32 kH, 327.39 kH/s

But when I add -flto I get the following error at the final link:

Code:

g++  -O3 -march=native -w -flto -std=gnu++11 -Lyes/lib  -Lyes/lib  -o cpuminer cpuminer-cpu-miner.o cpuminer-util.o cpuminer-uint256.o cpuminer-api.o cpuminer-sysinfos.o cpuminer-algo-gate-api.o algo/groestl/cpuminer-sph_groestl.o algo/skein/cpuminer-sph_skein.o algo/bmw/cpuminer-sph_bmw.o algo/shavite/cpuminer-sph_shavite.o algo/shavite/cpuminer-shavite.o algo/echo/cpuminer-sph_echo.o algo/blake/cpuminer-sph_blake.o algo/heavy/cpuminer-sph_hefty1.o algo/blake/cpuminer-mod_blakecoin.o algo/luffa/cpuminer-sph_luffa.o algo/cubehash/cpuminer-sph_cubehash.o algo/simd/cpuminer-sph_simd.o algo/hamsi/cpuminer-sph_hamsi.o algo/fugue/cpuminer-sph_fugue.o algo/gost/cpuminer-sph_gost.o algo/jh/cpuminer-sph_jh.o algo/keccak/cpuminer-sph_keccak.o algo/keccak/cpuminer-keccak.o algo/sha3/cpuminer-sph_sha2.o algo/sha3/cpuminer-sph_sha2big.o algo/shabal/cpuminer-sph_shabal.o algo/whirlpool/cpuminer-sph_whirlpool.o crypto/cpuminer-blake2s.o crypto/cpuminer-oaes_lib.o crypto/cpuminer-c_keccak.o crypto/cpuminer-c_groestl.o crypto/cpuminer-c_blake256.o crypto/cpuminer-c_jh.o crypto/cpuminer-c_skein.o crypto/cpuminer-hash.o crypto/cpuminer-aesb.o crypto/cpuminer-magimath.o algo/argon2/cpuminer-argon2a.o algo/argon2/ar2/cpuminer-argon2.o algo/argon2/ar2/cpuminer-opt.o algo/argon2/ar2/cpuminer-cores.o algo/argon2/ar2/cpuminer-ar2-scrypt-jane.o algo/argon2/ar2/cpuminer-blake2b.o algo/cpuminer-axiom.o algo/blake/cpuminer-blake.o algo/blake/cpuminer-blake2.o algo/blake/cpuminer-blakecoin.o algo/blake/cpuminer-decred.o algo/blake/cpuminer-pentablake.o algo/bmw/cpuminer-bmw256.o algo/cubehash/sse2/cpuminer-cubehash_sse2.o algo/cryptonight/cpuminer-cryptolight.o algo/cryptonight/cpuminer-cryptonight-common.o algo/cryptonight/cpuminer-cryptonight-aesni.o algo/cryptonight/cpuminer-cryptonight.o algo/cpuminer-drop.o algo/echo/aes_ni/cpuminer-hash.o algo/cpuminer-fresh.o algo/groestl/cpuminer-groestl.o algo/groestl/cpuminer-myr-groestl.o algo/groestl/sse2/cpuminer-grso.o algo/groestl/sse2/cpuminer-grso-asm.o algo/groestl/aes_ni/cpuminer-hash-groestl.o algo/groestl/aes_ni/cpuminer-hash-groestl256.o algo/haval/cpuminer-haval.o algo/heavy/cpuminer-heavy.o algo/heavy/cpuminer-bastion.o algo/cpuminer-hmq1725.o algo/hodl/cpuminer-hodl.o algo/hodl/cpuminer-hodl-gate.o algo/hodl/cpuminer-hodl_arith_uint256.o algo/hodl/cpuminer-hodl_uint256.o algo/hodl/cpuminer-hash.o algo/hodl/cpuminer-hmac_sha512.o algo/hodl/cpuminer-sha256.o algo/hodl/cpuminer-sha512.o algo/hodl/cpuminer-utilstrencodings.o algo/hodl/cpuminer-hodl-wolf.o algo/hodl/cpuminer-aes.o algo/hodl/cpuminer-sha512_avx.o algo/hodl/cpuminer-sha512_avx2.o algo/cpuminer-lbry.o algo/luffa/cpuminer-luffa.o algo/luffa/sse2/cpuminer-luffa_for_sse2.o algo/lyra2/cpuminer-lyra2.o algo/lyra2/cpuminer-sponge.o algo/lyra2/cpuminer-lyra2rev2.o algo/lyra2/cpuminer-lyra2re.o algo/keccak/sse2/cpuminer-keccak.o algo/cpuminer-m7m.o algo/cpuminer-neoscrypt.o algo/cpuminer-nist5.o algo/cpuminer-pluck.o algo/quark/cpuminer-quark.o algo/qubit/cpuminer-qubit.o algo/ripemd/cpuminer-sph_ripemd.o algo/cpuminer-scrypt.o algo/scryptjane/cpuminer-scrypt-jane.o algo/sha2/cpuminer-sha2.o algo/simd/sse2/cpuminer-nist.o algo/simd/sse2/cpuminer-vector.o algo/skein/cpuminer-skein.o algo/skein/cpuminer-skein2.o algo/cpuminer-s3.o algo/tiger/cpuminer-sph_tiger.o algo/whirlpool/cpuminer-whirlpool.o algo/whirlpool/cpuminer-whirlpoolx.o algo/x11/cpuminer-x11.o algo/x11/cpuminer-x11evo.o algo/x11/cpuminer-x11gost.o algo/x11/cpuminer-c11.o algo/x13/cpuminer-x13.o algo/x14/cpuminer-x14.o algo/x15/cpuminer-x15.o algo/x17/cpuminer-x17.o algo/yescrypt/cpuminer-yescrypt.o algo/yescrypt/cpuminer-yescrypt-common.o algo/yescrypt/cpuminer-sha256_Y.o algo/yescrypt/cpuminer-yescrypt-simd.o algo/cpuminer-zr5.o asm/cpuminer-neoscrypt_asm.o  asm/cpuminer-sha2-x64.o asm/cpuminer-scrypt-x64.o asm/cpuminer-aesb-x64.o   -lcurl -lz -ljansson -lpthread  -lssl -lcrypto -lgmp
/tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_avx2':
<artificial>:(.text+0x9712): undefined reference to `scrypt_ChunkMix_avx2'
<artificial>:(.text+0x9729): undefined reference to `scrypt_ChunkMix_avx2'
<artificial>:(.text+0x9760): undefined reference to `scrypt_ChunkMix_avx2'
<artificial>:(.text+0x9785): undefined reference to `scrypt_ChunkMix_avx2'
/tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_xop':
<artificial>:(.text+0x99f2): undefined reference to `scrypt_ChunkMix_xop'
<artificial>:(.text+0x9a09): undefined reference to `scrypt_ChunkMix_xop'
<artificial>:(.text+0x9a40): undefined reference to `scrypt_ChunkMix_xop'
<artificial>:(.text+0x9a65): undefined reference to `scrypt_ChunkMix_xop'
/tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_avx':
<artificial>:(.text+0x9cd2): undefined reference to `scrypt_ChunkMix_avx'
<artificial>:(.text+0x9ce9): undefined reference to `scrypt_ChunkMix_avx'
<artificial>:(.text+0x9d20): undefined reference to `scrypt_ChunkMix_avx'
<artificial>:(.text+0x9d45): undefined reference to `scrypt_ChunkMix_avx'
/tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_ssse3':
<artificial>:(.text+0x9fb2): undefined reference to `scrypt_ChunkMix_ssse3'
<artificial>:(.text+0x9fc9): undefined reference to `scrypt_ChunkMix_ssse3'
<artificial>:(.text+0xa000): undefined reference to `scrypt_ChunkMix_ssse3'
<artificial>:(.text+0xa025): undefined reference to `scrypt_ChunkMix_ssse3'
/tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_sse2':
<artificial>:(.text+0xa292): undefined reference to `scrypt_ChunkMix_sse2'
<artificial>:(.text+0xa2a9): undefined reference to `scrypt_ChunkMix_sse2'
<artificial>:(.text+0xa2e0): undefined reference to `scrypt_ChunkMix_sse2'
<artificial>:(.text+0xa305): undefined reference to `scrypt_ChunkMix_sse2'
collect2: error: ld returned 1 exit status
Makefile:1333: recipe for target 'cpuminer' failed
make[2]: *** [cpuminer] Error 1
make[2]: Leaving directory '/root/cpuminer-opt'
Makefile:3453: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/root/cpuminer-opt'
Makefile:670: recipe for target 'all' failed
make: *** [all] Error 2

If you are unsure how to fix it could you at least guide me how to disable the whole scrypt (optimization) because I am really anxious to see what -flto will do.

ReiMomo

Sr. Member

Offline

Activity: 2366
Merit: 305

Duelbits - $100k Bonus/week

Re: [ANN]: cpuminer-opt v3.4.0, NEW AVX2 optimizations.

August 05, 2016, 12:08:11 AM

#956

So when is the Windows bin out? Huh

. ████▄ ▄████ ██████▄ ▄ ▄██████ ▀████▀ ▄███▄ ▀████▀ ▀▀ ▄██▀▀▀██▄ ▀▀ ▄████ ████▄ ███████████████ ▀████ ████▀ ▄█▄ ▀██▄▄▄██▀ ▄█▄ ▄███▄ ▀███▀ ▄███▄ ▄███▀▀█▀ ▀ ▀█▀▀███▄ ▀█▀ ▀█▀

.
Duelbits

█▀▀▀▀▀
█
█
█
█
█
█
█
█
█
█
█
█▄▄▄▄▄▄▄

TRY OUR
~~NEW~~ UNIQUE
GAMES!

▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
DICE
.
▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄

███████████████████████████████ ███▀▀ ▀▀███ ███ ▄▄▄▄ ▄▄▄▄ ███ ███ ██████ ██████ ███ ███ ▀████▀ ▀████▀ ███ ███ ███ ███ ███ ███ ███ ███ ▄████▄ ▄████▄ ███ ███ ██████ ██████ ███ ███ ▀▀▀▀ ▀▀▀▀ ███ ███▄▄ ▄▄███ ███████████████████████████████

▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
MINES
.
▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄

███████████████████████████████ ████████████████████████▄▀▄████ ██████████████▀▄▄▄▀█████▄▀▄████ ████████████▀ █████▄▀████ █████ ██████████ █████▄▀▀▄██████ ███████▀ ▀████████████ █████▀ ▀██████████ █████ ██████████ ████▌ ▐█████████ █████ ██████████ ██████▄ ▄███████████ ████████▄▄ ▄▄█████████████ ███████████████████████████████

▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
PLINKO
.
▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄

███████████████████████████████ █████████▀▀▀ ▀▀▀█████████ ██████▀ ▄▄███ ███ ▀██████ █████ ▄▀▀ █████ ████ ▀ ████ ███ ███ ███ ███ ███ ███ ████ ████ █████ █████ ██████▄ ▄██████ █████████▄▄▄ ▄▄▄█████████ ███████████████████████████████

10,000x
MULTIPLIER

│

NEARLY UP TO
50% REWARDS

▀▀▀▀▀█
█
█
█
█
█
█
█
█
█
█
█
▄▄▄▄▄█

johnsmithx

Hero Member

Offline

Activity: 589
Merit: 507

I don't buy nor sell anything here and never will.

Re: [ANN]: cpuminer-opt v3.4.0, NEW AVX2 optimizations.

August 05, 2016, 01:30:49 AM
Last edit: August 05, 2016, 02:08:52 AM by johnsmithx

#957

Success!

I did this very ugly hack, joblo please don't get a heart attack:

Code:

--- scrypt-jane-romix-template.h.orig   2016-02-05 22:05:38.000000000 +0000
+++ scrypt-jane-romix-template.h 2016-08-05 00:37:48.949684265 +0000
@@ -86,9 +86,9 @@
  for (i = 0; i < /*N - 1*/511; i++, block += chunkWords) {
        /* 3: V_i = X */
        /* 4: X = H(X) */
-       SCRYPT_CHUNKMIX_FN(block + chunkWords, block, NULL, /*r*/1);
+//         SCRYPT_CHUNKMIX_FN(block + chunkWords, block, NULL, /*r*/1);
  }
- SCRYPT_CHUNKMIX_FN(X, block, NULL, 1);
+//     SCRYPT_CHUNKMIX_FN(X, block, NULL, 1);

  /* 6: for i = 0 to N - 1 do */
  for (i = 0; i < /*N*/512; i += 2) {
@@ -96,13 +96,13 @@
        j = X[chunkWords - SCRYPT_BLOCK_WORDS] & /*(N - 1)*/511;

        /* 8: X = H(Y ^ V_j) */
-       SCRYPT_CHUNKMIX_FN(Y, X, scrypt_item(V, j, chunkWords), 1);
+//         SCRYPT_CHUNKMIX_FN(Y, X, scrypt_item(V, j, chunkWords), 1);

        /* 7: j = Integerify(Y) % N */
        j = Y[chunkWords - SCRYPT_BLOCK_WORDS] & /*(N - 1)*/511;

        /* 8: X = H(Y ^ V_j) */
-       SCRYPT_CHUNKMIX_FN(X, Y, scrypt_item(V, j, chunkWords), 1);
+//         SCRYPT_CHUNKMIX_FN(X, Y, scrypt_item(V, j, chunkWords), 1);
  }

  /* 10: B' = X */

And now it does compile with -flto and here is the result:

Code:

CPU: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
CPU features: SSE2 AES AVX AVX2
SW built on Aug  5 2016 with GCC 5.4.0
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX AVX2

[2016-08-05 00:58:03] 4 miner threads started, using 'lyra2re' algorithm.
[2016-08-05 00:58:04] CPU #0: 65.54 kH, 84.17 kH/s
[2016-08-05 00:58:04] CPU #1: 65.54 kH, 84.25 kH/s
[2016-08-05 00:58:04] CPU #3: 65.54 kH, 84.23 kH/s
[2016-08-05 00:58:04] Total: 196.61 kH, 252.64 kH/s
[2016-08-05 00:58:04] CPU #2: 65.54 kH, 83.86 kH/s
[2016-08-05 00:58:08] CPU #2: 335.45 kH, 84.02 kH/s
[2016-08-05 00:58:08] CPU #1: 336.99 kH, 84.25 kH/s
[2016-08-05 00:58:08] CPU #3: 336.92 kH, 84.24 kH/s
[2016-08-05 00:58:08] Total: 1074.89 kH, 336.68 kH/s
[2016-08-05 00:58:08] CPU #0: 336.67 kH, 84.04 kH/s
[2016-08-05 00:58:13] CPU #2: 420.12 kH, 84.16 kH/s
[2016-08-05 00:58:13] CPU #1: 421.26 kH, 84.35 kH/s
[2016-08-05 00:58:13] CPU #0: 420.18 kH, 84.19 kH/s
[2016-08-05 00:58:13] CPU #3: 421.18 kH, 84.34 kH/s
[2016-08-05 00:58:13] Total: 1682.74 kH, 337.04 kH/s
[2016-08-05 00:58:18] CPU #2: 420.78 kH, 84.16 kH/s
[2016-08-05 00:58:18] CPU #1: 421.77 kH, 84.31 kH/s
[2016-08-05 00:58:18] CPU #0: 420.97 kH, 84.19 kH/s
[2016-08-05 00:58:18] CPU #3: 421.69 kH, 84.26 kH/s
[2016-08-05 00:58:18] Total: 1685.21 kH, 336.92 kH/s
[2016-08-05 00:58:23] CPU #1: 421.54 kH, 84.37 kH/s
[2016-08-05 00:58:23] CPU #3: 421.31 kH, 84.32 kH/s
[2016-08-05 00:58:23] CPU #2: 420.81 kH, 83.99 kH/s
[2016-08-05 00:58:23] Total: 1684.63 kH, 336.87 kH/s
[2016-08-05 00:58:23] CPU #0: 420.93 kH, 84.01 kH/s
[2016-08-05 00:58:28] CPU #2: 419.96 kH, 84.10 kH/s
[2016-08-05 00:58:28] CPU #0: 420.07 kH, 84.10 kH/s
[2016-08-05 00:58:28] CPU #1: 421.87 kH, 84.17 kH/s
[2016-08-05 00:58:28] CPU #3: 421.58 kH, 84.09 kH/s
[2016-08-05 00:58:28] Total: 1683.49 kH, 336.46 kH/s

So using -flto gives another 2.75% speed increase. That's 7.7% speed increase in total over tpruvot.

Now this is with -flto and -fuse-linker-plugin:

Code:

CPU: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
CPU features: SSE2 AES AVX AVX2
SW built on Aug  5 2016 with GCC 5.4.0
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX AVX2

[2016-08-05 00:55:15] 4 miner threads started, using 'lyra2re' algorithm.
[2016-08-05 00:55:16] CPU #0: 65.54 kH, 84.75 kH/s
[2016-08-05 00:55:16] CPU #1: 65.54 kH, 84.78 kH/s
[2016-08-05 00:55:16] CPU #2: 65.54 kH, 84.56 kH/s
[2016-08-05 00:55:16] CPU #3: 65.54 kH, 84.44 kH/s
[2016-08-05 00:55:16] Total: 262.14 kH, 338.53 kH/s
[2016-08-05 00:55:20] CPU #3: 337.77 kH, 84.06 kH/s
[2016-08-05 00:55:20] Total: 534.38 kH, 338.15 kH/s
[2016-08-05 00:55:20] CPU #2: 338.22 kH, 84.01 kH/s
[2016-08-05 00:55:20] CPU #1: 339.13 kH, 84.09 kH/s
[2016-08-05 00:55:20] CPU #0: 338.98 kH, 84.02 kH/s
[2016-08-05 00:55:25] CPU #0: 420.11 kH, 84.71 kH/s
[2016-08-05 00:55:25] CPU #2: 420.03 kH, 84.49 kH/s
[2016-08-05 00:55:25] CPU #3: 420.31 kH, 84.05 kH/s
[2016-08-05 00:55:25] Total: 1599.59 kH, 337.33 kH/s
[2016-08-05 00:55:25] CPU #1: 420.43 kH, 84.07 kH/s
[2016-08-05 00:55:30] CPU #3: 420.25 kH, 83.97 kH/s
[2016-08-05 00:55:30] Total: 1680.82 kH, 337.24 kH/s
[2016-08-05 00:55:30] CPU #2: 422.44 kH, 83.97 kH/s
[2016-08-05 00:55:30] CPU #0: 423.54 kH, 83.98 kH/s
[2016-08-05 00:55:30] CPU #1: 420.36 kH, 83.97 kH/s
[2016-08-05 00:55:35] CPU #0: 419.88 kH, 84.64 kH/s
[2016-08-05 00:55:35] CPU #2: 419.84 kH, 84.39 kH/s
[2016-08-05 00:55:35] CPU #3: 419.85 kH, 84.00 kH/s
[2016-08-05 00:55:35] Total: 1679.93 kH, 337.00 kH/s
[2016-08-05 00:55:35] CPU #1: 419.85 kH, 84.02 kH/s
[2016-08-05 00:55:40] CPU #0: 423.20 kH, 84.42 kH/s
[2016-08-05 00:55:40] CPU #3: 420.02 kH, 84.32 kH/s
[2016-08-05 00:55:40] Total: 1682.91 kH, 337.15 kH/s

Basically the same speed. Now what if I actually call tpruvot's build.sh, exactly the one I showed in my previous post:

Code:

CPU: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
CPU features: SSE2 AES AVX AVX2
SW built on Aug  5 2016 with GCC 5.4.0
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX AVX2

[2016-08-05 01:10:02] 4 miner threads started, using 'lyra2re' algorithm.
[2016-08-05 01:10:03] CPU #0: 65.54 kH, 84.11 kH/s
[2016-08-05 01:10:03] CPU #1: 65.54 kH, 83.93 kH/s
[2016-08-05 01:10:03] CPU #2: 65.54 kH, 83.86 kH/s
[2016-08-05 01:10:03] CPU #3: 65.54 kH, 83.96 kH/s
[2016-08-05 01:10:03] Total: 262.14 kH, 335.86 kH/s
[2016-08-05 01:10:07] CPU #1: 335.71 kH, 84.00 kH/s
[2016-08-05 01:10:07] CPU #2: 335.44 kH, 83.92 kH/s
[2016-08-05 01:10:07] CPU #3: 335.85 kH, 83.99 kH/s
[2016-08-05 01:10:07] Total: 1072.54 kH, 336.02 kH/s
[2016-08-05 01:10:07] CPU #0: 336.45 kH, 83.93 kH/s
[2016-08-05 01:10:12] CPU #1: 420.00 kH, 84.00 kH/s
[2016-08-05 01:10:12] CPU #2: 419.62 kH, 83.92 kH/s
[2016-08-05 01:10:12] CPU #3: 419.93 kH, 83.99 kH/s
[2016-08-05 01:10:12] Total: 1596.00 kH, 335.82 kH/s
[2016-08-05 01:10:12] CPU #0: 419.64 kH, 83.91 kH/s
[2016-08-05 01:10:17] CPU #1: 419.98 kH, 84.05 kH/s
[2016-08-05 01:10:17] CPU #2: 419.58 kH, 83.98 kH/s
[2016-08-05 01:10:17] CPU #3: 419.93 kH, 84.03 kH/s
[2016-08-05 01:10:17] Total: 1679.12 kH, 335.98 kH/s
[2016-08-05 01:10:17] CPU #0: 419.53 kH, 83.99 kH/s
[2016-08-05 01:10:22] CPU #2: 419.92 kH, 84.04 kH/s
[2016-08-05 01:10:22] CPU #1: 420.25 kH, 84.04 kH/s
[2016-08-05 01:10:22] CPU #0: 419.93 kH, 84.04 kH/s
[2016-08-05 01:10:22] CPU #3: 420.18 kH, 84.02 kH/s
[2016-08-05 01:10:22] Total: 1680.28 kH, 336.14 kH/s

Still the same (maximum) speed.

So I will be using joblo's cpuminer with tpruvot's (uncommented) build.sh because that build.sh has all those other flags (including -falign-*) which may or may not matter, so just to be safe..

EDIT: when I took the avx2 binary and tried to run it on a avx cpu I got this:

Code:

CPU:       Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
CPU features: SSE2 AES AVX
SW built on Aug  5 2016 with GCC 5.4.0
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX

Illegal instruction (core dumped)

But wasn't the whole idea that all the cpu features will be compiled in and what particular feature shall be used will be determined at the runtime? It's not a big deal, I just recompiled it and I will have two versions (avx and avx2) and run the one that's appropriate to the cpu. Just I thought I would report this.

joblo (OP)

Legendary

Offline

Activity: 1470
Merit: 1114

Re: [ANN]: cpuminer-opt v3.4.0, NEW AVX2 optimizations.

August 05, 2016, 02:25:33 PM

#958

Quote from: ReiMomo on August 05, 2016, 12:08:11 AM

So when is the Windows bin out? Huh

Cryptomining Blog have usually been good producing binaries within a few hours of release.
I'm sure why not this time. You could ask.

I can't build distributable Windows binaries but mingw works to compile your own, instructions in README.md

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

joblo (OP)

Legendary

Offline

Activity: 1470
Merit: 1114

Re: [ANN]: cpuminer-opt v3.4.0, NEW AVX2 optimizations.

August 05, 2016, 02:51:44 PM

#959

Quote from: johnsmithx on August 05, 2016, 01:30:49 AM

Success!

[snip]

So I will be using joblo's cpuminer with tpruvot's (uncommented) build.sh because that build.sh has all those other flags (including -falign-*) which may or may not matter, so just to be safe..

EDIT: when I took the avx2 binary and tried to run it on a avx cpu I got this:

Code:

CPU:       Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
CPU features: SSE2 AES AVX
SW built on Aug  5 2016 with GCC 5.4.0
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX

Illegal instruction (core dumped)

Excellent work. The easiest way to block the compile error is to comment out the source dir for argon2 and remove the registration
call for argon2 in algo-gate-api.c:register_algo_gate. You can easilly remove any algo this way.

You have demonstrated that LTO improves performance with the new compiler but has some incompatibilities with the existing
argon2 code. I will investigate argon2 to try to solve it.

CPU architecture selection is made at compile time. If you do a native compile on a CPU that supports AVX2 you can not run it
on a CPU with only AVX. If you want to cross compile you must specify the arch of the target CPU, and produce seperate executables
for each desired architecture.

My logic for AVX2 isn't fully implemented yet in the capablilities checks, had it been it would have
displayed a message warning of the impending crash, then crashed. This is what you should see when implemented:

Code:

CPU features: SSE2 AES AVX
SW built on Aug  5 2016 with GCC 5.4.0
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
[color=red]Unsupported CPU or SW configuration, miner will likely crash![/color]
Illegal instruction (core dumped)

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

joblo (OP)

Legendary

Offline

Activity: 1470
Merit: 1114

Re: [ANN]: cpuminer-opt v3.4.0, NEW AVX2 optimizations.

August 05, 2016, 04:13:36 PM
Last edit: August 05, 2016, 04:44:59 PM by joblo

#960

Quote from: johnsmithx on August 04, 2016, 11:49:08 PM

But when I add -flto I get the following error at the final link:

Code:

g++  -O3 -march=native -w -flto -std=gnu++11 -Lyes/lib  -Lyes/lib  -o cpuminer cpuminer-cpu-miner.o cpuminer-util.o cpuminer-uint256.o cpuminer-api.o cpuminer-sysinfos.o cpuminer-algo-gate-api.o algo/groestl/cpuminer-sph_groestl.o algo/skein/cpuminer-sph_skein.o algo/bmw/cpuminer-sph_bmw.o algo/shavite/cpuminer-sph_shavite.o algo/shavite/cpuminer-shavite.o algo/echo/cpuminer-sph_echo.o algo/blake/cpuminer-sph_blake.o algo/heavy/cpuminer-sph_hefty1.o algo/blake/cpuminer-mod_blakecoin.o algo/luffa/cpuminer-sph_luffa.o algo/cubehash/cpuminer-sph_cubehash.o algo/simd/cpuminer-sph_simd.o algo/hamsi/cpuminer-sph_hamsi.o algo/fugue/cpuminer-sph_fugue.o algo/gost/cpuminer-sph_gost.o algo/jh/cpuminer-sph_jh.o algo/keccak/cpuminer-sph_keccak.o algo/keccak/cpuminer-keccak.o algo/sha3/cpuminer-sph_sha2.o algo/sha3/cpuminer-sph_sha2big.o algo/shabal/cpuminer-sph_shabal.o algo/whirlpool/cpuminer-sph_whirlpool.o crypto/cpuminer-blake2s.o crypto/cpuminer-oaes_lib.o crypto/cpuminer-c_keccak.o crypto/cpuminer-c_groestl.o crypto/cpuminer-c_blake256.o crypto/cpuminer-c_jh.o crypto/cpuminer-c_skein.o crypto/cpuminer-hash.o crypto/cpuminer-aesb.o crypto/cpuminer-magimath.o algo/argon2/cpuminer-argon2a.o algo/argon2/ar2/cpuminer-argon2.o algo/argon2/ar2/cpuminer-opt.o algo/argon2/ar2/cpuminer-cores.o algo/argon2/ar2/cpuminer-ar2-scrypt-jane.o algo/argon2/ar2/cpuminer-blake2b.o algo/cpuminer-axiom.o algo/blake/cpuminer-blake.o algo/blake/cpuminer-blake2.o algo/blake/cpuminer-blakecoin.o algo/blake/cpuminer-decred.o algo/blake/cpuminer-pentablake.o algo/bmw/cpuminer-bmw256.o algo/cubehash/sse2/cpuminer-cubehash_sse2.o algo/cryptonight/cpuminer-cryptolight.o algo/cryptonight/cpuminer-cryptonight-common.o algo/cryptonight/cpuminer-cryptonight-aesni.o algo/cryptonight/cpuminer-cryptonight.o algo/cpuminer-drop.o algo/echo/aes_ni/cpuminer-hash.o algo/cpuminer-fresh.o algo/groestl/cpuminer-groestl.o algo/groestl/cpuminer-myr-groestl.o algo/groestl/sse2/cpuminer-grso.o algo/groestl/sse2/cpuminer-grso-asm.o algo/groestl/aes_ni/cpuminer-hash-groestl.o algo/groestl/aes_ni/cpuminer-hash-groestl256.o algo/haval/cpuminer-haval.o algo/heavy/cpuminer-heavy.o algo/heavy/cpuminer-bastion.o algo/cpuminer-hmq1725.o algo/hodl/cpuminer-hodl.o algo/hodl/cpuminer-hodl-gate.o algo/hodl/cpuminer-hodl_arith_uint256.o algo/hodl/cpuminer-hodl_uint256.o algo/hodl/cpuminer-hash.o algo/hodl/cpuminer-hmac_sha512.o algo/hodl/cpuminer-sha256.o algo/hodl/cpuminer-sha512.o algo/hodl/cpuminer-utilstrencodings.o algo/hodl/cpuminer-hodl-wolf.o algo/hodl/cpuminer-aes.o algo/hodl/cpuminer-sha512_avx.o algo/hodl/cpuminer-sha512_avx2.o algo/cpuminer-lbry.o algo/luffa/cpuminer-luffa.o algo/luffa/sse2/cpuminer-luffa_for_sse2.o algo/lyra2/cpuminer-lyra2.o algo/lyra2/cpuminer-sponge.o algo/lyra2/cpuminer-lyra2rev2.o algo/lyra2/cpuminer-lyra2re.o algo/keccak/sse2/cpuminer-keccak.o algo/cpuminer-m7m.o algo/cpuminer-neoscrypt.o algo/cpuminer-nist5.o algo/cpuminer-pluck.o algo/quark/cpuminer-quark.o algo/qubit/cpuminer-qubit.o algo/ripemd/cpuminer-sph_ripemd.o algo/cpuminer-scrypt.o algo/scryptjane/cpuminer-scrypt-jane.o algo/sha2/cpuminer-sha2.o algo/simd/sse2/cpuminer-nist.o algo/simd/sse2/cpuminer-vector.o algo/skein/cpuminer-skein.o algo/skein/cpuminer-skein2.o algo/cpuminer-s3.o algo/tiger/cpuminer-sph_tiger.o algo/whirlpool/cpuminer-whirlpool.o algo/whirlpool/cpuminer-whirlpoolx.o algo/x11/cpuminer-x11.o algo/x11/cpuminer-x11evo.o algo/x11/cpuminer-x11gost.o algo/x11/cpuminer-c11.o algo/x13/cpuminer-x13.o algo/x14/cpuminer-x14.o algo/x15/cpuminer-x15.o algo/x17/cpuminer-x17.o algo/yescrypt/cpuminer-yescrypt.o algo/yescrypt/cpuminer-yescrypt-common.o algo/yescrypt/cpuminer-sha256_Y.o algo/yescrypt/cpuminer-yescrypt-simd.o algo/cpuminer-zr5.o asm/cpuminer-neoscrypt_asm.o  asm/cpuminer-sha2-x64.o asm/cpuminer-scrypt-x64.o asm/cpuminer-aesb-x64.o   -lcurl -lz -ljansson -lpthread  -lssl -lcrypto -lgmp
/tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_avx2':
<artificial>:(.text+0x9712): undefined reference to `scrypt_ChunkMix_avx2'
<artificial>:(.text+0x9729): undefined reference to `scrypt_ChunkMix_avx2'
<artificial>:(.text+0x9760): undefined reference to `scrypt_ChunkMix_avx2'
<artificial>:(.text+0x9785): undefined reference to `scrypt_ChunkMix_avx2'
/tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_xop':
<artificial>:(.text+0x99f2): undefined reference to `scrypt_ChunkMix_xop'
<artificial>:(.text+0x9a09): undefined reference to `scrypt_ChunkMix_xop'
<artificial>:(.text+0x9a40): undefined reference to `scrypt_ChunkMix_xop'
<artificial>:(.text+0x9a65): undefined reference to `scrypt_ChunkMix_xop'
/tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_avx':
<artificial>:(.text+0x9cd2): undefined reference to `scrypt_ChunkMix_avx'
<artificial>:(.text+0x9ce9): undefined reference to `scrypt_ChunkMix_avx'
<artificial>:(.text+0x9d20): undefined reference to `scrypt_ChunkMix_avx'
<artificial>:(.text+0x9d45): undefined reference to `scrypt_ChunkMix_avx'
/tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_ssse3':
<artificial>:(.text+0x9fb2): undefined reference to `scrypt_ChunkMix_ssse3'
<artificial>:(.text+0x9fc9): undefined reference to `scrypt_ChunkMix_ssse3'
<artificial>:(.text+0xa000): undefined reference to `scrypt_ChunkMix_ssse3'
<artificial>:(.text+0xa025): undefined reference to `scrypt_ChunkMix_ssse3'
/tmp/ccVXbbn8.ltrans6.ltrans.o: In function `scrypt_ROMix_sse2':
<artificial>:(.text+0xa292): undefined reference to `scrypt_ChunkMix_sse2'
<artificial>:(.text+0xa2a9): undefined reference to `scrypt_ChunkMix_sse2'
<artificial>:(.text+0xa2e0): undefined reference to `scrypt_ChunkMix_sse2'
<artificial>:(.text+0xa305): undefined reference to `scrypt_ChunkMix_sse2'
collect2: error: ld returned 1 exit status
Makefile:1333: recipe for target 'cpuminer' failed
make[2]: *** [cpuminer] Error 1
make[2]: Leaving directory '/root/cpuminer-opt'
Makefile:3453: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/root/cpuminer-opt'
Makefile:670: recipe for target 'all' failed
make: *** [all] Error 2

I just want to make sure I understand the problem definition

- multi is faster with -flto
- multi without -flto is slower than identically compiled opt
- multi with -flto is faster than pre-avx2 opt compiled without -flto
- opt fails to compile with gcc 5.4.0 with -flto
- -flto compiles with gcc 4.8.4 with no effect in performance.

The significant points are:

- flto is faster with gcc 5.4.0
- code that compiles with -flto using gcc 4.8.4 fails to compile using gcc 5.4.0.

The code that fails to compile is pretty ugly. It uses asm function pointers to select targets at compile time.
I've never seen anything like this so it will take a while to understand what is going on. It looks like the code is
self contained and the error doesn't seem to be related to missing libraries.

As a workaround, if you disable argon2 you can get the best of my optimizations as well as LTO, unless some of my opts
conflict with LTO. It wouldn't be the first time I step on the compiler when trying to optimize.
related to missing libraries

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 [48] 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 ... 197 »

Bitcoin Forum > Alternate cryptocurrencies > Mining (Altcoins) > [LOCKED] cpuminer-opt v3.12.3, open source optimized multi-algo CPU miner

« previous topic next topic »