joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
December 03, 2019, 04:25:34 PM Last edit: December 10, 2019, 10:05:27 PM by joblo |
|
Some notes about pecuriarities using GCC 9 that affect cpuminer-opt and may be of interest to developpers.
1. It produces more warnings about array bounds, found some violations in cpuminer-opt that will be fixed in the next release.
2. It no longer includes AES in "-march=core-avx2", need to add aes manually: "-march=core-avx2 -maes".
3. It doesn't rebuild Makefile.in after removing a source file from Makefile.am. The compiler still looked for the deleted file. It was necessary to edit Makefile.in manually to remove all references to the deleted file. Will follow up.
Edit: I was missing automake, didn't need it until I changed Makefile,am
For the time being I will continue to use GCC 7 for devepolment and production of the Windows binaries.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
December 03, 2019, 05:48:16 PM |
|
After many delays, AVX-512 was supposed to be generally available 3 years ago with Intel Cannon Lake, cpuminer-opt now supports AVX-512. AVX-512 is currently available on Intel Skylake-X and the newly released Cascadelake-X CPUs from Intel. It is also available on Icelake but only for mobile CPUs. It looks like AVX512 will finally be released for mainstream desktops in 2020. I'm not aware of plans to add AVX512 to AMD Ryzen CPUs. Algos will be optimized gradually over the next few releases. First up are argon2d, blake2s, keccak, keccakc, skein and skein2. https://github.com/JayDDee/cpuminer-opt/releases/tag/v3.10.0
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
December 04, 2019, 06:36:00 PM |
|
I previously asked if someone would be kind enough to do a test on a Ryzen 3xxx to compare AVX2 vs AVX performance. With Ryzen 1xxx AVX2 was often slower than AVX. The results will help me decide how to deliver Windows binaries for Ryzen and whether AVX2 should override SHA. Currently only Ryzen has SHA so it's simple, use it if it's there because AVX2 is slow. It gets more complicated when Intel releases Icelake with SHA for the desktop. AVX2 is faster than SHA on Intel CPUs. Which is faster on Ryzen 3xxx and does the new znver2 compile arch make a difference? Requirements: Any Ryzen or TR CPU from the 3xxx series. A recent Linux distro. Goal: Compare AVX2 vs AVX performance on Ryzen 3000 series CPUs using blake2s algo. Compare AVX2 vs SHA performance on Ryzen 3000 series CPUs usimg sha256t algo. Determine if the new znver2 compile arch has an effect on the results. Determine if Intel and Ryzen need to prioritize features differently.. Procedure: 1. Compile seperate builds for znver1, znver2, and avx2 and avx ./autogen.sh CFLAGS="-O3 -march=znver1 -Wall" ./configure --with-curl make -j 4 mv cpuminer cpuminer-znver1 make clean CFLAGS="-O3 -march=znver2 -Wall" ./configure --with-curl make -j 4 mv cpuminer cpuminer-znver2 make clean CFLAGS="-O3 -march=core-avx2 -maes -Wall" ./configure --with-curl make -j 4 mv cpuminer cpuminer-avx2 [make clean CFLAGS="-O3 -march=core-avx -maes -Wall" ./configure --with-curl make -j 4 mv cpuminer cpuminer-avx
2. Do a blake2s benchmark on each build. 5 minutes each should be enough to produce a stable hash rate. ./cpuminer-znver1 -a blake2s --benchmark --hash-meter ./cpuminer-znver2 -a blake2s --benchmark --hash-meter ./cpuminer-avx2 -a blake2s --benchmark --hash-meter ./cpuminer-avx -a blake2s --benchmark --hash-meter
3. Repeat the tests with sha256t. 4. Post your results including CPU model, GCC version and the stable total hash rate for each test. Thanks in advance, the results will help ensure optimum performance on Ryzen CPUs.
|
|
|
|
A-Bolt
Legendary
Offline
Activity: 2334
Merit: 2374
|
|
December 05, 2019, 12:35:04 PM Last edit: December 05, 2019, 01:37:16 PM by A-Bolt |
|
I previously asked if someone would be kind enough to do a test on a Ryzen 3xxx
Ryzen 5 3600 @ 4.2GHz (CPU Core Ratio - 42x, PBO is disabled) GCC 9.2.1: blake2s: znver1 231.46 MH/s znver2 238.08 MH/s avx2 236.11 MH/s avx 236.09 MH/s
sha256t: znver1 61.44 MH/s znver2 61.69 MH/s avx2 46.25 MH/s avx 46.26 MH/s
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
December 05, 2019, 02:18:08 PM |
|
I previously asked if someone would be kind enough to do a test on a Ryzen 3xxx
Ryzen 5 3600 @ 4.2GHz (CPU Core Ratio - 42x, PBO is disabled) GCC 9.2.1: blake2s: znver1 231.46 MH/s znver2 238.08 MH/s avx2 236.11 MH/s avx 236.09 MH/s
sha256t: znver1 61.44 MH/s znver2 61.69 MH/s avx2 46.25 MH/s avx 46.26 MH/s
Many thanks. It's not quite the results I expected. I was hoping AVX2 would be better. SHA is clearly the winner over AVX2. That was expected given the AVX2 results. I see no need for seperate znver1 and znver2 packages, there is only a slight improvement for AVX and AVX2. I also see no need to override SHA until Intel CPUs with SHA become mainstream. with Icelake.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
December 06, 2019, 12:21:35 AM Last edit: December 10, 2019, 12:07:22 AM by joblo |
|
cpuminer-opt-3.10.1 was just released. It fixes some bugs that can cause generally poor performance without reporting any errors. All users should upgrade. https://github.com/JayDDee/cpuminer-opt/releasesAVX512 for blake2b, nist5, quark, tribus. More broken lane fixes, fixed buffer overflow in skein AVX512, fixed quark invalid shares AVX2. Only the highest ranking feature in a class is listed at startup, lower ranking features are available but no longer listed. Edit: v3.10.3 is out with more AVX512
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
December 15, 2019, 09:57:10 PM Last edit: December 22, 2019, 06:45:09 PM by joblo |
|
It looks interesting but I have lots of questions about it. I'm deep into AVX512 right now so I'll follow up later. It might be specific to RandomX (and probably cryptonight) because they were both designed with specific cache usage in mind. I assume the technique is to disable next line prefetching which assumes sequential access. RandomX won't need the next line due to it's randomness so it's waste to prefetch it. Edit: It appears this optimzation is specific to certain algorithms and could negatively impact others. To implement it would require using it only on selected algos. The algos currently benefitting are not supported by cpuminer-opt. It would be a lot of work to analyze which supported algos might be helped. I'm also concerned about the system impact. This kind of optimization may be appropriate for a dedicated mining system but not for a multi purpose desktop. Changing the prefetch configuration has system wide effect and will affect other applications positively or negatively, even when not mining. There is no gaceful way to undo the changes. Miners don't usually exit gracefuly, Ctrl C is the standard exit, or sometimes a crash. This would leave the system prefetch configuration modified and would require manually restoring it. I think I'll pass.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
December 22, 2019, 07:14:41 PM Last edit: December 23, 2019, 03:04:25 AM by joblo |
|
The previous optimization request got me thinking. It raised concerns similar to another request I resisted and raises an interesting question.
How far should a miner go for optimizing performance?
Should it modify system configuration?
Should the miner be required to run with root/admin privileges?
The 2 cases that illustrate the issue are the one imediately above. The miner makes a system configuration change that will affect all applications, and it can't restore the original config itself.
The other case is huge pages. Huge pages requires system configuration changes as well but only to enable the feature. It does not affect applications that don't explicitly use it. Buit it requires the miner to be run by administrator on Windows.
My opinion is these features may be appropriate on a dedicated mining system but maybe not for a typical desktop PC.
The ideal would be able to handle both environments transparently but that takes a lot of work.
Automated config changes that affect everything and aren't automatically reversed is completely unnacceptible, IMO. If manual intervention is reruired to "undo" it should also be required to "do".
My only concern is with the automation of the change and lack of automated reversal. That has a simple solution. Don't do it in the miner.
HW prefetch changes should be done manually by the user before starting to mine, and then undone when no longer mining. It's completely up to the user which algos to use it with and requires no complex logic in the miner.
Huge pages is not so risky but does have the issue of requiring the miner to be run by admin. My other concern is the lack of transparency.
Huge pages should be completely transparent. The system should be smart enough to allocate huge pages for large datasets. I don't see why any application changes should be required, it should all happen behind the scenes in malloc. And it shouldn't require root/admin.
My stubbornness on this point may be part of the issue.
Both of these optimizations could help some algos and hurt others, they have to be set for each algo individually. With nealry 100 algos that a huge task.
So aside from the technical concerns I don't know if it's worth the work.
Comments are welcome.
|
|
|
|
alucard20724
|
|
December 23, 2019, 03:31:23 AM |
|
Here are my results so far with a 7820X... i've only benched for the pools shown. ps.. i'm mining ethash on two VII also at the same time and here are the programs currently benched
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
December 23, 2019, 05:29:23 AM |
|
Here are my results so far with a 7820X... i've only benched for the pools shown.
Thanks for posting. It would be nice to compare with AVX2. I'm seeing genarally around 30% increase in most X algos as they are a mix of optimized and unoptimized hash functions. Algos like lyra2v3, which are 100% optimized are getting nearly double. It's too bad CPUs don't have a chance with those algos anymore.
|
|
|
|
alucard20724
|
|
December 23, 2019, 06:33:17 AM |
|
Here are my results so far with a 7820X... i've only benched for the pools shown.
Thanks for posting. It would be nice to compare with AVX2. which version of AVX2 would you like to see?.. i think i have twenty of your previous versions benched up to version 3.10.2 for avx2 on this cpu
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
December 23, 2019, 06:41:28 AM |
|
which version of AVX2 would you like to see?.. i think i have twenty of your previous versions benched up to version 3.10.2 for avx2 on this cpu
Just use the latest release compiled for avx2. That will provide the most direct comparison. If you have Windows it's already compiled for you. With Linux just compile with "-march=skylake" instead of "-march=native". You can confirm that the SW features only list AVX2 but the CPU still lists AVX512.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
December 23, 2019, 05:27:43 PM |
|
Scam warningA user is posting fake links to cpuminer-opt. Don't download. The only real cpuminer-opt is here and only here: https://github.com/JayDDee/cpuminer-opt
|
|
|
|
thefix
Legendary
Offline
Activity: 1049
Merit: 1001
|
|
December 24, 2019, 01:12:37 AM |
|
Thanks for the head up, I am sure the link they are posting has a download filled with all kinds of holiday goodies intended to make his/her holidays more festive. Its a good reminder to always double check things before you click them, because even the best of us get caught slipping sometimes.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
December 24, 2019, 04:10:30 AM |
|
Thanks for the head up, I am sure the link they are posting has a download filled with all kinds of holiday goodies intended to make his/her holidays more festive. Its a good reminder to always double check things before you click them, because even the best of us get caught slipping sometimes. The POS tried to copy my ANN but couldn't even do that right, A real winner. I reported it to Mod and it seems to have been deleted.
|
|
|
|
alucard20724
|
|
December 24, 2019, 07:44:38 PM |
|
which version of AVX2 would you like to see?.. i think i have twenty of your previous versions benched up to version 3.10.2 for avx2 on this cpu
Just use the latest release compiled for avx2. That will provide the most direct comparison. If you have Windows it's already compiled for you. With Linux just compile with "-march=skylake" instead of "-march=native". You can confirm that the SW features only list AVX2 but the CPU still lists AVX512. @joblo Here's my results for AVX512 vs AVX2 on version 3.10.5 i'm running windows pro 10 x64 8gigs ram
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
December 28, 2019, 09:13:51 PM |
|
@joblo Here's my results for AVX512 vs AVX2 on version 3.10.5 i'm running windows pro 10 x64 8gigs ram
Thanks, Those results are in line with mine. The 100% AVX512 algos are pretty close to double the hash rate so that indicates no significant scaling issues with AVX 512 unless memory accesses are bottlenecked. The long X chains are showing the effects of diminishing returns. Further optimization of previously optimized code has less effect as it represents a diminishing proportion of the complete algo.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 03, 2020, 05:21:28 AM |
|
cpuminer-opt-3.11.0 introduces full support for Intel's Icelake CPUs.
Iclelake architecture includes AVX512, SHA, and VAES. AVX512 and SHA are already supported on Intel Skylake-X and AMD Ryzen, respectively. VAES is new with Icelake and is an extension of AES_NI and AVX512 that provides 4 way parallel AES encryption and decryption in a 512 bit vector.
Icelake is only available for mobile at this time, desktop availability is unknown.
VAES support is only available as source code and requires GCC 8.
See the OP for more details about v3.11.0
This release marks the end of the rapid development of the past several weeks. Things will slow down considerably with mostly bug fixes and minor tweaks.
I am also planning a cleanup to remove some troublesome and useless code, namely the macros for blake, bmw, etc used by algos like x11, as well as scrypt-jane algo. The macros don't provide any noticeable performance difference from the refernce code and srypt-jane hasn't been used for several years. There are other dead algos but they don't cause problems so there is no need to remove them. This will also reduce the bloat. If anyone has concerns wwith this plan, please speak up.
|
|
|
|
alucard20724
|
|
January 03, 2020, 08:03:18 AM |
|
cpuminer-opt-3.11.0 introduces full support for Intel's Icelake CPUs.
Iclelake architecture includes AVX512, SHA, and VAES. AVX512 and SHA are already supported on Intel Skylake-X and AMD Ryzen, respectively. VAES is new with Icelake and is an extension of AES_NI and AVX512 that provides 4 way parallel AES encryption and decryption in a 512 bit vector.
Icelake is only available for mobile at this time, desktop availability is unknown.
VAES support is only available as source code and requires GCC 8.
See the OP for more details about v3.11.0
This release marks the end of the rapid development of the past several weeks. Things will slow down considerably with mostly bug fixes and minor tweaks.
I am also planning a cleanup to remove some troublesome and useless code, namely the macros for blake, bmw, etc used by algos like x11, as well as scrypt-jane algo. The macros don't provide any noticeable performance difference from the refernce code and srypt-jane hasn't been used for several years. There are other dead algos but they don't cause problems so there is no need to remove them. This will also reduce the bloat. If anyone has concerns wwith this plan, please speak up.
is m7m supported with AVX512 now? i didn't see it and i haven't noticed any speed increase based on the prior versions.... haven't tested v3.11.0 yet ... working on it.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 03, 2020, 01:57:38 PM |
|
is m7m supported with AVX512 now? i didn't see it and i haven't noticed any speed increase based on the prior versions.... haven't tested v3.11.0 yet ... working on it.
Unfortunately AVX512 only improves algos that have already been taken over by GPUS and ASICS and they are improvimng faster than CPUs can. That's because GPUs are real vector processors while CPU SIMD just emulates vector processing with strict restrictions on data organization. A GPU can run thousands of threads while the biggests CPUs with AVX512 can barely crack 100. The secret is in the algorithm, those can can be vectorized can be vectoized better on a GPU. The only way to speed up M7M is more CPU cores and faster clocks. VAES has some potential as a few CPU algos use can use it. But VAES will only help with linear vectorizing (loop unrolling) rather than enabling parallel operation.
|
|
|
|
|