joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
April 04, 2017, 05:59:52 PM |
|
On the bright side I can check off another test case for SHA. Ryzen CPU and features are correctly detected.
CPU: AMD Ryzen 7 1800X Eight-Core Processor CPU features: SSE2 AES AVX AVX2 SHA
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
April 04, 2017, 06:11:11 PM |
|
Doing the same tests, this time with the "sha256t" algo, I got these results. AVX-only version: CPU: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz CPU features: SSE2 AES AVX AVX2 SW built on Mar 31 2017 with GCC 4.8.3 SW features: SSE2 AES AVX Algo features: SSE2 SHA Start mining with SSE2
[2017-04-04 16:35:52] 4 miner threads started, using 'sha256t' algorithm. [2017-04-04 16:35:53] CPU #1: 262.14 kH, 321.25 kH/s [2017-04-04 16:35:53] CPU #2: 262.14 kH, 321.25 kH/s [2017-04-04 16:35:53] CPU #3: 262.14 kH, 321.25 kH/s [2017-04-04 16:35:53] Total: 786.43 kH, 963.75 kH/s [2017-04-04 16:35:53] CPU #0: 262.14 kH, 293.18 kH/s [2017-04-04 16:35:57] CPU #0: 1172.72 kH, 297.42 kH/s [2017-04-04 16:35:57] CPU #2: 1284.99 kH, 315.88 kH/s [2017-04-04 16:35:57] CPU #3: 1284.99 kH, 315.88 kH/s [2017-04-04 16:35:57] Total: 4004.85 kH, 1250.43 kH/s [2017-04-04 16:35:57] CPU #1: 1284.99 kH, 313.47 kH/s [2017-04-04 16:36:02] CPU #3: 1579.41 kH, 316.94 kH/s [2017-04-04 16:36:02] Total: 5322.12 kH, 1243.72 kH/s [2017-04-04 16:36:02] CPU #0: 1487.11 kH, 294.72 kH/s [2017-04-04 16:36:02] CPU #2: 1579.41 kH, 312.05 kH/s [2017-04-04 16:36:02] CPU #1: 1567.37 kH, 305.89 kH/s [2017-04-04 16:36:06] CPU #0: 1473.59 kH, 303.84 kH/s [2017-04-04 16:36:07] CPU #2: 1560.23 kH, 314.61 kH/s [2017-04-04 16:36:07] CPU #3: 1584.69 kH, 313.62 kH/s [2017-04-04 16:36:07] Total: 6185.88 kH, 1237.96 kH/s [2017-04-04 16:36:07] CPU #1: 1529.45 kH, 301.75 kH/s [2017-04-04 16:36:07] CTRL_C_EVENT received, exiting
And now the AVX2-optimized version: CPU: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz CPU features: SSE2 AES AVX AVX2 SW built on Mar 31 2017 with GCC 4.8.3 SW features: SSE2 AES AVX AVX2 Algo features: SSE2 SHA Start mining with SSE2
[2017-04-04 16:36:14] 4 miner threads started, using 'sha256t' algorithm. [2017-04-04 16:36:15] CPU #2: 262.14 kH, 430.18 kH/s [2017-04-04 16:36:15] CPU #1: 262.14 kH, 419.43 kH/s [2017-04-04 16:36:15] CPU #0: 262.14 kH, 409.20 kH/s [2017-04-04 16:36:15] CPU #3: 262.14 kH, 399.46 kH/s [2017-04-04 16:36:15] Total: 1048.58 kH, 1658.27 kH/s [2017-04-04 16:36:19] CPU #3: 1597.83 kH, 422.56 kH/s [2017-04-04 16:36:19] Total: 2384.26 kH, 1681.37 kH/s [2017-04-04 16:36:19] CPU #1: 1677.72 kH, 416.17 kH/s [2017-04-04 16:36:19] CPU #0: 1636.80 kH, 404.45 kH/s [2017-04-04 16:36:19] CPU #2: 1720.74 kH, 417.14 kH/s [2017-04-04 16:36:24] CPU #3: 2112.80 kH, 420.91 kH/s [2017-04-04 16:36:24] Total: 7148.05 kH, 1658.68 kH/s [2017-04-04 16:36:24] CPU #0: 2022.27 kH, 409.25 kH/s [2017-04-04 16:36:24] CPU #2: 2085.72 kH, 419.44 kH/s [2017-04-04 16:36:24] CPU #1: 2080.86 kH, 409.45 kH/s [2017-04-04 16:36:29] CPU #3: 2104.57 kH, 422.75 kH/s [2017-04-04 16:36:29] Total: 8293.42 kH, 1660.89 kH/s [2017-04-04 16:36:29] CPU #0: 2046.24 kH, 414.94 kH/s [2017-04-04 16:36:29] CPU #2: 2097.18 kH, 417.33 kH/s [2017-04-04 16:36:29] CPU #1: 2047.27 kH, 406.14 kH/s [2017-04-04 16:36:29] CTRL_C_EVENT received, exiting
This time I went from 1245 kH/s to 1660 kH/s, a surprising 33.33% increase on speed! With this algorithm, I really will like to see the performance with native HW SHA acceleration I can't explain this. There is no AES or AVX optimized code in cpuminer for sha256t.
|
|
|
|
onedeveloper
|
|
April 04, 2017, 06:42:58 PM |
|
Maybe it's just the compiler doing its optimization work. I had the same workload on my computer on every test, so I was as surprised as you. I also checked the source code and it's true there's no AVX2 code there.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
April 04, 2017, 06:57:33 PM |
|
Maybe it's just the compiler doing its optimization work. I had the same workload on my computer on every test, so I was as surprised as you. I also checked the source code and it's true there's no AVX2 code there.
It's also faster than the openssl version (without HW SHA) which also surprised me. I would have expected openssl to have AVX and AVX2 optimizations but it's slower than the SPH implementation included in cpuminer.
|
|
|
|
giagge
Legendary
Offline
Activity: 1134
Merit: 1001
|
|
April 04, 2017, 07:38:02 PM |
|
I love hexxcoin , with my ryzen , but i stuck a version 3.6.0 , any news for boost in new update ? i use windows 10 x64 .
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
April 04, 2017, 08:01:54 PM |
|
I love hexxcoin , with my ryzen , but i stuck a version 3.6.0 , any news for boost in new update ? i use windows 10 x64 .
Hexxcoin is pure Lyra2, it's pretty much maxed out. Post your results.
|
|
|
|
coinbutter
Newbie
Offline
Activity: 25
Merit: 0
|
|
April 04, 2017, 08:22:17 PM Last edit: April 05, 2017, 12:03:23 AM by coinbutter |
|
I retested v3.5.9.1 using the Windows AVX2 binary on Win 8.1 at suprnova groestlcoin and it works. I can't explain your rejects.
I tried running it on my X5687 with -a dmd-gr (and -a groestl) and get the same issues with 3.6.1 and 3.5.9.1 (aes-sse42). Perhaps it's something dmd-gr related. I know it's the same algo but there has to be something that causes the (reject reason: low difficulty share of 1.3270640174699655e-7) error. I'll have some sha256t and deepcoin results on my R7 in a bit. edit: I'm solo mining dmd-gr and I'll let it run for a few days on the X5687. edit2: It works fine pool mining GRS. Definitely something DMD related. edit3: Ryzen R7 1800X @ 3.8 2993 DDR4 - Deep algo pool: aes-sse42 ~533 kH/s/core, aes-avx ~543 kH/s/core, aes-avx2 ~492 kH/s/core
- sha256t algo pool: aes-sse42 ~1333 kH/s/core, aes-avx ~1353 kH/s/core, aes-avx2 ~1639 kH/s/core (all rejected due to low difficulty)
- groestl algo pool 3.6.1: aes-sse42 ~751 kH/s/core, aes-avx ~848 kH/s/core, aes-avx2 ~851 kH/s/core (all rejected due to low difficulty)
- groestl algo pool 3.5.9.1: aes-sse42 ~656 kH/s/core, aes-avx ~724 kH/s/core, aes-avx2 ~731 kH/s/core (all accepted)
- dmd-gr algo pool 3.6.1: aes-sse42 ~751 kH/s/core, aes-avx ~848 kH/s/core, aes-avx2 ~851 kH/s/core (all rejected due to low difficulty)
- dmd-gr algo pool 3.5.9.1: aes-sse42 ~655 kH/s/core, aes-avx ~724 kH/s/core, aes-avx2 ~731 kH/s/core (all rejected due to low difficulty)
edit4: It looks like thread binding goes to the processor mask instead of to an physical core. The windows scheduler is probably moving the threads: [2017-04-04 18:27:47] Binding process to cpu mask 54 [2017-04-04 18:27:47] Starting Stratum on stratum+tcp://xmr-usa.dwarfpool.com:8005 [2017-04-04 18:27:47] 3 miner threads started, using 'cryptonight' algorithm. [2017-04-04 18:27:47] Binding thread 2 to cpu mask 54 [2017-04-04 18:27:47] Binding thread 1 to cpu mask 54 [2017-04-04 18:27:47] Binding thread 0 to cpu mask 54 Running an instance on one physical core (one thread) ~60 H/s/thread Running an instance on three physical cores (same CCX) (three threads) ~52 H/s/thread I'm going to try to run and instance across both CCX and see if worse performance results. edit5: Running an instance on four physical cores (2Cx2 CCX) (four threads) ~53 H/s/thread Running an instance on six physical cores (3Cx2 CCX) (four threads) ~60 H/s/thread Running an instance on six physical cores (3Cx2 CCX) (six threads) ~55 H/s/thread Running an instance on eight physical cores (4Cx2 CCX) (six threads) ~57 H/s/thread Running an instance on eight physical cores (4Cx2 CCX) (eight threads) ~49 H/s/thread Additionally, I could see the scheduler moving the threads when more cores than threads were assigned. Unsurprisingly, the scheduler kept a thread off of core 0 even when it was allowed in the processor mask. I think the cross-core cache bandwidth gets filled up at some point (also, once I hit 4 threads/CCX the L3 cache was probably stuffed and sending data to RAM. The hashrate got as low as 38.84 H/s on one thread at a point, possibly indicating exceeding the cache size or the data being on the other CCX due to the design of Ryzen's victim cache). Perhaps assigning the threads on the cryptonight algorithm to a physical processor would help alleviate some of the architectural limitations. I'm done with cryptonight for tonight.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
April 05, 2017, 02:06:47 AM Last edit: April 05, 2017, 04:10:58 AM by joblo |
|
Thanks for the testing, lots of info to digest. deep: slower with avx2, may be an issue with cpu affinity (see below), maybe retest. shat256t: all rejects, it doesn't use the code that broke groestl so needs more investigation. Edit: please try sha256t with v3.5.9.1 CPU affinity: [2017-04-04 18:27:47] Binding thread 2 to cpu mask 54 Here's a description of the Windows function SetThreadAffinityMask https://msdn.microsoft.com/en-us/library/windows/desktop/ms686247(v=vs.85).aspxThe default whith no affinity affinity arg is to set one bit in the mask to match the thread with the cpu#: cpu 0 = mask 0, cpu 1 = mask 1, cpu 2 = mask 4, cpu 3 = mask 8. Each thread is assigned to a different cpu. On Intel i7 running 4 threads this works with one thread on each core and no core with 2 threads. When multiple bits are set in the mask the thread can be assigned to any of the cpus represented in the mask. If multiple threads are assigned using the same multibit mask I don't know. The code as written doesn't seem to allow a different mask for each thread. Maybe it relies on the OS to sort it out. I am speculating if multiple cpus are allowed the thread may be moved to another permitted cpu. A mask of 0x54 doesn't seem to make sense, it's not symetric. If i understand correctly a mask of... 0xffff: assign to any cpu 0x1111: assign 4 threads to either cpu 0, 4, 8 or 12 Edit: correction 0x33330x5555: assign 8 threads to 0, 2, 4, 6, 8, 10, or 12 0x000f; assign to 0, 1, 2, or 3 On Ryzen you must consider the CCX as well as SMT (HT). depending on how the CPUs are mapped CPUs 0 & 1 may be: - two HT threads on the first core on the first CCX, - the first thread on 2 different cores on the first CCX - the first thread on the first core of 2 different CCXs - something else It's going to take a lot of playing around to figure it all out. Once the mapping is understood it can be determined if a multibit mask allows the system to move threads to another CPU. Previous observations of better performance with multiple miner instances suggests they can. Any suggestions on how to specify a custom single bit mask for each thread? One though is to use the affinity arg as spacing between cpus. For instance an arg of: 0 = consecutive CPUs, ie 0,1,2,3,4... 1 = alternate cpus, 0, 2, 4, 6, 8... 2 = every third cpu, 0, 3, 6, 9... useful for 6/12 core Ryzen maybe 3 = every 4th cpu Edit: Another thought is to have a binary option for affinity, consecutive or distributed. With distributed the spcacing would be calculated automatically based on the cpu and thread counts.
|
|
|
|
coinbutter
Newbie
Offline
Activity: 25
Merit: 0
|
|
April 05, 2017, 03:07:11 AM |
|
A mask of 0x54 doesn't seem to make sense, it's not symetric.
That mask is logical processors 2,4 and 6. 00101010 I don't load the first core for testing until I reach four threads and since this is cryptonight I was increasing it one thread at a time to find out where the L3 cache saturated. I think the cryptographic functions in Ryzen are shared in each core so even though Simultaneous MultiThreading would allow two threads to execute on a core they get bottlenecked waiting for a thread to finish it's hash. I think it's most efficient to only put one thread on each core instead of loading every logical processor. I'd be very interested in having the thread count to match the number of processors in the mask and to automatically assign each thread to it's own processor. Call it Ryzen mode!
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
April 05, 2017, 04:30:23 AM |
|
A mask of 0x54 doesn't seem to make sense, it's not symetric.
That mask is logical processors 2,4 and 6. 00101010 I don't load the first core for testing until I reach four threads and since this is cryptonight I was increasing it one thread at a time to find out where the L3 cache saturated. I think the cryptographic functions in Ryzen are shared in each core so even though Simultaneous MultiThreading would allow two threads to execute on a core they get bottlenecked waiting for a thread to finish it's hash. I think it's most efficient to only put one thread on each core instead of loading every logical processor. I'd be very interested in having the thread count to match the number of processors in the mask and to automatically assign each thread to it's own processor. Call it Ryzen mode! 00101010 is not 0x54, it's 0x2a and it's logical cpu 1, 3 & 5 the way I read it. But it still depends on how AMD maps logical CPUs to physical cores and CCXs how to achieve one thread on every psysical core. If the default isn't optimum AMD maps differently than intel and cpuminer requires a way to specify the optimum mapping for AMD. I also made a mistake in my examples, every other cpu is 0x5555. It also needs to be confirmed whether multiple bits set in the mask means that the thread can be mapped to to any of the associated logical cpus as well as be moved among them. The optimum configuration for cryptonight on 8/16 core Ryzen is to have 8 threads, one nailed to each physical core which also means 4 per CCX. The question is how to do that. Need to understand AMD's mapping first.
|
|
|
|
andy75
|
|
April 05, 2017, 07:35:21 AM |
|
anyone tried to change the json parser , do you think it will help performance there are more faster and efficient parser out there
|
|
|
|
onedeveloper
|
|
April 05, 2017, 07:47:06 AM |
|
I must "enter the fray" because I see coinbutter is making a mistake. Take this notes into account. - The CPU affinity treats the bits as flags for each logical CPU. In Windows >= 8 one can look into the task manager to find how many CPUs is recognizing.
- Each bit represents a CPU, as joblo said, being bit 0 the zeroth CPU, bit 1 the 1st, and so on.
- On Ryzen case, as you know, the CPUS are organized in two blocks of 8 CPUS on each CCX, sharing a common 8MB cache each. This limits the number of threads per CCX for cryptonight to 8MB/2MB = 4 threads.
- It doesn't matter which logical CPU gets the threads on each CCX. The maximum must be 4.
- As each thread is a bit, Ryzen masks have 16 bits, i.e. 4 hexadecimal digits.
- The hexadecimal numbers are represented high-bit to low-bit, so the first 1 is the 16th CPU in Ryzen case.
- If you want to use the maximum 8 threads in Ryzen 7, you must use 4 threads in 1st CCX and other 4 in 2nd.
- A mask like 0xF0F0 is enoug. This means that 8 threads will be assigned to the logical CPUs 15, 14, 13, 12 (second CCX) and 7, 6, 5, 4 (first CCX). The rest of the CPUs will be left free and useable for any other task.
How to mine: cpuminer-aes-avx2 -a cryptonight -o <your_pool_here> -u <your_user> -p <your_password> -t 8 --cpu-affinity 0xF0F0 You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second. I hope all is clear now
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
April 05, 2017, 12:06:15 PM |
|
You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second. I hope all is clear now Almost. Won't a mask of 0xf0f0 result in 4 idle cores and 4 cores with 2 threads each. Don't you want one thread per physical core?
|
|
|
|
ZenFr
Legendary
Offline
Activity: 1260
Merit: 1046
|
|
April 05, 2017, 12:19:31 PM |
|
You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second. I hope all is clear now Almost. Won't a mask of 0xf0f0 result in 4 idle cores and 4 cores with 2 threads each. Don't you want one thread per physical core? What are the masks to have only one thread per physical core (2 cores/4 threads CPUs (Core i3/i5) nd 4 cores/8 thread CPUs (Core i7)) ?
|
|
|
|
coinbutter
Newbie
Offline
Activity: 25
Merit: 0
|
|
April 05, 2017, 01:06:37 PM Last edit: April 05, 2017, 01:29:26 PM by coinbutter |
|
You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second. I hope all is clear now Almost. Won't a mask of 0xf0f0 result in 4 idle cores and 4 cores with 2 threads each. Don't you want one thread per physical core? What are the masks to have only one thread per physical core (2 cores/4 threads CPUs (Core i3/i5) nd 4 cores/8 thread CPUs (Core i7)) ? ZenFr, 2C/4T 0xA or 0x5 4C/8T 0xAA or 0x55 edit: onedeveloper, I was mis-remembering the L3 cache structure on the Ryzen, it's a unified 8mb victim cache per CCX not 2mb L3 victim cache per core. Still, there is a cache bandwidth issue and the nature of the victim cache means that if 8mb per CCX is reached then it starts moving data to the other CCX or if it is full as well it moves to system ram. If you mask 0x5 on Ryzen you mask the process to logical processor 0 and 2, core 1 and 2 of CCX 1. If you mask 0xA on Ryzen you mask the process to logical processor 1 and 3, core 1 and 2 of CCX 1.
|
|
|
|
onedeveloper
|
|
April 05, 2017, 01:34:15 PM |
|
You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second. I hope all is clear now Almost. Won't a mask of 0xf0f0 result in 4 idle cores and 4 cores with 2 threads each. Don't you want one thread per physical core? Yes. 4 Idle cores per CCX. You don't earn better hash power on Cryptonight once you cover your L3 cache and each CCX has only 8MB cache, so 4 threads per CCX plus 4 threads free for playing DOTA2 and do some browsing at the same time
|
|
|
|
onedeveloper
|
|
April 05, 2017, 01:40:48 PM |
|
You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second. I hope all is clear now Almost. Won't a mask of 0xf0f0 result in 4 idle cores and 4 cores with 2 threads each. Don't you want one thread per physical core? What are the masks to have only one thread per physical core (2 cores/4 threads CPUs (Core i3/i5) nd 4 cores/8 thread CPUs (Core i7)) ? ZenFr, 2C/4T 0xA or 0x5 4C/8T 0xAA or 0x55 edit: onedeveloper, I was mis-remembering the L3 cache structure on the Ryzen, it's a unified 8mb victim cache per CCX not 2mb L3 victim cache per core. Still, there is a cache bandwidth issue and the nature of the victim cache means that if 8mb per CCX is reached then it starts moving data to the other CCX or if it is full as well it moves to system ram. If you mask 0x5 on Ryzen you mask the process to logical processor 0 and 2, core 1 and 2 of CCX 1. If you mask 0xA on Ryzen you mask the process to logical processor 1 and 3, core 1 and 2 of CCX 1. I read this reddit thread -> https://www.reddit.com/r/Amd/comments/5ybrxn/ryzen_7_is_actually_behaving_like_a_dual_4c8t/They linked this image: Still, they don't explain how windows manages CPU affinities, but the secret is finding the physical CCX for each CPU thread (logical CPU) and adjust the mask accordingly. User "giaggio" used that mask I said above and had a good result: 640 hashes at only 35 Watts (due to the iddle threads/cores on each CCX).
|
|
|
|
ZenFr
Legendary
Offline
Activity: 1260
Merit: 1046
|
|
April 05, 2017, 01:59:16 PM |
|
What are the masks to have only one thread per physical core (2 cores/4 threads CPUs (Core i3/i5) nd 4 cores/8 thread CPUs (Core i7)) ?
ZenFr, 2C/4T 0xA or 0x5 4C/8T 0xAA or 0x55 Thank you :-).
|
|
|
|
coinbutter
Newbie
Offline
Activity: 25
Merit: 0
|
|
April 05, 2017, 02:50:50 PM |
|
onedeveloper, giaggio was running hexxcoin, a lyra2z330 algo. I don't know what the performance is or how it uses the cache for that algo. I was planning on testing lyra2z330 tonight. Widows is supposed to manage threads by CCX and then by core. It definitely manages them by CCX and I'm not sure about core because I've always set an affinity mask. I've been trying to tell you how it manages affinities by CCX and core but you've been too busy telling me I'm getting it wrong. ZenFr, Happy to help! Also, the tool at https://www.paulhempshall.com/io/cpuminer-affinity-setter/ works great and outputs in decimal format.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
April 05, 2017, 03:05:17 PM |
|
Widows is supposed to manage threads by CCX and then by core.
This is the point I have been trying to make, there are 2 levels of hierarchy, it's not just about the fragmented L3 cache but also SMT. The miner threads need to be evenly distributed across the CCXs as well as accross the physical cores within the CCX. Loading 2 threads on one physical core will reduce performance while another physical core remains idle. Each physical core should run only one thread when mining cryptonight. It also helps spread the thermal load. 0x0f0f vs 0x5555, which achieves that?
|
|
|
|
|