Bitcoin Forum
April 26, 2024, 01:55:36 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 ... 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 [120] 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 ... 197 »
  Print  
Author Topic: [LOCKED] cpuminer-opt v3.12.3, open source optimized multi-algo CPU miner  (Read 443960 times)
joblo (OP)
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
April 04, 2017, 05:59:52 PM
 #2381

On the bright side I can check off another test case for SHA. Ryzen CPU and features are correctly detected.

CPU: AMD Ryzen 7 1800X Eight-Core Processor
CPU features: SSE2 AES AVX AVX2 SHA


 

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
"With e-currency based on cryptographic proof, without the need to trust a third party middleman, money can be secure and transactions effortless." -- Satoshi
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1714139736
Hero Member
*
Offline Offline

Posts: 1714139736

View Profile Personal Message (Offline)

Ignore
1714139736
Reply with quote  #2

1714139736
Report to moderator
1714139736
Hero Member
*
Offline Offline

Posts: 1714139736

View Profile Personal Message (Offline)

Ignore
1714139736
Reply with quote  #2

1714139736
Report to moderator
joblo (OP)
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
April 04, 2017, 06:11:11 PM
 #2382


Doing the same tests, this time with the "sha256t" algo, I got these results. AVX-only version:

Code:
CPU: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
CPU features: SSE2 AES AVX AVX2
SW built on Mar 31 2017 with GCC 4.8.3
SW features: SSE2 AES AVX
Algo features: SSE2 SHA
Start mining with SSE2

[2017-04-04 16:35:52] 4 miner threads started, using 'sha256t' algorithm.
[2017-04-04 16:35:53] CPU #1: 262.14 kH, 321.25 kH/s
[2017-04-04 16:35:53] CPU #2: 262.14 kH, 321.25 kH/s
[2017-04-04 16:35:53] CPU #3: 262.14 kH, 321.25 kH/s
[2017-04-04 16:35:53] Total: 786.43 kH, 963.75 kH/s
[2017-04-04 16:35:53] CPU #0: 262.14 kH, 293.18 kH/s
[2017-04-04 16:35:57] CPU #0: 1172.72 kH, 297.42 kH/s
[2017-04-04 16:35:57] CPU #2: 1284.99 kH, 315.88 kH/s
[2017-04-04 16:35:57] CPU #3: 1284.99 kH, 315.88 kH/s
[2017-04-04 16:35:57] Total: 4004.85 kH, 1250.43 kH/s
[2017-04-04 16:35:57] CPU #1: 1284.99 kH, 313.47 kH/s
[2017-04-04 16:36:02] CPU #3: 1579.41 kH, 316.94 kH/s
[2017-04-04 16:36:02] Total: 5322.12 kH, 1243.72 kH/s
[2017-04-04 16:36:02] CPU #0: 1487.11 kH, 294.72 kH/s
[2017-04-04 16:36:02] CPU #2: 1579.41 kH, 312.05 kH/s
[2017-04-04 16:36:02] CPU #1: 1567.37 kH, 305.89 kH/s
[2017-04-04 16:36:06] CPU #0: 1473.59 kH, 303.84 kH/s
[2017-04-04 16:36:07] CPU #2: 1560.23 kH, 314.61 kH/s
[2017-04-04 16:36:07] CPU #3: 1584.69 kH, 313.62 kH/s
[2017-04-04 16:36:07] Total: 6185.88 kH, 1237.96 kH/s
[2017-04-04 16:36:07] CPU #1: 1529.45 kH, 301.75 kH/s
[2017-04-04 16:36:07] CTRL_C_EVENT received, exiting

And now the AVX2-optimized version:

Code:
CPU: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
CPU features: SSE2 AES AVX AVX2
SW built on Mar 31 2017 with GCC 4.8.3
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 SHA
Start mining with SSE2

[2017-04-04 16:36:14] 4 miner threads started, using 'sha256t' algorithm.
[2017-04-04 16:36:15] CPU #2: 262.14 kH, 430.18 kH/s
[2017-04-04 16:36:15] CPU #1: 262.14 kH, 419.43 kH/s
[2017-04-04 16:36:15] CPU #0: 262.14 kH, 409.20 kH/s
[2017-04-04 16:36:15] CPU #3: 262.14 kH, 399.46 kH/s
[2017-04-04 16:36:15] Total: 1048.58 kH, 1658.27 kH/s
[2017-04-04 16:36:19] CPU #3: 1597.83 kH, 422.56 kH/s
[2017-04-04 16:36:19] Total: 2384.26 kH, 1681.37 kH/s
[2017-04-04 16:36:19] CPU #1: 1677.72 kH, 416.17 kH/s
[2017-04-04 16:36:19] CPU #0: 1636.80 kH, 404.45 kH/s
[2017-04-04 16:36:19] CPU #2: 1720.74 kH, 417.14 kH/s
[2017-04-04 16:36:24] CPU #3: 2112.80 kH, 420.91 kH/s
[2017-04-04 16:36:24] Total: 7148.05 kH, 1658.68 kH/s
[2017-04-04 16:36:24] CPU #0: 2022.27 kH, 409.25 kH/s
[2017-04-04 16:36:24] CPU #2: 2085.72 kH, 419.44 kH/s
[2017-04-04 16:36:24] CPU #1: 2080.86 kH, 409.45 kH/s
[2017-04-04 16:36:29] CPU #3: 2104.57 kH, 422.75 kH/s
[2017-04-04 16:36:29] Total: 8293.42 kH, 1660.89 kH/s
[2017-04-04 16:36:29] CPU #0: 2046.24 kH, 414.94 kH/s
[2017-04-04 16:36:29] CPU #2: 2097.18 kH, 417.33 kH/s
[2017-04-04 16:36:29] CPU #1: 2047.27 kH, 406.14 kH/s
[2017-04-04 16:36:29] CTRL_C_EVENT received, exiting

This time I went from 1245 kH/s to 1660 kH/s, a surprising 33.33% increase on speed!

With this algorithm, I really will like to see the performance with native HW SHA acceleration Smiley

I can't explain this. There is no AES or AVX optimized code in cpuminer for sha256t.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
onedeveloper
Full Member
***
Offline Offline

Activity: 143
Merit: 100


View Profile
April 04, 2017, 06:42:58 PM
 #2383

Maybe it's just the compiler doing its optimization work. I had the same workload on my computer on every test, so I was as surprised as you. I also checked the source code and it's true there's no AVX2 code there.
joblo (OP)
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
April 04, 2017, 06:57:33 PM
 #2384

Maybe it's just the compiler doing its optimization work. I had the same workload on my computer on every test, so I was as surprised as you. I also checked the source code and it's true there's no AVX2 code there.

It's also faster than the openssl version (without HW SHA) which also surprised me. I would have expected
openssl to have AVX and AVX2 optimizations but it's slower than the SPH implementation included in cpuminer.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
giagge
Legendary
*
Offline Offline

Activity: 1134
Merit: 1001


View Profile
April 04, 2017, 07:38:02 PM
 #2385

I love hexxcoin , with my ryzen , but i stuck a version 3.6.0 , any news for boost in new update ? i use windows 10 x64 .
joblo (OP)
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
April 04, 2017, 08:01:54 PM
 #2386

I love hexxcoin , with my ryzen , but i stuck a version 3.6.0 , any news for boost in new update ? i use windows 10 x64 .

Hexxcoin is pure Lyra2, it's pretty much maxed out. Post your results.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
coinbutter
Newbie
*
Offline Offline

Activity: 25
Merit: 0


View Profile
April 04, 2017, 08:22:17 PM
Last edit: April 05, 2017, 12:03:23 AM by coinbutter
 #2387

I retested v3.5.9.1 using the Windows AVX2 binary on Win 8.1 at suprnova groestlcoin and it works.
I can't explain your rejects.

I tried running it on my X5687 with -a dmd-gr (and -a groestl) and get the same issues with 3.6.1 and 3.5.9.1 (aes-sse42). Perhaps it's something dmd-gr related. I know it's the same algo but there has to be something that causes the (reject reason: low difficulty share of 1.3270640174699655e-7) error.

I'll have some sha256t and deepcoin results on my R7 in a bit.

edit: I'm solo mining dmd-gr and I'll let it run for a few days on the X5687.
edit2: It works fine pool mining GRS. Definitely something DMD related.
edit3: Ryzen R7 1800X @ 3.8 2993 DDR4
  • Deep algo pool: aes-sse42 ~533 kH/s/core, aes-avx ~543 kH/s/core, aes-avx2 ~492 kH/s/core
  • sha256t algo pool:  aes-sse42 ~1333 kH/s/core, aes-avx ~1353 kH/s/core, aes-avx2 ~1639 kH/s/core (all rejected due to low difficulty)
  • groestl algo pool 3.6.1: aes-sse42 ~751 kH/s/core, aes-avx ~848 kH/s/core, aes-avx2 ~851 kH/s/core (all rejected due to low difficulty)
  • groestl algo pool 3.5.9.1: aes-sse42 ~656 kH/s/core, aes-avx ~724 kH/s/core, aes-avx2 ~731 kH/s/core (all accepted)
  • dmd-gr algo pool 3.6.1: aes-sse42 ~751 kH/s/core, aes-avx ~848 kH/s/core, aes-avx2 ~851 kH/s/core (all rejected due to low difficulty)
  • dmd-gr algo pool 3.5.9.1: aes-sse42 ~655 kH/s/core, aes-avx ~724 kH/s/core, aes-avx2 ~731 kH/s/core (all rejected due to low difficulty)
edit4: It looks like thread binding goes to the processor mask instead of to an physical core. The windows scheduler is probably moving the threads:
Code:
[2017-04-04 18:27:47] Binding process to cpu mask 54
[2017-04-04 18:27:47] Starting Stratum on stratum+tcp://xmr-usa.dwarfpool.com:8005
[2017-04-04 18:27:47] 3 miner threads started, using 'cryptonight' algorithm.
[2017-04-04 18:27:47] Binding thread 2 to cpu mask 54
[2017-04-04 18:27:47] Binding thread 1 to cpu mask 54
[2017-04-04 18:27:47] Binding thread 0 to cpu mask 54
Running an instance on one physical core (one thread) ~60 H/s/thread
Running an instance on three physical cores (same CCX) (three threads) ~52 H/s/thread
I'm going to try to run and instance across both CCX and see if worse performance results.
edit5: Running an instance on four physical cores (2Cx2 CCX) (four threads) ~53 H/s/thread
Running an instance on six physical cores (3Cx2 CCX) (four threads) ~60 H/s/thread
Running an instance on six physical cores (3Cx2 CCX) (six threads) ~55 H/s/thread
Running an instance on eight physical cores (4Cx2 CCX) (six threads) ~57 H/s/thread
Running an instance on eight physical cores (4Cx2 CCX) (eight threads) ~49 H/s/thread
Additionally, I could see the scheduler moving the threads when more cores than threads were assigned. Unsurprisingly, the scheduler kept a thread off of core 0 even when it was allowed in the processor mask. I think the cross-core cache bandwidth gets filled up at some point (also, once I hit 4 threads/CCX the L3 cache was probably stuffed and sending data to RAM. The hashrate got as low as 38.84 H/s on one thread at a point, possibly indicating exceeding the cache size or the data being on the other CCX due to the design of Ryzen's victim cache).
Perhaps assigning the threads on the cryptonight algorithm to a physical processor would help alleviate some of the architectural limitations. I'm done with cryptonight for tonight.
joblo (OP)
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
April 05, 2017, 02:06:47 AM
Last edit: April 05, 2017, 04:10:58 AM by joblo
 #2388

Thanks for the testing, lots of info to digest.

deep:  slower with avx2, may be an issue with cpu affinity (see below), maybe retest.

shat256t:  all rejects, it doesn't use the code that broke groestl so needs more investigation.

Edit: please try sha256t with v3.5.9.1

CPU affinity:

Code:
[2017-04-04 18:27:47] Binding thread 2 to cpu mask 54

Here's a description of the Windows function SetThreadAffinityMask

https://msdn.microsoft.com/en-us/library/windows/desktop/ms686247(v=vs.85).aspx

The default whith no affinity affinity arg is to set one bit in the mask to match the thread with the cpu#:
cpu 0 = mask 0, cpu 1 = mask 1, cpu 2 = mask 4, cpu 3 = mask 8. Each thread is assigned to a different cpu.
On Intel i7 running 4 threads this works with one thread on each core and no core with 2 threads.

When multiple bits are set in the mask the thread can be assigned to any of the cpus represented in the mask.
If multiple threads are assigned using the same multibit mask I don't know.

The code as written doesn't seem to allow a different mask for each thread. Maybe it relies on the OS to sort
it out. I am speculating if multiple cpus are allowed the thread may be moved to another permitted cpu.

A mask of 0x54 doesn't seem to make sense, it's not symetric.

If i understand correctly a mask of...

0xffff: assign to any cpu
0x1111: assign 4 threads to either cpu 0, 4, 8 or 12
Edit: correction
0x33330x5555: assign 8 threads to 0, 2, 4, 6, 8, 10, or 12
0x000f; assign to 0, 1, 2, or 3

On Ryzen you must consider the CCX as well as SMT (HT). depending on how the CPUs are mapped
CPUs 0 & 1 may be:
- two HT threads on the first core on the first CCX,
- the first thread on 2 different cores on the first CCX
- the first thread on the first core of 2 different CCXs
- something else

It's going to take a lot of playing around to figure it all out. Once the mapping is understood it can be determined
if a multibit mask allows the system to move threads to another CPU. Previous observations of better performance
with multiple miner instances suggests they can.

Any suggestions on how to specify a custom single bit mask for each thread? One though is to use the affinity arg as
spacing between cpus. For instance an arg of:

0 = consecutive CPUs, ie 0,1,2,3,4...
1 = alternate cpus, 0, 2, 4, 6, 8...
2 = every third cpu, 0, 3, 6, 9... useful for 6/12 core Ryzen maybe
3 = every 4th cpu

Edit:

Another thought is to have a binary option for affinity, consecutive or distributed. With distributed the spcacing
would be calculated automatically based on the cpu and thread counts.









AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
coinbutter
Newbie
*
Offline Offline

Activity: 25
Merit: 0


View Profile
April 05, 2017, 03:07:11 AM
 #2389

A mask of 0x54 doesn't seem to make sense, it's not symetric.
That mask is logical processors 2,4 and 6. 00101010

I don't load the first core for testing until I reach four threads and since this is cryptonight I was increasing it one thread at a time to find out where the L3 cache saturated.

I think the cryptographic functions in Ryzen are shared in each core so even though Simultaneous MultiThreading would allow two threads to execute on a core they get bottlenecked waiting for a thread to finish it's hash. I think it's most efficient to only put one thread on each core instead of loading every logical processor.

I'd be very interested in having the thread count to match the number of processors in the mask and to automatically assign each thread to it's own processor. Call it Ryzen mode!
joblo (OP)
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
April 05, 2017, 04:30:23 AM
 #2390

A mask of 0x54 doesn't seem to make sense, it's not symetric.
That mask is logical processors 2,4 and 6. 00101010

I don't load the first core for testing until I reach four threads and since this is cryptonight I was increasing it one thread at a time to find out where the L3 cache saturated.

I think the cryptographic functions in Ryzen are shared in each core so even though Simultaneous MultiThreading would allow two threads to execute on a core they get bottlenecked waiting for a thread to finish it's hash. I think it's most efficient to only put one thread on each core instead of loading every logical processor.

I'd be very interested in having the thread count to match the number of processors in the mask and to automatically assign each thread to it's own processor. Call it Ryzen mode!

00101010 is not 0x54, it's 0x2a and it's logical cpu 1, 3 & 5 the way I read it. But it still depends on how AMD maps
logical CPUs to physical cores and CCXs how to achieve one thread on every psysical core. If the default isn't optimum
AMD maps differently than intel and cpuminer requires a way to specify the optimum mapping for AMD.

I also made a mistake in my examples, every other cpu is 0x5555.

It also needs to be confirmed whether multiple bits set in the mask means that the thread can be mapped to to any
of the associated logical cpus as well as be moved among them.

The optimum configuration for cryptonight on 8/16 core Ryzen is to have 8 threads, one nailed to each physical core which also means 4 per CCX.
The question is how to do that. Need to understand AMD's mapping first.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
andy75
Full Member
***
Offline Offline

Activity: 141
Merit: 100


View Profile
April 05, 2017, 07:35:21 AM
 #2391

anyone tried to change the json parser , do you think it will help performance  Huh
there are more faster and efficient parser out there
onedeveloper
Full Member
***
Offline Offline

Activity: 143
Merit: 100


View Profile
April 05, 2017, 07:47:06 AM
 #2392

I must "enter the fray" because I see coinbutter is making a mistake. Take this notes into account.

  • The CPU affinity treats the bits as flags for each logical CPU. In Windows >= 8 one can look into the task manager to find how many CPUs is recognizing.
  • Each bit represents a CPU, as joblo said, being bit 0 the zeroth CPU, bit 1 the 1st, and so on.
  • On Ryzen case, as you know, the CPUS are organized in two blocks of 8 CPUS on each CCX, sharing a common 8MB cache each. This limits the number of threads per CCX for cryptonight to 8MB/2MB = 4 threads.
  • It doesn't matter which logical CPU gets the threads on each CCX. The maximum must be 4.
  • As each thread is a bit, Ryzen masks have 16 bits, i.e. 4 hexadecimal digits.
  • The hexadecimal numbers are represented high-bit to low-bit, so the first 1 is the 16th CPU in Ryzen case.
  • If you want to use the maximum 8 threads in Ryzen 7, you must use 4 threads in 1st CCX and other 4 in 2nd.
  • A mask like 0xF0F0 is enoug. This means that 8 threads will be assigned to the logical CPUs 15, 14, 13, 12 (second CCX) and 7, 6, 5, 4 (first CCX). The rest of the CPUs will be left free and useable for any other task.

How to mine:

cpuminer-aes-avx2 -a cryptonight -o <your_pool_here> -u <your_user> -p <your_password> -t 8 --cpu-affinity 0xF0F0

You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second.

I hope all is clear now  Roll Eyes
joblo (OP)
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
April 05, 2017, 12:06:15 PM
 #2393

You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second.

I hope all is clear now  Roll Eyes


Almost. Won't a mask of 0xf0f0 result in 4 idle cores and 4 cores with 2 threads each. Don't you want one thread
per physical core?

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
ZenFr
Legendary
*
Offline Offline

Activity: 1260
Merit: 1046



View Profile
April 05, 2017, 12:19:31 PM
 #2394

You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second.

I hope all is clear now  Roll Eyes


Almost. Won't a mask of 0xf0f0 result in 4 idle cores and 4 cores with 2 threads each. Don't you want one thread
per physical core?
What are the masks to have only one thread per physical core (2 cores/4 threads CPUs (Core i3/i5) nd 4 cores/8 thread CPUs (Core i7)) ?
coinbutter
Newbie
*
Offline Offline

Activity: 25
Merit: 0


View Profile
April 05, 2017, 01:06:37 PM
Last edit: April 05, 2017, 01:29:26 PM by coinbutter
 #2395

You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second.

I hope all is clear now  Roll Eyes


Almost. Won't a mask of 0xf0f0 result in 4 idle cores and 4 cores with 2 threads each. Don't you want one thread
per physical core?
What are the masks to have only one thread per physical core (2 cores/4 threads CPUs (Core i3/i5) nd 4 cores/8 thread CPUs (Core i7)) ?

ZenFr,
2C/4T 0xA or 0x5
4C/8T 0xAA or 0x55

edit: onedeveloper, I was mis-remembering the L3 cache structure on the Ryzen, it's a unified 8mb victim cache per CCX not 2mb L3 victim cache per core. Still, there is a cache bandwidth issue and the nature of the victim cache means that if 8mb per CCX is reached then it starts moving data to the other CCX or if it is full as well it moves to system ram. If you mask 0x5 on Ryzen you mask the process to logical processor 0 and 2, core 1 and 2 of CCX 1. If you mask 0xA on Ryzen you mask the process to logical processor 1 and 3, core 1 and 2 of CCX 1.
onedeveloper
Full Member
***
Offline Offline

Activity: 143
Merit: 100


View Profile
April 05, 2017, 01:34:15 PM
 #2396

You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second.

I hope all is clear now  Roll Eyes


Almost. Won't a mask of 0xf0f0 result in 4 idle cores and 4 cores with 2 threads each. Don't you want one thread
per physical core?

Yes. 4 Idle cores per CCX. You don't earn better hash power on Cryptonight once you cover your L3 cache and each CCX has only 8MB cache, so 4 threads per CCX plus 4 threads free for playing DOTA2 and do some browsing at the same time Smiley
onedeveloper
Full Member
***
Offline Offline

Activity: 143
Merit: 100


View Profile
April 05, 2017, 01:40:48 PM
 #2397

You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second.

I hope all is clear now  Roll Eyes


Almost. Won't a mask of 0xf0f0 result in 4 idle cores and 4 cores with 2 threads each. Don't you want one thread
per physical core?
What are the masks to have only one thread per physical core (2 cores/4 threads CPUs (Core i3/i5) nd 4 cores/8 thread CPUs (Core i7)) ?

ZenFr,
2C/4T 0xA or 0x5
4C/8T 0xAA or 0x55

edit: onedeveloper, I was mis-remembering the L3 cache structure on the Ryzen, it's a unified 8mb victim cache per CCX not 2mb L3 victim cache per core. Still, there is a cache bandwidth issue and the nature of the victim cache means that if 8mb per CCX is reached then it starts moving data to the other CCX or if it is full as well it moves to system ram. If you mask 0x5 on Ryzen you mask the process to logical processor 0 and 2, core 1 and 2 of CCX 1. If you mask 0xA on Ryzen you mask the process to logical processor 1 and 3, core 1 and 2 of CCX 1.

I read this reddit thread -> https://www.reddit.com/r/Amd/comments/5ybrxn/ryzen_7_is_actually_behaving_like_a_dual_4c8t/

They linked this image:



Still, they don't explain how windows manages CPU affinities, but the secret is finding the physical CCX for each CPU thread (logical CPU) and adjust the mask accordingly. User "giaggio" used that mask I said above and had a good result: 640 hashes at only 35 Watts (due to the iddle threads/cores on each CCX).
ZenFr
Legendary
*
Offline Offline

Activity: 1260
Merit: 1046



View Profile
April 05, 2017, 01:59:16 PM
 #2398

Quote
What are the masks to have only one thread per physical core (2 cores/4 threads CPUs (Core i3/i5) nd 4 cores/8 thread CPUs (Core i7)) ?
ZenFr,
2C/4T 0xA or 0x5
4C/8T 0xAA or 0x55
Thank you :-).
coinbutter
Newbie
*
Offline Offline

Activity: 25
Merit: 0


View Profile
April 05, 2017, 02:50:50 PM
 #2399

onedeveloper,

giaggio was running hexxcoin, a lyra2z330 algo. I don't know what the performance is or how it uses the cache for that algo. I was planning on testing lyra2z330 tonight.

Widows is supposed to manage threads by CCX and then by core. It definitely manages them by CCX and I'm not sure about core because I've always set an affinity mask. I've been trying to tell you how it manages affinities by CCX and core but you've been too busy telling me I'm getting it wrong.

ZenFr,

Happy to help! Also, the tool at https://www.paulhempshall.com/io/cpuminer-affinity-setter/ works great and outputs in decimal format.
joblo (OP)
Legendary
*
Offline Offline

Activity: 1470
Merit: 1114


View Profile
April 05, 2017, 03:05:17 PM
 #2400


Widows is supposed to manage threads by CCX and then by core.

This is the point I have been trying to make, there are 2 levels of hierarchy, it's not just about the
fragmented L3 cache but also SMT. The miner threads need to be evenly distributed
across the CCXs as well as accross the physical cores within the CCX. Loading 2 threads on one
physical core will reduce performance while another physical core remains idle.

Each physical core should run only one thread when mining cryptonight. It also helps spread the
thermal load.

0x0f0f vs 0x5555, which achieves that?

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
Pages: « 1 ... 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 [120] 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 ... 197 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!