Bitcoin Forum
February 18, 2019, 11:49:28 PM *
News: Latest Bitcoin Core release: 0.17.1 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 ... 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 [120] 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 ... 191 »
  Print  
Author Topic: [ANN]: cpuminer-opt v3.8.8.1, open source optimized multi-algo CPU miner  (Read 418598 times)
joblo
Legendary
*
Offline Offline

Activity: 1148
Merit: 1050


View Profile
April 04, 2017, 03:07:16 PM
 #2381

onedeveloper, your posts about Ryzen are what initially brought me to this forum Cheesy cpuminer-multi didn't seem very optimized for my hardware and I'm looking for something better. I someone wants me to try out an algorithm or something out I'm happy to give it a try.

Regarding your groestl/dmd-gr problem, the error you see in v3.5.9.1 is the same as the bug I introduced in a later release
suggesting the Windows binaries were built incorrectly. I will retest that.

Other algos with Ryzen I am curious about:

deep: compute bound, good test of AVX2

sha256t: compute bound without any AVX2 code*

lyra2z330: I/O bound due to large data array

* sha256t would benefit from HW SHA acceleration available with Ryzen but requires a supported Linux compile environment.
   See release announcement for v3.6.1 for details.

https://bitcointalk.org/index.php?topic=1326803.msg18406368#msg18406368

cpuminer-opt developer, https://bitcointalk.org/index.php?topic=1326803.0
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
ETH: 0x72122edabcae9d3f57eab0729305a425f6fef6d0
1550533768
Hero Member
*
Offline Offline

Posts: 1550533768

View Profile Personal Message (Offline)

Ignore
1550533768
Reply with quote  #2

1550533768
Report to moderator
1550533768
Hero Member
*
Offline Offline

Posts: 1550533768

View Profile Personal Message (Offline)

Ignore
1550533768
Reply with quote  #2

1550533768
Report to moderator
1550533768
Hero Member
*
Offline Offline

Posts: 1550533768

View Profile Personal Message (Offline)

Ignore
1550533768
Reply with quote  #2

1550533768
Report to moderator
Your Bitcoin transactions
The Ultimate Bitcoin mixer
made truly anonymous.
with an advanced technology.
Mix coins
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
onedeveloper
Full Member
***
Offline Offline

Activity: 145
Merit: 100


View Profile
April 04, 2017, 03:42:22 PM
 #2382

I was piqued with your selection of algos so I decided to try them on my Windows 8.1 machine. This is the result for AVX version of "deep" algo:

Code:
CPU: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
CPU features: SSE2 AES AVX AVX2
SW built on Mar 31 2017 with GCC 4.8.3
SW features: SSE2 AES AVX
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX

[2017-04-04 16:31:20] 4 miner threads started, using 'deep' algorithm.
[2017-04-04 16:31:31] CPU #3: 2097.15 kH, 182.74 kH/s
[2017-04-04 16:31:31] Total: 2097.15 kH, 182.74 kH/s
[2017-04-04 16:31:31] CPU #2: 2097.15 kH, 182.49 kH/s
[2017-04-04 16:31:31] CPU #0: 2097.15 kH, 177.66 kH/s
[2017-04-04 16:31:32] CPU #1: 2097.15 kH, 175.57 kH/s
[2017-04-04 16:31:36] CPU #1: 702.27 kH, 171.87 kH/s
[2017-04-04 16:31:36] CPU #2: 912.45 kH, 183.92 kH/s
[2017-04-04 16:31:36] CPU #3: 913.69 kH, 182.45 kH/s
[2017-04-04 16:31:36] Total: 4625.55 kH, 715.90 kH/s
[2017-04-04 16:31:36] CPU #0: 888.29 kH, 180.76 kH/s
[2017-04-04 16:31:41] CPU #1: 859.36 kH, 172.99 kH/s
[2017-04-04 16:31:41] CPU #2: 919.62 kH, 183.39 kH/s
[2017-04-04 16:31:41] CPU #3: 912.25 kH, 183.06 kH/s
[2017-04-04 16:31:41] Total: 3579.51 kH, 720.20 kH/s
[2017-04-04 16:31:41] CPU #0: 903.81 kH, 179.12 kH/s
[2017-04-04 16:31:44] CTRL_C_EVENT received, exiting

And this is the same test using AVX2:

Code:
CPU: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
CPU features: SSE2 AES AVX AVX2
SW built on Mar 31 2017 with GCC 4.8.3
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES AVX AVX2
Start mining with SSE2 AES AVX2

[2017-04-04 16:30:52] 4 miner threads started, using 'deep' algorithm.
[2017-04-04 16:31:01] CPU #3: 2097.15 kH, 230.87 kH/s
[2017-04-04 16:31:01] Total: 2097.15 kH, 230.87 kH/s
[2017-04-04 16:31:01] CPU #2: 2097.15 kH, 228.12 kH/s
[2017-04-04 16:31:01] CPU #1: 2097.15 kH, 219.54 kH/s
[2017-04-04 16:31:02] CPU #0: 2097.15 kH, 206.06 kH/s
[2017-04-04 16:31:06] CPU #0: 824.23 kH, 225.80 kH/s
[2017-04-04 16:31:06] CPU #3: 1154.34 kH, 229.71 kH/s
[2017-04-04 16:31:06] Total: 6172.87 kH, 903.17 kH/s
[2017-04-04 16:31:06] CPU #2: 1140.60 kH, 226.97 kH/s
[2017-04-04 16:31:06] CPU #1: 1097.69 kH, 226.17 kH/s
[2017-04-04 16:31:11] CPU #0: 1129.01 kH, 221.69 kH/s
[2017-04-04 16:31:11] CPU #3: 1148.54 kH, 231.20 kH/s
[2017-04-04 16:31:11] Total: 4515.84 kH, 906.03 kH/s
[2017-04-04 16:31:11] CPU #2: 1134.87 kH, 229.17 kH/s
[2017-04-04 16:31:11] CPU #1: 1130.86 kH, 220.70 kH/s
[2017-04-04 16:31:13] CTRL_C_EVENT received, exiting

This shows that the AVX2 optimized version is 25.83% faster than AVX-only version in the same architecture. Ryzen is said it only have AVX2 emulation, not real 256 bits, so it will be interesting to see the results there.

Doing the same tests, this time with the "sha256t" algo, I got these results. AVX-only version:

Code:
CPU: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
CPU features: SSE2 AES AVX AVX2
SW built on Mar 31 2017 with GCC 4.8.3
SW features: SSE2 AES AVX
Algo features: SSE2 SHA
Start mining with SSE2

[2017-04-04 16:35:52] 4 miner threads started, using 'sha256t' algorithm.
[2017-04-04 16:35:53] CPU #1: 262.14 kH, 321.25 kH/s
[2017-04-04 16:35:53] CPU #2: 262.14 kH, 321.25 kH/s
[2017-04-04 16:35:53] CPU #3: 262.14 kH, 321.25 kH/s
[2017-04-04 16:35:53] Total: 786.43 kH, 963.75 kH/s
[2017-04-04 16:35:53] CPU #0: 262.14 kH, 293.18 kH/s
[2017-04-04 16:35:57] CPU #0: 1172.72 kH, 297.42 kH/s
[2017-04-04 16:35:57] CPU #2: 1284.99 kH, 315.88 kH/s
[2017-04-04 16:35:57] CPU #3: 1284.99 kH, 315.88 kH/s
[2017-04-04 16:35:57] Total: 4004.85 kH, 1250.43 kH/s
[2017-04-04 16:35:57] CPU #1: 1284.99 kH, 313.47 kH/s
[2017-04-04 16:36:02] CPU #3: 1579.41 kH, 316.94 kH/s
[2017-04-04 16:36:02] Total: 5322.12 kH, 1243.72 kH/s
[2017-04-04 16:36:02] CPU #0: 1487.11 kH, 294.72 kH/s
[2017-04-04 16:36:02] CPU #2: 1579.41 kH, 312.05 kH/s
[2017-04-04 16:36:02] CPU #1: 1567.37 kH, 305.89 kH/s
[2017-04-04 16:36:06] CPU #0: 1473.59 kH, 303.84 kH/s
[2017-04-04 16:36:07] CPU #2: 1560.23 kH, 314.61 kH/s
[2017-04-04 16:36:07] CPU #3: 1584.69 kH, 313.62 kH/s
[2017-04-04 16:36:07] Total: 6185.88 kH, 1237.96 kH/s
[2017-04-04 16:36:07] CPU #1: 1529.45 kH, 301.75 kH/s
[2017-04-04 16:36:07] CTRL_C_EVENT received, exiting

And now the AVX2-optimized version:

Code:
CPU: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
CPU features: SSE2 AES AVX AVX2
SW built on Mar 31 2017 with GCC 4.8.3
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 SHA
Start mining with SSE2

[2017-04-04 16:36:14] 4 miner threads started, using 'sha256t' algorithm.
[2017-04-04 16:36:15] CPU #2: 262.14 kH, 430.18 kH/s
[2017-04-04 16:36:15] CPU #1: 262.14 kH, 419.43 kH/s
[2017-04-04 16:36:15] CPU #0: 262.14 kH, 409.20 kH/s
[2017-04-04 16:36:15] CPU #3: 262.14 kH, 399.46 kH/s
[2017-04-04 16:36:15] Total: 1048.58 kH, 1658.27 kH/s
[2017-04-04 16:36:19] CPU #3: 1597.83 kH, 422.56 kH/s
[2017-04-04 16:36:19] Total: 2384.26 kH, 1681.37 kH/s
[2017-04-04 16:36:19] CPU #1: 1677.72 kH, 416.17 kH/s
[2017-04-04 16:36:19] CPU #0: 1636.80 kH, 404.45 kH/s
[2017-04-04 16:36:19] CPU #2: 1720.74 kH, 417.14 kH/s
[2017-04-04 16:36:24] CPU #3: 2112.80 kH, 420.91 kH/s
[2017-04-04 16:36:24] Total: 7148.05 kH, 1658.68 kH/s
[2017-04-04 16:36:24] CPU #0: 2022.27 kH, 409.25 kH/s
[2017-04-04 16:36:24] CPU #2: 2085.72 kH, 419.44 kH/s
[2017-04-04 16:36:24] CPU #1: 2080.86 kH, 409.45 kH/s
[2017-04-04 16:36:29] CPU #3: 2104.57 kH, 422.75 kH/s
[2017-04-04 16:36:29] Total: 8293.42 kH, 1660.89 kH/s
[2017-04-04 16:36:29] CPU #0: 2046.24 kH, 414.94 kH/s
[2017-04-04 16:36:29] CPU #2: 2097.18 kH, 417.33 kH/s
[2017-04-04 16:36:29] CPU #1: 2047.27 kH, 406.14 kH/s
[2017-04-04 16:36:29] CTRL_C_EVENT received, exiting

This time I went from 1245 kH/s to 1660 kH/s, a surprising 33.33% increase on speed!

With this algorithm, I really will like to see the performance with native HW SHA acceleration Smiley
joblo
Legendary
*
Offline Offline

Activity: 1148
Merit: 1050


View Profile
April 04, 2017, 05:53:22 PM
 #2383


cpuminer-aes-avx2 -a dmd-gr -o stratum+tcp://us.miningfield.com:3377 -u x -p x --cpu-affinity 43690 --cpu-priority 0 --threads=8 --api-bind 127.0.0.1:4050

         **********  cpuminer-opt 3.5.9.1  ***********
     A CPU miner with multi algo support and optimized for CPUs
     with AES_NI and AVX extensions.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT
     Forked from TPruvot's cpuminer-multi with credits
     to Lucas Jones, elmad, palmd, djm34, pooler, ig0tik3d,
     Wolf0, Jeff Garzik and Optiminer.

CPU: AMD Ryzen 7 1800X Eight-Core Processor
CPU features: SSE2 AES AVX AVX2
SW built on Mar  4 2017 with GCC 4.8.3
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 AES
Start mining with SSE2 AES

[2017-04-03 22:14:15] Binding process to cpu mask aaaa
[2017-04-03 22:14:15] Starting Stratum on stratum+tcp://us.miningfield.com:3377
[2017-04-03 22:14:15] 8 miner threads started, using 'groestl' algorithm.
[2017-04-03 22:14:15] Stratum difficulty set to 2
[2017-04-03 22:14:18] groestl block 2095712, diff 25.841
[2017-04-03 22:14:20] CPU #4: 262.14 kH, 232.48 kH/s
-
[2017-04-03 22:14:20] Rejected 1/1 (100.0%), 1871.29 kH, 1841.03 kH/s
[2017-04-03 22:14:20] reject reason: low difficulty share of 1.030424417006305e-7
[2017-04-03 22:14:20] factor reduced to : 0.67

edit: Also tried -a groestl

I retested v3.5.9.1 using the Windows AVX2 binary on Win 8.1 at suprnova groestlcoin and it works.
I can't explain your rejects.

Can you (or maybe someone else) try with another CPU and/or another pool like suprnova?
I want to get to the bottom of this before releasing the next version.

Use the AES builds in the legacy release here https://drive.google.com/file/d/0B0lVSGQYLJIZT0tlY3o4ZjEycXM/view?usp=sharing,
v3.6.1 is broken. The non-AES builds of 3.5.9.1 are likely broken as well.

cpuminer-opt developer, https://bitcointalk.org/index.php?topic=1326803.0
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
ETH: 0x72122edabcae9d3f57eab0729305a425f6fef6d0
joblo
Legendary
*
Offline Offline

Activity: 1148
Merit: 1050


View Profile
April 04, 2017, 05:59:52 PM
 #2384

On the bright side I can check off another test case for SHA. Ryzen CPU and features are correctly detected.

CPU: AMD Ryzen 7 1800X Eight-Core Processor
CPU features: SSE2 AES AVX AVX2 SHA


 

cpuminer-opt developer, https://bitcointalk.org/index.php?topic=1326803.0
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
ETH: 0x72122edabcae9d3f57eab0729305a425f6fef6d0
joblo
Legendary
*
Offline Offline

Activity: 1148
Merit: 1050


View Profile
April 04, 2017, 06:11:11 PM
 #2385


Doing the same tests, this time with the "sha256t" algo, I got these results. AVX-only version:

Code:
CPU: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
CPU features: SSE2 AES AVX AVX2
SW built on Mar 31 2017 with GCC 4.8.3
SW features: SSE2 AES AVX
Algo features: SSE2 SHA
Start mining with SSE2

[2017-04-04 16:35:52] 4 miner threads started, using 'sha256t' algorithm.
[2017-04-04 16:35:53] CPU #1: 262.14 kH, 321.25 kH/s
[2017-04-04 16:35:53] CPU #2: 262.14 kH, 321.25 kH/s
[2017-04-04 16:35:53] CPU #3: 262.14 kH, 321.25 kH/s
[2017-04-04 16:35:53] Total: 786.43 kH, 963.75 kH/s
[2017-04-04 16:35:53] CPU #0: 262.14 kH, 293.18 kH/s
[2017-04-04 16:35:57] CPU #0: 1172.72 kH, 297.42 kH/s
[2017-04-04 16:35:57] CPU #2: 1284.99 kH, 315.88 kH/s
[2017-04-04 16:35:57] CPU #3: 1284.99 kH, 315.88 kH/s
[2017-04-04 16:35:57] Total: 4004.85 kH, 1250.43 kH/s
[2017-04-04 16:35:57] CPU #1: 1284.99 kH, 313.47 kH/s
[2017-04-04 16:36:02] CPU #3: 1579.41 kH, 316.94 kH/s
[2017-04-04 16:36:02] Total: 5322.12 kH, 1243.72 kH/s
[2017-04-04 16:36:02] CPU #0: 1487.11 kH, 294.72 kH/s
[2017-04-04 16:36:02] CPU #2: 1579.41 kH, 312.05 kH/s
[2017-04-04 16:36:02] CPU #1: 1567.37 kH, 305.89 kH/s
[2017-04-04 16:36:06] CPU #0: 1473.59 kH, 303.84 kH/s
[2017-04-04 16:36:07] CPU #2: 1560.23 kH, 314.61 kH/s
[2017-04-04 16:36:07] CPU #3: 1584.69 kH, 313.62 kH/s
[2017-04-04 16:36:07] Total: 6185.88 kH, 1237.96 kH/s
[2017-04-04 16:36:07] CPU #1: 1529.45 kH, 301.75 kH/s
[2017-04-04 16:36:07] CTRL_C_EVENT received, exiting

And now the AVX2-optimized version:

Code:
CPU: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
CPU features: SSE2 AES AVX AVX2
SW built on Mar 31 2017 with GCC 4.8.3
SW features: SSE2 AES AVX AVX2
Algo features: SSE2 SHA
Start mining with SSE2

[2017-04-04 16:36:14] 4 miner threads started, using 'sha256t' algorithm.
[2017-04-04 16:36:15] CPU #2: 262.14 kH, 430.18 kH/s
[2017-04-04 16:36:15] CPU #1: 262.14 kH, 419.43 kH/s
[2017-04-04 16:36:15] CPU #0: 262.14 kH, 409.20 kH/s
[2017-04-04 16:36:15] CPU #3: 262.14 kH, 399.46 kH/s
[2017-04-04 16:36:15] Total: 1048.58 kH, 1658.27 kH/s
[2017-04-04 16:36:19] CPU #3: 1597.83 kH, 422.56 kH/s
[2017-04-04 16:36:19] Total: 2384.26 kH, 1681.37 kH/s
[2017-04-04 16:36:19] CPU #1: 1677.72 kH, 416.17 kH/s
[2017-04-04 16:36:19] CPU #0: 1636.80 kH, 404.45 kH/s
[2017-04-04 16:36:19] CPU #2: 1720.74 kH, 417.14 kH/s
[2017-04-04 16:36:24] CPU #3: 2112.80 kH, 420.91 kH/s
[2017-04-04 16:36:24] Total: 7148.05 kH, 1658.68 kH/s
[2017-04-04 16:36:24] CPU #0: 2022.27 kH, 409.25 kH/s
[2017-04-04 16:36:24] CPU #2: 2085.72 kH, 419.44 kH/s
[2017-04-04 16:36:24] CPU #1: 2080.86 kH, 409.45 kH/s
[2017-04-04 16:36:29] CPU #3: 2104.57 kH, 422.75 kH/s
[2017-04-04 16:36:29] Total: 8293.42 kH, 1660.89 kH/s
[2017-04-04 16:36:29] CPU #0: 2046.24 kH, 414.94 kH/s
[2017-04-04 16:36:29] CPU #2: 2097.18 kH, 417.33 kH/s
[2017-04-04 16:36:29] CPU #1: 2047.27 kH, 406.14 kH/s
[2017-04-04 16:36:29] CTRL_C_EVENT received, exiting

This time I went from 1245 kH/s to 1660 kH/s, a surprising 33.33% increase on speed!

With this algorithm, I really will like to see the performance with native HW SHA acceleration Smiley

I can't explain this. There is no AES or AVX optimized code in cpuminer for sha256t.

cpuminer-opt developer, https://bitcointalk.org/index.php?topic=1326803.0
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
ETH: 0x72122edabcae9d3f57eab0729305a425f6fef6d0
onedeveloper
Full Member
***
Offline Offline

Activity: 145
Merit: 100


View Profile
April 04, 2017, 06:42:58 PM
 #2386

Maybe it's just the compiler doing its optimization work. I had the same workload on my computer on every test, so I was as surprised as you. I also checked the source code and it's true there's no AVX2 code there.
joblo
Legendary
*
Offline Offline

Activity: 1148
Merit: 1050


View Profile
April 04, 2017, 06:57:33 PM
 #2387

Maybe it's just the compiler doing its optimization work. I had the same workload on my computer on every test, so I was as surprised as you. I also checked the source code and it's true there's no AVX2 code there.

It's also faster than the openssl version (without HW SHA) which also surprised me. I would have expected
openssl to have AVX and AVX2 optimizations but it's slower than the SPH implementation included in cpuminer.

cpuminer-opt developer, https://bitcointalk.org/index.php?topic=1326803.0
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
ETH: 0x72122edabcae9d3f57eab0729305a425f6fef6d0
giagge
Legendary
*
Offline Offline

Activity: 1134
Merit: 1001


View Profile
April 04, 2017, 07:38:02 PM
 #2388

I love hexxcoin , with my ryzen , but i stuck a version 3.6.0 , any news for boost in new update ? i use windows 10 x64 .
joblo
Legendary
*
Offline Offline

Activity: 1148
Merit: 1050


View Profile
April 04, 2017, 08:01:54 PM
 #2389

I love hexxcoin , with my ryzen , but i stuck a version 3.6.0 , any news for boost in new update ? i use windows 10 x64 .

Hexxcoin is pure Lyra2, it's pretty much maxed out. Post your results.

cpuminer-opt developer, https://bitcointalk.org/index.php?topic=1326803.0
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
ETH: 0x72122edabcae9d3f57eab0729305a425f6fef6d0
coinbutter
Newbie
*
Offline Offline

Activity: 25
Merit: 0


View Profile
April 04, 2017, 08:22:17 PM
Last edit: April 05, 2017, 12:03:23 AM by coinbutter
 #2390

I retested v3.5.9.1 using the Windows AVX2 binary on Win 8.1 at suprnova groestlcoin and it works.
I can't explain your rejects.

I tried running it on my X5687 with -a dmd-gr (and -a groestl) and get the same issues with 3.6.1 and 3.5.9.1 (aes-sse42). Perhaps it's something dmd-gr related. I know it's the same algo but there has to be something that causes the (reject reason: low difficulty share of 1.3270640174699655e-7) error.

I'll have some sha256t and deepcoin results on my R7 in a bit.

edit: I'm solo mining dmd-gr and I'll let it run for a few days on the X5687.
edit2: It works fine pool mining GRS. Definitely something DMD related.
edit3: Ryzen R7 1800X @ 3.8 2993 DDR4
  • Deep algo pool: aes-sse42 ~533 kH/s/core, aes-avx ~543 kH/s/core, aes-avx2 ~492 kH/s/core
  • sha256t algo pool:  aes-sse42 ~1333 kH/s/core, aes-avx ~1353 kH/s/core, aes-avx2 ~1639 kH/s/core (all rejected due to low difficulty)
  • groestl algo pool 3.6.1: aes-sse42 ~751 kH/s/core, aes-avx ~848 kH/s/core, aes-avx2 ~851 kH/s/core (all rejected due to low difficulty)
  • groestl algo pool 3.5.9.1: aes-sse42 ~656 kH/s/core, aes-avx ~724 kH/s/core, aes-avx2 ~731 kH/s/core (all accepted)
  • dmd-gr algo pool 3.6.1: aes-sse42 ~751 kH/s/core, aes-avx ~848 kH/s/core, aes-avx2 ~851 kH/s/core (all rejected due to low difficulty)
  • dmd-gr algo pool 3.5.9.1: aes-sse42 ~655 kH/s/core, aes-avx ~724 kH/s/core, aes-avx2 ~731 kH/s/core (all rejected due to low difficulty)
edit4: It looks like thread binding goes to the processor mask instead of to an physical core. The windows scheduler is probably moving the threads:
Code:
[2017-04-04 18:27:47] Binding process to cpu mask 54
[2017-04-04 18:27:47] Starting Stratum on stratum+tcp://xmr-usa.dwarfpool.com:8005
[2017-04-04 18:27:47] 3 miner threads started, using 'cryptonight' algorithm.
[2017-04-04 18:27:47] Binding thread 2 to cpu mask 54
[2017-04-04 18:27:47] Binding thread 1 to cpu mask 54
[2017-04-04 18:27:47] Binding thread 0 to cpu mask 54
Running an instance on one physical core (one thread) ~60 H/s/thread
Running an instance on three physical cores (same CCX) (three threads) ~52 H/s/thread
I'm going to try to run and instance across both CCX and see if worse performance results.
edit5: Running an instance on four physical cores (2Cx2 CCX) (four threads) ~53 H/s/thread
Running an instance on six physical cores (3Cx2 CCX) (four threads) ~60 H/s/thread
Running an instance on six physical cores (3Cx2 CCX) (six threads) ~55 H/s/thread
Running an instance on eight physical cores (4Cx2 CCX) (six threads) ~57 H/s/thread
Running an instance on eight physical cores (4Cx2 CCX) (eight threads) ~49 H/s/thread
Additionally, I could see the scheduler moving the threads when more cores than threads were assigned. Unsurprisingly, the scheduler kept a thread off of core 0 even when it was allowed in the processor mask. I think the cross-core cache bandwidth gets filled up at some point (also, once I hit 4 threads/CCX the L3 cache was probably stuffed and sending data to RAM. The hashrate got as low as 38.84 H/s on one thread at a point, possibly indicating exceeding the cache size or the data being on the other CCX due to the design of Ryzen's victim cache).
Perhaps assigning the threads on the cryptonight algorithm to a physical processor would help alleviate some of the architectural limitations. I'm done with cryptonight for tonight.
joblo
Legendary
*
Offline Offline

Activity: 1148
Merit: 1050


View Profile
April 05, 2017, 02:06:47 AM
Last edit: April 05, 2017, 04:10:58 AM by joblo
 #2391

Thanks for the testing, lots of info to digest.

deep:  slower with avx2, may be an issue with cpu affinity (see below), maybe retest.

shat256t:  all rejects, it doesn't use the code that broke groestl so needs more investigation.

Edit: please try sha256t with v3.5.9.1

CPU affinity:

Code:
[2017-04-04 18:27:47] Binding thread 2 to cpu mask 54

Here's a description of the Windows function SetThreadAffinityMask

https://msdn.microsoft.com/en-us/library/windows/desktop/ms686247(v=vs.85).aspx

The default whith no affinity affinity arg is to set one bit in the mask to match the thread with the cpu#:
cpu 0 = mask 0, cpu 1 = mask 1, cpu 2 = mask 4, cpu 3 = mask 8. Each thread is assigned to a different cpu.
On Intel i7 running 4 threads this works with one thread on each core and no core with 2 threads.

When multiple bits are set in the mask the thread can be assigned to any of the cpus represented in the mask.
If multiple threads are assigned using the same multibit mask I don't know.

The code as written doesn't seem to allow a different mask for each thread. Maybe it relies on the OS to sort
it out. I am speculating if multiple cpus are allowed the thread may be moved to another permitted cpu.

A mask of 0x54 doesn't seem to make sense, it's not symetric.

If i understand correctly a mask of...

0xffff: assign to any cpu
0x1111: assign 4 threads to either cpu 0, 4, 8 or 12
Edit: correction
0x33330x5555: assign 8 threads to 0, 2, 4, 6, 8, 10, or 12
0x000f; assign to 0, 1, 2, or 3

On Ryzen you must consider the CCX as well as SMT (HT). depending on how the CPUs are mapped
CPUs 0 & 1 may be:
- two HT threads on the first core on the first CCX,
- the first thread on 2 different cores on the first CCX
- the first thread on the first core of 2 different CCXs
- something else

It's going to take a lot of playing around to figure it all out. Once the mapping is understood it can be determined
if a multibit mask allows the system to move threads to another CPU. Previous observations of better performance
with multiple miner instances suggests they can.

Any suggestions on how to specify a custom single bit mask for each thread? One though is to use the affinity arg as
spacing between cpus. For instance an arg of:

0 = consecutive CPUs, ie 0,1,2,3,4...
1 = alternate cpus, 0, 2, 4, 6, 8...
2 = every third cpu, 0, 3, 6, 9... useful for 6/12 core Ryzen maybe
3 = every 4th cpu

Edit:

Another thought is to have a binary option for affinity, consecutive or distributed. With distributed the spcacing
would be calculated automatically based on the cpu and thread counts.









cpuminer-opt developer, https://bitcointalk.org/index.php?topic=1326803.0
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
ETH: 0x72122edabcae9d3f57eab0729305a425f6fef6d0
coinbutter
Newbie
*
Offline Offline

Activity: 25
Merit: 0


View Profile
April 05, 2017, 03:07:11 AM
 #2392

A mask of 0x54 doesn't seem to make sense, it's not symetric.
That mask is logical processors 2,4 and 6. 00101010

I don't load the first core for testing until I reach four threads and since this is cryptonight I was increasing it one thread at a time to find out where the L3 cache saturated.

I think the cryptographic functions in Ryzen are shared in each core so even though Simultaneous MultiThreading would allow two threads to execute on a core they get bottlenecked waiting for a thread to finish it's hash. I think it's most efficient to only put one thread on each core instead of loading every logical processor.

I'd be very interested in having the thread count to match the number of processors in the mask and to automatically assign each thread to it's own processor. Call it Ryzen mode!
joblo
Legendary
*
Offline Offline

Activity: 1148
Merit: 1050


View Profile
April 05, 2017, 04:30:23 AM
 #2393

A mask of 0x54 doesn't seem to make sense, it's not symetric.
That mask is logical processors 2,4 and 6. 00101010

I don't load the first core for testing until I reach four threads and since this is cryptonight I was increasing it one thread at a time to find out where the L3 cache saturated.

I think the cryptographic functions in Ryzen are shared in each core so even though Simultaneous MultiThreading would allow two threads to execute on a core they get bottlenecked waiting for a thread to finish it's hash. I think it's most efficient to only put one thread on each core instead of loading every logical processor.

I'd be very interested in having the thread count to match the number of processors in the mask and to automatically assign each thread to it's own processor. Call it Ryzen mode!

00101010 is not 0x54, it's 0x2a and it's logical cpu 1, 3 & 5 the way I read it. But it still depends on how AMD maps
logical CPUs to physical cores and CCXs how to achieve one thread on every psysical core. If the default isn't optimum
AMD maps differently than intel and cpuminer requires a way to specify the optimum mapping for AMD.

I also made a mistake in my examples, every other cpu is 0x5555.

It also needs to be confirmed whether multiple bits set in the mask means that the thread can be mapped to to any
of the associated logical cpus as well as be moved among them.

The optimum configuration for cryptonight on 8/16 core Ryzen is to have 8 threads, one nailed to each physical core which also means 4 per CCX.
The question is how to do that. Need to understand AMD's mapping first.

cpuminer-opt developer, https://bitcointalk.org/index.php?topic=1326803.0
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
ETH: 0x72122edabcae9d3f57eab0729305a425f6fef6d0
andy75
Full Member
***
Offline Offline

Activity: 141
Merit: 100


View Profile
April 05, 2017, 07:35:21 AM
 #2394

anyone tried to change the json parser , do you think it will help performance  Huh
there are more faster and efficient parser out there
onedeveloper
Full Member
***
Offline Offline

Activity: 145
Merit: 100


View Profile
April 05, 2017, 07:47:06 AM
 #2395

I must "enter the fray" because I see coinbutter is making a mistake. Take this notes into account.

  • The CPU affinity treats the bits as flags for each logical CPU. In Windows >= 8 one can look into the task manager to find how many CPUs is recognizing.
  • Each bit represents a CPU, as joblo said, being bit 0 the zeroth CPU, bit 1 the 1st, and so on.
  • On Ryzen case, as you know, the CPUS are organized in two blocks of 8 CPUS on each CCX, sharing a common 8MB cache each. This limits the number of threads per CCX for cryptonight to 8MB/2MB = 4 threads.
  • It doesn't matter which logical CPU gets the threads on each CCX. The maximum must be 4.
  • As each thread is a bit, Ryzen masks have 16 bits, i.e. 4 hexadecimal digits.
  • The hexadecimal numbers are represented high-bit to low-bit, so the first 1 is the 16th CPU in Ryzen case.
  • If you want to use the maximum 8 threads in Ryzen 7, you must use 4 threads in 1st CCX and other 4 in 2nd.
  • A mask like 0xF0F0 is enoug. This means that 8 threads will be assigned to the logical CPUs 15, 14, 13, 12 (second CCX) and 7, 6, 5, 4 (first CCX). The rest of the CPUs will be left free and useable for any other task.

How to mine:

cpuminer-aes-avx2 -a cryptonight -o <your_pool_here> -u <your_user> -p <your_password> -t 8 --cpu-affinity 0xF0F0

You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second.

I hope all is clear now  Roll Eyes
joblo
Legendary
*
Offline Offline

Activity: 1148
Merit: 1050


View Profile
April 05, 2017, 12:06:15 PM
 #2396

You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second.

I hope all is clear now  Roll Eyes


Almost. Won't a mask of 0xf0f0 result in 4 idle cores and 4 cores with 2 threads each. Don't you want one thread
per physical core?

cpuminer-opt developer, https://bitcointalk.org/index.php?topic=1326803.0
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,
ETH: 0x72122edabcae9d3f57eab0729305a425f6fef6d0
ZenFr
Legendary
*
Offline Offline

Activity: 1260
Merit: 1043



View Profile
April 05, 2017, 12:19:31 PM
 #2397

You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second.

I hope all is clear now  Roll Eyes


Almost. Won't a mask of 0xf0f0 result in 4 idle cores and 4 cores with 2 threads each. Don't you want one thread
per physical core?
What are the masks to have only one thread per physical core (2 cores/4 threads CPUs (Core i3/i5) nd 4 cores/8 thread CPUs (Core i7)) ?
coinbutter
Newbie
*
Offline Offline

Activity: 25
Merit: 0


View Profile
April 05, 2017, 01:06:37 PM
Last edit: April 05, 2017, 01:29:26 PM by coinbutter
 #2398

You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second.

I hope all is clear now  Roll Eyes


Almost. Won't a mask of 0xf0f0 result in 4 idle cores and 4 cores with 2 threads each. Don't you want one thread
per physical core?
What are the masks to have only one thread per physical core (2 cores/4 threads CPUs (Core i3/i5) nd 4 cores/8 thread CPUs (Core i7)) ?

ZenFr,
2C/4T 0xA or 0x5
4C/8T 0xAA or 0x55

edit: onedeveloper, I was mis-remembering the L3 cache structure on the Ryzen, it's a unified 8mb victim cache per CCX not 2mb L3 victim cache per core. Still, there is a cache bandwidth issue and the nature of the victim cache means that if 8mb per CCX is reached then it starts moving data to the other CCX or if it is full as well it moves to system ram. If you mask 0x5 on Ryzen you mask the process to logical processor 0 and 2, core 1 and 2 of CCX 1. If you mask 0xA on Ryzen you mask the process to logical processor 1 and 3, core 1 and 2 of CCX 1.
onedeveloper
Full Member
***
Offline Offline

Activity: 145
Merit: 100


View Profile
April 05, 2017, 01:34:15 PM
 #2399

You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second.

I hope all is clear now  Roll Eyes


Almost. Won't a mask of 0xf0f0 result in 4 idle cores and 4 cores with 2 threads each. Don't you want one thread
per physical core?

Yes. 4 Idle cores per CCX. You don't earn better hash power on Cryptonight once you cover your L3 cache and each CCX has only 8MB cache, so 4 threads per CCX plus 4 threads free for playing DOTA2 and do some browsing at the same time Smiley
onedeveloper
Full Member
***
Offline Offline

Activity: 145
Merit: 100


View Profile
April 05, 2017, 01:40:48 PM
 #2400

You can also use 0x0F0F or other combinations, providing there are only 4 CPUs (bits) active on first two digits and 4 on second.

I hope all is clear now  Roll Eyes


Almost. Won't a mask of 0xf0f0 result in 4 idle cores and 4 cores with 2 threads each. Don't you want one thread
per physical core?
What are the masks to have only one thread per physical core (2 cores/4 threads CPUs (Core i3/i5) nd 4 cores/8 thread CPUs (Core i7)) ?

ZenFr,
2C/4T 0xA or 0x5
4C/8T 0xAA or 0x55

edit: onedeveloper, I was mis-remembering the L3 cache structure on the Ryzen, it's a unified 8mb victim cache per CCX not 2mb L3 victim cache per core. Still, there is a cache bandwidth issue and the nature of the victim cache means that if 8mb per CCX is reached then it starts moving data to the other CCX or if it is full as well it moves to system ram. If you mask 0x5 on Ryzen you mask the process to logical processor 0 and 2, core 1 and 2 of CCX 1. If you mask 0xA on Ryzen you mask the process to logical processor 1 and 3, core 1 and 2 of CCX 1.

I read this reddit thread -> https://www.reddit.com/r/Amd/comments/5ybrxn/ryzen_7_is_actually_behaving_like_a_dual_4c8t/

They linked this image:



Still, they don't explain how windows manages CPU affinities, but the secret is finding the physical CCX for each CPU thread (logical CPU) and adjust the mask accordingly. User "giaggio" used that mask I said above and had a good result: 640 hashes at only 35 Watts (due to the iddle threads/cores on each CCX).
Pages: « 1 ... 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 [120] 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 ... 191 »
  Print  
 
Jump to:  

Bitcointalk.org is not available or authorized for sale. Do not believe any fake listings.
Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!