[LOCKED] cpuminer-opt v3.12.3, open source optimized multi-algo CPU miner

May 11, 2016, 01:32:15 PM

#581

X17 algo is now supported by cpuminer-opt. It has been tested at zpool with v3.1.18.

The message warning that the algo has not been tested can be ignored. It will be removed from the
next release.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

joblo (OP)

Legendary

Offline

Activity: 1470
Merit: 1114

May 11, 2016, 03:55:53 PM

#582

cpuminer-opt v3.2.1 is available.

zr5: fixed duplicate shares
decred: fixed invalid extranonce2 suffix
x17: pool tested and fully supported

https://drive.google.com/file/d/0B0lVSGQYLJIZWHUxY1BtMWdLd00/view?usp=sharing

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

hmage

Member

Offline

Activity: 83
Merit: 10

Quote from: joblo on May 10, 2016, 08:30:48 PM

May 11, 2016, 05:23:24 PM
Last edit: May 11, 2016, 05:34:12 PM by hmage

#583

Edit3: I ran the test several times with both algos and alway produced the correct result.
Could your script be misinterpreting? Without further information on how to reproduce
I consider this issue closed.

I was running it without the script and was getting the wrong output.

I'm talking about the last line: 463916

This is the line I'm parsing in the script.

Maybe it's because I have hyperthreading enabled on my CPU? Could be a thread synchronization issue.

It doesn't always happen for me either. I've been trying to reproduce the problem for you with v3.2.1 and asciinema and so far no luck.

I was experiencing problem on 3.1.17. Maybe you fixed it in v3.2.1? I see you've changed up stuff that could be relevant since then.

joblo (OP)

Legendary

Offline

Activity: 1470
Merit: 1114

Quote from: hmage on May 11, 2016, 05:23:24 PM

May 11, 2016, 05:44:33 PM

#584

Quote from: joblo on May 10, 2016, 08:30:48 PM

I also use hyperthreading. I haven't intentionally touched either algo in several releases and 3.2 was a restructuring release with
no intended change in functionality. If you see it again, let me know and I'll take another look. Buit it's pretty clear from the code
that the last TOTAL rate displays the same data as the last line.

From looking at the code the time_limit stuff seems in an odd place, the end of the loop would seem more appropriate with the
rest of the display code. I may consider moving it on speculation if the problem returns.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

hmage

Member

Offline

Activity: 83
Merit: 10

Quote from: joblo on May 11, 2016, 05:44:33 PM

May 11, 2016, 06:18:08 PM

#585

Yes, the output is same as total, just in format that's easier to parse, no problem with that fact.

The problem was that right before the end, it would spit thousands of lines in a second with ever increasing hashrate that would inflate the total result.

Do you keep an archive of older versions of cpuminer-opt? I'd like to check older version and I foolishly deleted my local copy.

joblo (OP)

Legendary

Offline

Activity: 1470
Merit: 1114

Quote from: hmage on May 11, 2016, 06:18:08 PM

May 11, 2016, 07:08:28 PM

#586

Quote from: joblo on May 11, 2016, 05:44:33 PM

A few of the DL links are still active. If you want a specific release let me know and I'll reactivate it. Keep in mind
there have been some problem releases along the way which you probbaly want to avoid. The post for each release
is still in the thread and should help you find the most stable ones.

Personally I don't think it's worth the effort to go back. If the problem reoccurs with the current release you can
collect more data and we can pursue it from there.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

hmage

Member

Offline

Activity: 83
Merit: 10

May 12, 2016, 12:14:21 AM
Last edit: May 12, 2016, 12:45:43 AM by hmage

#587

I've noticed another performance regression compared to cpuminer-multi.

Algos that have very high number of calls per second tend to be slower on cpuminer-opt because of algo-gate callback functions.

When calling through function pointers, the pointer needs to be dereferenced first before jumping, when the function in question is fast enough, the dereferencing could be reducing performance vs direct function call.

One of the ways to fix that is to put dereferencing outside the loop.

pseudocode before:

Code:

func = &hash_sha256;
while(true) { func(); }

pseudocode after:

Code:

funcloop = &hashloop_sha256;
funcloop();

hashloop_sha256() {while(true) {hash_sha256());}

This moves deferencing to be done only once at start of the loop.

joblo (OP)

Legendary

Offline

Activity: 1470
Merit: 1114

Quote from: hmage on May 12, 2016, 12:14:21 AM

May 12, 2016, 12:34:27 AM

#588

Code:

func = &hash_sha256;
while(true) { func(); }

pseudocode after:

Code:

funcloop = &hashloop_sha256;
funcloop();

hashloop_sha256() {while(true) {hash_sha256());}

This moves deferencing to be done only once at start of the loop.

Have you measured a regression? My measurements between 3.0.7 (pre algo-gate) and 3.1 showed a modest
improvement in performance accross the board.

Your suggestion would add the overhead of a function call and return on every iteration to save a pointer deref.
Looks like a bad trade to me.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

hmage

Member

Offline

Activity: 83
Merit: 10

Quote from: joblo on May 12, 2016, 12:34:27 AM

May 12, 2016, 12:43:38 AM

#589

Your suggestion would add the overhead of a function call and return on every iteration to save a pointer deref.
Looks like a bad trade to me.

I meant to move the dereference outside of iteration completely. Have the iteration cycle code for each algo so it doesn't go through dereferencing.

Note - all of this is speculation, I still didn't measure exactly where the slowdown is and why it's slower. I'm just reporting that for some reason non-AES versions of algos are slower in cpuminer-opt compared to cpuminer-multi. This needs further investigation.

One same CPU, these algos are slower on cpuminer-opt compared to cpuminer-multi:

Code:


    "groestl"       =>    1109819 / 1000, // cpuminer-opt
    "groestl"       =>    1125917 / 1000, // cpuminer-nicehash
    "keccak"        =>    6964234 / 1000, // cpuminer-opt
    "keccak"        =>    8332952 / 1000, // cpuminer-nicehash
    "luffa"         =>    2728931 / 1000, // cpuminer-opt
    "luffa"         =>    3177996 / 1000, // cpuminer-nicehash
    "lyra2"         =>     716945 / 1000, // cpuminer-opt
    "lyra2"         =>     921109 / 1000, // cpuminer-nicehash
    "neoscrypt"     =>      27583 / 1000, // cpuminer-opt
    "neoscrypt"     =>      28891 / 1000, // cpuminer-nicehash
    "pentablake"    =>    3479320 / 1000, // cpuminer-opt
    "pentablake"    =>    3609862 / 1000, // cpuminer-nicehash
    "pluck"         =>       1722 / 1000, // cpuminer-opt
    "pluck"         =>       1818 / 1000, // cpuminer-nicehash
    "s3"            =>    1086149 / 1000, // cpuminer-opt
    "s3"            =>    1201897 / 1000, // cpuminer-nicehash
    "scrypt"        =>      91557 / 1000, // cpuminer-opt
    "scrypt"        =>      99702 / 1000, // cpuminer-nicehash
    "sha256d"       =>   53122339 / 1000, // cpuminer-opt
    "sha256d"       =>   54669375 / 1000, // cpuminer-nicehash
    "shavite3"      =>    2232258 / 1000, // cpuminer-opt
    "shavite3"      =>    2343704 / 1000, // cpuminer-nicehash
    "skein"         =>    6405675 / 1000, // cpuminer-opt
    "skein"         =>    6586806 / 1000, // cpuminer-nicehash
    "skein2"        =>    7985012 / 1000, // cpuminer-opt
    "skein2"        =>    8167405 / 1000, // cpuminer-nicehash

I'm using this version of cpuminer-multi — https://github.com/nicehash/cpuminer-multi

joblo (OP)

Legendary

Offline

Activity: 1470
Merit: 1114

Quote from: hmage on May 12, 2016, 12:43:38 AM

May 12, 2016, 01:46:06 AM

#590

Quote from: joblo on May 12, 2016, 12:34:27 AM

Your suggestion would add the overhead of a function call and return on every iteration to save a pointer deref.
Looks like a bad trade to me.

Code:


    "groestl"       =>    1109819 / 1000, // cpuminer-opt
    "groestl"       =>    1125917 / 1000, // cpuminer-nicehash
    "keccak"        =>    6964234 / 1000, // cpuminer-opt
    "keccak"        =>    8332952 / 1000, // cpuminer-nicehash
    "luffa"         =>    2728931 / 1000, // cpuminer-opt
    "luffa"         =>    3177996 / 1000, // cpuminer-nicehash
    "lyra2"         =>     716945 / 1000, // cpuminer-opt
    "lyra2"         =>     921109 / 1000, // cpuminer-nicehash
    "neoscrypt"     =>      27583 / 1000, // cpuminer-opt
    "neoscrypt"     =>      28891 / 1000, // cpuminer-nicehash
    "pentablake"    =>    3479320 / 1000, // cpuminer-opt
    "pentablake"    =>    3609862 / 1000, // cpuminer-nicehash
    "pluck"         =>       1722 / 1000, // cpuminer-opt
    "pluck"         =>       1818 / 1000, // cpuminer-nicehash
    "s3"            =>    1086149 / 1000, // cpuminer-opt
    "s3"            =>    1201897 / 1000, // cpuminer-nicehash
    "scrypt"        =>      91557 / 1000, // cpuminer-opt
    "scrypt"        =>      99702 / 1000, // cpuminer-nicehash
    "sha256d"       =>   53122339 / 1000, // cpuminer-opt
    "sha256d"       =>   54669375 / 1000, // cpuminer-nicehash
    "shavite3"      =>    2232258 / 1000, // cpuminer-opt
    "shavite3"      =>    2343704 / 1000, // cpuminer-nicehash
    "skein"         =>    6405675 / 1000, // cpuminer-opt
    "skein"         =>    6586806 / 1000, // cpuminer-nicehash
    "skein2"        =>    7985012 / 1000, // cpuminer-opt
    "skein2"        =>    8167405 / 1000, // cpuminer-nicehash

I'm using this version of cpuminer-multi — https://github.com/nicehash/cpuminer-multi

Well, your pseudo code had the call/ret inside the loop.

Most of the algos in your list are of little interest, except neoscrypt. That is one algo I'd like to improve.
In relative terms it underperforms the GPU version by a lot.

Another thing to consider is that local hashrate reporting by the miner isn't very reliable and your data
is well within a 2% margin of error. I was seeing greater variation just from different sessions of the same code.
I thought I was making incremental improvements with some changes and regressions with others when all along
it was just noise.

I like intelectual challenges but you need to do a better job. You don't provide the full picture initilally and only
give more info after I poke holes in your initial presentation. This seeems to be a pattern with your "suggestions".

You obviously have some knowledge, maybe not as much as me, but knowledge in areas where I am weak, c++,
for example. I'm also weak in GUI apps and web programming but I'm strong in OS fundamentals and CPU architecture,
though not specifically Linux and x86. One of my biggest challenges has been applying my knowledge and experience
to an unfamiliar environment. I tend to make a lot of mistakes as a result.

I have given you the benefit of the doubt and tried to probe you for more info in areas where I didn't have the confidence
to call you out. But so far it's come up empty. When you challenge me on one of my strengths you'd better be well
prepared.
But so far it's come up empty

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

hmage

Member

Offline

Activity: 83
Merit: 10

Quote from: joblo on May 12, 2016, 01:46:06 AM

May 12, 2016, 11:57:06 AM
Last edit: May 12, 2016, 12:40:40 PM by hmage

#591

I have given you the benefit of the doubt and tried to probe you for more info in areas where I didn't have the confidence
to call you out. But so far it's come up empty. When you challenge me on one of my strengths you'd better be well
prepared.

I don't care if I challenge you or not, I'm not here for your entertainment.

10 runs of cpuminer-opt are giving results that are consistently less than 10 runs of cpuminer-multi on the algos listed above. Simple as that.

You're free to ignore this fact, of course. But I thought it'd be nice if you knew it.

joblo (OP)

Legendary

Offline

Activity: 1470
Merit: 1114

Quote from: hmage on May 12, 2016, 11:57:06 AM

May 12, 2016, 04:04:48 PM

#592

Quote from: joblo on May 12, 2016, 01:46:06 AM

When I give you constructive feedback you seem to get angry which is counterproductive. I thank you for your work
but it was not enough to draw any conclusions. A 2% diffreence is statistically insignificant. But let's assume it is.

You suggested it was caused by the use of function pointers by algo-gate. I countered that my measurements when
algo-gate was implemented showed an improvement. That disproves you theory, one that was not supported by any
evidence BTW. So if the difference is real it must be caused by something else. There are a lot of possibilities.
Differences in CPU architecture (I don't mean capabilities) can cause measurable differences between algos. Cache
size and organization, execution environment, memory interface, etc can all cause different algos to perform differently
on different CPUs. If you look at HOdl it performs well on an i7 but poorly on an i5 due to the smaller cache. As it turns
out it was specifically optimized for the size of the i7 cache.

You need to do your research, get your facts straight and present a coherent case it you want to get any attention,
especially when you are criticizing someone's work. I have a thick skin, thicker than yours apparently, so I can take
it and give it back. Put your self in my position, how would you react to someone taking pot shots about what you're
doing wrong and how you should do things. Oh, I already know, you get angry.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

hmage

Member

Offline

Activity: 83
Merit: 10

Quote from: joblo on May 12, 2016, 04:04:48 PM

May 12, 2016, 05:15:37 PM
Last edit: May 12, 2016, 05:34:57 PM by hmage

#593

Okay then, explain this: https://gist.github.com/hmage/2a1fdbd7bdad252cd08c9b4166c5727a

on Core i5-4570S:

Code:

hmage@dhmd:~/test$ cat /proc/cpuinfo |fgrep name|head -1
model name      : Intel(R) Core(TM) i5-4570S CPU @ 2.90GHz
hmage@dhmd:~/test$ gcc dereference_bench.c -O2 -o dereference_bench && ./dereference_bench
      workfunc(): 0.002082 microseconds per call, 480308.777k per second
  workloopfunc(): 0.001774 microseconds per call, 563746.643k per second

on Core i7-4770:

Code:

hmage@vhmd:~$ cat /proc/cpuinfo |fgrep name|head -1
model name      : Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
hmage@vhmd:~$ gcc dereference_bench.c -O2 -o dereference_bench && ./dereference_bench
      workfunc(): 0.001776 microseconds per call, 562932.922k per second
  workloopfunc(): 0.001506 microseconds per call, 664150.879k per second

Dereferencing on every call _is_ a big performance hit, unless you have another explanation.

Latency numbers every programmer should know -- https://gist.github.com/hellerbarde/2843375

Oh, I already know, you get angry.

It looks to me that it was you who got angry. I apologise for my blunt approach.

joblo (OP)

Legendary

Offline

Activity: 1470
Merit: 1114

Quote from: hmage on May 12, 2016, 05:15:37 PM

May 12, 2016, 05:46:02 PM

#594

Okay then, explain this: https://gist.github.com/hmage/2a1fdbd7bdad252cd08c9b4166c5727a

on Core i5-4570S:

Code:

hmage@dhmd:~/test$ cat /proc/cpuinfo |fgrep name|head -1
model name      : Intel(R) Core(TM) i5-4570S CPU @ 2.90GHz
hmage@dhmd:~/test$ gcc dereference_bench.c -O2 -o dereference_bench && ./dereference_bench
      workfunc(): 0.002082 microseconds per call, 480308.777k per second
  workloopfunc(): 0.001774 microseconds per call, 563746.643k per second

on Core i7-4770:

Code:

hmage@vhmd:~$ cat /proc/cpuinfo |fgrep name|head -1
model name      : Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
hmage@vhmd:~$ gcc dereference_bench.c -O2 -o dereference_bench && ./dereference_bench
      workfunc(): 0.001776 microseconds per call, 562932.922k per second
  workloopfunc(): 0.001506 microseconds per call, 664150.879k per second

Dereferencing on every call _is_ a big performance hit, unless you have another explanation.

Quote from: joblo on May 12, 2016, 04:04:48 PM

Oh, I already know, you get angry.

It looks to me that it was you who got angry. I apologise for my blunt approach.

A little impatient maybe but not really angry. I try to stick to the issues.

Yes, deferencing a pointer to call a function adds overhead but it has to be taken in context.
How often does that occur in the big picture? Take scanhash, for example, the lowest level function
that is gated. Each scan takes seconds to run so the overhead of one extra pointer deref every few
seconds is immeasurable. Even if you go up a level to the miner_thread loop. There are maybe 20
gated fuction calls every loop. 20 extra derefs every few seconds is still immeasurable.

Any change of program flow has overhead, that's why function inlining and loop unrolling exist.
But if the code size of an unrolled loop overflows the cache you may end up losing more performance
from cache misses than you gained from inlining.

This might answer your question:

https://bitcointalk.org/index.php?topic=1326803.msg13770966#msg13770966

I clearly stated I did not predict a performance gain from algo-gate and if you dig deeper you may find
where I did acknowledge the overhead of the deref but was at a loss to explain why I observed a performance
gain. Maybe my observations were just noise, maybe some other change is responsible for the increase in
performance in spite of the gate. I just don't know. There are too many variables that can't be controlled so
I dismiss such observations without a solid case to back it up.

Finally what it comes down to, like any decision, is a balance. Algo-gate was never about performance it was
about a better architecture that made it easier for developpers to add new algos to the miner with minimal
disruption to the existing code. I judged the performnce cost to be negligible.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

hmage

Member

Offline

Activity: 83
Merit: 10

May 12, 2016, 08:06:01 PM

#595

I did acknowledge the overhead of the deref but was at a loss to explain why I observed a performance
gain.

You didn't provide numbers, unfortunately, and you didn't provide a way to recreate the benchmarks to verify your claims either, since there's no archive of older versions of cpuminer-opt to build against. If it were on github, for example, that would have been easier to test.

Each scan takes seconds to run so the overhead of one extra pointer deref every few
seconds is immeasurable. Even if you go up a level to the miner_thread loop. There are maybe 20
gated fuction calls every loop. 20 extra derefs every few seconds is still immeasurable.

That was the info I was looking for, thank you.

This whole debate was too long just because either I didn't communicate clearly enough that I am assuming it is done on every hash call or because you didn't recognize that when reading. Pseudocode should have been a big hint at that.

Either way, this debate is pointless, 20 calls a second isn't something to worry about. The observed slowdown must be caused by other factors.

joblo (OP)

Legendary

Offline

Activity: 1470
Merit: 1114

Quote from: hmage on May 12, 2016, 08:06:01 PM

May 12, 2016, 09:22:55 PM

#596

I did acknowledge the overhead of the deref but was at a loss to explain why I observed a performance
gain.

I think you hit the nail on the head when you said you made an assumption. That was, IMO, your biggest mistake and why I
kept repeating that you need to do your homework before bringing it to my attention, Had you done that you would have realized
yourself that the deref overhead was trivial and any observed performance diff was due to something else.

It was my assumption that you would have already done that. We both made assumptions, not a good idea.

I didn't have numbers because there was no way to run a controlled test with the necessary level of precision and accuracy.
And it's also why I suggested it wasn't worth your effort to go back and restest previous releases.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

hmage

Member

Offline

Activity: 83
Merit: 10

Quote from: joblo on May 12, 2016, 09:22:55 PM

May 12, 2016, 10:34:38 PM

#597

It was my assumption that you would have already done that. We both made assumptions, not a good idea.

Yeap. I have only glanced briefly at the source code.

Anyway, I should apologise for my behaviour, it was unprofessional and that lead to less productive results. You weren't perfect either but everyone has faults since everyone is human and every suggestion or problem report felt like court trial just on how much work needed to be done on my end compared to what I saw being done on your end regarding the issue or suggestion (you always asked to do research or just more data without seemingly doing any research on your own before you pass your judgement). I really like your work so far and very appreciate it, though, and don't want to distract you from that more than I already did.

joblo (OP)

Legendary

Offline

Activity: 1470
Merit: 1114

Quote from: hmage on May 12, 2016, 10:34:38 PM

May 12, 2016, 11:31:02 PM

#598

Quote from: joblo on May 12, 2016, 09:22:55 PM

It was my assumption that you would have already done that. We both made assumptions, not a good idea.

Your perception of a court trial is pretty accurate. I was thinking something similar, a lawyer gets one crack at presenting
a case. If the lawyer comes to court unprepared the case gets tossed and he doesn't get another chance.

Although I'm atheist a Bible passage comes to mind. Let he who is without sin throw the first stone. The implication being
that no one is without sin. I simply picked up the stones and threw them back.

An apology is not required, coming to an understanding and learning from it is more important, and applies to both of us.
Nevertheless you offered one and I accept. For my part I'm not one to apologize for my actions, too stubborn, I guess.
But in hindsight I think the timing was bad. I had just released v3.2 and had broken zr5 which was embarassing and was
trying to focus on that issue. In fact I am not pleased with the overall quality of my releases, too many bad ones.
I expect better of myself. Am I losing my edge or is it because I forgot what it was like to be on a steep learning curve
after so long being a subject matter expert? Yeah, I'm arrogant too.

No hard feelings. Cheers.

AKA JayDDee, cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT,

pallas

Legendary

Offline

Activity: 2716
Merit: 1094

Black Belt Developer