joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
May 11, 2016, 01:32:15 PM |
|
X17 algo is now supported by cpuminer-opt. It has been tested at zpool with v3.1.18.
The message warning that the algo has not been tested can be ignored. It will be removed from the next release.
|
|
|
|
|
hmage
Member
Offline
Activity: 83
Merit: 10
|
|
May 11, 2016, 05:23:24 PM Last edit: May 11, 2016, 05:34:12 PM by hmage |
|
Edit3: I ran the test several times with both algos and alway produced the correct result. Could your script be misinterpreting? Without further information on how to reproduce I consider this issue closed.
I was running it without the script and was getting the wrong output. I'm talking about the last line: 463916 This is the line I'm parsing in the script. Maybe it's because I have hyperthreading enabled on my CPU? Could be a thread synchronization issue. It doesn't always happen for me either. I've been trying to reproduce the problem for you with v3.2.1 and asciinema and so far no luck. I was experiencing problem on 3.1.17. Maybe you fixed it in v3.2.1? I see you've changed up stuff that could be relevant since then.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
May 11, 2016, 05:44:33 PM |
|
Edit3: I ran the test several times with both algos and alway produced the correct result. Could your script be misinterpreting? Without further information on how to reproduce I consider this issue closed.
I was running it without the script and was getting the wrong output. I'm talking about the last line: 463916 This is the line I'm parsing in the script. Maybe it's because I have hyperthreading enabled on my CPU? Could be a thread synchronization issue. It doesn't always happen for me either. I've been trying to reproduce the problem for you with v3.2.1 and asciinema and so far no luck. I was experiencing problem on 3.1.17. Maybe you fixed it in v3.2.1? I see you've changed up stuff that could be relevant since then. I also use hyperthreading. I haven't intentionally touched either algo in several releases and 3.2 was a restructuring release with no intended change in functionality. If you see it again, let me know and I'll take another look. Buit it's pretty clear from the code that the last TOTAL rate displays the same data as the last line. From looking at the code the time_limit stuff seems in an odd place, the end of the loop would seem more appropriate with the rest of the display code. I may consider moving it on speculation if the problem returns.
|
|
|
|
hmage
Member
Offline
Activity: 83
Merit: 10
|
|
May 11, 2016, 06:18:08 PM |
|
I also use hyperthreading. I haven't intentionally touched either algo in several releases and 3.2 was a restructuring release with no intended change in functionality. If you see it again, let me know and I'll take another look. Buit it's pretty clear from the code that the last TOTAL rate displays the same data as the last line.
Yes, the output is same as total, just in format that's easier to parse, no problem with that fact. The problem was that right before the end, it would spit thousands of lines in a second with ever increasing hashrate that would inflate the total result. Do you keep an archive of older versions of cpuminer-opt? I'd like to check older version and I foolishly deleted my local copy.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
May 11, 2016, 07:08:28 PM |
|
I also use hyperthreading. I haven't intentionally touched either algo in several releases and 3.2 was a restructuring release with no intended change in functionality. If you see it again, let me know and I'll take another look. Buit it's pretty clear from the code that the last TOTAL rate displays the same data as the last line.
Yes, the output is same as total, just in format that's easier to parse, no problem with that fact. The problem was that right before the end, it would spit thousands of lines in a second with ever increasing hashrate that would inflate the total result. Do you keep an archive of older versions of cpuminer-opt? I'd like to check older version and I foolishly deleted my local copy. A few of the DL links are still active. If you want a specific release let me know and I'll reactivate it. Keep in mind there have been some problem releases along the way which you probbaly want to avoid. The post for each release is still in the thread and should help you find the most stable ones. Personally I don't think it's worth the effort to go back. If the problem reoccurs with the current release you can collect more data and we can pursue it from there.
|
|
|
|
hmage
Member
Offline
Activity: 83
Merit: 10
|
|
May 12, 2016, 12:14:21 AM Last edit: May 12, 2016, 12:45:43 AM by hmage |
|
I've noticed another performance regression compared to cpuminer-multi. Algos that have very high number of calls per second tend to be slower on cpuminer-opt because of algo-gate callback functions. When calling through function pointers, the pointer needs to be dereferenced first before jumping, when the function in question is fast enough, the dereferencing could be reducing performance vs direct function call. One of the ways to fix that is to put dereferencing outside the loop. pseudocode before: func = &hash_sha256; while(true) { func(); }
pseudocode after: funcloop = &hashloop_sha256; funcloop();
hashloop_sha256() {while(true) {hash_sha256());}
This moves deferencing to be done only once at start of the loop.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
May 12, 2016, 12:34:27 AM |
|
I've noticed another performance regression compared to cpuminer-multi. Algos that have very high number of calls per second tend to be slower on cpuminer-opt because of algo-gate callback functions. When calling through function pointers, the pointer needs to be dereferenced first before jumping, when the function in question is fast enough, the dereferencing could be reducing performance vs direct function call. One of the ways to fix that is to put dereferencing outside the loop. pseudocode before: func = &hash_sha256; while(true) { func(); }
pseudocode after: funcloop = &hashloop_sha256; funcloop();
hashloop_sha256() {while(true) {hash_sha256());}
This moves deferencing to be done only once at start of the loop. Have you measured a regression? My measurements between 3.0.7 (pre algo-gate) and 3.1 showed a modest improvement in performance accross the board. Your suggestion would add the overhead of a function call and return on every iteration to save a pointer deref. Looks like a bad trade to me.
|
|
|
|
hmage
Member
Offline
Activity: 83
Merit: 10
|
|
May 12, 2016, 12:43:38 AM |
|
Your suggestion would add the overhead of a function call and return on every iteration to save a pointer deref. Looks like a bad trade to me.
I meant to move the dereference outside of iteration completely. Have the iteration cycle code for each algo so it doesn't go through dereferencing. Note - all of this is speculation, I still didn't measure exactly where the slowdown is and why it's slower. I'm just reporting that for some reason non-AES versions of algos are slower in cpuminer-opt compared to cpuminer-multi. This needs further investigation. One same CPU, these algos are slower on cpuminer-opt compared to cpuminer-multi: "groestl" => 1109819 / 1000, // cpuminer-opt "groestl" => 1125917 / 1000, // cpuminer-nicehash "keccak" => 6964234 / 1000, // cpuminer-opt "keccak" => 8332952 / 1000, // cpuminer-nicehash "luffa" => 2728931 / 1000, // cpuminer-opt "luffa" => 3177996 / 1000, // cpuminer-nicehash "lyra2" => 716945 / 1000, // cpuminer-opt "lyra2" => 921109 / 1000, // cpuminer-nicehash "neoscrypt" => 27583 / 1000, // cpuminer-opt "neoscrypt" => 28891 / 1000, // cpuminer-nicehash "pentablake" => 3479320 / 1000, // cpuminer-opt "pentablake" => 3609862 / 1000, // cpuminer-nicehash "pluck" => 1722 / 1000, // cpuminer-opt "pluck" => 1818 / 1000, // cpuminer-nicehash "s3" => 1086149 / 1000, // cpuminer-opt "s3" => 1201897 / 1000, // cpuminer-nicehash "scrypt" => 91557 / 1000, // cpuminer-opt "scrypt" => 99702 / 1000, // cpuminer-nicehash "sha256d" => 53122339 / 1000, // cpuminer-opt "sha256d" => 54669375 / 1000, // cpuminer-nicehash "shavite3" => 2232258 / 1000, // cpuminer-opt "shavite3" => 2343704 / 1000, // cpuminer-nicehash "skein" => 6405675 / 1000, // cpuminer-opt "skein" => 6586806 / 1000, // cpuminer-nicehash "skein2" => 7985012 / 1000, // cpuminer-opt "skein2" => 8167405 / 1000, // cpuminer-nicehash
I'm using this version of cpuminer-multi — https://github.com/nicehash/cpuminer-multi
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
May 12, 2016, 01:46:06 AM |
|
Your suggestion would add the overhead of a function call and return on every iteration to save a pointer deref. Looks like a bad trade to me.
I meant to move the dereference outside of iteration completely. Have the iteration cycle code for each algo so it doesn't go through dereferencing. Note - all of this is speculation, I still didn't measure exactly where the slowdown is and why it's slower. I'm just reporting that for some reason non-AES versions of algos are slower in cpuminer-opt compared to cpuminer-multi. This needs further investigation. One same CPU, these algos are slower on cpuminer-opt compared to cpuminer-multi: "groestl" => 1109819 / 1000, // cpuminer-opt "groestl" => 1125917 / 1000, // cpuminer-nicehash "keccak" => 6964234 / 1000, // cpuminer-opt "keccak" => 8332952 / 1000, // cpuminer-nicehash "luffa" => 2728931 / 1000, // cpuminer-opt "luffa" => 3177996 / 1000, // cpuminer-nicehash "lyra2" => 716945 / 1000, // cpuminer-opt "lyra2" => 921109 / 1000, // cpuminer-nicehash "neoscrypt" => 27583 / 1000, // cpuminer-opt "neoscrypt" => 28891 / 1000, // cpuminer-nicehash "pentablake" => 3479320 / 1000, // cpuminer-opt "pentablake" => 3609862 / 1000, // cpuminer-nicehash "pluck" => 1722 / 1000, // cpuminer-opt "pluck" => 1818 / 1000, // cpuminer-nicehash "s3" => 1086149 / 1000, // cpuminer-opt "s3" => 1201897 / 1000, // cpuminer-nicehash "scrypt" => 91557 / 1000, // cpuminer-opt "scrypt" => 99702 / 1000, // cpuminer-nicehash "sha256d" => 53122339 / 1000, // cpuminer-opt "sha256d" => 54669375 / 1000, // cpuminer-nicehash "shavite3" => 2232258 / 1000, // cpuminer-opt "shavite3" => 2343704 / 1000, // cpuminer-nicehash "skein" => 6405675 / 1000, // cpuminer-opt "skein" => 6586806 / 1000, // cpuminer-nicehash "skein2" => 7985012 / 1000, // cpuminer-opt "skein2" => 8167405 / 1000, // cpuminer-nicehash
I'm using this version of cpuminer-multi — https://github.com/nicehash/cpuminer-multiWell, your pseudo code had the call/ret inside the loop. Most of the algos in your list are of little interest, except neoscrypt. That is one algo I'd like to improve. In relative terms it underperforms the GPU version by a lot. Another thing to consider is that local hashrate reporting by the miner isn't very reliable and your data is well within a 2% margin of error. I was seeing greater variation just from different sessions of the same code. I thought I was making incremental improvements with some changes and regressions with others when all along it was just noise. I like intelectual challenges but you need to do a better job. You don't provide the full picture initilally and only give more info after I poke holes in your initial presentation. This seeems to be a pattern with your "suggestions". You obviously have some knowledge, maybe not as much as me, but knowledge in areas where I am weak, c++, for example. I'm also weak in GUI apps and web programming but I'm strong in OS fundamentals and CPU architecture, though not specifically Linux and x86. One of my biggest challenges has been applying my knowledge and experience to an unfamiliar environment. I tend to make a lot of mistakes as a result. I have given you the benefit of the doubt and tried to probe you for more info in areas where I didn't have the confidence to call you out. But so far it's come up empty. When you challenge me on one of my strengths you'd better be well prepared. But so far it's come up empty
|
|
|
|
hmage
Member
Offline
Activity: 83
Merit: 10
|
|
May 12, 2016, 11:57:06 AM Last edit: May 12, 2016, 12:40:40 PM by hmage |
|
I have given you the benefit of the doubt and tried to probe you for more info in areas where I didn't have the confidence to call you out. But so far it's come up empty. When you challenge me on one of my strengths you'd better be well prepared.
I don't care if I challenge you or not, I'm not here for your entertainment. 10 runs of cpuminer-opt are giving results that are consistently less than 10 runs of cpuminer-multi on the algos listed above. Simple as that. You're free to ignore this fact, of course. But I thought it'd be nice if you knew it.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
May 12, 2016, 04:04:48 PM |
|
I have given you the benefit of the doubt and tried to probe you for more info in areas where I didn't have the confidence to call you out. But so far it's come up empty. When you challenge me on one of my strengths you'd better be well prepared.
I don't care if I challenge you or not, I'm not here for your entertainment. 10 runs of cpuminer-opt are giving results that are consistently less than 10 runs of cpuminer-multi on the algos listed above. Simple as that. You're free to ignore this fact, of course. But I thought it'd be nice if you knew it. When I give you constructive feedback you seem to get angry which is counterproductive. I thank you for your work but it was not enough to draw any conclusions. A 2% diffreence is statistically insignificant. But let's assume it is. You suggested it was caused by the use of function pointers by algo-gate. I countered that my measurements when algo-gate was implemented showed an improvement. That disproves you theory, one that was not supported by any evidence BTW. So if the difference is real it must be caused by something else. There are a lot of possibilities. Differences in CPU architecture (I don't mean capabilities) can cause measurable differences between algos. Cache size and organization, execution environment, memory interface, etc can all cause different algos to perform differently on different CPUs. If you look at HOdl it performs well on an i7 but poorly on an i5 due to the smaller cache. As it turns out it was specifically optimized for the size of the i7 cache. You need to do your research, get your facts straight and present a coherent case it you want to get any attention, especially when you are criticizing someone's work. I have a thick skin, thicker than yours apparently, so I can take it and give it back. Put your self in my position, how would you react to someone taking pot shots about what you're doing wrong and how you should do things. Oh, I already know, you get angry.
|
|
|
|
hmage
Member
Offline
Activity: 83
Merit: 10
|
|
May 12, 2016, 05:15:37 PM Last edit: May 12, 2016, 05:34:57 PM by hmage |
|
Okay then, explain this: https://gist.github.com/hmage/2a1fdbd7bdad252cd08c9b4166c5727aon Core i5-4570S: hmage@dhmd:~/test$ cat /proc/cpuinfo |fgrep name|head -1 model name : Intel(R) Core(TM) i5-4570S CPU @ 2.90GHz hmage@dhmd:~/test$ gcc dereference_bench.c -O2 -o dereference_bench && ./dereference_bench workfunc(): 0.002082 microseconds per call, 480308.777k per second workloopfunc(): 0.001774 microseconds per call, 563746.643k per second
on Core i7-4770: hmage@vhmd:~$ cat /proc/cpuinfo |fgrep name|head -1 model name : Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz hmage@vhmd:~$ gcc dereference_bench.c -O2 -o dereference_bench && ./dereference_bench workfunc(): 0.001776 microseconds per call, 562932.922k per second workloopfunc(): 0.001506 microseconds per call, 664150.879k per second
Dereferencing on every call _is_ a big performance hit, unless you have another explanation. Latency numbers every programmer should know -- https://gist.github.com/hellerbarde/2843375Oh, I already know, you get angry.
It looks to me that it was you who got angry. I apologise for my blunt approach.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
May 12, 2016, 05:46:02 PM |
|
Okay then, explain this: https://gist.github.com/hmage/2a1fdbd7bdad252cd08c9b4166c5727aon Core i5-4570S: hmage@dhmd:~/test$ cat /proc/cpuinfo |fgrep name|head -1 model name : Intel(R) Core(TM) i5-4570S CPU @ 2.90GHz hmage@dhmd:~/test$ gcc dereference_bench.c -O2 -o dereference_bench && ./dereference_bench workfunc(): 0.002082 microseconds per call, 480308.777k per second workloopfunc(): 0.001774 microseconds per call, 563746.643k per second
on Core i7-4770: hmage@vhmd:~$ cat /proc/cpuinfo |fgrep name|head -1 model name : Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz hmage@vhmd:~$ gcc dereference_bench.c -O2 -o dereference_bench && ./dereference_bench workfunc(): 0.001776 microseconds per call, 562932.922k per second workloopfunc(): 0.001506 microseconds per call, 664150.879k per second
Dereferencing on every call _is_ a big performance hit, unless you have another explanation. Oh, I already know, you get angry.
It looks to me that it was you who got angry. I apologise for my blunt approach. A little impatient maybe but not really angry. I try to stick to the issues. Yes, deferencing a pointer to call a function adds overhead but it has to be taken in context. How often does that occur in the big picture? Take scanhash, for example, the lowest level function that is gated. Each scan takes seconds to run so the overhead of one extra pointer deref every few seconds is immeasurable. Even if you go up a level to the miner_thread loop. There are maybe 20 gated fuction calls every loop. 20 extra derefs every few seconds is still immeasurable. Any change of program flow has overhead, that's why function inlining and loop unrolling exist. But if the code size of an unrolled loop overflows the cache you may end up losing more performance from cache misses than you gained from inlining. This might answer your question: https://bitcointalk.org/index.php?topic=1326803.msg13770966#msg13770966I clearly stated I did not predict a performance gain from algo-gate and if you dig deeper you may find where I did acknowledge the overhead of the deref but was at a loss to explain why I observed a performance gain. Maybe my observations were just noise, maybe some other change is responsible for the increase in performance in spite of the gate. I just don't know. There are too many variables that can't be controlled so I dismiss such observations without a solid case to back it up. Finally what it comes down to, like any decision, is a balance. Algo-gate was never about performance it was about a better architecture that made it easier for developpers to add new algos to the miner with minimal disruption to the existing code. I judged the performnce cost to be negligible.
|
|
|
|
hmage
Member
Offline
Activity: 83
Merit: 10
|
|
May 12, 2016, 08:06:01 PM |
|
I did acknowledge the overhead of the deref but was at a loss to explain why I observed a performance gain.
You didn't provide numbers, unfortunately, and you didn't provide a way to recreate the benchmarks to verify your claims either, since there's no archive of older versions of cpuminer-opt to build against. If it were on github, for example, that would have been easier to test. Each scan takes seconds to run so the overhead of one extra pointer deref every few seconds is immeasurable. Even if you go up a level to the miner_thread loop. There are maybe 20 gated fuction calls every loop. 20 extra derefs every few seconds is still immeasurable.
That was the info I was looking for, thank you. This whole debate was too long just because either I didn't communicate clearly enough that I am assuming it is done on every hash call or because you didn't recognize that when reading. Pseudocode should have been a big hint at that. Either way, this debate is pointless, 20 calls a second isn't something to worry about. The observed slowdown must be caused by other factors.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
May 12, 2016, 09:22:55 PM |
|
I did acknowledge the overhead of the deref but was at a loss to explain why I observed a performance gain.
You didn't provide numbers, unfortunately, and you didn't provide a way to recreate the benchmarks to verify your claims either, since there's no archive of older versions of cpuminer-opt to build against. If it were on github, for example, that would have been easier to test. Each scan takes seconds to run so the overhead of one extra pointer deref every few seconds is immeasurable. Even if you go up a level to the miner_thread loop. There are maybe 20 gated fuction calls every loop. 20 extra derefs every few seconds is still immeasurable.
That was the info I was looking for, thank you. This whole debate was too long just because either I didn't communicate clearly enough that I am assuming it is done on every hash call or because you didn't recognize that when reading. Pseudocode should have been a big hint at that. Either way, this debate is pointless, 20 calls a second isn't something to worry about. The observed slowdown must be caused by other factors. I think you hit the nail on the head when you said you made an assumption. That was, IMO, your biggest mistake and why I kept repeating that you need to do your homework before bringing it to my attention, Had you done that you would have realized yourself that the deref overhead was trivial and any observed performance diff was due to something else. It was my assumption that you would have already done that. We both made assumptions, not a good idea. I didn't have numbers because there was no way to run a controlled test with the necessary level of precision and accuracy. And it's also why I suggested it wasn't worth your effort to go back and restest previous releases.
|
|
|
|
hmage
Member
Offline
Activity: 83
Merit: 10
|
|
May 12, 2016, 10:34:38 PM |
|
It was my assumption that you would have already done that. We both made assumptions, not a good idea.
Yeap. I have only glanced briefly at the source code. Anyway, I should apologise for my behaviour, it was unprofessional and that lead to less productive results. You weren't perfect either but everyone has faults since everyone is human and every suggestion or problem report felt like court trial just on how much work needed to be done on my end compared to what I saw being done on your end regarding the issue or suggestion (you always asked to do research or just more data without seemingly doing any research on your own before you pass your judgement). I really like your work so far and very appreciate it, though, and don't want to distract you from that more than I already did.
|
|
|
|
joblo (OP)
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
May 12, 2016, 11:31:02 PM |
|
It was my assumption that you would have already done that. We both made assumptions, not a good idea.
Yeap. I have only glanced briefly at the source code. Anyway, I should apologise for my behaviour, it was unprofessional and that lead to less productive results. You weren't perfect either but everyone has faults since everyone is human and every suggestion or problem report felt like court trial just on how much work needed to be done on my end compared to what I saw being done on your end regarding the issue or suggestion (you always asked to do research or just more data without seemingly doing any research on your own before you pass your judgement). I really like your work so far and very appreciate it, though, and don't want to distract you from that more than I already did. Your perception of a court trial is pretty accurate. I was thinking something similar, a lawyer gets one crack at presenting a case. If the lawyer comes to court unprepared the case gets tossed and he doesn't get another chance. Although I'm atheist a Bible passage comes to mind. Let he who is without sin throw the first stone. The implication being that no one is without sin. I simply picked up the stones and threw them back. An apology is not required, coming to an understanding and learning from it is more important, and applies to both of us. Nevertheless you offered one and I accept. For my part I'm not one to apologize for my actions, too stubborn, I guess. But in hindsight I think the timing was bad. I had just released v3.2 and had broken zr5 which was embarassing and was trying to focus on that issue. In fact I am not pleased with the overall quality of my releases, too many bad ones. I expect better of myself. Am I losing my edge or is it because I forgot what it was like to be on a steep learning curve after so long being a subject matter expert? Yeah, I'm arrogant too. No hard feelings. Cheers.
|
|
|
|
pallas
Legendary
Offline
Activity: 2716
Merit: 1094
Black Belt Developer
|
|
May 13, 2016, 07:38:16 AM |
|
I agree, what counts is going ahead in the way of knowledge. Everybody does it his way. Some just stand still but that's not the kind of people usually posting here.
|
|
|
|
|
|