efudd (OP)
Member
Offline
Activity: 504
Merit: 51
|
|
November 28, 2018, 09:40:56 PM |
|
Folk, I'd like to share a lesson's learned that may be useful to others. I have a Z9mini here that I just checked and it has a hash rate of 0. Immediately I went to the logs to see what was going on and found the following: Nov 28 21:33:52 (none) local0.err cgminer[23558]: bm1740_verify_nonce_integrality CRC error. cal-crc=374c, chip-crc=60bf Nov 28 21:33:52 (none) local0.warn cgminer[23558]: receive a error nonce. total = 8908 Nov 28 21:33:52 (none) local0.err cgminer[23546]: bm1740_verify_nonce_integrality CRC error. cal-crc=ac2d, chip-crc=3f77 Nov 28 21:33:52 (none) local0.warn cgminer[23546]: receive a error nonce. total = 8705
The key here is that these are happening constantly, every second, the count is up to 8000+. (If these are infrequet, every few minutes, to hours, they can be ignored) What's going on? To figure that out, let's take a look at the process list ("System" -> "Monitor")... and what do I find? 23546 23545 root S < 225m 98% 50% /usr/bin/cgminer --version-file=/usr/bin/compile_time --config=/config/cgminer.conf -T --syslog 23558 23557 root S < 257m 111% 40% /usr/bin/cgminer --version-file=/usr/bin/compile_time --config=/config/cgminer.conf -T --syslog
Two copies of cgminer running! How could that happen? The answer is in this little program right here: 1012 1 root S 2152 1% 0% {monitorcg} /bin/sh /sbin/monitorcg
This is a factory process that tries to be a "watchdog" for cgminer and restart it if it is not running. From the factory it ran every 20 seconds, but I modified it to sleep for 60 seconds to try to limit the possibility of this race condition. What happens is if you change frequency or pool configuration, cgminer is stopped and restarted. While that stop/start is occurring, monitorcg has a change to see cgminer is not running and start one itself. End result: Two cgminer's stepping on each other. I may end up removing /sbin/monitorcg from the firmware as I've attempted to fix this particular race a myriad of ways... but when two separate processes (web interface actions and monitorcg) are both touching the same resource ("cgminer"), there is not any good way to prevent them from stepping on each other unless they are talking to each other constantly to achieve what is called "quorum". What's the lesson here? Many times the errors that you may see are a function of this particular race condition.... and if you have two cgminer processes running, the fix is to kill/restart them. The simplest way to do that is ust to go to the frequency page and click submit. That will terminate both cgminers and hopefully restart it before monitorcg tries to help. A guaranteed way to fix it is to reboot, but I am not a fan of unnecessary reboots. Hopefully this bit of information will be useful to someone. I've been meaning to write posts like this explaining various scenarios for a while. Thank you, Jason
|
|
|
|
chipless
Jr. Member
Offline
Activity: 559
Merit: 4
|
|
November 28, 2018, 10:37:23 PM |
|
Folk, I'd like to share a lesson's learned that may be useful to others. I have a Z9mini here that I just checked and it has a hash rate of 0. Immediately I went to the logs to see what was going on and found the following: Nov 28 21:33:52 (none) local0.err cgminer[23558]: bm1740_verify_nonce_integrality CRC error. cal-crc=374c, chip-crc=60bf Nov 28 21:33:52 (none) local0.warn cgminer[23558]: receive a error nonce. total = 8908 Nov 28 21:33:52 (none) local0.err cgminer[23546]: bm1740_verify_nonce_integrality CRC error. cal-crc=ac2d, chip-crc=3f77 Nov 28 21:33:52 (none) local0.warn cgminer[23546]: receive a error nonce. total = 8705
The key here is that these are happening constantly, every second, the count is up to 8000+. (If these are infrequet, every few minutes, to hours, they can be ignored) What's going on? To figure that out, let's take a look at the process list ("System" -> "Monitor")... and what do I find? 23546 23545 root S < 225m 98% 50% /usr/bin/cgminer --version-file=/usr/bin/compile_time --config=/config/cgminer.conf -T --syslog 23558 23557 root S < 257m 111% 40% /usr/bin/cgminer --version-file=/usr/bin/compile_time --config=/config/cgminer.conf -T --syslog
Two copies of cgminer running! How could that happen? The answer is in this little program right here: 1012 1 root S 2152 1% 0% {monitorcg} /bin/sh /sbin/monitorcg
This is a factory process that tries to be a "watchdog" for cgminer and restart it if it is not running. From the factory it ran every 20 seconds, but I modified it to sleep for 60 seconds to try to limit the possibility of this race condition. What happens is if you change frequency or pool configuration, cgminer is stopped and restarted. While that stop/start is occurring, monitorcg has a change to see cgminer is not running and start one itself. End result: Two cgminer's stepping on each other. I may end up removing /sbin/monitorcg from the firmware as I've attempted to fix this particular race a myriad of ways... but when two separate processes (web interface actions and monitorcg) are both touching the same resource ("cgminer"), there is not any good way to prevent them from stepping on each other unless they are talking to each other constantly to achieve what is called "quorum". What's the lesson here? Many times the errors that you may see are a function of this particular race condition.... and if you have two cgminer processes running, the fix is to kill/restart them. The simplest way to do that is ust to go to the frequency page and click submit. That will terminate both cgminers and hopefully restart it before monitorcg tries to help. A guaranteed way to fix it is to reboot, but I am not a fan of unnecessary reboots. Hopefully this bit of information will be useful to someone. I've been meaning to write posts like this explaining various scenarios for a while. Thank you, Jason This should fix the problem the majority of the time if not completely. Doing a double check gives cgminer 120 seconds and a double check before it starts a new instance of the miner. place this in the monitorcg file #!/bin/sh #set -x check_inter="60s" while true; do sleep $check_inter #date a="$(ps | grep cgminer | grep -v 'grep cgminer')" if [ -z "$a" ] ; then chk_again fi chk_again() { while true; do sleep $check_inter #date a="$(ps | grep cgminer | grep -v 'grep cgminer')" if [ -z "$a" ] ; then /etc/init.d/cgminer.sh restart fi } done There also needs to be some checking added for the asic status. If there is x number of acics reporting an x for status then cgminer restarts or the system reboots. This can help when overclocking and out of the blue a board fails. The miner will at least restart rather then stay dropped out losing speed
|
Share your results with others on my Discord channel https://discord.gg/6t62apJ
|
|
|
efudd (OP)
Member
Offline
Activity: 504
Merit: 51
|
|
November 28, 2018, 10:53:28 PM |
|
...snip... This should fix the problem the majority of the time if not completely. Doing a double check gives cgminer 120 seconds and a double check before it starts a new instance of the miner.
place this in the monitorcg file
#!/bin/sh #set -x check_inter="60s" while true; do sleep $check_inter #date a="$(ps | grep cgminer | grep -v 'grep cgminer')" if [ -z "$a" ] ; then chk_again fi
chk_again() { while true; do sleep $check_inter #date a="$(ps | grep cgminer | grep -v 'grep cgminer')" if [ -z "$a" ] ; then /etc/init.d/cgminer.sh restart fi } done
There also needs to be some checking added for the asic status. If there is x number of acics reporting an x for status then cgminer restarts or the system reboots. This can help when overclocking and out of the blue a board fails. The miner will at least restart rather then stay dropped out losing speed
Please clean up your quotes instead of re-quoting everyone else before you each time. That will not fix the race. You can add "chk_again" as many times as you want, it does not remove it. Jason
|
|
|
|
efudd (OP)
Member
Offline
Activity: 504
Merit: 51
|
|
November 28, 2018, 10:58:21 PM |
|
Thank you for those details. I think I might know what is going on -- can you check your PM and email me at the email address I provided? Once we can get confirmation, I should be able to get this fixed reasonably quickly. ....snip... Ok, this is going to take a little longer. This is Roskomnadzor. I am looking for a workaround. xkosx - by the time you wake up the issue should be resolved. I have migrated primary services to something not blocked by Roskomnadzor. In fact, I can already see russian installations coming online. Thank you, Jason
|
|
|
|
badbart
Member
Offline
Activity: 449
Merit: 24
|
|
November 28, 2018, 11:56:37 PM |
|
I installed 2.1 and I don't have an option to upload a licences file.
The system page says: Efudd's Z9 Series Firmware v2.1 No dev-fee until 12/01/2018!
But under upgrade no option to upload a license file.
P.S. My Z9 is running faster now then your old firm ware with no clock changes.
|
|
|
|
efudd (OP)
Member
Offline
Activity: 504
Merit: 51
|
|
November 29, 2018, 12:18:07 AM |
|
I installed 2.1 and I don't have an option to upload a licences file.
The system page says: Efudd's Z9 Series Firmware v2.1 No dev-fee until 12/01/2018!
But under upgrade no option to upload a license file.
P.S. My Z9 is running faster now then your old firm ware with no clock changes.
Refresh your browser cache on the upgrade page. Shift-f5 or cmd-f5 if you are on a Mac. Once uploaded, the license will tell you if applied. The summary page will update on the next poll or restart with your license status. That page may be cached as well.. same thing. In the next release (currently being tested), I have fixed the page cache issues. -Jason
|
|
|
|
efudd (OP)
Member
Offline
Activity: 504
Merit: 51
|
|
November 29, 2018, 01:33:28 AM Last edit: November 29, 2018, 02:24:39 AM by efudd |
|
Folk, For the month of December, I will be running a contest for users of the Z9 and Z9 Mini version 2.1 or later firmware. There will be one automatic entry per day per miner. On 12/24, a random machine will be selected. The Summary page on the version 2.1 firmware will be automatically updated to let you know who the winner is. Details will be posted in the thread here and on the Equihash discord. I've created a thread to discuss this at https://bitcointalk.org/index.php?topic=5077347.0Users will find your Summary page on the miner updating with this information over the next 24 hours automatically. I have put details in the original post, but will copy here also. Thank you, Jason
|
|
|
|
waterman
Full Member
Offline
Activity: 192
Merit: 119
★Bitvest.io★ Play Plinko or Invest!
|
|
November 29, 2018, 02:29:23 AM |
|
Excellent good job dude! I want that PS4
|
|
|
|
efudd (OP)
Member
Offline
Activity: 504
Merit: 51
|
|
November 29, 2018, 02:35:42 AM |
|
Excellent good job dude! I want that PS4 <best phone operator voice> Install now! Supplies are Limited! </phone operator voice> I was gonna return it and then thought... wait a second! Best of luck to you. It'll go to someone! (I can't help but type this while reading it in a "Saul Goodman" voice). Jason
|
|
|
|
Marchcat2008
Newbie
Offline
Activity: 17
Merit: 0
|
|
November 29, 2018, 06:19:09 AM |
|
@efudd Please read PM. Thank's.
|
|
|
|
efudd (OP)
Member
Offline
Activity: 504
Merit: 51
|
|
November 29, 2018, 04:26:50 PM |
|
@efudd Please read PM. Thank's.
@Marchcat2008 - Responded. AT this point in time, I am not selling new licenses. The developer supported version will remain available. If this changes in the future, I will update the original post in this thread. Thank you, Jason
|
|
|
|
efudd (OP)
Member
Offline
Activity: 504
Merit: 51
|
|
November 29, 2018, 05:26:24 PM |
|
Folk,
I wanted to get some feedback on dev-fees: Once per day, or split up throughout the day? I've had feedback from both, but am leaning towards once-per-day.
I personally think that the once-per-day has the least impact since it greatly reduces the swapping/moving things around.
Can you please share you view point on this and reasoning why?
Thank you,
Jason
|
|
|
|
Pizzi_h
Newbie
Offline
Activity: 9
Merit: 0
|
|
November 29, 2018, 08:09:16 PM |
|
Thoughts.
z9 mini I have my miners Hosted outside, OR direct outside air, we have been having around -15c and the miner worked great got it up to stable at 681mhz since release. 2 fans front 1800rpm rear 1640 rpm Chips temp around 28-30c hash Avg 14.9ksols
Today the weather Drastically changed to +2c and i got all 3 boards xxxx
seems that it was the bm1740_verify_nonce_integrality CRC error. Reebooted but only took like 7min then got the same error again. But after that ive only been able to maintain 656mhz.
soon as i go above that i loose one board.
CAN it be possible that the colder the chips can be maintained the higher mhz we can maintain? I never tried above 681mhz
|
|
|
|
j.weber
Newbie
Offline
Activity: 5
Merit: 0
|
|
November 29, 2018, 08:22:55 PM |
|
Definitely once a day. Just a quick question, is there a way I can set the frequency for the different hashboards via PuTTY / the JSON?
|
|
|
|
efudd (OP)
Member
Offline
Activity: 504
Merit: 51
|
|
November 29, 2018, 08:27:35 PM |
|
Thoughts.
z9 mini I have my miners Hosted outside, OR direct outside air, we have been having around -15c and the miner worked great got it up to stable at 681mhz since release. 2 fans front 1800rpm rear 1640 rpm Chips temp around 28-30c hash Avg 14.9ksols
Today the weather Drastically changed to +2c and i got all 3 boards xxxx
seems that it was the bm1740_verify_nonce_integrality CRC error. Reebooted but only took like 7min then got the same error again. But after that ive only been able to maintain 656mhz.
soon as i go above that i loose one board.
CAN it be possible that the colder the chips can be maintained the higher mhz we can maintain? I never tried above 681mhz
This is a very good question. First on the CRC error -- that is going to happen some and is only a problem if it is constant. It happens on even the stock firmwares depending on machines, temps, frequencies, and phase of the moon. Temperatures will play into how far you can push these, but there is not a clear formula for that. What's really interesting is I have a customer with a large install (1000+ machines) who has observed that there is a point where the machines get too cold and slow down! I'm unsure of the exact details on the temperatures, just the observation that was shared with me. So yes, temperature has a play both when going up and when going down. The summers here are very hot -- my miners I had to constantly tune even through the day to get maximum out of them; they always ran best at night. I hope this helps some. Jason
|
|
|
|
efudd (OP)
Member
Offline
Activity: 504
Merit: 51
|
|
November 29, 2018, 08:36:27 PM |
|
Definitely once a day. Just a quick question, is there a way I can set the frequency for the different hashboards via PuTTY / the JSON?
Yessir, bitmain-freq1, bitmain-freq2, bitmain-freq3 are the 3 variables for that. The only caveat is if you set the frequencies via that method the web interface will not get updated to reflect it until you go into the web interface and "save frequencies". Jason
|
|
|
|
Pizzi_h
Newbie
Offline
Activity: 9
Merit: 0
|
|
November 29, 2018, 08:37:24 PM |
|
This is a very good question. First on the CRC error -- that is going to happen some and is only a problem if it is constant. It happens on even the stock firmwares depending on machines, temps, frequencies, and phase of the moon.
Temperatures will play into how far you can push these, but there is not a clear formula for that. What's really interesting is I have a customer with a large install (1000+ machines) who has observed that there is a point where the machines get too cold and slow down! I'm unsure of the exact details on the temperatures, just the observation that was shared with me.
So yes, temperature has a play both when going up and when going down.
The summers here are very hot -- my miners I had to constantly tune even through the day to get maximum out of them; they always ran best at night.
I hope this helps some.
Jason
I Reflashed the firmware and now i can push it higher then 656mhz. Temps now is 49-50c fans 2000 rpm Okey. well if it starts to drop When there is to cold outside i have to split the intake air abit Thanks for that info. Another question.. I also tried the Biggie firmware earlier, with the same Mhz it could spike to +17ksols avg was the around 14.5 i think i got higher spikes with the "biggie" firmware then the mini. Avg is better at the mini FW though Good job! will buy the license when you start bringing in new ones
|
|
|
|
efudd (OP)
Member
Offline
Activity: 504
Merit: 51
|
|
November 29, 2018, 08:42:17 PM |
|
...snip... I Reflashed the firmware and now i can push it higher then 656mhz. Temps now is 49-50c fans 2000 rpm Okey. well if it starts to drop When there is to cold outside i have to split the intake air abit Thanks for that info. Another question.. I also tried the Biggie firmware earlier, with the same Mhz it could spike to +17ksols avg was the around 14.5 i think i got higher spikes with the "biggie" firmware then the mini. Avg is better at the mini FW though Good job! will buy the license when you start bringing in new ones The spikes are gonna be completely random for what it is worth. Your miner could get really lucky on calculations for a few seconds and jump to 2x what you would otherwise expect, but the average is where the truth really sits. I honestly am not sure I am going to sell new licenses and instead stick with the dev supported model. It actually is cheaper for users that way to be honest... it'll take 3-6 months of runtime or more for me to make up what the license fee was at 3%. It's just a lot easier on me to not manage individual licenses. Jason
|
|
|
|
efudd (OP)
Member
Offline
Activity: 504
Merit: 51
|
|
November 29, 2018, 09:31:20 PM |
|
Folk,
Due to a seriously idiotic oversight on my part, the frequency list for the Mini's max's at 700. I'll be releasing a 'b' variant this evening that corrects the issue. If you are already running the 2.1 version, you will receive a notice on your Overview page with details on the 'b' release, as well as a direct link to download it.
Thank you,
Jason
|
|
|
|
chipless
Jr. Member
Offline
Activity: 559
Merit: 4
|
|
November 29, 2018, 09:34:09 PM |
|
Thoughts.
z9 mini I have my miners Hosted outside, OR direct outside air, we have been having around -15c and the miner worked great got it up to stable at 681mhz since release. 2 fans front 1800rpm rear 1640 rpm Chips temp around 28-30c hash Avg 14.9ksols
Today the weather Drastically changed to +2c and i got all 3 boards xxxx
seems that it was the bm1740_verify_nonce_integrality CRC error. Reebooted but only took like 7min then got the same error again. But after that ive only been able to maintain 656mhz.
soon as i go above that i loose one board.
CAN it be possible that the colder the chips can be maintained the higher mhz we can maintain? I never tried above 681mhz
The recommended min temp is around 40c colder you may lose speed and too hot you will lose speed. The optimal temp here seems to be about 55c They put out enough heat to keep my whole house warm. I adjust the temp by opening or closing windows.
|
Share your results with others on my Discord channel https://discord.gg/6t62apJ
|
|
|
|