btcsql
|
|
June 25, 2013, 11:41:26 PM |
|
what is the source of HW errors? --avalon-auto on the new firmware is throwing a 1:1 ratio of accepted: HW ... any ideas?
|
|
|
|
-ck
Legendary
Offline
Activity: 4284
Merit: 1645
Ruu \o/
|
|
June 25, 2013, 11:58:04 PM |
|
A few notes about the auto-clocking approach.
First and foremost, you can fry your hardware as you are running your avalon out of specification, especially if you try it on a batch 1 device with its lower power and quality PSU.
As is virtually always the case, manually fine tuning the final result will always be better than an automated process that guesses. With time I wish to get rid of the requirement to have fixed intervals and allow the user to specify any arbitrary value for the frequency, though the interface coping with it is a bit of an issue at the moment.
Ironically some people are finding the frequency a little too high and others a little too low. I suspect everyone is looking at a different endpoint for what is an ideal frequency in their eyes. The targets I've set are based on hardware error as a percentage, with hysteresis of +/- 0.25% - this is because a .5% increase in hardware errors works out to the amount the hashrate would rise with 2Mhz increments; i.e. if your hardware error count is going up at the same rate as the hashrate should rise, you are wasting energy. Ideally, a regression plot is what would be needed, getting the hashrate rise with each increment and the hw error percentage rise, and seeing when one grows faster than the other, but this is absurd stats to try to go looking for, especially when the values fluctuate wildly under normal circumstances only. By default with avalon-auto, you will get hardware errors of 1~1.5% . When looking at the hardware error count, make sure you are comparing it to the diff1 shares and not the accepted since you will almost certainly be mining at higher diff. Hardware errors are harmless in their own right but indicative of how hard you're pushing the chips for their available voltage and cooling. It sounds like these chips are capable of much more with more voltage but no one's done said mod yet.
The way to calculate hardware error percentage is: HW * 100 / (diff1 + HW)
It's also worth mentioning that to simplify the calculation of different frequencies, the values passed to the avalon with this latest firmware on the "regular values", i.e. 300 and below, is slightly lower than the values that would have been passed to it, but it should make only a negligible difference to hashrate, lost in the noise of normal variance that happens with hashrate. The "timeout" value passed is also smaller now, which means you may hit the limit at lower speeds than you used to - but the old timeouts were too high, and even if you apparently had a higher hashrate, if you go back and check your stats you may find you were getting more rejects. This is because the higher timeouts were leading to duplicate shares being generated so it is only a disadvantage.
A sure fire sign that you're overdoing it is cgminer repeatedly being restarted by the avalon watchdog, or periods of hashrate dropping, or smoke coming out of your PSU.
|
Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel 2% Fee Solo mining at solo.ckpool.org -ck
|
|
|
Elokane
|
|
June 26, 2013, 12:07:34 AM |
|
Thank you for taking the time to write this informative description and update.
|
|
|
|
-ck
Legendary
Offline
Activity: 4284
Merit: 1645
Ruu \o/
|
|
June 26, 2013, 12:16:10 AM |
|
fans are heavily circulating from min to around 3400 rpm when i set target-temp to 48 degrees they are influencing power draw above values are taken @min speed high speed will draw about +15W
will try agin with default temp;)
donated 1BTC to ckolivas 7a398e9723d533dfc13d99ec44e040645704f939e037851a84cddc430dab0d00-000
rpm starts a min raises to over 3000rpm when target-temp is hit and then slowing down to min step by step seems not the optimal strategy - I think better raise fans slowly before target is hit
@ckolivas and if I'm allowed to express a wish: 40% fanspeed as min also will be fine for me;) or also configurable as knows from gpus..
Appreciate the donation, thanks In actual fact, the fans are told to slowly increase before the target is hit. The thing is, the fans don't really support such small increments in PWM settings and ignore it till certain thresholds. These fans don't support fine control like a GPU fan and really only have about 6 different speeds. Writing a true PID controller with the mathematics involved is truly overkill for this purpose, and the lack of granularity of fanspeed control would make it a futile exercise. The tiny overshoot followed by huge fan boost you describe should only happen when you first start your avalon for a few mins or if you set your temp to very close to the minimum temp your hardware will run at (something like 35?). I'll look at further config options in the future, time permitting.
|
Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel 2% Fee Solo mining at solo.ckpool.org -ck
|
|
|
btcsql
|
|
June 26, 2013, 12:20:06 AM |
|
fans are heavily circulating from min to around 3400 rpm when i set target-temp to 48 degrees they are influencing power draw above values are taken @min speed high speed will draw about +15W
will try agin with default temp;)
donated 1BTC to ckolivas 7a398e9723d533dfc13d99ec44e040645704f939e037851a84cddc430dab0d00-000
rpm starts a min raises to over 3000rpm when target-temp is hit and then slowing down to min step by step seems not the optimal strategy - I think better raise fans slowly before target is hit
@ckolivas and if I'm allowed to express a wish: 40% fanspeed as min also will be fine for me;) or also configurable as knows from gpus..
Appreciate the donation, thanks In actual fact, the fans are told to slowly increase before the target is hit. The thing is, the fans don't really support such small increments in PWM settings and ignore it till certain thresholds. These fans don't support fine control like a GPU fan and really only have about 6 different speeds. Writing a true PID controller with the mathematics involved is truly overkill for this purpose, and the lack of granularity of fanspeed control would make it a futile exercise. The tiny overshoot followed by huge fan boost you describe should only happen when you first start your avalon for a few mins or if you set your temp to very close to the minimum temp your hardware will run at (something like 35?). I'll look at further config options in the future, time permitting. Hi ckolivas, thanks for everything. What I have noticed is STROMBOM's firmware + 325 mhz is giving about 1-2% HW errors. On the other hand the latest you put out with the --auto and temp targetting is throwing 15-20% HW errors at the same clock. Is there any way to combine the best of both worlds and get strombom's level of HW errors, but with the ability to control the temp? Would be greatly appreciated.
|
|
|
|
-ck
Legendary
Offline
Activity: 4284
Merit: 1645
Ruu \o/
|
|
June 26, 2013, 12:23:48 AM |
|
Hi ckolivas, thanks for everything. What I have noticed is STROMBOM's firmware + 325 mhz is giving about 1-2% HW errors. On the other hand the latest you put out with the --auto and temp targetting is throwing 15-20% HW errors at the same clock. Is there any way to combine the best of both worlds and get strombom's level of HW errors, but with the ability to control the temp? Would be greatly appreciated.
No idea why that would be the case. Auto tries to keep HW errors below 1.5%. Are you sure you're not mining at a higher diff?
|
Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel 2% Fee Solo mining at solo.ckpool.org -ck
|
|
|
btcsql
|
|
June 26, 2013, 12:29:00 AM |
|
Hi ckolivas, thanks for everything. What I have noticed is STROMBOM's firmware + 325 mhz is giving about 1-2% HW errors. On the other hand the latest you put out with the --auto and temp targetting is throwing 15-20% HW errors at the same clock. Is there any way to combine the best of both worlds and get strombom's level of HW errors, but with the ability to control the temp? Would be greatly appreciated.
No idea why that would be the case. Auto tries to keep HW errors below 1.5%. Are you sure you're not mining at a higher diff? Right? No idea here either. I literally have the exact settings saved and switched between the two firmwares. Auto was setting the clock to 327, but even without Auto and manually set to 325, the HW error was still 15-20%, compared to strombom's 2%. So weird!
|
|
|
|
-ck
Legendary
Offline
Activity: 4284
Merit: 1645
Ruu \o/
|
|
June 26, 2013, 12:31:02 AM |
|
Hi ckolivas, thanks for everything. What I have noticed is STROMBOM's firmware + 325 mhz is giving about 1-2% HW errors. On the other hand the latest you put out with the --auto and temp targetting is throwing 15-20% HW errors at the same clock. Is there any way to combine the best of both worlds and get strombom's level of HW errors, but with the ability to control the temp? Would be greatly appreciated.
No idea why that would be the case. Auto tries to keep HW errors below 1.5%. Are you sure you're not mining at a higher diff? Right? No idea here either. I literally have the exact settings saved and switched between the two firmwares. Auto was setting the clock to 327, but even without Auto and manually set to 325, the HW error was still 15-20%, compared to strombom's 2%. So weird! Try restarting it a few times from the interface perhaps? I find it a bit less reliable to start up normally. But yeah, I don't know why that would be the case...
|
Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel 2% Fee Solo mining at solo.ckpool.org -ck
|
|
|
btcsql
|
|
June 26, 2013, 12:41:29 AM |
|
Hi ckolivas, thanks for everything. What I have noticed is STROMBOM's firmware + 325 mhz is giving about 1-2% HW errors. On the other hand the latest you put out with the --auto and temp targetting is throwing 15-20% HW errors at the same clock. Is there any way to combine the best of both worlds and get strombom's level of HW errors, but with the ability to control the temp? Would be greatly appreciated.
No idea why that would be the case. Auto tries to keep HW errors below 1.5%. Are you sure you're not mining at a higher diff? Right? No idea here either. I literally have the exact settings saved and switched between the two firmwares. Auto was setting the clock to 327, but even without Auto and manually set to 325, the HW error was still 15-20%, compared to strombom's 2%. So weird! Try restarting it a few times from the interface perhaps? I find it a bit less reliable to start up normally. But yeah, I don't know why that would be the case... Tried restarting multiple times from the interface, still seeing 15-20% HW errors. Soo weird.
|
|
|
|
-ck
Legendary
Offline
Activity: 4284
Merit: 1645
Ruu \o/
|
|
June 26, 2013, 12:44:48 AM |
|
Hi ckolivas, thanks for everything. What I have noticed is STROMBOM's firmware + 325 mhz is giving about 1-2% HW errors. On the other hand the latest you put out with the --auto and temp targetting is throwing 15-20% HW errors at the same clock. Is there any way to combine the best of both worlds and get strombom's level of HW errors, but with the ability to control the temp? Would be greatly appreciated.
No idea why that would be the case. Auto tries to keep HW errors below 1.5%. Are you sure you're not mining at a higher diff? Right? No idea here either. I literally have the exact settings saved and switched between the two firmwares. Auto was setting the clock to 327, but even without Auto and manually set to 325, the HW error was still 15-20%, compared to strombom's 2%. So weird! Try restarting it a few times from the interface perhaps? I find it a bit less reliable to start up normally. But yeah, I don't know why that would be the case... Tried restarting multiple times from the interface, still seeing 15-20% HW errors. Soo weird. Hmm... Auto wont start changing clocks unless the actual nonces returned are within 10% of expected, so perhaps try enabling auto and start at lower clocks like 300.
|
Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel 2% Fee Solo mining at solo.ckpool.org -ck
|
|
|
dogie
Legendary
Offline
Activity: 1666
Merit: 1185
dogiecoin.com
|
|
June 26, 2013, 01:56:35 AM |
|
How is the auto balancing between maximising clock speed but minimising fan speed? What is the hierarchy?
|
|
|
|
johnyj
Legendary
Offline
Activity: 1988
Merit: 1012
Beyond Imagination
|
|
June 26, 2013, 02:00:26 AM |
|
A few notes about the auto-clocking approach.
First and foremost, you can fry your hardware as you are running your avalon out of specification, especially if you try it on a batch 1 device with its lower power and quality PSU.
As is virtually always the case, manually fine tuning the final result will always be better than an automated process that guesses. With time I wish to get rid of the requirement to have fixed intervals and allow the user to specify any arbitrary value for the frequency, though the interface coping with it is a bit of an issue at the moment.
Ironically some people are finding the frequency a little too high and others a little too low. I suspect everyone is looking at a different endpoint for what is an ideal frequency in their eyes. The targets I've set are based on hardware error as a percentage, with hysteresis of +/- 0.25% - this is because a .5% increase in hardware errors works out to the amount the hashrate would rise with 2Mhz increments; i.e. if your hardware error count is going up at the same rate as the hashrate should rise, you are wasting energy. Ideally, a regression plot is what would be needed, getting the hashrate rise with each increment and the hw error percentage rise, and seeing when one grows faster than the other, but this is absurd stats to try to go looking for, especially when the values fluctuate wildly under normal circumstances only. By default with avalon-auto, you will get hardware errors of 1~1.5% . When looking at the hardware error count, make sure you are comparing it to the diff1 shares and not the accepted since you will almost certainly be mining at higher diff. Hardware errors are harmless in their own right but indicative of how hard you're pushing the chips for their available voltage and cooling. It sounds like these chips are capable of much more with more voltage but no one's done said mod yet.
The way to calculate hardware error percentage is: HW * 100 / (diff1 + HW)
It's also worth mentioning that to simplify the calculation of different frequencies, the values passed to the avalon with this latest firmware on the "regular values", i.e. 300 and below, is slightly lower than the values that would have been passed to it, but it should make only a negligible difference to hashrate, lost in the noise of normal variance that happens with hashrate. The "timeout" value passed is also smaller now, which means you may hit the limit at lower speeds than you used to - but the old timeouts were too high, and even if you apparently had a higher hashrate, if you go back and check your stats you may find you were getting more rejects. This is because the higher timeouts were leading to duplicate shares being generated so it is only a disadvantage.
A sure fire sign that you're overdoing it is cgminer repeatedly being restarted by the avalon watchdog, or periods of hashrate dropping, or smoke coming out of your PSU.
Thanks for the detailed info! I noticed that during the first 10 or so hours the overclocked avalon was stable, but then it becomes more and more unstable, even the outside temp dropped significantly during night, cgminer restarted repeatedly, I feel that instability might comes from FPGA. What could be the cause of that? Have you observed same accumulated instability over time? P.S. also sent 1B to you, cgminer still rules
|
|
|
|
-ck
Legendary
Offline
Activity: 4284
Merit: 1645
Ruu \o/
|
|
June 26, 2013, 02:08:19 AM |
|
How is the auto balancing between maximising clock speed but minimising fan speed? What is the hierarchy?
Unlike the GPU code, they're totally independent as. Clock speed is determined solely by hardware errors whereas fanspeed is determined by temperature. HW errors tend to run hand in hand with temperature rise on this sort of hardware whereas GPUs are designed to be deterministic right up to failure so hw errors are meant to almost never happen.
|
Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel 2% Fee Solo mining at solo.ckpool.org -ck
|
|
|
-ck
Legendary
Offline
Activity: 4284
Merit: 1645
Ruu \o/
|
|
June 26, 2013, 02:11:07 AM |
|
Thanks for the detailed info! I noticed that during the first 10 or so hours the overclocked avalon was stable, but then it becomes more and more unstable, even the outside temp dropped significantly during night, cgminer restarted repeatedly, I feel that instability might comes from FPGA. What could be the cause of that? Have you observed same accumulated instability over time? P.S. also sent 1B to you, cgminer still rules And thank you I'm sure instability can manifest in any number of ways, and it's probably either resetting the device regularly due to the chips failing or idling frequently due to the PSU not keeping up or something along those lines.
|
Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel 2% Fee Solo mining at solo.ckpool.org -ck
|
|
|
-ck
Legendary
Offline
Activity: 4284
Merit: 1645
Ruu \o/
|
|
June 26, 2013, 04:10:03 AM |
|
I am seeing little or no improvement by cooling with a portable A/C.
Unit with A/C 1h 37m 58s 83896.29 temp3 43 freq(auto) 354
Unit without A/C 6h 54m 22s 83111.32 temp3 53 freq(auto) 353
I guessed this might be the case since the temperatures really aren't getting into the error range even with regular air cooling - especially since it's 3 degrees at my home overnight and the hashrate doesn't go up. I suspect the hashrate will only get higher with more voltage given to the chips.
|
Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel 2% Fee Solo mining at solo.ckpool.org -ck
|
|
|
shmadz
Legendary
Offline
Activity: 1512
Merit: 1000
@theshmadz
|
|
June 26, 2013, 04:34:56 AM |
|
I am seeing little or no improvement by cooling with a portable A/C.
Unit with A/C 1h 37m 58s 83896.29 temp3 43 freq(auto) 354
Unit without A/C 6h 54m 22s 83111.32 temp3 53 freq(auto) 353
I guessed this might be the case since the temperatures really aren't getting into the error range even with regular air cooling - especially since it's 3 degrees at my home overnight and the hashrate doesn't go up. I suspect the hashrate will only get higher with more voltage given to the chips. just curious what might the "error range" be? it's getting rather hot in here and I think I might have to buy another AC unit... summer is right around the corner...
|
"You have no moral right to rule us, nor do you possess any methods of enforcement that we have reason to fear." - John Perry Barlow, 1996
|
|
|
-ck
Legendary
Offline
Activity: 4284
Merit: 1645
Ruu \o/
|
|
June 26, 2013, 04:39:41 AM |
|
I am seeing little or no improvement by cooling with a portable A/C.
Unit with A/C 1h 37m 58s 83896.29 temp3 43 freq(auto) 354
Unit without A/C 6h 54m 22s 83111.32 temp3 53 freq(auto) 353
I guessed this might be the case since the temperatures really aren't getting into the error range even with regular air cooling - especially since it's 3 degrees at my home overnight and the hashrate doesn't go up. I suspect the hashrate will only get higher with more voltage given to the chips. just curious what might the "error range" be? it's getting rather hot in here and I think I might have to buy another AC unit... summer is right around the corner... Very much dependent on the chips, so this can only be a wild guess, but... 80+ degrees?
|
Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel 2% Fee Solo mining at solo.ckpool.org -ck
|
|
|
shmadz
Legendary
Offline
Activity: 1512
Merit: 1000
@theshmadz
|
|
June 26, 2013, 04:56:21 AM |
|
I am seeing little or no improvement by cooling with a portable A/C.
Unit with A/C 1h 37m 58s 83896.29 temp3 43 freq(auto) 354
Unit without A/C 6h 54m 22s 83111.32 temp3 53 freq(auto) 353
I guessed this might be the case since the temperatures really aren't getting into the error range even with regular air cooling - especially since it's 3 degrees at my home overnight and the hashrate doesn't go up. I suspect the hashrate will only get higher with more voltage given to the chips. just curious what might the "error range" be? it's getting rather hot in here and I think I might have to buy another AC unit... summer is right around the corner... Very much dependent on the chips, so this can only be a wild guess, but... 80+ degrees? thank you CKolivas! 80C will cook me and my tiny apartment. I think I need to figure a way to vent the heat directly outside without letting the rain and snow in,
|
"You have no moral right to rule us, nor do you possess any methods of enforcement that we have reason to fear." - John Perry Barlow, 1996
|
|
|
-ck
Legendary
Offline
Activity: 4284
Merit: 1645
Ruu \o/
|
|
June 26, 2013, 04:58:36 AM |
|
Very much dependent on the chips, so this can only be a wild guess, but... 80+ degrees?
thank you CKolivas! 80C will cook me and my tiny apartment. I think I need to figure a way to vent the heat directly outside without letting the rain and snow in, Hah, well don't take my word for it, as I said, it's pure speculation.
|
Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel 2% Fee Solo mining at solo.ckpool.org -ck
|
|
|
fhh
Legendary
Offline
Activity: 1206
Merit: 1000
|
|
June 26, 2013, 06:12:21 AM |
|
Auto tries to keep HW errors below 1.5%. Are you sure you're not mining at a higher diff?
So the shown HW errors are a multiple of the the diff mining at? having a higher percentage cgminer restarted in the night, MHz is again at 341 I'm getting a high rate of rejects from the pool so cgminer is showing me nearly 79GHash/s but on the pool bitparking its only around 71GHasch/s like it was at 300 MHz? watching this
|
|
|
|
|