GandalfG
Sr. Member
Offline
Activity: 259
Merit: 250
Dig your freedom
|
|
July 04, 2013, 07:45:54 AM |
|
"handling an interrupt, so interrupts can be interrupted." BKK "Don Rumsfeld" Coins “There are interrupted interrupts; there are things we interrupt that we interrupt. There are interrupts uninterrupted; that is to say, there are things that we now interrupt we don't interrupt. But there are also uninterrupted uninterrupts – there are things we do not interrupt we don't interrupt.” —Klondike Secretary of Design, BKK "Donald Rumsfeld" Coins O lol , I laughed to tears
|
Want to say thanks? 16ragydppe9QFRVhrdwEUjgfMS7KCfEFGY
|
|
|
BkkCoins (OP)
|
|
July 04, 2013, 04:23:23 PM |
|
Todays Update.
I spent all day testing and trying to find what is causing HW errors. I also did some comparison/companion testing with the Erupter that a very generous turtle83 sent me. The Klondike and Erupter ran fine together, and the cgminer menu items seem to be fine now too, after updating to 3.3.1.
I spent a lot of time analysing share.logs and running the data through my kslog util to generate work data for ktest. What I found out was that almost all the HW errors are non-repeatable. If I take accepted data and feed it back in manually I get the same nonce out. When I feed similar data that resulted in error nonces I usually get NO nonce out at all. This seems to indicate some problem with midstate/precalc/data not getting into the ASIC correctly rather than errors caused by bad result capture. Now I checked my code several times trying to find anywhere the data gets corrupted before pushing to the ASIC and can't see it.
As the day progressed I found the error rate dropping off as well. After a run of 1.5 hours along with the Erupter I found that the Klondike had a bout a 3% error rate, and the Erupter about 1.5%. But I'd been getting a lot of Rejected shares and I wondered if that was due to the slow speed and delays in submitted shares or what. So this evening I switched from 50btc (getwork) to BTCGuild (stratum) and saw that Rejects dropped a lot, and so far HW Errors are completely gone to 0 (knock on wood).
So it could even be that some problem with generating work with GetWork is sending bad data to the Klondike (?? weird), as with stratum (local block generation) I have not been getting HW errors. I'm trying to understand how that can be. Never see USB disconnects at all now. And if HW errors drop right off with stratum, then I'll probably add another ASIC and start checking the chaining next. Right, now running at 150 MHz clock, no heat sink and it's a bit hottish, but touchable with fingers for about 5 seconds.
Or maybe error rates actually get lower as the clock rate rises because going from 128 to 150 seems to have lowered the HW errors. Hmmm. Figure that out.
Plan for tomorrow: solder down more chips.
|
|
|
|
kano
Legendary
Offline
Activity: 4592
Merit: 1851
Linux since 1997 RedHat 4
|
|
July 04, 2013, 04:36:08 PM |
|
A few things: 1) The HW: is reported in 1diff, but 3.3.1 (and earlier) report A: and R: in shares (which can be any diff - depends on what you are talking to) Current git reports them in 1diff (i.e. the next cgminer version will only be 1diff for all of HW, A and R) - we changed that a few days ago in git. In API devs I report both.
2) In current git I have also implemented what we call cps - on Icarus and ModMinerQuad. AMU (asic miner USB) is Icarus For my mining the AMU at 335MH/s it gets around 1% errors (certainly less than 1.5%) Without cps you would expect more errors
3) In my API stats I've added 2 new fields: "USB Pipe" amd "USB Delay" If "USB Pipe" is non-zero then there are USB problems happening that could also be causing errors. "USB Delay" shows if there are timing 'issues' occurring in the code (cps fixes these and reports them in "USB Delay")
|
|
|
|
cardcomm
|
|
July 04, 2013, 04:46:31 PM |
|
Sweet!!! Now things are getting really interesting. Thanks again for the hard work and determination!
|
|
|
|
Bicknellski
|
|
July 04, 2013, 04:58:49 PM |
|
That will do BKK. That will do.
|
|
|
|
alfabitcoin
|
|
July 04, 2013, 05:23:22 PM |
|
So pool protocol cause a high hw errors?
|
|
|
|
BkkCoins (OP)
|
|
July 04, 2013, 05:37:36 PM |
|
So pool protocol cause a high hw errors?
Makes no sense, I know. And I'm not saying it does, but when I switched to stratum the rates dropped right down. Still scratching my head. I'm just letting both Erupter and Klondike run now. Klondike currently has A:99 R:0 HW:2 - which is the best it's been yet, though not as good as the Erupter at A:293 R:0 HW:3.
|
|
|
|
k9quaint
Legendary
Offline
Activity: 1190
Merit: 1000
|
|
July 04, 2013, 05:46:15 PM |
|
Todays Update.
I spent all day testing and trying to find what is causing HW errors. I also did some comparison/companion testing with the Erupter that a very generous turtle83 sent me. The Klondike and Erupter ran fine together, and the cgminer menu items seem to be fine now too, after updating to 3.3.1.
I spent a lot of time analysing share.logs and running the data through my kslog util to generate work data for ktest. What I found out was that almost all the HW errors are non-repeatable. If I take accepted data and feed it back in manually I get the same nonce out. When I feed similar data that resulted in error nonces I usually get NO nonce out at all. This seems to indicate some problem with midstate/precalc/data not getting into the ASIC correctly rather than errors caused by bad result capture. Now I checked my code several times trying to find anywhere the data gets corrupted before pushing to the ASIC and can't see it.
As the day progressed I found the error rate dropping off as well. After a run of 1.5 hours along with the Erupter I found that the Klondike had a bout a 3% error rate, and the Erupter about 1.5%. But I'd been getting a lot of Rejected shares and I wondered if that was due to the slow speed and delays in submitted shares or what. So this evening I switched from 50btc (getwork) to BTCGuild (stratum) and saw that Rejects dropped a lot, and so far HW Errors are completely gone to 0 (knock on wood).
So it could even be that some problem with generating work with GetWork is sending bad data to the Klondike (?? weird), as with stratum (local block generation) I have not been getting HW errors. I'm trying to understand how that can be. Never see USB disconnects at all now. And if HW errors drop right off with stratum, then I'll probably add another ASIC and start checking the chaining next. Right, now running at 150 MHz clock, no heat sink and it's a bit hottish, but touchable with fingers for about 5 seconds.
Or maybe error rates actually get lower as the clock rate rises because going from 128 to 150 seems to have lowered the HW errors. Hmmm. Figure that out.
Plan for tomorrow: solder down more chips.
Awesome sauce. The getwork vs stratum is puzzling.
|
Bitcoin is backed by the full faith and credit of YouTube comments.
|
|
|
alfabitcoin
|
|
July 04, 2013, 05:52:35 PM |
|
So pool protocol cause a high hw errors?
Makes no sense, I know. And I'm not saying it does, but when I switched to stratum the rates dropped right down. Still scratching my head. I'm just letting both Erupter and Klondike run now. Klondike currently has A:99 R:0 HW:2 - which is the best it's been yet, though not as good as the Erupter at A:293 R:0 HW:3. Well, it make sense and it does not. You have designed k16 from scratch, you dont have asic comm protocol source, you dont know Erupter comm protocol either. So something there are causing the problem. Maybe avolon will release com protocol soon to be sure.
|
|
|
|
fasmax
|
|
July 04, 2013, 05:52:59 PM |
|
Concerning the 128 MHZ vs 150 MHZ issue maybe the internal PLL has stability problems at different frequency's.
|
|
|
|
cardcomm
|
|
July 04, 2013, 06:34:52 PM |
|
Isn't the GetWork protocol deprecated anyway? Not that it shouldn't work, but I thought stratum was the preferred protocol.
|
|
|
|
BkkCoins (OP)
|
|
July 04, 2013, 06:38:38 PM |
|
Concerning the 128 MHZ vs 150 MHZ issue maybe the internal PLL has stability problems at different frequency's.
I hadn't thought about that, but it's possible and perhaps it's tuned for higher frequencies then I'm currently using. We'll see pretty soon. After I get a few more chips mounted I'll add a heat sink and bump up the clock. I think my plan is to add one more on the same bank, and then after that two more on the opposite bank. Isn't the GetWork protocol deprecated anyway? Not that it shouldn't work, but I thought stratum was the preferred protocol.
I haven't been following that but I'm sure stratum is preferred. And if it works that much better, for whatever reasons, then I'm not going to worry much about getwork. **** I pushed new updates to github earlier with some small tweaks. The firmware now takes clock cfg values from 256 up to 900. They are double-the-mhz rate so that's 128 - 450 MHz (not that you can run at 450 but the PLL on the ASIC accepts values that high). The code now detects when <500 and sets the half-clock bit when below. It also excludes 451-499 (ie. 225-249 MHz) by forcing to 450 since the PLL doesn't support that range.
|
|
|
|
cp1
|
|
July 04, 2013, 06:43:01 PM |
|
What's the input clock for the avalon running at?
|
|
|
|
BkkCoins (OP)
|
|
July 04, 2013, 06:46:31 PM |
|
What's the input clock for the avalon running at?
32 MHz There are 2 PLL control values, R and N. By setting R=32 you get N = 2x MHz rate, which is what I expose as the clk cfg value. Documented range is 500 - 900. But a "half rate" bit allows dividing that by 2. So for N < 500 I set that bit and use 2N for the control value. I don't allow a cfg value below 256 even though the PLL allows down to 250.
|
|
|
|
siran
Newbie
Offline
Activity: 18
Merit: 0
|
|
July 04, 2013, 06:50:56 PM |
|
I hadn't thought about that, but it's possible and perhaps it's tuned for higher frequencies then I'm currently using. We'll see pretty soon. After I get a few more chips mounted I'll add a heat sink and bump up the clock. I think my plan is to add one more on the same bank, and then after that two more on the opposite bank.
About that heatsink. Isn't it true, that avalon chips must be cooled from below? I mean you cannot put heatsink on top of the chip, but below the PCB with silicone thermal pad. It's just like block erupter is cooled.
|
|
|
|
BkkCoins (OP)
|
|
July 04, 2013, 06:52:35 PM |
|
I hadn't thought about that, but it's possible and perhaps it's tuned for higher frequencies then I'm currently using. We'll see pretty soon. After I get a few more chips mounted I'll add a heat sink and bump up the clock. I think my plan is to add one more on the same bank, and then after that two more on the opposite bank.
About that heatsink. Isn't it true, that avalon chips must be cooled from below? I mean you cannot put heatsink on top of the chip, but below the PCB with silicone thermal pad. It's just like block erupter is cooled. Yes, that's right. The heat sink is mounted under the board. There are 1cm x 1cm exposed pads with thermal vias to help dissipation to the heat sink. Or, rather, the chips are mounted on bottom and heat sink on top - so the board is upside down...
|
|
|
|
Bicknellski
|
|
July 04, 2013, 07:27:56 PM |
|
Concerning the 128 MHZ vs 150 MHZ issue maybe the internal PLL has stability problems at different frequency's.
I hadn't thought about that, but it's possible and perhaps it's tuned for higher frequencies then I'm currently using. We'll see pretty soon. After I get a few more chips mounted I'll add a heat sink and bump up the clock. I think my plan is to add one more on the same bank, and then after that two more on the opposite bank. Isn't the GetWork protocol deprecated anyway? Not that it shouldn't work, but I thought stratum was the preferred protocol.
I haven't been following that but I'm sure stratum is preferred. And if it works that much better, for whatever reasons, then I'm not going to worry much about getwork. **** I pushed new updates to github earlier with some small tweaks. The firmware now takes clock cfg values from 256 up to 900. They are double-the-mhz rate so that's 128 - 450 MHz (not that you can run at 450 but the PLL on the ASIC accepts values that high). The code now detects when <500 and sets the half-clock bit when below. It also excludes 451-499 (ie. 225-249 MHz) by forcing to 450 since the PLL doesn't support that range. Liquid cooling... I wanna see 450.
|
|
|
|
Igor_Rast
Newbie
Offline
Activity: 40
Merit: 0
|
|
July 04, 2013, 08:57:21 PM |
|
Concerning the 128 MHZ vs 150 MHZ issue maybe the internal PLL has stability problems at different frequency's.
I hadn't thought about that, but it's possible and perhaps it's tuned for higher frequencies then I'm currently using. We'll see pretty soon. After I get a few more chips mounted I'll add a heat sink and bump up the clock. I think my plan is to add one more on the same bank, and then after that two more on the opposite bank. Isn't the GetWork protocol deprecated anyway? Not that it shouldn't work, but I thought stratum was the preferred protocol.
I haven't been following that but I'm sure stratum is preferred. And if it works that much better, for whatever reasons, then I'm not going to worry much about getwork. **** I pushed new updates to github earlier with some small tweaks. The firmware now takes clock cfg values from 256 up to 900. They are double-the-mhz rate so that's 128 - 450 MHz (not that you can run at 450 but the PLL on the ASIC accepts values that high). The code now detects when <500 and sets the half-clock bit when below. It also excludes 451-499 (ie. 225-249 MHz) by forcing to 450 since the PLL doesn't support that range. Liquid cooling... I wanna see 450. Dunk it in mineral Oil
|
|
|
|
babcoccl
Newbie
Offline
Activity: 36
Merit: 0
|
|
July 04, 2013, 09:51:20 PM |
|
Concerning the 128 MHZ vs 150 MHZ issue maybe the internal PLL has stability problems at different frequency's.
I hadn't thought about that, but it's possible and perhaps it's tuned for higher frequencies then I'm currently using. We'll see pretty soon. After I get a few more chips mounted I'll add a heat sink and bump up the clock. I think my plan is to add one more on the same bank, and then after that two more on the opposite bank. Isn't the GetWork protocol deprecated anyway? Not that it shouldn't work, but I thought stratum was the preferred protocol.
I haven't been following that but I'm sure stratum is preferred. And if it works that much better, for whatever reasons, then I'm not going to worry much about getwork. **** I pushed new updates to github earlier with some small tweaks. The firmware now takes clock cfg values from 256 up to 900. They are double-the-mhz rate so that's 128 - 450 MHz (not that you can run at 450 but the PLL on the ASIC accepts values that high). The code now detects when <500 and sets the half-clock bit when below. It also excludes 451-499 (ie. 225-249 MHz) by forcing to 450 since the PLL doesn't support that range. In my RL job, I previously worked on a project where a PLL was throwing our whole system out of whack. The problem was that it would lock about 50 percent of the time so we would get intermittent valid data with occasional garbage. After thoroughly tracing out various components we observed that there was an unusual amount of noise getting into the PLL thereby causing it to lose it's lock occasionally. This was compounded by there being varying degrees of noise for various frequencies. Once we filtered these out we were able to maintain a continuous lock and produce clean data. PLL might be a good place to start looking. Just make sure your PLL maintains a good lock.
|
|
|
|
Taugeran
|
|
July 05, 2013, 12:57:09 AM |
|
I dont remember if anyone has asked this prior. Ive been silently watching in the background...
Anyway for the PIC firmware, i remember you stating that it subdivides the nonce range by n chips and pushes those ranges to the chips.
How difficult/possible would it be to rework the FW to do 1 job per chip?
This is just out of curiosity since i put in an order for 5 chips in a group buy + board ( once finalized [TY, T13Hydra]).
-Taugeran
|
Bitfury HW & Habañero : 1.625Th/s tips/Donations: 1NoS89H3Mr6U5CmP4VwWzU2318JEMxHL1 Come join Coinbase
|
|
|
|