Turbor
Legendary
Offline
Activity: 1022
Merit: 1000
BitMinter
|
|
June 11, 2012, 06:42:45 PM |
|
Turbor, what you see in your first log is called the "downclock of death" and people, including me, have seen this for a long time. What you see in your second log is just normal behavior.
Yeah i know about DCOD. What do you mean by normal behavior ? This DCOD will lead to a disabled FPGA. If you set -oh to 0.05 it will disable the chip at 188 MHz and with -oh 0.7 at 60 MHz. Funny thing is that from time to time I'm able to mine at 120 MHz until I restart the board.
|
|
|
|
Inspector 2211
|
|
June 11, 2012, 09:47:33 PM |
|
Turbor, what you see in your first log is called the "downclock of death" and people, including me, have seen this for a long time. What you see in your second log is just normal behavior.
What do you mean by normal behavior? Oh, I thought that the last two lines (212 MHz and 208 MHz) mean that everything is fine and dandy again. I didn't see the "FPGA is shut down to prevent damage" line. If it's shut down, why does it then have normal frequencies like 212 MHz, 216 MHz and 208 MHz again? Anyway, this sounds like a classical RMA situation to me.
|
|
|
|
BR0KK
|
|
June 11, 2012, 10:18:37 PM |
|
bus-0-0: ztex_ufm1_15y1-0001-02-07-3: Set frequency to 68.00MHz bus-0-0: ztex_ufm1_15d4-0001-02-05-1: f=208.00MHz, errorRate=0.50%, maxErrorRa te=1.57%, hashRate=207.0MH/s, submitted 13 new nonces, luckFactor=1.14 bus-0-0: ztex_ufm1_15d4-0001-02-06-1: f=208.00MHz, errorRate=0.00%, maxErrorRa te=0.66%, hashRate=208.0MH/s, submitted 8 new nonces, luckFactor=0.82 bus-0-0: ztex_ufm1_15y1-0001-02-07-1: f=212.00MHz, errorRate=0.75%, maxErrorRa te=0.87%, hashRate=210.4MH/s, submitted 12 new nonces, luckFactor=0.99 bus-0-0: ztex_ufm1_15y1-0001-02-07-2: f=208.00MHz, errorRate=0.01%, maxErrorRa te=0.59%, hashRate=208.0MH/s, submitted 16 new nonces, luckFactor=0.97 bus-0-0: ztex_ufm1_15y1-0001-02-07-3: f=68.00MHz, submitted 2 new nonces, luck Factor=0.96 The third fpga on his quad is failing. All the others behave normal
|
|
|
|
ztex (OP)
Donator
Sr. Member
Offline
Activity: 367
Merit: 250
ZTEX FPGA Boards
|
|
June 12, 2012, 07:53:26 AM |
|
The third fpga on his quad is failing. All the others behave normal I asked you to return the board a feew weeks ago. So please don't complain about a non working board without giving me the chance to repair / replace it.
|
|
|
|
vv01f
|
|
June 12, 2012, 10:36:24 AM |
|
@ztex: I dont think he "complains" but only tries to point out for understanding although I estimate him having bad luck on country borders without knowing about his place - cant imagine other reason to run it that way all the time. /*! break !*/Another question due to possible failing on long (ok, here it was <24h with -oh 0.08) running the boards: Is it possible to exit BTCMiner on errors and read some error-number or parse the logs for dealing with the problem? Until now BTCMiner wont exit but trying to keep running (waiting for input q or s) and throwing error msg .. so I had to rerun manually I will add another log in the evening where such a problem with bad string stopped BTCMiner of working - could be due to the pool (but neither of the two backup pools was used) or my internet connection (error message does not say so) .. or perhaps my failing core is into that .. (will first check reconnect time of the router and get the log). But perhaps you have already some idea ... One simple solution coming to my mind, though not sure it is sufficient: running the startup-script $0 again after exiting BTCMiner (until now I stop it doing that until user agrees with some key using read -s) Or simpler: How to run clusters automatically (e.g. headless without monitoring) and handling errors best? Or: How to monitor another way than checking the logs?
|
|
|
|
Turbor
Legendary
Offline
Activity: 1022
Merit: 1000
BitMinter
|
|
June 12, 2012, 10:40:31 AM |
|
The third fpga on his quad is failing. All the others behave normal I asked you to return the board a feew weeks ago. So please don't complain about a non working board without giving me the chance to repair / replace it. I never complained about the bad FPGA. Things like that happen and I can live with that. That was and is our agreement. Just tried to give the other guy some advice. If a FPGA shuts down all the time -oh is the only way to find out where the problem is imo. Because if you don't set it, it shuts down after 3 to 4 drops. That can be too early for a bad set heatsink.
|
|
|
|
rupy
|
|
June 12, 2012, 01:44:22 PM |
|
I think the whole -oh ratio variable is hard to grasp... could it be "how many frequency jumps per second" or something we can understand professor?
0.7 equals 70% hashrate drop. 0.6 60% and so far. Base frequency seems to be 200 MHz. 70% hashrate drop per what second?
|
BANKBOOK GWT Wallet & no-FIAT Billing API
|
|
|
vv01f
|
|
June 12, 2012, 02:13:14 PM |
|
70% hashrate drop per what second? As I understand it, yes - but thats not really important. Its 70% of the Number in MHz. And that one translates (counting in luck and errors) estimated 1:1 into MH/s. So due to the hash rate counted in MH/s it is estimated 70% of that. If you would count it in MH/h it still would be relative to your calculation/numbering in MH/h. The only term I do not really understand in regards of calculation is "submitted hash rate". Is it the number of total nonces translated to MH/s ?
|
|
|
|
gr0bi42
|
|
June 12, 2012, 02:50:26 PM |
|
@ztex: after trying to get one quad-board (labeled 1.15x) running on Linux (Ubuntu 12.04) I have growing problems with one of the FPGAs that chip already was the slowest in my setup leveling in to 208 MHz (all others do at least 216 Mhz) previously on Windows 7 . after realizing to utilize sudo also the linux-client ran right off, but it shut down the FPGA #2 due to increasing hash rate drop (4-10%). That I already had with -oh 0 on Windows 7, but with -oh 0.04 it was OK on 208 MHz. Now reconnecting to Windows for double-checking the hash rate drop also occurred there, #2 is disconnected due to that. I already remounted the cooler two times, covering of the chip with MX4 looked ok, nothing besides the chip. Any ideas what to check/try next *edit, replaced "error" with correct term "hash rate drop" : while writing also #1 is disconnected for hash rate drop rate 4% and the others are leveling down to 212 MHz .. I think it better to power that one off. Hi, I have the same problem with one of my Quad's. One FPGA's frequency toggles right after starting BTCMiner: 001-1: ztex_ufm1_15y1-04A36E054A-1: Change from frequency 200.00MHz to 196.00MHz 001-1: ztex_ufm1_15y1-04A36E054A-1: Change from frequency 196.00MHz to 200.00MHz 001-1: ztex_ufm1_15y1-04A36E054A-1: Change from frequency 200.00MHz to 196.00MHz 001-1: ztex_ufm1_15y1-04A36E054A-1: Change from frequency 196.00MHz to 192.00MHz 001-1: ztex_ufm1_15y1-04A36E054A-1: Change from frequency 192.00MHz to 196.00MHz Got a tip from ztex to reset the error and performance counter (c-command) after a while. After the c-command the FPGA goes up to 220 Mhz and runs stable. I had two other boards with permanent hashrate drop errors. Sent them in for repair... quick return... now they are running fine. But one problem remains: every once in a while one or two FPGA's of my cluster were shut down by BTCMiner due to this hashrate drop error. This does not happen with cgminer. I'm sure, it's not a cooling problem. Maybe it's a power problem (loose contact). I don't know. BUT, I've started to modify BTCMiner for my needs. And one change I've made is to autoreset the error and performance counter every hour. In this way the cluster hashes always at max. speed and especially at night, then temps come down, I gain some extra MHashes. As an interesting sideeffect, since then I never got a hashrate drop error shutdown. Cluster runs super stable since 2 days now, without any issue.
|
Donations are welcome: 1Btf3BqUegfe5iFdWsgfBf1Ew3YsAvsrLT
|
|
|
gr0bi42
|
|
June 12, 2012, 03:27:53 PM |
|
|
Donations are welcome: 1Btf3BqUegfe5iFdWsgfBf1Ew3YsAvsrLT
|
|
|
vv01f
|
|
June 12, 2012, 04:25:47 PM Last edit: June 12, 2012, 05:12:11 PM by vv01f |
|
for the early morning problem.. my log says dsl connection was (re)established 12.06.2012, 04:31 - so I dont think that was the problem miner was startet (not regarding the login data) with that one: java -cp /home/wolf/btcminer/ZtexBTCMiner-120417.jar BTCMiner -f ztex_ufm1_15y1.ihx -host http://mmpool.bitparking.com:15098 -u vv01f -p testpw -b http://de.btcguild.com:8332 vv01f_fpga testpwd -b http://eu.ozco.in:8332 accnumber testpw -l /home/wolf/btcminer/fpga.log -bl /home/wolf/btcminer/fpga-submitted.log -m c -oh 0.08 resulting finally in full stop with error msg 2012-06-12T05:03:48: Stopped thread for bus 001-0 after some problem with pool data / a string (not sure how to read that properly): 2012-06-12T04:59:28: 001-0: ztex_ufm1_15y1-04A36DE0AA-3: Error: Read timed out: Disabling URL http://mmpool.bitparking.com:15098 for 60s 2012-06-12T04:59:32: 001-0: ztex_ufm1_15y1-04A36DE0AA-1: Error: Read timed out: Disabling URL http://mmpool.bitparking.com:15098 for 60s 2012-06-12T05:00:10: 001-0: ztex_ufm1_15y1-04A36E0711-4: Error: Read timed out: Disabling URL http://mmpool.bitparking.com:15098 for 60s 2012-06-12T05:00:16: 001-0: ztex_ufm1_15y1-04A36E0711-2: Error: Read timed out: Disabling URL http://mmpool.bitparking.com:15098 for 60s 2012-06-12T05:00:30: 001-0: ztex_ufm1_15y1-04A36DE0AA-4: Error: Read timed out: Disabling URL http://mmpool.bitparking.com:15098 for 60s 2012-06-12T05:00:57: 001-0: ztex_ufm1_15y1-04A36E0711-1: Error: Read timed out: Disabling URL http://mmpool.bitparking.com:15098 for 60s 2012-06-12T05:01:23: 001-0: ztex_ufm1_15y1-04A36E0711-3: Error: Read timed out: Disabling URL http://mmpool.bitparking.com:15098 for 60s 2012-06-12T05:01:40: 001-0: ztex_ufm1_15y1-04A36DE0AA-1: Error: Read timed out: Disabling URL http://mmpool.bitparking.com:15098 for 60s 2012-06-12T05:01:56: 001-0: ztex_ufm1_15y1-04A36E0711-4: Error: Invalid length of string: Disabling device 2012-06-12T05:01:56: 001-0: ztex_ufm1_15y1-04A36E0711-2: Error: Invalid length of string: Disabling device 2012-06-12T05:01:57: 001-0: ztex_ufm1_15y1-04A36DE0AA-3: Error: Invalid length of string: Disabling device 2012-06-12T05:01:57: 001-0: ztex_ufm1_15y1-04A36DE0AA-4: Error: Invalid length of string: Disabling device 2012-06-12T05:02:11: 001-0: ztex_ufm1_15y1-04A36E0711-1: Error: Invalid length of string: Disabling device 2012-06-12T05:02:41: 001-0: ztex_ufm1_15y1-04A36E0711-3: Error: Read timed out: Disabling URL http://mmpool.bitparking.com:15098 for 60s 2012-06-12T05:02:46: 001-0: ztex_ufm1_15y1-04A36DE0AA-1: Error: Invalid length of string: Disabling device 2012-06-12T05:02:59: Warning: Invalid length of string 2012-06-12T05:03:29: 001-0: ztex_ufm1_15y1-04A36DE0AA-1: Error: Invalid length of string: Device disabled since 2012-06-12T05:02:46 2012-06-12T05:03:29: 001-0: ztex_ufm1_15y1-04A36DE0AA-2: Error: Hash rate drop of 10,0% detect. This may be caused by overheating. FPGA is shut down to prevent damage. 50.0: Device disabled since 2012-06-11T06:43:14 2012-06-12T05:03:29: 001-0: ztex_ufm1_15y1-04A36DE0AA-3: Error: Invalid length of string: Device disabled since 2012-06-12T05:01:57 2012-06-12T05:03:29: 001-0: ztex_ufm1_15y1-04A36DE0AA-4: Error: Invalid length of string: Device disabled since 2012-06-12T05:01:57 2012-06-12T05:03:29: 001-0: ztex_ufm1_15y1-04A36E0711-1: Error: Invalid length of string: Device disabled since 2012-06-12T05:02:11 2012-06-12T05:03:29: 001-0: ztex_ufm1_15y1-04A36E0711-2: Error: Invalid length of string: Device disabled since 2012-06-12T05:01:56 2012-06-12T05:03:29: 001-0: ztex_ufm1_15y1-04A36E0711-3: f=216,00MHz, errorRate=0,11%, maxErrorRate=1,23%, hashRate=215,8MH/s, submitted 10 new nonces, luckFactor=0,91 2012-06-12T05:03:29: 001-0: ztex_ufm1_15y1-04A36E0711-4: Error: Invalid length of string: Device disabled since 2012-06-12T05:01:56 2012-06-12T05:03:29: 001-0: poll loop time: 236ms (USB: 11ms network: 225ms) getwork time: 362ms submit time: 407ms 2012-06-12T05:03:29: 001-0: Warning: 3 overflows occured. This is usually caused by a slow network connection. 2012-06-12T05:03:29: Total hash rate: 215,8 MH/s 2012-06-12T05:03:29: Total submitted hash rate: 195,7 MH/s 2012-06-12T05:03:29: -------- 2012-06-12T05:03:48: 001-0: ztex_ufm1_15y1-04A36E0711-3: Error: Invalid length of string: Disabling device full log for that run: Log 4 BTCMiner failing on some string error - it will expire in ~30d. Thanks for the hint with c-command while runtime.. it seems I didnt read that help properly. But for now I see no lasting effekt. Just randomly the # of 2nd fpga on the problematic board (also slowing down) changes, right now #4 is slowed down to 200 MHz and #3 I also saw disabled at first try until manual q-command. For some time the hash rate will go up, but I fear it having bad impact done to often. No offence, I will not change the setup to your firmware until ztex says its Ok for same reasons I will not set -oh above 0.1 - just prevent further damage he has nothing to do with. Instead I will try to keep at least 3 cores of that board running stable until Friday.
|
|
|
|
BR0KK
|
|
June 12, 2012, 08:00:13 PM |
|
The third fpga on his quad is failing. All the others behave normal I asked you to return the board a feew weeks ago. So please don't complain about a non working board without giving me the chance to repair / replace it. Nooooo Mine is working perfectly so everything is ok Was just pointing Out that the third fpga of the User was failing.
|
|
|
|
ztex (OP)
Donator
Sr. Member
Offline
Activity: 367
Merit: 250
ZTEX FPGA Boards
|
|
June 13, 2012, 11:40:19 AM |
|
70% hashrate drop per what second? Time plays no role. Its 70% of the Number in MHz. And that one translates (counting in luck and errors) estimated 1:1 into MH/s.
It's relative to the hash rate (i.e. frequency minus error rate) as stated in the logs. E.g. if the maximum frequncy is 212 MHz and the error rate at this frequency is 1% a frequency drop to 200 MHz is equal to a hash rate drop of (at least) 4.707%. The only term I do not really understand in regards of calculation is "submitted hash rate". Is it the number of total nonces translated to MH/s ?
"Submitted hash rate" is calculated based on the amount of successfully submitted shares to pool(s).
|
|
|
|
ztex (OP)
Donator
Sr. Member
Offline
Activity: 367
Merit: 250
ZTEX FPGA Boards
|
|
June 13, 2012, 11:45:29 AM |
|
Another question due to possible failing on long (ok, here it was <24h with -oh 0.08) running the boards: Is it possible to exit BTCMiner on errors and read some error-number or parse the logs for dealing with the problem? Until now BTCMiner wont exit but trying to keep running (waiting for input q or s) and throwing error msg .. so I had to rerun manually ... Or simpler: How to run clusters automatically (e.g. headless without monitoring) and handling errors best? Or: How to monitor another way than checking the logs?
I think the easiest way is to add a secondary input method for command entering. Then everyone can write its own scripts to deal with errors by sending commands through a named pipe.
|
|
|
|
ztex (OP)
Donator
Sr. Member
Offline
Activity: 367
Merit: 250
ZTEX FPGA Boards
|
|
June 13, 2012, 11:54:39 AM |
|
The third fpga on his quad is failing. All the others behave normal I asked you to return the board a feew weeks ago. So please don't complain about a non working board without giving me the chance to repair / replace it. Nooooo Mine is working perfectly so everything is ok Was just pointing Out that the third fpga of the User was failing. Sorry, this was not addressed to you.
|
|
|
|
ztex (OP)
Donator
Sr. Member
Offline
Activity: 367
Merit: 250
ZTEX FPGA Boards
|
|
June 13, 2012, 12:00:42 PM |
|
2012-06-12T04:59:28: 001-0: ztex_ufm1_15y1-04A36DE0AA-3: Error: Read timed out: Disabling URL http://mmpool.bitparking.com:15098 for 60s 2012-06-12T04:59:32: 001-0: ztex_ufm1_15y1-04A36DE0AA-1: Error: Read timed out: Disabling URL http://mmpool.bitparking.com:15098 for 60s 2012-06-12T05:00:10: 001-0: ztex_ufm1_15y1-04A36E0711-4: Error: Read timed out: Disabling URL http://mmpool.bitparking.com:15098 for 60s 2012-06-12T05:00:16: 001-0: ztex_ufm1_15y1-04A36E0711-2: Error: Read timed out: Disabling URL http://mmpool.bitparking.com:15098 for 60s 2012-06-12T05:00:30: 001-0: ztex_ufm1_15y1-04A36DE0AA-4: Error: Read timed out: Disabling URL http://mmpool.bitparking.com:15098 for 60s 2012-06-12T05:00:57: 001-0: ztex_ufm1_15y1-04A36E0711-1: Error: Read timed out: Disabling URL http://mmpool.bitparking.com:15098 for 60s 2012-06-12T05:01:23: 001-0: ztex_ufm1_15y1-04A36E0711-3: Error: Read timed out: Disabling URL http://mmpool.bitparking.com:15098 for 60s 2012-06-12T05:01:40: 001-0: ztex_ufm1_15y1-04A36DE0AA-1: Error: Read timed out: Disabling URL http://mmpool.bitparking.com:15098 for 60s 2012-06-12T05:01:56: 001-0: ztex_ufm1_15y1-04A36E0711-4: Error: Invalid length of string: Disabling device 2012-06-12T05:01:56: 001-0: ztex_ufm1_15y1-04A36E0711-2: Error: Invalid length of string: Disabling device 2012-06-12T05:01:57: 001-0: ztex_ufm1_15y1-04A36DE0AA-3: Error: Invalid length of string: Disabling device 2012-06-12T05:01:57: 001-0: ztex_ufm1_15y1-04A36DE0AA-4: Error: Invalid length of string: Disabling device 2012-06-12T05:02:11: 001-0: ztex_ufm1_15y1-04A36E0711-1: Error: Invalid length of string: Disabling device ...
These errors probably occur if the pool returns erroneous responses. BTCMiner does not handle these exception properly. (Correct behavior would be to disable the pool for 60s, ATM I'm not sure where these exception occur). Instead I will try to keep at least 3 cores of that board running stable until Friday.
Return it, if you can't find the error. (You have my email address because you already wrote me emails).
|
|
|
|
SamHa1n
Member
Offline
Activity: 60
Merit: 10
|
|
June 14, 2012, 02:57:23 AM |
|
001-0: ztex_ufm1_15y1-04A36DF26A-1: f=232.00MHz, errorRate=0.23%, maxErrorRate=1.83%, hashRate=231.5MH/s, submitted 11 new nonces, luckFactor=0.90 001-0: ztex_ufm1_15y1-04A36DF26A-2: f=232.00MHz, errorRate=0.00%, maxErrorRate=0.00%, hashRate=232.0MH/s, submitted 15 new nonces, luckFactor=1.15 001-0: ztex_ufm1_15y1-04A36DF26A-3: f=232.00MHz, errorRate=0.21%, maxErrorRate=1.87%, hashRate=231.5MH/s, submitted 18 new nonces, luckFactor=0.94 001-0: ztex_ufm1_15y1-04A36DF26A-4: f=228.00MHz, errorRate=0.00%, maxErrorRate=0.59%, hashRate=228.0MH/s, submitted 17 new nonces, luckFactor=0.94
Sweet!
|
|
|
|
Inspector 2211
|
|
June 14, 2012, 03:40:32 AM |
|
001-0: ztex_ufm1_15y1-04A36DF26A-1: f=232.00MHz, errorRate=0.23%, maxErrorRate=1.83%, hashRate=231.5MH/s, submitted 11 new nonces, luckFactor=0.90 001-0: ztex_ufm1_15y1-04A36DF26A-2: f=232.00MHz, errorRate=0.00%, maxErrorRate=0.00%, hashRate=232.0MH/s, submitted 15 new nonces, luckFactor=1.15 001-0: ztex_ufm1_15y1-04A36DF26A-3: f=232.00MHz, errorRate=0.21%, maxErrorRate=1.87%, hashRate=231.5MH/s, submitted 18 new nonces, luckFactor=0.94 001-0: ztex_ufm1_15y1-04A36DF26A-4: f=228.00MHz, errorRate=0.00%, maxErrorRate=0.59%, hashRate=228.0MH/s, submitted 17 new nonces, luckFactor=0.94
Sweet! Is this already Eldentyrell's bitstream???
|
|
|
|
SamHa1n
Member
Offline
Activity: 60
Merit: 10
|
|
June 14, 2012, 04:47:19 AM |
|
Is this already Eldentyrell's bitstream???
No, it is running ztex_ufm1_15y1 bitstream from ZtexBTCMiner-120417.
|
|
|
|
Inspector 2211
|
|
June 14, 2012, 04:57:45 AM |
|
Is this already Eldentyrell's bitstream???
No, it is running ztex_ufm1_15y1 bitstream from ZtexBTCMiner-120417. I was asking, because in a different thread, EldenTyrell wrote: "the first Bitstream I'll post will be a 230 MH/s design" So, if the standard Ztex Bitstream already achieves 230 MH/s, what's the f***cking point of bothering with ET's clever scheme of only accepting encrypted start vectors, and then decrypting them in his Bitstream, and only generating encrypted golden nonces? <confused>
|
|
|
|
|