I am trying to figure out the idle bug / help figure it out. I have phoenix running from source.
added some log commands:
what does that tell me? idleFixer is being called after the problem occurs but does not help. idleFix is not being called.
any suggestions for debugging? are there any RPC status infos that I could have it spit out?
The code you modified was my last ditch effort to create a workaround for the idle problem. As you can see, it didn't work. I have given up on the current RPCProtocol design because it doesn't work, and without the original developer present I can't seem to fix it. Personally I think the issue lies somewhere in the modification to the Twisted.web library to enable persistent connections. I am working on a complete rewrite for RPCProtocol, but I'm not exactly an expert on netcode so it's taking a bit of time. Expect a release sometime in the next 2 weeks.
|
|
|
This has been added to the SVN, and it will be released with a new version shortly. In the mean time, it is possible to edit kernel.cl and apply this change even with the compiled version.
|
|
|
[16/06/2011 20:21:48] Server gave new work; passing to WorkQueue [16/06/2011 20:22:01] Result 000000002f4255ac... accepted [16/06/2011 20:22:06] Result 00000000956163ae... accepted [16/06/2011 20:22:08] Result 000000006244c69e... accepted [16/06/2011 20:22:15] Server gave new work; passing to WorkQueue [16/06/2011 20:22:42] Server gave new work; passing to WorkQueue [16/06/2011 20:23:08] Result 0000000068e4507d... rejected [16/06/2011 20:23:08] Result 00000000e2b80025... rejected [16/06/2011 20:23:09] Server gave new work; passing to WorkQueue [16/06/2011 20:23:10] New block (WorkQueue) [16/06/2011 20:23:13] Result 000000003717b83f... accepted [16/06/2011 20:23:25] Result 00000000706aa557... accepted [16/06/2011 20:23:28] Result 00000000edf564dc... accepted [16/06/2011 20:23:37] Warning: work queue empty, miner is idle [0 Khash/sec] [859 Accepted] [23 Rejected] [RPC (+LP)]
v1.50 did not restart for two hours... Seven other miner instances worked fine, one of them on the same GPU. Could it be something with the socket connection? anything we can do to help debug? I am completely out of ideas for fixing this. I have looked through the RPC protocol code countless times and I can't find the cause. At this point I'm going to have to leave it up to other developers to fix. However, if someone can identify the cause of the problem I should be able to fix it.
|
|
|
Awesome! 1.50 totally eliminated the "miner is idle" lines I was seeing every so often!
Edit: I take that back. I"m seeing the "miner is idle" lines again, although less frequent than before.
The problem that was fixed in 1.50 was the miner getting stuck idle due to a bug in the RPC protocol. You will still get "miner is idle" from time to time, but it shouldn't get stuck idle unless it can't connect to the server.
|
|
|
Much less stales after update! Thanks!
What is so unique in hashskill that makes it faster than Phoenix+Phatk?
My guess is that hashkill has less overhead from host <--> device transfers: Another thing is (don't know if that's possible with pyopencl) - don't use clenqueuereadbuffer() (or whatever it's equivalent is). Use clenqueuemapbuffer() instead. It's noticably faster. Hm really started wondering about modifying some python miner to incorporate that kernel there, looks like a quick way to make it portable to windows. Besides, there are obvious problems with the non-ocl part which are due to code inmaturity.
I have looked at the hashkill OpenCL kernel in AMD's KernelAnalyzer and theoretically it should be the same speed as phatk on SDK 2.4
|
|
|
Sometimes I still get stale shares after a LP notification. Is the queue bug still unfixed?
This can happen if the share was already being sent when the LP request returned a new block. Shares are checked against the current block before being sent, but after that there are no further checks.
|
|
|
I tried it... 2011-06-15 02:29:22: Wird ausgeführt: E:\bitcoin\guiminer\guiminer\phoenix.exe -u http://xxxx:xxxx@btcguild.com:8332 PLATFORM=0 DEVICE=0 VECTORS AGGRESSION=5 -v FASTLOOP=true BFI_INT -k phatk 2011-06-15 02:29:22: Listener für "btc" gestartet 2011-06-15 02:29:26: Listener für "btc": [15/06/2011 02:29:26] Finding inner ELF...
From that moment onwards there was 5 minutes with nothing happening (yet still showing MHASH in the Gui(miner 06-09). ALso the MH was at about 180 compared to 200 with the 1.48 release... System Specs: Intel COre 2 Duo 2160 1x ATI 6850 4GB RAM Windows 7 x64 Guess i will stick with the previous version and hope this might post was useful ,0) That's odd, the BFI_INT patcher hasn't been changed since 1.0 so I don't see how updating could break it. Does it work if you don't use the BFI_INT flag?
|
|
|
Version 1.50 has been released. This should fix the miner getting stuck idle. I added a workaround for now since I am not 100% confident that I fixed the underlying problem. Changes: 1. Fixed long poll crashing when the server disconnects the miner with a message 2. Fixed QueueReader error when stopping the kernel 3. Several RPC protocol changes to reduce occurrence of idle miner problem 4. When idle the miner will now request more work every 15 seconds (this should eliminate idling in cases where the connection isn't lost) 5. LP now works in cases where the URL uses a query string (thanks to error for the patch, see page 30 for details) @SchizophrenicX You need to add the port after the server: http://username.workername:workerpassword@api.bitcoin.cz :8332
|
|
|
AGGRESSION is used to set the number of nonces to run per kernel execution. Values above 16 won't do anything because nonces to run = 2^(16 + AGGRESSION) Setting a higher value for AGGRESSION will simply round down to the highest valid value (16)
The -a parameter is used for the hashrate display. With higher aggression settings kernels can take several seconds to run, so this makes taking 16 samples rather pointless. The reason multiple samples are used is that otherwise the displayed hashrate would fluctuate greatly at low aggression.
The -f option in poclbm is somewhat similar to aggression because it also controls kernel execution sizes. The main difference is that aggression sets the number of nonces to run per execution while -f sets the target time for each execution. (-f adjusts the number of nonces constantly to maintain the target execution time)
As for updates to Phoenix, I think I have a solid workaround for the idle problem now. I still don't know the exact reason for it, but the workaround should be sufficient to recover automatically. I will release this is 1.50 shortly if I don't find any issues in testing. These changes are included in the latest SVN revision (r101) if anyone wants to try it.
|
|
|
The immediate solution is to use Phoenix's undocumented lpaskrate command line parameter, which is the Long Poll equivalent of the askrate parameter. A more structural fix would be for the Phoenix client to change lpaskrate default value from 0 to something sensible in the range 10-119. Of course the expiry policy may also be changed by the pool operator. But the default Pushpool work expiry policy is not overly strict in my opinion; work that's over 2 minutes old misses out on a lot of recent transactions and therefore on the transaction fees these include.
An update on this bug: Upon further research into the Phoenix code it seems that specifying the lpaskrate argument does not actually solve anything, so please ignore the quoted solution. I originally assumed the queuesize was intended as both preferred and maximum size, so that the oldest work would be abandoned when overfilling the queue. Instead, current work will only be abandoned when a new block has been detected; any forced getwork responses will just be added to the workqueue, only making the problem worse. A better avenue for improvement, although more involved, might be to add an expiration mechanism to the Phoenix code. Until then I'd recommend never mining at less than 36MH/s to avoid rejected shares caused by this bug. This is not accurate concerning the queue. The queue itself uses the deque class which has a maximum size set to the size the user specifies. Adding additional work to queue will automatically purge the oldest work if the queue is full. See the documentation for deque here: If maxlen is not specified or is None, deques may grow to an arbitrary length. Otherwise, the deque is bounded to the specified maximum length. Once a bounded length deque is full, when new items are added, a corresponding number of items are discarded from the opposite end. Bounded length deques provide functionality similar to the tail filter in Unix. They are also useful for tracking transactions and other pools of data where only the most recent activity is of interest.
http://docs.python.org/library/collections.html#collections.deque
|
|
|
I think the reason nobody has developed a pure CUDA kernel for Phoenix (or a CUDA port of poclbm) is that Nvidia cards are very poor miners compared to similarly priced ATI cards. As a result of this the vast majority of Nvidia cards used for mining were likely intended for gaming first.
We were planning on making a CUDA kernel for Phoenix, but we didn't get very far before shifting focus to other areas. (BFI_INT implementation)
|
|
|
I'm working on an extensive debug build that will dump out a highly detailed logfile every time the miner goes idle. Hopefully this will enable me to find the cause of this bug.
|
|
|
I am running a nvidia G102m, core 2 duo running windows 7 64 bit. With this and every time I try to run it this error comes up and if I run it with these particular settings my graphics card driver will crash with this error message. I am running the current graphics driver 270.61. If I use VECTORS BFI_INT then my driver doesn't crash but the same error message comes up. C:\Program Files (x86)\Bitcoin\phoenix-1.48>phoenix -u http://username:password@bitcoinpool.com:8334/ -k poclbm FASTLOOP=false AGGRESSION=9 [23/05/2011 16:39:49] Phoenix 1.48 starting... [23/05/2011 16:39:50] Connected to server [0 Khash/sec] [0 Accepted] [0 Rejected] [RPC]Unhandled Error Traceback (most recent call last): File "threading.pyc", line 504, in __bootstrap File "threading.pyc", line 532, in __bootstrap_inner File "threading.pyc", line 484, in run --- <exception caught here> --- File "twisted\python\threadpool.pyc", line 207, in _worker File "twisted\python\context.pyc", line 118, in callWithContext File "twisted\python\context.pyc", line 81, in callWithContext File "kernels\poclbm\__init__.py", line 392, in mineThread pyopencl.LogicError: clEnqueueReadBuffer failed: invalid command queue If I use these settings then this error message comes up and it just keeps on saying failed to connect. C:\Program Files (x86)\Bitcoin\phoenix-1.48>phoenix -u http://username:password@bitcoinpool.com:8334/-k poclbm -v VECTORS BFI_INT AGGRESSION=7 [23/05/2011 16:59:23] Phoenix 1.48 starting... [23/05/2011 16:59:23] Failed to connect, retrying... [23/05/2011 16:59:42] Failed to connect, retrying... [0 Khash/sec] [0 Accepted] [0 Rejected] [RPC] First of all Nvidia GPUs don't support BFI_INT so adding that flag won't work. Secondly lowend Nvidia cards are very slow miners, which at the current difficulty are pointless to run. I'm not sure how you are getting "clEnqueueReadBuffer failed: invalid command queue" though.
|
|
|
A temporary fix for anyone not using the compiled binaries is to just use the minerutil folder from 1.4. This doesn't support persistent connections, so it won't work correctly on Slush's pool, but it should work fine everywhere else. Thanks to everyone who tested the 1.48 debug build, it appears the problem is elsewhere. You can find the 1.4 release files here: http://svn3.xp-dev.com/svn/phoenix-miner/tags/release-1.4/
|
|
|
My main system runs Windows 7, because I play a lot of games. The only system I have running Linux at the moment is my mining box, but that has no monitor/keyboard and the only means of managing it is SSH.
You can try http://www.virtualbox.org/wiki/Downloads to run linux in your Windows 7 Not sure how tc will work on virtual adapter though, but I think it should. That's practically useless since I won't be able to mine fast enough to cause the issue without using a GPU. (remember that you don't get direct access to GPUs in a VM) At best I can get about 20 Mhash/sec from this CPU, which will need more work and find a share about once every 3.5 mins.
|
|
|
My main system runs Windows 7, because I play a lot of games. The only system I have running Linux at the moment is my mining box, but that has no monitor/keyboard and the only means of managing it is SSH. I'll try this out sometime tomorrow, since it it's somewhat annoying to modify code on the server. (can't just edit -> save -> run, have to upload it over SFTP) I have already tired simulating server downtime, but in the case of this issue (as I understand it) the connection isn't "busy" it's lost. I tried this already be blocking the IP in my router firewall. (so it just times out, as if the server were down) The miner was able to reconnect every time, and I tested it with 3 different pools, (deepbit, slush, bitcoinpool) MultiMiner, (using RPC) and a bitcoind instance running on a remote machine. However simply delaying packets by some amount should make this easy to reproduce.
|
|
|
I got the profiler working... that was a lot easier than I thought it would be. I haven't done too much Java outside of Google's DalvikVM, but it's not a true Java implementation so some things are done a little bit differently. Anyway, I'm running the first batch of samples now Are you going to be modifying the kernel much? I'm curious as to how phatk reduced the operation count by that amount... I did a lot of examining of phatk. I can't tell where he thinks hes saving cycles. Not only that, it runs exactly the same on SDK 2.1 and SDK 2.4 on my 5850 vs phoenix's standard kernel. Plus, if he is in fact exploiting anything, it probably isn't exploiting it as much as -v 3 -w 128 on mine on 69xx. The key difference is not in the total number of instructions executed, but that they make better use of the 5-wide ALU design. Have a look at the ASM generated with AMD's KernelAnalyzer. Particularly the number of ALU ops. It's no faster than the poclbm kernel on 2.1, but for most people it eliminates the speed disadvantage of SDK 2.4. It's also designed with VLIW5 in mind, so it's obviously not going to be optimal on VLIW4 hardware.
|
|
|
I seem to be running into a strange problem with the latest build. It seems to be dong the idle thing, but instead of locking up, just not processing anything - there's no error generated or any apparent problems communicating with RPC (at least it's not reporting it), but its acting like it and bringing the hash rate down.
Can you post the log? Unless you get "Work queue empty, miner is idle" you probably have a different issue. That message is logged by WorkQueue so it's independent of any protocol changes. Basically it means the kernel requested work and there is nothing in the queue to give it. If this doesn't appear it means that either the queue isn't empty or the kernel is not requesting work. (this could be caused by a driver error, hardware issues, ect) How do I do that, or do you mean just what's on the screen? I stopped using Phoenix again because it locked up like the previous version(s) and did not recover. Didn't look like there was much change for me on rev 99. I can try to help you out if you tell me what you need specifically, but it's costing me $$ everytime I have 6 GH idle for hours I just mean the output in the console window. Do you get the message "Work queue empty, miner is idle" or not? If you don't get this you have a completely different problem.
|
|
|
Anybody using the "phatk" kernel on HD 5970 ??
Did a quick test on one of my cores and didn't see any noticeable difference over pocblm (less if anything..... SDK 2.1, fglrx 11.2, clocks 800/300)
It's only beneficial on SDK 2.4. For 2.1 use poclbm.
|
|
|
I seem to be running into a strange problem with the latest build. It seems to be dong the idle thing, but instead of locking up, just not processing anything - there's no error generated or any apparent problems communicating with RPC (at least it's not reporting it), but its acting like it and bringing the hash rate down.
Can you post the log? Unless you get "Work queue empty, miner is idle" you probably have a different issue. That message is logged by WorkQueue so it's independent of any protocol changes. Basically it means the kernel requested work and there is nothing in the queue to give it. If this doesn't appear it means that either the queue isn't empty or the kernel is not requesting work. (this could be caused by a driver error, hardware issues, ect)
|
|
|
|