MacCompiler
Newbie
Offline
Activity: 53
Merit: 0
|
|
May 19, 2011, 09:51:51 PM |
|
I’ve packaged DiabloMiner with some helper scripts that will make it easier for new users to start mining on Mac OS. DiabloMiner will work like a normal application that you double-click on to open. Have a look at this thread for details and downloads.
|
|
|
|
MysteryMiner
Legendary
Offline
Activity: 1512
Merit: 1049
Death to enemies!
|
|
May 19, 2011, 10:29:03 PM |
|
I really like Diablo Miner but there is few problems with it:
1. When I run the new Diablo miner versions with .exe instead of .bat in it, it does not work. I get black console screen for maybe 25ms and it exits. Not enough time to even hit Pause button to see what's wrong. I need tu use .bat file instead.
2. No speed improvements on new version with BFI_INT. I even get speed decrease. I get 260 Mh/s with 2011-04-23 version and with 2011-05-19 I get 250 Mh/s
I use ATI HD5850 with Catalyst 10.11 and ATI SDK 2.1
|
bc1q59y5jp2rrwgxuekc8kjk6s8k2es73uawprre4j
|
|
|
DiabloD3 (OP)
Legendary
Offline
Activity: 1162
Merit: 1000
DiabloMiner author
|
|
May 20, 2011, 02:26:05 AM |
|
I really like Diablo Miner but there is few problems with it:
1. When I run the new Diablo miner versions with .exe instead of .bat in it, it does not work. I get black console screen for maybe 25ms and it exits. Not enough time to even hit Pause button to see what's wrong. I need tu use .bat file instead.
2. No speed improvements on new version with BFI_INT. I even get speed decrease. I get 260 Mh/s with 2011-04-23 version and with 2011-05-19 I get 250 Mh/s
I use ATI HD5850 with Catalyst 10.11 and ATI SDK 2.1
You run the exe the same way as the bat. The exe does not magically read your mind on what arguments you want to use. The bat is probably running the old jar, which means, no, you're not running a new version of DiabloMiner.
|
|
|
|
DiabloD3 (OP)
Legendary
Offline
Activity: 1162
Merit: 1000
DiabloMiner author
|
|
May 20, 2011, 05:38:11 AM |
|
is there any way I can use the miner without installing java? can you put it in a warper and compile the whole thing including the java?
Java does not work that way. I call bullshit: http://gcc.gnu.org/java/I'm aware of gcj, and I do not consider something that cannot run quite a few apps, and be a shitload slower at it an actually valid Java implementation. Oh, and last time I noticed, they didn't do JNI yet, so you can't run my miner with it.
|
|
|
|
ryepdx
|
|
May 20, 2011, 05:42:05 AM |
|
Oh, and last time I noticed, they didn't do JNI yet, so you can't run my miner with it.
Ah, okay. Got it.
|
|
|
|
Jaime Frontero
|
|
May 20, 2011, 06:07:39 AM |
|
two 5870s, CC 11.5, SDK 2.1, on Debian testing.
i don't know yet how much faster it is than your pre-BFI_INT release.
but a lot.
i'm putting in some extra fans and a rheostatic fan speed controller - it's so damn fast that i have to clock it down right now to keep temps under 85.
so going from the old version, max volted at 300 MemClock and 900 GPUClock, to the new version down-volted by almost 0.2, MemClock at 315 and GPUClock at 850; i picked up a bit over 100 Mh/s.
i'll have the new fans and controller in tomorrow. i have another box that i've experimented with fans on - just a single 5870, but i've learned a bit. i'm hoping for a maxed-out setup on the dual box, running at well under 75 degrees. we'll see.
|
|
|
|
DiabloD3 (OP)
Legendary
Offline
Activity: 1162
Merit: 1000
DiabloMiner author
|
|
May 20, 2011, 07:02:19 AM |
|
two 5870s, CC 11.5, SDK 2.1, on Debian testing.
i don't know yet how much faster it is than your pre-BFI_INT release.
but a lot.
i'm putting in some extra fans and a rheostatic fan speed controller - it's so damn fast that i have to clock it down right now to keep temps under 85.
so going from the old version, max volted at 300 MemClock and 900 GPUClock, to the new version down-volted by almost 0.2, MemClock at 315 and GPUClock at 850; i picked up a bit over 100 Mh/s.
i'll have the new fans and controller in tomorrow. i have another box that i've experimented with fans on - just a single 5870, but i've learned a bit. i'm hoping for a maxed-out setup on the dual box, running at well under 75 degrees. we'll see.
At stock 850, 2 5870 should be in the neighborhood of 740 using -v 2 -w 128 on SDK 2.1. BFI_INT adds around 10%.
|
|
|
|
DustinEwan
Newbie
Offline
Activity: 14
Merit: 0
|
|
May 20, 2011, 07:05:41 AM |
|
I got the profiler working... that was a lot easier than I thought it would be. I haven't done too much Java outside of Google's DalvikVM, but it's not a true Java implementation so some things are done a little bit differently. Anyway, I'm running the first batch of samples now Are you going to be modifying the kernel much? I'm curious as to how phatk reduced the operation count by that amount...
|
|
|
|
DiabloD3 (OP)
Legendary
Offline
Activity: 1162
Merit: 1000
DiabloMiner author
|
|
May 20, 2011, 07:10:24 AM |
|
I got the profiler working... that was a lot easier than I thought it would be. I haven't done too much Java outside of Google's DalvikVM, but it's not a true Java implementation so some things are done a little bit differently. Anyway, I'm running the first batch of samples now Are you going to be modifying the kernel much? I'm curious as to how phatk reduced the operation count by that amount... I did a lot of examining of phatk. I can't tell where he thinks hes saving cycles. Not only that, it runs exactly the same on SDK 2.1 and SDK 2.4 on my 5850 vs phoenix's standard kernel. Plus, if he is in fact exploiting anything, it probably isn't exploiting it as much as -v 3 -w 128 on mine on 69xx.
|
|
|
|
DustinEwan
Newbie
Offline
Activity: 14
Merit: 0
|
|
May 20, 2011, 07:12:59 AM |
|
I completely agree with you... I've looked at both code and it's almost line for line exactly the same...
I tried looking for other SHA256 algorithms, just in case anybody had come up with something clever besides the norm, but there's nothing out there really... in the cpu world Crypto++ is king and that's pretty much it..
|
|
|
|
DiabloD3 (OP)
Legendary
Offline
Activity: 1162
Merit: 1000
DiabloMiner author
|
|
May 20, 2011, 07:40:28 AM |
|
Update: Added all of Dustin's suggestions, and also added a timeout for non-LP connections.
|
|
|
|
Jaime Frontero
|
|
May 20, 2011, 07:54:18 AM |
|
two 5870s, CC 11.5, SDK 2.1, on Debian testing.
i don't know yet how much faster it is than your pre-BFI_INT release.
but a lot.
i'm putting in some extra fans and a rheostatic fan speed controller - it's so damn fast that i have to clock it down right now to keep temps under 85.
so going from the old version, max volted at 300 MemClock and 900 GPUClock, to the new version down-volted by almost 0.2, MemClock at 315 and GPUClock at 850; i picked up a bit over 100 Mh/s.
i'll have the new fans and controller in tomorrow. i have another box that i've experimented with fans on - just a single 5870, but i've learned a bit. i'm hoping for a maxed-out setup on the dual box, running at well under 75 degrees. we'll see.
At stock 850, 2 5870 should be in the neighborhood of 740 using -v 2 -w 128 on SDK 2.1. BFI_INT adds around 10%. pretty much. i'm getting 746-748. i'm hoping that once i get the voltage back up, and the GPUClock at 900 again, i'll be somewhere considerably closer to 800Mh/s. by the way, Diablo - do you agree with the formula (picked up somewhere on this forum...) that the sweet spot for MemClocks is very close to: GPUClock/3 + 14 ?
|
|
|
|
jedi95
|
|
May 20, 2011, 08:07:09 AM |
|
I got the profiler working... that was a lot easier than I thought it would be. I haven't done too much Java outside of Google's DalvikVM, but it's not a true Java implementation so some things are done a little bit differently. Anyway, I'm running the first batch of samples now Are you going to be modifying the kernel much? I'm curious as to how phatk reduced the operation count by that amount... I did a lot of examining of phatk. I can't tell where he thinks hes saving cycles. Not only that, it runs exactly the same on SDK 2.1 and SDK 2.4 on my 5850 vs phoenix's standard kernel. Plus, if he is in fact exploiting anything, it probably isn't exploiting it as much as -v 3 -w 128 on mine on 69xx. The key difference is not in the total number of instructions executed, but that they make better use of the 5-wide ALU design. Have a look at the ASM generated with AMD's KernelAnalyzer. Particularly the number of ALU ops. It's no faster than the poclbm kernel on 2.1, but for most people it eliminates the speed disadvantage of SDK 2.4. It's also designed with VLIW5 in mind, so it's obviously not going to be optimal on VLIW4 hardware.
|
Phoenix Miner developer Donations appreciated at: 1PHoenix9j9J3M6v3VQYWeXrHPPjf7y3rU
|
|
|
DiabloD3 (OP)
Legendary
Offline
Activity: 1162
Merit: 1000
DiabloMiner author
|
|
May 20, 2011, 11:28:11 AM |
|
I got the profiler working... that was a lot easier than I thought it would be. I haven't done too much Java outside of Google's DalvikVM, but it's not a true Java implementation so some things are done a little bit differently. Anyway, I'm running the first batch of samples now Are you going to be modifying the kernel much? I'm curious as to how phatk reduced the operation count by that amount... I did a lot of examining of phatk. I can't tell where he thinks hes saving cycles. Not only that, it runs exactly the same on SDK 2.1 and SDK 2.4 on my 5850 vs phoenix's standard kernel. Plus, if he is in fact exploiting anything, it probably isn't exploiting it as much as -v 3 -w 128 on mine on 69xx. The key difference is not in the total number of instructions executed, but that they make better use of the 5-wide ALU design. Have a look at the ASM generated with AMD's KernelAnalyzer. Particularly the number of ALU ops. It's no faster than the poclbm kernel on 2.1, but for most people it eliminates the speed disadvantage of SDK 2.4. It's also designed with VLIW5 in mind, so it's obviously not going to be optimal on VLIW4 hardware. Well the big problem is on 2.4 phoenix-poclbm and phatk give near identical results... and both are still slower than real poclbm on both 2.1 and 2.4. And -v 18 and 19 give interesting results on 58xx on 2.4 which beats phatk's lackluster speed. So... ymm so fucking v.
|
|
|
|
DustinEwan
Newbie
Offline
Activity: 14
Merit: 0
|
|
May 20, 2011, 11:34:19 AM |
|
From my first run of profiling the miner, I saw that you were spending about 2% cpu time in just building strings (mainly StringBuilder copying char arrays internally). Using the + operator is inlined to StringBuilder, which can be pretty slow. I ran into this in my game engine here at work and had come across this post at StackOverflow from a guy that implements his own (albeit primitive) class for string concatenation. I forgot to save the profile for that one (the profiler automatically overwrites the output file every time and I'm lazy ), but I reduced the CPU time spent on String building from 2% to <= .01% It's not much, but hey, it was easy and I knew how to do it Anyway, here is the latest trace I ran. (a lot is left out, just is just the top 90% of cpu time) CPU TIME (ms) BEGIN (total = 2239712) Fri May 20 19:40:42 2011 rank self accum count trace method 1 17.47% 17.47% 21 306858 java.lang.Object.wait 2 17.46% 34.94% 828 306869 java.lang.ref.ReferenceQueue.remove 3 16.52% 51.46% 16 319564 sun.net.www.http.KeepAliveCache.run 4 15.74% 67.20% 7513448 319281 java.nio.DirectByteBuffer.getInt 5 4.05% 71.25% 210 318093 java.net.SocketInputStream.read 6 2.81% 74.05% 29347 319369 org.lwjgl.opencl.CL10.clEnqueueReadBuffer 7 2.70% 76.75% 7513448 319278 java.nio.Buffer.checkIndex 8 2.69% 79.44% 7513448 319279 java.nio.DirectByteBuffer.ix 9 2.64% 82.08% 7513448 319280 java.nio.DirectByteBuffer.getInt[quote author=DiabloD3 link=topic=1721.msg131499#msg131499 date=1305890891] [quote author=jedi95 link=topic=1721.msg131287#msg131287 date=1305878829] [quote author=DiabloD3 link=topic=1721.msg131220#msg131220 date=1305875424] [quote author=DustinEwan link=topic=1721.msg131215#msg131215 date=1305875141] I got the profiler working... that was a lot easier than I thought it would be. I haven't done too much Java outside of Google's DalvikVM, but it's not a true Java implementation so some things are done a little bit differently.
Anyway, I'm running the first batch of samples now :)
Are you going to be modifying the kernel much? I'm curious as to how phatk reduced the operation count by that amount... [/quote]
I did a lot of examining of phatk. I can't tell where he thinks hes saving cycles. Not only that, it runs exactly the same on SDK 2.1 and SDK 2.4 on my 5850 vs phoenix's standard kernel. Plus, if he is in fact exploiting anything, it probably isn't exploiting it as much as -v 3 -w 128 on mine on 69xx. [/quote]
The key difference is not in the total number of instructions executed, but that they make better use of the 5-wide ALU design. Have a look at the ASM generated with AMD's KernelAnalyzer. Particularly the number of ALU ops. It's no faster than the poclbm kernel on 2.1, but for most people it eliminates the speed disadvantage of SDK 2.4.
It's also designed with VLIW5 in mind, so it's obviously not going to be optimal on VLIW4 hardware. [/quote]
Well the big problem is on 2.4 phoenix-poclbm and phatk give near identical results... and both are still slower than real poclbm on both 2.1 and 2.4. And -v 18 and 19 give interesting results on 58xx on 2.4 which beats phatk's lackluster speed.
So... ymm so fucking v. [/quote] 10 1.80% 83.88% 675014 319312 org.lwjgl.opencl.CL10.clSetKernelArg 11 1.36% 85.24% 675014 319313 org.lwjgl.opencl.InfoUtilFactory$CLKernelUtil.setArg 12 1.01% 86.25% 675015 319298 java.lang.ThreadLocal.get 13 1.00% 87.26% 675016 311203 java.lang.ThreadLocal$ThreadLocalMap.getEntry 14 0.98% 88.24% 675015 319302 java.nio.DirectIntBufferU.put 15 0.68% 88.92% 29348 319351 org.lwjgl.opencl.CL10.clEnqueueNDRangeKernel 16 0.63% 89.55% 675015 319307 org.lwjgl.PointerWrapperAbstract.getPointer 17 0.63% 90.18% 675012 319315 java.lang.ThreadLocal$ThreadLocalMap.access$000 18 0.62% 90.80% 675015 319305 org.lwjgl.BufferChecks.checkBufferSize
Now I've started looking at some of the bigger stuff. The first 2 lines are from the garbage collector, so you can see that ~35% of the CPU time was spent on just garbage collecting, 17% of which was spent just blocking all the execution threads in order to do so. So I'm trying to figure out ways to improve that. I don't really think that the netcode can be much faster, but another ~20% of cpu time is spent on that. So if the netcode can be improved, that will get us back into the kernel faster. The third line there is the thread that is used for keeping the HTTP 1.1 session alive. I don't know much about that, but maybe it's a lead. Anyway, I'm done for now. Here is the new DiabloMiner.java with the new string builder. Also: So... ymm so fucking v.
I totally agree with that, but I love your code and bitcoin is fascinating. So digging through this code is a great joy for me! Great work so far man, and in Java too!
|
|
|
|
DiabloD3 (OP)
Legendary
Offline
Activity: 1162
Merit: 1000
DiabloMiner author
|
|
May 20, 2011, 11:41:33 AM |
|
From my first run of profiling the miner, I saw that you were spending about 2% cpu time in just building strings (mainly StringBuilder copying char arrays internally). Using the + operator is inlined to StringBuilder, which can be pretty slow. I ran into this in my game engine here at work and had come across this post at StackOverflow from a guy that implements his own (albeit primitive) class for string concatenation. I forgot to save the profile for that one (the profiler automatically overwrites the output file every time and I'm lazy ), but I reduced the CPU time spent on String building from 2% to <= .01% It's not much, but hey, it was easy and I knew how to do it Anyway, here is the latest trace I ran. (a lot is left out, just is just the top 90% of cpu time) CPU TIME (ms) BEGIN (total = 2239712) Fri May 20 19:40:42 2011 rank self accum count trace method 1 17.47% 17.47% 21 306858 java.lang.Object.wait 2 17.46% 34.94% 828 306869 java.lang.ref.ReferenceQueue.remove 3 16.52% 51.46% 16 319564 sun.net.www.http.KeepAliveCache.run 4 15.74% 67.20% 7513448 319281 java.nio.DirectByteBuffer.getInt 5 4.05% 71.25% 210 318093 java.net.SocketInputStream.read 6 2.81% 74.05% 29347 319369 org.lwjgl.opencl.CL10.clEnqueueReadBuffer 7 2.70% 76.75% 7513448 319278 java.nio.Buffer.checkIndex 8 2.69% 79.44% 7513448 319279 java.nio.DirectByteBuffer.ix 9 2.64% 82.08% 7513448 319280 java.nio.DirectByteBuffer.getInt 10 1.80% 83.88% 675014 319312 org.lwjgl.opencl.CL10.clSetKernelArg 11 1.36% 85.24% 675014 319313 org.lwjgl.opencl.InfoUtilFactory$CLKernelUtil.setArg 12 1.01% 86.25% 675015 319298 java.lang.ThreadLocal.get 13 1.00% 87.26% 675016 311203 java.lang.ThreadLocal$ThreadLocalMap.getEntry 14 0.98% 88.24% 675015 319302 java.nio.DirectIntBufferU.put 15 0.68% 88.92% 29348 319351 org.lwjgl.opencl.CL10.clEnqueueNDRangeKernel 16 0.63% 89.55% 675015 319307 org.lwjgl.PointerWrapperAbstract.getPointer 17 0.63% 90.18% 675012 319315 java.lang.ThreadLocal$ThreadLocalMap.access$000 18 0.62% 90.80% 675015 319305 org.lwjgl.BufferChecks.checkBufferSize
Now I've started looking at some of the bigger stuff. The first 2 lines are from the garbage collector, so you can see that ~35% of the CPU time was spent on just garbage collecting, 17% of which was spent just blocking all the execution threads in order to do so. So I'm trying to figure out ways to improve that. I don't really think that the netcode can be much faster, but another ~20% of cpu time is spent on that. So if the netcode can be improved, that will get us back into the kernel faster. The third line there is the thread that is used for keeping the HTTP 1.1 session alive. I don't know much about that, but maybe it's a lead. Anyway, I'm done for now. Here is the new DiabloMiner.java with the new string builder. Also: So... ymm so fucking v.
I totally agree with that, but I love your code and bitcoin is fascinating. So digging through this code is a great joy for me! Great work so far man, and in Java too! You need to get in the habit of using Github to push merge requests. Also, the thread pool is very important because it can keep HTTP connections open between getwork/sendworks (cutting down on network round trip). Further, I spawn 3 threads per GPU to cut down on the actual hit of blocking due to HTTP, and then on top of that, LP cuts down on needing to keep fetching work every 5 seconds (with LP it only fetches as it returns asynchronously , or when nonce saturation occurs). As for blocking on garbage collection, switching to Java 7 altogether would do a lot to improve that.
|
|
|
|
DiabloD3 (OP)
Legendary
Offline
Activity: 1162
Merit: 1000
DiabloMiner author
|
|
May 20, 2011, 11:57:46 AM |
|
BTW, I am not going to accept a patch containing a custom concat setup. This is not C.
|
|
|
|
MysteryMiner
Legendary
Offline
Activity: 1512
Merit: 1049
Death to enemies!
|
|
May 20, 2011, 12:25:04 PM |
|
I got the new version working! What I did: 1. I run the DiabloMiner-Windows.exe from command prompt with all arguments needed such as -u and -p 2. I need tu manually specify -v 2 argument to use vectors. Without Vectors I have 248Mh/s, with -v 2 I finally got 282Mh/s instead of former 260Mh/s. The BFI_INT is a huge improvement. 3. I created .BAT file myself to run DiabloMiner-Windows.exe with all necessary arguments. The bat is probably running the old jar, which means, no, you're not running a new version of DiabloMiner. No, I'm not so stupid. I know how to use and edit bat files from MS-DOS 5.0 times. I check they contents before I run them. And Thank You DiabloD3! If I ever find coins with Your miner, I will send you some of them!
|
bc1q59y5jp2rrwgxuekc8kjk6s8k2es73uawprre4j
|
|
|
OtaconEmmerich
|
|
May 20, 2011, 06:22:51 PM |
|
I got the profiler working... that was a lot easier than I thought it would be. I haven't done too much Java outside of Google's DalvikVM, but it's not a true Java implementation so some things are done a little bit differently. Anyway, I'm running the first batch of samples now Are you going to be modifying the kernel much? I'm curious as to how phatk reduced the operation count by that amount... I did a lot of examining of phatk. I can't tell where he thinks hes saving cycles. Not only that, it runs exactly the same on SDK 2.1 and SDK 2.4 on my 5850 vs phoenix's standard kernel. Plus, if he is in fact exploiting anything, it probably isn't exploiting it as much as -v 3 -w 128 on mine on 69xx. The key difference is not in the total number of instructions executed, but that they make better use of the 5-wide ALU design. Have a look at the ASM generated with AMD's KernelAnalyzer. Particularly the number of ALU ops. It's no faster than the poclbm kernel on 2.1, but for most people it eliminates the speed disadvantage of SDK 2.4. It's also designed with VLIW5 in mind, so it's obviously not going to be optimal on VLIW4 hardware. Well the big problem is on 2.4 phoenix-poclbm and phatk give near identical results... and both are still slower than real poclbm on both 2.1 and 2.4. And -v 18 and 19 give interesting results on 58xx on 2.4 which beats phatk's lackluster speed. So... ymm so fucking v. I've yet to replicate the same results on my system, In fact with 2.4 every time phatk has beaten your miner. Every time I've tried anything other then -v 2 I get slower speeds. This is on a Sapphire Extreme 5850 on Windows 7 x64.
|
|
|
|
toasty
Member
Offline
Activity: 90
Merit: 12
|
|
May 20, 2011, 06:25:03 PM |
|
If this is just totally unsupported, feel free to smack me. I'm running on a MacPro with both a 5870 and a 5770 in it, which seems perfectly okay doing normal OS things, including games.
If I try running DiabloMiner without any special flags, I get:
[5/20/11 1:17:33 PM] Added ATI Radeon HD 5870 (#1) (10 CU, local work size of 256) [5/20/11 1:17:34 PM] Added ATI Radeon HD 5870 (#2) (20 CU, local work size of 256)
which doesn't seem right. I'm guessing the 5770 is #1.
With no special flags at all, I'm getting roughly 125M/sec. If I use -D 1 to make it only attach to the first card, it only drops to roughly 100M/sec which leads me to believe something very inefficient is going on.
I've tried various combos of -f, -v and -w and don't seem to be able to do anything but make it worse.
Is this configuration just not going to work at all? Is there any way I can force it to only use the 5870 instead?
|
|
|
|
|