Latest posts of: -ck

Updated tree:

Build binaries with unique filenames from the kernel generated and save them. Try to load this cached binary if it matches on next kernel instantiation. This speeds up start-up dramatically, and has a unique kernel binary for different kernel configurations.

Thanks very much to those donating. Even tiny donations are most appreciated.

About the libcurl static thing, I had read this elsewhere, I have no real experience in the area, so there may be only some aspects of networking that cannot be compiled statically. What they are, I don't know.

I'll check out the options parsing eventually, as I do see the problem you're reporting.

There appears to be a bug where it starts rejecting all blocks after an extended period, somewhere in the order of ~2500 accepted blocks. I'm trying to investigate why that is, but it's rather hard to reproduce on my slower equipment Tongue

More improvements to come...

Updated tree:

I put some effort into minimising the risk of rejects and cl errors and to not miss shares close to each other. I did this by creating an array for the buffer variables passed to and from the GPU to make it extremely unlikely for a race to occur over the same slot in the array. Then I scan over the entire array when it is flagged as a match being found, but it's scanned in a separate thread to not delay further work being passed to the GPU. This change should allow you to use the higher values for intensity without it increasing the reject or error rate.

In the interim I discovered a nice bug whereby there was a chance the struct with the thread id had its memory freed before an attempt was made to detach the thread with pthread_detach which would lead to a segfault.

Quote from: Naven on June 29, 2011, 10:44:39 AM

i tried with openssl-dev, nss-dev and also with curl from http://curl.haxx.se/download.html.
Everything works when i build minerd dynamic, but static dont wanna be so polite

You can't build a truly static build of something that sends or receives network packets, sorry.

Quote from: gat3way on June 29, 2011, 07:19:50 AM

clGetDeviceInfo() tends to return BS sometimes

Anyway, I don't think using the maximum allowed worksize is optimal as you are resource-constrained anyway. Bad thing is that it is hard to determine the optimum without experimenting with the workgroup size (starting from 32 on nvidia all the way to 512 in multiples of 32).

No of course not. I use max work size / vectors. That works surprisingly well as a default starting setting when none are chosen

So is anyone actually finding this client useful? It's getting quite mature now but apart from Burp's feedback I don't really get a sense that anyone's finding it useful. I find a huge improvement in throughput from it at intensity levels that don't affect my desktop.

Updated tree

I've modified the log to only show the summary and not the testing information unless in debug mode. There are now counters stored to say which gpu or cpu found the share, and hw errors are stored as well. The added information can be used to determine whether to turn down intensity or to overclock less.

The output looks like this now:

[2011-06-29 10:46:19] GPU: 0 Accepted: 100 Rejected: 4 HW errors: 0
[2011-06-29 10:46:24] [230.23 | 218.86 Mhash/s] [105 Accepted] [4 Rejected] [0 HW errors]
[2011-06-29 10:46:29] [227.39 | 218.88 Mhash/s] [105 Accepted] [4 Rejected] [0 HW errors]
[2011-06-29 10:46:34] [218.19 | 218.88 Mhash/s] [105 Accepted] [4 Rejected] [0 HW errors]
[2011-06-29 10:46:40] [239.39 | 218.94 Mhash/s] [105 Accepted] [4 Rejected] [0 HW errors]
[2011-06-29 10:46:45] [230.92 | 218.97 Mhash/s] [105 Accepted] [4 Rejected] [0 HW errors]
[2011-06-29 10:46:45] GPU: 0 Accepted: 101 Rejected: 4 HW errors: 0

Also I've updated the code to not allow automatically setting work sizes greater than 512 as a simple way of preventing the nvidia bug mentioned earlier.

EDIT: I've also made the 1st rate reported (the log interval one) a decaying average so it doesn't jump around as much.

Thanks and thanks. I wondered why they returned 1024. Looks like more phayl from nvidia with opencl Sad

Quote from: gat3way on June 28, 2011, 09:46:58 PM

I thought you quit kernel hacking. I've compiled some of your kernels a while ago on my desktop

Had no idea you are into bitcoin stuff and OpenCL. Nice

Actually I'm very new to opencl and bitcoin. Just started a week ago, and had to learn all about opencl. I've put in over a hundred hours on this code already to get up to speed Tongue

To see what I'm doing with linux kernel, check out http://ck-hack.blogspot.com

Quote from: gat3way on June 28, 2011, 11:57:59 AM

Sorry for the rude OT question but was that you that maintained the -ck tree?

Not rude at all. Yes it is me and I still do

Updated tree:

I've imported the phatk kernel into minerd. The maximum possible throughput is slightly faster on machines that support amd media ops which is nice. However, even nicer is that on sane intensity levels (including the default value of 4), the throughput is significantly faster now as well. The phatk kernel unfortunately doesn't even work on hardware that doesn't have amd media ops (radeon 4x cards and nvidia) so for now it defaults back to the poclbm kernel.

I've also updated the cpu mining component. Now it tries to keep its work sizes within the log update interval instead of the scan interval so that the hash rate doesn't fluctuate all over the place. It is also possible now to set number of gpu threads to 0 to run minerd as just a cpu miner again.

TODO:
-I want to find ways of allowing even larger settings for intensity that would only be suitable for headless boxes. Currently the code ends up racing too much (with all the parallel processing) and generates far too many rejected blocks when the intensity is set to >10. Making the cl code synchronous would avoid that but it also slows it down, thereby making it pointless to push it further.
-Store binary versions of the kernels that could be loaded faster when restarting the app.
-Any bugfixing remaining.
-Profit.

Update tree:

I did incorporate that change into my kernel. It turns out that even though my hardware reports 4 as the preferred vector width, it's faster with 2. I assume many people have experienced the same. So I've made the default to be 2 when the hardware says its preferred vector width is anything larger than 1.

I found a little buglet that also would repeat some blocks, thereby artificially raising the hash rate, so the overall rate has dropped slightly (about the same amount it's increased with the other code!).

As for the daily builds, I assume the requester meant windows builds? Most people who have linux will likely be able to build it. It's not building on windows yet, but will in the near future I hope. If you really do want linux binaries, just say the word.

The problem with repeated blocks was my pool not sending me out longpoll information reliably.

Quote from: Naven on June 27, 2011, 10:48:58 AM

@ckolivas, could u share daily builds of this minner?

linux only at this stage, sure I could do that.

Maybe it's just my pool. They're having a funky time so that would explain it.

I don't doubt it, and no one else is reporting this issue. The other machine I've tried it on it does give a speed up (with minerd) but this one 6770 I'm using it on reliably spits out tons of rejects when I make this change. It's not a heating issue, the card is at 64 degrees.

Quote from: iopq on June 27, 2011, 10:22:05 AM

Quote from: ckolivas on June 27, 2011, 10:20:00 AM

With 4 vectors, this change actually slows down the hash rate. With 2 vectors it speeds it up, but then I get runs of rejected shares. Not sure why but this is consistent now so I'm reluctant to include it at this stage.

are you sure?

I can keep trying it on and off to see, but every time so far it has happened. It could well be my pool as they're experiencing technical difficulties, but it's always been the same time I enable it that I get the rejects.

2011-06-27 20:22:46] [173.08 | 191.67 Mhash/s] [81 Accepted] [40 Rejected]

Look at that reject rate. Normally it's <5%

Thanks for your code. It's nice to see people get rewarded for their efforts (though it doesn't always happen). Well, I've tried it on my own miner (minerd) and have some very strange findings.

https://forum.bitcoin.org/index.php?topic=21275.0

First of all, minerd supports up to 4 vectors, and when I add this change to my kernel, it actually _slows down_ the 4 vector version. But when I override it to set 2 vectors, it speeds it up. However, once it's sped up, I then get runs of rejected shares. I tried it multiple times with and without and it does appear to be just this change that causes it, so I'm not sure what's going on.

Quote from: ckolivas on June 27, 2011, 07:36:41 AM

Quote from: figvam on June 27, 2011, 07:04:44 AM

This small mod to the poclbm OpenCL kernel gives about 3% more performance if BFI_INT is used:
https://forum.bitcoin.org/index.php?topic=22965.0;topicseen

Quote

#define Ma(x, y, z) amd_bytealign((y), (x | z), (z & x))
and change it to this line
#define Ma(x, y, z) amd_bytealign( (z^x), (y), (x) )

For some reason that's greatly increased my reject rate.

Actually that was sheer coincidence. I'll test this change some more, thanks!

Quote from: figvam on June 27, 2011, 07:04:44 AM

This small mod to the poclbm OpenCL kernel gives about 3% more performance if BFI_INT is used:
https://forum.bitcoin.org/index.php?topic=22965.0;topicseen

Quote

#define Ma(x, y, z) amd_bytealign((y), (x | z), (z & x))
and change it to this line
#define Ma(x, y, z) amd_bytealign( (z^x), (y), (x) )

For some reason that's greatly increased my reject rate.

Don't forget cgminer:
https://bitcointalk.org/index.php?topic=28402.0

CGMINER ASIC, FPGA and GPU miner, GPU overclock+monitor+fanspeed in C for linux/windows/osx