For -v 2 it should halve the global worksize (global Threads) number, so doubling this in the kernel should be no problem at all, right?
global_id == global_work_offset + global_worksize You're doubling the global id, not the worksize. Global ID without global offset is: 0 till (global worksize - 1) queried in the kernel via get_global_size(0). So for 2-component vectors we need a global worksize / 2 (as argument in the enqueue kernel call), because there are 2 nonces processed in each work-item. Global ID with global offset is: global offset till (global worksize + global offset - 1). Dia I still think it should be something like: u nonce = ((uint)get_global_id(0) + get_global_size(0) * (0,1,2,3)); for vectors4 etc. Anyway I ended up trying it both ways with your nonce code or mine and neither led to any improvement (actually detriment if anything).
|
|
|
I thought the NDRangeKernel went like this...Edit: Oh and Dia, in your CommandQueue try adding this property, "cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE"
https://github.com/ckolivas/cgminer/blob/master/ocl.c#L710clState->commandQueue = clCreateCommandQueue(clState->context, devices[gpu], CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &status);
cgminer has used this for a very long time. As I wrote, I think OoE mode is not supported on AMD GPUs ... is there a debug or vebose message, if that mode was successfully activated? It is successfully activated on windows and linux, but osx fails. It does not improve throughput with current GPUs but is harmless to enable for if/when they do. I saw a significant increase in average nonces being found and a 3 Mhash/sec higher throughput. Modified from Dia's code I used the following... self.commandQueue = cl.CommandQueue(self.context, self.device, cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE) Then I saw that read buffer was set to this > cl.enqueue_read_buffer(self.commandQueue, self.output_buf, self.output, is_blocking=True) Since OoE Mode will NOT work if is_blocking=True I set them all to false, and re-enabled self.commandQueue.finish() Similarly I changed the write buffer cl.enqueue_write_buffer(self.commandQueue, self.output_buf, self.output, is_blocking=False) on the cl.output_buf I changed mem flags to cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR | cl.mem_flags.ALLOC_HOST_PTR For Async to work the 11.11 AMD drivers tell you to add environmental variable to your system. GPU_ASYNC_MEM_COPY=2 Again, this might only be a 69xx feature, but for my 6970 I turn off BFI_INT and GOFFSET and increased my Memory speed and VECTORS8 was running at over 446 MHash/s. Now it'll find between 5-14 nonces per minute without choking up or freezing system. Before it was struggling to find 5 nonces per minute if at all. Next, I want to add the Async functions event_t async_work_group_copy (__local T *dst, const __global T *src, size_t num_gentypes, event_t event) event_t async_work_group_copy (__global T *dst, const __local T *src, size_t num_gentypes, event_t event) One is for global and other is for local work groups Then create a prefetch for global cache void prefetch (const __global T *p, size_t num_gentypes) Again, you're not remotely talking about cgminer: status = clEnqueueReadBuffer(clState->commandQueue, clState->outputBuffer, CL_FALSE, 0, BUFFERSIZE, thrdata->res, 0, NULL, NULL);
Please... seriously... I could take your advice if you were talking about how it relates here, but the stuff you're saying is not set is not cgminer... edit: That's not to say you have nothing useful to add, but the signal to noise ratio gets low when you're talking about other code first and foremost.
|
|
|
For -v 2 it should halve the global worksize (global Threads) number, so doubling this in the kernel should be no problem at all, right?
global_id == global_work_offset + global_worksize You're doubling the global id, not the worksize. Global ID without global offset is: 0 till (global worksize - 1) queried in the kernel via get_global_size(0). So for 2-component vectors we need a global worksize / 2 (as argument in the enqueue kernel call), because there are 2 nonces processed in each work-item. Global ID with global offset is: global offset till (global worksize + global offset - 1). Dia I still think it should be something like: u nonce = ((uint)get_global_id(0) + get_global_size(0) * (0,1,2,3)); for vectors4 etc.
|
|
|
Having said all of that it may just be the value I'm passing since it expects an array and I'm passing a single value...
|
|
|
For -v 2 it should halve the global worksize (global Threads) number, so doubling this in the kernel should be no problem at all, right?
global_id == global_work_offset + global_worksize You're doubling the global id, not the worksize.
|
|
|
Hey Con,
The global_work_offset value you pass to clEnqueueNDRangeKernel is used as the starting value for get_global_id(0). If global_work_offset would be 10 and we had 5 work-items, the nonces generated would be 20, 21, 22, 23, 24, 25, 26, 27, 28 and 29 so we loose nonces from 10 to 19 and use ones, that should not be base ... would that create duplicate work? You pass -D GOFFSET to the kernel, right?
I suspect the problem is to do with overflow on 32 bit unsigned integers. Imagine a value close to 2^32. Since the values are doubled, and what happens on overflowing 32 bits is undefined, you may well get repeated ranges of nonces checked. That would prove true for a base of (2147483647 - global worksize) because doubled it's over 2^32. How big are the nonce bases and what would be the global worksize for -I == 14 (as this is the maximum)? Dia global worksize = 2^(15 + intensity) so it's 29 max and it's double that for 2 vectors and so on... cgminer always tries to test the entire range of nonces up to 2^32 so it will *always* get to a value above 2^31 where it will wrap with a doubling of the global id regardless of what intensity it's at.
|
|
|
Edit: Oh and I have to warn you, yesterday I tested a new AMD driver, which has a new OpenCL runtime and I was unable to use CGMINER on Windows, so perhaps AMD "fixed" the binary generation but this breaks your solution from a few days ago, because the old message that no .bin could be generated was back. I switched back to a former runtime :-(. Seems to suck!
I don't think this is the old bug but a simple failure to produce a binary because of compilation of the kernel instead. Try running with -T -verbose and it will spew out any kernel compilation errors.
|
|
|
I thought the NDRangeKernel went like this...Edit: Oh and Dia, in your CommandQueue try adding this property, "cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE"
https://github.com/ckolivas/cgminer/blob/master/ocl.c#L710clState->commandQueue = clCreateCommandQueue(clState->context, devices[gpu], CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &status);
cgminer has used this for a very long time. As I wrote, I think OoE mode is not supported on AMD GPUs ... is there a debug or vebose message, if that mode was successfully activated? It is successfully activated on windows and linux, but osx fails. It does not improve throughput with current GPUs but is harmless to enable for if/when they do.
|
|
|
Hey Con,
The global_work_offset value you pass to clEnqueueNDRangeKernel is used as the starting value for get_global_id(0). If global_work_offset would be 10 and we had 5 work-items, the nonces generated would be 20, 21, 22, 23, 24, 25, 26, 27, 28 and 29 so we loose nonces from 10 to 19 and use ones, that should not be base ... would that create duplicate work? You pass -D GOFFSET to the kernel, right?
I suspect the problem is to do with overflow on 32 bit unsigned integers. Imagine a value close to 2^32. Since the values are doubled, and what happens on overflowing 32 bits is undefined, you may well get repeated ranges of nonces checked.
|
|
|
I thought the NDRangeKernel went like this...Edit: Oh and Dia, in your CommandQueue try adding this property, "cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE"
https://github.com/ckolivas/cgminer/blob/master/ocl.c#L710clState->commandQueue = clCreateCommandQueue(clState->context, devices[gpu], CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &status);
cgminer has used this for a very long time.
|
|
|
Nonce of that code makes much sense, but I think Dia is suggesting that he wants to use the natively built-in global_work_offset parameter instead of the one you reference by using clEnqueueNDRangeKernel
Umm... I actually do understand the code I was explaining what was wrong with it.
|
|
|
Con, maybe we can talk about integrating global offset parameter support into CGMINER? Take a short look at http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html and the global_work_offset parameter. All that has to be taken into consideration from the kernel-side is in DiaKGCN. OpenCL 1.1 detection is in your code, too, which is needed, but I can't do the other required changes without a compiler. In short, the nonce-base is not supplied via the base parameter, if GOFFSET is enabled, but instead via the global_work_offset parameter and used via the global work-item ID in the kernel. This saves a few instructions and can give us a small boost. Dia I tried writing code to send nonce as the global offset parameter and your code returned duplicate work on 2 or more vectors. Looking at your code: u nonce = ((uint)get_global_id(0) << 1) + (u)(0, 1);
This won't be right as global id will now be the global thread id + the global offset parameter so doubling it will give random results.
|
|
|
cgminer sets clocks all back to default on exit... if it exits cleanly, and of course on windows it's a miracle when it does.
Sadly this is not the case. windows machine, pair of 6950's, set to 850/1300 for normal operation, in cgminer they are set to 700-880/300... when cgminer exits it leaves the cards at 880/300. even updated to 12.1 drivers, both 2.4 and 2.6 SDK. Instead windows decides to blow your balls off and feed them back to you by collecting up the splatter, mincing it, putting it into a glass and forcing you to drink it through a straw via your left nostril. I'll try and code a workaround for this windows fail next time by stopping mining and resetting device values and pausing for a bit before letting windows crash cgminer when it tries to exit.
|
|
|
cgminer sets clocks all back to default on exit... if it exits cleanly, and of course on windows it's a miracle when it does.
Sadly this is not the case. windows machine, pair of 6950's, set to 850/1300 for normal operation, in cgminer they are set to 700-880/300... when cgminer exits it leaves the cards at 880/300. even updated to 12.1 drivers, both 2.4 and 2.6 SDK. Instead windows decides to blow your balls off and feed them back to you by collecting up the splatter, mincing it, putting it into a glass and forcing you to drink it through a straw via your left nostril.
|
|
|
Okay so I tested the fastest diablominer has to offer on 7970 and current cgminer is 1.5 MHash faster with defaults, so I'm pleased . I guess I should keep working on my kernel
|
|
|
Con, maybe we can talk about integrating global offset parameter support into CGMINER? Take a short look at http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html and the global_work_offset parameter. All that has to be taken into consideration from the kernel-side is in DiaKGCN. OpenCL 1.1 detection is in your code, too, which is needed, but I can't do the other required changes without a compiler. In short, the nonce-base is not supplied via the base parameter, if GOFFSET is enabled, but instead via the global_work_offset parameter and used via the global work-item ID in the kernel. This saves a few instructions and can give us a small boost. Yes I like that idea. But 8 and 16 vectors perform shithouse followed by appalling so it's not worth pursuing those.
|
|
|
You are but 1 Mhash off my current poclbm kernel with that Diakgcn 716.5 versus ck-poclbm 717.5 Sounds not too bad . I will try -w 128 and compare results on my machine ... will report back, which one is faster for me at default clocks and on Win7 x64. I'm happy now! There is something unusual about it running diakgcn and the hashrate appears to be more unstable, rising and falling more so it takes a while to get a reasonable grasp for what the hashrate really is. Since you're running 2 vectors, it effectively makes the hashrate update half as often as running my poclbm kernel since I use no vectors. Either way, the hashrates are really close.
|
|
|
With Diapolo's help we finally got the diakgcn kernel working on cgminer. I've just committed code to the git tree which makes it work. Alas at the same engine and clock speeds on the 7970, diakgcn gives me 699 MHash while my customised kernel gives me 717 MHash. But now that it's working, he may be able to tweak it further...
Did you use -v 2 with DiaKGCN for your test? I'm not sure why, but it has always been faster on Phoenix to use it with -v 2. Dia Tried it, much slower. Did you use the code from the latest commits, here on Windows it's definitely faster with -v 2. -I 9 -k diakgcn -d 0 -v 1 -w 256: ~524 MH/s -I 9 -k diakgcn -d 0 -v 2 -w 256: ~539 MH/s Indeed it is faster now, and with -v 2, it is fastest allow cgminer to choose worksize which is ends up being 128 - it queries the "preferred worksize" and divides that by number of vectors. You are but 1 Mhash off my current poclbm kernel with that Diakgcn 716.5 versus ck-poclbm 717.5
|
|
|
Thanks for your help with integrating DiaKGCN Con!
And thank you for the code. Now what advantage is there to reordering the variables passed to the kernel? In commit d86a38d1e75090e1ffb9df9e68aa13b1c8dcf9ec you shuffled arguments which appear to be mostly cosmetic. Would that be right? That's true for the most part when reordering kernel-arguments to be in line with usage of them in the code, but I think it's part of a clean code, don't you think so? So you created a new DiaKGCN branch, will this last so I can switch my local repo to that one ? Dia Yes, please do, and if things go well, we can have a new default kernel for GCN
|
|
|
Thanks for your help with integrating DiaKGCN Con!
And thank you for the code. Now what advantage is there to reordering the variables passed to the kernel? In commit d86a38d1e75090e1ffb9df9e68aa13b1c8dcf9ec you shuffled arguments which appear to be mostly cosmetic. Would that be right?
|
|
|
|