May I chime in?
The way I see it X(n) algos are much less suitable for GPU mining than Scrypt(n) due to the fact that composition of kernels produces a large code which GPU cannot execute efficiently, mostly due to limited cache size & cache algorithms.
At the cost of being considered nazi, I have to point out there is specific terminology for those: instruction cache (often "I$") and data cache.
Then "blake" hash is executed over that data in all threads. I noticed you use xIntensity of 64 so he would execute 64*2048 "blake" global threads on say R9 280X card.
Correct terminology is
work item. Nothing in any GPU architecture ever looked like a CPU thread. That's just oversimplification for marketing. Also note "core" isn't the same thing either.
Results of all these threads is stored into global memory. Since all instructions are executed more or less in lock-step (lock-step within a compute unit and possibly out of sync between compute units)
This is incorrect. Execution goes in lock-step fashion in a wavefront (which is GPU-equivalent of a N-way thread, for GCN, that's sort of AVX-2048 with no shuffling). Different wavefronts are scheduled independently.
Then the GPU pauses a bit and waits for sgminer to enqueue another kernel which in our case is: "bmw". Global memory contains "blake" hashed block data.
Run a program called CodeXL. You will see most NDRange calls are fully dispatched even before the kernel starts executing (at least, that's what happens for me).
2. you can schedule all of them for execution in parallel (well as you currently do in "opencl_scanhash" function except that clCreateCommandQueue should specify "CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE")
3. kernels should be modified to wait for event to triger them. This event would be a signal from the previous kernel that it had finished the work. I'm not sure if this can be done via clEvents as they seem to work on entire queue rather than individual thread.
Please, not this out-of-order queue nonsense again! The algo is sequential, it will need a sequential queue and you also have understood the whole point so I guess I'll make this clear for all the people out there who believe GPUs exist for hashing:
out of order queues for sequential algos are useless and possibly make the things worse! In particular, it does
not make any sense to ask for out of order queue and then:
- Flush/Finish (current legacy miner approach)
- produce events to force sequentiality.
There are 20 years of studies on GPU architectures available, albeit I suggest to drop the legacy and start from D3D10 which is the first API that took the thing seriously... albeit the results weren't great.
Anyway I realize I need to spent some time with CodeXL for a while to gain some insight... what beats me is that all hash algos in X11 are designed to have efficient implementation in hardware. So they should be small in code and consume little memory. This kind of thing should be possible to implement directly in thread registers or cl terms "private memory".
They are designed to be efficient in ASIC hardware or FPGA at most. The two problems here are:
- massive I$ overload, because the AMD compiler is too dumb to not unroll stuff (as a side note: HLSL/GLSL compilers usually unroll much more smartly, I currently suspect HLSL compiler might be building a whole tree of possibilities).
- registers must be shuffled across Work Items so most values cannot really be in private memory which brings us to the magic world of LDS layout.
- register pressure: how soon you need the result. To my own surprise it seems GCN 1.0 and 1.1 still cannot dispatch dependent instructions one after another