I'm not much of an ATI/OpenCL programmer, but I've done
quite a bit of CUDA programming (see my
CUDA fractal viewer), and there's a lot of similarities between the way these chips store and access memory.
I'm not too familiar with the L1-L3 cache on GPUs -- they tend to operate with some kind of global memory (the number you see on the graphics card box), and then various banks of shared memory and graphics caches. On most CUDA hardware, there is 16-48kB of shared memory split between 256 threads. It's fast as hell, but that's not a lot of memory. Since a lot of programs would need more shared memory than that, they have to use the global memory anyway: and they still do just fine getting 5-100x speed up over CPUs. On the other hand, requiring the GPU to store memory in your RAM, would be disastrous. It takes like 1000x more time to access RAM than it does its own global memory (don't quote me on that number).
Therefore, in order to disarm GPUs of their dominance, I believe you really need to force them to exceed their global memory capacity. Here's what you'll find in a high-end machine with a good CPU and a solid GPU (take 6970 for exactly)
CPU: 4 cores, 8GB RAM (2
GB/core)
GPU: 1600 cores, 2 GB RAM (1-2
MB/core)
The issue with hashing is that it requires only a couple
kB of RAM, so there's nothing stopping the GPU from hitting full occupancy (maximum parallelization). However, if the hashing were replaced by an operation that required 100 MB RAM per thread, the GPU would only be able to use 20/1600 cores. It'd probably be no better than the CPU. There's ways to do this: you could replace sha256(sha256(a)) with having to sequentially hash the string 3,000,000 times, and concatenate all the hashes into a single string (requiring 100 MB to store), then execute some operation on that string that requires the thread to store all of it in memory at once (hashing can be done in pieces, but there are other options that would require it).
Each GPU thread would have to use global memory, and only 20 would fit. You might as well use your CPU (which whose four cores are probably faster than the GPU's 20). 100 MB is good, because a lot of computers can spare 400 MB of RAM without crippling the computer.