Okay so I've had an extensive discussion with mtrlt about the code and I've done a lot of debugging and I've learnt a lot.
First of all, the values you can safely plug into linux are NOT compatible with what you can plug into windows. There are different restrictions on the allocatable memory dependent on driver/OS combination. Therefore you cannot compare results from the two.
Second, there IS a MEANINGFUL upper limit to aggression or in this case, intensity. It is where the power of 2 is greater than the concurrent threads.
Eg concurrent threads of 8192 has an upper limit of 13 intensity because 2^13 is 8192. You CAN go over this value, but you are absolutely guaranteed to start producing invalid results. How many invalid results you get for the potential rise in hashrate is highly hardware dependent.
The previous release code did no boundary checking or any testing of the device. I have now updated the git tree to test just how much memory it can allocate and it will now AUTOMATICALLY TUNE to the maximum values that are likely to work. I suggest you start it in debug mode with -D to see what it reports as the concurrent threads, and then find the value that is the largest multiple of number of shaders in the device. Eg a 6950 has 134217728 max memory, this works out to concurrency 2048 but it only has 1408 shaders so setting concurrent_threads to 1408 will likely make it faster.
Changing lookup_gap has 2 effects. The larger it is, the higher you can go with thread_concurrency. However, speed also is dependent on architecture design, and virtually all GPUs are fastest at a gap of 2. If you choose a custom gap without choosing a thread concurrency, cgminer will choose the concurrency for you. If you don't choose a gap, it will select 2 for you.
About GPU threads: You should run as many as you can start without cgminer crashing or failing. They do NOT correlate with shaders, compute units, ram or anything else as any meaningful multiple or anything like that.
Now finally, and you can believe me or not on this, but raper sends work to the GPU WITHOUT CHECKING if it was accepted, and gets the return buffer WITHOUT CHECKING if it actually did any work, and then adds the number of hashes it would have expected the GPU to do with that work sent to it. This means that when you start with lots of threads, some of them may not even be doing anything. Or if you've set some borderline invalid values, it will appear to be working fine, report back a big hashrate, but generate less valid shares. So I implore you to check the share rate generation and pretty much ignore the reported hashrate when comparing notes. Remember that cgminer AND raper use virtually identical kernels so should hash at virtually identical rates.
Summary: Start cgminer without setting worksize, vectors, lookup gap or thread concurrency, but in debug mode with -D -T (I made this example up, not sure what it really is)
[2012-07-23 21:07:18] GPU 0: selecting lookup gap of 2
[2012-07-23 21:07:18] GPU 0: selecting thread concurrency of 2048
Then if you're on a 5770, you can google it has 800 processing elements, so pick the highest multiple of that while staying under the thread concurrency, so 1600. The nearest power of 2 is 2048 so an intensity of 12.
Give that a go and let's see what happens. I expect different results on windows and linux. Use this table as a guide for what multiples to make concurrent threads.
GPU Processing Elements