That'd be depressing if all I can get is another 2%.
Yeah, it kinda takes the fun out of optimizing if all you can get is 2%.
Any idea if there's a GCN feature that we can exploit for more performance like fixed function hardware or new instructions/amd-specific extensions? I initially thought that using size 16 vectors would help, thinking that the GCN SIMD cores were akin to x64 SIMD registers, but it seems that the card exploits its SIMD cores by running a scalar instruction on 16 threads at once.
I'm trying to get the damned Kernel Analyzer to work in either Win in VM or in Wine, both is a no go, has to have a real running copy of the drivers, and the native Linux version locks up soon as DM calls an a CL function.
Don't even bother until the release a new version because Kernel Analyzer won't even correctly list the kernel stats for tahiti. Running the profiler via command line does give some useful output however, and I have .il, .cl, .isa and a comma separated value file of a profile run with -v 1 from my 7970 if you're interested.