Dual GPU mining problems (very slow) [SOLVED]

This had me pulling my hair out for the better part of six hours - hopefully I can help someone else avoid repeating my experience.....

The Actors
* Gentoo Linux
* Dual Radeon 5870s (OpenCL devices 0 and 1)
* ATI Catalyst 11.2
* No Crossfire (no cable AND disabled in aticonfig)
* poclbm miner (latest from git)

The Scene
A very frustrated admin was trying to figure out why launching one instance of poclbm on device 0 would generate 280 Mh/s, but launching a second instance of poclbm on device 1 would instantly cause output on device 0 to drop to between 100Mh/s and 200 Mh/s. Furthermore, generation on device 1 would executed at a rate no higher than between 100 Mh/s and 200 Mh/s. A multitude of different parameters (getwork rate, work group size, frame rate, display resolution, unplugging monitors, plugging monitors back in, enabling Xinerama, disabling Xinerama) failed to get both cards running at full speed at the same time (both ran just fine independently, but not in parallel).

Now, I know what you're all thinking, so I'll state one more time: YES, the admin was 100% certain that poclbm was being launched on DISTINCT devices and that Crossfire was fully disabled. Repeated polling of "aticonfig --adapter=all --od-getclocks" showed that both GPUs were being utilized, but that utilization tended to be around 60% on one device and 40% on the other. Under certain configurations (by launching up to 10 poclbm processes on EACH device), utilization of one device could reach as high as 99%, while utilization of the second device fluctuated anywhere between 35% and 70%.

The Solution
There appears to be a bug in ATI Catalyst 11.2 on Gentoo (possibly on other platforms) that causes execution of OpenCL kernels to block while other kernels are running, EVEN IF THE KERNEL IS RUNNING ON A DIFFERENT DEVICE. This caused execution to proceed on device 0 for a while, during which time execution would block on device 1. Then, the kernel on device 1 would launch and device 0 would block. And back and forth. This explains why the total sum calculation rate of both devices tended to add up to about 280 Mh/s.

Upgrading to Catalyst 11.3 fixed the problem.

Closing thoughts
While the user is 100% certain he installed ATI Catalyst 11.2 via Gentoo's portage installation system, ATI Control Center reported its installed version as 10.12. Perhaps somebody f*ed up the Gentoo installer. Or perhaps someone at ATI simply forgot to increment the version number in ATI Control Center when 11.2 was released.