I guess after reading more about the subject what I want to do is avoid compiling a .cl file at runtime and write the kernel code manually. Otherwise, what assurances do I have that the .cl file has been compiled *and optimized* by the compiler?
Yes, if you're up for the challenge, that is likely to gain you far more performance than writing the supporting code in C or ASM. Be aware that you need to have a very intimate relationship with the GPU, inside and out, to extract out every bit of performance. You will have to check your local laws to determine if such relationships are even legal in your district.
Note that your optimizations will be specific to a certain GPU
(1), so if you write for a 5850, your code won't get the same efficiency on a 6950.
(1) Unless polygamy is allowed in your country.