Is it the right way to make a new secp256K1 library with the CGBN number library on Cuda?
No. We should not base a CUDA secp256k1 implementation on an arbitrary precision library like GMP or the one you just mentioned, because we don't need all of that precision and bitlength.
Dealing with arbitrary precision numbers deprives us of the opportunity to optimize the code for a specific bitlength. Instead of 4 statements of 64-bit words, now you need a loop, which doesn't optimize well on CUDA, especially when future results inside it depend on previous ones (the loop cannot be unrolled).
I strongly believe that for maximum performance the ideal library would be written from scratch. CUDA doesn't play well with C++ classes so the library would just be a series of global functions plus a type such as secp256k1_t or something.