Use the GMP library which allows for arbitrary-precision integers. It's much easier than trying to modulo multiple uint64_t parts - although you kinda miss out on performance if your end goal is to run it on CUDA.
You can even import these numbers in hexadecimal base - as well as any other base.
I'm going to assume your numbers are in big-endian, because manually flipping the bytes is too tedious for me.
const char* a = "59f2815b16f81798029bfcdb2dce28d955a06295ce870b0779be667ef9dcbbac";
const char* b = "483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b8";
char result[256];
mpz_t ma, mb, mresult;
mpz_init(ma);
mpz_init(mb);
mpz_init(mresult);
mpz_set_str(ma, a, 16);
mpz_set_str(mb, b, 16);
mpz_fdiv_r(mresult, ma, mb); // <--- modulus
mpz_get_str(result, 16, mresult); // returns a hex string without a prefix
// ...
mpz_clear(ma);
mpz_clear(mb);
mpz_clear(mresult);
EDIT: forgot to add bases to mpz_set_str