Alright.
There's one other thing to address: in the secp256k1_fe_mul (or something like that) function, the all but the last leg are multiplied by the constant R. This causes a result different from when I calculated an example in Python. So inside the fe_mul function, I need to modify it to avoid multiplying the values in the result (stack) by R, and send that multiplication to a temporary instead.
That makes no sense; fe_mul just multiplies two field elements modulo p. That R constant is an implementation detail, that even differs between 32-bit and 64-bit platforms. It's not actually multiplying the result by this value.
Note that field elements are internally stored in a denormalized representation where the limbs can overflow. If you want to convert it to a portable format, use fe_get_b32.