Print Page - Need help understanding this modular inverse implementation

Title: Need help understanding this modular inverse implementation
Post by: NotATether on June 04, 2024, 01:28:24 PM

Taken straight from VanitySearch

https://github.com/XopMC/CudaBrainSecp/blob/main/GPU/GPUMath.h#L548-L659 (different project but identical source file)

_ModInv is the modular inverse funciton that implements the extended euclidean algorithm, which is described at Wikipedia here:

https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm

It's using signed 320-bit integers, probably to catch overflow and underflow, which it deals with later. That's not the issue here.

There is the Euclidean (proper) division, which I already understand. And then there's this talk about the Bezout coefficients which I also understand. My issue here is understanding how this particular implementation implements this this using 63-bit bit shifts.

U and V (the two inputs) are loaded with the values of P and R respectively. (because R * (the return value) === 1 mod P.

And then my understanding goes downhill from there :)

Can someone help me figure this out, or at least point me to a website where I can find people who might understand this sort of function?

Title: Re: Need help understanding this modular inverse implementation
Post by: j2002ba2 on June 11, 2024, 08:11:32 AM

This is Extended Binary GCD

Binary GCD:
https://xlinux.nist.gov/dads/HTML/binaryGCD.html
http://www.cut-the-knot.org/blue/binary.shtml

Extended Binary GCD explanation with lots of words:
https://github.com/DavidNorman/gcd

Title: Re: Need help understanding this modular inverse implementation
Post by: NotATether on June 11, 2024, 08:43:03 AM

Quote from: j2002ba2 on June 11, 2024, 08:11:32 AM

Perfect. Now I'm going to try to figure out how the binary extended GCD operates with many words - this has been a pain point for me. I'll get back to you later.

Title: Re: Need help understanding this modular inverse implementation
Post by: j2002ba2 on June 11, 2024, 09:12:09 AM

I mean lots of words explaining it.

Here is the code I made back in 2019, this is slower than the 5x52 version (it is 8x32), but was faster than x^p-2.

Code:

__device__ static void invModP_(unsigned int value[8]) {
  //INPUT : Prime p and a in [1, p−1].
  //OUTPUT : a^−1 mod p.
  //1. u = a, v = p.
  unsigned int u[8], v[8];
  copyBigInt(value, u);
  copyBigInt(_P, v);
  //2. x1 = 1, x2 = 0.
  unsigned int x1[8] = { 0,0,0,0, 0,0,0,1 };
  unsigned int x2[8] = { 0,0,0,0, 0,0,0,0 };
  //3. While (u != 1 and v != 1) do
  for (;;) {
    if ( ((u[7] == 1) && ((u[0] | u[1] | u[2] | u[3] | u[4] | u[5] | u[6]) == 0)) ||
         ((v[7] == 1) && ((v[0] | v[1] | v[2] | v[3] | v[4] | v[5] | v[6]) == 0)) )
      break;
    //3.1 While u is even do
    while ((u[7] & 1) == 0) {
      //u = u/2.
      shift_right(u);
      unsigned int carry = 0;
      //If x1 is even then x1 = x1/2; else x1 = (x1 + p)/2.
      if ((x1[7] & 1) != 0) {
        carry = add(x1, _P, x1);
      }
      shift_right_c(x1, carry);
    }
    //3.2 While v is even do
    while ((v[7] & 1) == 0) {
      //v = v/2.
      shift_right(v);
      unsigned int carry = 0;
      //If x2 is even then x2 = x2/2; else x2 = (x2 + p)/2.
      if ((x2[7] & 1) != 0) {
        carry = add(x2, _P, x2);
      }
      shift_right_c(x2, carry);
    }
    //3.3 If u >= v then: u = u − v, x1 = x1 − x2 ;
    unsigned int t[8];
    if (sub(u, v, t)) {
      //Else: v = v − u, x2 = x2 − x1.
      sub(v, u, v);
      subModP(x2, x1, x2);
    } else {
      copyBigInt(t, u);
      subModP(x1, x2, x1);
    }
  }
  //4. If u = 1 then return(x1 mod p); else return(x2 mod p).
  if ((u[7] == 1) && ((u[0] | u[1] | u[2] | u[3] | u[4] | u[5] | u[6]) == 0)) {
    copyBigInt(x1, value);
  } else {
    copyBigInt(x2, value);
  }
}
/*
INPUT : Prime p and a in [1, p−1].
OUTPUT : a^−1 mod p.
1. u = a, v = p.
2. x1 = 1, x2 = 0.
3. While (u != 1 and v != 1) do
   3.1 While u is even do
       u = u/2.
       If x1 is even then x1 = x1/2; else x1 = (x1 + p)/2.
   3.2 While v is even do
       v = v/2.
       If x2 is even then x2 = x2/2; else x2 = (x2 + p)/2.
   3.3 If u >= v then: u = u − v, x1 = x1 − x2 ;
       Else: v = v − u, x2 = x2 − x1.
4. If u = 1 then return(x1 mod p); else return(x2 mod p).
*/

Title: Re: Need help understanding this modular inverse implementation
Post by: NotATether on June 20, 2024, 05:45:37 AM

OK, so now, finally I got the hang of this!

The key to understanding this implementation which is not a conventional binary egcd, is that this implementation is DRS62 (delayed right shift 62 bits) - actually it says so in the comments even.

Basically, the reason why it's called such is because it implements the optimizations in this algorithm: https://github.com/pornin/bingcd/blob/main/doc/bingcd.pdf

This paper describes how to optimize EGCD when you are working in a modular field with fixed-size registers, like 64-bit registers and the like.

Although in the conventional binary gcd, we are left shifting (dividing) stuff by 2, in this implementation we are actually multiplying some of the words by 2 and then we "fix" it by dividing (right shift) by a bunch of twos at once - 62 times, to be exact.

Those lines that have MM64 and MSK62 on them followed by multiplication, addition and a right shift - that is Montgomery Reduction (https://en.wikipedia.org/wiki/Montgomery_modular_multiplication#The_REDC_algorithm) and that basically calculates the modulus by P. It uses some fancy modulo 2^62 stuff, but I will have more to say when I write a separate post about this.

Much thanks @j2002ba2!

Bitcoin Forum

Bitcoin => Development & Technical Discussion => Topic started by: NotATether on June 04, 2024, 01:28:24 PM