1

I'm confused about how one might supposedly bypass the final subtraction of the modulus in radix-2 montgomery modular multiplication, when used in a modular exponentiation algorithm. The following two papers put forward the conditions for bypassing the subtraction.

I don't understand exactly what is required in terms of the "preprocessing and postprocessing" to eliminate the need for the repetitive subtraction of the modulus at the end of the montgomery multiplication.

My understanding after reading the above papers is that, to eliminate the final subtractions, you must:

  1. Zero-extend each input operand to the modular exponentiation by two

    e.g. new[2049 downto 0]  = (0, 0, old[2047 downto 0]) 
    
  2. Increase the loop bound inside the Montgomery multiplications by two, such that two more iterations of the loop are executed

I've made these modifications to a working algorithm, however the results are not as I expect and I do not understand why. Therefore, I assume I am misinterpreting something in these papers, or leaving out a critical step.

Let us refer to my (working) radix-2 montgomery modular exponentiation function in (C-like pseudocode). Note that I have extended the operand width by two most-significant zero digits (just to make sure I'm not overflowing). They used to only be 2048 bits.

let NUM_BITS = 2048
let rsaSize_t be a 2050-bit vector type

// Montgomery multiplication: outData = XYr^(-1) modulo M,     
// where the radix r=2^n    (n=NUM_BITS) 
function montMult( rsaSize_t X,       // Multiplier
                   rsaSize_t Y,       // Multiplicand
                   rsaSize_t M,       // Modulus
                   rsaSize_t outData) // Result
{
    rsaSize_t S = 0;  // Running sum

    for (i=0; i<NUM_BITS; i++)
    {
        if (X.bit(i)==1) // Check ith bit of X
            S += Y;

        if (S.bit(0)==1) // check LSB of S
            S += M;

        S = S >> 1;   // Rightshift 1 bit
    }

    // HERE IS THE FINAL SUBTRACTION I WANT (NEED) TO AVOID
    if (S >= M)
    {
        S -= M;
    }

    outData = S.range(NUM_BITS-1,0);
}


//  montgomery modular exponentiation using square and multiply algorithm
//  computes  M^e modulo n, where we precompute the transformation of the 
//  base and running-partial sum into the montgomery domain 
function rsaModExp( rsaSize_t e,     // exponent 
                    rsaSize_t n,     // modulus
                    rsaSize_t Mbar,  // precomputed: montgomery residue of the base w.r.t. the radix--> (2^2048)*base mod n 
                    rsaSize_t xbar,  // precomputed: montgomery residue of 1  w.r.t. the radix--> 2^2048 mod n                 
                    rsaSize_t *out)  // result
{
    for (i=NUM_BITS-1; i>=0; i--)
    {
        montMult(xbar, xbar, n, xbar); // square
        if (e.bit(i)==1) 
            montMult(Mbar, xbar, n, xbar); // multiply
    }

    // undo montgomery transform
    montMult(xbar, 1, n, out);
}

Am I missing something in the papers? I do not believe this is an implementation error, as my code matches exactly what is put forth in the papers. I believe that I might be a conceptual error. Any and all help appreciated.

Thanks!

asmvolatile
  • 161
  • 4

1 Answers1

1

The problem was that there was one additional thing I left out....if I'm increasing the loop iteration count and the number of bits from 2048 to 2050, then I had to choose a new R to satisfy the preconditions of Walter's optimization.

$R=2^{bitwidth} $ so I needed to have $R=2^{2050}$ instead of $R=2^{2048}$.

I'd now like to modify this to use multiple word scanning, as suggested by @Ruggero, and have a few follow up questions so will open another thread.

asmvolatile
  • 161
  • 4