RNS Modular Computations for Cryptographic Applications

(1)

HAL Id: hal-01141347

https://hal.inria.fr/hal-01141347

Submitted on 11 Apr 2015

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

RNS Modular Computations for Cryptographic Applications

Karim Bigou, Arnaud Tisserand

To cite this version:

Karim Bigou, Arnaud Tisserand. RNS Modular Computations for Cryptographic Applications. RAIM: 7ème Rencontre Arithmétique de l’Informatique Mathématique, Apr 2015, Rennes, France. 2015. �hal-01141347�

(2)

RNS Modular Computations for Cryptographic Applications

Karim Bigou & Arnaud Tisserand

1. Elliptic Curve Cryptography (ECC)

Elliptic curve over FP: y2 = x 3 + a x + b with P a `-bit prime

y2 = x3 + _{4x + 20 over F}₁₀₀₉

Security levels: ` ∈ {160, . . . , 600} bits Curve level operations:

I point addition (ADD): Q + Q0

I point doubling (DBL): Q + Q

I scalar multiplication:

[k ]Q = Q + Q + . . . + Q

| {z }

k times

Security (ECDLP): knowing Q and

[k ]Q, k cannot be recovered

ECDLP : Elliptic Curve Discrete Logarithm Problem

3. RNS Computation Flow in ECC Applications

RNS allows to perform some field level operations in parallel

mod m1mod m2 mod m3mod m4 mod m5 +, −, ×,−1 _{in F}p ADD, DBL [k]Q

±× _{over one channel} _{over one RNS vector}

(i.e. n channels)

base extension modulo P in RNS

1 n time n ±× ±× ±× ±× • • • ±× ±× ±× ±× • • • ±× ±× ±× ±× • • • ±× ±× ±× ±× • • • • • • • • • • • • • • • ±× ±× ±× ±× • • • ±× ±× ±× ±× • • • ±× ±× ±× ±× • • • ±× ±× ±× ±× • • •

5. New RNS Modular Inversion (MI) (CHES 2013)

State-of-the-art RNS MI methods:

I based on Fermat’s Little Theorem (FLT-MI): X −1 = X P−2 mod P i.e. a large exponentiation with a lot of modular reductions

which costs O(log₂ P × n2) EMMs

I very limited parallelization due to internal data dependencies Proposed method PM-MI:

I extended binary Euclidean algorithm (binary-ternary version)

I uses the plus-minus trick:

if X and Y are odd then X + Y = 0 mod 4 or X − Y = 0 mod 4

I PM-MI works without BE and costs O(log₂ P × n) EMMs

CTRL (shared) local reg. {@, en, r/w} Arithmetic Unit (6 pipeline stages) {rst, mode, . . . } w w w w w IN w OUT w cmp w = b1 = c−1 precomp. mult. ≈ 2n × w w @1 precomp. ri (×2) @2 d log 2 ri e precomp. add. 17 × w @3 w

Example: # EMMs for ` = 192 bits

n × w FLT-MI PM-MI Gain Factor

12 × 17 103140 5474 18 9 × 22 61884 4106 15 7 × 29 40110 3193 12 0 50 100 150 200 250 300 350 400 450 500 Inversion time [ µ s] 192 bits FLT−MI PM−MI

256 bits 384 bits 521 bits

4 5 6 7 8 9 10 7 8 9 10 11 12 speed up n 8 9 10 11 12 n 10 12 14 16 18 20 22 n 15 16 17 18 19 n 0 500 1000 1500 2000 2500 3000 3500 4000 7 9 12 slices FLT−MI 192 bits 7 9 12 PM−MI 192 bits 8 9 12 FLT−MI 256 bits 8 9 12 PM−MI 256 bits 0 10 20 30 40 50 60 70 80 7 9 12 # blocks (DSP / BRAM) n DSP BRAM 7 9 12 n 8 9 12 n 8 9 12 n 0 2000 4000 6000 8000 10000 12000 10 12 14 17 18 20 22 slices FLT−MI 384 bits 10 12 14 17 18 20 22 PM−MI 384 bits 15 16 19 FLT−MI 521 bits 15 16 19 PM−MI 521 bits 0 20 40 60 80 100 120 10 12 14 17 18 20 22 # blocks (DSP / BRAM) n DSP BRAM 10 12 14 17 18 20 22 n 15 16 19 n 15 16 19 n

2. Residue Number System (RNS)

X a large `-bit integer is represented by: − → X = (x₁, . . . , x_n) = (X mod m₁, . . . , X mod m_n) channel 1 ±× mod m₁ w z₁ w y₁ w x₁ channel 2 ±× mod m₂ w z₂ w y₂ w x₂

. . .

channel n ±× mod m_n w z_n w y_n w x_n X Y Z RNS base B = (m₁, . . . , m_n)

n pairwise w -bit co-primes with n × w > `

The Chinese remainder

theorem (CRT) is the base of RNS

EMM elementary modular multiplication (w bits)

Pros:

I carry free between channels

I fast parallel +, −, × and some exact divisions

I non-positional number system, randomization against SCAs

I flexibility for hardware implementations

Cons:

I comparison, modular reduction and division are much harder

4. State-of-the-Art Algorithms and Architectures

RNS Montgomery Reduction

Input: −→X , −→X 0

Output: (−→ω , −→ω 0) with ω ≡ X × M−1 mod P

− → Q ←− −→X × (−−→P −1) (in base B) − → Q0 ←−BE(−→Q , B, B0) − → S 0 ←− −→X 0 + −→Q0 × −→P 0 (in base B0) − →_ω 0 _←− −→_S 0 _× −→_M−1 _{(in base B}0₎ − →_{ω ←−}_BE₍₋→_ω 0_{, B}0_{, B)} B B0 × • • × + × • • BE BE

BE: base extension M = Q m_i channel 1 rower 1 w w channel 2 rower 2 w w

. . .

channel n rower n w w cox

. . .

1 t w w Output Input n × w w w w w w w CTRL

6. Fast Patterns for RNS Computations (ASAP 2014)

Cost of standard and modular multiplications in RNS:

I standard: n EMMs fully parallel

I modular: 2n2 + O(n) EMMs 1 mult. & 1 red.

Proposed method:

I splits operands into 2 parts: −→X = −−→(K_x) × −−−→(M_a) + −−→(R_x) allows to replace 2n moduli by only 3₂n

I reuses split result in various computation patterns

I requires an hypothesis on P: OK for ECC/DH, but not for RSA

Cost for some patterns (#EMMs):

Operations s-o-t-a our

AB mod P 2n2 + 4n 2.5n2 + 12.5n

A2 mod P 2n2 + 4n 1.75n2 + 10.5n

Cst ×A mod P 2n2 + 4n 1.75n2 + 7n

Cst ×A2 mod P 4n2 + 8n 2.75n2 + 16.5n

Usage for Diffie-Hellman or ElGamal:

0.7 0.8 0.9 1.0 1.1 1.2 10 20 30 40 50 60 70 Our / Ref n EMM Expo. LSBF 0.7 0.8 0.9 1.0 1.1 1.2 Our / Ref

EMM Expo. Montg.

base extension (BE) computations in 1 base SPLIT PR MR base Ba Xa Ya Ua Kx Ky Ry = Ya Rx = Xa Qa Sa base Bb Xb Yb Rx Kx Ry Ky Ub Qb Sb base Bc Xc Yc Rx Kx Ry Ky Uc Qc Sc

Funding from DGA-INRIA PhD grant and project PAVOIS ANR 12 BS02 002 01