• Aucun résultat trouvé

Hardware Accelerators for ECC and HECC

N/A
N/A
Protected

Academic year: 2021

Partager "Hardware Accelerators for ECC and HECC"

Copied!
63
0
0

Texte intégral

(1)

HAL Id: hal-01207422

https://hal.inria.fr/hal-01207422

Submitted on 30 Sep 2015

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Hardware Accelerators for ECC and HECC

Arnaud Tisserand

To cite this version:

Arnaud Tisserand. Hardware Accelerators for ECC and HECC. ECC: 19th Workshop on Elliptic Curve Cryptography, Sep 2015, Bordeaux, France. �hal-01207422�

(2)

Hardware Accelerators for ECC and HECC

Arnaud Tisserand

CNRS, IRISA laboratory, CAIRN research team

ECC Bordeaux Sep. 29–30, 2015

(3)

Summary

• Introduction

• Accelerator architecture and units

• Accelerator programming

• Implementation results: comparison ECC vs HECC on FPGA

(4)

Current Projects on (H)ECC Accelerators

PAVOIS project 2012–2016

Arithmetic Protections Against Physical Attacks for Elliptic Curve based Cryptography

• IRISA (Lannion)

• LIRMM (Perpignan, Montpellier & Toulon) http://pavois.irisa.fr/

ANR 12 BS02 002

HAH project 2014–2017

Hardware and Arithmetic for Hyperelliptic Curves Cryptography • IRISA (Lannion) • IRMAR (Rennes) http://h-a-h.inria.fr/ Labex and

(5)

Introduction

encryption signature etc p rot o col level [k]P ADD(P, Q) DBL(P) P + P curve level x±y x×y . . . field level

Scalar multiplication operation

for i from 0 to t − 1 do

if ki = 1 then Q =ADD(P, Q)

P =DBL(P)

Point addition/doubling operations

sequence of finite field operations

DBL: v1= z12, v2= x1− v1, . . . ADD: w1= z12, w2= z1× w1, . . . Fp or F2m operations

operation modulo large prime (Fp)

(6)

Introduction

encryption signature etc p rot o col level [k]P ADD(P, Q) DBL(P) P + P curve level x±y x×y . . . field level E : y2= x3+ 4x + 20 over GF(1009) points: P, Q= (x , y ) or (x , y , z) or . . .

Scalar multiplication operation

for i from 0 to t − 1 do

if ki = 1 then Q =ADD(P, Q)

P =DBL(P)

Point addition/doubling operations

sequence of finite field operations

DBL: v1= z12, v2= x1− v1, . . . ADD: w1= z12, w2= z1× w1, . . . Fp or F2m operations

operation modulo large prime (Fp)

(7)

Introduction

encryption signature etc p rot o col level [k]P ADD(P, Q) DBL(P) P + P curve level x±y x×y . . . field level E : y2= x3+ 4x + 20 over GF(1009) points: P, Q= (x , y ) or (x , y , z) or . . . coordinates: x , y , z ∈ GF(·) Fp, F2m, t : 80–600bits k = (kt−1kt−2. . . k1k0)2∈ N

Scalar multiplication operation

for i from 0 to t − 1 do

if ki = 1 then Q =ADD(P, Q)

P =DBL(P)

Point addition/doubling operations

sequence of finite field operations

DBL: v1= z12, v2= x1− v1, . . . ADD: w1= z12, w2= z1× w1, . . . Fp or F2m operations

operation modulo large prime (Fp)

(8)

Introduction

encryption signature etc p rot o col level [k]P ADD(P, Q) DBL(P) P + P curve level x±y x×y . . . field level E : y2= x3+ 4x + 20 over GF(1009) points: P, Q= (x , y ) or (x , y , z) or . . . coordinates: x , y , z ∈ GF(·) Fp, F2m, t : 80–600bits k = (kt−1kt−2. . . k1k0)2∈ N Scalar multiplication operation

for i from 0 to t − 1 do

if ki = 1 then Q =ADD(P, Q)

P =DBL(P)

Point addition/doubling operations

sequence of finite field operations

DBL: v1= z12, v2= x1− v1, . . . ADD: w1= z12, w2= z1× w1, . . . Fp or F2m operations

operation modulo large prime (Fp)

(9)

Introduction

encryption signature etc p rot o col level [k]P ADD(P, Q) DBL(P) P + P curve level x±y x×y . . . field level E : y2= x3+ 4x + 20 over GF(1009) points: P, Q= (x , y ) or (x , y , z) or . . . coordinates: x , y , z ∈ GF(·) Fp, F2m, t : 80–600bits k = (kt−1kt−2. . . k1k0)2∈ N Scalar multiplication operation

for i from 0 to t − 1 do

if ki = 1 then Q =ADD(P, Q)

P =DBL(P)

Point addition/doubling operations

sequence of finite field operations

DBL: v1= z12, v2= x1− v1, . . . ADD: w1= z12, w2= z1× w1, . . .

Fp or F2m operations

operation modulo large prime (Fp)

(10)

Introduction

encryption signature etc p rot o col level [k]P ADD(P, Q) DBL(P) P + P curve level x±y x×y . . . field level E : y2= x3+ 4x + 20 over GF(1009) points: P, Q= (x , y ) or (x , y , z) or . . . coordinates: x , y , z ∈ GF(·) Fp, F2m, t : 80–600bits k = (kt−1kt−2. . . k1k0)2∈ N Scalar multiplication operation

for i from 0 to t − 1 do

if ki = 1 then Q =ADD(P, Q)

P =DBL(P)

Point addition/doubling operations

sequence of finite field operations

DBL: v1= z12, v2= x1− v1, . . . ADD: w1= z12, w2= z1× w1, . . . Fp or F2m operations

operation modulo large prime (Fp)

(11)

Side Channel Attacks

encryption signature etc p rot o col level [k]P ADD(P, Q) DBL(P) curve level x±y x×y . . . field level

Scalar multiplication operation

for i from 0 to t − 1 do

if ki = 1 then Q =ADD(P, Q)

P =DBL(P)

• simple power analysis (& variants)

• differential power analysis (& variants)

(12)

Side Channel Attacks

encryption signature etc p rot o col level [k]P ADD(P, Q) DBL(P) curve level x±y x×y . . . field level

Scalar multiplication operation

for i from 0 to t − 1 do

if ki = 1 then Q =ADD(P, Q)

P =DBL(P)

• simple power analysis (& variants)

• differential power analysis (& variants)

(13)

Side Channel Attacks

encryption signature etc p rot o col level [k]P ADD(P, Q) DBL(P) curve level x±y x×y . . . field level DBL DBL DBL DBL DBL DBL

Scalar multiplication operation

for i from 0 to t − 1 do

if ki = 1 then Q =ADD(P, Q)

P =DBL(P)

• simple power analysis (& variants)

• differential power analysis (& variants)

(14)

Side Channel Attacks

encryption signature etc p rot o col level [k]P ADD(P, Q) DBL(P) curve level x±y x×y . . . field level DBL DBL DBL ADD DBL ADD DBL DBL

Scalar multiplication operation

for i from 0 to t − 1 do

if ki = 1 then Q =ADD(P, Q)

P =DBL(P)

• simple power analysis (& variants)

• differential power analysis (& variants)

(15)

Side Channel Attacks

encryption signature etc p rot o col level [k]P ADD(P, Q) DBL(P) curve level x±y x×y . . . field level DBL DBL DBL ADD DBL ADD DBL DBL

0 0 0

1

1

0

Scalar multiplication operation

for i from 0 to t − 1 do

if ki = 1 then Q =ADD(P, Q)

P =DBL(P)

• simple power analysis (& variants)

• differential power analysis (& variants)

(16)

Side Channel Attacks

encryption signature etc p rot o col level [k]P ADD(P, Q) DBL(P) curve level x±y x×y . . . field level DBL DBL DBL ADD DBL ADD DBL DBL

0 0 0

1

1

0

Scalar multiplication operation

for i from 0 to t − 1 do

if ki = 1 then Q =ADD(P, Q)

P =DBL(P)

• simple power analysis (& variants)

• differential power analysis (& variants)

(17)

Objectives of Our Research Group

• Study and implementation of efficient hardware supports:

I Cryptography over (hyper)-elliptic curves (H)ECC

I Operations over finite fields Fp & F2m and curve points

I Hardware targets: FPGAs and ASICs

I Flexibility programmable in software

• Study and implementation of protections against physical attacks:

I Passive attacks: measure of power consumption, electromagnetic radiations, timings

I Active attacks: fault injection (in progress)

• Levels: algorithm, representation, operator, architecture, circuit

• Trade-offs between: performance, cost (area/energy), security

• Study, development and distribution of an open source (H)ECC acceleratorand its programming tools

(18)

Accelerator Specifications

encryption signature etc p roto col level HW SW HW [k]P ADD(P, Q) DBL(P) P + P curve level x±y x×y . . . field level • Performances =⇒hardware (HW)

I dedicated functional units I internal parallelism

• Limited cost (embedded systems)

I reduced silicon area

I low energy (& power consumption) I large area used at each clock cycle

• Flexibility =⇒software (SW)

I curves, algorithms, representations

(points/elements), k recoding, . . .

I at design time / at run time

• Security against SCAs =⇒ HW

I secure units (F2m,Fp)

I secure key storage/management

(19)

Accelerator Specifications

encryption signature etc p roto col level HW SW HW [k]P ADD(P, Q) DBL(P) P + P curve level x±y x×y . . . field level • Performances =⇒hardware (HW)

I dedicated functional units I internal parallelism

• Limited cost (embedded systems)

I reduced silicon area

I low energy (& power consumption) I large area used at each clock cycle

• Flexibility =⇒software (SW)

I curves, algorithms, representations

(points/elements), k recoding, . . .

I at design time / at run time

• Security against SCAs =⇒ HW

I secure units (F2m,Fp)

I secure key storage/management

(20)

Accelerator Specifications

encryption signature etc p roto col level HW SW HW [k]P ADD(P, Q) DBL(P) P + P curve level x±y x×y . . . field level • Performances =⇒hardware (HW)

I dedicated functional units I internal parallelism

• Limited cost (embedded systems)

I reduced silicon area

I low energy (& power consumption) I large area used at each clock cycle

• Flexibility =⇒software (SW)

I curves, algorithms, representations

(points/elements), k recoding, . . .

I at design time / at run time

• Security against SCAs =⇒ HW

I secure units (F2m,Fp)

I secure key storage/management

(21)

Accelerator Specifications

encryption signature etc p roto col level HW SW HW [k]P ADD(P, Q) DBL(P) P + P curve level x±y x×y . . . field level • Performances =⇒hardware (HW)

I dedicated functional units I internal parallelism

• Limited cost (embedded systems)

I reduced silicon area

I low energy (& power consumption) I large area used at each clock cycle

• Flexibility =⇒software (SW)

I curves, algorithms, representations

(points/elements), k recoding, . . .

I at design time / at run time

• Security against SCAs =⇒HW

I secure units (F2m,Fp)

I secure key storage/management

(22)

Accelerator Architecture

external

interface

accelerator

(23)

Accelerator Architecture

external

interface

accelerator

FU1 FU2 FU3

(24)

Accelerator Architecture

external interface accelerator register file FU1 FU2 FU3

(25)

Accelerator Architecture

external interface accelerator k ey mng. register file FU1 FU2 FU3

(26)

Accelerator Architecture

external interface accelerator CTRL k ey mng. register file FU1 FU2 FU3

(27)

Accelerator Architecture

external interface accelerator CTRL code mem. k ey mng. register file FU1 FU2 FU3

(28)

Accelerator Architecture

external interface accelerator interconnect CTRL code mem. k ey mng. register file FU1 FU2 FU3

(29)

Accelerator Architecture

external interface accelerator interconnect CTRL code mem. k ey mng. register file FU1 FU2 FU3

(30)

Accelerator Architecture

external interface accelerator interconnect CTRL code mem. k ey mng. register file FU1 FU2 FU3

(31)

Functional Units for Field Level Operations

data (w bits)

control (few bits)

FUα x [i ] y [i ] r [i ]

Notation: x [i ] is the i -th w -bit word of x ∈ Fq Units:

Fp: addition/subtraction, multiplication (2-step, Montgomery, variants), inversion

F2m (polynomial basis, normal basis & variants): addition/subtraction,

multiplication (Montgomery, Mastrovito, 2-step), square, inversion Internal parameters: nb of sub-blocks, radix, pipelining scheme,

(32)

Register File (≈ Dual Port Memory)

x [i ] y [i ] r [i ] field elements (size ≥ m bits)

word size (w bits)

Control signals: addresses (port A, port B), read/write, write enable

Specific addressing model for Fq elements (through an intermediate address table with hardware loop)

• linear addresses, SW: LOAD @x =⇒ HW: loop x [0], x [1], . . . x [` − 1] • randomized addresses

(33)

Key Management Unit

k ey mng. k key recoding ki CTRL

• On-the-fly recoding of k: binary, λ-NAF (λ ∈ {2, 3, 4, 5}), variants (fixed/sliding), double-base [1] and multiple-base [2] number systems (w/wo randomization), addition chains [12], other ?

(34)

External Interface(s)

Under development:

• Basic (neither clock rate nor width adaptation)

• ARM Cortex cores in Zynq 7 FPGAs (through AXI bus) • MicroBlaze softcore processor for Xilinx FPGAs

I AXI bus (V6+)

I PLB bus (V2 – V5)

• Specific for a “small” ASIC pad ring

Future development:

• NIOS softcore processor for Altera FPGAs

(35)

Protected F

2m

Multipliers

Unprotected 0 50 100 150 200 250 0 100 200 300 400 500 #transitions cycles Mastrovito 233 200 225 250 cycles Protected Overhead: Area/time < 10 % References: PhD D. Pamula [8] Articles: [11], [10], [9]

(36)

Protected F

2m

Multipliers

Unprotected 0 50 100 150 200 250 0 100 200 300 400 500 #transitions cycles Mastrovito 233 200 225 250 cycles Protected Overhead: Area/time < 10 % References: PhD D. Pamula [8] Articles: [11], [10], [9]

(37)

Protected (Old) Accelerator for F

2m 0 100 200 300 0 50 100 150 200 250 300 350 #transit. cycles DBL operation Mastrovito Unprotected Activity trace 0.00 0.02 0.04 0.06 0.08

current [mA] DBL operation

Mastrovito Unprotected Current measures 0 100 200 300

#transit. DBL operationMastrovito

Protected Activity trace 0.00 0.04 0.08 0.12 0.16

current [mA] DBL operation

Mastrovito Protected Current measures 0 100 200 300

#transit. ADD operationMastrovito

Protected Activity trace

(38)

Circuit-Level Protections for Arithmetic Operators

(39)

Units Impact on Side Channel Information (1/2)

Activity traces measured with CABA1 simulations for three configurations of the multiplier (1,2,4 sub-blocks of 32 bits) and a very small accelerator

1 2 4 ADD 0 200 400 600 800 1000 1200 0 5000 10000 15000 20000 25000 activity [#transitions]

time [clock cycles]

0 200 400 600 800 1000 1200 0 2000 4000 6000 8000 10000 12000 14000 16000 activity [#transitions]

time [clock cycles]

0 200 400 600 800 1000 1200 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 activity [#transitions]

time [clock cycles]

DBL 0 200 400 600 800 1000 1200 0 5000 10000 15000 20000 25000 activity [#transitions]

time [clock cycles]

0 200 400 600 800 1000 1200 0 2000 4000 6000 8000 10000 12000 14000 activity [#transitions]

time [clock cycles]

0 200 400 600 800 1000 1200 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 activity [#transitions]

time [clock cycles]

(40)

Units Impact on Side Channel Information (2/2)

0 200 400 600 800 1000 1200 16700 16720 16740 16760 16780 16800 16820 16840 16860 activity [#transitions]

time [clock cycles]

READ LAUNCH addition WAIT addition WRITE READ LAUNCH addition WAIT addition WRITE READ LAUNCH addition WAIT addition WRITE READ LAUNCH multiplication WAIT multiplication WRITE

0 200 400 600 800 1000 1200 1400 6500 6520 6540 6560 6580 6600 6620 6640 6660 activity [#transitions]

time [clock cycles]

(41)

Developed Programming Tools

time now V0 accelerator modules . . . configurations CAD tools selection user crypto. lib. assembler binary implementation

(42)

Developed Programming Tools

time now V0 V1 accelerator modules . . . configurations CAD tools selection user crypto. lib. assembler binary implementation compiler python API/TLS-SSL

(43)

Developed Programming Tools

time now V0 V1 V2 accelerator modules accelerator modules . . . configurations CAD tools selection user crypto. lib. crypto. lib. assembler binary implementation compiler Sage API/TLS-SSL

(44)

Instruction Set

READ FUid @Rid @Rid B/U

WRITE FUid @Rid

LAUNCH FUid MODE

WAIT FUid

SETADDRO @Rid OFFSET

SETADDRN @Rid #WORD

WRITEK #WORD CALL @DEST RET BZ @DEST BNZ @DEST JMP @DEST CMPD DIGIT SET FLAGid TST FLAGid

(45)

Address Model in the Register File

RF requirements :

• 5–16 registers of m-bit Fq elements

• worst case: w small (16 bits) and m large (600 bits) ⇒ 550+ words and 10-bit physical addresses

x ∈ Fq is addressed by oneentry (notation @Rid) of theintermediate

address table(IAT) with 2 values:

• offset of the first word (e.g. x [0]) • number of w -bit words

CTRL register

file address

table

(46)

Address Model in the Register File

RF requirements :

• 5–16 registers of m-bit Fq elements

• worst case: w small (16 bits) and m large (600 bits) ⇒ 550+ words and 10-bit physical addresses

x ∈ Fq is addressed by oneentry (notation @Rid) of theintermediate

address table(IAT) with 2 values:

• offset of the first word (e.g. x [0]) • number of w -bit words

CTRL register

file address

table

(47)

Code Memory

Behavior:

• Specific private path in the interconnect for code download (no leaks in RF or FUs)

• Code input can be disabled (ROM mode with code in the FPGA bitstream)

• Instruction CALL: push PC then jump to @DEST • Instruction RET: jump to (pop) + 1

(48)

Internal Parallelism Model

non-blocking instruction decoding (i.e. always do PC ← PC + 1 or PC ← cst) except forWAIT instruction

Example of operations sequence, its dependency graph and assembly code for 2 multipliers: r = ((a×b)+c)+(d ×e)) a b c d e r M 0 1 M 3 4 A 2 5 A 5 6 5

1 read fu mul 0, 0, 1 read a & b 2 launch fu mul 0 start ab 3 read fu mul 1, 3, 4 lit d & e 4 launch fu mul 1 start de 5 wait fu mul 0 wait for ab 6 write fu mul 0, 5 write ab

7 set OPMODE, 0 addition mode (+) 8 read fu add sub 0, 5, 2 read ab & c 9 launch fu add sub 0 start (ab) + c 10 wait fu mul 1 wait for de 11 write fu mul 1, 6 write de 12 wait fu add sub 0 wait for (ab) + c 13 write fu add sub 0, 5 write (ab) + c 14 read fu add sub 0, 5, 6 read (ab) + c & de 15 launch fu add sub 0 start ((ab) + c) + (de) 16 wait fu add sub 0 wait for ((ab) + c) + (de) 17 write fu add sub 0, 5 write ((ab) + c) + (de)

(49)

ECC Accelerator with Additions Chains

Firstfull hardware implementation of recoding using additions chains FPGA implementation

Spartan-6 XC6SLX9 192-bit Fp Very small config.

Euclide computation of C MEM. (BRAM) 1/φ, k, k/φ, a, b, C, C0 a(j)− b(j) b(j)− a(j) a(j) 2− b (j) b(j) 2− a (j) k

unused cout unused cout

CTRL C SIPO C C LSB(a(j) ) LSB(b(j) ) computation ofk φ ± + + ε + CTRL @ offset C0 offset C offset b offset a offset k/φ offset k write ports read ports

address control signalswordscalardigit w-bit data word

recoding

BRAM

optim. area freq. dura. SCA method target slices (FF/LUT) MHz ms prot.

EAC 3 area 534 (1813/1508) 132 35.8 Y speed 556 (1872/1523) 137 34.5 DA 2 area 429 (1243/1134) 191 30 N speed 399 (1302/1222) 177 32.5 ML 2 area 429 (1243/1134) 191 42.5 Y speed 399 (1302/1222) 177 45.8 UF 2 area 429 (1243/1134) 191 50.4 Y speed 399 (1302/1222) 177 54.4 NAF-3 2 area 422 (1280/1157) 181 25.2 N speed 423 (1321/1242) 175 26.1 NAF-4 2 area 420 (1277/1161) 158 27.3 N speed 425 (1233/1246) 177 24.4

EAC: Euclidean addition chains, DA: dbl-and-add, ML: Montgomery ladder, UF: unified formula

(50)

Comparison ECC 256 vs HECC 128 (1/7)

field Fp ADD DBL

ECC ` bits mulRXOUT

subRYOUT mulRZOUT PZ mul PZ mul PZ mul PZ PXPXmul PYPYmul QYQY QXQX QZ QZ QZ QZ mul v18 addv12add sub v13 mul v10 v10 v10 mul v11v11 sub v11 mul v16 v17 v14 v14 sub v0 v1 v1 v2 v2 sqr v2 sub v3 v4 v4 v5 sqr v5 mul v6 v7 v7 v8 v9 v9 Cost: 12M + 2S mulRXOUT subRYOUT addRZOUT PZ mul PZ mul PZ PX sqr PX mul PX PYPY mul PY aa addv18add v18 add v19 v19 sub add v12v12 sub v12 v13 addv10v10add v10 v11 mul v16 sqrv17v17 v15 mulv23v23add sqrv22 v20 add v25 v25 v24v24 add v0 add v1 v1 add v1 v2 v3 v4 sqr v4 v5 v6 v6 v6 v6 v7 v7 v8v8addv9v9 Cost: 6M + 5S HECC 2` bits

mulRU0OUT mulRU1OUT

addRV0OUT addRV1OUT mulRZOUT PZ mul PZ mul PZ mul PZ add PZ mul PZ mul PZ mul PZ mul PZ QV0 QV0 QV1 QV1 QU1 add QU1 QU1 QU0QU0 QU0 PU0 mul PU0 mul PU0 mul PU0 PU1 PU1mul PU1 mul PU1 PV1PV1mul QZ mul QZ QZ QZ QZ QZ PV0PV0 sub v18 mul v18 add v19 mul v19 add v12 mul v13 mul v13 mul v10 sqr v11 addv16mul v17 sqr v14 mul v14 mul v14 v15 add v85 mul v84 mul v87 add v86 add sub v81 mulv80 subv83 v82 sqr v69 v69 mul v69 mul sub v68 sub v67 mul v67 sub v66 v66 sub v65 add v64 v64 mulv63 v63 sub v61 v60 v60 add v78 v79 v74 v76 mul v77 mul v70 v70 v71 v72 mul v73 v73 v23 mul mul v41 sub v40 mul v43 v43 v43 v43 mul v43 mul v43 add v42 add mul v45 add v44 sub add v47 add v46 v49 v48 sub v22 mul v22 v21 v21v21 v21 v20 mul v27 v26 v26 add v25 sub v25 sub v24 v29 v28 mul v56 add v56 v56 sub v57 add v54 v54 v54 v52 v53 v50 v51 v58 v58 v59 v30 v30 v30 v30 mul v30 mul v30 v31 v31 v31 v31 v32 v32 v32 v32 v33 mul v34 v35 v35 sqr v35 add v35v35 v36 add v37 v37 v38 v39 v0 v0 v0 v0 v0 v1 sub v1 v2 v3 v3 v3 sub v4 v5 v5 v5 v6 v6 v6 v6 v6 v6 v6 v6 add v7 v8 v9 v9 v9 Cost: 47M + 4S

mulRU0OUT mulRU1OUT subRV0OUT subRV1OUT mulRZOUT PZ mul PZ mul PZ sub PZ mul PZ mul PZ sub PZ mul PZ mul PZ mul PZ mul PZ sqr PZ mul PZ mul PZ PU0 add PU0 PU0 mul PU0 add PU0 mul PU0 mul PU0 PU1 sqr PU1 add PU1 mul PU1 PU1 mul PU1 mul PU1 PU1 Z sub Z PV1 mul PV1 sqr PV1 add PV1 PV1 PV0 mul PV0 add PV0PV0 add sub v18 add v18v18 v19 add mul v12 add v13v13 v13 add v13 mul mul v10 sqr v10 mul v10 v11 mul v11 v16 v17 v14 v15 v15 v15 v80 v69 mulv68 v68 subv67 mul v66 v65 add v65 sqr v64 v64 mul v64 mul v63 addv62 mul v62 sub v61 v60 v60 add v78 v79 mul sub v74 v75 sub v76 v77 v71 add v72 v73 v41 v40 v40 v40 v40 sub sqr v43 mul v43 v42 mul add v45 add v44 mulv47 v46 v49 v48 sub v23 v22 v21 add v20 sub v27 add v26 mul v26 v25 v24 v29 v28 sub v56 v57 v57 v57 v54 add v54 v54 v55 v53 v53 v53 v50 v51 v51 v51 mul v59 v59 v59 v30 v30 v31 sub v32 v33 add v33 v34 v34 add v34 v35 v36 v37 v38 v38 v38 v38 v39 v39 v39 v0 v0 v1 mul v1 sub v2 v3 v4 v4 v4 add v5 v6 v6 sqr v6 v7 v8 v9 v9 Cost: 38M + 6S

Configurations on a XC6SLX75 FPGA (details in [5]): • w = 32 bits internal words

• 1 adder/subtracter, 1 inversion unit

• nM multipliers (Montgomery) with nB w -bit sub-blocks

• No DSP blocks

(51)

Comparison ECC 256 vs HECC 128 (2/7)

• Compared recoding techniques:

I BIN: standard binary from left to right

I NAF: non-adjacent form

I λ-NAF: window methods with λ ∈ {3, 4}

• Implementation results for a full ECC accelerator (nM = 1, nB= 1):

Recoding BIN NAF 3-NAF 4-NAF

area slices (FF/LUT) 565 (1321/1461) 570 (1340/1479) 571 (1344/1495) 503(1348/1489)

freq. (MHz) 225 228 237 217

(52)

Comparison ECC 256 vs HECC 128 (3/7)

Impact of the number/size of multipliers on the area and frequency:

nM

BRAM

nB= 1 nB= 2 nB= 4

area freq. area freq. area freq.

slices (FF/LUT) MHz slices (FF/LUT) MHz slices (FF/LUT) MHz

ECC 1 3 547(1374/1460) 231 573 (1476/1625) 233 673 (1674/1875) 233 2 3 722 (1776/1903) 220 811 (1979/2210) 227 942 (2377/2701) 220 3 3 810 (2174/2236) 221 915 (2480/2698) 215 1130(3077/3430) 214 4 3 952 (2569/2656) 215 1100 (2977/3282) 217 1512 (3771/4293) 216 5 3 1064 (2982/3136) 210 1405 (3492/3902) 206 1722 (4487/5122) 209 HECC 1 4 514(1336/1374) 235 549 (1434/1513) 234 2 4 646 (1716/1783) 220 737 (1912/2055) 234 3 4 732 (2092/2075) 224 826 (2386/2485) 225 4 4 870 (2476/2424) 218 1022 (2868/2987) 214 5 4 976 (2865/2773) 219 1115 (3355/3465) 210 6 4 1089 (3233/3092) 203 1240(3821/3908) 208 7 4 1145 (3601/3426) 213 1372 (4287/4365) 205 8 4 1281 (3981/3809) 191 1552 (4765/4890) 183 9 4 1379 (4363/4051) 202 1691 (5245/5277) 199 10 4 1543 (4739/4435) 196 1856 (5719/5801) 198 11 4 1547 (5114/4750) 189 1936 (6192/6240) 198 12 4 1738 (5499/5128) 191 2100 (6675/6771) 188

(53)

Comparison ECC 256 vs HECC 128 (4/7)

Impact of the number/size of multipliers on the average time (ms):

nB nM 1 2 3 4 5 6 7 8 9 10 11 12 HECC 1 15.6 8.6 5.7 4.7 3.9 3.7 3.3 3.6 3.4 3.5 3.6 3.6 2 11.9 6.2 4.5 3.6 3.2 2.8 2.8 3.0 2.7 2.7 2.8 2.9 ECC 1 28.1 15.3 12.4 12.4 12.7 2 17.7 9.6 8.3 8.0 8.4 4 11.1 6.2 5.4 5.1 5.3

Standard deviation for 1000 [k]P:

configuration ECC (1,1) ECC (3,4) HECC (1,1) HECC (6,2)

average time [ms] 28.1 5.4 15.6 2.8

(54)

Comparison ECC 256 vs HECC 128 (5/7)

a re a [slices] time [ms] ECC HECC 600 800 1000 1200 1400 1600 1800 2000 2200 5 10 15 20 25 30 5,4 5,2 5,1 4,4 4,2 4,1 3,4 3,2 3,1 2,4 2,2 2,1 1,4 1,2 1,1 12,2 12,1 11,2 11,1 10,2 10,1 9,2 9,1 8,2 8,1 7,2 7,1 6,2 6,1 5,2 5,1 4,2 4,1 3,2 3,1 2,2 2,1 1,2 1,1

(55)

Comparison ECC 256 vs HECC 128 (6/7)

% usage × a rea sp eedup ECC HECC 0 20 40 60 80 100 1 2 3 0 1 2 3 4 5 1,1 1,2 1,4 2,4 3,4 4,4 1,1 1,2 2,1 3,1 3,2 5,2 8,2

(56)

Comparison ECC 256 vs HECC 128 (7/7)

Source FPGA area freq. duration [k]P

slices / DSP blocks MHz ms ECC 1,2 Spartan 6 573 / 0 233 17.7 ECC 1,4 673 / 0 233 11.1 ECC 2,4 942 / 0 220 6.2 ECC 3,4 1 130 / 0 214 5.4 [7] Virtex-5 1 725 / 37 291 0.38 Virtex-4 4 655 / 37 250 0.44 [6] Virtex-4 13 661 / 0 43 9.2 20 123 / 0 43 7.7

(57)

Conclusion & Current/Future Works

• HECC is efficient in hardware (40 % speedup vs ECC) • Flexible architecture and tools for research activities • Advanced recoding schemes are efficient in hardware

Current/futureworks:

• Hardware implementation of halving based method(s) • Protections against fault injection

• HECC extensions of the accelerator (and tools)

• ASIC (CMOS 65nm) implementation of the accelerator • Side channel evaluation of (some) proposed protections • HW/SW Code distribution under free license

• More advanced architecture/circuit level protections • Collaboration with other research groups

(58)

Our Long Term Objectives

Study the links between: • curves

• arithmetic algorithms • Fq, pts representations • architecture & units • circuit styles

to ensure

• high securityagainst

I theoretical attacks

I physical attacks • low design cost • low silicon cost • low energy(/power) • high performances • high flexibility area 1 delay 1 energy 1 security 1

(59)

Our Long Term Objectives

Study the links between: • curves

• arithmetic algorithms • Fq, pts representations • architecture & units • circuit styles

to ensure

• high securityagainst

I theoretical attacks

I physical attacks • low design cost • low silicon cost • low energy(/power) • high performances • high flexibility area 1 1 + a delay 1 1 + t energy 1 1 + e a, t, e ∈ 0%, 5%, 10%, . . . , 100% security 1

(60)

Our Long Term Objectives

Study the links between: • curves

• arithmetic algorithms • Fq, pts representations • architecture & units • circuit styles

to ensure

• high securityagainst

I theoretical attacks

I physical attacks • low design cost • low silicon cost • low energy(/power) • high performances • high flexibility area 1 1 + a delay 1 1 + t energy 1 1 + e a, t, e ∈ 0%, 5%, 10%, . . . , 100% security 1 ×10 ×100

(61)

References I

T. Chabrier, D. Pamula, and A. Tisserand.

Hardware implementation of DBNS recoding for ECC processor.

In Proc. 44rd Asilomar Conference on Signals, Systems and Computers, pages 1129–1133, Pacific Grove, California, U.S.A., November 2010. IEEE.

T. Chabrier and A. Tisserand.

On-the-fly multi-base recoding for ECC scalar multiplication without pre-computations.

In A. Nannarelli, P.-M. Seidel, and P. T. P. Tang, editors, Proc. 21st Symposium on Computer Arithmetic (ARITH), pages 219–228, Austin, TX, U.S.A, April 2013. IEEE Computer Society.

J. Chen, A. Tisserand, E. Popovici, and S. Cotofana.

Asynchronous charge sharing power consistent montgomery multiplier.

In J. Sparso and E Yahya, editors, Proc. 21st IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC), pages 132–138, Mountain View, California, USA, May 2015.

J. Chen, A. Tisserand, E. M. Popovici, and S. Cotofana. Robust sub-powered asynchronous logic.

In J. Becker and M. R. Adrover, editors, Proc. 24th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), pages 1–7, Palma de Mallorca, Spain, September 2014. IEEE.

G. Gallin, A. Tisserand, and N. Veyrat-Charvillon.

Comparaison exp´erimentale d’architectures de crypto-processeurs pour courbes elliptiques et hyper-elliptiques.

In Actes Conf´erence d’informatique en Parall´elisme, Architecture et Syst`eme (ComPAS), Lille, France, June 2015.

Prix meilleur papier track architecture.

S. Ghosh, M. Alam, D. Roychowdhury, and I.S. Gupta.

Parallel crypto-devices for GF(p) elliptic curve multiplication resistant against side channel attacks.

(62)

References II

Y. Ma, Z. Liu, W. Pan, and J. Jing.

A high-speed elliptic curve cryptographic processor for generic curves over GF(p).

In Proc. 20th International Workshop on Selected Areas in Cryptography (SAC), volume 8282 of LNCS, pages 421–437, Burnaby, BC, Canada, August 2013. Springer.

D. Pamula.

Arithmetic Operators on GF(2m) for Cryptographic Applications: Performance - Power Consumption - Security Tradeoffs.

Phd thesis, University of Rennes 1 and Silesian University of Technology, December 2012.

D. Pamula, E. Hrynkiewicz, and A. Tisserand.

Analysis of GF(2233) multipliers regarding elliptic curve cryptosystem applications.

In 11th IFAC/IEEE International Conference on Programmable Devices and Embedded Systems (PDeS), pages 252–257, Brno, Czech Republic, May 2012.

D. Pamula and A. Tisserand.

GF(2m) finite-field multipliers with reduced activity variations.

In 4th International Workshop on the Arithmetic of Finite Fields, volume 7369 of LNCS, pages 152–167, Bochum, Germany, July 2012. Springer.

D. Pamula and A. Tisserand. Fast and secure finite field multipliers.

In Proc. Euromicro Conference on Digital System Design (DSD), pages 1–8, Funchal, Portugal, August 2015.

J. Proy, N. Veyrat-Charvillon, A. Tisserand, and N. Meloni.

Full hardware implementation of short addition chains recoding for ECC scalar multiplication.

(63)

The end, questions ?

Contact:

• mailto:arnaud.tisserand@irisa.fr

• http://people.irisa.fr/Arnaud.Tisserand/ • CAIRN Group http://www.irisa.fr/cairn/ • IRISA Laboratory, CNRS–INRIA–Univ. Rennes 1

6 rue Kerampont, CS 80518, F-22305 Lannion cedex, France

Références

Documents relatifs

The rapid quantitative growth of scientific publications has pushed the development of various Internet tools for finding the necessary information (for example, the Google

The rest of article is divided as follow: in the section 2 the ANN, deep learning concepts are described and the tools for implement deep learning algorithms are mentioned, in

The input language of the system is a functional language Cloud Sisal [9] that is a modification of our Sisal 3.2 language which is aimed to increase the language’s utility

This tool chain provides a reference implementation for our patterns when the systems can be modeled using an extension of Time Petri Nets with data variables and priorities that

After speaking with the management team and observing the factory in operation, clearly the selection of this vendor as a strategic vendor is a good choice. The

ةيرحلا هذه لمشتو تيوصتلا ةيرح ةيطارقميدلا تاسرامملا يف ةخسارلا ئدابملا نم لاضف يف ةكراشملا مدع ىتح وأ ءاضيب ةقروب تيوصتلا نيرخآ نع حشرم ليضفت نع تاباختنلاا ،

Evaluation of electricity related impacts using a dynamic LCA model, International Symposium Life Cycle Assessment and Construction, Nantes, France. Pachauri and

For instance, in the formulas presented in [9], one can find regular patterns of four to eight independent modular multiplications – the most costly and common finite field operation