Architectures for Public-Key Cryptography Lejla Batina, Kazuo Sakiyama, and Ingrid Verbauwhede

Introduction Jean-Luc Gaudiot

1.7 Architectures for Public-Key Cryptography Lejla Batina, Kazuo Sakiyama, and Ingrid Verbauwhede

1.7.1 Introduction

The importance of security keeps growing because our ever-increasing dependence on information.

Numerous examples of security applications are present in everyday life such as purchasing goods over Internet, secure e-mail exchange, online banking, mobile phone communication, medical applications, etc. More recent applications envision security protocols running even on RFID tags and sensor nodes.

For all these applications, there exists a range of algorithms that can provide basic cryptographic services: confidentiality, data integrity, authentication, and nonrepudiation [MOV97]. Cryptographic algorithms include stream and block ciphers, hash functions, digital signatures, public-key algorithms, etc. These algorithms are usually divided into secret-key and public-key algorithms. Stream and block ciphers are examples of the former and they allow for a fast encryption of a large amount of data.

However, effective information protection against eavesdropping and modifications in open systems as well as advanced cryptographic service, e.g., digital signatures and key exchange, can only be achieved using public-key algorithms.

The foundations of cryptography originate from Claude Shannon [Sha48] and the basic model is as follows. Alice and Bob (or any two parties) want to exchange messages over an insecure channel in such a way that an adversary Eve is not able to learn the contents of their communication (Fig. 1.30). For that purpose they use a secret key that was a priori exchanged. In modern cryptography, Kerckhoffs’s principle is assumed, which states that only the secret keyk is not known to an adversary. This rule was established already in the nineteenth century by A. Kerckhoffs.

In this system, a user Alice wants to send a messagemto Bob, which is called theplaintext. It can be any element of a finite set of messages and here we assume that it was converted to a bit string. Alice is using the secret keykto encrypt the message by an injective mappingEkto a stringc, which is called the ciphertext c. Therefore, one can writeEk(m)¼cand this mappingEkis called theencryptionoperation.

SinceEkis injection, the inverse of it exists, i.e., the mappingDkwhich is called thedecryptionoperation.

The same keykwill be used by Bob for decryption ofc, i.e., one can writeDk(c)¼Dk(Ek(m))¼m.

This system is the model for symmetric-key (or secret-key) cryptography. This scheme for two parties, who want to communicate securely, is based on a shared secret key. Although symmetric cryptosystems allow for large amounts of data to be transferred efficiently, key management and key distribution problems do not scale well in the case of a large number of users.

1.7.1.1 History of Public-Key Cryptography

Diffie and Hellman introduced the idea of public-key cryptography [DH76] in the mid 1970s. They showed that one can eliminate the need for prior agreement of a key, that is, an evident limiting factor in

Encryption E_k(m) = c

Decryption D_k(c) = m m

Key k

Alice Bob

Eve

FIGURE 1.30 Basic model for a cryptosystem.

the setting of private-key cryptography. The system is shown in Fig. 1.31. There exists a pair of keys (Ee,Dd) for each user instead of the unique key that all of them should own.

HereEandDare, respectively, the encryption and decryption mappings andeanddare called the public and the secret key, respectively. The pair (Ee,Dd) should be easy to generate. In order to achieve secret communication, the condition,Dd(c)¼Dd(Ee(m))¼m, is required and it is hard to derive d from e. The setting of a public-key cryptosystem also allows for the digital signatures; they were introduced by Diffie and Hellman to uniquely bind a message to the sender. Until today numerous public-key cryptosystems have been proposed. Most of the schemes used today base their security on a small number of mathematical problems. The best-known and most commonly used public-key cryptosystems are based on factoring (RSA) and on the discrete logarithm problem in a large prime field (Diffie-Hellman, ElGamal, Schnorr, DSA) [MOV97]. The RSA public-key cryptosystem is named after its inventors Rivest, Shamir, and Adelman [RSA78]. Elliptic curve cryptography (ECC), which was proposed in the mid 1980s by Miller [Mil85] and Koblitz [Kob87], is based on a different algebraic structure, i.e., on an abelian group of points on an elliptic curve.

1.7.1.2 Applications from High-End to Extremely Constrained Devices (RFIDs and Sensor Nodes)

Public-key cryptosystems (PKC) are present today in almost all spheres of digital communication, e.g., for financial, governmental, and medical applications; they form an essential building block for network security protocols (e.g., SSL=TLS, IPsec, SSH). These are often implemented on general-purpose computers or high-end custom chips where the throughput is the function to optimize. However, more recent embedded security applications such as mobile phones, PDAs, consumer electronics, automotive, and wireless applications imply much more challenging design tasks for PKC implemen-tations. In all these cases, there are firm constraints on area, power, energy, and so on. Therefore, pervasive security is posing difficult demands on cryptography engineering. Especially extreme con-straints as imposed by RFID technology and sensor networks present open problems for cryptographic protocols as well as implementations.

1.7.1.3 Various Architectural Options for PKC

Algorithms for RSA and ECC are based on arithmetic in finite fields. RSA algorithm requires only arithmetic in an integer ring where all operations are moduloN,whereNis an integer (and a product of two large prime numbers) at least 1024 bits long for security applications nowadays. On the other hand, ECC exists over a prime and a binary field where elements are at least 160 bits long.

More precisely, recommended key-lengths for RSA are at least 1024 bits currently. This estimate depends on the difficulty to factor integers and the current progress of factorization efforts, e.g., NFSNET (see www.nfsnet.org). The security of 1024-bit RSA is usually compared to 160-bit ECC and with 80 bit of a symmetric key algorithm such as AES. However, Lenstra and Verheul estimated that with respect to computationally equivalent security, 1024- and 1375-bit RSA are comparable to 139- and

Encryption

Keys E_e(m) = c

Decryption D_d(c) = m

e−public key of Bob m

Alice Bob

Eve FIGURE 1.31 A public-key encryption algorithm.

160-bit ECC, respectively [LV00]. On the other hand, for cost equivalent security they suggested slightly different corresponding bit-lengths. Within the ECRYPT project—

European Network of Excellence for Cryptology [ECRYPT-AZT] the numbers as in Table 1.1 were suggested. Here, the security levelnmeans thatQ(2ⁿ) operations are needed by the best-known algorithms to break the system.

As both RSA and ECC rely on integer arithmetic modulo, a large number, the crucial operation for imple-mentations is modular multiplications. In the case of RSA and ECC over a prime field, the algorithm of Montgomery [Mon85] appears to be the best solution.

An architecture based on Montgomery’s algorithm is probably the best-studied architecture in hardware. Differences appeared because of various approaches for avoiding long carry chains.

Most common ways to do so are systolic array and redundant representation, e.g., residue number systems [PP98]. We discuss the former in more detail in the remainder of this chapter.

1.7.2 RSA Algorithm

The private key of a user consists of two large primespandqand an exponentd. The public key consists of a pair (N, e), whereN¼pqis the modulus (at least 1024 bits) and an exponent e is such that e¼d¹ modl(N). Here, we denotel(N)¼lcm (p1,q1), where lcm (a,b) is the least common multiple ofa and b. The corresponding p, q, and d are kept secret. To encrypt a message M, the user Alice computes

C¼M^emodN and decryption is described by

M ¼C^dmodN M^(1þkw(N⁾⁾MmodN

The previous equality follows by Fermat’s theorem [Kob94] and the fact thatlis a divisor ofw(N)¼ (p1)(q1). The RSA function is the modular exponentiation with the public exponente and the private exponentdis referred to as the trapdoor to invert the function.

Hence, modular exponentiation and also modular multiplication are the most important operations, which have to be considered in detail.

1.7.2.1 The RSA Problem and Integer Factoring Problem

The RSA problem:Consider a positive integerN(that is a product of two distinct primespandq), a positive integeresuch that gcd(e,l(N))¼1, and an integerC; findMsuch thatM^eCmodN. So, the RSA problem is the problem of findingeth roots modulo a composite numberN. It is related to the integer factoring problem, in this case, the problem of factoring a composite number N, which is the product of two large primes,pandq. It can be shown that if the factors ofNare known, the RSA problem can be easily solved [MOV97]. The security of the RSA cryptosystem is based on the difficulty of the RSA problem. It is still the most popular cryptosystem, especially for high-end devices that are typically used in e-commerce and virtual private network (VPN) servers.

1.7.2.2 Chinese Remainder Theorem (CRT)

By means of the Chinese remainder theorem (CRT), the speed for the RSA decryption scheme can be increased up to four times (Koblitz [Kob94]). This possibility is very attractive in practical applications TABLE 1.1 Comparison of the Key-Lengths

for RSA and ECC Security Leveln Symm. Key

Algorithms, e.g., AES RSA ECC

80 1248 160

112 2432 224

128 3248 256

especially for hardware implementations. Use of CRT for RSA was proposed in 1982 by Quisquater and Couvreur [QC82].

If the factors ofN, i.e.,pandqare known to Bob, he can compute modular operations with moduli pandqinstead ofN. He computesMpC1d

modpandMqC2d

modq, (whereC1Cmodpand C2Cmodq. All these calculations are performed modulo integerspandqthat are typically half the length of N. The original message M is recovered as the linear combination of Mp and Mq. The methods to reconstruct the message M are known in the literature as the algorithms of Gauss and Garner [MOV97]. These computations can be performed in Q([lgn]²) bit operations. (Here, lg denotes the base 2 logarithm.) Altogether, this way of decryption can reduce the workload by a factor of four if cubic complexity of exponentiation is assumed. For hardware implementations that results in a substantial speed-up in performance in the case where two multiplication units are available.

More precisely, increase of the area with a factor of two can result in the speed-up in performance of a factor 4.

1.7.2.3 RSA Operations

In Fig. 1.32, the structure of operations required for any RSA protocol is depicted. The basic building block consists of modular exponentiation that is based on a number of modular multiplications and squarings. On the bottom level are modular addition, subtraction, and inversion. We remind the reader that all calculations are performed either modulo the composite RSA modulusN, or modulo some prime (porq, in the case of CRT). We explain the operations and their realizations in hardware in more detail.

1.7.2.3.1 Modular Exponentiation

The dominant cost operation in the RSA cryptosystem is modular exponentiation, namely computing M^emodN. The basic technique for exponentiation is based on repeated squaring and multiplications (see Knuth [Knu98], p. 461). In [MOV97], this method is called left-to-right binary exponentiation (Algorithm 1). An exponenteis given here in the MSB form and by the radix 2 representation. A similar algorithm is also used for point=divisor multiplication in ECC. In this case the analogous scheme is called double-and-add or the binary method (Algorithm 2) [BSS99].

RSA

Modular exponentiation:

M^e mod N

Modular arithmetic: multiplication, squaring, addition, subtraction, and inversion

FIGURE 1.32 Hierarchy of RSA operations.

Algorithm 1: Modular exponentiation

Input: 0M<Nand 0<e<N,e¼(e_t1, . . . ,e1,e0),ei2{0,1},et1¼1, andN Output:R¼M^emodN

1. R M

2. forifromt2 down to 0 do 3. R RRmodN

4. ifei¼1 thenR RM mod N 5. end for

6. returnR

Algorithm 2: Point multiplication (binary method) Input: A pointP, at-bit integerk¼(k_t1, . . . ,k1,k0),ki2{0,1}

Output:Q¼kP 1. Q 1

2. forifromt1 down to 0 do

3. Q 2Q

4. ifki¼1 thenQ QþP 5. end for

6. returnQ

Numerous methods for speeding-up exponentiation and scalar multiplication have been proposed in the literature; for a survey, see Gordon [Gor98]. Recently, side-channel security is also considered to be an important factor for the choice of a suitable exponentiation algorithm. As this became an important research area in the last decade that is closely related to implementations, we explain about side-channel attacks in more detail.

In general, attacks on cryptography can be divided into two groups: mathematical attacks (more traditional type of attacks that are usually purely theoretical) and implementation attacks (more practical type that pose a growing threat today). Implementation attacks exploit weaknesses in specific implementations of a cryptographic algorithm. Sensitive information, such as secret keys or a plaintext can be obtained by observing the time consumed, the power consumption, the electromagnetic radiation, etc. This class of attacks is called side-channel attacks. In 1996, Paul Kocher introduced the concept of timing attacks by showing that secret information can be extracted through measurements of the execution time of cryptographic algorithms [Koc96]. Timing attacks are applicable to all imple-mentations that have a nonconstant execution time, which depends on the bits of the secret key. Two years later, Kocher et al. performed successful attacks by measuring the power consumption while the cryptographic circuit is executing the implemented algorithm [KJJ99]. For example, conditional oper-ations that are key-dependent (such as step 4 in Algorithms 1 and 2) can leak bits of the secret key by merely observing power consumption graphs of algorithms being performed. It is evident that constant time-implementations should remove all vulnerabilities of cryptographic applications with respect to timing attacks, and algorithms should always perform the same sequence of operations to counteract simple side-channel attacks.

1.7.2.3.2 Montgomery’s Arithmetic

Modular multiplication forms the basis of modular exponentiation, which is the core operation of the RSA cryptosystem. It is also present in many other cryptographic algorithms, including those based

on ECC. The most popular algorithm for modular multiplication is Montgomery’s method [Mon85]. The approach of Montgomery avoids the time-consuming trial division, the common bottleneck of other algorithms. Montgomery’s algorithm is especially suitable for hardware implementations because the division with a large number (the modulus or some prime) is replaced by reduction with a power of 2.

We give here all details for Montgomery’s arithmetic as commonly used for RSA implementations. Let Nbe a modulus. For a word baseb¼2^r, the Montgomery radix (or parameter)Ris typically chosen such thatR¼(2^r)ⁿ>N. Letxbe an odd integer represented by its radixbrepresentation x¼Pn1

i¼0xibⁱ. There is a one-to-one correspondence between each x and its representation X¼xR mod N. This representation is usually referred to as the Montgomery representation. Addition and subtraction of two elements in Montgomery representation is again an element in Montgomery representation. For efficient implementation of modular multiplication, the crucial operation is modular reduction, which is replaced by reduction by a number that is a power of 2, as previously mentioned.

In the original algorithm of Montgomery, the requirements are given on the parameters Rand N⁰ such thatR>NandR¹and N⁰ are satisfying 0<R¹<N, 0<N⁰<Rand RR¹NN⁰¼1. For the computation of the Montgomery productT¼XYR¹modN, Algorithm 3 was proposed by Montgo-mery [MOV97].

Algorithm 3: Montgomery’s modular multiplication

Input: N, N⁰¼ N¹mod 2^r, X¼(x_n1. . .x1x0)2^r, Y¼(y_n1. . .y1y0)2^r with 0<X, Y<N, R¼2^rn gcd(N, 2)¼1

Output:T¼XYR¹modN 1. T¼0

2. fori¼0 up ton1 do 3. mi¼ (t0þxiy0)N⁰mod 2^r 4. T ¼ (TþxiYþmiN)=2^r 5. end for

6. if (T>N) thenT¼TN 7. returnT

In the original algorithm of Montgomery, a modular reduction is needed in step 6. The reason is in inputs being bounded byN, e.g.,X,Y<Nand the outputTwas bounded by 2N, soT<2N. Hence, if T>N,Nmust be subtracted so that the output can be used as input to the next multiplication. This extra reduction slows down modular exponentiation and it also introduces a vulnerability to side-channel attacks. To avoid this subtraction, a bound forRis given by Walter [Wal02] such that for inputs X,Y<2Nalso the output is bounded:T<2N.

One possible way to calculate the Montgomery’s modular multiplication (MMM) is to use a digit-serial multiplier. The corresponding idea is given by Algorithm 4 (digit-serial version). It computes bit-serial MMM with only additions and right-shift operations without the final subtraction. As shown in Fig. 1.33, during multiplication ofXandY, modulusNis added to the intermediate product ofXYso that the LSB becomes 0, which allows for division with 2 (i.e., the right-shift operation).

Algorithm 4: Bit-serial Montgomery’s modular multiplication

Input: A k-bit integer N, X¼(xk. . .x1x0)2, Y¼(yk. . .y1y0)2 with 0<X, Y<2N1, R¼2^kþ², gcd(N,2)¼1

Output:T¼XYR¹modN 1. T ¼ 0

2. fori ¼ 0 up tokþ2 do

3. mi ¼ t0þxiy0

4. T¼(TþxiyþmiN)=2 5. end for

6. returnT

Step 4 is the most critical computation in Algorithm 4, and it can be implemented with adders and 1-bit right-shift logic (Fig. 1.34). The adder can be implemented with a carry-save adder (CSA) avoid long carry propagation (Fig. 1.35). We explain more about ways to implement modular addition below.

For further speedup, one can use a higher radix instead of the radix 2. In that case an r3r-bit multiplier is used, whereris an arbitrary power of 2.

1.7.2.3.3 Modular Addition and Subtraction

Modular addition and subtraction are usually performed as in Algorithm 5 and Algorithm 6, respect-ively [Koc95].

Algorithm 5: Modular addition Input: IntegersAandBand modulusN Output:C¼AþBmodN

FIGURE 1.33 Computation flow of the bit-serial Montgomery’s modular multiplication.

Algorithm 6: Modular subtraction Input: IntegersAandBand modulusN Output:C¼ABmodN

1. S⁰¼AB 2. S⁰⁰¼S⁰þN

>>1

4-to-2 Carry save adder

T in

N Y X_i

k + 2

k + 1

T T Data in / out

Controller Shift control Data control

k + 1 k + 1

k + 1 1 X

FIGURE 1.34 Schematic representation of the hardware block that performs MMM.

FA FA FA AND

FA FA

x_iY

VS VC m_iN

FA HA

FIGURE 1.35 Four-to-two CSA-based MMM corresponding to Algorithm 4. The intermediate result, T is represented in carry-save form with VS and VC (T ¼ VSþVC).

3. ifS⁰<0, thenC¼S⁰⁰ 4. elseC¼S⁰

5. returnC

The numbers are represented in two’s complement representation and in this way both the addition and subtraction operation can be combined into one circuit as explained by Mano and Kime [MK01].

One can use a serial adder=subtractor that consists of one full adder, two shift-registers (one forAandC and the other forB), one flip-flop (for a carry bit), a counter, and a controller (Fig. 1.36). In this figure, a digit-serial addition is shown that calculates a 32-bit addition by means of ripple carry adder (RCA) [MK01].

1.7.2.4 Hardware Architectures for RSA

Soon after its invention, the first proposals for RSA hardware implementations appeared and different architectures were proposed in the past two decades. The systolic array architecture still appears to be the best solution for modular multiplication with very long integers. This architecture has been studied intensively, both from a theoretical and a practical viewpoint.

1.7.2.5 Systolic Array Architectures

A systolic array is typically defined as a grid-like structure of special processing elements (PEs) that processes data like an n-dimensional pipeline (see Johnson et al. [JHS93]). Each line indicates a communication path and each intersection represents a cell or a systolic element.

>>32

>>32 0

32-bit Ripple carry adder A

Sum

B Carry in

Carry out

32 1

Carry Sum

1 Data in / out

Controller

Shift control Data control Carry control

FIGURE 1.36 Schematic representation of the hardware block that performs modular addition and subtraction.

The main advantage of this architecture is that it can easily be scaled. Scalability is one of the most important requirements of cryptographic applications nowadays. This results in increased flexibility especially when implemented on FPGA platforms. According to Tenca and Koc¸ [TK99], an arithmetic unit is called scalable if the unit can be reused or replicated in order to generate long-precision results independently of the data path precision for which the unit was originally designed. More precisely, the longest path should be ‘‘short’’ and independent of operands’ length and designed such that it fits even in restricted hardware regions [GTK02]. This means that the arithmetic unit can handle arbitrary bit-lengths with the exception of memory limitations. The number of clock cycles per operation depends only on the actual size of the operands. A typical scalable architecture based on a systolic array implementing Montgomery multiplication is shown in Fig. 1.37 [BM02].

The design shows a large number arithmetic unit (LNAU), which is designed as a systolic array. If two such units are available CRT computation can be performed fully in parallel. This array is one dimensional and consists of a fixed number of PEs. A FIFO memory is added to the design to achieve scalability. A PE contains some adders and multipliers that can processabits ofX, andbbits ofYin one clock cycle. So, in one clock cycle a number of additions and multiplications can be performed, e.g., to

Dans le document How to go to your page (Page 92-111)