Parallel Implementation of SM2 Elliptic Curve Cryptography on Intel Processors with AVX2

(1)

Cryptography on Intel Processors with AVX2

Junhao Huang1_{, Zhe Liu}1,2?_{, Zhi Hu}3_{, and Johann Großsch¨}_adl4 1

College of Computer Science and Technology,

Nanjing University of Aeronautics and Astronautics, Nanjing, China jhhuang nuaa@126.com

2

State Key Laboratory of Cryptology, P.O. Box 5159, Beijing, 100878, China sduliuzhe@gmail.com

3

Central South University, Hunan, China huzhi math@csu.edu.cn

4

University of Luxembourg, Esch-sur-Alzette, Luxembourg johann.groszschaedl@uni.lu

Abstract. This paper presents an efficient and secure implementation of SM2, the Chinese elliptic curve cryptography standard that has been adopted by the International Organization of Standardization (ISO) as ISO/IEC 14888-3:2018. Our SM2 implementation uses Intel’s Advanced Vector Extensions version 2.0 (AVX2), a family of three-operand SIMD instructions operating on vectors of 8, 16, 32, or 64-bit data elements in 256-bit registers, and is resistant against timing attacks. To exploit the parallel processing capabilities of AVX2, we studied the execution flows of Co-Z Jacobian point arithmetic operations and introduce a parallel 2-way Co-Z addition, Co-Z conjugate addition, and Co-Z ladder algo-rithm, which allow for fast Co-Z scalar multiplication. Furthermore, we developed an efficient 2-way prime-field arithmetic library using AVX2 to support our Co-Z Jacobian point operations. Both the field and the point operations utilize branch-free (i.e. constant-time) implementation techniques, which increase their ability to resist Simple Power Analysis (SPA) and timing attacks. Our software for scalar multiplication on the SM2 curve is, to our knowledge, the first constant-time implementation of the Co-Z based ladder that leverages the parallelism of AVX2.

1 Introduction

Roughly 35 years ago, Koblitz and Miller proposed to use the group of points on an elliptic curve defined over a finite field for the implementation of discrete logarithm cryptosystems [17, 19]. Today, Elliptic Curve Cryptography (ECC) is enjoying wide acceptance in the embedded/mobile domain due to the benefits of smaller key size, faster computation time, and reduced memory requirements compared to classic public-key cryptosystems [30]. Furthermore, ECC becomes increasingly popular in application domains where high data transmission rates

?

(2)

(i.e. high throughput) are important, e.g. networking, web services, and cloud computing. The 64-bit Intel architecture plays a major role in the latter two domains, which makes a good case to optimize ECC software with respect to the computing capabilities of modern Intel processors, especially their parallel processing capabilities. In 2011, Intel presented a new set of SIMD instructions called Advanced Vector Extensions 2 (AVX2) that was first integrated into the Haswell microarchitecture. AVX2 instructions support integer operations with 256-bit vectors, which allows one to do calculations on e.g. four 64-bit integers in parallel, and have a “non-destructive” three-operand format, i.e. two source registers and one destination register. Even though AVX2 was mainly designed to accelerate graphics and video processing, it can also be leveraged to speed up cryptographic workloads like ECC computations.

The security of elliptic curve cryptosystems relies on the (presumed) hard-ness of the Elliptic Curve Discrete Logarithm Problem (ECDLP), which asks to find the scalar k given two points P and Q = kP on an elliptic curve [12]. An operation of the form kP , called scalar multiplication, is an integral part of all ECC schemes and, in general, their most computation-intensive component. In addition, the scalar multiplication can be vulnerable to side-channel attacks, in particular timing analysis or Simple Power Analysis (SPA), when implemented carelessly [4, 28]. Therefore, an efficient and secure (in the sense of side-channel resistant) implementation of the scalar multiplication is crucial for any elliptic curve cryptosystem. The Montgomery ladder algorithm, originally proposed in [20], has a very regular execution profile, which lends itself to implementations with constant execution time, provided the underlying field arithmetic satisfies certain requirements. This algorithm is not only suitable for Montgomery curves like Curve25519 [1], but has also been generalized to Weierstrass elliptic curves [4, 16]. The present paper focusses on a parallel constant-time implementation of the Montgomery ladder for Weierstrass curves (like SM2) using AVX2. An AVX2 implementation of Curve25519 can be found in e.g. [7].

SM2 was introduced by the State Cryptography Administration of China in 2010 [26] and is supposed to replace RSA and other public-key cryptographic algorithms for electronic authentication systems, key management systems, and application systems. In addition, SM2 was standardized by ISO/IEC in 2018 as ISO/IEC 14888-3:2018 [15]. Hence, in the next couple of years, SM2 will have excellent application prospects in both Chinese and international commercial electronic products. For all these reasons, it makes sense to investigate how the prime-field arithmetic, elliptic curve operations, and protocols using SM2 can be implemented efficiently, e.g. by utilizing the parallel computing capabilities of AVX2. It should be noted that many of the optimizations we present in this paper can also be applied to NIST P-256 or other Weierstrass curves by simply providing a parallel 2-way implementation of the field arithmetic.

(3)

of SM2 are based on the field arithmetic and elliptic curve operations provided by the open-source software OpenSSL [21]; typical examples are GmSSL5 _and

TASSL6_{. However, implementation details of the prime-field arithmetic and the}

point (i.e. group) operations of SM2 are, to our knowledge, not documented in any form, which makes it difficult to reason about their efficiency.

To improve the execution time of public-key cryptographic algorithms like RSA and ECC on Intel processors, the SIMD-level parallelism of AVX2 can be exploited. Vector implementations of Montgomery modular multiplication and efficient modular exponentiation for RSA were introduced in [10, 29]. Gueron and Krasnov presented in [11] a highly-optimized AVX2 software for fixed-base scalar multiplication on NIST’s P-256 curve that executes four point additions in parallel. Taking advantage of AVX2 instructions, Faz-Hernández and López [7] developed an optimized Montgomery ladder for Curve25519, which performs two field-operations (e.g. two field-multiplications) simultaneously. In order to further reduce the latency, each field-multiplication (resp. squaring) multiplies two pairs of 25 or 26-bit limbs in parallel, whereby two limbs belonging to one operand are stored in a 128-bit lane of an AVX2 register. In a recent follow-up work, Faz-Hernández et al. [8] presented fast 2-way and 4-way implementations of the field-arithmetic and point operations using both the Montgomery model model and the Edwards model of Curve25519. There are various other studies exploring the optimization of ECC for different vector instruction sets, such as Intel SSE2, Intel AVX-512, and ARM NEON, see e.g. [2, 24].

Parallel implementations of the Montgomery ladder for GPUs and FPGAs have also been reported, some of which use Meloni’s Co-Z Jacobian arithmetic from [18]. Bos [3] introduced a low-latency 7-way GPU implementation of an (X, Z)-only Co-Z ladder for the NIST curve P-224. Peng et al. [22] presented an optimized multi-core FPGA implementation of the X-only Co-Z ladder from [13] for a set of Weierstrass curves, whereby they combined a number of Mont-gomery modular multipliers to work in parallel. They concluded that a 3-core implementation achieves the best throughput-resource ratio.

1.2 Our Contributions

The contribution of this paper is twofold and can be summarized as follows: 1. We present novel parallel 2-way Co-Z Jacobian point arithmetic algorithms

that utilize the parallel processing capabilities of Intel’s AVX2 instruction set. Our parallel Co-Z addition, Co-Z conjugate addition, and combination thereof (i.e. the Co-Z ladder step) outperform their sequential counterparts by factors of about 1.26, 1.60, and 1.33, respectively. By pre-computing two values, we managed to resolve data dependencies in the parallel execution of the Co-Z ladder algorithm and minimize its execution time. Thanks to these parallel Co-Z point operations, our Co-Z based Montgomery ladder is 1.31 times faster than a sequential Co-Z Montgomery ladder.

5

See http://gmssl.org (accessed on 2020–05–24).

6

(4)

2. To speed up the field arithmetic, we developed a fast 2-way implementation of modular reduction and carry propagation for the SM2 prime using the AVX2 instruction set. Both are integrated into our modular multiplication and modular squaring functions, which employ a radix-226 _{representation}

of the operands. We aimed for resistance against timing/SPA attacks and avoided conditional statements like branch instructions to ensure the field arithmetic (and also point arithmetic) has operand-independent execution time. To achieve this, we utilized constant-time techniques such as operand masking, Fermat-based inversion, and a highly regular ladder algorithm. The rest of this paper is structured as follows. In Sect. 2, we firstly provide a brief introduction to AVX2, the representation of operands, and the notation used throughout this paper. Section 3 presents the new parallel Co-Z Jacobian arithmetic and the Co-Z based Montgomery ladder algorithm. Thereafter, in Sect. 4, we introduce our implementation of the 2-way field-arithmetic for SM2 using AVX2 instructions. The results of our implementation are summarized in Sect. 5 and compared with the results from some previous papers. Finally, we give concluding remarks in Sect. 6.

2 Preliminaries

Overview of AVX2. Starting with the Haswell microarchitecture (released in 2013), modern 64-bit Intel processors support AVX2, which is, in essence, an extension of AVX to include 256-bit integer operations (classical AVX provides 256-bit floating-point instructions, but only 128-bit integer instructions). There are various AVX2 integer instructions that can be used to speed up prime-field arithmetic; the most important is VPMULUDQ (in the following abbreviated as MUL), which executes four (32 × 32)-bit multiplications in parallel and places the four 64-bit products in a 256-bit AVX2 register. Similarly, AVX2 contains instructions for parallel addition and subtraction of four packed 64-bit integers (abbreviated as ADD and SUB) [14]. Other members of the AVX2 instruction set with relevance for ECC include instructions to combine data elements from two AVX2 registers into a single one (BLEND), to shuffle data elements within a register (SHUF), to permute elements (PERM), to left/right shift elements by the same or different distances (SHL, SHR, SHLV, SHRV), to concatenate 128-bit lanes from two registers (ALIGN), and to carry out bit-wise operations (e.g. AND, XOR). We refer to [14] for a detailed description these instructions and to [9] for information about their latency and throughput.

Representation of Field Elements. It is common practice to represent the elements of a prime field Fp as integers in the range of [0, p − 1], which means

(5)

processor, e.g. (n × n)-bit multiply, n-bit add, n-bit add-with-carry, etc. While such a canonical radix-2n _{representation of integers has the advantage that the}

total number of words k = dm/ne is minimal for the target platform, it entails a lot of carry propagation and, as a consequence, sub-optimal performance on modern 64-bit processors [1, 7]. Fortunately, it is possible to avoid most of the carry propagations by using a reduced-radix representation (also referred to as redundant representation [8]), which means the number of bits per limb n0 is slightly less than the bitlength n of the processor’s registers, e.g. n0= 51 when implementing Curve25519 for a 64-bit processor. In this way, several limbs can be added up in a 64-bit register without causing overflow and the result of the field-addition (and other arithmetic operations) does not necessarily need to be fully reduced, i.e. can be larger than p. Only at the very end of a cryptographic operation (e.g. scalar multiplication), a full reduction to the least non-negative residue and conversion to canonical form has to be carried out.

Formally, when using a reduced radix of 2n0 (i.e. n0 < n bits per limb), an m-bit integer A is represented via a sequence of limbs (ak0₋₁, a_k0₋₂, . . . , a₀) so

that A = Pk0−1

i=0 ai2in

0

, whereby a limb ai does not necessarily need to be less

than 2n0 _{but can (temporarily) become as big as 2}n_{− 1. Although a}

reduced-radix representation may increase the number of limbs k0 = dm/n0e versus the full-radix setting (i.e. k0 > k), there is typically still a net-gain in performance when taking advantage of “lazy carrying” and “lazy reduction” [8]. We will use uppercase letters to denote field elements and indexed lowercase letters for the individual limbs they consist of. As is usual practice, we analyze and compare the efficiency of point operations (i.e. addition and doubling) by counting the number of multiplications (M), squarings (S), additions/subtractions (A), and inversions (I) in the underlying finite field.

SM2 Elliptic Curve. The specific elliptic curve used for the implementation described in the following sections is SM2 [27], which is defined by a simplified Weierstrass equation E : y2 = x3_{+ ax + b over a prime field F}p. This field is

given by the pseudo-Mersenne prime p = 2256− 2224_{− 2}96_{+ 2}64_{− 1 and allows}

for a special modular reduction method [25]. The curve parameter a is fixed to −3 to reduce the cost of the point arithmetic when using Jacobian projective coordinates [12]. A Jacobian projective point (X : Y : Z), Z 6= 0 corresponds to the affine point (x, y) = (X/Z2_{, Y /Z}3_{). The projective form of the Weierstrass}

equation is

E : Y2= X3− 3XZ4_{+ bZ}6_. ₍₁₎

Like other standardized Weierstrass curves, the cardinality #E(Fp) of the SM2

curve is prime, i.e. it has a co-factor of h = 1. The full specification of the SM2 curve can be found in [27].

(6)

same Z-coordinate. As specified by Equ. (2), the sum R = P + Q of the points P = (XP, YP, Z) and Q = (XQ, YQ, Z) can computed at an overall cost of five

multiplications (5M), two squarings (2S) and seven additions/subtractions (7A) in Fp, which is significantly less than the cost of a conventional point addition

using Jacobian projective coordinates (12M+4S+7A, see [6, Sect. 2.3]) and lies even below the 8M+3S+7A for a “mixed” Jacobian-affine addition [6, 12]. This Co-Z point-addition technique was applied by Rivain [23] to develop a fast and regular Montgomery ladder algorithm that is suitable for scalar multiplication on Weierstrass curves and does not require the order to be divisible by 4.

A = (XQ− XP)2, B = XPA, C = XQA, D = (YQ− YP)2, E = YP(C − B)

XR= D − (B + C), YR= (YQ− YP)(B − XR) − E, ZR= Z(XQ− XP) (2)

Note that the Co-Z addition formula also yields a new representation of the point P = (XP, YP, Z) because B = XP(XQ− XP)2, E = YP(XQ− XP)3, and

ZR= Z(XQ− XP) [23]. Consequently, (XP, YP, Z) ∼ (B, E, ZR), which means

this new representation of the point P and the sum R = P + Q have the same Z-coordinate. According to Equ. (3) the difference R0 _{= P − Q = (X}0

R, YR0, ZR)

can be computed with very little extra cost and has the same Z-coordinate as R = P + Q. In total, 6M+3S+11A are needed to obtain P + Q and P − Q.

A, B, C, D, and E as in Equ. (2), F = (YP+ YQ)2

X_R0 = F − (B + C), Y_R0 = (YP+ YQ)(XR0 − B) − E (3)

Venelli and Dassance [28] presented a further optimization of Co-Z arithmetic by eliminating the computation of the Z-coordinate from the formulae for the Co-Z addition and Co-Z conjugate addition. Concretely, they proposed a novel Co-Z Montgomery ladder algorithm based on addition formulae that compute only the X and Y -coordinate of the intermediate points (we refer to this kind of operation as “(X, Y )-only addition”). The Z-coordinate can be recovered at the end of the ladder at little extra cost. Omitting the Z-coordinates reduces the computational cost of the Co-Z addition and the Co-Z conjugate addition by 1M to 4M+2S+7A and 5M+3S+11A, respectively. The implementation we present in this paper is based on (X, Y )-only Co-Z operations.

3 Parallel Co-Z Jacobian Arithmetic for SM2

In this section we first demonstrate that most of the field-arithmetic operations of Co-Z addition and Co-Z conjugate addition can be executed in parallel and then we present a ladder that exploits the processing capabilities of AVX2.

3.1 Parallel Co-Z Jacobian Point Addition

(7)

Algorithm 1. SIMD XYCZ ADD: Parallel (X, Y )-only Co-Z addition Input: P = (XP, YP), Q = (XQ, YQ).

Output: (R, P0) = (P + Q, P ) where P0∼ P has the same Z-coordinate as P + Q.

1: T1= XP− XQ T2= YQ− YP {sub} 2: A = T12 D = T22 {sqr} 3: B = XPA C = XQA {mul} 4: T1= B + C {add} 5: XR= D − T1 T3= C − B {sub} 6: T1= B − XR {sub} 7: T1= T1T2 E = YPT3 {mul} 8: YR= T1− E {sub} 9: return ((XR, YR), (B, E))

Algorithm 2. SIMD XYCZ ADDC: Parallel (X, Y )-only Co-Z conjugate addition Input: P = (XP, YP), Q = (XQ, YQ), A0= (XQ−XP)2, T0= (XQ−XP)A0= C0−B0 Output: (R, R0) = (P + Q, P − Q) 1: C = XQA0 E = YPT0 {mul} 2: B = C − T0 T1= YQ− YP {sub} 3: T2= B + C T3= YP + YQ {add} 4: D = T2 1 F = T32 {sqr} 5: XR= D − T2 XR0 = F − T2 {sub} 6: T2= B − XR T4= XR0 − B {sub} 7: T2= T1T2 T3= T3T4 {mul} 8: YR= T2− E YR0 = T3− E {sub} 9: return ((XR, YR), (XR0, Y 0 R))

field operations have no sequential dependency and can, therefore, be executed in parallel. This applies, for example, to the temporary values B and C of the formulae for the Co-Z addition given in Equ. (2), which means it is possible to obtain them simultaneously with a 2-way parallel field-multiplication. Also the computation of A and D can be “paired” and performed simultaneously if the used field-arithmetic library supports 2-way parallel squaring. Algorithm 1 and Algorithm 2 are optimized implementations of the (X, Y )-only Co-Z addition and Co-Z conjugate addition, respectively, whereby the prime-field operations are scheduled to facilitate a 2-way parallel execution. Each line performs two times the same operation in parallel using two sets of operands (the operation being carried out is commented on the right). Unfortunately, some operations of Algorithm 1 could not be paired (line 4, 6, and 8), but those operations are relatively cheap additions and subtractions. On the other hand, all operations of Algorithm 2 are performed pair-wise, but this became only possible because of the pre-computation of A0 and T0 (we will discuss further details of this pre-computation below). Without pre-pre-computation of A0 _{and T}0_{, the latency of the}

(X, Y )-only Co-Z conjugate addition would be significantly worse.

(8)

Algorithm 3. SIMD XYCZ ADDC ADD: Parallel Co-Z ladder step

Input: P = (XP, YP) = Ra, Q = (XQ, YQ) = R1−a, A0= (XQ− XP)2, T0= (XQ−

XP)A0 = C0− B0 where a ∈ {0, 1} and Ra, R1−a are two Co-Z Jacobian points

that are intermediate results of the Montgomery ladder algorithm.

Output: (Ra, R1−a) = (2Ra, Ra+ R1−a) and update of A0 = (XRa− XR1−a) 2 and T0= (XR1−a− XRa)A 0 . 1: C0= XQA0 E0= YPT0 {mul} 2: B0= C0− T0 T1= YQ− YP {sub} 3: T2= B0+ C0 T3= YP + YQ {add} 4: D0= T12 F0= T32 {sqr} 5: XR= D0− T2 XR0 = F 0_{− T} 2 {sub} 6: T2= B0− XR T4= XR0 − B 0 {sub} 7: T2= T1T2 T4= T3T4 {mul} 8: YR= T2− E0 YR0 = T4− E0 {sub} 9: T1= XR0 − XR T2= YR0 − YR {sub} 10: A = T2 1 D = T22 {sqr} 11: XP = B = XRA C = XR0A {mul} 12: T3= T2+ B T4= B + C {add} 13: XQ= D − T4 T1= C − B {sub} 14: T4= XQ− XP T3= T3− XQ {sub} 15: A0= T42 T3= T32 {sqr} 16: T0= T4A0 XP = E = YRT1 {mul} 17: T1= D + A0 T2= E + E {add} 18: T3= T3− T1 {sub} 19: YQ=1₂(T3− T2) {sub} 20: return ((XQ, YQ), (XP, YP))

operations. More concretely, the latency of the (X, Y )-only Co-Z addition de-creases from 4M+2S+7A to 2 ¨M+1¨S+5 ¨A, i.e. the delay due to multiplications and squarings is reduced by 50% (assuming that 2-way parallel field-operations have the same delay as single field-operations). We abbreviate a 2-way parallel multiplication, squaring, and addition (resp. subtraction) in Fp by ¨M, ¨S, and

¨

A, respectively, to distinguish them from the corresponding simple 1-way field operations. The 2-way parallel scheduling of the field-arithmetic decreases the latency of the (X, Y )-only Co-Z conjugate addition from 5M+3S+11A for the sequential variant given by Equ. (3) to 3 ¨M+1¨S+6 ¨A (this latency includes the pre-computation of A0 and T0, which will be discussed below).

(9)

Algorithm 4. Co-Z based Montgomery ladder algorithm Input: A point P 6= O, a scalar k ∈ Fpsatisfying kn−1= 1.

Output: The result of the scalar multiplication R = k · P .

1: (R1, R0) = XYCZ IDBL(P ) 2: a = kn−2 3: A0= (XR1−a− XRa) 2_{, T}0 = (XR1−a− XRa)A 0 4: for i from n − 2 by 1 down to 1 do

5: a = (ki+ ki+1) mod 2

6: (Ra, R1−a, A0, T0) = SIMD XYCZ ADDC ADD(Ra, R1−a, A0, T0) 7: end for

8: a = (k0+ k1) mod 2

9: (R1−a, Ra) = SIMD XYCZ ADDC(Ra, R1−a, A0, T0) 10: λ

Z = FinalInvZ(R1−a, Ra, P, a)

11: (R0, R1) = SIMD XYCZ ADD(R0, R1) 12: return ((λ Z) 2_X R0, ( λ Z) 3_Y R0)

by one squaring and four additions) makes it possible to reduce the latency to 8M+6M+22A. This indicates that, in theory, a parallel implementation of the ladder step using 2-way parallel field operations could have a latency as low as 4 ¨M+3¨S+11 Ä. However, the latency of a parallel Co-Z addition (Algorithm 1) together with the parallel Co-Z conjugate addition (Algorithm 2) amounts to 5 ¨M+2¨S+11 Ä and does not reach this (theoretical) lower bound. We then tried to reschedule the field operations of the sequential 8M+6M+22A ladder step in order to optimize it for 2-way parallel execution, but some data dependencies did not allow us to reach the best possible latency of 4 ¨M+3¨S+11 Ä.

In order to obtain the minimal latency, we propose to pre-compute the two terms A0 _{= (X}

Q− XP)2 and T0= (XQ− XP)A0 before entering the main loop

of the ladder algorithm and update A0 and T0 in each iteration (as part of the ladder step, see Algorithm 3). In this way, we managed to perfectly resolve the data dependencies and achieve a latency of 4 ¨M+3¨S+13 ¨A, which is close to the minimum (all field operations except two subtractions at the very end could be properly paired, which makes Algorithm 3 very well suited for a 2-way parallel execution of field operations). Compared to the combination of Co-Z addition and Co-Z conjugate addition, the proposed ladder step trades 1 ¨M for 1¨S and 2 ¨A, which reduces the latency in our case (see Sect. 5).

3.2 Parallel Co-Z based Montgomery Ladder

The Montgomery ladder can not only be used for Montgomery curves, but also for general Weierstrass curves [4, 16], which includes the SM2 curve. Venelli and Dassance [28] proposed (X, Y )-only Co-Z arithmetic and further optimized the Co-Z based Montgomery ladder algorithm by avoiding the computation of the Z-coordinate during the main loop of the scalar multiplication.

(10)

(R1, R0) = (2P, P ) for the ladder using a doubling operation with Co-Z update

(called XYCZ IDBL, see [23, Sect. C] for details). Thereafter, the two values A0 and T0 are pre-computed, which is necessary to minimize the latency of the parallel Co-Z ladder step as discussed before. During the execution of the main loop, our parallel Co-Z ladder algorithm with minimum latency maintains the following relationship between the ladder points:

(R1−a, Ra) = (Ra+ R1−a, Ra− R1−a)

Ra= R1−a+ Ra (4)

where a = (ki+ ki+1) mod 2. As Algorithm 4 shows, our parallel Co-Z based

Montgomery ladder executes a parallel Co-Z ladder step in each iteration, and has therefore a regular execution profile and constant execution time. The two constants A0, T0 get updated with each call of the ladder-step function. At the end of the ladder, a conversion from Co-Z Jacobian coordinates to affine coor-dinates needs to be carried out. We perform this conversion with the function FinalInvZ, which computes Z = XPYRa(XR0− XR1), λ = yPXRa and outputs

λ

Z at a cost of 1I+3M+1A, i.e. this conversion requires an inversion in Fp.

Due to the parallel (X, Y )-only Co-Z Jacobian arithmetic, our Co-Z based ladder outperforms the sequential Co-Z ladder by a factor of roughly 1.31. To the best of our knowledge, the parallel Co-Z Montgomery ladder we presented in this section is the first attempt of minimizing the latency of a variable-base scalar multiplication by combining (X, Y )-only Co-Z Jacobian point arithmetic with a 2-way parallel implementation of the prime-field arithmetic.

4 2-Way Parallel Prime-Field Arithmetic for SM2

The Co-Z based Montgomery ladder presented in the previous section requires a 2-way parallel implementation of the arithmetic operations in the underlying prime field Fp. As explained in Sect. 2, the prime field used by SM2 is defined

by the 256-bit generalized-Mersenne prime p = 2256− 2224_{− 2}96_{+ 2}64_{− 1. The}

special form of p allows one to speed up the modular reduction [25].

We explained in Sect. 2 that, in order to reduce carry propagation, it makes sense to use a reduced-radix representation on modern 64-bit processors. This is also the case when implementing multi-precision integer arithmetic for SIMD engines like AVX2 since they do not offer an add-with-carry instruction. The implementation we introduce in this section adopts a radix-226 representation for the field elements, i.e. a 256-bit integer consists of k0 _{= d256/26e = 10 limbs}

and each limb is n0 _{= 26 bits long (but can temporarily become longer). When}

putting four limbs into an AVX2 register, it is possible to perform four limb-multiplications in parallel, each producing a 52-bit result. However, each of the 10 limbs can become as long as 29 bits without causing an overflow during the multiplication of field elements since the sum of 10 limb-products still fits into 64 bits: 10 × 229_{× 2}29_{< 2}62_{. In addition, since 10 is a multiple of two, we can}

(11)

The AVX2 implementation of the Fp-arithmetic we describe below performs

an arithmetic operation in a 2-way parallel fashion, which two times the same operation is executed, but with different operands. Each of the two operations uses pairs of limbs instead of a single limb as “smallest unit” of processing. We put a limb-pair of operand A into the higher 128-bit lane of a 256-bit AVX2 register and a limb-pair of operand B into the lower 128-bit lane, i.e. there are four limbs altogether in an AVX2 register. Consequently, we need five registers to store all limbs of A and B. Similar to [7], we use a set of interleaved tuples hA, Bii for i ∈ [0, 5) to denote such five AVX2 registers, whereby the i-th tuple

hA, Bii contains the four limbs [a2i+1, a2i, b2i+1, b2i].

4.1 Addition and Subtraction

Due to the redundant representation, the 2-way addition/subtraction over two sets of interleaved tuples hA, Bii± hC, Dii= hA ± C, B ± Diican be performed

by executing five AVX2 add (ADD)/subtract (SUB) instructions that operate on 64-bit data elements in parallel. To avoid overflow during the addition, we assert that the length of each limb of the operands must not exceed 63 bits. On the other hand, to avoid underflow during subtraction, we add an appropriate multiple of the SM2 prime p to hA, Bii and then perform the subtraction. We

do not reduce the result modulo p unless the next operation would overflow.

4.2 Modular Multiplication and Squaring

Multiplication/Squaring. Our implementation of the 2-way parallel multi-plication using AVX2 instructions was inspired by the work of Faz-Hern´andez and L´opez for Curve25519 [7] (with some modifications for the SM2 prime). As shown in Algorithm 5, the outer loop (starting at line 4) traverses through the set of interleaved tuples hA, Bii. Since there are two limbs of each A and B in

an interleaved tuple, we have two inner loops and use the SHUF instruction to separate the two limbs of A and B. Each of the inner loops traverses through the set hC, Dij, multiplies the tuple hC, Di (or hC0, D0i) by the shuffled tuple

hA, Bi, and adds the obtained partial-product to Zi+j (or Zi+j+1). Due to the

radix-226 _{representation, we can assure that this multiply-accumulate process}

does not overflow the 64-bit data element in which the sum is kept. When the multiplication is finished, we call the function FastRed (Algorithm 6) to get the final result hE, F ii = hA · C mod p, B · D mod pii. Squaring is quite similar to

the multiplication, except that a number of MUL instructions can be replaced by left-shift (i.e. SHL) instructions, see e.g. [7, Algorithm 4].

(12)

Algorithm 5. 2-way parallel multiplication using AVX2 instructions Input: Two sets of interleaved tuples hA, Bii, hC, Diiwith A, B, C, D ∈ Fp.

Output: Modular product hE, F ii= hAC mod p, BD mod pii. 1: Zi= 0 for i ∈ [0, 10)

2: for i from 0 by 1 up to 4 do

3: hC0, D0ii= ALIGN(hC, Dii+1 mod 5, hC, Dii) {[c2i+2, c2i+1, d2i+2, d2i+1]} 4: end for

5: for i from 0 by 1 up to 4 do

6: U = SHUF(hA, Bii, hA, Bii, 0x44) {[a2i, a2i, b2i, b2i]} 7: for j from 0 by 1 up to 4 do

8: Zi+j = ADD(Zi+j,MUL(U ,hC, Dij)) {[a2ic2j+1, a2ic2j, b2id2j+1, b2id2j]} 9: end for

10: V = SHUF(hA, Bii,hA, Bii,0xEE) {[a2i+1, a2i+1, b2i+1, b2i+1]} 11: for j from 0 by 1 up to 3 do

12: Zi+j+1 = ADD(Zi+j+1,MUL(V ,hC0, D0ij))

{[a2i+1c2j+2, a2i+1c2j+1, b2i+1d2j+2, b2i+1d2j+1]} 13: end for

14: W = MUL(V ,hC, Di4) {[a2i+1c0, a2i+1c9, b2i+1d0, b2i+1d9]} 15: Zi= ADD(Zi,BLEND(W ,[0,0,0,0],0x33)) {[a2i+1c0, 0, b2i+1d0, 0]} 16: Zi+5= ADD(Zi+5,BLEND(W ,[0,0,0,0],0xCC)) {[0, a2i+1c9, 0, b2i+1d9]} 17: end for

18: hE, F ii = FastRed(Z) {Algorithm 6} 19: return hE, F ii

make it compatible with our radix-226_{representation, see Algorithm 6. We use}

the congruence relations specified by Equ. (5) to “fold” the upper half of the 20-limb product Z, i.e. the 10 limbs z10 to z19, which have a weight of between

2260 and 2494), into the lower half of Z. Our fast modular reduction technique replaces the large powers of two on the left side of Equ. (5), which all exceed 2260, by sums of smaller powers of two (i.e. less than 2260) based on the special form of p = 2256− 2224_{− 2}96_{+ 2}64_{− 1. Consequently, the modular reduction}

boils down to basic shifts, additions, and subtractions of limbs. However, note that Equ. (5) assumes the limbs of the product Z to be 26 bits long.

(13)

Algorithm 6. 2-way parallel modular reduction using AVX2 instructions Input: A set of interleaved tuples Z consisting of 20 limbs.

Output: Modular-reduced set of interleaved tuples hE, F iiconsisting of 10 limbs. 1: for i from 4 by 1 up to 8 do 2: Li= AND(Zi,[2n 0 − 1, 2n0_{− 1, 2}n0_{− 1, 2}n0_{− 1])} 3: Mi= SHR(Zi,[n0, n0, n0, n0]) 4: Mi= AND(Mi,[2n 0 − 1, 2n0_{− 1, 2}n0_{− 1, 2}n0_{− 1])} 5: Hi= SHR(Zi,[2n0, 2n0, 2n0, 2n0]) 6: end for 7: L9 = AND(Z9,[2n 0 − 1, 2n0_{− 1, 2}n0_{− 1, 2}n0_{− 1])} 8: M9 = SHR(Z9,[n0, n0, n0, n0]) 9: for i from 5 by 1 up to 9 do 10: Mi0 = ALIGN(Mi, Mi−1) 11: Zi= ADD(ADD(Li, Mi0),Hi−1) 12: end for 13: Z9 = ALIGN(Z9, M9) 14: hE, F ii = SimpleRed(Z) 15: return hE, F ii

The modular reduction function specified in Algorithm 6 first converts the limbs in the upper half of the AVX2 registers into a radix-228 form, i.e. 28 bits per limb. Since the reduction is carried out immediately after a multiplication or squaring, the maximum limb-length can be 60 bits. In order to reduce the length of the upper limbs to 28 bits, we first split them into three parts of up to n0 = 26 bits: a lower part containing the 26 least significant bits, a middle part consisting of the next 26 bits, and a higher part with the rest. Each of the parts has a certain weight, and parts of the same weight (which can be up to three) are added together, yielding limbs of a length of at most 28 bits. This conversion is performed by the two loops of Algorithm 6 (i.e. line 1–12). At the end of these loops, we have an intermediate result Z of which the lower limbs are less than 60 bits long, while the upper limbs can have a length of up to 28 bits (with exception of the limbs in Z9, which can be up to 30 bits long). The

actual modular reduction based on the congruence relations of Equ. (5) is then carried out in line 14 by the SimpleRed operation (explained in Sect. A). Note that SimpleRed produces a result consisting of 10 limbs, whereby each limb is less than 24· 228_{· 2}24_{+ 2}60 _{< 2}61 _{and easily fits in a 64-bit data element (this}

remains correct when the maximum limb-length is 30 instead of 28 bits)

4.3 Carry Propagation (Conversion to 28-bit Limbs)

The result of the SimpleRed operation consists of 10 limbs (which are stored in five AVX2 registers), whereby each limb is smaller than 261_{, i.e. no more than}

(14)

Algorithm 7. 2-way parallel carry propagation using AVX2 instructions Input: A set of interleaved tuples hA, Bii.

Output: A set of interleaved tuples hA, Biiwith |ai|, |bi| ≤ n0+ 1 for i ∈ [0, 10). 1: H4 = SHRV(hA, Bi4,[n0, 2n0, n0, 2n0])

2: hA, Bi4 = AND(hA, Bi4,[2n 0

− 1, 22n0_{− 1, 2}n0_{− 1, 2}2n0_{− 1])} 3: Q = ADD(H4,SHUF(H4, H4,0x4E))

4: Q0= SUB([0,0,0,0],Q)

5: hA, Bi0 = ADD(hA, Bi0,SHLV(BLEND(Q,[0,0,0,0],0xCC),[0,4,0,4])) 6: hA, Bi1 = ADD(hA, Bi1,SHLV(BLEND(Q,Q0,0x33),[22,16,22,16])) 7: hA, Bi4 = ADD(hA, Bi4,SHLV(BLEND(Q,[0,0,0,0],0xCC),[0,20,0,20])) 8: for i from 0 by 1 up to 4 do 9: Li= AND(hA, Bii,[2n 0 − 1, 2n0_{− 1, 2}n0_{− 1, 2}n0_{− 1])} 10: Mi= SHR(hA, Bii,[n0, n0, n0, n0]) 11: Mi= AND(Mi,[2n 0 − 1, 2n0_{− 1, 2}n0_{− 1, 2}n0_{− 1])} 12: Hi= SHR(hA, Bii,[2n0, 2n0, 2n0, 2n0]) 13: end for 14: for i from 0 by 1 up to 4 do

15: Mi0 = ALIGN(Mi, Mi−1 mod 5)

16: hA, Bii= ADD(ADD(Li, Mi0),Hi−1 mod 5) 17: end for

18: return hA, Bi

prime p). To achieve this, each 60-bit limb has to be split into three parts as follows: ai = hi||mi||li where |li| = |mi| = n0 and |hi| = 60 − 2n0 (similar as

in the previous subsection). Algorithm 7 specifies the conversion. In line 1–7 we estimate the excess bits of the highest limb, which means we estimate a value q of weight 2260by computing q = (hk0₋₁||m_k0₋₁+ h_k0₋₂). Then, we reduce the

value q via the congruence q · 2260 _{≡ q · (2}228_{+ 2}100_{− 2}68_{+ 2}4_{) mod p, i.e. we}

add q to (or subtract it from) limbs with the corresponding terms. The code in line 8–17 reduces the bit-length of the remaining limbs based on the equations a0₀= l0, a01= l1+ m0, and a0i= li+ mi−1+ hi−2for i ∈ [2, 10). This algorithm

ensures |a0_i| <= 28 for i ∈ [0, k0_{− 1], which means the limbs are within a safe}

range so that they can serve as operand in any of our field operations.

4.4 Modular Inversion

Modular inversion is the most costly among the prime-field operations needed in ECC. Using Jacobian projective coordinates, we only need one inversion to convert the result from projective to affine coordinates. The Binary Extended Euclidean Algorithm (BEEA) is a well-known algorithm for inversion, but has an irregular execution flow and operand-dependent execution time, which can enable timing and SPA attacks [5]. Therefore, we chose Fermat’s little theorem and perform the inversion through an exponentiation: a−1≡ ap−2_{mod p. When}

(15)

5 Performance Evaluation

We benchmarked the described implementation on a 64-bit Intel Cascade Lake processor clocked at 2.5 GHz. The execution times we present in this section were obtained by measuring the cycles for 106 _{iterations of the field-operations}

and 104 _{iterations of the point operations, on a single core and single thread.}

Table 1. Comparison of the execution time of the 1-way and 2-way implementation of the prime-field operations (in clock cycles)

Implementation add/sub mul sqr inv carry prop. 1-way (1 op) 5 75 65 18,459 23

2-way (2 ops) 8 99 81 – 30

Speed-up ratio 1.25 1.52 1.60 – 1.53

Table 1 provides the timings of a standard (i.e. 1-way) and a 2-way parallel implementation of the prime-field operations. The results show that, when one and the same operation has to be performed on two sets of field-elements, the 2-way parallel implementation is much faster than two subsequent executions of the basic 1-way version; the speed-up factors range from 1.25 to 1.60. Since our 2-way field arithmetic is based on techniques from [7], the execution times of the operations are similar. However, the reduction modulo the SM2 prime is more complicated (and thus slower) than the reduction for Curve25519.

Table 2. Execution time (in clock cycles) of Co-Z addition, Co-Z conjugate addition, Co-Z ladder step, and Co-Z based Montgomery ladder algorithm.

Implementation Co-Z ADD Co-Z ADDC Co-Z L-Step Co-Z Ladder Sequential 555 786 1,334 359,868 2-way parallel 439 489 1,001 274,908 Speed-up ratio 1.26 1.60 1.33 1.31

Table 2 shows the execution times of the sequential and the 2-way parallel version of the Co-Z addition, Co-Z conjugate addition, Co-Z ladder step, and Co-Z based full Montgomery ladder algorithm. Similarly as above, the parallel versions clearly outperform their sequential counterparts. The 2-way parallel CoZ Montgomery ladder has an execution time of 275k clock cycles, which is 1.31 times faster than the sequential ladder. To the best of our knowledge, this paper is the first to present a Co-Z based Montgomery ladder utilizing AVX2 instructions and to demonstrate the ability of a 2-way parallel implementation of the field arithmetic to speed up Co-Z based Jacobian point operations.

(16)

Table 3. Comparison of the computational cost and execution time (in clock cycles on a Cascade Lake or Haswell processor) of our Co-Z based Montgomery ladder and other AVX2 implementations of variable-base scalar multiplication.

Implementation Cost per bit Additional cost Execution time SM2 (this work) 4 ¨M+3¨S+13 ¨A 1I+8M+7S+12A 274,908 (CL)

Curve25519 [7] 3 ¨M+2¨S+1 ¨C+4 ¨A 1I+1M 156,500 (H) Curve25519 [8] 3 ¨M+2¨S+1 ¨C+4 ¨A 1I+1M 121,000 (H) NIST P-256 [11] n/a n/a 312,000 (H)

NIST curve P-256. Since Curve25519 is Montgomery curve [20], it supports an efficient “X-coordinate-only” algorithm for variable-base scalar multiplication that costs only 5M+4S+1C+8A per scalar-bit (“C” stands for a multiplication of a field-element by a curve constant, which is normally much faster than an ordinary field-multiplication). Furthermore, as already mentioned, a reduction modulo the 255-bit pseudo-Mersenne prime used by Curve25519 can be carried out more efficiently than a reduction modulo the SM2 prime. Faz-Hernández and López reported in [7] an execution time of roughly 156,500 Haswell cycles for their AVX2 implementation of Curve25519, which is significantly better than our 274,908 clock cycles for SM2 on a more recent Cascade Lake CPU. There are three main reasons for this difference in execution time. First, as shown in Table 3, the parallel version of the Co-Z based ladder-step for the SM2 curve is 1 ¨M+1¨S+9 Ä more costly than the parallel ladder-step for Montgomery curves (i.e. [7, Algorithm 1]). The second reason is the higher additional cost outside the ladder loop for such tasks like the initial point doubling, the computation of the values A0 and T0, and the recovery of the Z-coordinate at the end of the ladder. Finally, the reduction modulo the SM2 prime is more complicated, and therefore slower, than the reduction modulo p = 2255_{− 19.}

Table 4. Comparison of the execution time of ECDH key exchange and ECDSA sig-nature generation/verfication using the SM2 curve on a 2.5 GHz Cascade Lake proces-sor and the NIST curve P-256 on a 3.4 GHz Haswell procesproces-sor.

Implementation Processor ECDH key ex. ECDSA sign ECDSA verify SM2 (this work) Cascade Lake 148 μs 24 μs 98 μs NIST P-256 [11] Haswell 93 μs 41 μs 122 μs

(17)

of 7 and performs four point additions in parallel in the AVX2 engine. On the other hand, the variable-base scalar multiplication has smaller windows of size 5 and executes the point operations sequentially (the prime-field arithmetic is written in x86 64 assembly and does not exploit the parallelism of AVX2). The execution time in the variable-base case is 312,000 Haswell cycles (about 92 μs when the clock frequency is 3.4 GHz). Gueron and Krasnov also benchmarked ECDH key exchange (computation of the shared key only), ECDSA signature generation (mainly fixed-base scalar multiplication) and the verification of an ECDSA signature; the corresponding timings are listed in Table 4.

Also given in Table 4 are the execution times of our AVX2 implementation of SM2-based ECDH key exchange, and the generation and verification of an ECDSA signature, which we measured on a 2.5 GHz Cascade Lake CPU. We implemented the fixed-base scalar multiplication (for signature generation) in the same way as [11], i.e. by means of a windowing method with a window size of 7. The execution time of our fixed-base scalar multiplication is about 64,000 Cascade Lake cycles. Note that the SM2 key exchange protocol authenticates the involved parties and, therefore, the computation of the shared key requires two scalar multiplications. This explains why SM2 key exchange is slower than a basic static ECDH key exchange. Unfortunately, a comparison of the results of our implementation with that of [11] is difficult since the micro-architectural properties and features of Haswell and Cascade Lake differ significantly.

6 Conclusions

We introduced a 2-way parallel implementation of SM2 prime-field arithmetic and Co-Z Jacobian point operations that leverage the processing capabilities of AVX2. Due to a careful rescheduling of the field arithmetic along with the pre-computation of two values outside the main loop, we managed to minimize the execution time of the parallel Co-Z ladder algorithm. More concretely, the 2-way parallel field arithmetic and Co-Z Jacobian point operations reduce the execution time of the ladder algorithm for variable-base scalar multiplication by a factor of 1.31 compared to sequential execution. Furthermore, our parallel ladder has a highly regular execution profile, which helps to achieve resistance against timing and SPA attacks. The main take-away message of this paper is that SIMD-level parallelism helps to narrow the performance gap between the classical Montgomery ladder on Montgomery curves and the Co-Z ladder on Weierstrass curves. When executed sequentially, the difference between these two scalar multiplication methods is 3M+2S (i.e. 5M+4S vs. 8M+6S), but this difference shrinks to 1 ¨M+1¨S (3 ¨M+2¨S vs. 4 ¨M+3¨S) in the case of 2-way parallel execution. Finally, we remark that all optimization techniques proposed in this paper can also be applied to the NIST curves.

(18)

of Jiangsu Province (grant no. BK20180421), the National Cryptography Devel-opment Fund (grant no. MM-JJ20180105) and the Fundamental Research Funds for the Central Universities (grant no. NE2018106). Zhi Hu is supported by the Natural Science Foundation of China (grants no. 61972420, 61602526) and the Hunan Provincial Natural Science Foundation of China (grants no. 2019JJ50827 and 2020JJ3050).

References

1. Bernstein, D.J.: Curve25519: New Diffie-Hellman speed records. In: Yung, M., Dodis, Y., Kiayias, A., Malkin, T. (eds.) Public Key Cryptography — PKC 2006. Lecture Notes in Computer Science, vol. 3958, pp. 207–228. Springer Verlag (2006) 2. Bernstein, D.J., Schwabe, P.: NEON crypto. In: Prouff, E., Schaumont, P. (eds.) Cryptographic Hardware and Embedded Systems — CHES 2012. Lecture Notes in Computer Science, vol. 7428, pp. 320–339. Springer Verlag (2012)

3. Bos, J.W.: Low-latency elliptic curve scalar multiplication. International Journal of Parallel Programming 40(5), 532–550 (Oct 2012)

4. Brier, E., Joye, M.: Weierstraß elliptic curves and side-channel attacks. In: Nac-cache, D., Paillier, P. (eds.) Public Key Cryptography — PKC 2002. Lecture Notes in Computer Science, vol. 2274, pp. 335–345. Springer Verlag (2002)

5. Cabrera Aldaya, A., Cabrera Sarmiento, A.J., S´anchez-Solano, S.: SPA vulner-abilities of the binary extended Euclidean algorithm. Journal of Cryptographic Engineering 7(4), 273–285 (Nov 2017)

6. Cohen, H., Miyaji, A., Ono, T.: Efficient elliptic curve exponentiation using mixed coordinates. In: Ohta, K., Pei, D. (eds.) Advances in Cryptology — ASIACRYPT ’98. Lecture Notes in Computer Science, vol. 1514, pp. 51–65. Springer Verlag (1998)

7. Faz-Hern´andez, A., L´opez, J.: Fast implementation of Curve25519 using AVX2. In: Lauter, K.E., Rodr´ıguez-Henr´ıquez, F. (eds.) Progress in Cryptology — LATIN-CRYPT 2015. Lecture Notes in Computer Science, vol. 9230, pp. 329–345. Springer Verlag (2015)

8. Faz-Hern´andez, A., L´opez, J., Dahab, R.: High-performance implementation of elliptic curve cryptography using vector instructions. ACM Transactions on Math-ematical Software 45(3), ??–?? (Jul 2019)

9. Fog, A.: Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD, and VIA CPUs. Manual, avialable for down-load at http://www.agner.org/optimize/instruction_tables.pdf (2019) 10. Gueron, S., Krasnov, V.: Software implementation of modular exponentiation,

using advanced vector instructions architectures. In: ¨Ozbudak, F., Rodr´ıguez-Henr´ıquez, F. (eds.) Arithmetic of Finite Fields — WAIFI 2012. Lecture Notes in Computer Science, vol. 7369, pp. 119–135. Springer Verlag (2012)

11. Gueron, S., Krasnov, V.: Fast prime field elliptic-curve cryptography with 256-bit primes. Journal of Cryptographic Engineering 5(2), 141–151 (Jun 2015)

12. Hankerson, D.R., Menezes, A.J., Vanstone, S.A.: Guide to Elliptic Curve Cryptog-raphy. Springer Verlag (2004)

(19)

14. Intel Corporation: Intel instruction set architecture extensions. Documentation, available online at http://software.intel.com/en-us/isa-extensions (2013) 15. International Organization for Standardization: ISO/IEC 14888-3:2018 – IT

secu-rity techniques – Digital signatures with appendix – Part 3: Discrete logarithm based mechanisms (2018)

16. Izu, T., Takagi, T.: A fast parallel elliptic curve multiplication resistant against side channel attacks. In: Naccache, D., Paillier, P. (eds.) Public Key Cryptography — PKC 2002. Lecture Notes in Computer Science, vol. 2274, pp. 280–296. Springer Verlag (2002)

17. Koblitz, N.I.: Elliptic curve cryptosystems. Mathematics of Computation 48(177), 203–209 (Jan 1987)

18. Meloni, N.: New point addition formulae for ECC applications. In: Carlet, C., Sunar, B. (eds.) Arithmetic of Finite Fields — WAIFI 2007. Lecture Notes in Computer Science, vol. 4547, pp. 189–201. Springer Verlag (2011)

19. Miller, V.S.: Use of elliptic curves in cryptography. In: Williams, H.C. (ed.) Advances in Cryptology — CRYPTO ’85. Lecture Notes in Computer Science, vol. 218, pp. 417–426. Springer Verlag (1986)

20. Montgomery, P.L.: Speeding the Pollard and elliptic curve methods of factorization. Mathematics of Computation 48(177), 243–264 (Jan 1987)

21. OpenSSL Software Foundation: OpenSSL. Software, available for download at http://www.openssl.org (2019)

22. Peng, B.Y., Hsu, Y.C., Chen, Y.J., Chueh, D.C., Cheng, C.M., Yang, B.Y.: Multi-core FPGA implementation of ECC with homogeneous co-Z coordinate represen-tation. In: Foresti, S., Persiano, G. (eds.) Cryptology and Network Security — CANS 2016. Lecture Notes in Computer Science, vol. 10052, pp. 637–647 (2016) 23. Rivain, M.: Fast and regular algorithms for scalar multiplication over elliptic curves.

Cryptology ePrint Archive, Report 2011/338 (2011)

24. Seo, H., Liu, Z., Großsch¨adl, J., Choi, J., Kim, H.: Montgomery modular multipli-cation on ARM-NEON revisited. In: Lee, J., Kim, J. (eds.) Information Security and Cryptology — ICISC 2014. Lecture Notes in Computer Science, vol. 8949, pp. 328–342. Springer Verlag (2014)

25. Solinas, J.A.: Generalized Mersenne numbers. Tech. Rep. CORR-99-39, University of Waterloo, Waterloo, Canada (1999)

26. State Cryptography Administration of China: Public key cryptographic algorithm SM2 based on elliptic curves. Specification, available for download at http://www. sca.gov.cn/sca/xwdt/2010-12/17/content_1002386.shtml (2010)

27. State Cryptography Administration of China: Recommended curve parameters of public key cryptographic algorithm SM2 based on elliptic curves. Specifica-tion, available for download at http://www.sca.gov.cn/sca/xwdt/2010-12/17/ content_1002386.shtml (2010)

28. Venelli, A., Dassance, F.: Faster side-channel resistant elliptic curve scalar multi-plication. In: Kohel, D., Rolland, R. (eds.) Contemporary Mathematics (Volume 512), pp. 29–40. American Mathematical Society (2010)

29. Zhao, Y., Pan, W., Lin, J., Liu, P., Xue, C., Zheng, F.: PhiRSA: Exploiting the computing power of vector instructions on Intel Xeon Phi for RSA. In: Avanzi, R.M., Heys, H.M. (eds.) Selected Areas in Cryptography — SAC 2016. Lecture Notes in Computer Science, vol. 10532, pp. 482–500. Springer Verlag (2016) 30. Zhou, L., Su, C., Hu, Z., Lee, S., Seo, H.: Lightweight implementations of NIST

(20)

A

SimpleRed Operation

Based on the congruence relations in Equ. (5), we add or subtract each of the upper limbs zi with i ∈ [10, 20) to the corresponding lower limbs in Z to obtain

the residue hE, F ii from the intermediate result Z. For example, all the terms

with weight 20 _{∼ 2}26 _{and 2}26 _{∼ 2}52 _{will be added to or subtracted from Z} 0 to

obtain hE, F i0. Similarly to Z0, the terms with other weights will be added to

or subtracted from the corresponding terms of the intermediate result Z. The details are fully specified in Algorithm 8, which executes only simple additions (resp. subtractions), shifts, and permutation instructions.

Algorithm 8. 2-way parallel SimpleRed operation using AVX2 instructions Input: An intermediate result Z consisting of 20 limbs.

Output: A modular residue hE, F iiconsisting of ten 28-bit limbs. 1: hE, F i0= ADD(Z0,SHL(Z5,4))

2: hE, F i0= ADD(hE, F i0,SHL(Z6,24))

3: hE, F i0= ADD(hE, F i0,SHL(SHUF(Z6,Z7,0x5),18)) 4: hE, F i0= ADD(hE, F i0,SHL(Z7,12))

7: hE, F i0= ADD(hE, F i0,SHLV(BLEND(PERM64(Z9,0xB1),Z8,0xCC),[1,15,1,15])) 8: Z80= SUB([0,0,0,0],Z8), Z90= SUB([0,0,0,0],Z9)

9: hE, F i1= SUB(Z1,SHL(Z5,16))

10: hE, F i1= ADD(hE, F i1,SHLV(BLEND(Z6,PERM64(Z5,0xB1),0xCC),[22,4,22,4])) 11: hE, F i1= SUB(hE, F i1,SHLV(SHUF([0,0,0,0],Z6,0x5),[10,0,10,0]))

12: hE, F i1= ADD(hE, F i1,SHL(BLEND(Z8,Z80,0xCC),12)) 13: hE, F i1= ADD(hE, F i1,SHL(SHUF(Z8,Z90,0x5),6))

14: hE, F i1= ADD(hE, F i1,SHLV(BLEND(Z9,Z7,0xCC),[24,1,24,1])) 15: hE, F i2= ADD(Z2,SHL(SHUF(Z5,Z6,0x5),22))

16: hE, F i2= ADD(hE, F i2,SHL(Z7,4)) 17: hE, F i2= ADD(hE, F i2,SHL(Z8,24))

20: hE, F i2= ADD(hE, F i2,SHL(SHUF(Z9,[0,0,0,0],0x5),6)) 21: hE, F i3= ADD(Z3,SHL(SHUF(Z6,Z7,0x5),22))

22: hE, F i3= ADD(hE, F i3,SHL(Z8,4)) 23: hE, F i3= ADD(hE, F i3,SHL(Z9,24))

24: hE, F i3= ADD(hE, F i3,SHL(SHUF(Z9,[0,0,0,0],0x5),19)) 25: hE, F i4= ADD(Z4,SHL(Z5,20))

26: hE, F i4= ADD(hE, F i4,SHLV(SHUF(Z7,Z7,0x5),[2,23,2,23])) 27: hE, F i4= ADD(hE, F i4,SHL(Z9,4))

28: T = ADD(SHLV(Z6,[8,14,8,14]),SHLV(Z8,[17,23,17,23])) 29: T = ADD(T ,SHLV(Z9,[5,11,5,11]))