• Aucun résultat trouvé

Improved Three-Way Split Approach for Binary Polynomial Multiplication Based on Optimized Reconstruction

N/A
N/A
Protected

Academic year: 2021

Partager "Improved Three-Way Split Approach for Binary Polynomial Multiplication Based on Optimized Reconstruction"

Copied!
24
0
0

Texte intégral

(1)

HAL Id: hal-00788646

https://hal.archives-ouvertes.fr/hal-00788646

Submitted on 14 Feb 2013

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Improved Three-Way Split Approach for Binary Polynomial Multiplication Based on Optimized

Reconstruction

Christophe Negre

To cite this version:

Christophe Negre. Improved Three-Way Split Approach for Binary Polynomial Multiplication Based on Optimized Reconstruction. [Research Report] RR-1300x, Lirmm. 2013. �hal-00788646�

(2)

Improved Three-Way Split Approach for Binary Polynomial Multiplication Based on

Optimized Reconstruction

Christophe Negre

F

Abstract

At Crypto 2009 [1], Bernstein initiated an optimization of Karatsuba formula for binary polynomial multiplication by reorganizing the computations in the reconstruction part of two recursions of the formula. This approach was generalized in [10] to an arbitrary number of recursions resulting in the best known bit parallel multiplier based on Karatsuba formula.

In this paper we extend this approach to three-way split formula: we first reorganize two recursions and then extend this re-organization to an arbitrary numbersof recursions. We obtain a parallel multiplier with a space complexity of 4.68nlog3(6)+O(n)XOR gates andnlog3(6)AND gates and a delay of3 log3(n)D+D. This improves the previous best known results regarding space complexity of [2] and reaches the same time complexity as the the best known approach [4].

Keywords.Binary polynomial multiplication, three-way split formula, optimized recursive reconstruction.

1 INTRODUCTION

Finite field arithmetic are parts of many cryptographic applications like elliptic curve cryptography (ECC) [6], [9], pairing cryptography [5] or even some block-encryption modes like GCM [11] or CWC [7].

Typically, the computations involved in such cryptographic protocols consist in several hundreds or thousands of finite field additions and multiplications. It is thus crucial to have efficient implementation of finite field operations. The two mainly used operations are the addition and the multiplication, but since the addition is quite simple to implement, researchers have focus their effort on efficiently implementing finite field multiplication.

In this paper, we focus on efficient hardware implementation of multiplication in extended binary field, where the field size is supposed to be in the range[2160,2600]as the ones used in cryptographic protocols.

A multiplication in this king of field consists in multiplying two binary polynomials and then reduce the product modulo the irreducible polynomial defining the field. Two approaches are used to perform this modular multiplication: the first one consists to perform separately the multiplication and the reduction,

(3)

the second, initiated by Mastrovito [8], performs the multiplication and the reduction at the same time by re-expressing the modular multiplication as a matrix-vector product. This latter approach was improved by Fan and Hasan in [3]: they modified the matrix in order to have Toeplitz matrix, which enabled them to use a subquadratic approach for Toeplitz matrix-vector product. They obtained the best known binary field multiplier in terms of space and time complexities: in [3] the authors report a two-way split multiplier with5.5nlog2(3)+O(n)XOR gates,nlog2(3) AND gates and a delay of 2 log2(n)D+D, they also report a three-way split multiplier with4.8nlog2(3)+O(n)XOR gates,nlog2(3)AND gates and a delay of3 log2(n)D+D where D (resp.D) is the delay of an XOR (resp. AND) gate.

Until recently, the approach which separately performs multiplication and reduction were not competi- tive compared to the approach based TMVP of [3] since the two-way and three-way subquadratic methods for polynomial multiplication were not competitive compared to their TMVP counterparts. Recently, some progresses have been done on subquadratic polynomial approaches: first, Bernstein in [1] proposed an optimized four-way split approach which reduces the space requirement down to5.46nlog2(3)+O(n)XOR gates and nlog2(3) AND gates beating the space requirement of the two-way TMVP multiplier, but the delay remained too large, i.e., 2.5 log2(n)D+D. The approach of Bernstein [1] have been extended in [10], leading to a multiplier with complexity5.25nlog2(3)+O(n)XOR gates andnlog2(3)AND gates and a delay2 log2(n)D+D providing a better multiplier than the two-way split TMVP multiplier of [3].

Contributions. In this paper we extend the optimization presented in [10] to the case of three-way split formula. We first review the best-known three-way split formula, we then present the proposed optimization in the case of two recursions of the considered formulas. We then generalize the proposed optimization to s recursions: we provide an algorithm and thoroughly prove its validity. We then also provide detailed space and time complexities evaluation. The best three-way split multiplier we obtain has a space complexity of4.68nlog2(3)+O(n)XOR gates andnlog2(3) AND gates and a delay of3 log3(n)D+ D.

Organization of the paper.In Section 2, we review the best know three-way split polynomial formula. In Section 3, we present our optimization on two recursions of the considered three-way split formula. In Section 4, we extend this approach to s recursions and establish the complexity results of the resulting multiplier. Finally, in Section 5, we compare the complexity results with the best known approaches and give some concluding remarks.

2 REVIEW OF THREE WAY SPLIT FORMULA

Three-way split formulas for polynomial multiplication are derived from multi-evaluation/interpolation approach or more generally from Chinese remainder theorem. Let us briefly describe how the three-way formulas are obtained. For a more detailed explanation the reader may refer to [12], [13]. LetAandB be two binary polynomials of sizen= 3`. We splitAandBin three partsA=AL+AMXn/3+AHX2n/3and B=B0+B1Xn/3+B2X2n/3and then replaceXn/3byY. The two polynomialsA(Y) =AL+AMY+AHY

(4)

and B(Y) =BL+BMY +BHY2 are considered as degree two polynomials in Y. The product C(Y) = A(Y)×B(Y) has degree 4, so if it is computed modulo M(Y) = Y(Y + 1)(Y +∞)(Y2+Y + 1) the result remains equal to A(Y)×B(Y): no reduction is performed since M(Y) has a degree larger than degYA(Y)×B(Y). The Chinese remainder theorem can then be used to split up the multiplication A(Y)×B(Y) modulo M(Y) in four independent modular multiplications: multiplications modulo Y, moduloY + 1, moduloY + and moduloY2+Y + 1.

C(0) = A(Y)B(Y) modY =A(0)B(0) =ALBL,

C(1) = A(Y)B(Y) mod (Y + 1) =A(1)B(1) = (AL+AM +AH)(BL+BM +BH), C(∞) = =A(Y)B(Y) mod (Y +∞) =A(∞)B(∞) =AHBH,

C(Y) mod (Y2+Y + 1) = (AL+AM+ (AM +AH)Y)(BL+BH+ (BM +BH)Y) mod (Y2+Y + 1), In the above operations, the multiplication modulo(Y2+Y + 1)consists in a product of two degree one polynomial inY and this can be performed with the Karatsuba formula. The Chinese remainder theorem provides also a method to reconstructCfrom the four termsC(0), C(1), C(∞)andC(Y) mod (Y2+Y+1). After a number of simplifications and optimizations we can obtain the three-way split formula with six recursive multiplications of [2]:

Component polynomial formation (CPF). The component polynomial formation applied to A consists in splitting Ain three partsA=AL+AMXn/3+AHX2n/3 with AL, AM and AH of degreen/31 and then in computing

A00 = AL, A01 = AM, A02 = AH, A03 = AL+AM, A04 = AL+AH, A05 = AM+AH.

The above operations requirenbit additions. The same is done onB and this results in six polyno- mialsB00, . . . , B05of degree n/31.

Recursive products.This consists in computing the six pairwise productsCi0, i= 0,1, . . . ,5,as follows C00 = A00B00 =ALBL, C10 = A01B10 =AMBM,

C20 = A02B20 =AHBH, C30 = A03B30 = (AL+AM)(BL+BM), C40 = A04B40 = (AL+AH)(BL+BH), C50 = A05B50 = (AM +AH)(BM +BH).

(1)

Reconstruction.The reconstruction ofC is performed as:

C= C00(1 +Xn/3+X2n/3) +C10Xn/3(1 +Xn/3+X2n/3) +C20X2n/3(1 +Xn/3+X2n/3) +C30Xn/3+C40X2n/3+C50X3n/3 (2)

= (C00 +C10Xn/3+C20X2n/3)

| {z }

R

×(1 +Xn/3+X2n/3) +C30Xn/3+C40X2n/3+C50X3n/3. (3) This is done in three steps as stated in Algorithm 1: the first step initializesR, the second multiplies R by (1 +Xn/3+X2n/3) and the last step adds the three remaining terms C30Xn/3, C40X2n/3 and

(5)

C50X3n/3. The only tricky part is the multiplication by(1 +Xn/3+X2n/3). The authors in [2] provide the following diagram:

Identical operations

Identical operations

4n 31

5n 31

6n 31

0 n3

R1 R2 R3

R0

2n 3

3n 3

R1 R2

R1

R3

R3 R0

R0 R2

4n

3 5n

3

The identical operations mentioned in the diagram are performed only once during the computation ofR×(1 +Xn/3+X2n/3), this leads to the reconstruction formula described in Algorithm 1.

Algorithm 1 GBR1

Require: Six degree2n/32polynomials Ci0, i= 0, . . . ,5 defined in (1) Ensure: RsatisfyingR=A×B.

// Step 1: Initialization ofR

RC00 +C10Xn/3+C20X2n/3 // costs2n/32bit additions // Step 2: Computation of the productR×(1 +Xn/3+X2n/3) R=R0+R1Xn/3+R2X2n/3+R3X3n/3 // Split, no cost R01R0+R1 // costsn/3 bit additions

R04R2+R3 // costsn/31 bit additions R02R2+R01 // costsn/3 bit additions R03R1+R04 // costsn/3 bit additions

RR0+R01Xn/3+R20X2n/3+R30X3n/3+R40X4n/3+R3X5n/3 // no cost // Step 3: Final additions

RR+C30Xn/3+C40X2n/3+C50X3n/3 // costs3(2n/31) bit additions return(R)

We derive the overall space complexity of the three-way split formula expressed in terms of the total number of bit additions S(n) and bit multiplication S(n). We add the cost of the different steps of the formula (component formation and reconstruction) and the cost of six recursive products. The delay is obtained by finding the critical path delay of the formula. This results in the following recursive and non-recursive forms of the complexity:

S(n) = 6S(n/3) + 6n6, S(n) = 6S(n/3),

D(n) = D(n/3) + 4D.

S(n) = 245nlog3(6)6n+ 6/5, S(n) = nlog3(6),

D(n) = 4 log3(n)D+D.

(6)

Fig. 1. Reconstruction tree of two recursions of the three-way split formula

C0(0)

(1 +Xn/3+X2n/3)

X3n/3

X2n/3

C0(1) C4(1) C5(1)

(1+X

n/9 +X

2n/9)

X2n/9(1+Xn/9+X2n/9) X

2n/

9

X3n/

9

X

n/

9(1 +

X

n/

9+X

2n/

9)

X

n/

9

(1+X

n/9 +X

2n/9)

X2n/9(1+Xn/9+X2n/9) X

2n/

9

X3n/

9

X

n/

9(1 +

X

n/

9+X

2n/

9)

X

n/

(1+X 9

n/9+X

2n/9)

X2n/9(1+Xn/9+X2n/9) X3n/

9

C2(2) C1(2)

C3(2) C4(2)

C5(2)

C0(2)

X

n/

9(1 +

X

n/

9+X

2n/

9)

X

n/

9 C24(2)

C(2)25 C26(2)

C27(2) C29(2)

C28(2) C30(2)

C31(2)

C33(2)

C(2)32

C34(2) C35(2)

C(1)3 X

n/

3

C1(1) X2

n/3(1+X

n/3 +X2

n/3)

X

2n/

3(1 +X

n/

3+ X

2n/

3)

C2(1)

(1+X

n/9+X

2n/9)

X2n/9(1+Xn/9+X2n/9) X3n/

9

X

n/

9(1 +

X

n/

9+X

2n/

9)

X

n/

9 C22(2) C23(2)

C18(2)

C20(2) C21(2) C(2)19

(1+X

n/9+X

2n/9)

X2n/9(1+Xn/9+X2n/9) X

n/

9(1 +

X

n/

9+X

2n/

9)

X

n/

9

C13(2)

C16(2)

C15(2)

C14(2)

C17(2) X3n/

9

C12(2) X2n/9(1+Xn/9+X2n/9) X

n/

9(1 +

X

n/

9+X

2n/

9)

(1+X

n/9 +X

2n/9)

X3n/

9

X

n/

9

C7(2) C8(2) C6(2)

C9(2) C10(2)

C11(2) X2n/

9

X2n/

9

X

2n/

9

X2n/

9

3 OPTIMIZATION OF TWO RECURSIONS OF THREE-WAY SPLIT FORMULA

In this section, we present an optimization of the reconstruction oftwo recursions of the three-way split formula presented in Section 2. Two unrolled recursions of the three-way split formula consists in:

1) We first apply two recursions of the component formation to A and B, this produces 36 terms A(2)0 , A(2)1 , . . . , A(2)35,of degree n/9forAand 36 termsB0(2), B(2)1 , . . . , B35(2)forB. These terms are then multiplied two at a time as follows

Ci(2)=A(2)i ×B(2)i fori= 0,1, . . . ,35.

2) We then perform the reconstruction by applying the three-way split reconstruction formula (3) to each group of six consecutive productsC6i+j(2) , j= 0, . . . ,5. We obtain six degree2n/3−2polynomials Ci(1) for i= 0,1, . . . ,5. Again, we apply the reconstruction formula (3) to C0(1), C1(1), . . . , C5(1) to get

(7)

C=A×B.

In the remainder of this section, we focus only on the reconstruction part of the two recursions. We first arrange the two recursions of the reconstruction formula (3) in a reconstruction tree of depth two:

C = C0(0) is the root of the tree and it is linked to six children C0(1), C1(1), C2(1), C3(1), C4(1) and C5(1). Each link is labeled with the factor of Ci(1) which appears in (3). Specifically, the link from C0(0) to C0(1)is labeled by(1 +Xn/3+X2n/3),the link fromC0(0) toC1(1)is labeled byXn/3(1 +Xn/3+X2n/3) and the link from C0(0) toC2(1) is labeled byX2n/3(1 +Xn/3+X2n/3). We also label the links from C0(0) toC3(1), C4(1) andC5(1) byXn/3, X2n/3 andX3n/3,respectively.

Each child Ci(1), i= 0,1, . . . ,5is also linked to six children C6i+j(1) , j= 0, . . . ,5. The links toC6i+j(1) for j= 0,1,2are labeled byXjn/9(1 +Xn/9+X2n/9)and the links toC6i+3(i) , C6i+4(i) andC6i+5(i) are labeled byXn/9, X2n/9 and X3n/9.

The resulting reconstruction tree is depicted in Fig. 1. Now our goal is to modify this reconstruction tree in order to save some computations when performing the multiplications by (1 +Xn/3+X2n/3) and by(1 +Xn/9+X2n/9), i.e., by factorizing these multiplications. Following the idea used in [10], we modify the reconstruction tree as follows:

The termC3(1)is directly computed from the six inputsC18+j(2) , j= 0, . . . ,5using the GBR1algorithm.

We place a block GBR1 belowC3(1) which represents a reconstruction using Algorithm 1. The inputs of this new block GBR1 are the six terms C18+j(2) , j= 0, . . . ,5. We do the same modification forC4(1) and C5(1).

We place a block GBR0 on each link between the leaves Ci(2) fori∈ {3,4,5,9,10,11,15,16,17} and their corresponding parent. The block GBR0 represents a function which satisfies GBR0(U) =U for any U. These modifications are not necessary to derive the optimized reconstruction formula, but it helps to visualize the repetitive pattern in the modified reconstruction tree.

We finally move down the factor Xn/3 which appears on the link joining C0(0) and C1(1) to the six links joiningC1(1) to his six children. Similarly, we move down the factorX2n/3 on the link joining C0(0) and C2(1) to the six links fromC2(1) to his six children.

The resulting modified reconstruction tree is shown in Fig. 2. We now derive from this modified reconstruction tree the generalized Bernstein reconstruction of depth 2.

Initialization.We consider the termsCi(2) such that neither GBR0,neither GBR1are applied to them.

We accumulate these terms Ci(2) multiplied by their factor of the formXαin/9 appearing in the link joiningCi(2) to their parent. This leads to the following expression:

R0 = C0(2)+C1(2)Xn/9+C2(2)X2n/9+C6(2)X3n/9+C7(2)X4n/9 +C8(2)X5n/9+C12(2)X6n/3+C13(2)X7n/9+C14(2)X8n/9. The following diagram shows the overlaps involved in the expression ofR0

(8)

Fig. 2. Modified reconstruction tree of two recursions of three-way formula

GRB0 GRB0 GRB0 GRB0 GRB0 GRB0 GRB0 GRB0 GRB0

C0(0)

C4(1) C5(1) C(1)0

X

2n/

9

C(2)4 C5(2) C3(2)

C(2)2 C1(2) C0(2)

X

n/

9(1 +X

n/

9+X

2n/

9)

(1 +X

n/

9+X

2n/

9)

X2n/9(1+Xn/9+X2n/9)

X

5n/

9

C11(2) C(2)17 C9(2)

C8(2) C7(2) C(2)6

X

4n/

9(1 +X

n/

9+ X

2n/

9)

X

3n/

9(1 +X

n/

9+X

2n/

9)

X5n/9(1+Xn/9+X2n/9) C(1)1

(1+Xn/

3+X2

n/3) (1 +Xn/3+X2n/

3)

(1+Xn/3+X2n/3)

C(2)23

C18(2) C24(2) C28(2) C(2)30 C35(2)

GRB1 GRB1 GRB1

C2(1)

X

8n/

9

C16(2) C17(2) C15(2)

C14(2) C(2)13 C12(2)

X

7n/

9(1 +X

n/

9+ X

2n/

9)

X

6n/

9(1 +X

n/

9+X

2n/

9)

X8n/9(1+Xn/9+X2n/9) X

7n/

9

X

4n/

9

X

n/

9

X

3n/

9

X6n/

9

X

9n/

9

X3n/3

X2n/3

Xn/

3

C(1)3

C1(2)

C(2)0 C3(2)

C(2)6 C8(2)

C7(2) C(2)12

C13(2)

C(2)14

n

9 3n

9

5n 9

7n

9 9n

9

0 2n

9

4n 9

6n 9

8n 9

We deduce from this diagram that the cost of this step is equal to 8n/98 bit additions and the resulting polynomialR0 has degreen+n/92.

Multiplication of depth 2.We multiply by the common factor(1 +Xn/9+X2n/9)appearing in all the label of the accumulated termsCi(2), i∈ {0,1,2,6,7,8,12,13,14} in the initialization step:

R1 = R0(1 +Xn/9+X2n/9).

We apply the same optimization as in the GBR1algorithm: we split upR1into a number of blocks of sizen/9and identify a set of identical additions inR0(1 +Xn/9+X2n/9). In the following diagram, we show the splitting of R0(1 +Xn/9+X2n/9)along with the identical additions which are filled with the same pattern.

Références

Documents relatifs

To conclude these preliminaries, we recall known fast algorithms for three polynomial matrix subroutines used in our determinant algorithm: multiplication with unbalanced

We have presented a new method, with theoretical and practical aspects, for the computation of the splitting field of a polynomial f where the knowledge of the action of the

In the case of the computation of the splitting field of the polynomial f , the target will be the Gr¨ obner basis G of the splitting ideal corresponding to the symmetric

Instead, our algorithm is designed to be practically faster, on multi- core architectures, than the other algorithms that are usually implemented for the same purpose of

We split the Karatsuba recursion of depth s into two main steps: a first step which performs component polynomial formations and recursive products and a second step which performs

Keywords: Binary Field Multiplication, Subquadratic Complexity, Double Polynomial System, Lagrange Representation, FFT, Montgomery reduction,.. Abstract: We propose a new

To this end, using low weight Dickson polynomials, we formulate the problem of field multiplica- tion as a product of a Toeplitz or Hankel matrix and a vector, and apply

In Section 5 we analyze the complexity of the blockwise product for two polynomials supported by monomials of degree at most d − 1: if A contains Q in its center, then we show