Faster polynomial multiplication over finite fields

(1)

HAL Id: hal-01022757

https://hal.archives-ouvertes.fr/hal-01022757v2

Preprint submitted on 11 Feb 2015

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

David Harvey, Joris van der Hoeven, Grégoire Lecerf

To cite this version:

David Harvey, Joris van der Hoeven, Grégoire Lecerf. Faster polynomial multiplication over finite

fields. 2014. �hal-01022757v2�

(2)

over finite fields

David Harvey

School of Mathematics and Statistics University of New South Wales

Sydney NSW 2052 Australia

Email: [email protected]

Joris van der Hoeven ^a , Grégoire Lecerf ^b Laboratoire d’informatique

UMR 7161 CNRS École polytechnique 91128 Palaiseau Cedex

France

a. Email: [email protected] b. Email: [email protected] February 11, 2015

Let p be a prime, and let M p (n) denote the bit complexity of multiplying two polynomials in F p [X ] of degree less than n. For n large compared to p, we establish the bound M p (n) = O(n log n 8 ^log

^∗

ⁿ log p), where log ^∗ is the iterated logarithm. This is the first known Fürer- type complexity bound for F p [X], and improves on the previously best known bound M p (n) = O(n log n log log nlog p).

Keywords: Polynomial multiplication, finite field, algorithm, complexity bound, FFT A.C.M. subject classification: G.1.0 Computer-arithmetic, F.2.1 Computations in finite fields

A.M.S. subject classification: 68W30, 68Q17, 68W40

1. Introduction

Given a ring R, a basic problem in complexity theory is to find an upper bound for the cost of multiplying two polynomials in R[X ] of degree less than n. Several complexity models may be considered. In algebraic complexity models, such as the straight-line program model [9, Chapter 4], we count the number of ring operations in R, denoted M _R (n). In this model, the best currently known bound M _R (n) = O(n log n log log n) was obtained by Cantor and Kaltofen [10]. More precisely, their algorithm performs O(n log n log log n) additions and subtractions and O(n log n) multiplications in R. Their bound may be viewed as an algebraic analogue of the Schönhage–

Strassen result for integer multiplication [32] and generalises previous work by Schönhage [31].

These algorithms all rely on suitable incarnations of the fast Fourier transform (FFT) [12]. For details we refer the reader to [17, Chapter 8].

In this paper we are mainly interested in the case that R is the finite field F _p = Z / p Z for some prime p. The standard bit complexity model based on deterministic multitape Turing machines [26]

is more realistic in this setting, as it takes into account the dependence on p. We write M _p (n) for the bit complexity of multiplying two polynomials in F _p [X ] of degree less than n. Two basic approaches are known for obtaining good asymptotic bounds for M _p (n).

If n is not too large compared to p, then one may use Kronecker substitution, which converts the problem to integer multiplication by packing the coefficients into large integers. Let I (n) denote the bit complexity of n-bit integer multiplication. Throughout the paper, we make the customary assumption that I (n)/n is increasing, and for convenience we define I (x) := I ( ⌈ x ⌉ ) for x ∈ R , x > 0.

According to our recent sharpening [18] of Fürer’s bound [15], we may take

I (n) = O(n log n 8 ^log

^∗

ⁿ ), (1.1)

where log ^∗ x denotes the iterated logarithm for x ∈ R , i.e.,

log ^∗ x := min { k ∈ N : log ^◦k x 6 1 } , (1.2) log ^◦k := log ◦ ···

k× ◦ log.

1

(3)

If log n = O(log p), then Kronecker substitution yields M _p (n) = O( I (n log p)) (see section 2.6), so we have

M _p (n) = O n log p log (n log p) 8 ^log

^∗

⁽ⁿ ^log ^p) .

However, Kronecker substitution leads to inferior bounds when n is large compared to p, due to coefficient growth in the lifted product in Z [X ]. In this situation the best existing method is the algebraic Schönhage–Strassen algorithm (i.e., the Cantor–Kaltofen algorithm). In the Turing model this yields the bound

M _p (n) = O(n log n log log n log p + n log n I (log p)).

The first term dominates for large enough n, say for log log p 8 ^log

^∗

^p = O(log log n), and we get simply

M _p (n) = O(n log n log log n log p). (1.3)

Since Fürer’s breakthrough result, there has been a gap between what is known for integer multiplication and for multiplication in F _p [X ]. Namely, it has remained an open question whether the factor log log n appearing in (1.3) can be improved to K ^log

^∗

ⁿ for some K > 1. It seems that Fürer’s construction does not work in F _p [X].

1.1. Our contributions

The main result of this paper is a new algorithm that completely closes the gap mentioned above.

Theorem 1.1 . The bound

M _p (n) = O n log p log(n log p) 8 ^log

^∗

^(nlog ^p) holds uniformly for all n > 2 and all primes p.

The bound follows immediately via Kronecker substitution if log n = O(log p). The bulk of the paper concerns the reverse case where n is large compared to p. For example, if log log p =O(log n), the bound becomes simply

M _p (n) = O(n log n 8 ^log

^∗

ⁿ log p).

As promised, this replaces the factor log log n in (1.3) by 8 ^log

^∗

ⁿ .

The basic idea of the new algorithm is as follows. We first construct a special extension F _p

κ

of F _p , where κ is exponentially smaller than n, and for which p ^κ − 1 has many small divisors. These divisors are themselves exponentially smaller than n, yet they are so numerous that their product is comparable to n. We now convert the given multiplication problem in F _p [X] to multiplication in F _p

κ

[Y ], by cutting up the polynomials into small pieces, and we then multiply in F _p

κ

[Y ] by using FFTs over F _p

κ

. Applying the Cooley–Tukey method to the small divisors of p ^κ − 1, we decompose the FFTs into transforms of exponentially shorter lengths (not necessarily powers of two). We use Bluestein’s method to convert each short transform to a convolution over F _p

κ

, and then use Kronecker substitution (from bivariate to univariate polynomials) to convert this to multiplication in F _p [X ]. These latter multiplications are then handled recursively. We continue the recursion until n is comparable to p, at which point we switch to using Kronecker substitution (from polynomials to integers).

In many respects this approach is very similar to the algorithm for integer multiplication introduced in [18], but there are various additional technical issues to address. We recommend [18]

as a gentle introduction to the main ideas.

We also prove the following conditional result.

Theorem 1.2 . Assume Conjectures 8.1, 8.3 and 8.5. Then the bound

M _p (n) = O n log p log(n log p) 4 ^log

^∗

^(nlog ^p)

holds uniformly for all n > 2 and all primes p.

(4)

Of these conjectures, Conjecture 8.3 seems to be in reach for specialists in analytic number theory; Conjecture 8.5 is at least as hard as Artin’s conjecture on primitive roots, but may be tractable under the generalised Riemann hypothesis (GRH); and Conjecture 8.1 is almost as strong as the Lenstra–Pomerance–Wagstaff conjecture on the distribution of Mersenne primes, and seems to be inaccessible at present.

1.2. Outline of the paper

The paper is structured as follows. In section 2, we start with a survey of relevant basic techniques:

discrete Fourier transforms (DFTs) and FFTs, Bluestein’s chirp transform, and Kronecker sub- stitution. Most of this material is repeated from [18, Section 2] for the convenience of the reader;

however, sections 2.5 and 2.6 differ substantially.

In section 3, we recall basic complexity results for arithmetic in finite fields. In particular, we consider the construction of irreducible polynomials in F _p [X], algorithms for finding roots of unity in F _p

κ

, and the cost of arithmetic in F _p

κ

and F _p

κ

[Y ].

In section 4, we show how to construct special extensions of F _p whose multiplicative group has a large subgroup of highly smooth order, i.e., is divisible by many small integers. The main tool is [2, Theorem 3].

Section 5 gives complexity bounds for functions that satisfy recurrence inequalities involving postcompositions with “logarithmically slow” functions. The prototype of such an inequality is T (n) 6 K T (log n) + L, where K and L are constants. The definitions and theorems are duplicated from [18, Section 5]; for the proofs, see [18].

To minimise the constant K in the bound M _p (n) = O(n log n K ^log

^∗

ⁿ log p) (for n large relative to p), we need one more tool: in section 6, we present a polynomial analogue of the Crandall–Fagin convolution algorithm [13]. This allows us to convert a cyclic convolution over F _p of length n, where n is arbitrary, into a cyclic convolution over F _p

κ

of somewhat smaller length N, where N is prescribed and where κ ≈ 2 n/ N . (We can still obtain K = 16 without using this Crandall–Fagin analogue, but we do not know how to reach K = 8 without it.)

Section 7 is devoted to the proof of Theorem 1.1. Section 8 gives the proof of Theorem 1.2, and discusses the three number-theoretic hypotheses of that theorem.

The last section offers some final notes and suggested directions for generalisation. We first quickly dispense with the bit complexity of multiplication in F _q [X] for prime powers q, and in ( Z / m Z )[X ] for arbitrary integers m > 1. We then sketch some algebraic complexity bounds for polynomial multiplication over F _p -algebras and ( Z / m Z )-algebras, especially in the straight- line program model. Our techniques also give rise to new strategies for polynomial evaluation- interpolation over F _q . This may for instance be applied to the efficient multiplication of polynomial matrices over F _q . Although we have not implemented any of our algorithms yet, we conclude by a few remarks on their expected efficiency in practice.

Notations. We use Hardy’s notations f ≺ g for f = o(g), and f ≍ g for f = O(g) and g = O(f ).

The “soft-Oh” notation f (n) = O ˜(g(n)) means that f (n) = g(n) (log(3 + g(n))) ^O(1) (see [17, Chapter 25, Section 7] for details). The symbol R ^> denotes the set of non-negative real numbers, and N denotes { 0, 1, 2, ... } . For a ring R and n ∈ N , we write R[X ] n := { P ∈ R[X]: deg P < n } .

We will write lg n := ⌈ log n/ log 2 ⌉ . In expressions like log log p or lg lg p, we tacitly assume that the function is adjusted so as to take positive values for small primes such as p = 2.

Acknowledgments. We would like to thank Karim Belabas for directing us to the paper [2], and Igor Shparlinski for answering some analytic number theory questions.

2. Survey of classical tools

This section recalls basic facts on Fourier transforms and related techniques used in subsequent sections. For more details and historical references we refer the reader to standard books on the subject such as [3, 9, 17, 30].

2.1. Arrays and sorting

In the Turing model, we have available a fixed number of linear tapes. An n 1 × ··· × n _d array

M _i

1

,...,i

d

of b-bit elements is stored as a linear array of n 1 ··· n _d b bits. We generally assume that the

elements are ordered lexicographically by (i 1 , ..., i _d ), though this is just an implementation detail.

(5)

What is significant from a complexity point of view is that occasionally we must switch repre- sentations, to access an array (say 2-dimensional) by “rows” or by “columns”. In the Turing model, we may transpose an n 1 × n 2 matrix of b-bit elements in time O(b n 1 n 2 lg min (n 1 , n 2 )), using the algorithm of [7, Appendix]. Briefly, the idea is to split the matrix into two halves along the “short”

dimension, and transpose each half recursively.

We will also require more complex rearrangements of data, for which we resort to sorting.

Suppose that X is a totally ordered set, whose elements are represented by bit strings of length b, and suppose that we can compare elements of X in time O(b). Then an array of n elements of X may be sorted in time O(b n lg n) using merge sort [22], which can be implemented efficiently on a Turing machine.

2.2. Discrete Fourier transforms

Let R be a commutative ring with identity and let n > 1. An element ω ∈ R is said to be a principal n-th root of unity if ω ⁿ = 1 and

X

k=0 n −1

(ω ⁱ ) ^k = 0 (2.1)

for all i ∈ { 1, ..., n − 1 }. In this case, we define the discrete Fourier transform (or DFT) of an n-tuple a = (a 0 , ..., a n−1 ) ∈ R ⁿ with respect to ω to be DFT ω (a) = a ˆ = (a ˆ 0 , ..., a ˆ n−1 ) ∈ R ⁿ where

a ˆ i := a 0 + a 1 ω ⁱ + ··· + a n−1 ω ⁽ⁿ⁻¹⁾ⁱ .

That is, a ˆ i is the evaluation of the polynomial A(X) := a 0 + a 1 X + ··· + a n −1 X ⁿ ⁻¹ at ω ⁱ . If ω is a principal n-th root of unity, then so is its inverse ω ⁻¹ = ω ⁿ⁻¹ , and we have

DFT _ω

⁻¹

(DFT ω (a)) = n a.

Indeed, writing b := DFT ω

⁻¹

(DFT ω (a)), the relation (2.1) implies that b i = X

j=0 n−1

a ˆ j ω ^−ji = X

j=0 n−1

X

k=0 n−1

a k ω ^j(k−i) = X

k=0 n−1

a k

X

j=0 n−1

ω ^j(k ⁻ⁱ⁾ = X

k=0 n −1

a k (n δ i,k ) = n a i ,

where δ _i,k = 1 if i = k and δ _i,k = 0 otherwise.

Remark 2.1. In all of the new algorithms introduced in this paper, we actually work over a field, whose characteristic does not divide n. In this setting, the concept of principal root of unity coincides with the more familiar primitive root of unity.

2.3. The Cooley–Tukey FFT

Let ω be a principal n-th root of unity and let n = n 1 n 2 where 1 < n 1 < n. Then ω ⁿ

¹

is a principal n 2 -th root of unity and ω ⁿ

²

is a principal n 1 -th root of unity. Moreover, for any i 1 ∈ { 0, ..., n 1 − 1 } and i 2 ∈ { 0, ..., n 2 − 1 }, we have

a ˆ i

1

n

2

+i

2

= X

k

1

=0 n

1

−1

X

k

2

=0 n

2

−1

a k

2

n

1

+k

1

ω ^(k

²

ⁿ

¹

^+k

¹

⁾⁽ⁱ

¹

ⁿ

²

⁺ⁱ

²

⁾

= X

k

1

=0 n

1

−1

ω ^k

¹

ⁱ

²

X

k

2

=0 n

2

−1

a _k

₂

_n

₁

+k

1

(ω ⁿ

¹

) ^k

²

ⁱ

²

!

(ω ⁿ

²

) ^k

¹

ⁱ

¹

. (2.2) If A 1 and A 2 are algorithms for computing DFTs of length n 1 and n 2 , we may use (2.2) to construct an algorithm A 1 ⊙ A 2 for computing DFTs of length n as follows.

For each k 1 ∈ { 0, ..., n 1 − 1 } , the sum inside the brackets corresponds to the i 2 -th coefficient

of a DFT of the n 2 -tuple (a 0n

1

+k

1

, ..., a (n

2

−1)n

1

+k

1

) ∈ R ⁿ

²

with respect to ω ⁿ

¹

. Evaluating these

inner DFTs requires n 1 calls to A ² . Next, we multiply by the twiddle factors ω ^k

¹

ⁱ

²

, at a cost of n

operations in R. (Actually, fewer than n multiplications are required, as some of the twiddle factors

are equal to 1. This optimisation, while important in practice, has no asymptotic effect on the

algorithms discussed in this paper.) Finally, for each i 2 ∈ { 0, ..., n 2 − 1 } , the outer sum corresponds

to the i 1 -th coefficient of a DFT of an n 1 -tuple in R ⁿ

¹

with respect to ω ⁿ

²

. These outer DFTs

require n 2 calls to A ¹ .

(6)

Denoting by F _R (n) the number of ring operations needed to compute a DFT of length n, and assuming that we have available a precomputed table of twiddle factors, we obtain

F _R (n 1 n 2 ) 6 n 1 F _R (n 2 ) + n 2 F _R (n 1 ) + n.

For a factorisation n = n 1 ··· n _d , this yields recursively F _R (n) 6 X

i=1

d n

n _i F _R (n i ) + (d − 1) n. (2.3)

The corresponding algorithm is denoted A ¹ ⊙ ··· ⊙ A d . The ⊙ operation is neither commutative nor associative; the above expression will always be taken to mean ( ··· (( A ¹ ⊙ A ² ) ⊙ A ³ ) ⊙ ··· ) ⊙ A d . The above discussion requires several modifications in the Turing model. Assume that elements of R are represented by b bits.

First, for A 1 ⊙ A 2 , we must add a rearrangement cost of O(b n lg min (n 1 , n 2 )) to efficiently access the rows and columns for the recursive subtransforms (see section 2.1). For the general case A 1 ⊙ ··· ⊙ A d , the total rearrangement cost is bounded by O( P

i b n lg n i ) = O(b n lg n).

Second, we will sometimes use non-algebraic algorithms to compute the subtransforms, so it may not make sense to express their cost in terms of F _R . The relation (2.3) therefore becomes

F (n) 6 X

i=1

d n

n _i F (n i ) + (d − 1) n m _R + O(b n lg n), (2.4) where F (n) is the (Turing) cost of a transform of length n over R, and where m _R is the cost of a single multiplication in R.

Finally, we point out that A ¹ ⊙ A ² requires access to a table of twiddle factors ω ⁱ

¹

ⁱ

²

, ordered lexicographically by (i 1 , i 2 ), for 0 6 i 1 < n 1 , 0 6 i 2 < n 2 . Assuming that we are given as input a precomputed table of the form 1, ω, ..., ω ⁿ⁻¹ , we must show how to extract the required twiddle factor table in the correct order. We first construct a list of triples (i 1 , i 2 , i 1 i 2 ), ordered by (i 1 , i 2 ), in time O(n lg n); then sort by i 1 i 2 in time O(n lg ² n) (see section 2.1); then merge with the given root table to obtain a table (i 1 , i 2 , ω ⁱ

¹

ⁱ

²

), ordered by i 1 i 2 , in time O(n (b + lg n)); and finally sort by (i 1 , i 2 ) in time O(n lg n (b + lg n)). The total cost of the extraction is thus O(n lg n (b + lg n)).

The corresponding cost for A ¹ ⊙ ··· ⊙ A d is determined as follows. Assuming that the table 1, ω, ..., ω ⁿ⁻¹ is given as input, we first extract the subtables of (n 1 ··· n _i )-th roots of unity for i = d − 1, ..., 2 in time O((n 1 ··· n _d + ··· + n 1 n 2 ) (b + lg n)) = O(n (b + lg n)). Extracting the twiddle factor table for the decomposition (n 1 ··· n i−1 ) × n _i then costs O(n 1 ··· n _i lg n (b + lg n)); the total over all i is again O(n lg n (b + lg n)).

Remark 2.2. An alternative approach is to compute the twiddle factors directly in the correct order. When working over C , as in [18, Section 3], this requires a slight increase in the working precision. Similar comments apply to the root tables used in Bluestein’s algorithm in section 2.5.

2.4. Fast Fourier multiplication

Let ω be a principal n-th root of unity in R and assume that n is invertible in R. Consider two polynomials A = a 0 + ··· + a n−1 X ⁿ⁻¹ and B = b 0 + ··· + b n −1 X ⁿ ⁻¹ in R[X ]. Let C = c 0 + ··· + c n−1 X ⁿ⁻¹ be the polynomial defined by

c := _n ¹ DFT ω

⁻¹

(DFT ω (a) DFT ω (b)),

where the product of the DFTs is taken pointwise. By construction, we have c ˆ = a ˆ b ˆ , which means that C(ω ⁱ ) = A(ω ⁱ ) B(ω ⁱ ) for all i ∈ { 0, ..., n − 1 }. The product S = s 0 + ··· + s n−1 X ⁿ⁻¹ of A and B modulo X ⁿ − 1 also satisfies S(ω ⁱ ) = A(ω ⁱ ) B(ω ⁱ ) for all i. Consequently, s ˆ = a ˆ b ˆ , s = DFT ω

⁻¹

(s ˆ)/ n = c, whence C = S.

For polynomials A, B ∈ R[X ] with deg A < n and deg B < n, we thus obtain an algorithm for

the computation of A B modulo X ⁿ − 1 using at most 3 F _R (n) + O(n) operations in R. Modular

products of this type are also called cyclic convolutions . If deg (A B) < n, then we may recover

the product A B from its reduction modulo X ⁿ − 1. This multiplication method is called FFT

multiplication .

(7)

If one of the arguments (say B) is fixed and we want to compute many products A B (or cyclic convolutions) for different A, then we may precompute DFT ω (b), after which each new product A B can be computed using only 2 F _R (n) + O(n) operations in R.

2.5. Bluestein’s chirp transform

We have shown above how to multiply polynomials using DFTs. Inversely, it is possible to reduce the computation of DFTs — of arbitrary length, not necessarily a power of two — to polynomial multiplication [5], as follows.

Let ω be a principal n-th root of unity and consider an n-tuple a ∈ R ⁿ . First consider the case that n is odd. Let

f i := ω ⁽ⁱ

²

^−i)/2 , f _i ^′ := ω ⁽ⁱ

²

^+i)/2 , g i := ω ⁽⁻ⁱ

²

^−i)/2 . Note that (i ² − i)/2 and (i ² + i)/2 are integers, and that

g i+n = ω ^(−(i+n)

²

^−(i+n))/2 = ω ⁽⁻ⁱ

²

^−i)/2 ω ⁿ⁽⁻ⁱ ^−(n+1)/2) = g i .

Let F := f 0 a 0 + ··· + f n−1 a n−1 X ⁿ⁻¹ , G:= g 0 + ··· + g n−1 X ⁿ⁻¹ and C := c 0 + ··· +c n −1 X ⁿ ⁻¹ ≡ F G modulo X ⁿ − 1. Then

f _i ^′ c _i = X

j =0 n −1

f _i ^′ (f j a _j ) g i−j = X

j=0 n−1

ω ⁽ⁱ

²

^+i)/2 ω ^(j

²

^−j)/2 ω ⁽⁻⁽ⁱ⁻ ^j)

²

⁻⁽ⁱ⁻ ^j))/2 a _j = X

j =0 n−1

ω ^ij a _j = a ˆ i . In other words, the computation of a DFT of odd length n reduces to a cyclic convolution product of the same length n, together with O(n) additional operations in R. Notice that the polynomial G is fixed and independent of a in this product.

Now suppose that n is even. In this case we require that 2 be invertible in R, and that ω ^n/2 = − 1.

Let σ := ( − 1) ^n/2 , and put

f _i := ω ⁱ

²

, f _i ^′ := ω ⁱ

²

⁺ⁱ , g _i := ω ⁻ⁱ

²

+ ω ⁻ⁱ

²

⁻ⁱ . Then

g _i+

ⁿ

2

= σ ω ⁻ⁱ

²

− ω ⁻ⁱ

²

⁻ⁱ

, g i+n = g _i , and

1 2

g i + σ g _i+

ⁿ

2

= ω ⁻ⁱ

²

, 1

2 g i − σ g _i+

ⁿ

2

= ω ⁻ⁱ

²

⁻ⁱ .

As above, let F := f 0 a 0 + ··· + f n−1 a n−1 X ⁿ ⁻¹ , G := g 0 + ··· + g n−1 X ⁿ ⁻¹ and C :=

c ₀ + ··· + c _n−1 X ⁿ⁻¹ ≡ F G modulo X ⁿ − 1. Then for i ∈

0, ..., ⁿ ₂ − 1 we have 1

2 f i

c i + σ c _i+

ⁿ

2

= 1 2 f i

X

j=0 n−1

f j a j

g i−j + σ g _i−j+

ⁿ

2

= X

j =0 n−1

ω ⁱ

²

ω ^j

²

ω ^−(i−j)

²

a j = X

j=0 n−1

ω ^2ij a j = a ˆ 2i ,

and similarly 1 2 f _i ^′

c i − σ c _i+

ⁿ

2

= X

j=0 n −1

ω ⁱ

²

⁺ⁱ ω ^j

²

ω ⁻⁽ⁱ ^−j)

²

⁻⁽ⁱ⁻ ^j) a j = X

j=0 n−1

ω ^(2i+1)j a j = a ˆ 2i+1 .

Again, the DFT reduces to a convolution of length n, together with O(n) additional operations in R.

The only complication in the Turing model is the cost of extracting the f i , f _i ^′ and g i in the correct order. For example, consider the f i in the case that n is odd. Given as input a precomputed table 1, ω, ω ² , ..., ω ⁿ⁻¹ , we may extract the f i in time O(n lg n (b + lg n)) by applying the strategy from section 2.3 to the pairs (i, ¹ ₂ (i ² − i) mod n) for 0 6 i < n. The other sequences are handled similarly. For the g _i in the case that n is even, we also need to perform O(n) additions in R;

assuming that additions in R have cost O(b), this is already covered by the above bound.

(8)

Remark 2.3. Bluestein’s original formulation used the weights f _i := ω ⁱ

²

^/2 , g _i := ω ⁻ⁱ

²

^/2 . This has two drawbacks in our setting. First, it requires the presence of 2 n-th roots of unity, which may not exist in R. Second, if n is odd, it leads to negacyclic convolutions, rather than cyclic convolutions.

The variants used here avoid both of these problems.

Remark 2.4. It is also possible to give variants of the new multiplication algorithms in which Bluestein’s transform is replaced by a different method for converting DFTs to convolutions, such as Rader’s algorithm [29].

2.6. Kronecker substitution

Multiplication in F _p [X] may be reduced to multiplication in Z using the classical technique of Kronecker substitution [17, Corollary 8.27]. More precisely, let n >0 and suppose that we are given two polynomials A, B ∈ F _p [X ] of degree less than n. Let A ˜ , B ˜ ∈ Z [X ] be lifts of these polynomials, with coefficients A i and B i satisfying 0 6 A i , B i < p. Then for the product C ˜ = A ˜ B ˜ we have 0 6 C i < n p ² , and the coefficients of C ˜ may be read off the integer product C ˜(2 ^N ) = A ˜(2 ^N ) B ˜(2 ^N ) where N := 2 lg p + lg n. We deduce the coefficients of C := A B by dividing each C i by p. This shows that M _p (n) = O( I (n (lg p+ lg n)) + n I (lg p)). Using the assumption that I (n) / n is increasing, we obtain

M _p (n) = O( I (n (lg p + lg n))).

If we also assume that lg n = O(lg p), i.e., that the degree is not too large, then this becomes simply

M _p (n) = O( I (n lg p)). (2.5)

A second type of Kronecker substitution reduces bivariate polynomial multiplication to the univariate case. Indeed, let n > 1 and κ > 1, and suppose that A, B ∈ F _p [X , Z], where deg X A, deg X B < n and deg Z A, deg Z B < κ. We may then recover C := A B from the univariate product C(Y ^2κ , Y ) = A(Y ^2κ , Y ) B(Y ^2κ , Y ) in F _p [Y ]. Note that A(Y ^2κ , Y ) and B(Y ^2κ , Y ) have degree less than 2 n κ, so the cost of the bivariate product is M _p (2 n κ) + O(n κ lg p).

The same method works for computing cyclic convolutions: to multiply A, B ∈ F _p [X , Z ] / (X ⁿ − 1), the same substitution leads to a product in F _p [Y ] / (Y ^2nκ − 1). The cost is thus M _p ^′ (2 n κ) + O(n κ lg p), where M _p ^′ (d) denotes the cost of a multiplication in F _p [X]/(X ^d − 1).

3. Arithmetic in finite fields

Let p be a prime and let κ > 1. In this section we review basic results concerning arithmetic in F _p

κ

and F _p

κ

[Y ].

We assume throughout that F _p

κ

is represented as F _p [Z]/ P for some irreducible monic polyno- mial P ∈ F _p [Z] of degree κ. Thus an element of F _p

κ

is represented uniquely by its residue modulo P , i.e., by a polynomial F ∈ F _p [Z] of degree less than κ.

Lemma 3.1 . Let p be a prime and let κ > 1. We may compute a monic irreducible polynomial P ∈ F _p [Z] of degree κ in time O ˜(κ ⁴ p ^1/2 ).

Proof. See [33, Theorem 3.2].

The above complexity bound is very pessimistic in practice. Better complexity bounds are known if we allow randomised algorithms, or unproved hypotheses. For instance, assuming GRH, the bound reduces to (κ log p) ^O(1) [1].

We now consider the cost of arithmetic in F _p

κ

, assuming that P is given. Addition and sub- traction in F _p

κ

have cost O(κ lg p). For multiplication we will always use the Schönhage–Strassen algorithm. Denote by S _p (κ) the cost of multiplying polynomials in F _p [Z ] κ by this method, i.e., using the polynomial variant [31] for the polynomials themselves, followed by the integer version [32] to handle the coefficient multiplications. Then we have

S _p (κ) = O(κ lg κ lg lg κ lg p lg lg p lg lg lg p).

(9)

Of course, we could use the new multiplication algorithm recursively for these products, but it turns out that Schönhage–Strassen is fast enough, and leads to a simpler complexity analysis in section 7.

Let D _p (κ) denote the cost of dividing a polynomial in F _p [Z] 2κ by P, returning the quotient and remainder. Using Newton’s method [17, Chapter 9] we have D _p (κ) = O( S _p (κ)). Thus elements of F _p

κ

may be multiplied in time O( S _p (κ)).

Let N > 1 be a divisor of p ^κ − 1. To compute a DFT over F _p

κ

of length N, we must first have access to a primitive N-th root of unity in F _p

κ

. In general it is very difficult to find a primitive root of F _p

κ

, as it requires knowledge of the factorisation of p ^κ − 1. However, to find a primitive N -th root, Lemma 3.3 below shows that it is enough to know the factorisation of N . The construction relies on the following existence result.

Lemma 3.2. Let ℓ be a positive integer such that p ^ℓ > c 1 r ⁴ (log r + 1) ⁴ κ ² , where r is the number of distinct prime divisors of p ^κ − 1, and where c 1 > 0 is a certain absolute constant. Then there exists a monic irreducible polynomial Θ ∈ F _p [Z ] of degree ℓ such that Θ modulo P is a primitive root of unity for F _p

κ

= F _p [Z]/P.

Proof. This is [34, Theorem 1.1].

Lemma 3.3 . Assume that N divides p ^κ − 1 and that the factorisation of N is given. Then we may compute a primitive N-th root of unity in F _p [Z]/P in time O ˜(κ ⁹ p).

Proof. Testing whether a given α ∈ F _p

κ

is a primitive N -th root of unity reduces to checking that α ^N = 1 and α ^N ^/s = / 1 for every prime divisor s of N. According to Lemma 3.2, it suffices to apply this test to α := (Θ mod P) ^(p

^κ

^−1)/N for Θ ∈ F _p [Z] running over all monic polynomials of degree ℓ := ⌈ log(c 1 r ⁴ (log r + 1) ⁴ κ ² )/log p ⌉ , where r = O(κ lg p) is as in the lemma. The number of candidates is at most p ^ℓ 6 c r ⁴ (log r + 1) ⁴ κ ² p = O ˜(κ ⁶ p), and each candidate can be tested in time O ˜(r lg N (κ lg p)) = O ˜(κ ³ lg ³ p) using binary exponentiation.

Of course, we can do much better if randomised algorithms are allowed, since ζ ^(p

^κ

^−1)/N is reasonably likely to be a primitive N -th root for randomly selected ζ.

Finally, we consider polynomial multiplication over F _p

κ

. Let n > 1, and let A, B ∈ F _p

κ

[X ] n . Let A ˜ , B ˜ ∈ F _p [X , Z] be their natural lifts, i.e., of degree less than κ with respect to Z. The bivariate product C ˜ := A ˜ B ˜ may be computed using Kronecker substitution (section 2.6) in time M _p (2 n κ) + O(n κ lg p). Writing C ˜ = P

i=0 2n −2

C ˜ _i (Z) X ⁱ , to recover A B we must divide each C ˜ _i by P . Denoting by M _p

κ

(n) the complexity of the original multiplication problem, we obtain

M _p

κ

(n) = M _p (2 n κ) + O(n D _p (κ))

= M _p (2 n κ) + O(n S _p (κ)).

As in section 2.6, the same method may be used for cyclic convolutions. If M _p ′

κ

(n) denotes the cost of multiplication in F _p

κ

[X]/(X ⁿ − 1), then we get

M _p ′

κ

(n) = M _p ^′ (2 n κ) + O(n S _p (κ)).

Remark 3.4. In the other direction, we can also reduce multiplication in F _p [X ] to multiplication in F _p

κ

[Y ] by splitting the inputs into chunks of size ⌊ κ / 2 ⌋ . This leads to the bound M _p (n) 6 M _p

κ

( ⌈ n / ⌊ κ / 2 ⌋⌉ ) + O(n lg p) for n > κ. A variant of this procedure is developed in detail in section 6.

4. Preparing for DFTs of large smooth orders

The aim of this section is to prove the following theorem, which allows us to construct “small”

extensions F _p

λ

of F _p containing “many” roots of unity of “low” order. Recall that a positive integer is said to be y-smooth if all of its prime divisors are less than or equal to y.

Theorem 4.1 . There exist computable absolute constants c 3 > c 2 > 0 and n 0 ∈ N with the following properties. Let p be a prime and let n > n 0 . Then there exists an integer λ in the interval

(lg n) ^c

²

^{lg lg lg} ⁿ < λ < (lg n) ^c

³

^{lg lg lg} ⁿ ,

(10)

and a (λ + 1)-smooth integer M > n, such that M | p ^λ − 1. Moreover, given p and n, we may compute λ and the prime factorisation of M in time O((lg n) ^{lg lg} ⁿ ).

The proof depends on a series of preparatory lemmas. For λ > 2, define H _λ := Y

q prime, q−1|λ

q.

For example, H ₃₆ = 1919190 = 2 · 3 · 5 · 7 · 13 · 19 · 37 and H ₃₇ = 2. Note that H _λ is always squarefree and (λ + 1)-smooth. We are interested in finding λ for which H _λ is unusually large; that is, for which λ has many divisors d such that d + 1 happens to be prime. Most of the heavy lifting is done by the following remarkable result.

Lemma 4.2. Define λ 0 (k) := min

λ ∈ N : H _λ > √ k

. There exist computable absolute constants c 5 > c 4 > 0 such that for any integer k > 100 we have

(log k) ^c

⁴

log log log k < λ 0 (k) < (log k) ^c

⁵

log log log k . Proof. This is part of [2, Theorem 3]. (The threshold √ k

could of course be replaced by any fixed power of k. It is stated this way in [2] because that paper is concerned with primality testing.)

The link between H _λ and F _p

λ

is as follows.

Lemma 4.3 . Let p be a prime and let λ > 2. Then there exists a (λ + 1)-smooth integer M >

H _λ /(λ + 1) such that M | p ^λ − 1.

Proof. We take M := H λ / p if p divides H λ , and otherwise M := H λ . In the former case we must have p 6 λ + 1, so in both cases M > H λ / (λ + 1). To see that M | p ^λ − 1, consider any prime divisor q = / p of H λ . Then q − 1 | λ, so q | p ^λ − 1 by Fermat’s little theorem.

Remark 4.4. The integer M constructed in Lemma 4.3 only takes into account the structure of H _λ , and ignores p ^λ − 1 itself. In practice, p ^λ − 1 will often have small factors other than those in H _λ , possibly including repeated factors (which are never detected by H _λ ). For example, H 6 = 2 · 3 · 7, but 19 ⁶ − 1 = 2 ³ · 3 ³ · 5 · 7 ³ · 127. In an implementation, one should always incorporate these highly valuable “accidental” factors into M . We will ignore them in our theoretical discussion.

Next we give a simple sieving algorithm that tabulates approximations of log H λ for all λ up to a prescribed bound.

Lemma 4.5 . Let m > 2. In time O(m ² ), we may compute integers ℓ 1 , ..., ℓ m with

| ℓ _λ − log H _λ | 6 1, (1 6 λ 6 m). (4.1)

Proof. Initialise a table of ℓ _λ with ℓ _λ := 0 for λ = 1, ..., m. Since log H _λ 6 P

q 6λ+1 log q = O(λ), it suffices to set aside O(lg m) bits for each ℓ _λ .

For each integer q = 2, ..., m, perform the following steps. First test whether q is prime, discarding it if not. Using trial division, the cost is q ^1/2+o(1) , so m ^3/2+o(1) = O(m ² ) overall. Now assume that q is found to be prime. Using a fast algorithm for computing logarithms [8], compute an integer L q such that | L q − 2 ^r log q | 6 1, where r := 1 + lg m, in time O (lg m) ^1+o(1)

. Scan through the table, replacing ℓ λ by ℓ λ + L q for those λ divisible by q − 1, i.e., every (q − 1)-th entry, in time O(m lg m). The total cost for each prime q is O(m lg m), and there are O(m/ lg m) primes, so the overall cost is O(m ² ). At the end, we divide each ℓ λ by 2 ^r , yielding the required approximations

satisfying (4.1).

Proof of Theorem 4.1. We are given p and n as input. Applying Lemma 4.2 with k := n ³ , we find that for large enough n there exists some λ with

(log (n ³ )) ^c

⁴

log log log (n

³

) < λ < (log (n ³ )) ^c

⁵

log log log (n

³

)

(11)

and such that H λ > n ^3/2 . Choose any positive c 2 < c 4 log 2 and any c 3 > c 5 log 2. Since log log(n ³ ) log log log(n ³ ) ∼ (log 2) log lg n lg lg lg n,

we have

(lg n) ^c

²

^{lg lg lg} ⁿ < (log(n ³ )) ^c

⁴

log log log(n

³

) < λ < (log(n ³ )) ^c

⁵

log log log(n

³

) < (lg n) ^c

³

^{lg lg lg} ⁿ for large n. Using Lemma 4.5, we may find one such λ in time O(m ² ), where m := (lg n) ^c

³

^{lg lg lg} ⁿ .

Let M be as in Lemma 4.3. Then M divides p ^λ − 1, and M > H λ

λ + 1 > n ^3/2 m + 1 > n

for large n. We may compute the prime factorisation of M by simply enumerating the primes q 6 λ + 1, q = / p, and checking whether q − 1 divides λ for each q. This can be done in time

O(λ ² ) = O(m ² ).

The main multiplication algorithm depends on a reduction of a “long” DFT to many “short”

DFTs. It is essential to have some control over the long and short transform lengths. The following result packages together the prime divisors of the abovementioned M , to obtain a long transform length N near a given target L, and short transform lengths N _i near a given target S.

Theorem 4.6 . Let p, n, λ, M be as in Theorem 4.1. Let L and S be positive integers such that λ < S < L < M. Then there exist (λ + 1)-smooth integers N 1 , ..., N d with the following properties:

a) N := N 1 ··· N _d divides M (and hence divides p ^λ − 1).

b) L 6 N 6 (λ + 1) L.

c) S 6 N i 6 S ³ for all i.

Given λ, S, L, and the prime factorisation of M, we may compute such N 1 , ..., N d (and their factorisations) in time O ˜(λ ³ ).

Proof. Let M = N 1 ··· N s be the prime decomposition of M . Taking d minimal with N 1 ··· N d > L, we ensure that (a) and (b) are satisfied. At this stage we have N i 6 λ + 1 6 S for all i. As long as the tuple (N 1 , ..., N d ) contains an entry N i with N i < S, we pick the two smallest entries N i and N j and replace them by a single entry N i N j . Clearly, this does not alter the product N of all entries, so (a) and (b) continue to hold. Furthermore, as long as there exist two entries N i and N j

with N i < S and N j < S, new entries N i N j will always be smaller than S ² . Only at the very last step of the loop, the second smallest entry N j might be larger than S. In that case, the product of the entry N i with N i < S with N j is still bounded by S · S ² = S ³ . This shows that condition (c) is satisfied at the end of the loop.

Determining d requires at most s = O(λ) multiplications of integers less than M . There are at most s iterations of the replacement loop. Each iteration must scan through at most s integers of bit size O(log M ) = O(λ) and perform one multiplication on such integers.

5. Logarithmically slow recurrence inequalities

Let Φ: (x 0 , ∞ ) → R be a smooth increasing function, for some x 0 ∈ R . We say that Φ ^∗ : (x 0 , ∞ ) → R ^>

is an iterator of Φ if Φ ^∗ is increasing and if

Φ ^∗ (x) = Φ ^∗ (Φ(x)) + 1 (5.1)

for all sufficiently large x.

For instance, the standard iterated logarithm log ^∗ defined in (1.2) is an iterator of log. An analogous iterator may be defined for any smooth increasing function Φ: (x 0 , ∞ ) → R for which there exists some σ > x 0 such that Φ(x) 6 x − 1 for all x > σ. Indeed, in that case,

Φ ^∗ (x) := min { k ∈ N : Φ ^◦k (x) 6 σ }

(12)

is well-defined and satisfies (5.1) for all x > σ. It will sometimes be convenient to increase x 0 so that Φ(x) 6 x − 1 is satisfied on the whole domain of Φ.

We say that Φ is logarithmically slow if there exists an ℓ ∈ N such that

(log ^◦ℓ ◦ Φ ◦ exp ^◦ℓ )(x) = log x + O(1) (5.2) for x →∞ . For example, the functions log (2 x), 2 log x, (log x) ² and (log x) ^{log log} ^x are logarithmically slow, with ℓ = 0, 1, 2, 3 respectively.

Lemma 5.1 . Let Φ: (x 0 , ∞ ) → R be a logarithmically slow function. Then there exists σ > x 0 such that Φ(x) 6 x − 1 for all x > σ. Consequently all logarithmically slow functions admit iterators.

In this paper, the main role played by logarithmically slow functions is to measure size reduction in multiplication algorithms. In other words, multiplication of objects of size n will be reduced to multiplication of objects of size n ^′ , where n ^′ 6 Φ(n) for some logarithmically slow function Φ(x).

The following result asserts that, from the point of view of iterators, such functions are more or less interchangeable with log x.

Lemma 5.2. For any iterator Φ ^∗ of a logarithmically slow function Φ, we have Φ ^∗ (x) = log ^∗ x + O(1).

The next result is our main tool for converting recurrence inequalities into actual asymptotic bounds for solutions.

Proposition 5.3 . Let K > 1, B > 0 and ℓ ∈ N . Let x 0 > exp ^◦ℓ (1), and let Φ: (x 0 , ∞ ) → R be a logarithmically slow function such that Φ(x) 6 x − 1 for all x > x 0 . Then there exists a positive constant C (depending on x 0 , Φ, K, B and ℓ) with the following property.

Let σ > x 0 and L > 0. Let S ⊆ R , and let T: S → R ^> be any function satisfying the following recurrence. First, T (y) 6 L for all y ∈ S , y 6 σ. Second, for all y ∈ S , y > σ, there exist y 1 , ..., y d ∈ S with y i 6 Φ(y), and weights γ 1 , ..., γ d > 0 with P

i γ i = 1, such that T (y) 6 K

1 + B log ^◦ℓ y

X

i=1 d

γ _i T (y i ) + L.

Then we have T (y) 6 C L K ^log

^∗

^y ^−log

^∗

^σ for all y ∈ S , y > σ.

6. The Crandall–Fagin algorithm

Consider the problem of computing the product of two n-bit integers modulo m := 2 ⁿ − 1. The naive approach is to compute their ordinary 2 n-bit product and then reduce modulo m. The reduction cost is negligible because of the special form of m. If n is divisible by a high power of two, one can save a factor of two by using the fact that FFTs naturally compute cyclic convolutions.

An ingenious algorithm of Crandall and Fagin [13] allows for the gain of this precious factor of two for arbitrary n. Their algorithm is routinely used in the extreme case where n is prime, in the large-scale GIMPS search for Mersenne primes (see http://www.mersenne.org/).

A variant of the Crandall–Fagin algorithm was a key ingredient of our algorithm for integer multiplication that conjecturally achieves the bound I (n) = O(n log n 4 ^log

^∗

ⁿ ) [18, Section 9]. In this section we present yet another variant, for computing products in F _p [X]/(X ⁿ − 1).

6.1. The Crandall–Fagin reduction for polynomials

Let n, N and κ be positive integers with 1 6 N 6 n and κ > 2 ⌈ n / N ⌉. Our aim is to reduce

multiplication in F _p [X] / (X ⁿ − 1) to multiplication in F _p

κ

[Y ] / (Y ^N − 1). In the applications in

subsequent sections, N will be a divisor of p ^κ − 1 with many small factors, so that we can multiply

efficiently in F _p

κ

[Y ]/(Y ^N − 1) using FFTs over F _p

κ

.

(13)

Suppose that F _p

κ

is represented as F _p [Z]/P , for some irreducible P ∈ F _p [Z] of degree κ. The reduction relies on the existence of an element θ ∈ F _p

κ

such that θ ^N = Z. (This is the analogue of the real N -th root of 2 appearing in the usual Crandall–Fagin algorithm, which was originally formulated over C .) It is easy to see that such θ may not always exist; for example, there is no cube root of Z in F ₁₆ = F ₂ [Z ]/(Z ⁴ + Z + 1). Nevertheless, the next result shows that if κ is large enough, then we can always find some modulus P for which a suitable θ exists.

Proposition 6.1 . Suppose that p ^κ/2 > N. Then we may compute an irreducible polynomial P ∈ F _p [Z] of degree κ, and an element θ ∈ F _p

κ

= F _p [Z]/ P such that θ ^N = Z, in time O ˜(κ ⁹ p).

Proof. We first observe that if ζ ∈ F _p

κ

is a primitive root, then ζ ^N cannot lie in a proper subfield of F _p

κ

. (This property is independent of P.) Indeed, if ζ ^N ∈ F _p

_ℓ

for some proper divisor ℓ of κ, then every N -th power in F _p

κ

lies in F _p

ℓ

. This contradicts the fact that the number of N-th powers in F _p

κ

is at least 1 + (p ^κ − 1)/N > p ^κ / N > p ^κ/2 > p ^ℓ .

Now we give an algorithm for computing P and θ. We start by using Lemma 3.1 to compute an irreducible P ˜ ∈ F _p [U ] of degree κ in time O ˜ κ ⁴ p ^1/2

, and we temporarily represent F _p

κ

as F _p [U]/ P ˜ .

Using Lemma 3.2, we may construct a list of O ˜(κ ⁶ p) candidates for primitive roots of F _p

κ

(see the proof of Lemma 3.3). For each candidate ζ, we compute ζ ^N and its powers 1, ζ ^N , ..., ζ ^(κ−1)N . If these are linearly dependent over F _p then ζ ^N belongs to a proper subfield (and so, as shown above, ζ is not actually a primitive root). Thus we must eventually encounter some ζ for which they are linearly independent. We take P to be the minimal polynomial of this ζ ^N , which may be computed by finding a linear relation among 1, ζ ^N , ..., ζ ^κN . Let ϕ: F _p [Z ]/P → F _p [U ]/P ˜ be the isomorphism that sends Z to ζ ^N . The matrix of ϕ with respect to the standard bases 1, Z , ..., Z ^κ−1 and 1, U , ..., U ^κ−1 is given by the coefficients of 1, ζ ^N , ..., ζ ^(κ−1)N . The inverse of this matrix yields the matrix of ϕ ⁻¹ : F _p [U ] / P ˜ → F _p [Z] / P . We then set θ = ϕ ⁻¹ (ζ), so that θ ^N = ϕ ⁻¹ (ζ ^N ) = Z.

For each candidate ζ, the cost of computing the necessary powers of ζ is O ˜((κ+ lg N ) κ lg p) = O ˜(κ ² lg ² p), and the various linear algebra steps require time O ˜(κ ³ lg p) using classical matrix

arithmetic.

In the remainder of this section, we fix some P and θ as in Proposition 6.1, and assume that F _p

κ

is represented as F _p [Z ] / P. Suppose that we wish to compute the product of u, v ∈ F _p [X] / (X ⁿ − 1).

The presentation here closely follows that of [18, Section 9.2]. We decompose u and v as u = X

i=0 N −1

u _i X ^e

ⁱ

, v = X

i=0 N −1

v _i X ^e

ⁱ

, (6.1)

where

e i := ⌈ n i/ N ⌉ , u i , v i ∈ F _p [X] e

i+1

−e

i

.

Notice that e i+1 − e i takes only two possible values: ⌊ n/ N ⌋ or ⌈ n/ N ⌉ . For 0 6 i < N , let

c _i := N e _i − n i, (6.2)

so that 0 6 c i < N . For any 0 6 i 1 , i 2 < N , define δ i

1

,i

2

∈ Z as follows. Choose σ ∈ { 0, 1 } so that i := i 1 + i 2 − σ N lies in the interval 0 6 i < N, and put

δ i

1

,i

2

:= e i

1

+ e i

2

− e i − σ n.

From (6.2), we have

c _i

₁

+ c _i

₂

− c _i = N (e i

1

+ e _i

₂

− e _i ) − n (i 1 + i 2 − i) = N δ _i

₁

_,i

₂

.

Since the left hand side lies in the interval ( − N , 2 N ), this shows that δ _i

₁

_,i

₂

∈ { 0, 1 } . Now, since e _i

₁

+ e _i

₂

= e _i + δ _i

₁

_,i

₂

(mod n), we have

u v = X

i

1

=0 N −1

X

i

2

=0 N −1

u i

1

v i

2

X ^e

ⁱ¹

^+e

ⁱ²

= X

i=0 N −1

w i X ^e

ⁱ

(mod X ⁿ − 1),

(14)

where

w i := X

i

1

+i

2

=i (mod N)

X ^δ

^i1,i2

u i

1

v i

2

.

Since u i

1

∈ F _p [X ] ⌈n/N ⌉ and similarly for v i

2

, we have w i ∈ F _p [X ] 2⌈n/N ⌉ . Note that we may recover u v from w 0 , ..., w N −1 in time O(n lg p), by a standard overlap-add procedure (provided that N = O(n / lg n)).

Now, regarding u i and v i as elements of F _p

κ

by sending X to Z, define polynomials U , V ∈ F _p

κ

[Y ] / (Y ^N − 1) by U _i := θ ^c

ⁱ

u _i and V _i := θ ^c

ⁱ

v _i for 0 6i < N , and let W = W 0 + ··· +W N −1 Y ^N ⁻¹ :=

U V be their (cyclic) product. Then w ˜ i := θ ^−c

ⁱ

W _i = X

i

1

+i

2

=i (mod N)

θ ^−c

ⁱ

U _i

1

V _i

2

= X

θ ^c

ⁱ¹

^+c

ⁱ²

^−c

ⁱ

u _i

1

v _i

2

= X

Z ^δ

ⁱ¹^,i²

u _i

1

v _i

2

coincides with the reinterpretation of w i as an element of F _p

κ

. Moreover, we may recover w i

unambiguously from w ˜ i , as κ > 2 ⌈ n / N ⌉ and w i ∈ F _p [X] 2⌈n/N ⌉ . Altogether, this shows how to reduce multiplication in F _p [X ]/(X ⁿ − 1) to multiplication in F _p

κ

[Y ]/(Y ^N − 1).

Remark 6.2. The pair (e i+1 , c i+1 ) can be computed from (e i , c _i ) in O(lg n) bit operations, so we may compute the sequences e 0 , ..., e _N −1 and c 0 , ..., c _N −1 in time O(N lg n). Moreover, since c i+1 − c _i takes on only two possible values, we may compute the sequence θ ^c

⁰

, ..., θ ^c

^N⁻¹

using O(N) multiplications in F _p

κ

.

7. The main algorithm

Consider the problem of computing t > 1 products u 1 v , ..., u _t v with u 1 , ..., u _t , v ∈ F _p [X ]/(X ⁿ − 1), i.e., t products with one fixed operand. Denote the cost of this operation by C _p,t (n). Our algorithm for this problem will perform t + 1 forward DFTs and t inverse DFTs, so it is convenient to introduce the normalisation

C _p (n) := sup

t>1

C _p,t (n) 2 t + 1 .

This is well-defined since clearly C _p,t (n) 6 t C _p,1 (n). Roughly speaking, C _p (n) may be thought of as the notional cost of a single DFT.

The problem of multiplying two polynomials in F _p [X] of degree less than k may be reduced to the above problem by using zero-padding, i.e., taking n := 2 k and t := 1. Since C _p,1 (n) 6 3 C _p (n), we obtain M _p (k) 6 3 C (2 k) + O(k lg p). Thus it suffices to obtain a good bound for C _p (n).

The next theorem gives the core of the new algorithm for the case that n is large relative to p.

Theorem 7.1 . There exist x 0 > 2 and a logarithmically slow function Φ: (x 0 , ∞ ) → R with the following property. For all integers n > x 0 , there exist integers n 1 , ..., n d 6 Φ(n), and weights γ 1 , ..., γ d > 0 with P

i γ i = 1, such that C _p (n)

n lg p lg (n lg p) 6

8 + O 1

lg lg n

X

i=1 d

γ _i C _p (n i )

n _i lg p lg(n i lg p) + O(1), (7.1) uniformly for n > max (x 0 , p ² ).

Proof. We wish to compute t > 1 products u 1 v, ..., u t v with u 1 , ..., u t , v ∈ F _p [X ] / (X ⁿ − 1), for some sufficiently large n. We assume throughout that p ² < n.

Choose parameters. Using Theorem 4.1, we obtain integers λ and M with (lg n) ^c

²

^{lg lg lg} ⁿ < λ < (lg n) ^c

³

^{lg lg lg} ⁿ ,

and so that M > n, M | p ^λ − 1, and M is (λ + 1)-smooth. We choose long and short target transform

lengths L := ⌈ n / λ ³ ⌉ and S := λ ^{(lg lg} ⁿ⁾

²

. For large enough n we then have λ < S < L < n 6 M , so we

may apply Theorem 4.6. This yields (λ+ 1)-smooth integers N 1 , ..., N d , with known factorisations,

such that N := N 1 ··· N _d divides p ^λ − 1 and lies in the range L 6 N 6 (λ + 1) L, and such that

S 6 N _i 6 S ³ for all i. Finally we set κ := ⌈ 2 ⌈ n/ N ⌉ / λ ⌉ λ, so that N also divides p ^κ − 1, and we put

n _i := 2 N _i κ. All of these parameters may be computed in time O((lg n) ^{lg lg} ⁿ ) = O(n).

(15)

For large n, we observe that the following estimates hold. First, we have n

λ ³ 6 N 6 2 n λ ² . Since log λ ≍ lg lg n lg lg lg n, it follows that

log 2 N =

1 + O 1

lg lg n

lg n,

and also that log κ ≍ lg lg n lg lg lg n. We have N κ = 2 n + O(N λ) = 2 n + O(n/ λ), so N κ =

2 + O

1 lg n

n.

To estimate d, note that N > S ^d and log S ≍ (lg lg n) ³ lg lg lg n, so d = O

lg n (lg lg n) ³ lg lg lg n

.

Since p 6 n we have

lg (n lg p) =

1 + O 1

lg lg n

lg n, (7.2)

and as n i > N i > S > (lg n) ^{lg lg} ⁿ , we similarly obtain lg (n i lg p) =

1 + O

1 lg lg n

lg n i . (7.3)

Crandall–Fagin reduction. Let us check that the hypotheses of section 6.1 are satisfied, to enable the reduction to multiplication in F _p

κ

[Y ] / (Y ^N − 1). We certainly have 1 6N 6n and κ> 2 ⌈ n / N ⌉ . For Proposition 6.1, observe that κ > λ ² ≻ lg n ∼ lg N , so p ^κ/2 > 2 ^κ/2 > N for large n. Thus we obtain an irreducible P ∈ F _p [Z] of degree κ, and an element θ ∈ F _p

κ

= F _p [Z]/P such that θ ^N = Z, in time O ˜(κ ⁹ p) =O ˜(λ ²⁷ n ^1/2 ) = O(n). Each multiplication in F _p

κ

has cost O( S _p (κ)) (see section 3).

Computing the sequences e _i and c _i costs O(N lg n) = O(n lg n), and computing the sequence θ ^c

ⁱ

costs O(N S _p (κ)). The initial splitting and final overlap-add phases require time O(t n lg p), and the multiplications by θ ^c

ⁱ

and θ ^−c

ⁱ

cost O(t N S _p (κ)).

Long transforms. The factorisation of N is known, and N divides p ^κ − 1, so by Lemma 3.3 we may compute a primitive N-th root of unity ω ∈ F _p

κ

in time O ˜(κ ⁹ p) = O(n).

We will multiply in F _p

κ

[Y ]/ (Y ^N − 1) by using DFTs with respect to ω. The table of roots 1, ω, ..., ω ^N ⁻¹ may be computed in time O(N S _p (κ)). In a moment, we will describe an algorithm A i

for computing a “short” DFT of length N i with respect to ω i := ω ^N ^/N

ⁱ

; we then use the algorithm A := A ¹ ⊙ ··· ⊙ A ^d for the main transform of length N (see section 2.3). The corresponding twiddle factor tables may be extracted in time O(N lg N (κ lg p + lg N)) = O(n lg n lg p).

Let D denote the complexity of A , and for t ^′ > 1 let D _i,t

_′

denote the cost of performing t ^′ independent DFTs of length N i using A ⁱ . Then by (2.4) we have

D 6 X

i=1 d

D _i,N _/N

i

+ O(d N S _p (κ)) + O((κ lg p) N lg N).

The last term is simply O(n lg n lg p).

Bluestein conversion. We now begin constructing A i , assuming that t ^′ > 1 independent trans- forms are sought. We first use Bluestein’s algorithm (section 2.5) to convert each DFT of length N i

to a multiplication in F _p

κ

[X]/(X ^N

ⁱ

− 1). We must check that 2 is invertible in F _p

κ

if N i is even;

indeed, if N _i is even, then so is p ^κ − 1, so p= / 2. The Bluestein conversion contributes O(t ^′ N _i S _p (κ)) to the cost of A i .

We must also compute a suitable table of roots, once at the top level. We first extract the

table 1, ω i , ..., ω _i ^N

ⁱ

⁻¹ from the top level table in time O(N κ lg p) = O(n lg p), and then sort them

into the correct order (and perform any necessary additions) in time O(N i lg N _i (κ lg p + lg N)) =

O(S ³ lg N _i κ lg p) = O(lg N _i (n lg p)). Over all i the cost is O(lg N (n lg p)) = O(n lg n lg p).

(16)

Kronecker substitution. We finally convert each multiplication in F _p

κ

[X ]/(X ^N

ⁱ

− 1) to a mul- tiplication in F _p [X] / (X ⁿ

ⁱ

− 1) using Kronecker substitution (see section 2.6). The latter multiplica- tions have cost C _p,t

_′

(n i ), since one argument is fixed. After the multiplications, we must also perform t ^′ N i divisions by P to recover the results in F _p

κ

[X ]/(X ^N

ⁱ

− 1), at cost O(t ^′ N i S _p (κ)).

Consolidating the estimates for the Bluestein conversion and Kronecker substitution, we have D _i,t

_′

6 C _p,t

_′

(n i ) + O(t ^′ N _i S _p (κ)).

6 (2 t ^′ + 1) C _p (n i ) + O(t ^′ N _i S _p (κ)).

For t ^′ = N /N i > N / S ³ ≻ lg lg n, this becomes D _i,N _/N

i

6 (2 + O(1/ lg lg n)) (N / N _i ) C _p (n i ) + O(N S _p (κ)).

Conclusion. Summing over i yields D 6

2 + O

1 lg lg n

X

i=1

d N

N _i C _p (n i ) + O(d N S _p (κ)) + O(n lg n lg p).

Since

d N S _p (κ) = O

lg n

(lg lg n) ³ lg lg lg n N κ lg κ lg lg κ lg p lg lg p lg lg lg p

= O

n lg n lg p (lg lg lg n) ² lg lg n

= O(n lg n lg p) and

N

N _i = 2 N κ 2 N _i κ =

4 + O

1 lg n

n n _i , this becomes

D 6

8 + O 1

lg lg n

X

i=1

d n

n _i C _p (n i ) + O(n lg n lg p).

Let γ _i := log N _i /log N , so that P

i γ _i = 1. Since lg n i = log 2 N i + O(log κ) =

1 + O

log λ log S

log 2 N i =

1 + O 1

lg lg n

log 2 N i

and lg n = (1 + O(1/ lg lg n)) log 2 N , we get D 6

8 + O

1 lg lg n

X

i=1 d

γ _i n lg n

n _i lg n _i C _p (n i ) + O(n lg n lg p).

To compute the desired t products, we must execute t + 1 forward transforms, and t inverse transforms. Each product also requires O(N ) pointwise multiplications in F _p

κ

and O(N ) multipli- cations by 1/ N. These have cost O(N S _p (κ)), which is absorbed by the O(n lg n lg p) term. Thus we obtain

C _p,t (n) 6 (2 t + 1) D + O(t n lg n lg p).

Dividing by (2 t + 1) n lg n lg p, taking suprema over t > 1, and using (7.2) and (7.3), yields the bound (7.1).

Finally, since

n i = O(S ³ λ ³ ) = O (lg n) ^3c

³

^{lg lg lg} ^{n((lg lg} ⁿ⁾

²

⁺¹⁾

= O (lg n) ^{(lg lg} ⁿ⁾

³

,

we have log n _i = O((log log n) ⁴ ) and hence log log log n _i 6 log log log log n + C for some constant C and all large n. We may then take Φ(x) := exp ^◦3 (log ^◦4 x + C).

Faster polynomial multiplication over finite fields

HAL Id: hal-01022757

https://hal.archives-ouvertes.fr/hal-01022757v2

Preprint submitted on 11 Feb 2015

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

David Harvey, Joris van der Hoeven, Grégoire Lecerf

To cite this version:

David Harvey, Joris van der Hoeven, Grégoire Lecerf. Faster polynomial multiplication over finite

fields. 2014. �hal-01022757v2�

over finite fields

David Harvey

School of Mathematics and Statistics University of New South Wales

Sydney NSW 2052 Australia

Email: [email protected]

Joris van der Hoeven a , Grégoire Lecerf b Laboratoire d’informatique

UMR 7161 CNRS École polytechnique 91128 Palaiseau Cedex

France

a. Email: [email protected] b. Email: [email protected] February 11, 2015

Let p be a prime, and let M p (n) denote the bit complexity of multiplying two polynomials in F p [X ] of degree less than n. For n large compared to p, we establish the bound M p (n) = O(n log n 8 log

n log p), where log ∗ is the iterated logarithm. This is the first known Fürer- type complexity bound for F p [X], and improves on the previously best known bound M p (n) = O(n log n log log nlog p).

Keywords: Polynomial multiplication, finite field, algorithm, complexity bound, FFT A.C.M. subject classification: G.1.0 Computer-arithmetic, F.2.1 Computations in finite fields

A.M.S. subject classification: 68W30, 68Q17, 68W40

1. Introduction

Strassen result for integer multiplication [32] and generalises previous work by Schönhage [31].

These algorithms all rely on suitable incarnations of the fast Fourier transform (FFT) [12]. For details we refer the reader to [17, Chapter 8].

In this paper we are mainly interested in the case that R is the finite field F p = Z / p Z for some prime p. The standard bit complexity model based on deterministic multitape Turing machines [26]

is more realistic in this setting, as it takes into account the dependence on p. We write M p (n) for the bit complexity of multiplying two polynomials in F p [X ] of degree less than n. Two basic approaches are known for obtaining good asymptotic bounds for M p (n).

According to our recent sharpening [18] of Fürer’s bound [15], we may take

I (n) = O(n log n 8 log

n ), (1.1)

where log ∗ x denotes the iterated logarithm for x ∈ R , i.e.,

log ∗ x := min { k ∈ N : log ◦k x 6 1 } , (1.2) log ◦k := log ◦ ···

k× ◦ log.

1

If log n = O(log p), then Kronecker substitution yields M p (n) = O( I (n log p)) (see section 2.6), so we have

M p (n) = O n log p log (n log p) 8 log

(n log p) .

M p (n) = O(n log n log log n log p + n log n I (log p)).

The first term dominates for large enough n, say for log log p 8 log

p = O(log log n), and we get simply

M p (n) = O(n log n log log n log p). (1.3)

Since Fürer’s breakthrough result, there has been a gap between what is known for integer multiplication and for multiplication in F p [X ]. Namely, it has remained an open question whether the factor log log n appearing in (1.3) can be improved to K log

n for some K > 1. It seems that Fürer’s construction does not work in F p [X].

1.1. Our contributions

The main result of this paper is a new algorithm that completely closes the gap mentioned above.

Theorem 1.1 . The bound

M p (n) = O n log p log(n log p) 8 log

(nlog p) holds uniformly for all n > 2 and all primes p.

The bound follows immediately via Kronecker substitution if log n = O(log p). The bulk of the paper concerns the reverse case where n is large compared to p. For example, if log log p =O(log n), the bound becomes simply

M p (n) = O(n log n 8 log

n log p).

As promised, this replaces the factor log log n in (1.3) by 8 log

n .

The basic idea of the new algorithm is as follows. We first construct a special extension F p

[Y ], by cutting up the polynomials into small pieces, and we then multiply in F p

[Y ] by using FFTs over F p

. Applying the Cooley–Tukey method to the small divisors of p κ − 1, we decompose the FFTs into transforms of exponentially shorter lengths (not necessarily powers of two). We use Bluestein’s method to convert each short transform to a convolution over F p

In many respects this approach is very similar to the algorithm for integer multiplication introduced in [18], but there are various additional technical issues to address. We recommend [18]

as a gentle introduction to the main ideas.

We also prove the following conditional result.

Theorem 1.2 . Assume Conjectures 8.1, 8.3 and 8.5. Then the bound

M p (n) = O n log p log(n log p) 4 log

(nlog p)

holds uniformly for all n > 2 and all primes p.

1.2. Outline of the paper

The paper is structured as follows. In section 2, we start with a survey of relevant basic techniques:

discrete Fourier transforms (DFTs) and FFTs, Bluestein’s chirp transform, and Kronecker sub- stitution. Most of this material is repeated from [18, Section 2] for the convenience of the reader;

however, sections 2.5 and 2.6 differ substantially.

In section 3, we recall basic complexity results for arithmetic in finite fields. In particular, we consider the construction of irreducible polynomials in F p [X], algorithms for finding roots of unity in F p

, and the cost of arithmetic in F p

and F p

[Y ].

In section 4, we show how to construct special extensions of F p whose multiplicative group has a large subgroup of highly smooth order, i.e., is divisible by many small integers. The main tool is [2, Theorem 3].

To minimise the constant K in the bound M p (n) = O(n log n K log

n log p) (for n large relative to p), we need one more tool: in section 6, we present a polynomial analogue of the Crandall–Fagin convolution algorithm [13]. This allows us to convert a cyclic convolution over F p of length n, where n is arbitrary, into a cyclic convolution over F p

of somewhat smaller length N, where N is prescribed and where κ ≈ 2 n/ N . (We can still obtain K = 16 without using this Crandall–Fagin analogue, but we do not know how to reach K = 8 without it.)

Section 7 is devoted to the proof of Theorem 1.1. Section 8 gives the proof of Theorem 1.2, and discusses the three number-theoretic hypotheses of that theorem.

Notations. We use Hardy’s notations f ≺ g for f = o(g), and f ≍ g for f = O(g) and g = O(f ).

We will write lg n := ⌈ log n/ log 2 ⌉ . In expressions like log log p or lg lg p, we tacitly assume that the function is adjusted so as to take positive values for small primes such as p = 2.

Joris van der Hoeven ^a , Grégoire Lecerf ^b Laboratoire d’informatique

Let p be a prime, and let M p (n) denote the bit complexity of multiplying two polynomials in F p [X ] of degree less than n. For n large compared to p, we establish the bound M p (n) = O(n log n 8 ^log

ⁿ log p), where log ^∗ is the iterated logarithm. This is the first known Fürer- type complexity bound for F p [X], and improves on the previously best known bound M p (n) = O(n log n log log nlog p).

In this paper we are mainly interested in the case that R is the finite field F _p = Z / p Z for some prime p. The standard bit complexity model based on deterministic multitape Turing machines [26]

is more realistic in this setting, as it takes into account the dependence on p. We write M _p (n) for the bit complexity of multiplying two polynomials in F _p [X ] of degree less than n. Two basic approaches are known for obtaining good asymptotic bounds for M _p (n).

I (n) = O(n log n 8 ^log

ⁿ ), (1.1)

where log ^∗ x denotes the iterated logarithm for x ∈ R , i.e.,

log ^∗ x := min { k ∈ N : log ^◦k x 6 1 } , (1.2) log ^◦k := log ◦ ···

If log n = O(log p), then Kronecker substitution yields M _p (n) = O( I (n log p)) (see section 2.6), so we have

M _p (n) = O n log p log (n log p) 8 ^log

⁽ⁿ ^log ^p) .

M _p (n) = O(n log n log log n log p + n log n I (log p)).

The first term dominates for large enough n, say for log log p 8 ^log

^p = O(log log n), and we get simply

M _p (n) = O(n log n log log n log p). (1.3)

Since Fürer’s breakthrough result, there has been a gap between what is known for integer multiplication and for multiplication in F _p [X ]. Namely, it has remained an open question whether the factor log log n appearing in (1.3) can be improved to K ^log

ⁿ for some K > 1. It seems that Fürer’s construction does not work in F _p [X].

M _p (n) = O n log p log(n log p) 8 ^log

^(nlog ^p) holds uniformly for all n > 2 and all primes p.

M _p (n) = O(n log n 8 ^log

ⁿ log p).

As promised, this replaces the factor log log n in (1.3) by 8 ^log

ⁿ .

The basic idea of the new algorithm is as follows. We first construct a special extension F _p

[Y ], by cutting up the polynomials into small pieces, and we then multiply in F _p

[Y ] by using FFTs over F _p

. Applying the Cooley–Tukey method to the small divisors of p ^κ − 1, we decompose the FFTs into transforms of exponentially shorter lengths (not necessarily powers of two). We use Bluestein’s method to convert each short transform to a convolution over F _p

M _p (n) = O n log p log(n log p) 4 ^log

^(nlog ^p)

In section 3, we recall basic complexity results for arithmetic in finite fields. In particular, we consider the construction of irreducible polynomials in F _p [X], algorithms for finding roots of unity in F _p

, and the cost of arithmetic in F _p

and F _p

In section 4, we show how to construct special extensions of F _p whose multiplicative group has a large subgroup of highly smooth order, i.e., is divisible by many small integers. The main tool is [2, Theorem 3].

To minimise the constant K in the bound M _p (n) = O(n log n K ^log

ⁿ log p) (for n large relative to p), we need one more tool: in section 6, we present a polynomial analogue of the Crandall–Fagin convolution algorithm [13]. This allows us to convert a cyclic convolution over F _p of length n, where n is arbitrary, into a cyclic convolution over F _p

In the Turing model, we have available a fixed number of linear tapes. An n 1 × ··· × n _d array

M _i

of b-bit elements is stored as a linear array of n 1 ··· n _d b bits. We generally assume that the

elements are ordered lexicographically by (i 1 , ..., i _d ), though this is just an implementation detail.

Let R be a commutative ring with identity and let n > 1. An element ω ∈ R is said to be a principal n-th root of unity if ω ⁿ = 1 and

(ω ⁱ ) ^k = 0 (2.1)

for all i ∈ { 1, ..., n − 1 }. In this case, we define the discrete Fourier transform (or DFT) of an n-tuple a = (a 0 , ..., a n−1 ) ∈ R ⁿ with respect to ω to be DFT ω (a) = a ˆ = (a ˆ 0 , ..., a ˆ n−1 ) ∈ R ⁿ where

a ˆ i := a 0 + a 1 ω ⁱ + ··· + a n−1 ω ⁽ⁿ⁻¹⁾ⁱ .

That is, a ˆ i is the evaluation of the polynomial A(X) := a 0 + a 1 X + ··· + a n −1 X ⁿ ⁻¹ at ω ⁱ . If ω is a principal n-th root of unity, then so is its inverse ω ⁻¹ = ω ⁿ⁻¹ , and we have

DFT _ω

a ˆ j ω ^−ji = X

a k ω ^j(k−i) = X

ω ^j(k ⁻ⁱ⁾ = X

where δ _i,k = 1 if i = k and δ _i,k = 0 otherwise.

Let ω be a principal n-th root of unity and let n = n 1 n 2 where 1 < n 1 < n. Then ω ⁿ

is a principal n 2 -th root of unity and ω ⁿ

ω ^(k

ⁿ

^+k