Private Information Retrieval from Transversal Designs

(1)

HAL Id: hal-01901014

https://hal.archives-ouvertes.fr/hal-01901014

Submitted on 22 Oct 2018

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Julien Lavauzelle

To cite this version:

Julien Lavauzelle. Private Information Retrieval from Transversal Designs. IEEE Transactions on

Information Theory, Institute of Electrical and Electronics Engineers, 2019, 65 (2), pp.1189-1205.

�10.1109/TIT.2018.2861747�. �hal-01901014�

(2)

Private Information Retrieval from Transversal

Designs

Julien L

AVAUZELLE

Laboratoire LIX, École Polytechnique, Inria & CNRS UMR 7161

Université Paris-Saclay

Abstract—Private information retrieval (PIR) protocols allow a user to retrieve entries of a database without revealing the index of the desired item. Information-theoretical privacy can be achieved by the use of several servers and specific retrieval algorithms. Most known PIR protocols focus on decreasing the number of bits exchanged between the client and the server(s) during the retrieval process. On another side, Fazeli et al. introduced so-called PIR codes in order to reduce the storage overhead on the servers. However, few works address the issue of the computation complexity of the servers.

It this paper, we show that a specific encoding of the database yields PIR protocols with reasonable communication complexity, low storage overhead and optimal computational complexity for the servers. This encoding is based on incidence matrices of transversal designs, from which a natural and efficient recovering algorithm is derived. We also present several instances for our construction, which make use of finite geometries and orthogonal arrays. We finally give a generalisation of our main construction in order to resist collusions of servers.

I. INTRODUCTION

A. Private Information Retrieval

A private information retrieval (PIR) protocol aims at en-suring a user that he can retrieve some part Di of a remote

database D without revealing the index i to the server(s) hold-ing the database. For example, such protocols can be applied in medical data storage where physicians would be able to access parts of the genome while hiding the specific gene they analyse. The PIR paradigm was originally introduced by Chor, Goldreich, Kushilevitz and Sudan [6, 7].

A naive solution to the problem consists in downloading the entire database each time the user wants a single entry. But the communication complexity would then be overwhelming, so we look for PIR protocols exchanging less bits. However, Chor et al. proved that, when the k-bits database is stored on a single server, a PIR protocol which leaks no information on the index i (such a protocol being called information-theoretically secure) must use Ω(k) bits of communication [7]. Two alternatives were then considered: restricting the protocol to computational security (initiated by Chor and Gilboa [5]), or allowing several servers to store the database. Our work focuses on the last one.

This paper appears in: IEEE Transactions on Information Theory, on pages: 1-17, DOI: 10.1109/TIT.2018.2861747.

It was presented in part at the Tenth International Workshop on Coding and Cryptography 2017, September 18-22, 2017, Saint-Petersburg, Russia.

This work is partially funded by French ANR-15-CE39-0013-01 “Manta”.

In many such PIR protocols the database is replicated on ` servers, ` > 1. Informally, the idea is that each server is asked to compute some partial information related to a random-like query sent by the user. Then the user collects all the servers’ answers and retrieves the desired symbol with an appropriate algorithm. For instance, Chor et al. [7] considered a smart arrangement of the database entries in a log(`)-dimensional array, and used XOR properties to mask the index of the desired item and to retrieve the associated symbol. Their protocol features decreasing communication as a function of the number of servers: with ` servers, the communication is O(` log(`)k1/ log `) bits. For constant `, the authors also proposed a PIR protocol with communication O(k1/`

). A few years later, Katz and Trevisan [13] showed that any smooth locally decodable code C ⊆ Σn of locality ` gives rise to a PIR protocol with ` servers whose communication complexity is O(` log(n|Σ|)) — see [20] for a good survey on locally decodable codes (LDCs) and their applications in PIR protocols. Building on this idea, many PIR schemes (notably [3, 19, 10, 9]) successively decreased the communication com-plexity, achieving O(k

√

log log k/ log k

) with only ` = 2 servers. However, only few of them tried to lighten the computational and storage cost on the server side.

By preprocessing the database, Beimel, Ishai and Malkin [4] were the first to address the minimization of the server stor-age/computation in PIR protocols. Then, initiated by Fazeli, Vardy and Yaakobi [11], recent works used the concept of PIR codes to address the storage issue. The idea is to turn an `-server replication-based PIR protocol into a more-than-`-server distributed PIR protocol with a smaller overall storage overhead. For this purpose, the user encodes the database and distributes pieces of the associated codeword among the servers, such that servers hold distinct parts of the database (plus some redundancy). Through this transformation, both communication complexity and computational cost keep the same order of magnitude, but the storage overhead corresponds to the PIR code’s one, which can be brought arbitrarily close to 1 when sufficiently many servers are used. Several recent works also address the PIR issue on previously coded databases [18], and/or aim at reaching the so-called capacity of the model [17]. However, while the storage drawback seems to be solved, huge computational costs still represent a barrier to the practicality of such PIR protocols.

(3)

B. Motivations and results

As pointed out by Yekhanin [20], “the overwhelming com-putational complexity of PIR schemes (...) currently presents the main bottleneck to their practical deployment”. Consider a public database which is frequently queried, e.g. a database storing stock exchange prices where private queries could be very relevant. Fast retrieval is crucial is this context. Hence, one cannot afford each run of the PIR protocol to be computationally inefficient, for instance Ω(k) if k is the size of the database. Therefore, a relevant goal is to build PIR protocols with sublinear computational complexity in the length of the database stored by each server.

Naively, the computational complexity of a PIR protocol could be drastically reduced if we let all possible answers to its queries to be precomputed. Of course, storing all these answers dramatically increases the needed storage, so let us focus on a construction due to Augot, Levy-dit-Vehel and Shikfa [2] — anterior to the PIR codes breakthrough [11] — that address this issue.

The construction of Augot et al. [2] uses a specific family of high-rate locally decodable codes called multiplicity codes introduced by Kopparty, Saraf and Yekhanin [14]. But instead of replicating the database on ` servers (` > 1 being the locality of the codes), the authors split an encoded version c of the database D into parts c(1)_{, . . . , c}(`)_{, and share these}

parts on the servers. The main difference with PIR codes [11] is that Augot et al.’s construction does not purpose to emulate a lighter PIR protocol with an existing one. It uses specific properties of the encoding as a way to split the database on several servers. In short, the multiplicity codes they use feature boththe privacy of the PIR protocol and the storage reduction for the servers. We refer to Section VII for more details on the construction.

In this work, we reconsider this “codeword support split-ting” idea, and we propose a new generic framework for the construction of PIR protocols which takes into account the computational complexity issue. More precisely, the protocols we give are computationally optimal with respect to the communication complexity of the protocol, in the sense that each server needs to read only one entry in the part of the database it holds.

Our construction is based on combinatorial structures called transversal designs, from which we naturally derive a linear code, a partition of its support and a local reconstruction algorithm. In practice, we give several instances of transversal designs that lead to codes with large rate, hence to PIR pro-tocols with low storage overhead. The two first families come from incidences between points and lines in the affine (resp. projective) space. They are closely related to the classical geometric designs of 1-flats. A third family of instances makes use of a classical transformation of so-called orthogonal arrays of strength 2 into transversal designs. We then proceed to a thorough study of the dimension of codes coming from MDS-like orthogonal arrays of strength 2. A fourth and last family of practical instances appears when showing that orthogonal arrays built from divisible codes lead to PIR protocols with storage expansion less than 2. We finally prove that orthogonal

arrays with strength t > 2 allow the construction of PIR protocols resisting to collusions of up to t − 1 servers. We exhibit and analyzed instances of some orthogonal arrays with large strength to conclude this work.

C. Organization

We start by giving two formal definitions of PIR protocols in Section II, depending on whether the database is replicated or distributed on the servers. We also present the standard construction of replication-based PIR protocols from smooth locally decodable codes. In Section III, we recall defini-tions of combinatorial structures and their associated codes. The 1-private PIR protocols based on transversal designs are introduced in Section IV. Section V is devoted to four families of instances of the PIR construction having practical parameters. Finally, a generalisation of our construction is given in Section VI in order to keep up with collusions of servers, and a comparison with the PIR protocols coming from multiplicity codes is presented in Section VII.

II. DEFINITIONS AND RELATED CONSTRUCTIONS

We first recall that we are only concerned with information-theoretically secure PIR protocols. In this paper, we denote by U the user (or client) of the PIR protocol. User U owns a database denoted by D = (Di)1≤i≤k ∈ Fkq, where Fq

represents the finite field with q elements. Database D hence contains |D| = k log q bits. We also denote by S1, . . . , S`the

` servers involved in the PIR protocol.

Given A, B two sets, with |B| = n < ∞, we denote by AB _{the set of n-tuples a = (a}

b)b∈B of A-elements indexed

by B, which can also be seen as functions from B to A. For T ⊂ B, we also write a|T := (at)t∈T the restriction of the

tuple a to the coordinates of T .

A. Two definitions for PIR protocols

A vast majority of existing PIR schemes start by simply cloning the database D on all the servers S1, . . . , S`. Then,

the role of each server Sj is to compute some combination

of symbols from D, related to the query sent by U . This computation has a non-trivial cost, so in a certain sense, the computational complexity of the privacy of the PIR scheme is mainly devoted to the servers.

More formally, one can define replication-based PIR proto-colsas follows:

Definition II.1 (standard, or replication-based PIR protocol). Assume that every server Sj, 1 ≤ j ≤ `, stores a copy of the

database D. An `-server replication-based PIR protocol is a set of three algorithms (Q, A, R) running the following steps on input i ∈ [1, k]:

1) Query generation: the randomized algorithm Q generates ` queries (q1, . . . , q`) := Q(i). Query qj is sent to server

Sj.

2) Servers’ answer: each server Sj computes an answer

aj= A(qj, D) and sends it back to the user1.

1_algorithm_{A := A}

(4)

3) Reconstruction: denote by a = (a1, . . . , a`) and

q = (q1, . . . , q`). User U computes and outputs r =

R(i, a, q).

The PIR protocol is said:

• correctif r = Di when the servers follow the protocol; • t-private if, for every (i, i0) ∈ [1, k]2and every T ⊆ [1, `]

such that |T | ≤ t, the distributions Q(i)|T and Q(i0)|T

are the same. We also say that the PIR protocol resists t collusions of servers.

We call communication complexity the number of bits sent between the user and the servers, and server (resp. user) com-putational complexity the maximal number of Fq-operations

made by a server in order to compute an answer aj (resp.

made by R to reconstruct the desired item).

According to this definition, one sees that the servers must jointly carry the ` copies of the database, so the storage overhead of the scheme is (` − 1)|D| bits. Moreover, since D is a raw database without specific structure, the algorithm A has no reason to be trivial and can incur superlinear computations for the servers — which is verified for most of current replication-based PIR protocols.

A way to reduce the computation cost of PIR protocols is to preprocess the database. Therefore we need to model PIR protocols for which the database can be encoded and distributed over the servers. From now on, let c = (ci)i∈I

denote an encoding of the database D, i.e. the image of D by an injective map Fkq → FIq, with |I| = n ≥ k. Besides,

for convenience we assume that I = [1, s] × [1, `] and for readability we write c(i1,i2)= c

(i2)

i1 and c

(j)_{= (c}(j) r )r∈[1,s].

Definition II.2 (distributed PIR protocol). Assume that for 1 ≤ j ≤ `, server Sj holds the part c(j) of the encoded

database. An `–server distributed PIR protocol is a set of three algorithms (Q, A, R) running the following steps on input i ∈ I:

1) Query generation: the randomized algorithm Q generates ` queries (q1, . . . , q`) := Q(i). Query qj is sent to server

Sj.

2) Servers’ answer: each server Sj computes an answer

aj = A(qj, c(j)) and sends it back to the user.

3) Reconstruction: denote by a = (a1, . . . , a`) and

q = (q1, . . . , q`). User U computes and outputs r =

R(i, a, q).

Correctness and privacy properties are identical to those of replication-based PIR protocols. Similarly, one can also define communicationand computational complexities, and since the database D has been encoded, we finally define the storage overhead as the number of redundancy bits stored by the servers, that is, (s` − k) log q.

In this paper, we focus on distributed PIR protocols with low computational complexity on the server side. More precisely, we build PIR protocols where the answering algorithm A consists only in reading some symbols of the database. Thus, our PIR protocols are computationally optimal on the server side, in a sense that, compared to the non-private retrieval, they incur no extra computational burden for the each server taken individually.

B. PIR protocols from locally decodable codes

As pointed out in the introduction, Augot et al. [2] used a family of locally decodable codes (LDC) to design a distributed PIR scheme. LDCs are known to give rise to PIR protocols for a long time [13], but we emphasize that the main idea from [2] is to benefit from the fact that the encoded database can be smartly partitioned with respect to the queries of the local decoder.

Based on the seminal work of Katz and Trevisan [13], we briefly remind how to design a PIR protocol based on a perfectly smooth locally decodable code. First, let us define (linear) locally decodable codes.

Definition II.3 (locally decodable code). Let Σ be a finite set, 2 ≤ ` ≤ k ≤ n be integers, and δ, ∈ [0, 1]. A code C : Σk

→ Fn

q is (`, δ, )–locally decodable if and only if there

exists a randomized algorithm D such that, for every input i ∈ [1, k] we have:

• for all m ∈ Σk and all y ∈ Fn_q, if |{j ∈ [1, n], yj 6=

C(m)j}| ≤ δn, then

P(D(y)(i) = mi) ≥ 1 − ,

where the probability is taken over the internal random-ness of D;

• D reads at most ` symbols yq1, . . . , yq` of y.

Notation D(y) refers to the fact that D has oracle access to single symbols yqj of the word y. The parameter ` is called

the locality of the code. Moreover, the code C is said perfectly smoothif on an arbitrary input i, each individual query of the decoder D is uniformly distributed over the coordinates of the word y.

Now let us say a user wants to use a PIR protocol on a database D ∈ Σk, and assume there exists a perfectly smooth locally decodable code C ⊂ Fnq of dimension k and locality `.

Figure 1 presents a distributed PIR protocol based on C.

1) Initialization step. User U encodes D into a codeword c0∈ C. Each server S1, . . . , , S`holds a copy of c0. In the

formalism of Definition II.2, it means that c(j)_{:= c}0_{, for}

j = 1, . . . , `.

2) Retrieving step for symbol Di. Denote by D a local

decoding algorithm for C.

1) Queries generation: user U calls D to generate at random a query (q1, . . . , q`) for decoding the symbol

Di. Query qj is sent to server Sj.

2) Servers’ answer: each server Sj reads the encoded

symbol aj:= c0qj. Then Sj sends aj to U .

3) Reconstruction: user U collects the ` codeword symbols (c0qj)j∈[1,`] and feeds the local decoding

algorithm D in order to retrieve Di.

Fig. 1: A distributed PIR protocol based on a locally decodable code C.

The main drawback of these LDC-based PIR protocols is their storage overhead, since the ` servers must store `n/k = `/R times more data than the raw database (R := k/n

(5)

represents the information rate, or rate, of the code). This issue becomes especially crucial as building LDCs with small locality and high rate is highly non-trivial.

The idea of Augot, Levy-dit-Vehel and Shikfa [2] for reducing the storage overhead is to benefit from a natural partition of the support of multiplicity codes [14]. Assume that each codeword c ∈ C can be split into ` disjoint parts c(1), . . . , c(`), such that each coordinate qj of any possible

query (q1, . . . , q`) of the PIR protocol corresponds to reading

some symbols on c(j). By sending the part c(j) to server Sj,

the PIR protocol of Figure 1 can be improved in order to save storage. We devote Section VII to more explanation on this construction, as well as to a comparison with our schemes.

Finally, one can notice that the communication complexity of LDC-based PIR protocols depends on the locality of the code, while the smoothness of the code serves their privacy. We also point out two important remarks.

1) Assuming a noiseless transmission and honest-but-curiousservers (i.e. they want to discover the index of the desired symbol but never give wrong answers), one does not need a powerful local decoding algorithm. Indeed, it should be possible to reconstruct the desired symbol Di by local decoding only one erasure on the codeword.

For instance, computing a single low-weight parity-check sum should be enough.

2) Smoothness is sufficient for 1-privacy, but we need more structure for preventing collusions of servers.

Coupled with the fact that we want to split the database over several servers, these remarks lead us to design other kinds of encoding, which answer as close as possible the needs of private information retrieval protocols. Our construction relies on combinatorial structures, namely transversal designs, that we recall in the upcoming section.

III. TRANSVERSAL DESIGNS AND CODES

Let us give here the definition of transversal designs and how to build linear codes upon them. We refer to [1], [16] and [8] for complementary details.

Definition III.1 (block design). A block design is a pair D = (X, B) where X is a finite set of so-called points, and B is a set of non-empty subsets of X called the blocks.

Definition III.2 (incidence matrix). Let D = (X, B) be a block design. An incidence matrix MD of D is a matrix of

size |B| × |X|, whose (i, j)−entry, for i ∈ B and j ∈ X, is:

1 if the block i contains the point j, 0 otherwise.

The q-rank of MD is the rank of MD over the field Fq.

For B ⊂ X, the incidence vector ₁B∈ {0, 1}X is the row

vector whose x-th coordinate is 1 if and only if x ∈ B. Let us notice that, given a design D = (X, B), one can build MD

by stacking incidence vectors of blocks B ∈ B.

Of course, any design admits many incidence matrices, depending on the way points and blocks are ordered. However, all these incidence matrices are equal up to some permutation of their rows and columns, and, in particular, they all have

the same rank. Hence, we call rank of a design the q-rank of any of its incidence matrices. Moreover, from now on we consider incidence matrices of designs up to an ordering of points and blocks, and we abusively refer to the incidence matrix MD of a design D.

Example III.3. Let A2

(F3) be the affine plane over the finite

field F3, and X be the set consisting of its 9 points:

X = { (0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2) } .

We define the block set B as the set of the 12 affine lines of A2(F3): B = { {(0, 0), (0, 1), (0, 2)}, {(1, 0), (1, 1), (1, 2)}, {(2, 0), (2, 1), (2, 2)}, {(0, 0), (1, 1), (2, 2)}, {(1, 0), (2, 1), (0, 2)}, {(2, 0), (0, 1), (1, 2)}, {(0, 0), (2, 1), (1, 2)}, {(1, 0), (0, 1), (2, 2)}, {(2, 0), (1, 2), (0, 2)}, {(0, 0), (1, 0), (2, 0)}, {(0, 1), (1, 1), (2, 1)}, {(0, 2), (1, 2), (2, 2)} } . The pair D = (X, B) is then a block design, and its associated (12 × 9)–incidence matrix is MD =                     1 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1                     .

A computation shows that over the field F2, matrix MD is

full-rank, while over F3, it has only rank 6.

Definition III.4 (transversal design). Let s, ` ≥ 2 and λ ≥ 1 be integers. A transversal design, denoted TDλ(`, s),

is a block design (X, B) equipped with a partition G = {G1, . . . , G`} of X called the set of groups, such that:

• |X| = `s;

• any group in G has size s and any block in B has size `;

• any unordered pair of elements from X is contained either

in one group and no block or in no group and λ blocks. If λ = 1, we use the simpler notation TD(`, s).

Remark III.5. A block cannot be secant to a group in more than one point, otherwise the third condition of the definition would be disproved. Moreover, since the block size equals the number of groups, any block must meet any group. Hence the following holds:

∀(B, G) ∈ B × G, |B ∩ G| = 1 .

The definition also implies there must lie exactly λs2 blocks in B.

Example III.6. Let D = (X, B) be the block design defined in Example III.3. Define G to be any set of 3 parallel lines

(6)

from B which partitions the point set X. For instance, one can consider

G = { {(0, 0), (0, 1), (0, 2)}, {(1, 0), (1, 1), (1, 2)}, {(2, 0), (2, 1), (2, 2)} } .

Then, T = (X, B \ G, G) is a transversal design TD(3, 3). Indeed, T is composed of `s = 9 points, ` = 3 groups of size s = 3 and s2 = 9 blocks of size ` = 3 each. Moreover, in the affine plane every unordered pair of points belongs simultaneously to a unique line, which is represented in T either by a group or by a block. More generally, for any prime power q, a transversal design TD(q, q) can be built with the affine plane A2(Fq). A generalisation of this construction will

be given in Subsection V-A.

A simple way to build linear codes from block designs is to associate a parity-check equation of the code to each incidence vector of a block of the design. We recall that the dual code C⊥ _{of a code C ⊆ F}n

q is the linear vector space consisting of

vectors h ∈ Fn

q such that ∀c ∈ C,

Pn

i=1cihi= 0.

Definition III.7 (code of a design). Let Fq be a finite field,

D = (X, B) be a block design and MD be its incidence

matrix. The code Codeq(D) is the Fq-linear code of length

|X| admitting MD as a parity-check matrix.

Remark III.8. The code Codeq(D) is uniquely defined up

to a chosen order of the points X. For different orders, the arising codes remain permutation-equivalent. Also notice that the way blocks are ordered does not affect the code.

For any design D, the dimension over Fq of Codeq(D)

equals |X|−rankq(MD). Since MD has coefficients in {0, 1},

one must notice that rankq(MD) = rankp(MD), where p is

the characteristic of the field Fq.

Remark III.9. Standard literature (e.g. [1]) sometimes defines Codeq(D) (and not Codeq(D)⊥) to be the vector space

generated by the incidence matrix of the design. We favor this convention because Codeq(D) will serve to encode the

database in our PIR scheme.

Example III.10. The design D from Example III.3 gives rise to C = Code3(D), a linear code over F3, of length 9 and

dimension 3. A full-rank generator matrix of C is given by:

G =   1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 2 2 2 0 1 2 0 1 2 0 1 2  .

One may notice that this code is the generalized Reed-Muller code of degree 1 and order 2 over F3, that is, the evaluation

code of bivariate polynomials of total degree at most 1 over the whole affine plane F2

3.

Definition III.11 (systematic encoding). Let C ⊆ Fn q be a

linear code of dimension k ≤ n. A systematic encoding for C is a one-to-one map φ : Fk

q → C, such that there exists an

injective map σ : [1, k] → [1, n] satisfying: ∀m ∈ Fk

q, ∀i ∈ [1, k], mi= φ(m)σ(i).

The set σ([1, k]) ⊆ [1, n] is called an information set of C.

In other words, a systematic encoding allows to view the message m as a subword of its associated codeword φ(m) ∈ C. For instance, it is useful for retrieving m from c efficiently, when the codeword c has not been corrupted. A systematic encoding exists for any code C, is not necessarily unique, and can be computed through a Gaussian elimination over any generator matrix of the code. Also notice that this computation can be tedious for large codes.

IV. 1-PRIVATEPIRPROTOCOLS BASED ON TRANSVERSAL DESIGNS

In this section we present our construction of PIR protocols relying on transversal designs. The idea is that the knowledge of one point of a block of a transversal design gives (almost) no information on the other points lying on this block. The code associated to such a design then transfers this property to the coordinates of codewords. Hence, we obtain a PIR protocol which can be proven 1-private, that is, which ensures perfect privacy for non-communicating servers. Though this protocol cannot resist collusions, we will see in Section VI that a natural generalisation leads to t-private PIR protocols with t > 1.

Notice that both Fazeli et al.’s work [11] and ours make use of codes in order to save storage in PIR protocols. Nevertheless, we emphasize that the constructions are very different, since Fazeli et al. emulate a PIR protocol from an existing one while we build our PIR protocols from scratch.

A. The transversal-design-based distributed PIR protocol Let T be a transversal design TD(`, s) and n = |X| = `s. Denote by C = Codeq(T ) ⊆ Fnq the associated Fq-linear code,

and let k = dim_F_qC. Our PIR protocol is defined in Figure 2. We then summarize the steps of the construction in Figure 3.

B. Analysis

We analyse our PIR scheme by proving the following: Theorem IV.1. Let D be a database with k entries over Fq,

and T = TD(`, s) be a transversal design, whose incidence matrix has rank`s−k over Fq. Then, there exists a distributed

`-server 1-private PIR protocol with:

• _{only one F}q-symbol to read for each server, • _{` − 1 field operations over F}q for the user,

• ` log(sq) bits of communication (` log s are uploaded, ` log q are downloaded),

• a (total) storage overhead of (`s − k) log q bits on the servers.

Proof. Recall the PIR protocol we are dealing with is defined in Figure 2.

Correctness. By definition of the code C = Codeq(T ), the

incidence vector₁B of any block B ∈ B belongs to the dual

code C⊥. Hence, for c ∈ C, the inner product1B· c vanishes,

or said differently,P

x∈Bcx= 0. We recall that j∗represents

the index of the group which contains i. Since the servers Sj, j 6= j∗, receive queries corresponding to the points of a

block B which contains i, we have ci = −Px∈B\{i}cx =

−P

j6=j∗cqj, and our PIR protocol is correct as long as there

(7)

Parameters: T = (X, B, G) is a TDλ(`, s); C =

Codeq(T ) has length n = `s and dimension k.

1) Initialization step.

1) Encoding. User U computes a systematic encoding of the database D ∈ Fk

q, resulting in the codeword

c ∈ C.

2) Distribution. Denote by c(j) _{= c}

|Gj the symbols of

c whose support is the group Gj ∈ G. Each server

Sj receives c(j), for 1 ≤ j ≤ `.

2) Retrieving step for symbol ci for i ∈ X. Denote

by j∗ ∈ [1, `] the index of the unique group Gj∗ which

contains i — that is, ci= c (j∗)

r for some r ∈ [1, s]. Also

denote by B∗the subset of blocks containing i. The three steps of the distributed PIR protocol are:

1) Queries generation. U picks uniformly at random a block B ∈ B∗. For j 6= j∗, user sends the unique index qj ∈ B ∩ Gj to server Sj. Server Sj∗receives

a random query qj∗uniformly picked in G_j∗. To sum

up (←− stands for “picked uniformly at random in”):$      Q(i)j∗ $ ←− Gj∗, for j∗ s.t. i ∈ G_j∗ B ←−$ B∗ for j 6= j∗ Q(i)j ← B ∩ Gj,

2) Servers’ answer. Each server Sj(including Sj∗) reads

aj := cqj and sends it back to the user. That is,

A(qj, c(j)) = cqj.

3) Reconstruction. Denote by a = {a1, . . . , a`} and

q = {q1, . . . , q`}. User U computes r = R(i, a, q) := −X j6=j∗ aj= − X j6=j∗ cqj and outputs r.

Fig. 2: A 1-private distributed PIR protocol based on the Fq-linear code defined by a transversal design.

Transversal design TD(`, s)

incidence matrix

TD-based linear code Codeq(TD(`, s)) ⊆ F`sq database encoding

Distributed PIR scheme

Fig. 3: Summary of the steps leading to the construction of a transversal-design-based PIR scheme.

Security (1-privacy). We need to prove that for all j ∈ [1, `], it holds that P(i | qj) = P(i), where probabilities are

taken over the randomness of B ← B∗. The law of total

probability implies

P(i | qj) = P(i | qj and i ∈ Gj) P(i ∈ Gj)

+ P(i | qj and i /∈ Gj) P(i /∈ Gj)

= P(i | i ∈ Gj) P(i ∈ Gj)

+ P(i | i /∈ Gj) P(i /∈ Gj)

= P(i) ,

and the reasons why we eliminated the random variable qj in

the conditional probabilities are:

• in the case i ∈ Gj (that is, j = j∗), by definition of our

PIR protocol we know that qj is uniformly random, so

qj and i are independent;

• in the case i /∈ Gj, by definition of a transversal design,

there are as many blocks containing both qjand i as there

are blocks containing qjand any i0in X \Gj (the number

of such blocks is always λ). So once again, the value of the random variable qj is not related to i.

Communication complexity. Exactly one index in [1, s] and one symbol in Fq are exchanged between each server

and the user. So the overall communication complexity is ` × (log(s) + log(q)) = ` log(sq) bits.

Storage overhead. The number of bits stored on a server is s log q, giving a total storage overhead of (`s − k) log q, where k = dim C.

Computation complexity. Each server Sj only needs to

read the symbol defined by query qj, hence our protocol incurs

no extra computational cost.

Theorem IV.1 shows that, if we want to optimize the practical parameters of our PIR scheme, we basically need to look for small values of `, the number of groups. However, one observes that the dimension k of Codeq(T ) strongly depends

on ` and n, and tiny values of ` can lead to trivial or very small codes. This issue should be carefully taken into account, since instances with k < ` represent PIR protocols which are more communication expensive to use than the trivial one, which simply retrieves the whole database. Hence, it is very natural to raise the main issue of our construction:

Problem IV.2. Find codes C = Codeq(T ) arising from

transversal designsT = TD(`, s) with few groups (small `) and large dimension k = dim_F_qC compared to their length n = `s.

We first give a negative result, stating that the characteristic of the field Fq should be chosen very carefully in order to

obtain non-trivial codes.

Proposition IV.3. Let T = (X, B, G) be a TDλ(`, s). Let

q = pe_,

p prime. If p - λs, then

Codeq(T ) ⊆ {c ∈ Fs`q , ∀G ∈ G, c|G∈ Rep(s)} ,

where Rep(s) represents the repetition code of length s. In particular, if_{p - λs, then Code}q(T ) has dimension at most `.

Proof. For x ∈ X, recall that Bx= {B ∈ B, x ∈ B}, and

de-note by a(x)=P

B∈Bx1B. We know that a

(x)_{∈ Code} q(T )⊥,

(8)

since Codeq(T )⊥ is generated by {1B, B ∈ B}. Denote by

Gx∈ G the only group that contains x. We see that:

     a(x)x = λs a(x)_i = 0 for all i ∈ Gx\ {x} a(x)_j = λ for all j ∈ X \ Gx. Therefore a(x)_{− a}(y)

= λs(1{x}− 1{y}) if x and y lie in

the same group G. If p - λs, then we get 1{x} − 1{y} ∈

Codeq(T )⊥. Let now

C = Span_F_q{1{x}− 1{y}, ∀x, y ∈ X s.t. {x, y} ⊂ G ∈ G}

We see that C⊥ _{= {c ∈ F}s`

q , ∀G ∈ G, c|G ∈ Rep(s)}.

Therefore we obtain the expected result.

In the perspective of Problem IV.2, the following section is devoted to the construction of transversal designs with high rate.

V. EXPLICIT CONSTRUCTIONS OF1-PRIVATETD-BASED

PIRPROTOCOLS

From now on, we denote by `(k) the number of servers involved in a given PIR protocol running on a database with k entries, and by n(k) the actual number of symbols stored by all the servers. As it is proved in Theorem IV.1, these two parameters are crucial for the practicality of our PIR schemes, and they respectively correspond to the block size and the number of points of the transversal design used in the construction. In practice, we look for small values of ` and n as explained in Problem IV.2.

In this section, we first give two classical instances of transversal designs derived from finite geometries (Subsec-tions V-A and V-B), leading to good PIR parameters. We then show how orthogonal arrays produce transversal designs, and we more deeply study a family of such arrays leading to high-rate codes. Subsection V-D is finally devoted to another family of orthogonal arrays whose divisibility properties ensure to give an upper bound on the storage overhead of related PIR protocols.

A. Transversal designs from affine geometries

Transversal designs can be built with incidence properties between subspaces of an affine space.

Construction V.1 (Affine transversal design). Let Am

(Fq)

be the affine space of dimension m over Fq, and H =

{H1, . . . , Hq} be q hyperplanes that partition Am(Fq). We

define a transversal design TA(m, q) as follows:

• _{the point set X consists in all the points in A}m_(Fq); • the groups in G are the q hyperplanes from H;

• the blocks in B are all the 1-dimensional affine subspaces (lines) which do not entirely lie in one of the Hj,

j ∈ [1, q]. We also say that such lines are secant to the hyperplanes in H.

The design thus defined is a TD(q, qm−1), since an affine line is either contained in one of the Hj, or is 1-secant (i.e. has

intersection of size 1) to each of them. To complete the study

of the parameters of the induced PIR protocol, it remains to compute the dimension of Code(TA(m, q)).

Proposition IV.3 first proves that if p does not divide λs = q, then the code Codep(TA(m, q)) has poor dimension. Since our

goal is to obtain the largest codes as possible, we choose p to be, for instance, the characteristic of the field Fq.

Now notice that all blocks of TA(m, q) belong to the block

set of the affine geometry design AG1(m, q) — which is

de-fined as the incidence structure of all points and affine lines in Am(Fq). Thus, the incidence matrix MTA(m,q) is a sub-matrix

of MAG1(m,q), which implies that Codep(AG1(m, q)) ⊆

Codep(TA(m, q)) for any field Fp. In fact, equality holds as

shows the following result.

Proposition V.2. For every q = pe _and _{m ≥ 2, we have}

Codep(AG1(m, q)) = Codep(TA(m, q)) .

Proof. Denote by B(AG) _{the blocks of AG}

1(m, q), and by

B(T ) _{and G}(T )_{the blocks and groups of T}

A(m, q). Thanks to

the previous discussion, we only need to show that for every block B ∈ B(AG) contained in a group G ∈ G(T ), it holds that ₁B ∈ Codep(TA(m, q))⊥. For this sake, first notice that

Codep(TA(m, q))⊥= Span{1B0, B0∈ B(T )}.

Let now G ∈ G(T ) _{and B ∈ B}(AG) _{such that B ⊆ G.}

Recall that G is a hyperplane of Am

(Fq), and let P be a

2-dimensional affine plane of Am

(Fq) such that P ∩ G = B. We

claim that ₁P ∈ Span{1B0, B0 ∈ B(T )}. Indeed, P admits a

partition into affine lines which are secant to every hyperplane in G. Thus ₁P can be written as sum of the characteristic

vectors of these lines.

Now let x ∈ B, and B_x,P(T ) := {B0 ∈ B(T )_{, x ∈ B}0 _⊂

P } ⊆ B(T ). Define b(x) = P

B0_∈B(T ) x,P1B

0. It is clear that

b(x)_{∈ Span{1}

B0, B0∈ B(T )}, and we can notice that

     b(x)x = q = 0, b(x)_i = 0 for all i ∈ B \ {x}, b(x)_j = 1 for all j ∈ P \ B . In other words, b(x) ₌ 1P − 1B, therefore 1B ∈ Span{1B0, B0∈ B(T )}.

The benefit to consider AG1(m, q) is that the p-rank of

its incidence matrix has been well-studied. For instance, Hamada [12] gives a generic formula to compute the p-rank of a design coming from projective geometry. Yet, as presented in Appendix A, asymptotics are hard to derive from his formula for a generic value of m.

However, if m = 2, we know that rankp(AG1(2, pe)) = p+1

2

e

, which implies that

dim(Codep(TA(2, pe))) = p2e− p+1₂ e

. Hence we obtain the following family of PIR protocols. Proposition V.3. Let D be a database with k = p2e₋ p+1

2

e entries, p a prime, e ≥ 1. There exists a distributed 1-private PIR protocol forD with:

(9)

For fixed p and k → ∞, we have `(k) =√k + Θ(k12+cp₎ _and n(k)/k = 1 1−(1+1/p 2 ) e = 1 + Θ(kcp) → 1 , (1) where cp= 12logp( 1+1/p 2 ) < 0.

Proof. The existence of the PIR protocol is a consequence of the previous discussion, using the family of codes Codep(TA(2, pe)). Let us state the asymptotics of the

param-eters. Recall we fix the prime p and we let e → ∞. First we have: n(k)/k = p 2e p2e₋ p+1 2 e = 1 1 −1+1/p₂ e = 1 + 1 + 1/p 2 e + O 1 + 1/p 2 2e! . (2) Notice that log_pk = 2e + log_p 1 − 1 + 1/p 2 e = 2e + O 1 + 1/p 2 e . Hence, 1 + 1/p 2 e = 1 + 1/p 2 12logpk+O(( 1+1/p 2 ) e ) = k12logp( 1+1/p 2 ) × 1 + 1/p 2 O((1+1/p2 ) e ) = Θ (kcp_{) ,} since 1+1/p₂ O((1+1/p 2 ) e )

→ 1. Using (2) we obtain the asymptotics we claimed on n(k)/k.

For `(k), we see that n(k) = `(k)2_{. Therefore, we get}

`(k) =√kpn(k)/k =√kp1 + Θ(kcp_{) =}

√

k + Θ(k12+cp) .

We give in Table I the dimension of some codes arising from affine transversal designs. Notice that m is not restricted to 2, but we focus on codes with large, since they aimed at being applied in PIR protocols.

Finally, for a better understanding of the parameters we can point out two PIR instances:

• choosing m = 2 and ` = 4096, there exists a PIR

protocol on a ' 2.0 MB file with 6 kB of communication and only 3.2% storage overhead;

• for a ' 46 GB database (m = 3, ` = 8192), we obtain

a PIR protocol with 39 kB of communication and 27% storage overhead.

B. Transversal designs from projective geometries

The projective space Pm(Fq) is defined as (Am+1(Fq) \

{0})/ ∼ , where for (P, Q) ∈ (Am+1

(Fq) \ {0})2, we have

P ∼ Q if and only if there exists λ ∈ Fq such that P = λQ.

A projective subspace can be defined as the zero set of a

m ` = q n = s` = qm _{k = dim}_C _{R = k/n} 2 8 64 37 0.578 2 16 256 175 0.684 2 32 1024 781 0.763 2 64 4096 3367 0.822 2 1024 1 048 576 989 527 0.944 2 4096 16 777 216 16 245 775 0.968 2 16 384 268 435 456 263 652 487 0.982 2 65 536 4 294 967 296 4 251 920 575 0.990 3 8 512 139 0.271 3 16 4096 1377 0.336 3 64 262 144 118 873 0.453 3 256 16 777 216 9 263 777 0.552 3 1024 1 073 741 824 680 200 873 0.633 3 8192 549 755 813 888 400 637 408 211 0.729 4 8 4096 406 0.099 4 64 16 777 216 2 717 766 0.162 4 256 4 294 967 296 890 445 921 0.207 5 8 32 768 994 0.030 5 64 1 073 741 824 44 281 594 0.041

TABLE I: Dimension and rate of binary codes C arising from TA(m, q). Remind that the rate R of the code is

related to the server storage overhead of the PIR pro-tocol, and that q = ` is essentially the communication complexity and the number of servers.

collection of linear forms over Fm+1q . In particular, a projective

hyperplane is the zero-set of one non-zero linear form over Fm+1q .

Projective geometries are closely related to affine geome-tries, but contrary to them, there is no partition of the projective space into hyperplanes, since every pair of distinct projective hyperplanes intersects in a projective space of co-dimension 2. To tackle this problem, an idea is to consider the hyperplanes Hi which intersect on a fixed subspace of co-dimension 2

(call it Π∞). Then, all the sets Hi\ Π∞are disjoint, and their

union gives exactly Pm(Fq) \ Π∞, where Pm(Fq) denotes

the projective space of dimension m over Fq. Besides, any

projective line disjoint from Π∞ is either contained in one of

the Hi, or is 1-secant to all of them. It results to the following

construction:

Construction V.4 (Projective transversal design). Let Pm

(Fq)

and Π∞ defined as above. Let us define • _{a point set X = P}m(Fq) \ Π∞;

• a group set G = {projective hyperplanes H ⊂

Pm(Fq), Π∞⊂ H};

• a block set B = {projective lines L ⊂ Pm(Fq), L ∩

Π∞= ∅ and ∀H ∈ G, L 6⊂ H} .

Finally, denote by TP(m, q) := (X, B, G).

The design TP(m, q) is a TD(q + 1, qm−1) and, as in the

affine setting, its p-rank is related to that of PG1(m, q), the

classical design of point-line incidences in the projective space Pm(Fq). Indeed, the incidence matrix M of TP(m, q) is a

submatrix of MPG1(m,q) from which we removed:

• the columns corresponding to the points in Π∞, • the rows corresponding to the lines not in B.

Said differently, the code associated to TP(m, q)

con-tains (as a subcode) the Π∞-shortening of the code

(10)

dim Code(PG1(m, q)) − |Π∞|. Contrary to Proposition V.2,

we could not prove equality, but this is of little conse-quence: up to using a subcode of Code(TP(m, q)) we can

consider PIR protocols on databases with k entries, where k = dim Code(PG1(m, q)) − |Π∞|.

Once again, for projective geometries Hamada’s formula gets simpler for m = 2, and leads to the following proposition. Proposition V.5. Let D be a database with k = p2e+ pe−

p+1 2

e

− 1 entries, p a prime and e ≥ 1. There exists a distributed 1-private PIR protocol for D with:

`(k) = pe+ 1 and n(k) = p2e+ pe. Asymptotics are the same as in Equation(1).

In order to emphasize that the two previous constructions are asymptotically the same, we draw the rates of the codes involved in these two kinds of PIR schemes in Figure 4.

102 ₁₀4 ₁₀6 ₁₀8 ₁₀10 ₁₀12 0 0.2 0.4 0.6 0.8 1 Length n = qm Rate R = k /n m = 2 m = 3 m = 4 m = 5

Fig. 4: Rate of binary codes coming from TA(m, q) (in

red) and TP(m, q) (in blue). For every fixed m, we let

q grow.

C. Orthogonal arrays and the incidence code construction In this subsection, we first recall a way to produce plenty of transversal designs from other combinatorial constructions called orthogonal arrays.

Definition V.6 (orthogonal array). Let λ, s ≥ 1 and ` ≥ t ≥ 1, and let A be an array with ` columns and λst rows, whose entries are elements of a set S of size s. We say that A is an orthogonal array OAλ(t, `, s) if, in any subarray A0 of A

formed by t columns and all its rows, every row vector from St _{appears exactly λ times in the rows of A}0_{. We call λ the}

index of the orthogonal array, t its strength and ` its degree. If t (resp. λ) is omitted, it is understood to be 2 (resp. 1). If both these parameters are omitted we write A = OA(`, s).

From now on, for convenience we restrict Definition V.6 to orthogonal arrays with no repeated column and no repeated row. Next paragraph introduces a link between orthogonal arrays and transversal designs.

1) Construction of transversal designs from orthogonal arrays: We can build a transversal design TD(`, s) from an orthogonal array OA(`, s) with the following construction, given as a remark in [8, ch.II.2].

Construction V.7 (Transversal designs from orthogonal ar-rays). Let A be an OA(`, s) of strength t = 2 and index λ = 1 with symbol set S, |S| = s, and denote by Rows(A) the s2 _{rows of A. We define the point set X = S × [1, `]. To}

each row c ∈ Rows(A) we associate a block Bc:= {(ci, i), i ∈ [1, `]} ,

so that the block set is defined as

B := {Bc, c ∈ Rows(A)} .

Finally, let G := {S × {i}, i ∈ [1, `]}. Then (X, B, G) is a transversal design TD(`, s).

Example V.8. A very simple example of this construction is given in Figure 5, where for clarity we use letters for elements of the symbol set {a, b}, while the columns are indexed by integers. On the left-hand side, A is an OA1(2, 3, 2) with

symbol set {a, b}. On the right-hand side, the associated transversal design TD(3, 2) is represented as a hypergraph: the nodes are the points of the design, the “columns” of the graph form the groups, and a block consists in all nodes linked with a path of a fixed color. One can check that every pair of nodes either belongs to the same group or is linked with one path. A =     a b b b b a b a b a a a     =⇒

(a, 1) (a, 2) (a, 3)

(b, 1) (b, 2) (b, 3) Fig. 5: A representation of the construction of a transversal design from an orthogonal array.

Remark V.9. Listed in rows, all the codewords of a (generic) code C0 give rise to an orthogonal array, whose strength t is

derived from the dual distance d0 of C0 by t = d0− 1. Notice

that for linear codes, the dual distance is simply the minimum distance of the dual code, but it can also be defined for non-linear codes (see [15, Ch.5.§5.]). More details about the link between orthogonal arrays and codes can also be found in [8]. For example, the orthogonal array of Figure 5 comes from the binary parity-check code of length 3 (by replacing a by 0 and b by 1). One can check that its dual distance is 3 and its associated transversal design has strength 2.

Given a code C0, we denote by AC0 the orthogonal array

it defines (see Remark V.9) and by TC0 the transversal design

built from AC0 thanks to Construction V.7.

Example V.10. Let x = (x1, . . . , x`) be an `-tuple of pairwise

distinct elements of Fq and denote by RS2(x) the

Reed-Solomon code of length ` and dimension 2 over Fq with

evaluation points x:

(11)

Then, RS2(x) has dual distance 3, so its codewords form an

orthogonal array ARS2(x) = OA(`, q) of strength 2. Now,

one can use Construction V.7 to obtain a transversal design TRS2(x)= TD(`, q). The point set is X = Fq× [1, `], and the

blocks are “labeled Reed-Solomon codewords”, that is, sets of the form {(ci, i), i ∈ [1, `]} with c ∈ RS2(x). The ` groups

correspond to the ` coordinates of the code: Gi = Fq× {i},

1 ≤ i ≤ `.

We can finally sum up our construction by introducing the code Codeq(TC0) arising from the transversal design defined

by C0. To the best of our knowledge, the construction C0 7→

Codeq(TC0) is new. We name Codeq(TC0) the incidence code

of C0, since its parity-check matrix MTC0 essentially stores

incidence relations between all the codewords in C0.

Definition V.11 (incidence code). Let C0 be a (generic) code

of length ` over an alphabet S of size s. The incidence code of C0 over Fq, denoted ICq(C0), is the Fq-linear code of length

n = s` built from the transversal design TC0, that is:

ICq(C0) := Code(TC0) .

Notice that the field Fq does not need to be the alphabet S of

the code C0.

Incidence codes are introduced in order to design PIR protocols, as summarizes Figure 6. We can show that, if C0has

dual distance more than 3, then the induced PIR protocol is 1-private. A generalisation is formally proved in Corollary VI.8.

Base code C_OO 0 equivalence (Rem. V.9) Orthogonal array Construction V.7 [8, ch.II.2] Transversal design incidence matrix Incidence code of C0 database encoding

Distributed PIR scheme

Fig. 6: A distributed PIR scheme using the incidence code construction.

Example V.12. Here we provide a full example of the construction of an incidence code. Let C0 be the full-length

Reed-Solomon code of dimension 2 over the field F4 =

{0, 1, α, α2

= α + 1}. The orthogonal array associated to C0

is composed by the following list of codewords:

A =                             0, 0, 0, 0 1, 1, 1, 1 α, α, α, α α2, α2, α2, α2 0, 1, α, α2 0, α, α2, 1 0, α2_{, 1, α} 1, 0, α2_{, α} 1, α2_{, α, 0} 1, α, 0, α2 α, α2_{, 0, 1} α, 0, 1, α2 α, 1, α2_{, 0} α2_{, α, 1, 0} α2_{, 1, 0, α} α2_{, 0, α, 1}                            

Using Construction V.7, we get a transversal design TC0 =

(X, B, G) with 16 points (4 groups made of 4 points) and 16 blocks. Let us recall how we map a row of A to a word in {0, 1}16_{. For instance, consider the fifth row:}

a := A5= (0, 1, α, α2) .

We turn a into a block Ba := {(0, 1), (1, 2), (α, 3), (α, 4)} ∈

B, and we build the incidence vector 1Ba of the block Ba

over the point set X = {(β, i), i ∈ [1, 4], β ∈ F4}. Of course,

in order to see ₁Ba as a word in {0, 1}

16_{, we need to order}

elements in X, for instance:

(0, 1), (1, 1), (α, 1), (α2_{, 1),}

(0, 2), (1, 2), (α, 2), (α2_{, 2),}

(0, 3), (1, 3), (α, 3), (α2_{, 3),}

(1, 4), (1, 4), (α, 4), (α2_{, 4)}_.

Using this ordering, we get:

1Ba = 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1 ∈ {0, 1}

16_.

By computing all the₁Ba for a ∈ Rows(A), we obtain the

incidence matrix M of the transversal design TC0:

M =                             1000 1000 1000 1000 0100 0100 0100 0100 0010 0010 0010 0010 0001 0001 0001 0001 1000 0100 0010 0001 1000 0010 0001 0100 1000 0001 0100 0010 0100 1000 0001 0010 0100 0001 0010 1000 0100 0010 1000 0001 0010 0001 1000 0100 0010 1000 0100 0001 0010 0100 0001 1000 0001 0010 0100 1000 0001 0100 1000 0010 0001 1000 0010 0100                             ,

Notice that this matrix can be quickly obtained by respectively replacing entries 0, 1, α and α2 in the array A by the binary 4-tuples (1000), (0100), (0010) and (0001) in the matrix

(12)

M (of course this map depends on the ordering of X we have chosen, but another choice would lead to a column-permutation-equivalent matrix, hence a column-permutation-equivalent code). Notice that in matrix M , coordinates lying in the same group of the transversal design have been distinguished by dashed vertical lines.

Matrix M then defines, over any extension F2e of the

prime field F2, the dual code of the so-called incidence code

IC2e(C₀). For all values of e, the incidence codes IC₂e(C₀)

have the same generator matrix of 2-rank 7, being:

G =           1001 0000 0011 1010 0101 0000 0110 0011 0011 0000 0101 0110 0000 1001 0101 1100 0000 0101 0011 0110 0000 0011 0110 0101 0000 0000 1111 1111           .

2) A deeper analysis of incidence codes coming from linear MDS codes of dimension 2: Incidence codes lead to an innumerably large family of PIR protocols — as many as there exists codes C0— but most of them are not practical for

PIR protocols (essentially because the kernel of the incidence matrix is too small). To simplify their study, one can first remark that intuitively, the more blocks a transversal design, the larger its incidence matrix, and consequently, the lower the dimension of its associated code. But the number of blocks of TC0 is the cardinality of C0. Hence, informally the smaller the

code C0, the larger IC(C0).

We recall that a [n, k, d] linear code is said to be maximum distance separable (MDS) if it reaches the Singleton bound n + 1 = k + d. Besides, the dual code of an MDS code is also MDS, hence its dual distance is k + 1. In this paragraph we analyse the incidence codes constructed with MDS codes of dimension 2. Their interest lies in being the smallest codes with dual distance 3, which is the minimal setting for defining 1-private PIR protocols.

Generalized Reed-Solomon codes are the best-known family of MDS codes.

Definition V.13 (generalized Reed-Solomon code). Let ` ≥ k ≥ 1. Let also x ∈ F`

qbe a tuple of pairwise distinct so-called

evaluation points, and y ∈ (F×q)

` _{be the column multipliers.}

We associate to x and y the generalized Reed-Solomon (GRS) code:

GRSk(x, y) := {(y1f (x1), . . . , y`f (x`)),

f ∈ Fq[X], deg f < k} .

Generalized Reed-Solomon codes GRSk(x, y) are linear

MDS codes of dimension k over Fq, and they give usual

Reed-Solomon codes when y = (1, . . . , 1). Moreover, GRS codes are essentially the only MDS codes of dimension 2, as states the following lemma whose proof can be found in the Appendix.

Lemma V.14. All [`, 2, ` − 1] MDS codes over Fq with 2 ≤

` ≤ q are generalized Reed-Solomon codes.

Let us study the consequences of Lemma V.14 in terms of transversal designs. We say a map φ : X → X0 is

an isomorphism between transversal designs (X, B, G) and (X0, B0, G0) if it is one-to-one and if it preserves the incidence relations, or in other words, if φ is invertible on the points, blocks and groups:

φ(X) = X0, φ(B) = B0, φ(G) = G0.

Lemma V.15. Let C, C0 be two codes such that C0 _{= y ∗ C}

for some _{y ∈ (F}×_q)`_{, where} _{∗ is the coordinate-wise product}

of `-tuples. Recall TC, TC0 are the transversal designs they

respectively define. Then,TC and TC0 are isomorphic.

Proof. Write TC = (X, B, G) and TC0 = (X0, B0, G0). From

the definition it is clear that X = X0 _{= F}q× [1, `] and G =

G0

= {Fq× {i}, 1 ≤ i ≤ `}. Now consider the blocks sets.

We see that B = {{(ci, i), 1 ≤ i ≤ `}, c ∈ C} and B0 =

{{(yici, i), 1 ≤ i ≤ `}, c ∈ C}. Let:

φy: Fq× [1, `] → Fq× [1, `]

(x, i) 7→ (yix, i)

The vector y is ∗-invertible, hence φy is one-to-one on the

point set X. It remains to notice that φy maps G to itself

since it only acts on the first coordinate, and that φy(B) is

exactly B0 by definition of C and C0.

Proposition V.16. Let 2 ≤ ` ≤ q and C0be an[`, 2, `−1]q

lin-ear (MDS) code. Let also Fpbe any finite field. The incidence

codeICp(C0) is permutation-equivalent to ICp(RS2(x)), with

x ∈ F`

q,xi6= xj.

Proof. Lemma V.14 shows that all [`, 2, `−1]q linear codes C0

can be written as y ∗RS2(x) for some x ∈ F`q. Moreover, with

the previous notation φy(TRS2(x)) = Ty∗RS2(x), so we have

u ∈ ICp(y ∗ RS2(x)) if and only if u ∈ Codep(φy(TRS2(x))).

Now, let: ˜

φy : FXp → FXp

u = (ux)x∈X 7→ (uφy(x))x∈X

.

Clearly ˜φy(ICp(RS2(x))) = Codep(φy(TRS2(x))) and ˜φy

is a permutation of coordinates. So ICp(C0) is

permutation-equivalent to ICp(RS2(x)) which proves the result.

In our study of incidence codes of 2-dimensional MDS codes C0, the previous proposition allows us to restrict our

work on Reed-Solomon codes C0= RS2(x) with x an `-tuple

on pairwise distinct Fq-elements.

A first result proves that if x contains all the elements in Fq, then ICq(RS2(x)) is the code which has been previously

studied in subsection V-A. More precisely,

Proposition V.17. The following two codes are equal up to permutation:

1) C1 = ICq(RS2(Fq)), the incidence code over Fq of the

full-length Reed-Solomon code of dimension_{2 over F}q;

2) C2, the code over Fq based on the transversal design

TA(2, q).

Proof. It is sufficient to show that the transversal design defined by C0 = RS2(Fq) is isomorphic to TA(2, q). Let us

(13)

enumerate Fq = {x1, . . . , xq}. We recall that TC0 = (X, B, G)

where:

X = Fq× [1, q],

B = {{(ci, i), i ∈ [1, q]}, c ∈ C0},

G = {{(α, i), α ∈ Fq}, i ∈ [1, q]} ,

and that TA(2, q) = (X0, B0, G0) with:

X0 _{= F}q× Fq,

B0 = {{(axi+ b, xi), i ∈ [1, q]}, (a, b) ∈ F2q}

G0 = {{(α, xi), α ∈ Fq}, i ∈ [1, q]} .

In the light of the above, one defines φ : X → X0, (α, i) 7→ (α, xi), which is clearly one-to-one and satisfies φ(G) = G0.

Moreover, a codeword c ∈ C0is the evaluation of a polynomial

of degree ≤ 1 over Fq. Hence for some (a, b) ∈ F2q, we have

ci = axi+ b, ∀i. This proves that φ extends to a one-to-one

map B → B0, giving the desired isomorphism.

It remains to study the case of tuples x of length ` < q. First, one may notice that ICq(RS2(x)) is a shortening of

ICq(RS2(Fq)). Indeed, we have the following property:

Lemma V.18. Let C0 be a linear code of length ` over Fq,

and C0 be a puncturing of C0 on s positions. Then for all

prime powers q0, ICq0(C₀) is a shortening of IC_q0(C₀) on

the coordinates corresponding to s groups of the transversal design TC0.

Proof. Without loss of generality, we can assume that C0 is

punctured on its s last coordinates in order to give C0. Let

us analyse the link between T_C

0 = (X, B, G) and TC0 =

(X, B, G). We have:

X _{= F}q× [1, ` − s] ⊂ X,

G = {Fq× {i}, i ∈ [1, ` − s]} ⊂ G,

B = {B ∩ X, B ∈ B}

Let C = ICq0(C₀) and C = IC_q0(C₀). For clarity, we index

words in C (resp. C) by X (resp. X). For c ∈ FXq0, we define

ext(c) := c ∈ FX

q0, such that c_|X = c and c_|X\X = 0.

By definition of code’s puncturing/shortening, all we need to prove is:

C = {c ∈ FXq0, ext(c) ∈ C}.

Remind that C is defined as the set of c ∈ FXq0 satisfying

P

b∈Bcb= 0 for every B ∈ B. Hence we have:

c ∈ C ⇐⇒ X b∈B cb= 0, ∀B ∈ B ⇐⇒ X b∈B∩X cb= 0, ∀B ∈ B ⇐⇒ X b∈B∩X ext(c)b + X b∈B∩(X\X) ext(c)b= 0, ∀B ∈ B ⇐⇒ X b∈B ext(c)b = 0, ∀B ∈ B ⇐⇒ ext(c) ∈ C

We conclude the proof by pointing out that X \ X is a union of s distinct groups from G.

Despite this result, incidence codes of Reed-Solomon codes RS2(x) remain hard to classify for |x| = ` < q. Indeed, for

a given length ` < q, some IC(RS(x)) appear to be non-equivalent. Their dimension can even be different, as shows an exhaustive search on IC16(RS(x)) with pairwise distinct

x ∈ F`

q, q = 16 and ` = 5: we observe that 48 of these codes

have dimension 24 while the 4320 others have dimension 22. Further interesting research would then be to understand the values of x leading to the largest codes, for a fixed length |x| = `.

D. High-rate incidence codes from divisible codes

In this subsection, we prove that linear codes C0satisfying

a divisibility condition yield incidence codes whose rate is roughly greater than 1/2. Let us first define divisible codes. Definition V.19 (divisibility of a code). Let p ≥ 2. A linear code is p-divisible if p divides the Hamming weight of all its codewords.

A study of the incidence matrix which defines an incidence code leads to the following property.

Lemma V.20. Let C0 be a code of length` over a set S, and

letT be the transversal design associated to C0. We denote by

M the incidence matrix of T , where rows of M are indexed by codewords fromC0. Then we have:

(M MT)c,c0 = ` − d(c, c0) ∀c, c0∈ C₀,

whered(·, ·) denotes the Hamming distance.

Proof. For clarity we adopt the notation M [c, (α, i)] for the entry of M which is indexed by the codeword c ∈ C0(for the

row), and (α, i) ∈ S × [1, `] (for the column). We also denote by1U (c,i,α) ∈ {0, 1} the boolean value of the property U , that

is,1U (c,i,α)= 1 if and only if U (c, i, α) is satisfied. Now, let

c, c0∈ C0. (M MT)c,c0 = X α∈S, i∈[1,`] M [c, (α, i)]M [c0, (α, i)] = X α∈S, i∈[1,`] 1ci=α1c0i=α = ` X i=1 X α∈S 1ci=c0i=α = ` X i=1 1ci=c0i = ` − d(c, c0) .

Hence, if some prime p divides ` as well as the weight of all the codewords in C0, then the product M MT vanishes over

any extension of Fp, and M is a parity-check matrix of a code

containing its dual. A more general setting is analyzed in the following proposition.

(14)

Proposition V.21. Let C0 be a linear code of length` over S,

|S| = s. Let also C = ICq(C0) with char(Fq) = p. Denote the

length of C by n = `s. If C0is p-divisible, then

C⊥∩ Cpar⊆ C ,

whereCpardenotes the parity-check code of lengthn over Fq.

In particular, we getdim C ≥ n−1₂ .

Moreover, ifp | `, then C⊥ ⊆ C and dim C ≥ n 2.

Proof. Let M be the incidence matrix of the transversal design TC0. Also denote by J and J

0 _{the all-ones matrices}

of respective size |C0| × n and |C0| × |C0|. If we assume that

C0 is p-divisible, then Lemma V.20 translates into

M MT = `J0 mod p (3) while an easy computation shows that

M JT = `J0. Hence, over Fq we obtain

M (M − J )T = 0 (4) which brings us to consider the code A of length n generated over Fq by the matrix M − J . Equation (4) indicates that

A ⊆ C. Let Cpar := {c ∈ Fnq,

P

ici = 0} be the parity-check

code of length n over Fq. Notice that c ∈ Cpar ⇐⇒ cJT = 0

and uJ = 0 ⇐⇒ uJ0 = 0. If p - `, this leads to: C⊥∩ Cpar= {c = uM ∈ Fnq, cJ T _{= 0}} = {c = uM ∈ Fnq, `uJ 0_{= 0}} = {c = uM ∈ Fnq, uJ = 0} = {u(M − J ) ∈ Fnq, uJ = 0} ⊆ A ⊆ C .

On the other hand, if p | `, then equation (3) turns into M MT = 0, meaning that C⊥⊆ C.

Finally, the first bound on the dimension comes from dim C ≥ dim(C⊥∩ Cpar) ≥ dim C⊥− 1 = n − dim C − 1 ,

while the second one is straightforward.

In terms of PIR protocols, previous result translates into the following corollary.

Corollary V.22. Let p be a prime, and assume there exists a p-divisible linear code of length `0 over Fq. Then, there

exists k ≥ (`0q − 1)/2 such that we can build a distributed

PIR protocol for a _{k-entries database over F}q, and whose

parameters are `(k) = `0 and n(k) = `0q ≤ 2k + 1.

Divisible codes over small fields have been well-studied, and contain for instance the extended Golay codes [15, ch.II.6], or the famous MDS codes of dimension 3 and length q + 2 over Fq [15, ch.XI.6].

Example V.23. The extended binary Golay code is a self-dual [24, 12, 8]2 linear code. It produces a transversal design with

24 groups, each storing 2 points. Its associated incidence code Code2(Golay) has length n = 24 × 2 = 48 and dimension

≥ 24, and by computation we can show that this bound is tight.

Remark V.24. In our application for PIR protocols, we would like to find divisible codes C0 defined over large alphabets

(compared to the code length), but these two constraints seem to be inconsistent. For instance, the binary Golay code presented in Example V.23 leads to a PIR protocol with a too expensive communication cost (24 bits of communication for an original file of size... 24 bits: that is exactly the communication cost of the trivial PIR protocol where the whole database is downloaded). Nevertheless, Example V.23 represents the worst possible case for our construction, in a sense that the rate of IC2(Golay2) is exactly 1/2 (it attains

the lower bound), and that each server stores 2 bits (which is the smallest possible). Codes with better rate and/or with larger server storage capability would then give PIR protocols with relevant communication complexity. For instance, the extended ternary Golay code gives better parameters — see Example VI.9.

Divisible codes over large fields seems not to have been thoroughly studied (to the best of our knowledge), since coding theorists use to consider codes over small alphabets as more practical. We hope that our construction of PIR protocols based on divisible codes may encourage research in this direction.

VI. PIRPROTOCOLS WITH BETTER PRIVACY

When servers are colluding, the PIR protocol based on a simple transversal design does not ensure a sufficient privacy, because the knowledge of two points on a block gives some information on it. To solve this issue, we propose to use orthogonal arrays with higher strength t.

A. Generic construction and analysis

In the previous section, classical (t = 2) orthogonal arrays were used to build transversal designs. Considering higher values of t, we naturally generalize the latter as follows: Definition VI.1 (t-transversal designs). Let ` ≥ t ≥ 1. A t-transversal design is a block design D = (X, B) equipped with a group set G = {G1, . . . , G`} partitioning X such that:

• |X| = s`;

• any group has size s and any block has size `;

• for any T ⊆ [1, `] with |T | = t and for any (x1, . . . , xt) ∈

GT1× . . . × GTt, there exist exactly λ blocks B ∈ B such

that {x1, . . . , xt} ⊂ B.

A t-transversal design with parameters s, `, t, λ is denoted t-TDλ(`, s), or t-TD(`, s) if λ = 1.

Given a t-transversal design T , we can build a (t−1)-private PIR protocol with the exactly the same steps as in section IV. First, we define the code C = Codeq(T ) associated to the

design according to Definition III.7, and then we follow the algorithm given in Figure 2. Since a t-transversal design is also a 2-transversal design for t ≥ 2, the analysis is identical for every PIR feature, except for the security where it remains very similar.

Security ((t − 1)-privacy). Let T be a collusion of servers of size |T | ≤ t − 1. For varying i ∈ I, the distributions Q(i)|T