Performance Evaluation - Verifiable Conjunctive Keyword Search

Verifiable Conjunctive Keyword Search

6.7 Performance Evaluation

β^∗−fid^∗ δ^∗ ,(

i=1

e(g^Zⁱ^(α^∗⁾,Γbi))^δ^∗

. Sinceβ^∗6=fid^∗ with probability ¹_p, we can safely conclude that if there is an adversaryA2 that breaks the soundness of our scheme with a non-negligible advantage ε_A, then there is an adversary B2 that breaks D-SBDHwith a non-negligible advantage εB ≥εA·(1−¹_p).

Security of updates. All the critical update operations we presented in Section 6.5.4 are performed on the server side, which mainly updates the Merkle trees TW and TF.

Besides, the server generates the proof Πupd which ascertains of the correctness of the tree update. It consists of the updated accumulators and their authentication paths.

Therefore, since we considerH being a collision-resistant hash function, the server cannot make the verifier accept incorrect updates.

6.7 Performance Evaluation

This section discusses the efficiency of ourVCKSprotocol. We also present some experimental results based on the implementation of a prototype of our solution.

6.7.1 Discussion on Efficiency

Hereafter, we evaluate the costs of our scheme with respect to storage, computation and communication complexities. Table 6.2gives a summary of these various costs.

6.7.1.1 Storage

With respect to storage complexity, the data owner stores and publishes public key PK_F = ({g^αⁱ}_0≤i≤D, σW, σF), of size O(D). On the other hand, cloud server S keeps search key LK_F = (I,TW,TF,F^,HT), resulting in a storage complexity in O(N +n), where N is the total number of keywords and n is the number of files. Indeed, the size of index I , hash table HTand Merkle trees TWand TF is linear inN.

6.7.1.2 Computation

In light of the performances of the several building blocks (Cuckoo hashing, polynomial-based accumulators and Merkle trees), we analyze in the following the computational costs of our solution.

1. Setup: The Setup phase of our protocol is a one-time pre-processing operation that is amortized over an unlimited number of fast verifications. The computational cost of this phase is dominated by:

• The public parameter generation which amounts to Dexponentiations in G;

• N calls to CuckooInsert where, as shown in [74], each insertion is expected to terminate in (1/ε)^O(log^d) time (ε >0);

• The computation of m accumulators AWwhich requires m exponentiations in G and mdmultiplications in Fp (in the worst case);

• The computation of N accumulators AF which involves N exponentiations in G and N nmultiplications in Fp (in the worst case);

• The generation of Merkle tree TW(respectively TF) which consists of 2m hashes (respectively 2N).

2. QueryGen: This algorithm does not require any computation. It only constructs the query for the k keywords together with the correspondingVKQ.

3. Search: Although this algorithm seems expensive, we highlight the fact that it is executed by the cloud server, who has more computational resources than the data owner and the users. SearchrunskCuckooLookupwhich consists in 2khashes and 2kdcomparisons to search for all the kqueried keywords (in the worst case). Following this operation, the complexity of this algorithm depends on whether all the keywords have been found:

• out=F_W: The complexity ofSearchis governed by:

– The computation of k file accumulatorsAF. Without the knowledge of trap-door α, and using FFT interpolation as specified in [52], this operation per-formsknlognmultiplications inFp and kexponentiations in G;

– The generation of the authentication paths in treeTFfor these accumulators, which amounts to klogN hashes;

– The generation of the proof of intersection that takesO((kn) log²(kn) log log(kn)) multiplications¹⁰² in Fp to compute the gcd of the characteristic polynomials of the sets involved in the query result.

• out=∅: The computational costs of this phase consist in:

– The generation of the proof of membership for the missing keyword by calling twice GenerateWitness. This operation requires 2(d+dlogd) multiplications inFp and 2dexponentiations inG;

– The computation of 2 bucket accumulators AW, which amounts to 2dlogd multiplications inFp and 2dexponentiations in G;

– The generation of 2 authentication paths for these 2 buckets by running GenerateMTProof on treeTW, which performs 2 logmhashes.

4. Verify: We also analyze the complexity of this algorithm according to whether all the keywords have been found:

• out = F_W: Verify runs k instances of VerifyMTProof on tree TF, which requires klogN hashes. It computes the accumulator of the intersection, which, in the worst case, amounts to n multiplications in Fp and 1 exponentiation in G. Then, it executes VerifyIntersectionwhich computes 3k pairings and k multiplications in G_T.

• out=∅: Verifyruns twiceVerifyMTProofon treeTWthat computes 2 logmhashes and it invokes twice VerifyMembershipthat evaluates 3×2 = 6 pairings.

In summary, to verify the search results, a verifierV performs very light computations com-pared to the computations undertaken by the server when answering keyword search queries and generating the corresponding proofs. Besides, the verification cost depends onkonly in the case where all the keywords have been found and is independent otherwise. Furthermore, we believe that for large values of k, the probability that the search returns a set of files containing all thek keywords is low. Hence, the verification cost will be constant and small (6 pairings and 2 logm hashes). On the other hand, for smaller values ofk, the verification cost remains efficient.

102More details on this complexity computation can be found in [140,52].

Table 6.2: Complexities of our protocol for verifiable keyword search Notations

D: parameter of our system,n≤D: number of files,N: number of keywords m: the number of buckets in the index,d: size of a bucket

k: number of keywords in a query.

Storage |G|refers to the size (in bits) of elements in setG.

Data owner O(D) D· |G|+ 2×256 bits

Server O(N+n) md· |Fp|+ (2m−1)×256 + (2n−1)×256 bits We assumeH is SHA-256.

Communication

Outbound O(k) k· |{0,1}^∗|bits Inbound

out=FW O(n+k) n· |{0,1}^∗|+ 3k· |G|+klogn×256 bits out=∅ O(1) 2· |G|+ 2·(|G|+|Fp|) + 2 logm×256 bits

Operations Setup QueryGen Search Verify

out=FW out=∅ out=FW out=∅ Hashes in{0,1}^∗ 2(m+N+ 1) - k·(3 + logN) 2 logm+ 3k klogN 2 logm

Multiplications inFp md+N n - kn·logn 2d+ 4dlogd n

-Multiplications inG^T - - - - k

-Exponentiations inG D+m+N - k 4d 1

-Pairings - - - - 3k 6

CuckooInsert N - - - -

-ProveIntersection - - 1 - -

-Light Computation - k - - -

-6.7.1.3 Communication

In terms of communication complexity, our protocol for VCKS is relatively lightweight. In particular, sending a search query to serverS ^{amounts to}O(k) space, where kis the number of keywords in that query.

The size of server’s answer to the query depends on the outcome ofSearch algorithm, as shown in Figure 6.8:

If all the keywords are found: Then server’s result is of size O(n+k), where n corre-sponds to the worst case scenario where all the outsourced files contain the conjunction of keywords, and k results from the underlying proof of intersection of k sets of files Fωi.

If one keyword is not found: In this case, server S has to send a search result of size O(1) that contain the keyword that is not found and its corresponding proof of non-membership.

6.7.1.4 Impact of D on the performance.

This performance analysis assumes n≤ D, where n is the number of files. The value ofD solely depends on security parameterκ, and as such, defines an upper-bound to the size of sets for which we can compute a polynomial-based accumulator. It follows that in our protocol, the number of files that a data owner can outsource at once is bounded byD. However, it is still possible to accommodate files’ sets that exceed the bound D. The idea is to divide the set of sizeninton⁰ =d_Dⁿesmaller sets of sizeD. By using the same public parameters,Setup accordingly creates for each set of Dfiles an index and the corresponding Merkle trees. This increases the complexity of the Setup by a factor of n⁰. Namely, the data owner is required to build n⁰ Cuckoo indexes and 2n⁰ Merkle trees. However, since D is now constant, the computation of each file accumulator Acc⁽Fω) will take O(DlogD) = O(1) time. Besides,

the server has to run n⁰ ProveIntersection which runs in O((kD) log²(kD) log log(kD)) = O(klog²(k) log log(k)) time.

6.7.2 Experimental Analysis

This section evaluates the performance of our proposedPVCKSscheme with two datasets:

• For the purpose of experimental analysis, we created a collection of files which gathers the complete plays of Moli`ere¹⁰³. We will refer to this dataset as “Moli`ere”. This dataset consists of n = 25 293 files and is of size 1.34 MB. It is populated by N = 16 720 distinct keywords.

• The Enron email dataset¹⁰⁴ is used to test our protocol in a real-world scenario. This collection of data includes emails generated by 150 former employees of Enron, an American energy company that faced a bankruptcy in the beginning of years 2000s. The dataset was acquired for a matter of investigation after Enron’s collapse and made publicly available for study and research purposes. The investigation of Enron email dataset applies to a scenario where ourPVCKSwould be an interesting tool. Indeed, some verifiers from financial auditing companies or legal authorities might be required to search in the collection of emails for some keywords related to their investigations. This search might be delegated to the cloud which is then compelled to return the search results with a proof of correct search. The Enron email dataset contains 2.54 GB ofn= 517 401 files andN = 2 141 711 unique keywords.

Experimental Setup. Our prototype is written in C and uses the Pairing-Based Cryp-tography (PBC) library¹⁰⁵. We run the experiments on two different machines. The Moli`ere dataset was tested on a laptop with the following characteristics: processor Intel Core i5-4258U CPU@2.4 GHz; RAM of 4 GB; system on 64-bit.

We tested the Enron email dataset on a desktop machine with the following characteristics:

processor Intel Core i5-2500 CPU@3.80GHz; RAM of 16 GB; system on 64-bit.

6.7.2.1 Benchmarks on the Moli`ere dataset

Evaluation of the Setup phase. To evaluate the performance of the Setup phase, we run our prototype on three different scenarios based on the Moli`ere dataset. We arrange the collection of files so that n = 1, n = 253 or n = 25293 files; each of these settings involves N = 16720 unique keywords (the same set of N keywords for each file collection). The various benchmark times mentioned here correspond to the average of times of 5 trials of our prototype against the data set. Table6.3shows the experimental results of the different steps of the Setup phase in the case it is run on the collection ofn= 25293 files. In the different cases, the prototype builds a Cuckoo index that can store up to m×d= 65536 keywords.

Hence, 26% of the index is populated.

As we can see in Table 6.3, the most computationally expensive operation is the com-putation of the accumulators AF. This is in accordance with our performance evaluation:

algorithm Setup must compute N accumulators AF, one for each keyword, and the under-lying polynomial of each of these accumulators is of degree at most n. Nevertheless, notice that this computational burden is allowed in the amortized model our protocol adopts. Setup is executed once for multiple search queries and verifications. We will discuss about the amortization later on.

Figure 6.15compares the average total time for the Setup phase for different values of n andm(thus different values fordsinceN is fixed andm,dandN are linked with the relation m = ^1+ε_d N). In particular, according to our performance analysis conducted in Table 6.2,

103“Moli`ere œuvres compl`etes”,http://tiny.cc/17hu8x[Accessed: February 4, 2016].

104The Enron Email Dataset,http://tiny.cc/d9hu8x[Accessed: February 4, 2016].

105The Pairing-Based Cryptography Library: https://crypto.stanford.edu/pbc/ [Accessed: February 4, 2016].

Table 6.3: Time of Setup phasen= 25293 files

m d IndexI(s) AW(s) {Fω_i}(s) AF(s) TW(s) TF(s) Total (s)

8192 8 3.489 7.252 11.484 26.591 0.147 2.284 51.248

4096 16 1.819 4.240 10.568 28.843 0.039 2.341 47.851

2048 32 2.037 2.038 10.505 29.760 0.012 2.357 46.708

1024 64 1.940 1.074 10.998 27.889 0.003 2.270 44.175

Figure 6.15: Average total time of the Setup phase

forN and m fixed, the Setup operations is linearly dependent onn (in the worst case) only because of the computation of accumulatorsAF. However, in our setting, we do not reach this worst case (namely, all the keywords are not in all the files). This can be explained by the fact that one particular keyword is not present in all thenfiles of the tested data collection.

Empirical analysis shows that on average, in the case ofn= 253 files, a keyword is present in 6 files while in the case ofn= 25293 files, a keyword is present in 13 files. We also highlight the fact that the overall efficiency of the Setup phase is acceptable for practical scenarios.

Evaluation of the Search phase. To appreciate the performance of algorithms Search andVerify, we use the same data set as before. We want to evaluate the dependencies of these algorithms with n (the number of files) andk (the number of searched keywords), while N is fixed (N = 16720 unique keywords in the entire data set). We also look at the two cases where the search outputs an empty or a non-empty result. All the reported times in this paragraph are computed as the average of 25 executions of our prototype.

We first run our program for a fixed number of searched keywords: k= 2. In this setting, we evaluate the impact of n in the efficiency of the algorithms when all the keywords of the search query are found. Hence, our prototype launches the search operations for several values of n: n= 10,100,1000,10000 files (selected from our test data set). Figure 6.16 and Table 6.4 present the results of this benchmark. As we can see, the increase in n slows the search operation. This is due to the computation of each accumulatorAFwhich, in the worst

case, involvesn files. ProveIntersectionis also a computationally demanding operation, since it has to compute the gcd ofksets, where each of the sets may include, in the worst case, the identifiers ofnfiles. Therefore, we considerProveIntersectionas the most expensive operation performed in our solution. Besides, as we saw in Section 6.7.1, the time required by Verify does not depend on the number of searched keywords. Figure 6.16 and Table 6.4 validate this property.

Table 6.4: Time of Search phase (fixed number of searched keywords) n ProveIntersection(s) Search(s) Verify(s)

10 0.002 0.005 0.009

100 0.005 0.008 0.009

1000 0.075 0.114 0.009

10000 0.703 1.063 0.011

Figure 6.16: Total time of the Search phase (fixed number of searched keywords)

Thereafter, we conduct a second experiment where nis fixed to n= 25293 files and the number of searched keywords k varies according to the following serie k = 2,5,10. This scenario assesses the effect of k in the efficiency of Search and Verify. Table 6.5 and Fig-ure6.17 display respectively the times to compute the proof of intersection, the search and the verification in the case when all the keywords of the search query are found.

Table 6.5: Time of Search phase - all keywords are found k ProveIntersection(s) Search(s) Verify(s)

2 13.121 21.369 0.009

5 120.867 172.121 0.017

10 305.062 368.500 0.036

First, let us consider the time required by algorithmVerify. As expected by the theoretical performance evaluation, it linearly increases withkthe number of keywords to be searched for and is negligible compared to the running time of algorithmSearch. In addition, Figure 6.17 shows that the step that generates the intersection of sets and its proof determines the time

Figure 6.17: Total time of the Search phase - all keywords are found

of Search. Besides, while N and n are fixed in this experiment, we observe that the times for ProveIntersection and Search are not strictly linear in k as expected by the performance evaluation conducted in Table6.2. We explain this divergence by the fact that our theoretical evaluation does not take into account the different sizes of the sets accumulated inAF. Indeed, Table 6.2 presents the worst case complexities according to which each of the N keywords exist in all the nfiles. Practically, in the case where n= 25293 files, a keyword is present in only 13 files. Hence, between two distinct values of k, the computation of accumulators AF and the proof of intersection of their respective sets impacts differently the time of operating the search.

Finally, we study the performance of our protocol when the search has not found a keyword from the query. In the following, we evaluate the evolution of the time required by algorithms Search andVerifyin function ofkthe number of searched keywords in a query. These results are presented in Table6.6and depicted in Figure6.18. First, we can notice that in accordance with the theoretical performance evaluation presented in Table6.2, the running time ofVerify is independent from k. Secondly, we highlight the fact that in this experiment, the search takes much less time than in the case where all the keywords in the search query were found.

Furthermore, while we were expecting the time of Search to be linear in k, Figure 6.18 shows a different evolution. We can explain this discrepancy by the fact that our theoretical performance was studied in the worst case, where each of the bucket in Cuckoo index I was populated bydkeywords (the maximum capacity of the buckets). In reality, all the buckets in the index might not be full. Hence, the computation of the accumulatorsAWfor the missing keyword may take less time than in the worst-case scenario. Therefore, in our experiment, we believe that in the case where k= 5, the buckets corresponding to the missing keyword were empty, while in the case where k = 2, these buckets were probably filled by at most d keywords (in this experiment d= 8 keywords).

Table 6.6: Time of Search phase - one keyword is not found k Search(s) Verify(s)

2 0.039 0.008

5 0.008 0.008

10 0.049 0.008

Figure 6.18: Total time of the Search phase - one keyword is not found

Discussion on Amortization and Outsourceability. We discuss the amortization and the outsourceability of the conjunctive keyword search operation. In particular, we will define in terms of computation, storage and bandwidth consumption the criteria that make the use our verifiable keyword search protocol a more interesting strategy for a data owner than performing the search locally.

In terms of computational performance, we compare our protocol executed over the Molière dataset with a local search performed at the client side (without resorting to cloud technology). This local search requires a pre-processing step which includes the construction of the Cuckoo indexI as well as the indexing {Fωi} of files according to the keywords that the files in Molière dataset contain. Then, the data owner can perform the search for a con-junction of keywords by executing algorithmCuckooLookupover indexI and computing the intersection of sets{Fωi} for all the keywords targeted by the search. Table 6.7 depicts the time to run the pre-processing step and the search for 2 keywords for different sizes of the Molière data set.

Table 6.7: Local search computational performance n Pre-Processing (s) Search (s)

10 0.002 2.524×10⁻⁵

100 0.017 2.968×10⁻⁴

1000 0.203 3.326×10⁻³

10000 2.618 0.021

First, we can assess from Table 6.7 that while, for a large number of files, our protocol satisfies the efficiency requirement we gave in Section 3.4.2 (namely, the verification time showed in Table6.5is less than the local search time, forn= 10000), it is not outsourceable in the computational definition of the amortized model. To better understand this concept we introduce the following definition for a verifiable keyword search solution.

Definition 20 (Outsourceability - Computation). The criterionxof outsourceability in computation for a verifiable keyword search scheme is determined by a parameter x≥0 such that:

tSetup+x·(tProbGen+tVerify)≤tPreprocessing_Local+x·tSearch_Local

wheret_algois the time required to execute algorithmalgo. Hence,xis defined by the relation:

tSetup−tPreprocessing_Local

t_Search_Local −(t_ProbGen+t_Verify)

Regarding the tests we performed on the Moli`ere dataset, we cannot find an x≥0 that meets the outsourceability definition. Nonetheless, our protocol is still arguably outsource-able. Indeed, the above definition for outsourceability only takes into account the computa-tional performances of a verifiable keyword search solution. However, in real-world scenarios, storage and bandwidth consumption are also to be considered in the outsourceability criteria:

Storage: Our protocol targets large databases that cannot be locally stored at the data owner side due to its heavy consumption of memory. If we consider the case where users may use laptops, smartphones or other devices with low memory resources, outsourcing the storage and thus the search operation to the cloud will be necessary. Therefore, the outsourceability definition must take into account this storage criterion: data owners with limited memory resources choose to outsource their large databases to the cloud while delegating the search operation to the cloud server as the best strategy.

Bandwidth: In the context of a publicly delegatable and verifiable keyword search solu-tion, bandwidth consumption is a criterion of paramount importance in the notion of outsourceability. Indeed, without our protocol, if a data owner delegates to third-party queriers the capability to search their large datasets, she has to transmit for each querier

Dans le document The DART-Europe E-theses Portal (Page 169-180)