• Aucun résultat trouvé

Distributed SPARQL Throughput Increase: On the effectiveness of Workload-driven RDF partitioning

N/A
N/A
Protected

Academic year: 2022

Partager "Distributed SPARQL Throughput Increase: On the effectiveness of Workload-driven RDF partitioning"

Copied!
4
0
0

Texte intégral

(1)

Distributed SPARQL Throughput Increase: On the effectiveness of Workload-driven RDF

partitioning

Cosmin Basca and Abraham Bernstein

DDIS, Department of Informatics, University of Zurich, Zurich, Switzerland {lastname}@ifi.uzh.ch

1 Introduction

The current size and expansion of the Web of Data or WoD, as shown by the stag- gering growth of the Linked Open Data (LOD) project1, which reached to over 31 billion triples towards the end of 2011, leaves federated and distributed Semantic DBMS’ or SDBMS’ facing the open challenge of scalable SPARQL query pro- cessing. Traditionally, SDBMS’ push the burden of efficiency at runtime on the query optimizer. This is in many cases too late (i.e., queries with many and/or non-trivial joins). Extensive research in the general field ofDatabases has iden- tified partitioning, in particular horizontal partitioning, as a primary means to achieve scalability. Similarly to [2] we adopt the assumption thatminimizing the number of distributed-joins as a result of reorganizing the data over participating nodes will lead to increased throughput in distributed SDBMS’. Consequently, the benefit of reducing the number of distributed joins in this context is twofold:

A) Query optimization becomes simpler. Generally regarded as a hard prob- lem in a distributed setup, query optimization benefits, at all execution levels, from fewer distributed joins. Duringsource selection the optimizer can use spe- cialized indexes like in [5], while during query planning better query plans can be devised quicker, since much of the optimization burden and complexity is shifted away from the distributed optimizer to local optimizers.

B) Query execution becomes faster. Not having to pay for the overhead of shipping partial results around, naturally reduces the time spent waiting for usually higher latency network transfers. Furthermore, federated SDBMS’ incur higher costs as they have to additionally serialize and deserialize data.

The main contributions of this poster are: i) the presentation of a novel and na¨ıve workload-based RDF partitioning method2andii)an evaluation and study using a large real-world query log and dataset. Specifically, we investigate the impact of various method-specific parameters and query log sizes, comparing the performance of our method with traditional partitioning approaches.

2 Method Overview

Traditional approaches like Schism construct a graph representation where ver- texes are tuples that participate in workload transactions. The graph is extended

1 http://linkeddata.org/

2 This work was partially supported by the Swiss National Science Foundation under contract number 200021-118000.

(2)

to include all other tuples using a partition-trained classifier. Following this idea, triples would be considered vertexes, while edges are created when any two triples participate in the same query. This is however not feasible for RDF data.

Query Partitions

View Query

Triples Index Federator, or SPARQL

Broker endpoint 1) SPARQL logging

Triple Partitions

View Query Graph Partitioning

Triple Propagation & Replication

Partition C Partition B Partition A SPARQL

Query

2) Query Space: Partitioning

3) Triple Space: Replication & Propagation 4) Triple Space: Placement

Triples Distribution Index

Fig. 1.A simple generalized view of the partitioning process.

Following this graph representation in our early attempts led to very dense graphs, which proved to be too large for state of the art graph parti- tioning software like Metis [4].3 Next, we detail all steps seen in Figure 1 except the simpler 1st phase where queries and their results are logged.

Data Representation & Graph Partitioning. Since mapping triples to ver- texes does not scale well for RDF data, we pursued an intermediate representa- tion: theQueries graph.

Q2

Q5 Q3

Q4 Q1 100

20

10

150 50

Partition 2 Partition 1 Dashed edges are replicated

i.e.: (Q1,Q4) and (Q1,Q3)

Size = 3000

Size = 1500 Size = 5000

Size = 10000

Size = 200

Fig. 2.Basic query log-driven data representation.

Each query in the workload be- comes a vertex, while edges be- tween queries are formed when some triples participate in more than one query, with the num- ber of common triples as edge weights (Figure 2). Finally, we apply Metis on the newly formed

queries graph, forcing balanced partitions as a result of thegraph-cut operation.

Replication. After performing thegraph-cut, there will still be distributed joins even on the workload queries (i.e., queryQ1 will require a distributed join while queryQ2 not). A straight-forward solution is to replicate the triples that reside on the border between partitions. We proceed with identifying the minimum set of triples that needs to be replicated, copying the extra triples from the smaller sized query (i.e., copy extra triples from query Q1 over to Partition 2).

Propagation. While the process outlined so far guarantees that each query in the workload can be executed without a single distributed join

?

?

O

S P

?

?

? ?

?

?

<?, ?, S > <S, ?, ? > <O, ?, ? >

?

?

?

?

?

?

?

?

Fig. 3.Visual depiction of the propagation patterns.

there are no guarantees for fu- ture unknown queries. A method of expanding the set of all triples which participate in all workload queries is needed. For this we rely on the principle of (Spatial) Lo- cality of Reference [3] adapted to the logical graph representation.

In effect wepropagate along the edges in the original RDF data graph to iden- tify new triples “related” to existing triples which participated in all workload

3 The resulting input edge file amounted to approx. 150GB on disk, crashing Metis.

(3)

queries. Hence, we perform an n-hop4 edge propagation matching the follow- ing triple patterns given a triple <s,p,o>(also depicted Figure 3): a) siblings:

< s,?,?>,b) outgoing edges:<o,?,?>and c) incoming edges: <?,?,s>. The re- maining dataset triples which have not been considered so far, are randomly distributed to theKselected partitions, or by hashing by subject.

3 Results & Conclusions

We make use of the USEWOD Data challenge [1] log file to extract 400k valid and well formed SPARQL SELECT queries that produce at least 1 result, all other log entries are discarded. We use a local instance of theVirtuoso RDF- store to resolve them against DBpedia 3.5.1. Furthermore, we assume aperfect distributed query optimizer, able to find the best possible query decomposition.

Measurements were conducted on a node with 72GB of RAM, 8 Cores @2.93GHz.

We compare our method againstrandom partitioning, expert (manual) par- titioning5 andhash partitioning. For the latter we hash on all possible combina- tions of a triple:S,P,O,SP,SO,POandSPO.6 Given the small to average size of the DBpedia dataset (ca. 43.6 million triples), we fixed the number of partitions toK= 8, simulating a small to medium sized cluster. Furthermore, we randomly sampled the workload, with sizes consisting of 1k, 5k, 10k, 25k, 50k and 100k queries from the total of 400k logged. The number of propagationhopswas set to 0, 1 and 2 respectively whilereplicationwas enabled in all cases.

1.55   2.76   5.72   8.87  

9.78   26.01  

34.29  

59.08  

72.63  

12.13   28.64  

45.08   39.47  

65.44  

76.26  

0.00   10.00   20.00   30.00   40.00   50.00   60.00   70.00   80.00   90.00  

0   10000   20000   30000   40000   50000   60000   70000   80000   90000   100000   Percentage  of  Triples  Assigned  to   Par33ons  from  Total  (Par33oned  +   Replicated  +  Propagated)  

Training  Workload  Sample  Size  (#  queries)  

Me/s  random  0   Me/s  random  1   Me/s  random  2  

Fig. 4.The % of triples assigned to partitions from total triples, for each training query set.

As we can visually observe in Figure 4 that although the increase in num- ber of triples reached through the partitioning process (excluding the triples not connected at all) is significant from 50k to 100k queries, there are diminishing re- turns as the expansion process starts to slow down. Indeed doubling the number of queries to log, yields approximatively a 10% increase at this point. Therefore, we observe that a training log size of 50k queries represents an optimal point.

Performance Impact of Graph Partitioning. Even-though the general problem of graph partitioning is known to be NP-hard, the approximating algo- rithm implemented in Metis performs very well, finalizing the queries graph cut in 0.17 seconds for 1k queries and 0.71 seconds for 100k queries.

4 Multiple hops only enabled for in & out edges to avoid an expensive avalanche effect.

5 Each large dump (if>1M triples) to own partition, remainder grouped together.

6 We use of thecityhashfamily of hash functions, due to low-collision rate and speed.

(4)

3.66  3.82  3.96  

4.40  

8.45   8.66  

1.34  1.70  

2.35   2.15  

5.02  

6.14  

3.34  

0.8   1.8   2.8   3.8   4.8   5.8   6.8   7.8   8.8   9.8  

0   10000   20000   30000   40000   50000   60000   70000   80000   90000   100000   Performance  Improovement  Normalized  to   Slowest  (#  Distributed  Joins)  

Training  Workload  Sample  Size  (#  queries)  

Me/s  hash  S  2   Me/s  hash  S  1   Me/s  random  2   Me/s  random  1   Hash  S   Expert  (Manual)   Hash  P   Me/s  random  0   Hash  SP   Hash  O   Hash  SO   Hash  PO   Random   Hash  SPO  

Fig. 5.The performance improvement relative to the lowest performing method (hash SPO).

Number of Hops Impact when Propagating. Figure 5 plots the relative performance improvement over the lowest performing method. Non-workload driven partitioning methods appear as horizontal lines. The worst performing ones are the Hash SPO method with Random & Hash PO/SO exposing simi- lar performance levels. Hashing by subjectHash S is performing best, followed closely by the Expert (manual) distribution method. This could suggest that the majority of the workload queries are dominated by star-shaped basic graph patterns and contain few joins.7 When using random distribution of remain- ing triples our method performs unsatisfactory for smaller workload sizes, but becomes substantially better by 50k queries. At 100k queries it exposes a 6.14 performance factor being 2.81 and 3.1 times better thanHash S andExpert re- spectively. When remaining triples are distributed based on subject hashes, the method outperforms all other methods at all workload sizes. At best (Metis hash S 2) our method is between 3.6 and 8.66 times better than the lowest performing and up to 5.32 times better than hashing by subject. In essence the best case partitioning would produce on average of 0.10 distributed joins per query.

References

1. B. Berendt, L. Hollink, V. Hollink, M. Luczak-Rsch, K. H. Mller, and D. Vallet.

Usewod2011 - 1st international workshop on usage analysis and the web of data. in 20th international world wide web conference (www2011), hyderabad, india, 2011.

2. C. Curino, E. Jones, Y. Zhang, and S. Madden. Schism: a workload-driven approach to database replication and partitioning. Proceedings of the VLDB Endowment, 3:48–57, Sept. 2010.

3. P. J. Denning. The locality principle. Communications of the ACM, 48, July 2005.

4. G. Karypis and V. Kumar. MeTis: Unstructured Graph Partitioning and Sparse Matrix Ordering System, Version 4.0. http://www.cs.umn.edu/metis, 2009.

5. Y. Yan, C. Wang, A. Zhou, W. Qian, L. Ma, and Y. Pan. IEEE Xplore - Efficient Indices Using Graph Partitioning in RDF Triple Stores. InICDE2009: IEEE 25th International Conference on Data Engineering, 2009., pages 1263 – 1266, 2009.

7 A fact we intend to investigate in depth in the near future

Références

Documents relatifs

RIQ’s novel design includes: (a) a vector representation of RDF graphs for efficient indexing, (b) a filtering index for efficiently organizing similar RDF graphs, and (c)

One of the obstacles for showing hardness (which is required for the cases (3) and (5)) is that the set of all triple patterns occurring in the query has to be acyclic and of

So, instead of providing just the affecting change operation and the ontology version, our idea is to present the history of the evolution of the specific parts of the

We also consider arbitrary SPARQL queries and show how the concept and role hierarchies can be used to prune the search space of possible answers based on the polarity of

That is, for sets of source and target vertices that become bound to the variables of a reachability predicate ?x ?y, the DRJ operator performs a distributed

• We propose a graph-stream pattern matching approach to transform a stream of vertices and edges G into a sequence of motifs, where each motif represents a sub- graph in G likely to

We use SPARQL conjunctive queries as views that describe sources contents. As in the Bucket Algorithm [20], relevant sources are selected per query subgoal, i.e., triple pattern.

This section describes the translation of SPARQL update operations into XQuery ex- pressions using the XQuery Update Facility. We present how similar methods and al- gorithms