pdf

(1)

Databases and Information Systems

BalticDB&IS‘2012

(2)

(3)

Lithuanian Academy of Sciences Vilnius University Lithuanian Computer Society

Databases and Information Systems

Tenth International Baltic Conference on Databases and Information Systems Local Proceedings, Materials of Doctoral Consortium

Vilnius, Lithuania July 8-11, 2012

Edited by Albertas Čaplinskas Vilnius University Lithuania Gintautas Dzemyda Vilnius University Lithuania Audronė Lupeikienė Vilnius University Lithuania Olegas Vasilecas Vilnius Gediminas Technical University Lithuania

Vilnius 2012

(4)

UDK 004.65(474.5)(06) Da372

Databases and Information Systems. Tenth International Baltic Confe- rence on Databases and Information Systems. Local Proceedings, Materials of Doctoral Consortium. A. Čaplinskas, G. Dzemyda, A. Lupei- kienė, O. Vasilecas (Eds.). Vilnius: Žara, 2012, 264 p.

This book consists of three independent parts. The first part contains the short communications and accepted full research papers presented at the 10th Jubilee International Baltic Conference on Databases and Information Systems (BalticDB&IS’2012) but not included in the post-proceedings published by IOS Press Frontiers in Artificial Intelligence and Applications series. The second part contains the abstracts of the papers published in post-proceedings.

Finally, the third part contains the materials of Doctoral Consortium. Short communications and full research papers discuss the wide spectrum of research topics including databases, data mining and optimisation in information systems, business modelling, cloud computing and e-service, software evaluation and testing, decision support systems, norms and reasoning, information system engineering tools and techniques, and advanced e-learning environments and technologies. In the materials of Doctoral Consortium doctoral students present the current state of their research.

Cover design: A. Gedgaudas

Typesetting: manuscripts were formatted by conference technical staff

ISBN 978-9986-34-274-8 © The authors named in the author index, 2012

(5)

ORGANISED BY

Lithuanian Academy of Sciences

Vilnius University,

Institute of Mathematics and Informatics Lithuanian Computer Society

Research Council of Lithuania Algoritmų Sistemos, Ltd.

Information Technologies Institute

(6)

(7)

vii

Conference Committee

Steering Committee

Juris BORZOVS, University of Latvia, Latvia Janis BUBENKO, Stockholm University, Sweden Albertas Č^APLINSKAS, Vilnius University, Lithuania Jānis Grundspeņķis, Riga Technical University, Latvia Hele-Mai HAAV, University of Technology, Estonia Ahto KALJA, Tallinn University of Technology, Estonia Mārīte K^IRIKOVA, Riga Technical University, Latvia AudronėLUPEIKIENĖ, Vilnius University, Lithuania

Arne SØLVBERG, Norwegian University of Science and Technology, Norway Olegas VASILECAS, Vilnius Gediminas Technical University, Lithuania

General Chair

Gintautas DZEMYDA, Vilnius University, Lithuania

Program Committee Co-Chairs

Albertas ČAPLINSKAS, Vilnius University, Lithuania

Olegas V^ASILECAS, Vilnius Gediminas Technical University, Lithuania

Publishing Co-Chair

Audronė L^UPEIKIENĖ, Vilnius University, Lithuania

Programme Committee

Irina ASTROVA, Tallinn University of Technology, Estonia Marko B^AJEC, University of Ljubljana, Slovenia

Romas BARONAS, Vilnius University, Lithuania Guntis BĀRZDIŅŠ, University of Latvia, Latvia András BENCZÚR, Eötvös Loránd University, Hungary Jānis B^IČEVSKIS, University of Latvia, Latvia

Mária BIELIKOVÁ, Slovak University of Technology in Bratislava, Slovakia Juris BORZOVS, University of Latvia, Latvia

Boštjan BRUMEN, University of Maribor, Slovenia Dumitru Dan B^URDESCU, University of Craiova, Romania Rimantas BUTLERIS, Kaunas University of Technology, Lithuania Vytautas ČYRAS, Vilnius University, Lithuania

Paweł CZARNUl, Gdansk University of Technology, Poland Valentina D^AGIENĖ, Vilnius University, Lithuania

Marlon DUMAS, University of Tartu, Estonia

Dalė DZEMYDIENĖ, Mykolas Romeris University, Lithuania

(8)

viii

Jutta ECKSTEIN, IT communication, Germany Johann E^DER, University of Vienna, Austria

Jānis GRUNDSPEŅĶIS, Riga Technical University, Latvia Saulius GUDAS, Kaunas University of Technology, Lithuania Remigijus GUSTAS, Karlstad University, Sweden

Hele-Mai H^AAV, University of Technology, Estonia

Zbigniew HUZAR, Wroclaw University of Technology, Poland Sylvia ILIEVA, Sofia University St. Kliment Ohridski, Bulgaria Mirjana IVANOVIĆ, University of Novi Sad, Serbia

Leonid KALINICHENKO, Institute of Informatics Problems of the Russian Academy of Science, Russia

Ahto KALJA, Tallinn University of Technology, Estonia Audris KALNIŅŠ, University of Latvia, Latvia

Mārīte K^IRIKOVA, Riga Technical University, Latvia Peep K^ÜNGAS, University of Tartu, Estonia

Olga KURASOVA, Vilnius University, Lithuania

Sergei KUZNETSOV, Institute for System Programming of the Russian Academy of Sciences, Russia

Marion LEPMETS, Public Research Centre Henri Tudor, Luxembourg Audronė LUPEIKIENĖ, Vilnius University, Lithuania

Rainer MANTHEY, University of Bonn, Germany Dorina M^ARGHESCU, Åbo Akademi University, Finland Saulius M^ASKELIŪNAS, Vilnius University, Lithuania

Mihhail MATSKIN, KTH Royal Institute of Technology, Sweden Raimundas MATULEVIČIUS, University of Tartu, Estonia Jürgen MÜNCH, University of Helsinki, Finland

Lina N^EMURAITĖ, Kaunas University of Technology, Lithuania Boris NOVIKOV, St-Petersburg University, Russia

Algirdas PAKŠTAS, London Metropolitan University, UK Jaan P^ENJAM, Tallinn University of Technology, Estonia Dana PETCU, West University of Timisoara, Romania

Ivan I. PILETSKI, Belarusian State University of Informatics and Radioelectronics, Belarus

Jaroslav P^OKORNÝ, Charles University in Prague, Czech Republic Henrikas P^RANEVIČIUS, Kaunas University of Technology, Lithuania Boris RACHEV, Technical University of Varna, Bulgaria

Jolita RALYTĖ, University of Geneva, Switzerland Rok RUPNIK, University of Ljubljana, Slovenia

Gunter S^AAKE, Otto von Guericke University of Magdeburg, Germany Simonas ŠALTENIS, Aalborg University, Denmark

Kurt SANDKUHL, University of Rostock, Germany

Michał Ś^MIAŁEK, Warsaw University of Technology, Poland

Darijus S^TRAŠUNSKAS, Norwegian University of Science and Technology, Norway Uldis SUKOVSKIS, Riga Technical University, Latvia

Rimantas VAICEKAUSKAS, Vilnius University, Lithuania Damjan VAVPOTIČ, University of Ljubljana, Slovenia

Irina VIRBITSKAITE, A.P. Ershov Institute of Informatics Systems, Russia Robert WREMBEL, Poznań University of Technology, Poland

Janusz ZALEWSKI, Florida Gulf Coast University, USA

(9)

ix

Additional Reviewers

Natalia GARANINA, Russia Tarmo ROBAL, Estonia Nada BAJNAID, Saudi Arabia Riina MAIGRE, Estonia Arne KOSCHEL, Germany Ateeq KHAN, Germany Kristina S^MILGYTĖ, Lithuania Norbert S^IEGMUND, Germany Lovro ŠUBELJ, Slovenia Azeem LODHI, Germany Milos RADOVANOVIC, Serbia Iwona DUBIELEWICZ, Poland Tomaž H^OVELJA, Slovenia Lech TUZINKIEWICZ, Poland Simon V^RHOVEC, Slovenia Bogumiła H^NATKOWSKA, Poland

Doctoral Consortium Advising Committee

Coordinator

Prof. Dalė D^ZEMYDIENĖ, Mykolas Romeris University, Vilnius University Institute of Mathematics and Informatics, Lithuania

Advisors:

Prof. Jānis GRUNDSPEŅĶIS, Riga Technical University, Latvia Prof. Hele-Mai HAAV, University of Technology, Estonia Prof. Mārīte K^IRIKOVA, Riga Technical University, Latvia

Prof. Henrikas PRANEVIČIUS, Kaunas University of Technology, Lithuania

Coordinators of Special Sections

Olga KURASOVA, Vilnius University (Section: Data Mining and Optimisation in IS) Valentina DAGIENĖ, Vilnius University (Section: Advanced e-Learning

Environments and Technologies)

Local Organising Committee Chair

Saulius MASKELIŪNAS, Lithuanian Computer Society

Local Organising Committee

Bronius JASKELEVIČIUS, Litnuanian Academy of Sciences Viktor MEDVEDEV, Vilnius University

Jolanta MILIAUSKAITĖ, Vilnius University Laima PALIULIONIENĖ, Vilnius University Sandra SVANIDZAITĖ, Vilnius University Marius ŠAUČIŪNAS, Vilnius University

Dalia Š^UKVIETIENĖ, Lithuanian Computer Society Aidas ŽANDARIS, Lithuanian Computer Society

(10)

(11)

xi

Preface

The series of Baltic DB&IS biennial conferences was started in 1994. The first conference was held in Trakai, Lithuania. It was initiated by Prof. Janis A. Bubenko Jr.

(Royal Institute of Technology, Stockholm, Sweden) and Prof. Arne Sølvberg (Norwegian University of Science and Technology, Norway) and organised by Institute of Mathematics and Informatics (Lithuania), Vilnius University (Lithuania), Vilnius Technical University (Lithuania). This time was a hard time for Baltic research community. After the crash of Soviet Union the cooperation with the research centres in Moscow, Leningrad, Kiev, and Novosibirsk was broken. The horizontal cooperation between research centres in Lithuania, Latvia and Estonia was weak developed. We had almost no touch with the research centres in West countries and even in East European countries. The idea of Janis A. Bubenko jr. and Arne Sølvberg to organise Baltic Workshops on databases and information systems was indeed a great idea. It helped to consolidate research communities in Baltic countries, to establish contacts with research centres in West and East European countries, and even to renew contacts with the researchers in Russia. These workshops have become a real incentive for development of regional research and boost of international cooperation.

Janis A. Bubenko jr. and Arne Sølvberg helped also to receive founding and even trained us in West European conference organization technologies. Despite the fact that Tallinn, Riga and Vilnius had big experience in organizing large soviet conferences, this experience was almost not applicable in organizing conferences according to West standards. Besides, the time was very hard. It is enough to mention that during the conference in Trakai the conference venue and diner room were protected by armed safeguard.

In year 2000, the international workshop was transformed into international conference. Two years earlier, the selected papers of the conference started to be published by international publishers. The detail statistic data are presented in the table below.

The analysis of this table shows not only positive but also some negative trends.

Starting from year 2008, the geography of participants contracted significantly. It seems that it happened not only for reasons of world-wide financial crisis, but also for changes in state scientific policies. In our opinion, the trend to ignore international conferences and to accent publications only in high-quality international scientific journals is not perspective and even malign.

Another negative trend is the permanently enlarged gap between research communities and industry. If first conferences had significant industrial tracks and intensive round table discussions, in three recent conferences the participation of industry partners is less than minimal. In our opinion, this is the evidence that, at least in Baltic countries, the state policy oriented to encourage the cooperation among research centres and industry is not effective. Despite these negative trends, we remain optimistic and believe that Baltic DB&IS biennial conferences reborn for its new life.

In the final, we express our warmest thanks to all authors who contributed to the 10th Jubilee Conference, to the members of international Programme Committee, additional reviewers, and to the organizing team. We also express our very special thanks and deep gratitude to all our sponsors. Last, but not the least, we also thank all the participants of the conference.

(12)

xii

PC Submissions Year/

Venue/

Status

Proceedings

Number of members Number of countries Number of submissions Number of countries Accepted

1994 Trakai Workshop

Local 16 5 11 43 (Australia-1, Belgium - 1, Estonia-+ , Finland-2, France- 3, Latvia-5, Lithuania-7, Norway-3, Sweden-10, UK-3, USA-2)

1996 Tallinn Workshop

Local 25 14 55 19 36 (Australia-3.5, Estonia-5, Finland-3.5, France-5,

Germany-2.75, Israel-0.25, Latvia-7, Lithuania-1, Malaysia-1, Netherlands – 1, Russia-1, Sweden-3,

Switzerland-1, UK-3, USA-1) 1998

Riga Workshop

Local 31 16 46 17 39 (Australia-0.5, Brasil-1, Denmark-1, Estonia-1, Finlandia-4.5, France-1, Germany-2, Hungary-1, Korea-1, Latvia-10, Lithuania- 5.5, Malaysia-1, Netherlands- 1, Poland-3, Sweden-1, UK- 1.5, USA-1)

2000 Vilnius Workshop

Local + Kluwer Academic Publishers (selected papers)

37 18 60 39 (Belarus-1, Brasil-2, Estonia-5, Finland-4, France- 1, Germany-9.33, Japan-1, Latvia-5.8, Lithuania-14, Norway-1, Poland-5, Russia-3, Spain-1, Sweden-0.33, UK- 2.03, Ukraine-1, USA-2.5, Yugoslavia-1)

2002/

Tallinn Conference

Local + Kluwer Academic Publishers (selected papers)

33 18 60 23 41 (Australia-0.5, Czech Republic-1, China-1, Denmark-2, Estonia-11, Germany-7, Italy-1, Japan-1, Latvia-11, Lithuania-3, New Zealand-1, Norway-1.5, Poland-3, Portugal-1, Russia- 3, Spain-1, Sweden-1, UK-1, USA-1)

2004 Riga Conference

Local + IOS Press (selected papers)

35 14 59 17 35 (Czech Republic-3, Estonia-3, Finland-1, France- 1, Germany-5.5, Latvia-4, Lithuania-4, New Zealand-1,

(13)

xiii

PC Submissions Year/

Venue/

Status

Proceedings

Number of members Number of countries Number of submissions Number of countries Accepted

Norway-1, Poland-3, Russia- 3.5, Spain-2, Sweden-1, UK-1, USA-1)

2006 Vilnius Conference

Local + IOS Press (selected papers) + IEEE (selected papers)

87 35 84 21 48 (Australia-1, Austria-2, Belgium-0.3, Brazil-0.5, China-1, Denmark-0.5, Estonia-2, France-5,

Germany-2, Italy-1.5, Latvia- 17, Lithuania-11.8, Mexico- 0.5, Netherlands-0.7, New Zealand-2, Portugal-1, Russia- 1, Spain-0.5, UK-1.7, USA-1) 2008

Tallinn Conference

43 22 43 12 29 (Austria-1, Estonia-8, Finland-1, Germany-2, Latvia- 13, Lithuania-5, Poland-1, Russia-1, UK-1)

2010 Riga Conference

51 15 36 (Austria-0.5, Estonia-6.5, Germany-2.5, Latvia-17, Lithuania-2, Poland-1, Russia- 0.5, Sweden-1)

2012 Vilnius Conference

64 23 69 14 43 (Algeria-1, Canada-1, Estonia-6, France-3, Germany-3, Latvia-17, Lithuania-9.5, Russia-3, Slovenia-2, Switzerland-0.5)

July 2012 Albertas Čaplinskas

Olegas Vasilecas

(14)

(15)

xv

Part 1. Research Papers

and

Short Communications

(18)

(19)

3

Workload-based Heuristics for Evaluation of Physical Database Architectures

Andreas LÜBCKEâ, Martin SCHÄLERâ, Veit KÖPPENâand Gunter SAAKEâ

aSchool of Computer Science,

Otto-von-Guericke-University Magdeburg, Germany, {andreas.luebcke,martin.schaeler,veit.koeppen,gunter.saake}@ovgu.de

Abstract. Database systems are widely used in different application domains.

Therefore, it is difficult to decide which database management system meets the requirements of a certain application at most. This observation is also true for scientific and statistical data management, due to new application and research fields.

New requirements are often implied to data management while discovering un- known research and applications areas. That is, heuristics and tools do not exist to select an optimal database management system. In previous work, we proposed a decision framework based on application workload analyses. Our framework supports application performance analyses by mapping and merging workload information to patterns. In this paper, we present heuristics for performance estimation to select an optimal database management system for a given application. We show that these heuristics improve our decision framework by complexity reduction with- out loss of accuracy.

Keywords.Heuristics, storage architecture, design, performance, query processing

Introduction

Database systems (DBS) are pervasively for almost each branch of business activity.

Therefore, DBS have to manage different requirements for heterogeneous application domains. New data management approaches are developed (e.g., NoSQL-DBMSs [9,14], MapReduce [12,13], Cloud Computing [3,16,7], etc.) to make the growing amount of data¹ manageable for special application domains. We argue, these approaches are developed for special applications and need a high degree of expert knowledge for usage, administration, and optimization. However, we focus our observations to relational database management systems (DBMSs) in this paper. Relational DBMSs are commonly used DBS for highly diverse applications and besides relational DBMS are well-know to many IT-affine people.

Relational DBMSs²are developed to manage data of daily business and reduce paper trails of companies (e.g., finance institutions) [2]. This approach dominates more and more the way of data management that we today know as online transaction processing (OLTP). Nowadays, faster and more accurate forecasts for revenues and expenses are not enough anymore. A new application domain evolves that focuses on analysis of data to support business decisions. Codd et al. [8] defines this type of data analysis as

1Consider the data explosion problem [22,26].

2In the following, we use the term DBMS synonymously for relational DBMS.

(20)

4 A. Lübcke et al. / Heuristics for Evaluation of Physical DB-Architectures

online analytical processing (OLAP). Consequently, two disjunctive application domains for relational data management exist with different scopes, impacts, and limitations (cf.

Section 1).

In recent years, business application have a high demand for solutions that support tasks from both OLTP and OLAP [15,21,28,31,33,34], thus coarse heuristics for typi- cal OLTP and/or OLAP applications have become obsolete (e.g., data warehouses with- out updates always perform best on column-oriented DBMSs). Nevertheless, new approaches, which we mention above, also show impacts and limitations (e.g., in-Memory- DBMS only, focus on real-time or dimension updates, etc.), such that we argue there is no DBMS that fits for OLTP and OLAP in all application domains. Heuristics and current approaches for physical design and query optimization only consider a certain architecture³(e.g., design advisor [36] and self-tuning [10] for row-oriented DBMSs or equivalent for column-oriented DBMSs [19,30]). That is, the decision for a certain architecture has to be done beforehand. Consequently, there is no approach that neither advices physical design spanning different architectures for OLTP, OLAP, and mixed OLTP/OLAP workloads nor that estimates which architecture is optimal to process a query/database operation.

However, we refine obsolete heuristics for physical design of DBS (e.g., heuristics from classical OLTP domain). We consider OLTP, OLAP, and mixed application domains for physical design. We present heuristics that propose the usage of row-oriented DBMSs (row stores) or column-oriented DBMSs (column stores) under certain circumstances. Furthermore, we present heuristics for query execution or rather for processing (relational) database operations on column and row stores. Our heuristics show which query type and/or database operation performs better on a particular architecture⁴ and how single database operations affect performance of a query or a workload. We derive our heuristics from experiences in workload analyses with the help of our decision model, presented in [23].

In the following sections, we consider the main differences between row- and column store as well as the advantages and disadvantages for column stores. Section 2 ad- dresses our heuristics for physical design of DBMSs concerning different storage architectures (i.e., row or column store). In Section 3, we present heuristics for query processing on row and column stores. Section 4 gives an overview of related research. Finally, we summarize our discussions and give an outlook in Section 5.

1. Column or Row Store: Assets and Drawbacks

In previous work, we already discussed differences between column and row stores according to different subjects [25,24,23]. We can summarize our major observations in Table 1. Of course, column stores architecture also has disadvantages. We just name them because they are mostly contradictory to row store architecture advantages. Column stores perform worse on update operations and concurrent non-read-only data access due to partitioned data, thus on frequent updates and consistency checks tuple reconstructions cause notable cost.

Finally, row stores can outperform column stores in their traditional application domain nor vice versa. Other researchers confirm our consideration that one architecture

3We use the term architecture synonymously for storage architecture.

4The term architecture refers to row- and column-oriented database architecture.

(21)

A. Lübcke et al. / Heuristics for Evaluation of Physical DB-Architectures 5

cannot sustain the other architecture in their native domain (e.g., by simulating architecture through partitioned schema [4]).

Table 1. Advantages of Column Stores

Requirement CS Property Comment

Disc space Reduced Aggressive compression

(e.g., optimal compression per data type)

Data transfer Less More data fits in main memory

Data processing Compressed and decompressed Does not work for each compression nor for all operations

OLAP I/O No Neither for aggregations nor column operations

Parallelization For inter- and intra-query Not for ACID-transaction with write operations

Vector operations Fast Easily adaptable

2. Heuristics on Physical Design

Physical design of DBSs is important as long as DBMSs exist. We do not only restrict our considerations to a certain architecture. Further, we consider a set of heuristics that can be used to forecast which architecture is more suitable for a given application.

Some existing rules still have their validity. First, pure OLTP applications perform best on row stores. Second, classic OLAP application with an ETL (extract, transform, and load) process or (very) rare updates are satisfied with column stores⁵. In the following, we consider a more exciting question. In which situation one architecture outperforms the other one and in which case they perform nearly equivalent.

OLTP For OLTP workloads, we just recommend to use row stores as we do it for decades. A column store does not achieve competitive performance except column-store architecture will significantly change.

OLAP In this domain one might suspect a similar situation as for OLTP workloads.

However, this is not true in general. We are aware that column stores outperform row stores for many applications and/or queries in this domain; that is, for aggregates and access as well as processing of a few columns. In most cases, column stores are most suitable for applications in this domain. Nevertheless, there exist complex OLAP queries where column stores lose their advantages (cf. Section 1). For these complex queries, row stores can achieve competitive results, even if they consume more memory. These queries have to be considered for architecture selection because they critically influence the physical design estimation even more if there is a significant amount of these queries in workload.

OLTP/OLAP In these scenarios, your physical design strongly depends on the ratio between updates, point queries, and analytical queries. Our experience is that column stores perform about 100-times slower on OLTP-transactions (updates, inserts, etc.) than row stores. This fact is even worse because we do not even consider concurrency (e.g., ACID); that is, we make this observation in single-user execution on transactions. As- suming transaction and analytical queries in average take the same time, we state that one transaction only occurs every 100 queries (OLAP). The fact that analytical queries

5Note: Not all column stores support updates just ETL.

(22)

last longer than a single iteration leads us to a smaller ratio. Our experience shows that 10 executions of analytical queries on a column store are of greater advantage than the loss by a single transaction. If you have a smaller ration than 10:1 (analyses/Tx) then we cannot give a clear statement. We recommend using a row store in this situation or you know beforehand that the ratio will change to more analytical queries. If the ratio falls under this ratio for a column store and is not a temporary change then a system change is appropriate. So, in mixed workloads it is all about the ratio of analytical queries to transactions. Note that the ratio 100:1 and 10:1 can change considering OLAP-query type.

We address this issue in Section 3.

Our heuristics can be used as a guideline for architecture decisions for certain applications. That is, we select the most suitable architecture for an application and afterwards use existing approaches (like IBM’s advisor [36]) to tune physical design of a certain architecture. If workload and DBSs are available for analysis, we emphasize to use our decision model [23] to calculate an optimal architecture for a defined workload. The presented heuristics for physical design extent our decision model to reduce calculation costs (i.e., solution room is pruned). Additionally, heuristics make our decision model available for scenarios where only restricted information is available.

3. Heuristics on Query Execution

In the following, we present heuristics for query execution on complex or hybrid DBS.

We assume a DBS that supports column- and row-store functionality or a complex environment with at least two DBMSs containing data redundant whereby at least one DBMS is setup for each OLTP and OLAP. In the following, we only discuss query processing heuristics for OLAP and OLTP/OLAP workloads due to our assumptions above and the fact that there is no competitive alternative to row stores for OLTP.

OLAP In this domain, we face a huge amount of data that generally is not frequently updated. Column stores are able to significantly reduce the amount of data due to aggressive compression. That is, more data can be loaded into main memory and I/O between storage and main memory is reduced. We state that this I/O reduction is the major benefit of column stores. We made the experience that row stores perform worse on many OLAP queries because row stores drop performance due to fact that CPUs often are idle while waiting for I/O from storage. Moreover, row stores read data that is not required due to the physical design of data. In our example from TPC-H benchmark [32], one can see that only a few columns of the lineitem relation have to be accessed (cf. Listing 1). In contrast to column stores, row stores have to access the complete relation to answer this query. We state, most OLAP query fit into this pattern. We recommend using column store functionality to answer this type of OLAP queries as long as they only access a minority of columns from relation for aggregation and predicate selection.

1 f r o m lineitem

2 w h e r e l_shipdate >= d a t e ’1994-01-01’

3 and l_shipdate < d a t e ’1994-01-01’ + i n t e r v a l ’1’ y e a r 4 and l_discount b e t w e e n .06 - 0.01 and .06 + 0.01 5 and l_quantity < 24;

Listing 1.TPC-H query Q6

(23)

Complex Queries OLAP is often used for complex analyses, thus queries become more complex too. These queries describe complex issues and/or produce large business re- ports. Our example (Listing 2) from the TPC-H benchmark could be part of a report (or a more complex query). Complex queries access and/or aggregate many tuples, that is, nearly the complete relation has to be read. This implies a number of tuple reconstructions that significantly reduce the performance of column stores. Hence, row stores can achieve competitive performance because nearly the complete relations have to be accessed. Our example shows another reason for a number of tuple reconstructions: group operations. Tuples have to be reconstructed before aggregating groups. Other reasons for significant performance reduction by tuple reconstructions can be a large number of predicate selections on different columns as well as complex joins. We argue, this complex OLAP query type can be executed on both architectures. This fact can be used to load balance queries in complex environments and hybrid DBMSs.

6 s e l e c t c_custkey, c o u n t(o_orderkey) f r o m

7 customer l e f t o u t e r j o i n orders on c_custkey = o_custkey 8 and o_comment n o t l i k e ’%special%request%’

9 g r o u p by c_custkey) a s c_orders (c_custkey, c_count) 10 g r o u p by c_count

11 o r d e r by custdist d e s c, c_count d e s c;

Listing 2.TPC-H query Q13

Mixed Workloads In mixed workload environments, our first recommendation is to split the workload into two parts OLTP and OLAP. Both parts can be allocated to the corresponding DBS with row- or column-store functionality. As we mentioned above, this split methodology can also be used for load balancing if one DBS is too busy. With this approach, we achieve competitive performance for both OLAP and OLTP because, according to our assumptions, complex systems have to have at least one DBMS of each architecture. Further, we state that a future hybrid system has to satisfy both architecture, too. Processing mixed workloads with our split behavior has two additional advantages.

First, we can reduce issues with complex OLAP queries by correct allocation or dividing over both DBS. Second, we can consider time-bound parameters for queries.

As mentioned above, two integration methods for the query-processing heuristics are available. First, we propose the integration into a hybrid DBMSto decide where to optimally execute a query. That is, heuristics enable rule-based query optimization for hybrid DBMS as we know from row-store optimizer [17]. Second, we propose a global manager on top of complex environments to optimally distribute queries (as known from distributed DBMSs [27]). Our decision model analyzes queries and decides where to distribute queries to. Afterwards, queries are locally optimized by DBMS itself. Note, time-bound requirements change over time and system design determines how up-to- date data is in OLAP system (e.g., real-time load [31] is available or not). We state that such time-bound requirements can be passed as parameters to our decision model. That is, even OLAP queries can be distributed to OLTP part of the complex environment if data in OLAP part is insufficiently updated or vice versa we have to ensure analysis on most up-to-date data. Additionally, such parameters can be used to alternatively allocate high-priority queries to another DBS part if the query has to wait otherwise.

(24)

4. Related Work

Several approaches are developed to analyze and classify workloads (e.g., [18,29]).

These approaches recommend tuning and design of DBS, try to merge similar tasks, etc.

to improve performance of DBSs. To the present, workload-analysis approaches are limited to classification of queries to execution pattern or design estimations for a certain architecture; that is, the solution space is beforehand pruned analysis and performance estimations are done. This results in information loss due to an inappropriate reduction.

With our decision model and heuristics, we propose an approach that is independent from architectural issues.

Due to success in the analytical domain, researchers devote more attention on column stores whereby focused on analysis performance [1,19] or to overcome update- problems with separate storages [15,1]. However, a hybrid system in an architectural manner does not exist. In-memory-DBMSs are developed in recent years (e.g., Hy- Per [21]) that satisfy requirements for mixed OLTP/OLAP workloads. Nevertheless, we state that even today not all DBSs can run in-memory (due to monetary or environmental constraints), thus we propose a more general approach.

Recommending indexes [5,10], materialized views [10], configurations [20], or physical design in general [6,37,36] are in focus of researchers. However, all approaches are limited to certain DBMSs or at least one architecture to the best of our knowledge.

Our approach is situated on top of these design and tuning approaches. That is, we support a first coarse granular design and tuning for a global view on hybrid systems and utilize existing approaches to optimize locally.

In the literature, researchers compare the performance of column and row stores considering certain scenarios. Cornell and Yu [11] focus on disc-access minimization for transaction to improve query execution time. Abadi et al. [4] compare different approaches for column-oriented storage concerning analytical queries. We state, current applications demand for a combined observations of OLTP and OLAP scenarios.

In line with Zukowski et al. [35], we observe benefits to convert from NSM to DSM and vice versa during query processing. In contrast to Zukowski et al., we do not only focus on CPU trade-offs caused by tuple reconstructions. Our approach is used for both scenarios and considers overall benefit of global and local tuning.

5. Conclusion

In recent years, many new approaches as well as new requirements encourage performance of DBMSs in many application domains. Nevertheless, new approaches and requirements also increase complexity of DBMS selection, DBS design, and tuning for a certain application. We focus our considerations on OLTP, OLAP, and OLTP/OLAP workloads in general. That is, we considered which DBMS is most suitable.

Therefore, we present heuristics on design estimations and query execution for both relational storage architectures row and column stores. We use our decision model [23]

to observe the performance of several applications. Consequently, we derive heuristics from our experiences while evaluating our decision model. We present these heuristics for DBS design to a priori select the most suitable DBMS and to tune this system afterwards. Our approach avoids misleading tuning if the architecture selection is wrong. Fur- thermore, we present heuristics on query execution for both storage architectures. On the one hand, we want to emphasize our heuristics for physical design. On the other hand,

(25)

we propose these heuristics for integration in hybrid DBS whether it is a real hybrid DBMS or it is a complex system that consists of different DBMSs (at least one row and one column store). Summarizing, there is no alternative in OLTP environments to row stores, for OLAP applications we recommend to use column stores taking into account that a number of complex OLAP queries can change this recommendation, and finally in mixed OLTP/OLAP environments the focus is on the ratio of OLAP queries to OLTP transactions (again taking very complex OLAP queries into account). Our heuristics are a first step to rule-based query optimization in hybrid systems/architectures.

In future work, we will evaluate our heuristics considering standard benchmark to achieve meaningful results. After evaluation, we will implement the heuristics in our decision model, thus we achieve a design advisor for both relational storage architectures.

Finally, we plan to implement a hybrid DBMSs using our heuristics for ruled-based query optimization.

Acknowledgements

This paper has been funded in part by the German Federal Ministry of Education and Science (BMBF) through the Research Program under Contract FKZ: 13N10817.

References

[1] Daniel J. Abadi. Query execution in column-oriented database systems. PhD thesis, Cambridge, MA, USA, 2008. Adviser: Madden, Samuel.

[2] Morton M. Astrahan, Mike W. Blasgen, Donald D. Chamberlin, Kapali P. Eswaran, Jim Gray, Patricia P.

Griffiths, W. Frank King III, Raymond A. Lorie, Paul R. McJones, James W. Mehl, Gianfranco R.

Putzolu, Irving L. Traiger, Bradford W. Wade, and Vera Watson. System R: Relational Approach to Database Management.ACM Trans. Database Syst.1(2) (1976), 97–137.

[3] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H. Katz, Andrew Konwin- ski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia. Above the Clouds:

A Berkeley View of Cloud Computing. Technical Report UCB/EECS-2009-28, EECS Department, University of California, Berkeley, Feb 2009.

[4] Daniel J. Abadi, Samuel R. Madden, and Nabil Hachem. Column-stores vs. row-stores: How different are they really? InSIGMOD’08(2008), 967–980.

[5] Nicolas Bruno and Surajit Chaudhuri. To tune or not to tune? A lightweight physical design alerter. In VLDB’06, VLDB Endowment (2006), 499–510.

[6] Nicolas Bruno and Surajit Chaudhuri. An online approach to physical design tuning. InICDE’07, IEEE (2007), 826–835.

[7] Rajkumar Buyya, Chee Shin Yeo, and Srikumar Venugopal. Market-Oriented Cloud Computing: Vision, Hype, and Reality for Delivering IT Services as Computing Utilities. InHPCC(2008), 5–13.

[8] Edgar F. Codd, Sally B. Codd, and Clynch T. Salley. Providing OLAP to User-Analysts: An IT Mandate.

Ann ArborMichigan(1993), 24.

[9] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. Bigtable: A Distributed Storage System for Struc- tured Data. InOSDI(2006), 205–218.

[10] Surajit Chaudhuri and Vivek Narasayya. Self-tuning database systems: A decade of progress. In VLDB’07(2007), 3–14.

[11] Douglas W. Cornell and Philip S. Yu. An effective approach to vertical partitioning for physical design of relational databases.Trans. Softw. Eng.16(2) (1990), 248–258.

[12] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI(2004), 137–150.

[13] Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters.Commun.

ACM51(1) (2008), 107–113.

[14] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: amazon’s highly available key-value store. InSOSP(2007), 205–220.

(26)

[15] Clark D. French. Teaching an OLTP database kernel advanced datawarehousing techniques. InICDE’97 (1997), 194–198.

[16] Ian T. Foster, Yong Zhao, Ioan Raicu, and Shiyong Lu. Cloud Computing and Grid Computing 360-Deg- ree Compared.CoRR, abs/0901.0131, 2009.

[17] Goetz Graefe and David J. DeWitt. The EXODUS Optimizer Generator. InSIGMOD’87(1987), 160–

172.

[18] Marc Holze, Claas Gaidies, and Norbert Ritter. Consistent on-line classification of DBS workload events. InCIKM’09(2009), 1641–1644.

[19] Stratos Idreos. Database Cracking: Torwards Auto-tuning Database Kernels. PhD thesis, 2010.

[20] Eva Kwan, Sam Lightstone, K. Bernhard Schiefer, Adam J. Storm, and Leanne Wu. Automatic database configuration for DB2 Universal Database: Compressing years of performance expertise into seconds of execution. InBTW’03, GI (2003), 620–629.

[21] Alfons Kemper and Thomas Neumann. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. InICDE’11(2011), 195–206.

[22] Henry F. Korth and Abraham Silberschatz. Database Research Faces the Information Explosion.Com- mun. ACM40(2) (1997), 139–142.

[23] Andreas Lübcke, Veit Köppen, and Gunter Saake. A Decision Model to Select the Optimal Storage Architecture for Relational Databases. InProceedings of the Fifth IEEE International Conference on Research Challenges in Information Science, RCIS(2011), 74–84.

[24] Andreas Lübcke and Gunter Saake. A Framework for Optimal Selection of a Storage Architecture in RDBMS. InDB&IS(2010), 65–76.

[25] Andreas Lübcke. Challenges in Workload Analyses for Column and Row Storess. InGrundlagen von Datenbanken(2010).

[26] Ina Naydenova and Kalinka Kaloyanova. Sparsity Handling and Data Explosion in OLAP Systems. In MCIS(2010), 62–70.

[27] M. Tamer Özsu and Patrick Valdurie.Principles of Distributed Database Systems. Springer, 3rd edition, 2011.

[28] Hasso Plattner. A common database approach for OLTP and OLAP using an in-memory column database. InSIGMOD’09, ACM (2009), 1–2.

[29] Kimmo E. E. Raatikainen. Cluster Analysis and Workload Classification. SIGMETRICS Performance Evaluation Review20(4) (1993), 24–30.

[30] Michael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth J. O’Neil, Patrick E. O’Neil, Alex Rasin, Nga Tran, and Stanley B. Zdonik. C-Store: A column-oriented DBMS. InVLDB’05, VLDB Endowment (2005), 553–564.

[31] Ricardo Jorge Santos and Jorge Bernardino. Real-time data warehouse loading methodology. In IDEAS’08(2008), 49–58.

[32] Transaction Processing Performance Council. TPC BENCHMARK^{T M} H. White Paper, April 2010.

Decision Support Standard Specification, Revision 2.11.0.

[33] Alejandro A. Vaisman, Alberto O. Mendelzon, Walter Ruaro, and Sergio G. Cymerman. Supporting dimension updates in an OLAP server.Information Systems29(2) (2004), 165–185.

[34] Youchan Zhu, Lei An, and Shuangxi Liu. Data Updating and Query in Real-Time Data Warehouse System. InCSSE’08(2008), 1295–1297.

[35] Marcin Zukowski, Niels Nes, and Peter Boncz. DSM vs. NSM: CPU performance tradeoffs in block- oriented query processing. InProceedings of the 4th international workshop on Data management on new hardware, DaMoN’08, ACM (2008), 47–54.

[36] Daniel C. Zilio, Jun Rao, Sam Lightstone, Guy M. Lohman, Adam J. Storm, Christian Garcia-Arellano, and Scott Fadden. DB2 Design Advisor: Integrated automatic physical database design. InVLDB’04, VLDB Endowment (2004), 1087–1097.

[37] Daniel C. Zilio, Calisto Zuzarte, Sam Lightstone, Wenbin Ma, Guy M. Lohman, Roberta Cochrane, Hamid Pirahesh, Latha S. Colby, Jarek Gryz, Eric Alton, Dongming Liang, and Gary Valentin. Recom- mending materialized views and indexes with IBM DB2 Design Advisor. InICAC’04(2004), 180–188.

(27)

11

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins

Wendy OSBORN^a,¹ and Saad ZAAMOUT^a

aDepartment of Mathematics and Computer Science, University of Lethbridge, Lethbridge, Alberta T1K 3M4, Canada

Abstract. In this paper, we present our strategy for distributed spatial query optimization that involves multiple sites. Previous work in the area of distributed spatial query processing and optimization focuses only on strategies for performing spatial joins and spatial semijoins, and distributed spatial queries that only involve two sites. We propose a strategy for optimizing a distributed spatial query using spatial semijoins that can involve any number of sites in a distributed spatial database. In this initial work we focus on minimizing the data transmission cost of a distributed spatial query by identifying and initiating semijoins from the smaller relations in order to reduce the larger relations and minimize the cost of data transmission. We compare the performance of our strategy against the naïve approach of shipping entire relations to the query site. We find that our strategy minimizes the data transmission cost in all cases, and significantly in specific situations.

Keywords. Spatial data, distributed spatial queries, optimization, performance, data transmission cost

Introduction

A distributed spatial database system [12] consists of several spatial database sites that are dispersed geographically. Each site manages its own collection of spatial data, but work collectively for processing inter-site data requirements. An important requirement of a distributed spatial database is the ability to efficiently process a query that requires spatial data from multiple sites. Historically, research in distributed relational databases focused on generating query execution plans that minimized the cost of data transmission over the network [3, 4, 11]. However, spatial data is more complex than alphanumeric data. This increases the complexity of joining spatial relations.

Therefore, CPU and I/O costs should also be considered when processing a distributed spatial query [12].

Most existing strategies that process a distributed spatial query only work for two sites. The one exception to this does not handle spatial joins. Therefore, this preliminary work begins to address this shortcoming by propose a strategy for processing and optimizing a distributed spatial query that handles more than two sites.

Our strategy focuses on minimizing the data transmission cost of a query by applying spatial semijoins in a cost-effective manner. An evaluation of our strategy shows

1Corresponding Author

(28)

W. Osborn, S. Zaamout / Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins 12

reduction in the data transmission cost over the naïve approach (i.e. the approach that involves shipping entire relations directly to the query site). In specific situations, this reduction in cost is significant.

1. Background and Related Work

Most research in distributed spatial query processing focuses on spatial join algorithms, spatial semijoins algorithms, and the use of Bloom filters for processing distributed spatial queries. A spatial join [12] takes two relations R and S, each with a spatial attribute, and relates pairs of tuples between R and S based on a spatial predicate that is applied to the spatial attribute values. Examples of spatial predicates include the overlap, containment, and adjacency of two spatial attributes. A spatial semijoin [2] is performed by projecting the spatial attribute from one relation, transmitting it to the site that contains the other spatial relation, and performing a spatial join of the spatial projection and relation. Then, the qualifying tuples from second site are shipped back to the first site and joined with the spatial relation on that site. A Bloom filter [1] is a hashed bit array that provides a compact but imprecise representation of the values of a joining attribute. A ’1’ bit represents the possible existence of a joining attribute value, while a ’0’ bit represents the absence of the value.

With the exception of [6], all proposed strategies for processing distributed spatial queries work for a two-site database only. We summarize these works below.

1.1. Spatial Join

A significant majority of spatial join algorithms are designed for a centralized system [7]. This section focuses on spatial join strategies for distributed spatial queries.

Kang et al. [8] propose a parallel spatial join strategy that is adapted to a distributed spatial database environment. Their strategy has two phases: data redistribution, and filter and refinement. In the data redistribution phase, on each site, the space that contains objects is partitioned into regions (i. e. buckets). A subset of regions is transmitted between servers so that each server has the same regions for both data sets. This subset is chosen by estimating which will result in the lowest overall response time (although it is unclear if I/O costs are considered). Then, the filter and refinement phase is carried out on both sites by performing a spatial join. An experimental evaluation shows that the parallel spatial join technique has a significantly faster response time – up to a 33% improvement over a semijoin-based strategy.

1.2. Spatial Semijoin

Abel et al. [2, 13] propose a spatial semijoin operator that combines the conventional semijoin operator with the filter stage of spatial query processing in order to reduce the data transmission, I/O and CPU costs. Their work explores two adaptations of the spatial semijoin. In the first, a “projection” of a set of MBRs from one spatial relation is transmitted to the second site and applied to the other relation using a spatial join. In the second, the “projection” is a single-dimensional mapping that represents the objects in each relation. A performance evaluation between these approaches shows that 1) for datasets with very large spatial descriptions, both strategies perform the same, 2) for

(29)

datasets with smaller spatial descriptions, a semijoin that uses single-dimensional mapping works best, 3) using the R-tree for retrieving MBRs incurs significant CPU costs, and 4) single-dimensional mapping causes more false drops than MBRs.

Karam and Petry [10] propose a spatial semijoin, which differs from [2, 13] in that MBRs from different levels of the R-tree are chosen for the spatial semijoin, instead of requiring that all come from the same level. A performance evaluation shows that their spatial semijoin outperforms the naïve spatial join (i. e. the whole relation is shipped to the other site for joining) when applied to real world data, but not when applied to randomly distributed rectangle sets. Limitation of their work are: 1) no comparison versus other strategies, 2) no consideration of CPU time.

1.3. Bloom Filters

Karam [9] propose a 2-dimensional bit-matrix approach for performing a semijoin of two relations that focuses on minimizing the data transmission, I/O and CPU costs. A 2-dimensional space is partitioned into equal-sized regions, with each region mapping to a bit in a 2-dimensional array. If a region contains objects, the corresponding bit is set to 1. This bit-matrix is transmitted to the site containing the other relation, and is applied by testing each region containing objects for the existence of a ’1’ bit in the bit- matrix. Any qualifying objects are sent to the first site. A performance evaluation shows that this approach shows the best improvement when applied to real world data.

Limitation of this work are: 1) an evaluation against the spatial join, and not versus a spatial semijoin, which in functionality is a closer match to the bit-matrix approach, and 2) no compression of the bit-matrix – a bit-matrix that contains many zeros is still transmitted in its entirety.

Hua et al. [6] propose the BR-tree, which is an R-tree that is augmented with Bloom filters to support exact-match queries. Each node entry contains 1) a minimum bounding rectangle (MBR) that approximates an object or a subset of objects, and 2) a Bloom filter that also represents one or more objects. In a leaf node, a Bloom filter is created by taking each object and producing k bits in the filter using different hash functions. In a non-leaf node, a Bloom filter is created by intersecting the Bloom filters in its child node. Although the BR-tree supports exact-match queries by using Bloom filters, it still requires the MBRs for region and point queries. A strategy for processing distributed region, point, and exact-match queries is proposed. The algorithm duplicates the root of every BR-tree across every site in the database. Any objects that pass the test against a root node is shipped to the site containing the original BR-tree.

This strategy works for any number of sites. A significant limitation is a lack of support for spatial joins.

2. Distributed Spatial Query Processing Strategy

In this section we present our algorithm for processing a distributed spatial query. The focus is to reduce the cost of data transmission over the network by using spatial semijoins. In the future we will also consider I/O and CPU costs.

(30)

2.1. Preliminaries

We apply spatial semijoins by shipping the smaller spatial attributes to other sites and applying them to the larger relations in order to eliminate a significant amount of data that will not participate in the final result. We utilize a modified version of the approximation based spatial semijoin that is proposed in [13].

In our implementation of the spatial semijoin, we use the following approach. We have two sites R and S that each contain a spatial relation. First, we obtain the projection of the spatial attribute from R. Then, the spatial relation is transferred to S and joined with the spatial relation on S. Our semijoins differs after this point. Instead of sending the qualifying tuples from S back to R for the final join, we send back to R the identifiers from the spatial projection, which are used to select tuples from the relation on R to ship to the final query site. In addition, the tuples on site S that qualified in the semijoin are also shipped to the query site. Using this semijoin strategy allows us to incorporate more than two sites when processing a distributed query.

We make the following assumptions in our work:

1. Every spatial object is represented using its minimum bounding rectangle (MBR).

2. Every site that participates in the distributed spatial query has one spatial relation. If a site contains other relations that are required for the query, it is assumed that all local processing has taken place and one spatial relations remains.

3. All spatial relations have one spatial attribute.

4. All objects (and corresponding MBRs) in all spatial attributes are drawn from the same “spatial domain” (i.e. same region of space).

5. The cardinality of each spatial attribute is equal to the number MBRs in the relation. That is, we assume that all MBRs in a spatial attribute are distinct.

6. The number of sites participating in the distributed spatial query is a multiple of two. The reason for this will be made clear when the algorithm is presented.

7. The spatial attribute for every spatial relation is already indexed by an R-tree (or a similar index that places the minimum bounding rectangles for all objects in its leaf level).

2.2. The Algorithm

Given n sites that will be participating in processing a distributed query, where each site has one spatial relation, our strategy has four main steps:

1. sorting and grouping by spatial attribute cardinality, 2. transmission of spatial attributes,

3. semijoin execution,

4. transmission of qualifying tuples to query site for the final join and processing.

Each step is described next. First, the sites will be ordered by increasing spatial attribute cardinality. After ordering, the first n/2 sites of the ordered list are placed in a set P, while the remaining n/2 sites are placed in a set Q.

Then, the spatial attribute from the relation on each site in P are transmitted to a site in Q in the following manner:

(31)

• the attribute from the site with the smallest spatial cardinality in P is sent to the site with the smallest spatial attribute in Q,

• the attribute from the site with the next smallest spatial cardinality in P is sent to the site with the next smallest spatial attribute in Q,

• and so on... until,

• the attribute from the site with the largest spatial cardinality in P is sent to the site with the largest spatial attribute in Q.

Next, on each site in Q, a spatial semijoin is performed between the existing spatial relation and the spatial attribute sent from the corresponding site in P. The results of of the semijoin are: 1) the set of tuples on the site that qualify in the semijoin, and 2) a set of identifiers from the spatial attribute whose MBRs also qualify in the semijoin. The set of identifiers is sent back to the corresponding site in P.

Finally, for all sites in P, the tuples whose identifiers match the ones obtained from Q are shipped to the query site. In addition, for each site in Q the set of tuples that qualified in the semijoin are sent to the query site. At the query site, the final join is performed.

2.2.1. Example

Suppose we have a distributed spatial database with six sites. Each site contains a spatial relation with 100, 200, 400, 600, 800, and 1000 tuples respectively. Our strategy for processing a query that involves these sites proceeds as follows. First, our sites are ordered by increasing spatial attribute cardinality. Then, the list is divided into the two sets. The set P will contain the sites with the 100-, 200- and 400-tuple relations, while the set Q will contain the sites with the 600-, 800-, and 1000-tuple relations.

Next, the spatial attributes from the sites in set P are sent to sites in Q in the following manner. First, the spatial attribute from the site containing 100 tuples is sent to the site that contains 600 tuples. Similarly, the spatial attribute from the 200-tuple sites is set to the 800-tuple site, and the spatial attribute from the 400-tuple site is sent to the 1000-tuple site. Then, on the 600-, 800-, and 1000-tuple sites, a semijoin is performed between the local spatial relation and the spatial attribute that was shipped to it. During this process, the identifiers that correspond to the MBRs in the spatial attribute that qualify for the semijoin are sent back to originating site. For example, on the 600-tuple site, the identifiers for the qualifying MBRs are sent back to the 100-tuple site, and are used to select the corresponding tuples. Finally, all qualifying tuples from all sites are shipped to the query site.

3. Experimental Evaluation

Here, we present our empirical evaluation of our distributed query processing algorithm. We compared our strategy for optimizing a distributed spatial query with the naïve approach which transfers all unreduced relations to the query site. First, we present the data sets and cost formulas used in our evaluation. Then, we present the results and discussion of our tests.

We simulated a six-site distributed spatial database, where each site contains one spatial relation. Each spatial relation has one spatial attribute, which consists of four values (lx, ly, hx, hy) that represent the extents of an MBR. In addition, each spatial

pdf

Databases and Information Systems

BalticDB&IS‘2012

Databases and Information Systems

Tenth International Baltic Conference on Databases and Information Systems Local Proceedings, Materials of Doctoral Consortium

ORGANISED BY

Lithuanian Academy of Sciences

Vilnius University,

Institute of Mathematics and Informatics Lithuanian Computer Society

SPONSORS

Research Council of Lithuania Algoritmų Sistemos, Ltd.

Information Technologies Institute

Conference Committee

Steering Committee

General Chair

Program Committee Co-Chairs

Publishing Co-Chair

Programme Committee

Additional Reviewers

Doctoral Consortium Advising Committee

Coordinators of Special Sections

Local Organising Committee Chair

Local Organising Committee

Preface

Contents

Part 1. Research Papers

and

Short Communications

Workload-based Heuristics for Evaluation of Physical Database Architectures

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins