The DART-Europe E-theses Portal

189  Download (0)

Texte intégral

(1)

HAL Id: tel-02457147

https://tel.archives-ouvertes.fr/tel-02457147

Submitted on 27 Jan 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Sampling, qualification and analysis of data streams

Rayane El Sibai

To cite this version:

Rayane El Sibai. Sampling, qualification and analysis of data streams. Data Structures and Algorithms [cs.DS]. Sorbonne Université; Université Libanaise, 2018. English. �NNT : 2018SORUS170�. �tel- 02457147�

(2)

THÈSE DE DOCTORAT

Préparée au

Laboratoire LISITE - Institut supérieur d’électronique de Paris (ISEP) École doctorale Informatique, télécommunications et électronique (Paris)

par

Rayane EL SIBAI

pour l’obtention du grade de

Docteur de l’Université Pierre et Marie Curie (Paris) - Sorbonne Université

Spécialité : Informatique

présentée et soutenue publiquement le 4 Juillet 2018

Sampling, qualification and analysis of data streams

Directeurs de thèse

Raja CHIKY HDR, ISEP

Kablan BARBAR HDR, Université Libanaise Encadrants de thèse

Yousra CHABCHOUB MCF, ISEP

Jacques DEMERJIAN MCF, Université Libanaise Rapporteurs

Mustapha LEBBAH HDR, Université Paris 13

Vincent LEMAIRE HDR, Orange Labs

Examinateurs

Bernd AMANN HDR, UPMC - Sorbonne Université

Karine ZEITOUNI HDR, University of Versailles St Quentin (UVSQ)

(3)
(4)

"If you can dream it, you can do it."

Walt Disney

"Là où la volonté est grande, les difficultés diminuent. "

Nicolas Machiavel

(5)
(6)

Remerciements

"Soyons reconnaissants aux personnes qui nous donnent du bonheur ; elles sont les charmants jardiniers par qui nos âmes sont fleuries." [Marcel Proust]

Je voudrais tout d’abord adresser ma reconnaissance perpétuelle à mon co- encadrant de thèse M. Jacques Demerjian, qui m’a donné l’opportunité de faire cette thèse. Il a toujours été disponible pour me soutenir, me conseiller et répondre à mes questions tout au long de cette thèse.

Les mots ne suffisent pas à remercier ma co-encadrante de thèse Mme. Yousra Chabchoub, à qui je dois exprimer ma gratitude éternelle pour ses conseils précieux, nos discussions fructueuses et sa sympathie. Elle a toujours trouvé le temps pour m’encadrer et me guider durant la thèse.

Je remercie profondément mes directeurs de thèse Mme. Raja Chiky et M. Kablan Barbar pour la confiance qu’ils m’ont accordée en acceptant de diriger ma thèse.

Je tiens à remercier sincèrement M. Vincent Lemaire et M. Mustapha Lebbah d’avoir accepté d’être rapporteurs de ce manuscrit. Je remercie également Mme.

Karine Zeitouni, M. Bernd Amann et M. Vincent Guigue d’avoir accepté de participer au jury de cette thèse.

Je tiens à saisir cette occasion et remercier M. Eric Gressier Soudan pour les suggestions et conseils qu’il m’a donnés durant la soutenance de mi-parcours.

Ce fût un grand plaisir de collaborer avec Mme. Zakia Kazi-Aoul, Mme. Christine Fricker et M. Jacques Bou Abdo, que je remercie vivement pour leurs temps et leurs commentaires utiles.

Mes plus chaleureux remerciements s’adressent à mes collègues à l’SEP, en particulier, Denis, Loic, Manuel, Sathiya, et Xiang Nan. Merci pour la géniale ambiance et la convivialité. Un merci tout particulier à Amadou avec qui j’ai partagé ce voyage de doctorat.

Finalement, un grand merci à mes très chers parents pour leur confiance, leur affection et leurs sacrifices. Cette expérience n’aurait pas été possible sans eux.

(7)
(8)

Cette thèse est dédiée à mes parents

pour leur soutien, leur encouragement et leur amour

inconditionnel.

(9)
(10)

Résumé

Un système de surveillance environnementale collecte et analyse continuellement les flux de données générés par les capteurs environnementaux. L’objectif du pro- cessus de surveillance est de filtrer les informations utiles et fiables et d’inférer de nouvelles connaissances qui aident l’exploitant à prendre rapidement les bonnes décisions. L’ensemble de ce processus, de la collecte à l’analyse des données, soulève deux problèmes majeurs : le volume de données et la qualité des données.

D’une part, le débit des flux de données générés n’a pas cessé d’augmenter sur les dernières années, engendrant un volume important de données continuellement envoyées au système de surveillance. Le taux d’arrivée des données est très élevé par rapport aux capacités de traitement et de stockage disponibles du système de surveillance. Ainsi, un stockage permanent et exhaustif des données est très coûteux, voire parfois impossible. D’autre part, dans un monde réel tel que les environnements des capteurs, les données sont souvent de mauvaise qualité, elles contiennent des valeurs bruitées, erronées et manquantes, ce qui peut conduire à des résultats défectueux et erronés.

Dans cette thèse, nous proposons une solution appelée filtrage natif, pour traiter les problèmes de qualité et de volume de données. Dès la réception des données des flux, la qualité des données sera évaluée et améliorée en temps réel en se basant sur un modèle de gestion de la qualité des données que nous proposons également dans cette thèse. Une fois qualifiées, les données seront résumées en utilisant des algorithmes d’échantillonnage. En particulier, nous nous sommes intéressés à l’analyse de l’algorithme Chain-sample que nous comparons à d’autres algorithmes de référence comme l’échantillonnage probabiliste, l’échantillonnage déterministe et l’échantillonnage pondéré. Nous proposons aussi deux nouvelles versions de l’algorithme Chain-sample améliorant sensiblement son temps d’exécution.

L’analyse des données du flux est également abordée dans cette thèse. Nous nous intéressons particulièrement à la détection des anomalies. Deux algorithmes sont étudiés : Moran scatterplot pour la détection des anomalies spatiales et CUSUM pour la détection des anomalies temporelles. Nous avons conçu une méthode améliorant l’estimation de l’instant de début et de fin de l’anomalie détectée dans CUSUM.

(11)

Nos travaux ont été validés par des simulations et aussi par des expérimentations sur deux jeux de données réels et différents : Les données issues des capteurs dans le réseau de distribution de l’eau potable fournies dans le cadre du projet Waves et les données relatives au système de vélo en libre-service (Velib).

Mots-clés : Flux de données, Algorithmes d’échantillonnage, Qualité des don- nées, Analyse des données, Cloud computing.

(12)

Abstract

An environmental monitoring system continuously collects and analyzes the data streams generated by environmental sensors. The goal of the monitoring process is to filter out useful and reliable information and to infer new knowledge that helps the network operator to make quickly the right decisions. This whole process, from the data collection to the data analysis, will lead to two keys problems: data volume and data quality.

On the one hand, the throughput of the data streams generated has not stopped increasing over the last years, generating a large volume of data continuously sent to the monitoring system. The data arrival rate is very high compared to the available processing and storage capacities of the monitoring system. Thus, permanent and exhaustive storage of data is very expensive, sometimes impossible. On the other hand, in a real world such as sensor environments, the data are often dirty, they contain noisy, erroneous and missing values, which can lead to faulty and defective results.

In this thesis, we propose a solution called native filtering, to deal with the problems of quality and data volume. Upon receipt of the data streams, the quality of the data will be evaluated and improved in real-time based on a data quality management model that we also propose in this thesis. Once qualified, the data will be summarized using sampling algorithms. In particular, we focus on the analysis of the Chain-sample algorithm that we compare against other reference algorithms such as probabilistic sampling, deterministic sampling, and weighted sampling.

We also propose two new versions of the Chain-sample algorithm that significantly improve its execution time.

Data streams analysis is also discussed in this thesis. We are particularly inter- ested in anomaly detection. Two algorithms are studied: Moran scatterplot for the detection of spatial anomalies and CUSUM for the detection of temporal anomalies.

We have designed a method that improves the estimation of the start time and end time of the anomaly detected in CUSUM.

Our work was validated by simulations and also by experimentation on two real and different data sets: The data issued from sensors in the water distribution

(13)

network provided as part of the Waves project and the data relative to the bike sharing system (Velib).

keywords: Data streams, Sampling algorithms, Data quality, Data analysis, Cloud computing.

(14)
(15)

Contents

Introduction 1

Context and motivation . . . 1

Contributions and organization of the manuscript . . . 4

Application domains . . . 6

List of publications . . . 8

I Data streams summarization 11

1 Data streams sampling algorithms 13 1.1 Introduction . . . 14

1.2 Data streams basic concepts . . . 14

1.2.1 Definition . . . 14

1.2.2 Data streams structure . . . 15

1.2.3 Time modeling and windowing models . . . 16

1.3 Data streams application domains . . . 18

1.3.1 Sensor networks . . . 18

1.3.2 Financial analysis . . . 19

1.3.3 Network traffic analysis . . . 19

1.4 Data streams management . . . 19

1.4.1 Data streams management system characteristics . . . 20

1.4.2 Data streams management systems . . . 20

1.5 Sampling algorithms . . . 22

1.6 Discussion . . . 28

2 Chain-sample algorithm for data streams sampling over a sliding window 31 2.1 Introduction . . . 32

2.2 The traditional Chain-sample algorithm . . . 32

2.2.1 Motivation . . . 32

2.2.2 Algorithm description . . . 33

2.2.3 Collision problem . . . 33

(16)

CONTENTS

2.3 Chain+: A free redundancy Chain-sample algorithm . . . 35

2.3.1 Construction of the samples . . . 35

2.3.2 Memory usage for a singlechain-sample . . . 36

2.3.3 Trade-off between the execution time, sampling rate and window size . . . 38

2.3.4 Comparison of the Chain+ sampling against the Simple Random Sampling algorithm . . . 39

2.4 Enhancing the Chain+ sampling algorithm . . . 42

2.4.1 Inverting the selection for high sampling rates strategy . . . . 42

2.4.2 Divide-to-Conquer strategy . . . 43

2.4.3 Experimentations . . . 45

2.5 Conclusion . . . 46

3 On the impact of data sampling on data streams statistics inference 49 3.1 Introduction . . . 50

3.2 Sampling impact on the queries estimation accuracy . . . 50

3.2.1 Chain+ sampling algorithm . . . 51

3.2.2 A bounded-space SRS algorithm without replacement over a purely sliding window . . . 51

3.2.3 Deterministic sampling algorithm. . . 52

3.2.4 Experimentations . . . 53

3.3 Sampling impact on the anomalies detection . . . 57

3.3.1 Problem definition . . . 57

3.3.2 EWMA control chart algorithm . . . 59

3.3.3 A bounded-space SRS algorithm without replacement over a jumping sliding window. . . 59

3.3.4 A bounded-space WRS algorithm without replacement over a jumping sliding window. . . 60

3.3.5 Experimentations . . . 61

3.4 Conclusion . . . 65

II Managing data quality in streaming sensor networks 67

4 Modeling data quality in sensors networks 69 4.1 Introduction . . . 69

4.2 Data quality basic concepts . . . 70

4.3 Data quality management in sensor networks . . . 74

4.3.1 Data quality dimensions in sensor networks . . . 74

4.3.2 Sensor data quality management, a new approach . . . 79

4.4 Data quality in streaming sensor networks, related works. . . 80

4.5 Conclusion . . . 84

III Data streams anomalies detection 85

5 An in-depth analysis of CUSUM algorithm for change detection in time series 87 5.1 Introduction . . . 88

5.2 Anomalies detection algorithms . . . 88

(17)

CONTENTS

5.3 An analysis of CUSUM algorithm . . . 91

5.3.1 Algorithm description . . . 91

5.3.2 Choice of the parameters . . . 92

5.3.3 Variability of the Run Length(RL) . . . 94

5.4 Enhancing the reactivity of CUSUM algorithm . . . 95

5.4.1 Detecting the anomaly start time . . . 96

5.4.2 Detecting the anomaly end time . . . 97

5.5 Detecting mean change . . . 97

5.5.1 Efficiency metrics . . . 97

5.5.2 Experimentations . . . 99

5.6 Application to stuck-at error: Detecting variation change . . . 101

5.7 Conclusion . . . 104

6 Spatial outliers detection with Moran Scatterplot 107 6.1 Introduction . . . 107

6.2 Motivation . . . 108

6.3 Dataset description and problem definition . . . 110

6.4 Spatial outliers detection with an improved Moran scatterplot . . . . 111

6.4.1 Moran scatterplot . . . 111

6.4.2 Improvement of Moran scatterplot using Gower’s coefficient . 113 6.5 Enhancing resources distribution in Velib system. . . 115

6.6 Conclusion . . . 120

IV Data streams native filtering 121

7 Implementation of the data streams native filters solution 123 7.1 Introduction . . . 123

7.2 Implementing data streams sampling and qualification modules in WAVES platform . . . 124

7.2.1 WAVES FUI project . . . 124

7.2.2 Native filters module . . . 126

7.3 Information technology infrastructure for data streams native filtering129 7.3.1 Resources consumption in native filters . . . 129

7.3.2 Moving to the Cloud computing . . . 137

7.4 Conclusion . . . 142

Conclusion and Perspectives 145

References 149

(18)

List of Figures

0.1 Native filtering of data streams. . . 2

0.2 Links between the chapters of the thesis. . . 5

0.3 Volume of the water consumed by the sector, over time. . . 7

1.1 Windowing models of data streams. . . 17

2.1 Sampling k = 1 item over a sliding window of size n = 4, using the Chain-sample algorithm. . . 34

2.2 Collision rate. . . 35

2.3 Impact of the window size on thechain-samplelength. . . 38

2.4 Variation of thechain-samplelength over time. . . 38

2.5 Impact of the window size on the execution time of the Chain+ sampling algorithm, for different sampling ratesk/n. . . 39

2.6 Impact of the sampling rate on the execution time of the Chain+ sampling algorithm, for different windows sizesn. . . 40

2.7 Execution time of the Simple Random Sampling and Chain+ sampling algorithms, for different sampling ratesk/n. . . 41

2.8 Distribution of the sample size of the Simple Random Sampling algorithm without replacement. . . 42

2.9 Samplingk = 6items over a sliding window of sizen= 10with the Chain+ sampling algorithm using the "Inverting the selection for high sampling rates" strategy.. . . 43

2.10 Samplingk = 5items over a sliding window of sizen= 10with the Chain+ sampling algorithm using the "Divide-to-Conquer" strategy. . . 45

2.11 Impact of the "Inverting the selection for high sampling rates" strategy on the execution time of the Chain+ sampling algorithm. . . 46

3.1 Execution time of the Deterministic, Simple Random and Chain+ sam- pling algorithms over a purely sliding window, for different sampling rates k/n. . . 54 3.2 Impact of the sampling rate and window size on the mean estimation

error of the Chain+ sampling algorithm, for different sampling ratesk/n 56

(19)

List of Figures

3.3 Impact of the collision problem on the mean estimation error of the Chain-

sample algorithm, for different sampling ratesk/n. . . 56

3.4 Mean estimation error of the Deterministic, Simple Random and Chain+ sampling algorithms for a sampling rate of1observation per12hours, for different window sizes. . . 57

3.5 Experiments’ strategy. . . 58

3.6 Sum of the sampled flowmeters of a sector when samplingk = 5 items over a jumping sliding window of sizen = 10using WRS algorithm. . . . 62

3.7 Volume of the water consumed by the sector, over time, with and without anomalies. . . 63

3.8 EWMA control chart. . . 63

3.9 Impact of the sampling rate on the anomaly detection performance ac- cording to the sampling algorithm, for the first scenario. . . 64

3.10 Comparison of the sampling impact on the anomaly detection performance when using the Simple Random and Weighted Random Sampling algo- rithms for both scenariiS1and S2. . . 64

4.1 "Journal and conference proceedings from ISI Web of Knowledge searched by a query title and business economics domain using the key words in- formation quality or data quality, data quality and metadata, and data management"(From [Moges, 2014]). . . 71

4.2 TDQM methodology. . . 72

4.3 Data quality dimensions, empirical approach [Strong et al., 1997]. . . . 74

4.4 Types of abnormal data. . . 76

4.5 Examples of outliers faults in the raw humidity readings in the NIMS deployment [Kaiser et al., 2005]. . . 77

4.6 Stuck-at faults in the chlorophyll concentrations from two buoys in the NAMOS deployment at Lake Fulmor monitoring the marine environment [Dhariwal et al., 2006]. . . 78

4.7 Sensors data quality management approach based on TDQM methodology. 80 5.1 ARL0(k, h): Impact ofk and the control limithonARL0. . . 93

5.2 ARL1(k, h): Impact ofk and the control limithonARL1. . . 93

5.3 Impact of the control limithonARLδ. . . 94

5.4 Impact of the shiftδ on ARL,h∈[3,5]. . . 94

5.5 Impact of small shiftsδ on ARL,h∈[3,5]. . . 95

5.6 Variation ofCtover time. . . 99

5.7 Variation ofST counter over time. . . 100

5.8 Variation ofET counter over time. . . 100

5.9 Injection of variation change errors. . . 103

5.10 Variationvtofst after the injection of errors. . . 103

5.11 Variation ofCtover time. . . 103

6.1 Distribution of the number of neighbors.. . . 112

6.2 Distribution stations capacity. . . 113

6.3 Improved Moran scatterplot based on occupancy data of Velib system on Thursday10/31/2013 10 : 00am. . . 115

6.4 Number of trips over time. . . 116

6.5 Number of problematic stations. . . 116

6.6 Number of problematic stations in the day. . . 117

(20)

List of Figures

6.7 Detected spatial outliers. . . 118 6.8 Average number of problematic stations in the day after the users col-

laboartion. . . 118 6.9 Mean duration of stations invalidity. . . 119 6.10 Mean cumulative duration of stations invalidity. . . 119 7.1 WAVES platform architecture (from FUI17 WAVES Annexe technique). 125 7.2 Native filters architecture. . . 126 7.3 Execution time of the sampling process according to the number of streams,

using the local server. . . 131 7.4 Execution time of the sampling process according to the number of streams,

using the Cloud server. . . 131 7.5 Execution time of the cleaning process according to the number of streams.132 7.6 Number of servers needed to ensure high availability, using the Cloud

server. . . 133 7.7 Relational model of the summary.. . . 134 7.8 Representation of a row in an HBase table as a multidimensional map. 136

(21)

List of Tables

1.1 Structured data stream tuples generated by a water operator. . . 16

1.2 Windowing models of data streams. . . 17

1.3 Weaknesses of data streams sampling algorithms. . . 29

2.1 Execution time reduction of Chain+ sampling algorithm using the "Divide- to-Conquer" strategy, for a sampling ratek/n= 0.5, a splitting factorc=k, and for different window sizesn . . . 47

4.1 Taxonomy of sensor data faults from a data-centric view [Ni et al., 2009]. 76 5.1 Simulated Run Lengths (RL), for different shift sizes. . . 95

5.2 Obtained results for mean change detection. . . 101

5.3 Performance metrics of CUSUM. . . 101

5.4 Obtained results for stuck-at errors detection. . . 104

5.5 Performance metrics of CUSUM. . . 104

6.1 Number of detected outliers stations with the improved Moran scatterplot.115 7.1 Number of observations that can be treated in15minutes, using the Cloud server. . . 132

7.2 Storage requirements of data streams summaries according to the sam- pling algorithm. . . 134

7.3 Decision table based on the Cloud computing criteria. . . 143

(22)
(23)
(24)

Introduction

Context and motivation

A

n environmental monitoring process consists of a continuous collection, analysis and reporting of observations or measurements of environmental characteris- tics. Different environmental components can be described and qualified (soil, air, water...) using different types of sensors. These latter perform regular measures that are sent to a central system to be analyzed using specific diagnostic tools. The final objective is to discover and infer new knowledge about the environment, in order to help the administrator to make the good decisions. A main purpose of the monitoring system is to detect the anomalies, also called "events". Different data mining techniques are applied to the collected data in order to infer in real-time aggregated statistics useful for anomalies detection and forecasting purposes. This process helps the administrator to supervise the observed system and to take quickly the right decisions. The whole process, from the data collection to data analysis, leads to two major problems: the management of the data volume and the quality of this data.

On the one hand, a sensor generates the data in the form of a stream that consists of a large volume of data sent to the monitoring system continuously. The arrival rate of the data is very high compared to the available processing and storage capacities. The monitoring system is thus faced with a large amount of data for which permanent and exhaustive storage is very expensive and sometimes impossible. That’s why we need to process the data stream in one pass, without storing it. However, for a particular stream, it is not always possible to predict in advance all the processing to be performed. On the other hand, in a real-world such as sensor environment, the data are often dirty, they contain noisy, erroneous, duplicate, and/or missing values. This is due to many factors: local interference, malicious nodes, network congestion, limited sensor accuracy, harsh environment, sensor failure or malfunction, calibration error, and insufficient sensor battery. As in any data analysis process, the conclusions and decisions based on these data may

(25)

INTRODUCTION

be faulty or erroneous if the data are of poor quality.

Our goal in this thesis is to treat the chain, from the data collection to the anomalies (events) detection of data streams generated by sensors.

As a first step, we propose the native filtering of data streams as a solution to overcome the two problems related to the data collection: the huge volume of generated data and their poor quality. This solution consists of filtering the data qualitatively (evaluating and improving the quality of the received data), and then, quantitatively (summarizing the data), as shown in Figure0.1.

Figure 0.1 – Native filtering of data streams.

Qualitative filter

One solution to overcome the problem of poor quality of the data is to use sen- sors with high precision to neglect the potential errors that could occur. Another alternative is to deploy redundant sensors to cover sensor failure. However, these approaches are very expensive. In this thesis, we propose an approach that consists of using methods and algorithms to first evaluate the quality of the data and then to improve it in order to obtain reliable and effective results.

Several research studies have focused on the management of data quality in sensor networks. [Jeffery et al., 2006] introduced the so-called Extensible Sensor Stream Processing (ESP) system to clean sensors data. The system detects erroneous data, replaces missing data, and deletes duplicate data. [Lim et al., 2009] proposed to evaluate the accuracy of the data by calculating the difference between the exper- imental distribution and the theoretical distribution of the data. A data cleaning system based on machine learning algorithms has been proposed by [Ramirez et al., 2011]. It calculates the difference between the received data and the predicted data to evaluate the accuracy of the data. In [Klein et al., 2007], the authors proposed a model for evaluating and storing the information about the accuracy and com- pleteness of the data. We present in Chapter 4 a state of the art of the existing approaches for the qualification of sensors data. Then, we introduce the model we propose to evaluate and to improve the quality of the data in the context of sensor networks.

Quantitative filter

Data streams are volatile, once expired, they are no longer available for analysis.

This makes it impossible to evaluate any undefined query before the arrival of the

(26)

INTRODUCTION

data, while new requirements may appear after the arrival of the stream. In this case, the data stream management system cannot answer the new queries. One solution to overcome this problem is to store an extract of the stream in a compact structure, called summary.

The challenge is to decide which data to store in order to maintain a represen- tative summary of the entire stream. One of the structures used to preserve the stream history is the general summary [Midas et al., 2010]. It is a data structure updated whenever new data arrive. Its particularity lies in its ability to carry out analysis tasks on the data of the stream and to give an approximate answer to any query and for any investigated period of time. These characteristics distinguish the general summary from other data streams summaries called "synopsis" such as Sketches. The Sketch is a very compact data stream summary used to answer specific queries about data stream. We can mention the Count-Min sketch [Cormode and Muthukrishnan, 2005] used to estimate the frequency of an element, and the sketch of Flajolet-Martin [Flajolet and Martin, 1985] which estimates the number of distinct elements in a data stream.

The effectiveness of a general summary is measured in terms of the accuracy of the provided response, the memory space to store it, and the time to update it [Midas et al., 2010]. The challenge is to decide what to store in this summary and how to ensure that the summary can meet the requirements of the application while respecting the available resources of the system.

Sampling methods can be used to construct a general summary of data streams.

Two categories of these techniques are provided in the literature: probabilistic meth- ods and deterministic methods. Probabilistic methods also called stochastic methods are characterized by the fact that each element has a probability of inclusion in the sample. The composition of the obtained sample is thus random. Simple Random Sampling (SRS) and Stratified sampling are two examples of random sampling. For deterministic methods, there is no randomness in the composition of the sample: for example, selecting all the elements having even indexes. The choice of the appro- priate sampling method depends, of course, on the application and the purpose of the sampling. We first present in Chapter1 the state of the art of sampling meth- ods. Then, we focus in Chapter2on the study of the Chain-sample, a probabilistic sampling algorithm well adapted to the context of data streams.

Anomalies detection

Native filtering (qualitative and quantitative), presented above, is a pre-processing step that prepares the data to be analyzed and exploited. In the data analysis phase, we are particularly interested in anomalies detection in data streams. This prob- lem is addressed in several applications domains such as fraud detection for credit cards, intrusion detection in networks, image processing, etc. In sensor networks, anomalies detection can be used for many tasks such as fault diagnosis, intrusion detection, and applications monitoring [Chandola et al., 2009].

In sensor networks, there are two different types of anomalies: temporal and spatial. Indeed, sensor data has two characteristics: temporal and spatial correlation.

Temporal correlation is due to the continuity of the observed measure. It implies that, for a single data stream, the data value at a given moment is often related

(27)

INTRODUCTION

to the values measured at close moments. Spatial correlation consists in a strong relation between the values measured at the same time by nearby sensors. These two types of anomalies are studied in this thesis in Chapters5and 6respectively.

Anomalies detection techniques in a temporal context can be classified into two categories: parametric and non-parametric methods. Parametric methods assume that the data follow a known probability distribution. Anomalies are defined as data having a low probability to belong to this distribution. On the contrary, non- parametric techniques do not make any assumption about data distribution. In this kind of methods, no a priori knowledge about data is needed. In sensor net- works, non-parametric methods are frequently used. Indeed, in such environments, the data distribution can often change due to sensor resource constraints. The main non-parametric approaches used to detect anomalies in sensor networks are rule-based approaches, control chart methods (i.e. CUmulative SUM (CUSUM), Exponential Weighted Moving Average (EWMA)), clustering and support vector machine approaches. In this thesis, we study in detail the CUSUM algorithm in Chapter5.

Several algorithms have been developed to detect anomalies in a spatial context.

Among these algorithms, we mention the quantitative algorithms and the graphi- cal algorithms. Quantitative methods perform statistical tests to distinguish the anomalies from the rest of the data, while the graphical algorithms are based on visualization. They present for each spatial point the distribution of its neighbors and identify the anomalies as isolated points, in specific regions. We are interested in Chapter 5in Moran scatterplot, a data visualization method that exploits the spatial correlation.

Contributions and organization of the manuscript

As shown in Figure0.2, this thesis deals with two main issues: native filtering and data stream analysis, and consists of four parts. We discuss the quantitative and qualitative filtering of data streams in the first two parts entitled "Data streams summarization" and "Managing data quality in streaming sensors networks" respec- tively. The implementation of the native filters solution is presented in the fourth part entitled "Data streams native filtering". The detection of anomalies in data streams is discussed in the third part entitled "Data streams anomalies detection".

We present in Chapter1the state of the art of different sampling algorithms used in data streams environments. We propose to classify these algorithms according to the following metrics: the number of passes over the data, the memory consumption, the skewing ability, and the resources consumption of the algorithm.

In Chapter2, we study in detail the Chain-sample algorithm. We identify a partic- ular weakness of this algorithm caused by the problem of collisions and redundancy of the items in the sample when the sampling rate is high. In order to overcome this problem, we modify the Chain-sample algorithm to improve the quality of the sample by eliminating the redundancy. We also propose two techniques to significantly reduce the execution time of the algorithm, even for a high sampling rate.

We address in Chapter3the impact of data sampling on events detection. Several

(28)

INTRODUCTION

Native filtering and

analysis of data streams

Native filtering Quantita-

tive filter

Chapter

1 Chapter 2

Chapter 3

Qualita- tive filter

Chapter 4

Native filters module

Chapter 7

Data analysis

Chapter 5

Chapter 6

Figure 0.2 – Links between the chapters of the thesis.

sampling algorithms are performed on data streams: Deterministic sampling, Chain- sample, Simple Random Sampling (SRS) and Weighted Random Sampling (WRS).

First, we adapt the SRS algorithm to the stream context by adapting it to the sliding window model. Then, we compare the performance of these algorithms in terms of execution time and accuracy of the queries answers. Thereafter, we study the impact of the sampling process on anomalies detection. In this context, the comparison of the algorithms is based on their response time in case of anomaly and the relevance of the detected anomalies.

We discuss in Chapter 4 the data quality aspects in dynamic environments, especially, in sensor networks. At first, we present the general definitions of data quality dimensions. Then, we detail the different dimensions of sensor data quality, and we provide our definitions for theaccuracy andconf idencedimensions. We also propose a new model for managing data quality in sensor networks. Compared to existing approaches, our model takes into account the errors caused by sensor faults.

In Chapter5, we study in depth the CUSUM algorithm which allows detecting the temporal anomalies in a data stream issued from a single source. In particular, we analyze the choice of the parameters of the algorithm in order to achieve two

(29)

INTRODUCTION

objectives. (1) Minimize the false positives: when the process is under control, the CUSUM algorithm should not detect any change, (2) detect quickly any deviation of the process. We also propose a method which determines with a good precision, the start and end time of the deviation of the process parameters.

In Chapter6, we are interested in spatial anomalies detection methods, especially, in Moran scatterplot, a graphical algorithm based on visualization that exploits the similarity between spatial neighbors in order to detect spatial anomalies. At first, we propose an improved version of Moran scatterplot, in which, we enhance the definition of the weight matrix involved in the calculation of the distance between an observed value and its neighboring observations. We propose to calculate the weights based on several parameters related to the spatial points characteristics which qualify the correlation between them.

We present in Chapter7the native filters module which provides two features:

real-time data streams qualification and sampling. Upon receipt of data streams from multiple sensors, the qualitative filter evaluates and improves the data quality based on the architecture presented in Chapter4. Once the data are qualified, the quantitative filter proceeds to summarize the data using sampling algorithms. Sev- eral simultaneous data streams can be processed by the native filters module which in turn adapts to the characteristics of each stream. The integration of the native filters module into the WAVES project platform is also discussed in this chapter.

We also evaluate in Chapter 7the native filters solution in terms of the required computing resources. We present a benchmark of the Information Technology (IT) resources requirements while examining two network architectures for data process- ing: local computing and Cloud computing, and two infrastructures for data streams summaries storage: database and Hadoop. The considered computing resources are the execution time of the data streams qualification and sampling processes, and the memory storage required for storing the data streams summaries. We finally discuss in this chapter the migration benefit of the native filters solution to the Cloud computing environment.

Application domains

All the propositions presented in this thesis are tested and validated against real datasets issued from these application domains:

A. WAVES dataset

This thesis is part of the FUI 17 WAVES project, which aims to design and to develop a monitoring platform for the supervision of water distribution networks.

The increase of the water stress in many parts of the world and the awareness of the value of fresh water as a scarce source require a reduction in the water losses along the production chain, from the natural water resources to the consumers.

According to the Cador report1on the state of the heritage in France, the losses and leakage represent30%of the total volume of water flowing into the water distribution

1http://www.economie.eaufrance.fr/IMG/pdf/Patrimoine_des_canalisations_d_AEP_France.pdf

(30)

INTRODUCTION

network.

In order to supervise and to manage the water distribution network, many flowmeters have been deployed by the water operators. They periodically measure and send to the central monitoring system the instantaneous values of different water-related observables such as flow, pressure, and chlore. A large geographical area is divided into several sectors with several flowmeters deployed on the periphery of each sector. All these information are aggregated and analyzed by the monitoring system, which infers, in particular, the water consumption of the considered sector.

The water consumption of each sector is indeed a key parameter for the detection of leaks. The volume of the water consumed by a given sector is calculated in real- time as an algebraic sum of the flows sent by its associated flowmeters. Each of these deployed flowmeters has two categories of attributes: spatial attributes and non-spatial attributes. The spatial attributes depict the geographical location of the flowmeter: latitude and longitude, while the non-spatial attributes include the name, ID, and the record observations of the flowmeter, and the diameter of the flowmeter.

The first dataset we are going to explore in this thesis is issued from the deployed flowmeters. The flow measurements for a given flowmeter are very variable. Indeed, they depend on the other flowmeters supplying the associated sector. For example, a given source associated with a particular flowmeter may suddenly stop supplying water to a sector. The latter will be delivered by its other associated peripheral sources.

The data recorded by the flowmeters are structured data streams and have both spatial and temporal characteristics. Each record observation is composed of two fields: the timestamp designating the recording date of the measure, and the value of the measure. These data are regularly generated by the sensors with a frequency of one observation every 15minutes. Figure0.3shows the volume of the consumed water of a specific sector during five working days in January2014. It is inferred in real-time as an algebraic sum of the flows delivered by its associated flowmeters, in m3. One can notice a periodicity in the water consumption, related to the human activity. As it is a working day, we can notice two main peaks of consumption: an important peak in the morning and a second less important around7pm.

Figure 0.3 – Volume of the water consumed by the sector, over time.

(31)

INTRODUCTION

B. Velib dataset

The second dataset we explored in this thesis is relative to the Parisian bike sharing system (Velib). These data are of two types: static data and dynamic data.

The static data describe the Velib stations and they consist of two categories of attributes: spatial attributes and non-spatial attributes. The spatial attributes depict the geographical location of the station: latitude and longitude, while the non-spatial attributes include the ID of the station and its capacity (total number of docks). The dynamic data are of two kinds: occupancy data and trip data. Occupancy data are provided in real time. They represent the states of the stations in terms of the number of bikes present in each station for each timestampt. This parameter is varying during the day and is closely dependent on users activity. Trip data depict the data corresponding to the trips of Velib’ users. A trip is characterized by a departure and arrival timestamp, and a departure and arrival station. Trip data can be divided into two main categories: the working days and the weekends.

Indeed, two days of the same category are very similar. In this thesis, we focus on the working days and we choose to analyze 24hours trips: trips that took place on Thursday, October, the 31th, 2013. This duration includes 121.709 trips, involving 1226 Velib stations.

List of publications

This thesis resulted in8publications in international conferences. In the following, our list of publications.

1. Rayane El Sibai, Yousra Chabchoub and Christine Fricker. "Using spatial outliers detection to assess balancing mechanisms in bike sharing systems". In Proceedings of the32th IEEE International Conference on Advanced Informa- tion Networking and Applications (AINA), IEEE, May 2018.

2. Rayane El Sibai, Yousra Chabchoub, Raja Chiky, Jacques Demerjian and Kablan Barbar. "An in-depth analysis of CUSUM algorithm for the detection of mean and variability deviation in time series". In Proceedings of the 16th International Symposium on Web and Wireless Geographical Information Systems (W2GIS), Springer, May 2018.

3. Rayane El Sibai, Yousra Chabchoub, Raja Chiky, Jacques Demerjian and Kablan Barbar. "Information Technology Infrastructure for Data Streams Na- tive Filtering". In Proceedings of the IEEE Middle East & North Africa COM- Munications Conference, IEEE, April 2018.

4. Rayane El Sibai, Yousra Chabchoub, Raja Chiky, Jacques Demerjian and Kablan Barbar. "A performance evaluation of data streams sampling algo- rithms over a sliding window". In Proceedings of the IEEE Middle East &

North Africa COMMunications Conference, IEEE, April 2018.

5. Rayane El Sibai, Yousra Chabchoub, Raja Chiky, Jacques Demerjian and Kablan Barbar. "Assessing and Improving Sensors Data Quality in Streaming

(32)

INTRODUCTION Context". In Proceedings of the9thInternational Conference on Computational Collective Intelligence Technologies and Applications (ICCCI), pages 590-599, Springer, September 2017.

6. Rayane El Sibai, Yousra Chabchoub, Jacques Demerjian, Zakia Kazi-Aoul and Kablan Barbar. "Sampling algorithms in data stream environments". In the International Conference on Digital Economy (ICDEc), pages 29-36, IEEE, April 2016.

7. Rayane El Sibai, Yousra Chabchoub, Jacques Demerjian, Zakia Kazi-Aoul and Kablan Barbar. "A performance study of the Chain sampling algorithm". In Proceedings of the7thInternational Conference on Intelligent Computing and Information Systems (ICICIS), pages 487-494, IEEE, December 2015.

8. Yousra Chabchoub, Zakia Kazi-Aoul, Amadou Fall Dia and Rayane El Sibai.

"On the dependencies of queries execution time and memory consumption in C-SPARQL". In Proceedings of the12th IADIS International Conference on Applied Computing (AC), pages 29-36, October 2015.

(33)
(34)

Part I

Data streams summarization

(35)
(36)

Chapter

1

Data streams sampling algorithms

Contents

1.1 Introduction . . . 14 1.2 Data streams basic concepts . . . 14 1.2.1 Definition . . . 14 1.2.2 Data streams structure . . . 15

1.2.3 Time modeling and windowing models . . . 16 1.3 Data streams application domains . . . 18 1.3.1 Sensor networks . . . 18 1.3.2 Financial analysis . . . 19

1.3.3 Network traffic analysis . . . 19 1.4 Data streams management . . . 19

1.4.1 Data streams management system characteristics . . . 20

1.4.2 Data streams management systems. . . 20 1.5 Sampling algorithms . . . 22 1.6 Discussion . . . 28

o

The scientific contribution presented in this chapter has been published in our paper: Rayane El Sibai, Yousra Chabchoub, Jacques Demerjian, Zakia Kazi-Aoul and Kablan Barbar. "Sampling algorithms in data stream environments". In the International Conference on Digital Economy (ICDEc), pages 29-36, IEEE, April 2016.

(37)

1.1. INTRODUCTION

1.1 Introduction

Data streams are large sets of data generated continuously and at a rapid rate in comparison to the available processing and storage capacities of the system that receives them. Thus, these streams cannot be fully stored. That is why we have to process them in one pass without storing them exhaustively. However, for a particular stream, it is not always possible to predict in advance all the processing to be performed. It is, therefore, necessary to save some of these data for future treatments. These stored data constitute the "summaries". Several techniques can be used for the construction of data streams summaries, among them the sampling algorithms.

In this chapter, we present a study of these algorithms. Firstly, we introduce the basic concepts of data streams, windowing models, as well as the data streams applications. Next, we detail the different sampling algorithms used in streaming environments, and we propose to qualify them according to the following metrics:

the number of passes, memory consumption, skewing ability and complexity.

This chapter is organized as follows. We present in Section1.2 the basic con- cepts of data streams. We discuss several applications domains of data streams in Section1.3. Section1.4is dedicated to data streams management systems. In Sec- tion1.5, we present a detailed study of the sampling algorithms used in streaming environments. We end the chapter with a discussion.

1.2 Data streams basic concepts

1.2.1 Definition

A data stream is an infinite sequence of tuples generated continuously and rapidly with respect to the available processing and storage capacities. Golabet al. [Golab and Özsu, 2003b] define a data stream as follows:

"A data stream is a real-time, continuous, ordered (implicitly by arrival time or explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor is it feasible to locally store a stream in its entirety."

Several other definitions of data streams have been presented in the literature.

Their common characteristic is that they all rely on the main features of these streams, namely [Gabsi, 2011]:

Continuous. The tuples arrive continuously and sequentially.

Fast. The data arrive at a high speed compared to the processing and storage capacities available in the system that receives them.

Ordered. The order of the data is often defined by timestamp which can be either implicit (data arrival time), or explicit by a timestamp contained in the data.

(38)

1.2. DATA STREAMS BASIC CONCEPTS

Unlimited volume. The size of the data stream is potentially unbounded and can be very large. The exhaustive storage of all the received data is not possible.

For instance, one gigabyte of records per hour is generated by AT&T (the largest provider of local and long distance voice and xDSL services in the United States) [Chakravarthy and Jiang, 2009].

Push-type. The sources of the data are not controllable. They are programmed to send regular measurements. The input rate can vary widely from a data stream to another. Also, some data streams have irregular input rates, while others are highly bursty such as the HTTP traffic streams [Crovella and Bestavros, 1997] and the local Ethernet traffic streams [Leland et al., 1993].

Volatile. Once the data are processed, they will be discarded, and there is no possibility to treat them another time unless they have been stored in memory.

The latter is very small compared to the data stream size. Therefore, an immediate treatment of the data is required and has to be fast enough to achieve the response time requirement.

Uncertainty of the data. Some data of the stream may be missing, duplicated, or prone to errors. This is due to external factors such as network congestion, hardware problem in measurement instrument, etc. (cf. Chapter4).

1.2.2 Data streams structure

The form and type of the data belonging to a stream depend on the application that led to the creation of the data. Two types of data can be distinguished: quan- titative data and qualitative data. The quantitative data includes the data whose representation is in the numerical form. They usually come from measurements.

The qualitative data concern the data represented by specific values from a discrete set of possible values. For example, the weight of the human is a qualitative data that can be represented by the labels low, medium, high. Notice that the binary data are also considered as two-mode qualitative data (0/1, ON/OFF). Depending on the form of the data, the data stream can be represented by three types. In structured data streams, the tuples arrive as records that respect a specific relationship schema including the fields names of the tuples and the associated values. These tuples arrive in an ordered manner which is often determined by the timestamp of the tuple. An example of structured data stream tuples generated by a flowmeter is given in the Table 1.1. The data of a semi-structured stream are heterogeneous sets of weakly structured data, they arrive in the form of XML tags or RDF. An RDF data stream is an ordered sequence of pairs where each pair consists of a triple RDF and a timestamp. In unstructured data streams, the data have different structures. Currently, more than 85%of all business information are unstructured data [Blumberg and Atre, 2003]. These data include e-mails, surveys, Web pages, PowerPoint presentations, chats, etc. WebCQ [Liu et al., 2000] is a data stream management system for managing unstructured data streams. Its purpose is to monitor the pages on the Web in order to detect and to report the interesting changes that occur to the users.

Given that the types of data that can be handled by the data mining algorithms are restrained, a data pre-processing step is often necessary before proceeding to

(39)

1.2. DATA STREAMS BASIC CONCEPTS

Table 1.1 – Structured data stream tuples generated by a water operator.

Timestamp Sensor ID Consumptionm3

... ... ...

2014/12/31 00:00 Q 400G 196,19 2014/12/31 01:00 Q 400G 187,91 2014/12/31 02:00 Q 400G 188,24 2014/12/31 03:00 Q 400G 188,60

... ... ...

the data analysis phase. Data pre-processing aims at converting the raw data to a standard format adapted to the data mining algorithms. This step is also an opportunity to clean up the data by replacing missing data and regenerating the aberrant data. Finally, the standardization of the data is also necessary to bring all the data to the same definition domain to be able to compare their values independently of their original units.

1.2.3 Time modeling and windowing models

Data streams are infinite. They must be processed in an online manner, and the Data Stream Management System (DSMS) must provide fast responses to the continuous requests while respecting the data stream arrival rate. Thus, the windowing models were introduced in the formulation of the continuous requests in the DSMS. The data windowing models are based on the principle of cutting the stream into successive portions, and they are used to limit the amount of data to be processed. With the use of the windows, at any time, a finite set of tuples of the stream can be defined and used to respond to the query and produce the corresponding results. The windowing models can be classified according to the time modeling fashion. The temporal aspect of a data stream can be modeled in two manners: the physical time, also called temporal time, and the logical time, also called sequential time. The physical time is expressed in terms of date while the logical time is expressed in terms of the number of elements. One can notice that, with the logical time model, it is possible to know in advance the number of elements in the window. This number is unknown for the physical window model when the stream rate is variable. Alternatively, each of these two types of windows can be defined by its two boundaries. According to the start and end dates of the window, we can distinguish:

Fixed window. When using the fixed window model, the stream is partitioned into non-overlapping windows and the data are preserved only for that part of the stream within the current window. The boundaries of this type of window are accurate and absolute.

Sliding window. With the sliding window model, the boundaries of the window change over time, each time an item is added to the window, the oldest element will come out. The queries are periodically performed on the data included in the last window. There are three variants of the sliding window: the purely sliding window with which the offset between the successive windows is less

(40)

1.2. DATA STREAMS BASIC CONCEPTS

than the window size, the jumping window where the offset between successive windows is equal to the window size and the hopping window with an offset greater than the window size. A sliding window can be tuple-based (the most recentnelements) or time-based (elements received within the lastδminutes).

Landmark window. The start date of this window is fixed, while the end date is relative. The size of the window increases gradually as new elements of the stream arrive. For instance, a window between a specific date and the actual date is of type landmark.

Table 1.2 – Windowing models of data streams.

Windowing model Dates Example

Physical fixed Sequential From the19th element to the56th element Logical fixed Temporal From01/01/2018 to30/01/2018 Physical sliding Sequential The last10elements

Logical sliding Temporal The last10days

Physical landmark Sequential From the30th element to the last received element

Logical landmark Temporal From12/03/2018to present

(a) Purely sliding window of size n= 3

(b) Jumping sliding window of size n= 3

(c) Hopping sliding window of size n= 3and offseto= 4

(d) Landmark sliding window of ini- tial sizen= 3

Figure 1.1 – Windowing models of data streams.

(41)

1.3. DATA STREAMS APPLICATION DOMAINS

1.3 Data streams application domains

The field of data streams is a subject of growing interest in the industrial commu- nity. This interest is reflected by the growing number of applications and industrial systems that continuously generate data streams [Golab and Özsu, 2003a,Golab and Özsu, 2003b]. These applications are heterogeneous and quite diverse, although their main purpose is the supervision and the control of the data. A typical applica- tion of data streams is to study the impact of the weather on the traffic networks.

Such a study is useful for analyzing and predicting the traffic density as a function of the weather conditions [Gietl and Klemm, 2009]. The analysis of weather data is also used to predict the weather conditions. Indeed, several weather indicators, such as the temperature, humidity, air pressure, wind speed, etc, are indicative of the weather. Through the classification and learning of these data, several models can be derived and used to predict the future weather conditions [Bartok et al., 2010].

Social networks also provide more and more data streams that can be exploited in many areas. For instance, TwitterMonitor is a real-time system that allows the detection and the analysis of emerging topics on Twitter. The results are provided to the users who in turn interact with the system to rank the detected trends according to different criteria [Mathioudakis and Koudas, 2010]. Data streams can be found in other applications as well, such as website logs [Gilbert et al., 2001] and medical remote monitoring [Sachpazidis, 2002,Brettlecker and Schuldt, 2007]. We discuss in the following various other applications:

1.3.1 Sensor networks

Wireless Sensor Networks (WSN) are a special type of ad hoc networks. They use compact and autonomous devices, called sensor nodes. These nodes collect and transmit their observations autonomously to other nodes or to the central server directly. Sensor networks are used in many applications fields to monitor and to supervise the environment [Liu et al., 2003,Gürgen, 2007]. They are also used in the domain of the electrical energy monitoring [Abdessalem et al., 2007]. Currently, several electrical energy providers are using smart sensors. These latter send in a continuous manner their observations about the users’ electricity consumption to the information systems of the electricity suppliers to which they are linked. The recorded data are in the form of streams. The analysis of these data streams makes it possible to detect several anomalies such as the over-consumption of the energy or the failure in a household appliance. PQStream is a data stream management system designed to process and to manage the streams data generated by the Turkish Electricity Transmission System [Küçük et al., 2015]. The system includes a module for continuous data processing, where several data mining methods such as classification and clustering are applied to the data, a database to store the obtained data analysis results, and a Graphical User Interface (GUI).

(42)

1.4. DATA STREAMS MANAGEMENT

1.3.2 Financial analysis

The financial analysis is one of the major application of data streams. Previously, the analysis of financial data was intended to assess the probability of a financial crisis of a company. Nowadays, the analysis of these data involves a wide variety of users, including the commercial providers, banks, investors, credit agencies, the stock market, among others. In this context, several data mining operations can be applied to financial data, such as fuzzy logic techniques, machine learning, neural networks, and genetic algorithms [Kovalerchuk and Vityaev, 2000]. The purpose of these operations is to study the impact of one market on another one, monitor the conformity and consistency of the trading operations and improve their performance, and last but not least, trigger warning signs of changing trends [Kovalerchuk and Vityaev, 2000]. For instance, Tradebot [TRA, 1999] is a search engine that allows for quantitative analysis of financial data and trading performance, and the design of strategies and implementation of scientific experiments to evolve the theory of commerce, and ultimately, work with traders to improve the trading system.

1.3.3 Network traffic analysis

Several real-time systems for the analysis of network traffic have been designed.

They aim to infer statistics from the network traffic and to detect critical conditions such as congestion and denial of service attacks. GigaScope [Cranor et al., 2002]

allows to monitor Internet traffic with an SQL interface. Tribeca [Sullivan and Heybey, 1998] is a stream-oriented DBMS designed to monitor and analyze the network traffic performance. Tribeca has a query language that can be written and compiled by the users to process the data streams coming from the network traffic.

Gilbertet al. [Gilbert et al., 2001] proposed QuickSAND to summarize the network traffic data using the sketches. In this context, one can search for instance for the clients who consumed the most bandwidth of the network. The analysis of Web traffic has also several interesting applications and serves several purposes [Csernel, 2008]:

– Rank the n most accessed pages during a specific period of time in order to optimize the loading time of these pages.

– Analyze the behavior of the visitors of a particular page or website and identify and determine the distinct users among them.

– Analyze the traffic generated by the social networks.

1.4 Data streams management

Traditional DataBase Management Systems (DBMSs) allow permanent storage and efficient management of the data by exploiting their structure. A query language is used to query the data and retrieve information. However, due to the emergence of data streams, new challenges related to data processing have appeared. These

(43)

1.4. DATA STREAMS MANAGEMENT

issues are mainly due to the infinite volume of the stream and its very high arrival rate. These new constraints make the use of the DBMSs inadequate. As a result, the traditional data storage and analysis systems need to be revised to allow processing of data streams. Hence, the emergence of data streams management systems. We detail in the following the main constraints related to the processing of data streams.

1.4.1 Data streams management system characteristics

A DSMS is supposed to meet the constraints of data streams and the needs of the applications that generate these data by having characteristics related to both the functionality and the performance [Gabsi, 2011].

Processing continuous queries. In database applications, the queries are eval- uated in a finite environment on persistent data. In such applications, the data does not change as long as the current query is not answered. On the contrary, in streaming applications, data keep growing and the whole environment is fully scalable. As a result, the queries are persistent, they must be executed continuously on volatile data. Also, it is important that the DSMS has a highly optimized engine to handle the volume of data.

Data availability. The DSMS must ensure the availability of the data at all times. It is also supposed to deal with the system failures and must take into account the eventual arrival delay of the data. Due to the eventual long delays in the arrival of data, the operations may be blocked. To avoid such situation, a maximum delay time (time out) can be specified. Thus, queries that can be blocking are processed in a timely manner even if the data is not complete.

Infinity of the data. The DSMS must be able to handle the huge volume of the data stream. The use of the load shedding techniques [Tatbul et al., 2003]

and the summaries structures are possible solutions to reduce the load on the system.

Resistance to stream imperfections. The system must handle the imperfec- tions of the data. In the real world, the data often contain noisy, erroneous, duplicate and missing values. The DSMS has to deal with these issues.

1.4.2 Data streams management systems

Several DSMSs have been developed in the recent years. These systems are distinguished by the query reformulation languages, the procedures used to rep- resent the streams, and the type of application they are designed for. Thus, some DSMSs have a generalist vocation, such as Aurora [Abadi et al., 2003], TruViso [TRU, 2004], Medusa [Cetintemel, 2003], Borealis [Abadi et al., 2005], TelegraphCQ [Chandrasekaran et al., 2003], and StreamBase [Tatbul et al., 2003], while others are intended for a particular type of application, such as GigaScope [Cranor et al., 2002], NiagaraCQ [Chen et al., 2000], OpenCQ [Liu et al., 1999], StatStream [Zhu and Shasha, 2002] and Tradebot [TRA, 1999].

(44)

1.4. DATA STREAMS MANAGEMENT

One of the well-known DSMSs is STREAM (STanford stREam datA Manager) [Arasu et al., 2016]. Akhtar et al. [Akhtar, 2011] evaluated the performance of the STREAM system and noted several advantages of it, namely, that it is user friendly since it allows the users to interact with the system through a graphical interface, and that it is very suitable for the applications that require high precision since it gives accurate results for the aggregation queries. We present in the following the STREAM system.

STREAM: STanford stREam datA Manager

STREAM [Arasu et al., 2016] is a general DSMS developed in C ++ language at the Stanford University. The users can register their queries and receive the corresponding results as a streaming HTTP response in XML format using a Web- based GUI through direct HTTP. This DSMS is based on the Continuous Query Language (CQL) declarative language derived from SQL, in order to formulate continuous queries on relations (static data) and data streams (dynamic data). The semantic of continuous queries on relations and data streams relies on abstract relational semantics. This semantic is based on two types of data, streams and relations defined as follows [Arasu et al., 2016]:

– A stream S is an unbounded set of pairs s,τ

, wheresis a tuple andτ is the logical arrival time of tupleson streamS.

– A relationRis a time-varying set of tuples, where R(τ) is a relation representing the set of tuples at timeτ.

This semantic uses three blocks of operators:

– A relational query language, which we can see as a set of relation-to-relation operators.

– A window specification language. It can be seen as a set of stream-to-relation operators to convert the streams into relations. These operators are based on the sliding window model and are expressed using a window specification language derived from SQL-99. The sliding window can be of three types: a tuple-based sliding window, a time-based sliding window, and a partitioned sliding window.

– A set of relation-to-stream operators: Istream which is applied to a relation R whenever a tuplesis inserted into the relation at timeτ, Dstream applied to a relationR whenever a tuplesis deleted from the relation at timeτ, and Rstream applied to a relation to send all the tuplessbelonging to a relationR at the timeτ.

An example of a CQL continues query on a data streamflow_meter is as follows:

SELECT Istream(timestamp, consumption) FROM flowmeter [ROWS 96]

WHERE flowmeter.consumption > 500

(45)

1.5. SAMPLING ALGORITHMS

This query answers the following question: In the last96observations, what are the instants during which the quantity of the water consumed by the flow_meter has exceeded the value of500m3.

1.5 Sampling algorithms

In streaming environments, the data arrive continuously, often at a high rate, and the system that receives the data may not have a sufficient memory to store them exhaustively. Thus, data stream processing implies reducing the size of the data by maintaining and storing a summary of the data in the memory. Sampling algorithms are used to construct a data stream summary. An effective summary of a data stream must have the ability to respond, in an approximate manner, to any query whenever the time period investigated. The purpose of the sampling algorithms is to provide information concerning a large set of data from a representative sample extracted from it. Data streams sampling algorithms are based on the traditional sampling techniques. These techniques require accessing all the data in order to construct the sample, also called summary. However, in streaming context, this condition is not guaranteed because of the infinite size of the stream. Thus, the sampling algorithms have to be adapted to the streaming context using the windowing models presented in Section 1.2.3.

In the following, we present in detail the different sampling algorithms proposed in the literature to construct a data stream summary.

a. Simple Random Sampling (SRS). The SRS algorithm [Cochran, 1977] is the most used sampling algorithm. It is simple and gives a random sample. It consists of sampling the data in a random manner, where each item of the data has the same probability pof being selected. Data sampling can be with or without replacement. With the SRS with replacement, the sample may contain duplicate elements since each item of the data may be selected twice or more.

Whereas, with the SRS without replacement, each item can be selected only once which makes this type of sampling more accurate and convenient.

b. Systematic Sampling. Let nbe the sample size andN the size of the data, Systematic sampling algorithm divides the data intongroups, each one of size k =N/n. Then, it chooses a random numberj ∈[1, k]and adds the following elements to the sample: j,j+k,j+ 2k,j+ 3k... [Cochran, 1977]. Systematic sampling algorithm has several advantages, the sample is easy to be built, it is faster and more accurate than Simple Random Sampling algorithm since the sampled elements are spread over the entire data [Cochran, 1977]. One drawback of this algorithm is its lack of randomness in the sample. It fact, the sampled elements are periodically selected which can impact the quality of the sample. If the original dataset presents a periodicity close to the valuek, the sample cannot represent the original dataset, and it will be biased in such case.

c. Stratified Sampling. Stratified sampling algorithm [Cochran, 1977] is a Simple Random Sampling where the original data is divided into homogeneous

Figure

Updating...

Sujets connexes :