• Aucun résultat trouvé

Abstract vi

N/A
N/A
Protected

Academic year: 2021

Partager "Abstract vi"

Copied!
8
0
0

Texte intégral

(1)

Contents

Abstract vi

Resum viii

Résumé x

Acknowledgements xii

List of Figures . . . xvi

List of Tables . . . xix

Abbreviations xxi Thesis Details xxii 1 Introduction 1 1 Motivation . . . . 1

2 Background and State-of-the-art . . . . 2

2.1 Data Lakes and Tabular Datasets . . . . 2

2.2 Data Lake Governance . . . . 3

3 Techniques and Challenges . . . 10

3.1 Schema Matching . . . 11

3.2 Dataset Similarity Computation . . . 14

3.3 Similarity Models Learning . . . 15

4 Thesis Objectives and Research Questions . . . 18

5 Thesis Overview . . . 20

5.1 Proximity Mining Framework . . . 20

5.2 DL Categorization . . . 23

5.3 Metadata Query Interface . . . 25

6 Thesis Contributions . . . 25

7 Structure of the Thesis . . . 27

7.1 Chapter 2: Instance-level value-based schema matching

for mining proximity between datasets . . . 28

(2)

Contents

7.2 Chapter 3: Dataset-level content metadata based prox-

imity mining . . . 28

7.3 Chapter 4: Attribute-level content metadata based prox- imity mining for pre-filtering schema matching . . . 28

7.4 Chapter 5: Automatic categorization of datasets using proximity mining . . . 29

7.5 Chapter 6: Prox-mine tool for browsing DLs using prox- imity mining . . . 29

2 Instance-level value-based schema matching for computing dataset similarity 31 1 Introduction . . . 33

2 Related Work . . . 35

3 Motivational Case-Study . . . 36

4 A Framework for Content Metadata Management . . . 37

5 The CM4DL Prototype . . . 39

5.1 Prototype Architecture . . . 40

5.2 Ontology Alignment Component . . . 41

5.3 Dataset Comparison Algorithm . . . 43

6 Experiments and Results . . . 45

7 Discussion . . . 48

8 Conclusion and Future Work . . . 49

3 Dataset-level content metadata based proximity mining for comput- ing dataset similarity 51 1 Introduction . . . 53

2 Problem Statement . . . 54

3 Related Work . . . 55

4 The DS-Prox Approach . . . 56

4.1 The Meta-Features Distance Measures . . . 57

4.2 The Approach . . . 58

5 Experimental Evaluation . . . 61

5.1 Datasets . . . 61

5.2 Experimental Setup . . . 62

5.3 Results . . . 63

5.4 Discussion . . . 64

6 Conclusion and Future Work . . . 66

4 Attribute-level content metadata based proximity mining for pre- filtering schema matching 68 1 Introduction . . . 70

2 Related Work . . . 72

3 Preliminaries . . . 75

(3)

Contents

4 Approach: Metadata-based Proximity Mining for Pre-filtering

Schema Matching . . . 79

4.1 Proximity Metrics: Meta-features Distances . . . 79

4.2 Supervised Proximity Mining . . . 81

4.3 Pre-filtering Dataset Pairs for Schema Matching . . . 88

5 Experimental Evaluation . . . 88

5.1 Datasets . . . 89

5.2 Evaluation Metrics . . . 92

5.3 Experiment 1: Attribute-level Models . . . 94

5.4 Experiment 2: Dataset-level Models . . . 95

5.5 Experiment 3: Computational Performance Evaluation . 103 5.6 Generalisability . . . 105

6 Conclusion . . . 105

5 Automatic categorization of datasets using proximity mining 107 1 Introduction . . . 109

2 Preliminaries . . . 111

2.1 Proximity Mining: Meta-features Metrics and Models . 113 3 DS-kNN: a Proximity Mining Based k-Nearest-Neighbour Al- gorithm for Categorizing Datasets . . . 115

4 Experimental Evaluation . . . 117

4.1 Dataset: OpenML DL Ground-truth . . . 117

4.2 Experimental Setup . . . 118

4.3 Results . . . 121

4.4 Validation Experiment . . . 131

5 Related Work . . . 133

6 Conclusion . . . 134

6 Prox-mine tool for browsing DLs using proximity mining 135 1 Introduction . . . 136

2 Data Lake Index . . . 137

3 Similarity Search . . . 137

4 Dataset Categorization . . . 138

5 Dataset Matching . . . 140

5.1 New Dataset Matching . . . 141

6 Proximity Graph . . . 143

7 Conclusions and Future Directions 148 1 Conclusions . . . 149

2 Future Directions . . . 150

Bibliography 152

References . . . 152

(4)

List of Figures

List of Figures

1.1 A flat structured dataset consisting of tabular data organised as attributes and instances . . . . 3 1.2 DL governance classification and tasks . . . . 4 1.3 An example of schema matching and dataset similarity compu-

tation . . . 11 1.4 Classification of schema matching techniques . . . 13 1.5 Example of a decision tree model for classification of related

dataset pairs . . . 17 1.6 An ensemble of decision trees for classification . . . 18 1.7 Overview of the proposed proxmity mining framework . . . . 21 1.8 The proximity mining metadata management process . . . 22 1.9 A proximity graph showing topic-wise groupings of interlinked

datasets in the DL with their similarity scores . . . 24 1.10 DL content metadata management and analysis process . . . . 25 2.1 The Metadata Management BPMN Process Model . . . 37 2.2 The EXP01 Metadata Exploitation Sub-Process Model . . . 38 2.3 CM4DL System Architecture . . . 40 2.4 Performance analysis of CM4DL in the OpenML experiments . 47 3.1 Similarity relationships between two pairs of datasets . . . 55 3.2 DS-Prox: supervised machine learning . . . 58 3.3 DS-Prox cut-off thresholds tuning . . . 60 3.4 Recall-efficiency plots (left column) and recall-precision plots

(right column) for experiments 1,2,3 and 4 in each row . . . 67 4.1 The stratified holistic schema matching approach at different

levels of granularity. . . . 71 4.2 The dependencies of components in the metadata-based prox-

imity mining approach for pre-filtering schema matching. . . . 76 4.3 Final output of our approach consisting of similarity relation-

ships between two pairs of datasets. . . . 78 4.4 An overview of the process to build the supervised ensemble

models in our proposed datasets proximity mining approach using previously manually annotated dataset pairs. . . . 82 4.5 Proximity Mining: supervised machine learning for predicting

related data objects. . . . 83 4.6 Different normal distributions for assigning weights to ranked

attribute linkages. . . . 86

(5)

List of Figures

4.7 An overview of the process to apply the learnt supervised models in our approach for pre-filtering previously unseen dataset pairs independent of the build process. . . . 87 4.8 A 10-fold cross-validation experimental setup consisting of

alternating folds in training and test roles. Image adapted from the book: Raschka S (2015) Python Machine Learning. 1st Edition. Packt Publishing, Birmingham, UK. . . . 97 4.9 Classification accuracy from 10-fold cross-validation of dataset

pairs pre-filtering models. . . . 98 4.10 Kappa statistic from 10-fold cross-validation of dataset pairs

pre-filtering models. . . . 98 4.11 ROC statistic from 10-fold cross-validation of dataset pairs pre-

filtering models. . . . 98 4.12 Recall against efficiency gain for the different supervised models.100 4.13 Recall against precision for the different supervised models. . . 100 4.14 Recall against efficiency gain for the different metric types. . . 100 4.15 Recall against precision for the different metric types. . . . 100 5.1 A visualisation of the output from DS-kNN data lake (DL) cate-

gorization. A proximity graph shows the datasets as nodes and the proximity scores as edges between nodes. Fig.(a) complete DL and Fig. (b) a zoomed-in view highlighted by the red box in (a) . . . 110 5.2 The data lake categorization scenario using k-NN proximity

mining . . . 113 5.3 Performance of DS-kNN using k=1, different models, different

ground-truths, and different category sizes . . . 122 5.4 Performance of DS-kNN using k=3, different models, different

ground-truths, and different category sizes . . . 123 5.5 Performance of DS-kNN using k=5, different models, different

ground-truths, and different category sizes . . . 124 5.6 Performance of DS-kNN using k=7, different models, different

ground-truths, and different category sizes . . . 125 6.1 The input screen for the similarity search component of Prox-mine138 6.2 The output screen for the similarity search component of Prox-

mine . . . 138 6.3 The input screen for the dataset categorization component of

Prox-mine . . . 139 6.4 The output screens for the dataset categorization component of

Prox-mine. . . . 140 6.5 The input screen for the dataset matching component of Prox-

mine . . . 141

(6)

List of Figures

6.6 The output screens for the dataset matching component of Prox-mine. . . . 142 6.7 The input screen for the new dataset matching component of

Prox-mine . . . 143 6.8 The output screen for the new dataset matching component of

Prox-mine . . . 144 6.9 The input screen for the proximity graph component of Prox-mine145 6.10 An overview of the output proximity graph component of

Prox-mine, where (a) gives a zoomed-out view and (b) gives a zoomed-in view . . . 145 6.11 The search and filtration panel of the output proximity graph

component of Prox-mine, where (a) gives a view of the category selector in the left-panel and (b) gives the result of applying the filtration step . . . 146 6.12 The selection of a specific dataset node in the proximity graph

and the relationships information panel shown on the right side 147

(7)

List of Tables

List of Tables

1.1 Current types of metadata tools for data lakes . . . . 7

2.1 Description of OpenML datasets . . . 36

2.2 Example Cross-dataset Relationships . . . 43

2.3 Results of Manual Annotation . . . 47

3.1 DS-Prox meta-features . . . 57

3.2 A description of the OpenML samples collected . . . 62

3.3 An example of pairs of datasets from the all-topics sample from OpenML . . . 62

3.4 A description of the experiments conducted . . . 63

4.1 Schema matching techniques state-of-the-art comparison . . . . 75

4.2 Schema matching pre-filtering functions . . . 76

4.3 Attribute level content meta-features . . . 80

4.4 Description of the OML01 datasets . . . 90

4.5 Example Cross-dataset Attribute Relationships from OML01 . . 91

4.6 Description of the OML02 datasets . . . 92

4.7 An example of pairs of datasets from the OML02 sample from OpenML . . . 92

4.8 The significance of the Kappa statistic . . . 93

4.9 The significance of the ROC statistic . . . 93

4.10 Performance evaluation of attribute pairs proximity models . . 95

4.11 Spearman rank correlation for the different meta-features. We aggregate minimum (Min.), average (Avg.), maximum (Max.), & standard deviation (Std. Dev.) for different meta-feature types. 99 4.12 The standard deviation of each evaluation measure for 10-fold cross-validation of each dataset pairs pre-filtering model, where c

d

“ 0.5 . . . 101

4.13 The computational performance of our approach vs. the PARIS implementation in terms of time and storage space . . . 104

5.1 A description of the OpenML categorized datasets collected. Datasets are categorized by subject and by entity for the 203 ds sample, or by entity for the 118 ds sample. . . . 119

5.2 The evaluation of DS-kNN for the minimum category size of

1+ with the different model types and ground-truth types. For

each setting, we only show here the best performing parameters

based on F1-scores. . . . 127

(8)

List of Tables

5.3 The evaluation of DS-kNN for the minimum category size of 3+ with the different model types and ground-truth types. For each setting, we only show here the best performing parameters based on F1-scores. . . . 128 5.4 The evaluation of DS-kNN for the minimum category size of

5+ with the different model types and ground-truth types. For each setting, we only show here the best performing parameters based on F1-scores. . . . 129 5.5 The evaluation of DS-kNN for the minimum category size of

8+ with the different model types and ground-truth types. For each setting, we only show here the best performing parameters based on F1-scores. . . . 130 5.6 The evaluation of top performant DS-kNN settings for the

minimum category sizes of 1+, 3+ and 8+ with the 118 ds validation sample. . . . 132 5.7 The evaluation of specific DS-kNN settings which met specific

criteria with the 203 ds sample. We re-validate them with the

118 ds sample. . . . 132

Références

Documents relatifs

Dans le cadre de cette étude, nous avons déterminé, pour la même procédure analytique, l’incertitude associée aux résultats et ce par trois approches différentes utilisant

L’enfant sera le centre de toute attention et son anniversaire sera malheureusement aussi l’anniversaire du décès. Il sera important aussi de mettre en regard

By utilizing pattern classes our approach considers two schema elements as a match, if their instances can be expressed via at least one regular expres- sion of the same pattern

Le modèle de Judd-Ofelt a été développé pour calculer les probabilités de transition radiative entre les niveaux d’énergie 4f des ions terres rares trivalents excités au sein

The main contribution of this paper is a reference architecture and prototype for smarter interoperability using a combination of automatic schema matching, based on machine

Dans notre cas, nous allons partitionner la table HMEQ en deux sous-ensembles, 60% pour apprendre, et le reste, soit 40%, pour valider que notre modèle est robuste.. En effet,

While there is no guarantee that one sweep through the data set will lead to a full solution of the opti- mization problem (and it almost never will, since some points may be left

4. In the final stage, we use the embeddings in order to calculate similarity between columns of different relations. Since embeddings of similar data elements are close to each