Contents
Abstract vi
Resum viii
Résumé x
Acknowledgements xii
List of Figures . . . xvi
List of Tables . . . xix
Abbreviations xxi Thesis Details xxii 1 Introduction 1 1 Motivation . . . . 1
2 Background and State-of-the-art . . . . 2
2.1 Data Lakes and Tabular Datasets . . . . 2
2.2 Data Lake Governance . . . . 3
3 Techniques and Challenges . . . 10
3.1 Schema Matching . . . 11
3.2 Dataset Similarity Computation . . . 14
3.3 Similarity Models Learning . . . 15
4 Thesis Objectives and Research Questions . . . 18
5 Thesis Overview . . . 20
5.1 Proximity Mining Framework . . . 20
5.2 DL Categorization . . . 23
5.3 Metadata Query Interface . . . 25
6 Thesis Contributions . . . 25
7 Structure of the Thesis . . . 27
7.1 Chapter 2: Instance-level value-based schema matching
for mining proximity between datasets . . . 28
Contents
7.2 Chapter 3: Dataset-level content metadata based prox-
imity mining . . . 28
7.3 Chapter 4: Attribute-level content metadata based prox- imity mining for pre-filtering schema matching . . . 28
7.4 Chapter 5: Automatic categorization of datasets using proximity mining . . . 29
7.5 Chapter 6: Prox-mine tool for browsing DLs using prox- imity mining . . . 29
2 Instance-level value-based schema matching for computing dataset similarity 31 1 Introduction . . . 33
2 Related Work . . . 35
3 Motivational Case-Study . . . 36
4 A Framework for Content Metadata Management . . . 37
5 The CM4DL Prototype . . . 39
5.1 Prototype Architecture . . . 40
5.2 Ontology Alignment Component . . . 41
5.3 Dataset Comparison Algorithm . . . 43
6 Experiments and Results . . . 45
7 Discussion . . . 48
8 Conclusion and Future Work . . . 49
3 Dataset-level content metadata based proximity mining for comput- ing dataset similarity 51 1 Introduction . . . 53
2 Problem Statement . . . 54
3 Related Work . . . 55
4 The DS-Prox Approach . . . 56
4.1 The Meta-Features Distance Measures . . . 57
4.2 The Approach . . . 58
5 Experimental Evaluation . . . 61
5.1 Datasets . . . 61
5.2 Experimental Setup . . . 62
5.3 Results . . . 63
5.4 Discussion . . . 64
6 Conclusion and Future Work . . . 66
4 Attribute-level content metadata based proximity mining for pre- filtering schema matching 68 1 Introduction . . . 70
2 Related Work . . . 72
3 Preliminaries . . . 75
Contents
4 Approach: Metadata-based Proximity Mining for Pre-filtering
Schema Matching . . . 79
4.1 Proximity Metrics: Meta-features Distances . . . 79
4.2 Supervised Proximity Mining . . . 81
4.3 Pre-filtering Dataset Pairs for Schema Matching . . . 88
5 Experimental Evaluation . . . 88
5.1 Datasets . . . 89
5.2 Evaluation Metrics . . . 92
5.3 Experiment 1: Attribute-level Models . . . 94
5.4 Experiment 2: Dataset-level Models . . . 95
5.5 Experiment 3: Computational Performance Evaluation . 103 5.6 Generalisability . . . 105
6 Conclusion . . . 105
5 Automatic categorization of datasets using proximity mining 107 1 Introduction . . . 109
2 Preliminaries . . . 111
2.1 Proximity Mining: Meta-features Metrics and Models . 113 3 DS-kNN: a Proximity Mining Based k-Nearest-Neighbour Al- gorithm for Categorizing Datasets . . . 115
4 Experimental Evaluation . . . 117
4.1 Dataset: OpenML DL Ground-truth . . . 117
4.2 Experimental Setup . . . 118
4.3 Results . . . 121
4.4 Validation Experiment . . . 131
5 Related Work . . . 133
6 Conclusion . . . 134
6 Prox-mine tool for browsing DLs using proximity mining 135 1 Introduction . . . 136
2 Data Lake Index . . . 137
3 Similarity Search . . . 137
4 Dataset Categorization . . . 138
5 Dataset Matching . . . 140
5.1 New Dataset Matching . . . 141
6 Proximity Graph . . . 143
7 Conclusions and Future Directions 148 1 Conclusions . . . 149
2 Future Directions . . . 150
Bibliography 152
References . . . 152
List of Figures
List of Figures
1.1 A flat structured dataset consisting of tabular data organised as attributes and instances . . . . 3 1.2 DL governance classification and tasks . . . . 4 1.3 An example of schema matching and dataset similarity compu-
tation . . . 11 1.4 Classification of schema matching techniques . . . 13 1.5 Example of a decision tree model for classification of related
dataset pairs . . . 17 1.6 An ensemble of decision trees for classification . . . 18 1.7 Overview of the proposed proxmity mining framework . . . . 21 1.8 The proximity mining metadata management process . . . 22 1.9 A proximity graph showing topic-wise groupings of interlinked
datasets in the DL with their similarity scores . . . 24 1.10 DL content metadata management and analysis process . . . . 25 2.1 The Metadata Management BPMN Process Model . . . 37 2.2 The EXP01 Metadata Exploitation Sub-Process Model . . . 38 2.3 CM4DL System Architecture . . . 40 2.4 Performance analysis of CM4DL in the OpenML experiments . 47 3.1 Similarity relationships between two pairs of datasets . . . 55 3.2 DS-Prox: supervised machine learning . . . 58 3.3 DS-Prox cut-off thresholds tuning . . . 60 3.4 Recall-efficiency plots (left column) and recall-precision plots
(right column) for experiments 1,2,3 and 4 in each row . . . 67 4.1 The stratified holistic schema matching approach at different
levels of granularity. . . . 71 4.2 The dependencies of components in the metadata-based prox-
imity mining approach for pre-filtering schema matching. . . . 76 4.3 Final output of our approach consisting of similarity relation-
ships between two pairs of datasets. . . . 78 4.4 An overview of the process to build the supervised ensemble
models in our proposed datasets proximity mining approach using previously manually annotated dataset pairs. . . . 82 4.5 Proximity Mining: supervised machine learning for predicting
related data objects. . . . 83 4.6 Different normal distributions for assigning weights to ranked
attribute linkages. . . . 86
List of Figures
4.7 An overview of the process to apply the learnt supervised models in our approach for pre-filtering previously unseen dataset pairs independent of the build process. . . . 87 4.8 A 10-fold cross-validation experimental setup consisting of
alternating folds in training and test roles. Image adapted from the book: Raschka S (2015) Python Machine Learning. 1st Edition. Packt Publishing, Birmingham, UK. . . . 97 4.9 Classification accuracy from 10-fold cross-validation of dataset
pairs pre-filtering models. . . . 98 4.10 Kappa statistic from 10-fold cross-validation of dataset pairs
pre-filtering models. . . . 98 4.11 ROC statistic from 10-fold cross-validation of dataset pairs pre-
filtering models. . . . 98 4.12 Recall against efficiency gain for the different supervised models.100 4.13 Recall against precision for the different supervised models. . . 100 4.14 Recall against efficiency gain for the different metric types. . . 100 4.15 Recall against precision for the different metric types. . . . 100 5.1 A visualisation of the output from DS-kNN data lake (DL) cate-
gorization. A proximity graph shows the datasets as nodes and the proximity scores as edges between nodes. Fig.(a) complete DL and Fig. (b) a zoomed-in view highlighted by the red box in (a) . . . 110 5.2 The data lake categorization scenario using k-NN proximity
mining . . . 113 5.3 Performance of DS-kNN using k=1, different models, different
ground-truths, and different category sizes . . . 122 5.4 Performance of DS-kNN using k=3, different models, different
ground-truths, and different category sizes . . . 123 5.5 Performance of DS-kNN using k=5, different models, different
ground-truths, and different category sizes . . . 124 5.6 Performance of DS-kNN using k=7, different models, different
ground-truths, and different category sizes . . . 125 6.1 The input screen for the similarity search component of Prox-mine138 6.2 The output screen for the similarity search component of Prox-
mine . . . 138 6.3 The input screen for the dataset categorization component of
Prox-mine . . . 139 6.4 The output screens for the dataset categorization component of
Prox-mine. . . . 140 6.5 The input screen for the dataset matching component of Prox-
mine . . . 141
List of Figures
6.6 The output screens for the dataset matching component of Prox-mine. . . . 142 6.7 The input screen for the new dataset matching component of
Prox-mine . . . 143 6.8 The output screen for the new dataset matching component of
Prox-mine . . . 144 6.9 The input screen for the proximity graph component of Prox-mine145 6.10 An overview of the output proximity graph component of
Prox-mine, where (a) gives a zoomed-out view and (b) gives a zoomed-in view . . . 145 6.11 The search and filtration panel of the output proximity graph
component of Prox-mine, where (a) gives a view of the category selector in the left-panel and (b) gives the result of applying the filtration step . . . 146 6.12 The selection of a specific dataset node in the proximity graph
and the relationships information panel shown on the right side 147
List of Tables
List of Tables
1.1 Current types of metadata tools for data lakes . . . . 7
2.1 Description of OpenML datasets . . . 36
2.2 Example Cross-dataset Relationships . . . 43
2.3 Results of Manual Annotation . . . 47
3.1 DS-Prox meta-features . . . 57
3.2 A description of the OpenML samples collected . . . 62
3.3 An example of pairs of datasets from the all-topics sample from OpenML . . . 62
3.4 A description of the experiments conducted . . . 63
4.1 Schema matching techniques state-of-the-art comparison . . . . 75
4.2 Schema matching pre-filtering functions . . . 76
4.3 Attribute level content meta-features . . . 80
4.4 Description of the OML01 datasets . . . 90
4.5 Example Cross-dataset Attribute Relationships from OML01 . . 91
4.6 Description of the OML02 datasets . . . 92
4.7 An example of pairs of datasets from the OML02 sample from OpenML . . . 92
4.8 The significance of the Kappa statistic . . . 93
4.9 The significance of the ROC statistic . . . 93
4.10 Performance evaluation of attribute pairs proximity models . . 95
4.11 Spearman rank correlation for the different meta-features. We aggregate minimum (Min.), average (Avg.), maximum (Max.), & standard deviation (Std. Dev.) for different meta-feature types. 99 4.12 The standard deviation of each evaluation measure for 10-fold cross-validation of each dataset pairs pre-filtering model, where c
d“ 0.5 . . . 101
4.13 The computational performance of our approach vs. the PARIS implementation in terms of time and storage space . . . 104
5.1 A description of the OpenML categorized datasets collected. Datasets are categorized by subject and by entity for the 203 ds sample, or by entity for the 118 ds sample. . . . 119
5.2 The evaluation of DS-kNN for the minimum category size of
1+ with the different model types and ground-truth types. For
each setting, we only show here the best performing parameters
based on F1-scores. . . . 127
List of Tables