Abstract vi

(1)

datasets in the DL with their similarity scores . . . 24 1.10 DL content metadata management and analysis process . . . . 25 2.1 The Metadata Management BPMN Process Model . . . 37 2.2 The EXP01 Metadata Exploitation Sub-Process Model . . . 38 2.3 CM4DL System Architecture . . . 40 2.4 Performance analysis of CM4DL in the OpenML experiments . 47 3.1 Similarity relationships between two pairs of datasets . . . 55 3.2 DS-Prox: supervised machine learning . . . 58 3.3 DS-Prox cut-off thresholds tuning . . . 60 3.4 Recall-efﬁciency plots (left column) and recall-precision plots

(right column) for experiments 1,2,3 and 4 in each row . . . 67 4.1 The stratiﬁed holistic schema matching approach at different

levels of granularity. . . . 71 4.2 The dependencies of components in the metadata-based prox-

imity mining approach for pre-ﬁltering schema matching. . . . 76 4.3 Final output of our approach consisting of similarity relation-

ships between two pairs of datasets. . . . 78 4.4 An overview of the process to build the supervised ensemble

models in our proposed datasets proximity mining approach using previously manually annotated dataset pairs. . . . 82 4.5 Proximity Mining: supervised machine learning for predicting

related data objects. . . . 83 4.6 Different normal distributions for assigning weights to ranked

attribute linkages. . . . 86

(5)

List of Figures

4.7 An overview of the process to apply the learnt supervised models in our approach for pre-ﬁltering previously unseen dataset pairs independent of the build process. . . . 87 4.8 A 10-fold cross-validation experimental setup consisting of

alternating folds in training and test roles. Image adapted from the book: Raschka S (2015) Python Machine Learning. 1st Edition. Packt Publishing, Birmingham, UK. . . . 97 4.9 Classiﬁcation accuracy from 10-fold cross-validation of dataset

pairs pre-ﬁltering models. . . . 98 4.10 Kappa statistic from 10-fold cross-validation of dataset pairs

pre-ﬁltering models. . . . 98 4.11 ROC statistic from 10-fold cross-validation of dataset pairs pre-

filtering models. . . . 98 4.12 Recall against efficiency gain for the different supervised models.100 4.13 Recall against precision for the different supervised models. . . 100 4.14 Recall against efficiency gain for the different metric types. . . 100 4.15 Recall against precision for the different metric types. . . . 100 5.1 A visualisation of the output from DS-kNN data lake (DL) cate-

gorization. A proximity graph shows the datasets as nodes and the proximity scores as edges between nodes. Fig.(a) complete DL and Fig. (b) a zoomed-in view highlighted by the red box in (a) . . . 110 5.2 The data lake categorization scenario using k-NN proximity

mining . . . 113 5.3 Performance of DS-kNN using k=1, different models, different

ground-truths, and different category sizes . . . 122 5.4 Performance of DS-kNN using k=3, different models, different

ground-truths, and different category sizes . . . 123 5.5 Performance of DS-kNN using k=5, different models, different

ground-truths, and different category sizes . . . 124 5.6 Performance of DS-kNN using k=7, different models, different

ground-truths, and different category sizes . . . 125 6.1 The input screen for the similarity search component of Prox-mine138 6.2 The output screen for the similarity search component of Prox-

mine . . . 138 6.3 The input screen for the dataset categorization component of

Prox-mine . . . 139 6.4 The output screens for the dataset categorization component of

Prox-mine. . . . 140 6.5 The input screen for the dataset matching component of Prox-

mine . . . 141

(6)

List of Figures

6.6 The output screens for the dataset matching component of Prox-mine. . . . 142 6.7 The input screen for the new dataset matching component of

Prox-mine . . . 143 6.8 The output screen for the new dataset matching component of

Prox-mine . . . 144 6.9 The input screen for the proximity graph component of Prox-mine145 6.10 An overview of the output proximity graph component of

Prox-mine, where (a) gives a zoomed-out view and (b) gives a zoomed-in view . . . 145 6.11 The search and ﬁltration panel of the output proximity graph

component of Prox-mine, where (a) gives a view of the category selector in the left-panel and (b) gives the result of applying the ﬁltration step . . . 146 6.12 The selection of a speciﬁc dataset node in the proximity graph

and the relationships information panel shown on the right side 147

(7)

List of Tables

1.1 Current types of metadata tools for data lakes . . . . 7

2.1 Description of OpenML datasets . . . 36

2.2 Example Cross-dataset Relationships . . . 43

2.3 Results of Manual Annotation . . . 47

3.1 DS-Prox meta-features . . . 57

3.2 A description of the OpenML samples collected . . . 62

3.3 An example of pairs of datasets from the all-topics sample from OpenML . . . 62

3.4 A description of the experiments conducted . . . 63

4.1 Schema matching techniques state-of-the-art comparison . . . . 75

4.2 Schema matching pre-ﬁltering functions . . . 76

4.3 Attribute level content meta-features . . . 80

4.4 Description of the OML01 datasets . . . 90

4.5 Example Cross-dataset Attribute Relationships from OML01 . . 91

4.6 Description of the OML02 datasets . . . 92

4.7 An example of pairs of datasets from the OML02 sample from OpenML . . . 92

4.8 The signiﬁcance of the Kappa statistic . . . 93

4.9 The signiﬁcance of the ROC statistic . . . 93

4.10 Performance evaluation of attribute pairs proximity models . . 95

4.11 Spearman rank correlation for the different meta-features. We aggregate minimum (Min.), average (Avg.), maximum (Max.), & standard deviation (Std. Dev.) for different meta-feature types. 99 4.12 The standard deviation of each evaluation measure for 10-fold cross-validation of each dataset pairs pre-ﬁltering model, where c

_d

“ 0.5 . . . 101

4.13 The computational performance of our approach vs. the PARIS implementation in terms of time and storage space . . . 104

5.1 A description of the OpenML categorized datasets collected. Datasets are categorized by subject and by entity for the 203 ds sample, or by entity for the 118 ds sample. . . . 119

5.2 The evaluation of DS-kNN for the minimum category size of

1+ with the different model types and ground-truth types. For

each setting, we only show here the best performing parameters

based on F1-scores. . . . 127

(8)

List of Tables

5.3 The evaluation of DS-kNN for the minimum category size of 3+ with the different model types and ground-truth types. For each setting, we only show here the best performing parameters based on F1-scores. . . . 128 5.4 The evaluation of DS-kNN for the minimum category size of

5+ with the different model types and ground-truth types. For each setting, we only show here the best performing parameters based on F1-scores. . . . 129 5.5 The evaluation of DS-kNN for the minimum category size of

8+ with the different model types and ground-truth types. For each setting, we only show here the best performing parameters based on F1-scores. . . . 130 5.6 The evaluation of top performant DS-kNN settings for the

minimum category sizes of 1+, 3+ and 8+ with the 118 ds validation sample. . . . 132 5.7 The evaluation of speciﬁc DS-kNN settings which met speciﬁc

criteria with the 203 ds sample. We re-validate them with the

118 ds sample. . . . 132

Abstract vi

Contents

Abstract vi

Resum viii

Résumé x

Acknowledgements xii

List of Figures . . . xvi

List of Tables . . . xix

Abbreviations xxi Thesis Details xxii 1 Introduction 1 1 Motivation . . . . 1

2 Background and State-of-the-art . . . . 2

2.1 Data Lakes and Tabular Datasets . . . . 2

2.2 Data Lake Governance . . . . 3

3 Techniques and Challenges . . . 10

3.1 Schema Matching . . . 11

3.2 Dataset Similarity Computation . . . 14

3.3 Similarity Models Learning . . . 15

4 Thesis Objectives and Research Questions . . . 18

5 Thesis Overview . . . 20

5.1 Proximity Mining Framework . . . 20

5.2 DL Categorization . . . 23

5.3 Metadata Query Interface . . . 25

6 Thesis Contributions . . . 25

7 Structure of the Thesis . . . 27

7.1 Chapter 2: Instance-level value-based schema matching

for mining proximity between datasets . . . 28

7.2 Chapter 3: Dataset-level content metadata based prox-

imity mining . . . 28

7.3 Chapter 4: Attribute-level content metadata based prox- imity mining for pre-ﬁltering schema matching . . . 28

7.4 Chapter 5: Automatic categorization of datasets using proximity mining . . . 29

7.5 Chapter 6: Prox-mine tool for browsing DLs using prox- imity mining . . . 29

2 Instance-level value-based schema matching for computing dataset similarity 31 1 Introduction . . . 33

2 Related Work . . . 35

3 Motivational Case-Study . . . 36

4 A Framework for Content Metadata Management . . . 37

5 The CM4DL Prototype . . . 39

5.1 Prototype Architecture . . . 40

5.2 Ontology Alignment Component . . . 41

5.3 Dataset Comparison Algorithm . . . 43

6 Experiments and Results . . . 45

7 Discussion . . . 48

8 Conclusion and Future Work . . . 49

3 Dataset-level content metadata based proximity mining for comput- ing dataset similarity 51 1 Introduction . . . 53

2 Problem Statement . . . 54

3 Related Work . . . 55

4 The DS-Prox Approach . . . 56

4.1 The Meta-Features Distance Measures . . . 57

4.2 The Approach . . . 58

5 Experimental Evaluation . . . 61

5.1 Datasets . . . 61

5.2 Experimental Setup . . . 62

5.3 Results . . . 63

5.4 Discussion . . . 64

6 Conclusion and Future Work . . . 66

4 Attribute-level content metadata based proximity mining for pre- ﬁltering schema matching 68 1 Introduction . . . 70

2 Related Work . . . 72

3 Preliminaries . . . 75

4 Approach: Metadata-based Proximity Mining for Pre-ﬁltering

Schema Matching . . . 79

4.1 Proximity Metrics: Meta-features Distances . . . 79

4.2 Supervised Proximity Mining . . . 81

4.3 Pre-ﬁltering Dataset Pairs for Schema Matching . . . 88

5 Experimental Evaluation . . . 88

5.1 Datasets . . . 89

5.2 Evaluation Metrics . . . 92

5.3 Experiment 1: Attribute-level Models . . . 94

5.4 Experiment 2: Dataset-level Models . . . 95

5.5 Experiment 3: Computational Performance Evaluation . 103 5.6 Generalisability . . . 105

6 Conclusion . . . 105

5 Automatic categorization of datasets using proximity mining 107 1 Introduction . . . 109

2 Preliminaries . . . 111

2.1 Proximity Mining: Meta-features Metrics and Models . 113 3 DS-kNN: a Proximity Mining Based k-Nearest-Neighbour Al- gorithm for Categorizing Datasets . . . 115

4 Experimental Evaluation . . . 117

4.1 Dataset: OpenML DL Ground-truth . . . 117

4.2 Experimental Setup . . . 118

4.3 Results . . . 121

4.4 Validation Experiment . . . 131

5 Related Work . . . 133

6 Conclusion . . . 134

6 Prox-mine tool for browsing DLs using proximity mining 135 1 Introduction . . . 136

2 Data Lake Index . . . 137