Contents
Curriculum Vitae vii
Abstract ix
Thesis Details xix
1 Introduction 1
1 Background and Motivation . . . . 1
2 The Lifecycle of Data-intensive Flows . . . . 3
2.1 Research Problems and Challenges . . . . 4
3 Structure of the Thesis . . . . 7
4 Thesis Overview . . . . 8
4.1 Chapter 2: A Unied View of Data-Intensive Flows in Business Intelligence Systems: A Survey (The State of the Art) . . . . 10
4.2 Chapter 3: Incremental Consolidation of Data-Intensive Multi-ows (Data Flow Integrator) . . . . 11
4.3 Chapter 4: A Requirement-Driven Approach to the De- sign and Evolution of Data Warehouses (Target Schema Integrator) . . . . 12
4.4 Chapter 5: Engine Independence for Logical Analytic F- lows (Data Flow Deployer) . . . . 14
4.5 Chapter 6: Supporting Job Scheduling with Workload- driven Data Redistribution (Data Flow Scheduler) . . . 15
5 Contributions . . . . 16
2 A Unied View of Data-Intensive Flows in Business Intelligence Sys- tems: A Survey 19 1 Introduction . . . . 20
2 Example Scenario . . . . 23
3 Methodology . . . . 25
3.1 Selection process . . . . 26
3.2 Phase I (Outlining the study setting). . . . 26
3.3 Phase II (Analyzing the characteristics of data-intensive ows). . . . 29
3.4 Phase III (Classication of the reviewed literature). . . 30
4 Dening dimensions for studying data-intensive ows . . . . 31
4.1 Data Extraction . . . . 31
4.2 Data Transformation . . . . 33
4.3 Data Delivery . . . . 34
4.4 Optimization of data-intensive ows . . . . 34
5 Data Extraction . . . . 35
5.1 Structuredness . . . . 35
5.2 Coupledness . . . . 36
5.3 Accessability . . . . 38
5.4 Discussion . . . . 38
6 Data Transformation . . . . 40
6.1 Malleability . . . . 40
6.2 Constraintness . . . . 42
6.3 Automation . . . . 43
6.4 Discussion . . . . 45
7 Data Delivery . . . . 46
7.1 Interactivity . . . . 46
7.2 Openness . . . . 48
7.3 Discussion . . . . 50
8 Optimization of data-intensive ows . . . . 50
8.1 Optimization input . . . . 50
8.2 Dynamicity . . . . 52
8.3 Discussion . . . . 52
9 Overall Discussion . . . . 53
9.1 Architecture for managing the lifecycle of data-intensive ows in next generation BI systems . . . . 54
10 Conclusions . . . . 58
3 Incremental Consolidation of Data-Intensive Multi-ows 59 1 Introduction . . . . 60
2 Overview . . . . 62
2.1 Running Example . . . . 62
2.2 Preliminaries and Notation . . . . 64
2.3 Problem Statement . . . . 68
3 Data Flow Consolidation Challenges . . . . 70
3.1 Operation reordering . . . . 71
3.2 Operations comparison . . . . 74
4 Consolidation Algorithm . . . . 75
4.1 Computational complexity . . . . 81
5 Evaluation . . . . 83
5.1 Prototype . . . . 83
5.2 Experimental setup . . . . 83
5.3 Scrutinizing CoAl . . . . 84
6 Related Work . . . . 87
7 Conclusions and Future Work . . . . 89
8 Acknowledgments . . . . 89
4 A Requirement-Driven Approach to the Design and Evolution of Data Warehouses 91 1 Introduction . . . . 92
2 Overview of our Approach . . . . 95
2.1 Running example . . . . 95
2.2 Formalizing Information Requirements . . . . 96
2.3 Formalizing the Problem . . . 100
2.4 ORE in a Nutshell . . . 102
3 Traceability Metadata . . . 106
4 The ORE Approach . . . 109
4.1 Matching facts . . . 111
4.2 Matching dimensions . . . 113
4.3 Complementing the MD design . . . 115
4.4 Integration . . . 116
5 Theoretical Validation . . . 118
5.1 Soundness and Completeness . . . 118
5.2 Commutativity and Associativity . . . 122
5.3 Computational complexity . . . 122
6 Evaluation . . . 124
6.1 Prototype . . . 125
6.2 Output validation . . . 126
6.3 Experimental setup . . . 127
6.4 Scrutinizing ORE . . . 128
6.5 The LEARN-SQL Case Study . . . 133
7 Related Work . . . 140
8 Conclusions and Future Work . . . 142
9 Acknowledgements . . . 143
5 Engine Independence for Logical Analytic Flows 145 1 Introduction . . . 146
2 Problem Formalization . . . 147
2.1 Preliminaries . . . 147
2.2 Logical and physical ows . . . 148
2.3 Normalized ow . . . 148
2.4 Dictionary . . . 149
2.5 Conversion process . . . 149
2.6 Problem statements . . . 150
3 Architecture . . . 151
3.1 System overview . . . 151
3.2 Example . . . 152
3.3 Flow encoding . . . 152
3.4 Dictionary . . . 154
3.5 Error handling . . . 155
4 Physical to Logical . . . 156
4.1 Single ow . . . 156
4.2 Multi-ow import . . . 158
5 Flow Processor . . . 161
6 Logical to Physical . . . 162
6.1 Creating an engine specic ow . . . 162
6.2 Code generation . . . 164
7 Evaluation . . . 167
7.1 Preliminaries . . . 167
7.2 Experiments . . . 168
8 Related Work . . . 172
9 Conclusions . . . 173
6 H-WorD: Supporting Job Scheduling in Hadoop with Workload-driven Data Redistribution 175 1 Introduction . . . 176
2 Running Example . . . 178
3 The Problem of Skewed Data Distribution . . . 179
4 Workload-driven Redistribution of Data . . . 181
4.1 Resource requirement framework . . . 181
4.2 Execution modes of map tasks . . . 182
4.3 Workload estimation . . . 184
4.4 The H-WorD algorithm . . . 185
5 Evaluation . . . 186
6 Related Work . . . 189
7 Conclusions and Future Work . . . 190
8 Acknowledgements . . . 191
7 Conclusions and Future Directions 193 1 Conclusions . . . 193
2 Future Directions . . . 197
Bibliography 199 References . . . 199
Appendices 215 A Quarry: Digging Up the Gems of Your Data Treasury 217
1 Introduction . . . 218
2 Demonstrable Features . . . 219
2.1 Requirements Elicitor . . . 221
2.2 Requirements Interpreter . . . 221
2.3 Design Integrator . . . 221
2.4 Design Deployer . . . 223
2.5 Communication & Metadata Layer . . . 223
2.6 Implementation details . . . 224
3 Demonstration . . . 224
4 Acknowledgements . . . 225