1 2 The Lifecycle of Data-intensive Flows

(1)

Contents

Curriculum Vitae vii

Abstract ix

Thesis Details xix

1 Introduction 1

1 Background and Motivation . . . . 1

2 The Lifecycle of Data-intensive Flows . . . . 3

2.1 Research Problems and Challenges . . . . 4

3 Structure of the Thesis . . . . 7

4 Thesis Overview . . . . 8

4.1 Chapter 2: A Unied View of Data-Intensive Flows in Business Intelligence Systems: A Survey (The State of the Art) . . . . 10

4.2 Chapter 3: Incremental Consolidation of Data-Intensive Multi-ows (Data Flow Integrator) . . . . 11

4.3 Chapter 4: A Requirement-Driven Approach to the De- sign and Evolution of Data Warehouses (Target Schema Integrator) . . . . 12

4.4 Chapter 5: Engine Independence for Logical Analytic F- lows (Data Flow Deployer) . . . . 14

4.5 Chapter 6: Supporting Job Scheduling with Workload- driven Data Redistribution (Data Flow Scheduler) . . . 15

5 Contributions . . . . 16

2 A Unied View of Data-Intensive Flows in Business Intelligence Sys- tems: A Survey 19 1 Introduction . . . . 20

2 Example Scenario . . . . 23

3 Methodology . . . . 25

3.1 Selection process . . . . 26

(2)

3.2 Phase I (Outlining the study setting). . . . 26

3.3 Phase II (Analyzing the characteristics of data-intensive ows). . . . 29

3.4 Phase III (Classication of the reviewed literature). . . 30

4 Dening dimensions for studying data-intensive ows . . . . 31

4.1 Data Extraction . . . . 31

4.2 Data Transformation . . . . 33

4.3 Data Delivery . . . . 34

4.4 Optimization of data-intensive ows . . . . 34

5 Data Extraction . . . . 35

5.1 Structuredness . . . . 35

5.2 Coupledness . . . . 36

5.3 Accessability . . . . 38

5.4 Discussion . . . . 38

6 Data Transformation . . . . 40

6.1 Malleability . . . . 40

6.2 Constraintness . . . . 42

6.3 Automation . . . . 43

7 Data Delivery . . . . 46

7.1 Interactivity . . . . 46

7.2 Openness . . . . 48

8 Optimization of data-intensive ows . . . . 50

8.1 Optimization input . . . . 50

8.2 Dynamicity . . . . 52

9 Overall Discussion . . . . 53

9.1 Architecture for managing the lifecycle of data-intensive ows in next generation BI systems . . . . 54

10 Conclusions . . . . 58

3 Incremental Consolidation of Data-Intensive Multi-ows 59 1 Introduction . . . . 60

2 Overview . . . . 62

2.1 Running Example . . . . 62

2.2 Preliminaries and Notation . . . . 64

2.3 Problem Statement . . . . 68

3 Data Flow Consolidation Challenges . . . . 70

3.1 Operation reordering . . . . 71

3.2 Operations comparison . . . . 74

4 Consolidation Algorithm . . . . 75

4.1 Computational complexity . . . . 81

(3)

5 Evaluation . . . . 83

5.1 Prototype . . . . 83

5.2 Experimental setup . . . . 83

5.3 Scrutinizing CoAl . . . . 84

6 Related Work . . . . 87

7 Conclusions and Future Work . . . . 89

8 Acknowledgments . . . . 89

4 A Requirement-Driven Approach to the Design and Evolution of Data Warehouses 91 1 Introduction . . . . 92

2 Overview of our Approach . . . . 95

2.1 Running example . . . . 95

2.2 Formalizing Information Requirements . . . . 96

2.3 Formalizing the Problem . . . 100

2.4 ORE in a Nutshell . . . 102

3 Traceability Metadata . . . 106

4 The ORE Approach . . . 109

4.1 Matching facts . . . 111

4.2 Matching dimensions . . . 113

4.3 Complementing the MD design . . . 115

4.4 Integration . . . 116

5 Theoretical Validation . . . 118

5.1 Soundness and Completeness . . . 118

5.2 Commutativity and Associativity . . . 122

5.3 Computational complexity . . . 122

6 Evaluation . . . 124

6.1 Prototype . . . 125

6.2 Output validation . . . 126

6.3 Experimental setup . . . 127

6.4 Scrutinizing ORE . . . 128

6.5 The LEARN-SQL Case Study . . . 133

7 Related Work . . . 140

8 Conclusions and Future Work . . . 142

9 Acknowledgements . . . 143

5 Engine Independence for Logical Analytic Flows 145 1 Introduction . . . 146

2 Problem Formalization . . . 147

2.1 Preliminaries . . . 147

2.2 Logical and physical ows . . . 148

2.3 Normalized ow . . . 148

2.4 Dictionary . . . 149

(4)

2.5 Conversion process . . . 149

2.6 Problem statements . . . 150

3 Architecture . . . 151

3.1 System overview . . . 151

3.2 Example . . . 152

3.3 Flow encoding . . . 152

3.4 Dictionary . . . 154

3.5 Error handling . . . 155

4 Physical to Logical . . . 156

4.1 Single ow . . . 156

4.2 Multi-ow import . . . 158

5 Flow Processor . . . 161

6 Logical to Physical . . . 162

6.1 Creating an engine specic ow . . . 162

6.2 Code generation . . . 164

7.1 Preliminaries . . . 167

7.2 Experiments . . . 168

9 Conclusions . . . 173

6 H-WorD: Supporting Job Scheduling in Hadoop with Workload-driven Data Redistribution 175 1 Introduction . . . 176

2 Running Example . . . 178

3 The Problem of Skewed Data Distribution . . . 179

4 Workload-driven Redistribution of Data . . . 181

4.1 Resource requirement framework . . . 181

4.2 Execution modes of map tasks . . . 182

4.3 Workload estimation . . . 184

4.4 The H-WorD algorithm . . . 185

7 Conclusions and Future Work . . . 190

7 Conclusions and Future Directions 193 1 Conclusions . . . 193

2 Future Directions . . . 197

Bibliography 199 References . . . 199

(5)

Appendices 215 A Quarry: Digging Up the Gems of Your Data Treasury 217

1 Introduction . . . 218

2 Demonstrable Features . . . 219

2.1 Requirements Elicitor . . . 221

2.2 Requirements Interpreter . . . 221

2.3 Design Integrator . . . 221

2.4 Design Deployer . . . 223

2.5 Communication & Metadata Layer . . . 223

2.6 Implementation details . . . 224

3 Demonstration . . . 224