I Thesis Summary 1

(1)

Abstract iii

Resumé v

Résumé vii

I Thesis Summary 1

1 Introduction . . . 3

1.1 Thesis Structure . . . 5

2 Scalability of Parallel Secondo . . . 6

2.1 Background . . . 6

2.2 Datasets . . . 8

2.3 Experimental Setup . . . 8

2.4 Results . . . 8

2.5 Discussion . . . 9

3 The Convoy Pattern . . . 11

4 Partitioning Strategies . . . 12

4.1 Object-based Partitioning . . . 13

4.2 Spatial Partitioning . . . 13

4.3 Temporal Partitioning . . . 15

5 Distributed Convoy Mining . . . 16

5.1 Local Convoy Mining DCM_part . . . 17

5.2 Global Merge with DCMmerge . . . 18

6 k/2-Hop algorithm for Convoy Pattern Mining . . . 20

6.1 Benchmark Points and Clusters . . . 20

6.2 Candidate Clusters . . . 22

6.3 Mining Convoys in a Hop-Window . . . 23

6.4 Finding Maximal Spanning Convoys . . . 24

6.5 Extending the Maximal Spanning Convoys . . . 24

(2)

6.6 Data Storage Back-end for k/2-hop Algorithm . . . 24

7 Streaming k/2-hop algorithm for Convoy Mining . . . 25

7.1 Background . . . 25

7.2 Algorithm . . . 26

8 Summary of Contributions . . . 30

9 Future Work Directions . . . 32

References . . . 32

II Papers 35

A Using T-Drive and BerlinMOD in Parallel Secondo for Performance Evaluation Of Geospatial Big Data Processing 37 1 Introduction . . . 39

2 Related Work . . . 40

3 Methodology . . . 41

3.1 Datasets . . . 42

3.2 Parallel Secondo . . . 43

3.3 Hardware Computer System . . . 43

4 Results and Discussions . . . 45

4.1 T-drive Performance . . . 45

4.2 BerlinMOD Performance . . . 47

5 Conclusions and Future Work . . . 53

References . . . 53

B Towards Distributed Convoy Pattern Mining 57 1 Introduction . . . 59

3 The Convoy Pattern . . . 60

4 Problem Overview . . . 61

4.1 Object Based Partitioning . . . 62

4.2 Spatial Partitioning . . . 62

4.3 Temporal Partitioning . . . 66

5 Conclusion . . . 67

References . . . 67

C Distributed Convoy Pattern Mining 69 1 Introduction . . . 71

(3)

5 Local Convoy Mining DCM_part . . . 78

6 Global Merge with DCM_merge . . . 83

7 Cost Complexity . . . 86

8 The Hadoop Implementation (DCM_MR) . . . 89

9 Experimental Evaluation . . . 91

9.1 Data Preparation and Parameter Setting . . . 91

9.2 Results . . . 92

10 Conclusion and Future Work . . . 94

References . . . 95

D Distributed Mining of Convoys in Large Scale Datasets 97 1 Introduction . . . 99

3 Convoy Mining . . . 105

4 Partitioning Strategies . . . 108

5 Local Convoy Mining DCM_part . . . 109

5.1 Closed Convoys: V_C . . . 110

5.2 Left-Open Convoys: V_L . . . 110

5.3 Right-Open Convoys: V_R . . . 111

5.4 Left-Right-Open Convoys: V_LR . . . 112

5.5 Convoy Types Disjointness . . . 112

5.6 The Algorithm DCM_part . . . 112

6 Global Merge with DCM_merge . . . 114

7 The Hadoop MapReduce Implementation (DCM_MR) . . . 117

8 Theoretical Analysis of DCM_MR . . . 118

8.1 Cost Complexity . . . 119

8.2 Effect of Partition Size . . . 120

8.3 Cluster Utilization . . . 123

8.4 Discussion on the Choice of Framework . . . 124

9 Experimental Evaluation . . . 125

9.1 Infrastructure . . . 125

9.2 Data Preparation and Parameter Setting . . . 126

9.3 Results: Single Machine Experiments (Hardware Setup A) . . . 127

9.4 Results: Scale-up Experiments (Hardware Setup B) . . 129

9.5 Results: Scale-out Experiments (Hardware Setup C) . . 130

9.6 Results: Experiments DCM_MRAnalysis . . . 130

11 Appendix . . . 138

References . . . 141

(4)

E k/2-hop: Fast Mining of Convoy Patterns With Effective Pruning 147

1 Introduction . . . 149

3 Convoy Mining Problem . . . 153

3.1 Density-based Clustering . . . 153

3.2 Convoys . . . 154

4 k/2-Hop Algorithm . . . 158

4.3 Hop-Window Mining Tree (HWMT) . . . 160

4.4 Finding Maximal Spanning Convoys . . . 164

4.5 Extending Maximal Spanning Convoys . . . 166

4.6 Mining Fully Connected Convoys . . . 168

4.7 Proof of Correctness . . . 169

5 Persistent Storage Structure . . . 171

5.1 Relational Data Storage . . . 172

5.2 Log-Structured Merge-Tree . . . 172

6 Experiments . . . 177

6.1 Setups . . . 177

6.2 Data Sets . . . 178

6.3 Results . . . 179

8 Appendix . . . 188

8.1 Proof of Lemma 27 . . . 188

8.2 Proof of Lemma 28 . . . 189

8.3 Proof of Lemma 29 . . . 189

8.4 Proof of Lemma 30 . . . 189

8.5 Proof of Lemma 11 . . . 190

8.6 Proof of Correctness . . . 190

F Streaming k/2-hop: Fast and Real-time Mining of Convoy Patterns With Effective Pruning 195 1 Introduction . . . 197

3 Convoy Mining Problem . . . 201

3.1 Density-based Clustering . . . 201

3.2 Convoys . . . 202

4 k/2-Hop Algorithm . . . 204

(5)

5 Streaming k/2-hop Algorithm . . . 208

6 Experiments . . . 214

6.1 Setup for Sequential Algorithms . . . 214

6.2 Data Sets . . . 216

6.3 Results . . . 217