Abstract iii
Resumé v
Résumé vii
I Thesis Summary 1
1 Introduction . . . 3
1.1 Thesis Structure . . . 5
2 Scalability of Parallel Secondo . . . 6
2.1 Background . . . 6
2.2 Datasets . . . 8
2.3 Experimental Setup . . . 8
2.4 Results . . . 8
2.5 Discussion . . . 9
3 The Convoy Pattern . . . 11
4 Partitioning Strategies . . . 12
4.1 Object-based Partitioning . . . 13
4.2 Spatial Partitioning . . . 13
4.3 Temporal Partitioning . . . 15
4.4 Discussion . . . 16
5 Distributed Convoy Mining . . . 16
5.1 Local Convoy Mining DCMpart . . . 17
5.2 Global Merge with DCMmerge . . . 18
5.3 Discussion . . . 19
6 k/2-Hop algorithm for Convoy Pattern Mining . . . 20
6.1 Benchmark Points and Clusters . . . 20
6.2 Candidate Clusters . . . 22
6.3 Mining Convoys in a Hop-Window . . . 23
6.4 Finding Maximal Spanning Convoys . . . 24
6.5 Extending the Maximal Spanning Convoys . . . 24
6.6 Data Storage Back-end for k/2-hop Algorithm . . . 24
6.7 Discussion . . . 25
7 Streaming k/2-hop algorithm for Convoy Mining . . . 25
7.1 Background . . . 25
7.2 Algorithm . . . 26
7.3 Discussion . . . 28
8 Summary of Contributions . . . 30
9 Future Work Directions . . . 32
References . . . 32
II Papers 35
A Using T-Drive and BerlinMOD in Parallel Secondo for Performance Evaluation Of Geospatial Big Data Processing 37 1 Introduction . . . 392 Related Work . . . 40
3 Methodology . . . 41
3.1 Datasets . . . 42
3.2 Parallel Secondo . . . 43
3.3 Hardware Computer System . . . 43
4 Results and Discussions . . . 45
4.1 T-drive Performance . . . 45
4.2 BerlinMOD Performance . . . 47
5 Conclusions and Future Work . . . 53
References . . . 53
B Towards Distributed Convoy Pattern Mining 57 1 Introduction . . . 59
2 Related Work . . . 59
3 The Convoy Pattern . . . 60
4 Problem Overview . . . 61
4.1 Object Based Partitioning . . . 62
4.2 Spatial Partitioning . . . 62
4.3 Temporal Partitioning . . . 66
5 Conclusion . . . 67
References . . . 67
C Distributed Convoy Pattern Mining 69 1 Introduction . . . 71
2 Related Work . . . 72
5 Local Convoy Mining DCMpart . . . 78
6 Global Merge with DCMmerge . . . 83
7 Cost Complexity . . . 86
8 The Hadoop Implementation (DCMMR) . . . 89
9 Experimental Evaluation . . . 91
9.1 Data Preparation and Parameter Setting . . . 91
9.2 Results . . . 92
10 Conclusion and Future Work . . . 94
References . . . 95
D Distributed Mining of Convoys in Large Scale Datasets 97 1 Introduction . . . 99
2 Related Work . . . 102
3 Convoy Mining . . . 105
4 Partitioning Strategies . . . 108
5 Local Convoy Mining DCMpart . . . 109
5.1 Closed Convoys: VC . . . 110
5.2 Left-Open Convoys: VL . . . 110
5.3 Right-Open Convoys: VR . . . 111
5.4 Left-Right-Open Convoys: VLR . . . 112
5.5 Convoy Types Disjointness . . . 112
5.6 The Algorithm DCMpart . . . 112
6 Global Merge with DCMmerge . . . 114
7 The Hadoop MapReduce Implementation (DCMMR) . . . 117
8 Theoretical Analysis of DCMMR . . . 118
8.1 Cost Complexity . . . 119
8.2 Effect of Partition Size . . . 120
8.3 Cluster Utilization . . . 123
8.4 Discussion on the Choice of Framework . . . 124
9 Experimental Evaluation . . . 125
9.1 Infrastructure . . . 125
9.2 Data Preparation and Parameter Setting . . . 126
9.3 Results: Single Machine Experiments (Hardware Setup A) . . . 127
9.4 Results: Scale-up Experiments (Hardware Setup B) . . 129
9.5 Results: Scale-out Experiments (Hardware Setup C) . . 130
9.6 Results: Experiments DCMMRAnalysis . . . 130
10 Conclusion and Future Work . . . 137
11 Appendix . . . 138
References . . . 141
E k/2-hop: Fast Mining of Convoy Patterns With Effective Pruning 147
1 Introduction . . . 149
2 Related Work . . . 150
3 Convoy Mining Problem . . . 153
3.1 Density-based Clustering . . . 153
3.2 Convoys . . . 154
4 k/2-Hop Algorithm . . . 158
4.1 Benchmark Points and Clusters . . . 158
4.2 Candidate Clusters . . . 160
4.3 Hop-Window Mining Tree (HWMT) . . . 160
4.4 Finding Maximal Spanning Convoys . . . 164
4.5 Extending Maximal Spanning Convoys . . . 166
4.6 Mining Fully Connected Convoys . . . 168
4.7 Proof of Correctness . . . 169
5 Persistent Storage Structure . . . 171
5.1 Relational Data Storage . . . 172
5.2 Log-Structured Merge-Tree . . . 172
6 Experiments . . . 177
6.1 Setups . . . 177
6.2 Data Sets . . . 178
6.3 Results . . . 179
7 Conclusion and Future Work . . . 188
8 Appendix . . . 188
8.1 Proof of Lemma 27 . . . 188
8.2 Proof of Lemma 28 . . . 189
8.3 Proof of Lemma 29 . . . 189
8.4 Proof of Lemma 30 . . . 189
8.5 Proof of Lemma 11 . . . 190
8.6 Proof of Correctness . . . 190
References . . . 192
F Streaming k/2-hop: Fast and Real-time Mining of Convoy Patterns With Effective Pruning 195 1 Introduction . . . 197
2 Related Work . . . 199
3 Convoy Mining Problem . . . 201
3.1 Density-based Clustering . . . 201
3.2 Convoys . . . 202
4 k/2-Hop Algorithm . . . 204
4.1 Benchmark Points and Clusters . . . 205
4.2 Candidate Clusters . . . 206
5 Streaming k/2-hop Algorithm . . . 208
6 Experiments . . . 214
6.1 Setup for Sequential Algorithms . . . 214
6.2 Data Sets . . . 216
6.3 Results . . . 217
7 Conclusion and Future Work . . . 222
References . . . 222