• Aucun résultat trouvé

Trajectory Query Processing and Optimization

Dans le document Mobility, Data Mining and Privacy (Page 167-173)

Trajectory Database Systems

6.4 Trajectory Query Processing and Optimization

Since spatiotemporal query types are guided by existing work on the domain of spatial querying, it is expected that the majority of the proposed algorithms for tra-jectory query processing will also be an extension of algorithms already employed in the context of spatial databases. For example, the spatiotemporal range query algo-rithm involving both spatial and temporal components, in the R-tree-like structures storing historical trajectory information, is a straightforward generalization of the originalFindLea falgorithm presented in [27] in the three-dimensional space. For a more detailed discussion on the definitions of spatiotemporal query types, the inter-ested reader may refer to the previous chapter, which also contains comprehensive examples. Following also from the previous chapter, the query types we will deal with in the context of query processing arerange, nearest neighbor, join, similarity, and trajectory-basedqueries.

6.4.1 Range Search

The majority of the aforementioned spatiotemporal indexes provide range search algorithm exploiting both the spatial and temporal dimensions. As already

160 E. Frentzos et al.

mentioned, since most of them are based on the well-known R-tree, the respective range search algorithm follows the one presented in [27]. Following the example illustrated in Fig. 6.3 for spatial objects, consider a range queryQexecuted against the two-dimensional R-tree. The algorithm starts by visiting the tree root, checking whether the MBBs of the root entries are overlappingQ. If a node entry MBB over-laps Q, the algorithm follows the pointer to the corresponding child node (in our case entries AandB), where it repeats recursively the same task. If the algorithm reaches a leaf node, leaf entries are examined againstQand if their MBB overlap, the algorithm reports their ids (objectsFandGwhen the algorithm visits leaf node A, and objectHwhen in nodeB). The extension of the above algorithm in the spa-tiotemporal domain is a straightforward task, where each two-dimensional MBB is simply replaced by the respective three-dimensional MBB of actual objects, nodes or queries.

Regarding two-stage structures, such as SETI, FNR, and MON-tree, range search is generally a three-step task. It consists of a spatial filtering process, which is followed by a temporal filtering and a subsequent refinement step joining the results of the spatial and temporal filtering. Spatial and temporal filtering are per-formed through the respective spatial and temporal components; if this is a one- or two-dimensional R-tree, the algorithm is essentially the same with the one previ-ously presented for simple R-trees. The refinement step is necessary, since objects retrieved from the spatial filtering are approximated by MBBs, therefore, we cannot determine whether the spatial object is actually inside the query before it has been retrieved (something that happens only after the temporal filtering step). To make this more comprehensive, consider a range query over SETI partially overlapping an index cell; then, line segments inside it may or may not be actually inside the query, something that can be determined only after the first two steps that retrieve the actual trajectory components (i.e., line segments). Generally speaking, such an approach is much more efficient since these indexes exploit the fact that the spa-tial domain remains unchanged, while the time domain evolves monotonically; as a result, all these approaches outperform the R-tree by several orders of magnitude [2, 11, 20].

Moreover, the spatiotemporal domain includes several approaches trying to opti-mize the range search procedure based on several properties of the real spatiotem-poral applications. For example, the work presented in [47] uses the restrictions placed on the movement of objects by the existing infrastructure to improve the per-formance of spatiotemporal queries executed against a spatiotemporal index. The strategy followed does not affect the structure of the index itself. Instead, they adopt an additional preprocessing step before the execution of each query. In particular, provided that the infrastructure is rarely updated, it can be indexed by a conventional spatial index such as the R-tree. On the other hand, a general-purpose spatiotempo-ral index, such as the TB-tree [49] or the three-dimensional R-tree [66], can be used to index trajectories of moving objects. Then, a preprocessing step of the query divides the initial query window in a number of smaller windows, from which the regions covered by the infrastructure have been excluded (see Fig. 6.7). Each one of the smaller queries is executed against the (general-purpose spatiotemporal) index

Q1

Q

Q4

Q3 Q5

Q2

O2

O1

O3

O4

(a) (b) Fig. 6.7 The initial query windowQ(a) is decomposed into a number of smaller query windows

Q1,Q2,...(b) with respect to infrastructure elements (drawn inblack)

returning a set of candidate objects, which are finally refined with respect to the initial query window.

In the evaluation presented in [47], the performance of two spatiotemporal indexes (TB- and three-dimensional R-tree) was compared, using either the des-cribed query preprocessing step (i.e., dividing the initial window in smaller win-dows) or not, and it was shown that the query performance was improved for both indexes when this step was used.

Recently, work has also been done on how to optimally split trajectories for the purpose of improving range query performance [28, 30, 53]. Hadjieleftheriou et al.

[28] use a partially persistent structure, the PPR-tree, trying to confront the problem of the dead space generated by MBB approximations of moving object trajectories.

Dead space is termed as the amount of space in an MBB approximation, which does not actually covers any object contained inside it. They introduce “artificial object updates” partitioning the trajectories into smaller elements, thus reducing the dead space; they use nonlinear functions to describe the moving objects’ trajectories, which are initially indexed by the PPR-tree. This work is extended in [30] where a multiversion R-tree, such as the one proposed in [62] is used instead of the PPR-tree, leading to an indexing scheme with improved performance. Moreover, the proposed algorithms for handling the problem of the dead space introduced in MBBs can be used in combination with any spatiotemporal data archive as the R-tree and its variants.

6.4.2 Nearest-Neighbor Search

Nearest-neighbor (NN) search has been in the core of spatial and spatiotemporal database research during the last decade. The literature on NN query processing algorithms mainly deals with either stationary [14, 31, 54] or moving query points over static data sets [57, 60] or data sets constituting by current or future (predicted) locations [7, 33, 37, 58, 75, 77]. Apparently, these types of queries do not cover NN search on historical trajectories, which is the subject of this work; the only relative

162 E. Frentzos et al.

proposal is presented in [21], which investigates mechanisms to perform NN search on R-tree-like structures storing historical information about moving object trajec-tories. The depth-first and best-first algorithms proposed in [21] vary with respect to the type of the query object (stationary or moving point) as well as the type of the query result (historical continuous or not), thus resulting in four types of NN queries, which are thoroughly discussed in the previous chapter. The proposed algo-rithms where implemented on two members of the R-tree family for trajectory data (the TB-tree and the three-dimensional R-tree) demonstrating their scalability and efficiency through an extensive experimental study using synthetic and real data sets.

6.4.3 Trajectory Joins

Distance join has not been considered extensively in the domain of spatiotemporal databases. The limited existing work on this subject considers joining of moving objects trajectories utilizing dedicated index structures [5, 6] or general-purpose indexes [4].

Bakalov et al. [5] consider the problem of evaluating all pairs of similar trajecto-ries between two data sets. According to [5], two trajectotrajecto-ries are considered similar during a given time interval, when, given a distance function, all distances between timely corresponding trajectory positions are within the given threshold. Then an approximation technique is used to reduce trajectories to symbolic representations (strings) so as to lower the dimensionality of the original (three-dimensional) prob-lem to one. Using the constructed strings, a special lower-bounding metric supports a pruning heuristic used to reduce the number of candidate pairs to be examined.

The overall scheme is subsequently indexed by a structure based on the B-tree, requiring also minimal storage space. The same work is extended in [6] to support time-relaxed spatiotemporal trajectory joins.

Another variation on the subject of joining trajectories is the closest-point-of-approach recently introduced in [4]. Closest-point-of-closest-point-of-approach requires finding all pairs of line segments between two trajectories such that their distance is less than a predefined threshold. The work presented in [4] proposes three approaches : the first utilizes packed R-trees treating trajectory segments as simple line segments in the d+1-dimensional space, and then employs the well-known R-tree join algorithm [32], which requires carefully controlled synchronized traversal of the two R-trees;

The second is based on a plane-sweep along the temporal dimension algorithm;

and the third is an adaptive algorithm, which naturally alters the way in which it computes the join in response to the characteristics of the underlying data.

6.4.4 Similarity Search

Similarity search has been well studied in the time series analysis domain; con-sequently, techniques addressed there are usually extended in the spatiotemporal

T Q Fig. 6.8 Two similar trajectoriesTandQ

domain, in which trajectories as T and Q presented in Fig. 6.8 are considered.

Historically, similarity search has been based on the Euclidean distancebetween time series, nevertheless, having several disadvantages which the following pro-posals are trying to confront. In particular, in order to compare sequences with different lengths, Berndt and Clifford [8] used thedynamic time warping(DTW) technique that allowed sequences to be stretched along the time axis so as to mini-mize the distance between sequences. Although DTW incurred a heavy computation cost, it was more robust against noise.Longest common subsequence(LCSS) mea-sure [70] matches two sequences by allowing them to stretch, without rearranging the sequence of the elements, but allowing some elements to be unmatched (which is the main advantage of the LCSS measure compared with Euclidean distance and DTW). Therefore, LCSS can efficiently handle outliers and different scaling fac-tors. Authors introduce two similarity measures, namelyS1andS2, allowing time stretching and translations, respectively, which were proved to be very robust to the presence of noise and provided an intuitive notion of similarity between trajectories by giving more weight to the similar portions of the trajectories. In [12], a distance function, callededit distance on real sequences(EDR), was introduced. EDR dis-tance function is based on the edit disdis-tance, which is the number of insert, delete, or replace operations that are needed to convert trajectoryT intoQ. In the respective experimental study presented in [12], EDR was shown to be more robust than DTW and LCSS over trajectories with noise.

To speed up the similarity search between trajectories, both [70] and [12] rely on dedicated index structures, thus achieving pruning of over 90% of the total number of indexed trajectories.

6.4.5 Trajectory-Based Querying

Trajectory-based querying is mainly discussed in [49] and [78], where dedicated index structures (TB-tree and OP-tree, respectively) are proposed to efficiently sup-port this type of queries. Regarding the aforementioned structures, trajectory-based querying is a rather straightforward task to perform: having located one leaf node containing entries of a specific trajectory, one may recursively follow the pointers to the previous and the successive node containing entries of the same trajectory (recall Fig. 6.4 for the TB-tree case), until the spatial or temporal query criterion has been verified or the entire moving object trajectory has been retrieved.

164 E. Frentzos et al.

Regarding the rest of the index structures, which do not consider trajectory preservation, the processing of trajectory-based queries can be performed by em-ploying the algorithm proposed in [49] regarding the three-dimensional R-tree and the STR-tree. As such, having retrieved an initial segment belonging to the trajec-tory under consideration, the algorithm tries to find its connecting segment, first, in the same leaf node, and, second, in other leaf nodes. Searching in other leaf nodes is conducted as a range search, with the endpoint of the segment in question as a predi-cate. Arriving at the leaf level, the algorithm checks whether a segment is connected to the segment in question in the specified way (backward of forward connected).

Using this recursive approach, successive segments of the trajectory are retrieved, until the spatial or temporal query criterion has been verified, or the entire moving object trajectory has been retrieved. However, this simple algorithm incurs heavy computation cost even in the presence of a buffer, since the worst case scenario corresponds to a case where every trajectory segment is stored in different disk page.

6.4.6 Spatiotemporal Query Optimization

The determination of the best execution plan for a query requires estimating the number of data items that it retrieves, as well as its cost, in terms of I/O and CPU effort. Like traditional databases, spatial query optimization tools include cost-based models, exploiting analytical formulas for selectivity and cost of a query, and histogram-based techniques. On the other hand, although the domain of spa-tiotemporal databases has been in the center of the research interest for several years developing many novel indexing techniques most of them based on the R-tree, the work conducted for estimating the selectivity of trajectories as well as developing cost models for such indexing schemes is very limited. Specifically, on the subject of selectivity estimation in spatiotemporal databases, research includes [15, 29, 61], all of them estimating the selectivity of several spatiotemporal predictive queries.

Apparently, none of them covers the domain of historical trajectory databases;

therefore, the interested reader is referenced to the cited papers.

Although models for the prediction of the R-tree performance have been exten-sively examined during the last decade, they cannot be straightforwardly applied in the spatiotemporal domain. For example, the traditional analysis on R-trees cost models, such as [65, 67], relies on the assumption that the extent of the data inserted in the tree is equally distributed along each dimension, i.e., resulting in square node rectangles. Though this is a reasonable assumption concerning spatial objects, in the spatiotemporal domain, the temporal dimension behaves differently from the two spatial ones. For example, in the widely used three-dimensional R-tree, when an object updates its position rarely, its trajectory’s line segments will tend to be elongated in the time dimension, resulting in elongated (in the temporal dimension) leaf nodes.

To resolve this problem, Tao and Papadias [59] examine the R*-tree split algo-rithm and propose anextent regression function(ERF), which computes the node

extents as a function of the number of node splits. In particular, using each level’s and axis’length distribution function(which at the leaf level derives from the actual data), they calculate the introducedextent regression functionERFi(t)for each tree level at theith dimension having as parametertthe total number of splits performed along theith dimension in this tree level. The average extentsi,j of a level-inode along thejth dimension is calculated using the computed ERFs adopting also a tech-nique that estimates the number of splits performed along the jth dimension at the ith tree level by minimizing an objective function under constraints. Finally, having estimated off-line, and without accessing the tree, the average values ofsj,iat each tree level, they provide the following generalized formula regarding the expected number of node accessesCW(R,q)for a query windowq:

CW(R,q) =

1+logf(N/f)

j=1

N fj

d

i=1(si,j+qi)

, (6.1)

whereNis the data set cardinality, f is the fanout of tree nodes, jis the respective level of the R-tree, andqiis the extent of queryqalong theith dimension (a formula that origins itself in the spatial database domain [65]). The experimental evaluation presented in [59] shows that the proposed model provides accurate estimates for the expected number of node accesses in all settings, while other tested cost mod-els (such as [65]) completely fail. Although the model is not developed only for spatiotemporal data, it is capable to predict the performance of a three-dimensional R*-tree since it supports tree nodes being elongated in the temporal dimension.

However, it cannot be used to other R-tree variants since the calculation of the ERF is based on the R*-tree splitting algorithm.

Dans le document Mobility, Data Mining and Privacy (Page 167-173)