Joint detection and tracking of moving objects using spatio-temporal marked point processes

(1)

HAL Id: hal-01104981

https://hal.inria.fr/hal-01104981

Submitted on 19 Jan 2015

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Joint detection and tracking of moving objects using

spatio-temporal marked point processes

Paula Craciun, Mathias Ortner, Josiane Zerubia

To cite this version:

Paula Craciun, Mathias Ortner, Josiane Zerubia. Joint detection and tracking of moving objects

using spatio-temporal marked point processes. IEEE Winter Conference on Applications of Computer

Vision, Jan 2015, Hawaii, United States. �hal-01104981�

(2)

Joint detection and tracking of moving objects using spatio-temporal marked

point processes

Paula Cr˘aciun

INRIA

paula.craciun@inria.fr

Mathias Ortner

Airbus D&S

mathias.ortner@eads.net.fr

Josiane Zerubia

INRIA

josiane.zerubia@inria.fr

Abstract

In this paper, we present a novel approach based on spatio-temporal marked point processes to detect and track moving objects in a batch of high resolution images. Batch processing techniques are applicable to and desirable for a large class of applications such as offline scene and video analysis, and provide better overall detection and data as-sociation accuracy than sequential methods. We develop a new, intuitive energy based model consisting of several terms that take into account both the image evidence and physical constraints such as target dynamics, track persis-tence and mutual exclusion. We construct a suitable op-timization scheme that allows us to find strong local min-ima of the proposed highly non-convex energy. We test our model on three batches of 25 synthetic biological images with different levels of noise. Our main application however consists of two batches of 14 remotely sensed high resolu-tion optical images of boats which are particularly hard to analyze due to the different angles at which the images were taken and the low temporal frequency.

1. Introduction

In a simplistic view, tracking can be defined as the problem of estimating the trajectories of objects in the image plane, as they move around the scene. Hence, a tracker assigns consistent labels to the objects in different frames of a se-quence of images. Additionally, it can provide information about the orientation, shape or size of the objects.

Multi-target tracking has been historically achieved using sequential techniques, classical examples of which are the Joint Probabilistic Data Association Filter (JPDAF) or the Multi-Hypothesis Tracker (MHT) [1]. These methods are preferred when real-time object tracking is needed. The ma-jor drawback of such methods however is that they cannot modify past results in the light of new data. Nevertheless, real-time tracking is not always a necessary constraint. Ap-plications such as offline video processing or information

retrieval do not impose such a constraint. Batch processing methods are preferred in this case since they do not suffer from the limitations of sequential methods. The increased performance of modern hardware allows for new batch pro-cessing techniques which could not be explored in the past. In this regard, MCMC Data Association has been proposed in [15] to partition a discrete set of detections into tracks. To retrieve the partitions, a one-to-one target to detection mapping assumption was made. This work was later ex-tended by Yu et al. [21] to overcome the one-to-one map-ping assumption. A modified version based on the data as-sociation method developed by Yu et al. [21] was applied in video-microscopy [17]. Nevertheless, batch processing techniques remain poorly explored and highly underused. Marked point processes (MPPs) [18] have been applied with success to object detection problems in high resolution re-motely sensed optical images. Applications range from de-tecting flamingos, buildings or boats [7] but also to people detection in crowds in street view images [10]. The use of specific terms in the energy to be optimized makes these processes application dependent, but highly efficient and ac-curate. In spite of their good theoretical properties, to our best knowledge marked point processes have never been ap-plied to tracking problems in image sequences up to now. According to Cressie and Wikle [3], a spatio-temporal marked point process can be viewed as an extension of the spatial marked point process to the temporal domain. In their view, one can think of a spatio-temporal point process either as a point process in Rd+1_{which they call descriptive,}

or as a temporal process of spatial point processes, which they call dynamic. Diggle et al. [8] gives an extensive re-view of the growing literature in spatio-temporal models. Nevertheless, we emphasize a change in paradigm between the spatio-temporal marked point process models found in literature and our approach.

The models, as presented by Cressie and Wikle, are used to obtain the posterior distribution of all unknowns, given the spatio-temporal information. Thus, the aim is to iden-tify and understand the forces that drive the evolution of a certain event. This knowledge can further be used to

(3)

ex-plain the evolution of similar events. Applications include stochastic models of biological growth [13], the spread of infectious diseases [9] or the evolution of forest fires [16]. Our line of reason is exactly the opposite. The forces that drive the dynamics of the considered event are known a pri-ori and integrated into an internal energy term. The aim in this case is to isolate and group the data that is best ex-plained by our model from a large set of information. We propose a new spatio-temporal marked point process model specifically adapted to the problem of multiple tar-get tracking. We use ellipses to model the objects, i.e. bio-logical particles or boats, and add an additional mark to fa-cilitate the association between objects in different frames. We develop a new, intuitive energy and show the high de-tection and good association properties. We use reversible jump Markov Chain Monte Carlo (RJMCMC) [12] to find the most likely configuration of objects. We show results on three synthetic biological image sequences of 25 frames each with varying levels of noise, as well as on two difficult sequences of 14 gray-scale high resolution frames taken by an optical satellite at different angles and low temporal fre-quency.

This paper is organized as follows: We describe our ap-proach to multi-target tracking in section 2. The optimiza-tion technique is presented in secoptimiza-tion 3. Secoptimiza-tion 4 shows experimental results. Finally, conclusions are drawn in sec-tion 5.

2. Multiple target tracking

2.1. Preliminaries and notation

To facilitate understanding, we first introduce the notations used throughout this paper. We model the 3D image cube as a continuous bounded set K = [0, Ihmax] × [0, Iwmax] ×

[0, T ] and denote x = (ch, cw, t) a point of K, where

(ch, cw) denote the location of the point within the image

and t is the frame number. A configuration of points x is an unordered set of points in K: x = {x1, . . . , xn(x)}, xi∈ K,

where n(x) = card(x) denotes the number of points in the configuration. A point process X is a collection of random configurations on the same probability space (Ω, A, P) [18]. To describe configurations of objects, random marks are added to each point. In our case of ellipses, we consider the mark space M = [am, aM] × [bm, bM]×] −π₂,π₂] × [0, L],

where am, aM and bm, bM are the minimum and

maxi-mum length of the semi-major and semi-minor axis respec-tively, ω ∈] − π₂,π₂] is the orientation of the ellipse and l ∈ [0, L] is its label. Thus, an ellipse u can be defined as u = (ch, cw, t, a, b, ω, l) and a marked point process of

el-lipses X is a point process on W = K × M. While the semi-axes a and b and the orientation ω describe the phys-ical properties of an ellipse, the label l is used as an identi-fier. Objects with the same label across the image sequence

form a track. We model the likelihood that an object exists at any given location in K and the interaction between ob-jects in a configuration. Individual trajectories are extracted by grouping objects according to their label. Finally, we de-note with C the set of finite configurations of ellipses. An attractive property of point processes is the possibility of defining a point process distribution by its probability density function where the Poisson point process plays the analogue role of the Lebesgue measure on Rd_{, where d is}

the dimension of the object space. We use the Gibbs family of processes to define the probability density as follows:

fθ(X = X|Y) = 1 c(θ|Y)exp −Uθ(X,Y) ₍₁₎ where: • X = {x1∪x2∪· · ·∪xt∪· · ·∪xT} is the configuration

of ellipses, with xt being the configuration of ellipses

at time t;

• Y represents the 3D image cube; • θ is the parameter vector; • c(θ|Y) = R

Ωexp

−Uθ(X,Y)_{µ(dX) is the normalizing}

constant, with Ω being the configuration space and µ(·) being the intensity measure of the reference Pois-son process;

• Uθ(X, Y) is the energy term.

The most likely configuration of objects corresponds to the global minimum of the energy:

X ∈ arg max

X∈Ωfθ(X = X|Y) = arg minX∈Ω[Uθ(X, Y)].

(2) The energy function is divided in two parts: an external energy term, U_θext_ext(X, Y) which determines how good the configuration fits the input sequence, and an internal energy term, U_θint_int(X), which incorporates knowledge about the interaction between objects in a single frame and across the entire batch considered. The total energy can be written as the sum of these two terms:

Uθ(X, Y) = Uθextext(X, Y) + U

int

θint(X). (3)

The parameter vectors of the external and internal energy terms are θext and θintrespectively and θ = {θext, θint}.

In the following subsections we will describe each of these two energy parts in detail.

2.2. External energy term

The external energy term is composed of the local external energies of each object u in the configuration. In order to enhance the image evidence, two local terms are computed for each object: an object evidence and a contrast distance measure.

(4)

2.2.1 Object evidence.

We search for likely locations of moving objects by frame differencing. At each pixel location, we compute the mean over time of the radiometric values and denote it pm. Next,

for each frame f , for all pixels belonging to frame f , we compute the difference between their radiometric value pf

and the mean value pmat that location. Finally, we retain

as foreground only those pixels for which this difference is higher than a predefined threshold: ∀f ∈ [0, T ], ∀pf ∈

f : pf ∈ f oreground ⇐⇒ |pf − pm| ≥ threshold.

Morphological erosion and closing operations are used to enhance the filter response and smooth the boundaries of the foreground regions. We can define the class of a pixel p as ν(p) = {f oreground, background}. We compute the evidence of object u in the following way:

E(u|Y) = − 1 |u|

X

p∈u

1{ν(p) = foreground|Y} (4)

where |u| marks the cardinality of object u (e.g. the number of pixels that belong to u) and 1{·} marks the indicator function (1{true} = 1, 1{false} = 0). The object evidence E (u|Y) is used to favor the detection of smaller objects.

2.2.2 Contrast distance measure.

The aim of this term is to further refine the detection and extract information such as the orientation and the size of the objects. The objects of interest (e.g. boats) appear as bright structures on a dark background. Hence, a contrast distance measure is computed between the interior of the ellipse and its border. The contrast distance measure was first introduced in [11] and is defined as:

dB(u, Fρ(u)) = (µu− µF)2 4pσ2 u+ σF2 −1 2log 2pσ2 uσ2F σ2 u+ σ2F (5)

where (µu, σu2) and (µF, σ2F) represent empirical means

and variances of the object u and its ρ-wide border Fρ(u). A threshold d0(Y) is manually determined based on the

im-age. High threshold values are used when the objects are easily distinguishable from the background. Lower thresh-old values are used otherwise.

A quality function Q : R+→ [−1, 1] is used to compensate for errors close to the threshold value:

Q(x) = 1 − x1/3 _{if x < 1} exp(−x−1 3 ) − 1 if x ≥ 1 (6)

The quality function attributes a negative value to well placed ellipses (e.g. objects u for which dB(u, Fρ(u)) is

higher than the threshold d0(Y)) and a positive value

oth-erwise.

The two terms computed in eq. 4 and eq. 6 are further com-bined into a local external energy for an object u:

U_localext (u|Y) = γevE(u|Y)+γcntQ

d_B(u, Fρ(u)) d0(Y)

. (7)

Finally, the external energy term for the configuration X is: U_θext

ext(X, Y) =

X

u∈X

U_localext (u|Y). (8)

The parameter vector θext = {γev, γcnt} of the external

term consists of the weight γevof the evidence term and the

weight γcntof the quality of the contrast distance.

2.3. Internal energy term

The internal energy term consists of a set of constraints meant for a correct detection of objects and to facilitate tracking. These constraints are inspired by the physical con-straints objects obey in real life.

2.3.1 The dynamic model.

A defining property of tracking (as opposed to individual detections per frame) is that in most cases object trajectories are smooth. This allows to favor configurations where ob-jects exhibit a motion described by a dynamic model. This motion model, denoted by dyn, depends on the application. Nevertheless, we can create an energy term that encourages objects to follow a given motion model s.t. for an object u that exists at time t it can be written as:

U_dynint(u) =

dyn0− dyn if ∃ dyn s.t. dyn ≤ dyn0

0 otherwise

(9) where dyn0is a threshold that describes how much objects

can deviate from the motion model and still be awarded. The energy term that awards configurations which follow the dynamic model is the sum over all objects in the config-uration:

U_dynint(X) = γdyn

X

u∈X

U_dynint(u). (10)

This term favors the creation of objects where the data evi-dence is reduced but the dynamic model motivates the exis-tence of an object.

2.3.2 Label persistence.

In order to distinguish between distinct trajectories we have introduced a label in the mark of each object. A label can be viewed as a trajectory identifier. Different labels mean dif-ferent trajectories. Thus, the number of labels has to be kept closely related to the number of trajectories in the data set. Ideally, the large number of objects u scattered across the

(5)

image sequence should be assigned to a rather small num-ber of labels. In this regard, we construct the set of labels present in a configuration X by labels(X) = S

u∈Xl(u),

where l(u) is the label of object u. We favor configurations where the number of distinct labels is small:

U_labelint (X) = −γlabel

1 |labels(X)|

(11) where |labels(X)| represents the cardinality of the set. The labels are assigned based on the motion model. Given object u centered at location pos(u) = (ch(u), cw(u)), we

search for the objects in the adjacent frames that satisfy the motion model. We compute the distance between u and these objects and compare it to a threshold. If the distance is smaller than the threshold, we set the label of object u to the label of the object in the previous frame. Otherwise, a new random label from [0, L] \ labels(xt) is assigned to u.

Configurations X that contain two or more objects with the same label at any time instance t are not permitted, meaning that an infinite energy is assigned to such configurations.

2.3.3 Mutual exclusion.

Handling object collision or overlapping at a given frame is a crucial aspect when detecting and tracking objects. In our model, we attribute an infinite penalty to any configuration that contains objects that overlap more than a given extent s. Thus, the probability of selecting such a configuration is zero. We denote by

A(u, v) = Area(u ∩ v)

min(Area(u), Area(v)) (12) the area of intersection between the objects u and v. The energy term describing the penalty for overlapping is

U_overlapint (X) =    P u,v∈X,u6=vA(u, v) if ∀t ∈ [0, T ]∀u, v ∈ xt: A(u, v) ≤ s +∞ otherwise. (13) The reason for which we choose to impose this hard con-straint is the fact that our main data set is composed of re-motely sensed images. Since our interest lies in detecting and tracking real-life objects, we can fairly conclude that two distinct objects cannot simultaneously occupy the same image region at any time instance.

2.4. Total energy term

The total energy term can be written as a sum of all the energy terms defined in section 2.2 and section 2.3:

Uθ(X, Y) = γdynUdynint(X) + γlabelUlabelint (X)+

γoUoverlapint (X) + γevE(u|Y)+

γcntPu∈X QdB(u,Fρ(u)) d0(Y) . (14)

Figure 1. The effects of different components of the energy terms. The upper row shows a configuration with a higher energy value for each individual term. The bottom row shows a configuration with a lower energy value for each individual term. The dark spots denote target locations at different time frames. Different colors on the targets represent different labels assigned to each.

An intuition of how each energy term influences the output result is presented in Figure 1.

2.5. Parameters

The weights of the energy terms present in eq. 14 do not have an intuitive meaning and thus are harder to set by hand. Therefore, an automatic method to determine these param-eters is necessary. In our experiments we have used a linear programming approach to estimate these weights. Given a configuration X, the log posterior density is a linear com-bination of the parameters which can be considered to be independent from each other. Although we do not have ac-cess to the precise value of the log posterior density due to the normalizing constant, we can however compute the ra-tio, r, between two log posterior densities. If we know that one configuration is better than another, we can establish a set of constraints r ≤ 1 (or r ≥ 1). These constraints can be transformed into a set of linear inequalities of the param-eters. Once this set of inequalities is large enough, we can apply linear programming to obtain a feasible solution for the parameters. The interested reader can refer to [21] for further details. Alternative parameter estimation techniques are discussed in [7].

3. Optimization

The energy described in eq. 14 is clearly not convex. It is easy to construct examples that have two virtually equal minima, separated by a wall of high energy values. The de-pendence caused by the high-order physical constraints is the main reason that drives the energy to be non-convex. The target distribution is the posterior distribution of X, i.e. π(X) = f (X|Y), defined on a union of subspaces of dif-ferent dimensions. The most widely known optimization method for non-convex energy functions and an unknown number of objects is the reversible jump Markov Chain Monte Carlo (RJMCMC) sampler developed by Green [12]. RJMCMC uses a mixture of perturbation kernels Q(·, ·) = P

mpmQm(·, ·),

P

mpm= 1 andR Qm(X, X

(6)

Figure 2. The space-time volume an object can physically influ-ence from its current position at time t can be depicted as a double-cone extending both in frame t − 1 and t + 1.

1, to create tunnels through the walls of high energy. We use simulated annealing to find a minimizer of the en-ergy function. The density function in eq. 1 can is rewritten as:

fθ,i(X = X|Y) =

1 cT empi(θ|Y)

exp−Uθ (X,Y)T empi ₍₁₅₎

where T empiis a temperature parameter that tends to zero

when i tends to infinity. If T empidecreases in logarithmic

rate, then Xitends to a global optimizer of fθ,i. In practice

however, a logarithmic law is not computationally feasible and hence, a geometric law is used instead. Therefore, a proper design of the perturbation kernels is needed to en-sure a good exploration of the state space.

The efficiency of this iterative algorithm depends on the variety of the perturbation kernels. We have used the fol-lowing perturbation kernels in our experiments:

• Birth and death according to a birth map: Two types of maps are created in a pre-processing step:

1. Birth maps: since the objects are supposed to have higher radiometric values than the back-ground, we use a simple threshold technique to identify probable locations of objects in each frame and attribute higher probabilities to these locations for the birth proposition kernel; 2. Water mask: in the case of boat tracking we first

detect the water area and limit the search to such areas. Cr˘aciun et al. [2] present a simple and effective method to extract the water area based on three features that characterize the water area: low radiometric values, small variance across the area and a relative large size. A single water mask is computed for the whole image sequence; The birth and death according to a birth map kernel first chooses with probability pb and pd = 1 − pb

whether an object u should be added to (birth) or deleted from (death) the configuration. If a birth is

chosen, the kernel generates a new object u accord-ing to the birth map and proposes X0 = X ∪ u. If a death is chosen, the kernel selects one object u in X according to the birth map and proposes X0= X \ u; • Birth and death in a neighborhood: this kernel is used

to propose the addition or removal of an interacting pair of objects. To define the neighborhood of an ob-ject we introduce the notion of event cones. This no-tion was previously introduced by Leibe et al. [14] to search for plausible trajectories in the space-time vol-ume by linking up event cones. Following the idea of Leibe et al. we define the event cone of an object u = (ch, cw, t, a, b, ω, l) to be the space-time volume

it can physically influence from its current position as depicted in Figure 2;

• Non-jumping transformations: non-jumping transfor-mations are transfortransfor-mations that randomly select an object u in the current configuration and then propose to replace it by a perturbed version of the object v: X0 = (X \ u) ∪ v. Translation, rotation and scale are examples of such transformations.

A mapping Rm(·, ·) : C × C → (0, ∞), called the Green

ratio, is associated to each of these perturbation kernels. At iteration i, the proposition Xi = X0is accepted with

prob-ability αm = min(1, Rm(X, X0)). Otherwise Xi = X.

Although embedding the RJMCMC sampler into a simu-lated annealing scheme yields better results, in practice it is computationally very expensive.

4. Results and discussions

We first test our approach on three synthetic biological im-age sequences. The imim-age sequences have been generated with Icy [6], an online available toolbox for biological im-age sequences developed by the Pasteur Institute in Paris. The sequences consist of 25 images, each 256 × 256 pix-els, with approximately 20 objects per frame. The three sequences exhibit different levels of Gaussian noise. The tracking results are displayed in Figure 3. The motion model of the objects is considered to be a Brownian mo-tion.

We compare our approach with the built-in particle tracker that comes as a plug-in to the Icy software. The results are shown in Table 1. Nevertheless, the output of the MHT tracker should not be considered as ground truth. The sim-ilarities between the paired detections and tracks are com-puted based on a maximum distance of 5 pixels between the two methods. This means that if the same object is detected using MHT at position c1 and with the proposed approach at position c2 and if |c1 − c2| ≥ 5 then the two detections are not matched.

(7)

Figure 3. Detection and tracking results on synthetic biological image sequences created using Icy software [6]. Left: Tracking results on the first image sequence (no noise). Middle: Tracking results on the first image sequence (Gaussian noise, µ = 25, σ = 2.5). Right: Tracking results on the first image sequence (Gaussian noise, µ = 50, σ = 5.0).

Data set No. reference tracks (MHT)

No. candidate tracks (Proposed algorithm) Similarity between tracks Similarity between detections Seq. 1 45 53 0.3803 0.4230 Seq. 2 49 38 0.4263 0.3285 Seq. 3 49 40 0.3485 0.2674

Table 1. Comparison between the results obtained using the built-in MHT tracker within Icy [6] and the proposed method for the three synthetic image sequences. Note however, that the output of the MHT tracker should not be taken as ground truth information.

on a 2.30GHz Intel Xeon processor. Although this perfor-mance is reduced compared to other state-of-the-art meth-ods, we believe that the performance of our algorithm can be greatly increased using a parallel computing scheme. The implementations developed by Verdie et al. [19] and Cr˘aciun et al. [2] have brought significant speed-up in the case of object detection in a single image. A similar scheme can be envisioned for the dynamic case of object tracking over a sequence of images.

Our main objective however, is to apply the proposed al-gorithm on real data. The acquisition rate of satellite im-ages has experienced a significant increase in the last years. Therefore, object tracking using high resolution satellite im-ages can be regarded as a new application in remote sensing, complementary to object detection and land-cover classifi-cation. Therefore, we test our approach on two challenging image sequences of boats. Each sequence consists of 14 frames taken at a low temporal frequency. Targets exhibit strong variations in appearance due to the changing angle at which the images were taken. We compare our results to two classical trackers: Kalman filter [20] and Histogram-Based Tracker [5]. We consider a constant velocity motion model in this case.

Metrics. Conducting an objective comparison between

dif-ferent tracking algorithms is a challenging task for various reasons. First, the importance of individual tracking failures is application dependent. Second, classifying tracker out-puts as correct or incorrect may as well be very ambiguous and usually requires additional parameters (e.g. thresholds) to assess the correctness and precision of the trackers. To evaluate the multi-object tracking accuracy, we compute three types of errors: false positives (FP), false negatives (FN) and identity switches (ID). The three types of errors are weighted equally. We also state the number of true positives (TP) and we provide the total number of mov-ing objects (TO). The total number of movmov-ing objects (TO) is the sum over all frames of the objects that change their position in two consecutive frames. Additionally, mostly tracked (MT) and mostly lost (ML) scores are computed on the entire number of distinct trajectories (TT) to measure how many ground truth trajectories are tracked successfully (tracked for at least 80 percent) or lost (tracked for less than 20 percent). Finally, we state the precision (TP / (TP + FP)) and recall (TP / (TP + FN)) of each algorithm.

Quantitative evaluation. Table 2 shows the quantitative results for both image sequences individually. We show the results of three trackers: our full model including dy-namic birth maps and the water mask used for

(8)

optimiza-Figure 4. Detection and tracking results on two sequences of satellite images taken at different angles. Left: Tracking results on the first image sequence up to frame 10. Right: Tracking results up to frame 13 of the second image sequence.

Data set Method FP FN TP TO ID MT ML TT Precision Recall Seq. 1 ST-MPP + BM 1 6 85 91 0 7 1 8 0.988 0.934 KF 3 34 57 91 0 4 2 8 0.950 0.626 HBT 5 14 77 91 3 6 2 8 0.939 0.846 MPP 7 5 84 91 − − − − 0.923 0.944 Seq. 2 ST-MPP + BM 1 1 24 25 0 4 0 4 0.962 0.962 KF 2 4 21 25 0 2 1 4 0.913 0.84 HBT 2 3 22 25 1 2 0 4 0.916 0.88 MPP 3 1 24 25 − − − − 0.889 0.961

Table 2. Quantitative results for the two sequences of satellite images.

tion, denoted (ST-MPP + BM), the Kalman filter (KF) and the histogram-based tracker (HBT). Figure 4 shows the re-sults of our method on the two image sequences considered. For comparison, we list in Table 2 also the detection results of the spatial marked point processes model developed by Cr˘aciun et al. [4] and denoted (MPP) which we applied in-dependently on each frame to extract boats.

The first image sequence has 14 frames of 1840 × 820 pix-els each and contains a constant number of 8 moving ob-jects throughout the entire sequence. The use of the object evidence term leads to better estimates of the locations of the objects, while the contrast distance measure is used to obtain accurate values for the size and orientation of the objects. Moreover, the labels are preserved throughout the image sequence. The computation lasted in average 20 min-utes on a 2.30GHz Intel Xeon processor. The second image sequence has 14 frames of 830 × 730 pixels each. The ev-idence term plays a decisive role in distinguishing dynamic objects from static ones. Our model yields a higher tracking performance compared to the classical trackers, due to the better detection results. The labels of the objects are gener-ally preserved throughout the sequence. The average com-putation lasted in average 7 minutes on the same 2.30GHz Intel Xeon processor.

Our method (ST-MPP + BM) outperforms both the Kalman filter (KF) and the histogram-based tracker (HBT). The lower performance of the Kalman filter is described by the lower performance of the detector used and the time it needs for initialization. The performance of the histogram-based tracker is highly influenced by the change in illumination due to the different angles of acquisition of the images. The appearance of the objects changes throughout the sequence and thus, the precision of the tracker is affected. In terms of appearance, our tracker however only relies on the contrast between the objects and their border. Therefore, its perfor-mance is not affected by appearance changes.

5. Conclusions and future work

In this paper we have presented a novel spatio-temporal marked point process of ellipses to detect and track moving boats in high resolution remotely sensed images sequences. First, we have emphasized the ideological difference be-tween current spatio-temporal marked point processes de-signed for statistical understanding of natural events and our model designed for detecting and tracking moving ob-jects in image sequences. Then, we have defined a new and intuitive energy to detect the objects across the sequence and group them into trajectories. We have used an adapted

(9)

version the widely known RJMCMC sampler for optimiza-tion. We have computed birth maps for each frame and pro-posed the use of the birth and death in the neighborhood permutation kernel to speed up the optimization process. We have then tested our algorithm on three synthetic bio-logical sequences with varying levels of noise. Finally, we have shown promising results on two remotely sensed high resolution optical images sequences.

Future work can be envisioned along three complementary directions: first, an in-depth analysis of the robustness of the model with respect to noise and outliers has to be per-formed. Moreover, the model could be further extended to include the detection of static objects. Second, a parallel im-plementation of the RJMCMC sampler for spatio-temporal marked point process models needs to be devised. Such an implementation would significantly decrease the computa-tional time of the sampler. A data-parallel implementation based on the conditional independence of targets that are far apart within a frame can be envisioned. Such an approach would minimize the communication cost between clusters. Finally, the model could be applied in other fields, such as videomicroscopy.

Acknowledgments

The authors would like to thank Airbus D&S (http://airbusdefenceandspace.com) for providing the satel-lite data and for the partial funding of this research. We would like to thank Pasteur Institute (www.pasteur.fr) for the free access to the Icy software for biological image pro-cessing (http://icy.bioimageanalysis.org).

References

[1] Y. Bar-Shalom and T. Fortmann. Tracking and Data Association. Academic Press, San Diego, 1988. [2] P. Craciun and J. Zerubia. Towards efficient

simula-tion of marked point process models for boat extrac-tion from high resoluextrac-tion optical remotely sensed im-ages. Proc. IGARSS, 2014.

[3] N. Cressie and C. Wikle. Statistics for spatio-temporal data. Wiley series in probabilities and statistics, 2011. [4] P. Cr˘aciun and J. Zerubia. Unsupervised marked point process model for boat extraction in harbors from high resolution optical remotely sensed image. Proc. ICIP, pages 4122–4125, 2013.

[5] N. Dalai and B. Triggs. Histograms of oriented gra-dients for human detection. Proc. CVPR, 1:886–893, 2005.

[6] F. de Chaumont et al. Icy: an open bioimage informat-ics platform for extended reproducible research. Na-ture methods, 9:690–696, 2012.

[7] X. Descombes, F. Chatelain, F. Lafarge, C. Lantue-joul, C. Mallet, M. Minlos, M. Schmitt, M. Sigelle, R. Stoica, and E. Zhizhina. Stochastic Geometry for Image Analysis. John Wiley and Sons, 2011.

[8] P. Diggle. Spatio-temporal point processes: meth-ods and applications. Statistical methmeth-ods for spatio-temporal systems, pages 1–45, 2007.

[9] P. Diggle, B. Rowlingson, and T.-L. Su. Point pro-cess methodology for on-line spatio-temporal disease surveillance. Environmetrics, 16:423–434, 2005. [10] W. Ge and R. T. Collins. Marked point processes

for crowd counting. Proc. CVPR, pages 2913–2920, 2009.

[11] F. Goudail, P. R´efr´egier, and G. Delyon. Bhat-tacharyya distance as a contrast parameter for statisti-cal processing of noisy optistatisti-cal images. Journal of Op-tical Science of America A, 21(7):1231–1240, 2004. [12] P. Green. Reversible jump Markov Chain Monte

Carlo computation and Bayesian model determina-tion. Biometrika, 82(4):711–732, 1995.

[13] E. Jensen, K. J´onsd´ottir, J. Schmiegel, and O. Barndoff-Nielsen. Spatio-temporal modeling - with a view to biological growth. Statistical methods for spatio-temporal systems, pages 47–75, 2007. [14] B. Leibe, K. Schindler, N. Cornelis, and L. van

Gool. Coupled object detection and tracking from static cameras and moving vehicles. IEEE TPAMI, 30(10):1683–1698, 2008.

[15] S. Oh, S. Russell, and S. Sastry. Markov Chain Monte Carlo Data Association for general multiple-target tracking problems. Proc. CDC, 1:735–742, 2004.

[16] R. Peng, F. Schoenberg, and J. Woods. A space-tile conditional intensity model for evaluating a wildfire hazard index. J. Amer. Statist. Assoc., 100:26–35, 2005.

[17] K. Smith, A. Carleton, and V. Lepetit. General con-straints for batch multiple-target tracking applied to large-scale videomicroscopy. Proc. CVPR, pages 1– 8, 2008.

[18] M. van Lieshout. Markov Point Processes and Their Applications. Imperial College Press, 2000.

[19] Y. Verdi´e and F. Lafarge. Efficient Monte Carlo sam-pler for detecting parametric objects in large scenes. Proc. ECCV, 7574:539–552, 2012.

[20] G. Welch and G. Bishop. An introduction to the Kalman filter. Proc. SIGGRAPH, pages 19–24, 2001. [21] Q. Yu and G. Medioni. Multiple-target tracking by spatio-temporal Monte Carlo Markov chain data asso-ciation. IEEE TPAMI, 31(12):2196–2210, 2009.