Décomposition de déformation pour l'estimation d'un mouvement de caméra

(1)

Movement decomposition for camera motion

estimation

C. Jonchery

a,

∗ F. Dibos

a

G. Koepfler

b

a_{CEREMADE, Universit´e Paris IX Dauphine, Place du Mar´echal de Lattre de} Tassigny, 75775 Paris Cedex 16, France,

tel : +33 (0)1 44-05-41-95, fax : +33 (0)1 44-05-45-99.

b_{MAP5, Universit´e Paris V, 45 rue des Saints-P`eres, 75270 Paris Cedex 06,} France.

Abstract

In this paper, we propose a new global method for estimating motion parameters of a camera which films a static scene. Our approach utilizes both images and optical flow, it doesn’t need any pre-knowledge about the type of movement and deals with adjacent frames of a sequence. Thanks to an original model of camera motion, our method separates the deformation between two adjacent frames in two components: a similarity and a “purely” projective application. The similarity is estimated first from two successive images using a parametric motion model. The projective deformation is evaluated afterwards from the directions of optical flow vectors.

Key words: image sequences, egomotion, optical flow.

1 Introduction

The estimation of camera motion plays a crucial role in many domains of computer vision such as the recovery of scene structure, medical imaging, augmented reality and so on. This is a difficult task since the motion of a pixel between two images depends not only on the six parameters of camera motion between the two successive image captures, but also on the depth at

∗ Corresponding author.

Email addresses: jonchery@ceremade.dauphine.fr (C. Jonchery ), dibos@ceremade.dauphine.fr(F. Dibos),

(2)

the corresponding point in the static scene. The literature on camera motion estimation from image sequences comprises many different models and esti-mation techniques. Existing methods can be classified as optical flow methods and features correspondences-based approaches. The first ones are global and the second ones local. But we also could classify the methods using different criteria: the estimation of the structure of the scene simultaneously or not with the camera motion estimation, the use (or not) of optimization techniques, the order of the estimation of the translation and the rotation, and so on.

Among all proposed methods using feature correspondences, one can men-tion recursive techniques based on extended Kalman filters, as described by Azarbayejani and Pentland in [1], which track camera motion and estimate the structure of the scene. The essential matrix, which was first defined by Longuet-Higgins in [2], is often estimated, as only few correspondences in two images are sufficient; the number of required correspondences is discussed by Faugeras et al. in [3–5]. In the case of an uncalibrated camera, the analogous approach is described in [6] with the fundamental matrix.

The use of optical flow avoids the choice of “good” features. Many authors use the basic bilinear constraint linking optical flow, camera velocities and depths of projected points; Bruss and Horn, in [7], apply an algebraic computation to remove depth from the bilinear constraint and use numerical optimization techniques, Zucchelli et al. in [8], minimize combined bilinear constraints for given optical flow vectors on motion parameters and all ignored depths. Heeger and Jepson, in [9], decouple the translational velocity from the rotational velocity and use linear subspace methods. Ma et al. in [10] and Brooks et al. in [11] use a different approach with the epipolar differential constraint: a differential essential matrix is determined from the optical flow, leading to a unique camera velocity estimation. Another well-known approach is based on motion parallax, notably developped by Tomasi and Shi in [12] and by Irani, Rousso and Peleg in [13]. Tian, Tomasi and Heeger propose in [14] a comparison of algorithms that only use optical flow for estimating camera motion. However, these techniques work best with well separated views, when the displacement (especially the translation or the so-called baseline) between frames is sufficiently large.

Our method uses both optical flow and adjacent images for compensating the noise present in the optical flow. It deals with adjacent frames of a sequence, so with narrow baselines and restricted camera rotations. In this particular context, we assume that the deformation between two consecutive images is planar. Thanks to a particular writing of camera rotation, we decompose the image deformation in two parts: a similarity and a “purely” projective application. This motion decomposition leads us to a quadratic optical flow approximation, on which is based our algorithm. The outline of the paper is as follows. In Section 2, we recall briefly the general background and describe the

(3)

context of narrow baselines, in which we work. In Section 3, we introduce the registration group, used for modeling the camera displacement. In Section 4, we propose a new camera motion decomposition, used in Section 5 to define an algorithm for motion estimation. Section 6 presents the results of experiments performed using real video sequences. Concluding remarks are given in Section 7.

2 Background

2.1 Pinhole camera model

A camera projects a point in 3D space on a 2D image. This transformation can be described using the well-known pinhole camera model [6] presented in figure 1. The camera is located on C, the optical center, and directed by k, the optical axis. (X, Y, Z) are the coordinates of a point M in the 3D camera coordinate system (C, i, j, k). The camera projects M on the plane R:{Z = f}. The plane R is called the retinal plane and f the focal length. The projection m of M is then the intersection of the optical ray (CM ) with R.

Let c be the intersection of the optical axis with R. If (x, y) are the coordinates of the point m in the orthogonal basis (c, i, j), the relationship between (x, y) and (X, Y, Z) is the following

             x = fX Z y = fY Z .

As f just acts as a scaling factor on the image, we can choose, without loss of generality, to set the focal length to one.

2.2 Framework

We consider two adjacent images f and g in a video sequence; high frame rate implies small relative camera displacement between the acquisition of two successive images, even if the velocity of the camera is not small. Consequently, f and g are very close and we assume that the parallax effect can be neglected. This assumption allows us to consider g as a deformation of the image f ; the problem of camera motion estimation is therefore simplified: we search for planar transformations in order to determine a camera motion. Remark

(4)

PSfrag replacements C c i j k f m(x, y) M (X, Y, Z) R

Fig. 1. Pinhole camera model.

that, the more the projected scene is far from camera location, the more our hypothesis is realistic.

3 The registration group

3.1 Projective transformation produced by camera motion

Let f and g be two adjacent images in a sequence. With our assumptions, we can consider that the camera films the planar image f living in the affine plane R:{Z = 1}. Let E be the affine space and (X, Y, Z) the coordinates of a point M ∈ E in the orthonormal basis (C, i, j, k). We may consider that when we observe f , we observe in fact the restriction to the plane R of the 3D image

F (X, Y, Z) = f _X Z, Y Z if Z 6= 0 .

This representation may be considered as a simplification of the classical pin-hole camera model.

Let D be a displacement of the camera or in an equivalent way a displacement of the plane R. The movement D may be written in a unique way as D = (T, R), where R is a rotation with axis containing C and T a translation with vector t. The set of dispacements D = (T, R) form the Lie group of rigid transformations called SE(3), which denotes the special Euclidian group. The displacement D = (T, R) transforms a point X belonging to R3 _{in X}0

= RX + t. Thus, the camera is identified before the displacement by (C, i, j, k) and after the displacement by (C0

, R(i), R(j), R(k)), with −−→CC0

(5)

PSfrag replacements C c C0 c0 i j k R(i) R(j) R(k) f m M R S

Fig. 2. Change of viewpoint.

Let S be the plane D(R) with orthonormal coordinate system (c0

, R(i), R(j)) such that −−→C0

c0

= R(k); the image g is living in S (see figure 2). If M belongs to S and m is the intersection of the line (CM) with the plane R, we have

g(M ) = f (m). Therefore, if we denote −−→ CM = Xi + Y j + Zk , −−→ c0 M = xR(i) + yR(j) , R =        a b c a0 b0 c0 a00 _b00 _c00        , and t = αi + βj + γk , then −→ Cc0 = t + R(k) , −−→ CM =−→Cc0 +−−→c0 M = xR(i) + yR(j) + R(k) + t

(6)

and so _             X = ax + by + c + α Y = a0 x + b0 y + c0 + β Z = a00 x + b00 y + c00 + γ . As g(M ) = f (m) with m = _X Z, Y Z we have g(x, y) = f ax + by + c + α a00 x + b00 y + c00 + γ, a0 x + b0 y + c0 + β a00 x + b00 y + c00 + γ ! which we write g(x, y) = (ϕf )(x, y). (1) So, a 3 × 3 matrix associated to the projective application ϕ may be written

Mϕ =        a b c + α a0 b0 c0 + β a00 b00 c00 + γ        = R        1 0 < t, R(i) > 0 1 < t, R(j) > 0 0 1+ < t, R(k) >        = RH (2)

and this decomposition is shown to be unique.

Conversely, we have by analogy,

f (x, y) = g ax + a 0 y + a00 − < t, R(i) > cx + c0 y + c00 − < t, R(k) >, bx + b0 y + b00 − < t, R(j) >) cx + c0 y + c00 − < t, R(k) > ! written f (x, y) = (ψg)(x, y). (3) The applications ϕ and ψ are projective applications, each defined by only six parameters, three for the rotation and three for the translation.

3.2 Projective group

Projective applications are classically represented in the projective group de-fined in R2 _by G2 = {ϕ : R2 → R2 | ∀(x, y) ∈ R2 ϕ(x, y) = ax + by + c a00 x + b00 y + c00, a0 x + b0 y + c0 a00 x + b00 y + c00 ! ,

(7)

det A = det        a b c a0 b0 c0 a00 b00 c00        6= 0}.

This group is isomorphic to the special linear group SL(R3_{) of invertible}

ma-trices because the projective application generated by a matrix A is the same as generated by λA (λ ∈ R). Consequently, G2 is an eight parameter group.

But our aim is to estimate the camera motion through the image deformation, each defined by six parameters, as shown above. Thus we are going to model the projective transformation in another group: the registration group.

3.3 Registration group

Let A be the subset of projective applications ϕ : R2 _{→ R}2 _{such that}

∀(x, y) ∈ R2_, _{ϕ(x, y) =} ax + by + c + α a00 x + b00 y + c00 + γ, a 0 x + b0 y + c0 + β a00 x + b00 y + c00 + γ , where the matrix R =        a b c a0 b0 c0 a00 b00 c00        belongs to SO(3).

The registration group is the group (A, ?) where the composition law ? is deduced from the composition law ◦ of the displacements group SE(3) through the isomorphism

I : A −→ SE(3) ∀ϕ ∈ A I(ϕ) = (T, R)

where (T, R) is used for the unique spatial displacement D = (T, R) where T is the translation with vector t = (α, β, γ) and R the rotation defined above with axis containing the origin.

More precisely, let ϕ1 and ϕ2 belong to A, they correspond to the

displace-ments D1 = (T1, R1) and D2 = (T2, R2), respectively. Then, ϕ2? ϕ1 = ϕ where

ϕ is the projective application associated to the displacement D = D2 ◦ D1.

The group SE(3) is the semi-direct product of SO(3), the rotations about the origin, with the translations of R3_,.

SE(3) = SO(3) o R3

Hence, D = D2◦ D1 = (R, T ) where R is the rotation R = R2R1 and T is the

(8)

Remark that if ϕ belongs to A and g(x, y) = f(ϕ(x, y)) then we have f(x, y) = g(ψ(x, y)) with ψ = ϕ−1 _{in the registration group. This means that if ϕ}

is associated to the displacement D = (T, R), ϕ−1 _{is associated to D}−1 ₌

(−R−1_{(t), R}−1_).

By modeling the camera displacement in the registration group, we reduce the problem to the determination of six parameters of a planar application, as R and T are respectively defined by three parameters.

4 Decomposition of camera motion

In this section, we decompose a camera motion in order to separate the image deformation in two components: a similarity part and a “purely” projective part. A camera motion can be decomposed into three basic types of motion. They consist of:

• a translation, which produces an homothety translation on the image f belonging to the plane R,

• a rotation with axis k, which produces a planar rotation on f,

• a rotation with axis in the plane (C, i, j) which slightly moves the observa-tion f in the plane R .

4.1 Decomposition of a rotation

Let us consider a camera rotation R with axis containing C. We decompose R in two particular rotations R2R1. The first one R1, with axis ∆ belonging to

the plane (C, i, j) transforms the direction of the optical axis k in R(k) ; this rotation induces a projective deformation of the image f . The second one R2

is a rotation with axis R(k): R2 induces a planar rotation of the image R1(f ).

Any camera rotation can be written in such a way.

Let us express the rotation R1 with two parameters: θ for the location of ∆

in the plane (C, i, j) and α for the angle of the rotation. If we denote Rl a the

rotation matrix with axis l and angle a, the expression of R1 in the basis

(i, j, k) is R1 = RθkR i αR k −θ

which we denote in the following Rθ,α.

Now, let β be the angle of the rotation R2 around the new optical axis R(k).

We can then write the rotation R2 in the basis (i, j, k)

R2 = RkθR i αR k βR i −αR k −θ.

(9)

Finally, the expression of the global rotation R is R = R2R1 = RkθR i αR k βR k −θ.

This decomposition is interesting because of the induced deformations of the image. R1 produces a “purely” projective deformation of the image f whereas

R2 creates a planar rotation of the image R1(f ).

PSfrag replacements C C C c C0 c0 i i j j k k R R(i) R(i) R(j) R(j) R(k) R(k) R(k) R1(i) R1(j) R1 R2 α θ β ∆

Fig. 3. Decomposition of a camera rotation R in two rotations R2R1.

4.2 Decomposition of a general motion

A complete camera motion D = (T, R) induces a projective deformation ϕ of the image f . The matrix associated to ϕ is RH (formula 2), which can now be written as RH = Rk θR i αR k βR k −θH We can swap Rk

β and Rk−θ in the previous formula because the two rotations

have the same rotation axis, so

(10)

If we denote rθ,αthe “purely” projective deformation associated to the rotation

Rθ,α and s the similarity deformation associated to RkβH then we have

g(x, y) = f (ϕ(x, y)) = f (rθ,α◦ s(x, y)) = (rθ,α◦ s)(f)(x, y).

We obtain therefore six parameters defining the camera motion, two for the rotation Rθ,α and four for the translation T and rotation Rkβ. We express

now camera motion with the following parameters (θ, α, β, A, B, λ) where (−A, −B, λ − 1) are the coordinates of t in the basis (R(i), R(j), R(k)). These new notations allow to obtain an easier writting of the projective application ψ (the inverse of ϕ in the registration group)

ψ(x, y) = ax + a 0 y + a00 + A cx + c0 y + c00 + 1 − λ, bx + b0 y + b00 + B cx + c0 y + c00 + 1 − λ ! where R−1 ₌        a a0 a00 b b0 _b00 c c0 c00       

. Remark that the six parameters (θ, α, β, A, B, λ)

allow to access explicitly the camera displacement D = (T, R). Indeed,

             t = −AR(i) − BR(j) − (1 − λ)R(k) R = Rθ,αRkβ .

As we consider two successive images of a video sequence with a high frame rate, the camera motion between two images is very small and the parameter values are restricted, except for the angle θ, which belongs to ] − π, π]. Figure 4 gives the orders of magnitude of parameter values that we have obtained by experiment, when we take a unit focal length. These experiments consist in taking images of few hundreds pixels dimensions and applying the six param-eters projective application. As the images have not to be too deformed, we deduce the orders of magnitude of parameters.

4.3 Decomposition and approximation of an optical flow

Let us consider a camera motion D = (T, R) described by the six parameters (θ, α, β, A, B, λ). The rotation matrix can then be expressed by R = Rθ,αRkβ

(11)

parameter magnitude α (radian) 10−4

β (radian) 10−2

A,B (unit of pixel) 1 1 − λ (unit of pixel) 10−2

Fig. 4. Orders of magnitude for camera motion parameters.

with Rθ,α =        cos2_{θ + sin}2

θ cos α cos θ sin θ(1 − cos α) sin θ sin α cos θ sin θ(1 − cos α) sin2_{θ + cos}2_{θ cos α − cos θ sin α}

− sin θ sin α cos θ sin α cos α

       and Rk β =        cos β − sin β 0 sin β cos β 0 0 0 1        .

As α is very small, we can approximate the rotation matrix Rθ,α by a

first-order Taylor expansion

Rθ,α '        1 0 α sin θ 0 1 _{−α cos θ} −α sin θ α cos θ 1        . (4)

As β2 _{has the same order of magnitude of α, we have to approximate R}k β by

a two-order Taylor expansion, for respecting values consistency

Rkβ '        1 − β22 −β 0 β _{1 −} β₂2 0 0 0 1        .

The global rotation R is therefore approximated by the following matrix, with coefficients computed with an accuracy of 10−4

R '        1 − β22 −β α sin θ β _{1 −} β₂2 _{−α cos θ} −α sin θ α cos θ 1        .

(12)

Let M (x, y) be a point of the image f and M0

(x0

, y0

) the corresponding point of g, such that f (x, y) = g(x0

, y0

). According to the definition of the projective application ψ, we have the exact correspondence between (x, y) and (x0

, y0

) through M0

= ψ(M ). The visual displacement, also called optical flow, at the point M (x, y) is defined by−−−→M M0 _{= (x}0

−x, y0

−y). By using the previous matrix and as the rotation associated to ψ is R−1_{, the optical flow is approximated}

by                x0 − x ' (1 − β 2 /2)x + βy − α sin θ + A αx sin θ − αy cos θ + 1 + (1 − λ) −x y0

− y ' −_{αx sin θ − αy cos θ + 1 + (1 − λ) −}βx + (1 − β2/2)y + α cos θ + B y .

As α and (1−λ)2 _{have an order of magnitude of 10}−4_{, we can perform a Taylor}

expansion of order 10−4 _{on the set of parameters}

                                                       x0 − x ' (λ − 1 + (1 − λ)2_{− β}2_{/2 − Aα sin θ)x} +(λβ + Aα cos(θ))y +A(λ + (1 − λ)2) − α sin θ +αx(−x sin θ + y cos θ) y0 − y ' (−λβ − Bα sin θ)x +(λ − 1 + (1 − λ)2_{− β}2_{/2 + Bα cos θ)y} +B(λ + (1 − λ)2_{) + α cos θ}

+αy(−x sin θ + y cos θ)

.

We want to obtain an approximation of the two components of the optical flow with an accuracy of 10−1_{. Therefore, we are going to consider the magnitude}

of the values of x and y. As the order of the image size is supposed to be 102_{, the values of x and y take the orders of 1, 10 and 100. By using this}

and the magnitudes of parameter values, we obtain the following quadratic approximation          x0

− x ' (λ − 1)x + A +βy +αx(y cos θ − x sin θ) y0 − y ' (λ − 1)y + B | {z } (1) −βx | {z } (2)

+αy(y cos θ − x sin θ)

| {z }

(3)

. (5)

The optical flow has been decomposed in three independent terms; the first one (1) is due to the translation of the camera, the second one (2) to the rotation Rk

β and the third one (3) to the rotation Rθ,α. This last term only

(13)

will be used in the following for recovering the optical flow generated by the rotation Rθ,α from the flow generated by a complete camera motion defined

by (θ, α, β, A, B, λ), given the values of (β, A, B, λ).

Let us remark that this approximation of optical flow is correct only for two consecutive images of a sequence. When images are distant in time, we have to consider depths of projected points. This approximation is also related to the bilinear constraint linking optical flow, linear and angular velocities, and depth of projected point [7,9]. Actually, by computing the velocity information from the camera displacement (θ, α, β, A, B, λ), we obtain approximately the same formula. This computation is performed using the logarithm map and the theory of Lie group and algebra [15].

5 Estimation of camera motion parameters

Let f and g be two adjacent images in a video sequence and D = (T, R) be the camera displacement defined by the six parameters (θ, α, β, A, B, λ). In this section, we describe our method for estimating camera motion parameters.

5.1 Algorithm

If we use the previous optical flow approximation (5) then, for the x and y values restricted to the order 1 and 10, we have, with an accuracy of 10−1

     x0 ' λx + A + βy y0 ' λy + B − βx .

Therefore, the center of the image f (which corresponds to small values of x and y) is mainly deformed by the similarity. Figure 5 shows an example of theoretical flow and its two components generated by the associated similarity and by the associated rotation Rθ,α. This decomposition illustrates that the

similarity acts on the whole image whereas the purely projective deformation arises around the image.

Remark that we could obtain this result without this computation. Indeed, in order to obtain a meaningful image g, the vector R(k) must be very close to the vector k, which is linked to the small value of α. This implies that if we

(14)

Fig. 5. At the top, optical flow generated by a camera motion parameterized by (θ = 3π/4, α = 5.10−4, β = 0.051, A = 3, B = 2, λ = 1.02) and at the bottom, the two components of the flow, generated on the left by the similarity of parameters (β = 0.051, A = 3, B = 2, λ = 1.02) and on the right, by the purely projective deformation of parameters (θ = 3π/4, α = 5.10−4). denote R =        a b c a0 b0 c0 a00 _b00 _c00       

for the rotation matrix, the coefficients c0

and c00

must be close to zero, and consequently the coefficients a00 _{and b}00 _{too. Then, for x and y small enough,}

we have the following approximation of the formula (3)

f (x, y) ≈ g ax + a 0 y− < t, R(i) > c00 − < t, R(k) > , bx + b0 y− < t, R(j) > c00_{− < t, R(k) >} ! ≈ g ax + a 0 y + A c00 + 1 − λ , bx + b0 y + B c00_{+ 1 − λ} ! (6) This shows that affine deformations are good approximations of significant projective deformations and this may explain the large number of papers on this subject ([16], [17], [18], [19], [20]). But the only affine approximations compatible with a camera motion are similarities. Indeed, if a00

= 0 and b00

= 0 then c002 _{= 1 because R is orthogonal, and c}002 _{= 1 implies c}00

= 1 because the vector R(k) is close to k and not to −k; thus, R is a rotation of axis k. We describe now our algorithm of camera motion estimation.

(15)

• Estimation of the four parameters (β, A, B, λ) of the similarity from the extracted center parts of f and g with the algorithm of 2D parametric motion models described in [21] and presented in section 5.2.

• Computation of the optical flow V between f and g with the algorithm of Weickert and Schn¨orr [22].

• Computation of the optical flow Vp due to the projective transformation

Rθ,α by removing the optical flow due to the estimated similarity from V

(this is justified by approximation (5)).

• Estimation of the two projective parameters θ and α, by using the directions Dp of the optical flow Vp and the images f and g.

In the next subsections, we describe the similarity estimation and then the estimation of θ and α.

5.2 Similarity estimation

In order to estimate the four parameters of a similarity linking two images I1 and I2, we use an algorithm proposed by Odobez and Bouth´emy in [21].

Their method consists in estimating 2D parametric motion models between two images. These models can be constant, affine or quadratic. As our aim is to compute a similarity, we use an affine motion model, with four parameters (c1, c2, a1, a2), able to consider 2D translation, 2D rotation and zoom. The

displacement vector u(x, y) estimated between two given images is then defined at the image point (x, y) by

u(x, y) =    x0 − x y0 − y   =    c1 c2   +    a1 a2 −a2 a1       x y   .

The parameters (c1, c2, a1, a2) allow to access our four parameters (β, A, B, λ)

because we have with formula (3) and the definition of Rk

β, A, B and λ              x0 = x cos β + y sin β + A 2 − λ y0 _{= −}x sin β + y cos β + B 2 − λ ⇔    x0 − x y0 − y   = 1 2 − λ    A B   +    cos β 2−λ − 1 sin β 2−λ −sin β 2−λ cos β 2−λ − 1       x y   .

(16)

So, when we have (c1, c2, a1, a2), we can compute (β, A, B, λ) from                                              λ = 2 − _√ 1 (a1+1)2+a2₂ A = √ c1 (a1+1)2+a2₂ B = √ c2 (a1+1)2+a22 β = arcsin a2 √ (a1+1)2+a22 (β ∈] − π 2, π 2]) .

Let us now describe briefly the algorithm of Odobez and Bouth´emy. The au-thors define the displacement frame difference (DFD) associated to a para-metric motion model at the point (x, y) with

DFD(c1,c2,a1,a2,ξ)(x, y) = I2((x, y) + u(x, y)) − I1(x, y) + ξ

where ξ is a global intensity shift to account for global illumination change. The five parameters are thus estimated by minimizing the following function

X

(x,y)∈I1

ρ(DFD(c1,c2,a1,a2,ξ)(x, y), C)

where the function ρ is called an M-estimator since its minimization corre-sponds to the maximum-likelihood estimation. The authors choose a function bounded for high values in order to eliminate the contribution of outliers. They use the Tuckey’s biweight function defined as

ρ(t, C) =              t2 2(C 4 − C2t2+t₃4_{) if|t| < C,} C6 6 otherwise.

The minimization of ρ is performed using an incremental and multiresolution scheme described in [21]. This method is accurate and has a low computational cost; it provides us an estimation of the similarity linking the centers of two successive images. The software that we have used, developped by Odobez and Bouth´emy, is available at the address http://www.irisa.fr/Vista/Motion2D. Let us recall that the camera translation can only be recovered up to a scalar factor. With our assumptions, the estimated values of A, B and 1 − λ corre-spond actually to points at the averaged depth of the scene, projected on the image f .

(17)

5.3 Estimation of θ and α

5.3.1 Estimation of θ

At this stage of the algorithm, we suppose that we work with the optical flow Vp due to the projective rotation Rθ,α. The flow Vpis obtained by removing the

flow Vsdue to the estimated similarity from the flow V computed by Weickert

and Schn¨orr’s algorithm, thanks to the approximation (5) Vp = V − Vs.

According to the approximation (4) of the matrix Rθ,αin the previous section,

the visual displacement (or optical flow) Vp(x, y) at the point (x, y) is

approx-imated by      x0

− x ' α(− sin θ x2_{+ cos θ xy − sin θ)}

y0

− y ' α(cos θ y2_{− sin θ xy + cos θ)}

This implies that the direction Dp(x, y) of the optical flow vector Vp(x, y) given

by Dp(x, y) = arctan y0 − y x0_{− x} ! ' arctan cos θ y 2_{− sin θ xy + cos θ}

− sin θ x2_{+ cos θ xy − sin θ}

!

is independent of the angle α and so only depends on θ. Therefore, the angle θ is estimated as follows: the directions of the optical flow computed between f and g are compared with directions of optical flow models associated to rotations Rθ,α, for different values of θ and a constant value of α. Figure 6

depicts eight directional models, which are ideal, noise-free and obtained from our numerical formulas.

If we denote Dθk(x, y) the direction of the optical flow vector of the model

generated by Rθk,α, the estimated value of θ is

ˆ θ = argmin θk∈Θ X (x,y)∈f |Dp(x, y) − Dθk(x, y)| (mod 2π)

where Θ is a set of discrete angular values in ] − π, π]. In the experiments, presented in the next section, we have used Θ = {−3π

4 , − π 2, − π 4, 0, π 4, π 2, 3π 4 , π}. 5.3.2 Estimation of α

While the directions of the optical flow vectors are characteristic of the angle θ, the norm of optical flow vectors depends on the angle α. Unfortunately, the algorithms for computing optical flow give a quite good estimation of

(18)

Fig. 6. Models of optical flow generated by rotations Rθk,α, with from left to right

and top to bottom, θ = −3π4 , − π 2, − π 4, 0, π 4, π

2, 3π4 , π, and α = 0.0005. The image size is 256 × 256 and the height of speed vectors is multiplied by 5 for representation.

vector direction but estimate hardly the norm. Therefore, we come back to the two adjacent images to estimate the angle α. Let ( ˆβ, ˆA, ˆB, ˆλ) be the previous estimated values of the similarity and ˆθ be the estimated value of θ by the above method. We construct the images s ◦ r_θ,αˆ i(f ) where s ◦ rθ,αˆ i is the

projective deformation generated by the camera motion ( ˆθ, αi, ˆβ, ˆA, ˆB, ˆλ), with

αi belonging to a discrete set I of values in [0, 10−3[. The value of αi that

minimizes the image difference in L2 _{norm is then selected}

ˆ

α = argmin

αi∈I

ks ◦ r_θ,αˆ if − gk2.

In the experiments presented, we have used

(19)

5.4 Example

In this part, we illustrate our method with a 2D example. In figure 7, we deform the image of Lena that we call f , in g with a camera motion parameterized with the six following parameters: θ = 0.8, α = 0.0005 for the purely projective rotation and β = −0.051, A = 2, B = 1, λ = 0.99 for the similarity.

Fig. 7. From left to right, the famous image of Lena (f ), the deformation g and the average f+g₂ .

Figure 8 shows the extraction of center parts of f and g (with size 100 × 100) in order to estimate the similarity with the method of Odobez and Bouth´emy; the results are ˆ_{β = −0.05113, ˆ}A = 2.812, ˆB = 1.075, ˆλ = 0.98821. This estimation allows to compute a new flow obtained by removing from the optical flow between f and g, the component due to the estimated similarity (figure 9).

(20)

Fig. 9. On left, the optical flow computed between f and g and on right, the new flow obtained by suppressing from first flow the component due to the similarity computed by the method of Odobez and Bouth´emy (the unit of speed vector is multiplied by 5).

This new optical flow is compared to the eight directional models (figure 6), leading to an estimation of the parameter θ: ˆθ = π/4. The angle α is esti-mated by constructing ten images s ◦ r_θ,αˆ (f ) for ten different values of α and

comparing them to the image g. The results of comparison are given in table 10. The value ˆα = 0.0004 provides the best registration.

β A B λ θ α Diff 0.0000 30.9027 0.0001 28.4744 0.0002 24.748 0.0003 20.2643 0.0004 16.1502 -0.05113 2.812 1.075 0.98821 π 4 0.0005 17.0852 0.0006 22.6945 0.0007 27.0005 0.0008 30.3602 0.0009 32.883

Fig. 10. Estimation of α: ten images are constructed by distorting “Lena” with the ten different sets of six parameters (in the six first columns). The differences in the last column are the average L2 _{norm of image differences between g and the} constructed images.

Finally, estimated and initial parameters are recapitulated in table 11. Figure 12 shows the initial average of f and g compared to the final average of g and (r_θ,ˆˆ_α◦ s)(f), the transformation of f with the estimated parameters.

(21)

β A B λ θ α Estimated parameters -0.05113 2.812 1.075 0.98821 π 4 0.0004 Initial parameters -0.051 2 1 0.99 0.8 0.0005

Fig. 11. Initial and estimated parameters.

Fig. 12. Left, the image (f + g)/2 and right, (s ◦ r_θ,ˆˆα(f ) + g)/2.

6 Experimental results

6.1 Results with video sequences: augmented reality

Augmented reality consists in adding an object in a sequence in such a way that it appears to be present within the scene. Three successive problems have then to be solved: camera motion estimation, scene depth extraction and augmentation. Lots of researchers have proposed methods and have obtained good results [23,24].

We have applied our camera motion estimation method on a real video se-quence of our office, taken with an uncalibrated camera. The optical flow, required for the estimation of the angle θ, was computed with the algorithm of Weickert and Schn¨orr ([22]). In order to evaluate the camera motion estima-tion, we have inserted in the sequence a shape that we have deformed with the projective application associated to the estimated camera motion. We haven’t taken into account depth of the inserted feature and neither scene depth, since the feature has been inserted on the main planar region of the scene, roughly parallel to the retinal plane. Example frames from the augmented sequence are presented in figure 13. This experience shows that the camera motion is accurately estimated, indeed the inserted feature moves with the same motions as the background of the scene. More precisely, we observe that the feature’s

(22)

orientation follows the orientation of the background (camera rotations are correctly estimated). The quality of this reality augmentation is particularly encouraging.

Fig. 13. At the top: insertion of a feature in the first frame, at the bottom: frames 10, 20, 30, 40, 60 and 70 of the augmented sequence.

6.2 Application to mosaicing

As we suppose that two successive images are linked by a planar transfor-mation, the knowledge of camera motion between these two images allows to register one image on the other.

The idea of mosaicing is simple: we choose two images of a video sequence, In and Ip, with n < p. With the estimation of camera motion on the whole

sequence, we can compute, by combining the p − n displacement estimations, the global camera motion between times n and p and the projective application linking In and Ip. By applying it to In, we transform it to a new image that

we overlay on Ip. We obtain a bigger image that we could observe from the

point of view of Ip but with a larger vision field.

(23)

method on the office sequence, we use the recovered parameters to reconstruct several mosaic views. Figures 14, 15 and 16 show three panoramas consisting of two images warped on a third one. Remark that the mosaicing is theoretically possible if the point of view does not change (when there is no translation) [6] or when the camera films a planar scene. Our movie does not exactly verify the hypothesis of pure rotation because although the camera translation is very small between two adjacent frames, it may be significant between two images distant in time and obviously the scene is not planar. But as the scene is rather far from the camera location, registrations are correct when the over-layed borders of the image correspond to the averaged scene depth. Some gaps appear when image borders correspond to an extremal depth. Nevertheless, we obtain some interesting panoramic views.

Fig. 14. At the top, scenes 20, 35 and 50 out of office sequence; at the bottom, reconstructed panoramic view on viewpoint I35.

7 Conclusion

In this paper, we have proposed a new global approach for the problem of egomotion estimation, well-adapted to adjacent frames as produced by a cam-era, which films a static scene. It is based on the registration group and on a decomposition of the projective transformation in a similarity and a “purely” projective deformation.

(24)

Our algorithm’s weak point is optical flow determination, which is quite long as it uses all the frames of the movie. But once the optical flow has been computed, all steps are performed very quickly. It is quite difficult to compare our results to those obtained by other techniques because for each couple of adjacent frames, we use not only the optical flow but also the content of the two images. However, our results of augmented reality and panoramic views are encouraging: the estimation of small camera motions between consecutive images allows to obtain a correct estimation of camera motion between images distant in time. The use of both, image content and optical flow, suggests that the method is robust.

In our method, we have not considered the different depths of points projected on the retinal plane because we estimate camera motion between two succes-sive frames. Our results of motion between two adjacent frames are very good and allow to compute a very good estimation of camera motion between two frames distant in time. But when we register two images distant in time, depth differences appear in the result. Our goal now will be to estimate both camera motion and a map of scene disparities.

(25)

References

[1] A. Azarbayejani, P. Pentland, Recursive estimation of motion, structure and focal length, IEEE Trans on Patt Analysis and Machine Intell (1995) 17(6):562– 575.

[2] H. Longuet-Higgins, A computer algorithm for reconstructing a scene from two projections, Nature 293(10) (1981) 133–135.

[3] O. Faugeras, Three-dimensional computer vision: a geometric viewpoint, MIT Press, 1993.

[4] O. Faugeras, S. Maybank, Motion from point matches: multiple of solutions, IJCV 4 (3) (1990) 225–246.

[5] T. Huang, O. Faugeras, Some properties of the e matrix in two-view motion estimation, IEEE Trans. Pattern Anal. Mach. Intell. 11 (12) (1989) 1310–1312. [6] O. Faugeras, Q. Luong, T. Papadopoulo, The Geometry of Multiple Images,

MIT Press, 2000.

[7] A. Bruss, B. Horn, Passive navigation, Computer Graphics and Image Processing 21 (1983) 3–20.

[8] M. Zucchelli, J. Santos-Victor, H. Christensen, Constrained structure and motion estimation from optical flow, ICPR, Quebec, Canada.

(26)

[9] D. Heeger, A. Jepson, Subspace methods for recovering rigid motion i: algorithm and implementation, IJCV 7 (2) (1992) 95–117.

[10] Y. Ma, J. Koseck`a, S. Sastry, Linear differential algorithm for motion recovery: A geometric approach, IJCV 36(1) (2000) 71–89.

[11] M. J. Brooks, W. Chojnacki, L. Baumela, Determining the ego-motion of an uncalibrated camera from instantaneous optical flow, J. Opt. Soc. Am A 14(10) (1997) 2670–2677.

[12] C. Tomasi, J. Shi, Direction of heading from image deformations, CVPR93 (1993) 422–427.

[13] M. Irani, B. Rousso, S. Peleg, Recovery of ego-motion using region alignment, IEEE Trans. Pattern Anal. Mach. Intell. 19 (3) (1997) 268–272.

[14] Y. Tian, C. Tomasi, D. Heeger, Comparison of approaches to egomotion computation, IEEE Computer Society, Conference on Computer Vision and Pattern Recognition (1996) 315–320.

[15] D. Bump, Lie groups, Springer, 2004.

[16] L. Alvarez, F. Guichard, P.-L. Lions, J.-M. Morel, Axioms and fundamental equations of image processing: Multiscale analysis and p.d.e., Archive for Rational Mechanics and Analysis 16 (9) (1993) 200–257.

[17] T. Cohignac, C. Lopez, J.-M. Morel, Integral and local affine invariant parameter and application to shape recognition, in: Proc. of Int. Conf. on Pattern Recognition, 1994, pp. 164–168.

[18] G. Koepfler, L. Moisan, Geometric multiscale representation of numerical images, in: SCALE-SPACE ’99: Proceedings of the Second International Conference on Scale-Space Theories in Computer Vision, Springer-Verlag, 1999, pp. 339–350.

[19] L. Moisan, Affine plane curve evolution: A fully consistent scheme, in: IEEE Trans. on Image Processing, 1998, pp. 411–420.

[20] G. Sapiro, A. Tannenbaum, Affine invariant scale-space, Int. J. Comput. Vision 11 (1) (1993) 25–44.

[21] J. Odobez, P. Bouth´emy, Robust multiresolution estimation of parametric motion models, Jl. of Visual Communication and Image Representation 6, no 4 (1995) 348–365.

[22] J. Weickert, C. Schn¨orr, Variational optic flow computation with a spatio-temporal smoothness constraint, Technical Report, Computer Science Series. [23] A. Yao, A. Calway, Robust estimation of 3-d camera motion for uncalibrated

augmented reality, Technical Report CSTR-02-001, Dept of Computer Science, University of Bristol.

(27)

[24] K. Cornelis, M. Pollefeys, M. Vergauwen, L. J. V. Gool, Augmented reality using uncalibrated video sequences, in: SMILE ’00: Revised Papers from Second European Workshop on 3D Structure from Multiple Images of Large-Scale Environments, Springer-Verlag, 2001, pp. 144–160.