Model-free augmented reality by virtual visual servoing

(1)

HAL Id: inria-00352035

https://hal.inria.fr/inria-00352035

Submitted on 12 Jan 2009

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Model-free augmented reality by virtual visual servoing

Muriel Pressigout, E. Marchand

To cite this version:

Muriel Pressigout, E. Marchand. Model-free augmented reality by virtual visual servoing. IAPR Int. Conf. on Pattern Recognition, ICPR’04, 2004, Cambridge, UK, United Kingdom. pp.887-891.

�inria-00352035�

(2)

IAPR Int. Conf. on Pattern Recognition, ICPR’04, Cambridge, UK, August 2004.

Model-free augmented reality by virtual visual servoing

Muriel Pressigout, Eric Marchand ´ IRISA INRIA Rennes

Campus de Beaulieu, 35042 Rennes Cedex, France E-mail: [email protected]

Abstract

This paper presents a method based on the virtual visual servoing approach [10] to achieve markerless augmented reality applications. This work aims to realize this task using as little prior 3D information as possible. Virtual visual servoing techniques that lead to a non-linear mini- mization approach allow one to estimate the 2D transfor- mation between two images of a video sequence which per- mits to achieve augmented reality on this sequence. Thanks to the work that has already been carried out in this domain, the presented method is efficient and robust wrt. noise and occlusions. It allows very realistic augmented videos with minimum knowledge about the real environment.

1. Introduction

Augmented reality (AR) [1] aims to insert virtual objects in a real environment captured by a moving camera, in a manner such that these objects seem to be part of the viewed 3D scene. This work is related to the AR problem in the case of an unique camera. The most important issue is to overcome the registration problem, i.e. how to align the real and the virtual world properly to give the impression that they are just one world. In a vision-based system, this is usually a pose computation issue.

Most of the approaches consider the pose computa- tion as a registration problem that consists of determin- ing the relationship between 3D coordinates of features (points, lines,...) and their 2D projections onto the image plane [2, 3, 9]. These approaches require a 3D scene model obtained by fiducial markers or by exploiting its structure.

Since such 3D knowledge is not easily available, it is nec- essary to overcome the pose computation considering less constraignant knowledge on the viewed scene. This can be done by using planar structures of the scene [8, 13, 12].

Whatever the method chosen, it must deal with the problem of robustness to account for the noise and occlusion phe- nomenons it may include since the content of the video is unknown.

This work copes with the 3D knowledge issue by using, at most, the 2D information extracted from the images and the geometrical constraints inherent to a moving vision sys- tem [5]. It has been chosen to estimate the camera displace- ment between the capture of two images instead of the cam- era pose. This can be accurately achieved by minimizing a distance in the image defined using the strong constraints linking two images of the same scene. The novelty of this article is that the camera displacement estimation by a non- linear minimization is considered like a problem of 2D vir- tual visual servoing (VVS) [10]. It is therefore closer to the underlying geometrical constraints than similar classi- cal approaches as described in, e.g. , [5].

This article first describes how the displacement estima- tion can be handled like a problem of 2D VVS and then how it can be made robust. The following sections set out the different displacement cases we dealt with and how to use the displacement estimate for AR with minimum prior 3D knowledge. Finally, several experimental results on real videos are presented.

2. Computing Displacement

As already stated, the fundamental principle of the pro- posed approach is to define a non-linear minimization ap- proach as the dual problem of 2D visual servoing [7]. This formulation has already been applied to the pose computa- tion problem [2, 10]. In visual servoing, the goal is to move a camera in order to observe an object at a given position in the image. This is achieved by moving the camera in order to minimize the error between a desired state of the image features s ^∗ and the current state s. Displacement computa- tion problem is a very similar issue.

To illustrate the principle, consider the case of a scene

with various 2D features s (for example, points, dis-

tances,. . . ). For camera motion estimation the classical idea

is to minimize the distance between the position of the ob-

served features in image 2 (s 2 ) and their position ² tr 1 ( s 1 )

transfered in the image 2 by a given transformation (repre-

sented by the fundamental or essential matrix, an homogra-

phy, etc...) whose parameters rely on the camera displace-

(3)

ment ² T 1 to be estimated:

\

c ₂ M _c ₁ = arg _c min

2 M c 1

∆ with ∆ = X N

i =1

d(s 2 i , ² tr 1 (s 1 i )) In this formulation of the problem, a virtual camera is moved (initial displacement is null) using a visual servoing control law in order to minimize this error ∆. At conver- gence, the virtual camera reaches the position ² M 1

∗ which minimizes this error ( ² M 1

∗ will be the real camera dis- placement). It is supposed in this paper, that intrinsic pa- rameters are available.

In the more realistic case where image measurement er- rors occur in both images, it is better to minimize the er- rors in both images and not only in one. We then have to consider the forward ( ² tr 1 ) and backward ( ¹ tr 2 ) transfor- mation. The distance to be minimized is then :

X N

i=1

d(s 2 i , ² tr 1 (s 1 i )) + d(s 1 i , ¹ tr 2 (s 2 i )) (1)

where N is the number of considered features and d(s 2 i , ² tr 1 (s 1 i )) = ² d 1 i is the signed distance between the 2D features s 2 i and ² tr 1 (s 1 i ). Minimizing this distance is equivalent to minimize the error vector :

e = . . . , ² d 1 i , ¹ d 2 i , . . . T

by the following control law :

2 v = − λb L ⁺ e (2) where ² v is the velocity of the virtual camera (expressed in camera 2 frame) and where L is the interaction matrix related to the error vector such as :

b L =

· · · , L( b ² d 1 i ), − L( b ¹ d 2 i ) ¹ V b 2 , · · · T

(3) L( ² d 1 i ) is the Jacobian matrix that links the variation of the distance ² d 1 i to the virtual camera velocity such as :

˙

2 d 1 i = L( ² d 1 i ) ² v. We will see how to define this matrix in section 2.1. ¹ V 2 is the velocity transformation matrix from camera 1 frame to camera 2 frame, given by the following 6 × 6 matrix:

1 V 2 =

1 R 2 [ ¹ t 2 ] ^× ¹ R 2

0 3×3 1 R 2

where [t] ^× is the skew matrix related to the vector t.

As shown in [2], if data are corrupted with noise, the widely accepted statistical techniques of robust M- estimation [6] can be introduced within the minimization process. This is introduced directly in the virtual visual servoing control law by weighting the confidence on each feature.

2 v = − λ( Db b L) ⁺ De b (4)

where D is a diagonal weighting matrix given by D = diag(..., w, ...) The weights w i reflect the confidence of each feature. Their computation needs an influence func- tion. Tukey’s hard re-descending function is considered since it completely rejects outliers and gives them a zero weight (see [2, 6] for further information on weights com- putation and influence functions). This is of interest in this sort of application so that a detected outlier has no effect on the virtual camera motion.

2.1. General camera motion

This subsection describes the 2D transformation to be estimated for the most general case: a non-planar scene viewed by a camera which rotates and translates. In the re- minder of the paper features we use the following notation:

p 1 for the points extracted from camera 1 image and p 2 for the corresponding points in camera 2 image. In that case the constraints derived from the epipolar geometry give [5] :

p ^T ₁ ¹ E 2 p 2 = 0 and symmetrically p ^T ₂ ² E 1 p 1 = 0 (5) The 3 × 3 matrix ² E 1 = [ ¹ t 2 ] × 1 R 2 is called the essen- tial matrix. ² E 1 is only related to the camera displacement and is the same for all the considered 3D points. In this case computing the camera motion is equivalent to compute this essential matrix. Considering the virtual visual servo- ing approach the idea is to minimize the distance between the position of the observed points in image 2 (p 2 ) and the position of the corresponding features ² tr 1 p 1 transfered in the image 1 by the essential matrix ² E 1 , i.e. to minimize the signed difference between p 2 and their associated epipolar lines l 2 in the image i. Hence, the terms of the global error e (2) to be minimized in both image 1 and 2 are obtained by :

2 d 1 i = p 2 T

i l 1 i and ¹ d 2 i = p 1 T

i l 2 i (6)

(6) means that a point p 1 must rely on the epipolar line l 1

related to its corresponding point p 2 such as l 1 is defined by ¹ E 2 p 2 . The epipolar line l 2 line related to p 1 is the projection of the line C 1 P (where C 1 is the camera optical center and X is the 3D point that project in p 1 and p 2 ).

a

C ₁

2 p 1 p

P

D _E

1 Π

C ₂

b

Figure 1. (a) Distance of a point to a line. (b) Plane

Π used in the computation of the interaction matrix

(4)

The distance between point p and line l(r) can be char- acterized by the distance d ^⊥ perpendicular to the line. Thus the distance feature from a line is given by:

d l = d ^⊥ (x, l(r)) = ρ(l(r)) − ρ p (7) where ρ p = x cos θ + ysinθ, with x and y being the coor- dinates of the tracked point. Thus,

d ˙ l = ˙ ρ − ρ ˙ p = ˙ ρ + α θ, ˙ (8) where α = x sin θ − y cos θ. Deduction from (8) gives L _d l = L _ρ +αL θ . The interaction matrix related to d l can be thus derived from the interaction matrix related to a straight line given by (see [4] for its complete derivation):

L _θ = λ θ cosθ λ θ sinθ − λ θ ρ ρcos θ − ρsin θ − 1 L _ρ = λ ρ cosθ λ ρ sinθ − λ ρ ρ (1+ρ ² ) sinθ − (1+ρ ² ) cosθ 0

(9) where λ θ = (A 2 sin θ − B 2 cos θ)/D 2 , λ ρ = (A 2 ρ cos θ + B 2 ρ sin θ + C 2 )/D 2 , and A 2 X + B 2 Y + C 2 Z + D 2 = 0 is the equation of a 3D plane Π which the line belongs to (see Figure 1b).

The translation ¹ t 2 is estimated up to scale. Indeed, if the displacement between the image 1 and the image 2 such as the translation is ¹ t 2 and the rotation ¹ R 2 obeys to (5), so does a similar one such as ¹ t 2

0 = k. ¹ t 2 and

1 R 2

0 = ¹ R 2 . In order to find the exact translation, 3D in- formation is needed. It can be a distance between two points of the scene: there is only one scalar k that keeps constant this 3D distance such as the real translation and rotation are respectively k. ¹ t 2 and ¹ R 2 .

2.2. Homography estimation

Some particular cases of camera displacement (planar scene, pure rotation camera motion) leads the 2D transfor- mation between two images of the video to be a homogra- phy. In that case, this gives:

p 2 = ² H 1 p 1 = ( ⁱ R _j +

i t _j

j d

j n ^T )p 1 (10) where ² H 1 is an homography that defined the transforma- tion between the image acquired by the camera 1 and the camera 2. In this case computing the camera motion is equivalent to compute this homography. When considering the virtual visual servoing approach the idea is to minimize the distance between the position of the observed points in image 2 (p 2 ) and the position of the corresponding points p 1 transfered in the image 2 by the homography ² H 1 . The goal is then to minimize the error (2) in both image 1 and 2 whose terms are given by:

2 d 1 i = ² H b 1 p 1 i − p 2 i and ¹ d 2 i = ² H b ⁻¹ ₁ p 2 i − p 1 i

The terms L ( ^j d k i ) of related interaction matrix L are thus the classical interaction matrix that links the variation of the point x _i position to the camera motion (see e.g. [7]).

3. Application to augmented reality

For augmented reality applications, the pose between the camera and the world coordinate system is required. If an initial pose ¹ M c _W is known, computing the current pose from the estimated displacement is straightforward:

n M c _W = ⁿ M c 1 n −1 M c _W (11) since the displacement between the first and the current im- age is computed, using the precedent image displacement estimation as initial estimation. However computing ¹ c M _W requires the introduction of 3D information. Therefore it has been decided to estimate this first pose from the image of a rectangle in the first image following the approach pre- sented in [13]. The only 3D information required is a rect- angle in the first image and the lenght of one of its sides.

4. Experimental results

For the outdoor experiments, tracking is achieved along the image sequence using the Shi-Tomasi-Kanade points tracker [11].

4.1. General case: estimating the essential matrix In this first experiment, the camera undergoes a transla- tion and a rotation. There are some markers in the viewed scene that allow a fast tracking and provide a reliable set of points in each image of the video sequence without any matching problem. The 3D information used is a rectangle extracted from the markers in the initial image to compute the initial pose and the lenght of one of its sides during the sequence to estimate the right translation. In Figure 2, three augmented images of this sequence are shown. One can see that the added horse remains at the same location along the sequence.

Figure 2. AR from general camera motion

4.2. Planar scene: estimating the homography

In this experiment (see Figure 3), an outdoor scene is

considered. The wall is the planar scene from which points

are extracted to estimate the homography between two im-

ages. The rectangle used to estimate the initial pose is the

one composed by the different posters. It is not very accu-

rate but it provides a good enough result. The pose com-

putation resulting from this initial pose estimation and the

displacement estimations provide realistic augmented video

sequence as can be seen in Figure 3. The objects remain sta-

ble in the scene.

(5)

Figure 3. AR with robust homography estimation with planar structure

Two comparaisons have been made on the remaining er- ror between the image points and the projection of the cor- responding points in the other image for the estimated dis- placement (see the Figure 4). The presented method is first compared using the robust kernel and without. It can be noticed that after a while, the use of M-estimator gives re- ally better displacement estimations. It is then compared with the linear one, i.e. the DLT algorithm using the data normalisation recommended by [5]. It is undeniable that the presented method, even without its robust kernel, is far more efficient. However other non-linear minimization ap- proachs give similar results.

0 0.02 0.04 0.06 0.08 0.1 0.12

0 5 10 15 20 25 30 35 40 45 50

residues with VVS approach without M-estimators residues with robust VVS approach

0 10 20 30 40 50 60 70 80

0 5 10 15 20 25 30 35 40 45 50

residues with DLT algorithm residues with VVS approach without M-estimators

Figure 4. Planar structure. Left: VVS without M- estimators (red) vs. robust VVS (green). Right: DLT algo (red) vs. VVS without M-estimators (green).

4.3. Pure rotation camera motion

Pure rotation is interesting since in this case the homog- raphy ⁱ H _j is only related to the rotation. Thus the points are not required to belong to a plane. This particularity may be considered in a lot of image sequences where the cam- era translations are very small. The equations presented for homography estimation have been simplified by removing the terms related to the motion translation. In this exper- iment (see Figure 5), an outdoor scene is considered with very noisy images. The Figure 5 shows that even after 800 images, the error in pose computation (thus in displacement computation) is very small. What must be pointed out is that the complete change of background during the sequence does not disturb the results.

5. Conclusion

This paper shows that exploiting the virtual visual servo- ing approach to achieve displacement estimation based on 2D information is efficient and furthermore is intuitive since it is nearer to the underlying geometrical constraints than the other non-linear minimization approaches. Robust esti- mation is obtained by the introduction of the M-estimators

Figure 5. AR with pure rotation camera motion

in the control law which updates the displacement estima- tion. Its application to AR provides very realistic videos with very few constraints.

Aknowledgment This work was realized in the context of the french RIAM Sora project in Lagadic team at IRISA/INRIA Rennes.

Videos are available on the Lagadic website:

http://www.irisa.fr/lagadic

References

[1] R. Azuma, Y. Baillot, R. Behringer, S. Feiner, S. Julier, B. MacIntyre. Recent advances in augmented reality. IEEE CG&A, 21(6):34–47, 2001.

[2] A. Comport, E. Marchand, F. Chaumette. A real-time tracker for markerless augmented reality. In IEEE/ACM ISMAR, pp.

36–45, 2003.

[3] D. Dementhon, L. Davis. Model-based object pose in 25 lines of codes. IJCV, 15:123–141, 1995.

[4] B. Espiau, F. Chaumette, and P. Rives. A new approach to visual servoing in robotics. IEEE TRA, 8(3):313–326, 1992.

[5] R. Hartley,A. Zissermann. Multiple View Geometry in com- puter vision. Cambridge Univ. Press, 2001.

[6] P.-J. Huber. Robust Statistics. Wiler, New York, 1981.

[7] S. Hutchinson, G. Hager, P. Corke. A tutorial on visual servo control. IEEE TRA, 12(5):651–670, 1996.

[8] K. Kutulakos, J. Vallino. Calibration-free augmented reality.

IEEE TVCG, 4(1):1–20, 1998.

[9] D. Lowe. Fitting parameterized three-dimensional models to images. IEEE PAMI, 13(5):441–450, 1991.

[10] E. Marchand, F. Chaumette. Virtual visual servoing: a frame- work for real-time augmented reality. In EUROGRAPHICS, volume 21(3), pp. 289–298, 2002.

[11] J. Shi, C. Tomasi. Good features to track. In IEEE CVPR, pp. 593–600, 1994.

[12] G. Simon, M.-O. Berger. Pose estimation for planar struc- tures. IEEE CG&A, 22(6):46–53, 2002.

[13] G. Simon, A. Fitzgibbon, A. Zisserman. Markerless tracking using planar structures in the scene. In IEEE/ACM ISAR, pp.

120–128, 2002.

Model-free augmented reality by virtual visual servoing

HAL Id: inria-00352035

https://hal.inria.fr/inria-00352035

Submitted on 12 Jan 2009

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Model-free augmented reality by virtual visual servoing

Muriel Pressigout, E. Marchand

To cite this version:

Muriel Pressigout, E. Marchand. Model-free augmented reality by virtual visual servoing. IAPR Int. Conf. on Pattern Recognition, ICPR’04, 2004, Cambridge, UK, United Kingdom. pp.887-891.

�inria-00352035�

IAPR Int. Conf. on Pattern Recognition, ICPR’04, Cambridge, UK, August 2004.

Model-free augmented reality by virtual visual servoing

Muriel Pressigout, Eric Marchand ´ IRISA INRIA Rennes

Campus de Beaulieu, 35042 Rennes Cedex, France E-mail: [email protected]

Abstract

1. Introduction

Since such 3D knowledge is not easily available, it is nec- essary to overcome the pose computation considering less constraignant knowledge on the viewed scene. This can be done by using planar structures of the scene [8, 13, 12].

Whatever the method chosen, it must deal with the problem of robustness to account for the noise and occlusion phe- nomenons it may include since the content of the video is unknown.

2. Computing Displacement

To illustrate the principle, consider the case of a scene

with various 2D features s (for example, points, dis-

tances,. . . ). For camera motion estimation the classical idea

is to minimize the distance between the position of the ob-

served features in image 2 (s 2 ) and their position 2 tr 1 ( s 1 )

transfered in the image 2 by a given transformation (repre-

sented by the fundamental or essential matrix, an homogra-

phy, etc...) whose parameters rely on the camera displace-

ment 2 T 1 to be estimated:

\

c 2 M c 1 = arg c min

2 M c 1

∆ with ∆ = X N

i =1

d(s 2 i , 2 tr 1 (s 1 i )) In this formulation of the problem, a virtual camera is moved (initial displacement is null) using a visual servoing control law in order to minimize this error ∆. At conver- gence, the virtual camera reaches the position 2 M 1

∗ which minimizes this error ( 2 M 1

∗ will be the real camera dis- placement). It is supposed in this paper, that intrinsic pa- rameters are available.

In the more realistic case where image measurement er- rors occur in both images, it is better to minimize the er- rors in both images and not only in one. We then have to consider the forward ( 2 tr 1 ) and backward ( 1 tr 2 ) transfor- mation. The distance to be minimized is then :

X N

i=1

d(s 2 i , 2 tr 1 (s 1 i )) + d(s 1 i , 1 tr 2 (s 2 i )) (1)

where N is the number of considered features and d(s 2 i , 2 tr 1 (s 1 i )) = 2 d 1 i is the signed distance between the 2D features s 2 i and 2 tr 1 (s 1 i ). Minimizing this distance is equivalent to minimize the error vector :

e = . . . , 2 d 1 i , 1 d 2 i , . . . T

by the following control law :

2 v = − λb L + e (2) where 2 v is the velocity of the virtual camera (expressed in camera 2 frame) and where L is the interaction matrix related to the error vector such as :

b L =

· · · , L( b 2 d 1 i ), − L( b 1 d 2 i ) 1 V b 2 , · · · T

(3) L( 2 d 1 i ) is the Jacobian matrix that links the variation of the distance 2 d 1 i to the virtual camera velocity such as :

˙

2 d 1 i = L( 2 d 1 i ) 2 v. We will see how to define this matrix in section 2.1. 1 V 2 is the velocity transformation matrix from camera 1 frame to camera 2 frame, given by the following 6 × 6 matrix:

1 V 2 =

1 R 2 [ 1 t 2 ] × 1 R 2

0 3×3 1 R 2

where [t] × is the skew matrix related to the vector t.

As shown in [2], if data are corrupted with noise, the widely accepted statistical techniques of robust M- estimation [6] can be introduced within the minimization process. This is introduced directly in the virtual visual servoing control law by weighting the confidence on each feature.

2 v = − λ( Db b L) + De b (4)

2.1. General camera motion

This subsection describes the 2D transformation to be estimated for the most general case: a non-planar scene viewed by a camera which rotates and translates. In the re- minder of the paper features we use the following notation:

p 1 for the points extracted from camera 1 image and p 2 for the corresponding points in camera 2 image. In that case the constraints derived from the epipolar geometry give [5] :

2 d 1 i = p 2 T

i l 1 i and 1 d 2 i = p 1 T

i l 2 i (6)

(6) means that a point p 1 must rely on the epipolar line l 1

related to its corresponding point p 2 such as l 1 is defined by 1 E 2 p 2 . The epipolar line l 2 line related to p 1 is the projection of the line C 1 P (where C 1 is the camera optical center and X is the 3D point that project in p 1 and p 2 ).

a

C 1

2 p 1 p

P

D E

1 Π

C 2

b

Figure 1. (a) Distance of a point to a line. (b) Plane

Π used in the computation of the interaction matrix

The distance between point p and line l(r) can be char- acterized by the distance d ⊥ perpendicular to the line. Thus the distance feature from a line is given by:

d l = d ⊥ (x, l(r)) = ρ(l(r)) − ρ p (7) where ρ p = x cos θ + ysinθ, with x and y being the coor- dinates of the tracked point. Thus,

d ˙ l = ˙ ρ − ρ ˙ p = ˙ ρ + α θ, ˙ (8) where α = x sin θ − y cos θ. Deduction from (8) gives L d l = L ρ +αL θ . The interaction matrix related to d l can be thus derived from the interaction matrix related to a straight line given by (see [4] for its complete derivation):

L θ = λ θ cosθ λ θ sinθ − λ θ ρ ρcos θ − ρsin θ − 1 L ρ = λ ρ cosθ λ ρ sinθ − λ ρ ρ (1+ρ 2 ) sinθ − (1+ρ 2 ) cosθ 0

(9) where λ θ = (A 2 sin θ − B 2 cos θ)/D 2 , λ ρ = (A 2 ρ cos θ + B 2 ρ sin θ + C 2 )/D 2 , and A 2 X + B 2 Y + C 2 Z + D 2 = 0 is the equation of a 3D plane Π which the line belongs to (see Figure 1b).

The translation 1 t 2 is estimated up to scale. Indeed, if the displacement between the image 1 and the image 2 such as the translation is 1 t 2 and the rotation 1 R 2 obeys to (5), so does a similar one such as 1 t 2

served features in image 2 (s 2 ) and their position ² tr 1 ( s 1 )

ment ² T 1 to be estimated:

c ₂ M _c ₁ = arg _c min

d(s 2 i , ² tr 1 (s 1 i )) In this formulation of the problem, a virtual camera is moved (initial displacement is null) using a visual servoing control law in order to minimize this error ∆. At conver- gence, the virtual camera reaches the position ² M 1

∗ which minimizes this error ( ² M 1

In the more realistic case where image measurement er- rors occur in both images, it is better to minimize the er- rors in both images and not only in one. We then have to consider the forward ( ² tr 1 ) and backward ( ¹ tr 2 ) transfor- mation. The distance to be minimized is then :

d(s 2 i , ² tr 1 (s 1 i )) + d(s 1 i , ¹ tr 2 (s 2 i )) (1)

where N is the number of considered features and d(s 2 i , ² tr 1 (s 1 i )) = ² d 1 i is the signed distance between the 2D features s 2 i and ² tr 1 (s 1 i ). Minimizing this distance is equivalent to minimize the error vector :

e = . . . , ² d 1 i , ¹ d 2 i , . . . T

2 v = − λb L ⁺ e (2) where ² v is the velocity of the virtual camera (expressed in camera 2 frame) and where L is the interaction matrix related to the error vector such as :

· · · , L( b ² d 1 i ), − L( b ¹ d 2 i ) ¹ V b 2 , · · · T

(3) L( ² d 1 i ) is the Jacobian matrix that links the variation of the distance ² d 1 i to the virtual camera velocity such as :

2 d 1 i = L( ² d 1 i ) ² v. We will see how to define this matrix in section 2.1. ¹ V 2 is the velocity transformation matrix from camera 1 frame to camera 2 frame, given by the following 6 × 6 matrix:

1 R 2 [ ¹ t 2 ] ^× ¹ R 2

where [t] ^× is the skew matrix related to the vector t.

2 v = − λ( Db b L) ⁺ De b (4)

i l 1 i and ¹ d 2 i = p 1 T

related to its corresponding point p 2 such as l 1 is defined by ¹ E 2 p 2 . The epipolar line l 2 line related to p 1 is the projection of the line C 1 P (where C 1 is the camera optical center and X is the 3D point that project in p 1 and p 2 ).

C ₁

D _E

C ₂

The distance between point p and line l(r) can be char- acterized by the distance d ^⊥ perpendicular to the line. Thus the distance feature from a line is given by:

d l = d ^⊥ (x, l(r)) = ρ(l(r)) − ρ p (7) where ρ p = x cos θ + ysinθ, with x and y being the coor- dinates of the tracked point. Thus,

d ˙ l = ˙ ρ − ρ ˙ p = ˙ ρ + α θ, ˙ (8) where α = x sin θ − y cos θ. Deduction from (8) gives L _d l = L _ρ +αL θ . The interaction matrix related to d l can be thus derived from the interaction matrix related to a straight line given by (see [4] for its complete derivation):

L _θ = λ θ cosθ λ θ sinθ − λ θ ρ ρcos θ − ρsin θ − 1 L _ρ = λ ρ cosθ λ ρ sinθ − λ ρ ρ (1+ρ ² ) sinθ − (1+ρ ² ) cosθ 0

The translation ¹ t 2 is estimated up to scale. Indeed, if the displacement between the image 1 and the image 2 such as the translation is ¹ t 2 and the rotation ¹ R 2 obeys to (5), so does a similar one such as ¹ t 2

0 = k. ¹ t 2 and

0 = ¹ R 2 . In order to find the exact translation, 3D in- formation is needed. It can be a distance between two points of the scene: there is only one scalar k that keeps constant this 3D distance such as the real translation and rotation are respectively k. ¹ t 2 and ¹ R 2 .

p 2 = ² H 1 p 1 = ( ⁱ R _j +

i t _j

2 d 1 i = ² H b 1 p 1 i − p 2 i and ¹ d 2 i = ² H b ⁻¹ ₁ p 2 i − p 1 i

The terms L ( ^j d k i ) of related interaction matrix L are thus the classical interaction matrix that links the variation of the point x _i position to the camera motion (see e.g. [7]).

For augmented reality applications, the pose between the camera and the world coordinate system is required. If an initial pose ¹ M c _W is known, computing the current pose from the estimated displacement is straightforward: