HAL Id: inria-00352035
https://hal.inria.fr/inria-00352035
Submitted on 12 Jan 2009
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Model-free augmented reality by virtual visual servoing
Muriel Pressigout, E. Marchand
To cite this version:
Muriel Pressigout, E. Marchand. Model-free augmented reality by virtual visual servoing. IAPR Int. Conf. on Pattern Recognition, ICPR’04, 2004, Cambridge, UK, United Kingdom. pp.887-891.
�inria-00352035�
IAPR Int. Conf. on Pattern Recognition, ICPR’04, Cambridge, UK, August 2004.
Model-free augmented reality by virtual visual servoing
Muriel Pressigout, Eric Marchand ´ IRISA INRIA Rennes
Campus de Beaulieu, 35042 Rennes Cedex, France E-mail: [email protected]
Abstract
This paper presents a method based on the virtual visual servoing approach [10] to achieve markerless augmented reality applications. This work aims to realize this task using as little prior 3D information as possible. Virtual visual servoing techniques that lead to a non-linear mini- mization approach allow one to estimate the 2D transfor- mation between two images of a video sequence which per- mits to achieve augmented reality on this sequence. Thanks to the work that has already been carried out in this domain, the presented method is efficient and robust wrt. noise and occlusions. It allows very realistic augmented videos with minimum knowledge about the real environment.
1. Introduction
Augmented reality (AR) [1] aims to insert virtual objects in a real environment captured by a moving camera, in a manner such that these objects seem to be part of the viewed 3D scene. This work is related to the AR problem in the case of an unique camera. The most important issue is to overcome the registration problem, i.e. how to align the real and the virtual world properly to give the impression that they are just one world. In a vision-based system, this is usually a pose computation issue.
Most of the approaches consider the pose computa- tion as a registration problem that consists of determin- ing the relationship between 3D coordinates of features (points, lines,...) and their 2D projections onto the image plane [2, 3, 9]. These approaches require a 3D scene model obtained by fiducial markers or by exploiting its structure.
Since such 3D knowledge is not easily available, it is nec- essary to overcome the pose computation considering less constraignant knowledge on the viewed scene. This can be done by using planar structures of the scene [8, 13, 12].
Whatever the method chosen, it must deal with the problem of robustness to account for the noise and occlusion phe- nomenons it may include since the content of the video is unknown.
This work copes with the 3D knowledge issue by using, at most, the 2D information extracted from the images and the geometrical constraints inherent to a moving vision sys- tem [5]. It has been chosen to estimate the camera displace- ment between the capture of two images instead of the cam- era pose. This can be accurately achieved by minimizing a distance in the image defined using the strong constraints linking two images of the same scene. The novelty of this article is that the camera displacement estimation by a non- linear minimization is considered like a problem of 2D vir- tual visual servoing (VVS) [10]. It is therefore closer to the underlying geometrical constraints than similar classi- cal approaches as described in, e.g. , [5].
This article first describes how the displacement estima- tion can be handled like a problem of 2D VVS and then how it can be made robust. The following sections set out the different displacement cases we dealt with and how to use the displacement estimate for AR with minimum prior 3D knowledge. Finally, several experimental results on real videos are presented.
2. Computing Displacement
As already stated, the fundamental principle of the pro- posed approach is to define a non-linear minimization ap- proach as the dual problem of 2D visual servoing [7]. This formulation has already been applied to the pose computa- tion problem [2, 10]. In visual servoing, the goal is to move a camera in order to observe an object at a given position in the image. This is achieved by moving the camera in order to minimize the error between a desired state of the image features s ∗ and the current state s. Displacement computa- tion problem is a very similar issue.
To illustrate the principle, consider the case of a scene
with various 2D features s (for example, points, dis-
tances,. . . ). For camera motion estimation the classical idea
is to minimize the distance between the position of the ob-
served features in image 2 (s 2 ) and their position 2 tr 1 ( s 1 )
transfered in the image 2 by a given transformation (repre-
sented by the fundamental or essential matrix, an homogra-
phy, etc...) whose parameters rely on the camera displace-
ment 2 T 1 to be estimated:
\
c 2 M c 1 = arg c min
2 M c 1
∆ with ∆ = X N
i =1
d(s 2 i , 2 tr 1 (s 1 i )) In this formulation of the problem, a virtual camera is moved (initial displacement is null) using a visual servoing control law in order to minimize this error ∆. At conver- gence, the virtual camera reaches the position 2 M 1
∗ which minimizes this error ( 2 M 1
∗ will be the real camera dis- placement). It is supposed in this paper, that intrinsic pa- rameters are available.
In the more realistic case where image measurement er- rors occur in both images, it is better to minimize the er- rors in both images and not only in one. We then have to consider the forward ( 2 tr 1 ) and backward ( 1 tr 2 ) transfor- mation. The distance to be minimized is then :
X N
i=1
d(s 2 i , 2 tr 1 (s 1 i )) + d(s 1 i , 1 tr 2 (s 2 i )) (1)
where N is the number of considered features and d(s 2 i , 2 tr 1 (s 1 i )) = 2 d 1 i is the signed distance between the 2D features s 2 i and 2 tr 1 (s 1 i ). Minimizing this distance is equivalent to minimize the error vector :
e = . . . , 2 d 1 i , 1 d 2 i , . . . T
by the following control law :
2 v = − λb L + e (2) where 2 v is the velocity of the virtual camera (expressed in camera 2 frame) and where L is the interaction matrix related to the error vector such as :
b L =
· · · , L( b 2 d 1 i ), − L( b 1 d 2 i ) 1 V b 2 , · · · T
(3) L( 2 d 1 i ) is the Jacobian matrix that links the variation of the distance 2 d 1 i to the virtual camera velocity such as :
˙
2 d 1 i = L( 2 d 1 i ) 2 v. We will see how to define this matrix in section 2.1. 1 V 2 is the velocity transformation matrix from camera 1 frame to camera 2 frame, given by the following 6 × 6 matrix:
1 V 2 =
1 R 2 [ 1 t 2 ] × 1 R 2
0 3×3 1 R 2
where [t] × is the skew matrix related to the vector t.
As shown in [2], if data are corrupted with noise, the widely accepted statistical techniques of robust M- estimation [6] can be introduced within the minimization process. This is introduced directly in the virtual visual servoing control law by weighting the confidence on each feature.
2 v = − λ( Db b L) + De b (4)
where D is a diagonal weighting matrix given by D = diag(..., w, ...) The weights w i reflect the confidence of each feature. Their computation needs an influence func- tion. Tukey’s hard re-descending function is considered since it completely rejects outliers and gives them a zero weight (see [2, 6] for further information on weights com- putation and influence functions). This is of interest in this sort of application so that a detected outlier has no effect on the virtual camera motion.
2.1. General camera motion
This subsection describes the 2D transformation to be estimated for the most general case: a non-planar scene viewed by a camera which rotates and translates. In the re- minder of the paper features we use the following notation:
p 1 for the points extracted from camera 1 image and p 2 for the corresponding points in camera 2 image. In that case the constraints derived from the epipolar geometry give [5] :
p T 1 1 E 2 p 2 = 0 and symmetrically p T 2 2 E 1 p 1 = 0 (5) The 3 × 3 matrix 2 E 1 = [ 1 t 2 ] × 1 R 2 is called the essen- tial matrix. 2 E 1 is only related to the camera displacement and is the same for all the considered 3D points. In this case computing the camera motion is equivalent to compute this essential matrix. Considering the virtual visual servo- ing approach the idea is to minimize the distance between the position of the observed points in image 2 (p 2 ) and the position of the corresponding features 2 tr 1 p 1 transfered in the image 1 by the essential matrix 2 E 1 , i.e. to minimize the signed difference between p 2 and their associated epipolar lines l 2 in the image i. Hence, the terms of the global error e (2) to be minimized in both image 1 and 2 are obtained by :
2 d 1 i = p 2 T
i l 1 i and 1 d 2 i = p 1 T
i l 2 i (6)
(6) means that a point p 1 must rely on the epipolar line l 1
related to its corresponding point p 2 such as l 1 is defined by 1 E 2 p 2 . The epipolar line l 2 line related to p 1 is the projection of the line C 1 P (where C 1 is the camera optical center and X is the 3D point that project in p 1 and p 2 ).