Toward Automatic 3D Modeling of Scenes using a Generic Camera Model

(1)

HAL Id: hal-01635669

https://hal.archives-ouvertes.fr/hal-01635669

Submitted on 15 Nov 2017

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Toward Automatic 3D Modeling of Scenes using a

Generic Camera Model

Maxime Lhuillier

To cite this version:

Maxime Lhuillier. Toward Automatic 3D Modeling of Scenes using a Generic Camera Model. IEEE

Conference on Computer Vision and Pattern Recognition, Jun 2008, Anchorage, United States.

�hal-01635669�

(2)

Toward Automatic 3D Modeling of Scenes using a Generic Camera Model

Maxime Lhuillier

LASMEA-UMR 6602 UBP/CNRS, 63177 Aubi`ere Cedex, France.

maxime.lhuillier.free.fr

Abstract

The automatic reconstruction of 3D models from image sequences is still a very active field of research. All exist-ing methods are designed for a given camera model, and a new (and ambitious) challenge is 3D modeling with a method which is exploitable for any kind of camera. A similar approach was recently suggested for structure-from-motion thanks to the use of generic camera models. In this paper, we first introduce geometric tools designed for 3D scene modeling with a generic camera model. Then, these tools are used to solve many issues: matching errors, wide range of point depths, depth discontinuities, and view-point selection for reconstruction. Experiments are provided for perspective and catadioptric cameras.

1. Introduction

The automatic reconstruction of photo-realistic 3D mod-els of scenes from image sequences taken by a moving camera is still a very active field of research. Once the camera parameters of the image sequence are recovered by structure-from-motion, dense stereo and stereo merge into a single 3D model are successively applied. Cur-rently, many 3D modeling systems exist for perspective cameras [12, 16, 10], catadioptric cameras [3, 9] and multi-camera rig [2] (among others) sometimes with the help of additional informations such as odometry. Even if the in-trinsic parameters of the camera are unknown, the involved methods are dependent on a given camera model.

Recently, it was suggested that the first sub-problem (structure-from-motion) be solved using a generic camera model and generic tools which are exploitable for any kind of camera [5, 15], and the same challenge arises naturally for the complete 3D modeling process. The practical advan-tages would be obvious: a high ability to change one camera model for another or to mix different cameras (e.g. cata-dioptric camera for wide field of view and perspective cam-era for a few parts of the scene where higher reconstruction accuracy is needed). Many generic tools are already avail-able for structure-from-motion: estimation of the

general-ized essential matrix [13, 15], pose calculation [14], bundle adjustment [17, 11] and generic camera calibration [5, 8]. We have no additional contributions for this sub-problem and assume that the camera parameters are known.

Dense stereo is the second sub-problem. It is recognized that this step is very difficult in practice for uncontrolled en-vironments. This difficulty increases in the generic context since the use of the image projection function is prohibited (this function is specific to the kind of camera). The epipo-lar constraint is also unavailable since the camera may be non-central. Optical flow methods [7] remain since they do not use 3D. In this second step, a hypothesis is used to ob-tain better results for 3D modeling: we assume that epipolar constraints are locally available in the generic images such that the standard pair-wise stereo methods [18] may be ap-plied after local rectifications.

The third sub-problem is the following: once cameras and matches between image pairs are known, how can a 3D model of the scene be recovered using generic tools ? This model is a list of textured triangles in 3D which ap-proximates the visible part of the scene where the camera has moved. We have to reconstruct 3D points, approximate them by a mesh, and deal with matching errors (false nega-tives and false posinega-tives), depth discontinuities, and a wide range of accuracies for reconstructed points (due to close foreground and far background, or view-point selection).

1.1. Contributions and Paper Overview

Our generic camera model is slightly different from pre-vious ones. Prepre-vious authors [5, 17] model arbitrary imag-ing systems by a set of virtual sensimag-ing elements called rax-els: a raxel is central or perspective camera with a small part of the complete view field. In our case, a raxel is reduced to a single ray (a point origin and a direction in 3D) such that the raxel center is the ray origin. We know the calibration function, which maps image pixels to rays. This function also defines the choice of all ray origins (“the ray surface choice” [5]). A central camera is a special case, where all ray origins are the same point: the camera center.

Once the generic camera model is presented, Section 2 introduces a generic method to reconstruct points from 1

(3)

image matches by ray intersection, a generalization for a generic camera of virtual uncertainty [9], and a generaliza-tion for n views of the 2-view-angle reliability [4]. Both virtual uncertainty and reliability were introduced for cata-dioptric cameras to select reconstructed points which are re-tained in the final 3D model. Such selections are important if parts of the scene are to be reconstructed at very different accuracies depending on the view-point selected for recon-struction. This is also true for any other camera with a wide field of view. In this paper, both virtual uncertainty and reli-ability have closely related and coherent definitions, which is not the case in previous works [4, 9].

Section 3 describes how to obtain a (local) 3D model for a few generic images given their corresponding camera poses, the calibration function and point correspondences. First, a reference image is chosen and segmented by a 2D mesh using gradient edges and color information. Second, points are reconstructed by our generic ray intersection. Third, 2D triangles are back-projected in 3D to fit the re-constructed points as best possible by taking into account a wide range of point depths. Virtual uncertainty is useful here to weight the minimized scores, to define the connec-tions between triangles in 3D, and to fill holes. Finally, tri-angles with the worst reliability are rejected.

Section 4 provides many experiments for central cam-eras. Global models are obtained by combining the local models with a simple view-point selection method [9] using our (generic) virtual uncertainty. Last, Section 5 concludes and explains what should be added for non-central cameras.

1.2. Assumptions

The proposed method involves many assumptions. First, the scene surface should be smooth enough to be approxi-mated by a list of triangles in 3D. Second, the majority of occluding contours (and the tangent discontinuities of sur-faces) should occur at gradient edges or color discontinu-ities in images. Third, the generic camera should not be too exotic to back-project connected image points (e.g. 2D tri-angles) to connected points in 3D (e.g. planar scene parts): we assume that the calibration function which maps pix-els to rays in 3D is piecewise C0 continuous with known, smooth and polygonizable discontinuities. These disconti-nuities occur in practice for multi-camera rig (e.g. the line between two composite images in the generic image of the stereo-rig).

2. Geometric Tools for a Generic Camera

This Section presents a method to reconstruct points (Section 2.1), virtual uncertainty (Section 2.2), reliability (Section 2.3) and geometric tests (Section 2.4).

The calibration function of the camera is known and maps pixels of a generic image to optical rays. An optical

ray is an oriented line defined by its origin and direction. Thanks to the knowledge of camera pose in the world coor-dinate system, origin o and direction d (||d|| = 1) of this ray in the world coordinate system are also known and used throughout the paper.

2.1. Point Reconstruction by Ray Intersection

Once point correspondences in images, calibration and successive poses of the camera are given, 3D points of the scene should be reconstructed. The standard method to re-construct a point is the minimization of a sum of square of reprojection errors in pixels using the Levenberg-Marquardt method [6] (LM). However, these errors cannot be used in the generic context since they require the image projection function, which is specific to the kind of camera. Many rays (oi, di)corresponding to observations in the ithimage of the 3D point P to reconstruct should be used instead.

One solution is the reconstruction of P by minimizing the sum of squares of angles αibetween vectors diand P− oi. In practice, the definition αi(P) = arccos(d>_i P−oi

||P−oi||)

lead to poor LM convergence. This is not surprising: the C2 continuity of αiis recommended for a good (quadratic and final) convergence of LM and it can be shown [1] that αiis never C1continuous at point ˜_Psuch that αi( ˜_P) = 0.

Let Ribe a rotation such that Rid_i_{= [0 0 1]}>and π the function π([x y z]>_{) = [x/z y/z]}>_{. Once rays (oi, di)}_are given (i ∈ {1, 2, · · · I}), we estimate P as the minimizer of E( ˜P_{) =}

I X

i=1

||αi( ˜P_)||2 with αi( ˜P_{) = π(Ri( ˜}P_{− oi)).} (1) Now αiis C2continuous and the LM convergence is good in practice. Furthermore, ||αi(P)||is the tangent of the an-gle between di and P − oi. The tangent is a good angle approximation near the expected solution where the angles are small. P is retained if it is in front of the cameras (i.e. d>_i (P − oi) > 0) and if E(P)/I is less than a threshold.

2.2. Virtual Covariance and Uncertainty

Assume that angle errors αidefined in Eq. 1 follow in-dependent and identical Gaussian noise N (02×1, σ2αI2×2). Let J be the Jacobian of the function ˜P_{7→ [α}>₁ _{· · · α}>_I _]>. This noise propagates to a Gaussian noise for the estimated parameter P with standard covariance matrix [6]

C(P) = σ2

α(J(P)>J(P))−1. (2) An estimate of σ2

αis obtained from residuals E(P) for all 3D reconstructed points P.

Now, assume that a point P0 _{∈ R}3 and many ray ori-gins o0

i ∈ R3, i ∈ {1, 2, · · · I}are given. Let d0i be the direction d0

i = P0_−o0

i

||P0_−o0_i_||. We can solve the minimization

(4)

of rays (oi, di)and obtain the minimizer P of E with its standard covariance C(P) in Eq. 2. However,

αi(P0) = π(Ri(P0− o0i)) = π(||P0− o0i||Rid_{i) = 0} and we conclude that E(P0_{) = 0. Since the E minimizer is} unique (if P0_{and the o}0

iare not collinear points), we obtain P0= P. Thus, the “virtual covariance matrix” of P0for ray origins o0

iis defined by C(P0)in Eq. 2.

At this point, the expression “virtual covariance matrix” is clearer: C(P0₎_{is the covariance matrix obtained by} re-constructing P0_{from “virtual” rays (o}0_{i, d}0_i)_{using LM, i.e.} rays which are not observation rays. In the special case where P0_{was reconstructed before by LM from other rays} (oi, di)corresponding to real observations in images, the corresponding standard covariance is similar to the virtual covariance if oi≈ o0

iand di≈ d0i.

Finally, the virtual uncertainty U(P0₎_{is defined by the} length of the major semi-axis of the uncertainty ellipsoid defined by C(P0₎_{and a probability p. This ellipsoid is} ∆x>C−∆x ≤ X2 3(p), C−= 1 σ2 α I X i=1 I₃_×3− d0_id0>_i ||P0_{− o}0_i||2 (3) with X2

3(p)the quantile function of the X

2distribution with 3 d.o.f, p a probability and C− _{the inverse [1] of C(P}0_). Using notation e for the smallest eigenvalue of C−_{, we have}

U (P0) = r

X2 3(p)

e . (4)

2.3. Reliability for 3D Modeling of a Scene

A point P reconstructed from observation rays (oi, di), i ∈ {1, 2, · · · I}may be so inaccurate that the 3D model should not contain it. At first glance, we can decide that P is inaccurate for 3D modeling if U(P) is larger than a given threshold U0. In this Section, we introduce and jus-tify a reliability definition R(P) which is more adequate than U(P) for thresholding.

If the generic camera is a central camera, the reconstruc-tion is defined up to a global 3D scale and a scale change of the whole reconstruction (3D points and camera centers) implies the same scale change of the uncertainties. For a central camera, the threshold U0 must be proportional to the scene scale to obtain a decision which is independent of the scale. This is a first reason to define

R(P) = U (P)

mini||P − oi|| (5) and decide that P is inaccurate for 3D modeling if R(P) is larger than a given threshold R0.

We see that the permitted maximal uncertainty by con-dition R(P) < R0 is proportional to the distance between

point P and ray origins oi. More precisely, this inequal-ity allows points with good accuracy (for 3D modeling of the scene) to have greater uncertainties if they are a long distance from the ray origins oi and smaller uncertainties if they are close. As a consequence, we can expect to mod-elize both close foreground and far background of the scene. This is the second reason for this definition of R(P). Fur-thermore, through this inequality it is possible to moderate the ellipsoid size (uncertainty U(P)) in comparison with the distance between ellipsoid center (the reconstructed point P) and ray origins oi. It is not difficult to prove [1] that R(P)is arbitrarily large in two cases: (1) nearly parallel d_iand (2) large values of ||P − oi||. Case (1) occurs if all o_i are collinear points and P goes toward the line of the oi. Case (2) occurs for distant point P. These cases should be avoided for 3D modeling. This is a third reason for this definition of R(P).

2.4. Geometric Tests

Let Π be the plane n>_X+ d = 0. Once virtual covari-ance matrix C is defined, Mahalanobis point-to-point and point-to-plane [1] squared distances are respectively

d2 (P1, P2) = (P1− P2)>C−1(P1)(P1− P2) d2 (P1, Π) = min P2∈Π d2 (P1, P2) = (n >_P 1+ d) 2 n>_C(P1)n .(6) Here we introduce several tests which are systematically used by mesh operations for 3D modeling. The point-to-point neighborhood test T (P1, P2)is true if d2

(P1, P2) ≤ X2

3(p)and d 2

(P2, P1) ≤ X2

3(p). The point-to-plane neigh-borhood test T (P1, Π)is true if d2

(P1, Π) ≤ X2

3(p). The planarity test T ({Pi})is true if there is a plane Π such that all T (Pi, Π)are true. In practice, Π is estimated by random samples of 3 points in the list {Pi}.

These tests implicitly requires for each 3D point Pithe corresponding ray origins due to the virtual covariance def-inition in Section 2.2. If the generic camera is central or if the points are reconstructed by LM, the ray origins are known. They are unknown in other cases, unless we ap-ply the projection functions to Pi(but this is not a generic method). More investigations are needed to estimate effi-ciently ray origins in the non-central case.

3. 3D Model from Generic Images

This Section describes how to obtain a 3D model for a few generic images given their corresponding camera poses, the calibration function and image point correspondences.

First, a reference image is chosen and segmented by a 2D mesh using gradient edges and color informations (Sec-tion 3.1). Second, points are reconstructed by intersec(Sec-tion of observation rays as described in Section 2.1. Third, 2D triangles are back-projected in 3D to fit the reconstructed

(5)

points by taking into account 3D point uncertainties and depth discontinuities (Section 3.2). Last, the most unreli-able parts of the resulting 2.5D mesh (Section 3.3) are re-jected. Assumptions are given in Section 1.

3.1. 2D Mesh

The 2D mesh in the generic reference image should sat-isfy many contradictory constraints: gradient edges at mesh edges, small enough mesh edges for good approximation of gradient edges, large enough mesh triangles for stable esti-mation of triangles in 3D and efficient rendering, uniform sampling of the field of view, and good aspect ratio for tri-angles. A compromise is obtained as follows.

Mesh Initialization First, a Delaunay triangulation is

ini-tialized such that the solid angles of any triangles are roughly the sames. In practice, simple checkerboards with two triangles for each rectangular cell are good enough for standard cameras like perspective, catadioptric, or stereo-rig. The C0 discontinuities of the calibration function de-fine the borders of independent 2D meshes in the reference image. Borders enforce constrained edges on the Delau-nay and enforce the global shape of the checkerboards (in the catadioptric case, cell rows are concentric rings and cell columns are radial sections). The mesh resolution is defined by a mean length of cell edges equal to 8 pixels.

Gradient Edge Integration Second, the gradient edges

are integrated in the mesh by moving mesh vertices slightly and forcing mesh edges to be constrained. We have not taken into account all gradient edges since the mesh res-olution has been previously fixed. So they are integrated in a best first order. A contour is a list of connected pix-els which have maximum local image gradient. Its score is equal to the sum of gradient modulus for all its pixels. We pick the contour with the highest score, and find the list of closest vertices to its pixels such that the vertices have not been used before for any other contour. Then two consecutive vertices are moved slightly to approximate the contour if the part of the contour between vertex ends is a segment. Once all contours have been considered by de-creasing score, a completion step is used in order to try to constrain new mesh edges if they approximate a contour in their immediate neighborhood.

Mesh Refinement Third, the 2D mesh is refined by

al-ternating continuous improvements (move vertices to mini-mize a global cost combining color variance in triangles and mesh smoothness) and discrete improvements (flip edges and merge vertices to improve aspect ratio of triangles).

The continuous mesh improvement is useful for many reasons. First, few (parts of) gradient edges may be missed by the previous step, and minimizing the sum of color vari-ances for each triangle is an other way to increase the proba-bility that the gradient edges are on the mesh edges. Second,

the gradient edge integration deformed the initial mesh only locally such that the constraint of a same solid angle for all triangles is highly violated. Minimizing the mesh smooth-ness (sum of squared modulus of an umbrella operator) is a way to incite incident triangles to have similar solid an-gles. Minimizing the mesh smoothness is also useful to im-prove triangle aspect ratio and regularize the minimization of color variance.

The cost function is defined by e2d({pv}) = X p∈t∈T ||cp−X p0_∈t cp0 |t||| 2 +λX v∈V || X v0_∈N_v pv−pv0|| 2

with T the list of mesh triangles, |t| the area of triangle t, V the list of mesh vertices, Nv the list of vertices which are connected to v by a mesh edge. Color cp at pixel p is RGB, pv is the image location of vertex v and λ is equal to 1000. The cost function is minimized using a simple de-scent method with vertex locations {pv}as parametrization. All mesh vertices are allowed to move in 2D, except ver-tices which are incident to a constrained edge (verver-tices at gradient edges). The latter are only allowed to move in 1D along the detected gradient edges. This gives the priority to detected gradient edges over the minimization of color variance, which may sometimes be contradictory.

3.2. 2.5D Mesh

Assume that we have I ≥ 2 camera poses and a dense list of 3D points P reconstructed by intersection of I obser-vation rays (one for each pose) as described in Section 2.1. A 2D mesh in a reference image is also given.

First, the 2.5D mesh is initialized as a list of fully dis-connected triangles in 3D. Then, this mesh is refined by alternating discrete improvements “Triangle Connection”, “Hole Filling”, “Triangle Removal”, “Triangle Damping” and continuous improvements “Mesh Refinement”. These mesh improvements are defined below thanks to the virtual covariance for the I poses (Sections 2.2 and 2.4).

At any step, the 2.5D mesh in 3D is a back-projection of the 2D mesh in the reference image. In other words, each triangle t2d of the 2D mesh corresponds to (at most) one triangle t3d of the 2.5D mesh with vertices vi _{∈ R}3. The t3d vertices are parameterized by depths zi _{> 0}such that v_i _{= oi}_{+ zi}d_i with (oi, di)the observation rays of t2d vertices. Vertices in the 2D mesh may have many depths depending on current connections between triangles in 3D.

Mesh Initialization In this step, each triangle t2d of the 2D mesh is individually back-projected to fit the 3D points as best possible with a RANSAC procedure.

First, all 3D points reconstructed from matched pixels inside t2d are collected in a list Lt

2d. Second, planes are

(6)

Lt2d. Let Π be the plane minimizing E2 t2d= X P_∈L t2d min{X2 3(p), d 2 (P, Π)} (7) with X2

3(p)and d(P, Π) introduced in Eq. 3 and 6. Then, we estimate depths ziat the 3 vertices of t2dsuch that oi+ zid_i _{∈ Π} with (oi, di)the observation rays of these vertices. The triangle in 3D with vertices oi+ zid_iis added in the 2.5D mesh if zi> 0.

Pair-Wise Triangle Connection Triangles in 3D should

be interconnected to obtain a more realistic 3D model. Let t3d

a and t3db be two 3D triangles such that the asso-ciated triangles t2d

a and t2db in the 2D mesh have a common edge (t2d

a and t2db are “weakly” connected). This edge has two vertices 0 and 1 in 2D, which correspond to triangle vertices {va_{0, v}b_0}and {va

1, vb1}in 3D. The connection be-tween t3d

a and t3db is effective if the point-to-point neighbor-hood tests T (va_{0, v}b₀₎and T (va_{1, v}b₁₎defined in Section 2.4 are true.

The connection between t3d

a and t3db is defined as fol-lows. Let za

i and zbi be depths such that vai = oi+ ziadi and vb

i = oi+ zibdiwith (oi, di)the observation rays of 2D vertices i ∈ {0, 1}. New values of za

i and z b i are set to former value of 1 2(z a i + z b

i). Henceforth, the 2.5D mesh parameters za

i and z b

i are linked by constraints z a i = z

b i for further processing.

Group-Wise Triangle Connection The “Pair-Wise

Tri-angle Connection” above connects any triTri-angle pair in 3D if they satisfy neighborhood conditions. Here we intro-duce the “Group-Wise Triangle Connection”, which con-nects any k-group of triangles in 3D if they satisfy a pla-narity condition (typically k ∈ {2, 3, 4}).

A k-group of triangles in 3D is a list of k triangles t3d j such that the corresponding triangles t2d

j are “strongly” nected in the 2D mesh. Two triangles are strongly con-nected if they have a common edge which is not constrained in the 2D mesh. We avoid constrained edges since they are potential surface discontinuities in 3D. Section 2.4 defines the planarity condition by T ({vi})with {vi}the list of all triangle vertices of the k-group.

Any triangle pair {t3d

a , t3db }in 3D is connected as in the pair-wise case if it is included in a k-group satisfying the planarity condition and if the corresponding t2d

a , t2db in 2D have a common edge.

Triangle Removal A smooth surface is expected to be

approximated by a list of connected triangles in 3D. If a triangle is not connected to (at least) one of its neighbors after trials of triangle connections, we have some doubt as to its quality and may decide to remove it from the 2.5D mesh. They are many reasons for fully disconnected and bad triangles in 3D: false positive matches in images (e.g.

in the neighborhood of occluding contours), triangle esti-mations using 3D points in both close foreground and far background, too few points for reliable estimation.

Triangle Damping The main drawback of “Triangle

Re-moval” is the lack of triangles in scene parts which are not smooth such as tree foliages. If a triangle t3dwithout con-nection is not removed, it may produce a major degradation of visual quality if it is very stretched in 3D in the direc-tion diof rays which goes across t3dvertices. In this case, the angle θ between t3d normal n and di is greater than a threshold θ0.

Thus, “Triangle Damping” reduces such degradations as follows: if θ0 < θ, the t3d depths zi are disturbed such that (1) the t3d center is fixed and (2) n is replaced by cos(θ0)di+ sin(θ0) ˜d_i

||˜d_i_|| with ˜di = n − (n

>_d_i)di. “Tri-angle Damping” may be preferred to “Tri“Tri-angle Removal” to obtain more triangles in the 3D model.

Hole Filling In our context, a hole is a connected

compo-nent of triangles t2d

j in the 2D mesh without corresponding triangles t3d

j in the 2.5D mesh. “Hole Filling” is the defini-tion of the lacking t3d

j by interpolation of depths available in the hole border. Holes are mainly due to false negative matches in low textured areas and degrade the visual quality of 3D model rendering if they are not properly filled.

The main risk is depth interpolation between foreground and background which also degrades the rendering quality, especially if foreground and background have different col-ors. We have the choice between strong connectivity (used in “Group-Wise Triangle Connection”) and weak connec-tivity (used in “Pair-Wise Triangle Connection”) between two triangles in the 2D mesh to define a hole as a connected component. The former is preferred to the latter which in-cludes potential surface discontinuities at constrained edges too easily in the hole. As a consequence, the hole border is a list of edges in the 2D mesh such that (1) edges are con-strained or (2) edges are not concon-strained and have depths at their two vertices. All 3D points corresponding to these vertices with depths are collected in a list {vi}. We also de-fine r as the ratio between the sum of 2D lengths of edges of type (2) and the sum of 2D lengths of all border edges.

To obtain a well defined interpolation and reduce the risk of depth interpolation between foreground and background, we require that the hole border is planar thanks to the pla-narity condition T ({vi})defined in Section 2.4. We also re-quest enough 3D information at the hole border by r thresh-olding (0.5 < r). If T ({vi})is true, there is a plane Π which approximates the viand “Hole Filling” is defined as follow. Each vertex in the hole (including border) has a cor-responding observation ray (oi, di)and a depth zi defined by oi+ zid_i ∈ Π. Any hole triangle t2d with positive zi at its vertices defines a new triangle t3d in the 2.5D mesh.

(7)

Depth constraints are set for further processing such that these vertices have only one depth.

Mesh Refinement The parameters of the 2.5D mesh is

the list of depths zi for each triangle vertex in 3D with many constraints (equalities) between the zi. Improvements “Hole Filling” and “Pair/Group-Wise Triangle Connection” are useful to increase the rendering quality of the 3D model, but they reduce the number of independent zi and disturb the initial values of zi obtained from the 3D point cloud. The consequence is an increasing discrepancy between 3D points and the 2.5D mesh. This problem is reduced by min-imizing a global cost function including a discrepancy term and a smoothness term. The smoothness term is useful to reduce noise and enforce a prior knowledge of a smooth surface on the 2.5D mesh.

The cost function to minimize is defined by e3d({zi}) =X t∈T E2 t+λ X {t1,t2}∈E 1 2(|t1|+|t2|)(nt1−nt2) 2

with T the list of 2D mesh triangles which has a triangle in 3D, {t, t1, t2} ⊂ T, {t1, t2}the edge between triangles t1and t2, |t| the surface (in pixels) of t, E the list of un-constrained edges in the 2D mesh, and ntthe normal of the 3D triangle corresponding to the 2D triangle t. Weight λ is equal to 1 and E2

t is defined in Eq. 7. The cost func-tion is minimized by a descent method with depths {zi} as parametrization. Depths have a wide range due to close foreground and far background, and this should be taken into account to reduce the cost efficiently. Given a depth value zn

i at iteration n of the descent method, we choose the value zn+1

i ∈ {zin− δi(zin), zin, zni + δi(zin)}which mini-mizes the partial function zi7→ e3d(zi). Virtual uncertainty is used to scale the increment δiby δi(z) = U (oi+ zdi) with = 0.02.

Algorithm Summary Many combinations of the mesh

operations above are possible and have been the subject of experiments. Our favorite strategy currently is

1. Mesh Initialization

2. apply Group-Wise Triangle Connection (k = 4), Hole Filling and Mesh Refinement alternatively

3. Triangle Removal or Triangle Damping (θ0= 7 20π) 4. apply Pair-Wise Triangle Connection, Hole Filling and

Mesh Refinement alternatively.

Step 2 merges triangles with strong conditions before step 3. Once step 3 has removed improbable and unconnected triangles, step 4 connects triangles with weaker conditions.

3.3. Unreliable Parts

Once the 2.5D mesh is obtained, the reliability for 3D modeling (Eq. 5) allows the detection of unreliable vertices v_iby thresholding such as R0< R(vi). Any triangle which has an unreliable vertex is removed.

Figure 1. Virtual uncertainty U and reliability R in a plane for a generic (central) camera in three cases: two camera poses on the left, three collinear poses in the middle, and three non collinear poses on the right. Camera poses are black points in this plane. Black edges are the main axes of uncertainty ellip-soids centered at some points P (their length is 2U(P). Ev-ery point P has a gray level depending on the interval where R(P) lies: [0, 1 40[, [ 1 40, 2 40[, [ 2 40, 3 40[, [ 3 40, 4 40[, [ 4 40,+∞](darkest

gray levels for largest reliabilities). On the left, we check that curves R(P) = R0with R0 ∈ { 1 40, 2 40, 3 40, 4

40}are very similar

to circles (in black). Typical values are obtained for U and R with σα= 0.001radian and X

3

2(0.9) = 6.25.

4. Experiments

Once camera poses and matching between image pairs are given, all experiments are done for the generic camera model restricted to central cameras (ray origins at the cen-tre). Actually, specific pose methods [10, 9] are prefered to generic method [11] since they also estimate calibration and have more successful automatic matching. Furthermore, the dense matching method between image pairs involves local rectifications which are specific to camera model [9].

4.1. Properties of

U (P)

and

R(P)

In the first case (on the left of Figure 1), U(P) and R(P) are shown for two camera locations or ray origins A and B. U (P)and R(P) are defined everywhere (except on the line defined by A and B). Due to the symmetry of the problem, Uand R are the same for any plane in 3D containing A and B. As expected, they increase in two cases: (1) if P goes toward line defined by A and B or (2) if P goes long away from A and B. We also see at the bottom that our reliability is very similar to the reliability given in [4]: curves implic-itly defined by R(P) = constant are very similar to circles defined by angle(A, P, B) = constant.

In the second case (in the middle), a camera pose C is added in the middle of A and B. The result is unexpected: there is no improvement (i.e. U or R decrease) by adding the third camera pose. In fact, the results are nearly the same. In the third case (on the right), C is moved toward the bottom. As expected, the improvement is noticeable in the neighborhood of the line defined by A and B. In these two last cases, our R definition is naturally derived from U for any numbers of views. This was not the case for the R definition of [4], which was only defined for two views.

(8)

Figure 2. Image projections of circular cones (with apex at camera center) for perspective (left) and catadioptric (middle) cameras. The former is a 35mm camera with a 30mm lens. The latter is equiangular and has a field of view of ±50◦above and below the

plane orthogonal to the symmetry axis. Cone apertures are π 25

radian. An image taken by this catadioptric camera is also shown.

4.2. Comparing Specific and Generic Cameras

It is wished that the virtual covariance obtained from the generic error (angle) be the same as the virtual covariance obtained from the specific error (image reprojection). We prove a simple condition [1] in image space to check this.

Assume that a point P, a specific camera model and I camera poses are given. We also consider any point X such that the image projection pi(X)is in the immediate neigh-borhood of pi(P)in the i-th image. The condition is ||αi(X)|| = σα

σp||pi(X) − pi(P)|| + o(||pi(X) − pi(P)||) (8) with αidefined from camera center oiand direction di =

P_−o_i

||P−oi|| as described in Eq. 1. In other words, the

projec-tions of all circular cones (with apex at camera center) of aperture 2 radians should be circles of radius σp

σα pixels.

Figure 2 draws some of these projections for perspec-tive and catadioptric cameras. In both cases, we note that the main differences between specific and generic virtual covariances occur at the borders of view fields where the circles have the largest distortions.

4.3. 3D Model from Catadioptric Images

The field of view and an image taken by the catadiop-tric camera are also shown in Figure 2. The definition of the calibration is completed by the radii of the large and small circles: 563 and 116 pixels respectively. The image sequence has 208 images (closed turn around a church). Once the camera parameters and dense matching between image pairs are estimated, each local model is reconstructed from 3 consecutive views with σα = 0.0011radian and X2

3(0.9) = 6.25as described in Section 3.

Figure 3 shows 3D models obtained with this sequence. The local model in the first row is obtained with the ref-erence image given in Figure 2. The most unreliable parts drawn on the left are discarded in the middle with R0 = 0.08. A part of the ground (circular hole) and the upper part of the facade are in the blind cones defined by the small and

large circles of the catadioptric images and can not be re-constructed for this reason. The second row shows views of an other local model. In both examples, we see that a suc-cessful gradient edge integration in the meshes allows sharp modeling of C0and C1depth discontinuities (in spite of the low resolution of catadioptric images).

The third row of Figure 3 shows a top view and a height map of the global model. The global model is obtained from 208local models around the church as follows. Let Ul(P) be the virtual uncertainty defined at point P with ray origins defined by the centers of cameras of a local model l. A triangle of local model l0 is retained in the global model if Ul0(P) ≤ β minlUl(P)with β = 1.1 and P a vertex

of the triangle. In other words, the triangle is retained if l0 provides one of the best (smallest) virtual uncertainties available from all local models [9]. This condition ignores the visibility of P in the images since Ul(P)is well-defined everywhere in the generic context. In practice, the result is improved by taking into account the visibility as follows: we reset Ul(P) = +∞if P is not in the view fields of l. The global model contains 567757 triangles.

The video shows a walkthrough in the scene. The recon-struction is difficult in several parts including trees and low textured areas (e.g. street parts) which are not filled. Fur-thermore, the simple triangle selection above has two weak-ness referenced in [9]: the model redundancy increases with depth and the self-occlusions of the surface are ignored. We have noted that the triangle selection confines the case (1) of bad reliability (Section 2.3) at the ends of image sequences, and case (1) does not occur for a closed sequence like this. Here we choose to not reject the most unreliable areas and include the far background (case (2) of bad reliability).

4.4. 3D Model from Perspective Images

Figure 4 shows results of the method applied to 28 (816× 1088) images taken by a perspective camera with a lateral motion. The focal length and radial distortion estimations are f = 1234 and ρ = −0.073. Local and global models are reconstructed as in the catadioptric case with σα= 0.00059 radian, β = 1.02 and R0 = 0.05. The global model has 129635triangles. Table 1 provides an estimate of the 3D noise for a few planes of the global models. The perspective camera has smaller noise than the catadioptric camera.

5. Conclusions

This paper presents geometric tools and results for 3D scene modeling using a generic camera model. First, virtual uncertainty and reliability are extended for generic cameras and compared with those of previous works. Second, these tools are systematically applied in the 3D model generation: fit and connect triangles in 3D, fill the holes due to matching errors, set depth resolution for 2.5D mesh optimization,

(9)

re-Figure 3. Several views of 3D models obtained with the catadiop-tric camera. Top and middle: views of two local models (3 con-secutive poses) with rejection of unreliable triangles (R0 = 0.08),

except at the top left corner. Triangle orientations are also drawn using gray levels. Bottom: top view and height map of the global model (208 poses) with rejection (R0= 0.05).

Figure 4. Left: two images (among 28) taken by the perspective camera. Right: the global 3D model.

ject the most unreliable triangles, and view-point selection to obtain a global model. Finally, 3D models of a scene are obtained for both perspective and catadioptric cameras.

A problem should be solved to apply the method on non-central camera naturally: the ray origin calculation for a 3D point expressed in the camera coordinate system. Fu-ture works also include efficient matching methods in the generic context.

Camera per. per. per. per. cat. cat. cat. cat Vertex 208 74 153 139 78 71 405 157 Depth 188 246 358 524 212 283 319 358 RMS 0.23 0.37 0.55 0.74 0.92 0.98 1.77 1.21 Table 1. 3D noise for planar parts of global models. Each col-umn provides information about a part: camera (perspective or catadioptric), number of vertices, mean distance between vertex and closest camera (cm), RMS of distances between vertex and estimated plane (cm). Vertices are selected for a part if they are projected in an ellipse of an image (white ellipses in Fig. 2 and 4). Distance between two consecutive camera poses is about 30 cm.

References

[1] File proofs.pdf in the supplementary material.

[2] A. Akbarzadeh, J. Frahm, P. Mordohai, B. Clipp, C. En-gels, D. Gallup, P. Merell, M. Phelps, S. Sinha, B. Tal-ton, L. Wang, Q. Yang, H. Stewenius, R. Yang, G. Welch, H. Towles, D. Nister, and M. Pollefeys. Towards urban 3d reconstruction from video. In 3DPTV’06.

[3] R. Bunschoten and B. Krose. Robust scene reconstruction from an omnidirectional vision system. IEEE Transactions

on Robotics and Automation, pages 351–357, 2003.

[4] P. Doubek and T. Svoboda. Reliable 3d reconstruction from a few catadioptric images. In OMNIVIS’02.

[5] M. Grossberg and S. Nayar. A general imaging model and a method for finding its parameters. In ICCV’01.

[6] R. Hartley and A. Zisserman. Multiple View Geometry in

Computer Vision. Cambridge University Press, 2000.

[7] D. F. J. Barron and S. Beauchemin. Performance of optical flow techniques. IJCV, 12(1), 1992.

[8] J. Kannala and S. Brandt. A generic camera model and cal-ibration method for conventional, wide-angle and fish-eye lenses. IEEE PAMI, 28(8), 2006.

[9] M. Lhuillier. Toward flexible 3d modeling using a catadiop-tric camera. In CVPR’07.

[10] M. Lhuillier and L. Quan. A quasi-dense approach to surface reconstruction from uncalibrated images. IEEE PAMI, 27(3), 2005.

[11] E. Mouragnon, M. Lhuillier, M. Dhome, F. Dekeyser, and P. Sayd. Generic and real time structure from motion. In

BMVC’07.

[12] D. Nister. Automatic Dense Reconstruction from

Uncali-brated Video Sequence. PhD thesis, Royal Institute of

Tech-nology KTH, Stockholm, Sweden, 2001.

[13] D. Nister. An efficient solution for the five-point relative pose problem. IEEE PAMI, 26(6), 2004.

[14] D. Nister and H. Stewenius. A minimal solution to the gen-eralized 3-point pose problem. JMIV, 27(1), 2007.

[15] R. Pless. Using many cameras as one. In CVPR’03. [16] M. Pollefeys, L. V. Gool, M. Vergauwen, F. Verbiest, K.

Cor-nelis, J. Tops, and R. Koch. Visual modeling with a hand-held camera. IJCV, 59(3), 2004.

[17] S. L. S. Ramalingam and P. Sturm. A generic structure-from-motion framework. CVIU, 103(3), 2006.

[18] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IJCV, 47(2), 2002.