• Aucun résultat trouvé

Correspondence Search with Three Views

Dans le document 3D Videocommunication (Page 153-158)

2 University of Colorado, Boulder, CO, USA

7.2 DISPARITY FROM THREE OR MORE CAMERAS 127 Of course computing similarity metrics, and potentially searching in a third image,

7.2.2 Correspondence Search with Three Views

As with the binocular systems described in the previous sections the common metrics for dense trinocular stereo are SAD, SSD, and NCC. The key difference involves combining these scores to exploit information from the third image. It is at the level of correspondence search that the third (or more) camera actually makes its contribution.

Many early trinocular systems were edge-based (Ayache 1991; Dhond and Aggarwal 1991). Match criterion included similarity in edge gradient, orientation, segment length and zero crossing direction. Generally a set of candidate edges or edge pixels is selected in the reference pair, and these are transferred and verified in the third image.

Fua (1993) presented an early dense correlation-based trinocular stereo algorithm. As previously described, a reference image is correlated with two or more images of the scene at several resolutions in parallel, with a fixed window size. A left–right consistency check is used to validate matches for each pair. For each pair the disparity with the highest score over all levels of detail is chosen. The pairwise disparity maps are all relative to a single reference frame so they in turn are merged by choosing the valid disparity with the highest score for each pixel. Presumably the fact that disparities are from different pairs is accounted for during reconstruction. The transfer of matches to new views is not explicitly computed, only the relative match score for a reference pixel over multiple disparity maps is used for match selection. Faugeras et al.(1993) present a real-time version of this system. They exploit the epipolar geometry further by using a triple of cameras mounted in an L-shape, which allows aligning the columns of the up–down pair and the rows of the left–right pair to optimize search. On special purpose hardware of the era, the system achieves four frames per second at 256×256 pixel resolution and 32 disparities.

Another early dense multiview stereo system was the multiple-baseline stereo of Okutomi and Kanade (1993). This was the first system to combine SSD scores from multiple pairs of images (with a variety of baselines) using a sum of SSD (SSSD) metric referred to 1/z which is called the SSSD-in-inverse-distance metric. The authors point out that correlation scores cannot be combined based on raw disparities for independent pairs. Given the SSSD-in-inverse-distance metric, matching proceeds as usual. The advantage over methods which select candidate matches from one or more stereo pairs and cross-validate them, is that the combined metric includes all of the available information and false intermediate matches will not influence the result. For example we can see in Figure 7.10(a) that neither pairwise SSD score has a clear minimum at the point indicated by SSSD. This basic method has been implemented in special purpose hardware as a video-rate stereo machine (Kanade et al.1996).

The National Tele-Immersion Initiative (Lanier 2001; Mulligan et al. 2004) spawned a sequence of high-speed stereo systems, including a number of trinocular stereo approaches.

The main trinocular system (Mulliganet al. 2001) uses the modified NCC (MNCC) metric from Equation (7.7) to generate five depth images, from a set of seven cameras used as overlapped triples. Each triple is rectified as two separate pairs, the (right) reference pair C2 C3 and the left pair C2 C1. Transfer to C1 is precomputed for a range of right pair disparities for each pixel in the reference imageC2. The resulting left disparities are used to estimate a linear model for the disparity dL in the left pair parameterized by the pixel location u2 v2 and disparity dR in the right pair. As the search proceeds in the reference image, for each u2 v2 dR the MNCC score is computed, then the left disparity is estimateddL=Mu2 v2dR+bu2 v2. The sum of MNCC profile is computed as

7.2 DISPARITY FROM THREE OR MORE CAMERAS 129 SMNCC=MNCCu2 v2 dR+MNCCu2 v2dL. As in Okutomi and Kanade (1993), the selected disparity is determined using the extremum of the SMNCC. This system computes disparity frames for one camera triple at 15–20 frames per second at 320×240 pixel resolution for 64 disparities on current desktop technology.

A group of multiview stereo techniques which are typically not real-time are those which exploit iterative optimization methods. Level set methods refine an initial world sur-face by iterating partial differential equations derived from the constraints of the problem (Zhaoet al.2001). They automatically accommodate changes in topology as surfaces evolve and allow authors to relax some of the assumptions inherent in correlation-based approaches.

The key for multiview stereo is to formulate equations which capture the imaging constraints and deform the estimated world surface to match its appearance in a set of calibrated views (correspondence). Faugeras and Keriven (1998) incorporate the normalized cross-correlation metric, but account for the fact that surface patches are unlikely to be frontoparallel to all cameras. Jin et al. (2003) describe their approach as correlating images to models rather than images to images. They attempt to eliminate the Lambertian assumption inherent in correlation by adopting a diffuse plus specular model of scene radiance. Under this model a matrix R, composed of intensities from the projection of a tessellated surface patch into each view, should have rank two. The differential equations used for iteration are based on the difference between the observedRand the current specular and diffuse components for the estimated surface.

The topic of volumetric reconstruction from multiple views will be addressed in Chapter 8.

We will only mention here that space-carving approaches (Kutulakos and Seitz 2000; Seitz and Dyer 1999) exploit similar constraints on multi-camera geometry to those employed in multiview stereo. Effectively they search along the ray from a reference image pixel, projecting world points represented by voxels to a set of image views and keeping those voxels which produce the same appearance in all views (correspondences). The similarity or consistency metric is computed on the set of pixels from all views about the current projected point. Measures such as standard deviation or a likelihood ratio test are used to determine consistency.

7.2.3 Post-processing

Many of the post-processing techniques described in Section 7.1.3 are used in trinocular systems as well. One special claim for systems which use more than two views is that occlusion problems are reduced when regions are occluded in one view, but visible in several others. The truth of this claim depends on the approach the system takes to correspondence.

For example, if it depends on hypotheses from a reference pair, then any regions semioccluded in that pair, are unlikely to be matched correctly in spite of being visible in the third image.

Kang et al. (2001) address this question of pixels visible in some, but not all images in a multiview stereo system. They propose using shiftable windows similar to the approach of Bobick and Intille (1999), combined with temporal selection. Temporal selection means computing a similarity metric such as SSD for all views, then choosing a selection (50%) of views with the lowest scores and summing the scores. Presumably a pixel’s score will be poor for pairs in which it is semioccluded, and better for pairs where it is fully visible.

Effectively this is an SSSD approach, which discards scores which are too large.

7.3 CONCLUSION

Much of the 3D information which can be derived from multiview systems depends on identifying the projections of world points in two or more views. The critical question is how to measure similarity between image features or regions and choose the best match.

Determining these corresponding image points is a challenging and active research problem.

This chapter has reviewed the basic definitions and approaches for stereo analysis for two or more cameras, as well as describing many of the seminal systems in the area.

Computing robust disparity maps in real-time requires exploiting constraints such as restricting correspondence search to 1D along the epipolar line, enforcing unique matches and assuming the scene contains smooth surfaces. Area-based techniques have proven effective in generating dense disparity maps at high frame rates, even on common desktop computers.

Unfortunately they are unreliable in image regions with low or repeated texture, and at occlusion boundaries and half-occluded regions. Bad matches can be reduced by post-processing using techniques such as right–left checking, region entropy and peak sharpness in the similarity metric.

For systems with three or more cameras similarity metrics must combine evidence from all views. For calibrated systems we can use the trinocular epipolar constraint to transfer a match in one image pair to a third image. Knowledge of camera geometry allows us to refer the metric to a reference image or to world scene points. We can think of these systems as searching in space of possible surfaces which have projections matching the images (scene-based), or as a straightforward comparison among images (image-based). Post-processing particular to multi-camera systems consists of discarding evidence from images for which the considered scene point is occluded.

REFERENCES

Anandan P 1989 A computational framework and an algorithm for the measurement of visual motion.

International Journal of Computer Vision2,283–310.

Atzpadin N, Kauff P and Schreer O 2004 Stereo analysis by hybrid recursive matching for real-time immersive video conferencing.IEEE Transactions on Circuits and Systems for Video Technology 14(3), 321–334.

Ayache N 1991Artificial Vision for Mobile Robots: Stereo Vision and Multisensory Perception.MIT Press, Cambridge, MA.

Banks J, Bennamoun M, Kubik K and Corke P 1998 Evaluation of new and existing confidence measures for stereo matching. Proceedings of the Image and Vision Computing NZ conference (IVCNZ’98),Auckland, NZ, pp. 252–261.

Barron J, Fleet DJ and Beauchemin S 1994 Performance of optical flow techniques.International Journal of Computer Vision12(1), 43–77.

Bhat DN and Nayar SK 1995 Stereo in the presence of specular reflection.Proceedings of the 5th International Conference on Computer Vision,pp. 1086–1092.

Birchfield S 1999Depth and Motion DiscontinuitiesDepartment of Electrical Engineering, Stanford University.

Bobick AF and Intille SS 1999 Large occlusion stereo. International Journal of Computer Vision 33(3), 181–200.

Böröczky L 1991 Pel-recursive Motion Estimation. Department of Electrical Engineering, Delft University of Technol.

Cochran S and Medioni G 1992 3-d surface description from binocular stereo.IEEE Transactions on pattern analysis and machine intelligence14(10), 981–994.

REFERENCES 131 Dhond UR and Aggarwal JK 1991 A cost-benefit analysis of a third camera for stereo correspondence.

International Journal of Computer Vision6(1), 39–58.

Egnal G, Mintz M and Wildes R 2002 A stereo confidence metric using single view imagery.

Proceedings of Vision Interface,pp. 162–170.

Faugeras O and Keriven R 1998 Variational principles, surface evolution, pde’s, level set methods, and the stereo problem.IEEE Transactions on Image Processing7(3), 336–344.

Faugeras O, Vieville T, Theron E, Vuillemin J, Hotz B, Zhang Z, Moll L, Bertin P, Mathieu H, Fua P, Berry G and Proy C 1993 Real-time correlation-based stereo: algorithm, implementations and applications.Technical Report RR-2013,INRIA-Sophia Antipolis.

Fau P 1993 A parallel stereo algorithm that produces dense depth maps and preserves image features.

Machine Vision and Applications6,35–49.

Hartley R and Zisserman A 2003Multiple View Geometry in Computer Vision,2nd edn. Cambridge University Press, Cambridge, UK.

Horn BKP and Schunck BG 1981 Determining optical flow.Artificial Intelligence17,185–204.

Jin H, Soatto S and Yezzi A 2003 Multi-view stereo beyond lambert.Proceedings of 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition,vol. 1, pp. I-171–I-178.

Kanade T and Okutomi M 1994 A stereo matching algorithm with an adaptive window: Theory and experiment.IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 16(9), 920–932.

Kanade T, Yoshida A, Oda K, Kano H and Tanaka M 1996 A stereo machine for video-rate dense depth mapping and its new applications.Proceedings of the 15th Computer Vision and Pattern Recognition Conference (CVPR ’96),pp. 196–202.

Kang SB, Szeliski R and Chai J 2001 Handling occlusions in dense multi-view stereo.Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’01), Kauai, HI, vol. I, pp. 103–110.

Konolige K 1997 Small vision systems: Hardware and implementation. Proceedings of the Eigth International Symposium on Robotics Research (Robotics Research 8),Hayama, Japan, pp. 203–212.

Kutulakos KN and Seitz SM 2000 A theory of shape by space carving. International Journal of Computer Vision38(3), 199–218.

Lanier J 2001 Virtually there.Scientific American,April, pp. 66–75.

Leclerc YG 1989 Constructing simple stable descriptions for image partitioning.International Journal of Computer Vision3(1), 73–102.

Levine MD, O’Handley DA and Yagi GM 1973 Computer determination of depth maps.Computer Graphics and Image Processing2,131–150.

Marr D and Poggio T 1979 A theory of human stereo vision.Proceedings of the Royal Society London B204, 301–328.

Moravec H 1980/81 Robot rover visual navigation.Computer Science: Artificial Intelligence,No. 3, pp. 105–108.

Mulligan J and Daniilidis K 2000 Trinocular stereo for non-parallel configurations.Proceedings of the 15th International Conference on Pattern Recognition,vol. 1, Barcelona, Spain. pp. 567–570.

Mulligan J, Isler V and Daniilidis K 2001 Performance evaluation of stereo for tele-presence. Proceed-ings of the 8th IEEE International Conference on Computer Vision (ICCV’01),vol. 2, Vancouver, BC, Canada, pp. 558–565.

Mulligan J, Zampoulis X, Kelshikar N and Daniilidis K 2004 Stereo-based environment scanning for immersive telepresence.IEEE Transactions on Circuits and Systems for Video Technology14(3), 304–320.

Negal HH and Enkelmann W 1986 An investigation of smoothness constraints for the estimation of displacement vector fields from image sequences. IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI)8,565–593.

Ohm J, Grüneberg K, Izquierdo E, Hendriks MKE, Redert A, Kalivas D and Papadimatos D 1997 A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation. Pro-ceedings of International Workshop on Synthetic – Natural Hybrid Coding and Three Dimentional Imaging,Rhodes, pp. 147–150.

Okutomi M and Kanade T 1993 A multiple-baseline stereo.IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)15(4), 353–363.

Park JI and Inoue S 1997 Hierarchical depth mapping from multiple cameras.Proceedings of Inter-national Conference on Image Analysis and Processing ICIAP,pp. 685–692.

Seitz SM and Dyer CR 1999 Photorealistic scene reconstruction by voxel coloring.International Journal of Computer Vision35(2), 151–173.

Stefano LD and Mattoccia S 2000 Real-time stereo for a mobility aid dedicated to the visually impaired.

Proceedings of the 6th International Conference on Control, Automation, Robotics and Computer vision (ICARCV 2000),Singapore.

van der Wal G, Hansen M, and Piacentino M 2000 The acadia vision processor.IEEE Proceedings of International Workshop on Computer Architecture for Machine Perception,Padua, Italy, pp. 31–40.

Woodfill J and Herzen BV 1997 Real-time stereo vision on the parts reconfigurable computer.

Proceedings IEEE Symposium on Field-Programmable Custom Computing Machines, Napa, CA, pp. 201–210.

Zabih R and Woodfill J 1994 Non-parametric local transforms for computing visual correspondence.

Proceedings of the 3rd European Conference on Computer Vision,Stockholm, pp. 151–158.

Zhao HK, Osher S and Fedkiw R 2001 Fast surface reconstruction and deformation using the level set method.Proceedings of the 1st Workshop on Variational and Level Set Methods in Computer Vision (VLSM’01),pp. 194–201.

8

Reconstruction of

Dans le document 3D Videocommunication (Page 153-158)