Binocular Visual Features - Conjugate Mixture Models for the Modeling of Visual and Auditory Pe

We would like to extract visual features that would be general enough (not specific to partic-ular object types) and at the same time sufficiently informative to perform AV integration.

In this Section we present the technique used to extract and reconstruct in the scene such features called “interest points”.

3http://www.intel.com/technology/computing/opencv

(a) (b)

Figure 2.2: Binocular geometry. (a) Basic pinhole camera model.C is the camera centre, (x_cam, y_cam, z_cam) is the camera frame,sis a point in 3D and pis its projection on the image plane. (b) Point correspondence. The two cameras are indicated by their centersC andC^′and image planes. An image pointpback-projects to a ray in 3D space defined by C andp. This ray is imaged as a linelin the second view.

The visual data is gathered using a pair of stereoscopic cameras, i.e. binocular vision.

We assume the basic pinholecamera model [Hartley 2003] that establishes a projective mapping

s= (x, y, z)^⊤7→p= (p₁s/p₃s, p₂s/p₃s) (2.1) of a pointsin 3D onto the image plane. We denotedp_i thei^thline of the camera matrix P = AR( I | −C), whereA =

 is the matrix of camera intrinsic pa-rameters and RandCare the rotation and translation of camera frame respectively with respect to some reference frame (extrinsic parameters); I is the 3x3 identity matrix. For exact meaning of values inAmatrix we refer to [Hartley 2003]. The extrinsic and intrin-sic parameters of a camera are obtained through camera calibration, as mentioned before in Section 2.1. Schematic representation of the basic pinhole camera model is given in Figure2.2a.

Under the pinhole camera model, image points are represented as rays of light inter-secting the image plane on a line running through the camera center. Given a pair of cameras, C andC^′, and a pointpin cameraC, the locationp^′ of the same point in the other camera can be constrained to anepipolar linel, as shown in Figure2.2b. Thus for every scene pointsone can introduce the notion ofepipolar disparitydas a displacement of an image point along the corresponding epipolar line [Hansard 2008]. For a rectified camera pair [Hartley 2003] an invertible functionF :R³ → R³can be defined, that maps a scene points= (x, y, z)^⊤onto a cyclopean image pointf = (u, v, d)^⊤corresponding to a 2D image location(u, v)and to an associated binocular disparityd:

F(s) =

Figure 2.3: Visual observations on the left and right camera images. White circles depict the “interest points”, coloured squares show those of them that are matched to some point from the other image. The epipolar lines correspond to a point marked by a star in the opposite image.

where B is the baseline length (distance between camera centres C and C^′) measured in focal distances of a camera. Without loss of generality we further scale the disparity component and letB = 1to use the following feature space mapping

F(s) = x

z,y z,1

z ⊤

and F⁻¹(f) = u

d,v d,1

d ⊤

. (2.3)

This model can be easily generalized from a rectified camera pair configuration to more complex binocular geometries [Hansard 2007,Hansard 2008]. We use a sensor-centered coordinate system to represent the object locations.

Visual observations f = {f₁, . . . ,f_M} in our experiments are obtained as follows.

First we detect points of interest (POI) in both the left and right images. Second we perform stereo matching such that a disparity value is associated with each matched point.

In practice we used the POI detector described in [Harris 1988]. This detector is known to have high repeatability in the presence of texture and to be photometric invariant. We analyse each image point detected this way and we select those points associated with a significant motion pattern. Motion patterns are obtained in a straightforward manner. A temporal intensity varianceσ_tis estimated at each POI. Assuming stable lighting condi-tions, the POI belongs to a static scene object if its temporal intensity variance is low and non-zero due to a camera noise only. For image points belonging to a dynamic scene ob-ject, the local variance is higher and depends on the texture of the moving object and on the motion speed. In our experiments, we estimated the local temporal intensity variance σ_t at each POI, from a collection of 5 consecutive frames. The point is labelled “mo-tion” if σ_t > 5 (for 8-bit gray-scale images), otherwise it is labelled as “static”. The motion-labelled points are then matched and the associated disparities are estimated us-ing standard stereo methods. The features we use are obtained with the method described in [Hansard 2007]. Examples are shown on Figure2.3. Alternatively, we could have used the spatiotemporal point detector described in [Laptev 2005]. This method is designed to

Figure 2.4: Visual observations f in the Cyclopean image space (on the right) and their reconstructed correspondances in the scene space (on the left), obtained through applying F⁻¹. Point colour represents the dorz coordinate in Cyclopean image space or scene space respectively.

detect points in a video stream having large local variance in both the spatial and temporal domains, thus representing abrupt events in the stream. However, such points are quite rare in data flows we work with.

An example of visual observation set for a visual scene containing three persons is given in Figure2.4. The pointsfin the Cyclopean image space (on the right) are obtained through stereo matching of POI in the left and right images. Their reconstruction sin the scene space (on the left) can be found through applying the inverse mapping F⁻¹. The point colours are computed from thedorzcoordinates in Cyclopean image space or scene space respectively.

The implementation of the visual feature detection algorithm was kindly provided by Miles Hansard, a member ofPERCEPTIONteam⁴at INRIA research institute, France.

Dans le document Conjugate Mixture Models for the Modeling of Visual and Auditory Perception (Page 24-27)