Image-based Street-side City Modeling

(1)

HAL Id: hal-01635654

https://hal.archives-ouvertes.fr/hal-01635654

Submitted on 15 Nov 2017

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Image-based Street-side City Modeling

Jianxiong Xiao, Tian Fang, Peng Zhao, Maxime Lhuillier, Long Quan

To cite this version:

Jianxiong Xiao, Tian Fang, Peng Zhao, Maxime Lhuillier, Long Quan. Image-based Street-side City

Modeling. ACM Transactions on Graphics, Association for Computing Machinery, 2009, 28 (5),

pp.114. �10.1145/1618452.1618460�. �hal-01635654�

(2)

Image-based Street-side City Modeling

Jianxiong Xiao Tian Fang Peng Zhao Maxime Lhuillier∗ _{Long Quan}

The Hong Kong University of Science and Technology ∗_{LASMEA - Universit´e Blaise Pascal}

Figure 1: Two close-ups of the parts 1 and 2 of a modeled city area shown in the first two rows. All the models are automatically generated

from input images, exemplified by the bottom row. The close-up of the part 3 is shown in Figure 15.

Abstract

We propose an automatic approach to generate street-side 3D photo-realistic models from images captured along the streets at ground level. We first develop a multi-view semantic segmentation method that recognizes and segments each image at pixel level into semantically meaningful areas, each labeled with a specific object class, such as building, sky, ground, vegetation and car. A partition scheme is then introduced to separate buildings into independent blocks using the major line structures of the scene. Finally, for each block, we propose an inverse patch-based orthographic com-position and structure analysis method for fac¸ade modeling that ef-ficiently regularizes the noisy and missing reconstructed 3D data. Our system has the distinct advantage of producing visually com-pelling results by imposing strong priors of building regularity. We demonstrate the fully automatic system on a typical city example to validate our methodology.

Keywords: Image-based modeling, street view, street-side,

build-ing modelbuild-ing, fac¸ade modelbuild-ing, city modelbuild-ing, 3D reconstruction.

1 Introduction

Current models of cities are often obtained from aerial images as demonstrated by Google Earth and Microsoft Virtual Earth 3D plat-forms. However, these methods using aerial images cannot produce photo-realistic models at ground level. As a transition solution, Google Street-View, Microsoft Live Street-Side and the like display the captured 2D panorama-like images with fixed view-points. Ob-viously, it is insufficient for applications that require true 3D photo-realistic models to enable user interactions with 3D environment. Researchers have proposed many methods to generate 3D mod-els from images. Unfortunately, the interactive methods [Debevec et al. 1996; M¨uller et al. 2007; Xiao et al. 2008; Sinha et al. 2008] typically require significant user interactions, which cannot be eas-ily deployed in large-scale modeling tasks; the automatic methods [Pollefeys et al. 2008; Cornelis et al. 2008; Werner and Zisserman 2002] focused on the early stages of the modeling pipeline, and have not yet been able to produce regular mesh for buildings.

1.1 Related work

There is a large literature on image-based city modeling. We review several studies according to the input images and the user interac-tion without being exhaustive.

Single-view methods. Oh et al. [2001] presented an interactive system to create models from a single image by manually assigning the depth based on a painting metaphor. M¨uller et al. [2007] relied

(3)

Input images 3D points and lines

from SFM Recognition andsegmentation Block partition Composition Regularization 3D model City models

Figure 2: Overview of our automatic street-side modeling approach.

on repetitive patterns to discover fac¸ade structures, and obtained depth from manual input. Generally, these methods need intensive user interactions to produce visual pleasing results. Hoiem et al. [2005] proposed surface layout estimation for modeling. Saxena et al. [2009] learned the mapping information between image features and depth directly. Barinova et al. [2008] made used of manhattan structure for man-made building to divide the model fitting problem into chain graph inference. However, these approaches can only produce a rough shape for modeling objects without lots of details.

Interactive multi-view methods. Façade, developed by Debevec et al. [1996], is a seminal work in this category. They used line seg-ment features in images and polyhedral blocks as 3D primitives to interactively register images and to reconstruct blocks with view-dependent texture mapping. However, the required manual selec-tion of features and correspondences in different views is tedious, which makes it difficult to be scaled up when the number of images grows. Van den Hengel et al. [2007] used a sketching approach in one or more images to model a general object. But it is difficult to use this approach for detail modeling even with intensive interac-tion. Xiao et al. [2008] proposed to approximate orthographic im-age by fronto-parallel reference imim-age for each façade during auto-matic detail reconstruction and interactive user refinement. There-fore, their approach requires an accurate initialization and boundary for each façade as input, probably manually specified by the user. Sinha et al. [2008] used registered multiple views and extracted the major directions by vanishing points. The significant user interac-tions required by these two methods for good results make them difficult to adopt in large-scale city modeling applications.

Automatic multi-view methods. Dick et al. [2004] developed a 3D modeling architectural modeling method for short image se-quences. The user is required to provide intensive architectural rules for the Bayesian inference. Many researchers realized the importance of line features in man-made scenes. Werner and Zis-serman [2002] used line segments for building reconstruction from registered images by sparse points. Schindler et al. [2006] pro-posed the use of line features for both structure from motion and modeling. However, line features tend to be sparse and geometri-cally less stable than points. In our work, reconstructed 3D lines are used to partition a reconstructed sequences into independent build-ing blocks which will be individually modeled and to align buildbuild-ing blocks regularly. A more systematic approach to modeling urban environments using video cameras has been investigated by sev-eral teams [Pollefeys et al. 2008; Cornelis et al. 2008]. They have been very successful in developing real-time video registration and focused on the global reconstruction of dense stereo results from the registered images. Our approach does not try to obtain dense stereo reconstruction, but focuses on the identification and model-ing of buildmodel-ings from the semi-dense reconstructions. Finally, the work by Zebedin et al. [2006] is representative of city modeling from aerial images, which is complementary to our approach from street-level images.

1.2 Overview

We propose in this paper an automatic approach to reconstruct 3D models of buildings and fac¸ades from street-side images. The im-age sequence is reconstructed using a structure from motion algo-rithm to produce a set of semi-dense points and camera poses.

Approach From the reconstructed sequence of the input images, there are three major stages during city modeling. First, in Sec-tion 3, each input image is segmented per pixel by a supervised learning method into semantically meaningful regions labeled as building, sky, ground, vegetation and car. The classified pixels are then optimized across multiple registered views to produce a co-herent semantic segmentation. Then, in Section 4, the whole se-quence is partitioned into building blocks that can be modeled in-dependently. The coordinate frame is further aligned with the major orthogonal directions of each block. Finally, in Section 5, we pro-pose an inverse orthographic composition and shape-based analysis method that efficiently regularizes the missing and noisy 3D data with strong architectural priors.

Interdependency Each step can provide very helpful informa-tion for later steps. The semantic segmentainforma-tion in Secinforma-tion 3 can help the removal of line segments that are out of recognized build-ing regions in Section 4. And it can identify the occludbuild-ing regions of the fac¸ade by filtering out non-building 3D points for inverse orthographic depth and texture composition in Section 5.1. After the semantic segmentation results are mapped from the input im-age space to the orthographic space in Section 5.2, they are the most important cues for boundary regularization, especially the up-per boundary optimization in Section 5.4. When the model is pro-duced, the texture re-fetching from input images in Section 6.2 can also use the segmented regions to filter out occluding objects. On the other hand, the block partition in Section 4 provides us a way to divide the data into block level for further process, and also gives accuracy boundaries for the fac¸ades.

Assumption For the city and input images, our approach only as-sume that building facades have two major directions, vertical and horizontal, which are true for most buildings except some special landmarks. We utilize generic features from bank filters to train a recognizer from examples and optimized in multi-view to recog-nize sky regions without blue sky assumption. The final boundary between the sky and the buildings is robustly regularized by opti-mization. Buildings may be attached together or separated for a long distance, as the semantic segmentation can indicate the pres-ence and abspres-ence of buildings. It is unnecessary to assume that buildings are perpendicular to the ground plane in our approach, as buildings are automatically aligned to the reconstructed vertical line direction.

Contribution Our approach contributes to the state-of-the-art in the following ways. The first is a supervised multi-view semantic segmentation that recognizes and segments each input street-side

(4)

Figure 3: Preprocessing. Reconstructed 3D points and vertical

lines (red).

image into areas according to different object classes of interest. The second is a systematic partition scheme to separate buildings into independent blocks using the major man-made line structures of the scene. The third is a fac¸ade structure analysis and modeling method to produce visually pleasing building models automatically. The last is a robust automatic system assembled from all these reli-able components.

2 Preprocessing

The street-side images are captured by a camera mounted on a mov-ing vehicle along the street and facmov-ing the buildmov-ing fac¸ades. The ve-hicle is equipped with GPS/INS (Inertial Navigation System) that is calibrated with the camera system.

Points The structure from motion for a sequence of images is now standard [Hartley and Zisserman 2004]. We use a semi-dense structure from motion [Lhuillier and Quan 2005] in our current implementation to automatically compute semi-dense point clouds and camera positions. The advantage of the quasi-dense approach is that it provides a sufficient density of points that are globally and optimally triangulated in a bundle-like approach. The availability of pose data from GPS/INS per view further improves the robustness of structure from motion and facilitates the large-scale modeling task. In the remainder of the paper, we assume that a reconstructed sequence is a set of semi-dense reconstructed 3D points and a set of input images with registered camera poses.

Lines Canny edge detection [Canny 1986] is performed on each image, and connected edge points are linked together to form line segments. We then identify two groups of the line segments: ver-tical line segments and horizontal ones. The grouping [Sinha et al. 2008] is carried out by checking whether they go through the com-mon vanishing point using a RANSAC method. Since we have a semi-dense point matching information between each pair of im-ages from the previous computation of SFM, the matching of the detected line segments can be obtained. The pair-wise matching of line segments is then extended to the whole sequence. As the cam-era is moving latcam-erally on the ground, it is difficult to reconstruct the horizontal lines in 3D space due to lack of the horizontal parallax. Therefore, we only reconstruct vertical lines which can be tracked over more than three views. Finally, we keep the 3D vertical lines whose directions are consistent with each other inside RANSAC framework, and remove other outlier vertical lines.

3 Building segmentation

For a reconstructed sequence of images, we are interested in recog-nizing and segmenting the building regions from all images. First, in Section 3.1, we train discriminative classifiers to learn the map-ping from features to object class. Then, in Section 3.2, multiple view information is used to improve the segmentation accuracy and

consistency.

3.1 Supervised class recognition

We first train a pixel-level classifier from a labeled image database to recognize and distinguish five object classes, including building, sky, ground, vegetation and car.

Feature To characterize the image feature, we use textons which have been proved to be effective in categorizing materials and gen-eral object classes [Winn et al. 2005]. A 17-dimensional filter-bank, including 3 Gaussians, 4 Laplacian of Gaussians (LoG) and 4 first order derivatives of Gaussians, is used to compute the response on both training and testing images at pixel level. The textons are then obtained from the centroids by K-means clustering on the re-sponses of the filter-bank. Since the nearby images in the testing sequence are very similar, to save computation time and memory space, we do not run the texton clustering over the whole sequence. We currently pick up only one out of six images for obtaining the clustered textons. After the textons are identified, the texture-layout descriptor [Shotton et al. 2009] is adopted to extract the features for classifier training, because it has been proved to be successful in recognizing and segmenting images of general classes. The each dimension of the descriptor corresponds a pair [r, t] of an image region r and a texton t. The region r relative to a given pixel loca-tion is a rectangle chosen at random within a rectangular window of±100 pixels. The response v[r,t](i) at the pixel location i is the

proportion of pixels under regions r + i that have the texton t, i.e. v_[r,t](i) =P_j∈(r+i)[Tj= t] /size(r).

Classifier We employ the Joint Boost algorithm [Torralba et al. 2007; Shotton et al. 2009], which iteratively selects discriminative texture-layout filters as weak learners, and combines them into a strong classifier of the form H (l, i) = P_mhmi (l). Each weak

learner hi(l) is a decision stump based on the response v_[r,t](i) of

the form

hi(l) =

(

aˆv_[r,t](i) > θ˜+ b l∈ C kl l /∈ C.

For those classes that share the feature l ∈ C, the weak learner gives hi(l) ∈ {a + b, b} depending on the comparison of feature

response to a threshold θ. For classes not sharing the feature l /∈ C, the constant kl_{makes sure that unequal numbers of training}

exam-ples of each class do not adversely affect the learning procedure. We use sub-sampling and random feature selection techniques for the iterative boosting [Shotton et al. 2009]. The estimated confi-dence value can be reinterpreted as a probability distribution using softmax transformation:

Pg(l, i) =Pexp (H (l, i)) kexp (H (l, k))

.

For performance and speed, the classifier will not be trained from the full labeled data that might be huge. We choose a subset of la-beled images that are closest to the given testing sequence to train the classifier, in order to guarantee the learning of reliable and trans-ferable knowledge. We use the gist descriptor [Oliva and Torralba 2006] to characterize the distance between an input image and a la-beled image, because the descriptor has been shown to work well for retrieving images of similar scenes in semantics, structure, and geo-locations. We create a gist descriptor for each image with 4 by 4 spatial resolution where each bin contains the average response to steerable filters in 4 scales with 8,8,4 and 4 orientations respec-tively in that image region. After the distances between the labeled

(5)

(a) (b) (c)

building

ground sky

(d)

Figure 4: Recognition and segmentation. (a) One input image. (b)

The over-segmented patches. (c) The recognition per pixel. (d) The segmentation.

images and input images of the testing sequence are computed, we then choose the 20 closest labeled images from the database as the training data for the sequence by nearest neighbor classification.

Location prior A camera is usually kept approximately straight in capturing. It is therefore possible to learn the approximate loca-tion priors of each object class. In a street-side image, for example, the sky always appears in the upper part of the image, the ground in the lower part, and the buildings in-between. Thus, we can use the labeled data to compute the accumulated frequencies of differ-ent object classes Pl(l, i). Moreover, the camera moves laterally

along the street in the capturing of street-side images. A pixel at the same height in the image space should have the same chance of belonging to the same class. With this observation, we only need to accumulate the frequencies in the vertical direction of the image space.

3.2 Multi-view semantic segmentation

The per-pixel recognition produces a semantic segmentation of each input image. But the segmentation is noisy and needs to be op-timized in a coherent manner for the entire reconstructed sequence. Since the testing sequence has been reconstructed by SFM, we uti-lize the point matching information among multiple views to im-pose segmentation consistency.

Graph topology Each image Iiis first over-segmented using the

method by [Felzenszwalb and Huttenlocher 2004]. Then we build a graphGi = Vi,Ei on the over-segmentation patches for each

image. Each vertex v ∈ Vi in the graph is an image patch or a

super-pixel in the over-segmentation, while the edgesEidenote the

neighboring relationships between super-pixels. Then, the graphs {Gi} from multiple images in the same sequence are merged into

a large graphG by adding edges between two super-pixels in cor-respondence but from different images. The super-pixels piand pj

in images Iiand Ijare in correspondence, if and only if there is

at least one feature trackt = (xu, yu, i) , (xv, yv, j) , . . . with projection (xu, yu) lying inside the super-pixel piin image Ii, and

(xv, yv) inside pjin Ij. To limit the graph size, there is at most

only one edge eij between any super-pixel piand pj in the final

graphG = V, E, which is shown in Figure 5.

Adaptive features For object segmentation with fine boundaries, we prefer to use color cues to characterize the local appearance. In our model, the color distribution of all pixels in the image is approximated by a mixture model of m Gaussians in the color space with meanukand covarianceΣk. At the beginning, all pixel colors

in all images of the same sequence are taken as input data points, and K-means is used to initialize a mixture of 512 Gaussians in RGB space. Let γkldenote the probability that the k-th Gaussian

Figure 5: Graph topology for multi-view semantic segmentation.

belongs to class l. The probability of vertex pihaving label l is

Pa(l, i) = m

X

k=1

γklN (ci|uk,Σk) .

To compute γ, the probability Pg(l, i) is used solely in a greedy

way to obtain an initial segmentation{li} as shown in Figure 4(c).

This initial segmentation{li} is then used to train a Maximal

Like-lihood estimate for γ from γkl∝

X

p_i∈V

[li= k] p (ci|uk,Σk)

under the constraintPkγkl = 1. Now, combining the costs from

both the local adaptive feature and the global feature, we define the data cost as

ψi(li) = − log Pa(l, i) − λllog Pl(l, i) − λglog Pg(l, i) .

Smoothing terms For an edge eij ∈ Ekin the same image Ik,

the smoothness cost is

ψij(li, lj) = [li= lj] · g (i, j)

with g (i, j) = 1/(ζci− cj2+ 1), where ci− cj2is the L2

-Norm of the RGB color difference of two super-pixels piand pj.

Note that [li= lj] allows capturing the gradient information only

along the segmentation boundary. In other words, ψijis penalizing

the assignment to the different labels of the adjacent nodes. For an edge eij∈ E across two images, the smoothness cost is

ψij(li, lj) = [li= lj] · λ |T| g (i, j) ,

whereT = {t = (xu, yu, i) , (xv, yv, j) , . . . } is the set of all

feature tracks with projection (xu, yu) inside the super-pixel piin

image Ii, and (xv, yv) inside pjin Ij. This definition favors two

super-pixels having more matching tracks to have the same label, as the cost of having different labels is higher when|T| is larger.

Optimization With the constructed graphG = V, E, the label-ing problem is to assign a unique label lito each node pi∈ V. The

solution L ={li} can be obtained by minimizing the Gibbs energy

[Geman and Geman 1984]: E(L) = X p_i∈V ψi(li) + ρ X e_ij∈E ψij(li, lj) .

Since the cost terms we defined satisfy the metric requirement, Graph Cut alpha expansion [Boykov et al. 2001] is used to obtain a local optimized label configuration L within a constant factor of the global minimum.

(6)

Figure 6: Building block partition. Different blocks are shown by

different colors.

4 Block partition

The reconstructed sequence needs to be partitioned into indepen-dent building blocks for each block to be individually modeled. However, the definition of building block is not unique, in that a block may contain a fraction or any number of physical buildings as long as they share a common dominant plane. As an urban scene is characterized by plenty of man-made structures of vertical and hor-izontal lines, we use the vertical lines to partition the sequence into blocks because they are stable and distinct separators for our pur-pose. Moreover, a local alignment process is used to place building blocks regularly. Such local alignment makes the analysis and im-plementation of the model reconstruction algorithm more straight-forward.

4.1 Global vertical alignment

We first remove the line segments that are projected out of the seg-mented building regions from the previous section. From all the remaining vertical line segments, we compute the global vertical direction of gravity by taking the median direction of all recon-structed 3D vertical lines, found during the preprocessing stage in Section 2. Then, we align the y-axis of coordinate system for the reconstructed sequence with the estimated vertical direction.

4.2 Block separator

To separate the entire scene into natural building blocks, we find that the vertical lines are an important cue as a block separator, but there are too many vertical lines that tend to yield an over-partition. Therefore, we need to choose a subset of vertical lines as the block separators using the following heuristics.

Intersection A vertical line segment is a block separator if its

ex-tended line does not meet any horizontal line segments within the same fac¸ade. This heuristic is only true if the fac¸ade is flat. We compute a score for each vertical line segment L by accumulating the number of intersections with all horizontal line segments in each image N (L) = 1

m

Pm

k=1Γkwhere Γkis the number of

intersec-tions in the k-th image and m is the number of correspondences of the line L in the sequence.

Height A vertical line is a potential block separator if its left block

and right block have different heights. We calculate the height for its left and right blocks as follows: (1) In every image where the vertical line is visible, we fit two line segments to the upper and lower boundary of the corresponding building region respectively. (2) A height is estimated as the Euclidean distance of the mid-points of these two best-fit line segments. (3) The height of a block is taken as the median of all its estimated heights in corresponding images.

Texture Two buildings often have different textures, so a good

block separator is the one that gives very different texture distri-butions for its left and right blocks. For each block, we build an average color histogram h0from multiple views. To make use of

Algorithm 1 Inverse Orthographic Patching

1: for each image Ikvisible to the fac¸ade do

2: for each super pixel pi∈ Ik do

3: if normal direction of piparallel with z-axis then

4: for each pixel (x, y) in the bounding box do

5: X ← (x, y, zi)T  ziis the depth of pi

6: compute projection (u, v) ofX to Camera i

7: if super pixel index of (u, v) in Ik= k then

8: accumulate depth zi, color, segmentation

the spatial information, each block is further downsampled t− 1 times to compute several normalized color histograms h₁, ..., h_t−1. Thus, each block corresponds to a vector of multi-resolution his-tograms. The dissimilarity of two histogram vector hleft _and

hright of neighboring blocks is defined as Dt`hleft, hright´ =

1 t P_t−1 k=0d(h left k , h right

k ) where d(·, ·) is the Kullback-Leibler

di-vergence.

With these heuristics, we first sort all vertical lines by increasing number of intersections N, and retain the first half of the vertical lines. Then, we select all vertical lines that result in more than 35% in height difference for its left and right blocks. After that, we sort again all remaining vertical lines, now by decreasing texture difference Dt, and we select only the first half of the vertical lines.

This selection procedure is repeated until each block is in a pre-defined range (from 6 to 30 meters in the current implementation).

4.3 Local horizontal alignment

After the global vertical alignment in the y-axis, the desired fac¸ade plane of the block is vertical, but it may not be parallel to the xy-plane of the coordinate frame. We automatically compute the van-ishing point of the horizontal lines in the most fronto-parallel im-age of the block sequence to obtain a rotation around the y-axis for alignment of the x-axis with the horizontal direction. Note that this is done locally for each block if there are sufficient horizontal lines in the chosen image. After these operations, each independent fac¸ade is facing the negative z axis with x axis as horizontal direc-tion from left to right, and y axis as vertical direcdirec-tion from top to down in their local coordinate system respectively.

5 Fac¸ade modeling

Since the semantic segmentation has identified the region of inter-est, and the block partition has separated the data into façade level, the remaining task is to model each façade. The reconstructed 3D points are often noisy or missing due to varying textureness as well as matching and reconstruction errors. Therefore, we intro-duce a building regularization method in the orthographic view of the façade for structure analysis and modeling. We first filter out the irrelevant 3D points by semantic segmentation and block sep-arator. Orthographic depth map and texture image are composed from multiple views in Section 5.1, and provide the working im-age space for later stim-ages. In Section 5.2, the structure elements on each façade are identified and modeled. When the identification of structure elements is not perfect, a backup solution is introduced in Section 5.3 to rediscover these elements, if they repetitively appear. Now, after the details inside each façade have been modeled, the boundaries of the façade are regularized in Section 5.4 to produce the final model.

(7)

Near

Far

(a) (b) (c) (d) (e)

building

(f)

Figure 7: Inverse orthographic composition. (a) Depth map in input image space. (b) Partial orthographic depth map from one view. (c)

Partial orthographic texture from one view. (d) Composed orthographic depth map (unreliably estimated pixels are in yellow). (e) Composed orthographic texture. (f) Composed orthographic building region.

5.1 Inverse orthographic composition

Each input image of the building block is over-segmented into patches using [Felzenszwalb and Huttenlocher 2004]. The patch size is a trade-off between accuracy and robustness. We choose 700 pixels as the minimum patch size for our images at a resolution of 640 × 905 pixels to favor relatively large patches since the recon-structed 3D points from images are noise.

Patch reconstruction The normal vector and center position of each pi are estimated from the set of 3D points Pi =

{(xk, yk, zk)}, which have projections inside pi. As the local

co-ordinate frame of the block is aligned with the three major orthogo-nal directions of the building, the computation is straightforward. Let σi

x, σiy and σiz be the standard deviations of all 3D points

inPi in three directions. We first compute the normalized

stan-dard deviations ˆσxi = s_s¯xi xσ i x, ˆσyi = s_s¯yi yσ i y, where six and siy

are the horizontal and vertical sizes of the bounding box of the patch in the input images, and their median respectively across all patches ¯sx = medianisix, ¯sy = medianisiy.The normalization

avoids bias to a small patch. The patch piis regarded as parallel

to the fac¸ade base plane if σz is smaller than σxi and σiy. And, all

these parallel patches with small σzcontribute to the composition

of an orthographic view of the fac¸ade. The orientation of such a patch piis aligned with the z-axis, and its position set at the depth

zi= median(_x

j,yj,zj)∈pizj.One example is shown in Figure 7(a).

Orthographic composition To simplify the representation for irregular shapes of the patches, we deploy a discrete 2D ortho-graphic space on the xy-plane to create an orthoortho-graphic view O of the fac¸ade. The size and position of O on the xy-plane are de-termined by the bounding box of the 3D points of the block, and the resolution of O is a parameter that is actually set not to exceed 1024 × 1024. Each patch is mapped from its original image space onto this orthographic space as illustrated in Figure 7 from (a) to (b). We use an inverse orthographic mapping algorithm shown in Algorithm 1 to avoid gaps. Theoretically, the warped textures of all patches create a true orthoimage O as each used patch has a known depth and is parallel with the base plane.

For each pixel viof the orthoimage O, we accumulate a set of depth

values{zj}, a set corresponding of color values {cj} and a set

of segmentation labels{lj}. The depth of this pixel is set to the

median of{zj} whose index is κ = arg medianjzj. Since the

depth determines the texture color and segmentation label, we take

cκand lκas the estimated color and label for the pixel. In practice,

we accept a small set of estimated points around zκand take their

mean as the color value in the texture composition. As the content of images are highly overlapped, if a pixel is observed only once from one image, it is very likely that it comes from an incorrect

reconstruction. It will thus be rejected in the depth fusion process. Moreover, all pixels{vi} with multiple observations˘{zj}_i¯are

sorted in non-decreasing order according to their standard deviation ςi = sd ({zj}) of depth sets. After that, we define ς (η) to be the

η|{vi}|-th element in the sorted {ςi}. We declare the pixel vito be

unreliable if ςi> ς(η). The value of η comes from the estimated

confidence of the depth measurements. We currently scale the value by the ratio of the number of 3D points and the total pixel number of O.

Note that when we reconstruct the patches, we do not use the se-mantic segmentation results in the input image space for two rea-sons. The first is that the patches used in reconstruction are much larger in size than those used for semantic segmentation, this may lead to an inconsistent labeling. Though it is possible to estimate a unique label for a patch, it may downgrade the semantic segmen-tation accuracy. The second is that the possible errors in the se-mantic segmentation may over-reject patches, which compromises the quality of the depth estimation. Therefore, we reconstruct the depth first and transfer the segmentation results from the input im-age space to the orthographic view with pixel-level accuracy, shown in Figure 7(f). After that, we remove the non-building pixels in the orthoimage according to the segmentation label. Our composition algorithm for the orthographic depth map is functionally close to the depth map fusion techniques such as [Curless and Levoy 1996]. But our technique is robust as we use the architectural prior of orthogonality that preserves structural discontinuity without over-smoothing.

5.2 Structure analysis and regularization

From the composed orthographic depth map and texture image for each façade, we want to identify the structural elements at different depths of the façade to enrich the façade geometry. To cope with the irregular, noisy and missing depth estimations on the façade, a strong regularization from the architecture priors is therefore re-quired. Most of buildings are governed by vertical and horizontal lines and form naturally rectangular shapes. We restrict the prior shape of each distinct structure element to be a rectangle, such as the typical extruding signboard in Figure 7.

Joint segmentation We use a bottom-up, graph-based segmen-tation framework [Felzenszwalb and Huttenlocher 2004] to jointly segment the orthographic texture and depth maps into regions, where each region is considered as a distinct element within the fac¸ade. The proposed shape-based segmentation method jointly uti-lizes texture and depth information, and enables the fully automatic fac¸ade structure analysis. Xiao et al. [2008] also proposed a func-tional equivalent top-down recursive sub-division method. How-ever, it has been shown in [Xiao et al. 2008] to be inefficient to produce satisfactory result without any user interaction.

(8)

(a) (b) (c) (d) (e) (f)

Figure 8: Structure analysis and regularization for modeling. (a) The fac¸ade segmentation. (b) The data cost of boundary regularization.

The cost is color-coded from high at red to low at blue via green as the middle. (c) The regularized depth map. (d) The texture-mapped fac¸ade. (e) The texture-mapped block. (f) The block geometry.

A graphG = V, E is defined on the orthoimage image O with all pixels as verticesV and edges E connecting neighboring pixels. To encourage horizontal and vertical cut, we use 4-neighborhood system to constructE. The weight function for an edge connecting two pixels with reliable depth estimations is based both on the color distance and normalized depth difference

w((vi, vj)) = ci− cj2· „ zi− zj ς(η) «₂ ,

whereci− cj2 is the L2-Norm of the RGB color difference of

two pixels viand vj. We slightly pre-filter the texture image using

a Gaussian of small variance before computing the edge weights. The weight for an edge connecting two pixels without reliable depth estimations is set to 0 to force them to have the same label. We do not construct an edge between a pixel with a reliable depth and a pixel without a reliable depth, as the weight cannot be defined. We first sortE by non-decreasing edge weight w. Starting with an initial segmentation in which each vertex viis in its own

compo-nent, the algorithm repeats for each edge eq = (vi, vj) in order

for the following process: If viand vjare in disjoint components

Ci = Cj, and w (eq) is small compared with the internal

differ-ence of both those components, w (eq) ≤ MInt (Ci, Cj), then the

two components are merged. The minimum internal difference is defined as

M Int(C1, C2) = min (Int (C1) + τ (C1) , Int (C2) + τ (C2)) ,

where the internal difference of a component C is the largest weight in the minimum spanning tree of the component

Int(C) = max

e∈MST (C,E)w(e) .

The non-negative threshold function τ (C) is defined on each com-ponent C. The difference in this threshold function between two components must be greater than their internal difference for an ev-idence of a boundary between them. Since we favor a rectangular shape for each region, the threshold function τ (C) is defined by the divergence ϑ (C) between the component C and a rectangle, which is the portion of the bounding box BCwith respect to the

compo-nent C, ϑ (C) =|BC| / |C|. For small components, Int (C) is not

a good estimate of the local characteristics of the data. Therefore, we let the threshold function be adaptive based on the component size, τ(C) = „ |C| «ϑ(C) ,

where is a constant and is set to 3.2 in our prototype. τ is large for components that do not fit a rectangle, and two components with large τ are more likely to be merged. A larger favors larger com-ponents, as we require stronger evidence of a boundary for smaller components.

Once the segmentation is accomplished, the depth values for all pix-els in Ciof each reliable component Ciare set to the median. The

depth of the largest region is regarded as the depth of the base plane for the fac¸ade. Moreover, an unreliable component Cismaller than

a particular size, i.e. smaller than 4% of the current fac¸ade area, is merged to its only reliable neighboring component if such a neigh-boring component exists.

Shape regularization Except for the base plane of the fac¸ade, we fit a rectangle to each element on the fac¸ade. For an element C = {vi= (xi, yi)}, we first obtain the median position (xmed, ymed) by

xmed= medianixiand ymed= medianiyi. We then remove outlier

points that are|xi− xmed| > 2.8σxor|yi− ymed| > 2.8σy, where

σx =P_i|xi− xmed| / |C| and σy = P_i|yi− ymed| / |C|.

Fur-thermore, we reject the points that are in the 1% region of the left, right, top and bottom according to their ranking of x and y coordi-nates in the remaining point set. In this way, we obtain a reliable subset Csubof C. We define the bounding box BCsubof Csubas the

fitting rectangle of C. The fitting confidence is then defined as fC=BCsub∩ C

BCsub∪ C

.

In the end, we only retain the rectangle as distinct fac¸ade element if its confidence fC>0.72 and the rectangle size is not too small. The rectangular elements are automatically snapped into the nearest vertical and horizontal mode positions of the accumulated Sobel responses on the composed texture image, if their distances are less than 2% of the width and height of the current fac¸ade. The detected rectangles can be nested within each other. When producing the final 3D model, we first pop up the larger element from the base plane and then the smaller element within the larger element. If two rectangles overlap but do not contain each other, we first pop up the one that is closest to the base plane.

5.3 Repetitive pattern rediscovery

Structure elements are automatically reconstructed in the previous section. However, when the depth composition quality is not good enough due to poor image matching, reflective materials or low im-age quality, only a few of them could be successfully recovered. For repetitive elements of the fac¸ade, we can now systematically launch a re-discovery process using the discovered elements as templates in the orthographic texture image domain. The idea of taking ad-vantage of repetitive nature of the elements has been explored in [M¨uller et al. 2007; Xiao et al. 2008].

We use the Sum of Squared Differences (SSD) on RGB channels for template matching. Unlike [Xiao et al. 2008] operating in 2D search space, we use a two-step method to search twice in 1D, shown in Figure 9. We first search in horizontal direction for a template Bi

(9)

and obtain a set of matchesBiby extracting the local minima under

a threshold. Then, we use both BiandBitogether as the template

to search for the local minima along the vertical direction. This leads to more efficient and robust matching, and automatic align-ment of the elealign-ments. A re-discovered elealign-ment by template match-ing inherits the depth of the template.

When there are more than one structure elements discovered previ-ously by joint segmentation representing the same kind of structure elements, we also need to cluster the re-discovered elements us-ing a bottom-up hierarchical mergus-ing mechanism. Two templates Bi and Bjobtained by joint segmentation with sets of matching

candidatesMiandMjare merged into the same class, if one

tem-plate is sufficiently similar to any element of the candidates of the other template. Here, the similarity between two elements is de-fined as the ratio of the intersection area by the union area of the two elements. The merging process consists of averaging element sizes betweenMi∪ {Bi} and Mj∪ {Bj}, as well as computing

the average positions for overlapped elements inMi∪ {Bi} and

Mj∪ {Bj}.

5.4 Boundary regularization

The boundaries of the fac¸ade of a block are further regularized to favor sharp change and penalize serration. We use the same method as for shape regularization of structure elements to compute the bounding box [xmin, xmax] × [ymin, ymax] of the fac¸ade. Finally,

we further optimize the upper boundary of the fac¸ade, as we cannot guarantee that a building block is indeed a single building with the same height during block partition.

Illustrated in Figure 10, we lay out a 1D Markov random field on the horizontal direction of the orthoimage. Each xi∈ [xmin, xmax]

defines a vertex, and an edge is added for two neighboring vertices. The label liof xicorresponds to the position of the boundary, and

li ∈ [ymin, ymax] for all xi. Therefore, one label configuration of

the MRF corresponds to one fac¸ade boundary. Now, we utilize all texture, depth and segmentation information to define the cost. The data cost is defined according to the horizontal Sobel responses

φi(lj) = 1 − HorizontalSobel (i, j)

2 maxxyHorizontalSobel (x, y).

Furthermore, if ljis close to the top boundary riof reliable depth

map,|lj− ri| < β, where β is empirically set to 0.05(ymax−

y_min+ 1), we update the cost by multiplying it with (|lj− ri| +

)/(β + ). Similarly, if ljis close to the top boundary siof

seg-mentation|lj− si| < β, we update the cost by multiplying it with

(|lj− si|+)/(β +). For the fac¸ades whose boundaries are not in

the viewing field of any input image, we snap the fac¸ade boundary to the top boundary of the bounding box, and empirically update φi(ymin) by multiplying it with 0.8. Figure 8(b) shows one

exam-ple of defined data cost.

The height of the fac¸ade upper boundary usually changes in the regions with strong vertical edge responses. We thus

(a) The fac¸ade segmentation (b) Matching results using the vio-let template in (a)

Figure 9: Repetitive pattern rediscovery.

0 1 2 3 4 5 l = 3 l = 3 l = 3 l = 3 l = 1 l = 1 l = 1 l = 2 l = 2

Figure 10: An example of MRF to optimize fac¸ade upper boundary.

(a) (b) (c)

Figure 11: Texture optimization. (a) The original orthographic

texture image. (b) The optimized texture image. (c) A direct texture composition. The optimized texture image in (b) is more clear than the original orthographic texture image in (a), and has no texture from occluding objects, such as the one contained in (c).

accumulate vertical Sobel responses at each xi into Vi =

P

y∈[ymin,ymax]VerSobel (i, y) , and define the smoothness term to

be

φi,i+1(li, li+1) = μ |li− li+1|

„

1 −V_{2 max}i+ Vi+1

jVj

« , where μ is a controllable parameter.

The boundary is optimized by minimizing a Gibbs energy [Geman and Geman 1984] E(L) = X x_i∈[xmin,xmax] φi(li)+ X x_i∈[xmin,xmax−1] φ_i,i+1(li, li+1) ,

where φi is the data cost and φi,i+1 is the smoothing cost. The

exact inference can be obtained with a global optimum by methods such as belief propagation [Pearl 1982].

6 Post-processing

After the model for each fac¸ade is computed, the mesh is produced, and the texture is optimized.

6.1 Model production

Each fac¸ade is the front side of the building block. We can extend a fac¸ade in the z-direction into a box with a constant depth (the default constant is set to 18 meters in the current implementation) to represent the geometry of the building block, as illustrated in Figure 12(f).

All the blocks of a sequence are then assembled into the street side model. The texture mapping is done by visibility checking using z-buffer ordering. The side face of each block can be automati-cally textured as illustrated in Figure 12 if it is not blocked by the neighboring buildings.

(10)

6.2 Texture optimization

The orthographic texture for each front fac¸ade by Algorithm 1 is a true orthographic texture map. But as a texture image, it suffers from the artifacts of color discontinuities, blur and gaps, as each pixel has been independently computed as the color of the median depth from all visible views. However, it does provide very robust and reliable information for the true texture, and contain almost no outlier from occluding objects. Therefore, we re-compute an op-timized texture image for each front fac¸ade, regarding the original orthographic texture image as a good reference.

Suppose that each façade has N visible views. Each visible view is used to compute a partial texture image for all visible points of the façade. Then we obtained N partial texture images for the façade. Next, we define a difference measurement as the squared sum of differences between each pixel of the partial texture im-ages and the original orthographic texture image at the same coor-dinate. This is the data term for a Markov Random Field on the orthographic texture image grid. The smoothing term is defined to be the reciprocal of the color difference between each neigh-boring pair of pixels on the original orthographic texture image. The desired orthographic texture image is computed using Graph-Cut alpha-expansion [Boykov et al. 2001]. If seam artifacts are serious, Poisson blending [Pérez et al. 2003] can be used as post-process. Figure 11 shows the comparative results of this post-process. Figure 11(c) also shows a direct texture warping from most fronto-parallel image as in [Xiao et al. 2008], which fails to remove the occluding objects, i.e. the telegraph pole in this case.

7 Experiment and Discussion

We have implemented our system and tested on the street-side im-ages of downtown Pittsburgh from Google. These imim-ages have been used in Google Street View to create seamless panoramic views. Therefore, the same kind of images is currently available for a huge number of cities around the whole world, which have been captured without online human control and with noises and glares. The image resolution is 640× 905. The entire sequence of 10,498 images in Pittsburgh is broken down into a few shorter sequences roughly every 100 consecutive images. Then, each sequence is re-constructed using the structure from motion algorithm to produce a set of semi-dense points and camera poses. The cameras are then geo-registered back to the GPS coordinate frame. The result corre-sponding to this sequence is geo-registered back to the global earth coordinate using available GPS data. Since the sequence is not very long, there is no obvious drift effect, and we don’t explicitly handle loop closing.

7.1 Implementation details

The implementation is in unoptimized C++ code, and the param-eters are manually tuned on a set of 5 fac¸ades. The whole sys-tem consists of three major components: SFM, segmentation, and modeling, and it is completely modular. We use the code from [Oliva and Torralba 2006] for gist feature extraction, the code from [Shotton et al. 2009] for Joint Boost classification, the code from [Boykov et al. 2001] for Graph Cut alpha expansion in MRF op-timization, the code from [Felzenszwalb and Huttenlocher 2004] for joint graph-based segmentation in structure analysis and over-segmentation. Note that after block partition, the fac¸ade analysis and modeling component works in a rectified orthographic space, which involves only simple 2D array operations in implementation. The resulted model is represented by pushed-and-popped rectan-gles. As shown in Figure 8(f) and Figure 12(e), the parallepipeds-like division is the quadrilateral tessellation of the base plane mesh

by “rectangle map”, which is inspired by the trapezoid map algo-rithm [de Berg et al. 2008]. A rectangle is then properly extruded according to the depth map. The results may seem to have some “parallepipeds”, while some of them may have the same depth as neighboring, depending precisely upon their reconstructed depth values.

For a portion of Pittsburgh, we reconstructed 202 building blocks from 10,498 images. On a small cluster composed by 15 normal desktop PCs, the results are produced automatically in 23 hours, in-cluding approximately 2 hours for SFM, 19 hours for segmentation, and 2 hours for partition and modeling. Figure 12 shows different examples of blocks and the intermediate results. Figure 15 shows a few close-up views of the final model. All presented results in the paper and in the accompanying video are “as is” without any man-ual touch-up. For rendering, each building block is represented in two levels of detail. The first level has only the fac¸ade base plane. The second level contains the augmented elements of the fac¸ade. In the semantic segmentation, we hand-labeled 173 images by uni-formly sampling images from our data set to create the initial database of labeled street-side images. Some example labeled data is shown in the accompanying video. Each sequence is recognized and segmented independently. For testing, we do not use any la-beled images if they come from the same sequence in order to fairly demonstrate the real performance on unseen sequences.

Our method is remarkably robust for modeling as the minor errors or failure cases do not create visually disturbing artifacts. The dis-tinct elements such as windows and doors within the façade may not always be reconstructed due to lack of reliable 3D points. They are often smoothed to the façade base plane with satisfactory tex-tures as the depth variation is small. Most of the artifacts are from the texture. Many of the trees and people are not removed from the textures on the first floor of the buildings seen in Figure 15. These could be corrected if an interactive segmentation and inpainting is used. There are some artifacts on the façade boundaries if the back-ground buildings are not separated from the foreback-ground buildings, shown in the middle of Figure 15. Some other modeling examples are also shown in Figure 12. Note that there are places where the top of the buildings are chopped off, because they are clipped in the input images.

7.2 Comparative studies

Comparison with semi-automatic methods There are several semi-automatic image-based modeling methods [Debevec et al. 1996; Xiao et al. 2008]. The work in [Xiao et al. 2008] is the most recent representative approach targeting a single façade modeling with interactive initialization and editing. Although our method is fully automatic for the entire street-side city modeling pipeline, we could compare the functional equivalent component of the auto-matic structure analysis of a given façade in Section 5.2 of this pa-per and the automatic part of [Xiao et al. 2008] in Sections 6 and 7. The results are compared in Figure 14. Note that for comparison, we manually specify the initial planes with accurate boundaries, as shown in Figure 14(g). Since they made the very strong assumption that the depth variation is small for each façade, we have to specify two initial planes due to the large depth difference in this example. The input images are captured by hand-held DSLR camera at high resolution. For this data set, with simple normalization of image space to [-1,+1], we can use exactly the same set of parameters as for Pittsburgh data set. The results clearly indicate that our façade analysis is more robust and results in superior results before any manual retouching. For repetitive patterns such as windows, they relies on user specification or trained model to identify the template (the first region) for each façade and match around to find more oc-currences of the template. In our approach, since our shape-based

(11)

(a) (b) (c) (d) (e) (f) (a) (b) (c) (d) (e) (f)

Figure 12: Modeling examples of various blocks. (a) The orthographic texture. (b) The orthographic color-coded depth map (yellow pixel is

unreliable). (c) The fac¸ade segmentation. (d) The regularized depth map. (e) The geometry. (f) The textured model.

(a) Orthographic texture (b) 3D model

Figure 13: A challenging non-rectangular case. The slope-shape

roof is approximated as a step-shape structure.

joint segmentation can automatically identify several templates, it is unnecessary to specify the template manually or to train a win-dow recognition model. Therefore, it reduces the manual efforts or system complexity, and can perform more robustly since the correct recognition of windows is non-trivial.

Comparison with sensor-rich methods Using 3D scanners for city modeling is definitely an alternative for street-side city model-ing [Stamos and Allen 2002; Frueh and Zakhor 2003]. The work in [Frueh and Zakhor 2003] is one of the most recent representative works that generate 3D street-side models from 3D scans captured along the streets at the ground level. Additionally, they also cap-ture aerial views for the building tops that we do not model. The scanned data clearly has a higher density and accuracy in geometry than the reconstructed data from SFM. An approach integrating the images and scans could be envisaged to leverage the advantage of image analysis developed in our method and the high quality of 3D point clouds from the scanns. Our image analysis accepts any avail-able 3D points if the 3D data is registered with the images. The key to the success of the integrated approach is therefore the registra-tion of the scanned data and the images. The potential challenge is the inconsistency of the registered data.

7.3 Limitations

There are a few limitations reflecting on the current implementation of the approach.

Rectilinear structure assumption As for any problem, a more flexible model with many degrees of freedom is difficult to be solved in practice. Therefore, imposing reasonable priors of build-ing regularity is the trade-off that we have to make for robustness and automation. Rectilinear structure assumption, or equivalent Manhattan-world assumption [Coughlan and Yuille 1999], is uni-versal for usual man-made buildings. For more complex buildings such as landmarks, the rectangular assumption of buildings can still be a first-level approximation of arbitrary surfaces. Moreover, at the given scale of street-side city reconstruction we are targeting in this paper, the assumption is sufficient, as what the final results demon-strated. For instances, some non-rectangular windows in the center of Figure 1 are well-approximated by rectangles without obvious artifact at the scale of street view. Figure 13 is another example that the roof shape directly conflicts with our assumption. With our 1D Markov random field regularization, the roof is approximated as a step-shape structure.

Camera viewing field The upper parts of large buildings are not modeled due to the limited viewing field of a ground-based camera. We could envisage to integrate aerial images as suggested in [Fr¨uh and Zakhor 2003], or deploy a multiple camera system with one of them pointing upward.

Potential interactive editing Our approach is fully automatic for all presented results. However, the method in [Xiao et al. 2008] does provide a very convenient user interface for manual opera-tions. Nevertheless, since our rectangular representation is just a special case of DAG graph used in their method, our method can be seamlessly used together with the user interface provided by them for later manual process if necessary.

8 Conclusion

We have proposed a completely automatic image-based modeling approach that takes a sequence of overlapping images captured along the street and produces the complete photo-realistic 3D mod-els. The main contributions are: a multiple view semantic segmen-tation method to identify the object classes of interest, a systematic partition of buildings into independent blocks using the man-made vertical and horizontal lines, and a robust fac¸ade modeling with

(12)

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)

Figure 14: Comparison with [Xiao et al. 2008] . Our results are presented from (a) to (f) with the same legend as in Figure 12. The

automatic results provided by [Xiao et al. 2008] are presented from (g) to (j). (g) is the manually specified initial façades. (h) is the automatic subdivision result. (i) is the automatic geometry of the façades before interactive retouching. (j) is the textured façades.

pushed and pulled rectangular shapes. More importantly, the com-ponents are assembled into a robust and fully automatic system. The approach has been successfully demonstrated on large amount of data.

There are a few limitations to the current implementation of the system, but they can be improved within the same framework. For example, we could incorporate the 3D information in the seman-tic segmentation. Furthermore, using the grammar rules extracted from the reconstructed models to synthesize missing parts procedu-rally is also an interesting further direction.

Acknowledgements

The work is supported by Hong Kong RGC Grants 618908 and 619107. We thank Google for the data and a Research Gift, and Honghui Zhang for helps in segmentation.

References

BARINOVA, O., KONUSHIN, V., YAKUBENKO, A., LIM, H.,AND

KONUSHIN, A. 2008. Fast automatic single-view 3-d recon-struction of urban scenes. In Proceedings of the European Con-ference on Computer Vision, 100–113.

BOYKOV, Y., VEKSLER, O.,ANDZABIH, R. 2001. Fast

approxi-mate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 11, 1222–1239. CANNY, J. F. 1986. A computational approach to edge detection.

IEEE Transactions on Pattern Analysis and Machine Intelligence 8, 679–714.

CORNELIS, N., LEIBE, B., CORNELIS, K., AND GOOL, L. V.

2008. 3D urban scene modeling integrating recognition and re-construction. International Journal of Computer Vision 78, 2, 121–141.

COUGHLAN, J.,ANDYUILLE, A. 1999. Manhattan world: com-pass direction from a single image by bayesian inference. In Proceeding of IEEE International Conference in Computer Vi-sion, vol. 2, 941–947.

CURLESS, B., ANDLEVOY, M. 1996. A volumetric method for

building complex models from range images. In Proceedings of SIGGRAPH 96, ACM Press / ACM SIGGRAPH, H. Rushmeier, Ed., Computer Graphics Proceedings, Annual Conference Se-ries, ACM, 303–312.

DE BERG, M., CHEONG, O., VANKREVELD, M., ANDOVER -MARS, M. 2008. Computational Geometry: Algorithms and

Applications, 3rd ed. Springer, Berlin.

DEBEVEC, P. E., TAYLOR, C. J.,ANDMALIK, J. 1996. Mod-eling and rendering architecture from photographs: a hybrid geometry- and image-based approach. In Proceedings of SIG-GRAPH 96, ACM Press / ACM SIGSIG-GRAPH, H. Rushmeier,

Ed., Computer Graphics Proceedings, Annual Conference Se-ries, ACM, 11–20.

DICK, A., TORR, P., ANDCIPOLLA, R. 2004. Modelling and interpretation of architecture from several images. International Journal of Computer Vision 60, 2, 111–134.

FELZENSZWALB, P.,ANDHUTTENLOCHER, D. 2004. Efficient graph-based image segmentation. International Journal of Com-puter Vision 59, 2, 167–181.

FRUEH, C.,ANDZAKHOR, A. 2003. Automated reconstruction of building facades for virtual walk-thrus. In SIGGRAPH ’03: ACM SIGGRAPH 2003 Sketches & Applications, ACM, New York, NY, USA, 1–1.

FRUH¨ , C.,ANDZAKHOR, A. 2003. Constructing 3d city models by merging ground-based and airborne views. In Proceedings of IEEE Conference Computer Vision and Pattern Recognition, 562–569.

GEMAN, S.,ANDGEMAN, D. 1984. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 6, 721–741.

HARTLEY, R. I.,ANDZISSERMAN, A. 2004. Multiple View Ge-ometry in Computer Vision, 2nd ed. Cambridge University Press. HOIEM, D., EFROS, A. A.,ANDHEBERT, M. 2005. Automatic photo pop-up. ACM Transactions on Graphics 24, 3 (Aug.), 577–584.

LHUILLIER, M.,ANDQUAN, L. 2005. A quasi-dense approach to surface reconstruction from uncalibrated images. IEEE Transac-tions on Pattern Analysis and Machine Intelligence 27, 418–433. M ¨ULLER, P., ZENG, G., WONKA, P.,ANDGOOL, L. V. 2007. Image-based procedural modeling of fac¸ades. ACM Transactions on Graphics (Aug.), 85:1–85:10.

OH, B. M., CHEN, M., DORSEY, J., ANDDURAND, F. 2001. Image-based modeling and photo editing. In Proceedings of SIGGRAPH 2001, ACM Press / ACM SIGGRAPH, E. Fiume, Ed., Computer Graphics Proceedings, Annual Conference Se-ries, ACM, 433–442.

OLIVA, A., ANDTORRALBA, A. 2006. Building the gist of a scene: the role of global image features in recognition. Progress in Brain Research 155, Part 2, 23–36.

PEARL, J. 1982. Reverend bayes on inference engines: a dis-tributed hierarchical approach. In Proceedings of AAAI National Conference on AI, 133–136.

P ´EREZ, P., GANGNET, M.,ANDBLAKE, A. 2003. Poisson image editing. ACM Transactions on Graphics 22, 3 (Aug.), 313–318. POLLEFEYS, M., NISTER´ , D., FRAHM, J., AKBARZADEH, A., MORDOHAI, P., CLIPP, B., ENGELS, C., GALLUP, D., KIM,

(13)

Figure 15: Two close-up street-side views of the city models automatically generated from the images shown on the bottom.

S., MERRELL, P., SALMI, C., SINHA, S., TALTON, B., WANG, L., YANG, Q., STEWNIUS, H., YANG, R., WELCH, G.,AND

TOWLES, H. 2008. Detailed real-time urban 3D reconstruction

from video. International Journal of Computer Vision 78, 2, 143–167.

SAXENA, A., SUN, M.,ANDNG, A. Y. 2009. Make3d: learning

3d scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 5, 824–840. SCHINDLER, G., KRISHNAMURTHY, P., AND DELLAERT, F.

2006. Line-based structure from motion for urban environments. In 3D Data Processing, Visualization, and Transmission, Third International Symposium on, 846–853.

SHOTTON, J., WINN, J., ROTHER, C.,ANDCRIMINISI, A. 2009.

TextonBoost for image understanding: multi-class object recog-nition and segmentation by jointly modeling texture, layout, and context. International Journal of Computer Vision 81, 1, 2–23. SINHA, S. N., STEEDLY, D., SZELISKI, R., AGRAWALA, M.,

ANDPOLLEFEYS, M. 2008. Interactive 3D architectural mod-eling from unordered photo collections. ACM Transactions on Graphics 27, 5 (Dec.), 159:1–159:10.

STAMOS, I., AND ALLEN, P. K. 2002. Geometry and texture recovery of scenes of large scale. Computer Vision and Image Understanding 88, 2, 94–118.

TORRALBA, A., MURPHY, K.,ANDFREEMAN, W. 2007.

Shar-ing visual features for multiclass and multiview object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 5, 854–869.

VAN DEN HENGEL, A., DICK, A., THORMAHLEN¨ , T., WARD, B., ANDTORR, P. H. S. 2007. VideoTrace: rapid interactive scene modelling from video. ACM Transactions on Graphics 26, 3 (Aug.), 86:1–86:5.

WERNER, T., ANDZISSERMAN, A. 2002. Model selection for automated architectural reconstruction from multiple views. In Proceedings of the British Machine Vision Conference, 53–62. WINN, J., CRIMINISI, A.,ANDMINKA, T. 2005. Object

catego-rization by learned universal visual dictionary. In Proceedings of IEEE International Conference in Computer Vision, vol. 2, 1800–1807.

XIAO, J., FANG, T., TAN, P., ZHAO, P., OFEK, E.,ANDQUAN,

L. 2008. Image-based fac¸ade modeling. ACM Transactions on Graphics 27, 5 (Dec.), 161:1–161:10.

ZEBEDIN, L., KLAUS, A., GRUBER-GEYMAYER, B., AND

KARNER, K. 2006. Towards 3D map generation from digital aerial images. ISPRS Journal of Photogrammetry and Remote Sensing 60, 6 (Sep.), 413–427.