Dynamic Super Resolution of Depth Sequences with Non-Rigid Motions

(1)

DYNAMIC SUPER RESOLUTION OF DEPTH SEQUENCES WITH NON-RIGID MOTIONS

Kassem Al Ismaeil

?

Djamila Aouada

?

Bruno Mirbach

†

Bj¨orn Ottersten

?

_{SnT - University of Luxembourg}

†

_{Advanced Engineering - IEE S.A.}

{kassem.alismaeil, djamila.aouada, bjorn.ottersten}@uni.lu

[email protected]

ABSTRACT

We enhance the resolution of depth videos acquired with low resolution time-of-flight cameras. To that end, we propose a new dedicated dynamic super-resolution that is capable to accurately super-resolve a depth sequence containing one or multiple moving objects without strong constraints on their shape or motion, thus clearly outperforming any existing super-resolution techniques that perform poorly on depth data and are either restricted to global motions or not precise because of an implicit estimation of motion. The proposed approach is based on a new data model that leads to a robust registration of all depth frames after a dense upsampling. The textureless nature of depth images allows to robustly handle sequences with multiple moving objects as confirmed by our experiments.

Index Terms— Depth sequence, dynamic super-resolution, motion estimation, upsampling, ToF data, moving object.

1. INTRODUCTION

Super resolution (SR) is the process of recovering a high res-olution (HR) image from a set of captured low resres-olution (LR) frames. SR has originally been defined for static scenes, i.e., scenes where the motion between the observed images is global as opposed to dynamic scenes containing a moving ob-ject. The past two decades have witnessed tremendous work on SR for static scenes. As presented in [3], these algorithms, commonly referred to as classical SR, are numerically limited to small global motions even for an increased number of LR frames. Moreover, they cannot handle scenes with moving objects and consider the corresponding frames as outliers. As a solution to these major limitations, example-based SR algo-rithms have been proposed [4], as well as their combinations with classical multi-frame SR [5]. However, such algorithms depend on a heavy training phase and the quality of the super-resolved image is dependent on the suitability of the training data.

Relatively little attention has been given to the SR of dy-namic scenes. Farsiu et al. have proposed in [1] a dydy-namic shift and add model (dynamic S&A) as a mere extension of the original S&A that was first defined for the static case in [2], hence suffering from the same restrictions. Other

meth-ods [6, 7, 8, 9] have been proposed to tackle the problem of dynamic SR by segmenting the moving object first before su-per resolving it. Such methods do not handle pixels on the boundary of the object causing major artifacts. In 2010, van Eekeren et al. [9] proposed an algorithm to solve the prob-lem of boundary pixels; however, this algorithm is computa-tionally heavy and based upon strong assumptions. In 2009, dynamic SR models were proposed with an implicit motion estimation, e.g., steering kernels for SR (SKSR) [21]. While the idea is theoretically attractive, it is very impractical as it relies on heavy computations and on many empirical param-eters. Moreover, these methods are dedicated for 2D inten-sity sequences, strongly failing when it comes to depth data because of its abrupt value changes around edges and tex-tureless nature. Such data, usually captured with a time-of-flight (ToF) camera, requires a resolution enhancement. Fu-sion based methods have been proposed as a solution for dy-namic depth scenes [10, 11, 12, 13, 14] where a HR 2D cam-era is coupled with a depth LR camcam-era. These methods often suffer from texture copying problems and require a perfect alignment and synchronization of 2D and depth sequences.

In [16], we have proposed to release the limitations on scale and motion of the static S&A algorithm [1, 2] in the case of depth data with global rigid motions. In this work, we target the SR of dynamic depth scenes containing one or multiple moving objects without prior assumptions on their shape or motion, and without engaging in an additional learn-ing stage. To the best of our knowledge, we are the first to explore the multi-frame SR framework for dynamic depth scenes. The proposed algorithm takes advantage of the tex-tureless nature of depth data, leading to a robust median esti-mation without fusing with 2D data; hence, avoiding blurring and texture copying artifacts. This algorithm is based on a new data model that starts by densely upsamling the LR mea-surements for an accurate registration using a new cumulative motion compensation.

(2)

2. PROBLEM FORMULATION

The aim of dynamic SR algorithms is to estimate a sequence of N HR images {xt}Nt=1of size (m × n) from an observed

LR sequence {yt}Nt=1, where each LR image yt is of size

(m0× n0_{) pixels, with n = r · n}0_{and m = r · m}0_{, such that}

r is the SR factor. Every image ytmay be viewed as an LR

noisy and deformed realization of xt0at the acquisition time t,

with t0≤ t. Rearranging all images xtand yt, t = 1, · · · , N ,

in lexicographic order, i.e., column vectors of lengths mn and m0n0, respectively, we consider the following data model:

yt= DHLt

0

txt0+ n_t, t0 ≤ t and t, t0∈ [1, N ] ⊂ N∗, (1)

where D is a matrix of dimension (m0n0× mn) that repre-sents the downsampling operator, and which we assume to be known and constant over time. The system blur is repre-sented by the time and space invariant matrix H. The vector ntis an additive Laplacian noise at time t as justified in [2].

The matrices Ltt0 are (mn × mn) matrices corresponding to

the geometric motion between the considered HR image xt0

and the observed LR image ytprior to its downsampling.

The dynamic SR problem is simplified by reconstructing one HR image at a time using the full observed sequence. From now on, we fix the reference time to t0, and focus on the

re-construction of xt0 from {yt}

N

t=t0. The operation may be

repeated for t0 = 1, · · · , N . Based on the data model in

(1), and using an L1norm between the observations and the

model, the Maximum Likelihood (ML) estimate of xt0 is

ob-tained as follows: ˆ xt0 = arg min_x t0 N X t=t0 kDHLt0 t xt0− ytk1. (2)

Using the same approach as in [2, 18], we consider that H and Lt0

t are block circulant matrices. Therefore:

HLt0

t = L

t0

t H. (3)

The minimization in (2) can therefore be decomposed into two steps; estimation of a blurred HR image zt0 = Hxt0,

followed by a deblurring step. In what follows, we assume that ytis simply the noisy and decimated version of zt

with-out any geometric warp. We may thus write Lt

t = I, ∀t, I

being the identity matrix, hence, Lt0

t zt0 = zt = Hxt. This

operation can be assimilated to registering zt0to zt. We draw

attention to the fact that in the case of static multi-frame SR, instead of a sequence, a set of observed LR images is consid-ered, i.e., there is no order between frames. Such order be-comes crucial in dynamic SR because the estimation of mo-tion, based on the optical flow paradigm, happens between consecutive frames only. An accurate dynamic SR estimation is consequently highly dependent on the accuracy of estimat-ing the registration matrices Lt−1t , as well as L

t0

t . In the case

of one moving object with a very small translational motion

through few frames, a subpixel motion estimation would be sufficient to guarantee a good HR image. This assumption is not valid anymore if the object moves fast or the scene has multiple objects moving with different motions. In this case, the SR process becomes more challenging and a robust reg-istration method is required using a dense optical flow. Most SR algorithms are directly related to a registration based on a too coarse pixel correspondence as compared to the scale of details in the scene. It is therefore necessary to call upon a very accurate subpixel correspondence. In what follows, we argue that this accuracy is highly increased after an up-sampling of the observed sequence as presented in Section 3. We accordingly propose a new data formulation for dynamic depth SR and give its corresponding algorithm in Section 4.

3. MOTION ESTIMATION AND REGISTRATION It has been shown in [17] that higher image resolutions help increase the accuracy of motion estimation which justifies ap-plying an upsampling framework to get higher scale images. Moreover, performing the registration process on upsampled images guarantees a better result with a higher accuracy than registering the LR images ytfollowed by upsampling them.

This is due to the fact that registration parameters are approx-imated by rounding the motion vectors with an expected error of ±1₂ pixel. The effect of this error is related to the size of the registered images, whereas the upsampling process re-duces this effect from ±_2m1 in the LR case to ±_2rm1 . Hence, we propose to upsample the observed LR images even be-fore registering them. Due to the specifications of depth data, classical interpolation based methods (e.g. bicubic) cannot be used and lead to jagged values and blurring effects especially for boundary pixels. Thus, we propose to densely upsam-ple yt, t = 1, ..., N , up to the super-resolved image of size

(m × n). We define the resulting image as:

yt↑= U · yt, (4)

where U is a dense upsampling matrix of size (mn × m0n0_),

which we choose to be the transpose of D, s.t., UD = A, where A is a block circulant matrix that defines a new blur-ring matrix B = AH. Therefore, we redefine ztas:

zt= Bxt. (5)

Since the optical flow approach works under the assumption of small motions, the frames which are further from the refer-ence frame yt0 ↑ would introduce a higher registration error

than the ones that are closer to yt0 ↑. They will thus be

(3)

registration method based on a cumulative motion compensa-tion.

Considering two consecutive upsampled frames yt−1 ↑

and yt↑, the optimal registration solution is:

ˆ

Mt_t−1= arg min

M Ψ (yt−1↑, yt↑, M) , (6)

where Ψ is a dense optical flow-related cost function and yt↑= Mtt−1yt−1↑ +vt. (7)

The vector vtcontains the innovation that we assume

negligi-ble in this framework. In addition, similarly to [21], for ana-lytical convenience, we assume that all pixels in yt↑ originate

from pixels in yt−1 ↑ in a one to one mapping. Therefore,

each row in Mtt−1contains 1 for each position corresponding

to the address of the source pixel in yt−1 ↑. This bijective

property implies that the matrix ˆMt

t−1 is an invertible

per-mutation, s.t., [ ˆMt

t−1]−1 = ˆM t−1

t . Furthermore, its estimate

leads to the following registration to yt−1:

y_t↑= ˆMt−1_t yt↑ . (8)

We then need to define yt0

t ↑, the registered version of yt ↑

to the reference yt0 ↑. To that end, we use all the

regis-tered upsampled images yt ↑, as defined in (8), for t > t0.

We propose, similarly to our work in [17], a cumulative mo-tion compensamo-tionapproach with an additional improvement where we further reduce the cumulated motion error by re-computing ˆMt

t−1as follows:

ˆ

Mt_t−1= arg min

M Ψ (yt−1↑, yt↑, M) , (9)

We prove by induction the following registration relationship for non-consecutive frames:

yt0 t ↑= ˆM t0 t yt↑= ˆMtt00+1· · · ˆM t−1 t | {z } (t − t0) times ·yt↑ . (10)

Considering the bijection simplification, we further write: ˆ Mt0 t ≈ L t t0 = [L t0 t ]−1. (11) 4. PROPOSED ALGORITHM

The subpixel accuracy in motion estimation induced by the combined upsampling and cumulative motions proposed in Section 3, make it feasible to handle a depth sequence with a moving object without using any prior information on its shape, rigidity, or motion. These advantages are extended to the much more complex case of multiple moving objects. In-deed, the textureless nature of depth images categorizes them as images containing gross information only, i.e., with no tex-ture information, as per Mallikarajuna et al.’s composite im-age model [22]. This property, combined with the SR impulse

noise nt, suggests that a temporal median estimator is a robust

equivalent to the ML formulation of (2).

We reformulate the data model in (1) to introduce the upsam-pling strategy of Section 3. Combining (1), (3), (4), (5), and (11), we find1:

yt0

t ↑= zt0+ wt, t0≤ t and t, t0∈ [1, N ] ⊂ N

∗_, ₍₁₂₎

where wt = ˆMtt0U · ntis an additive Laplacian noise at t.

The estimation in (2) becomes: ˆ zt0 = arg min z_t0 N X t=t0 kzt0− y t0 t ↑ k1, (13)

which corresponds to the pixel-wise temporal median estima-tor, i.e., ˆzt0 = medt{y

t0

t ↑}Nt=t0. Then follows a simple

im-age deblurring to recover ˆxt0 from ˆzt0. We hence propose

a new dynamic SR algorithm corresponding to the presented new SR estimation, that we refer to as Upsampling for Precise Super-Resolution(UP-SR) as summarized below:

UP-SR algorithm for t0,

1. Choose the reference frame yt0.

for t, s.t., t0< t < N ,

do

2. Compute yt↑ using (4).

3. Estimate the registration matrices ˆMt0

t using (10).

4. Compute yt0

t ↑ using (10).

end do end for

5. Find ˆzt0by applying a median estimator (13).

6. Deduce ˆxt0by deblurring.

end for

5. EXPERIMENTS

We test the performance of the proposed UP-SR algorithm on depth data acquired with a ToF camera. Using the SR estimation on such data is suitable as it suffers from a very low resolution. We start with a simple case of one moving object (hand) with a translational motion. We compare the performance of UP-SR for both cases, registered measured LR depth images yt0

t and registered densely upsampled depth

images yt0

t ↑, t0 < t < N . Results show that in the latter

case (Fig. 1(b)), the registration is more accurate and leads to sharper edges. Directly relying on LR images, however, leads to blurred edges (Fig. 1(a)), necessitating a special treatment or a segmentation step to reduce the artifacts caused by the boundary pixels. This experimentally confirms the benefit of our upsampling strategy. Next, we tested UP-SR on a real

(4)

(a) (b)

Fig. 1. UP-SR results using motion estimated (a) from an LR sequence; (b) from densely upsampled sequence (r = 5). quence of LR depth images containing multiple moving ob-jects. We mounted an LR ToF camera at a 2.5 meter hight looking down at the ground with two persons sitting on chairs sliding in two different directions. A sequence of 9 LR depth images, of size (56×61) pixels, were super-resolved with a factor r = 5 using UP-SR, 2D/depth fusion [13], SKSR [18] and dynamic S&A [1]. Visual results for one frame are given in Fig. 2(c), (d), (e) and (f), clearly showing that SKSR and dynamic S&A fail badly on depth data mainly on boundary pixels while 2D/depth fusion, although computationally ef-ficient, often suffers from strong 2D texture copying on the final super-resolved depth frame. Fig. 2(f) shows the result of UP-SR where we obtained clear sharp edges in addition to an efficient removal of noisy pixel values. This is mostly due to the proposed subpixel motion estimation combined with an accurate registration leading to a successful temporal fu-sion of the sequence. Finally, in order to provide a quan-titative evaluation, we generated an LR depth sequence by downsampling an available HR depth sequence with a factor r = 4, and further degrading it by additive white Gaussian noise (AWGN) with signal to noise ratio (SNR) of 15, 25, 35, and 45 dB. We quantitatively compare our proposed algo-rithm with SKRS and dynamic S&A. We tested these methods using the corresponding softwares provided in [19] and [20]. Since we have a known ground truth {xt}Nt=1, we measure the

quality of an estimated HR depth frame ˆxt0 using peak SNR

(PSNR), which is defined as: PSNR = 10 log10kx_t0m×n−ˆx_t0k2.

Obtained results show the superiority of the UP-SR algorithm where it provides the best results among discussed state-of-the-art SR methods across all noise levels. As illustrated in Fig. 3, it is not surprising to see that even for a very high noise level (SNR = 15 dB) results are good. This is due to the key components of UP-SR, namely, its subpixel motion esti-mation and accurate multi-frame registration combined with a robust median filtering that matches the textureless prop-erty of depth data. Therefore, our algorithm results with good quality depth images without having to call upon an additional regularization step. The same algorithm applied on 2D im-ages gives blurry results with lost details due to the fact that generally they do not fall under the model proposed in (12).

(a) 2D image (b) Low resolution frame

(c) 2D/depth fusion [13] (d) SKSR [18]

(e) Dynamic S&A [1] (f) Proposed UP-SR Fig. 2. UP-SR example of a dynamic depth scene (r = 5).

Fig. 3. PSNR on the moving hand sequence with r = 4. 6. CONCLUSION

(5)

7. REFERENCES

[1] Farsiu, S.; Robinson, M.D.; Elad, M.; Milanfar, P., “Ad-vances and challenges in super-resolution” , International Journal of Imaging Systems and Technology, vol. 14, pp. 47-57, 2004

[2] Farsiu, S.; Robinson, M.D.; Elad, M.; Milanfar, P., “Fast and robust multiframe super resolution,” Image Process-ing, IEEE Transactions on, vol.13, no.10, pp.1327,1344, Oct. 2004

[3] Zhouchen Lin; Heung-Yeung Shum, “Fundamental lim-its of reconstruction-based superresolution algorithms un-der local translation,” Pattern Analysis and Machine Intel-ligence, IEEE Transactions on, vol.26, no.1, pp.83,97, Jan. 2004

[4] Aodha, Oisin Mac; Campbell, Neill D. F.; Nair, Arun; Brostow, Gabriel J., “Patch based synthesis for single depth image super-resolution,” Proceedings of the 12th European Conference on Computer Vision, vol. Part III, ECCV’12, pp.71-84, ISBN: 978-3-642-33711-6

[5] Glasner, D.; Bagon, S.; Irani, M., “Super-resolution from a single image,” Computer Vision, 2009 IEEE 12th Inter-national Conference on, vol., no., pp.349,356, Sept. 29 2009-Oct. 2 2009

[6] Hardie, R.C.; Tuinstra, T.R.; Bognar, J.; Barnard, K.J.; Armstrong, E., “High resolution image reconstruction from digital video with global and non-global scene mo-tion,” Image Processing, 1997. Proceedings., International Conference on, vol.1, no., pp.153,156 vol.1, 26-29 Oct 1997

[7] Farsiu, S.; Elad, M.; Milanfar, P., “Video-to-video dy-namic super-resolution for grayscale and color sequences,” EURASIP Journal on Applied Signal Processing, vol. 2006, pp.232-232, ID 61859

[8] van Eekeren, A.; Schutte, K.; Dijk, J.; de Lange, D.J.J.; van Vliet, L.J., “Super-Resolution on Moving Objects and Background,” Image Processing, 2006 IEEE International Conferenceon , vol., no., pp.2709,2712, 8-11 Oct. 2006 [9] Van Eekeren, A. W M; Schutte, K.; van Vliet, L.J.,

“Mul-tiframe Super-Resolution Reconstruction of Small Moving Objects,” Image Processing, IEEE Transactions on, vol.19, no.11, pp.2901,2912, Nov. 2010

[10] J. Diebel and S. Thrun, “An application of Markov ran-dom fields to range sensing,” Proceedings of Conference on Neural Information Processing Systems (NIPS), pp. 291-298, 2005.

[11] Johannes Kopf and Michael F. Cohen and Dani Lischin-ski and Matt Uyttendaele, “Joint bilateral upsampling,”

ACM Transactions on Graphics (Proceedings of SIG-GRAPH 2007), vol.26, no.3, 2007

[12] Qingxiong Yang; Ruigang Yang; Davis, J.; Nister, D., “Spatial-depth super resolution for range images,” Com-puter Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, vol., no., pp.1,8, 17-22 June 2007 [13] Garcia, F.; Aouada, D.; Mirbach, B.; Solignac, T.;

Ot-tersten, B., “Real-time hybrid ToF multi-camera rig fu-sion system for depth map enhancement,” Computer Vifu-sion and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on , vol., no., pp.1,8, 20-25 June 2011

[14] Xueqin Xiang; Guangxia Li; Jing Tong; Zhigeng Pan, “Fast and simple super resolution for range data,” Cyber-worlds (CW), 2010 International Conference on, vol., no., pp.319,324, 20-22 Oct. 2010

[15] Garcia, F.; Aouada, D.; Mirbach, B.; Ottersten, B., “Spatio-temporal ToF data enhancement by fusion,” Image Processing (ICIP), 2012 19th IEEE International Confer-ence on, vol., no., pp.981,984, Sept. 30 2012-Oct. 3 2012 [16] Al Ismaeil, K.; Aouada, D.; Mirbach, B.; Ottersten, B.,

“Depth super-resolution by enhanced shift & add, ” Pro-ceedings of the 15th International Conference on Com-puter Analysis of Images and Patterns, CAIP’13, York, UK, 27-29 Aug. 2013.

[17] L. Xu, J. Jia, S. B. Kang, “Improving sub-pixel cor-respondence through upsampling”, J. of Computer Vision and Image Understanding (CVIU), vol 116, Issue 2, Febru-ary 2012, pp. 250-261, ISSN 1077-3142.

[18] Takeda, H.; Milanfar, P.; Protter, M.; Elad, M., “Super-resolution without explicit subpixel motion estimation,” Image Processing, IEEE Transactions on, vol.18, no.9, pp.1958,1975, Sept. 2009

[19] http://users.soe.ucsc.edu/htakeda/SpaceTimeSKR.htm [20]

http://users.soe.ucsc.edu/milanfar/software/superresol-ution.html

[21] Elad, M.; Feuer, A., “Super-resolution reconstruction of continuous image sequences,” Image Processing, 1999. ICIP 99. Proceedings. 1999 International Conference on, vol.3, no., pp.459,463 vol.3, 1999