Sparse multi-view 3D computer vision : application to embedded assistive technologies

(1)

Thesis

Reference

Sparse multi-view 3D computer vision : application to embedded assistive technologies

CLOIX, Séverine

Abstract

In the framework of 3D computer vision dedicated to assistive technologies, the research studies reported in this thesis have the objective to design new computer-vision-based approaches dedicated to embedded and real-time applications with limited resources. This thesis proposes novel strategies for rapid object detection and recognition under practical constraints. These limitations are for example the number of sensors and their resolution, algorithm complexity and mobile battery-life. We narrowed the research scope to specific objects and obstacles detection from off-the-shelf stereo and plenoptic cameras. The research work thus investigates two areas of computer vision from multi-view imaging, namely exploiting (i) sparse 3D keypoint clouds from stereo vision and (ii) light field imaging for low-complexity and efficient algorithms.

CLOIX, Séverine. Sparse multi-view 3D computer vision : application to embedded assistive technologies. Thèse de doctorat : Univ. Genève, 2017, no. Sc. 5090

URN : urn:nbn:ch:unige-956778

DOI : 10.13097/archive-ouverte/unige:95677

Available at:

http://archive-ouverte.unige.ch/unige:95677

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

CSEM SA, NEUCH ˆATEL

Vision Embedded Systems Dr. David Hasler

Sparse Multi-View 3D Computer Vision

Application to Embedded Assistive Technologies

TH` ESE

présentée à la Faculté des sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention informatique

par

S´ everine CLOIX

de

Paris XIVe (FRANCE)

Th`ese N^o5090

GEN`EVE

Repro-Mail - Universit´e de Gen`eve 2017

(3)

(4)

CSEM SA, NEUCH ˆATEL

Vision Embedded Systems Dr. David Hasler

Sparse Multi-View 3D Computer Vision

Application to Embedded Assistive Technologies

TH` ESE

présentée à la Faculté des sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention informatique

par

S´ everine CLOIX

de

Paris XIVe (FRANCE)

Th`ese N^o5090

GEN`EVE

Repro-Mail - Universit´e de Gen`eve 2017

(5)

(6)

(7)

(8)

A ma famille aimante et bien-aim´` ee...

i

(9)

(10)

“On est toujours plus intelligent apr`es”

iii

(11)

(12)

Ces cinq années de thèse furent une inoubliable expérience que je n’aurais pas pu ac- complir sans un entourage extrêmement bon et intelligent, des perles d’humanité.

Mes premiers remerciements vont à mes directeurs Prof. Thierry Pun et Dr. David Has- ler. Non seulement ils m’ont fait confiance avec mon bagage de 8 années dans l’industrie mais ils m’ont donné conseils et liberté avec bienveillance tout au long de mes recherches.

Je me suis enrichie de chacune de nos discussions et ai acquis de pr´ecieuses comp´etences.

Je remercie également Dr. Guido Bologna pour avoir été un guide et dans son aide dans la rédaction d’articles, de même qu’Edo Franzi pour son accueil au secteur 111 du CSEM et sa porte toujours ouverte.

Le projet EyeWalker ne serait rien sans le programme SmartWorld de la fondation Hasler ni les partenaires : l’IMAD, “Institution genevoise de Maintien `a Domicile” and l’EMS-Charmilles.

Je suis aussi très reconnaissante envers les membres de mon jury de thèse Dr. Christian Perwass, Dr. Lo¨ıc Baboulaz, Prof. Andres Perez-Uribe, Prof. Sviatoslav Voloshynovskiy et Dr. Guido Bologna pour avoir accepté d’évaluer mes travaux de recherche.

Un grand merci au secteur 111 du CSEM, à savoir, dans le désordre, Eric Grenet, Fran¸cois Kaess, Sigolène Pangaud, Amina Chebira, Pierre-Alain Beuchat, Engin Tu- retken, Pierre-Fran¸cois Ruedi, Patrick Vollet, Virginie Moser, Laurent Von Allmen et Pascal Nussbaum pour les discussions quotidiennes, les rires et pauses cafè/thé dans une ambiance chaleureuse et amicale. Un merci tout particulier à Daniel Sigg qui, avec son expertise IT, m’a sauvée à plusieurs reprises de situations délicates.

Je remercie les membres du labo CVML pour leur gentillesse et leur disponibilité, dans le désordre : Guillaume Chanel, Prof. Stéphane Marchand-Maillet, Mohammad Soleymani, Theodoros Kostoulas, Viviana Weiss, Sunny Avry, Michal Muszynski, Chen Wang, Phil Lopes, Prof. Alexandros Kalousis, Maurits Diephuis, Edgar Roman-Rangel (Paco), Taras Holotyak, Sohrab Ferdowsi, Dimce Kostadinov, Ke Sun.

v

(13)

Je ne peux oublier de remercier mes parents de m’avoir offert les meilleures conditions possibles pour poursuivre mes études en France et en Angleterre. En particulier ma maman, Josiane, pour son amour, sa générosité, son soutien et les valeurs morales que j’ai apprises à ses côtés et qui ont fait de moi ce que je suis.

J’aimerais exprimer ma gratitude envers mes beaux-parents, Elisabeth and Olivier, pour leur aide inconditionnelle dont la garde de mes filles en p´eriode de vacances scolaires et de conf´erences.

Enfin, le meilleur pour la fin, ma merveilleuse petite famille : mes princesses aimantes et bien-aimées qui ont toujours eu des mots de réconfort quand je passais par des périodes d’émotions fortes ou de travail difficile. Emma pouvait expliquer l’apprentissage super- visé à 8 ans pendant qu’Agnès pianotait sur son faux ordinateur portable ! Et l’amour de ma vie, Damien, pour son soutien moral, son amour infini et ses encouragements pour tous les rôles dans lesquels je suis engagée.

(14)

These five years of PhD were an unforgettable experience that I could not have accom- plished without outstandingly smart and good people, gems of humanity.

My first thanks go to my supervisors Prof. Thierry Pun and Dr. David Hasler. Not only they put trust in me to start a PhD after my 8 years in the industry, but they give me valuable advice and freedom with goodwill during all my doctoral research. I became richer from each of our discussions and acquired invaluable skills. I also thank Dr. Guido Bologna for his support in writing papers and his guidance, as well as Edo Franzi for welcoming me in CSEM sec. 111 and his door always open.

The EyeWalker project could not be without the support of the Swiss Hasler Foundation SmartWorld Program and the end-user partners : the IMAD, “Institution genevoise de Maintien `a Domicile” and EMS-Charmilles.

I am also grateful to my thesis committee members, Dr. Christian Perwass, Dr. Lo¨ıc Baboulaz, Prof. Andres Perez-Uribe, Prof. Sviatoslav Voloshynovskiy and Dr. Guido Bologna for accepting and evaluating my research.

Many thanks to the CSEM Sec. 111 team, Eric Grenet, Fran¸cois Kaess, Sigol`ene Pan- gaud, Amina Chebira, Pierre-Alain Beuchat, Engin Turetken, Pierre-Fran¸cois Ruedi, Patrick Vollet, Virginie Moser, Laurent Von Allmen and Pascal Nussbaum for the eve- ryday discussions, tea/coffee breaks and laughs in a warm and friendly atmosphere. A special thank to Daniel Sigg for his IT expertise that saved me from a couple of serious troubles.

I thank all CVML lab members for their friendliness and helpfulness, in random order : Guillaume Chanel, Prof. St´ephane Marchand-Maillet, Mohammad Soleymani, Theodo- ros Kostoulas, Viviana Weiss, Sunny Avry, Michal Muszynski, Chen Wang, Phil Lopes, Prof. Alexandros Kalousis, Maurits Diephuis, Edgar Roman-Rangel (Paco), Taras Ho- lotyak, Sohrab Ferdowsi, Dimce Kostadinov, Ke Sun.

vii

(15)

I cannot miss thanking my parents for providing me with the best possible conditions for my education in France and in England. Especially my mother, Josiane, for her love, her generosity, her support, the moral values I learnt and that make me what I am.

I would like to express my gratitude to my parents in law, Elisabeth and Olivier, for their unconditional help especially by looking after my daughters during school holidays and conferences.

The best for last, my wonderful little family : my loving and beloved princesses who always had nice words to cheer me up through all the emotions and hard work I passed.

Emma could explain supervised learning at 8 years old while Agn`es was typing on her fake laptop ! And the love of my life, Damien, for his moral support, endless love and encouragement for all the roles I was and am committed to.

(16)

Les progrès en puissance de calcul et les prix abordables des capteurs d’images d’au- jourd’hui permettent le développement croissant d’algorithmes complexes et hautement performants. Le domaine de la vision 3D par ordinateur a bénéficié de cet essor donnant lieu à une large palette d’applications dont celles dédiées aux technologies d’assistance.

Ces dernières sont d’une grande importance, en particulier pour les personnes âgées. En effet, l’Organisation des Nations Unies publie chaque année un rapport sur les tendances du vieillissement de la population. Il en ressort la nécessité de développer des solutions qui compensent les divers types de handicap liés à l’âge dont celui en lien avec la mobilité.

Cependant les appareils technologiques actuels d’aide à la mobilité sont encombrants et leur autonomie est limitée à quelques heures. C’est ainsi qu’ont débuté le projet Eye- Walker et les études de recherches rassemblées dans cette thèse dont l’objectif commun est de développer de nouvelles approches d’algorithmes de vision par ordinateur pour des applications temps-réel embarquées sur des plateformes limitées en ressources.

Le projet EyeWalker a pour but de d´evelopper un appareil de vision par ordinateur

`

a moindre coût et très léger pour les utilisateurs de déambulateurs à roue ayant des difficultés à se déplacer. Cette thèse propose donc de nouvelles méthodes de détection et de reconnaissance d’objets rapides et sous contraintes pratiques. Ces limitations sont, par exemple, le nombre de capteurs d’images et leur résolution, la complexité algorithmique et l’autonomie de la batterie portable. Nous avons défini des cas d’utilisation à partir des informations récoltées auprès d’ergothérapeutes. Nous avons ainsi réduit le domaine de notre recherche à la détection d’obstacles et d’objets spécifiques à l’aide de caméras commerciales stéréo et plénoptique. Ces travaux de recherche explorent deux aspects de la vision par ordinateur à partir d’imagerie multi-vues, à savoir l’exploitation de (i) nuages de points d’intérêt 3D épars à partir d’images stéréo et de (ii) champs lumineux destinés à des algorithmes à la fois efficaces et peu complexes.

Dans un premier temps, nous traitons les problèmes de détection d’obstacles et d’objets spécifiques dans le contexte de l’aide à la navigation. Nous proposons deux approches utilisant des informations visuelles 3D éparses d’objets capturés en stéréo dans le cadre

ix

(17)

d’une classification “boosting”. La première approche repose sur des estimations de pose de l’objet 3D afin de réduire l’ambigu¨ıté de son apparence et ainsi améliorer sa détection dans une image 2D. La seconde permet l’extraction de caractéristiques 2D et 3D pour la détection d’obstacles. Nous présentons également une méthode générique de détection d’escaliers descendants. Nous avons étudié l’influence de deux paramètres sur les performances de l’algorithme : la résolution d’image et le type de systèmes d’imagerie,

`

a savoir une caméra stéréo passive ou active sous des conditions d’illumination variables.

Nous avons analysé la consommation d’énergie en fonction de la résolution et par rapport aux considérations matérielles et logicielles sous contraintes de temps-réel embarqué.

Cette ´etude d´emontre la robustesse et le faible de temps de calcul de notre algorithme.

Concernant le point (ii), les récentes avancées dans les domaines de l’optique et de la vision par champs lumineux ont donné lieu à la commercialisation de caméras plénoptiques embarquant l’équivalent de près de 7900 points de vue. Nous présentons une méthode de reconnaissance d’objets invariante à l’échelle utilisant une caméra Raytrix^R et évaluée sur un ensemble novateur et versatile d’images “light field” . Une dernière étude présente une méthode de caractérisation de points 2D grâce aux rayons de lumière environnants capturés par un champ lumineux. Appliquée à un unique pixel, elle peut estimer sa distance à la caméra de manière précise. Elle peut également prédire si un point d’intérêt se situe dans une zone de discontinuités en profondeur. La faible complexité de l’algorithme en fait un candidat idéal pour des applications temps réel destinées aux plateformes em- barquées.

Pour conclure, cette thèse se place à l’opposé de la tendance actuelle du “big data”.

Elle met en évidence les capacités d’exploiter des informations éparses à partir de vues multiples pour des applications de vision performantes et rapides. Alors que nous avons expérimenté les limites du nuage de points 3D épars obtenus par des capteurs de basse résolution pour l’estimation de pose 3D d’objets plans, nous avons démontré les performances pratiques d’un tel ensemble de points pour détecter des escaliers descendants en temps réel sur une plateforme de vision embarquée. En augmentant considérablement le nombre de points de vue grâce à une caméra plénoptique, nos approches de reconnaissance d’objets et de détection de points d’intérêt 3D positionnent notre recherche parmi l’état de l’art dans le domaine de la vision par champs lumineux. Outre l’amélioration continue de la puissance de calcul pour les appareils portables, la faible complexité de toutes les méthodes proposées permet de respecter les spécifications techniques res- trictives imposées pour développer des appareils d’assistance utiles et abordables fi- nancièrement. Ces méthodes répondent non seulement aux besoins des systèmes de navigation et de surveillance mais aussi à ceux de l’industrie des machines aux cadences croissantes des lignes de production intégrant des technologies de vision à des fins d’inspection de qualité tout au long de la chaˆıne par exemple.

(18)

The advances in computing power and the affordability of today’s imaging sensors gave rise to the development of complex and highly efficient algorithms. The domain of 3D computer vision has benefited from this growth for a broad range of applications, including assistive technologies. The latter field is of high concern, especially for the elderly. At the same time, the United Nations yearly reports the trends in population aging. It reveals the necessity to develop solutions that compensate impairment due to old age including the one related to mobility. Today’s assistive devices, however, are bulky and require large batteries to run for just a couple of hours. This is how the EyeWalker project came out along with the research studies reported in this thesis, whose common objective is to design new computer-vision-based approaches dedicated to embedded and real-time applications with limited resources.

TheEyeWalker project aims at developing a low-cost, ultra-light computer-vision-based device for rollator users with mobility problems. Thereby this thesis proposes novel strategies for rapid object detection and recognition under practical constraints. These limitations are for example the number of sensors and their resolution, algorithm complexity and mobile battery-life. We defined use cases based on data collected from occupational therapists. We could therefore narrow the research scope to specific objects and obstacles detection from off-the-shelf stereo and plenoptic cameras. The research work thus investigates two areas of computer vision from multi-view imaging, namely exploiting (i) sparse 3D keypoint clouds from stereo vision and (ii) light field imaging for low-complexity and efficient algorithms.

We address the problems of detecting obstacles and specific objects for aids to navigation. We propose two approaches using sparse 3D object cues from stereo vision in a boosting classification framework. The first estimates poses of a 3D object to reduce the ambiguity of its appearance for its detection in a 2D image. The second allows specific 2D and 3D feature extraction for obstacle detection. We also present a method to detect generic descending stairs. We studied the influence of two parameters on the algorithm performance: the reduction of the image resolution and the type of imaging systems, i.e.

xi

(19)

passive and active stereo cameras in various illumination conditions. We analyzed the power consumption versus the resolution with regard to considerations on hardware and embedded real-time programming. It assesses the robustness and the low computation time of our algorithm.

Regarding (ii), the recent advances in optics and light field vision have given rise to off-the-shelf plenoptic cameras that embed the equivalent of about 7900 viewpoints. We present a state-of-the-art scale-invariant object recognition method using a Raytrix^R camera evaluated on a novel and versatile light field dataset. A final study presents a method to characterize 2D points using the surrounding rays captured by a light field.

Applied to a single pixel, it independently gives an accurate depth estimate and predicts if it is a keypoint lying on a region of depth discontinuities. The low complexity of the algorithm makes it an ideal candidate for real-time applications intended for embedded platforms.

In conclusion this thesis is in the opposite direction of the recent trend for big data.

It highlights the capabilities of exploiting sparse cues from multiple views to efficiently perform computer vision tasks. While we faced the limits of sparse 3D point clouds from low-resolution sensors to estimate poses of planar objects in 3D space, we have demonstrated the practical performance of such sets of points to detect descending stairs in real-time on a low-power embedded vision system. By drastically increasing the number of views with an off-the-shelf plenoptic camera, our approaches to object recognition and 3D keypoint detection position our research among the state-of-the-art work on light field for computer vision. In addition to the continuous improvement in mobile computing power, the low-complexity of all the proposed methods comply with restrictive technical requirements to design not only affordable and useful assistive devices for the elderly but also navigation and surveillance systems, and retrofitting in existing manufacturing lines for quality inspection to name a few.

(20)

2D two-dimension 3D three-dimensional VOC Visual Object Classes

RANSAC Random Sample Consensus LoG Laplacian of Gaussian

DoG Difference of Gaussian

SIFT Scale-invariant feature transform

BRIEF Binary Robust Independent Elementary Features FAST Features from Accelerated Segment Test

BoW bag-of-words BoF bag-of-features

ROC Receiver Operator Characteristic PR Precision-Recall

FLANN Fast Library for Approximate Nearest Neighbors DSP Digital Signal Processing or Processor

DMA Direct Memory Access RAM Read Access Memory HDR High Dynamic Range ETA Electronic Travel Aid TOF Time Of Flight

DEM Digital Elevation Map fps frames per second

UGV Unmanned Ground Vehicles IR Infra-Red

TPR True Positive Rate

xiii

(21)

FPR False Positive Rate TNR True Negative Rate FNR False Negative Rate ACC accuracy

PPV Positive Predictive Value SAD Sum of Absolute Difference RGB-D Red Green Blue Depth

(22)

M Normalized model Λ List of elements λ Element of the list Λ F Family of features f Feature of familyF

pi 2-dimensional point with index i x, y Coordinates of a 2-dimensional point P_i 3-dimensional point with index i X, Y, Z Coordinates of a 3-dimensional point

xv

(23)

(24)

Symbols xiv

Table of Contents xvii

List of Figures xxi

1 Introduction 1

1.1 The EyeWalker project . . . 2 1.1.1 Context . . . 2 1.1.2 Definitions . . . 4 1.1.2.1 What is an obstacle? . . . 4 1.1.2.2 What is an object? . . . 4 1.1.3 Use cases . . . 5 1.1.4 A vision system with requirements and hardware constraints . . . 6 1.2 Research questions . . . 6 1.3 Contextual restrictions . . . 7 1.4 Contributions . . . 8 1.5 Thesis structure . . . 9

2 Technical background 11

2.1 2D keypoints . . . 11 2.1.1 Definition . . . 11 2.1.2 Corner detectors . . . 12 2.1.2.1 Harris corner detector . . . 12 2.1.2.2 FAST corner detector . . . 13 2.1.3 Blob detectors . . . 14 2.1.4 Descriptors . . . 15 2.2 3D Computer vision . . . 16 2.2.1 Stereo vision . . . 16 2.2.1.1 Stereo matching . . . 17 2.2.1.2 Stereo cameras . . . 18 2.2.2 Multi-view vision or light field . . . 19 2.2.2.1 Light field definition . . . 19 2.2.2.2 Capturing a light field . . . 19 2.2.3 Raytrix: a plenoptic camera . . . 20 2.2.4 Depth estimation and occlusions . . . 21 2.3 Machine learning for computer vision . . . 22 2.3.1 What is machine learning? . . . 22

xvii

(25)

2.3.2 Cascade of boosted classifiers . . . 22 2.3.3 Bag-of-visual-words . . . 24 2.3.4 Binary classifier performance assessment . . . 24 2.4 Conclusion . . . 26

3 Exploiting sparse 3D point clouds 27

3.1 A deformable object detector . . . 28 3.1.1 State of the art . . . 28 3.1.2 A pose estimation-based approach . . . 29 3.1.3 Hardware set-up and dataset . . . 31 3.1.4 Experimental results . . . 32 3.1.5 Conclusion . . . 34 3.2 A boosting obstacle detection approach . . . 36 3.2.1 State of the art . . . 36 3.2.2 A multi-view feature-based classifier . . . 37 3.2.3 Dataset . . . 38 3.2.4 Experimental results . . . 39 3.2.5 Conclusion . . . 41 3.3 Depth-based descending stairs detection . . . 42 3.3.1 State of the art . . . 42 3.3.2 Hardware and depth map acquisition requirements . . . 43 3.3.3 A non-geometric method . . . 44 3.3.4 Collected data . . . 50 3.3.5 Results . . . 51 3.3.5.1 Illumination conditions . . . 53 3.3.5.2 Resolution study . . . 54 3.3.5.3 RGB-D camera versus stereo camera . . . 59 3.3.5.4 Porting on embedded platforms . . . 61 3.3.5.5 Discussion . . . 64 3.3.6 Conclusion . . . 66 3.4 Discussion . . . 67 3.5 Conclusion . . . 68 4 Light field for embedded computer vision 69 4.1 Scale-invariant object recognition . . . 69 4.1.1 State of the art . . . 70 4.1.2 CSEM-25 dataset . . . 71 4.1.2.1 Existing Light Field Datasets . . . 71 4.1.2.2 Proposed Dataset . . . 72 4.1.2.3 Possible Usage of The Dataset . . . 73 4.1.3 A redundancy-based method . . . 75 4.1.3.1 Codebook learning . . . 76 4.1.3.2 Histogram extraction and classification . . . 77 4.1.4 Experimental results . . . 78 4.1.5 Conclusion . . . 80 4.2 2D point characterization for 3D keypoints detection . . . 81 4.2.1 Related work . . . 82

(26)

4.2.1.1 Depth Estimation . . . 82 4.2.1.2 Keypoint Detection . . . 83 4.2.2 2D Point Characterization . . . 84 4.2.2.1 Depth Estimation . . . 86 4.2.2.2 Keypoint Detection . . . 86 4.2.3 Experimental results . . . 86 4.2.3.1 Depth Maps . . . 87 4.2.3.2 Keypoint Selection . . . 88 4.2.3.3 Complexity and Computation Time . . . 90 4.2.3.4 Limitations . . . 92 4.2.4 Conclusions . . . 92 4.3 Discussion . . . 93 4.4 Conclusion . . . 93

5 Conclusions 95

A Comparative study of existing intelligent assistive devices 99

B EyeWalker - Use cases 103

C Icycam specifications 115

D Commercially available imaging systems for 3D scene information ac-

quisition 117

E Plenoptic camera Raytrix R5 119

Bibliography 123

(27)

(28)

1.1 An old lady using a rollator. . . 1 1.2 Wheeled walkers. . . 2 1.3 Existing intelligent walkers. . . 3 1.4 Thesis structure . . . 9 2.1 Detection of a FAST corner in an image patch. . . 14 2.2 Scale space theory: Example of the Difference of Gaussian (DoG) detector. 14 2.3 BRIEF best test sampling locations (from [1]). . . 16 2.4 Epipolar geometry of stereo vision. . . 17 2.5 Illustration of the plenoptic function. . . 19 2.6 Two-plane parametrization of the 4D Light field as explained in [2]. . . . 20 2.7 The difference between a Plenoptic 1.0 and a Plenoptic 2.0 cameras. . . . 20 2.8 Raytrix technology. . . 21 2.9 Illustration of self-occlusion in stereo vision. . . 21 2.10 AdaBoost flow. . . 23 2.11 Haar-like features for face detection. . . 23 3.1 General approach to a deformable door detector. . . 29 3.2 Correspondence of door modeling. . . 30 3.3 Haar-like features for the detection of a door. . . 30 3.4 Hardware setup. . . 31 3.5 Samples of the data set of cabinet doors. . . 32 3.6 Door detection: Data annotation. . . 32 3.7 Performance of the boosting door detector built with the Haar-like features. 33 3.8 Examples of positive and negative triplets with common corners. . . 33 3.9 Left captures of a door. . . 34 3.10 Visualisation of the obstacle detector features. . . 38 3.11 Obstacle positive samples. . . 39 3.12 Obstacle negative samples. . . 39 3.13 Performance of the baseline obstacle detector. . . 40 3.14 Average performance of the boosting obstacle detector. . . 41 3.15 Stair detection: Rectification of the depth values. . . 45 3.16 Geometrical views of a rollator is facing a descending stair. . . 46 3.17 Stairs in the warning zone detected as dangerous from the side. . . 48 3.18 Dangerous stairs not detected from the side. . . 48 3.19 Flowchart of our descending stairs detection approach. . . 48 3.20 Our experimental setup mounted with the Bumblebee2 stereo camera. . . 51 3.21 Sample captured of the scenes of stairs and curbs. . . 52

xxi

(29)

3.22 Scenes captured with the Kinect and the Bumblebee2. . . 53 3.23 Illuminance of each scene in lux. . . 53 3.24 Average of the proportion of pixels with known depth. . . 54 3.25 Resulting depth map according to the resolution for a dangerous outdoor

stair scene correctly predicted. . . 55 3.26 Performance on outdoor stairs with the stereo camera as a function of

resolution. . . 56 3.27 Performance on indoor stairs with the stereo camera as a function of

resolution. . . 57 3.28 Performance on curbs with the stereo camera as a function of resolution. . 58 3.29 Captures of dangerous oblique stairs correctly predicted. . . 59 3.30 Captures of dangerous indoor stairs predicted as safe. . . 59 3.31 Indoor stair capture from the Asus Xtion and the corresponding depth

map. . . 60 3.32 Outdoor stair captures from a RGB-D camera and the corresponding

depth map. . . 61 3.33 Indoor stairs detection performance. . . 62 3.34 Outdoor stairs detection performance. . . 63 3.35 stair detection processing time. . . 64 4.1 Scale-invariant object detection: acquisition setup. . . 72 4.2 Scale-invariant object detector: Dataset samples. . . 74 4.3 Light field captures of a figurine at two distances. . . 75 4.4 Block diagram of the test scheme. . . 77 4.5 Object detection: Comparative results . . . 80 4.6 Two-plane parametrization of the 4D Light field. . . 81 4.7 Mona dataset and the vertical and horizontal EPIs. . . 83 4.8 Order of rays under occlusion in a EPI image. . . 83 4.9 2D mismatch image. . . 85 4.10 Flow-chart of the 2D point characterization algorithm. . . 87 4.11 Evaluation of depth estimation on the HCI light fields. . . 89 4.12 Precision-Recall curves in point filtering. . . 90 4.13 Boundary detection on the HCI light fields. . . 91

(30)

Introduction

Having better living conditions, making daily tasks easier and our lives more convenient, aren’t they the wishes of each of us? And so, as long as possible in a lifetime? In the meantime, the United Nations publish regular reports on the world population aging.

The key findings of their last publication [3] concern the levels and trends in population aging, the demographic drivers and sustainable development. The growth in number and proportion of the over 60s is expected to increase by 56 % by 2030, even to double by 2050. The population of the over 80s should triple by 2050. The changes in fertility and mortality, medical technologies, the improvement of living conditions are the primary drivers for older people to live longer and healthier than before. The increase in life expectancy, however, leads to the growth of unhealthy life-years mainly due to disability including motor disability.

This context is the one of the EyeWalker project and of the research work presented in this thesis with the major goal of developing new computer-vision-based approaches dedicated to both embedded and real-time applications with limited resources.

Figure 1.1 – An old lady using a rollator.

1

(31)

Figure 1.2 – Wheeled walkers.

This chapter gives the reader a description of theEyeWalker project (Section 1.1) dedicated to providing a low-cost and ultra-light weight embedded vision-based device to supply support to the elderly using a walker. This clip-on electronic system shall warn the rollator user against dangers to prevent them from potential falls. The use cases, determined with the expertise of occupational therapists (Section 1.1.3), and the prototype requirements in term of hardware (Section 1.1.4) have delimited the framework of my research and my contributions detailed in Section 1.4. This chapter ends on the outline of this manuscript.

1.1 The EyeWalker project

1.1.1 Context

In industrialized countries, the number of mobility impaired people increases, in particular among the elderly individuals. To deal with the growth of the population of the over 65s, governments are asked to develop policies towards a range of services and hardware supports that the senior require. Such policies would indeed help to postpone their retirement to a long-term nursing care facility. It includes support for individuals remaining at home, which starts with the access to assistive technologies such as the rollator, a walker equipped with wheels (Figure 1.2¹), widely spread among the elderly (Figure 1.1²). These tools can, however, lead to falls especially in urban zones and within buildings, places where the users spend the majority of their time. They occur when they misjudge the nature or the extent of some obstacles in any familiar or unknown environments. To help them keep on using their walker confidently and safely, we want to benefit from off-the-shelf technologies regarding imaging and embedded systems as well as from computer vision techniques to provide a low-cost assistive device that warns against dangers that could lead to falls e.g. obstacles, stairs or curbs.

1. http://www.hadnet.org.uk/_shop/shop/mobility/rollators/

2. http://i265.photobucket.com/albums/ii235/SCOOPY01/5487141442_02d26a5d38.jpg

(32)

(a) (b) (c) (d) (e) Figure 1.3 – Existing intelligent walkers. (a) PAMM, (b) MARC smart walker, (c) Guido, (d) iWalkActive, (e) simbiosis walker.

The technological advances in the field of embedded systems, optics, sensors and mobile batteries help prototype various “intelligent” walkers. These assistive devices aim at answering a number of issues faced by their target users. Several projects have been developing assistive devices including “smart” walkers. A comparative study including results of [4, 5] is summarized in Appendix D and illustrated in Figure 1.3. It highlights the key characteristics such as the design, the target users, the assistive functionalities, etc. Among all the presented assistive devices, all are cumbersome and expensive. They are usually motorized and programmed to plan routes and to detect obstacles with active or passive sensors. Moreover they are intended to be used indoor most of the time for rehabilitation. The existing devices demonstrate the trend of developing complex systems full of active and passive sensors supported by large batteries.

The position of the EyeWalker project is however in the opposite trend. It aims at developing a low-cost, ultra-light computer vision-based device for users with mobility problems. It is meant to be an independent accessory that can be easily fixed on a standard rollator and with a daylong autonomy. Our device will warn users of potentially hazardous situations or help to locate a few particular objects in diverse environments and under widely varying illumination conditions. The users we initially target are elderly persons that still live relatively independently.

This project has been a collaborative work between the University of Geneva and the CSEM SA, le Centre Suisse d’Electronique et de Microtechnique, in Neuchˆatel Switzer- land. It was supported by the Swiss Hasler Foundation SmartWorld Program, grant Nr.

11083 (from 2012 to 2016).

(33)

1.1.2 Definitions

The World Health Organization gives a concise description of what assistive technology is [6]:

“Assistive devices and technologies are those whose primary purpose is to maintain or improve an individual’s functioning and independence to facilitate participation and to enhance overall well-being.”

Assistive technology concerns any disabilities (learning, hearing, seeing, etc.). In the case of mobility impairments, the walker gives its user support to keep his/her balance while moving forward. To make a step towards “intelligent” walkers, i.e. augmented with embedded computers and sensor technologies, we define what the electronic system shall identify. In the scope of the EyeWalker project, obstacles and specific objects are the primary targets to detect with computer vision algorithms.

1.1.2.1 What is an obstacle?

An obstacle is defined as something that prevents from moving forward. This definition can be more accurate according to the domain the device is developed for. For driving assistance, an obstacle will be any object standing on a dominant ground surface [7].

In the field of health care rehabilitation, it would rather be a static or moving object on the walking path at any height from the ground to head-level [8]. As an example, a cardboard box of about the size of a tissue box will be considered as an obstacle for rollator users but not for autonomous cars. As far as rollator users are concerned and in addition to the general definition, an obstacle is something that might lead to a fall.

Any object of any type and shape will thus be an obstacle (e.g. a hole in the ground).

1.1.2.2 What is an object?

One defines an object as a thing that one can see and touch. In the computer vision community, Ballard and Brown gave an explicit definition focused on the identification of the object [9] according to the traditional formulations of the problem of vision in the 80’s. A decade later, James V. Stone broaden these formulations by focusing on what human vision does, i.e. more than identifying an object [10]. He thus proposed to take account of spatial and spatiotemporal characteristic views to solve computer vision problems. This approach is interesting as it implies the use of 3D information to infer invariance among these views. More recently, Alexe et al. proposed another definition. They defined an object as a standalone thing with at least one of the following

(34)

characteristics [11]: “(a) well-defined closed boundary in space, (b) a different appearance from their surroundings and (c) sometimes it is unique within the image and stands out as salient”.

These two definitions of an obstacle and an object form the basis for the specifications of the use cases of the EyeWalker. They also give already some hints for the solution implementation strategies, especially regarding the consideration to give to 3D information.

1.1.3 Use cases

The definition of use cases is a necessary phase in the development of commercial prod- ucts. This step enables to focus on what is useful to the target users.

In the EyeWalker project, the final goal is to provide a plug-in device that makes standard wheeled walkers “smart” to prevent the elderly from falling. A walker is a metal frame that the user places in front of him/her and leans on to help him/her perform a move forward. While this aid is dedicated to any persons with motor disability, it is widely spread among the elderly for their daily travels indoor and outdoor. We thus considered situations that could put the walker user at risk.

People using a walker could also have a cognitive disability that prevents them from planning a route to go from a starting point to an ending point. We, however, focused only on people whose cognitive impairment is negligible. Our target users are thus people who have all their awareness but could present a limited peripheral vision or may have difficulties in estimating frontal hazards. The ground, for instance, could put forward pitfalls that the user has seen but has not necessarily considered as a risk in his/her displacement.

The members of the project met two occupational therapists who are in daily contact with old people using wheeled walkers. The outcomes of these meetings were the following use cases split into three categories (cf. Appendix B for more details):

Obstacle detection locating at three different heights: high (head-level), middle (frontal) and bottom (ground-level);

Ground-level hazards like a hole, a carpet that could lead the walker to stop sud- denly, or a doorstep;

Other situations that are not part of either the first or the second category, such as the detection of descending stairs.

This preliminary work on the user requirements sets the framework of this thesis (Sec- tion 1.3) and of the addressed research questions presented in Section 1.2.

(35)

1.1.4 A vision system with requirements and hardware constraints

Developing a device for mobility aids raises several issues, more specifically in the context of the EyeWalker project. Firstly the scenes where the device must operate is broad. Even though it will mostly run in places familiar to the users, unknown environments are also possible such as outdoor areas including man-made architecture but also natural scenes. Secondly, it requires robustness to any lighting conditions as the device is intended to run at any time of the day, of the year and both in indoor and outdoor places. These requirements lead us to think over the choices of the sensors that will equip the assistive device. The size and light-weight requirements also have an impact of the hardware choices and thus imply embedded and real-time capabilities. Last but not least, the battery life shall meet an autonomy of a day.

The study of the use cases sets the basis for my investigations within two main domains essential to the development of mobility aid devices with computer vision: (i) obstacle avoidance for safe navigation in the three-dimensional (3D) space and (ii) object local- ization and recognition. In the context of this project, 3D embedded computer vision is the ability to generate a 3D representation of a natural scene from a mobile camera with unconstrained motion and to detect objects or hazardous situations.

1.2 Research questions

This thesis aims at addressing several research questions related to mobile computer vision combined with practical constraints of the EyeWalker project.

The ultimate goal is to propose novel strategies dedicated to object detection or recognition tasks in the 3D space with limited resources.

As mentioned in Section 1.1.4, the project is driven by the limitations on the power consumption and the execution times. Besides, the acquisition of 3D data, compulsory for any navigation systems, has an important part to play. The will of integrating 3D computer vision within a low-power embedded vision system opens on some considerations to reduce the number of processing cycles. As a result, the resolution of the cameras, the density of the spatial 3D point cloud acquired, the number of viewpoints/cameras and the complexity of the proposed algorithms were the main concerns.

(36)

These considerations lead to the following research questions:

• Sparse 3D points for obstacle detection

1. How can a set of sparse 3D points help detect and localize obstacles in the scene with a few resources?

2. Can sparse 3D cues contribute to reducing the ambiguity of an object appearance for its detection in a two-dimension (2D) image?

3. How low can the resolution of cameras be reduced without affecting the performance of a detection task?

• Light field for embedded computer vision

4. How can light field imaging improve object recognition tasks?

5. To what extent can the computational requirements of approaches using light field images be reduced to reach an acceptable level for real-time applications?

To answer research questions 1 to 3, I developed objects and obstacles detection methods from 2D images and sparse 3D point clouds extracted from passive stereoscopic vision that is expected to consume less battery power than recent 3D sensors. The point clouds allow pose estimation in the three-dimensional space in order to deform features extracted from a 2D image prior classification. Light field, which is thoroughly defined in Chapter 2, is a means of acquiring a frontal scene with multiple views. Thus the work on light field addresses the last two questions either with the use of industrial plenoptic cameras or conventional arrays of cameras.

1.3 Contextual restrictions

Within the first part of this thesis, we limited the handled use cases to:

• specific obstacles located at head-level, i.e. cabinet doors;

• generic frontal obstacles located on the path of the user;

• descending stairs and curbs that belong to ground-level.

These cases were considered appropriate to set the basis for the research about obstacle avoidance and object detection in various environmental conditions with use of sets of sparse 3D points. To address the research questions related to these use cases, we inherently had to deal with the question of which cameras are appropriate while keeping in mind theEyeWalker final product (weight and the power consumption, its operability in various lighting conditions). The icyCam (Appendix C), produced by the project partner CSEM SA, answered many of the requirements. Its low-power consumption is an advantage for the choice of the battery capacity and consequently the weight. Its

(37)

high dynamic range allows coping with the large variability of the lighting conditions in which the device shall perform.

The boundaries were broadened in Chapter 4. Despite their high power consumption, the recent off-the-shelf plenoptic cameras have indeed opened doors to new computer vision research work we found worth studying.

1.4 Contributions

In its broad spectrum, the field of computer vision includes applied research work on object detection and recognition as well as 3D vision for depth estimation. With today low-priced sensors (stereo cameras) and the recent advances in camera technology (RGB- D, plenoptic), the acquisition of the spatial 3D information, sparse to dense, becomes more accessible. I thus dedicated my doctoral research to:

• 3D computer vision that leads to research on object detection and recognition in the three-dimensional space either with sparse 3D data, i.e. the depth is not defined for each acquired pixels of the associated 2D image, or with redundant information (light field);

• research on 2D keypoint detection in conventional 2D images, a dynamic research domain since it is one of the primary stages mostly used in object recognition.

The contributions of my doctoral project are partially acknowledged by the computer vision community [12, 13, 14, 15, 16] and are:

• a boost-learning based detector of 3D planar objects in a sparse 3D point cloud.

The information of pose estimates allows deforming the features extracted from the 2D images;

• a detector trained to model the presence of a generic obstacle within a frontal sparse 3D point cloud;

• a descending stairs/curb detector dedicated to low-power and low-resolution cameras;

• a novel method of scale-invariant object recognition without any explicit depth estimation using light field information;

• a new strategy to characterize 2D points with light field imaging in order to independently evaluate the depth map as accurately as state-of-the-art approaches and discriminate points lying on depth discontinuities without any explicit depth estimation.

(38)

1.5 Thesis structure

Chapter 2 introduces the technical background to understand the following chapters.

We present the 2D keypoints employed for 3D computer vision, namely their detectors and descriptors if any. A deep review of 3D computer vision from stereo vision and light field is detailed. This chapter ends with the description of machine learning algorithms employed during the research.

Chapter 3addresses the issues of exploiting sparse 3D point clouds for obstacle, specific object and stairs detection in the context of developing an assistive device for rollator users.

Chapter 4describes novel approaches to perform computer vision tasks with light field imaging with low-complexity and fast processing times.

Chapter 5 summarizes the contributions of this thesis along with their limits before concluding on future perspectives.

Figure 1.4 – Thesis structure

(39)

(40)

Technical background

In this thesis were developed approaches designed for real-time embedded systems with a long battery life. They are based either on existing results validated for their advantages regarding computation time, such as corner detectors and descriptors or on off-the- shelf vision sensors ready to provide 3D information along with the conventional images (cf. Appendix D).

This chapter gathers the necessary knowledge to the understanding of Chapters 3 and 4.

Section 2.1 presents a review of 2D keypoint detectors, especially corners and blobs along with the BRIEF descriptor. These algorithms were employed to develop our approaches to detection tasks from 3D sparse point clouds. Section 2.2 gives a basic introduction to 3D computer vision algorithms for depth estimation dedicated to embedded systems.

Finally, Section 2.3 exposes machine learning algorithms adopted in this thesis.

2.1 2D keypoints

2.1.1 Definition

With his theory of recognition-by-components for human image understanding [17], Bie- derman opened the doors to research on detection of features, e.g. pieces of abstraction within a 2D image. From then, Herault et al. have defined a keypoint as:

“a salient image point that visually stands out and is likely to remain stable under any possible image transformation such as illumination change, noise, or affine transformation to name a few” [18].

These considerations and definition make the keypoint detection task one of the algo- rithmic components the most employed in computer vision applications. It is still an

11

(41)

active research topic both in 2D [19, 20, 21, 22, 23] and 3D [24, 25]. For our research, we were interested to 2D keypoint suitable for real-time applications for embedded systems. The focus was naturally placed on low-complexity corner detectors. For the sake of comparison, we also briefly present blob detectors we tested in the evaluation of our approach of keypoint selection (Section 4.2).

2.1.2 Corner detectors

2.1.2.1 Harris corner detector

One of the most well-known algorithms is the HARRIS corner detector [19]. Coming from Movarec’s work [26], the main idea is to look for intensity variations in both vertical and horizontal directions. Let us consider an image patchW centered on (x, y). Harriset al. examined three situations:

• The patch is uniform in intensity: the variations in both directions are small;

• The patch presents an edge: the intensity variation will be large along the axis perpendicular to the edge;

• The patch depicts a corner: the image intensity will greatly vary along any direction.

The intensity variation is thus represented by E(u, v) =X

u,v

(W(x, y)[I(x+u, y+v)−I(x, y)]²), (2.1)

where W(x, y) is the window at location (x, y), I(x, y) is the pixel intensity at (x, y), and u and v represent the displacement along the x and y axis. The Taylor expansion allows approximating the first gradients by

X = I⊗(−1,0,1)≈∂I/∂u (2.2)

Y = I⊗(−1,0,1)^T ≈∂I/∂v (2.3) where⊗ is the convolution operator.

Equation 2.1 becomes

E(u, v) =Au²+ 2Cuv+Bv² = (u, v)M(u, v)^T (2.4)

(42)

with

A = X²⊗W (2.5)

B = Y²⊗W (2.6)

C = (XY)⊗W (2.7)

M =





 A C C B





 (2.8)

Eventually a measure of the “cornerness” is dependent on the eigenvalues α and β of the matrix M. The interesting trick of Harris et al. is to avoid the computation of α and β and, instead, consider the following corner response

R=D−kT² (2.9)

withk a free parameter and

T = T r(M) =α+β=A+B (2.10) D = Det(M) =αβ=AB−C² (2.11) whereT andD are respectively the trace and the determinant of M.

2.1.2.2 FAST corner detector

Another corner detector is presented in [21]. Features from Accelerated Segment Test (FAST) are detected by comparing a central pixel with the neighbour pixels located on a circle of 4-pixel radius. A pixel is a corner if at least 12 contiguous neighbours have their intensity larger that the intensity of the central pixel augmented with a threshold value. To rapidly discard non-corners, four main comparative tests are firstly run, i.e.

the central pixel is compared in the following order to: (1) pixel 1, (2) pixel 9, (3) pixel 5 and (4) pixel 13 (Figure 2.1¹).

In [23] the detector was improved with a machine-learning procedure. Called FAST-ER, it is presented as an optimization of the original FAST corner detector [21] with regard to speed.

1. http://www.clarkbuildersgroup.com/images/sus-building.jpg

(43)

Figure 2.1 – Detection of a FAST corner in an image patch. The numbered pixels are compared to the central pixelp.

2.1.3 Blob detectors

From the theory of the focus-of-attention, Lindeberg derived the notion of blobs [27]

and, more importantly, the theory of scale space for their detection [28]. The principle consists in building a scale space either from Laplacian of Gaussian (LoG) or DoG as depicted in Figure 2.2²:

• the first scale is a set of images resulting of the iterative gaussian-smoothing of the original image;

• the following scales are built iteratively by down-sampling the set of images of the previous scales;

• within each scale, consecutive pair of images are subtracted to get the difference of Gaussian.

Through each layer of each scale, we look for local extrema that define the scale-invariant interest points.

Figure 2.2 – Scale space theory: Example of the DoG detector.

2. http://slideplayer.com/slide/6277126/

(44)

In the early 2000’s, Lowe [20] and Bayet al. [22] respectively proposed the detection of SIFT and SURF keypoints. While SIFT detection is based on DoG, SURF features are extrema detected in a LoG-based scale space built from the Hessian matrix defined by:

H(p, σ) =







Lxx(p, σ) Lxy(p, σ) L_xy(p, σ) L_yy(p, σ)





 (2.12)

whereLxx(p, σ),Lxy(p, σ) andLyy(p, σ) are the results of the convolution of the Gaussian second derivatives ∂²g(σ)/∂x², ∂²g(σ)/∂x∂y and ∂²g(σ)/∂y² with the image at point p(x, y).

2.1.4 Descriptors

The detection of interest points is a necessary stage for matching, required in many applications such as image retrieval, recognition and, as far as we are concerned, stereo vision. The simplest way of matching two image patches centered on keypoints is the computation of a correlation score, e.g. Sum of Absolute Difference (SAD). Depending on the application, such methods are inefficient because not robust to scale, rotation and affine transforms. Researchers have been developing approaches to address these limitations. They aim at describing a keypoint in a compact way from its surrounding visual information.

The leading descriptors are SIFT [20] and SURF [22]. They, however, suffer from at least two of the following limitations [1]:

• slow to compute,

• slow to match,

• the actual description requires floating-point values,

• a dimensional reduction can degrade their performance.

These drawbacks make SIFT and SURF unsatisfactory candidates for embedded real- time applications. Calonder proposed Binary Robust Independent Elementary Features (BRIEF), a binary descriptor built by comparing the intensity of pixels p(a) and p(b) within the surrounding of the keypoint:

N−1

X

i=0

τ(p;a, b)2ⁱ (2.13)

(45)

whereN is the dimension of the bit string and with the comparison result between pixels p(a) and p(b)

τ(p;a, b) =







1 if p(a)< p(b) 0 otherwise

(2.14)

Several approaches to select the number N of pixel pairs used to build the descriptor were evaluated to assess the performance in terms of recognition rate. The best sampling pattern depicted in Figure 2.3 was found by randomly picking 128 test locations from an isotropic Gaussian distribution. Any matching process is done with the Hamming distance, speeding up once more the computation time.

Figure 2.3 – BRIEF best test sampling locations (from [1]).

2.2 3D Computer vision

3D computer vision has started with the intersection of stereo/multi vision and computer vision [29]. While standard computer vision gathers the acquisition, the processing and the analysis of a 2D picture for scene understanding, a geometrical approach added values to it. Well-known topics are 3D reconstruction [30] and depth estimation [31].

This section covers the computer vision from two to multiple viewpoints.

2.2.1 Stereo vision

The first case of stereo vision that each of us has experimented is the human vision!

With his two eyes (and his brain), the human-being can infer the 3D depth of a frontal scene from the pair of images. Likewise, computer stereo vision conjugates the epipolar geometry and the camera pinhole model assumption to derive relations between the frontal 3D scene and the two 2D projections [32]. Figure 2.4 and Equation 2.15 illustrate the geometry of stereo vision in case of parallel cameras separated with a distance B (called baseline). For a known 3D pointP, the locations of its projectionspLandpR on

(46)

the two images are thus determined.

Z = f ×B

B−D . (2.15)

Figure 2.4 – Epipolar geometry of stereo vision.

In computer vision, however, the depth extraction of a 3D point is possible only if we locate its projections on the pair of images. Since the surroundings of the two projections are very similar, computer vision aims at finding the correspondence between the two images. The following section describes the approaches.

2.2.1.1 Stereo matching

The stereo matching approaches can be categorized into two groups: sparse or dense [33]. The first approach is also known as feature-based matching and results in a sparse output. The correspondence process is applied to features such as corners, edges or keypoints [22]. To compare the different key points, we shall measure their similarity. This similarity can either result from comparing the surroundings via patches or attributes commonly called descriptors [34]. Each descriptor of the left image points is compared to the list of descriptors of the right image points and matched to the most similar one. Feature descriptors tend to be robust against orientation and intensity variation while keypoints are robust to perspective changes. Thus this method can be applied to real-time applications that require a very sparse depth map [12], for example in image registration applications.

The second stereo correspondence approach relies on comparing patches of images to minimize a cost function. This cost function can be local or global. In the case of local methods, the aim is to reduce the difference between the patches located on the epipolar lines in order to finally get the disparity for every pixel of the reference image. But stereo

(47)

matching algorithms can be time, memory and power consuming. Konolige proposed one based on the sum of absolute difference (SAD) and implemented it on FPGA to run real-time [35].

From the matched points, we can extract a disparity map. The disparity, d, is the difference between thex-coordinates of the detected point in both pictures (pLandpR),

d=x_L−x_R, (2.16)

wherex_Landx_Rbeing thex-coordinates of the 3D point projected on the left and right images. Provided the correct matching, the depth map is built from the disparity map using image geometry triangulation [32]. Assuming the pin-hole camera model and the cameras having the same focal length f, separated by a baseline B, the distance of a detected point is

Z = f×T

d , (2.17)

whereZ andT are expressed in meters andf anddin pixels. Equation 2.17 is equivalent to Equation 2.15.

2.2.1.2 Stereo cameras

Stereo correspondence is a challenging field of research in term of software and hardware implementation [33]. It has to respond to the high demand of real-time execution and frame rates in many domains like machine vision and navigation. Passive stereo vision also suffers from matching failure on low-textured regions and repetitive patterns [36].

Projecting a texture on the scene improves the stereo matching drastically. Projector- based systems became serious competitors to passive stereo cameras. However, the main drawback of such IR-projector-based sensors is their inability to work outdoors and their power consumption. Authors of [37] also showed the degradation of the 3D reconstruction at different times of the day. The stronger the illuminance, the poorer the quality of the resulting 3D map. Thus passive stereo cameras keep on being employed for outdoor applications related to navigation [38] whereas active ones are leading the indoor application usage. Examples of commercially available active stereo cameras, i.e. structured-light based systems, are the Microsoft Kinect, the Asus Xtion and more recently the structure sensor (more details in Appendix D), developed to cope with textureless scenes.

(48)

The way to capture a scene from two viewpoints can be extended to multiview and generalized to light field.

2.2.2 Multi-view vision or light field

2.2.2.1 Light field definition

With more than two views, we can call the whole capture a subset of the “light field”.

The definition of light field comes from the plenoptic function [39]. For each point in the 3D scene, the intensity distribution is

P(θ, φ, λ, t, V_x, V_y, V_z), (2.18) where θ and φ are the spherical coordinates of the direction of the light ray, λ the wavelength andtthe time dimension. Vx,Vy and Vz define the viewpoint (Figure 2.5).

Figure 2.5 –Illustration of the plenoptic function: two viewpoints with light coming from any direction (from [39]).

2.2.2.2 Capturing a light field

In practice, a conventional camera is capable of recording a 2D slice of the scene irradi- ance. With the multi-view strategy, we are able to add three other dimensions describing the location of the viewpoint. Light fields are thus captured with an array of cameras, a gantry [40] or the use of a turntable and a robot [41]. Another way to augment the conventional image capturing with two dimensions is the use of a microlens array. This 4D parameterization (x, y, u, v) is done by two planes, the viewpoints plane, (u, v), and the sensor plane, (x, y), and allows the measurement of the directional distribution of the light [2] (Figure 4.6). The latter is named the 4D plenoptic camera with commercial versions like Lytro³ and Raytrix⁴.

3. https://www.lytro.com/

4. http://www.raytrix.de/

(49)

Figure 2.6 – Two-plane parametrization of the 4D Light field as explained in [2].

Figure 2.7 –The difference between a Plenoptic 1.0 camera (left) and a Plenoptic 2.0 camera (right) lies on the location of the microlens array related to the main lens. In Plenoptic 1.0 the microlens array is on the image plane of the main lens (1/a+1/b= 1/f wheref is the focal length of the microlenses) [42].

2.2.3 Raytrix: a plenoptic camera

In a standard 2D camera, the image is formed by the main lens that projects the image of the scene onto the sensor. In a light field camera, there is an additional array of microlenses. The captures are lens-grid-based representations of the light field. As of today, there are two types of cameras: The Plenoptic 1.0 cameras and the Plenoptic 2.0 cameras [42] (Figure 2.7). In the Plenoptic 1.0 camera, the main lens projects the image into the array of microlenses, which then forms a set of micro-images on the sensor. A very simple relationship between the coordinate on the sensor and the light field (x, y, u, v) coordinates characterizes the 1.0 approach. The resulting reconstructed image has a number of pixels equal to the number of microlenses in the microlens array.

An example of a commercial Plenoptic 1.0 camera is the Lytro Camera. In a Plenoptic 2.0 camera, the image formed by the main lens is either in front or behind the microlens array. This approach allows for better resolution, but the price to pay is a complex relationship between the light field (x, y, u, v) and the sensor coordinates. The Raytrix R5 camera (Appendix E) is a Plenoptic 2.0 camera [43] composed of an array of around 7900 microlenses and the image formed by the main lens (called here thevirtual image) falls behind the microlens array.

It has an additional extended depth-of-field property by incorporating three types of microlenses with three different focal lengths [43]. The microlenses lie on a hexagonal

(50)

grid that optimizes the sensor coverage (Figure 2.8). A raw light field image is composed of a bubble-like pattern; each bubble is the projection of the virtual image by a single microlens. In the rest of the manuscript, a bubble region is called a micro-image, referring to a microlens.

Figure 2.8 – Raytrix technology : Three types of microlenses on a hexagonal grid.

2.2.4 Depth estimation and occlusions

As seen in Section 2.2.1, depth can be estimated from a pair of images based on triangulation. The main issue, however, is the occlusions. Let us consider a 3D scene.

An occlusion is a part of the scene that is visible by only one of the two viewpoints in a stereo system (Figure 2.9). Such occlusions can occur on a single object and are self-occlusions, or from an object in the foreground hiding a part of the background of the scene. As a consequence, these regions can not have their depth estimated.

Figure 2.9 –Illustration of self-occlusion in stereo vision. Aerial view of a 3D object captured with two cameras. Point A is occluded and not visible in the right image, similarly for point D in the left image.

To address the occlusion issues, multi-view stereo [44] and, more recently light field approaches [45, 46, 47, 48, 49] present accurate dense depth map with occlusion handling.

As far as computation is concerned, depth estimation from a 4D light field is time- consuming, and research is still in progress. In 2012, Wanneret al. reported 15 minutes to compute the depth map from the stanford lego truck dataset⁵ while the approach of

5. http://lightfield.stanford.edu/lfs.html

(51)

Kimet al. required 64 seconds. From 4D light fields captured with a plenoptic camera, Jeon et al. proposed an accurate approach that needs 6 minutes to output the depth map with a Matlab^TM implementation.

In the following chapter, the depth is estimated at specific and sparse locations to reduce the computation time to meet real-time requirements of the EyeWalker project, i.e. about 5 frames per second (fps).

2.3 Machine learning for computer vision

The following recalls the main machine learning algorithms used in my contributions.

2.3.1 What is machine learning?

In [50], Mitchell defines machine learning from the research questions the community sought to answer:

“How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?” (page 2 of [50]).

He then emphasized on three points necessary for a machine to learn:

• the taskT the machine is expected to be capable of doing after the learning process;

• the performance metric P that shall improve during the learning phase;

• the ExperienceE, or the examples the system learns from.

Machine learning is applied to many fields involving lots of data (e.g. finance, visual and audio signals). It is mainly employed to make predictions from the model built during the learning phase. I will present the machine learning algorithms that I used in this thesis.

2.3.2 Cascade of boosted classifiers

This section exposes the approach to the automatic selection of weak features for the objects and obstacles detection tasks presented in Chapters 3 and 4.

The main idea behind boosting learning algorithms is the use of “weak” learning algorithms combined to end up with a “strong” algorithm. By “weak”, it is meant slightly better than a random decision. In [51] Schapire presented a concise introduction to

(52)

Boosting classification and details of the AdaBoost algorithm. The AdaBoost algorithm [52] consists in building a strong classifierH from a set of weak binary classifiers ht, i.e. H is a linear combination of weak classifiers:

H(x) =sign(X

α_t·h_t(x)) (2.19)

α_t= 1

2·ln1−t

_t , (2.20)

tbeing the error of the weak classifierhton the training data (xi, yi), i∈[1, m], where x_i ∈X and y_i∈ {−1; 1}.

Schapire describes the full algorithm in [51], which is summarized it in Figure 2.11.

Figure 2.10 – AdaBoost flow. D_t is the distribution on a training sample, the aim being to maintain its value over the training loops.

One of the most popular applications of AdaBoost is the Viola-Jones face detector [53].

AdaBoost is employed to select a small number of features, Haar-like features, that are represented by the weak classifiers.

Figure 2.11 – Haar-like features for face detection employed by Viola and Jones.

The main advantage of cascading boosted classifiers is the speed. Indeed the first weak classifiers tested allow rejecting “strong” negative samples. The gain is thus in the execution time by not testing the following weak classifiers. It makes this ML algorithm suitable for embedded applications. However, this algorithm demonstrates a performance that can be considered unsatisfying depending on the application. The face detector, for example, presents 0.4 % false positive rate for 99 % of correct detection.

(53)

2.3.3 Bag-of-visual-words

This section gives the required basis to the scale-invariant object recognition approach detailed in Chapter 4.

The principle of the bag-of-words (BoW) model comes from the text retrieval community.

In computer vision, this method consists in considering an image as a text: the image is represented by a sparse vector, each element of the vector being a visual word. Sivic and Zisserman [54] proposed the first application to video retrieval. In [55], Csurka et al.

used this visual vocabulary strategy for object categorization with bag-of-features (BoF).

Eventually, this strategy was recently employed in unsupervised learning of features [56].

The experiments on patches extracted from YouTube⁶ thumbnails demonstrate the unsupervised building-up of features dedicated to human face detection.

The BoW approach consists in the following steps:

• the extraction of selected patches from the training data. A common selection procedure is based on feature detection, such as scale-space extrema detection [20];

• the representation of these patches. In [55] for instance, the chosen feature descriptor employed was Scale-invariant feature transform (SIFT) [20];

• the creation of a dictionary of visual words also called codebook. A codeword shall represent a set of similar patches. A straight-forward method is to run a clustering algorithm, “k-means” being often used [55, 56].

Then each training image is represented by a histogram that is the reference to compare with the histogram of the test image.

2.3.4 Binary classifier performance assessment

Any classification approach requires a set of measures to assess its performance. They are carried out on a test dataset. The outcomes of a binary classifier on this test set is a 2×2 confusion matrix (Table 2.1) that gathers the following four groups, assuming the label of the two classes are ‘positive’ and ‘negative’:

• samples that are correctly predicted 1. True positives (TP)

2. True negatives (TN)

• samples that are incorrectly predicted 4. False positives (FP)

5. False negatives (FN)