Discussion

Throughout the years, research has always been focused on static images. The ap-proaches proposed so far can be grouped into two major categories: generative and discriminative. The fundamental differences between these two categories of approach include:

• model fitting based on coefficient prediction or mapping between the image and the shape;

• explicit shape model or implicitly embedded shape constraints;

• independent landmark prediction or joint prediction;

• holistic or part-based representation;

• one step prediction or coarse-to-fine strategy;

• the need of a shape initialization.

Many solutions have also been developed to address specific challenges and can be in-tegrated into any type of approach. They are based on techniques including 3D model [287], multiview model(s) [50], multitask learning [280], GANs [24], explicit occlu-sions modeling [269]. However, these solutions can make dataset annotation and model training much more complex.

Discriminative approaches perform landmark prediction through a mapping between the image and the shape. This makes them faster and more accurate than the approaches

based on generative models [105, 178]. Landmarks are jointly predicted, which im-plicitly integrates the shape constraints. The mapping is generally implemented using a cascade of regressors, coarse-to-fine strategy, since direct mapping is hard to achieve.

Among the discriminative approaches, DL ones has gained increasing popularity [23].

They allow discriminative and problem-specific feature learning, as opposed to hand-crafted features used in traditional approaches. The latter are generic and likely to be suboptimal for facial landmark detection [122]. With DNNs, feature extraction and re-gression are trained jointly. The relationship between images and shapes is modeled more effectively. DL has notably been successful with direct mapping, showing that DR remains a relevant solution [146, 148].

Current approaches still encounter difficulties under uncontrolled conditions [184].

Although most of the challenges have been addressed, this is often achieved with solu-tions specific to one or a few challenges. There is a lack of approaches that can address all the difficulties at once. With regard to the literature, temporal approaches appear to be the most promising direction. Today, due to the increase in video data and their potential benefits, the interest in temporal approaches is growing [193]. Currently, most landmark detection approaches are unable to take advantage of the temporal informa-tion. They are often applied in tracking-by-detection manner. More advanced strategies have recently been proposed [32]. They mainly rely on tracking (e.g., box, landmark, pose) along with incremental learning and for a smaller part on temporal smoothing [56, 207]. However, these approaches are subject to drift and so far do not fully exploit facial dynamics. The most advanced ones based on RNNs appear to be limited to the global movements of the head [133, 89, 79]. The CNNs currently used as a backbone cannot model any motion.

Moreover, given the overall progress accomplished, more attention should be paid to evaluation protocols. There is a lack of comprehensive analysis to quantify the per-formance of current approaches according to the different difficulties [271]. Regarding temporal approaches, the contribution of the temporal information according to the dif-ficulties, and compared to static approaches, is not well quantified. This would help to better identify the remaining efforts to be made but could also be useful for anyone who uses landmarks as a pre-processing in an application (e.g., expression recognition).

Consequently, the issues addressed in this work are related to the evaluation proto-cols and temporal approaches. The first goal is to quantify the impact of major chal-lenges on landmark detection on the one hand, and on the other, to quantify the contri-bution of temporal approaches. The recent availability of datasets such as SNaP-2DFe

[2] presents an opportunity to address these issues. The second goal is to better exploit the dynamic nature of the face in order to obtain more stable predictions over time and more robustness to the full set of variations that considerably impact facial appearance.

One hypothesis is that early temporal connectivity could help to model motion more finely as well as complement current approaches. Among the open questions that are also of interest are how to effectively combine facial motion information with the facial appearance. To this end, CNNs appears to be an effective and versatile baseline tool.

Effectiveness of Facial Landmark Detection

3.1 Overview . . . 55 3.2 Datasets and Evaluation Metrics . . . 55 3.2.1 Image and Video Datasets . . . 56 3.2.2 Face Preprocessing and Data Augmentation . . . 60 3.2.3 Evaluation Metrics . . . 61 3.2.4 Summary . . . 63 3.3 Image and Video Benchmarks . . . 64 3.3.1 Compiled Results on 300W . . . 64 3.3.2 Compiled Results on 300VW . . . 66 3.4 Cross-Dataset Benchmark . . . 67 3.4.1 Evaluation Protocol . . . 68 3.4.2 Comparison of Selected Approaches . . . 69 3.5 Discussion . . . 74

Landmark detection is a common and often crucial pre-processing step in the context of facial analysis. Although its overall performance continues to improve, its impact on subsequent tasks must be considered. Let us consider, for example, a typical expression recognition process (see Figure 3.1). It relies on landmarks to normalize the face and extract appearance, geometry or motion features. Poor landmark detections may lead to confusion between expressions, decreasing the accuracy of the recognition process.

Hence, it is necessary to ensure the robustness and stability of facial landmark detection under uncontrolled conditions and the suitability of the detected landmarks to the sub-sequent task (here, expression recognition).

Figure 3.1: Typical facial expression recognition process. It relies on landmarks to normalize the face and extract appearance, geometry or motion features.

To the light of the review in Chapter 2, we can observe a large number and variety of approaches for facial landmark detection. Besides, these approaches are no longer limited to images but also include video-based approaches. Nowadays, it may be dif-ficult to clearly understand the current state of the problem. Hence, there is a need for benchmarking to better identify the benefits and limits of the various approaches pro-posed so far, especially deep learning ones. Beyond overall performance, there is also a need to quantify the accuracy of these approaches according to the different challenges mentioned in Section 1.3. This would help to address the problem more effectively.

An overview of each of the benchmarks conducted is provided in Section 3.1. In Section 3.2, the datasets and evaluation metrics used in the literature to solve the facial landmark detection problem are first covered. In Section 3.3, the overall performance of image-based and video-based approaches is then reviewed. A comprehensive analysis of a selection of approaches is also proposed in Section 3.4, with a focus on some critical challenges for facial analysis, head pose and facial expressions, through a cross-dataset evaluation. The results are finally discussed in Section 3.5.

3.1 Overview

To get useful insights about the current state of the problem, figures that are available in the literature have been collected. We focus on two widely used datasets, 300W for still images and 300VW for videos. In this way, the figures of a large collection of approaches can be gathered. It also gives the possibility to link these results to the conclusions of the 300W [184] and 300VW [193] competitions. State-of-the-art ap-proaches that perform fairly well on these datasets have then been selected and a third benchmark has been conducted based on SNaP-2DFe. This video dataset emphasizes two challenges, head pose and facial expressions, with annotations that allows an in-depth analysis of the performance in the presence of these challenges, independently and simultaneously. These measurements are especially relevant as landmark detec-tion is critical for many facial analysis tasks, which are now shifting to uncontrolled conditions. Overall, the results obtained through each of these benchmarks provide an opportunity to discuss about:

• the respective value of the different types of approaches;

• the difference in performance between controlled and uncontrolled conditions;

• the most critical difficulties;

• the most impacted facial regions.

In order to properly interpret these benchmarks, a review of the datasets and evalu-ation metrics used to solve the problem of facial landmark detection is first provided in the following section.

Dans le document The DART-Europe E-theses Portal (Page 78-84)

Effectiveness of Facial Landmark Detection

Contents

3.1 Overview