Image and Video Benchmarks - The DART-Europe E-theses Portal

In the following, the two benchmarks obtained by compiling the results from the litera-ture are introduced. The first benchmark covers image-based algorithms and can notably be used to confront traditional and deep learning approaches along with fine and chal-lenging conditions. In the second benchmark, video-based algorithms are included to provide an additional comparison between static and dynamic approaches. Full details on the datasets and approaches used can be found in Section 2.1, 2.2, and 3.2.

3.3.1 Compiled Results on 300W

Figure 3.9: Samples of ESR, SDM, LBF and TCDCN on 300W [280]. Landmarks of non-rigid facial structures, especially on the outline of the face, are found to be the most difficult to detect.

The performances of leading methods for image-based facial landmark detection are reported in Table 3.2. These methods were evaluated on the 300W dataset [184], which is the most widely used dataset in the literature. As a reminder, it composed of 2 categories of variable difficulty. Category A corresponds to images that do not in-clude severe difficulties. Category B contains more complex facial images with large variations in pose, illumination, and expressions, as well as occlusions. First, note that CSR based on various DNN architectures is the predominant approach (see for example, CFAN, MDM, DAN, LAB). Secondly, for category B, the average error is more than twice the one obtained on category A. Despite the quantity of methods proposed in the literature and the recent major advances, these results show that the problems encoun-tered under uncontrolled conditions are still far from being solved.

Approach Method Year Category A Category B Full set

4.10/-/-Table 3.2: Performances of recent image-based methods on 300W. NRMSE is reported for categories A and B. NRMSE/AUC/FR are reported for the full set. Numbers are taken from the literature. CSR based on various DNN architectures is the predominant approach. The average error on category B is more than twice the one obtained on category A. The problems encountered under uncontrolled conditions are still far from being solved.

By looking at qualitative results such as the ones in Figure 3.9, interesting obser-vations can also be made. Because of their significant influence on facial appearance, variations in pose and expression along with occlusions seem to be among the most difficult challenges. Landmarks of non-rigid facial structures (e.g., those around the mouth) are found to be the most difficult to detect as they can be strongly impacted by deformations related to expressions. Landmarks on the outline of the face are even more challenging, probably due to the lack of distinctive features. Deep learning ap-proaches are now very popular and state-of-the-art. They demonstrate striking results compared to traditional approaches, especially in category B. This is most likely related to their ability to represent complex non-linear functions and to learn discriminative task-specific features while benefiting from data augmentation. Nevertheless, they have difficulty running in real-time, even on GPU. Traditional CSR approaches are much faster, reaching up to 1000 to 3000 FPS [105, 178]. It should be noted that the training

procedure (pre-training, training data, augmentation) is not always clearly provided and may vary from one method to another, which can impact the results and make them difficult to compare. However, similar observations with additional details are reported in competition reports, which provide original evaluations with identical datasets and protocols [184, 271].

3.3.2 Compiled Results on 300VW

Approach Method Year Category 1 Category 2 Category 3 Full set

Static

SDM [248] 2013 7.41 6.18 13.04 7.25

ESR [27] 2014 - - - 7.09

CFSS [285] 2015 7.68 6.42 13.67 6.13

CFAN [274] 2014 - - - 6.64

TCDCN [279] 2014 7.66 6.77 14.98 7.59

Dynamic RED-NET [162] 2016 - - - 5.15

TSTN [133] 2018 5.36 4.51 12.84

-Table 3.3: Performances of recent image-based and video-based methods on 300VW.

NRMSE is reported for categories 1, 2, and 3 and the full set. Numbers are taken from the literature. DNNs based on RNN are the predominant approach. The results demonstrate better robustness of temporal approaches under uncontrolled conditions.

Temporal constraints appear to be just as important as spatial constraints.

In the same way as image-based approaches, the performance of leading video-based facial landmark detection methods are referenced in Table 3.3. These methods were evaluated on the 300VW dataset [193] which, as a reminder, is composed of three categories of increasing difficulty. It is currently the only video dataset available. Some image-based methods (“Static” row in Table 3.3) have been included for comparison [248, 27, 285, 274, 279]. Since the interest in video-based approaches is more recent, possibly due to data availability, much fewer results are available.

DNNs are also the predominant approach. They tend to provide more stable pre-dictions without large errors since they take advantage of the temporal coherence (see Figure 3.10). Temporal constraints appear to be just as important as spatial constraints, as proved by the results that demonstrate better robustness under uncontrolled condi-tions. RNN-based approaches are currently very popular [162, 79], especially two-stream architectures [133]. The lack of data being more important when considering videos (more complex model, redundancy in data), one of the advantages of such archi-tectures is that fine-tuning can be performed for the spatial stream from models trained on static datasets. Although the results demonstrate the usefulness of temporal

infor-mation, the trend to use RNNs turns out to be limited. Combining CNNs and RNNs provides late temporal connectivity, most certainly capable of detecting only global mo-tion from high level feature maps, e.g., head movements, since the CNNs used cannot detect any motion.

Figure 3.10: Mean error of RED-NET with and without recurrent learning on 300VW [162]. Recurrent learning tends to provide more stable predictions without large errors.

Dans le document The DART-Europe E-theses Portal (Page 93-96)