• Aucun résultat trouvé

Image and Video Datasets

Dans le document The DART-Europe E-theses Portal (Page 85-89)

3.2 Datasets and Evaluation Metrics

3.2.1 Image and Video Datasets

In recent years, many datasets for facial landmark detection have been made available to the scientific community (see Table 3.1). The images included in these datasets are collected on social networks or image search services such as Google, Flickr, or Face-book, bringing more realism to the data. The annotation is performed manually, which is time consuming, but it can be alleviated with the help of the Amazon Mechanical Turk platform1or using semi-automatic methods [193]. The quality of the annotations, however, may vary [184, 23]. The annotation scheme used, i.e., the positions and the number of landmarks, may also differ from one dataset to another [184]. Currently, the scheme composed of 68 landmarks [78, 184] illustrated in Figure 3.2 is the most widely used. Since this scheme does not fit extreme poses (typically, profile poses) due to self-occlusions, as landmarks get stacked up, a 39-landmark scheme has also been proposed for profile faces [78, 271]. Other schemes with either fewer landmarks (e.g., only the rigid ones for each facial component) or with over a hundred landmark (e.g., including the contours of the face and of each component) can also be useful depending of the application (e.g., human-computer interaction, motion capture). Nonetheless, the 68-landmark-based scheme might be suitable for many applications with a reasonable trade-off between annotation time and information capture.

Figure 3.2: 68-landmark (left) and 39-landmark (right) schemes [271]. These schemes might be suitable for many applications with a reasonable trade-off between annotation time and information capture.

300W [184] is one of the most used datasets in the literature to train and evaluate facial landmark detection algorithms under uncontrolled conditions. This is the product of a competition of the same name, which has federated several datasets, i.e., LFPW [11], HELEN [121], AFW [286]. All these datasets have been re-annotated using the

1Crowdsourcing marketplace allowing individuals and companies to perform tasks, in exchange for remuneration, that computers are currently unable to perform.

68-landmark scheme and new images have been added for training and evaluation with a total of 4350 images. Some challenges such as pose variations being under-represented, an extension called 300W-LP [288] has been proposed in order to provide more images with large pose variations. However, these new data, synthesized using the 3DMM [15]

contain artifacts that might affect the accuracy of landmarks.

Type Datasets Year #Images #Landmarks

Table 3.1: Datasets captured under unconstrained conditions. The 68-landmark scheme is the most widely used. There is only limited video data available.

More specific datasets have also been proposed. COFW [25] gives more empha-sis to occlusions. Initially annotated with 29 landmarks, it has recently been updated to the 68-landmark scheme. MTFL [279] and MAFL [280] focus on multitask learn-ing and facial attributes, but with only 5 annotated landmarks. In an effort to increase the number of challenging data, especially for DL approaches, MENPO [271] has been developed, in conjunction with the competition of the same name. It contains 10,993 im-ages for semi-frontal faces and 3852 for profile faces, obtained from large scale datasets, AFLW [109] and FDDB [96], and respectively annotated with the 68-landmark and 39-landmark schemes. Samples can be seen in Figure 3.3. However, these datasets make it difficult to determine the sources of error. Recently, a new dataset, WFLW [235], based on WIDER Face [260], has been published with rich annotations: 98 landmarks, occlusions, pose, make-up, illumination, blur, and expression. It aims to enable a com-prehensive analysis of existing algorithms.

In practical applications, facial landmark detection algorithms are commonly ap-plied to videos. It appears crucial to have data that is representative of the problem in

order to answer it. Static datasets do not cover all the difficulties encountered by ap-plications; especially, the ones related to the movements of persons or cameras are not currently considered. 300VW [193] is the only dataset developed for video-based land-marks detection under uncontrolled conditions to be published. It contains 114 videos of about 1 minute each and featuring one person annotated with 68 landmarks. 50 videos are intended for training and 64 for testing. The test set is divided into 3 categories of increasing difficulty: category 1 presents videos recorded in well-lit conditions with various head poses, category 2 contains additional lighting variations, and category 3 includes severe difficulties such as lighting, occlusions, expression, and head pose (see Figure 3.4). However, the challenging data are not as diverse and extreme as those found in static datasets.

Figure 3.3: Illustration of the images contained in the MENPO dataset (a = pose; b = occlusion; c = expression; d = illumination) [49].

Given the limited video data available, other video datasets have been reviewed, and one of them has shown valuable specifications. SNaP-2DFe [2] is a video dataset re-cently developed to quantify the impact of head movements on expression recognition performance. As it contains landmark annotations, it also provides rich annotations, movement and expression, to study video-based facial landmark detection. It consists of 6 movements composed of a horizontal translation and/or a rotation (roll, pitch, yaw), each associated with 7 acted expressions (neutral, happiness, fear, anger, disgust, sad-ness, surprise) corresponding to the universal expressions suggested by [58]. The tem-poral patterns of expression activation (i.e., neutral-onset-apex-offset-neutral) are also provided, where apex refers to the highest intensity of an expression. Data from 15

participants have been collected by two synchronized cameras (i.e., helmet and static camera), with a total of 1260 videos. Samples with pitch movement and surprise ex-pression are illustrated in Figure 3.5.

Figure 3.4: Illustration of each category of the 300VW dataset: (a) well-lit conditions, (b) lighting variations, and (c) severe difficulties [32]. These difficulties are not as ex-treme as in still image datasets.

Figure 3.5: Sample images of facial expressions recorded under pitch movements from the SNaP-2DFe dataset (row 1: helmet camera, only expression movement; row 2: static camera, expression and head movement). The bottom plot shows the different patterns of a facial expression: neutral (green), onset (orange), apex (red), and offset (orange).

Dans le document The DART-Europe E-theses Portal (Page 85-89)