HAL Id: hal-02264898
https://hal.inria.fr/hal-02264898
Preprint submitted on 7 Aug 2019
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Deep Saliency Models : the Quest for the Loss Function
Alexandre Bruckert, Hamed Tavakoli, Zhi Liu, Marc Christie, Olivier Le Meur
To cite this version:
Alexandre Bruckert, Hamed Tavakoli, Zhi Liu, Marc Christie, Olivier Le Meur. Deep Saliency Models :
the Quest for the Loss Function. 2019. �hal-02264898�
DEEP SALIENCY MODELS : THE QUEST FOR THE LOSS FUNCTION Alexandre Bruckert 1 , Hamed R. Tavakoli 2 , Zhi Liu 3 , Marc Christie 1 , Olivier Le Meur 1
1 Univ Rennes, IRISA, CNRS, France
2 Aalto University, Finland
3 Shangai University, China
ABSTRACT
Recent advances in deep learning have pushed the perfor- mances of visual saliency models way further than it has ever been. Numerous models in the literature present new ways to design neural networks, to arrange gaze pattern data, or to ex- tract as much high and low-level image features as possible in order to create the best saliency representation. However, one key part of a typical deep learning model is often neglected:
the choice of the loss function.
In this work, we explore some of the most popular loss func- tions that are used in deep saliency models. We demonstrate that on a fixed network architecture, modifying the loss func- tion can significantly improve (or depreciate) the results, hence emphasizing the importance of the choice of the loss function when designing a model. We also introduce new loss functions that have never been used for saliency prediction to our knowledge. And finally, we show that a linear combina- tion of several well-chosen loss functions leads to significant improvements in performances on different datasets as well as on a different network architecture, hence demonstrating the robustness of a combined metric.
1. INTRODUCTION
Despite decades of research, visual attention mechanisms of humans remain complex to understand and even more com- plex to model. With the availability of large databases of eye- tracking and mouse movements recorded on images [1, 2], there is now a far better understanding of the perceptual mech- anisms. Significant progress has been made in trying to pre- dict visual saliency, i.e. computing the topographic represen- tation of visual stimulus strengths across an image. Deep saliency models have strongly contributed to this progress.
However, as recently pointed out by Borji [3], a neglected challenge in the design of a deep saliency model is the choice of an appropriate loss function. In [4] a probabilistic end-to- end framework was proposed and five relevant loss functions were studied. Yet, to the best of our knowledge, none of the papers concerning the challenges in designing deep saliency models have investigated this aspect properly, despite its in- fluence on the quality of the results. Important questions
therefore arise: how do different loss functions affect per- formance of deep saliency networks? Which loss functions perform better than others and on which metrics? Is there ac- tually substantial benefits in combining loss functions? And how does the combination of loss functions perform with re- spect to individual loss functions? In this work, we seek an- swers to such questions by conducting a series of extensive experiments with both well-known and newly designed loss functions.
For this purpose, we first categorize loss functions per type of metric : (i) pixel-based comparisons e.g. Mean Square Error, Absolute Errors Exponential Absolute Difference, (ii) distribution-based metrics e.g. Kullback-Leibler diver- gence, Bhattacharya loss, binary cross-entropy, (iii) saliency- inspired metrics such as Normalized Scanpath Saliency or Pearson’s Correlation Coefficient, or (iv) perceptual-based metrics, gathering two novel metrics we propose in this paper which are inspired from image style transfer, and measure the aggregation of distances computed at each convolutional layer, between the convoluted reference image and the gener- ated saliency map.
We then design a novel deep saliency model to provide a fixed network architecture as a reference on which all the loss functions will be evaluated. Our evaluation strategy then consists in evaluating the impact of all the loss functions taken individually, on our fixed network with a fixed image dataset (MIT). Then by building on the common agreement that different metrics favor different perceptual characteristics of the image [3], we propose to further explore how the com- bination of loss functions, typically aggregating pixel-based, distribution-based, saliency-based and perceptual-based func- tions, can significantly influence the quality of the training.
To demonstrate the generalization capacity of our combined metric, we measure its impact on different datasets (CAT2000 and FiWi) and also with a different network architecture (SAM-VGG).
The contributions of this paper are therefore (i) to demonstrate
how the choice of the loss function can strongly improve (or
depreciate) the quality of a deep saliency model, and (ii) how
an aggregation of carefully selected loss functions can lead to
significant improvements, both on the fixed network architec-
ture we proposed, but also on some other architectures and
datasets.
The paper is organized as follows. Section 2 presents the re- lated works. The loss functions for training a deep architec- ture aiming to predict saliency map are described in Section 3.
Section 4 presents the comprehensive analysis of loss func- tions and their combinations. Conclusions are drawn in the last section.
2. RELATED WORKS
Computational models of saliency prediction, a long stand- ing problem in computer vision, have been studied from so many perspectives that going through all is beyond the scope of this manuscript. We, thus, provide a brief account of rele- vant works and summarize them in this section. We refer the readers to [5, 6] for an overview.
To date, from a computer vision perspective, we can divide the research on computational models of saliency prediction into two era (1) pre-deep learning, and (2) deep learning.
During the pre-deep learning period, significant number of saliency models were introduced, e.g. [7, 8, 9, 10, 11], and numerous survey papers looked into these models and their properties, e.g. [5, 12]. During this period the community converged into adopting eye tracking as a medium for ob- taining ground truth and dealt with challenges regarding the evaluation and the models, e.g. [13, 14]. This era was then replaced by saliency models based on deep learning tech- niques [3], which will be the main focus of this paper.
We therefore outline the recent research developments of deep saliency model era from two perspectives, (1) challenges of deep models and works that addressed them, and (2) the deep saliency models. We, then, stress the importance of task spe- cific loss functions in computer vision.
Challenges of deep saliency models.
The use of deep learning introduced new challenges to the community. The characteristics of most of the models shifted towards data intensive models based on deep convolutional neural networks (CNNs). To train a model, a huge amount of data is required; motivating the search for alternatives to eye tracking databases like mouse tracking [2], or pooling all the existing eye tracking databases into one [15].
To improve the training, Bruce et al. [15] investigated the factors required to take into account when relying on deep models, e.g., pre-processing steps, tricks for pooling all the eye tracking databases together and other nuances of training a deep model. Authors, however, considered only one loss function in their study.
Tavakoli et al. [16] looked into the correlation between mouse tracking and eye tracking at finer details, showing the data from the two modalities are not exactly the same. They demonstrated that, while mouse tracking is useful for training a deep model, it is less reliable for model selection and eval- uation in particular when the evaluation standards are based
on eye tracking.
Given the sudden boost in overall performance by saliency models using deep learning techniques, Bylinskii et al. [17]
reevaluated the existing benchmarks and looked into the fac- tors influencing the performance of models in a finer detail.
They quantified the remaining gap between models and hu- man. They argued that pushing performance further will re- quire high-level image understanding.
Recently Sen et al. [18] investigated the effect of model train- ing on neuron representations inside a deep saliency model.
They demonstrated that (1) some visual regions are more salient than others, and (2) the change in inner-representations is due to the task that original model is trained on prior to be- ing fine-tuned for saliency.
Deep saliency models.
The deep saliency models fall into two categories, (1) those using CNNs as a fixed feature extractors and learn a regres- sion from feature space into saliency space using a none- neural technique, and (2) those that train a deep saliency model end-to-end. The number of models belonging to the first category is limited. They are not comparable within the context of this research because the regression is often car- ried out such that the error can not be back-propagated, e.g., [19] employs support vector machines and [20] uses extreme learning machines. Our focus is, however, the second group.
Within end-to-end deep learning techniques, the main re- search has been on architecture design. Many of the models borrow the pre-trained weights of an image recognition net- work and experiment combining different layers in various ways. In other words, they engineer an encoder-decoder net- work that combines a selected set of features from different layers of a recognition network. In the following we discuss some of the most well-known models.
Huang et al. [21] proposed a multi-scale encoder based on VGG networks and learns a linear combination from re- sponses of two scales (fine and coarse). K¨ummerer et al. [22]
use a single scale model using features from multiple layers of AlexNet. Similarly, K¨ummerer et al. [23] and Cornia et al. [24] employed single scale models with features from multiple layers of a VGG architecture.
There has been also a wave of models incorporating recur- rent neural architectures. Han and Liu [25] proposed a multi- scale architecture using convolutional long-short-term mem- ory (ConvLSTM). It is followed by [26] using a slight mod- ified architecture using multiple layers in the encoder and a different loss function. Recurrent models of saliency predic- tion are more complex than feed-forward models and more difficult to train. Moreover, their performance is not yet sig- nificantly better than some recent feed-forward networks such as EML-NET [27].
In the literature of deep saliency models, a loss function or
a combination of several ones is chosen based on intuition,
expertise of the authors or sometimes mathematical formu-
lation of a model. K¨ummerer et al. [28] introduces the idea that information-theory can be a good inspiration for saliency metrics. They use the information gain to explain how well a model performs compared to a gold-standard baseline.
Consequently, they use the log-likelihood for a loss function in [29], achieving state-of-the-art results in saliency predic- tion. Jetley et al. [4] are part of the very few who specifically focused on the design of a loss functions for saliency mod- els. They proposed the use of Bhattacharyya distance and compared it to 4 other probability distances. In this paper, in contrast to [4], we (1) adopt a principled approach to compare existing loss functions and their combinations and (2) inves- tigate their convergence properties over different datasets and network architectures.
Deep Learning, loss functions and computer vision.
With the application of deep learning techniques to computer vision domain, the choice of appropriate loss function for a task has become a critical aspect of the model training. The computer vision community have been successful in develop- ing task tailored loss functions to improve a model, e.g., en- coding various geometric properties for pose estimation [30], curating loss functions enforcing perceptual properties of vi- sion for various generative deep models [31], exploiting the sparsity within the structure of problem, e.g., class imbal- anced between background and foreground in detection prob- lem, for reshaping standard loss functions and form a new effective loss functions [32]. Our efforts follows the same path to identify the effectiveness of a range of loss functions in saliency prediction.
3. LOSS FUNCTIONS FOR DEEP SALIENCY NETWORK
Before delving into the description of loss functions, we present the architecture of the convolutional neural network that will be used throughout this paper. After this presenta- tion, we elaborate on the tested loss functions.
3.1. Proposed baseline architecture
Figure 1 presents the overall architecture of the proposed model. The purpose of designing a new architecture is only to perform a comparison with existing architectures. Our architecture is based on the deep gaze network of [29] and on the multi-level deep network of [24]. The pre-trained VGG-16 network [33] is used for extracting deep features of an input image (400 × 300) from layers conv3 pool, conv4 pool, conv5 conv3. Feature maps of layers conv4 pool and conv5 conv3 are rescaled to get feature maps with a sim- ilar spatial resolution.
That feature map with 1280 channels is then fed into a shallow network composed of the following layers: a first convolu- tional layer allows us to reduce by a factor ten the number of
channels, which are then processed by an ASPP (an atrous spatial pyramid pooling [34]) of 4 levels. Each level has a convolution kernel of 3 × 3, a stride equal to 1 and a depth of 32. The dilatation rates are 1, 3, 6, and 12. The ASPP benefit is to catch information in a coarse-to-fine approach while keeping the resolution of the input feature maps. The output of the four pyramid levels are then merged together, i.e. leading to 4 ×32 maps. The last 1 × 1 convolutional layer reduces the data dimensionality to 1 feature map. This map is then smoothed by a Gaussian filter 5 × 5 with a standard deviation of 1. The activation function of these layers is a ReLU activation.
The network was trained over the MIT dataset composed of more than 1000 images [35]. We split this dataset into 500 images for the training, 200 images for the validation and the rest for the test. We use a batch size of 60, and the stochastic gradient descent. To prevent over-fitting, a dropout layer, with a rate of 0.25, is added on top the network. The learning rate is set to 0.001. During the training, the network was val- idated against the validation test to monitor the convergence and to prevent over-fitting. The number of trainable parame- ters is approximately 1,62 millions. In the following section, we present the different tested loss functions used during the training phase.
Fig. 1: Architecture of the proposed deep network.
3.2. Loss functions
Let I : Ω ⊂ R 2 7→ R 3 an input image of resolution N × M .
We suppose S and S ˆ the vectorized human and the predicted
saliency maps, i.e. S and S ˆ are in R N×M . Let also S f ix be
the human eye fixations map, i.e. a N × M image with 1 or
0 pixels. In the following, we present thirteen loss functions
L tested in this study. There are classified into four categories
according to their characteristics: pixel-based, distribution-
based, saliency-inspired and perceptual-based.
3.2.1. Pixel-based loss functions
For pixel-based loss functions L, we assume that S and S ˆ are in [0, 1]. We evaluate the following loss functions:
• Mean Squared Error (MSE) measures the averaged squared error between prediction and ground truth:
L(S, S) = ˆ 1 N × M
N ×M
X
i=1
(S i − S ˆ i ) 2 (1)
• Exponential Absolute difference (EAD):
L(S, S) = ˆ 1 N × M
N ×M
X
i=1
exp(|S i − S ˆ i |) − 1 (2)
• Absolute Error (AE):
L(S, S) = ˆ 1 N × M
N ×M
X
i=1
|S i − S ˆ i | (3)
• Weighted MSE Loss (W-MSE):
L(S, S) = ˆ 1 N × M
N×M
X
i=1
w i
S i − S ˆ i
2
(4)
The weight w i allows to put more emphasis on errors occurring on salient pixels of the ground truth S. Two functions are tested in this paper. In [24], authors defined the loss function (MLNET): w i = α−S 1
i