Gradient regularisation increases deep learning performance under stress on remote sensing applications.

(1)

HAL Id: hal-01773170

https://hal.archives-ouvertes.fr/hal-01773170v4

Preprint submitted on 14 Sep 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

Gradient regularisation increases deep learning performance under stress on remote sensing

applications.

Adrien Chan-Hon-Tong

To cite this version:

Adrien Chan-Hon-Tong. Gradient regularisation increases deep learning performance under stress on

remote sensing applications.. 2019. �hal-01773170v4�

(2)

Gradient regularisation increases deep learning performance under stress on remote sensing

applications.

Adrien CHAN-HON-TONG September 14, 2019

Abstract

Deep learning community currently put a very large effort on resistance to adversarial attacks, and, major improvements has been achieved to increase local smoothness. In contrast, global smoothness is intrinsically linked to Lipschitz value, but, the link between these two notions is too few explored.

In this paper, experiments of object detection and image segmenta- tion on public remote sensing datasets show that adding Lipschitz related penalty consistently increase performance under stress (including adver- sarial attacks).

1 Introduction

Deep learning (see [14] for a review) is currently the state of the art of many intelligence artificial applications. Yet, deep learning performances are plagued by adversarial examples (see [7, 17, 23] as samples of adver- sarial example literature). At test time, it is possible to design a specific invisible/marginal perturbation such as a targeted network eventually pre- dicts different outputs on original and disturbed input. This threat is worsen as producing adversarial examples does not require to have ac- cess to the internal structure of the network [16] and can have physical implementation [12].

In addition, an other issue is the lack of global smoothness of deep net- work, typically linked to Lispchitz constant of the network, and, not just to local smoothness which is strengthen by adversarial defences. Thus, quite orthogonally with the very large effort of the community on adver- sarial defence (e.g. [26, 25, 22]), this paper focus on Lispchitz related penalty like [3, 1].

Now, most Lispchitz related methods only measures their effects with

accuracy i.e. global smoothness is measured as a way to reduce overfit-

ting. But, there is no trivial link between smoothness (model dependant)

and overfitting (model and data dependant). Instead, this paper offers as

framework to measure Lispchitz penalty effect by accuracy under stress

(3)

(including adversarial attack but not only) i.e. global smoothness is mea- sured as a way to increase local smoothness (even if adversarial defence are obviously a more straightforward way to do so for small perturbation).

In contrast, this paper presents remote sensing experiments (object de- tection and image segmentation on remote sensing datasets) which shows consistently that the offered Lispchitz related penalty increases perfor- mance under stress while being related to the global smoothness of the network (an not just local smoothness).

In this paper, the focus is given to remote sensing applications. On one hand, remote sensing image can not be hacked like social network im- ages (by modifying encoding), or, autonomous driving images (by hacking traffic signs). But, on the other hand, remote sensing system should deal with large illumination changes, death pixels, atmospheric blur, and, cam- ouflage which is totally an adversarial attack. Also, many remote sensing applications like car counting, or, image based smart farming are currently waiting certification framework to be used in real life while autonomous driving may require more time. And, such framework will probably in- clude robustness evaluation.

So, the contributions of this paper are:

• it offers a new Lispchitz related penalty

• which is evaluated as a way to increase performances under stress i.e. to increase local smoothness and not to reduce overfitting

• using a toolbox (made public) for stress evaluation including both adversarial attack (typically Fast gradient sign method (FGSM) [7]), and, common perturbation (a subset of [8])

• the consistent trend of our remote sensing experiments is that this new penalty term increases performance under stress

The new penalty term is presented in section 3 after related works of section 2. Experiments are presented in section 4 before conclusion.

2 Related works

2.1 Stress test and adversarial defence

Definition of dependable evaluation is not that clear. Two theoretical dead end forbids a formal dependable evaluation: without a formal definition of the task, formal proof of correctness is impossible (typically [10] can prove some property but it can prove nothing about accuracy on unknown samples), and, all classifiers are equally bad averaged on all classification problem [24].

So dependable evaluation should have something to do with very large testing datasets. But, very large datasets are expensive to collect. Thus, more realistically, dependable evaluation will be related to middle size testing datasets augmented to be stress tests (and simulation, plus, formal property which are out of the scope of this paper).

Typically, [8] offers a toolbox to augment testing data with agnostic

perturbations, and shows that classical network performs poorly on these

augmented data. This toolbox could obviously be combined by adversarial

(4)

attack like FGSM. Now, high performance under stress can be straight- forwardly achieves by local smoothness typically obtained by considering state of the art of adversarial defence, but also, by global smoothness typically related to Lispchitz constant.

Most classical adversarial defence is adversarial training [22] i.e. train- ing on data plus adversarial data. An emerging way is [25] where network output is optimized to be absolutely stable around each training point allowing mathematical guarantee. A soft version of [25] is [26] where the guarantee on robustness is relaxed to allow a training closer to classi- cal training with a simpler trade off between robustness, and, (training) accuracy.

However, none of these methods is designed to increase global smooth- ness of the network. This global smoothness is more related to Lispchitz constant.

2.2 Lispchitz methods

A function f from R

^I

in R

^J

is said K Lipschitz for a norm ||.||

I

and ||.||

J

if ∀x, y ∈ R

^D

, ||f(x) − f(y)||

J

≤ K||x − y||

I

. As continuous, piecewise infinitely derivable function, relu based deep network are K Lipschitz func- tion. So, naively, deep network could be expected to be smooth function.

However both K and I can be very large: K is estimated to by higher than 50000 for first Alexnet layers in L

2

norm [23] while I = 227 ×227×3.

So, modifying each pixel value a an input image just by 1 can lead to a 7729350000 gap from the original likelihood produced by Alexnet on this image. Thus, to be really smooth, deep network should have a much more lower Lipschitz coefficient (LC).

Different penalties have been offered in literature to tackle this issue in- cluding L

2

penalty, spectral penalty [3] or gradient penalty [1]. This penal- ties are usually added to classical binary cross entropy loss. More formally, let x be input image, y the corresponding ground truth, f the network with θ the current weights, and, hence, f(x, θ) the probability produced by the network. Then, the total loss is l(x, y, θ) = BCE(f(x, θ), y) + λR(f, x, θ) with BCE the binary cross entropy and R the regularity function (e.g.

||θ|| or other). Weights θ of the network are then updated using stochastic gradient descent approach i.e. following the gradient ∇

θ

l(x, y, θ).

As measuring the global smoothness of the network is not straight- forward, most evaluations of these methods were focused on accuracy i.e.

these regularisation losses have been used as a way to control overfitting.

In this paper, the evaluation is instead focused on accuracy under stress.

Indeed, as global smoothness should imply a relative local smoothness, evaluating accuracy under stress seems more relevant than overfitting re- duction which is not directly related to robustness.

2.3 Applications

Soon after deep learning becomes the state of the art of image classifi- cation with Alexnet [11], deep learning object detector appear with [6]

and immediately outperforms previous state of the art in object detection

(like [5]).

(5)

Since [6], a large number of deep learning detectors have been designed.

Most salient deep detectors are [20] which improves [6] but using deep learning box proposal instead of static one, [19] which directly predicts boxes from a dense feature map, then extended by [15] which offers a better way to encode boxes score in feature map. Alternatively, detection can also be extracted from semantic segmentation map when object does not overlap. This segment before detect paradigm is widely used in remote sensing [2].

Segmentation is independently an other task where networks outper- forms previous state of the art since [13]. Typical structure for image segmentation is UNet [21].

Both these tasks can found tremendous applications in remote sensing which are usually non critical applications. This way, such remote sensing applications may be the first real life industrial application of deep learning (excluding multimedia applications). This is why this paper focus on remote sensing applications (see datasets).

3 Batch centred Lispchitz regularization

This section describes the offered new Lispchitz regularization and differ- ence with previous Lispchitz regularizations.

L

2

weight penalty is the easiest way to lower LC, but, controlling of the regularisation is hard as all weights of all layers are considered equally.

Spectral penalty takes advantages of the layered structure of the network.

However, as pointed by [9], product of each layer spectral norm is a very loose bound of LC. Finally, penalizing gradient norm (according to inputs not weights i.e. ||∇

x

f(x)||) has the advantage to estimate the real LC, but, only locally around each point on with the gradient is computed (and with the disadvantage of requiring second order derivatives).

Also, penalizing gradient norm is usually done on the loss: training tries to make the loss more Lispchitz. But this is problematic as loss is compute after softmax.

Instead in this paper, the objective is to make only the features more Lispchitz - neither the classifier (e.g. three last layers) neither the loss. So, network f is decomposed into an encoder g and a classifier h with weights θ being decomposed into φ, ψ, likelihood on input x is then g(h(x, φ), ψ) and loss on this sample (associated with y ground truth) is l(x, y, φ, ψ) = BCE(g(h(x, φ), ψ), y) + λ||∇

x

h(x, φ)|| where ||∇

x

h(x, φ)|| design the sum of ||∇

x

h

i

(x, φ)|| for all components i.

In practice, framework (like pytorch or tensorflow) allows to compute scalar function loss but does not allow to share the graph of derivative making it intractable to compute the sum over i of individual gradient.

Yet, this paper offers to compute a batch centred Lispchitz regularization:

k∇

x

kh(x) − h

ref

kk is computed with h

ref

being the average of h(x) on

the batch.

(6)

4 Experiments

4.1 Setting

Experiments have been conducted on a public remote sensing datasets of object detection and segmentation.

To evaluate performance under stress a toolbox similar to [8] has been developed for detection context. Also, this toolbox contains adversarial noise (implemented by FGSM).

For detection, following [4], task considered is center object detection.

A distance of 1 meter is set to allow a predicted center to be matched with a ground truth center. Then, prediction and ground truth are matched like in classical detection. Performance is summarized by the g score (product of recall and precision). 3 datasets are used for detection: ISPRS POTSDAM

¹

, VEDAI [18], and the GDRSS DFC data fusion contest 2015 [13].

Dataset are resized in order to provide a large range of resolution:

VEDAI is used with little resizing such that a car is contained into a 32x32 box, DFC2015 is resized such that a car is contained into a 48x48 box, and, finally, POTSDAM is slightly resized such that a car is contained into a 64x64 box. Obviously, performance are expected to increase with the resolution (this will be the case).

For segmentation, performance is measured by the straightforward overall accuracy on the same datasets except VEDAI where no labelling is available.

Detector are based on SSD [15] but adapted to center regression. Core of the detector are both VGG and Resnet like in [15]. For segmentation, UNet is straightforwardly considered.

4.2 Results

All regularisation losses are evaluated with the same pipeline. However, on these datasets, L

2

weight penalty has no impact on the performance except for large λ for which it makes all weights collapsing to 0 (leading to a very poor result) Spectral regularity exhibits a very strong numeri- cal instability. Also, standard gradient penalty [1] tends to degrade the performances (with and without stress). The fact that these methods to have such negative impact is quite surprising. For [1] which is very close to our method, it may be due to the softmax (it penalizes after softmax and here it is penalize before independently from the label). Either, these methods are not adapted to this remote sensing context, or, this paper may have too naively implement these (despite λ which balances BCE loss and regularity loss is optimized on a grid). So, no conclusion will be stated for them.

Now, an interesting result is comparison of performance with and with- out our regularity. Indeed, the main result of this paper is that the of- fered gradient regularisation mitigated the drop of performances caused by stress consistently across all datasets, tasks and networks. Raw per- formance are given as reference to measure this drop (see tabular 1).

1

www2.isprs.org/commissions/comm3/wg4/2d-sem-label-potsdam.html

(7)

VEDAI 32x32 DFC 48x48 POTSDAM 64x64

VGG 53 66 78

VGG reg 53 66 81

RESNET 39 60 80

RESNET reg 43 49 82

UNET - 84 86

UNET reg - 79 82

Table 1: Performances on raw data (without noise).

On raw data, performance are quite similar with or without regularisation (a little less when adding regularisation), and, consistent with the literature.

Performances are measured in g-score in detection for SSD (based on VGG and RESNET) - accuracy in segmentation for UNET.

VEDAI 32x32 DFC 48x48 POTSDAM 64x64

VGG 30 51 66

VGG reg 36 57 74

RESNET 22 24 50

RESNET reg 17 29 69

UNET - 60 38

UNET reg - 68 42

Table 2: Performances after 2 steps of FGSM.

Adding regularization greatly moderate the performance drop caused by two step of FGSM attack compared to raw training.

Performances are measured in g-score in detection for SSD (based on VGG and RESNET) - accuracy in segmentation for UNET (no available for VEDAI).

Most striking results is in table 2: native model are very sensitive to FGSM, and, performance under two FGSM steps are very low compared to performances on raw data. Yet, adding the offered regularization mod- erate this drop. Regularized algorithms eventually outperform their raw version by average 5% (g-score for detection, accuracy for segmentation).

Off course, adversarial training may be even more efficient, but, here the regularisation is linked to global smoothness of the network, not just local.

An other striking results is in table 3: image are noised by a strong

Gaussian noise (variance is 512). Again, performances drop, but, this

drop is mitigated by regularisation. Let stress that, as, performance is

measured in g-score in detection, it can even increase with noise by in-

creasing precision and/or ratio between precision and recall. Typically, on

VEDAI, performances often increase with agnostic noise (never observed

for adversarial noise). This phenomenons is not seen in segmentation as

accuracy is much more smooth metric. On this last experiment, it is

not clear at all to known if common adversarial training and/or certified

defense based polytop would have been effective in front of such large

(8)

VEDAI 32x32 DFC 48x48 POTSDAM 64x64

VGG 54 29 52

VGG reg 52 54 56

RESNET 38 17 56

RESNET reg 45 45 76

UNET - 58 39

UNET reg - 65 44

Table 3: Performances after adding a 512 variance Gaussian noise.

Adding regularization greatly moderate the performance drop caused by a 512 variance Gaussian noise compared to raw training.

Performances are measured in g-score in detection for SSD (based on VGG and RESNET) - accuracy in segmentation for UNET (no available for VEDAI).

VEDAI DFC2015 POTSDAM

Figure 1: Performances vs steps of FGSM.

x axis is number of steps of FGSM. y axis is gscore.

amplitude Gaussian noise while global smoothness offers some protection.

More detailed results are quantified in following figures (all raw re- sults including variance estimation on several training are available on the github) which print performance for detection algorithms on the 3 datasets for Gaussian, black-white and adversarial noise (see figures 1, 2, 3). x axis is the level of noise: step of FGSM, percentage of dead pixel or Gaussian variance level (variance is proportionnal to square of level). y axis is the performance.

VEDAI DFC2015 POTSDAM

Figure 2: Performances vs percentage of dead pixels.

x axis is percentage of dead pixel. y axis is gscore.

(9)

VEDAI DFC2015 POTSDAM Figure 3: Performances vs level of Gaussian noise.

x axis is level of Gaussian noise. y axis is gscore.

5 Conclusion

This article focuses on object detection performance under stress, and, benchmark different kind of regularization theoretically expected to make the detector more Lipschitz.

On three public remote sensing datasets, performances under stress on detection/segmentation are increased by using batch based gradient norm penalty. Such kind of regularisation are a way to advance on safety issue raise by deep learning detector.

However, as perspective, our result is far from closing this issue. First, such experiment should be reproduced in more large datasets, and, mainly, performance under stress still heavily decreases with stress level (even if this drop is much slower with regularisation than without). An other se- rious point is that, in our experiment, the more an algorithm is efficient on raw data, the more it tend to be sensitive to noise (typically perfor- mance without noise are higher on POTSDAM than on VEDAI, but, the drop of performance with noise is higher too on POTSDAM compared to VEDAI).

References

[1] Martin Arjovsky, Soumith Chintala, and L´ eon Bottou. Wasserstein generative adversarial networks. In International Conference on Ma- chine Learning, pages 214–223, 2017.

[2] Nicolas Audebert, Bertrand Le Saux, and Sbastien Lefvre. Segment- before-detect: Vehicle detection and classification through semantic segmentation of aerial images. Remote Sensing, 9(4):368, Apr 2017.

[3] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally- normalized margin bounds for neural networks. In Advances in Neu- ral Information Processing Systems, pages 6240–6249, 2017.

[4] A. Chan-Hon-Tong and N. Audebert. Object detection in remote sensing images with center only. In IGARSS 2018 - 2018 IEEE In- ternational Geoscience and Remote Sensing Symposium, pages 7054–

7057, July 2018.

[5] Pedro F Felzenszwalb, David A McAllester, Deva Ramanan, et al. A

discriminatively trained, multiscale, deformable part model. In Cvpr,

volume 2, page 7, 2008.

(10)

[6] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik.

Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition, 2014.

[7] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explain- ing and harnessing adversarial examples. arXiv, 2014.

[8] Dan Hendrycks and Thomas Dietterich. Benchmarking neural net- work robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.

[9] Todd Huster, Cho-Yu Jason Chiang, and Ritu Chadha. Limitations of the lipschitz constant as a defense against adversarial examples.

In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 16–29. Springer, 2018.

[10] Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. Reluplex: An efficient smt solver for verifying deep neural networks. In International Conference on Computer Aided Verification, pages 97–117. Springer, 2017.

[11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[12] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversar- ial examples in the physical world. In International Conference on Learning Representations (ICLR), 2017.

[13] Adrien Lagrange, Bertrand Le Saux, Anne Beaupere, Alexandre Boulch, Adrien Chan-Hon-Tong, St´ ephane Herbin, Hicham Randri- anarivo, and Marin Ferecatu. Benchmarking classification of earth- observation data: From learning explicit features to convolutional networks. In 2015 IEEE International Geoscience and Remote Sens- ing Symposium (IGARSS), pages 4173–4176. IEEE, 2015.

[14] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.

Nature, 521(7553):436–444, 2015.

[15] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.

[16] N. Narodytska and S. Kasiviswanathan. Simple black-box adversarial attacks on deep neural networks. In 2017 IEEE Conference on Com- puter Vision and Pattern Recognition Workshops (CVPRW), pages 1310–1318, July 2017.

[17] Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In Security and Privacy (EuroS&P), 2016 IEEE European Symposium on, pages 372–387. IEEE, 2016.

[18] Sebastien Razakarivony and Frederic Jurie. Vehicle detection in aerial

imagery: A small target detection benchmark. Journal of Visual

Communication and Image Representation, 34:187–203, 2016.

(11)

[19] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–

788, June 2016.

[20] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R- CNN: Towards real-time object detection with region proposal net- works. In Advances in Neural Information Processing Systems, 2015.

[21] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Con- volutional networks for biomedical image segmentation. In Interna- tional Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.

[22] Uri Shaham, Yutaro Yamada, and Sahand Negahban. Understanding adversarial training: Increasing local stability of supervised models through robust optimization. Neurocomputing, 307:195–204, 2018.

[23] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing prop- erties of neural networks. technical report arxiv:1312.6199, 2013.

[24] David H Wolpert. The lack of a priori distinctions between learning algorithms. Neural computation, 8(7):1341–1390, 1996.

[25] Eric Wong and J Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. arXiv preprint arXiv:1711.00851, 2017.

[26] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El

Ghaoui, and Michael I Jordan. Theoretically principled trade-off

between robustness and accuracy. arXiv preprint arXiv:1901.08573,

2019.