• Aucun résultat trouvé

Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks

N/A
N/A
Protected

Academic year: 2021

Partager "Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks"

Copied!
2
0
0

Texte intégral

(1)Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks Gaël Letarte, Pascal Germain, Benjamin Guedj, François Laviolette. To cite this version: Gaël Letarte, Pascal Germain, Benjamin Guedj, François Laviolette. Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks. ML with guarantees – NeurIPS 2019 workshop, Dec 2019, Vancouver, Canada. �hal-02482354�. HAL Id: hal-02482354 https://hal.inria.fr/hal-02482354 Submitted on 18 Feb 2020. HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés..

(2) Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks 1. 2. 2,3. 1. Gaël Letarte , Pascal Germain , Benjamin Guedj , François Laviolette 1. 3. Département d’informatique et de génie logiciel, Université Laval, Québec, Canada 2 Équipe-projet Modal, Inria Lille - Nord Europe, Villeneuve d’Ascq, France UCL Centre for Artificial Intelligence, University College London, London, England. Introduction. Visualization The proposed method can be interpreted as a majority vote of hidden layer representations.. We present a comprehensive study of multilayer neural networks with binary activation, relying on the PAC-Bayesian theory.. s = (−1, −1, −1). s = (−1, −1, 1). s = (−1, 1, −1). s = (−1, 1, 1). Deterministic Network Fθ. 1.00. Contributions I An end-to-end framework to train a binary activated deep neural network (DNN). I Nonvacuous. 0.75 0.50. PAC-Bayesian generalization bounds for binary activated DNNs.. 0.25. s = (1, −1, −1). Binary Activated Neural Networks. BAM Network fθ. s = (1, 1, 1). 0.00 −0.25. fully connected layers. sgn −0.50. denotes the number of neurons of the k th layer. sgn. Prediction  fθ (x) = sgn wLsgn WL−1sgn . . . sgn W1x. sgn x1. sgn. · · · xd. Stochastic Approximation. PAC-Bayesian Theory. PAC-Bayesian Theorem For any prior P on F, with probability 1−δ on the choice of S∼Dn, we have for all C > 0, and any posterior distribution Q on F :    √ 1 1 E LD (f ) ≤ 1−e−C 1 − exp −C E LbS (f ) − [KL(QkP ) + ln 2 δ n ] f ∼Q f ∼Q n. Linear Classifier fw(x) := sgn(w · x), with w ∈ Rd. sgn. x1. x2. · · · xd. PAC-Bayesian analysis of all linear classifiers Fd := {fv|v ∈ Rd}. I Gaussian. prior Pu := N (u, Id) over Fd. posterior Qw := N (w, Id) over Fd   I Predictor Fw (x) := Ev∼Q fv (x) = erf √w·x w dkxk I Gaussian. I Linear. loss `(fv(x), y) := 21 − 21 yfv(x). s∈{−1,1}d1. 1.0. Monte Carlo sampling We generate T random binary vectors {st}Tt=1 according to Pr(s|x, W1). ! T T X k t X 1 · x x w 1 s ∂ t 1 0 t k Fθ (x) ≈ Fw2 (s ) √ F (s ) F (x) ≈ erf w2 θ 3 k k t T t=1 ∂w1 2 kxk T t=1 Pr(sk |x, w1 ) 2 2 kxk This turns out to be a variant of the REINFORCE algorithm (Williams, 1992).. Deep Learning (PBGNet) To enable a layer-by-layer computation of the prediction function, we want the neurons of a given layer to be independent of each other. This is achieved with the tree architecture mapping function ζ(θ) applied on a multilayer network.. 0.5. Recursive definition. denotes the output of the j th neuron of the k hidden layer :  j  (j) w1·x √ ←− first hidden layer F1 (x) = erf 2kxk  dk   j Y X 1 1 wk+1·s (j) (i) erf √2d Fk+1(x) = + si × Fk (x) k 2 2 d i=1 s∈{−1,1}. erf(x) tanh(x) sgn(x). 0.0. −0.5 −1.0. −3. −2. k. −1. 0. 1. 2. 0.203 0.16. for. d†k. :=. QL. i=k di .. Bound Value Test Error. Model MLP PBGnet` PBGnet`-bnd PBGnet PBGnetpre. 0.12. sgn. sgn. 0.10. Pr(s|x,W1). PAC-Bayesian bound ingredients 1 1  P n 1 bS (Fθ ) = Eθ0∼Q LbS (fθ0 ) = I Empirical loss : L θ i=1 2 − 2 yiFθ (xi) n term : KL(Qθ kPµ) = 12 kθ − µk2, with µ ∈ RD   √  2 n 1 1 b I Generalization bound : 1 − exp −C L (F ) − [KL(Q kP ) + ln S θ θ µ 1−e−C n δ ]. 0.08. 0.06. 0.04. I Complexity. 0.02. 0.00. 1.000 0.096 0.040 0.009. · · · xd. 0.056. x1. !#. sgn. Error. Rd1×d0. Contact : [email protected] Code : github.com/gletarte/dichotomize-and-generalize. 2 F. sgn. Y d1 w2 · s 1 si w1i · x √ + erf √ 2d1 i=1 2 2 2 kxk {z }| {z }. Fw2 (s). !. Train Error. 0.763. s∈{−1,1}d1. erf |. x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2. 0.14. 1.000. =. X. . x2. We perform model selection using a validation set for a MLP with tanh activations, and using the PAC-Bayes bound for our PBGnet and PBGnetpre algorithms. PBGnet` and PBGnet`-bnd are intermediate variants.. Posterior Qθ = N (θ, ID ), over the family of all networks FD = {fθ̃ | θ̃ ∈ RD }, where  fθ (x) = sgn w2 · sgn(W1x). ". x1. ŷ. Experiment. Shallow Learning. s∈{−1,1}d1. ŷ. 3. i. Fθ (x) = E fθ̃ (x) θ̃∼Qθ Z Z = Q1(V1) Q2(v2)sgn(v2 · sgn(V1x))dv2dV1 Rd1×d0 RZd1   X 2 ·s = erf √w2d 1[s = sgn(V1x)]Q1(V1) dV1. ζ(θ). Kullback-Leibler regularization L−1   1 X kwL − uLk2 + d†k+1 Wi − Ui KL Qζ(θ) Pζ(µ) = 2 i=1. Bound minimization   P n 1 1 2 i b + C n LS (Fw) + KL(QwkPu) = C 2 i=1 erf −yi √w·x kw − uk 2 dkx k. 1. s∈{−1,1}. (j) Fk th. PAC-Bayesian Learning of Linear Classifiers (Germain et al., 2009). I Space. Fw2 (s) Pr(s|x, W1). ads. adult. mnist17. Dataset. mnist49. mnist56. 0.033. i=1. ←− empirical loss. Fθ (x) =. 0.310 0.160.   ` f (xi), yi. X. Hidden layer partial derivatives, with ! erf (x) :=   k X ∂ x w1 · x Pr(s|x, W1) 0 Fθ (x) = 3 erf √ sk Fw2 (s) k k) ∂w1 Pr(s |x, w 2 kxk 2 2 kxk k 1 d1. 1.000. 1 b LS (f ) = n. 2 2 −x √ e . π. 0. Prediction.. Given a data distribution D, a training set S = {(x1, y1), . . . , (xn, yn)} ∼ Dn, with xi ∈ Rd0 and yi ∈ {−1, 1}, a loss ` : [−1, 1]2 → [0, 1], a predictor f ∈ F :   LD (f ) = E ` f (x), y ←− generalization loss (x,y)∼D n X. Figure 1: Illustration of the proposed method for a one hidden layer network of size d1=3, interpreted as a majority vote over 8 binary representations s ∈ {−1, 1}3. For each s, a plot shows the values of Fw2 (s) Pr(s|x, W1).. 0.017. ∈R. 0.171 0.089. , θ = vec. −1.00. D. 1.000. matrices : Wk ∈ R. {Wk }Lk=1. X.XXX. I Weights. dk ×dk−1. . −0.75. sgn. 0.027. = 1 if a > 0 and sgn(a) = −1 otherwise. sgn. 0.311 0.139. I sgn(a). sgn. 1.000. I dk. s = (1, 1, −1). 0.606 0.281 0.214 0.162. IL. s = (1, −1, 1). mnistLH. Figure 2: Experiment results for the considered models on the binary classification datasets. PAC-Bayesian bounds hold with probability 0.95..

(3)

Références

Documents relatifs

This new class, termed I -optimality, is analogous to Kiefer’s L W -criterion but is based on predicted variance, whereas Kiefer’s class is based on the k eigenvalues of the

If we attempt to build a power amplifier utilizing a particular amplifying device and arbitrary passive coupling it is of interest to determine how the device

They showed in particular that using only iterates from a very basic fixed-step gradient descent, these extrapolation algorithms produced solutions reaching the optimal convergence

In this subsection, we show that if one only considers the approximation theoretic properties of the resulting function classes, then—under extremely mild assumptions on the

Measuring a network’s complexity by its number of connections, or its number of neurons, we consider the class of functions which error of best approximation with networks of a

The robot captures a single image at the initial pose, the network is trained again and then our CNN-based direct visual servoing is performed.. While the robot is servoing the

The prospects for large reserves of coal in the Cumberland basin depend on two main factors: (1) the continuance of the thick seams of the Springhill district

Inspired by Random Neural Networks, we introduce Random Neural Layer (RNL), which is compatible with other kinds of neural layers in Deep Neural Networks (DNN). The