Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks

Texte intégral

(1)Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks Gaël Letarte, Pascal Germain, Benjamin Guedj, François Laviolette. To cite this version: Gaël Letarte, Pascal Germain, Benjamin Guedj, François Laviolette. Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks. ML with guarantees – NeurIPS 2019 workshop, Dec 2019, Vancouver, Canada. �hal-02482354�. HAL Id: hal-02482354 https://hal.inria.fr/hal-02482354 Submitted on 18 Feb 2020. HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés..

(2) Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks 1. 2. 2,3. 1. Gaël Letarte , Pascal Germain , Benjamin Guedj , François Laviolette 1. 3. Département d’informatique et de génie logiciel, Université Laval, Québec, Canada 2 Équipe-projet Modal, Inria Lille - Nord Europe, Villeneuve d’Ascq, France UCL Centre for Artificial Intelligence, University College London, London, England. Introduction. Visualization The proposed method can be interpreted as a majority vote of hidden layer representations.. We present a comprehensive study of multilayer neural networks with binary activation, relying on the PAC-Bayesian theory.. s = (−1, −1, −1). s = (−1, −1, 1). s = (−1, 1, −1). s = (−1, 1, 1). Deterministic Network Fθ. 1.00. Contributions I An end-to-end framework to train a binary activated deep neural network (DNN). I Nonvacuous. 0.75 0.50. PAC-Bayesian generalization bounds for binary activated DNNs.. 0.25. s = (1, −1, −1). Binary Activated Neural Networks. BAM Network fθ. s = (1, 1, 1). 0.00 −0.25. fully connected layers. sgn −0.50. denotes the number of neurons of the k th layer. sgn. Prediction fθ (x) = sgn wLsgn WL−1sgn . . . sgn W1x. sgn x1. sgn. · · · xd. Stochastic Approximation. PAC-Bayesian Theory. PAC-Bayesian Theorem For any prior P on F, with probability 1−δ on the choice of S∼Dn, we have for all C > 0, and any posterior distribution Q on F : √ 1 1 E LD (f ) ≤ 1−e−C 1 − exp −C E LbS (f ) − [KL(QkP ) + ln 2 δ n ] f ∼Q f ∼Q n. Linear Classifier fw(x) := sgn(w · x), with w ∈ Rd. sgn. x1. x2. · · · xd. PAC-Bayesian analysis of all linear classifiers Fd := {fv|v ∈ Rd}. I Gaussian. prior Pu := N (u, Id) over Fd. posterior Qw := N (w, Id) over Fd I Predictor Fw (x) := Ev∼Q fv (x) = erf √w·x w dkxk I Gaussian. I Linear. loss `(fv(x), y) := 21 − 21 yfv(x). s∈{−1,1}d1. 1.0. Monte Carlo sampling We generate T random binary vectors {st}Tt=1 according to Pr(s|x, W1). ! T T X k t X 1 · x x w 1 s ∂ t 1 0 t k Fθ (x) ≈ Fw2 (s ) √ F (s ) F (x) ≈ erf w2 θ 3 k k t T t=1 ∂w1 2 kxk T t=1 Pr(sk |x, w1 ) 2 2 kxk This turns out to be a variant of the REINFORCE algorithm (Williams, 1992).. Deep Learning (PBGNet) To enable a layer-by-layer computation of the prediction function, we want the neurons of a given layer to be independent of each other. This is achieved with the tree architecture mapping function ζ(θ) applied on a multilayer network.. 0.5. Recursive definition. denotes the output of the j th neuron of the k hidden layer : j (j) w1·x √ ←− first hidden layer F1 (x) = erf 2kxk dk j Y X 1 1 wk+1·s (j) (i) erf √2d Fk+1(x) = + si × Fk (x) k 2 2 d i=1 s∈{−1,1}. erf(x) tanh(x) sgn(x). 0.0. −0.5 −1.0. −3. −2. k. −1. 0. 1. 2. 0.203 0.16. for. d†k. :=. QL. i=k di .. Bound Value Test Error. Model MLP PBGnet` PBGnet`-bnd PBGnet PBGnetpre. 0.12. sgn. sgn. 0.10. Pr(s|x,W1). PAC-Bayesian bound ingredients 1 1 P n 1 bS (Fθ ) = Eθ0∼Q LbS (fθ0 ) = I Empirical loss : L θ i=1 2 − 2 yiFθ (xi) n term : KL(Qθ kPµ) = 12 kθ − µk2, with µ ∈ RD √ 2 n 1 1 b I Generalization bound : 1 − exp −C L (F ) − [KL(Q kP ) + ln S θ θ µ 1−e−C n δ ]. 0.08. 0.06. 0.04. I Complexity. 0.02. 0.00. 1.000 0.096 0.040 0.009. · · · xd. 0.056. x1. !#. sgn. Error. Rd1×d0. Contact : [email protected] Code : github.com/gletarte/dichotomize-and-generalize. 2 F. sgn. Y d1 w2 · s 1 si w1i · x √ + erf √ 2d1 i=1 2 2 2 kxk {z }| {z }. Fw2 (s). !. Train Error. 0.763. s∈{−1,1}d1. erf |. x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2. 0.14. 1.000. =. X. . x2. We perform model selection using a validation set for a MLP with tanh activations, and using the PAC-Bayes bound for our PBGnet and PBGnetpre algorithms. PBGnet` and PBGnet`-bnd are intermediate variants.. Posterior Qθ = N (θ, ID ), over the family of all networks FD = {fθ̃ | θ̃ ∈ RD }, where fθ (x) = sgn w2 · sgn(W1x). ". x1. ŷ. Experiment. Shallow Learning. s∈{−1,1}d1. ŷ. 3. i. Fθ (x) = E fθ̃ (x) θ̃∼Qθ Z Z = Q1(V1) Q2(v2)sgn(v2 · sgn(V1x))dv2dV1 Rd1×d0 RZd1 X 2 ·s = erf √w2d 1[s = sgn(V1x)]Q1(V1) dV1. ζ(θ). Kullback-Leibler regularization L−1 1 X kwL − uLk2 + d†k+1 Wi − Ui KL Qζ(θ) Pζ(µ) = 2 i=1. Bound minimization P n 1 1 2 i b + C n LS (Fw) + KL(QwkPu) = C 2 i=1 erf −yi √w·x kw − uk 2 dkx k. 1. s∈{−1,1}. (j) Fk th. PAC-Bayesian Learning of Linear Classifiers (Germain et al., 2009). I Space. Fw2 (s) Pr(s|x, W1). ads. adult. mnist17. Dataset. mnist49. mnist56. 0.033. i=1. ←− empirical loss. Fθ (x) =. 0.310 0.160. ` f (xi), yi. X. Hidden layer partial derivatives, with ! erf (x) := k X ∂ x w1 · x Pr(s|x, W1) 0 Fθ (x) = 3 erf √ sk Fw2 (s) k k) ∂w1 Pr(s |x, w 2 kxk 2 2 kxk k 1 d1. 1.000. 1 b LS (f ) = n. 2 2 −x √ e . π. 0. Prediction.. Given a data distribution D, a training set S = {(x1, y1), . . . , (xn, yn)} ∼ Dn, with xi ∈ Rd0 and yi ∈ {−1, 1}, a loss ` : [−1, 1]2 → [0, 1], a predictor f ∈ F : LD (f ) = E ` f (x), y ←− generalization loss (x,y)∼D n X. Figure 1: Illustration of the proposed method for a one hidden layer network of size d1=3, interpreted as a majority vote over 8 binary representations s ∈ {−1, 1}3. For each s, a plot shows the values of Fw2 (s) Pr(s|x, W1).. 0.017. ∈R. 0.171 0.089. , θ = vec. −1.00. D. 1.000. matrices : Wk ∈ R. {Wk }Lk=1. X.XXX. I Weights. dk ×dk−1. . −0.75. sgn. 0.027. = 1 if a > 0 and sgn(a) = −1 otherwise. sgn. 0.311 0.139. I sgn(a). sgn. 1.000. I dk. s = (1, 1, −1). 0.606 0.281 0.214 0.162. IL. s = (1, −1, 1). mnistLH. Figure 2: Experiment results for the considered models on the binary classification datasets. PAC-Bayesian bounds hold with probability 0.95..

(3)