Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks
Texte intégral
(2) Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks 1. 2. 2,3. 1. Gaël Letarte , Pascal Germain , Benjamin Guedj , François Laviolette 1. 3. Département d’informatique et de génie logiciel, Université Laval, Québec, Canada 2 Équipe-projet Modal, Inria Lille - Nord Europe, Villeneuve d’Ascq, France UCL Centre for Artificial Intelligence, University College London, London, England. Introduction. Visualization The proposed method can be interpreted as a majority vote of hidden layer representations.. We present a comprehensive study of multilayer neural networks with binary activation, relying on the PAC-Bayesian theory.. s = (−1, −1, −1). s = (−1, −1, 1). s = (−1, 1, −1). s = (−1, 1, 1). Deterministic Network Fθ. 1.00. Contributions I An end-to-end framework to train a binary activated deep neural network (DNN). I Nonvacuous. 0.75 0.50. PAC-Bayesian generalization bounds for binary activated DNNs.. 0.25. s = (1, −1, −1). Binary Activated Neural Networks. BAM Network fθ. s = (1, 1, 1). 0.00 −0.25. fully connected layers. sgn −0.50. denotes the number of neurons of the k th layer. sgn. Prediction fθ (x) = sgn wLsgn WL−1sgn . . . sgn W1x. sgn x1. sgn. · · · xd. Stochastic Approximation. PAC-Bayesian Theory. PAC-Bayesian Theorem For any prior P on F, with probability 1−δ on the choice of S∼Dn, we have for all C > 0, and any posterior distribution Q on F : √ 1 1 E LD (f ) ≤ 1−e−C 1 − exp −C E LbS (f ) − [KL(QkP ) + ln 2 δ n ] f ∼Q f ∼Q n. Linear Classifier fw(x) := sgn(w · x), with w ∈ Rd. sgn. x1. x2. · · · xd. PAC-Bayesian analysis of all linear classifiers Fd := {fv|v ∈ Rd}. I Gaussian. prior Pu := N (u, Id) over Fd. posterior Qw := N (w, Id) over Fd I Predictor Fw (x) := Ev∼Q fv (x) = erf √w·x w dkxk I Gaussian. I Linear. loss `(fv(x), y) := 21 − 21 yfv(x). s∈{−1,1}d1. 1.0. Monte Carlo sampling We generate T random binary vectors {st}Tt=1 according to Pr(s|x, W1). ! T T X k t X 1 · x x w 1 s ∂ t 1 0 t k Fθ (x) ≈ Fw2 (s ) √ F (s ) F (x) ≈ erf w2 θ 3 k k t T t=1 ∂w1 2 kxk T t=1 Pr(sk |x, w1 ) 2 2 kxk This turns out to be a variant of the REINFORCE algorithm (Williams, 1992).. Deep Learning (PBGNet) To enable a layer-by-layer computation of the prediction function, we want the neurons of a given layer to be independent of each other. This is achieved with the tree architecture mapping function ζ(θ) applied on a multilayer network.. 0.5. Recursive definition. denotes the output of the j th neuron of the k hidden layer : j (j) w1·x √ ←− first hidden layer F1 (x) = erf 2kxk dk j Y X 1 1 wk+1·s (j) (i) erf √2d Fk+1(x) = + si × Fk (x) k 2 2 d i=1 s∈{−1,1}. erf(x) tanh(x) sgn(x). 0.0. −0.5 −1.0. −3. −2. k. −1. 0. 1. 2. 0.203 0.16. for. d†k. :=. QL. i=k di .. Bound Value Test Error. Model MLP PBGnet` PBGnet`-bnd PBGnet PBGnetpre. 0.12. sgn. sgn. 0.10. Pr(s|x,W1). PAC-Bayesian bound ingredients 1 1 P n 1 bS (Fθ ) = Eθ0∼Q LbS (fθ0 ) = I Empirical loss : L θ i=1 2 − 2 yiFθ (xi) n term : KL(Qθ kPµ) = 12 kθ − µk2, with µ ∈ RD √ 2 n 1 1 b I Generalization bound : 1 − exp −C L (F ) − [KL(Q kP ) + ln S θ θ µ 1−e−C n δ ]. 0.08. 0.06. 0.04. I Complexity. 0.02. 0.00. 1.000 0.096 0.040 0.009. · · · xd. 0.056. x1. !#. sgn. Error. Rd1×d0. Contact : [email protected] Code : github.com/gletarte/dichotomize-and-generalize. 2 F. sgn. Y d1 w2 · s 1 si w1i · x √ + erf √ 2d1 i=1 2 2 2 kxk {z }| {z }. Fw2 (s). !. Train Error. 0.763. s∈{−1,1}d1. erf |. x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2. 0.14. 1.000. =. X. . x2. We perform model selection using a validation set for a MLP with tanh activations, and using the PAC-Bayes bound for our PBGnet and PBGnetpre algorithms. PBGnet` and PBGnet`-bnd are intermediate variants.. Posterior Qθ = N (θ, ID ), over the family of all networks FD = {fθ̃ | θ̃ ∈ RD }, where fθ (x) = sgn w2 · sgn(W1x). ". x1. ŷ. Experiment. Shallow Learning. s∈{−1,1}d1. ŷ. 3. i. Fθ (x) = E fθ̃ (x) θ̃∼Qθ Z Z = Q1(V1) Q2(v2)sgn(v2 · sgn(V1x))dv2dV1 Rd1×d0 RZd1 X 2 ·s = erf √w2d 1[s = sgn(V1x)]Q1(V1) dV1. ζ(θ). Kullback-Leibler regularization L−1 1 X kwL − uLk2 + d†k+1 Wi − Ui KL Qζ(θ) Pζ(µ) = 2 i=1. Bound minimization P n 1 1 2 i b + C n LS (Fw) + KL(QwkPu) = C 2 i=1 erf −yi √w·x kw − uk 2 dkx k. 1. s∈{−1,1}. (j) Fk th. PAC-Bayesian Learning of Linear Classifiers (Germain et al., 2009). I Space. Fw2 (s) Pr(s|x, W1). ads. adult. mnist17. Dataset. mnist49. mnist56. 0.033. i=1. ←− empirical loss. Fθ (x) =. 0.310 0.160. ` f (xi), yi. X. Hidden layer partial derivatives, with ! erf (x) := k X ∂ x w1 · x Pr(s|x, W1) 0 Fθ (x) = 3 erf √ sk Fw2 (s) k k) ∂w1 Pr(s |x, w 2 kxk 2 2 kxk k 1 d1. 1.000. 1 b LS (f ) = n. 2 2 −x √ e . π. 0. Prediction.. Given a data distribution D, a training set S = {(x1, y1), . . . , (xn, yn)} ∼ Dn, with xi ∈ Rd0 and yi ∈ {−1, 1}, a loss ` : [−1, 1]2 → [0, 1], a predictor f ∈ F : LD (f ) = E ` f (x), y ←− generalization loss (x,y)∼D n X. Figure 1: Illustration of the proposed method for a one hidden layer network of size d1=3, interpreted as a majority vote over 8 binary representations s ∈ {−1, 1}3. For each s, a plot shows the values of Fw2 (s) Pr(s|x, W1).. 0.017. ∈R. 0.171 0.089. , θ = vec. −1.00. D. 1.000. matrices : Wk ∈ R. {Wk }Lk=1. X.XXX. I Weights. dk ×dk−1. . −0.75. sgn. 0.027. = 1 if a > 0 and sgn(a) = −1 otherwise. sgn. 0.311 0.139. I sgn(a). sgn. 1.000. I dk. s = (1, 1, −1). 0.606 0.281 0.214 0.162. IL. s = (1, −1, 1). mnistLH. Figure 2: Experiment results for the considered models on the binary classification datasets. PAC-Bayesian bounds hold with probability 0.95..
(3)
Documents relatifs
This new class, termed I -optimality, is analogous to Kiefer’s L W -criterion but is based on predicted variance, whereas Kiefer’s class is based on the k eigenvalues of the
If we attempt to build a power amplifier utilizing a particular amplifying device and arbitrary passive coupling it is of interest to determine how the device
They showed in particular that using only iterates from a very basic fixed-step gradient descent, these extrapolation algorithms produced solutions reaching the optimal convergence
In this subsection, we show that if one only considers the approximation theoretic properties of the resulting function classes, then—under extremely mild assumptions on the
Measuring a network’s complexity by its number of connections, or its number of neurons, we consider the class of functions which error of best approximation with networks of a
The robot captures a single image at the initial pose, the network is trained again and then our CNN-based direct visual servoing is performed.. While the robot is servoing the
The prospects for large reserves of coal in the Cumberland basin depend on two main factors: (1) the continuance of the thick seams of the Springhill district
Inspired by Random Neural Networks, we introduce Random Neural Layer (RNL), which is compatible with other kinds of neural layers in Deep Neural Networks (DNN). The