Application to Logistic Regression and Bayesian Deep Learning

3.4.1 Binary logistic regression with missing values

This application follows Example 1 described in Section 3.2. We consider a binary regression setup, ((y_i, z_i), i∈JnK) where y_i ∈ {0,1} is a binary response and z_i = (z_i,j ∈ R, j ∈ JpK) is a covariate vector. The vector of covariates zi = [zi,mis, zi,obs] is not fully observed where we denote by z_i,mis the missing values and z_i,obs the observed covariate.

It is assumed that (zi, i∈JnK) are i.i.d. and marginally distributed according to N(β,Ω) whereβ ∈R^p and Ω is a positive definite p×pmatrix.

We define the conditional distribution of the observationsyi givenzi = (zi,mis, zi,obs) as:

p_i(y_i|z_i) =S(δ^>z¯_i)^yⁱ1−S(δ^>z¯_i)^1−yⁱ (3.4.1) where for u ∈ R, S(u) = 1/(1 + e^−u), δ = (δ₀,· · ·, δp) are the logistic parameters and

z_i = (1, z_i). We are interested in estimating δ and finding the latent structure of the covariateszi. Here,θ = (δ,β,Ω) is the parameter to estimate. For i∈JnK, the complete data log-likelihood is expressed as:

logf_i(z_i,mis,θ)∝y_iδ^>z¯_i−log 1+exp(δ^>z¯_i)−1

2log(|Ω|)+1

2TrΩ⁻¹(z_i−β)(z_i−β)^>. Choice of surrogate function for MISO: We recall the MISO deterministic surrogate defined in (3.2.8):

Lb_i(θ;θ) =^Z

Zlogpi(zi,mis,θ)/fi(zi,mis,θ) pi(zi,mis,θ)µi(dzi). (3.4.2)

whereθ= (δ, β,Ω) andθ = (¯δ,β,¯ ¯Ω). We adapt it to our missing covariates problem and decompose the term depending onθ, while ¯θ is fixed, in two following parts:

Lb_i(θ;θ)≈ − Z

Zlogfi(zi,mis, zi,obs,θ)pi(zi,mis,θ)µi(dzi,mis)

=− Z

Zlog [p_i(y_i|z_i,mis, z_i,obs, δ)p_i(z_i,mis, β,Ω)] p_i(z_i,θ)µ_i(dz_i,mis)

=− Z

Zlogp_i(y_i|z_i,mis, z_i,obs, δ)p_i(z_i,θ)µ_i(dz_i,mis)

| {z }

= ˆL⁽¹⁾_i (δ,θ)

− Z

Zlogp_i(z_i,mis, β,Ω)p_i(z_i,θ)µ_i(dz_i,mis)

| {z }

= ˆL⁽²⁾_i (β,Ω,θ)

(3.4.3) The mean β and the covariance Ω of the latent structure can be estimated minimizing the sum of MISSO surrogates ˜L⁽²⁾_i (β,Ω,θ,{z_m}^M_m=1), defined as MC approximation of Lˆ⁽²⁾_i (β,Ω,θ), for all i∈JnK, in closed-form expression.

We thus keep the surrogate ˆL⁽²⁾_i (β,Ω,θ) and consider the following quadratic approxima-tion of ˆL⁽¹⁾_i (δ,θ) to estimate the vector of logistic parameters δ:

Lˆ⁽¹⁾_i (¯δ,θ)− Z

∇logpi(yi|z_i,mis, zi,obs, δ)_δ=¯_δpi(zi,mis,θ)µi(dzi,mis)(δ−δ)¯

−(δ−¯δ)/2^Z

∇²logpi(yi|z_i,mis, zi,obs, δ)pi(zi,mis,θ)pi(zi,mis,θ)µi(dzi,mis)(δ−δ)¯ ^>

(3.4.4) Recall that:

∇logpi(yi|z_i,mis, z_i,obs, δ) =zi

yi−S(δ^>zi)

∇²logp_i(y_i|z_i,mis, z_i,obs, δ) =−z_iz_i^>S˙(δ^>z_i) (3.4.5) where ˙S(u) is the derivative of S(u). Note that ˙S(u)≤1/4 and since, for alli∈JnK, the p×pmatrix ziz_i^> is semi-definite positive we can assume:

L1 For all i ∈ JnK and > 0, there exist, for all z_i ∈ Z, a positive definite matrix Hi(zi) := ¹₄(ziz^>_i +I_d) such that for all δ∈R^p, −z_iz^>_i S˙(δ^>zi)≤Hi(zi).

We thus use, for alli∈JnK, the following surrogate function to estimate δ: L¯⁽¹⁾_i (δ,θ) = ˆL⁽¹⁾_i (¯δ,θ)−D^>_i (δ−δ¯) +1

2(δ−δ¯)H_i(δ−¯δ)^> (3.4.6) where:

D_i =^Z

∇logp_i(y_i|z_i,mis, z_i,obs, δ)_δ=¯_δp_i(z_i,mis,θ)µ_i(dz_i,mis) Hi =^Z

Hi(zi,mis)pi(zi,mis,θ)µi(dzi,mis) (3.4.7)

Finally, at iterationk, the total surrogate is:

Minimizing the total surrogate (3.4.8) boils down to performing a quasi-Newton step. It is perhaps sensible to apply some diagonal loading which is perfectly compatible with the surrogate interpretation we just gave.

MISSO update: At the k-th iteration, and after the initialization, for all i ∈ JnK, of the latent variables (z⁽⁰⁾_i ), the MISSO algorithm consists in picking an index i_k uniformly on JnK, completing the observations by sampling a Monte Carlo batch {z_i^(k)

k,mis,m}^M_m=1^(k) of missing values from the conditional distribution p(z_i_k_,mis|z_i_k_,y_1:N, y_i_k;θ^(k−1)) using an MCMC sampler and computing the estimated parameters as follows:

β^(k)= arg min logistic parameters are estimated as follows:

δ^(k)= arg min

where ˜L⁽¹⁾_i (δ, θ^(τⁱ^k⁾,{z_i,m}

M_{(τ k}

m=1 ) is the MC approximation of the MISO surrogate defined in (3.4.6)and which leads to the following quasi-Newton step:

δ^(k) = 1 n

i=1

δ^(τⁱ^k⁾−( ˜H^(k))⁻¹D˜^(k) (3.4.12)

with ˜D^(k)= _n¹^Pⁿi=1D˜^(τ

k i)

i and ˜H^(k)= _n¹^Pⁿi=1H˜^(τ

k i) i .

Fitting a logistic regression model on the TraumaBase dataset We apply the MISSO method to fit a logistic regression model on the TraumaBase (http://

traumabase.eu) dataset, which consists of data collected from 15 trauma centers in France, covering measurements on patients from the initial to last stage of trauma.

Similar to [Jiang et al., 2018], we select p = 16 influential quantitative measurements, described in Appendix3.9.1, on n = 6384 patients, and we adopt the logistic regression model with missing covariates in (3.4.1) to predict the risk of a severe hemorrhage which is one of the main cause of death after a major trauma. Note as the dataset considered is heterogeneous – coming from multiple sources with frequently missed entries – we apply the latent data model described in the above. For the Monte-Carlo sampling ofz_i,mis, we run a Metropolis Hastings algorithm with the target distributionp(·|z_i,obs, yi;θ^(k)) whose procedure is detailed in Appendix3.9.1.

Figure 3.1 – Convergence of first component of the vector of parametersδ and β for the SAEM, the MCEM and the MISSO methods. The convergence is plotted against the number of passes over the data.

We compare in Figure3.1 the convergence behavior of the estimated parametersβ using SAEM [Delyon et al.,1999a] (with stepsize γ_k = 1/k), MCEM [Wei and Tanner, 1990a]

and the proposed MISSO method. For the MISSO method, we set the batch size to M_(k) = 10 +k² and we examine with selecting different number of functions in Line 5 in the method – the default settings with 1 function (MISSO), 10% (MISSO10) and 50%

(MISSO50) of the functions per iteration. From Figure3.1, the MISSO method converges to a static value with less number of epochs than the MCEM, SAEM methods. It is worth noting that the difference among the MISSO runs for different number of selected functions demonstrates a variance-cost tradeoff.

3.4.2 Fitting Bayesian LeNet-5 on MNIST

This application follows Example 2 described in Section 3.2. We apply the MISSO method to fit a Bayesian variant of LeNet-5 [LeCun et al.,1998] (see Appendix3.9.2). We train this network on the MNIST dataset [LeCun,1998]. The training set is composed of N = 55 000 handwritten digits, 28×28 images. Each image is labelled with its correspond-ing number (from zero to nine). Under the prior distributionπ, see (3.2.9), the weights are assumed independent and identically distributed according to N(0,1). We also assume thatq(·;θ) ≡ N(µ, σ²I). The variational posterior parameters are thus θ = (µ, σ) where µ = (µ_`, ` ∈ JdK) where d is the number of weights in the neural network. We use the re-parametrization asw=t(θ, z) =µ+σzwith z∼ N(0,I).

At iteration k, minimizing the sum of stochastic surrogates defined as in (3.2.4) and (3.2.14) yields the following MISSO update —step (i) pick a function indexik uniformly onJnK;step (ii)sample a Monte Carlo batch{z_m^(k)}^M_m=1^(k) fromN(0,I); andstep (iii) update

We compare the convergence of theMonte Carlo variants of the following state of the art optimization algorithms — the ADAM [Kingma and Ba,2015], the Momentum [Sutskever et al., 2013] and the SAG [Schmidt et al., 2017] methods versus the Bayes by Backprop (BBB) [Blundell et al., 2015] and our proposed MISSO method. For all these methods, the loss function (3.2.11) and its gradients were computed by Monte Carlo integration using Tensorflow Probability library [Dillon et al.,2017], based on the re-parametrization described above. Update rules for each algorithm are performed using their vanilla im-plementations on TensorFlow [Abadi et al., 2015] as detailed in Appendix 3.9.2. We use the following hyperparameters for all runs — the learning rate is 10⁻³, we run 100 epochs with a mini-batch size of 128 and use the batchsize ofM_(k)=k.

Figure 3.2 – (Incremental Variational Inference) Negated ELBO versus epochs elapsed for fitting the Bayesian LeNet-5 on MNIST using different algorithms. The solid curve is obtained from averaging over 5 independent runs of the methods, and the shaded area represents the standard deviation.

Figure3.2shows the convergence of the negated evidence lower bound against the number of passes over data (one pass represents an epoch). As observed, the proposed MISSO method outperformsBayes by Backprop and Momentum, while similar convergence rates are observed with the MISSO, ADAM and SAG methods.

Dans le document The DART-Europe E-theses Portal (Page 70-75)