• Aucun résultat trouvé

Supplement material for “Walsh-Hadamard Variational Inference for Bayesian Deep Learning”

N/A
N/A
Protected

Academic year: 2022

Partager "Supplement material for “Walsh-Hadamard Variational Inference for Bayesian Deep Learning”"

Copied!
11
0
0

Texte intégral

(1)

Supplement material for “Walsh-Hadamard Variational Inference for Bayesian Deep Learning”

Simone Rossi Data Science Department

EURECOM (FR) simone.rossi@eurecom.fr

Sébastien Marmin* Data Science Department

EURECOM (FR)

sebastien.marmin@eurecom.fr

Maurizio Filippone Data Science Department

EURECOM (FR)

maurizio.filippone@eurecom.fr

A Matrix-variate Posterior Distribution Induced by

WHVI

We derive the parameters of the matrix-variate distributionq(W) =MN(M,U,V)of the weight matrixW˜ ∈RD×Dgiven byWHVI,

W˜ =S1Hdiag(˜g)HS2 with g˜∼ N(µ,Σ). (1) The meanM =S1Hdiag(µ)HS2derives from the linearity of the expectation. The covariance matricesUandV are non-identifiable: for any scale factors >0, we haveMN(M,U,V)equals MN(M, sU,1sV). Therefore, we constrain the parameters such thatTr(V) = 1. The covariance matrices verify (see e.g. Section 1 in the supplement of [1])

U =E

(W −M)(W −M)>

V = 1 Tr(U)E

(W −M)>(W −M) .

The Walsh-Hadamard matrixH is symmetric. Denoting byΣ1/2 a root of Σand considering ∼ N(0,I), we have

U =E h

S1Hdiag(Σ1/2)HS22Hdiag(Σ1/2)HS1

i

. (2)

If we define the matrixT2∈RD×D

2where theithrow is the column-wise vectorization of the matrix (Σ1/2i,j (HS2)i,j0)j,j0≤D. We have

(T2T2>)i,i0 =

D

X

j,j0=1

Σ1/2i,j Σ1/2i0,j(HS2)i,j0(HS2)i0,j0

=

D

X

j,j0,j00=1

Σ1/2i,j (HS2)i,j0E[jj001/2i0,j00(HS2)i0,j0

=

D

X

j0=1

E

D

X

j=1

jΣ1/2i,j (HS2)i,j0

D

X

j00=1

j00Σ1/2i0,j00(HS2)i0,j0

=E

diag(Σ1/2)HS22Hdiag(Σ1/2)

i,i0

.

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

(2)

Using (2), a root ofU =U1/2U1/2>can be found:

U1/2=S1HT2. (3) Similarly forV, we have

V1/2= 1

pTr(U)S2HT1,

withT1=

 vect

Σ1,:(HS1)>1,:>

... vect

ΣD,:(HS1)>d,:>

. (4)

B Geometric Interpretation of

WHVI

The matrixAin Section 2.2 expresses the linear relationship between the weightsW =S1HGHS2

and the variational random vectorg, i.e.vect(W) =Ag. Recall the definition of

A=

S1Hdiag(v1) ... S1Hdiag(vD)

, withvi= (S2)i,i(H):,i. (5) We show that aLQ-decomposition ofAcan be explicitly formulated.

Proposition. LetAbe aD2×Dmatrix such thatvect(W) =Ag, whereW is given byW = S1Hdiag(g)HS2. Then aLQ-decomposition ofAcan be formulated as

vect(W) = [s(2)i S1Hdiag(hi)]i=1,...,Dg

=LQg, (6)

wherehiis theith column ofH,L = diag((s(2)i s)i=1,...,D),diag(s(1)) = S1,diag(s(2)) = S2, andQ= [Hdiag(hi)]i=1,...,D.

Proof. Equation(6)derives directly from block ma- trix and vector operations. AsLis clearly lower tri- angular (even diagonal), let us proof thatQhas or- thogonal columns. Defining thed×dmatrixQ(i) = Hdiag(hi), we have:

Q>Q=

D

X

i=1

Q(i)>Q(i)

=

D

X

i=1

diag(hi)H>Hdiag(hi)

=

D

X

i=1

diag(h2i) =

D

X

i=1

1 DI=I.

Figure 1:Diagrammatic representation ofWHVI. The cube represent the high dimensional parameter space. The variational posterior (mean in orange) evolves during optimization in the (blue) subspace whose orientation (red) is controlled byS1 and S2.

This decomposition gives direct insight on the role of the Walsh-Hadamard transforms: with com- plexityDlog(D), they perform fast rotationsQof vectors living in a space of dimensionD(the plane inFig. 1) into a space of dimensionD2(the cube in Figure1). Treated as parameters gathered inL,S1andS2control the orientation of the subspace by distortion of the canonical axes.

We empirically evaluate the minimum RMSE, as a proxy for some measure of average distance, betweenW and any given pointΓ. More precisely, we compute forΓ∈RD×D,

min

s1,s2,g∈RD

1

D||Γ−diag(s1)Hdiag(g)Hdiag(s2)||Frob. (7)

(3)

Fig. 2shows this quantity evaluated forΓsampled with i.i.dU(−1,1)with increasing value ofD.

The bounded behavior suggests thatWHVIcan approximate any given matrices with a precision that does not increase with the dimension.

0.0 1.0 2.0 3.0

Dimension

Empiricaldistance

4 16 32 64 128

Figure 2:Distribution of the minimumRMSEbetweenS1HGHS2and a sample matrix with i.i.d.U(−1,1) entries. For each dimension, the orange dots represent 20 repetitions. The median distance is displayed in black.

Few outliers (with distance greater than 3.0) appeared, possibly due to imperfect numerical optimization. They were kept for the calculation of the median but not displayed.

C Additional Details on Normalizing Flows

In the general setting, given a probabilistic model with observationsx, latent variableszand model parametersθ, by introducing an approximate posterior distributionqφ(z)with parametersφ, the variational lower bound to the log-marginal likelihood is defined as

KL{qφ(z)||p(z|x)}=Eqφ(z)[logqφ(z)−logp(z|x)]

=Eqφ(z)[logqφ(z)−logpθ(x,z)−logp(x)]

≤ −Eqφ(z)[logpθ(x|z)−logqφ(z) + logp(z)] (8) wherepθ(x|z)is the likelihood function withθmodel parameters andp(z)is the prior on the latents.

The objective is then to minimize the negative variational bound (NELBO):

L(θ, φ) =−Eqφ(z)logpθ(x|z) +KL{qφ(z)||p(z))} . (9) Consider an invertible, continuous and differentiable functionf :RD →RD. Given˜z0∼q(z0), thenz˜1=f(˜z0)followsq(z1)defined as

q(z1) =q(z0)

det ∂f

∂z0

−1

. (10)

As a consequence, afterKtransformations the log-density of the final distribution is logq(zK) = logq(z0)−

K

X

k=1

log

det∂fk−1

∂zk−1

. (11)

We shall definefk(zk−1k)thekthtransformation which takes input from the previous flowzk−1

and has parametersλk. The final variational objective is

L(θ, φ) =−Eqφ(z)[logpθ(x|z)] +KL{qφ(z)||p(z)})

=Eqφ(z|x)[−logpθ(x|z)−logp(z) + logqφ(z)]

=Eq0(z0)[−logpθ(x|zK)−logp(zK) + logqK(zK)]

=Eq0(z0)[−logpθ(x|zK)−logp(zK) + logq0(z0)

K

X

k=1

log

det∂fk(zk−1k)

∂zk−1

#

=−Eq0(z0)logpθ(x|z) +KL{q0(z0)||p(zK)}

(4)

−Eq0(z0) K

X

k=1

log

det∂fk(zk−1k)

∂zk−1

. (12)

Setting the initial distributionq0to a fully factorized GaussianN(z0|µ,σI)and assuming a Gaussian prior on the generatedzK, theKLterm is analytically tractable. A possible family of transformation is theplanar flow[10]. For theplanar flow,f is defined as

f(z) =z+uh(w>z+b), (13)

whereλ= [u∈RD, w∈RD, b∈R]andh(·) = tanh(·). This is equivalent to a residual layer with single neuronMLP– as argued by Kingma et al. [5]. The log-determinant of the Jacobian offis

log

det∂f

∂z

=

det(I+u[h0(w>z+b)w]>)

=

1 +u>wh0(w>z+b)

. (14) Although this is a simple flow parameterization, a planar flow requires onlyO(D)parameters and thus it does not increase the time/space complexity ofWHVI. Alternatives can be found in [10,13,5,8].

D Additional Results

D.1 Experimental Setup for Bayesian DNN The experiments on BayesianDNNare run with the following setup. ForWHVI, we used a zero-mean prior overgwith fully factorized covarianceλI;λ= 10−5was chosen to obtain sensible variances in the output layer. It is possible to design a prior overgsuch that the prior onW has constant marginal variance and low cor- relations although empirical evaluations showed not to yield a significant improve- ment compared to the previous (simpler) choice. In the final implementation of

WHVIthat we used in all experiments,S1 andS2 are optimized. The dropout rate ofMCDis set to 0.005. We used classic Gaussian likelihood with optimized noise variance for regression and softmax likeli- hood for classification.

Table 1:List of dataset used in the experiments

NAME TASK N. D-IN D-OUT

EEG CLASS. 14980 14 2

MAGIC CLASS. 19020 10 2

MINIBOO CLASS. 130064 50 2

LETTER CLASS. 20000 16 26

DRIVE CLASS. 58509 48 11

MOCAP CLASS. 78095 37 5

CIFAR10 CLASS. 60000 3×28×28 10

BOSTON REGR. 506 13 1

CONCRETE REGR. 1030 8 1

ENERGY REGR. 768 8 2

KIN8NM REGR. 8192 8 1

NAVAL REGR. 11934 16 2

POWERPLANT REGR. 9568 4 1

PROTEIN REGR. 45730 9 1

YACHT REGR. 308 6 1

BOREHOL REGR. 200000 8 1

HARTMAN6 REGR. 30000 6 1

RASTRIGIN5 REGR. 10000 5 1

ROBOT REGR. 150000 8 1

OTLCIRCUIT REGR. 20000 6 1

Training is performed for 500 steps with fixed noise variance and for other 50000 steps with optimized noise variance. Batch size is fixed to 64 and for the estimation of the expected loglikelihood we used 1 Monte Carlo sample at train-time and 64 Monte Carlo samples at test-time. We choose the Adam optimizer [4] with exponential learning rate decayλt+10(1 +γt)−p, withλ0= 0.001,p= 0.3, γ= 0.0005andtbeing the current iteration.

Similar setup was also used for the BayesianCNNexperiment. The only differences are the batch size – increased to 256 – and the optimizer, which is run without learning rate decay.

(5)

D.2 Regression Experiments on Shallow Models

For a complete experimental evaluation ofWHVI, we also use the experimental setup proposed by Hernandez-Lobato and Adams [3] and adopted in several other works [2,7,14]. In this configuration, we use one hidden layer with 50 hidden units for all datasets with the exception ofPROTEINwhere the number of units is increased to 100. Results are reported inTable 2.

Table 2:TestRMSEand testMNLLfor regression datasets following the setup in [3].

TEST ERROR TEST MNLL

MODEL MCD MFG NNG WHVI MCD MFG NNG WHVI

DATASET

BOSTON 3.40(0.66) 3.04(0.64) 2.74(0.12) 2.56(0.15) 5.04(1.76) 3.19(0.89) 2.45(0.03) 2.55(0.15) CONCRETE 4.60(0.53) 5.24(0.53) 5.02(0.12) 5.01(0.25) 2.96(0.23) 3.03(0.15) 3.04(0.02) 2.95(0.06) ENERGY 1.18(0.03) 1.52(0.09) 0.48(0.02) 1.20(0.07) 3.00(0.07) 3.49(0.11) 1.42(0.00) 3.01(0.12) KIN8NM 0.09(0.00) 0.10(0.00) 0.08(0.00) 0.12(0.01) −1.09(0.04) −1.01(0.04) −1.15(0.00) −0.78(0.10) NAVAL 0.00(0.00) 0.01(0.00) 0.00(0.00) 0.01(0.00) −9.93(0.01) −6.48(0.02) −7.08(0.03) −6.25(0.01) POWERPLANT 4.20(0.12) 4.23(0.13) 3.89(0.04) 4.11(0.12) 2.76(0.03) 2.77(0.03) 2.78(0.01) 2.74(0.03) PROTEIN 4.35(0.04) 4.74(0.05) 4.10(0.00) 4.64(0.07) 2.80(0.01) 2.89(0.01) 2.84(0.00) 2.86(0.01) YACHT 1.72(0.32) 1.78(0.45) 0.98(0.08) 0.96(0.20) 2.73(0.74) 2.02(0.46) 2.32(0.00) 1.28(0.22)

D.3 ConvNets architectures

64 64 64 64 64

128 128 128

128 128

256 256 256

256 256 512 512

512

512 512 10

Figure 3:Architecture layout ofRESNET18.

For the experiments on Bayesian convolutional neural networks, we used architectures adapted to

CIFAR10 (see Tables3,4and5).

Table 3:ALEXNET

LAYER DIMENSIONS CONV 64×3×3×3 MAXPOOL

CONV 192×64×3×3 MAXPOOL

CONV 384×192×3×3 CONV 256×384×3×3 CONV 256×256×3×3 MAXPOOL

LINEAR 4096×4096 LINEAR 4096×4096

LINEAR 10×4096

Table 4:VGG16

LAYER DIMENSIONS CONV 32×3×3×3 CONV 32×32×3×3 MAXPOOL

CONV 64×32×3×3 CONV 64×64×3×3 MAXPOOL

CONV 128×64×3×3 CONV 128×128×3×3 CONV 128×128×3×3 MAXPOOL

CONV 256×128×3×3 CONV 256×256×3×3 CONV 256×256×3×3 MAXPOOL

CONV 256×256×3×3 CONV 256×256×3×3 CONV 256×256×3×3 MAXPOOL

LINEAR 10×256

Table 5:RESNET18

LAYER DIMENSIONS

RESNET BLOCK

3×3,64 3×3,64

×2

RESNET BLOCK

3×3,128 3×3,128

×2

RESNET BLOCK

3×3,256 3×3,256

×2

RESNET BLOCK

3×3,512 3×3,512

×2 AVGPOOL

LINEAR 10×512

(6)

Table 6:Complexity table forGPs with random feature and inducing points approximations. In the case of random features, we include both the complexity of computing random features and the complexity of treating the linear combination of the weights variationally (usingVIandWHVI).

COMPLEXITY

SPACE TIME

MEAN FIELD-RF O(DINNRF) +O(NRFDOUT) O(DINNRF) +O(NRFDOUT) WHVI-RF O(DINNRF) +O(

NRFDOUT) O(DINNRF) +O(DOUTlogNRF)

INDUCING POINTS O(M) O(M3)

Note:Mis the number of pseudo-data/inducing points andNRFis the number of random features used in the kernel approximation.

D.4 Results - Gaussian Processes with Random Feature Expansion

We testWHVIfor scalableGPinference, by focusing onGPs with random feature expansions [6]. In

GPmodels, latent variablesf are given a priorp(f) =N(0|K); the assumption of zero mean can be easily relaxed. Given a random feature expansion of the kernel martix, sayK≈ΦΦ>, the latent variables can be rewritten as:

f =Φw (15)

withw∼ N(0,I). The random featuresΦare constructed by randomly projecting the input matrix Xusing a Gaussian random matrixΩand applying a nonlinear transformation, which depends on the choice of the kernel function. The resulting model is now linear, and considering regression problems such thaty=f +εwithε∼ N(0, σ2I), solvingGPs for regression becomes equivalent to solving standard linear regression problems. For a given set of random features, we treat the weights of the resulting linear layer variationally and evaluate the performance ofWHVI.

By reshaping the vector of parameterswof the linear model into aD×Dmatrix,WHVIallows for the linearizedGPmodel to reduce the number of parameters to optimize (seeTable 6). We compare

WHVIwith two alternatives; one isVIof the Fourier featuresGPexpansion that uses less random features to match the number of parameters used inWHVI, and another is the sparse Gaussian process implementation ofGPFLOW[9] with a number of inducing points (rounded up) to match the number of parameters used inWHVI.

We report the results on five datasets (10000 ≤N ≤200000,5≤D ≤8, seeTable 1). The data sets are generated from space-filling evaluations of well known functions in analysis of computer

41 73 137 265 0.025

0.050 0.075 0.100

Parameters

Testerror

Borehole

41 73 137 265 0.35

0.40 0.45

Parameters Robot

39 71 135 263 0.15

0.20 0.25 0.30

Parameters Hartman6

38 70 134 262 16

18 20 22

Parameters

Testerror

Rastrigin5

39 71 135 263 0.1

0.2 0.3

Parameters Otlcircuit

GP RFF - Mean field GP RFF - WHVI Sparse GP - Mean field

105 329 1161 4361 2

4 6 8

·10−2

Parameters

Testerror

Borehole

105 329 1161 4361 0.4

0.45

Parameters Robot

103 327 1159 4359 0.15

0.2 0.25 0.3

Parameters Hartman6

102 326 1158 4358 16

18 20 22

Parameters Rastrigin5

103 327 1159 4359 0.1

0.2 0.3

Parameters Otlcircuit

GP RFF - Full covariance GP RFF - WHVI Full covariance Sparse GP - Full covariance

Figure 4:Comparison of test error w.r.t. the number model parameters (top:mean field,bottom:full covariance).

(7)

experiments (see e.g. [12]). Dataset splitting in training and testing points is random uniform, 20%

versus 80 %. The input variables are rescaled between 0 and 1. The output values are standardized for training. AllGPs have the same prior (centeredGPwithRBFcovariance), initialized with equal hyperparameter values: each of theDlengthscale top

D/2, theGPvariance to1, the Gaussian likelihood standard deviation to0.02(prior observation noise). The training is performed with 12000 steps of Adam optimizer. The observation noise is fixed for the first 10000 steps. Learning rate is 6×10−4, except for the datasetHARTMAN6 with a learning rate of5×10−3. SparseGPs are run with whitened representation of the inducing points.

The results are shown inFig. 4 with diagonal covariance for the three variational posteriors and with full covariance. In both mean field and full covariance settings, this variant ofWHVIusing the reshaping ofW into a column largely outperforms the directVIof Fourier features. However, it appears that this improvement of the random feature inference forGPs is not enough to reach the performance ofVIusing inducing points. Inducing point approximations are based on the Nystroöm approximation of kernel matrices, which are known to lead to lower approximation error on the elements on the kernel matrix compared to random features approximations. This is the reason we attribute to the lower performance ofWHVIcompared to inducing points approximations in this experiment.

D.5 Extended results -DNNs

Being able to increase width and depth of a model without drastically increasing the number of variational parameters is one of the competitive advantages ofWHVI.Fig. 5shows the behavior of

WHVIfor different network configurations. At test time, increasing the number of hidden layers and the numbers of hidden features allow the model to avoid overfitting while delivering better performance. This evidence is also supported by the analysis of the testMNLLduring optimization of theELBO, as showed inFig. 6.

Thanks toWHVIstructure of the weights matrices, expanding and deepening the model is beneficial not only at convergence but during the entire learning procedure as well. Furthermore, the derived

NELBOis still a valid lower bound of the true marginal likelihood and, therefore, a suitable objective function for model selection. Differently from the issue addressed in [11], during our experiments we didn’t experience problems regarding initialization of variational parameters. We claim that this is possible thanks to both the reduced number of parameters and the effect of the Walsh-Hadamard transform.

Timing profiling of the Fast Walsh-Hadamard transform Key to the log-linear time complexity is the Fast Walsh-Hadamard transform, which allows to perform the operationHxinO(DlogD) time without requiring to generate and storeH. For our experimental evaluation, we implemented aFWHToperation inPYTORCH(v.0.4.1) in C++ and CUDA to leverage the full computational capabilities of modernGPUs. Fig. 8presents a timing profiling of our implementation versus the naivematmul(batch size of 512 samples and profiling repeated 1000 times). The breakeven point for theCPUimplementation is in the neighborhood of512/1024features, while onGPUwe seeFWHTis consistently faster.

64 128 256 512 3.8

3.9 4

Testerror

Powerplant

64 128 256 512 4

4.2 4.4

Protein

64 128 256 512 2

4 6

8·10−2 Mocap

64 128 256 512 0.38

0.4 0.42

Letter

64 128 256 512 2.66

2.68 2.7 2.72

Features

TestMNLL

64 128 256 512 2.7

2.75 2.8

Features

64 128 256 512 0.1

0.15 0.2 0.25

Features

64 128 256 512 0.9

0.95 1 1.05

Features 2 Hidden layers 3 Hidden layers

Figure 5:Analysis of model capacity for different features and hidden layers.

(8)

1 1.2 1.4 1.6

TestMNLL

Letter

0.18 0.2 0.22 0.24 0.26

Miniboo

0 10000 20000 30000 40000 50000

0.1 0.2 0.3

Step

TestMNLL

Mocap

0 10000 20000 30000 40000 50000

2.6 2.65 2.7 2.75 2.8

Step Powerplant

64 12 256 512 2 Hidden layers 3 Hidden layers

Figure 6:Comparison of test performance. Being able to increase features and hidden layers without worrying about overfitting/overparametrize the model is advantageous not only at convergence but during the entire learning procedure

101 102 103

Inferencetime[ms]

Boston

102 103 104

Powerplant

101 102 103

Yacht

101 102 103

Concrete

64 128 256 512 102

103 104

Hidden units Kin8NM

64 128 256 512 101

102 103

Hidden units Energy

64 128 256 512 103

104

Hidden units Protein

64 128 256 512 102

103 104

Hidden units Naval

WHVI(this work) MCD NNG

Figure 7:Inference time on the test set with 128 batch size and 64 Monte Carlo samples. Experiment repeated 100 times. Additional datasets available in the Supplement.

26 27 28 29 210 211

10−1 100 101

Dimension

Time

FWHT Matmul CUDA CPU

1 1.2 1.4 1.6

0 5 10 15 20

Matmul

Distribution of execution time vs batch size

0.06 0.07 0.08 0.09

0 100 200 300 400

Time

FWHT

64 128

256 512

1024

Figure 8:On the (left), time performance versus number of features (D) with batch size fixed to 512. On the (right) distribution of inference time versus batch size (D=512) withMATMULandFWHTonGPU.

(9)

Table7:TesterrorofBayesianDNNwith2hiddenlayersonregressiondatasets.NF:numberofhiddenfeatures TESTERROR DATASETBOSTONCONCRETEENERGYKIN8NMNAVALPOWERPLANTPROTEINYACHT MODELNF MCD643.80±0.885.43±0.692.13±0.120.17±0.220.07±0.004.36±0.122.02±0.51 1283.91±0.865.12±0.792.07±0.110.09±0.000.30±0.303.97±0.144.23±0.101.90±0.54 2563.62±1.015.03±0.742.04±0.110.10±0.000.07±0.003.91±0.114.09±0.112.09±0.66 5123.56±0.854.81±0.792.03±0.120.09±0.000.07±0.003.90±0.103.87±0.112.09±0.55 MFG644.06±0.726.87±0.542.42±0.120.11±0.000.01±0.004.38±0.124.85±0.124.31±0.62 1284.47±0.858.01±0.413.10±0.140.12±0.000.01±0.004.52±0.134.93±0.117.01±1.22 2565.27±0.989.41±0.544.03±0.100.13±0.000.01±0.004.79±0.125.07±0.128.71±1.31 5126.04±0.9010.84±0.464.90±0.110.16±0.000.01±0.005.53±0.165.26±0.1010.34±1.45 NNG643.20±0.266.90±0.591.54±0.180.07±0.000.00±0.003.94±0.053.90±0.023.57±0.70 1283.56±0.438.21±0.551.96±0.280.07±0.000.00±0.004.23±0.094.57±0.475.16±1.48 2564.87±0.948.18±0.573.41±0.550.07±0.000.00±0.004.07±0.004.88±0.005.60±0.65 5125.19±0.6211.67±2.065.12±0.370.10±0.000.00±0.004.97±0.005.91±0.80 WHVI643.33±0.825.24±0.770.73±0.110.08±0.000.01±0.004.07±0.114.49±0.120.82±0.18 1283.14±0.714.70±0.720.58±0.070.08±0.000.01±0.004.00±0.124.36±0.110.69±0.16 2562.99±0.854.63±0.780.52±0.070.08±0.000.01±0.003.95±0.124.24±0.110.76±0.13 5122.99±0.694.51±0.800.51±0.040.07±0.000.01±0.003.96±0.124.14±0.090.71±0.16 Table8:TestMNLLofBayesianDNNwith2hiddenlayersonregressiondatasets.NF:numberofhiddenfeatures TESTMNLL DATASETBOSTONCONCRETEENERGYKIN8NMNAVALPOWERPLANTPROTEINYACHT MODELNF MCD645.67±2.353.19±0.284.19±0.150.78±0.692.68±0.002.79±0.012.85±1.02 1286.90±2.933.20±0.364.15±0.150.87±0.021.00±2.272.74±0.052.76±0.022.95±1.27 2566.60±3.593.31±0.454.13±0.150.70±0.052.70±0.002.75±0.042.72±0.013.79±1.88 5127.28±3.313.45±0.594.13±0.170.76±0.032.71±0.002.77±0.042.68±0.023.76±1.65 MFG642.83±0.333.26±0.084.42±0.100.92±0.026.24±0.012.80±0.032.90±0.012.85±0.24 1282.99±0.413.41±0.054.91±0.090.83±0.026.23±0.012.83±0.032.92±0.013.38±0.29 2563.33±0.533.57±0.075.44±0.050.69±0.016.22±0.012.89±0.022.95±0.013.65±0.32 5123.69±0.543.73±0.055.83±0.050.49±0.016.19±0.013.04±0.032.98±0.013.86±0.31 NNG642.69±0.063.40±0.151.95±0.081.14±0.055.83±1.492.80±0.012.78±0.012.71±0.17 1282.72±0.093.56±0.082.11±0.121.19±0.046.52±0.092.86±0.022.95±0.123.06±0.27 2563.04±0.223.52±0.072.64±0.171.19±0.035.73±0.212.84±0.003.02±0.013.15±0.13 5123.13±0.143.91±0.203.07±0.070.80±0.005.30±0.053.51±0.003.21±0.14 WHVI643.68±1.403.19±0.342.18±0.371.13±0.026.25±0.012.73±0.032.82±0.012.56±1.33 1284.33±1.803.17±0.372.00±0.601.19±0.046.25±0.012.71±0.032.79±0.011.80±1.01 2564.99±2.653.35±0.592.06±0.721.23±0.046.25±0.012.70±0.032.77±0.011.53±0.53 5125.41±2.303.33±0.562.05±0.461.22±0.046.25±0.012.70±0.032.74±0.011.37±0.57

Références

Documents relatifs

It’s a supervised learning method, so the usual route for getting a support vector machine set up would be to have some training data and some data to test the algorithm..

Machine learning provides these, developing methods that can automatically detect patterns in data and use the uncovered patterns to predict future data.. This textbook offers

In this chapter, we describe the basic theory behind these methods and show three of their successful applications to solving machine learning problems: regularized risk

Training and testing data When several machine learning models are being compared, the data used to train those models or, separately, evaluate those models, should be identical..

The fundamental mathematical tools needed to understand machine learning include linear algebra, analytic geometry, matrix decompositions, vector calculus, optimization,

• those that are specic to the random forest: q, bootstrap sample size, number of trees. Random forest are not very sensitive to hyper-parameters setting: default values for q

An algorithm setup thus defines a deterministic function which can be directly linked to a specific result: it can be run on a machine given specific input data (e.g., a dataset),

In this talk, I will briefly de- scribe the basics of Bayesian optimization and how to scale it up to handle structured high-dimensional optimization problems in the se-