• Aucun résultat trouvé

Least squares problems How to state and solve them, then evaluate their solutions Stéphane Mottelet

N/A
N/A
Protected

Academic year: 2022

Partager "Least squares problems How to state and solve them, then evaluate their solutions Stéphane Mottelet"

Copied!
63
0
0

Texte intégral

(1)

How to state and solve them, then evaluate their solutions

Stéphane Mottelet

Université de Technologie de Compiègne

April 28, 2020

(2)

1 Motivation and statistical framework

2 Maths reminder (survival kit)

3 Linear Least Squares (LLS)

4 Non Linear Least Squares (NLLS)

5 Statistical evaluation of solutions

6 Model selection

(3)

1 Motivation and statistical framework

2 Maths reminder (survival kit)

3 Linear Least Squares (LLS)

4 Non Linear Least Squares (NLLS)

5 Statistical evaluation of solutions

6 Model selection

(4)

Regression problem

Data: (xi,yi)i=1..n, Model:y =fθ(x)

I x∈R:independentvariable

I y∈R:dependentvariable (value found by observation)

I θRp: parameters

Regressionproblem

Findθsuch that the modelbestexplains the data, i.e.yi isclosetofθ(xi),i = 1. . .n.

(5)

Regression problem, example

Simplelinearregression : (xi,yi)∈R2

y

−→findθ1, θ2such that thedatafits the modely =θ1+θ2x

How does one measure the fit/misfit ?

(6)

Least squares method

Theleastsquaresmethod measures the fit with the Sum of Squared Residuals (SSR) S(θ) =

n

X

i=1

(yi−fθ(xi))2,

and aims to find ˆθsuch that

∀θ∈Rp, S( ˆθ)≤S(θ), or equivalently

θˆ= arg min

θRp S(θ).

Important issues

statistical interpretation

existence, uniqueness and practical determination of ˆθ(algorithms)

(7)

Hypothesis

1 (xi)i=1...nare given

2 (yi)i=1...nare samples of random variables

yi =fθ(xi) +εi,i= 1. . .n,

whereεi,i = 1. . .nareindependent and identically distributed(i.i.d.) and E[εi] = 0,E[ε2i] =σ2,densityε→g(ε) The probability density ofyi is given by

φiθ:R−→R

y −→φiθ(y) =g(y−fθ(xi)) henceE[yi|θ] =fθ(xi).

(8)

Example

Ifεis normally distributed, i.e.g(ε) = (σ√

2π)−1exp(−12ε2), we have φiθ(y) = (σ√

2π)−1exp

− 1

2(y −fθ(xi))2

(9)

Joint probability density and Likelihood function

Joint density

Whenθis given, as the (yi) are independent, the density of the vectory= (y1, . . . ,yn) is φθ(y) =

n

Y

i=1

φiθ(yi) =φ1θ(y12θ(y2). . . φnθ(yn).

Interpretation : forD⊂Rn

Prob(y∈D|θ) = Z

D

φθ(y)dy1. . .dyn

Likelihood function

When a sample ofyis given, thenLy(θ)def=φθ(y) is called Likelihood of the parametersθ

(10)

Maximum Likelihood Estimation

TheMaximum Likelihood Estimateofθis the vector ˆθdefined by θˆ= arg max

θ∈RpLy(θ).

Under the Gaussian hypothesis, then Ly(θ)=

n

Y

i=1

(σ√

2π)−1exp

− 1

2(yi−fθ(xi))2

,

= (σ

2π)−nexp − 1 2σ2

n

X

i=1

(yi−fθ(xi))2

! ,

hence, we recover the least squares solution, i.e.

arg max

θ∈RpLy(θ)= arg min

θ∈RpS(θ).

(11)

Alternatives : Least Absolute Deviation Regression

LeastAbsolute DeviationRegression : the misfit is measured by S1(θ) =

n

X

i=1

|yi−fθ(xi)|.

Is ˆθ= arg minθ∈RpS1(θ) is a maximum likelihood estimate ?

Yes, ifεi has aLaplace distribution g(ε) = (σ√

2)−1exp −

√ 2 σ |ε|

!

First issue : S1is not differentiable

(12)

Alternatives : Least Absolute Deviation Regression

Densities of Gaussian vs. Laplacian random variables (with zero mean and unit variance) :

Second issue :the two statistical hypothesis are very different !

(13)

Take home message

Take home message #1 :

Doing Least Squares Regression means that you assume that the model error is Gaussian.

However, if you have no idea about the model error :

1 thenice theoretical and computational frameworkyou will get is worth doing this assumption. . .

2 a posteriori goodness of fit testscan be used to assess the normality of errors.

(14)

1 Motivation and statistical framework

2 Maths reminder

3 Linear Least Squares (LLS)

4 Non Linear Least Squares (NLLS)

5 Statistical evaluation of solutions

6 Model selection

(15)

Matrix algebra

Notation :A∈ Mn,m(R),x ∈Rn,

A=

a11 . . . a1m ... . .. ... an1 . . . anm

, x =

 x1

... xn

Product : forB∈ Mm,p(R),C=AB∈ Mn,p(R), cij=

m

X

k=1

aikbkj

Identity matrix

I=

 1

. ..

1

(16)

Matrix algebra

Transposition, Inner product and norm :

A>∈ Mm,n(R) , A>

ij =aji

Forx ∈Rn,y ∈Rn,

hx,yi=x>y =

n

X

i=1

xiyi, kxk2=x>x

(17)

Matrix algebra

Linear dependance / independence :

a set{x1, . . . ,xm}of vectors inRnis dependent if a vectorxj can be written as xj =

m

X

k=1,k6=i

αkxk

I a set of vectors which is not dependent is called independent

I a set ofm>nvectors is necessarily dependent

I a set ofnindependent vectors inRnis called a basis

The rank of aA∈ Mnmis the number of its linearly independent columns rank(A) =m⇐⇒ {Ax = 0⇒x = 0}

(18)

Linear system of equations

WhenAis square

rank(A) =n⇐⇒there existsA−1s.t.A−1A=AA−1=I

When the above property holds :

For ally ∈Rn, the system of equations

Ax =y, has a unique solutionx =A−1y.

Computation : Gauss elimination algorithm (no computation ofA−1) in Scilab/Matlab :x =A\y

(19)

Differentiability

Definition : letf :Rn−→Rm,

f(x) =

 f1(x)

... fm(x)

, fi :Rn−→R, f is differentiable ata∈Rnif

f(a+h) =f(a) +f0(a)h+khkε(h), lim

h→0ε(h) = 0

Jacobian matrix,partial derivatives:

f0(a)

ij= ∂fi

∂xj

(a)

Gradient: iff :Rn−→R, is differentiable ata,

f(a+h) =f(a) +∇f(a)>h+khkε(h), lim

h→0ε(h) = 0

(20)

Nonlinear system of equations

Whenf :Rn−→Rn, a solution ˆx to the system of equations f( ˆx) = 0

can be found (or not) by the Newton’s method : givenx0, for eachk

1 consider the affine approximation off atxk

T(x) =f(xk) +f0(xk)(x−xk)

2 takexk+1such thatT(xk+1) = 0,

xk+1=xk −f0(xk)−1f(xk)

Newton’s method can be very fast. . . ifx0is not too far from ˆx !

(21)

Find a local minimum - gradient algorithm

Whenf :Rn−→Ris differentiable, a vector ˆx satisfying∇f( ˆx) = 0 and

∀x ∈Rn,f( ˆx)≤f(x) can be found by the descent algorithm : givenx0, for eachk :

1 select a directiondk such that∇f(xk)>dk <0

2 select a stepρk, such that

xk+1=xk +ρkdk, satisfies (among other conditions)

f(xk+1)<f(xk)

The choicedk =−∇f(xk) leads to thegradient algorithm

(22)

Find a local minimum - gradient algorithm

x =x −ρ ∇f(x ),

(23)

1 Motivation and statistical framework

2 Maths reminder

3 Linear Least Squares (LLS)

4 Non Linear Least Squares (NLLS)

5 Statistical evaluation of solutions

(24)

Linear models

The modely =fθ(x) is linear w.r.t.θ, i.e.

y =

p

X

j=1

θjφj(x), φk :RR

Examples

I y=Pp j=1θjxj−1

I y=Pp

j=1θjcos(j−1)xT , whereT =xn−x1 I . . .

(25)

The residual for simple linear regression

Simple linear regression

S(θ) =

n

X

i=1

1+θ2xi−yi)2=kr(θ)k2, Residual vectorr(θ)

ri(θ) = [1,xi] θ1

θ2

−yi

For the whole residual vector

r(θ) =Aθ−y, y =

 y1

... yn

, A=

 1 x1

... ... 1 xn

(26)

The residual for a general linear model

General linear modelfθ(x) =Pp

j=1θjφj(x)

S(θ) =

n

X

i=1

(fθ(xi)−yi)2=kr(θ)k2,

=

n

X

i=1

Pp

j=1θjφj(xi)−yi2

=kr(θ)k2,

Residual vectorr(θ)

ri(θ) = [φ1(xi), . . . , φp(xi)]

θ1

... θ2

−yi

For the whole residual vectorr(θ) =Aθ−y whereAhas sizen×pand aij =φj(xi).

(27)

Optimality conditions

Linear Least Squares problem : find ˆθ θˆ= arg min

θRp S( ˆθ) =kAθ−yk2 Necessary optimality condition

∇S( ˆθ) = 0 Compute the gradient by expandingS(θ)

(28)

Optimality conditions

S(θ+h) =kA(θ+h)−yk2=kAθ−y +Ahk2

= (Aθ−y+Ah)>(Aθ−y +Ah)

= (Aθ−y)>(Aθ−y) + (Aθ−y)>Ah+ (Ah)>(Aθ−y) + (Ah)>Ah

=kAθ−yk2+ 2(Aθ−y)>Ah+kAhk2

=S(θ) +∇S(θ)>h+kAhk2

∇S(θ) = 2A>(Aθ−y), hence∇S( ˆθ) = 0 implies

A>Aθˆ=A>y.

(29)

Optimality conditions

Theorem :a solution of the LLS problem is given by ˆθ, solution of the “normal equations”

A>Aθˆ=A>y, moreover, if rankA=pthen ˆθis unique.

Proof :

S(θ) =S( ˆθ+θθ) =ˆ S( ˆθ) +∇S( ˆθ)>(θ−θ) +ˆ kA(θ−θ)kˆ 2,

=S( ˆθ) +kA(θ−θ)kˆ 2,

≥S( ˆθ) Uniqueness :

S( ˆθ) =S(θ)⇐⇒ kA(θ−θ)kˆ 2= 0,

⇐⇒A(θ−θ) = 0ˆ

⇐⇒θ= ˆθ,

(30)

Simple linear regression

rankA= 2 if there existsi 6=j such thatxi 6=xj Computations :

Sx =

n

X

i=1

xi,Sy =

n

X

i=1

yi,Sxy =

n

X

i=1

xiyi,Sxx =

n

X

i=1

xi2

A>A=

n Sx

Sx Sxx

, A>y = Sy

Sxy

θ1= SySxx−SxSxy

nSxx −Sx2 , θ2=nSxy −SxSy

nSxx−Sx2

(31)

Practical resolution with Scilab

WhenAis square and invertible, the Scilab command x=A\y computesx, the unique solution ofA*x=y.

WhenAis not square and has full (column) rank, then the command x=A\y

computesx, the unique least squares solution. i.e. such thatnorm(A*x-y)is minimal.

I Although mathematically equivalent to

x=(A’*A)\(A’*y)

the commandx=A\yisnumerically more stable, precise and efficient

(32)

Practical resolution with Scilab

Fit (xi,yi)i=1...nwith a polynomial of degree 2 with Scilab

(33)

An interesting example

Find a circle wich best fits (xi,yi)i=1...nin the plane

Minimize thealgebraic distance d(a,b,R) =

n

X

i=1

(xi −a)2+ (yi −b)2−R22

=krk2

(34)

An interesting example

Algebraic distance

d(a,b,R) =

n

X

i=1

(xi −a)2+ (yi −b)2−R22

=krk2 The residual vector is non-linear w.r.t. (a,b,R) but we have

ri =R2−a2−b2+ 2axi+ 2byi−(xi2+yi2),

= [2xi,2yi,1]

a b R2−a2−b2

−(xi2+yi2) hence residual is linear w.r.t.θ= (a,b,R2−a2−b2).

(35)

An interesting example

Standard form, the unknown isθ= (a,b,R2−a2−b2)

A=

2x1 2y1 1 ... ... ... 2xn 2yn 1

, z =

x12+y12 ... xn2+yn2

, d(a,b,R) =kAθ−zk2 In Scilab

A=[2*x,2*y,ones(x)]

z=x.^2+y.^2 theta=A\z a=theta(1) b=theta(2)

R=sqrt(theta(3)+a^2+b^2) t=linspace(0,2*%pi,100)

plot(x,y,"o",a+R*cos(t),b+R*sin(t))

(36)

Take home message

Take home message #2 :

Solving linear least squares problem is just a matter of linear algebra

(37)

1 Motivation and statistical framework

2 Maths reminder

3 Linear Least Squares (LLS)

4 Non Linear Least Squares (NLLS)

5 Statistical evaluation of solutions

6 Model selection

(38)

Example

Consider data (xi,yi) to be fitted by the non linear model y =fθ(x) = exp(θ1+θ2x), The“log trick”leads some people to minimize

Slog(θ) =

n

X

i=1

(logyi −(θ1+θ2xi))2,

i.e. do simple linear regression of (logyi) against (xi), but this violates a fundamental hypothesis because

ifyi−fθ(xi) is normally distributed then logyi −logfθ(xi) is not !

(39)

Possibles angles of attack

Remember that

S(θ) =kr(θ)k2, ri(θ) =fθ(xi)−yi.

A local minimum ofScan be found by different methods :

Find a solution of the non linear systems of equations

∇S(θ) = 2r0(θ)>r(θ) = 0, with the Newton’s method :

I needs to compute the Jacobian of the gradient itself (do you really want to compute second derivatives ?),

I does not guarantee convergence towards a minimum.

(40)

Possibles angles of attack

Use the spirit of Newton’s method as follows : start withθ0and for eachk

consider the Taylor development of the residual vector atθk

r(θ) =r(θk) +r0k)(θ−θk)+kθ−θkkε(θ−θk) and takeθk+1such that the squared norm of theaffine approximation

kr(θk) +r0k)(θk+1θk)k2 is minimal.

findingθk+1θk is a LLS problem !

(41)

Gauss-Newton method

Original formulation of the Gauss-Newton method

θk+1=θk−[r0k)>r0k)]−1r0k)>r(θk),

Equivalent Scilab implementation using backslash\operator θk+1=θk−r0k)\r(θk)

Problem: what can you do whenr0k) has not full column rank ?

(42)

Levenberg-Marquardt method

Modify the Gauss-Newton iteration: pick up aλ >0and takeθk+1such that Sλk+1θk) =kr(θk) +r0k)(θk+1θk)k2+λk(θk+1θk)k2 is minimal.

After rewritingSλk+1θk) using block matrix notation as Sλk+1θk) =

r0k) λ12I

k+1θk) +

r(θk) 0

2

findingθk+1θk is a LLS problem and for anyλ >0a unique solution exists !

(43)

Levenberg-Marquardt method

Since the residual vector reads r0k)

λ12I

k+1θk) +

r(θk) 0

the normal equations of the LLS are given by r0k)

λ12I >

r0k) λ12I

k+1θk) =−

r0k) λ12I

>

r(θk) 0

⇐⇒

r0k)>12I r0k)

λ12I

k+1θk) =−

r0k)>12I r(θk)

0

⇐⇒ r0k)>r0k) +λI

k+1θk) =−r0k)>r(θk)

(44)

Levenberg-Marquardt method

Hence, the mathematical formulation of Levenberg-Marquardt method is θk+1=θk−[r0k)>r0k) +λI]−1r0k)>r(θk)

but practical Scilab implementation should use the backslash \ operator θk+1=θk

r0k) λ12I

\

r(θk) 0

(45)

Levenberg-Marquardt method

Where is the insight in Levenberg-Marquardt method ?

Remember that∇S(θ) = 2r0(θ)>r(θ), hence LM iteration reads θk+1=θk12 r0k)>r0k) +λI−1

∇S(θk),

=θk1 λ1r0k)>r0k) +I−1

∇S(θk)

I Whenλis small, LM methods behaves more like the Gauss-Newton method.

I Whenλis large, LM methods behaves more like the gradient method.

λallows to balance between speed (λ= 0) and robustness (λ→ ∞)

(46)

Example 1

Consider data (xi,yi) to be fitted by the non linear modelfθ(x) = exp(θ1+θ2x) :

The Jacobian ofr(θ) is given by

r0(θ) =

exp(θ1+θ2x1) x1exp(θ1+θ2x1)

... ...

exp(θ1+θ2xn) xnexp(θ1+θ2x1)

(47)

Example 1

In Scilab, use thelsqrsolveorleastsqfunction:

θˆ= (0.981, -2.905)

function r=resid(theta,n) r=exp(theta(1)+theta(2)*x)-y;

endfunction

function j=jac(theta,n) e=exp(theta(1)+theta(2)*x);

j=[e x.*e];

endfunction load data_exp.dat theta0=[0;0];

theta=lsqrsolve(theta0,resid,length(x),jac);

plot(x,y,"ob", x,exp(theta(1)+theta(2)*x),"r")

(48)

Example 2

Enzymatic kinetics

s0(t) =θ2 s(t)

s(t) +θ3,t >0, s(0) =θ1,

yi = measurement ofsat timeti

S(θ) =kr(θ)k2, ri(θ) = yi−s(ti) σi

Individual weightsσi allow to take into account different standard deviations of measurements

(49)

Example 2

In Scilab, use thelsqrsolveorleastsqfunction

θˆ= (887.9, 37.6, 97.7)

function dsdt=michaelis(t,s,theta) dsdt=theta(2)*s/(s+theta(3)) endfunction

function r=resid(theta,n) s=ode(theta(1),0,t,michaelis) r=(s-y)./sigma

endfunction

load michaelis_data.dat theta0=[y(1);20;80];

theta=lsqrsolve(theta0,resid,n)

If not provided, the Jacobianr0(θ) is approximated by finite differences (but true Jacobian always speed up convergence).

(50)

Take home message

Take home message #3 :

Solving non linear least squares problems is not that difficult with adequate software and good starting values

(51)

1 Motivation and statistical framework

2 Maths reminder

3 Linear Least Squares (LLS)

4 Non Linear Least Squares (NLLS)

5 Statistical evaluation of solutions

6 Model selection

(52)

Motivation

Since the data (yi)i=1...nis a sample of random variables, then ˆθtoo !

Confidence intervals for ˆθcan be easily obtained by at least two methods

I Monte-Carlo method : allows to estimate the distribution of ˆθbut needs thousands of resamplings

I Linearized statistics : very fast, but can be very approximate for high level of measurement error

(53)

Monte Carlo method

The Monte Carlo method is aresamplingmethod, i.e. works by generating new samples of synthetic measurement and redoing the estimation of ˆθ. Here model is

y =θ1+θ2x+θ3x2, and data is corrupted by noise withσ= 12

(54)

Monte Carlo method

θ1 θ2 θ3

At confidence level=95%,

θˆ1∈[0.99,1.29], θˆ2∈[−1.20,−0.85], θˆ1∈[−2.57,−1.91].

(55)

Linearized Statistics

Define the weighted residualr(θ) by

ri(θ) = yi−fθ(xi) σi , whereσi is the standard deviation ofyi.

The covariance matrix of ˆθcan be approximated by V( ˆθ) =F( ˆθ)−1 whereF( ˆθ) is the Fisher Information Matrix, given by

F(θ) =r0(θ)>r0(θ) For example, whenσi =σfor alli, in LLS problems

V( ˆθ) =σ2A>A

(56)

Linearized Statistics

θˆ= (887.9, 37.6, 97.7)

d=derivative(resid,theta) V=inv(d’*d)

sigma_theta=sqrt(diag(V)) // 0.975 fractile Student dist.

t_alpha=cdft("T",m-3,0.975,0.025);

thetamin=theta-t_alpha*sigma_theta thetamax=theta+t_alpha*sigma_theta

At 95% confidence level

θˆ1∈[856.68,919.24], θˆ2∈[34.13,41.21], θˆ3∈[93.37,102.10].

(57)

1 Motivation and statistical framework

2 Maths reminder

3 Linear Least Squares (LLS)

4 Non Linear Least Squares (NLLS)

5 Statistical evaluation of solutions

6 Model selection

(58)

Motivation : which model is the best ?

(59)

Motivation : which model is the best ?

On the previous slide data has been fitted with the model y =

p

X

k=0

θkxk, p= 0. . .8,

ConsiderS( ˆθ) as a function of model orderpdoes not help much

p S( ˆθ) 0 3470.32 1 651.44 2 608.53 3 478.23 4 477.78 5 469.20 6 461.00 7 457.52 8 448.10

(60)

Validation

Validation is the key of model selection :

1 Define two sets of data

I T ⊂ {1, . . .n}for model training

I V={1, . . .n} \Tfor validation

2 For each value of model orderp

I Compute the optimal parameters with the training data θˆp= arg min

θ∈Rp

X

i∈T

(yi−fθ(xi))2

I Compute the validation residual

SV(θˆp) =X

i∈V

(yi−fθˆp(xi))2

(61)

Training + Validation

(62)

Training + Validation

Validation helps a lot: here the best model order is clearlyp= 3 !

p SV( ˆθp) 0 11567.21 1 2533.41 2 2288.52 3 259.27 4 326.09 5 2077.03 6 6867.74 7 26595.40 8 195203.35

(63)

Take home message

Take home message #4 :

Always evaluate your models by either computing confidence intervals for the parameters or by using validation.

Références

Documents relatifs

In this paper we proposed m-power regularized least squares regression (RLSR), a supervised regression algorithm based on a regularization raised to the power of m, where m is with

Simple metal Solid state Glass membrane Enzyme Liquid membrane Gas sensing We’ll review representative examples of each.. You can assume that the proper reference

While the discriminative framework is based on convex relaxations and has led to interesting developments and applications [Zhang et al., 2009, Li et al., 2009, Joulin et al.,

For any realization of this event for which the polynomial described in Theorem 1.4 does not have two distinct positive roots, the statement of Theorem 1.4 is void, and

For the ridge estimator and the ordinary least squares estimator, and their variants, we provide new risk bounds of order d/n without logarithmic factor unlike some stan- dard

Note that the columns with the predicted RPS scores ( pred rps ) and the logarithmically transformed RPS scores ( pred log_rps ) are generated later in this example.

In fact, our aim is to discuss the problems which occur when the weighted algorithm is applied to the particular case of LL fuzzy numbers, as in Section 5.. Till then it is necessary

We make use of two theories: (1) Gaussian Random Function Theory (that studies precise properties of objects like Brownian bridges, Strassen balls, etc.) that allows for