Least squares problems How to state and solve them, then evaluate their solutions Stéphane Mottelet

(1)

How to state and solve them, then evaluate their solutions

Stéphane Mottelet

Université de Technologie de Compiègne

April 28, 2020

(2)

1 Motivation and statistical framework

2 Maths reminder (survival kit)

3 Linear Least Squares (LLS)

4 Non Linear Least Squares (NLLS)

5 Statistical evaluation of solutions

6 Model selection

(3)

1 Motivation and statistical framework

2 Maths reminder (survival kit)

6 Model selection

(4)

Regression problem

Data: (xi,yi)i=1..n, Model:y =fθ(x)

I x∈R:independentvariable

I y∈R:dependentvariable (value found by observation)

I θ∈R^p: parameters

Regressionproblem

Findθsuch that the modelbestexplains the data, i.e.yi isclosetofθ(xi),i = 1. . .n.

(5)

Regression problem, example

Simplelinearregression : (x_i,y_i)∈R²

y

−→findθ₁, θ₂such that thedatafits the modely =θ₁+θ₂x

How does one measure the fit/misfit ?

(6)

Least squares method

Theleastsquaresmethod measures the fit with the Sum of Squared Residuals (SSR) S(θ) =

n

X

i=1

(y_i−fθ(x_i))²,

and aims to find ˆθsuch that

∀θ∈R^p, S( ˆθ)≤S(θ), or equivalently

θˆ= arg min

θR^p S(θ).

Important issues

statistical interpretation

existence, uniqueness and practical determination of ˆθ(algorithms)

(7)

Hypothesis

1 (xi)i=1...nare given

2 (y_i)_i=1...nare samples of random variables

y_i =fθ(x_i) +ε_i,i= 1. . .n,

whereε_i,i = 1. . .nareindependent and identically distributed(i.i.d.) and E[εi] = 0,E[ε²_i] =σ²,densityε→g(ε) The probability density ofyi is given by

φⁱ_θ:R−→R

y −→φⁱ_θ(y) =g(y−fθ(xi)) henceE[yi|θ] =fθ(xi).

(8)

Example

Ifεis normally distributed, i.e.g(ε) = (σ√

2π)⁻¹exp(−_2σ¹₂ε²), we have φⁱ_θ(y) = (σ√

2π)⁻¹exp

− 1

2σ²(y −fθ(x_i))²

(9)

Joint probability density and Likelihood function

Joint density

Whenθis given, as the (y_i) are independent, the density of the vectory= (y₁, . . . ,y_n) is φ_θ(y) =

n

Y

i=1

φⁱ_θ(yi) =φ¹_θ(y1)φ²_θ(y2). . . φⁿ_θ(yn).

Interpretation : forD⊂Rⁿ

Prob(y∈D|θ) = Z

D

φ_θ(y)dy₁. . .dyn

Likelihood function

When a sample ofyis given, thenLy(θ)^def=φθ(y) is called Likelihood of the parametersθ

(10)

Maximum Likelihood Estimation

TheMaximum Likelihood Estimateofθis the vector ˆθdefined by θˆ= arg max

θ∈R^pLy(θ).

Under the Gaussian hypothesis, then Ly(θ)=

n

Y

i=1

(σ√

2π)⁻¹exp

− 1

2σ²(yi−fθ(xi))²

,

= (σ

√

2π)⁻ⁿexp − 1 2σ²

n

X

i=1

(yi−fθ(xi))²

! ,

hence, we recover the least squares solution, i.e.

arg max

θ∈R^pLy(θ)= arg min

θ∈R^pS(θ).

(11)

Alternatives : Least Absolute Deviation Regression

LeastAbsolute DeviationRegression : the misfit is measured by S1(θ) =

n

X

i=1

|y_i−fθ(xi)|.

Is ˆθ= arg minθ∈R^pS₁(θ) is a maximum likelihood estimate ?

Yes, ifε_i has aLaplace distribution g(ε) = (σ√

2)⁻¹exp −

√ 2 σ |ε|

!

First issue : S₁is not differentiable

(12)

Alternatives : Least Absolute Deviation Regression

Densities of Gaussian vs. Laplacian random variables (with zero mean and unit variance) :

Second issue :the two statistical hypothesis are very different !

(13)

Take home message

Take home message #1 :

Doing Least Squares Regression means that you assume that the model error is Gaussian.

However, if you have no idea about the model error :

1 thenice theoretical and computational frameworkyou will get is worth doing this assumption. . .

2 a posteriori goodness of fit testscan be used to assess the normality of errors.

(14)

2 Maths reminder

6 Model selection

(15)

Matrix algebra

Notation :A∈ M_n,m(R),x ∈Rⁿ,

A=







a₁₁ . . . a_1m ... . .. ... a_n1 . . . a_nm





, x =





 x₁

... x_n







Product : forB∈ M_m,p(R),C=AB∈ M_n,p(R), cij=

m

X

k=1

aikbkj

Identity matrix

I=





 1

. ..

1







(16)

Matrix algebra

Transposition, Inner product and norm :

A^>∈ M_m,n(R) , A^>

ij =aji

Forx ∈Rⁿ,y ∈Rⁿ,

hx,yi=x^>y =

n

X

i=1

xiyi, kxk²=x^>x

(17)

Matrix algebra

Linear dependance / independence :

a set{x₁, . . . ,x_m}of vectors inRⁿis dependent if a vectorx_j can be written as x_j =

m

X

k=1,k6=i

α_kx_k

I a set of vectors which is not dependent is called independent

I a set ofm>nvectors is necessarily dependent

I a set ofnindependent vectors inRⁿis called a basis

The rank of aA∈ M_nmis the number of its linearly independent columns rank(A) =m⇐⇒ {Ax = 0⇒x = 0}

(18)

Linear system of equations

WhenAis square

rank(A) =n⇐⇒there existsA⁻¹s.t.A⁻¹A=AA⁻¹=I

When the above property holds :

For ally ∈Rⁿ, the system of equations

Ax =y, has a unique solutionx =A⁻¹y.

Computation : Gauss elimination algorithm (no computation ofA⁻¹) in Scilab/Matlab :x =A\y

(19)

Differentiability

Definition : letf :Rⁿ−→R^m,

f(x) =





 f₁(x)

... f_m(x)





, fi :Rⁿ−→R, f is differentiable ata∈Rⁿif

f(a+h) =f(a) +f⁰(a)h+khkε(h), lim

h→0ε(h) = 0

Jacobian matrix,partial derivatives:

f⁰(a)

ij= ∂f_i

∂xj

(a)

Gradient: iff :Rⁿ−→R, is differentiable ata,

f(a+h) =f(a) +∇f(a)^>h+khkε(h), lim

h→0ε(h) = 0

(20)

Nonlinear system of equations

Whenf :Rⁿ−→Rⁿ, a solution ˆx to the system of equations f( ˆx) = 0

can be found (or not) by the Newton’s method : givenx0, for eachk

1 consider the affine approximation off atxk

T(x) =f(x_k) +f⁰(x_k)(x−x_k)

2 takex_k₊₁such thatT(x_k+1) = 0,

xk+1=xk −f⁰(xk)⁻¹f(xk)

Newton’s method can be very fast. . . ifx₀is not too far from ˆx !

(21)

Find a local minimum - gradient algorithm

Whenf :Rⁿ−→Ris differentiable, a vector ˆx satisfying∇f( ˆx) = 0 and

∀x ∈Rⁿ,f( ˆx)≤f(x) can be found by the descent algorithm : givenx0, for eachk :

1 select a directiond_k such that∇f(x_k)^>d_k <0

2 select a stepρ_k, such that

x_k+1=x_k +ρ_kd_k, satisfies (among other conditions)

f(x_k+1)<f(x_k)

The choiced_k =−∇f(x_k) leads to thegradient algorithm

(22)

Find a local minimum - gradient algorithm

x =x −ρ ∇f(x ),

(23)

2 Maths reminder

3 Linear Least Squares (LLS)

(24)

Linear models

The modely =fθ(x) is linear w.r.t.θ, i.e.

y =

p

X

j=1

θ_jφ_j(x), φ_k :R→R

Examples

I y=Pp j=1θjx^j−1

I y=Pp

j=1θjcos^(j−1)x_T , whereT =xn−x1 I . . .

(25)

The residual for simple linear regression

Simple linear regression

S(θ) =

n

X

i=1

(θ1+θ₂xi−yi)²=kr(θ)k², Residual vectorr(θ)

ri(θ) = [1,xi] θ₁

θ₂

−yi

For the whole residual vector

r(θ) =Aθ−y, y =





 y₁

... y_n





, A=





 1 x₁

... ... 1 x_n







(26)

The residual for a general linear model

General linear modelfθ(x) =Pp

j=1θ_jφ_j(x)

S(θ) =

n

X

i=1

(fθ(x_i)−y_i)²=kr(θ)k²,

=

n

X

i=1

Pp

j=1θ_jφ_j(x_i)−y_i2

=kr(θ)k²,

Residual vectorr(θ)

ri(θ) = [φ1(xi), . . . , φp(xi)]





 θ₁

... θ₂





−yi

For the whole residual vectorr(θ) =Aθ−y whereAhas sizen×pand aij =φ_j(xi).

(27)

Optimality conditions

Linear Least Squares problem : find ˆθ θˆ= arg min

θR^p S( ˆθ) =kAθ−yk² Necessary optimality condition

∇S( ˆθ) = 0 Compute the gradient by expandingS(θ)

(28)

S(θ+h) =kA(θ+h)−yk²=kAθ−y +Ahk²

= (Aθ−y+Ah)^>(Aθ−y +Ah)

= (Aθ−y)^>(Aθ−y) + (Aθ−y)^>Ah+ (Ah)^>(Aθ−y) + (Ah)^>Ah

=kAθ−yk²+ 2(Aθ−y)^>Ah+kAhk²

=S(θ) +∇S(θ)^>h+kAhk²

∇S(θ) = 2A^>(Aθ−y), hence∇S( ˆθ) = 0 implies

A^>Aθˆ=A^>y.

(29)

Theorem :a solution of the LLS problem is given by ˆθ, solution of the “normal equations”

A^>Aθˆ=A^>y, moreover, if rankA=pthen ˆθis unique.

Proof :

S(θ) =S( ˆθ+θ−θ) =ˆ S( ˆθ) +∇S( ˆθ)^>(θ−θ) +ˆ kA(θ−θ)kˆ ²,

=S( ˆθ) +kA(θ−θ)kˆ ²,

≥S( ˆθ) Uniqueness :

S( ˆθ) =S(θ)⇐⇒ kA(θ−θ)kˆ ²= 0,

⇐⇒A(θ−θ) = 0ˆ

⇐⇒θ= ˆθ,

(30)

Simple linear regression

rankA= 2 if there existsi 6=j such thatx_i 6=x_j Computations :

S_x =

n

X

i=1

x_i,S_y =

n

X

i=1

y_i,S_xy =

n

X

i=1

x_iy_i,S_xx =

n

X

i=1

x_i²

A^>A=

n Sx

Sx Sxx

, A^>y = Sy

Sxy

θ₁= SySxx−SxSxy

nS_xx −S_x² , θ₂=nSxy −SxSy

nS_xx−S_x²

(31)

Practical resolution with Scilab

WhenAis square and invertible, the Scilab command x=A\y computesx, the unique solution ofA*x=y.

WhenAis not square and has full (column) rank, then the command x=A\y

computesx, the unique least squares solution. i.e. such thatnorm(A*x-y)is minimal.

I Although mathematically equivalent to

x=(A’*A)\(A’*y)

the commandx=A\yisnumerically more stable, precise and efficient

(32)

Practical resolution with Scilab

Fit (xi,yi)i=1...nwith a polynomial of degree 2 with Scilab

(33)

An interesting example

Find a circle wich best fits (x_i,y_i)_i=1...nin the plane

Minimize thealgebraic distance d(a,b,R) =

n

X

i=1

(xi −a)²+ (yi −b)²−R²2

=krk²

(34)

Algebraic distance

d(a,b,R) =

n

X

i=1

(x_i −a)²+ (y_i −b)²−R²2

=krk² The residual vector is non-linear w.r.t. (a,b,R) but we have

r_i =R²−a²−b²+ 2ax_i+ 2by_i−(x_i²+y_i²),

= [2x_i,2y_i,1]





a b R²−a²−b²



−(x_i²+y_i²) hence residual is linear w.r.t.θ= (a,b,R²−a²−b²).

(35)

Standard form, the unknown isθ= (a,b,R²−a²−b²)

A=







2x1 2y1 1 ... ... ... 2xn 2yn 1





, z =







x₁²+y₁² ... x_n²+y_n²





, d(a,b,R) =kAθ−zk² In Scilab

A=[2*x,2*y,ones(x)]

z=x.^2+y.^2 theta=A\z a=theta(1) b=theta(2)

R=sqrt(theta(3)+a^2+b^2) t=linspace(0,2*%pi,100)

plot(x,y,"o",a+R*cos(t),b+R*sin(t))

(36)

Take home message

Solving linear least squares problem is just a matter of linear algebra

(37)

2 Maths reminder

4 Non Linear Least Squares (NLLS)

6 Model selection

(38)

Example

Consider data (x_i,y_i) to be fitted by the non linear model y =fθ(x) = exp(θ₁+θ₂x), The“log trick”leads some people to minimize

Slog(θ) =

n

X

i=1

(logyi −(θ1+θ₂xi))²,

i.e. do simple linear regression of (logyi) against (xi), but this violates a fundamental hypothesis because

ifyi−fθ(xi) is normally distributed then logyi −logfθ(xi) is not !

(39)

Possibles angles of attack

Remember that

S(θ) =kr(θ)k², ri(θ) =fθ(xi)−yi.

A local minimum ofScan be found by different methods :

Find a solution of the non linear systems of equations

∇S(θ) = 2r⁰(θ)^>r(θ) = 0, with the Newton’s method :

I needs to compute the Jacobian of the gradient itself (do you really want to compute second derivatives ?),

I does not guarantee convergence towards a minimum.

(40)

Possibles angles of attack

Use the spirit of Newton’s method as follows : start withθ₀and for eachk

consider the Taylor development of the residual vector atθ_k

r(θ) =r(θ_k) +r⁰(θ_k)(θ−θ_k)+kθ−θ_kkε(θ−θ_k) and takeθ_k+1such that the squared norm of theaffine approximation

kr(θ_k) +r⁰(θ_k)(θ_k+1−θ_k)k² is minimal.

findingθ_k+1−θ_k is a LLS problem !

(41)

Gauss-Newton method

Original formulation of the Gauss-Newton method

θ_k+1=θ_k−[r⁰(θk)^>r⁰(θk)]⁻¹r⁰(θk)^>r(θk),

Equivalent Scilab implementation using backslash\operator θ_k+1=θ_k−r⁰(θ_k)\r(θ_k)

Problem: what can you do whenr⁰(θk) has not full column rank ?

(42)

Levenberg-Marquardt method

Modify the Gauss-Newton iteration: pick up aλ >0and takeθ_k+1such that Sλ(θk+1−θ_k) =kr(θ_k) +r⁰(θk)(θk+1−θ_k)k²+λk(θ_k₊₁−θ_k)k² is minimal.

After rewritingSλ(θ_k+1−θ_k) using block matrix notation as Sλ(θ_k+1−θ_k) =

r⁰(θ_k) λ¹²I

(θ_k+1−θ_k) +

r(θ_k) 0

2

findingθ_k+1−θ_k is a LLS problem and for anyλ >0a unique solution exists !

(43)

Since the residual vector reads r⁰(θk)

λ¹²I

(θ_k+1−θ_k) +

r(θk) 0

the normal equations of the LLS are given by r⁰(θk)

λ¹²I ^>

r⁰(θk) λ¹²I

(θk+1−θ_k) =−

r⁰(θk) λ¹²I

^>

r(θk) 0

⇐⇒

r⁰(θk)^>,λ¹²I r⁰(θk)

λ¹²I

(θk+1−θ_k) =−

r⁰(θk)^>,λ¹²I r(θk)

0

⇐⇒ r⁰(θk)^>r⁰(θk) +λI

(θk+1−θ_k) =−r⁰(θk)^>r(θk)

(44)

Hence, the mathematical formulation of Levenberg-Marquardt method is θ_k+1=θ_k−[r⁰(θ_k)^>r⁰(θ_k) +λI]⁻¹r⁰(θ_k)^>r(θ_k)

but practical Scilab implementation should use the backslash \ operator θ_k+1=θ_k−

r⁰(θ_k) λ¹²I

\

r(θ_k) 0

(45)

Where is the insight in Levenberg-Marquardt method ?

Remember that∇S(θ) = 2r⁰(θ)^>r(θ), hence LM iteration reads θ_k₊₁=θ_k −¹₂ r⁰(θk)^>r⁰(θk) +λI−1

∇S(θ_k),

=θ_k −_2λ¹ _λ¹r⁰(θ_k)^>r⁰(θ_k) +I−1

∇S(θ_k)

I Whenλis small, LM methods behaves more like the Gauss-Newton method.

I Whenλis large, LM methods behaves more like the gradient method.

λallows to balance between speed (λ= 0) and robustness (λ→ ∞)

(46)

Example 1

Consider data (x_i,y_i) to be fitted by the non linear modelfθ(x) = exp(θ₁+θ₂x) :

The Jacobian ofr(θ) is given by

r⁰(θ) =







exp(θ₁+θ₂x₁) x₁exp(θ₁+θ₂x₁)

... ...

exp(θ₁+θ₂x_n) x_nexp(θ₁+θ₂x₁)







(47)

Example 1

In Scilab, use thelsqrsolveorleastsqfunction:

θˆ= (0.981, -2.905)

function r=resid(theta,n) r=exp(theta(1)+theta(2)*x)-y;

endfunction

function j=jac(theta,n) e=exp(theta(1)+theta(2)*x);

j=[e x.*e];

endfunction load data_exp.dat theta0=[0;0];

theta=lsqrsolve(theta0,resid,length(x),jac);

plot(x,y,"ob", x,exp(theta(1)+theta(2)*x),"r")

(48)

Example 2

Enzymatic kinetics

s⁰(t) =θ₂ s(t)

s(t) +θ₃,t >0, s(0) =θ₁,

yi = measurement ofsat timeti

S(θ) =kr(θ)k², r_i(θ) = y_i−s(t_i) σ_i

Individual weightsσ_i allow to take into account different standard deviations of measurements

(49)

Example 2

In Scilab, use thelsqrsolveorleastsqfunction

θˆ= (887.9, 37.6, 97.7)

function dsdt=michaelis(t,s,theta) dsdt=theta(2)*s/(s+theta(3)) endfunction

function r=resid(theta,n) s=ode(theta(1),0,t,michaelis) r=(s-y)./sigma

endfunction

load michaelis_data.dat theta0=[y(1);20;80];

theta=lsqrsolve(theta0,resid,n)

If not provided, the Jacobianr⁰(θ) is approximated by finite differences (but true Jacobian always speed up convergence).

(50)

Take home message

Solving non linear least squares problems is not that difficult with adequate software and good starting values

(51)

2 Maths reminder

5 Statistical evaluation of solutions

6 Model selection

(52)

Motivation

Since the data (yi)i=1...nis a sample of random variables, then ˆθtoo !

Confidence intervals for ˆθcan be easily obtained by at least two methods

I Monte-Carlo method : allows to estimate the distribution of ˆθbut needs thousands of resamplings

I Linearized statistics : very fast, but can be very approximate for high level of measurement error

(53)

Monte Carlo method

The Monte Carlo method is aresamplingmethod, i.e. works by generating new samples of synthetic measurement and redoing the estimation of ˆθ. Here model is

y =θ₁+θ₂x+θ₃x², and data is corrupted by noise withσ= ¹₂

(54)

Monte Carlo method

θ₁ θ₂ θ₃

At confidence level=95%,

θˆ₁∈[0.99,1.29], θˆ₂∈[−1.20,−0.85], θˆ₁∈[−2.57,−1.91].

(55)

Linearized Statistics

Define the weighted residualr(θ) by

ri(θ) = yi−fθ(xi) σ_i , whereσ_i is the standard deviation ofyi.

The covariance matrix of ˆθcan be approximated by V( ˆθ) =F( ˆθ)⁻¹ whereF( ˆθ) is the Fisher Information Matrix, given by

F(θ) =r⁰(θ)^>r⁰(θ) For example, whenσ_i =σfor alli, in LLS problems

V( ˆθ) =σ²A^>A

(56)

Linearized Statistics

θˆ= (887.9, 37.6, 97.7)

d=derivative(resid,theta) V=inv(d’*d)

sigma_theta=sqrt(diag(V)) // 0.975 fractile Student dist.

t_alpha=cdft("T",m-3,0.975,0.025);

thetamin=theta-t_alpha*sigma_theta thetamax=theta+t_alpha*sigma_theta

At 95% confidence level

θˆ₁∈[856.68,919.24], θˆ₂∈[34.13,41.21], θˆ₃∈[93.37,102.10].

(57)

2 Maths reminder

6 Model selection

(58)

Motivation : which model is the best ?

(59)

Motivation : which model is the best ?

On the previous slide data has been fitted with the model y =

p

X

k=0

θ_kx^k, p= 0. . .8,

ConsiderS( ˆθ) as a function of model orderpdoes not help much

p S( ˆθ) 0 3470.32 1 651.44 2 608.53 3 478.23 4 477.78 5 469.20 6 461.00 7 457.52 8 448.10

(60)

Validation

Validation is the key of model selection :

1 Define two sets of data

I T ⊂ {1, . . .n}for model training

I V={1, . . .n} \Tfor validation

2 For each value of model orderp

I Compute the optimal parameters with the training data θˆp= arg min

θ∈R^p

X

i∈T

(yi−fθ(xi))²

I Compute the validation residual

SV(θˆp) =X

i∈V

(yi−fθˆp(xi))²

(61)

Training + Validation

(62)

Training + Validation

Validation helps a lot: here the best model order is clearlyp= 3 !

p S_V( ˆθ_p) 0 11567.21 1 2533.41 2 2288.52 3 259.27 4 326.09 5 2077.03 6 6867.74 7 26595.40 8 195203.35

(63)

Take home message

Always evaluate your models by either computing confidence intervals for the parameters or by using validation.