HAL Id: hal-02793398
https://hal.inrae.fr/hal-02793398
Submitted on 5 Jun 2020
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Practical cases and issues related to model fitting
Laurent Saint-André, Gael Sola, Matieu Henry, Nicolas Picard
To cite this version:
Laurent Saint-André, Gael Sola, Matieu Henry, Nicolas Picard. Practical cases and issues related to model fitting. Training Workshop on Tree Allometric Equations, May 2014, Colombo, Sri Lanka. pp.41 slides. �hal-02793398�
Practical cases and
issues related to model
fitting
•
Dr. Laurent Saint-André, Gael
Step by Step
Exploratory stage, getting a model for each compartment and each strata (local model)
û
What variable is to be used as input data ? Or what combination of variables is to be used as in put data?
What is the form of the relationship with each of the variable ?
What are the relationships between the parameters of the local models and the strata characteristics ?
What is the form of this relationship for each parameter ?
Fitting of the complete model: one system of equations for all compartments and all strata
û
Aggregation stage, getting a model for each compartment, all strata pooled together (global model)
Linear Models
Linear regression: Principle
Fitting this equation consist in estimating parameters a and b.
Usually, we use the least squared method which consist in finding parameters a and b that minimized the sum of squared errors :
The model is written as following: Y X
û
i i ia
b
X
Y
=
+
.
+
ε
model the of parameters the are b and a model, by the explained not variation residual the is i ε∑
∑
= = + − = n i i i n i i Y a b X 1 2 1 2 ) . ( εLinear Models
a and b are two random variables
The covariance between a and b is not null (meaning that parameters of a given equation are non-independent)
An unbiased estimation of this covariance is given by :
The standard deviation of a and b
is given by : And their confidence interval by :
Usually p=0.05 to get the parameter value at level 95% of confidence Linear regression: Principe
û
− = − − − =∑
∑
_ _ _ 2 _ _ . ) ( ) ).( ( X b Y a X X X X Y Y b i i i 2 ) . ( 2 2 − + − =∑
n X b a Y s i i + = − =∑
_ 2 _ 2 2 1 ) ( ) ( X X X s b ect i − ± − ± ) ( ). 2 / , 2 ( ) ( ). 2 / , 2 ( a ext p n t a b ect p n t bLinear Models
û
Dep Var: Y N: 13 Multiple R: 0.828991 Squared multiple R: 0.687225 Adjusted squared multiple R: 0.658791 Standard error of estimate: 0.351724 Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail) CONSTANT 0.063555 0.378512 0.000000 . 0.16791 0.86970 X 0.704030 0.143206 0.828991 1.000000 4.91621 0.00046 Effect Coefficient Lower < 95%> Upper
CONSTANT 0.063555 -0.769544 0.896655 X 0.704030 0.388836 1.019223 Analysis of Variance
Source Sum-of-Squares df Mean-Square F-ratio P
Coefficient of correlation R2, the adjusted one is better
Values of parameters a = intercept
b = parameter linked to X
Standard deviations of parameters Confidence intervals of parameters
Linear regression:
Analysis of variance
Linear Models
Can we do a linear regression ?
û
Y
Linear Models
Can we do a linear regression ?
û
Y
X
Linear Models
Y X Ln Y Ln X Y’=Ln Y X’=Ln X ?Can we do a linear regression ?
Linear Models
Y X
Ln Y Ln X
Y’=Ln Y X’=Ln X ?Can we do a linear regression ?
Linear Models
Can we do a linear regression ?
û
Y X Y’=Ln Y X’=Ln X ? Ln Y Ln XLinear Models
Can we do a linear regression ?
û
Y X
Y’=Ln Y X’=Ln X ? Ln Y Ln XLinear Models
Can we do a linear regression ?
û
Y X Y’=Ln Y X’= X ? Ln Y XLinear Models
Can we do a linear regression ?
û
Y X
Y’=Ln Y X’= X ? Ln Y XLinear Models
Y X X’=X Y’=ln(Y/(1-Y)) ? ln(Y/(1-Y)) XCan we do a linear regression ?
Linear Models
Y X
X’=X Y’=ln(Y/(1-Y)) ? ln(Y/(1-Y)) XCan we do a linear regression ?
Linear Models
Why transforming the data ?
û
Ä
Power equation :
Ä
Exponential model :
It is always interesting to get a linear relationship because the solution is explicit
And sometimes it permits also to stabilize the varianceBut, this is not always possible and it may not correspond to the data set…. So try and see !
)
exp(
ε
bX
a
Y
=
ε
+
+
=
ln(
)
ln
Y
a
'b
X
)
exp(
+
ε
=
a
b
X
Y
ε
+
+
=
a
b
X
Y
'ln
Linear Models
The following equations are linear or can be transformed to get a linear equation ?
û
Ä Ä Ä Ä Ä Ä Ä
Yes but two highly
ε
+
=
b
X
Y
ε
X
b
Y
=
)
exp(
ε
X
b
Y
=
ε
+
=
b
X
2Y
)
exp(
ε
bX
Y
=
ε
+
=
X
bY
ε
+
+
=
b
X
c
X
2Y
) ln( ln ln ln lnY = bXε = b+ X + ε)
ln(
ln
Y
=
X
b+
ε
Non - Linear Models
Non-Linear regression: Principle
û
For linear models, the solution is explicit because the derivative of the model toward each parameter is independent from the paramameters of the equation.
For non-linear models, it is not the case: the derivatives depend on the parameters. The resolution of the system is too much difficult. It is then necessary to use alternative methods.
ε
α
β+
=
e
XY
.
. 2 1 . ) . (∑
= − = n i X i res i e Y SSα
β 0 ) )( . ( . 1 . − = −∑
= i i bX n i X b i a e e Y ( . )( . . . ) 0 1 . − = −∑
= i X b n i X b i a e ae X Y i iNon - Linear Models
To fit a non-linear model, it is necessary to proceed by iterations.
When the least square method is used, at each step (i.e. each estimation of a new set of parameters) the sum of squared errors is calculated. If the procedure is efficient, this SSE decrease at each step. At the end of the process, if this decrease is negligible, then it is said that the model converged.
The most used iterative procedure is the Gauss-Newton one. But a lot of other procedures are
available. When there are problems in fitting a model, it is recommended to test several methods (ex: fractionnal iteration, Marquardt)
Meaning that we have to give initial values to the parameters
Non-Linear regression: Principle
Non - Linear Models
Non-Linear regression: Principleû
b1 b2 b1 final 100 80 60 40Sum of squared errors – isovalue curves
b1 initial
Successive iterations to get the final values of b1 and b2
Graphical view of the model
convergence with two parameters
Non - Linear Models
Non-Linear regression: Principleû
b1 100 60 2 0 40 60With these initial values, we fall in a local minimum of SSE (>20 et <40)
With these initial values, we fall in the absolute minimum of SSE(<20)
Importance of the initial values given to the parameters
û
Dependent variable is Y
Source Sum-of-Squares df Mean-Square Regression 1.79138E+04 3 5971.269443 Residual 6.711670 20 0.335583 Total 1.79205E+04 23
Mean corrected 3284.344348 22
Raw R-square (1-Residual/Total) = 0.999625 Mean corrected R-square (1-Residual/Corrected) = 0.997956 R(observed vs predicted) square = 0.997965
Wald Confidence Interval Parameter Estimate A.S.E. Param/ASE Lower < 95%> Upper B1 40.269815 0.584758 68.865757 39.050031 41.489600 B2 0.029815 0.001760 16.941467 0.026144 0.033486 B3 1.454754 0.078017 18.646595 1.292013 1.617495 Asymptotic Correlation Matrix of Parameters
B1 B2 B3 B1 1.000000
B2 -0.910171 1.000000
B3 -0.756906 0.939698 1.000000
R2, the mean corrected one should be used
Values of the parameters and their confidence intervals
Non Linear regression: Analysis of variance
Goodness of fit
How to assess the goodness of fit (for linear and non-linear models)
û
Ä R2 and graph Y=f(Ypredit)
R2 is an index of fit, to be used cautiously, (see thereafter) Maximum value = 1; Minimum value = 0
Ä Values of the parameters and their confidence interval Identifying problems of convergence; usually the standard error should not exceed 10% of the parameter value
Ä Correlations between parameters
If correlations are too high, transformation of the variables or change the model equation
Ä The RMSE (Root Mean Square Error, or residual standard error) Gives the error dispersion; to be compared to the average measured
values. Usually, the model is satisfactory when the RMSE is less than 10% of the measured values
Goodness of fit
How to assess the goodness of fit (for linear and non-linear models)
û
Example of normally distributed errors, to be verified with statistical tests (ex D’agostino et al, 1990) and quantile plots
Goodness of fit
How to assess the goodness of fit (for linear and non-linear models)
û
Example errors with
Goodness of fit
How to assess the goodness of fit (for linear and non-linear models)
û
Error of a linear model when the appropriate model is in fact
Goodness of fit
How to assess the goodness of fit (for linear and non-linear models)
û
Do not listen to the siren’s song of the R2 !
Heteroscedasticity
How to deal with heteroscedasticity ?
û
Y
X
Transformation of the variables,
to get the following linear model :
This is equivalent to performing a weighted linear regression:
ε
ε
avec
.
+
+
=
a
b
X
Y
N
(
0
,
σ
X
)
X
Y
Y
'=
/
X
'=
1
/
X
ε
'=
ε
/
X
' ' ' '=
a
X
+
b
+
ε
avec
ε
Y
N
(
0
,
σ
)
Heteroscedasticity
How to deal with heteroscedasticity ?
û
Y
X
More generally, the weighted regression consist in minimizing :
avec le weight of observation i Usually, we use
All the challenge consist in finding the appropriate z value
z =Zoptimum z < Zoptimum z > Zoptimum
∑
=−
n i ipredit i iY
Y
w
1 2)
(
2/
1
i iw
=
σ
z i i∝
X
σ
Heteroscedasticity
How to deal with heteroscedasticity ?
û
First option: a rough and simple method that can be used if there are enough data
Ä
step 1 = split the variable X into k classes centered on Xk
Step 2 = calculate the variance of Y within each k classes
Step 3 = linear regression of to logXk
The slope of this regression is z which is often rounded to 1 or 2
z i i X w ∝ 12 2 k σ k σ log
Heteroscedasticity
How to deal with heteroscedasticity ?
û
Ä
Step 1 = fitting the weighted model by fixing z to a given value (often 0 at the beginning) Step 2 = calculate the Furnival index (FI)
Step 3 = back to step 1 by increasing z
Second option: fitting z iteratively z i i X w ∝ 12 ( ) RMSE n X anti FI n i k k . log log 1 =
∑
=Heteroscedasticity
How to deal with heteroscedasticity ?
û
Fitting z with the other parameters of the model
Ä
Fit by maximum likelihood instead of least squared methods Model for the mean
Model for the variance
z i i X w ∝ 12
( )
∑
+
+
−
−
=
i z i i z i i i iX
X
Y
ML
log
2
.
log(
.
)
.
.
2
1
2 2. 2σ
π
σ
µ
Model Choice
For nested models: F test using the sum of squares errors (SSE) of the two models
if Fobs>Ftab, then model 1 is more Ftab(p1-p2,n-p1)
How to chose between models ?
û
If the number of parameters is the same between model 1 and model 2, then use the sum of square errors (SSE), the lowest SSE is the best
Ä
Nested in Nested in Non nested in
If the number of parameters is different between model 1 and model 2, then Check if these two models are nested or not:
Ä 1 2 1 1 1 2 p n SCE p p SCE SCE F T T T obs − − − = 2 1 p p >
ε
+ + = a bD Y Y = a+bD +cD2H +ε
ε
+ + = a bD H Y 2 Y = a+bD +cD2H +ε
ε
+ + = a bD Y Y = a +cD2H +ε
Advice: decision tree for
model selection
Same dependent
variable?
Same number of
parameters?
Are the models
nested?
F test
Akaike information
Sum of square
errors
Furnival index
Yes No Yes Yes No NoModel Choice
How to chose between models ?
û
Don’t use the R2 because it increases automatically as and when the number of
2 D c D b a Y = + + 5 2 .. kD D c D b a Y = + + + +
Step by Step
Exploratory stage, getting a model for each compartment and each strata (local model)
û
What variable is to be used as input data ? Or what combination of variables is to be used as in put data?
What is the form of the relationship with each of the variable ?
What are the relationships between the parameters of the local models and the strata characteristics ?
What is the form of this relationship for each parameter ?
Fitting of the complete model: one system of equations for all compartments and all strata
û
Aggregation stage, getting a model for each compartment, all strata pooled together (global model)
Aggregation
Example : Eucalyptus in Congo (Saint-André et al. 2005)
û
0 0.2 0.4 0.6 0.8 1 0 25 50 75 100 125 150 A g e (m o is ) 100 200 300 400 500 600No variation with stand age
Exponential decrease with stand age 0 1 2 3 4 5 6 0 0.5 1 1.5 2 2.5 D2H (m3) L e av e s B io m as s ( kg D M t re e-1) DMLeaves DMLeavesEst GP3A, 3B GP3C, 3D GP1, 2 (11-30 months) (50-75 months) (135 months)
Fitted age by age, then analysis of the parameter variations with stand age
ε + + =a bD H s LeafBiomas 2
Aggregation
Compartiment Peuplements utilisés pour la calibration
Modèle pour l’espérance Modèle pour la variance F1 Total G1, G3A, G3B, G3D 5.53 939.11( )r .2 h 3 . 1 + = µ 24.19( )2. 0.483 3 . 1 h r = ε F2 Aérien G1, G2, G3A, G3B, G3C, G3D, V1 à V9 2.18 (488.8 2.2 )( )12.3. (0.87 0.0012 ) age h r age + + + = µ 21.74( )2. 0.613 3 . 1 h r = ε F3 Souterrain G1, G3A, G3B, G3D 92.14( )2. 0.630 3 . 1 h r = µ 5.52( )2. 0.385 3 . 1 h r = ε F4 Feuilles G1, G2, G3A, G3B, G3C, G3D, V1 à V9 0.64 (20.39 0.09age 2344.6e 0.15age)r12.3.h − + − + = µ 0.68( )2. 0.232 3 . 1 h r = ε
F5 Branches Mortes G1, G2, G3A, G3B, G3C,
G3D, V1 à V9 (6.12 158.9e 0.03age)r12.3.h − + = µ 3.09( )2. 0.353 3 . 1 h r = ε
F6 Branches Vivantes G1, G2, G3A, G3B, G3C,
G3D, V1 à V9 (31.12 4496.7e 0.18age)r12.3.h − + = µ 5.20( )2. 0.573 3 . 1 h r = ε F7 Ecorce G1, G2, G3A, G3B, G3C, G3D, V1 à V9 (25.95 19.83 0.05 )r12.3.h0.761 age e− + = µ 1.03( )2. 0.402 3 . 1 h r = ε F8 Tronc G1, G2, G3A, G3B, G3C, G3D, V1 à V9 0.29 (510.7 1.29age)r .h 2 3 . 1 + + = µ 36.02( )2. 0.887 3 . 1 h r = ε F9 Souche G1, G3A, G3B, G3D 37.72( )2. 0.718 3 . 1 h r = µ 4.27( )2. 0.508 3 . 1 h r = ε
F10 Grosses Racines G1, G3A, G3B, G3D ( )0.790 . 84 . 4 5 2 3 . 1 h r = µ 5.05( )2. 0.564 3 . 1 h r = ε
F11 Racines moyennes G1, G3A, G3B, G3D ( )0.470
. 2.71 2 3 . 1 h r = µ 0.47( )2. 0.276 3 . 1 h r = ε
F12 Racines fines G1, G3A, G3B, G3D ( )0.297
. 7.61 2 3 . 1 h r = µ 0.67( )r .2h 3 . 1 = ε
Age effect was significant for most of the compartments, we then get a set of equations that can be used whatever the stand age (within the range of the calibration data set 11 to
Example : Eucalyptus in Congo (Saint-André et al. 2005)
0 50 100 150 200 250 0 50 100 150 200 Eucalyptus -Congo Beech -France Eucalyptus -Brasil 0 50 100 150 200 250 0 50 100 150 Age (years) b ( ad im ) 0 50 100 150 200 250 0 50 100 150 200 Age (years) b ( ad im ) 0 100 200 300 400 500 600 700 0 50 100 150 200 Age (years) b ( ad im
) Not only eucalyptus and
fagus have the same pattern, they do also follow the same line ! (especially for stem wood and branches)
Example : Fagus in France (Genet et al. 2011)
Step by Step
Exploratory stage, getting a model for each compartment and each strata (local model)
û
What variable is to be used as input data ? Or what combination of variables is to be used as in put data? What is the form of the relationship with each of the variable ?
What are the relationships between the parameters of the local models and the strata characteristics ?
What is the form of this relationship for each parameter ?
Fitting of the complete model: one system of equations for all compartments and all strata
û
Aggregation stage, getting a model for each compartment, all strata pooled together (global model)
Taking all
compartments into
account
Equations were fitted altogether simultaneously, To take cross-compartment correlation into account. This step is important when one wants to simulate biomass estimates with confidence intervals
Ä
The output of SUR Regressions (Seemly unrelated regression) are :
1-Values of parameters and their confidence intervals
2-Correlation matrix of parameters (within compartment and between compartments)
3-Residual errors for each compartment
4-Correlation matrix of errors (between compartments)