HAL Id: hal-02803674
https://hal.inrae.fr/hal-02803674
Submitted on 5 Jun 2020
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Practical cases and issues related to model fitting
Nicolas Picard, Laurent Saint-André, Matieu Henry
To cite this version:
Nicolas Picard, Laurent Saint-André, Matieu Henry. Practical cases and issues related to model fitting. National technical staff training on allometric equations (ae) under the REDD Programme, UN-Reducing Emissions from Deforestation and forest Degradation (UN-REDD). Genève, CHE.; Food and Agriculture Organization (FAO). ITA.; Programme des Nations Unies pour l’Environnement (UNEP). FRA.; Institut National de la Recherche Agronomique (INRA). FRA.; Centre de Coopération Internationale en Recherche Agronomique pour le Développement (CIRAD). FRA., Jun 2012, Hanoi, Vietnam. 35 p. �hal-02803674�
Practical cases and
issues related to model
fitting
Pham Cuong, Inoguchi Akiko
Hanoi, June 18 - 22
th2012
Authors : N. Picard (CIRAD), L. Saint-André
(CIRAD - INRA), and M. Henry (FAO)
Step by Step
Exploratory stage, getting a model for each compartment and each strata (local model)
What variable is to be used as input data ? Or what combination of variables is to be used as in put data?
What is the form of the relationship with each of the variable ?
What are the relationships between the parameters of the local models and the strata characteristics ?
What is the form of this relationship for each parameter ?
Fitting of the complete model: one system of equations for all compartments and all strata
Aggregation stage, getting a model for each compartment, all strata pooled together (global model)
Linear Models
Linear regression: Principle i i ia
b
X
Y
=
+
.
+
ε
Fitting this equation consist in estimating parameters a and b.
Usually, we use the least squared method which consist in finding parameters a and b that minimized the sum of squared errors :
The model is written as following: model the of parameters the are b and a model, by the explained not variation residual the is i ε
∑
∑
= = + − = n i i i n i i Y a b X 1 2 1 2 ) . ( ε Y X
Linear Models
− = − − − =∑
∑
_ _ _ 2 _ _ . ) ( ) ).( ( X b Y a X X X X Y Y b i i ia and b are two random variables
The covariance between a and b is not null (meaning that parameters of a given equation are non-independent) An unbiased estimation of this covariance is given by :
2 ) . ( 2 2 − + − =
∑
n X b a Y s i iThe standard deviation of a and b is given by : − + = − =
∑
∑
_ 2 _ 2 2 _ 2 2 ) ( 1 ) ( ) ( ) ( X X X n s a ect X X s b ect i iAnd their confidence interval by :
− ± − ± ) ( ). 2 / , 2 ( ) ( ). 2 / , 2 ( a ext p n t a b ect p n t b
Usually p=0.05 to get the
parameter value at level 95% of confidence
Linear regression: Principe
Linear Models
Dep Var: Y N: 13 Multiple R: 0.828991 Squared multiple R: 0.687225 Adjusted squared multiple R: 0.658791 Standard error of estimate: 0.351724 Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail) CONSTANT 0.063555 0.378512 0.000000 . 0.16791 0.86970 X 0.704030 0.143206 0.828991 1.000000 4.91621 0.00046 Effect Coefficient Lower < 95%> Upper
CONSTANT 0.063555 -0.769544 0.896655 X 0.704030 0.388836 1.019223 Analysis of Variance
Source Sum-of-Squares df Mean-Square F-ratio P Regression 2.989959 1 2.989959 24.169106 0.000460 Residual 1.360810 11 0.123710
---Coefficient of correlation R2, the adjusted one is better
Values of parameters a = intercept
b = parameter linked to X
Standard deviations of parameters Confidence intervals of parameters
∑
− _ 2 ) ˆ (Yi Y∑
= n i i 1 2 ε Linear regression: Analysis of variance
Linear Models
Can we do a linear regression ?
Y
X
Linear Models
Y X
Ln Y Ln X
Y’=Ln Y X’=Ln X ?Can we do a linear regression ?
Linear Models
Can we do a linear regression ?
Y X
Y’=Ln Y X’=Ln X ? Ln Y Ln XLinear Models
Can we do a linear regression ?
Y X
Y’=Ln Y X’= X ? Ln Y XLinear Models
Y X
X’=X Y’=ln(Y/(1-Y)) ? ln(Y/(1-Y)) XCan we do a linear regression ?
Linear Models
Why transforming the data ?
Power equation :
Y
=
a
X
bexp(
ε
)
ε
+
+
=
ln(
)
ln
Y
a
'b
X
Exponential model :
Y
=
a
exp(
b
X
+
ε
)
ε
+
+
=
a
b
X
Y
'ln
It is always interesting to get a linear relationship because the solution is explicit
And sometimes it permits also to stabilize the varianceBut, this is not always possible and it may not correspond to the data set…. So try and see !
Linear Models
The following equations are linear or can be transformed to get a linear equation ?
Y
=
b
X
+
ε
Y
=
b
X
ε
Y
=
b
X
exp(
ε
)
Y
=
b
X
2+
ε
Y
=
X
bexp(
ε
)
Y
=
X
b+
ε
Y
=
b
X
+
c
X
2+
ε
Yes but two highly correlated variables
lnY = lnbXε =lnb+ln X +ln(ε)Non - Linear Models
Non-Linear regression: Principle
For linear models, the solution is explicit because the derivative of the model toward each parameter is independent from the paramameters of the equation.
For non-linear models, it is not the case: the derivatives depend on the parameters. The resolution of the system is too much difficult. It is then necessary to use alternative methods.
ε
α
β+
=
e
XY
.
. 2 1 . ) . (∑
= − = n i X i res i e Y SSα
β 0 ) )( . ( . 1 . − = −∑
= i bXi n i X b i a e e Y 0 ) . . )( . ( . 1 . − = −∑
= i X b n i X b i a e a e X Y i iNon - Linear Models
To fit a non-linear model, it is necessary to proceed by iterations.
When the least square method is used, at each step (i.e. each estimation of a new set of parameters) the sum of squared errors is calculated. If the procedure is efficient, this SSE decrease at each step. At the end of the process, if this decrease is negligible, then it is said that the model converged.
The most used iterative procedure is the Gauss-Newton one. But a lot of other procedures are available. When there are problems in fitting a model, it is recommended to test several methods (ex: fractionnal iteration,
Marquardt)
Meaning that we have to give initial values to the parameters
Non-Linear regression: Principle
Non - Linear Models
Non-Linear regression: Principle
b1 b2 b2 final b1 final 100 80 60 40Sum of squared errors – isovalue curves b2 initial b1 initial Successive iterations to get the final values of b1 and b2
Graphical view of the model
convergence with two parameters
Non - Linear Models
Non-Linear regression: Principle
b1 b2 100 60 2 0 40 60With these initial values, we fall in a local minimum of SSE (>20 et <40)
With these initial values, we fall in the absolute minimum of SSE(<20)
It is strongly recommended to test several sets of initial values
Importance of the initial values given to the parameters
Non - Linear Models
Dependent variable is Y
Source Sum-of-Squares df Mean-Square Regression 1.79138E+04 3 5971.269443 Residual 6.711670 20 0.335583 Total 1.79205E+04 23
Mean corrected 3284.344348 22
Raw R-square (1-Residual/Total) = 0.999625 Mean corrected R-square (1-Residual/Corrected) = 0.997956 R(observed vs predicted) square = 0.997965
Wald Confidence Interval Parameter Estimate A.S.E. Param/ASE Lower < 95%> Upper B1 40.269815 0.584758 68.865757 39.050031 41.489600 B2 0.029815 0.001760 16.941467 0.026144 0.033486 B3 1.454754 0.078017 18.646595 1.292013 1.617495
Asymptotic Correlation Matrix of Parameters
B1 B2 B3
B1 1.000000
B2 -0.910171 1.000000
B3 -0.756906 0.939698 1.000000
R2, the mean corrected one should be used
Values of the parameters and their confidence intervals
Correlation matrix between parameters
Non Linear regression:
Goodness of fit
How to assess the goodness of fit (for linear and non-linear models)
R2 and graph Y=f(Ypredit)
R2 is an index of fit, to be used cautiously, (see thereafter) Maximum value = 1; Minimum value = 0
Values of the parameters and their confidence interval
Identifying problems of convergence; usually the standard error should not exceed 10% of the parameter value
Correlations between parameters
If correlations are too high, transformation of the variables or change the model equation
The RMSE (Root Mean Square Error, or residual standard error)
Gives the error dispersion; to be compared to the average measured values. Usually, the model is satisfactory when the RMSE is less than 10% of the measured values
Error distribution et relationship with the input variables
Errors should be normally distributed, with no heteroscedasticity and no autocorrelation; Errors should be un-correlated with the input variables
Goodness of fit
How to assess the goodness of fit (for linear and non-linear models)
Example of normally distributed errors, to be verified with statistical tests (ex D’agostino et al, 1990) and quantile plots
Goodness of fit
How to assess the goodness of fit (for linear and non-linear models)
Example errors with
Goodness of fit
How to assess the goodness of fit (for linear and non-linear models)
Error of a linear model when the appropriate model is in fact
Goodness of fit
How to assess the goodness of fit (for linear and non-linear models)
Do not listen to the siren’s song of the R2 !
Heteroscedasticity
How to deal with heteroscedasticity ?
Y Xε
ε
avec
.
+
+
=
a
b
X
Y
∼N
(
0
,
σ
X
)
Transformation of the variables,
X
Y
Y
'=
/
X
'=
1
/
X
ε
'=
ε
/
X
to get the following linear model :
' '
'
'
=
a
X
+
b
+
ε
avec
ε
Y
∼N
(
0
,
σ
)
Heteroscedasticity
How to deal with heteroscedasticity ?
Y
X
More generally, the weighted regression consist in minimizing :
∑
=−
n i ipredit i iY
Y
w
1 2)
(
avec1
/
2 i iw
=
σ
le weight of observation i Usually, we use z i i∝
X
σ
All the challenge consist in finding the appropriate z value
z =Zoptimum z < Zoptimum z > Zoptimum
Heteroscedasticity
How to deal with heteroscedasticity ?
First option: a rough and simple method that can be used if there are enough data
z
i i X w ∝ 12
step 1 = split the variable X into k classes centered on Xk
Step 2 = calculate the variance σk2 of Y within each k classes
Step 3 = linear regression of logσk to logXk
Heteroscedasticity
How to deal with heteroscedasticity ?
z i i X w ∝ 12 Step 1 = fitting the weighted model by fixing z to a given value (often 0 at the beginning) Step 2 = calculate the Furnival index (FI)
The optimum value for z corresponds to the minimum of the Furnival Index
Step 3 = back to step 1 by increasing z
( ) RMSE n X anti FI n i k k . log log 1 =
∑
=Heteroscedasticity
How to deal with heteroscedasticity ?
z
i i X w ∝ 12
Fitting z with the other parameters of the model
( )
∑
+
+
−
−
=
i z i i z i i i iX
X
Y
ML
log
2
.
log(
.
)
.
.
2
1
2 2. 2σ
π
σ
µ
Fit by maximum likelihood instead of least squared methods
Model for the mean
Model Choice
For nested models: F test using the sum of squares errors (SSE) of the two models
1 2 1 1 1 2 p n SCE p p SCE SCE F T T T obs − − − = 2 1 p p >
if Fobs>Ftab, then model 1 is more suitable than model 2
Ftab(p1-p2,n-p1)
For non nested models: AIC, BIC using the maximum likelihood estimates
p ML
AIC = −2. +2. BIC = −2.ML+ p.log(n)
How to chose between models ?
If the number of parameters is the same between model 1 and model 2, then use the sum of square errors (SSE), the lowest SSE is the best
ε
+ + = a bD Y Nested in Y = a+bD +cD2H +ε
ε
+ + = a bD H Y 2 Nested in Y = a+bD +cD2H +ε
ε
+ + = a bD Y Non nested in Y = a+cD2H +ε
If the number of parameters is different between model 1 and model 2, then Check if these two models are nested or not:
Model Choice
How to chose between models ?
2 D c D b a Y = + + 10 2 .. f D D c D b a Y = + + + +Don’t use the R2
because it increases automatically as and when the number of parameters increases 5 2 .. kD D c D b a Y = + + + +
Step by Step
Exploratory stage, getting a model for each compartment and each strata (local model)
What variable is to be used as input data ? Or what combination of variables is to be used as in put data?
What is the form of the relationship with each of the variable ?
What are the relationships between the parameters of the local models and the strata characteristics ?
What is the form of this relationship for each parameter ?
Fitting of the complete model: one system of equations for all compartments and all strata
Aggregation stage, getting a model for each compartment, all strata pooled together (global model)
Aggregation
Example : Eucalyptus in Congo (Saint-André et al. 2005)
0 0.2 0.4 0.6 0.8 1 0 25 50 75 100 125 150 A g e (m o is ) 0 100 200 300 400 500 600 0 25 50 75 100 125 150 A g e (m o is ) ε + + =a bD H s LeafBiomas 2No variation with stand age
Exponential decrease with stand age 0 1 2 3 4 5 6 0 0.5 1 1.5 2 2.5 D2H (m3) L e av e s B io m a ss ( kg D M t re e-1 ) DMLe ave s DMLe ave s Es t GP3A, 3B GP3C, 3D GP1, 2 (11-30 months) (50-75 months) (135 months)
Fitted age by age, then analysis of the parameter variations with stand age
Aggregation
Compartiment Peuplements utilisés pour la calibration
Modèle pour l’espérance Modèle pour la variance F1 Total G1, G3A, G3B, G3D 5.53 939.11( )r .2h 3 . 1 + = µ 24.19( )2. 0.483 3 . 1 h r = ε F2 Aérien G1, G2, G3A, G3B, G3C, G3D, V1 à V9 2.18 (488.8 2.2 )( )12.3. (0.87 0.0012 ) age h r age + + + = µ 21.74( )2. 0.613 3 . 1 h r = ε F3 Souterrain G1, G3A, G3B, G3D 92.14( )2. 0.630 3 . 1 h r = µ 5.52( )2. 0.385 3 . 1 h r = ε F4 Feuilles G1, G2, G3A, G3B, G3C, G3D, V1 à V9 0.64 (20.39 0.09age 2344.6e 0.15age)r12.3.h − + − + = µ 0.68( )2. 0.232 3 . 1 h r = ε
F5 Branches Mortes G1, G2, G3A, G3B, G3C,
G3D, V1 à V9 (6.12 158.9e 0.03age)r12.3.h − + = µ 3.09( )2. 0.353 3 . 1 h r = ε
F6 Branches Vivantes G1, G2, G3A, G3B, G3C,
G3D, V1 à V9 (31.12 4496.7e 0.18age)r12.3.h − + = µ 5.20( )2. 0.573 3 . 1 h r = ε F7 Ecorce G1, G2, G3A, G3B, G3C, G3D, V1 à V9 (25.95 19.83 0.05 )r12.3.h0.761 age e− + = µ 1.03( )2. 0.402 3 . 1 h r = ε F8 Tronc G1, G2, G3A, G3B, G3C, G3D, V1 à V9 0.29 (510.7 1.29age)r .h 2 3 . 1 + + = µ 36.02( )2. 0.887 3 . 1 h r = ε F9 Souche G1, G3A, G3B, G3D 37.72( )2. 0.718 3 . 1 h r = µ 4.27( )2. 0.508 3 . 1 h r = ε
F10 Grosses Racines G1, G3A, G3B, G3D ( )0.790 . 84 . 4 5 2 3 . 1 h r = µ 5.05( )2. 0.564 3 . 1 h r = ε
F11 Racines moyennes G1, G3A, G3B, G3D ( )0.470
. 2.71 2 3 . 1 h r = µ 0.47( )2. 0.276 3 . 1 h r = ε
F12 Racines fines G1, G3A, G3B, G3D ( )0.297
. 7.61 2 3 . 1 h r = µ 0.67( )r .2h 3 . 1 = ε
Age effect was significant for most of the compartments, we then get a set of equations that can be used whatever the stand age (within the range of the calibration data set 11 to 135 months)
Nb: the obtained z values (heteroscedasticity) are very different from 1 or 2 (classically used in the weighted regressions)
Example : Eucalyptus in Congo (Saint-André et al. 2005)
Aggregation
0 50 100 150 200 250 0 50 100 150 200 Eucalyptus -Congo Beech -France Eucalyptus -Brasil 0 50 100 150 200 250 0 50 100 150 Age (years) b ( ad im ) 0 50 100 150 200 250 0 50 100 150 200 Age (years) b ( ad im ) 0 100 200 300 400 500 600 700 0 50 100 150 200 Age (years) b ( ad im) Not only eucalyptus and
fagus have the same pattern, they do also follow the same line ! (especially for stem wood and branches)
Example : Fagus in France (Genet et al. 2011)
Step by Step
Exploratory stage, getting a model for each compartment and each strata (local model)
What variable is to be used as input data ? Or what combination of variables is to be used as in put data?
What is the form of the relationship with each of the variable ?
What are the relationships between the parameters of the local models and the strata characteristics ?
What is the form of this relationship for each parameter ?
Fitting of the complete model: one system of equations for all compartments and all strata
Aggregation stage, getting a model for each compartment, all strata pooled together (global model)
Taking all compartments
into account
Equations were fitted altogether simultaneously, To take cross-compartment correlation into account. This step is important when one wants to simulate biomass estimates with confidence intervals
The output of SUR Regressions (Seemly unrelated regression) are :
1-Values of parameters and their confidence intervals
2-Correlation matrix of parameters (within compartment and between compartments)
3-Residual errors for each compartment
4-Correlation matrix of errors (between compartments)