Fibrosis staging with aggregation - Probl`emes inverses d’apprentissage, apprentissage en ligne

Fibrosis is the formation of excess fibrous connective tissue in an organ, due to a reactive process.

Fibrosis can arise in many tissues within the body, such as lungs, heart, skin or intestine. In liver, cirrhosis is a result of advanced fibrosis and leads to a loss of liver function. It is most commonly caused by alcoholism, hepatitis (B and C) or other possible causes. HIFIH laboratory develops accurate and non-invasive blood-tests for identifying stage of fibrosis, for instance in non alcoholic fatty liver disease (NAFLD) (see Calès, Boursier, Chaigneau, Lainé, Sandrini, Michalak, Hubert, Dib, Oberti, Bertrais, Hunault, Cavaro-Ménard, Gallois, Deugnier, and Rousselet [2010]) or in chronic hepatitis C (Calès, Boursier, Ducancelle, Oberti, Hubert, Hunault, Lédinghen, Zarski, Salmon, and F.Lunel [2014]). In this section, we want to use aggregation methods to predict the fibrosis stage thanks to simple biomarkers in order to propose an automatic blood-test based method.

Dataset description

All of the 1012 patients included in the derivation population has a fibrosis variableF from 0 (Fibrosis absence) to 4 (cirrhosis) gathering with 6 blood-tests variables related with different quantities, levels or rates. It includes X1 (G/l, platelets or thrombocytes), X2 (UI/i, aspartate amino-transferase), X3

(mmol/l, blood urea level), X4 (%, Prothrombine blood rate), X5 (mg/dl, Alpha2macroglobulin), X6

(UI/l, gamma glutalyl transpeptidase quantity). Eventually, we use X₇ (Age of the patient) and X₈ (Male/Female).

Multinomial Logistic Regression (MLR)

The value of F has been measured thanks to an invasive method. In order to develop non-invasive methods based on blood-tests, we propose to use a multinomial logistic regression model from the input variables X1, . . . , X8 described above. The logistic regression model consists in modelling the posterior probabilitiesη_k(x) :=P(F =k|X =x),k= 0, . . . , K−1 via a linear function inx. We use in the sequel

the following logit transformations :

∀k= 0, . . . , K −1, log

ηk(x) η_K(x)

=β_k^>·(1, x),

where (1, x) = (1, x₁, . . . , x₈) for simplicity. Then, a simple calculation shows that : P(F =k|X =x) =η_k(x|β) = e^β^k^>^·(1,x)

1 +PK−1

j=0 e^β^j^>^·(1,x) .

The model of logistic regression is widely used in biostatistics for K = 1 (binary classification), where in this case there is only a single linear function. In our problem, parameters (β_k)^K−1_k=0 are usually fitted by maximum likelihood. The associated first order conditions are in matrix notation as follows :

X^>(y−p) = 0,

whereX is the data matrix withn= 1012 rows andd= 8 + 1 columns,y is the vector of fibrosis stages and p is the vector of fitted probabilities given by (η₁(x_i|β), . . . , η_k(x_i|β))^>. Then, a Newton-Raphson algorithm could be performed.

According to the health care professional, 9 logistic regressions were calculated thanks to the dataset.

The first one is the multinomial logistic with d= 9 when we consider the entire set of feature variables.

It is called MLRtot. Then, we construct 8 other logistics by avoiding one variable from the dataset. It gives MLR₁,. . ., MLR₈ where MLR_k is the logistic withoutX_k. The associated classifier are denoted as f_tot, and f_j,j = 1, . . . ,8 and are given by the following formula :

fj(x) = arg max

k=0,...,K−1ηk(x|βˆj),

where ˆβj is the solution of the Newton-Raphson gradient descent associated with MLRj computed with package VGAM.

Aggregation with Mirror Averaging (MA)

Aggregation methods are very popular in machine learning. The principle of the method is to construct a combination of a finite number M ≥ 1 of base learners {f₁, . . . , fM}, in order to give an accurate prediction strategy. This is an alternative to the well-known empirical risk minimization principle, which selects a particular classifier in a given family. The main motivation is as follows. Very often, a particular classifier can not perform well on each occurence of a test set. Then, the use of a combination of classifiers instead of a single method can lead to better results. Most of the time, the sample is divided into two parts : the first part is used to construct a family of base learners whereas the second part is used to construct the associated weights².

Equipped with a family of preliminary functions, denoted as Φ = {f₁, . . . , f_M}, we construct our final decision sequentially. At each trial t = 1, . . . , n, for j = 1, . . . , M, we compute the empirical risk rt,j of classifierfj ∈Φ at time tand associated weights ˆwt,j as follows :

w_t,j = e^−λr^t,j Wt

, wherer_t,j =

i=1

1_Y_i_6=f_j_(X_i₎, (5.3)

whereas W_t > 0 is such that PM

j=1wˆ_t,j = 1 and λ > 0 is a temperature parameter. Eventually, we proceed to the final step called ”mirror averaging” and construct the final weights :

ˆ w_j = 1

t=1

ˆ w_t,j. (5.4)

We hence obtain an aggregate called mirror averaging (MA) defined as ˆfMA(·) =PM

j=1wˆjfj(·).

2. In Barron and Leung [2006], it is proved that in the context of linear regression, we can calculate the least-square projections and the associated aggregate with the same sample.

Result of the experiment

Following the aggregation scheme of Section 5.2, we divide the sample into two parts. The first part of the sample (n1 = 506 patients chosen randomly) is used to construct the family of classifiers Φ = {f_tot, f₁, . . . , f₈}, where multinomial logistics are performed on this primary set of patients. Then, we use the second subsample ofn2 = 506 patients to construct the Mirror Averaging aggregate ˆfMA(·).

The evolution of the performances of each MLR are given in Figure 5.3 below.

Figure 5.3 Evolution of the empirical riskrt,j of each fj ∈Φ over n2= 506 patients.

We can note thatf_tot, the multinomial logistic regression based on the whole set of variablesX₁, . . . , X₈ has a good accuracy (3.16%) whereas other regressions give intermediate results from 8.49% for f3 to 24.7% forf5.

Then, we can proceed to the sequential construction of weight. Figure 5.2 below shows the evolution of the weights with small temperature parametersλ= 0.001 and λ= 0.01.

(a)λ= 0.001 (b) λ= 0.01

Figure 5.4 Evolution of weights defined in (5.3) withn₂ = 506 and small temperature parameters.

The influence ofλ can be seen in the vertical axe. The sequence of weights can also be computed with greater temperature parameters λ= 0.1 andλ= 1.

(a) λ= 0.1 (b)λ= 1

Figure 5.5 Evolution of weights ˆw_t,j defined in (5.3) withn₂= 506 and large temperature parameters.

Here, the influence of f_tot is significantly higher. This is due to the high values of the temperature parameters. It shows rather well that by increasing the value of the temperature parameter in (5.3), we lead to an ERM strategy, where the second sample is used as a test set.

Eventually, we proceed to a leave-one-out cross validation method in order to calculate the accuracy of each aggregate ˆfMA for various temperature parameters. The result of this study shows thatλ= 0.38 is the best compromise. It gives a prediction error of 2.766798% of misclassification which is detailled for each fibrosis stage in the following table :

Conclusion

This section illustrates the power of aggregation in fibrosis staging based on blood-tests. It can be seen as a first attempt into the development of non-invasive methods with statistical learning.

Dans le document Probl`emes inverses d’apprentissage, apprentissage en ligne et applications S´ebastien Loustau (Page 104-107)