• Aucun résultat trouvé

Bridging data-exploration and modeling in event-history analysis: the Supervised-Component Cox Regression method

N/A
N/A
Protected

Academic year: 2021

Partager "Bridging data-exploration and modeling in event-history analysis: the Supervised-Component Cox Regression method"

Copied!
69
0
0

Texte intégral

(1)

HAL Id: hal-01831865

https://hal.archives-ouvertes.fr/hal-01831865

Preprint submitted on 6 Jul 2018

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de

Bridging data-exploration and modeling in event-history

analysis: the Supervised-Component Cox Regression

method

Xavier Bry, Théo Simac, Salah El Ghachi, Philippe Antoine

To cite this version:

Xavier Bry, Théo Simac, Salah El Ghachi, Philippe Antoine. Bridging data-exploration and modeling in event-history analysis: the Supervised-Component Cox Regression method. 2018. �hal-01831865�

(2)

Bridging data-exploration and modeling in event-history

analysis: the Supervised-Component Cox Regression

method.

Xavier Brya, Th´eo Simacb, Salah Eddine El Ghachib, Philippe Antoinec

a IMAG, Universit´e de Montpellier b Universit´e de Montpellier

c IRD

Abstract

In event-history analysis with numerous and collinear regressors, Cox’s proportional hazard model, as all generalized linear models, encounters crip-pling instability problems. Dimension-reduction and regularization are there-fore needed. Penalty-based methods such as ridge and least absolute shrink-age and selection operator (LASSO) provide a regularized linear predic-tor, but do not enable exploratory analysis of predictive structures. A new and flexible component-based technique is proposed here, as an alternative: Supervised-Component Cox Regression (SCCoxR). Its principle is to

(3)

calcu-the strong correlation structures of regressors, and optimize calcu-the goodness-of-fit of the model. The flexibility of the method comes from three tuning-parameters. The first one allows to tune the balance between component-strength and goodness-of-fit, thus bridging classical Cox Regression with Cox regression on principal components. The second one tunes the focus on more or less local explanatory variable-bundles. The third one tunes the regularization of the model coefficients, hence the robustness of the output estimated formula of the hazard. Supervised-component Cox regression is demonstrated on simulated data, with intent to give the user some hints on the tuning process, and then used to explore and model the entrance of men into polygamy in Dakar.

Keywords: Supervised components, Cox regression, PLS Cox regres-sion, regularization, SCGLR, survival analysis.

1

Introduction

Statistical models based on stochastic counting processes are one of the most useful mathematical tools for population study (Andersen et al., 1993). When they include the effects of covariates in a generalized linear formu-lation of their intensity, they provide a strong and handy explanatory way

(4)

to link response-variables to what supposedly influences them, and quantify this statistical link. Among such explanatory models, Cox’s proportional hazard regression model is most popular for having long been successfully used in epidemiology for survival analysis (Cox, 1972; Cox & Oakes, 1984 ; Collett, 2003) and in demography for event-history analysis (Courgeau & Leli`evre, 1989). These generalized linear counting-process models, as powerful as they are, have the same Achilles’ heel as all classical regres-sion models: they cannot handle high-dimenregres-sional data, i.e. data where explanatory variables outnumber observations. Nor can they handle highly correlated explanatory variables. High dimension and high correlation make such models unidentified if one does not add some constraints or penalty to the goodness-of-fit criterion they currently maximize. Theory-based crisp identification constraints, when they do exist, are usually too rare to make the model identified. Moreover, in practice, they do not exist at all, for the excess of variables over observations and the presence of highly redundant variables arise from an “excess of measurement” : as the “true” explana-tory variables can usually not be observed directly, they are replaced with a number of proxies supposedly revolving “about” some of them, which is why they are so correlated. This lack of constraints makes the penalty-based regularized models, such as ridge- or least absolute shrinkage and

(5)

selec-tion operator-penalised regression models (Tibshirani , 1997 ; Perperoglou, 2006), an all-terrain and very much appreciated way of breaking down the curse of dimensionality. But these regularized models still miss one point: if they do provide a linear predictor of the response, this linear predictor is very hard to interpret, for the variables in it are too many and redundant. What the analyst needs is a small set of explanatory, i.e. predictive and interpretable dimensions which capture the statistical link between the response and the set of its regressors. So, true explanatory modeling in a high-dimensional framework requires, on top of regularization, analytical dimension reduc-tion.

On the other hand, classical exploratory analysis methods as principal component analysis (PCA) and correspondence analysis are justly reputed as powerful dimension-reduction techniques, but have no explanatory power of their own.

Now, in-between regression and principal component analysis lies par-tial least square (PLS) regression. The basic idea of parpar-tial least squares regression (Wold et al., 1984) is to bridge principal component analysis and ordinary least squares (OLS) regression by maximizing a trade-off cri-terion between ordinary least squares regression’s goodness-of-fit cricri-terion (square correlation of dependent variable with component) and the

(6)

compo-nent’s “strength”, as measured by its variance under a unit-norm constraint on the coefficient-vector. But partial least squares regression is not at all straightforward to extend to generalized linear models. Bry et al. (2004), Bastien et al (2005) and Bry (2006) have proposed different ways to do it. In our view, all had to be improved in terms of generality, flexibility, and mathematical consistency. More recently, Bry et al. (2015) have proposed to extend the measure of a component’s strength beyond its variance, by in-troducing a flexible measure of structural relevance. This measure can act as a bonus given to components close to certain variable-structures, e.g. bun-dles of highly correlated variables. This is an important asset in “explana-tory exploration” , for it produces components all the easier to interpret as they are close to enough explanatory variables. Below, we develop in detail what was briefly introduced by Bry & Simac (2016). We propose to flexibly combine the structural relevance with the partial likelihood of the propor-tional hazard model into a new criterion. Maximizing this criterion under a unit-norm constraint on the coefficient vector allows to calculate a first structurally relevant explanatory component. A rank-h component can then be calculated by maximizing the same criterion under the additional con-straint that it be orthogonal to all lower-rank components. As the calculated components are supervised ones, they are not exogenous, which entails that

(7)

standard significance-tests are invalid. As a consequence, once a given (and presumably too large) number of components is obtained, a cross-validation procedure is required to only keep the really predictive ones. Eventually, one gets a regularized linear predictor of the hazard function, decomposed on a small number of dimensions spanning an interpretable explanatory sub-space. By varying the parameter tuning the combination of the structural relevance with the partial likelihood, one can make the method go smoothly from standard non-regularized Cox regression to Cox Regression on such exogenous (non-supervised) “strong” components as principal components. These two extremities are usually not very interesting: on the one hand, in a high-dimensional setting, standard Cox regression yields highly unstable and hardly interpretable coefficients, if any, for want of prior dimension re-duction; on the other hand, exogenous components are not optimized to pre-dict the survival time, so do not usually provide the best explanatory view of it. In practice, the most interesting components will be given by some intermediate values of the tuning parameter.

The paper is organized as follows. In section 2, after a short reminder on Cox’s proportional hazard model, we present the corresponding component-model and its estimation technique in a formal way. In section 3, a sim-ulation study is performed with intent to show how the tuning-parameters

(8)

condition the estimation results, and how to interpret them. Finally, in sec-tion 4, the hints given in secsec-tion 3 are used to analyse life-history data of men in Dakar, in order to find which variables are useful to model their change from monogamy to polygamy.

2

Model and technique

We consider a survival time Y , depending on a set of covariates X , plus a set of extra-covariates Z. X and Z can be time-dependent. There may be non-informative right-censoring on Y . Variables in Z are few and exhibit low or no correlation. By contrast, variables in X are many and possibly redundant, so that the proportional hazard model demands regularization with respect to X .

2.1

Notations:

• yiis the survival-time or censoring-time of unit i ∈ {1, ..., n}.

• xi,t and zi,t are the values of vectors of covariates x and z respectively

for unit i at time t.

• Event-indicator δ is defined through: ∀i, δi= 1 if the event occurs for

(9)

• R(t) denotes the set of all individuals at risk at time t.

• ΠA, where A is a matrix, denotes the orthogonal projector on the space

spanned by the column-vectors of A, with respect to a given metric. • A being a matrix, A0denotes the transpose of A.

2.2

The proportional hazard model

Cox’s model is based on the following formulation of the hazard function of unit i at time t:

h(t; xi,t, zi,t) = h0(t)eβ

0x

i,t+γ0zi,t, (1)

where h0(t) is the baseline hazard function. The survival function having

hazard function h0(t) is the baseline survival function.

The partial likelihood defined by Cox (1972, 1975) is, when there are no simultaneous events: Lp(β , γ; X , Z) = n

i=1 pi, where pi= " eβ0xi,yi+γ0zi,yi ∑j∈R(yi)e β0xj,yi+γ0zi,yi #δi (2)

The partial likelihood can, under some assumptions, be interpreted as a marginal likelihood of events’ ranks (Cox, 1975). The partial likelihood is rid of h0, involving only (β , γ). When the model is identified (i.e. matrix

(10)

[X , Z] is full column rank), maximizing the partial likelihood with respect to (β , γ) through a Newton-Raphson algorithm yields estimates ( ˆβ , ˆγ ) based on which Kalbfleisch et al. (1973) and Breslow (1974), among others, proposed an estimation of the baseline survival function.

The problem with our data is the large amount of collinearity within X , which causes the X -part of the linear predictor to be unstable, if identified. We shall make it identified and stable by prompting it to lean on components that have some structural relevance. The idea is to replace regressor-block X with a block F = XU of H orthogonal components, in the model. The regressors being time-dependent, so will the components. Let X be the ma-trix the columns of which are the X -regressors and the N rows of which are the individuals-at-risk-at-time-points: (i,t). Orthogonality of the f ’s will be taken with respect to matrix W = N−1IN. The hazard function of a unit i at

time t will thus be:

h(t; xi,t, zi,t) = h0(t)eα

0f

i,t+γ0zi,t = h

0(t)eα

0f

i,t+γ0zi,t (3)

Components will be calculated hierarchically, starting with one, and each extra-component being constrained to be orthogonal to the former ones. Now, what we want is that the components be structurally relevant, in that they should be as close as possible to directions of a pre-defined type, such as

(11)

explanatory variables, or other relevant subspaces. Here is how we propose to measure that.

2.3

Structural relevance of a component

To X we associate a p × p symmetric positive definite (s.p.d.) matrix A, such that principal component analysis of X with metric matrix A and weight-matrix W is relevant.

Let τ ∈ [0; 1], and M = (τA−1+ (1 − τ)X0WX)−1. The purpose of coef-ficient τ is to tune the regularization of the model as follows. Matrix M is s.p.d., and component f = Xu will be constrained by : kuk2M−1 = 1, so that:

• τ = 0 ⇒ M = (X0WX)−1 , so that kuk2M−1 = 1 ⇔ k f k2W = 1 . All

directions in hX i are equivalent, be them close to subsets of variables or not: X is then seen as a pure vector-space.

• τ = 1 ⇒ M = A . Recall that the program of the principal component analysis of (X , A,W ) is:

max w0Aw=1kXAwk 2 W ⇔ max u0A−1u=1kXuk 2 W with u = Aw (4)

So, s.t. kuk2A−1 = 1 , k f kW2 is the inertia of observations (rows) along

w= A−1u∈ Rp. Components with a larger inertia will hence be favoured

(12)

Structural relevance was introduced by Bry et al. (2015) as a possible extension of the component’s variance to measure the ability of a compo-nent to capture information in its variable-block. Given a set of “reference” s.p.d. matrices N = {Nj; j = 1, ..., J} encoding types of structures of interest

(target-spaces, e.g. variables in X ), a weight system Ω = {ωj; j = 1, ..., J},

and a scalar l ≥ 1, the associated structural relevance measure is defined as a generalized average of quadratic forms of u:

φN,Ω,l(u) := J

j=1 ωj(u0Nju)l !1l (5) In Eq. (5), the value of l tunes the locality of the bundles of structures coded in N. The larger the value of l, the more local the bundle. An example is given below.

2.3.1 Particular instances of structural relevance measures

• Component Variance:

φ (u) = V (Xu) = kXuk2W = u0(X0WX)u

Under constraint u0M−1u= 1, this is the inertia of units along direction hui , and is maximized by the first (direct) eigenvector in the principal

(13)

com-In practice, explanatory variables are often a mixture of numeric and nominal variables. Assume that X =x1, ..., xK, X1, ..., XL , where: x1, ..., xK are column-vectors coding the numeric regressors, and X1, ..., XLare blocks of centred indicator variables, each block coding a nominal regressor ( Xl

has ql− 1 columns if the corresponding variable has ql levels, the removed

level being taken as “reference level”). We should then consider the fol-lowing A , which bridges ordinary principal component analysis of numeric variables with multiple correspondence analysis of nominal variables:

A:= diag n

(x10W x1)−1, ..., (xK 0W xK)−1, (X10W X1)−1, ..., (XL0W XL)−1 o

(6) • Variable powered inertia :

For a block X consisting of p standardised numeric variables xj , the variable powered inertia is defined as:

φ (u) = p

j=1 ωjhXu|xji2l !1l = p

j=1 ωj(u0X0W xjxj 0W X u)l !1l (7) ⇔ φ (u) = p

j=1 ωjρ2l( f , xj) !1l k f k2 W (8)

(14)

<u>

ϕ

1

(u)

ϕ

2

(u)

ϕ

4

(u)

Figure 1: Polar representation of the variable powered inertia according to the value of l φ (u) = p

j=1 ωjρ2l( f , xj) !1l (9) In the elementary case of 4 coplanar variables x with ∀ j, ωj= 1 , fig. 1

graphs φXl(u) in polar coordinates (using the complex notation, where vector uis identified with complex number eiθ : z(θ ) = φXl(eiθ)eiθ;θ ∈ [0, 2π) ) for

various values of l . Note that φXl(u) was graphed instead of φX(u) so that

curves would not overlap. One can see how the value of l tunes the locality of bundles considered: the greater the l , the more local the bundle.

(15)

coded through the set of its centred indicator variables less one, the variable powered inertia is:

φ (u) = p

j=1 ωjcos2l(Xu, hXji) !1l = p

j=1 ωjhXu|ΠXjXuiWl !1l (10) where: ΠXj = Xj(Xj 0W Xj)−1Xj 0W

2.4

Combining the partial likelihood with the

struc-tural relevance

We propose to combine the partial likelihood with the structural relevance using a geometric average:

c(u, α, γ;Z;s) = [Lp(u, α, γ)]1−s[φ (u)]s with 0 ≤ s ≤ 1 (11)

The scales of Lp and φ are not comparable, and the geometric average

has the obvious advantage to make the compound criterion insensitive to these scales in that, at the optimum, the relative variations of Lpand φ

com-pensate with a fixed rate:

dc= 0 ⇔ d ln c = 0 ⇔ (1 − s)d ln Lp+ sd ln φ = 0 ⇔ dLp Lp = − s 1 − s dφ φ (12)

(16)

2.5

The Supervized Component Cox Regression

al-gorithm

Maximizing the criterion c in Eq. (11) subject to constraint u0M−1u= 1 yields vector u1, hence the first component : f1= Xu1.

When looking for rank-h component fh= X uh, we take former compo-nents Fh−1= [ f1, ..., fh−1] as known, and impose the additional orthogonal-ity constraint:

Fh−10W fh= 0 ⇔ D0hu h

= 0 with Dh= X0W Fh−1 (13)

Besides, Fh−1 is taken as a block of extra-covariates, thus appended to Z. Let Zh= [Z, Fh−1]. Vector uhis obtained as the solution of the following program:

P: max

u, α, γ s.t.: u0M−1u= 1; D0hu= 0

c(u, α, γ; Zh; s) (14)

Program P is solved by iteratively maximizing the criterion with respect to uand (α, γ), in turn:

(17)

max-normed gradientalgorithm given in appendix B. The formula for cal-culating gradient ∇c may be found in appendix A.

• Maximization with respect to (α, γ) : given u , the criterion is maxi-mized on (α, γ) exactly as in the classical Cox regression on covariates {Xu, Zh} , with Zh= Z ∪ Fh−1.

Overall, we get the following algorithm:

- A number H of components to be calculated is chosen. Let D1=

null-matrix, F0= /0 and Z1= Z. Then:

For h = 1 to H :

- Calculate uhas the solution of P with Zhas extra-covariates. - Set fh= X uh, Fh= [Fh−1; fh] , D

h+1= X0W Fhand Zh+1= [Z; Fh] ;

End for.

- Perform Cox regression on regressors [FH, Z] . This yields linear pre-dictor:

η = FRα + Zγ = XURα + Zγ = X β + Zγ , whereβ = URα (15)

IfX denotes the original uncentered-variable matrix, then:

(18)

η = β0+X β + Zγ , with β = URα and β0= −10NWX β. (17)

2.6

Model-assessment

As said earlier, when s < 1, the components are supervised, hence non-exogenous, which requires that their assessment be made through cross-validation.

2.6.1 Cross-validation quality coefficient

To each triplet (s, l, H) we associate a corresponding modelM . For a given modelM , the cross-validation quality coefficient (CV) is calculated accord-ing to the technique proposed for the proportional hazard model by van Houwelingen et al. (2006): the sample is split into K parts, and for each part k, we calculate:

CVk(M ) = l(θ−k(M )) − l−k(θ−k(M )) (18)

where θ = (β0, β , γ) , l−k is the log-partial likelihood excluding part k of

the data, and θ−k(M ) is the θ(M ) estimated on the non-left out data. The

(19)

2.6.2 Obtaining good values for the tuning-parameters

We shall, from now on, use the variable powered inertia to measure struc-tural relevance. The tuning parameters are many (s, l, τ, H), so that using cross-validation to compare all combinations of them on a cross-product grid is out of the question in practice. Therefore, heuristics are needed. It is important to note that, even if these parameters have different purposes, which can to some extent be served sequentially, they are not completely in-dependent. We propose to momentarily separate the exploratory goal from the predictive one, these two goals being served in this very order. Indeed, contrary to prediction, exploration does not require regularization, which simplifies tuning, and should lead to interpretable explanatory dimensions used then for reliable prediction.

Exploratory phase: The primary key-parameter to be tuned in this stage

is s (importance of structural relevance with respect to partial likelihood), which entails the approximate number H of components to be retained ulti-mately. Choosing s = 0 (pure goodness-of-fit maximization) should lead to a single component: the predictor of the classical Cox regression. To the other end, s = 1 leads to exogenous components that do not take the good-ness of fit into account, which means that many such components may be

(20)

necessary to correctly fit the model. The higher s is, the higher H should be. In practice, we propose the following heuristic way to identify the ex-planatory dimensions. First, take a low value of τ, strictly positive to ensure identification but producing only minimal regularization, e.g. τ = 10−2. Then:

(a) Perform principal component analysis on X (e.g. with s = 0 and l= 1), and choose H so that principal components with rank > H may be considered noise. Perform Cox’s regression on the H principal components. These being exogenous, the standard significance tests are valid. Downsize H to the highest rank of a significant component, which is its maximum possibly interesting value.

(b) Perform supervised-component Cox regression with a decreasing se-quence of s values starting slightly below s = 1, e.g. 0.95 and using a rea-sonable step (e.g. 0.05). This is gradually letting the goodness-of-fit into the criterion. For each value of s, use the p-values of Cox’s regression on the components as a mere descriptive indicator of their roles. These p-values are not taken here as rigorous inference values, which they aren’t because components are not exogenous when s < 1 , but as “descriptive indicators” of the potential modeling value of the component. Keep a value of s giving interpretable components with reasonably low p-values.

(21)

The secondary key-parameter to be tuned in this stage is l (locality of variable-bundles components should align on).

(c) For the retained value of s, start raising l gently (integer values are sufficient): l ∈ {1, 2, 3, 4, ..., 10, ...}. Rather soon, one notices on the scatter-plots that if the interpretation of components was enhanced by an early raise of l , raising it further only changes them negligibly. For the few values retained of l, examine the p-values of components and see that an easier in-terpretation is not paid for by some non-negligible rise (it may even decrease them).

Predictive phase: We have now a good idea of the values to give to

s, l, H. These mays still change slightly with the value of τ (tuning regular-ization), but not dramatically.

(d) Start raising τ gently (e.g.: τ ∈ {.1, .2, .3, .4, ..., 1} ). For each value, calculate the CV coefficient for models ranging from h = 1 to H components and select the number of components H*(τ) giving the best CV, then denoted CV*(τ).

(e) Select the value of τ leading to the best CV*(τ) , and try to fine-tune τ so as to yet improve it.

(f) In the process, see that the interpretation of components is not altered, which may be the case with large values of τ. Should this happen, either

(22)

decrease τ, or re-interpret components.

3

Simulation study

The simulation scheme is intended to show how the tuning of parameters s, l and τ influence the estimation results, which gives some insight on their roles and how to use them in practice.

3.1

Simulation scheme

We defined the time-span as [0,30] , divided in 30 unit-length elementary intervals. over [0,30], a baseline hazard function was simulated as:

h0(t) = a + b(t − tm)2 with tm= 12, a = .2, b = 10−3 (19)

Time-dependent covariates describing n = 75 subjects were then simu-lated with bundle-structures, as follows:

Firstly, three independent latent variables taking time-independent val-ues at subject-level were simulated : ψij∼ N(0; 1), j ∈ {1, 2, 3}, i ∈ {1, ..., 75}.

Secondly, three independent latent variables were simulated on the 75 × 30 (subject, elementary interval) pairs: φitj∼ N(0; 1), j ∈ {1, 2, 3}, i ∈ {1, ..., 75},t ∈ {1, ..., 30}. Thirdly, these variables were combined into three latent

(23)

vari-ables directing the covariate-bundles:

∀(i,t, j) : ξitj= ψij+ φitj (20) Revolving around latent variables ξ ’s,“observed” covariates were simulated as follows: ∀ j ∈ {1, 2, 3} : bundle Bj= {xjk, k = 1, ..., Kj} with x jk it = ξ j it+ εitk , εitk∼ N(0; σ2) , with σ = .3. We took K1= 4, K2= 6, K3= 10.

Finally, 20 independent pure noise covariates were simulated as N(0;1). Covariates were renumbered as:

B1= {x1, ..., x4} ; B2= {x5, ..., x10} ; B3= {x11, ..., x20} ; Noise = {x21, ..., x40}

The linear predictor was calculated as the following function of latent variables fj:

∀(i,t, j) : ηit= .25 + ξit1− .5ξit2 (21)

Hence the hazard function of each subject i: hi(t) = h0(t)eηit.

Note that : 1) Within every variable-bundle Bj , correlation is very high

(0.96), which precludes standard Cox regression. 2) The bundle playing the major role is B1, which is only the third in size, and the latent variable

struc-turing it, ξ1, has a positive coefficient. 3) The bundle playing the secondary role is B2 , which is the second in size, and the associated latent variable,

(24)

ξ2 , has a negative coefficient. 4) ξ3does not play any part in the hazard. Associated bundle B3 is thus a decoy-structure, and it is the first in size,

weighing the same as B1 and B2 together. 5) Pure noise outnumbers every

bundle, weighing the same as all bundles together.

The challenge is to identify B1as the primary explanatory structure, with

positive effect, and B2as the second, with negative effect.

Independent right-censoring was simulated according to an exponential distribution (constant hazard rate) so that 21 subjects were censored before undergoing the event, and 54 underwent the event.

3.2

Estimation results

The parameter-varying scheme we follow is the one proposed in section 2.6: • We start with s = 1 (taking no account of the goodness-of-fit), and see how gradually taking the goodness-of-fit into account by decreasing s changes the results.

• Parameter l is regarded as secondary tuning-parameter which may im-prove interpretation of components when raised, once a reasonably good value of s has been found.

(25)

τ = 10−2merely to ensure identification in presence of high collinear-ities, we vary τ only in the prediction-stage.

Let us now see the evolution of the results when we follow this parameter-varying scheme.

Figures 2 to 7 show the correlation-scatterplots of variables with com-ponents for a few combinations of (s, l) and τ = 10−2, together with the coefficients and p-values given by Cox’s regression on components.

s= 1, l = 1: as shown on fig. 2 and 3, supervised-component Cox re-gression gives back X ’s principal component analysis, with component 1 aligning B3, component 2 aligning B2and component 3 aligning B1.

Decreasing s to .95 has a dramatic effect: ξ1 captures important parts of B1 and B2 with opposite effects (4). But B3 still has a non-negligible

projection in plane ( f1, f2). Raising l to 4 allows the plane to focus on B1

and B2 exclusively (fig. 6). Now, fig. 7 shows that f3 aligns B3 perfectly,

and that f4aligns noise-variable x39, which is obvious over-fitting, yet given away by a high p-value.

(26)

Figure 2: Correlation scatterplot a : s = 1, l = 1

(27)

Figure 3: Correlation scatterplot b : s = 1, l = 1

(28)

Figure 4: Correlation scatterplot c : s = .95, l = 1

(29)

Figure 5: Correlation scatterplot d : s = .95, l = 1

(30)

Figure 6: Correlation scatterplot e : s = .95, l = 4

f1: coefft = -1.92 ; p<2.e-16 ; f2: coefft = -.27; p=.068

The cross-validation performance graph associated with figure 6 is given on figure 8 . It shows that the model with two components should be pre-ferred.

(31)

Figure 7: Correlation scatterplot f : s = .95, l = 4

(32)

Figure 8: CV coefficient against number of components, for s = 0.95, l = 4, τ = 1e − 2, and 20 repetitions of a 3-block partition of data.

(33)

Figures 9 and 10 give correlation scatterplots of variables with compo-nents, according to the values of s and l, with low s and τ = 10−2. We can see that at the other end of the continuum, s = 0 yields a first component equal to the maximum likelihood estimation of the linear predictor (fig. 9). We witness on it the opposite roles of B1and B2, but the component is not so

close to the direction of either, and is likely to contain over-fitting. Compo-nent f2 is not really determined. Starting to take some structural relevance into account by raising s to .1 brings back B3along component f2but does

not change much f1, essentially determined by the maximum likelihood es-timation.

3.2.1 The regularization step

Once we retain a suitable (s, l) pair, e.g. here (.95, 4), we can tune τ so as to regularize the coefficient-vector, with intent to lessen fitting on noise, so distribute coefficients more evenly among highly correlated variables. Using the estimated coefficients of the first two components calculated every time, we get the estimated linear predictor ˆη , and calculate its correlation with the actual simulated η. The results are given in table 1 and show regularization

(34)

Figure 9: Correlation scatterplot g : s = 0

(35)

Figure 10: Correlation scatterplot h : s = .1, l = 1

(36)

at work.

Table 1: Coefficients of variables in components f1 , f2 and correlation of

esti-mated predictor ˆη with actual one η for increasing τ (s = .95, l = 4).

τ = 0 τ = .1 τ = .3 τ = .5 τ = .7 f1 f2 f1 f2 f1 f2 f1 f2 f1 f2 B1 x1 -.19 .04 -.22 .09 -.26 .08 -.30 .08 -.35 .09 x2 -.26 .00 -.25 .05 -.27 .06 -.31 .07 -.35 .08 x3 -.38 .00 -.31 .05 -.30 .05 -.32 .07 -.36 .08 x4 -.13 .03 -.22 .04 -.26 .06 -.30 .07 -.35 .09 B2 x5 .18 .02 .11 .19 .07 .20 .07 .02 .07 .02 x6 .20 -.02 .10 .17 .07 .19 .07 .02 .07 .02 x7 .43 .05 .19 .21 .11 .21 .08 .02 .08 .03 x8 -.12 -.02 -.03 .15 .01 .18 .03 .02 .05 .02 x9 -.31 -.02 -.09 .14 -.01 .17 .02 .02 .04 .02 x10 -.16 .00 -.03 .16 .02 .18 .03 .02 .05 .02 B3 x11 .20 -.13 .13 .02 .06 .02 .04 .01 .03 .01 x12 .24 -.10 -.17 -.02 -.07 -.01 .04 -.01 -.02 -.01 x13 -.42 -.14 -.04 .00 -.02 .00 -.01 .00 .00 .00 x14 -.10 -.13 -.06 .01 -.03 .00 -.02 .00 -.01 .00 x15 -.15 -.13 -.05 -.01 -.02 .00 -.01 .00 .00 .00 x16 -.15 -.14 .10 .00 .04 .01 .03 .00 .01 .00 x17 .19 -.13 .11 .01 05 .01 .03 .00 .02 .00 x18 .23 -.11 -.06 .02 -.06 .01 -.06 .01 -.06 .01 x19 -.06 .00 -.03 .25 -.03 .01 -.03 .00 -.02 -.01 x20 -.03 .00 -.03 -.02 -.03 .01 -.03 -.01 -.02 -.01 ρ (η , ˆη ) .9480 .9653 .9723 .9772 .9817

(37)

4

Application to real data

The supervized component Cox regression method will now be used to study the advent of polygamy among men living in Dakar.

Poverty in Senegal is high. Dakar gathers about one fourth of the sene-galese population and attracts an important migration. The society in Dakar is moslem and women are dominated by men to a notable extent. The 2002 census revealed that among men, polygamy is on the increase mostly from the age of 45. Among married men living in Dakar aged 45 to 49, the per-centage of the polygamous is over 20%. Among those aged 60 and over, this percentage is rather stable, ranging between 30% and 40%.

The mechanisms making polygamy possible have been studied and are well-documented (Pilon, 1991). The existence of polygamy rests upon an important age-gap at marriage between men and women: women enter mar-riage much younger than men. The frequent and quick remarrying of di-vorced women and widows also favors this practice. Among other things, polygamy allows the man to maximize his posterity (Chojnacka, 2000). Having many descendants means benefitting from an important manpower. It also means hoping for a better assistance in the old age: in case of health problems, it raises the chances to be looked after by at least one wife and her children (Møller and Welch, 1990 ; Gning and Antoine, 2015). In that,

(38)

polygamy may serve as a safety net. Finally, polygamy is, for a man, a way to reconcile the preferences of the group (social choice) when his parents impose a spouse on him, with his own (individual choice).

We are going to confront these elements with the results of a first and somewhat naive application of supervized component Cox regression. Our intention here is not to complete a thorough analysis of the data, but to provide a first illustration of how the supervised-component Cox regression method works, how it should be used, and the kind of help it may give to the data analyst.

Supervised-component Cox regression is applied to life-histories of 222 married men born before 1967 and living in Dakar. The data comes from the 2001 retrospective survey conducted by Philippe Antoine (IRD) and Abdou Salam Fall: (IFAN-UCAD) ”Crise, passage `a l’ˆage adulte et devenir de la famille dans les classes moyennes et pauvres `a Dakar” (Antoine and Fall, 2002), partially funded by the CODESRIA. Our aim is to find out the main explanatory dimensions that influence the risk of becoming polygamic by marrying a second wife. The survival time is the duration of the first union in the monogamic state. The terminal event is marrying a second wife. Death of the first wife and divorce were considered independent right-censoring, just as was the date of the survey. Of course, divorce is likely not to be

(39)

independent of the event under study, so should be treated as a competing event in future work. The subjects are described with 107 time-varying nu-merical variables, most of which are indicators of nominal variable values. All are put in matrix X , matrix Z being null here. The total number of rows of the data matrix (number of (subject, date) pairs) is 2293, and 55 events were observed. The high number of variables and their collinearities make direct unregularized Cox regression impossible. Selection of variables by means of mere intuition is tricky and may be opposed on the grounds that the many confusion effects change the results according to the variables se-lected, so, at least, variable-selection should lean on some solid criterion. Cross-validated supervised components can be used as such a criterion.

To illustrate the gain provided by supervised-component Cox regression, we first performed Cox regression on principal components, which is one end of the continuum spanned by SCCoxR, associated with s = 1. Then, to the other end of the continuum (s arbitrarily close to 0), we performed Cox regression on the whole explanatory subspace, in order to see whether the linear predictor is close to an interpretable dimension. In view of the findings provided by these two attempts, we finally tune the parameters of SCCoxR so as to track down interpretable explanatory dimensions.

(40)

4.1

Cox regression on the first 8 principal

compo-nents

Cox regression on principal components is obtained setting s = 1 and l = 1. The only principal component which has a really low p-value (p = 0.002) is the 5th, all other components having a p-value above 0.05. Component 5 has a shrinking effect on the risk (exp(β ) = 0.78). Correlation scatterplot with components 4 and 5 (Fig 11) shows that component 5 is positively, but only fairly correlated with variable age-gap (= age of ego - age of first wife). Cox regression on principal components does not help us to track interpretable directions that are predictive of the risk, except perhaps the age-gap’s.

4.2

Cox regression on the full explanatory space

To the other end of the continuum is value s = 0. Setting s = 0 is forbid-den here because of the exact linear depenforbid-dence of some variables, e.g. : number of sons + number of daughters = number of children. But by tak-ing s very small in (0;1], e.g. s = 1e − 3, we get a supervised component 1 almost aligning the linear predictor. Fig 12 shows that the latter is close to no observed variable in particular. If s were equal to 0, we would not

(41)
(42)

Figure 12: Correlation scatterplot with supervised components (1,2) for s = 1e − 3, l = 1, τ = 1

get a second component, since the goodness-of-fit would be maximized at once by the first one. Since we took a small s, component 2 is identified, but almost random, and plays no role in the prediction. By taking τ = 1, we en-sured regularisation of the linear predictor, i.e. that confusion be minimized between correlated variables. So, the signs of coefficients reflect better the effects of the associated variables, provided their absolute value is not too low. These coefficients are given in line β of table 3.

(43)

They should a priori not be considered completely reliable, insofar as they are not based on strong structural dimensions of the explanatory vari-ables.

Neither of these extreme choices has provided substantial insight as to the main dimensions influencing the risk. Let us track these by fine-tuning the parameters, as in the simulations: first s, then l, and finally τ.

4.3

Fine-tuning supervized component Cox

regres-sion

As in the simulations, we kept l = 1 and started decreasing s from s = 1 to s= 0.95, then 0.9, 0.85 and 0.8. The best results were obtained for s = 0.9, with p-values (viewed as descriptive indicators) of components 2 to 5 lower than 1e − 5. Then, in order to improve interpretability of the supervised components by dragging them towards close variable-structures, we raised l from 1 to 8.

Cox regression on the obtained first 10 supervized components shows that components 2 to 5 have a p-value lower than 1e-4, and component 8 has a p-value lower than 1e-2. In order to assess the real predictive power of the components, we perform cross-validation on models built with a number of

(44)

Figure 13: CV coefficient against number of components for s = 0.9, l = 8, τ = 1

supervized components ranging from 1 to 10. The associated CV coefficient curve can be seen on figure 13. It indicates that 5 components gives the optimum prediction, every extra-component contributing more to overfitting than to fitting. Nevertheless, the loss between 5 and 6 components being yet small, we will use 6 components for exploration, because the 6th component might still help reveal variables playing some role.

(45)

Figure 14: Correlation scatterplot with supervised components (1,2) for s = 0.9, l = 8, τ = 1

variables. These plots are given on figures 14 to 16.

Component 1 is highly negatively correlated with the size of offspring, and positively with “No son”, “No daughter” and “No child”. Component 2

(46)

Figure 15: Correlation scatterplot with supervised components (3,4) for s = 0.9, l = 8, τ = 1

(47)

Figure 16: Correlation scatterplot with supervised components (5,6) for s = 0.9, l = 8, τ = 1

(48)

sets out the secondary education, to which it is highly positively correlated. It is negatively, but poorly correlated with low education levels. Figure 15 shows an interesting plane, spanned by components 3 and 4. Its interpre-tation is wholly connected to places of birth and infancy, with a triangle opposing the three factor-levels: Dakar (“Dak”), other city (“urb”) , rural zone (“rur”). As in most situations where the variable-vectors illustrating the components sum up to zero (this is the case of the three birth-place lev-els, and of the three infancy-place ones), variables don’t align components really well, but the plane they span is clearly related to the risk. Finally, com-ponent 5 aligns the age-gap, revealing its role in polygamy, and comcom-ponent 6 is loosely related to the kinship of the first spouse to ego.

At this stage, we may use the components to try and select promising subsets of explanatory variables for a standard Cox regression, bearing in mind that doing so, we may lose valuable predictive information, since pre-dictive components combine many more variables to be efficient. The es-timated effects of some variable-subsets, with corresponding p-values, are given in table 2.

Supervised components do seem to help variable-selection. Yet, one can see that some confusion remains: the p-values of effects sometimes change dramatically when a variable (e.g. “Infancy in Dakar”) is replaced with a

(49)

Table 2: Results of Cox regression on three variable-subsets selected through cor-relation with supervised components

Variable β exp(β ) p Son 0 0.6590 1.9329 0.0553 Edu Sec -0.6567 0.5186 0.0315 Inf Dak -0.5452 0.5797 0.0697 AgeGap -0.0970 0.9075 0.0012 ParSp1 No -0.9242 0.3968 0.0016 Variable β exp(β ) p Son 0 0.7193 2.0531 0.0362 Edu Sec -0.6468 0.5237 0.0343 BP Dak -0.3060 0.7364 0.3058 AgeGap -0.0975 0.9071 0.0014 ParSp1 No -0.9188 0.3990 0.0016 Variable β exp(β ) p NumChild -0.1309 0.8773 0.0884 Edu Sec -0.6868 0.5032 0.0250 Inf Dak -0.6182 0.5389 0.0371 AgeGap -0.0964 0.9081 0.0012 ParSp1 No -0.9459 0.3883 0.0012

(50)

highly correlated one (“Birth place Dakar”). Variable-selection prior to Cox regression can be misleading.

Components are combinations of variables, and the proper way to use them is as follows: select the best number of supervized components through cross-validation; then, as they are regularized combinations of variables, use them to calculate the corresponding β , and interpret the signs and magni-tudes of the largest effects. In the table displayed on fig. 3, line β 5 displays the coefficients calculated on the basis of the first 5 components. These are the ones to be used for interpretation of variable-effects, because the corre-sponding linear predictor is based on the best predictive component-model. The linear predictor has been regularised so as to minimise confusion be-tween effects of correlated variables. One consequence of this is that the signs of the betas can overall be trusted, at least if the absolute value is not negligible.

Bootstrapping can of course provide confidence intervals associated with the coefficients, but it is computationally very costly.

From coefficients in table given on fig. 3, it seems possible to calculate relative risk ratios. For instance, the risk-ratio of a man having spent his in-fancy in the country relative to a man having the same characteristics except having spent his infancy in Dakar would be:

(51)

r(Inf rur/Inf Dak) = exp(β 5Inf rur− β 5Inf Dak)

= exp(.059 + .12) = 1.20 (22) But one should never forget that this “having the same other character-istics” is sheer fiction: in the situation being dealt with, variables may be linked up to perfection. Picture a variable denoted a, linked to no other, and having coefficient βa in the estimated model. Now, picture duplicating a

into a1 and a2 in a new model, with all other variables unchanged. Then,

regularization will entail that the former effect of a be distributed evenly on a1and a2, so that their coefficients in the estimated new model will both be

equal to βa

2. So, one must keep in mind the statistical links between

explana-tory variables before trying to interpret the magnitude of these relative risk ratios.

For a binary variable without its complement, or for a quantitative vari-able, the rule would be the same as in Cox regression. For instance: the risk-ratio of a man having given his consent for his first marriage relative to a man not having, but sharing all other characteristics, would be:

r(Consent / no Consent) = exp(β 5Consent) = 0.93 (23)

(52)

would be to multiply the risk by:

r(AgeGap) = exp(β 5AgeGap) = 0.66 (24)

We shall now interpret the signs and sometimes the magnitudes of the coefficients of variables linked to the first dimensions found by supervised-component Cox regression. We chose to include supervised-component 6 because it is illustrated by a kinship variable, and even if on average it tends to pro-duce overfitting, there are many cross-validations samples with which it contributes to improve the CV coefficient. As a general rule, we advise to possibly consider a few extra-components after the optimum, provided the fall in CV keeps moderate and such components are reasonably correlated to some variables.

Linked to component 1 are the offspring variables. On the whole, the effect of the number of children on the hazard-rate is negative. Having no son appears to have a special effect, increasing the risk with respect to all other numbers of sons. This is not the case with daughters.

Linked to component 2 are the ego-education variables, and more specif-ically secondary education. Indeed, the coefficients show that its effect rel-ative to all other levels is to decrease the risk. For example, the relrel-ative risk

(53)

Table 3: Coefficients of variables in the regularized linear predictor obtained for:

s= 1e − 3, l = 1, τ = 1, 1 component (β ), and s = .9, l = 8, τ = 1, 5 components

(β 5) Nat Sene Nat GuiB Nat GuiC Nat Mali Nat Beni FDcsd MDcsd FMDi v or MarRank Consent 6.25e-2 2.23e-2 -4.44e-2 -5.04e-2 -2.03e-2 1.28e-1 -5.59e-2 -2.97e-17 -1.12e-1 β 5 6.34e-3 8.68e-2 -1.40e-2 -8.92e-2 -2.29e-2 -3.34e-2 1.50e-1 -7.18e-2 8.44e-16 -7.46e-2 AgeGap Edu Non Edu Cor Edu Prim Edu Sec FEdu Non FEdu Cor FEdu Prim FEdu Sec FEdu N A 3.66e-2 5.41e-2 3.33e-2 -9.89e-2 -8.90e-2 2.00e-1 -6.04e-2 -4.70e-2 -1.15e-1 β 5 -4.14e-1 6.35e-2 5.60e-2 6.12e-2 -1.43e-1 -1.03e-1 1.54e-1 -2.44e-2 -2.46e-2 -7.72e-2 MEdu Non MEdu Cor ME du Prim MEdu Sec M Edu N A Eth W olof Eth Pular Eth Serer Eth Diola Eth oth β -1.27e-1 6.89e-2 5.08e-2 6.15e-2 6.46e-2 7.83e-2 -4.33e-2 1.40e-2 -7.12e-2 -1.73e-2 β 5 -1.09e-1 -1.38e-2 1.01e-1 9.45e-2 1.13e-1 9.33e-2 -8.39e-2 2.92e-2 -5.29e-2 -2.20e-2 Rel T idj Rel Muri Rel othMus Rel Chri AgMar1 16 24 AgMar1 25 29 AgMar1 30 34 AgMar1 35 46 ChMar1 Man ChMar1 Mut β -6.97e-2 6.75e-2 1.21e-1 -1.33e-1 1.76e-1 1.02e-1 -2.01e-1 -1.54e-1 -2.04e-2 -4.81e-2 β 5 -9.08e-2 4.29e-2 1.41e-1 -8.54e-2 2.21e-1 1.34e-1 -1.76e-1 -3.00e-1 -3.79e-3 -3.69e-2 ChMar1 P ar P arSp1 P at P arSp1 Mat P arSp1 No AgSpo N A AgSpo 13 16 AgSpo 17 19 AgSpo 20 24 AgSpo 25 37 BP Dak β 8.68e-2 7.99e-2 1.55e-1 -2.01e-1 -8.62e-2 1.40e-1 1.01e-2 -6.69e-2 -5.35e-2 -8.73e-2 β 5 5.16e-2 1.36e-1 1.96e-1 -2.83e-1 -1.07e-1 8.90e-2 -3.05e-2 -4.63e-2 4.02e-2 -1.06e-2 BP rur BP urb Inf Dak Inf rur Inf urb Spo Bach Spo Mar W Sp1 HousW WSp1 Stud WSp1 Sal β 1.39e-1 -5.27e-2 -1.60e-1 1.32e-1 4.30e-2 2.66e-2 -2.66e-2 2.45e-2 -9.17e-2 -6.52e-2 β 5 6.25e-2 -5.63e-2 -1.23e-1 5.89e-2 7.81e-2 2.07e-2 -2.07e-2 1.25e-2 -9.29e-2 -4.98e-2 WSp1 Craf WSp1 T rade WSp1 Agri WSp1 N A Occ Inf Occ Sal Occ App Occ ind Occ Stu Occ Ret β 7.14e-2 5.79e-2 2.50e-1 -6.26e-2 -4.34e-3 1.33e-1 -8.83e-2 -5.14e-2 -3.88e-2 -9.14e-2 β 5 6.62e-2 8.13e-2 1.88e-1 -5.31e-2 -9.67e-3 1.59e-1 -7.08e-2 -1.05e-1 -6.25e-2 -4.62e-2 Occ Unem Occ othInac Occ othNoInc Res Prop Res Lodg Res F ami Res HusP a Re s othP a Res oth NumSon β 2.97e-3 -7.06e-2 -9.66e-2 2.14e-2 -8.62e-2 1.42e-2 4.05e-2 1.14e-1 -8.93e-2 -5.52e-2 β 5 2.18e-2 -4.22e-2 -7.80e-2 2.81e-2 -7.58e-2 6.02e-2 6.22e-2 7.61e-2 -1.33e-1 -4.02e-2 NumDau Son 0 Son 1 S on 2 Son 3 Son 4 Son 5o Dau 0 Dau 1 Dau 2 β -4.02e-2 1.04e-1 -5.42e-2 -5.89e-2 -2.27e-2 -3.92e-2 5.09e-2 1.54e-2 -1.21e-1 1.64e-1 β 5 -3.86e-2 6.04e-2 -6.19e-2 -2.54e-2 3.06e-2 -2.17e-2 1.39e-2 -3.00e-3 -7.63e-2 1.41e-1 Dau 3 Dau 4 Dau 5o NumChild Chil 0 Chil 1 Chil 2 Chil 3 Chil 4 Chil 5o β 5.14e-2 -8.42e-2 -8.53e-2 -5.81e-2 4.90e-2 1.16e-2 -4.37e-2 9.76e-2 -1.44e-1 3.41e-3 β 5 3.66e-2 -8.44e-2 -7.17e-2 -4.81e-2 -8.52e-3 1.43e-2 -2.26e-2 1.29e-1 -1.35e-1 7.13e-3 ChilOU 0 ChilOU 1o AgGap 0 3 A gGap 4 7 AgGap 8 12 AgGap 13 24 MarCert β -1.74e-2 1.74e-2 1.21e-1 -5.27e-2 1.47e-1 -2.21e-1 -1.38e-1 β 5 -3.47e-2 3.47e-2 1.96e-1 2.50e-2 1.37e-1 -3.81e-1 -1.55e-1

(54)

education, and all other characteristics identical, would be:

r(Edu Sec/Edu Non) = exp(β 5Edu Sec− β 5Edu Non)

= exp(−0.143 − 0.0635) = 0.81 (25)

Men having had a secondary education or above tend to enter marriage later and with a partner they chose. Their conception of marriage is less traditional. This may account for a lesser risk of becoming polygamous.

Linked to components 3 and 4 are the places of birth and infancy. Of course, these variables being strongly linked, their effect on the risk will tend to be shared and distributed evenly across very correlated ones. As a consequence, one cannot for instance speak of the effect of “infancy in Dakar” with respect to “infancy in a rural zone” with all other character-istics identical, because infancy is much linked to place of birth, so that their effect is common to a large extent. Where standard Cox regression, as any non-regularized regression, tries to separate the effects of explana-tory variables - thus having severe problems when these are not separable - supervised-component Cox regression does all the less so as variables are correlated. The effects of uncorrelated explanatory variables will be their “proper” effect indeed, just as in plain Cox regression, but the effects of

(55)

them of their “common” effect. How do we make use of that here? By considering the bundle { Inf rur , BP rur } as a whole and opposing it to { Inf Dak , BP Dak }, for instance. The relative risk ratio of a man born and raised in a rural zone with respect to one born and raised in Dakar, with all other characteristics identical, would then be:

r(Rural/Dakar) = exp(β 5BP rur+ β 5Inf rur− (β 5BP Dak+ β 5Inf Dak))

= exp(0.052 + 0.058 − 0.011 + 0.12) = 1.29 (26) Likewise, the relative risk ratio of a man born and raised in a city other than Dakar with respect to one born and raised in Dakar, with all other char-acteristics identical, would be: 1.17.

Of course, if the variables are not strictly linked through a logical con-straint, it is always conceivable to vary the situation. For instance, the rel-ative risk ratio of a man born and raised in a rural zone with respect to one born in a rural zone but raised in Dakar, with all other characteristics identi-cal, would then be:

r(Inf rural/Inf Dakar) = exp(β 5Inf rur− β 5Inf Dak) = 1.20 (27)

Component 5 is very strongly correlated to the age-gap (age of ego - age of first wife). The corresponding relative risk has been given above: every

(56)

extra-year in age-gap, all other characteristics kept equal, would roughly cut down the risk by one third. This may be linked to the “safety net” approach to polygamy, namely the man’s wish to increase his offspring when his first wife, having an age close to his, is getting old or unwilling to, or his wish to found a younger family to support him in his old age.

Component 6 is related to the kinship of ego to his first wife. The relative risk ratio of a man having a first wife unrelated to him with respect to a man having one related through his mother is: exp(β 5ParSp1 No− β 5ParSp1 Mat) =

0.62 , and the ratio is 0.66 when the first spouse is related to ego’s father. A kinship between spouses is very often the sign of a prescribed marriage, which may easily lead to polygamy because, if it is difficult to part from a spouse more or less imposed by the family, ego can make for it by marrying a second wife more freely chosen, most often unrelated.

It must be emphasized that we have given these ratio calculations for the mere sake of illustration. The reader will naturally be aware that such calculations should not be taken at face value. They merely indicate ten-dencies obtained here on a rather small sample, and always under the costly hypothesis of “all other characteristics kept identical”.

(57)

com-sion reduction, but we only demonstrated the first step of the analytical pro-cess. On a second step, the selected variables should be taken as extra-covariates Z, and SCCoxR relaunched so as to track down more comple-mentary predictive dimensions, if any. And so on until the predictive dimen-sionality is exhausted.

5

Conclusion

The supervised-component methodology is a flexible method bridging data exploration with model estimation. The relative importance of the structural relevance of components with respect to the likelihood of the supervizing model can be continuously tuned. So can the locality of the variable-bundles the model should lean on. The third parameter tunes the regularization of the model coefficients, hence the robustness of the linear predictor. We have adapted this methodology to Cox’s proportional hazard model, but it can deal with the likelihood of any counting process with intensity involving a linear combination of covariates. The simulation study has shown how the method behaves and how to tune parameters. Useful extensions could include random effects, at the cost of a more complex estimation algorithm. This would in particular enable multi-level modelling. Good results have already been obtained in the Generalized Linear Model framework (Chauvet

(58)

et al. , 2016), and application to counting processes is currently considered.

References:

Andersen P.K., Borgan O., Gill R.D., Keiding N. (1993): Statistical Models Based on Counting Processes, Springer Series in Statistics -Springer Verlag.

Antoine P., Fall A-S. (dir.) (2002). Crise, passage `a l’ˆage adulte et devenir de la famille dans les classes moyennes et pauvres `a Dakar. Intermediate report to the CODESRIA, Dakar, IRD-IFAN.

Bastien P., Esposito Vinzi V., and Tenenhaus M. (2005). PLS gener-alised linear regression. Computational Statistics & Data Analysis, 48: 17-46.

Breslow N. E., Crowley J. (1974). A large-sample study of the life ta-ble and product limit estimates under random censorship. Annals of Statistics 2: 437-454.

Bry X., Antoine P. (2004). Explorer l’explicatif ; application `a l’analyse biographique. Population-F 59(6) : 909-945 / Exploring explanatory models ; an event history application. Population-E 59(6).

(59)

54(3).

Bry X., Verron T. (2015). THEME: THEmatic Model Exploration through Multiple Co-Structure maximization. Journal of Chemometrics, 29 (12): 637-647.

Bry X., Simac T., Verron T. (2016). Supervised-Component based Cox Regression, COMPSTAT 2016, Proceedings, Springer.

Chauvet J., Bry X., Trottier C., Mortier F. (2016). Extension to mixed models of the Supervised Component-based Generalised Linear Re-gression, COMPSTAT 2016, Proceedings, Springer.

Chojnacka H. (2000). Early marriage and polygyny : feature character-istics of nuptiality in Africa, Genus, LVI, 3-4: 179-208.

Collett D. (2003). Modelling Survival Data in Medical Research (2nd ed.). Boca Raton: CRC. ISBN 1584883251.

Courgeau D., Leli`evre E. (1989). Analyse d´emographique des biogra-phies, INED, Paris.

Cox D.R. (1972). Regression Models and Life-Tables. Journal of the Royal Statistical Society, Series B., 34(2): 187-220.

Cox D.R. (1975): Partial Likelihood, Biometrika, 62: 269-276.

Cox, D. R., Oakes, D. (1984). Analysis of Survival Data. Chapman & Hall, New York. ISBN 041224490X.

(60)

Mondes en d´eveloppement (171): 31-50.

van Houwelingen H.C., Bruinsma T., Hart AA.M., van’t Veer L.J., Wes-sels LF. (2006). Cross-Validated Cox Regression on Microarray Gene Expression Data. Statistics in Medicine, 25: 3201-3216.

Kalbfleisch J. D. , Prentice R. L. (1973). Marginal likelihoods based on Cox’s regression and life model. Biometrika, 60: 267-278.

Møller V., Welch J., 1990 : Polygamy, economic security and well-being of retired Zulu migrant workers, Journal of Cross-Cultural Gerontol-ogy (5-3): 205–216.

Nygard S., Borgan O., Lingjaerde O., Storvold H. (2008). Partial least squares Cox regression for genome-wide data. Lifetime Data Analysis. 14(2):179-95. doi: 10.1007/s10985-007-9076-7.

Perperoglou A., le Cessie S., van Houwelingen H. (2006). Reduced-rank hazard regression for modelling non-proportional hazards. Statistics in Medicine, 25: 2831-2845

Pilon M. (1991). Contribution `a l’analyse de la polygamie, ´Etude de la Population Africaine (5): 1-17 , Dakar, UEPA.

Tibshirani R. (1997). The Lasso method for variable selection in the Cox model. Statistics in Medicine, 16 (4): 385-395.

(61)

alized inverses. SIAM Journal on Scientific and Statistical Computing. 5 (3): 735-743.

(62)

Appendix A : The gradient formulas

On every time-point ti when an event occurs for some subject i, δi= 1 ,

we have the following partial likelihood term:

pi= eα x 0 itiu+z 0 itiγ ∑j∈Rtieα x 0 jtiu+z0jtiγ

⇒ ln pi= αx0itiu+ z 0 itiγ − ln(

j∈Rti eα x 0 jtiu+z 0 jtiγ) (28)

The overall partial likelihood is :

lp(u, γ, δ ) =

i,δi=1

pi⇒ ln lp(u, α, γ) =

i,δi=1

ln pi (29)

⇒ ln c(u, α, γ) = (1 − s) ln lp(u, α, γ) + s ln φ (u) (30)

⇒ ∇

uln c(u, α, γ) = (1 − s)∇uln lp(u, α, γ) + s∇uln φ (u) (31)

with: 1) ∇

uln lp(u, α, γ) = ∑i,δi=1∇uln p(ti) , where

∇ uln p(ti) = ∇u(αx 0 itiu+ z 0 itiγ ) − ∇uln(

j∈Rti

eα x0jtiu+z0jtiγ) (32)

= αxiti−

u(∑j∈Rtie

α x0jtiu+z0jtiγ)

∑j∈Rtieα x 0 jtiu+z 0 jtiγ ; (33) ∇ u(

j∈Rti eα x0jtiu+z 0 jtiγ) =

j∈Rti ∇ ue

α x0jtiu+z0jtiγ

=

j∈Rti eα x0jtiu+z 0 jtiγα xjt i; (34)

(63)

Putting ∀ti, ∀ j ∈ Rti : ωj,ti := e

α x0jtiu+z0jtiγ

and Ωti := ∑j∈Rtiωj,ti , we get: ∇ uln p(ti) = α xiti− 1 Ωti j∈R

ti ωj,tixjti ! (35) And finally: ∇ uln lp(u, α, γ) = α

i,δi=1 xiti− 1 Ωti

j∈Rti ωj,tixjti ! (36) 2) ∇ uln φ (u) :

• Component-variance: φ (u) = u0X0WXu

⇒ ∇ uφ (u) = 2X 0 WXu⇒ ∇ uln φ (u) = 2 u0X0WXuX 0 WXu (37)

• Variable powered inertia: φ (u) =1 p∑ p j=1hXu|xji2lW 1l ⇒ ∇ uφ (u) = 1 l 1 p p

j=1 hXu|xji2l W !1l−1 1 p p

j=1 2lhXu|xji2(l−1)W X0Wxjxj 0WXu (38) ⇒ ∇ uln φ (u) = 2 p

j=1 hXu|xji2l W !−1 X0W p

j=1 h hXu|xji2(l−1) W x j xj 0 i WXu (39) = 2 ∑pj=1ϖlj X0W p

j=1 ϖl−1j xjxj 0 !

(64)

Letting Ω(u) = diag(ϖl−1j ) , we have: ∇ uln φ (u) = 2 ∑pj=1ϖlj (X0WX)Ω(u)(X0WX)u (41)

(65)

Appendix B : The Projected Iterated

Normed Gradient (PING) algorithm

Notation: the current value of any quantity a on iteration t is denoted: a[t] . Consider program: max u∈ Rp, u0M−1u= 1 D0u= 0 h(u) (42)

Putting v = M−1/2u, g(x) = h(M1/2x) and C = M1/2D, this is strictly equivalent to:

RC: max

v∈ Rp, v0v= 1

C0v= 0

g(v) (43)

The corresponding lagrangian is:

(66)

∇ λ ,µ L(v, λ , µ) = 0 ⇔ v0v= 1 (1) and C0v= 0 (2) (45) ∇ vL(v, λ , µ) = 0 ⇔ Γ(v) − 2λ v −Cµ = 0 (3), with Γ(v) := ∇vg(v) (46) (3) ⇔ v = 1 2λ(Γ(v) −Cµ)(4) (47) C0(3)with(2) ⇒ C0Γ(v) = C0Cµ ⇔ µ = (C0C)−1C0Γ(v) (48) Put back into (4), this yields:

v= 1

2λΠC⊥Γ(v)(5) where ΠC⊥:= I −C(C

0

C)−1C0 (49) In the particular case where C = 0 , we shall take: ΠC⊥= I .

Finally, (5) and (1) imply:

v= ΠC⊥Γ(v) kΠC⊥Γ(v)k

(6) (50)

This gives the basic iteration of the PING algorithm:

v[t+1]= ΠC⊥Γ(v

[t])

(67)

Let us show that this iteration follows a direction of ascent. Because, by construction: ∀s: v[s]⊥ C, we have:

∀s : v[s]= ΠC⊥v[s] ⇒ hv[t+1]− v[t]|Γ(v[t])i = hΠC⊥(v[t+1]− v[t])|Γ(v[t])i

(52) = hv[t+1]− v[t]|ΠC⊥Γ(v[t])i, (53) which has the sign of:

hv[t+1]− v[t]|v[t+1]i = 1 − hv[t]|v[t+1]i = 1 − cos(v[t], v[t+1]) ≥ 0 (54)

Picking a point on a direction of ascent does not guarantee that g actually increases, for we may “go too far” in this direction. Let γ[t]:= ΠC⊥Γ(v

[t])

C⊥Γ(v[t])k

. Staying “close enough” to the current starting point on the arc (v[t], γ[t]) guarantees that g increases. Indeed, let ϖ be the plane tangent to the sphere on v[t]and let w denote the vector tangent to arc (v[t], γ[t]) on v[t](cf. fig. 17). Then: ∃τ > 0, w = τΠϖγ [t] ⇒ hw|γ[t]i = τhΠ ϖγ [t][t]i = τ cos2[t], ϖ ) > 0 (55)

(68)

v

w

γ

γ

(69)

Yet, staying “too close” to the current starting point on the arc (v[t], γ[t]) may make the algorithm too slow to reach the maximum. To avoid that, we propose to use a Gauss-Newton unidimensional maximization to find the maximum of g(v) on the arc (v[t], γ[t]), and take it as v[t+1].

The fixed point of the resulting algorithm is a critical point of (1), hence a local maximum of g s.t. C0v= 0 .

Figure

Figure 1: Polar representation of the variable powered inertia according to the value of l φ (u) = p ∑ j=1 ω j ρ 2l ( f ,x j ) ! 1l (9) In the elementary case of 4 coplanar variables x with ∀ j,ω j = 1 , fig
Figure 2: Correlation scatterplot a : s = 1, l = 1 f 1 : coefft = -.03 ; p=.83 ; f 2 : coefft = -.42; p=.004
Figure 3: Correlation scatterplot b : s = 1, l = 1
Figure 4: Correlation scatterplot c : s = .95, l = 1
+7

Références

Documents relatifs

In this note, we propose an alternative regularization procedure based on the penalized least squares for variable selection in linear regression problem, which combines the L 1

Separation of elastic waves in split Hopkinson bars using one-point strain measurements. Separation et reconstruction des andes dans les barres elastiques et

The popularity metric can be any measure of the popularity of online contents, such as the life- time of threads and the number of received comments per thread and the risk factors

The purpose of this article is to describe the Cox ring of a new class of algebraic varieties with reductive group action, namely our main result (Theorem 2.1) is a description of

logistic link functions are used for both the binary response of interest (the probability of infection, say) and the zero-inflation probability (the probability of being cured).

This reinforces and justifies the need for meticulously choosing the ML model and its hyper-parameter configuration, as we have seen that picking any single model and tuning it

The aim of the study was to evaluate the feasibility of the acoustic reflection method in children with mucopolysaccharidosis, and to compare the characteristics

Built by the TROPES software, diagram 2 represents all the bundles identified in a non-directive interview using the algorithm presented previously. All the graphic forms have