Kullback-Leibler Approach to CUSUM Quickest Detection Rule for Markovian Time Series

(1)

HAL Id: hal-02334914

https://hal.archives-ouvertes.fr/hal-02334914

Submitted on 29 Oct 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Kullback-Leibler Approach to CUSUM Quickest Detection Rule for Markovian Time Series

Valerie Girardin, Victor Konev, Serguei Pergamenchtchikov

To cite this version:

Valerie Girardin, Victor Konev, Serguei Pergamenchtchikov. Kullback-Leibler Approach to CUSUM Quickest Detection Rule for Markovian Time Series. Sequential Analysis, Taylor & Francis, 2018, 37, pp.322 - 341. �10.1080/07474946.2018.1548846�. �hal-02334914�

(2)

Kullback-Leibler Approach to CUSUM Quickest Detection Rule for Markovian Time Series

Val´erie Girardin

Laboratoire de Math´ematiques N. Oresme, UMR6139 CNRS Universit´e de Caen Normandie, France

Victor Konev

Department of Applied Mathematics and Cybernetics, Tomsk State University, Russian Federation

Serguei Pergamenchtchikov

Laboratoire de Mathématiques Raphaël Salem, CNRS, UMR 6085, Université de Rouen, France and International Laboratory of Statistics of Stochastic Processes and Quantitative Finance

of National Research Tomsk State University, Russian Federation

Abstract: Optimality properties of decision procedures are studied for the quickest detection of a change-point of parameters in autoregressive and other Markov type sequences. The limit of the normalized conditional log-likelihood ratios is shown to exist for Markov chains satisfying the ergodic theorem of information theory. Then closed-form expressions for this limit are derived by making use of the time average rate of Kullback-Leibler divergence. The good properties of the detection procedures based on a sequential analysis approach are proven to hold thanks to geometric ergodicity properties of the observation processes. In particular, the window-limited CUSUM rule is shown to be optimal for detecting the disruption point in autoregressive models. Sparre Andersen models are specifically studied.

Keywords:Change-point detection, CUSUM detection procedure, Ergodic Theorem of information theory, Kullback- Leibler divergence, Relative entropy, Sequential detection, Sparre Andersen models.

Subject Classifications:62L10, 62L15, 60G40, 60J10, 62B10, 94A17..

1. INTRODUCTION

The problem of quick detection of an abrupt change in time series arises in many different areas, related to automatic control, segmentation of signals, biomedical signal processing, quality control engineering, finance, link failure detection in communication networks, intrusion detection in computer systems, and target detection in surveillance systems; see Page (1954), Basseville and Nikiforov (1993), Kent (2000), Tartakovsky et al. (2015) for details and further references. A challenging application area is intrusion detection in distributed computer networks; see Kent (2000) and Tartakovsky et al. (2004). Large scale Address correspondence to V. Girardin, LMNO, Universit´e de Caen Normandie, BP5186, 14032 Caen, France; E-mail: [email protected]

(3)

attacks, such as denial-of-service attacks, occur at unknown time points and need to be detected in the early stages by observing abrupt changes in the network traffic.

In change point analysis, a large variety of observation models is used. They include independent identi- cally distributed (i.i.d.) sequences of random variables whose distributions change at the disruption time and also different kinds of processes with dependent variables; see Tartakovsky et al. (2015), Pergamenchtchikov and Tartakovsky (2016) for autoregressive processes, and Yakir (1994) for Markov chains of order 1 with finite state spaces, extended in Lai (1998) to general state spaces. Two aspects are essential: the decision rule must both provide a low false-alarm rate and ensure quick detection of the disruption after it occurs.

However, in contrast to the analysis of detection procedures for i.i.d. models, most procedures for dependent models reduce to only one characteristic, the false alarm rate, while the delay time characteristic is studied only by means of numerical simulation for some specific models.

The asymptotic minimax theory for the i.i.d. case is developed in Lorden (1971), which both formulates the optimization problem as the minimization of the worst-case delay time subject to a lower bound on the mean time between false alarms, and proves the optimality of Page’s CUSUM algorithm. In recent years, efforts have been made to extend the theory of optimal detection beyond the i.i.d. setting. A general asymptotic minimax theory of change-point detection in stochastic systems is developed in Lai (1998), under an alternative constraint on false-alarm rate – precisely that the probability of a false alarm within a given period of length be uniformly small whensoever the period starts. In particular, the asymptotic optimality of the procedures is proven under general conditions on the conditional log-likelihood ratio statistics (CLLRS).

Among possible conditions is the convergence of the normalized CLLRS to some finite positive quantity, sayI. In the i.i.d. case, this quantity is shown in Lorden (1971) – by standard arguments on error probabilities of sequential log-likelihood ratio tests, to amount to the Kullback-Leibler divergence (KLD) between the two involved pre and post-change distributions. A closed-form expression ofI is provided in Lai (1998) for Markov chains of order 1 in terms of the stationary distribution after the disruption and on the two involved transition densities. An expression in terms of the Fisher matrix of the model is also provided in Pergamenchtchikov and Tartakovsky (2016) for autoregressive models of order 1. None of them relate these expressions to the concept of entropy rate, classical in information theory.

The analysis of the properties of the detection procedures will here be essentially based on the Kullback- Leibler divergence (KLD) rates – see Girardin (2005) and Girardin and Limnios (2006), and the geometric ergodicity properties of stable autoregressive processes – see Galtchouk and Pergamenchtchikov (2014).

Precisely, the present paper has two linked objectives. First, the problem of convergence of the normalized CLLRS is addressed. The existence of the limitIis obtained through the ergodic theorem of information theory, also called asymptotic equipartition property (AEP); see Girardin (2005). Thus,Iis shown to be a KLD rate, with a closed-form expression for Markov chains of any order. Specific expressions are also derived for Gaussian autoregressive models. Second, in order to apply an optimal change-point detection rule to a specific model, some general conditions providing optimality have to be satisfied – all of them involving convergence of the normalized CLLRS to the KLD rate. For many models of interest in applications, the verification of these conditions becomes a stumbling block. Here we prove the optimality of the CUSUM and generalized CUSUM procedures for Markov chains and autoregressive models under simple conditions.

To this end, a method is developed for checking their sufficiency in a general optimal detection framework, by using the closed-form expression obtained for the KLD rate.

This paper is organized as follows. In Section2we consider the problem of quick detection of a change- point in general models with dependent data. As in both Lai (1998) and Pergamenchtchikov and Tartakovsky (2016), we impose upper bounds on the uniform probability of the false alarms. We recall some general conditions on the CLLRS, required for the optimality of the CUSUM rule for the models treated thereafter.

In Section3, through an ergodic theory approach, we determine the limitIof the normalized CLLRS under the form of a KLD rate, for Markov chains, and for autoregressive models; we also state conditions of convergence in terms of AEP. In Section 4, we treat the problem of change-point detection for the Sparre

(4)

Andersen model – widely used in insurance. In Section5, we study the problem of the quickest detection in autoregressive models with Gaussian noise. Note that for sake of clarity and ease of reading, long proofs are postponed to the appendix.

2. OPTIMALITY PROPERTIES IN DETECTION PROCEDURES

In this section, we present the quickest change-point detection problems in the pointwise and minimax settings. The presentation follows the lines of the modified CUSUM procedures developed for general stochastic models in Lai (1998).

Let the time series (Xk)k≥0 be specified by the conditional density function of Xk given X^k−1₀ = (X0, . . . , Xk−1) for k ≥ 1. The model with a disruption at time ν is given by the parametric family of conditional densities

f^(ν)(x_k|x^k−1₀ ) =f₀(x_k|x^k−1₀ )1{k≤ν}+f₁(x_k|x^k−1₀ )1{k>ν}, (2.1) wherex^k−1₀ = (x₀, . . . , x_k−1), together with the distribution ofX₀. The disruption timeν will be assumed to be a nonrandom unknown integer. We will denote byP^(ν)the distribution of(X_k)k≥0defined by the two families of conditional densitiesf₀andf₁and the disruption timeν, andE^(ν)will denote mean underP^(ν). The issue of sequential detection is to detect the change-point as soon as possible after it occurs. The considered procedures involve the system of CLLRS

Z_j = logf₁(X_j|X^j−1₀ )

f₀(X_j|X^j−1₀ ), j≥1. (2.2)

In Lai (1998), the classical CUSUM procedure is changed into the so-called window-limited CUSUM rule.

For each0< α <1, letM_α,mbe the class of admissible stopping timesτ such thatsup_k≥1P⁽⁰⁾(k≤τ <

k+m)≤α.

Note thatP⁽⁰⁾ stands for the distribution with densityf₀, thatαis a preassigned upper bound for false alarm probability, andma suitably chosen window size function ofαsuch that

α→0lim m_α

|logα| = +∞ and lim

α→0

logm_α

|logα| = 0.

The problem of optimal change-point detection thus becomes the determination of a procedure which minimizes in the classM_α,mthe positive part detection delay riskR_ν(τ) =E^(ν)(τ−ν)₊, that is the mean time between the detection timeτ and the true disruption timeν; note that(x)₊= max(0, x).

The window-limited CUSUM procedure is defined by the stopping time Tα= inf







n≥1 : max

(n−mα)+≤k≤n n

X

j=k

Zj ≥cα







, wherecα = log2mα

α . (2.3)

The optimization problem is studied under the following conditions on the CLLRS defined in (2.2).

Condition 1. SomeI ∈ R^∗+exists such that, for anyν ≥0, the following almost sure (a.s.) convergence holds:

n→∞lim 1 n

ν+n

X

j=ν+1

Z_j =I, P^(ν)−a.s. (2.4)

(5)

As established in Pergamenchtchikov and Tartakovsky (2016, Theorem 5), Condition1implies for the delay risk the sharp lower bound

lim inf

α→0

1

|lnα| inf

τ∈M_α R_ν(τ)≥ 1

I. (2.5)

In order to derive the upper bound for the detection procedure (2.3), a stronger condition is required.

Condition 2. SomeI∈R^∗+exists such that, for anyδ >0,

n→∞lim sup

k≥ν

essupP^(ν)





k+n

X

j=k+1

Zj<(I−δ)n

F_k



= 0. (2.6)

whereF_k=σ{X₁, . . . Xk}.

As proven in Lai (1998, Theorem 4 (ii)), Condition2provides the following sharp upper bound for the average time between the true disruption timeνand the detection timeT_α,

lim sup

α→0

1

|lnα|sup

ν≥0

R_ν(Tα)≤ 1

I. (2.7)

3. KLD APPROACH TO THE LIMIT OF THE NORMALIZED CLRRS

Here naturally arise questions about the assumptions to be fulfilled by a time series to ensure the validity of Conditions1and2, and in which way the limitIcan be evaluated from observations. This section aims at providing conditions of existence and closed-form formulae for the case of Markov chains, and more particularly for autoregressive models. We will mainly use the geometric ergodicity properties of Markov chains together with concepts of entropy and information theory; see Gray (2011) and Girardin (2005). Even though autoregressive models constitute a subclass of Markov chains, specific methods and tools have to be developed in this case; see Meyn and Tweedie (1993) and Galtchouk and Pergamenchtchikov (2014).

Section3.1will develop an approach based on information theory for investigating the asymptotic convergence of the normalized CLLRS. Condition1will be shown to amount to a particular case of the classical AEP – also called Shannon-McMillan-Breiman theorem, or ergodic theorem of information theory; see Cover and Thomas (1991). Thus, for Markov chains of any fixed order, the limitIwill appear as the KLD rate between two random sequences with the conditional distributions before and after the disruption. In Section3.2, the limit will be shown to take the form of an explicit function of the model parameters. In Sec- tion3.3, for autoregressive models, assumptions on the existence of the limitIwill be specifically checked, yielding a closed-form formula forItoo. The change-point detection procedure will then be studied in full details, for Sparre Andersen models in Section4and autoregressive models in Section5.

3.1. The Ergodic Theorem of Information Theory

Let us first recall some basics concerning entropy and divergence for random sequences. Entropy measures randomness or uncertainty in a random phenomenon. Its introduction into probability theory in Shannon (1948) gave rise to information theory, by measuring the information content of sources. The Shannon (or differential) entropy of a random vectorXⁿ⁻¹₀ is S(Xⁿ⁻¹₀ ) = S(f) = −R

Rⁿf(x) logf(x)dx,if its distribution is absolutely continuous with respect to the Lebesgue measureΛ, with densityf. An extension of the definition leads to consider the relative entropy

Sg(f) =− Z

Rⁿ

logf(x) g(x)

f(x)dx,

(6)

if the support off is included in the support ofg, otherwise set as+∞. This is the entropy of a measure with densityf relative to a measure with densityg– both absolutely continuous with respect toΛ. When bothf andgare densities of probability measures onRⁿ, the Kullback-Leibler divergence (KLD) off with respect tog,also called neg-entropy and introduced in Kullback and Leibler (1951), is the opposite number of the entropy off relative tog, precisely

K(f|g) = Z

Rⁿ

logf(x) g(x)

f(x)dx=−Sg(f).

Thus, the entropy of a random vectorXⁿ⁻¹₀ with densityf relative to a measure with densitygwith respect toΛonRⁿis defined asSg(Xⁿ⁻¹₀ ) =Sg(f).

Similarly, the KLD of a random vectorXⁿ⁻¹₀ with densityfwith respect to a random vectorY₀ⁿ⁻¹with density gon Rⁿ is defined asK(Xⁿ⁻¹₀ |Yⁿ⁻¹₀ ) = K(f|g). The KLD measures dissimilarity between two random phenomena. Although it is not a true distance because of lack of symmetry and triangular inequality – see Cover and Thomas (1991), it constitutes a good tool for discriminating two distributions because

K(f|g)≥0, with K(f|g) = 0 if and only iff =g. (3.1) Letf(x,y) andg(x,y) denote densities (with respect toΛ) onR^m+n, with conditional densities denoted byf(y|x)andg(y|x). The conditional KLD off with respect togis usually defined as the real number

K(f(y|x)|g(y|x)) = Z

R^m+n

f(x,y) logf(y|x)

g(y|x)dxdy (3.2)

= Z

R^m

f(x) Z

Rⁿ

f(y|x) logf(y|x)

g(y|x)dydx=Ef[K(f, g;x)], where the function ofx

K(f, g;x) = Z

Rⁿ

f(y|x) logf(y|x)

g(y|x)dy (3.3)

is called local divergence (inx).

The marginal Shannon entropy of ordernof a sequenceX = (X_n)n≥0 taking values in(R,B(R))is defined as the entropy of itsn-dimensional marginal distribution, namely

S(Xⁿ⁻¹₀ ) =−E_f[logf(Xⁿ⁻¹₀ )] =− Z

Rⁿ

f(xⁿ⁻¹₀ ) logf(xⁿ⁻¹₀ )dµ(xⁿ⁻¹₀ ),

wherefis the density of the random vectorXⁿ⁻¹₀ with respect to then-dimensional marginal of a reference measureµon the infinite product space(R^N,B(R^N)). Further, the Shannon entropy rate of a processXis defined byH(X) = lim_n_n¹S(Xⁿ⁻¹₀ ) when the limit exists; see Cover and Thomas (1991, Chapter 9) and Gray (2011, Chapter 8). The convergence in probability of−¹_nlogf(Xⁿ⁻¹₀ )toH(X)constitutes the weak AEP while the almost sure convergence is called the strong AEP – or ergodic theorem of information theory, or Shannon-MacMillan-Breiman theorem from the authors who originally proved it.

For any ergodic random sequenceX whose distribution PX is absolutely continuous with respect to Λ on R^N, with densityf, the extension of AEP to relative entropy says that ⁻¹_n log[^f_g(Xⁿ⁻¹₀ )]converges toHg(X),where convergence holds in probability, in mean, or a.s. with respect toPX, for any reference measure absolutely continuous with respect to Λ on R^N, with densityg; see Barron (1985) and Girardin (2005) and the references therein for details. Note that, for sake of simplification,forghere denote as well the density of measures onR^Nas the density of all their marginals onRⁿforn∈N^∗.

(7)

Similarly, letX = (Xn)n≥0 andY = (Yn)n≥0 be two ergodic random sequences whose distributions are absolutely continuous with respect toΛonR^N, with respective densitiesf andg. Averaging per time unit leads to the definition of the KLD rate

D(X|Y) = lim

n→∞

1

nK(Xⁿ⁻¹₀ |Yⁿ⁻¹₀ ) =−Hg(X),

when the limit exists. Extended forms of AEP have been proven to hold for large classes of ergodic processes; see Girardin (2005) for a detailed review. In particular, the strong AEP was extended to Borel state spaces independently by Barron (1985) and Orey (1985). For full details on AEP and KLD rates in the special case of of Markovian sequences, see Algoet and Cover (1988, Theorem 4) and Gray (2011, Corol- lary 8.4.1; in short, if both X andY are ergodic Markov chains of finite order, with respective transition densitiesf andg, then their KLD rate is well-defined and the following a.s. convergence holds with respect to the distribution ofX:

1 nlog

f

g(Xⁿ⁻¹₀ )

−→D(X|Y). (3.4)

3.2. Application to the CLLRS for Markov Chains

In Section 2, the quantity Iof interest in both Conditions 1and2 is the limit of the normalized CLLRS

1 n

Pν+n

j=ν+1Z_j, withZ_jdefined in (2.2). As remarked in Lai (1998), for the trivial case of i.i.d. observations, Iis the KLDK(f1|f₀)between the post-change and pre-change distributionsf1(x)andf0(x)ofXn. For Markovian dependent observations, the next result states thatIis the KLD rate between two well-identified sequences.

Lemma 3.1. Let the time series X = (Xn)n≥0 be a model with a disruption at time ν, given by the parametric family of conditional densities (2.1). Let Xb be a time series with distribution given by the conditional densitiesf₁(x_k|x^k−1₀ ). LetXe be another time series, with conditional densitiesf₀(x_k|x^k−1₀ ).

Suppose that bothXb andXe are ergodic homogeneous Markov chains of orderd.

Then Condition1is fulfilled, withI=D(X|b X).e Proof. We can write

ν+n

X

j=ν+1

Z_j = log





ν+n

Y

j=ν+1

f1(Xj|X^j−1₀ ) f₀(X_j|X^j−1₀ )



= log

" Qn

j=1f1(Xν+j|X^ν+j−1₀ ) Qn

j=1f₀(X_ν+j|X^ν+j−1₀ )

# .

For any densityf defining a Markov chain of orderd, we compute

n

Y

j=1

f(xν+j|x^ν+j−1₀ ) =

n

Y

j=1

f(xν+j|x^ν+j−1_ν+j−d) =

n

Y

j=v+1

f(xν+j|x^ν+j−1_ν+1 )

d

Y

j=1

f(xν+j|x^ν+j−1_ν+j−d).

For any densityfand timeν, we can also write

f(x^ν+n_ν ) = f(x_ν+n|x^ν+n−1_ν )f(x^ν+n−1_ν ) =f(x_ν+n|x^ν+n−1_ν )f(xν+n−1|x^ν+n−2_ν )f(x^ν+n−2_ν )

= · · ·=

n

Y

j=1

f(xν+j|x^ν+j−1_ν )· f(xν).

Therefore, for a Markov chain of orderd, we getQ_n

j=d+1f(x_ν+j|x^ν+j−1_ν+1 ) =f(x^ν+n_ν+1)/f(x^ν+d_ν+1).Specifi- cally

ν+n

X

j=ν+1

Zj = log

"

f₁(X^ν+n_ν+1) f₀(X^ν+n_ν+1)

#

+h(ν, d),

(8)

whereh(ν, d)is a quantity depending onνanddbut not onn, and henceh(ν, d)/nconverges to 0. There- fore, ¹_nPν+n

j=ν+1Zjconverges to the same limit as ¹_nlog

f1(X^ν+n_ν+1)/f0(X^ν+n_ν+1) .

By the strong AEP – see Gray (2011, Corollary 8.4.1) and (3.4), the latter converges toI = D(X|X),e

which proves the result. 2

The first explicit formula for entropy rates was already derived in Shannon (1948) for ergodic homogeneous Markov chains of order 1 with finite state spaces. Further, letXbe a Markov chain taking values inR,with transition densityf(x|y) and stationary distribution with densityπ(x). Then−¹_nlogf(Xⁿ⁻¹₀ ) converges almost surely to the entropy rateH(X) =−R

Rπ(y)R

Rf(x|y) logf(x|y)dx dy;see for instance Girardin and Limnios (2006), where the extension to KLD is also proven to hold for semi-Markov processes. Considering straightforward extension of the above formula for Markov chains of orderdtogether with Lemma3.1yields the following result for a Markov chain with a disruption.

Theorem 3.1. LetX be an ergodic real Markov chain of orderdwith disruption timeν, with transition density

f^(ν)(x_n|xⁿ⁻¹₀ ) =f₀(x_n|xⁿ⁻¹_n−d)1{k≤ν}+f₁(x_n|xⁿ⁻¹_n−d)1{k>ν}, n≥d.

Letπ1(x)denote the density of the stationary distribution for the Markov chainXbof orderd, with transition densityf₁(x|y₀^d−1). Let similarlyXe be an ergodic homogeneous Markov chain of orderd, with transition densityf₀(x|y^d−1₀ ). Then Condition1holds true, with

I=D(Xb |X) =e Z

R^d

π1(y^d−1₀ )

"

Z

R

f1(x|y^d−1₀ ) logf1(x|y^d−1₀ ) f₀(x|y^d−1₀ )dx

#

dy0. . . dyd−1. (3.5) Note that in Fuh (2003) a KLD number between Markov transition kernels is defined for proving optimality properties of CUSUM procedures in hidden Markov models of order 1. In particular, a formula similar to (3.5) is given for a Markov chain of order 1, but no link to the AEP and KLD rates is highlighted.

3.3. Autoregressive Models

For a scalar autoregressive sequence of order pwith a disruption, we will provide an explicit expression forIin terms of the Fisher matrix of the model. In this aim, we will consider the associated vector valued autoregressive sequence of dimensionp, that is a Markov chain of order 1.

Precisely, letX = (X_k)_k≥1be a random sequence governed by the autoregressive recursive equations of orderp

X_k =α_1,kXk−1 +. . .+α_p,kXk−p+ε_k k≥1, (3.6) where α_i,k = a_i before the change-point time ν (for k ≤ ν), and α_i,k = θ_i after ν, that is to say for k > ν. Here(εk)k≥1 is Gaussian white noise – an i.i.d. standard Gaussian sequence, and the initial vector (X1−p, . . . , X0)is independent of(ε_k)k≥1.

SettingΦ_k= (X_k, . . . , Xk−p+1)⁰, a= (a₁, . . . , a_p)⁰ andθ= (θ₁, . . . , θ_p)⁰,the model (3.6) becomes X_k = Φ⁰_k−1 a1{k≤ν}+θ1{k>ν}

+ε_k, k≥1. (3.7)

In this case the conditional densities (2.1) take the form f₀(x_k|x^k−1₁ ) = 1

√2πe⁻

(xk−Φ0 k−1a)2

2 and f₁(x_k|x^k−1₁ ) = 1

√2πe⁻

(xk−Φ0 k−1θ)2

2 .

Since the CLLRS (2.2) can be written

Z_k=X_kΦ⁰_k−1(θ−a)−1 2

(Φ⁰_k−1θ)²−(Φ⁰_k−1a)²

, (3.8)

(9)

the ergodic properties ofΦ = (Φk)k>ν will induce a closed-form formula forI, involving the Fisher matrix of the model. Indeed, the vector valued processΦis a Markov chain of order 1, and is governed after the change-point by the multivariate autoregressive equation of order 1,

Φk=AΦk−1+ξk k≥ν+ 1, (3.9)

whereξ_k = (ε_k,0, . . . ,0)⁰ ∈R^pand

A=A(θ) =







θ1 . . . . θp

1 0. . . . 0 . . . . 0 . . .0 1 0





 .

Applying (3.9) repeatedly yieldsΦ_k = A^k−νΦ_ν +Pk

j=ν+1A^k−jξ_j for k ≥ ν + 1,and hence, for any v∈R^p,

E ΦkΦ⁰_k|Φ_ν =v

=A^k−νv v⁰(A⁰)^k−ν +

k−ν−1

X

l=0

A^lB(A⁰)^l, (3.10) where the prime denotes transpose, andB = (bij)1≤i,j≤pis the matrix withbij = 1if(i, j) = (1,1)and 0 otherwise.

In the sequel we will denote byΘ_stb the set of allθ ∈ R^pfor which all eigenvalues ofAare less than one in modulus, and say accordingly thatΦ– orX – is a stable sequence. Ifθ ∈ Θstb, thenΦis ergodic, the Fisher matrixF(θ) =Fof the model (3.9) is well-defined, and, for anyv∈R^p,

k→∞lim E ΦkΦ⁰_k|Φν =v

=E Φ∞Φ⁰_∞=X

l≥0

A^lB(A⁰)^l=F, (3.11) whereΦ∞ =P

l≥1 A^l−1ξ_l ∼ N_p(0, F(θ))is the ergodic distribution ofΦ_k; see for example Basseville and Nikiforovbook (1993). Moreover the following strong law of large numbers, proven in the appendix, holds true.

Theorem 3.2. Ifθ∈Θ_stbfor the model (3.6), then

n→∞lim 1 n

ν+n

X

j=ν+1

Φ_jΦ⁰_j =F, Pν −a.s..

Condition1follows, with explicit limit.

Theorem 3.3. LetXbe governed by the model (3.6). Then Condition1holds true for allθ∈Θstband any ν ≥0. Moreover, the limit in (2.4) is

I=K(f_θ(x|v)|f_a(x|v)) = 1

2(θ−a)⁰F(θ) (θ−a). (3.12) with the KLD defined in (3.2) for the conditional densities defined in (3.8).

Proof. Equation (3.9) implies that

ν+n

X

j=ν+1

Zj(θ) = (θ−α)⁰Mn+1

2(θ−a)⁰

ν+n−1

X

j=ν

ΦjΦ⁰_j(θ−a), (3.13)

(10)

whereMn=Pν+n

j=ν+1 Φj−1εj. The Chebyshev inequality easily induces that for anyδ >0 X

n≥1

P(|M_n|> δn)≤δ⁻⁴sup

k≥1

E M_k⁴ k²

X

n≥1

1 n².

Since the process(Mn)n≥1 is a martingale, using successively the Burkholder-Gundy inequality and the Cauchy-Schwarz inequality proves that some constantC^∗ >0exists such that

E M_n⁴

≤C^∗E





ν+n

X

j=ν+1

|Φj−1|²ε²_j





2

≤C^∗n

ν+n

X

j=ν+1

E |Φj−1|⁴ε⁴_j

=C^∗E



ε⁴₁n

ν+n

X

j=ν+1

E |Φj−1|⁴



.

From (3.11), it is easy to deduce thatsup_k≥ν₊₁ E|Φ_k|⁴is finite, where|x|denotes the Euclidean norm of the vectorx∈R^p. Therefore,

n→∞lim

|Φ_n|²

n = 0 and lim

n→∞

1

nM_n= 0, Pν−a.s., (3.14)

where the second limit follows fromsup_n≥1 (EM_n⁴)/n² <∞, thanks to the Borel-Cantelli lemma.

Equation (2.4) follows directly from Theorem 3.2, and the convergence is proven. Finally, (3.12) is given in Example 6.5 of Pergamenchtchikov and Tartakovsky (2016), with the link to KLD detailed for an

autoregressive model of order 1 in Example 6.2. 2

4. SPARRE ANDERSEN MODELS

In this section we study the quickest detection problem for a specific case of the discrete time Sparre Ander- son models that is widely used in insurance; see for example Asmussen and Albrecher (2010, p. 427) and Back et al. (2003). We will show that, after the disruption, this observation process is a geometric ergodic homogeneous Markov chain under some technical assumptions. We will explicitly compute its stationary distribution, and finally obtain a closed form expression for the limit of the normalized CLLRS through specific means.

Let(Jk)k≥0be an autoregressive process of order 1 with disruption timeν, governed by the recursive equation

J_k= (a₀1_{k≤ν}+a₁1_{k>ν})Jk−1+ε_k, k≥1, (4.1) where the real parameters a0 6= a1 are known and the initial variableJ0 is independent of the Gaussian white noise(ε_k)k≥1. LetN_k=P

m≥11(Sm≤k), whereN0 = 0and theSmare sums ofmi.i.d. exponential random variables with parameter λ > 0. The renewal process (N_k)k≥0 and the sequence (J_k)k≥0 are independent. The process is assumed to be observed only throughJe_k=J_N_kfork≥0.

The problem is to construct a detection rule forνon the basis of the observations

X_k=Je_σ_k, k≥0, (4.2)

whereσ0 = 0andσk= inf{l≥σk−1+ 1 : Nl−N_σ_k−1 >0}fork≥1. Note that(σk)k≥0is an increasing sequence of stopping times with respect to the filtration(σ(N_k, k ≤n))n≥0. The shift from the sequence (Je_k) to the subsequence(Jeσ_k)is necessary because, in the case of equal values ofN_k, the corresponding values ofJe_kcoincide, and hence carry no information on the disruption; therefore suchJe_kmust be excluded from the detection rule.

For this construction, we need to study the observation process (4.2) afterν, that is(Xk)k>ν. Applying (4.1) repeatedly fork > ν, we getJ_m+k = a^k₁J_m+Pk

j=1a^k−j₁ ε_j+m,and hence(X_k)_k>ν is governed by the recursive equation

X_k=θ_kXk−1+η_k, k≥ν+ 1, (4.3)

(11)

whereηk =Psk

j=1 a^s₁^k^−jεj+σk−1,sk=σk−σk−1andθk = a^s₁^k. The associated conditional densities are f(x, y;ai) =fi(y|x) =X

k≥1

γ_kφ_i,k(y, x), i= 0,1, (4.4) whereγk= (1−e^−λ)e^−λ(k−1) and

φi,k(y, x) = 1 ςi,k

√ 2πe

−^(y−ak^{i x}⁾²

2ς2

i,k with ς_i,k² = 1−a^2k_i

1−a²_i . (4.5)

Now let us compute the transition probability function for the random sequence governed by (4.3).

Theorem 4.1. The sequence(X_k)_k>ν is a homogeneous Markov chain of order 1, with transition function given for any Borel setB ⊂Rby

P(x, B) = Z

B

f1(y|x)dy, (4.6)

wheref₁is given by (4.4) fori= 1.

Proof. First, let us check that(sk)is an i.i.d. sequence. We compute P(s1=m1, . . . , sj =mj) = E

P(s1=m1, . . . , sj =mj|F_σ_j−1)

= E[1(s1=m1). . .1(sj−1=mj−1)]P(s_j =m_j|F_σ_j−1), whereF₀ ={∅,Ω}, andF_k=σ{X₁, . . . , X_k}fork≥1. It follows from the definition ofσ_jthat

P(s_j =m_j|F_σ_j−1) = P(N_σ_j−1 =· · ·=N_σ_j−1_+m_j−1, N_σ_j−1_+m_j > N_σ_j−1|F_σ_j−1)

= (1−e^−λ)e^−λ(m^j⁻¹⁾.

Thus,Xis a homogeneous Markov chain with transition probability function given by P(x, B) = P(a^s₁¹x+

t1

X

j=1

a^s₁¹^−jε_j+1)

= X

k≥1

P(a^k₁x+

k

X

j=1

a^k−j₁ εj+1 ∈B)P(s1 =k) =X

k≥1

γ_k Z

B

φ_1,k(y, x)dy ,

and the desired result follows. 2

Lemma 4.1. The Markov chainXis geometrically ergodic if and only if|a₁|<1.

Proof. Let us check the assumptions of TheoremA.1from the appendix. LetV be defined by

V(x) =c∗(1 +x²), (4.7)

for some parameterc∗ ≥1. Then Ex[V(X₁)] =c∗

1 +Ex X₁²

=c∗ 1 +Ex

(θ₁x+η₁)²

=c∗

1 +x²E(θ²₁) +E(η₁²) .

It is easy to check thatE(η₁²) = (1−a²e^−λ)⁻¹ and E(θ₁²) = X

m≥1

a^2m₁ P(s1 =m) = (1−e^−λ)a²₁

1−a²₁e^−λ :=ea1 <1.

(12)

Therefore, for anyea1 < ρ <1, Ex[V(X₁)] =c∗

a²₁

e^λ−a²₁ +ea₁x²

≤ec∗+ea₁V(x)≤ρV(x), x∈C,

whereec∗=c∗(e^λ−a²₁)⁻¹a²₁and C =

(

x∈R : |x|>

s ec∗

c∗(ρ−ea1) −1 )

.

Consequently, both ConditionsD1andD2of TheoremA.1hold for the setC. Applying this theorem yields

thatXis geometrically ergodic. 2

Lemma 4.2. If|a₁|<1, then the stationary distribution of the Markov chain(X_k)_k>νis centered Gaussian with variance(1−a²₁)⁻¹.

Proof. LetS =σ{N_t, t≥0}be theσ-algebra generated by the renewal counting process(N_t)t≥0. We get by induction from (4.3) that

X_n=





n

Y

j=ν+1

θ_j



X_ν+

n

X

k=ν+1

η_k

n

Y

j=k+1

θ_j, n > ν, (4.8)

where an empty product is set to 1 by convention, andη_k isS-conditionally centered Gaussian distributed with variance

∆_k=

sk

X

j=1

a^2(s₁ ^k^−j)= 1−a^2s₁ ^k 1−a²₁ .

Since (η_n)_n>ν is a sequence ofS-conditionally independent random variables, the distribution ofX_n for n > νisS-conditionally Gaussian distributed, with mean

mn=E(Xn|S) =

n

Y

j=ν+1

θjE (Xν|S) =a^2(σ₁ ⁿ^−σ^ν⁾E(Xν|S),

and variance

dn=E

Xn−E(Xn|S)² S

=

n

X

k=ν+1

∆_k

n

Y

j=k+1

θj = 1−a^2σ₁ ⁿ^−σ^ν

1−a²₁ . (4.9)

Moreoverm_nconverges a.s. to0andd_nto(1−a²₁)⁻¹whenntends to infinity. Thus, both the conditional distribution ofXnand its ordinary one are asymptotically centred Gaussian distributed with variance(1− a²₁)⁻¹. Therefore the candidate for the stationary distribution of the process is the probability measure defined by

π(B) =

r1−a²₁ 2π

Z

A

e^−(1−a²¹^)u²^/2du, B∈ B(R). (4.10) Finally, it is straightforward to show that π(B) = R

R P(x, B)π(dx) for any Borel setB ∈ B(R), where

P(x, B)is given in (4.6). 2

Thanks to (4.6), the conditional densities in (2.1) take the form (4.4), orf_i(x_j|x^j−1₀ ) =f(xj−1, x_j;a_i), fori= 0,1,and the CLLRS in (2.2) takes the form

Zj =g(Xj−1, Xj), whereg(x, y) = logf(x, y;a1)

f(x, y;a0). (4.11) The following properties of the conditional KLD are necessary for obtaining a closed-form formula forI.

The proof is postponed to the appendix.

(13)

Lemma 4.3. Letf1(y|x)andf0(y|x)be given by (4.4).

1. Ifa16=a0, then the local divergenceK(f1, f0;x)defined by (3.3) is positive for allx∈R. 2. LetV be defined by (4.7). Ifc∗ ≥6(1−a²₁)⁻¹+2 log√

2π−2 log(1−e^−λ), thenK(f₁, f₀;x)≤V(x) for allx∈R.

It follows from Lemma4.3that the local divergence satisfies K(f1, f0;x)≤6(1−a²₁)⁻¹+ 2 log

√

2π−2 log(1−e^−λ).

Condition1can then be deduced from Theorem3.1and a closed-form expression for the limit derives from (3.5). Still, the following given proof is specific to the case under study.

Theorem 4.2. LetXbe a Sparre-Anderson model governed by (4.3). Then for anyν ≥0, Condition1is satisfied, with

I= Z

R

π(x)K(f1(y|x)|f0(y|x))dx=

r1−a²₁ 2π

Z

R

K(f1, f0;x)e^−(1−a²¹^)x²^/2dx, P^(ν)−a.s..

where the conditional densities are given in (4.4).

Proof. We can write

ν+n

X

j=ν+1

Zj =

ν+n

X

j=ν+1

Uj +

ν+n

X

j=ν+1

eg(Xj−1) =An+Bn, whereUj =g(Xj−1, Xj)−eg(Xj−1)andeg(x) = Ex[g(x, X1)] = R

Rg(x, y)f1(y|x)dy. Let us show that, P^(ν)-a.s.,Bn/nconverges toK(f1|f₀)andAn/nconverges to 0 asntends to infinity.

First, applying the strong law of large numbers – see Meyn and Tweedie (1993, Theorem 17.1.7) – to the seriesBn=Pν+n

j=ν+1eg(Xj−1),forn≥1, yields

n→∞lim 1 nBn=

Z

Reg(y)dπ(y) =K(f1|f₀) P^(ν)−a.s..

Second,(Uj)j>νis a martingale-difference with respect to the natural filtrationF_j =σ(X^j₀)forj > ν.

ThereforeAn/nwill converge to0if X

j>ν

1

j²E(U_j²|Fj−1)<∞ P^(ν)−a.s.. (4.12) Thanks to the bound (A.5), we get

E(U_j²|F_j−1) ≤ E[g²(Xj−1, Xj)|F_j−1]≤E[(u^∗+X_j−1² +X_j²)²|F_j−1]

≤ 3

(u^∗)²+X_j−1⁴ +E X_j⁴|Fj−1 . Thanks to the latter inequality, it is sufficient for proving (4.12) to show that

sup

n>νE(X_n⁴|F_ν)<∞. (4.13)

First (4.8) givesX_n⁴ ≤8X_ν⁴ Qn

j=ν+1θ_j⁴+ 8Xe_n⁴ ≤8X_ν⁴+ 8Y_n⁴,whereY_n =Pn

k=ν+1η_k Qn

j=k+1 θ_j. So, proving (4.13) amounts to proving thatsup

n>νE(Y_n⁴|F_ν)<∞.

(14)

Forn > ν, applying the H¨older inequality yields

Y_n⁴ ≤





n

X

k=ν+1



|η_k|

n

Y

j=k+1

|θ_j|^1/2









n

Y

j=k+1

|θ_j|^1/2









4

≤cn n

X

k=ν+1

η_k⁴

n

Y

j=k+1

θ²_j,

wherecn= Pn

k=ν+1

Qn

j=k+1|θ_j|^2/33

. Therefore, thanks to (4.9),

E Y_n⁴|S

≤3c_n

n

X

k=ν+1

∆²_kΠⁿ_j=k+1θ²_j ≤ 3c_n 1−a²₁

n

X

k=1

∆_kΠⁿ_j=k+1θ²_j ≤ 3c_n (1−a²₁)²· Sinceσ_j−σj−1 ≥1in (4.2), we getc_n≤(P

k≥1 |a₁|^2k/3)³<∞, and (4.13) is proven. Finally, the series

(4.12) is convergent, and henceA_n/nconvergesPν-a.s. to 0. 2

5. OPTIMALITY PROPERTIES FOR AUTOREGRESSIVE MODELS

We will here prove the optimality of the window-limited CUSUM procedure (2.3) for the autoregressive model (3.6) for which both pre-change and post-change parameters are known. We will apply the detecting procedures proposed in Lai (1998) and shortly described in Section2. In this aim, we will develop analytical tools leading to the conditions that provide the lower and upper bounds for the detection delay.

First the detection delay has an asymptotic upper bound, because Condition2 is satisfied, as stated in the following proposition, whose proof is given in the appendix.

Theorem 5.1. If the autoregressive model (3.6) is stable – that is forθ∈Θ_stb, then Condition2is satisfied.

In Lai (1998, Theorem 4 (ii)), the upper bound (2.7) is proven to hold true provided that the system of statistics (3.8) satisfies (2.6). Therefore, the lower bound (2.5) together with the upper bound (2.7) imply the minimax property of the CUSUM procedure, as stated below.

Theorem 5.2. If the autoregressive model (3.6) is stable, then the detection delay risk of the window- limited procedure (2.3) based on the statistics (3.8) is upper bounded as in (2.7), withIgiven by (3.12), and the procedure is asymptotically minimax with

α→0lim

infτ∈M_αsup_ν≥0R_ν(τ) sup_ν≥0R_ν(T_α) = 1.

Moreover, this procedure is pointwise optimal. In mathematical words, for anyν≥0,

α→0lim

infτ∈MαR_ν(τ) R_ν(T_α) = 1.

ACKNOWLEDGEMENTS

This work is partly supported by the French Federation Normandie Math´ematiques, by the RSF grant 14-49- 00079 (National Russian Research University ”MPEI” 14 Krasnokazarmennaya, 111250 Moscow, Russia), and by the RFBR Grant 16-01-00121 A.

(15)

REFERENCES

Algoet, P. and Cover, T. (1988). A Sandwich Proof of the Shannon-McMillan- Breiman Theorem,Annals of Probability16: 899–909.

Asmussen, S. and Albrecher, H. (2010). Ruin Probabilities,2d edition, Singapore: World Scientific.

Back, K., Bielecki, T.R., Hipp, C., Peng, Sh. and Schachermayer, W. (2003). Stochastic Methods in Finance, C.I.M.E.-E.M.S. Summer School, July 6–12, Bressanone/Brixen, Italy.

Barron, A. (1985). The strong Ergodic Theorem for Densities: Generalized Shannon-McMillan-Breiman Theorem, Annals of Probability, 13: 1292–1303.

Basseville, M. and Nikiforov, I.V. (1993). Detection of Abrupt Changes: Theory and Applications,Engle- wood Cliffs: Prentice Hall.

Cover, L. and Thomas, J. (1991). Elements of Information Theory,New York: Wiley series in telecommu- nications.

Feigin, P.D. and Tweedie, R. (1985). Random Coefficient Autoregressive Processes: a Markov chain Analysis of Stationarity and Finiteness of Moments,Journal of Time Series Analalysis6: 1-14.

Fuh, C.-D. (2003). SPRT and CUSUM in hidden Markov models,Annals of Statistics31: 942-977.

Galtchouk, L. and Pergamenchtchikov, S. (2014). Geometric Ergodicity for Classes of Homogeneous Markov Chains,Stochastic Processes and Applications124, 3362–3391.

Girardin, V. (2005). On the Different Extensions of the Ergodic Theorem of Information Theory, Recent Advances in Applied ProbabilityR. Baeza-ates, J. Glaz, H. Gzyl, J. H¨usler and J.L. Palacios, eds., pp.

163 – 179, San Francisco: Springer-Verlag.

Girardin, V. and Limnios, N. (2006). Entropy for Semi-Markov Processes with Borel State Spaces: Asymp- totic Equipartition Properties and Invariance Principles,Bernoulli12: 1–19.

Gray, R. M. (2011). Entropy and Information Theory, 2d edition, New York: Springer.

Kent, S. (2000). On the Trial of Intrusions into Information Systems,IEEE Spectrum, 37: 52–56.

Konev, V. V. and Pergamenchtchikov, S. M. (1981). The Sequential Plans of Parameters Identification in Dynamic Systems,Automation and Remote Control7: 84–92.

Konev, V. V. and Pergamenchtchikov, S. M. (1990). Truncated Sequential Estimation of the Parameters in a Random Regression,Sequential Analalysis9: 19–41.

Kullback S. and Leibler R. A. (1951). On Information and Sufficiency,Annals of Mathematical Statistics 22: 49–86.

Lai, T.L. (1998) Informations Bounds and Quick Detection of Parameters Changes in Stochastic Systems, IEEE Transactions on Information Theory44: 2917–2929

Lorden, G. (1971). Procedures for Reacting to a Change in Distribution,Annals of Mathematical Statistics 42: 1897-1908.

Meyn, S. and Tweedie, R. (1993). Markov Chains and Stochastic Stability, New York: Springer Verlag.

Orey, S. (1985) On the Shannon-Perez-Moy Theorem,Contemporary Mathematics41: 319–27.

Page, E.S. (1954). Continuous Inspection Schemes,Biometrika41: 100–115.

Pergamenchtchikov, S. M. and Tartakovsky, A.G. (2016). Asymptotically Optimal Pointwise and Minimax quickest Change-Point Detection for Dependent Data, Statistical inference for stochastic processes, DOI: 10.1007/s11203-016-9149-x

Shannon, C. (1948). A Mathematical Theory of Communication,Bell Systems Technical Journal27: 379–

423, 623-656.

Tartakovsky, A.G. and Veeravalli, V.V. (2004). Change-Point Detection in Multichannel and Distributed Systems with Applications,Applications of Sequential Methodologies (N. Mukhopadhyay, S. Datta and S. Chattopadhyay, eds.), pp. 331–363, New York: Marcel Dekker, Inc..

(16)

Tartakovsky, A., Nikiforov, I. and Basseville, M. (2015). Sequential Analysis: Hypothesis Testing and Changepoint Detection,Boca Raton: Chapman and HallCRC.

Yakir, B. (1994). Optimal Detection of a Change in Distribution when Observations Form a Markov Chain with a Finite State Space,Change-Point problems,E. Carlstein, H. Muller and D. Siegmund, eds., pp.

346–358, Hayward: Institute of Mathematical Statistics.

APPENDIX

A.1. Geometric Ergodicity for Homogeneous Markov Processes

We will follow the so-called Meyn–Tweedie approach; some definitions from Meyn and Tweedie (1993) and Galtchouk and Pergamenchtchikov (2014) are necessary.

For a homogeneous Markov chain(X_n)n≥0with measurable state space(X,B(X)), the transition probability of the chain is defined byP(x, A) =P(X1 ∈A|X₀=x)for allx∈ X andA∈ B(X). Then−step transition probability isPⁿ(x, A) = P(X_n ∈A|X₀ = x). A measureπ onB(X))is said to be invariant – or stationary, or ergodic – for the chain ifπ(A) = R

X P(x, A)π(dx)for allA ∈ B(X).If an invariant positive measureπwithµ(X) = 1exists, then the chain is said to be positive.

The following conditions are minorization and drift conditions:

D₁.There existδ >0, a setC ∈ B(X),and a probability measureQonB(X)withQ(C) = 1, such that, for anyA∈ B(X)for whichQ(A)>0,infx∈CP(x, A)> δQ(A).

D₂.There exist a functionV:X →[1,∞), constants0< ρ <1andd≥1, and a setC∈ B(X)such that W = sup_x∈C|V(x)|<∞and, for allx∈ X,Ex[V(X₁)] ≤ (1−ρ)V(x) + d1C(x).

IfD2 is satisfied,Vis called the Lyapunov function of the chain. Note that, obviously, ConditionD1

implies thatη = infx∈C P(x, C)−δ >0.

The following theorem is used in the proof of Lemma4.1in Section4; see Feigin and Tweedie (1985), and also Galtchouk and Pergamenchtchikov (2014).

Theorem A.1. Let(Xn)n≥0 be a homogeneous Markov chain satisfying conditionsD1 andD2 with the same setC ∈ B(X). Then(X_n)n≥0is a positive geometric ergodic process, i.e.,

sup

n≥0

e^κ^∗ⁿ sup

x∈X

sup

0≤g≤V

1

V(x)|Exg(X_n)−π(eg)| ≤R^∗.

Note that the positive constantsκ^∗andR^∗are determined in Galtchouk and Pergamenchtchikov (2014).

A.2. Proof of Theorem3.2 SettingΨn=Pν+n

j=ν+1 ΦjΦ⁰_j, we obtain from (3.9) that Ψ_n=AΨ_nA⁰+A

Φ_νΦ⁰_ν −Φ_ν+nΦ⁰_ν+n

A⁰+AMf_n+Mf_n⁰A⁰+

ν+n

X

j=ν+1

ξ_jξ⁰_j,

whereMf_n = P_ν+n

j=ν+1 Φj−1ξ⁰_j is thep×pmatrix whose first column is the vectorM_n defined in (3.13) and all other columns are null. So, if we set

Dn=A

ΦνΦ⁰_ν −Φν+nΦ⁰_ν+n

A⁰+AMfn+Mf_n⁰A⁰+

ν+n

X

j=ν+1

ξjξ_j⁰,