Tutorial Learning Metrics For Temporal Data

(1)

Tutorial

Learning Metrics For Temporal Data

Ahlame Douzal ([email protected]) AMA-LIG, Universit´e Joseph Fourier

(EPAT’2014)

(2)

Outline

1

Motivation

2

Temporal alignments

3

Values and behavior based metrics

4

Complex temporal data - Temporal kernels

- Learning temporal matching

(3)

Motivation

(4)

Temporal Data

Definition

- A kind of sequence data:

- an ordered set of elements - order criterion: time

Temporal data are ubiquitous - User Behaviour Analysis - Evolving social Networks - Load curve Prediction - Learning from sensor networks

(5)

Temporal data structures

(6)

Temporal alignments

(7)

Temporal alignments

Letxi= (xi1, ...,xiT),x_i0 = (x_i01, ...,x_i0T) be two time series of lengthT.

Definition

An alignmentπ∈Aof length|π|=m between two time seriesx_iandx_i0is defined as a sequence of m couples of aligned elements:

π= ((π₁(1), π₂(1)),(π₁(2), π₂(2)), ...,(π₁(m), π₂(m)))

-πdefines a warping function that realizes a mapping from time axis ofxionto time axis ofx_i0

(8)

Temporal alignments: conditions

1 Noa prioriknowledge about which sub-period contain important information

2 Continuity and monotonic conditions:π₁andπ₂define applications from{1, ...,m}to{1, ..,T}that satisfy∀j∈ {1, ..,m−1}:

π1(j+ 1)≤π1(j) + 1 andπ2(j+ 1)≤π2(j) + 1, (π1(j+ 1)−π1(j)) + (π2(j+ 1)−π2(j))≥1.

3 Boundary conditions:

1 =π₁(1)≤π₁(2)≤...≤π₁(m) =T 1 =π₂(1)≤π₂(2)≤...≤π₂(m) =T

4 Adjustment window condition:

|π1(j)−π₂(j)| ≤r, r= 0, ..,T the window length

5 Slope constraint condition:

- the slop intensity controlled byp=^r_c= 0,1,2, ...,

it imposes to a point that moves forward in the direction of one dimension consecutivectimes, to step at leastrtimes in the diagonal direction.

-p= 0, there is no restrictions on the slope,p=∞the warping functionπis restricted to diagonal.

(9)

Values and behavior based metrics

(10)

Metrics for temporal data

Euclidean alignment

The Euclidean alignmentπbetweenxiandx_i0 alignes elements observed at the same time:

π= ((π1(1), π2(1)),(π1(2), π2(2)), ...,(π1(T), π2(m)))

∀k= 1, ...m, π1(k) =π2(k) =k, |π|=T

Euclidean Distance for Time Series

The Euclidean Distance (DE) distance between the time seriesx_iandx_i0 is given by:

DE(x_i,x_i0)===^def= 1

|π|

X

k=1

ϕ(x_i_π

1 (k),x_i0π2 (k)) = 1 T

T

X

t=1

ϕ(x_it,x_i0t)

(11)

Unconstrained temporal alignments

Unconstrained Dynamic Time Warping ([SK83], [KL83])

The Dynamic Time Warping (DTW) dissimilarity measure between the time seriesx_i andx_i0is given by : DTW(x_i,x_i0)===^def= min

π∈AC(π)

C(π)===^def= 1 X^|π|

ϕ(x_i_π _k),x_i0π k)) = 1 X

ϕ(xit,x_i0t0)

(12)

Temporal alignments under global/local constraints

Sakoe-Chiba-band [SC78] Itakura [Ita75] Rabiner [Rab89]

Global window Slope constraints Local constraints

(13)

Metrics for temporal data: Sakoe-Chiba constraint

Sakoe-Chiba Dynamic Time Warping [SC78]

The Sakoe-Chiba band Dynamic Time Warping (DTW_SC) dissimilarity measure between the time seriesxiand x_i0is given by:

DTWSC(xi,x_i0)===^def= min

π∈AC(π)

C(π)===^def= 1

|π|

X

k=1

w_π_{1 (}_k),π_{2 (}_k)ϕ(x_i_π_{1 (}_k),x_i0π2 (k)) = 1

|π|

X

(t,t0)∈π

w_t,t0ϕ(xit,x_i0t0)

w_t,t0= 1,if|t−t⁰|<c, ∞if|t−t⁰| ≥c -ϕtaken as the euclidean norm,

-w_t,t0 weights that constrainAto a subset of alignments -cbeing the Sakoa-Chiba band width

(14)

Temporal alignment

Characteristics

- Dynamic programming alignments deal with delays or time differences - Pairwise alignments

- Comparison involves the whole observations (noa prioriknowledge about informative sub-periods) - Values-based metrics

- Usage in classification/clustering: assumption of similar dynamics within classes

Lack of !!

- Behavior-based metrics

- Comparison involves sub-period importances - Multiple temporal alignments

- Address time series of complex dynamics

(15)

Behavior-based metrics

(16)

Behavior-based metrics

Definition (Similar / Opposite behavior)

- Two time series are said similar if, for each period [ti,ti+1], they increase or decrease simultaneously with the same growth rate

- Two time series are said opposite if, for each period [t_i,t_i+1], when one time series increases, the other decreases and (vice-versa) with the same growth rate (in absolute value)

- Two time series are said of different behaviors if not similar nor opposite (linearly and stochastically independent)

Some contributions

- Derivative-based for Slope comparison[KP01], [MLKCW03], [XW10]

- Correlation coefficient-based

- Kendall coefficient, qualitative distance[cTCK02], [SB08]

- Spearman coefficient elements rank comparison[AT10], [CVMW07], [RBK08]

- Autocorrelation-based temporal kernel[GHS11]

- Temporal Correlation[DCN07], [DCDG09], [DCA12]

(17)

Behavior-based metrics: Slope comparison

Derivative Dynamic Time Warping [KP01]

DDTW(xi,x_i0)===^def= min

π∈AC(π)

C(π)===^def= 1

|π|

X

k=1

ϕ(∆_i_π

1 (k),∆_i0π2 (k)) = 1

|π|

X

(t,t0)∈π

ϕ(∆_it,∆_i0t0)

∆_it=(xi t−xi t−1) + (xi t+1−xi t−1)/2 2

- ignore the sign of the slope

(e.g.∆it= +1,∆jt= +3,∆kt=−1, and ϕ(∆i t,∆j t) =ϕ(∆i t,∆k t) = +2)

(18)

Behavior-based metrics: Pearson correlation coefficient

Pearson correlation coefficient

x= (x₁, ...,x_n),y= (y₁, ...,y_n)

Cor(x,y)= P

i,i0(x_i−x_i0)(y_i−y_i0) qP

i,i0(x_i−x_i0)²qP

i,i0(y_i−y_i0)²

+ / -

+ Similar, opposite, different⇒Cor= 1, -1 and 0 - HigherCor6⇒similar dynamics

- Involve all the couplesi,i⁰(ignore the temporal dependency) - Overestimate the similarity (tendency effects, drifts,...)

(19)

Behavior-based metrics: autocorrelation

Difference between Auto-Correlation Operators [GHS11]

x= (x1, ...,xn), y= (y1, ...,yn), ˜x= (ρ1(x), ...,ρK(x)), y˜= (ρ1(y), ...,ρK(y))

ρ_τ(x)= PT−τ

i=1 (x_i−¯x)(x_i+τ−¯x) PT

i=1(xi−x)¯² , d_DACO(x,y) =k˜x−˜yk² + Divergence measure between correlogrammes (usefull for model selection)

- Close autocorrelationρ_τ(lowerd_DACO)6⇒similar behaviors !

(20)

Behavior-based metrics: temporal correlation

Temporal correlation coefficientCort(x,y) of orderr[DCN07], [DCDG09], [DCA12]

Cort(x,y) =

P

i,i0m_ii0(xi−x_i0)(yi−y_i0) qP

i,i0m_ii0(x_i−x_i0)²qP

i,i0m_ii0(y_i−y_i0)² m_ii0= 1 si|i⁰−i| ≤r, 0 otherwise (temporal dependency withinr)

+ / -

+ Similar, opposite, different⇔Cort= 1, -1 and 0 + Non sensitive to tendency and drifts (lowerradvised)

- Sensitive to noise (higherradvised)

(21)

Illustration (1)

15 synthetic time series

3 classes:F1={1, ..5},F2={6, ..10}andF3={11, ..15}

F₁ = {f1(t)/f₁(t) =g(t) + 2t+ 3 +} F2 = {f2(t)/f2(t) =µ−g(t) + 2t+ 3 +} F₃ = {f3(t)/f₃(t) = 4g(t)−3 +}

- g(t): a random discrete function, -µ=E(g(t))

-;N(0,1),

- 2t+ 3: a linear trend effect.

01020304050

F(x)

F1(x) F2(x)

F3(x)

(22)

Illustration (2)

- Both the Euclidean distance and the dynamic time warping giveS_icloser toS_kthan toS_j, - dE(Si,Sk) = 4.24<dE(Si,Sj) = 15.13<dE(Sj,Sk) = 16.15

- d_dtw(S_i,S_k) = 6<d_dtw(S_i,S_j) = 29<d_dtw(S_j,S_k) = 29

(23)

Clustering time series

Hierarchical clustering

(24)

Illustration (3) : cor vs cort

2 4 6 8 10

01020304050

Time

F(x)

F1(x) F2(x)

F3(x)

−0.75−0.70−0.65−0.60

(a)

0.870.890.91

(b)

−0.90−0.88−0.86

(c)

0.450.550.65

(d)

0.200.250.300.35

(e)

−0.64−0.60−0.56

(f)

(a)cort(F₁,F₂),(b)cort(F₁,F₃))(c)cort(F₂,F₃),

(25)

Dynamic programming alignments

Characteristics

- Dynamic programming alignments deal with delays or time differences - Pairwise alignments

- Comparison involves the whole observations (noa prioriknowledge about informative sub-periods) - Values-based metrics

- Usage in classification/clustering: assumption of similar dynamics within classes

Lack of !!

- Behavior-based metrics

- Comparison involves sub-period importances - Multiple temporal alignments

- Address time series of complex dynamics

(26)

Complex temporal data !

* Temporal kernels

- Learning temporal matching

(27)

Temporal kernels: under Euclidean alignment

Temporal Correlation Kernel [DCA12]

kcort(x_i,x_i0)===^def=Cort(x_i,x_i0) Cortis a linear kernel (p.d.)

Autocorrelation Kernel [GHS11]

kDACO(xi,x_i0)===^def=e⁻

1

σ2dDACO(xi,xi0)

kDACOis a gaussian kernel (p.d.)

(28)

Temporal kernels: under dynamic warping alignment

Dynamic Time Warping-based Kernels ([BHB02]

k_DTW(x_i,x_i0)===^def=e⁻¹tDTW(xi,xi0)

-non p.d. kernel,ta normalization parameter

Sakoe-Chiba Dynamic Time Warping Kernel

k_SC(x_i,x_i0)===^def=e⁻¹tDTWSC(xi,xi0)

-non p.d. kernel,ta normalization parameter

(29)

Temporal kernels: under dynamic warping alignment

Dynamic Temporal Alignment Kernel [SNNS01]

DTAK(xi,x_i0)===^def= max

π∈AC(π)

C(π)===^def= 1

|π|

X

j=1

ϕ(x_i_π_{1 (}_j),x_i0π2 (j)) = 1

|π|

X

(t,t0)∈π

ϕ(xit,x_i0t0)

ϕ(xit,x_i0t0) =kσ(xit,x_i0t0) =e⁻

1

σ2kxit−xi0t0 k2

-non p.d. kernel, but positive semidefinite matrices (sufficient in an experimental context)

(30)

Temporal kernels: under dynamic warping alignment

Global Alignment Kernel [CVBM07]

K_softmax(x_i,x_i0)===^def=X

π∈A

|π|

Y

j=1

k(x_i_π_{1 (}_j),x_i0π2 (j))

k(x,y)===^def=

1 2e⁻

1 σ2kx−yk2

1−¹₂e⁻

1 σ2kx−yk2

+non p.d. but, the property _1+k^k yields positive semidefinite matrices

-Diagonally dominant Gram matrix (cause of non p.d. property, may be rescaled)

Global Alignment Kernel [Cut11]

K_GA(xi,x_i0)===^def=X

π∈A

|π|

Y

j=1

k(x_i_π_{1 (}_j),x_i0π2 (j))

k(x,y)===^def=e⁻^φσ(x,y), φσ(x,y)===^def= 1

2σ²kx−yk² +log(2−e⁻

1 2σ2kx−yk2

(31)

Temporal kernels: under dynamic warping alignment

Triangle Global Alignment Kernel [Cut11]

KTGA(xi,x_i0)===^def=X

π∈A

|π|

Y

j=1

k(x_i_π_{1 (}_j),x_i0π2 (j))

k(x_i_π_{1 (}_j),x_i0π2 (j))===^def= w_π_{1 (}_j),π_{2 (}_j)kσ(x_i_π_{1 (}_j),x_i0π2 (j)) 2−w_π_{1 (}_j),π_{2 (}_j)kσ(x_i_π_{1 (}_j),x_i0π2 (j)) wa radial basis kernel on N (a triangular kernel for integers):

w(j,j⁰) =

1−|j−j⁰| c

+

cbeing the Sakoe-Chiba band width.

(32)

Complex temporal data !

- Temporal kernels

* Learning temporal matching

(33)

Time Series: complex data !

Real Data: UCI ML Household Electrical load consumption

Data characteristics:

- Each time series gives a daily load consumption

- In Low (vs. High) class the average consumption between 6-8 pm is lower (resp. higher) than the annual average consumption

(34)

Objective and challenges

Objective

- The early classification (before 6 pm) of a load consumption to predict consumer demand on 6-8pm Standard approaches

- Based on a standard time series metric (DTW)

- Assign a time series to the class of similar consumption profiles Challenge

- Load consumption exhibit different global behaviors within classes or nearly similar ones between classes

(35)

Learning temporal matching for time series classification

Objective

- Complex time series classification: different dynamics within classes, slight differences between classes For this,

- Enlarge time series alignments to aless constrainedtemporal matching - The learning process involvesthe whole dynamicswithin and between classes

- Match time series on theirsharedfeatures within classes anddistinctiveones between classes - Derive a metric based on the highlighteddiscriminativefeatures to be used for the time series

classification.

(36)

Proposal’s key

[FDCG13], [FDCG⁺14]

Given a set of linked time series (alignment, temporal matching,...)

1

i

T 1

i"

T C₂

1

i

T 1

i'

T C₁ S₃

S₄ S₁ S₂

mi,i"^4,3 mi,i"^1,2

Idea

- Each link induces a variability corresponding to the divergence between the connected values

- To reveal shared features within a class, we minimize the within variance by removing links between non shared features

- To reveal differential features between classes, we maximize the between variance by removing links between shared features

(37)

Proposal’s key

[FDCG13], [FDCG⁺14]

How?

- A new formalization of the classical variance/covariance for a set of time series, as well as for a partition of time series

- Strengthen or weaken links according to their contribution to the variances within and between classes

C₂ C₁

1 1 1

1

i

T i"

T

i

T

i'

T S3

S4 S1 S2

mi,i"^4,3 mi,i"^1,2

(38)

Variance/Covariance formalization for time series data

- S₁, ...,S_nmultivariate time series, of lengthTdescribingpvariables - X: description ofS1, ...,Snby p variables

- Assume time series linked through DTW alignment, temporal matching, ...

- We defineM_(n,n)(M_ll0) as an adjacency block matrix - A blockM_ll0 specifies the linkage betweenS_landS_l0

- A term ofM_ll0 m_ii^ll⁰₀= 1 if the instantsiandi⁰ofS_landS_l⁰are aligned, 0 otherwise.

(39)

Variance induced by a set of time series

- Variance/covariance induced by a set of time series

V_M(X) = X^t(I−M⁰)^tP(I−M⁰)X

M⁰: row normalized matrix ofM

(I−M⁰): Laplacian matrix of the graph defined by the connected observations Each observation is centered relative to the average of its neighborhood.

Remark:VMleads to thetotalVariance/Covariance - For a complete linkage defined by a unit matrixM= 1 - If each time series shrinks to one point

(40)

Variance induced by a partition of time series

Variance/covariancewithinetbetweenclasses of time series

V_MW(X) = X^t(I−M⁰_W)^tP(I−M_W⁰)X V_MB(X) = X^t(I−M⁰_B)^tP(I−M_B⁰)X

intra-classmatchingM_W:

m^ll_ii⁰0 = 1 if the linked time series belong to the same class, 0 otherwise.

inter-classmatchingMB:

m^ll_ii⁰0 = 1 if the linked time series belong to different classes, 0 otherwise.

Remark:V_MW,V_MB lead to thewithin,betweenVariance/Covariance - For a complete linkage defined by a unit matrixMW = 1,MB= 1 - If each time series shrinks to one point

(41)

Learning a discriminative temporal matching

Two consecutive phases algorithm

M0

w Sl

Sl'

1 T

Unconstrained linkage

Learn intra-class matching Min(V

MW)

M*

w

First phase

Shared features

Learn inter-class matching Max(V

MB)

cond phase

Shared + differential

(42)

Learning the intra-class temporal matching

S_l= (x₁^l, ...,x_T^l),S_l0= (x₁^l⁰, ...,x_T^l⁰) belonging toC_k(|Ck|=n_k)

M\(i,i⁰,l,l⁰) :Mafter the removal of the link (i,i⁰) betweenS_landS_l0(m^ll_ii⁰0= 0)

Outlines of the algorithm

1 InitialiseM_W as a complete linkage

∀i,i⁰∈ {1, ...T}andSl,S_l0of the same classm^ll_ii⁰0= 1

2 Calculate the contributionC_ii^ll0⁰ to the varianceV_MW of each linki,i⁰betweenS_letS_l0

C_ii^ll0⁰ = V_MW −V_MW_\(i,i0,l,l0)

3 Delete links (i,i⁰) (m^ll_ii⁰0= 0) of positive contributionsC_ii^ll0⁰>0

4 Iterate steps 2 and 3 untilV_MW stabilization

(43)

Non degenerate and convergence conditions

∀k∈ {1, ...,K}, ∀(l,l⁰)∈C_k,∀(i,i⁰)∈[1,T]²

Variance definition 1- m^ll_ii>0

2- M_W row-normalized :Pnk l0=1

PT

i0=1m^ll_ii⁰₀= 1 Non-degenerate variance

3- Each obs. ofS_lshould be linked to at least one obs. ofS_l0:PT

i0=1m^ll_ii⁰0>0 Convergence of the variance minimization process

4- The delete of (i,i⁰) impacts theieti⁰neighborhoods (rowsiandi⁰): at each iteration, delete the link of maximal positive contribution per row

(44)

Derive discriminative metric

M_∗: the learned discriminative matching - LetM_∗^l^.be the average matching toS_l:

M

∗^l^.

= 1 (n − n

k

) T

X

l⁰

M

∗^ll⁰

withy_l06=y_l=k

- The discriminative dissimilarity betweenS_NewandS_l

D

l

(S

l

, S

New

) = min

r∈{0,..,T−1}

( X

|i−i⁰|≤r; (i,i⁰)∈[1,T]²

m

^l_ii^.0

P

|i−i⁰|≤r

m

^l_ii^.0

(x

_i^l

− x

_i^New0

)

²

)

wherercorresponds to the Sakoe-Chiba band width.

(45)

Classification of the household electric power consumption

(46)

Classification of the household electric power consumption

Learned discriminant matching (CONSLEVEL)

M_W^∗ (Low) M_B^∗(Low vs. High)

(47)

Preliminary results

- Classes compactness/separability by MDS

(48)

References I

Z. Abraham and P. Tan,An integrated framework for simultaneous classification and regression of time-series data., SIAM International Conference on Data Mining, 2010, pp. 653–664.

Claus Bahlmann, Bernard Haasdonk, and Hans Burkhardt,Online handwriting recognition with support vector machines-a kernel approach, Frontiers in Handwriting Recognition, 2002. Proceedings. Eighth International Workshop on, IEEE, 2002, pp. 49–54.

Ljup co Todorovski, Bojan Cestnik, and Mihael Kline,Qualitative clustering of short time-series: A case study of firms reputation data, IDDM-2002 (2002), 141.

Marco Cuturi,Fast global alignment kernels, Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 929–936.

M. Cuturi, J.-P. Vert, Øystein Birkenes, and T. Matsui,A kernel for time series based on global alignments, the International Conference on Acoustics, Speech and Signal Processing, vol. 11, 2007, pp. 413–416.

F. Cabestaing, T.M. Vaughan, D.J. McFarland, and J.R. Wolpaw,Classification of evoked potentials by pearson´ıs correlation in a brain-computer interface., Modelling C Automatic Control (theory and applications)67(2007), 156–166.

A. Douzal-Chouakria and C. Amblard,Classification trees for time series, Pattern Recognition45(2012), no. 3, 1076–1091.

A. Douzal-Chouakria, A. Diallo, and F. Giroud,Adaptive clustering for time series: application for identifying cell cycle expressed genes, Computational Statistics and Data Analysis53(2009), no. 4, 1414–1426.

A. Douzal-Chouakria and P.N. Nagabhushan,Adaptive dissimilarity index for measuring time series proximity, Advances in Data Analysis and Classification Journal.1(2007), no. 1, 5–21.

Cedric Frambourg, Ahlame Douzal-Chouakria, and Eric Gaussier,Learning multiple temporal matching for time series classification, Advances in

(49)

References II

C´edric Frambourg, Ahlame Douzal-Chouakria, ´Eric Gaussier, et al.,Learning temporal matchings for time series discrimination.

A. Gaidon, Z. Harchaoui, and C. Schmid,A time series kernel for action recognition, British Machine Vision Conference, 2011.

F. Itakura,Minimum prediction residual principle applied to speech recognition, Acoustics, Speech and Signal Processing, IEEE Transactions on23 (1975), no. 1, 67–72.

J.B. Kruskall and M. Liberman,The symmetric time warping algorithm: From continuous to discrete. in time warps, string edits and macromolecules., Addison-Wesley., 1983.

E. Keogh and M.J. Pazzani,Derivative dynamic time warping, First International Conference on Data Mining, 2001.

C.S. Moller-Levet, F. Klawonn, K.H. Cho, and O. Wolkenhauer,Fuzzy clustering of short time series and unevenly distributed sampling points, 5th International Symposium on Intelligent Data Analysis (Berlin), aug 2003.

L.R. Rabiner,A tutorial on hidden markov models and selected applications in speech recognition, Proceedings of the IEEE77(1989), no. 2, 257–286.

J. Rydell, M. Borga, and H. Knutsson,Robust correlation analysis with an application to functional mri, IEEE International Conference on Acoustics, Speech and Signal Processing, no. 453-456, 2008.

Young Sook Son and Jangsun Baek,A modified correlation coefficient based similarity measure for clustering time-course gene expression data, Pattern Recognition Letters29(2008), no. 3, 232–242.

(50)

References III

Hiroshi Shimodaira, Ken-ichi Noma, Mitsuru Nakai, and Shigeki Sagayama,Dynamic time-alignment kernel in support vector machine, NIPS, vol. 14, 2001, pp. 921–928.

Y. Xie and B. Wiltgen,Adaptive feature based dynamic time warping, International Journal of Computer Science and Network Security10(2010), no. 1.