• Aucun résultat trouvé

Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics

N/A
N/A
Protected

Academic year: 2022

Partager "Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics"

Copied!
1
0
0

Texte intégral

(1)

4/2/2019 Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics

file:///home/nathalie/Private/Travail/Recherche/HiC/pierre/communications/2019-01_SMPGD/ambroise_etal_SMPGD2019_poster.html 1/1

Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics

Christophe Ambroise , Alia Dehman , Pierre Neuvial , Guillem Rigaill and Nathalie Vialaneix

LaMME, Evry • Hyphen-stat, Toulouse Institut de Mathématiques de Toulouse/CNRS • IPS2, CNRS/INRA INRA MIAT •

Motivation: Regionally-structured genomic data

Genome-Wide Association Studies (GWAS)

• loci: SNP

• similarity: linkage disequilibrium

• regions: LD/haplotype blocks

Chromosome contact maps (Hi-C)

• loci: binned genome positions

• similarity: contact intensity

• regions: TAD; A/B compartments

Goal: Segmentation by constrained HAC

Hierarchical Agglomerative Clustering (HAC)

• Input: objects, similarity

• Repeat times: merge the most similar clusters

• Output: A dendrogram describing the sequence of merges

Adjacency-constrained HAC: only merge adjacent clusters

• Improved time complexity: quadratic

• Space complexity : can be improved in specific applications Still too high for Hi-C, GWAS: for each chromosome.

Contribution: a quasi-linear algorithm

Extra assumption: band diagonal similarity

Key 1: Ward’s linkage in constant time

Distance between clusters: Ward’s linkage

Key 2: Storing candidate fusions in a min-heap

Min heap

A partially ordered binary tree

• nodes = candidate merges

• ordering given by the linkage

next candidate fusion is the root of the heap

Complexity

• in space

• in time

Implementation

R package adjclust

• plots of similarity, dendrogram and clustering

• wrappers for SNP or Hi-C data analyses

• model selection by broken stick or slope heuristic

GWAS: inferring linkage disequilibrium blocks

Band approximation

Quality index: proportion of of approximation vs

Scalability

Data from [ ]

Hi-C: inferring Topologically Associated Domains

Influence of bandwidth

Data from [ ] and [ ]

DI around clusters

Directionality Index (DI, [ ])values are expected to show a sharp variation at TADs boundaries

References

A. Dehman, C. Ambroise, and P. Neuvial, BMC Bioinformatics 16, 148 (2015).

C. Ambroise, A. Dehman, P. Neuvial, G. Rigaill, and N. Vialaneix, (2019).

C. Ambroise and others, Adjclust: Adjacency-Constrained Clustering of a Block-Diagonal Similarity (2018).

K.D. Bennett, New Phytologist 132, 155 (1996).

S. Arlot, V. Brault, J.-P. Baudry, and others, Capushe: CAlibrating Penalities Using Slope Heuristics (2016).

C. Dalmasso, W. Carpentier, L. Meyer, and others, PLoS ONE 3, (2008).

J. Dixon, S. Selvaraj, F. Yue, A. Kim, Y. Li, Y. Shen, M. Hu, J. Liu, and B. Ren, Nature 485, 376 (2012).

Y. Shen, F. Yu, D.F. McCleary, Z. Ye, L. Edsall, and others, Nature 488, 116 (2012).

1 2 3 4 5

1 2 3 4 5

p S

p − 1

(O(p

2

))

(O(p

2

))

1

p ∼ 10

4

− 10

5

2

δ(C, C

) = + − , S(C) = ∑

(i,j)∈C2

s

ij

S(C)

|C|

S(C

)

|C

|

S(C ∪ C

)

|C ∪ C

|

δ

O(ph)

O(p(h + log(p))

3

4 5

h

6

7 8 7

1 2 3 4 5 6 7 8

Références

Documents relatifs

Scoring similarity between key phrases and unstructured texts is an issue which is important in both information retrieval and text analysis.. Researchers from the two fields

If we define the maximality in the tree without taking into account the merging condition (in this case, a group is maximal meaningful if it is more meaningful than all the other

Scores obtained demonstrate that models trained at the harder levels of difficulty are more ro- bust to noise and achieve better results.. As before, also in this case MLP is,

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

A memory control mecha- nism is applied to the link hierarchy during the modified refine/gather/push-pull: based on user defined memory limits, links are established at higher levels

It allows a better scalability compared to the conventional AHC al- gorithm both in memory and processing time, and its results are deterministic unlike some aforementioned

Apart from verifying cluster hypothesis, other tests are carried out to compare retrieval effectiveness using different clustering techniques, among which, four hie-

In this work we investigate correlation clustering as an active learning problem: each similarity score can be learned by making a query, and the goal is to minimise both