• Aucun résultat trouvé

Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics

N/A
N/A
Protected

Academic year: 2022

Partager "Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics"

Copied!
1
0
0

Texte intégral

(1)

Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics

Christophe AMBROISE1, Alia DEHMAN2, Michel KOSKAS3, Pierre NEUVIAL4, Guillem RIGAILL,1,5 and Nathalie VIALANEIX6

1 Laboratoire de Mathématiques et Modélisation d'Évry, UMR CNRS 8071, Université d'Évry Val d'Essonne, 23 boulevard de France, 91037 Évry, France

[email protected]

2 Hyphen-stat, 195 Route d'Espagne, 31036 Toulouse, France [email protected]

3 UMR518 AgroParisTech/INRA, , 16 rue Claude Bernard, 75231 Paris Cedex 05, France [email protected]

4 Institut de Mathématiques de Toulouse, UMR5219 CNRS, Université de Toulouse, UPS IMT, F-31062 Toulouse Cedex 9, France

[email protected]

5 Institute of Plant Sciences Paris Saclay IPS2, CNRS, INRA, Gif sur Yvette, France.

[email protected]

6 MIAT, Université de Toulouse, INRA, Castanet-Tolosan, France [email protected]

Genetic information is coded in long strings of DNA organized in chromosomes. High-throughput sequencing now allow to study biological phenomena along the whole genome at a very high resolution (for example using RNAseq, DNAseq, ChipSeq, HiC…). In most cases, we expect neighboring positions to behave similarly and using this a priori information is a way to tackle the complexity of genome-wide analyses. It is common practice to partition each chromosome into relevant regions, because the obtained regions hopefully correspond to biological relevant or interpretable units (genes, binding site, TADs, ...) and also because the obtained partition can be used as a dimensionality reduction method and allows statistical methods to be used more easily at the region level than at the position level. In most cases, regions of interest are however unknown and should be discovered using the data. A popular way to do so is to aggregate neighboring and similar positions based on a measure of similarity between pairs of positions, in a hierarchical way that is close to the biological organization of the genome.

This article focuses on a modification of the classical hierarchical agglomerative clustering (HAC), where only adjacent clusters (according to the ordering of positions within a chromosome) can be merged.

Adjacency-constrained HAC is implemented in the R package rioja. Our main contribution with respect to existing works is an efficient implementation of adjacency-constrained HAC in the case where the similarity between genomically distant objects can be considered as negligible. We propose an algorithm that is almost linear in time and space with respect to the number of objects to be clustered. It uses a sparse band strategy based on pre-computations of certain cumulative sums of similarities, combined with a min-heap approach to efficiently store and maintain a list of candidate merges. This algorithm is implemented in the R package adjclust, which is available at https://CRAN.R-project.org/package=adjclust. We provide applications to SNP and Hi-C datasets.

Références

Documents relatifs

Scores obtained demonstrate that models trained at the harder levels of difficulty are more ro- bust to noise and achieve better results.. As before, also in this case MLP is,

Therefore, comparison of binding pockets is a more appropriate approach in order to predict if two proteins bind similar ligands [2], and many ligand predic- tion methods rely on

In Section 3, we give a new similarity measure that characterize the pairs of attributes which can increase the size of the concept lattice after generalizing.. Section 4 exposes

With respect to previous studies, the proposed approach consists of 2 steps : first, the definition of similarity between journals and second, these similarities are used

If we define the maximality in the tree without taking into account the merging condition (in this case, a group is maximal meaningful if it is more meaningful than all the other

Scoring similarity between key phrases and unstructured texts is an issue which is important in both information retrieval and text analysis.. Researchers from the two fields

We also found that as the number of ratings of a movie increases its rating similarity approaches to its Wikipedia content similarity whereas its rating similarity diverges away

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des