Computational Challenges of scRNA-seq Data Analysis

1.3 Single-cell RNA Sequencing Applied to Developmental Biology

1.3.3 Computational Challenges of scRNA-seq Data Analysis

The first goal of scRNA-seq is to identify the cell population/heterogeneity present in the sample based on transcriptomic profiles. For that purpose, classic clustering algorithms can be applied, as well as modified methods tailored for scRNA-seq. Once the cell populations have been identified, specific transcriptomic signatures for each cell type can be identified with differential expression analysis. In the context of developmental biology, scRNA-seq is applied to characterize dynamic processes such as cell differentiation. For this purpose, tools have been developed to predict the transitional states of the cells based on the progressive overlapping gene expression from one state to another. In this way, by ordering the cells along a pseudotime and identifying the genes that are dynamically expressed, the chronology of the genetic programme driving cell differentiation can be revealed.

More than a hundred scRNA-seq tools and new methods are published on a monthly basis. A collaborative page attempts to list all available scRNA-seq tools and is regularly updated: https:

//github.com/seandavi/awesome-single-cell.

Dealing with Sparse and High Dimensional Data

Single-cell RNA sequencing represents not only an experimental challenge, but also requires the development of adapted computational methods to correctly interpret the data. Single-cell RNA sequencing data present certain particularities regarding classic bulk RNA-seq. The first particularity is the nature of the data that contain a large number of zero values (zero-inflated

data) that are referred as "drop-out" (figure 1.33a). The zero values are explained by two factors:

the gene expression process and the sensibility of mRNA capture. Gene expression is a stochastic biochemical process that leads to random transcription that induces high cell-to-cell variability, even within a homogeneous cell population [Elowitz et al., 2002, Raj and van Oudenaarden, 2008].

In addition to the transcriptional noise, scRNA-seq has a limited sensitivity to capturing and amplifying transcripts (about 10% of the overall RNA content [Liu and Trapnell, 2016]). The accumulation of these two factors results in data presenting a very high variance (figure 1.33b).

(a)Distribution of the raw read count of theSox9gene from 963 XX and XY gonadal somatic cells (full-length scRNA-seq, personal data).

(b)Squared coefficient of variation as a function of the mean of expression of genes. Low expressed genes accumulate high variance due to drop-out events.

Figure 1.33 –ScRNA-seq data are noisy and zero-inflated (personal data, full-length scRNA-seq of 963 XX and XY gonadal somatic cells).

The second particularity of scRNA-seq data is the high dimensionality of the data. Single-cell RNA sequencing simultaneously measures the expression of tens of thousands of genes from hundreds to thousands of cells. High dimensionality makes the use of clustering methods difficult, as distances between the observations become more uniform as dimensionality grows, especially with sparse data such as gene expression.

Clustering the Cells

Clustering algorithms are multivariate analyses that explore the natural structure of datasets to classify the most similar observations into groups or clusters. Cluster analysis is a complex problem, as there is no "true" number of clusters, and the same dataset can produce results that are quite different depending on the algorithms [Freytag et al., 2017] (figure 1.34).

Moreover, clustering is particularly sensitive to the curse of dimensionality, thus applying the algo-rithms to the entire scRNA-seq dataset can produce misleading clusters, where cells are considered as being similar even though they belong to different clusters. Therefore, most scRNA-seq analy-sis pipelines propose using different protocols to deal with this problem, including both feature selection and dimension reduction methods prior to clustering cells.

Feature selection Feature selection aims to remove uninformative genes from a dataset to reduce the intrinsic noise. The feature selection method that produces the best spread is that of highly variable genes (HGV). Highly variable genes postulates that biologically relevant genes have a higher degree of variability than expected compared to technical noise. This method was first applied with scRNA-seq data using the spike-ins internal control to model technical noise [Brennecke

(a)Centroid based clustering (K-means with k=2).

(b)Density based clustering (DBSCAN).

Figure 1.34 –Two clustering methods used with the same dataset produce different results (images from Wikimedia).

et al., 2013], and was then generalized with non-spike-in experiments by estimating the technical noise from the entire dataset. RNA sequencing data present a strong relationship between variance and mean gene expression. This relationship is visible by calculating the mean and the square coefficient of variation (figure 1.33b). A quadratic curve is applied to this relationship and the genes that are significantly above the curve are selected (figure 1.35). Highly variable genes tends to select a high number of lowly expressed genes, and variations of this method now include a minimum mean expression threshold [Macosko et al., 2015, Qiu et al., 2017a, Qiu et al., 2017b]. Although this method has been successfully applied in many studies, the biological relevance of HGV is still not completely understood.

Figure 1.35 –Squared coefficient of variation (CV²) as a function of mean gene expression. The pink dots are the selected HGV genes, the brown dots are the other genes, the blue dots are the ERCC spike-ins, the solid red line is the technical noise fit, and the dash line is the limit where genes present 50% biological variance (taken from [Brennecke et al., 2013]).

Dimensionality reduction Dimensionality reduction methods aim to project the data in a lower dimensional space, in order to concentrate the most important source of variability to improve the interpretability of the data. The most frequently used technique is principal component analysis (PCA) that decomposes the data in subspaces, or components, and orders them in such a way that the first component contains the most variance. In scRNA-seq analysis, PCA is used for both selecting the most informative genes from a dataset, and visualizing the data structure. Another widely used dimensionality reduction method is t-distributed stochastic neighbour embedding (t-SNE, pronounced Tee-Snee) [van der Maaten et al., 2008]. T-distributed stochastic neighbor

embedding calculates the probability that two cells are similar, and gathers the most probably similar cells together, while moving aside the dissimilar cells. Unlike PCA, t-SNE is a non-linear method that preserves the local structure of the data but not the global structure. In other words, for each cell t-SNE calculates the nearest neighbouring cells; therefore within a group, the cells are ordered by similarities (local structure). However, the distances between groups of cells do not reflect their degree of dissimilarity (global structure). The T-SNE method was first used in scRNA-seq analysis to cluster cells using density based algorithms such as DBSCAN [Ester et al., 1996, Macosko et al., 2015], but the use of t-SNE for analysis and not just visualization is increasingly criticized, especially due to the perplexity parameter that defines the number of nearest neighbours allowed, which has to be empirically set and can drastically change the results.

(a)First two PCs from a PCA, with the per-centage of variance in the PCs (prcomp pack-age, input: log(RPKM+1), centred, scaled)).

(b)T-SNE representation (Rtsne package, in-put: 9 PCs from prcomp, 5000 iterations, de-fault perplexity (30).

Figure 1.36 –PCA and t-SNE on the 822 highly variable genes from 563 cells from developing ovaries (personnal data).

Defining group of cells Once the data have been filtered to conserve the most significant source of variance, by using feature selection and/or dimensionality reduction methods, one can proceed to the clustering of the cells to identify the different cell populations. A myriad of methods exists, in-cluding classic algorithms such as hierarchical clustering, K-means clustering, K-nearest neighbour, and scRNA-seq tailored methods such as SC3 [Kiselev et al., 2017], pcaReduce [žurauskien˙e and Yau, 2016], and SINCERA [Guo et al., 2015], to cite a few. All these methods have both advantages and inconveniences, and a systematic comparison of these methods is hampered by the high frequency with which new tools are published. Despite the robustness of clustering analysis, the obtained results may vary from one method to another [Freytag et al., 2017] (figure 1.37). The biological relevance of the clusters can be assessed by looking at the expression of known marker genes and by performing differential expression analysis.

Identifying Transcriptional Signatures

In bulk RNA-seq, the classic analysis consists of identifying the genes that are differentially ex-pressed between two conditions. In scRNA-seq, the goal is to identify the genes that differ between two or more cell clusters. The existing methods for bulk RNA-seq, e.g. DESeq2 [Love et al., 2014]

and EdgeR [McCarthy et al., 2012] perform well with scRNA-seq data [Jaakkola et al., 2016]. Specific methods for scRNA-seq data were developed to compensate for the dispersion of the data due to drop-outs using, for example, the bayesian models (SCDE [Kharchenko et al., 2014], MAST [Finak

Figure 1.37 –T-SNE visualization of the same Drop-seq dataset clustered, coloured by clusters using four different clustering methods (taken from [Freytag et al., 2017]).

et al., 2015], and M3Drop [Andrews and Hemberg, 2016]).

In scRNA-seq, it is not two conditions but several cell clusters that have to be compared. To avoid fastidious pairwise analysis between all the combinations of cell clusters, the detection of differentially expressed genes is based on the distribution of the expression values in all the clusters.

The method assumes that gene expression follows a particular distribution, e.g. a negative binomial distribution (such as Monocle [Qiu et al., 2017a, Qiu et al., 2017b]) or a zero-inflated negative binomial distribution (such as SCDE, MAST, and M3Drop). Genes are considered differentially expressed if their distribution significantly varies among the clusters.

Prediction of the Cell Lineages and Pseudotime Ordering

Lineage reconstruction and pseudotime ordering aim to predict the most probable path through which the cells differentiate. They are directly imputed from a dimensional reduction analysis using graph exploration algorithms. The first tool that intended to order the cells in accordance to their differentiation or transitional state was Monocle [Trapnell et al., 2014]. It uses independent component analysis (ICA) and calculate the possible lineage and order of the cells using minimum spanning tree. The new version now integrates DDRTree dimensional reduction analysis [Qiu et al., 2017a]. Plenty of other methods were developed and what differs between them is the dimensional reduction method used. These include PCA, ICA or diffusion map. Once the lineages and the pseudotime are defined, it is possible to perform differential expression analysis over the pseudotime in each lineages to decipher the dynamic of expression underlying cell differentiation.

The present PhD project aimed to identify the different gonadal somatic cell populations present prior to, during, and after sex determination; to reconstruct their lineage; and to characterise the genetic program underlying their differentiation. To proceed, we isolated the somatic cells from XX and XY developing gonads at key stages of the development, performed single-cell RNA sequencing, and analysed the data using clustering algorithms and cell lineage reconstruction methods. By doing so, we aimed to answer specific questions regarding gonadogenesis and sex determination.

Cell heterogeneity of the bipotential gonadThe classical model of sex determination states that two distinct cell lineages derived from the coelomic epithelium, the supporting and steroidogenic cell lineages, exist prior sex determination. Recent studies questioned that model toward a unique progenitor cell population with indirect evidences [Wen et al., 2014, Zhang et al., 2015, Chen et al., 2017]. With scRNA-seq, we can evaluate the heterogeneity of the somatic cell population composing the bipotential gonad and refute one of the two existing hypothesis.

Specification and sex-dependent differentiation of the supporting cell lineageExisting transcrip-tomic studies based on microarrays highlighted the dynamic of gene expression during sex de-termination in XX and XY but the chronology of events is unclear due as they describe pools of asynchronous cells [Nef et al., 2005, Jameson et al., 2012, Munger et al., 2013]. By analysing the transcriptomes of individual cells during the process of sex determination, we can bypass the problem of the asynchronous cell differentiation and reconstruct the genetic program by ordering the cells along a pseudotime that reflect their transition from supporting cell precursors to either Sertoli or pre-ranulosa cells.

Specification and sex-dependent differentiation of the steroidogenic cell lineageThe specifica-tion of the steroidogenic cell precursors and how they differ between XX and XY is poorly under-stood. Using scRNA-seq, we should be able to identify the steroidogenic cell population in both sex and to characterise their specification program as XX and XY specific cell types (i.e.Leydig and Theca cells).

3.1 Deciphering cell lineage specification during male sex

determina-tion with single-cell RNA sequencing

Dans le document Monitoring gonadal somatic cell differentiation during sex determination using single-cell RNA sequencing (Page 55-62)