MyoMiner: explore gene co-expression in normal and pathological muscle

(1)

HAL Id: hal-02746400

https://hal.sorbonne-universite.fr/hal-02746400

Submitted on 3 Jun 2020

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Apostolos Malatras, Ioannis Michalopoulos, Stéphanie Duguez, Gillian

Butler-Browne, Simone Spuler, William Duddy

To cite this version:

Apostolos Malatras, Ioannis Michalopoulos, Stéphanie Duguez, Gillian Butler-Browne, Simone Spuler,

et al.. MyoMiner: explore gene co-expression in normal and pathological muscle. BMC Medical

Genomics, BioMed Central, 2020, 13 (1), pp.67. �10.1186/s12920-020-0712-3�. �hal-02746400�

(2)

D A T A B A S E

Open Access

MyoMiner: explore gene co-expression in

normal and pathological muscle

Apostolos Malatras

1

, Ioannis Michalopoulos

2

, Stéphanie Duguez

1,3

, Gillian Butler-Browne

1

, Simone Spuler

4

and

William J. Duddy

1,3*

Abstract

Background: High-throughput transcriptomics measures mRNA levels for thousands of genes in a biological sample. Most gene expression studies aim to identify genes that are differentially expressed between different biological conditions, such as between healthy and diseased states. However, these data can also be used to identify genes that are co-expressed within a biological condition. Gene co-expression is used in a guilt-by-association approach to prioritize candidate genes that could be involved in disease, and to gain insights into the functions of genes, protein relations, and signaling pathways. Most existing gene co-expression databases are generic, amalgamating data for a given organism regardless of tissue-type.

Methods: To study muscle-specific gene co-expression in both normal and pathological states, publicly available gene expression data were acquired for 2376 mouse and 2228 human striated muscle samples, and separated into 142 categories based on species (human or mouse), tissue origin, age, gender, anatomic part, and experimental condition. Co-expression values were calculated for each category to create the MyoMiner database.

Results: Within each category, users can select a gene of interest, and the MyoMiner web interface will return all correlated genes. For each co-expressed gene pair, adjusted p-value and confidence intervals are provided as measures of expression correlation strength. A standardized expression-level scatterplot is available for every gene pair r-value. MyoMiner has two extra functions: (a) a network interface for creating a 2-shell correlation network, based either on the most highly correlated genes or from a list of genes provided by the user with the option to include linked genes from the database and (b) a comparison tool from which the users can test whether any two correlation coefficients from different conditions are significantly different.

Conclusions: These co-expression analyses will help investigators to delineate the tissue-, cell-, and pathology-specific elements of muscle protein interactions, cell signaling and gene regulation. Changes in co-expression between pathologic and healthy tissue may suggest new disease mechanisms and help define novel therapeutic targets. Thus, MyoMiner is a powerful muscle-specific database for the discovery of genes that are associated with related functions based on their co-expression.

MyoMiner is freely available athttps://www.sys-myo.com/myominer

Keywords: Transcriptomics, Correlation, Gene co-expression, Gene co-expression networks, Differential correlation, Functional genomics

© The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

* Correspondence:[email protected]

1_{Sorbonne Université, Inserm, Institut de Myologie, U974, Center for Research}

in Myology, 47 Boulevard de l’hôpital, 75013 Paris, France

3_{Northern Ireland Centre for Stratified Medicine, Biomedical Sciences}

Research Institute, C-TRIC, Altnagelvin Hospital Campus, Glenshane Road, Ulster University, Derry/Londonderry BT47 6SB, UK

(3)

Background

High-throughput data are crucial for modern biology. cDNA microarrays have provided an efficient way to measure the expression of thousands of genes simultan-eously [1, 2], thus helping the study of fundamental bio-logical processes such as gene regulation, signaling pathways and even complex disease traits. The main use of microarrays is differential gene expression analysis where two or more sets of samples are compared (e.g. treated or diseased vs normal) and the up- or down-regulated genes are identified. The accumulation of large amounts of data over the years in public high-throughput data repositories such as ArrayExpress [3] and Gene Expression Omnibus [4], allows us to identify relations between genes through correlation analysis. However, it is difficult for experimental researchers to combine and extract the information they seek if they have limited bioinformatics expertise.

Measures of gene co-expression obtained by the ana-lysis of data stored in high-throughput repositories such as ArrayExpress [3] and Gene Expression Omnibus [4] are now widely used to study gene function, protein rela-tions and biological networks such as signaling pathways [5,6], and several tools and resources exist that facilitate the exploration and analysis of gene co-expression across tissues, common technological platforms, or conditions [7–9]. Furthermore, pathology-specific gene co-expression can be used as a biomarker discovery tool [10] or for pa-tient prognosis [11,12].

An important purpose of gene co-expression analysis is in discovering the mechanistic links between genes. Gene co-expression is a form of functional association, alongside other types of functional associations such as protein-protein interactions determined by immunopre-cipitation experiments, or protein cellular co-localization as determined by immunostaining. Since functional as-sociation data can be regarded as a graph structure, gene co-expression can be used for network biology or net-work medicine types of analysis [13, 14]. As such, the study of gene co-expression can be used to understand better the mechanisms of molecular interaction within a cell. This may be the whole cell, in which case it can be called interactomics, or it can be focused on a specific gene, function, or pathway. Gene co-expression analysis therefore can enhance the study of changes to molecular interaction networks, and can be applied both at the whole cell level and for specific cellular functions, and to compare between different pathologies and conditions. Recent work suggests that direct causal relationships be-tween genes may be inferable from gene co-expression [15]. The purpose of gene co-expression analysis is dis-tinct from that of differential expression analysis, the purpose of which is to identify differences in individual gene transcript levels between conditions. Differential

expression analysis may be combined with functional en-richment testing to detect changes across gene sets (e.g. representing functions, pathways, and cellular compo-nents), but it cannot tell us about mechanistic relation-ships between genes within a given gene set.

Several organism-specific co-expression databases already exist such as the Arabidopsis Co-expression Tool (ACT) [16,17] and ATTED-II [18] for Arabidopsis thali-ana, and CoXPRESdb [19], STARNET [20], Genevesti-gator [21] and Human Gene Correlation Analysis (HGCA) [22] for mammals. They collect gene expression data and a Pearson correlation coefficient [23] is calcu-lated for each pair of genes or probes, which can be used as a measure of expression correlation and for network construction from the highly-correlated genes. However, these databases are not tissue- or cell-specific, because their expression matrices are derived from a mix of tis-sue types and in some cases from mixed conditions (e.g. treated and untreated cells). Since gene expression dif-fers between types of tissues and cells [24], it is expected that gene co-expression will also vary. Experimentalists seeking to identify correlation patterns for a chosen gene of interest, usually focus on a specific tissue or cell model and thus the relevance of co-expression values is greatly enhanced by the specificity of the data used [25]. ImmuCo [26] and Immuno-Navigator [27] gene co-expression databases are among the first to address im-mune cell-specific correlation, and the latter also cor-rects the expression matrices for batch effects. Many conditions, such as reagents, equipment, software and personnel could vary during the course of an experiment and may introduce batch effects, which is a common and strong source of variation on high-throughput data [28,29]. Batch effects are unrelated to biological or sci-entific variables, are not corrected by normalization [29] and must be removed before any further analysis. By combining studies, one extra layer of batch effects is in-troduced: experiments from different laboratories [30]. If left uncorrected, this technical variation will introduce error into the results of correlation analysis. Another issue is that current co-expression databases include gene co-expression from healthy samples only or from a mix of healthy and diseased conditions. Studying the changes in co-expression between healthy and patho-logical states could lead to biomarker discovery and to improved understanding of disease mechanisms [31].

Here, we introduce MyoMiner (https://www.sys-myo. com/myominer), the first striated muscle cell- and tissue-specific database that provides co-expression ana-lyses in both normal and pathological tissues, addressing both issues of overall correlation and batch effects. Myo-Miner includes 2376 mouse and 2228 human microarray samples separated in 142 human, mouse and cell cat-egories based on age, sex, anatomic part and condition.

(4)

We built a simple and easy-to-use web interface to search for transcriptional correlation of any expressed gene pair in muscle cells/tissues and the various pathological condi-tions. Users can select a category and a gene of interest, and MyoMiner will return all the expressed correlated genes for that category. Correlation strength is measured by the provided FDR adjusted p-value (q-value) and confi-dence intervals are given for each correlation.

Construction and content Microarray data collection

To collect muscle-specific microarray data and discard low quality samples, we followed a pipeline similar to that used for our Muscle Gene Sets resource [32], as de-scribed below. Even though ArrayExpress partially mir-rors Gene Expression Omnibus, we searched both repositories for striated muscle (skeletal and cardiac), cells and cell line experiments. In this initial screening, we found that the most popular platforms used for muscle-related experiments were Affymetrix Human Genome U133 Plus 2.0 GeneChip (GEO platform GPL570 or ArrayExpress ID A-AFFY-44) for human and Affymetrix Mouse Genome 430 2.0 GeneChip (GEO platform GPL1261 or ArrayExpress ID A-AFFY-45) for murine samples. Since correlation analysis requires homogenous data, we limited our more refined subse-quent searches to these two platforms, which represent about half of all muscle data on both repositories.

We searched ArrayExpress using the following strings: (muscle(s) OR myoblast(s) OR myotube(s) OR myofiber(s) OR cardiomyocyte(s) OR myocyte(s) OR heart(s) OR HSMM) AND A-AFFY-44for human samples, and (mus-cle(s) OR myoblast(s) OR myotube(s) OR myofiber(s) OR cardiomyocyte(s) OR myocyte(s) OR heart(s) OR C2C12 OR HL1 OR G8 OR SOL8) AND A-AFFY-45 for murine samples. GEO and ArrayExpress assign a different ID to each alternative platform. An alternative platform is the same microarray chip as the original, but the data are pre-processed with a different probe-to-gene mapping file called Chip Description File (CDF). It is quite popu-lar for researchers to use a different CDF than the ori-ginal for better probe-to-probeset and probeset-to-gene targeting accuracy (see “Probes to gene mapping” sec-tion). GEO provides a list of the alternative platforms into the original platform information sheet, but many were missing. An additional way to identify the alterna-tive platforms is to search on ArrayExpress (which is manually curated) for alternative IDs. In the array browser of ArrayExpress ( https://www.ebi.ac.uk/arrayex-press/arrays/browse.html), we searched for U133 Plus 2.0, MG 430 2.0 and retrieved all of the alternative GEO platforms and IDs to A-AFFY-44 [GEO: GPL570] for human and to A-AFFY-45 [GEO: GPL1261] for mouse (Additional file1: Table S1).

Next, we developed a script to parse automatically their MIAME [33] metadata and confirm them manu-ally, selecting only those pertinent to muscle research. We excluded all series that did not include the raw CEL files (Affymetrix fluorescence light intensity files) be-cause we pre-processed them using a robust data ana-lysis pipeline, described in detail below, so as to homogenize the data as much as possible.

Particular microarray samples may have been used for several experiments, or analyzed with different normalization algorithms, or even grouped with other samples in larger meta-analyses, the results of which have been re-submitted to the repositories. The reused microarrays get a different ID (GSM number in GEO) and it is crucial to identify and remove them from co-expression analysis, as duplicates will erroneously in-crease correlation scores and introduce biases. Using the conversion tool (apt-cel-convert.exe) of Affymetrix Power Tools [34], we transformed the binary CEL files (version 4) to ASCII text format (version 3) in order to

parse them. Their light intensity values were

concatenated into a string and used as input to three hash algorithms: MD5 [35], SHA-1 [36] and CRC32 [37]. The combined hash acts as a unique key for each sample and the duplicate arrays were then easily identi-fied and removed.

Quality control of Affymetrix microarrays

The quality control pipeline was identical to that used previously for our Muscle Gene Sets resource [32]. Ar-rays that had extreme values or were above our set thresholds on the combined quality controls, were ex-cluded from any further analysis. In total, we removed 160 human and 122 mouse samples (Additional file 1: Table S2, S3). We identified the poor quality arrays based primarily on the output of percent present, RLE and NUSE, as they are known to perform well [38], and secondarily on GAPDH andβ-actin ratios.

Data normalization

Pre-processing algorithms, usually termed normalization algorithms, are three-step processes: background correc-tion, normalization and probe summarization. An add-itional optional step is log2 transformation. The arrays

that passed quality controls were pre-processed with the Single Channel Array Normalization (SCAN) algorithm [39] with default parameters except for the CDFs, which were downloaded from BrainArray Ensembl ENSG ver-sion 20.0.0 [40]. SCAN normalizes each array independ-ently from its series, corrects GC bias and reduces probe and array variation from each individual sample while in-creasing signal-to-noise ratio. Single array normalization is preferred when combining microarray samples from dif-ferent series or laboratories, because other pre-processing

(5)

algorithms such as RMA [41] or GC-RMA [42] use infor-mation across samples for both normalization and summarization steps, and can thus introduce correlation artifacts [43,44].

Probes-to-genes mapping

At the time of chip design, Affymetrix selection of probes relied on early genome and transcriptome anno-tation which is significantly different from our current knowledge. The genes on the microarray chips are usu-ally represented by multiple probesets and, conversely, in many cases, a single probeset could target multiple genes or even no gene. Multiple probesets targeting the same gene could exhibit wildly different expression levels making downstream analysis challenging. This limitation had been observed [40], and BrainArray portal was created to reorganize probesets with up-to-date gen-omic, cDNA and single nucleotide polymorphism (SNP) information in order to create a more accurate and pre-cise CDF. This approach has become very popular amongst researchers [45]. BrainArray CDFs are annually updated and many microarray algorithms and tools now use them by default. The SCAN normalization algorithm has in-built parameters to download and use BrainArray CDFs. For MyoMiner we used Ensembl genome [46] (ENSG) version 20.0.0. We set the SCAN CDF specified parameter probeSummaryPackage to InstallBrainArray-Package(“human_sample_name.CEL”, “20.0.0”, “hs”, “ensg”) and InstallBrainArrayPackage(“mouse_sample_name.CEL”, “20.0.0”, “mm”, “ensg”) for human and mouse organisms respectively.

Filtering and mapping of expressed genes to gene symbols

In order to distinguish between expressed and unex-pressed genes (such as genes with expression levels close to or lower than the background noise), we used the Universal exPression Code (UPC) algorithm [47] separ-ately for each category. We did that because different tissues, cells or pathological conditions have distinct genetic profiles. UPC is a 2-step algorithm that corrects for background noise using linear statistical models and estimates the percentage of gene expression by calculat-ing the active and inactive gene population. An assump-tion is made that genes with identical molecular characteristics should share the same background ex-pression levels. To identify expressed genes for each cat-egory, we calculated UPC’s percentage expression 3rd quartile for each gene and categorized it as being expressed if its value was higher than 50%.

To map Ensembl gene IDs to HGNC gene symbols [48], Entrez IDs [49] and Uniprot accession numbers [50], we used Ensembl BioMart [51]. We extracted the

required information from GRCh38.p5 assembly for hu-man and GRCm38.p4 assembly for mouse.

Gender prediction

On approximately half of the MIAME metadata entries for both organisms, the gender information was missing [52]. To predict the missing gender entries we used hgfocus.db [53] and mouse4302.db [54] from Bioconduc-tor to map genes to chromosomes and then we calculated the median expression of Y chromosome genes. Males should have higher expression values than females, which was visible on the Y chromosome gene expression histo-gram with two clearly separated gender related peaks.

Combining datasets

To define categories of similar samples based on organ-ism, gender, age, anatomic part and condition criteria, we extracted all the available MIAME metadata for each organism (Additional file2: Table S7 and Additional file

3: Table S8). Then, we filtered all possible metadata combinations into categories that had at least n = 12 samples. With this approach we created a large number of categories while maintaining a high level of power. For each category, we created a single expression matrix that includes all the samples from that category. Further analyses such as batch correction and gene co-expression were based on each category’s co-expression matrix.

Batch effects evaluation

For batch effect reduction we used the ComBat algo-rithm [55] from the “SVA” Bioconductor package [56]. ComBat is a robust empirical Bayes method that adjusts for known batch covariates. By default, we considered each data series (i.e. study) to be a different batch for every category (gender, age, etc). However, it is also known that processing date/time can be a strong batch surrogate [29]. From the text converted CEL files we re-trieved the scan dates and also used these as batch sur-rogates for each series, assuming that microarray experiments performed on the same day belonged to the same experimental batch, thus subdividing the afore-mentioned default series batches to date and series batches. Using principal component analysis (PCA) 3D plots, by the“rgl” R package [57], for each category, we identified if the samples correlate with batch surrogates and proceeded with batch correction if necessary (Table S4). We did not use the category differences as input for the ComBat algorithm (modcombat = model.matrix(~ 1, numbatches)), because (a) all samples were from the same category and (b) samples that are assigned to a batch are usually unevenly distributed which can induce incorrect differences [58]. In some cases, when a batch was represented by a single sample, after assessing the

(6)

PCA 3D plot we assigned the sample to the closest batch cluster if possible, otherwise we used the mean.only = TRUEparameter in ComBat that corrects only the mean of the batch effect not adjusting for scale. There were no significant changes (t-test < 0.05, multiple testing con-trolled with FDR) in gene expression of any gene in any category before and after applying batch correction.

Gene expression correlation

Spearman’s rank correlation [59] is a non-parametric rank statistic that measures the strength of a monotonic, linear or non-linear, relationship between two sets of data. Monotonic is a function that increases when its in-dependent variable increases, having a positive correl-ation. If the independent variable decreases while the function increases, the correlation will be negative. Spearman’s correlation is simply the application of Pear-son’s correlation [60] on rank converted data. A faster method to calculate Spearman’s ρ is to rank the values of xi and yi, and calculate their difference di. The rank

correlation can then be computed as follows: ρ ¼ 1−6

Pn i¼1d2i

n n_ð 2₋₁_Þ ð1Þ

where n is the number of samples and di = rank(xi) -rank(yi). Spearman’s correlation range values between − 1 and + 1, where − 1 describes a perfect monotonically negative correlation and + 1 a perfect monotonically positive correlation. If the data are monotonically inde-pendent, Spearman’s ρ is equal to 0. However, this does not necessarily mean that the data are independent in other ways.

Since Spearman’s correlation can be asymptotically ap-proximated by a t-distribution with n-2 degrees of free-dom under the null hypothesis of no correlation, we used Student’s t-test to examine whether a correlation was significantly different from the null hypothesis:

t¼ ρ ffiffiffiffiffiffiffiffi n−2 p ffiffiffiffiffiffiffiffiffiffi 1−ρ2 p ð2Þ

To adjust for multiple testing we used the Benja-mini – Hochberg (BH) method [61] to control the false discovery rate (FDR). Spearman correlation

ρ-and adjusted p-values were computed with the

“psych” R package [62].

Because the correlation coefficient is not distributed normally and its variance is dependent on both sample size and the correlation coefficient from the entire popu-lation, we cannot compute confidence intervals directly for the ρ-values [63]. First we have to convert ρ-values into additive quantities with ρ to Z Fisher transform-ation [64] which is the inverse hyperbolic tangent func-tion (arctanh) (Addifunc-tional file1: Table S6, Eq. S1, S2).

Second, we compute the confidence intervals at 95% confidence level Ztable= 1.96 (Additional file1: Table S6,

Eq. S3). The final step is to convert Z scores back to ρ-values using the hyperbolic tangent function (tanh) (Additional file1: Table S6, Eq. S4).

Thus, in any sample correlation coefficientρ, there is a 95% probability that the true population correlation co-efficient value will be in the range of CIlowerand CIupper.

Differential co-expression

For comparing whether any two correlation coefficientsρ1

andρ2, for different categories (various samples and sample

sizes n1and n2), are significantly different, we make the null

hypothesis (H0) that the correlation coefficients are not

statistically different. Then we transform theρ-values to Z scores (Additional file 1: Table S6, Eq. S1), calculate the difference between them and calculate an absolute Z score by dividing the difference with the pooled standard error:

Zc¼ Z1−Z2 SEzp ;where SEzp¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 n1−3þ 1 n2−3 r ð3Þ If Zc< Ztablewhere Ztable= 1.96(at 95% confidence level)

or more commonly, if the p-value which is the probability P (Zc< 1.96) > 0.05, we cannot reject H0. The difference

be-tweenρ1andρ2is not significant at 95% confidence level.

Data extraction for validation

To validate the findings of MyoMiner we compared it to the existing databases MEM, SEEK and the GTEx RNA-Seq collection. Since the databases include generic co-expression data and not specific categories, we limited their studies to muscle relevant as follows: for MEM we selected the GPL570 (U133 Plus 2.0) platform, the Pear-son distance method, betaMEM as the ranking method and the dataset filter Stdev = 0 with “Skeletal Muscle” as the text field search. For SEEK we used the refined search option to Muscle (Non-cancer) datasets and the Pearson distance method. For GTEx v8 we downloaded the anno-tation data and extracted the gene TPMs for muscle rele-vant samples using the options SMTSD =” Muscle – Skeletal” and SMAFRZE =” RNASEQ”. We then calculated Spearmanρ the same way as for MyoMiner.

Database construction and website implementation

MyoMiner was constructed in several steps using vari-ous tools and processes (Fig. 1). We developed an HTML5 website that allows querying and visualizing for the requested gene correlations. The interface was devel-oped using the Bootstrap responsive framework. Scatter-plots and correlation networks are visualized with the Nvd3 and D3 [65] JavaScript libraries respectively. All Spearman’s ρ and p-value pairwise matrices, and meta-data are stored on a relational MySQL meta-database which

(7)

runs under an Apache web server. Dynamic content is processed by the PHP programming language: data re-trieval, ρ to Z transformations and CI calculations. The front and back-end is powered by Okeanos [66] cloud services. Complete listings of data series IDs and sample numbers are provided in Additional file2: Table S7 and Additional file3: Table S8.

Utility and discussion Data statistics

Following filtering and programmatic retrieval of 81 hu-man (2541 samples) and 198 mouse (2642 samples) muscle series from the ArrayExpress repository, we manually parsed the MIAMI compliant SDRF (sample and data relationship format) metadata file of each series while crosschecking them, if applicable, with the corre-sponding SOFT (simple omnibus format in text) file from GEO. If there were missing data or differences

between ArrayExpress and GEO, we tracked the publica-tion that described the series to correct the missing in-formation. If we still could not extract the missing data, we contacted the corresponding authors in case they could provide us with the correct data. Working in close co-operation with ArrayExpress and GEO personnel, we corrected several series metafiles, although the most common mismatches were copying errors.

We identified and removed 169 human and 144 mouse samples as duplicates. A further 160 human and 122 mouse samples did not pass quality controls and were discarded, leaving us with 2228 human samples (from 74 series) and 2376 mouse samples (from 189 series). The samples were then classified to different categories of 12 or more samples each. In total, 1810 human samples were assigned to 69 categories and 1155 mouse samples were assigned to 73 categories (Table S4).

Fig. 1 Workflow of data pre-processing method used for MyoMiner. We identified studies that are pertinent to muscle research from GEO and ArrayExpress. Only the studies that provided the raw CEL files proceeded to quality controls. Samples that passed QC were pre-processed with the SCAN algorithm. We thoroughly curated the metadata files and separated them into categories. We used PCA to detect and remove batch effects using the ComBat algorithm. Users have access to muscle tissue and cells gene-pair co-expression, differential co-expression of every category and co-expression networks. All data are available on the MyoMiner web portal.

(8)

Categories were created based on gender, age, muscle tissue, condition and strain. A total of 8 skeletal and car-diac muscle tissues are included on MyoMiner together with the combination of those. Human age was classified in years as follows: 0 to 14 as child, 15 to 24 as young, 25 to 59 as adult and 60+ as old. For mouse the classifi-cation is in weeks: E (embryonic days) as embryo, 0 to 11 weeks as young, 12 to 24 as adult and 25+ as old. We also included 4 separate strains for mouse: C57BL/6 J, CD1, C3H/HeJ and FVB but also the combinations of them and several other strains (Table 1). The Cells cat-egory was derived from mouse microarrays: skeletal muscle precursor cells, cardiomyocytes and immortal-ized C2C12 mouse cell lines at different stages of differ-entiation: myoblasts, myotubes 1–2, 3–4 and 5+ days after differentiation. MyoMiner covers 53 distinct condi-tions including normal and pathological ones. In detail, several exercise categories: aerobic, resistance, endur-ance, trained or sedentary; different types of diet: high fat or calorie restricted diet; type 2 diabetes (DM2): Pre-DM2, DM2 relatives, etc.; muscle regeneration: cardio-toxin and glycerol injections; several cardiomyopathies: Idiopathic, Dilated, Ischemic and Arrhythmogenic; mus-cular dystrophies: Duchenne musmus-cular dystrophy, Mdx, Myotonic dystrophy type 2, and many other categories (Additional file1: Table S4, MyoMiner web portal).

To measure the accuracy of the gender prediction method we first tried it on the samples with known gen-der. For human only 1135 out of 2228 samples had their gender reported. The method classified 98% of the sam-ples to their reported gender. 23 samsam-ples (~ 2%) did not match and we investigated further into their original pub-lications. 5 samples out of the 23 were correctly predicted but were reported in the repositories or publications as opposite-sex. This increased the initial match to 98.4%. Regarding the mouse data, the gender was reported in 1390 out of 2376 samples. Again, testing this method on the known gender samples resulted in about a 98% match,

with 56 samples being predicted as opposite sex from the ones reported. We identified and corrected 16 falsely re-ported cases, increasing the prediction match to 98.3% (Additional file1: Table S5). All gender mismatches that we corrected occurred from copying errors.

Query results and features

The MyoMiner interface was designed to enable users to search and quickly retrieve the transcriptional co-expression of any expressed gene pair in muscle tissue and cells. All categories are presented as buttons on the main page (Fig. 2a). When selecting a category, the op-tions that are not relevant to it are deactivated, thereby constraining the search to (and indicating to the user) only those options which remain available. MyoMiner supports queries using HGNC gene symbols (e.g. DYSF), Ensembl IDs (e.g. ENSG00000135636), Entrez gene IDs (e.g. 8291) and Uniprot accession numbers (e.g. O75923). The table output retrieves the correlation values for all expressed gene pairs in the selected cat-egory (Fig.2b) sorted byρ-value. The first column com-prises the paired gene symbols which can also be clicked to search for its list of correlated genes. The second col-umn is a description of the paired gene, also serving as a link to the associated gene on GeneCards [67]. The third column shows the Spearman’s correlation coefficient but also if clicked the scatterplot of this pair. The fourth and fifth columns report two statistic summaries for the user to judge the significance of the correlation: the FDR ad-justed p-value, and the CI at 95% confidence level, that include information about the estimated effect size and the uncertainty associated with this estimate. CI trans-lates to 95% probability that the population correlation coefficient true ρ-value is between CIlowerand CIupper. A

search bar is provided on the top right corner of the table format for easy gene pair filtering and the columns can be sorted by clicking on their headers (e.g. sort by positive or negative correlation). The results can be downloaded, in various formats, using the appropriate buttons at the bottom left corner of the table.

Scatterplots are important as supplementary informa-tion to help interpret the correlainforma-tion coefficient. In Myo-Miner, interactive expression scatterplots for any gene pair can be accessed by clicking theρ-value. A modal win-dow will appear showing the normalized expression values obtained by SCAN for the selected gene pair (Fig.2c). The series that were used for the selected category are dis-played at the top of the scatterplot. By clicking or double-clicking the series ID, one can either remove the selected series or retain that series only, respectively. Removing series on the scatterplot window will not affect theρ-value as it is pre-computed for all series shown on the scatterplot.

Table 1 Gender, age, tissue and strain classification for each organism. Eight distinct muscle tissues, 4 different age stages (years for human and weeks for mouse) and 4 separated mouse strains with their combinations

Organism Human Mouse

Gender Both, Male, Female Both, Male, Female

Age All ages, Child, Young,

Adult, Old

All ages, Embryo, Young, Adult, Old Anatomic part Combined heart, Left

ventricle, Both ventricles, Myocardium, Combined skeletal muscle, Quadriceps, Rectus abdominis, Biceps brachii

Combined heart, Left ventricle, Both ventricles, Combined skeletal muscle, Quadriceps, Gastrocnemius, Tibialis anterior, Soleus Strain NA Combined, C57BL/6 J, CD1, C3H/HeJ, FVB

(9)

Correlation networks can be accessed by selecting the network tab and pressing the submit button without the need to re-select the category (Fig. 2d). A signed un-weighted 2-shell network will be con-structed. It works either with the number of

co-expressed genes in each shell (default: 15 and 5 genes for 1st and 2nd shell respectively) or by set-ting a correlation threshold through the advanced options. A combination of these two methods is also possible.

Fig. 2 How to browse MyoMiner. a Select a category of interest. All categories are visible at the beginning, so that the user can find with ease what is available on MyoMiner. By clicking a category only the options that are related with this category will remain visible. This way the user is guided to the available MyoMiner categories. b Table output. Search by gene symbol, Ensembl or Entrez gene ID. All transcriptional co-expressions of any expressed gene-pair displayed when hitting submit. The first column is the paired gene symbol, the second is the annotation of the paired gene, the third is the Sprearman’s correlation of that pair, the fourth and fifth are the BH FDR adjusted p-value and the confidence intervals. The table can be downloaded in CSV format or copied directly to the clipboard c Gene pair scatterplot. The expression values of every sample of the selected category for that gene pair are plotted by clicking on the r value. Each series is shown at the top and can be toggled to display the expression values for any series independently. d Correlation network. The network is constructed based on gene correlation. Users can change the number of relations or set a correlation threshold from the advanced options. e Differential co-expression analysis. Select two or more categories and compare the first to the rest. A gene may be a regulator if its co-expression is significantly altered (p-value) between pathological conditions. MyoMiner can be accessed athttps://www.sys-myo.com/myominer

(10)

Another feature is the gene list network, available through the advanced options, where the user can input a list of genes to create the correlation network. In this case, default 1st and 2nd shell values are set to 0 in order to firstly identify if the genes on the list are related. These values can be changed to add co-expressed genes outside from the gene list. The search form “Locate genes in the network” will hide for a short time all the genes in the net-work except for the searched gene, making it easy to pin-point the location of genes inside the network. The link threshold bar can be used to remove edges below a certain correlation value, creating sub networks in the process. The blue colored node is used to point the queried gene, the light blue depicts the 1st shell connected nodes and orange the 2nd shell nodes. Users can pan and zoom by click-dragging on an empty space of the interactive network area and using the mouse wheel, respectively. The nodes are interactive and can be moved to any space of the network area. Users can also double-click a node to highlight its im-mediate connected nodes.

Since correlation networks can grow quite large, includ-ing thousands of nodes and many more edges, it could take several minutes to retrieve the values for large networks from the database. For this reason, we decided that network construction will be a client side task, using the D3 Java-Script library. For large networks, we recommend using the Chrome browser as it could take some time to render big networks, especially on low end machines. We also recom-mend having the graphics card enabled for the browser in order to avoid long rendering time for the network.

Differential co-expression analysis is emerging as a method to complement traditional differential expres-sion analysis [13,68]. It can detect biologically important differentially co-expressed gene pairs that would other-wise not be detected via co-expression or differential ex-pression [69]. Differentially co-expressed genes between different conditions are likely to be regulators, thus explaining differences between phenotypes [70]. MyoMi-ner provides differential co-expression analysis for any gene pair from any category combination. In the “Com-pare gene co-expression” form, users can set the cat-egories for comparison (Fig. 2e). The first category is compared to the rest after the gene in question is selected. The output includes the gene symbol and its description, theρ1value from the first category, theρ2value from the

second category and the p-value of the comparison. If the p-value is higher than 0.05 the difference ofρ1 andρ2is

not significant at 95% confidence level. MyoMiner sup-ports multiple simultaneous comparisons.

Improved combined data quality after the correction of batch effects

By combining data from different data sets and labora-tories from around the world we introduce unwanted

technical variation which needs to be corrected. Differ-ent processing days between samples in a series were also observed, through PCA plots, as another source of strong non-biological variation [29]. To improve the quality of the co-expression values obtained from tens to hundreds of samples, we check each category for the presence of batch effects by different series and/or pro-cessing dates. To acquire the scan dates from the micro-array CEL files, we parsed them in text format. We then used PCA to visualize the samples from each category, colored by series or processing dates, on a 3D plane (Fig. 3b), in order to identify underlying batch effects. When we observed non-biological variation we corrected it using the ComBat algorithm [55], as described in the “Batch effects evaluation” section.

Below, we present two examples where batch effect treatment drastically altered the correlation coefficient between the gene pairs (Fig. 3). Dysferlin is a type II transmembrane protein that is enriched in skeletal and cardiac muscle and involved in membrane repair [71]. Mutations or loss of DYSF gene lead to muscular dystro-phies called dysferlinopathies. Synaptopodin 2-like (SYN-PO2L) protein is an important paralog of Synaptopodin-2 (SYNPOSynaptopodin-2) that is involved in active binding and bund-ling and associated with Duchene muscular dystrophy and myofibrilar myopathy 2. We selected the adult hu-man resistance exercise category to illustrate how batch correction removes bias introduced when combining data. Before correction, no strong correlation is observed between DYSF and SYNPO2L: ρ = − 0.05 (Table 2, also shown with Pearson’s correlation coefficients).

Clustering and PCA plots show that the samples are grouped by series, which may indicate bias (Fig.3a, b left). The DYSF and SYNPO2L gene expression scatterplot reveal the extent of the batch effect: even though individ-ual series (different colors) have clear positive correlation the overall correlation is canceled out when combined (Fig. 3c). In detail, the selected category is comprised of three series. Individual series Spearman correlation is GSE47881 ρ = 0.6, GSE48278 ρ = 0.3 and GSE28422 ρ = 0.67. We can also average the correlation values using ρ-to-Z Fisher’s transformation (Additional file 1: Table S6, Eq. S1) to convert the non additive ρ-values to Z scores, then average the Z scores and finally convert the mean Z back to ρ-value (Additional file 1: Table S6, Eq. S4). DYSF-SYNPO2L averageρ-value for the category is 0.54. After we treated the samples with ComBat which reduced the aforementioned bias (Fig.3a, b, c right) the correlation value increased to 0.62 which could indicate a possible functional association between DYSF and SYNPO2L [72].

In another example between DYSF and Synaptopodin (SYNPO), which may be modulating actin-based shape and mobility of dendritic spines, we find that batch effect correction reduces the bias-inflated correlation ρ = 0.62.

(11)

Fig. 3 Example of batch effects treatment. The adult human quadriceps resistance exercise category is constructed from three series: GSE47881 (olive green), GSE28422 (pink) and GSE48278 (turquoise) that include 45 samples in total. GSE9103 (magenta) series, from sedentary individuals, is used as a visual control. On the left, one can see the untreated samples and on the right the batch-treated samples, using each series as a surrogate. a Hierarchical clustering of both resistance exercise and sedentary samples shows a clear separation. Note that resistance exercise samples are clustered by their corresponding series even after pre-processing (normalization). After treating the samples with ComBat, the resistance exercise samples are now mixed, reducing the batch effect. b Principal component analysis plots of the same samples. In the untreated plot, samples are clustered very well by their series (olive green, pink and turquoise). However, the resistance exercise series are as far from each other as the sedentary (visual control in this case) series. After the batch correction (right) all resistance exercise samples are clustered together and are clearly separated from the sedentary samples cluster. c The expression values of DYSF and SYNPO2L are grouped by series resulting in a correlation value r =− 0.05. After batch correction the samples are mixed with ρ = 0.62. d Inversely, in the example of DYSF and SYNPO where the ρ value is artificially high, before the treatment (ρ = 0.62), the correction reduces it to ρ = 0.36

(12)

Individual series correlation is as follows: GSE47881 ρ = 0.31, GSE48278 ρ = − 0.4 and GSE28422 ρ = 0.64. The scatterplot also reveals that the series have mixed correla-tions (Fig. 3d left) and the overall ρ is biased when we combined the series. The average correlation of the three series is 0.21. After removing the bias (Fig.3d right) the correlation is reduced from 0.62 to 0.36. Gene pairs that had reduced correlation after batch treatment were more common, suggesting that batch correction reduced the number of false positive correlations.

Validation of correlation values

We sought to validate the correlation values generated for MyoMiner, by comparison to two existing databases of gene co-expression (MEM and SEEK), and to the GTEx RNA-Seq data compendium. We did this to the extent possible given the limited muscle-specificity and lack of muscle condition sub-categorization of those databases. For this comparison, a panel of 20 muscle-relevant genes (Additional file 4: Table S9) were chosen based on muscle-relevant annotations (Entrez Gene, GeneCards), and on their frequency of representation in consensus lists of the Muscle Gene Sets collection [32]. All 190 pair-wise Pearson correlation values were obtained from MyoMiner for these 20 muscle-relevant genes from MyoMiner’s healthy human whole muscle category (Human|Both gen-ders|All Ages|Skeletal muscle|Normal).

The MyoMiner values were compared against similar correlation values for the same pairs of genes given by MEM and SEEK for the closest relevant datasets that we could identify in the MEM and SEEK databases. Pearson correlation could be obtained directly from SEEK, whereas MEM returns a p-value for the strength of cor-relation that is not directly comparable to Pearson, so for MEM we ranked the 190 pair-wise correlations and

compared rankings between the two tools. A text search for “Skeletal Muscle” in the MEM tool enabled extrac-tion of correlaextrac-tion values for a dataset that combined 25 muscle-relevant studies on the Affymetrix HG U133 Plus 2.0 array. For SEEK, the ‘Muscle (Non-cancer)’ dataset was chosen, which combines 87 mostly muscle-related data series from several gene expression platforms. We observed strong agreement in correlation values for MyoMiner with SEEK (Pearson r = 0.87; Fig. 4) and with MEM (Spearmanρ = 0.74; Additional file1: Fig. S1).

Correlation values for healthy whole muscle were cal-culated from GTEx RNA-Seq data in a similar way to those calculated for MyoMiner’s healthy human whole muscle category. A reasonable level of agreement was observed in correlation values between MyoMiner and GTEx (Pearson r = 0.66; Additional file1: Fig. S1).

Conclusions

In this work we retrieved and analyzed striated muscle pertinent microarray samples and combined them effect-ively for the construction of a muscle-tissue-specific co-expression database. MyoMiner provides a simple, ef-fective and easy way to identify co-expressed gene pairs under a vast number of experimental conditions. This was not available in any existing co-expression database. Thus, MyoMiner represents a powerful tool for muscle researchers, helping them to delineate gene function and key regulators.

For MyoMiner we chose to use the Spearman ation coefficient, despite the fact that Pearson correl-ation seems to be more popular in other correlcorrel-ation databases. We did not use the Pearson correlation be-cause it is sensitive to outliers and bebe-cause of the as-sumptions that need to be met, in order to calculate adjusted p-values: every gene would have to be normally

Table 2 Examples of gene pairs correlation changes after batch correction. We illustrate two correlation examples (i) between DYSF and SYNPO2L, where the correlation increases significantly and (ii) between DYSF and SYNPO, where the correlation decreases. Both Spearman and Pearson’s correlations are available to indicate that batch effects are prevalent in both parametric and non-parametric statistics. We see considerable changes on their combined correlation coefficients, which is due to the correction of the variation between studies having been done in different labs by different people. In the case of DYSF - SYNPO2L, originally there seems to be no correlation on the combined samples, despite that a strong positive correlation is observed in each individual series. This bias is removed after batch correction with ComBat, resulting in a positive correlation. The example of the DYSF– SYNPO pair shows an initial strong positive correlation before batch correction, while the individual series have mixed positive and negative correlations. Following batch correction this value is reduced

DYSF - SYNPO2L DYSF - SYNPO

Spearmanρ Pearson r Spearmanρ Pearson r

Uncorrected −0.05 0.02 0.62 0.53

Batch corrected 0.62 0.65 0.36 0.42

GSE47881 0.6 0.67 0.31 0.39

GSE48278 0.3 0.31 −0.4 −0.08

GSE28422 0.67 0.79 0.64 0.71

(13)

distributed, while gene pairs have to be bivariately nor-mally distributed. On the other hand, Spearman correl-ation is robust to outliers and does not require assumptions of linearity. To determine the strength of the correlation we have provided the adjusted p-value and the confidence intervals.

It is noteworthy that the most correlated partners for a driver gene may vary significantly between co-expression databases. This could be attributed to differ-ent transcriptomic platforms, although most of these da-tabases use GPL570 and GPL1261 platforms as we did. Moreover, different pre-processing methods, batch effect correction methods or the lack thereof, tissue- and cell-specific expression, variable cell states, different correl-ation coefficients, and other factors, add to the differ-ences found between co-expression databases. An investigation of the inconsistencies between co-expression databases could identify common gene char-acteristics or the key factors that contribute to those dif-ferences. Our comparison of MyoMiner to the MEM and SEEK databases suggests that there is reasonable agreement of these resources in terms of co-expressed

gene pairs in healthy muscle, validating the approach used in MyoMiner, and supporting the trustworthiness of MyoMiner’s correlation values for muscle diseases and other muscle conditions.

One caveat of gene expression correlation is that it can be driven by other factors. For example, a tran-scription factor (TF), when upregulated, drives the ex-pression of gene X and Y. In this scenario, TF with X and TF with Y will be highly correlated. However, X and Y will be highly correlated as well, since both are upregulated from the same TF. This could be benefi-cial as X and Y could be involved in the same pro-cesses, but if we are interested specifically in the relation of X with Y, their correlation would be zero if TF was not upregulated. In order to extract the correlation between X and Y without TF interfering, we can calculate the partial correlation [73]. Partial correlation could theoretically be used to remove all the gene effects from a pair of genes, but it would re-quire more microarray experiments than the number of genes. It has been used successfully to create rela-tively small networks [74].

Fig. 4 Validation of MyoMiner by comparison to the SEEK co-expression search engine. Pearson correlation values were extracted from both MyoMiner and SEEK for each pair of a panel of 20 muscle-relevant genes (190 pairwise combinations). For MyoMiner, these were obtained for the healthy human whole muscle category (Human|Both genders|All Ages|Skeletal muscle|Normal). For SEEK, the‘Muscle (Non-cancer)’ dataset was chosen, which combines 87 mostly muscle-related data series from several gene expression platforms. The correlation value from SEEK for a given gene pair was plotted against that from MyoMiner for the same gene pair. The 190 values from MyoMiner correlated with those from SEEK with Pearson r of 0.87

(14)

To derive statistical confidence across large numbers of experiments, and because we wished MyoMiner to examine co-expression differences between experimental conditions (some of which have low quantities of pub-lished expression data), we have focused our analysis on the most-used microarray platforms for human and murine muscle studies. Microarrays have been extremely useful in a wide area of biological applications, but they also have a number of limitations. Importantly, a micro-array can only detect RNA sequences that the designed probes can detect. Simply put, if the RNA contains se-quences that have no corresponding oligos in the array, the sequences will not be measured. In gene expression analysis, a gene that was not described before will not be present in the array. Also, non-coding RNA sequences are typically not present on arrays. This problem is more pronounced in older arrays where only a set number of probes could be printed on the array; thus a portion of the genes could eventually be measured. Newer com-mercial arrays have tried to compensate for this by in-cluding probes that do not match to any known genes at the time they are designed - predicted transcripts which can then be assigned to newly discovered genes if their sequences match. Also, as time progresses more searchers are using the now popular BrainArray CDF re-pository which is updated annually. Another difficulty in terms of probe design, is to generate probes of which the RNA sequences do not overlap. If sequences are homologous, then a probe could detect multiple genes at once, which is particularly problematic for genes with many splice variants or for genes that belong to the same family. Dai et al. [40], address this issue by select-ing probes that detect specific and unique parts of the gene (whenever this is possible). It should be noted that specific arrays can detect splice variants by having probes which detect specific exons or exon junctions [75–77]. Moreover, microarrays measure, by design, relative concentration indirectly. The intensity measured in a probe is proportional to the concentration of a se-quence that can hybridize to this probe. However, ex-perimental spike-in studies [78] showed that the probe intensity is nonlinearly proportional to the target con-centration [79–81]. The array will become saturated at high target concentrations, while at low concentrations there will be no binding. The intensities are linear within a very limited range of RNA concentration. Another limitation is that co-expression analyses based on a sin-gle microarray platform may have technical biases asso-ciated with that platform: for example, the properties of cross-hybridization and the dynamic range of probes dif-fer among microarray platforms, as does the signal-to-noise ratio [18]. These limitations of microarray technol-ogy may to some extent account for the stronger agree-ment of MyoMiner’s correlation values with those of

tools such SEEK (r = 0.88) and MEM (ρ = 0.74), which use mainly microarray data, compared with the moder-ate agreement with correlation values obtained from the GTEx RNA-Seq dataset (r = 0.66). For these reasons, it may be useful in future work to cross-reference common co-expression patterns found in MyoMiner against those found in RNA-Seq data in order to identify whether any consistent disparities are present that may be resulting from technical biases in the microarray chip – such instances could then be filtered out of MyoMiner findings for specific experimental categories. Published RNA-Seq datasets are becoming more numerous in the neuromuscular field but remain limited in number especially when considering spe-cific experimental or pathological conditions.

A major reason for creating MyoMiner is to be able to compare gene co-expression networks between condi-tions – i.e. to identify cases were the co-expression of a given gene pair is lost, gained, or inverted, as a result of, for example, a pathological genetic mutation or an envir-onmental change. For individual gene pairs, this is cur-rently facilitated by the MyoMiner web interface, and the MyoMiner database itself opens up the possibility of a systematic analysis in the future. Clearly, such analysis is not possible without measuring co-expression separ-ately in the conditions that are to be compared. How-ever, a caveat of this approach is that a highly specific phenotype might lack biological variation to an extent that gene co-expression becomes overly noisy. Our com-parisons to healthy skeletal muscle sub-sets of the SEEK and MEM databases suggest that there is consistency of identified co-expressed gene pairs for this biological con-dition, indicating that noise does not dominate, and bod-ing well for other conditions. However, care should be taken and high confidence p-values should be sought when comparing between very precise biological condi-tions in MyoMiner, especially for murine samples, in which inter-individual variation could be relatively minor.

Finally, it could be interesting in future iterations of MyoMiner to examine the relationship of gene co-expression with other types of gene and protein func-tional association (e.g. [82–84]) especially for the identi-fication of functional modules [85].

The MyoMiner database is a powerful tool for muscle researchers to investigate gene function, based on tissue specific co-expression, and new disease mechanics, based on changes in co-expression between normal and patho-logical tissues.

Supplementary information

Supplementary information accompanies this paper athttps://doi.org/10. 1186/s12920-020-0712-3.

Additional file 1. Figure S1. Pair-wise correlations of 20 selected muscle genes in MyoMiner compared to the same pairwise correlations

(15)

in GTEx and MEM, for healthy human whole muscle tissue. Table S1. Alternative IDs to the originals A-AFFY-44 for human UG-U133 Plus 2.0 and A-AFFY-45 for mouse MG 430 2 arrays. Table S2 - Table S3. Failed quality samples and series from the human and mouse microarray data collection. Table S4. Number of samples, series and expressed genes for each of 69 and 73 categories in human and mouse respectively. Table S5. Samples with opposite gender prediction. Table S6. Supple-mentary equations forρ to Ζ and Z to ρ transformations, and confidence intervals calculation.

Additional file 2. Table S7. Complete listing of data series IDs for human experiments.

Additional file 3. Table S8. Complete listing of data series IDs for murine experiments.

Additional file 4. Table S9. The panel of 20 muscle-relevant genes used for validation of MyoMiner co-expression values. This panel is based on muscle-relevant annotations, and frequency of representation in con-sensus lists of the Muscle Gene Sets collection.

Abbreviations

GEO:Gene Expression Omnibus; FDR: False discovery rate; CDF: Chip description file; MIAME: Minimum Information About a Microarray Experiment; CEL: The file format of the Affymetrix raw (light intensity) file; RLE: Relative Log Expression; NUSE: Normalized Unscaled Standard Error; SCAN: Single Channel Array Normalization; RMA: Robust Multi-array Average; GC-RMA: GeneChip Robust Multi-array Average; UPC: Universal exPression Code; HGNC: HUGO Gene Nomenclature Committee; DM2: Diabetes Mellitus type 2; PCA: Principal Component Analysis; TF: Transcription Factor Acknowledgements

AM was supported by the MyoGrad International Graduate School for Myology.

Authors_{’ contributions}

AM constructed the database and web interface, and carried out the bioinformatics analyses. WD conceived and managed the study. AM wrote the manuscript, with important contributions from WD. IM provided support for mathematical/technical theory. SD contributed to design of disease categories. GBB and SS contributed to project conception. All authors read and approved the final manuscript.

Funding

This work was supported by the MyoGrad International Graduate School for Myology, and the Association Française contre les Myopathies (AFM). The funding bodies had no role in: the design of the study; collection, analysis, and interpretation of data; or in the writing of the manuscript.

Availability of data and materials

The transcriptomic data that support the findings of this study are available from ArrayExpress,https://www.ebi.ac.uk/arrayexpress/, and the Gene Expression Omnibus,https://www.ncbi.nlm.nih.gov/geo/. Complete listings of data series IDs and sample numbers are provided in Additional file2: Table S7 and Additional file3: Table S8. All the data generated for MyoMiner are available athttps://sys-myo.com/myominer/.

Ethics approval and consent to participate Not applicable.

Consent for publication Not applicable. Competing interests

The authors declare that they have no competing interests. Author details

1_{Sorbonne Université, Inserm, Institut de Myologie, U974, Center for Research}

in Myology, 47 Boulevard de l’hôpital, 75013 Paris, France.2Centre of Systems Biology, Biomedical Research Foundation, Academy of Athens, 4 Soranou Ephessiou St., 11527 Athens, Greece.3_{Northern Ireland Centre for Stratified}

Medicine, Biomedical Sciences Research Institute, C-TRIC, Altnagelvin Hospital

Campus, Glenshane Road, Ulster University, Derry/Londonderry BT47 6SB, UK.

4_{Muscle Research Unit, Experimental and Clinical Research Center}_{– a joint}

cooperation of the Charité Medical Faculty and the Max Delbrück Center for Molecular Medicine, Lindenberger Weg 80, 13125 Berlin, Germany.

Received: 18 October 2019 Accepted: 13 April 2020

References

1. Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H, et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol. 1996; 14(13):1675–80.

2. Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995; 270(5235):467–70.

3. Kolesnikov N, Hastings E, Keays M, Melnichuk O, Tang YA, Williams E, Dylag M, Kurbatova N, Brandizi M, Burdett T, et al. ArrayExpress update--simplifying data submissions. Nucleic Acids Res. 2015; 43(Database issue):D1113–6.

4. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, et al. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 2013;41(Database issue):D991_–5.

5. Marbach D, Costello JC, Kuffner R, Vega NM, Prill RJ, Camacho DM, Allison KR, Kellis M, Collins JJ, Stolovitzky G. Wisdom of crowds for robust gene network inference. Nat Methods. 2012;9(8):796–804.

6. De Smet R, Marchal K. Advantages and limitations of current network inference methods. Nat Rev Microbiol. 2010;8(10):717_–29.

7. Zhu Q, Wong AK, Krishnan A, Aure MR, Tadych A, Zhang R, Corney DC, Greene CS, Bongo LA, Kristensen VN, et al. Targeted exploration and analysis of large cross-platform human transcriptomic compendia. Nat Methods. 2015;12(3):211–4 213 p following 214.

8. Kolde R, Laur S, Adler P, Vilo J. Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics. 2012;28(4):573–80. 9. Consortium GT. Human genomics. The genotype-tissue expression (GTEx)

pilot analysis: multitissue gene regulation in humans. Science. 2015; 348(6235):648–60.

10. Sun Y, Zhang W, Chen D, Lv Y, Zheng J, Lilljebjorn H, Ran L, Bao Z, Soneson C, Sjogren HO, et al. A glioma classification scheme based on coexpression modules of EGFR and PDGFRA. Proc Natl Acad Sci U S A. 2014;111(9):3538–43. 11. Ma RL, Shen LY, Chen KN. Coexpression of ANXA2, SOD2 and HOXA13

predicts poor prognosis of esophageal squamous cell carcinoma. Oncol Rep. 2014;31(5):2157_–64.

12. Futamura N, Nishida Y, Urakawa H, Kozawa E, Ikuta K, Hamada S, Ishiguro N. EMMPRIN co-expressed with matrix metalloproteinases predicts poor prognosis in patients with osteosarcoma. Tumour Biol. 2014;35(6):5159_–65. 13. de la Fuente A. From 'differential expression' to 'differential networking'

-identification of dysfunctional regulatory networks in diseases. Trends Genet. 2010;26(7):326–33.

14. Liu BH. Differential Coexpression network analysis for gene expression data. Methods Mol Biol. 1754;2018:155_–65.

15. Bhuva DD, Cursons J, Smyth GK, Davis MJ. Differential co-expression-based detection of conditional relationships in transcriptional data: comparative analysis and application to breast cancer. Genome Biol. 2019;20(1):236. 16. Jen CH, Manfield IW, Michalopoulos I, Pinney JW, Willats WG, Gilmartin PM,

Westhead DR. The Arabidopsis co-expression tool (ACT): a WWW-based tool and database for microarray-based gene expression analysis. Plant J. 2006; 46(2):336–48.

17. Manfield IW, Jen CH, Pinney JW, Michalopoulos I, Bradford JR, Gilmartin PM, Westhead DR. Arabidopsis Co-expression Tool (ACT): web server tools for microarray-based gene expression analysis. Nucleic Acids Res. 2006;34(Web Server issue):W504–9.

18. Aoki Y, Okamura Y, Tadaka S, Kinoshita K, Obayashi T. ATTED-II in 2016: a plant Coexpression database towards lineage-specific Coexpression. Plant Cell Physiol. 2016;57(1):e5.

19. Okamura Y, Aoki Y, Obayashi T, Tadaka S, Ito S, Narise T, Kinoshita K. COXPRESdb in 2015: coexpression database for animal species by DNA-microarray and RNAseq-based expression data with multiple quality assessment systems. Nucleic Acids Res. 2015;43(Database issue):D82–6.

(16)

20. Jupiter D, Chen H, VanBuren V. STARNET 2: a web-based tool for accelerating discovery of gene regulatory networks using microarray co-expression data. BMC Bioinformatics. 2009;10:332.

21. Hruz T, Laule O, Szabo G, Wessendorp F, Bleuler S, Oertle L, Widmayer P, Gruissem W, Zimmermann P. Genevestigator v3: a reference expression database for the meta-analysis of transcriptomes. Adv Bioinforma. 2008; 2008:420747.

22. Michalopoulos I, Pavlopoulos GA, Malatras A, Karelas A, Kostadima MA, Schneider R, Kossida S. Human gene correlation analysis (HGCA): a tool for the identification of transcriptionally coexpressed genes. BMC Res Notes. 2012;5(1):265.

23. Pearson K. Note on regression and inheritance in the case of two parents. Proc R Soc Lond. 1895;58:240–2.

24. Piro RM, Ala U, Molineris I, Grassi E, Bracco C, Perego GP, Provero P, Di Cunto F. An atlas of tissue-specific conserved coexpression for functional annotation and disease gene prediction. Eur J Hum Genet. 2011;19(11):1173–80. 25. Greene CS, Krishnan A, Wong AK, Ricciotti E, Zelaya RA, Himmelstein DS,

Zhang R, Hartmann BM, Zaslavsky E, Sealfon SC, et al. Understanding multicellular function and disease with human tissue-specific networks. Nat Genet. 2015;47(6):569_–76.

26. Wang P, Qi H, Song S, Li S, Huang N, Han W, Ma D. ImmuCo: a database of gene co-expression in immune cells. Nucleic Acids Res. 2015;43(Database issue):D1133_–9.

27. Vandenbon A, Dinh VH, Mikami N, Kitagawa Y, Teraguchi S, Ohkura N, Sakaguchi S. Immuno-navigator, a batch-corrected coexpression database, reveals cell type-specific gene networks in the immune system. Proc Natl Acad Sci U S A. 2016;113(17):E2393–402.

28. Leek JT. svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. 2014;42:21.

29. Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, Irizarry RA. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010;11(10):733–9. 30. Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, Gabrielson E,

Garcia JG, Geoghegan J, Germino G, et al. Multiple-laboratory comparison of microarray platforms. Nat Methods. 2005;2(5):345–50.

31. Malatras A. Bioinformatics tools for the systems biology of dysferlin deficiency. PhD Thesis: Université Pierre et Marie Curie - Paris VI. Freie: Universität Berlin; 2017. 32. Malatras A, Duguez S, Duddy W. Muscle gene sets: a versatile

methodological aid to functional genomics in the neuromuscular field. Skelet Muscle. 2019;9(1):10.

33. Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, et al. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet. 2001;29(4):365–71.

34. Affymetrix Power Tools [ https://www.thermofisher.com/de/en/home/life-science/microarray-analysis/microarray-analysis-partners-programs/ affymetrix-developers-network/affymetrix-power-tools.html].

35. Turner S, Chen L. Updated security considerations for the MD5 message-digest and the HMAC-MD5 algorithms; 2011.

36. Eastlake D. Secure hash algorithm 1 (SHA1); 2001.

37. Brayer K, Hammond JL Jr. Evaluation of error detection polynomial performance on the AUTOVON channel. In: IEEE National

Telecommunications Conference. vol. 1. New Orleans, LA: Institute of Electrical and Electronics Engineers; 1975. p. 8–21. to 28–25.

38. McCall MN, Murakami PN, Lukk M, Huber W, Irizarry RA. Assessing affymetrix GeneChip microarray quality. BMC Bioinformatics. 2011;12:137.

39. Piccolo SR, Sun Y, Campbell JD, Lenburg ME, Bild AH, Johnson WE. A single-sample microarray normalization method to facilitate personalized-medicine workflows. Genomics. 2012;100(6):337_–44.

40. Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, et al. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 2005;33(20):e175. 41. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U,

Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4(2):249_–64. 42. Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo F, Spencer F. A

model-based background adjustment for oligonucleotide expression arrays. J Am Stat Assoc. 2004;99(468):909–17.

43. Lim WK, Wang K, Lefebvre C, Califano A. Comparative analysis of microarray normalization procedures: effects on reverse engineering gene networks. Bioinformatics. 2007;23(13):i282–8.

44. Usadel B, Obayashi T, Mutwil M, Giorgi FM, Bassel GW, Tanimoto M, Chow A, Steinhauser D, Persson S, Provart NJ. Co-expression tools for plant biology: opportunities for hypothesis generation and caveats. Plant Cell Environ. 2009;32(12):1633–51.

45. Sandberg R, Larsson O. Improved precision and accuracy for microarrays using updated probe set definitions. BMC Bioinformatics. 2007;8:48. 46. Aken BL, Ayling S, Barrell D, Clarke L, Curwen V, Fairley S, Fernandez Banet J,

Billis K, Garcia Giron C, Hourlier T, et al. The Ensembl gene annotation system. Database (Oxford). 2016;baw093.

47. Piccolo SR, Withers MR, Francis OE, Bild AH, Johnson WE. Multiplatform single-sample estimates of transcriptional activation. Proc Natl Acad Sci U S A. 2013;110(44):17778_–83.

48. Braschi B, Denny P, Gray K, Jones T, Seal R, Tweedie S, Yates B, Bruford E. Genenames.org: the HGNC and VGNC resources in. Nucleic Acids Res. 2019; 47(D1):D786_–92.

49. Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez gene: gene-centered information at NCBI. Nucleic Acids Res. 2011;39(Database issue):D52–7. 50. The UniProt Consortium. UniProt: the universal protein knowledgebase.

Nucleic Acids Res. 2017;45(D1):D158–69.

51. Kinsella RJ, Kahari A, Haider S, Zamora J, Proctor G, Spudich G, Almeida-King J, Staines D, Derwent P, Kerhornou A, et al. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database (Oxford). 2011; 2011:bar030.

52. Florez-Vargas O, Brass A, Karystianis G, Bramhall M, Stevens R, Cruickshank S, Nenadic G. Bias in the reporting of sex and age in biomedical research on mouse models. Elife. 2016;5.

53. Carlson M. hgfocus.db: Affymetrix Human Genome Focus Array annotation data (chip hgfocus). R package version 323; 2016.

54. Carlson M. mouse4302.db: Affymetrix Mouse Genome 430 2.0 Array annotation data (chip mouse4302). R package version 323; 2016. 55. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray

expression data using empirical Bayes methods. Biostatistics. 2007;8(1): 118–27.

56. Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012;28(6):882–3.

57. Adler D, D M, et al: rgl: 3D Visualization Using OpenGL. R package version 0951441 2016.

58. Nygaard V, Rodland EA, Hovig E. Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics. 2016;17(1):29–39.

59. Spearman C. The proof and measurement of association between two things. Am J Psychol. 1904;15(1):72_–101.

60. Pearson K. Notes on the history of correlation. Biometrika. 1920;13:25–45. 61. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical

and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol. 1995;57:289–300.

62. Revelle W. psych: Procedures for Personality and Psychological Research. 1.7. 5 ed. Evanston, Illinois: Northwestern University; 2017.

63. Lu Z, Shen D: Computation of Correlation Coefficient and Its Confidence Interval in SAS.

64. Fisher RA. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika. 1915;10(4):507–21.

65. Bostock M, Ogievetsky V, Heer J. D3_{Data-Driven Documents. IEEE Trans Vis}

Comput Graph. 2011;17(12):2301–9.

66. Koukis V, Venetsanopoulos C. Koziris N: ~okeanos: building a cloud, Cluster by Cluster. IEEE Internet Computing. 2013;17(3):67_–71.

67. Safran M, Dalah I, Alexander J, Rosen N, Iny Stein T, Shmoish M, Nativ N, Bahir I, Doniger T, Krug H, et al. GeneCards Version 3: the human gene integrator. Database (Oxford). 2010;2010:baq020.

68. Kostka D, Spang R. Finding disease specific alterations in the co-expression of genes. Bioinformatics. 2004;20(Suppl 1):i194_–9.

69. Hudson NJ, Reverter A, Dalrymple BP. A differential wiring analysis of expression data correctly identifies the gene containing the causal mutation. PLoS Comput Biol. 2009;5(5):e1000382.

70. Li KC. Genome-wide coexpression dynamics: theory and application. Proc Natl Acad Sci U S A. 2002;99(26):16875–80.

71. Han R, Campbell KP. Dysferlin and muscle membrane repair. Curr Opin Cell Biol. 2007;19(4):409–16.