This is a post-peer-review, pre-copyedit version of an article published in BBA - GENE REGULATORY MECHANISMS. The final authenticated version is available online
at: https://doi.org/10.1016/j.bbagrm.2019.194447
Inference of plant gene regulatory networks using data-driven methods: a practical overview
Kulkarni S.R. and Vandepoele K.
Transcriptional regulation is a complex and dynamic process that plays a vital role in plant growth and development. A key component in the regulation of genes are transcription factors (TFs), which coordinate the transcriptional control of gene activity. A gene regulatory network (GRN) is a collection of regulatory interactions between TFs and their target genes. The accurate delineation of GRNs offers a significant contribution to our understanding about how plant cells are organized and function, and how individual genes are regulated in various conditions, organs or cell types. During the past decade, important progress has been made in the identification of GRNs using experimental and computational approaches. However, a detailed overview of available platforms supporting the analysis of GRNs in plants is missing. Here, we review current databases, platforms and tools that perform data-driven analyses of gene regulation in Arabidopsis. The platforms are categorized into two sections, 1) promoter motif analysis tools that use motif mapping approaches to find TF motifs in the regulatory sequences of genes of interest and 2) network analysis tools that identify potential regulators for a set of input genes using a range of data types in order to generate GRNs. We discuss the diverse datasets integrated and highlight the strengths and caveats of different platforms. Finally, we shed light on the limitations of the above approaches and discuss future perspectives, including the need for integrative approaches to unravel complex GRNs in plants.
Highlights
The basic methodologies for generating TF binding site profiles in plants are described
The features of different available online tools to identify cis-regulatory elements are reported
The data types, analysis and visualization options for tools to explore data-driven plant gene regulatory networks are presented
Limitations and remaining challenges for plant gene regulatory network analysis are discussed
Keywords
plant gene regulatory networks
promoter analysis
network analysis
systems biology
Introduction
Plants, being sessile organisms, need to rapidly adapt to changes in environmental conditions. These adaptation mechanisms are largely controlled by shifts in gene expression that occur in these conditions. Transcriptional regulation is a complex and dynamic process that plays a vital role in achieving proper gene expression during the different stages of plant development and various biotic and abiotic stress responses. Key factors controlling this process are TFs which regulate their target genes by recognizing short sequences on the DNA called TF binding sites (TFBSs). The full set of regulatory interactions between a TF and its target genes forms a gene regulatory network (GRN).
Efficiently identifying regulatory links between TFs and target genes in order to delineate GRNs is pivotal to understand how different biological processes like growth, development or stress responses are transcriptionally controlled in plants.
Numerous studies have used high-throughput assays to systematically study the GRNs underlying specific conditions in the model plant Arabidopsis thaliana (Doroshkov et al., 2019, Varala et al., 2018, Morohashi et al., 2012, Wilkins et al., 2016, Yang et al., 2017). In vivo chromatin immunoprecipitation (ChIP) methods define the genome-wide map of potential targets of a TF of interest. There has been a rapid increase in the availability of ChIP studies in Arabidopsis over the last few years (Franco-Zorrilla & Solano, 2017). Chen and co-workers used ChIP, followed by high- throughput DNA sequencing (ChIP-Seq) combined with expression data to unravel the regulatory code of floral organ development (Chen et al., 2018). Genome-wide targets of 21 TFs were identified using ChIP-Seq to construct a wide-ranging map of environmental stress responses in Arabidopsis (Song et al., 2016). Open chromatin regions (OCRs) containing the motifs of TFs help in understanding where TFs bind on the genome and which target genes they regulate in a specific condition or cell type. The high throughput DNase-Seq assay has been used in rice to identify callus and seedling tissue specific OCRs (Zhang et al., 2012a). The same methodology was used in Arabidopsis to compare OCRs profiled in leaf versus flower, light and heat (Sullivan et al., 2014, Zhang et al., 2012b). An Assay for Transposase-Accessible Chromatin using sequencing (ATAC-Seq) combined with cell type specific nuclear purification is gaining popularity due to its speed, sensitivity‐ and compatibility to map cell type-specific OCRs (Sijacic et al., 2018, Maher et al., 2018, Lu et al., 2017). The methodology was used to study regulatory mechanisms in shoot apical meristem (SAM) stem and mesophyll cells in Arabidopsis (Sijacic et al., 2018) and to perform a comparative analysis of OCRs in root hair and non-hair cells across four species (Maher et al., 2018). Apart from these commonly used methods, the in vitro DNA affinity purification sequencing (DAP-seq) assay elucidated genome-wide binding of 529 TFs in Arabidopsis (O'Malley et al., 2016). Other studies used protein binding microarrays (PBMs) to identify TFBSs for a large collection of TFs (Franco-Zorrilla et al., 2014, Weirauch et al., 2014). Franco-Zorilla and co-workers identified motifs of 63 pre-selected TFs in Arabidopsis and Weirauch and co-workers used PBMs to determine the DNA sequence preferences for >1000 TFs from 131 species. The Yeast-one hybrid assay (Y1H) was used to study nitrogen associated metabolism and growth and discovered 1,660 interactions between 431 genes and 345 TFs (Gaudinier et al., 2018). Y1H was also used to identify GRNs underlying root stele, ground tissue and secondary cell wall synthesis and unravelled transcriptional cascades essential in cellular functions (Brady et al., 2011, Sparks et al., 2016, Taylor-Teeples et al., 2015).
Although these assays offer a considerable increase in information about regulatory sequences and interactions, each method has its limitations. ChIP-Seq studies one TF at a time, making it difficult to study large numbers of TFs efficiently. Similarly, identification of OCRs using DNase-Seq or ATAC-Seq gives a genome-wide view of where TFs can bind on DNA in a specific condition or cell type.
However, to study which TFs bind to these regions, downstream analysis, which depends on the availability of profiled TF motifs, is required (Sullivan et al., 2014, Maher et al., 2018, Sijacic et al., 2018). Furthermore, experimental methods measuring the binding of cooperative TFs in one assay are also limited (Smaczniak et al., 2017). Thus, GRNs studied from ATAC-Seq datasets potentially suffer from incompleteness as for many TFs binding site information is missing (Kulkarni et al., 2019).
Considering the limitations of individual experimental assays, the solitary use of these assays in order to construct and study a genome-wide regulatory map of the Arabidopsis genome is far from optimal.
Some of the limitations of experimental methods can be overcome by making use of computational and integrative approaches that make use of publicly available regulatory information. This review summarizes the available databases and platforms that use data-driven approaches to study gene regulation based on TF motifs and infer GRNs in plants. These resources leverage diverse experimental data types, including expression data, chromatin data, TF motifs, genomic variation data, protein-DNA and protein-protein interactions, to identify regulatory interactions. This is in contrast to classical computational reverse engineering methods, where most methods only utilize gene expression information to link TFs to target genes (Banf & Rhee, 2017, de Luis Balaguer et al., 2017, Haque et al., 2019). First, we discuss the online platforms available for promoter analysis that identify cis-regulatory elements (CREs) for input genes or gene sets using a motif mapping approach.
Next, we present a set of tools that build on this type of promoter analysis by identifying potential regulators and offering visualizations of inferred GRNs, including the integration of various experimental data sources. Lastly, we discuss remaining challenges and future perspectives to efficiently map GRNs in plants making full use of the different regulatory -omics data types currently available.
Results
Availability of transcription factor motifs
Due to combined efforts in studying the binding preferences of TFs in Arabidopsis, the number of TF motifs available has increased drastically in this decade. TRANSFAC, PlantCARE and PLACE were the very first databases that collected consensus sequences information from various literature (Higo et al., 1999, Rombauts et al., 1999, Wingender et al., 2000). However, PlantCARE and PLACE were limited to 435 and 469 motifs respectively and lacked information linking the motif to the TFs potentially binding it (Higo et al., 1999, Rombauts et al., 1999, Wingender et al., 2000). TRANSFAC is a commercial database for which users require a license. In 2003, the AGRIS database was introduced, which systematically collected motifs from the above sources, annotated them to the TFs and grouped them into TF families (Davuluri et al., 2003). Experimentally known TFBSs were computationally mapped to Arabidopsis promoter regions to identify putative targets of TFs making AGRIS the largest information resource at that time to start constructing regulatory networks in Arabidopsis. AthaMap was a similar database which integrated publicly available TF motif information from TRANSFAC and other literature sources to cover 126 TFs belonging to 29 TF families (Steffens et al., 2004). The database also maps the motifs on Arabidopsis gene promoters and allows easy visualization of TFBSs for input genes of interest. The JASPAR database, introduced in the same year, systematically collects, curates and annotates TF motifs for different eukaryotes, and is considered an alternative to the commercial TRANSFAC database. The latest JASPAR collection covers 489 TF motifs from various sources including Electrophoretic Mobility Shift Assay (EMSA), Systematic Evolution of Ligands by Exponential Enrichment followed by sequencing (SELEX-Seq), PBM and ChIP- Seq. It is widely used in the research community and is updated on a tri-yearly basis (Khan et al., 2018, Sandelin et al., 2004).
The introduction of PBMs and their use by the Arabidopsis community led to a vast increase in the number of TF motifs available. The technique was first used on a large scale by Franco-Zorilla and co- workers to identify motifs of 63 TFs in Arabidopsis (Franco-Zorrilla et al., 2014). Later, Weirauch and co-workers extended its use to identify TFBSs in 131 species, covering more than 1000 TFs and presented this catalogue in the CIS-BP database (Weirauch et al., 2014).
DAP-Seq is the most recent and novel technology for profiling TFBSs, and was used to identify TFBS information for 529 Arabidopsis TFs (O'Malley et al., 2016). DAP-Seq allows rapid profiling of genome-wide TFBSs by pairing affinity-purified TFs with next-generation sequencing. This high- throughput in vitro assay created a comprehensive map of the Arabidopsis cistrome, comprising 2.6 million genome-wide TFBSs for 529 TFs. All the above technological advances are helping to increase the number of profiled TFBSs and the understanding of transcriptional regulation for given biological processes (Ezer et al., 2017, Ming et al., 2015, Santuari et al., 2016, Sullivan et al., 2014, Zhang et al., 2015).
Tools for promoter motif analysis of input genes
By the end of 2000, a complete genome sequence of the model plant Arabidopsis was available (Arabidopsis Genome, 2000). After identifying the structure of genes within the genome, the next natural step was to unravel the function of each gene and study how these genes are regulated. To tackle the latter question, different methods were used to identify CREs in the promoter of a gene of interest. One of the first databases collecting plant CREs was the PlantCARE database (Rombauts et al., 1999). The database maintained 417 CREs (159 from monocots and 263 from dicots) which were collected from various literature sources. A query on this database outputs a list of matching entries with its meta-data including the information of TFBS, organism, position and matrix similarity.
Besides maintaining the CREs, this database integrates other utilities to identify CREs in input promoters. The output of the ‘Search for CARE’ utility allows users to easily visualize the CREs on the promoter of the input gene on an HTML page.
As research on identifying TF motifs continued, new databases like TRANSFAC, PLACE, AGRIS, PlnTFDB, JASPAR and AthaMAP were developed, that systematically maintained the information of motifs (Davuluri et al., 2003, Higo et al., 1999, Riano-Pachon et al., 2007, Rombauts et al., 1999, Sandelin et al., 2004, Steffens et al., 2004). PlantPAN was the first tool that integrated not just TF motifs but also the information about CpG islands and tandem repeats in the promoter sequences of genes from multiple species (Chang et al., 2008). The “Cross-species” option of this tool was made available to allow the conservation of TFBSs in promoters of homologous genes to be effectively studied. Thus, this tool became an effective platform to perform a diverse set of analyses other than simply promoter analysis. There have been multiple updates to this tool since its original release, with the latest version incorporating a new database called “PCBase” with a complete focus on integrating protein-DNA interactions from ChIP-Seq experiments (Chow et al., 2019, Chow et al., 2016). This includes 1,233,999 protein-DNA interactions for 99 regulatory factors from 7 species (Table 1). The set of regulatory factors contains not only TFs but also histones and other DNA binding proteins. This database includes up-to-date ChIP-Seq experiments and contains multiple sources of TFBSs along with 2,449 de novo motifs from ChIP-Seq processing (Chow et al., 2019).
The PlantTFDB database, which initially started as a catalogue of plant TFs and their classification into TF families, has evolved significantly over the past few years (Guo et al., 2008). The first version of the database consisted of a set of TFs from various species and include other information such as protein sequences, annotation at the family level and their orthologous TFs (Guo et al., 2008). The next two updates of this database consisted of an increased number of TFs and advanced methods of TF family prediction (Zhang et al., 2011, Jin et al., 2014). In the latest update, PlantTFDB4.0, this
database stepped towards becoming a resource for regulatory interactions in plants and introduced the PlantRegMap portal providing various data sets and tools to study gene regulation (Jin et al., 2017). High-quality TF motifs from various sources were collected and manually curated to obtain 674 non-redundant motifs. Aside from TF motifs, over 4 million TF footprints derived from DNase- seq experiments were integrated to identify genome-wide TFBSs of TF (Table 1). The “Binding Site Prediction” tool from the PlantRegMap platform uses the high-quality manually curated TF motifs to scan the input promoter sequences using the MATCH program (Kel et al., 2003). The output page of this tool displays detailed information of TFBSs, including the TF name and family, the matching sequence, the position of this sequence on the promoter and the significance level of the motif match. The tool allows this results page to be downloaded in a standardised file format for further analysis.
CressInt, a web resource that integrates ChIP-Seq, TFBS information, DNase I hypersensitive sites (DHS) maps and expression information was developed by Chen and co-workers in 2015 (Chen et al., 2015). This platform overcomes the limitation on the number of input genes and identifies the TFBSs located in the promoters of thousands of genes at a time. The “Intersect” mode of this platform performs an intersection of input gene coordinates with the regulatory datasets stored in the database in order to identify potential regulatory links between TFs and input genes.
Zhang and co-workers made use of DHS maps in the Arabidopsis, Brachypodium distachyon (Brachypodium) and Oryza sativa (Lescot et al.) genomes in order to generate the PlantDHS database, allowing different aspects of gene regulation to be investigated (Zhang et al., 2016). Aside from DHS maps, various regulatory datasets such as histone modifications, nucleosome occupancy and TFBSs were integrated. Moreover, RNA-Seq expression datasets were incorporated in order to easily visualize the gene expression levels for different tissues and developmental stages. This database is currently limited to the three species listed above. In addition, the database only accepts a single input gene at a time, and does not have the option to download the output from the analysis for further processing. However, the bulk download option enables the user to download the BigWig files corresponding to each condition or tissue.
Databases like Expresso and ChIPBASE2.0 were developed in order to predict regulators of input genes based on ChIP-Seq data (Zhou et al., 2017, Aghamirzaie et al., 2017). The users can upload their own gene expression data to the Expresso database and can perform a correlation analysis for the predicted TF-target gene pair. However, this database is limited to only 20 TFs in Arabidopsis and most of the recent ChIP-Seq experiments have not been incorporated into the database (Aghamirzaie et al., 2017). The “Regulator” module of ChIPBASE2.0 aids in the prediction of TFBSs and histone modifications for 10 species but could not be run for Arabidopsis (Zhou et al., 2017).
The Regulatory Sequence Analysis Tools (RSAT) suite offers a mirror dedicated to the regulatory sequence analysis of plants since 2015 (Castro-Mondragon et al., 2016). Some of the utilities present in this suite are promoter-based and genome-scale pattern matching tools that looks for the matching positions of regulatory motifs on a promoter region. Performing the same analysis on the genome-scale level provides the information of the putative targets of the motif considered. TF motifs are pre-included in the database and are from various sources like JASPAR, TRANSFAC and CIS- BP. A drawback of this tool is that the pattern matching needs to be done in a series of consecutive steps, starting from the promoter sequence extraction followed by the mapping of motifs. The TF motifs need to be selected from one source at a time and there is no possibility to map different motifs from different sources collectively. The output of the tool is non-interactive and generates a static feature map.
Austin and co-workers presented BAR tools, a set of tools to explore and analyze Arabidopsis gene expression profiles and the presence of TFBSs in their promoter regions (Austin et al., 2016). Cistome is one of the tools that helps users to identify TFBSs in input promoter sequences. The tool allows the user to choose the motif database of preference. The motif databases included in this tool are CIS-BP, JASPAR and PLACE, out of which CIS-BP and JASPAR link the motifs to TFs, leading to an efficient assignment of TF-target interactions in the output. Cistome also offers users the option to filter for significant motifs after executing a motif enrichment analysis. Motif enrichment is useful in limiting the number of false positive predictions, and is determined by comparing the number of occurrences of TFBSs in the input promoters against the occurrences in randomly sampled promoters. Cistome allows users to select the size of the promoter from three options and also has the flexibility to choose this distance upstream from transcription or translation start sites in order to include or exclude the untranslated regions in the analysis. The output of this tool is completely web-based and offers the location of enriched TFBSs on input promoter regions. Hovering over the TFBSs reveals information about the enriched motifs, including the sequence logo of the motif and its enrichment score (Figure 1). Users, however, cannot download the output files for their post-processing into GRNs.
Tool (year)*
Species Type of data integrated #PWMs and
#TFs
Input Output Link
CressInt (2015)
Arabidopsis ChIP-Seq, RNA-Seq, TFBS models, DNase-Seq, eQTLs,
GWAS
CIS-BP and 16 ChIP-Seq for TFs
and histones
Genomic coordinates or
gene names
HTML https://cress int.cchmc.or g/cressint/
Cistome (2016)
Arabidopsis TFBS JASPAR or CIS-
BP (incl. Gene Slider conservation information) or your own set of
motifs
Multiple genes HTML http://bar.ut oronto.ca/ci stome/cgi- bin/BAR_Cis
tome.cgi
PlantDHS (2016)
Arabidopsis, Rice, Brachypodium
histone modification, RNA sequencing, nucleosome occupancy, TFBS, genomic
sequences
Single gene Genome browser
http://plant dhs.org/
RSAT Plants (2016)
70 species ChIP-Seq JASPAR,
TRANSFAC and CIS-BP
Multiple genes HTML http://rsat.e ead.csic.es/p lants/index.
php Expresso
(2017)
Arabidopsis ChIP-Seq Includes 20
processed TF ChIP-Seq
Multiple genes HTML https://bioin formatics.cs.
vt.edu/expre sso/
PlantRegMap (Binding site
prediction) (2017)
132 species 345,920 peaks for 14 TFs based on 26 ChIP-Seq datasets and 4,794,773 TF
footprints based on 13 DNase-Seq experiments +
histone modifications
Manually curated 674 non-redundant
high quality motifs from CIS-
BP, JASPAR, TRANSFAC, UniPROBE, MEME-ChIP
from ChIP datasets, Franco-Zorilla
Multiple genes HTML http://plantr egmap.gao- lab.org/bind ing_site_pre diction.php
PlantPAN3.0 (2019)
Arabidopsis, Brachypodium, Chlamydomonas reinhardtii, Glycine
max, Malus
ChIP-Seq for TFs, histones and other DNA binding
proteins, protein-DNA complex tertiary structures
and protein sequence-
PLACE, JASPAR, CIS-BP, UNIPROT, Franco-Zorilla,
DAP-Seq, and
Single gene HTML http://plant pan.itps.nck u.edu.tw/
domestica, Oryza sativa, Populus
trichocarpa, Sorghum bicolor, Volvox carteri, Zea
mays, Physcomitrella
patens
based annotation received from PDB
2,449 PWMs derived from ChIP-Seq
Table 1. Overview of plant promoter motif analysis tools. * indicates the year of the latest publication or update.
Tools for network analysis
Most of the tools listed in the previous section identify the presence of a certain TFBS in promoter sequences of genes of interest. A common problem with this approach is that naïve motif mapping leads to the generation of a high number of false positive predictions. Moreover, none of the above tools extends the presence of a TFBS in a promoter to the delineation of GRNs. In order to thoroughly understand the mechanism of gene regulation for a given organ, tissue or biological process, there is a need to study the GRN underlying that state. Tools that perform network analysis are able to infer such GRNs.
AtRegNet was the first database in Arabidopsis that linked genes to the TFs controlling them (Palaniswamy et al., 2006). AtRegNet was presented as a utility of the Arabidopsis Gene Regulatory Information Server (AGRIS) that collected information on TFs and annotated promoters of approximately 33,000 Arabidopsis genes with a mixture of experimentally confirmed and predicted TF motifs (Palaniswamy et al., 2006). The AtRegNet database consists of 1,638,778 direct interactions between TFs and target genes and allows easy visualization of regulatory networks. Users can also download the regulatory interactions of interest for further analysis. The database was a significant contribution to the Arabidopsis community, allowing GRNs to be efficiently analyzed in the plant.
However, the interactions incorporated into this database were limited to those collected from literature and lack recent ChIP-Seq studies (Heyndrickx et al., 2014). Thus, a continuous update is required to add newly discovered TF-target gene interactions.
An alternative to the database approach that stores pre-computed TF-target gene interactions is to find the presence of enriched TFBSs in input promoter sequences on-the-fly using motif enrichment.
A motif enrichment approach is different from the simple motif mapping approach since the enrichment process aims to reduce the false positive rate, caused by the short and degenerate nature of TFBSs, by considering a background model. More precisely, it estimates motif enrichment by determining over-represented motifs in the input data and estimating the significance of the observed motif counts against the expected counts derived from the background model, either sampling all or a subset of randomly chosen input promoters. Typically, Fisher’s Exact test or the hypergeometric distribution are applied to estimate significance levels. Furthermore, as typically many motifs are tested for enrichment, statistical methods to correct for multiple hypothesis testing are required to control the false discovery rate.
The latest update of PlantTFDB introduces PlantRegMap, a comprehensive platform to study GRNs in plants (Jin et al., 2017). Apart from the “Binding Site Prediction” tool (described in the previous section), a “Regulation Prediction” tool (TF enrichment) was developed that uses pre-generated TF- target interactions from various regulatory data sources such as ChIP-Seq, TF footprints and literature (Table 2). These interactions are used to calculate enriched TFBSs for the input set of gene promoters using Fisher's Exact test, starting from a set of 674 non-redundant and curated TF motifs. The output of this tool is displayed on an HTML page, with files containing GRN interactions and statistical
parameters for the enrichment test available for download. There is an immediate linking of TFBSs to TFs, making it possible to efficiently delineate GRNs.
The SeqEnrich tool developed by Becker and co-workers aims to predict GRNs by taking as input a list of co-expressed genes from Arabidopsis or Brassica napus (Brassica) (Becker et al., 2017). The authors collected motifs from motif databases like JASPAR, PBM sources like CIS-BP and Franco- Zorilla et al., 2014, and DAP-seq and manually curated them to remove duplicate and non- informative motifs. For the TFs that did not contain any specific motif information, putative motifs were assigned based on phylogenetic information of closely-related TFs. The phylogenetic analysis incorporated into the tool drastically improved the number of TFs which have at least one motif associated with it, and the SeqEnrich database now contains 240 degenerate and high-quality motifs for 2,263 Arabidopsis TFs as a result (Table 2). As the information on motifs for Brassica TFs is very sparse, the TF motifs from Arabidopsis homologs were used to make predictions, making SeqEnrich the largest database for Brassica TFs. A potential limitation of the degenerate motifs inferred using phylogenetic analysis is that diverged binding sites for a TF of a specific family are not well modelled.
The tool is freely available as a JAVA application and takes a list of Arabidopsis gene identifiers as input. For every analysis, an output directory is generated, which stores the information on motif enrichment and GO enrichment for input gene lists.
The Arabidopsis Interactions Viewer 2.0 (AIV2) is a tool that is useful to visualize predicted and experimentally validated regulatory interactions, including 2.8 million protein-DNA interactions together with 70,944 predicted and 91,175 confirmed protein-protein interactions (Dong et al., 2019). Note that, unlike other tools listed in this section, AIV2 does not perform any enrichment statistics to identify, for example, potential regulators for a user-defined input gene set. Similarly, the Networks section at Plant Regulome (plantegulome.org) offers an interactive interface to browse gene regulatory networks in different cellular conditions (control, light and heat treatments) based on nucleotide-resolution maps of accessible regulatory DNA profiled using DNase-Seq (Sullivan et al., 2014).
Kulkarni and co-workers combined various TFBSs for 916 TFs and developed the TF2Network tool to unravel GRNs underlying a set of co-expressed genes in Arabidopsis using TFBS information (Kulkarni et al., 2018). Both PlantRegMap and TF2Network use the motif enrichment approach to identify enriched TFBSs in input gene sets. The TF2Network performance was evaluated using benchmarks based on ChIP-Seq and lists of differentially expressed genes following TF perturbation. This analysis revealed that TF2Network performs better than Cistome and PlantRegMap (Kulkarni et al., 2018).
Moreover, the web interface of TF2Network not only displays the GRN for enriched TFs but also integrates various experimental regulatory datasets such as protein-DNA interaction (PDI), protein- protein interactions (PPI) and co-expression information (Table 2). This allows the user to easily integrate multiple evidence sources. The motif enrichment output is available to download which contains information on statistical measures, lists of target genes for every TF or motif, and network files that can be further explored using Cytoscape. An example of a network inferred using TF2Network is shown in Figure 2, illustrating the different data types that were integrated.
Tool (Year)
Species Type of data integrated
#PWMs and #TFs Input Output Multiple
hypothesis correction
Link
Plantregulome.
org (2014)
Arabidopsis DNase-Seq different conditions
- TF selection HTML - http://plantregulome.
org/
SeqEnrich (2017)
Arabidopsis and Brassica
Gene ontology 2,263 Arabidopsis TFs and 240 motifs
Gene set Folder with all output
files
No -
PlantRegMap (TF enrichment)
(2017)
132 species 4,794,773 TF footprints based on 13 DNase-seq experiments +
histone modifications
CIS-BP, JASPAR, TRANSFAC, UniPROBE, MEME-ChIP from ChIP datasets, Franco-Zorilla
et al. + they also do manual curation. 674
non-redundant high quality motifs from 45
families
Gene set HTML Yes http://plantregmap.g
ao- lab.org/tf_enrichmen
t.php
AtRegNet (2019)
Arabidopsis, Rice and
Maize
1,638,778 direct interactions between TFs and
target genes.
1,638,778 direct interactions between TFs and target genes.
TF selection HTML - https://agris-
knowledgebase.org/
TF2Network (2018)
Arabidopsis PPI, PDI, co- expression, Gene
ontology
1793 TFBSs for 916 TFs collected from CIS-BP,
Franco-Zorrilla et al., Plant Cistrome Database (DAP-Seq),
JASPAR, UNIPROBE, AGRIS and AthaMAP.
Gene set HTML Yes http://bioinformatics.
psb.ugent.be/webtool s/TF2Network/index.
php
AIV2 (2019)
Arabidopsis PPI and PDI 2.8 million protein-DNA interactions
Gene set HTML - http://bar.utoronto.ca
/interactions2/
Table 2. Overview of plant network analysis tools. * indicates the year of the latest publication or update.
Conclusions and Perspectives
The platforms and tools discussed in this review identify cis-regulatory elements in promoters with the aim of linking TFs to target genes and facilitating GRNs construction. The biggest disadvantage of these methods is that if a motif for a specific TF is not available, that regulator will not be incorporated into the inferred GRN, resulting in incomplete networks. As described, the motifs for nearly half of the Arabidopsis TFs are still uncharacterized, indicating that a considerable research effort is still required. Although 529 of the 1,812 profiled Arabidopsis TFs motifs where recovered using DAP-Seq, it remains to be seen if other methods such as profiling interacting TFs (Smaczniak et al., 2017), or more advanced data processing methods combining in vitro and an in vivo data, can fill this gap in our knowledge (Brooks et al., 2019). As transcriptional regulation depends on the environmental conditions, cell-types and developmental stages under investigation, the tools discussed in this review can be used to study the dynamics of GRNs by using condition or cell-specific datasets as input.
As TF motif data is heavily biased towards Arabidopsis, performing regulatory sequence analysis in other plant species remains challenging. Whereas mapping Arabidopsis motifs to genomes of closely related species offers a practical means to translate regulatory information to crops, especially those also in the Brassicaceae family (Becker et al., 2017), it is less clear how well TF motifs are conserved when focusing on other clades within the flowering plants (Van de Velde et al., 2016). Through the
application of ChIP-Seq, DAP-Seq and PBMs, the volume of TF motif information is slowly increasing for other plants (Brown et al., 2019, Galli et al., 2018, Khan et al., 2018). Nevertheless, how to define promoters in order to identify functional TFBSs in bigger or more complex plants genomes is not yet fully resolved, although the integration of chromatin state information offers a practical means to reduce the DNA search space (Galli et al., 2018, Rodgers-Melnick et al., 2016, Vera et al., 2014, Van de Velde et al., 2014).
An alternative approach is translating a GRN identified in Arabidopsis to other plant or crop species by considering gene orthology information. The assumption is that orthologous genes with conserved regulatory programs across different species will also have a conserved set of targets or regulators. Although gene-based orthology is a way to transfer biological information between model species and crops, polyploidy, species-specific genes, and gene family evolution are all important factors that need to be considered to obtain high-quality and complete functional GRNs (Glover et al., 2016, Van Bel et al., 2012). Whereas the use of expression datasets can be used to determine functional orthologs in the case of polyploidy, identifying species-specific regulatory interactions is less trivial using a comparative approach (Jones et al., 2018, Movahedi et al., 2012, Chang et al., 2019).
To conclude, an expanding set of different experimental methods are generating complementary and genome-wide read-outs of regulatory DNA. The proper processing and integration of these data sets, together with the development of efficient tools and workflows to access this information for different plant species, will pave the way for the identification of more complete and accurate gene regulatory interactions.
Funding
This work has been funded by Research Foundation–Flanders [G001015N].
Acknowledgements
We thank D. Marc Jones for proofreading the manuscript and the reviewers for valuable comments.
Figures
Figure 1
Figure 1. Screenshot of the Cistome user interface. The “Map View” showing the output of Cistome for 50 top scoring direct targets of FLM (AGL27) selected as an input, using the motif collection from CIS-BP. Locations of enriched motifs are shown for two target genes. Below the Map View, enriched motifs are listed with their z-score of enrichment. Hovering over the blocks of motifs shows the CIS- BP ID for the motif, the sequence and logo, functional depth score (a normalized score for motif mapping), names of other TFs from the same TF family to which the motif is linked to and distribution of the distance of motif from the transcription start site.
Figure 2
Figure 2. Screenshot of the TF2Network user interface. Different panels are showing the output of TF2Network for 500 top scoring direct targets of WRKY33 which were selected as input. The regulator panel on the left lists predicted regulators with the TF name, the z-score of co-expression (CO), known experimental protein-DNA interaction percentage (PD), rank, the q-value of the enrichment and number of targets identified. The top-right panel shows the Cytoscape network for the selected regulators with known experimental protein-DNA interaction in black and protein- protein interactions in blue. The bottom-right panel shows the GO enrichment on the input gene list while other tabs allow downloading the different networks.