• Aucun résultat trouvé

B IOINFORMATICS OF GENETIC DIVERSITY

The deciphering of an explosive number of nucleotide sequences in a large number of genes and species, and the availability of complete genome sequences of model organisms have changed the approaches of genetic diversity studies. Population genetics has evolved from an insufficient empirical science into an interdisciplinary activity gathering together theoretical models and interpretation statistics, advanced molecular techniques of massive sequence production, and large-scale bioinformatic tools of data mining and management. As a result, population genetics is today an information-driven science in which hypotheses can be tested directly on the data sets stored in online databases and bioinformatics has emerged as a new cutting-edge approach to do science.

1.3.1.MANAGEMENT AND INTEGRATION OF MASSIVE HETEROGENEOUS DATA: THE NEED FOR RESOURCES

Bioinformatics is the computerized analysis of biological data, in which biology, computer science and information technology merge into a single discipline. The ultimate goal of bioinformatics is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned.

The evolution of bioinformatics has been marked by three main stages (VALENCIA 2002;

KANEHISA and BORK 2003). At the beginning of the genomic revolution, the major concern of bioinformatics was the creation and maintenance of databases to store raw biological data (or primary databases), such as nucleotide and amino acid sequences, as well as the creation of effective interfaces and new computational methods to facilitate the access and analysis of data at a large-scale. Later, bioinformatics became a powerful technology focused on creating biological databases of knowledge (or secondary databases) from previous unprocessed data (GALPERIN 2007). The major difficulty arises from the fact that there are almost as many formats to store and represent the data as the number of existing databases (STEIN 2002). In this sense, bioinformatics has become essential to manage and integrate the torrent of raw data and transform it into biological knowledge (SEARLS 2000; JACKSON et al. 2003). The ultimate goal of bioinformatics, however, is to combine all this information and create a comprehensive picture of complex biological systems (DI VENTURA et al. 2006). The actual process of analyzing and interpreting data is often referred to as computational biology. Still, theory, modeling and data processing will continue to become more and more important as scientists working on model systems tend not to be limited by data (STEIN 2003).

The first absolute requirement arising from the deluge of data in the genomic era is the establishment of computerized databases to store, organize and index massive and complex datasets, together with specialized tools to view and analyze the data. A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query and retrieve components of the data stored within the system. A simple database might be a simple flat file containing many records, each of which includes the same data fields. This strategy is still extensively used because of the standardization of formats and the existence of the PERL (Practical Extraction and Report Language) programming language (STEIN 2001), which is very powerful in scanning, searching and manipulating textual data. However, relational databases (RDB) (CODD 1970) offer the best performance to complex and highly structured data, as is the case of biological data. In RDB, information is distributed into tables of rows and columns, and tables are related by one or several common fields. This system is especially useful for performing queries using the SQL (Structured Query Language) standard language.

Two more requirements are necessary for researchers to benefit from the data stored in a database: (i) easy access to the information, and (ii) a method for extracting only that information needed to answer a specific biological question. The web has played a very important role in genetics research by allowing a universal and free exchange of biological data (GUTTMACHER 2001). As a result of large projects such as the sequencing project of the human genome (LANDER et al. 2001; VENTER et al. 2001), powerful portals have been created to store and distribute a wide variety of data, which also include sophisticated web tools for its analysis (WHEELER et al. 2007). Indeed, the so announced milestone of having the complete sequence of the human genome would not have been possible without the arrival of the Internet and the development of information and communication technologies (ICTs).

1.3.2.DATABASES OF NUCLEOTIDE VARIATION

Nowadays, four dominant resources allow free access to nucleotide variation data (Table 7). The largest and primary public-domain archive for simple genetic variation data is the Entrez dbSNP section of NCBI (WHEELER et al. 2007). It contains single nucleotide polymorphisms (SNPs), small-scale multi-base deletions or insertions (also called deletion-insertion polymorphisms or DIPs), and STRs (microsatellites) associated to genome sequencing projects of 43 different species, including human (>11.8 million SNPs, of which >5.6 million validated), mouse (>10.8 million SNPs, of which >6.4 million validated), dog (>3.3 million SNPs, of which 217,525 validated), chicken (>3.2 million SNPs, of which >3.2 million validated), rice (>3.8 million SNPs, of which 22,057 validated) and chimpanzee (>1.5 million SNPs, of which 112,654 validated). These SNPs can be browsed according to different criteria, such as heterozygosity or functional class.

However, most maps have been developed under a medical or applied focus, and thus their application to evolutionary studies is limited. The non-random sampling of SNPs and/or individuals, the analysis of very few individuals (only those needed to position the SNP in the genome), or the inability to obtain haplotypic phases, together with the fact that only sequenced genomes have such a resource, make Entrez dbSNP an inappropriate source of data on which to carry out multi-species population evolutionary studies.

The HAPMAP project (CONSORTIUM 2003; CONSORTIUM 2004; THORISSON et al.

2005; FRAZER et al. 2007) is an international effort to catalog common genetic variants and their haplotype structure in different human populations, with the goal to identify genes affecting health, disease and responses to drugs and environmental factors. It is undoubtedly the most comprehensive description of nucleotide variation at any species (CONSORTIUM 2005; HINDS et al. 2005), but its medical focus limits its application to population genetics studies. For example, only common SNPs (rare variants at >5%

frequency) were selected for HAPMAP Phase I, and the sampling methodology changed during the course of the project, which really hinders any evolutionary interpretation of the patterns found. The lack of a complete data collection and the biases underlying SNP sampling make virtually impossible to specify an evolutionary model of human genetic variation from the HAPMAP data (MCVEAN et al. 2005).

The Entrez POPSET database (WHEELER et al. 2007) is a collection of haplotypic sequences that have been collected from studies carried out within a populational perspective, with sequences coming either from different members of the same species or

Table 7 Data sources of nucleotide variation

Resource Description Amount of dataζ Reference Entrez dbSNP

(NCBI) Mapped SNPs associated to genome sequencing projects of eukaryotic species

>51.3 million SNPs (of which >22.2 million validated) from 43 species

WHEELER et al.(2007)

HAPMAP Haplotype map of the

human genome >3.7 million genotyped SNPs in 270 independent samples from 4 human populations (CEU, CHB,

(NCBI) Haplotypic sequences from population studies of polymorphism or

divergence

>52,000 eukaryotic entries WHEELER et al.(2007)

GENBANK

EST: >46.1; GSS: >18.2)

BENSON et al.(2007);

WHEELER et al.(2007)

ζHAPMAP Public Release #22 (March 2007); Entrez dbSNPBuild 127 (March 2007); Entrez POPSET

and GENBANK – Entrez NUCLEOTIDE as consulted on Sep 21st 2007 (excluding Whole Genome Shotgun (WGS) sequences and constructed (CON-division) sequences).

from organisms from different species. Even though it contains >52,000 eukaryotic entries, POPSET is a mere repository of DNA sequences, some of which have been aligned by the authors, but it does not give descriptive or comparative information of genetic diversity in any polymorphic set.

Finally, another potential resource for the study of nucleotide variation is the

>76.1 million non-redundant sequences (>71.6 million from eukaryotes) of any region and/or species that are stored in major public DNA databases, such as Entrez NUCLEOTIDE (GENBANK) (BENSON et al. 2007; WHEELER et al. 2007) (see Figure 7).

This dataset contains all the sequences from the Entrez POPSET database together with an extensive number of other heterogeneous sequences with respect to their origin and motivation for their sequencing. In principle, all these sequences could be used to estimate genetic diversity in a large number of genes and species and carry out a large-scale description of nucleotide variation patterns in any taxa (PANDEY and LEWITTER

1999). In such an approach, the reliability of the estimates depends on developing proper data mining and analysis tools that include accurate filtering criteria of the source data, as well as efficient checking procedures and quality parameters associated to any estimate.