• Aucun résultat trouvé

Bioinformatics and the discovery of gene function

N/A
N/A
Protected

Academic year: 2021

Partager "Bioinformatics and the discovery of gene function"

Copied!
2
0
0

Texte intégral

(1)

C O M M E N T

Scientific history was m a d e in com- pleting the yeast genuine sequence, yet its 13 Mb are a m e r e starting point. Two challenges loom large: to deci- pher the function of all genes and to describe the workings of the eukary- otic cell in full molecular detail. A combination o f experimental and theoretical approaches will be brought to bear on these challenges. What will be next in yeast g e n o m e analysis from the point of view o f bioinformatics? C u r r e n t i n f o r m a t i o n status

Functional knowledge about yeast genes is already more advanced than one might have expected at the out- set of the sequencing effort. For an amazing 65% o f the approximately 6000 protein-encoding genes, w e already h a v e s o m e functional infor- mation. For about 30°/o of the total, occording to B. Dujon (this issue), functional knowledge is the result of direct experiment, but for a large frac- tion, about 350,6 of the total, functional information was derived by hom- ology transfer. Homology transfer of information exploits the evolutionary

Bi0informatics and the discovery of gene function

GEORG CASARI, ANTOINE DE DARUVAR, CHRIS SPaNDER* AND REhNHARD SCHN~EIDER

casari@embl-heidelberg.de daruvar@embl-heiddberg.de

sander~ernbl-ebi.ac.uk schneiderOernbl-heidelberg.de

EUROPEAN MOLECULAR BIOLOGY LABORATORY, EMBL-HD, D-69012, H~ola.mmG, GLR.~t~X~. *EUROPEAN BIOtr~TOP~tATICS INStal, EMBL-EBI, HINXTO~ HALL, C£~mRmGE, UK CB10 IRQ.

continuity in protein function and structure over very long time spans, apparent in our w o d d as the presence of similar genes in viruses, bacteria and eukaryotes. Technically, h o m - ology transfer b e c o m e s possible be- cause of the two pillars nf current bio- informatics: databases that capture the experimental knowledge about g e n e function in different organisms and search algorithms that permit the de- tailed comparison o f a n e w gene with all available database sequences. Whenever sequence similarity is de- tected at a le'.'el that clearly indicates functional e r stluctuml homology, information can be transferred from a gene of known function in one species to one c f unknown function ~n another species (or in the same species'). In some cases, the transferred information exquisitely describes die detailed bio- chemical and/or cellular function of a n e w gene, for example, that o f the yeast open reading frame (ORF) YCR14c on chromosome III (SWISS- PROT Accession No. P25615) as the functional cousin of the mammalian DNA polymerase 13 (Ref. 1). In other

l

Sequence Probable orphans, spudnus function ORFS un} Homol .o~, but lunetton unlcdlowrl

$.

cases, the p o w e r o f prediction is very limited, because of strong functional divergence or because the homology is limited to a sequence fragment; an example is the prediction of nucleic acid binding properties based o n the presence of a zinc finger motif.

In m a n y cases, the prediction of g e n e function by homology transfer can b e easily achieved using stand- ard database search tools. However, trained experts are n e e d e d to achieve a high level, in terms of quantity and quality, of derived functional infor- mation. The unexpectedly high cumu- lative value of 650,6 of functionally annotated protein sequences (Fig. 1 ) is the result o f a p p l ) ~ g an advanced soft- ware system, called GeneQuiz (Ref. 2), that encapsulates expert knowl- edge, combines a variety of analysis tools and relies on daily updates of the latest database information.

The approach to 100%

H o w m u c h is left to discover about gene function in yeast? "i~_e good news is that some information about gene function is available for an FmLr~ 1. How much information is available to date about yeast proteins? For as many as 650/0 of protein genes more-or-less detailed 'functional information' is available, either from direct experiment and/or from information transfer via "clear' evolution- ary relationships (homology). For more than 7% (contained in the 65%) a three-dimensional model ('3D model') is available, most built by homology modeling. Another 7% have 'tentative homology" to proteins with functional information (of v..hich we expect about 2-3% to survive scrutiny). 'Homology without functional information' is deafly detected for 14% of all pro- teins (these are called orphan pairs or families by B. Dulon, this issue), and only 12°£ are 'sequence orpham of unknown function' (called single orphans by B. Dujon), that is. have no sequence relative in the databases, of which 6-7% can be spurious (i.e. are not expressed as messenger nor translated into proteins). The numbers v,-ere derived in a GeneQuiz analysis of all yeast protein genes available in March 1996, about 85% of the total number of yeast proteins. More than 6060 genes (partially redundant, depending on the state of the March 1996 publidy available data. with 100% identities removed) were compared with 185688 protein sequences and 43B305 ESTs on a 64 processor computer configumtinn. Differences in the amount of derived functional information and the number of clear homologies relative to the accompanying article by B. Dujon are probably due to (1) dlffemnces in the yeast and search databases used, and (2) differences in processing the results of sequence analysis methods. The main furore bioinformatics challenges are to increase the quality and completeness of the deduced information, and to develop methods to discover more evolutionary relationships. The derived functional information and 3D models are available (htrp://www.sander.embl-heidelberg.de/genequiz).

"FIG JULY 1996 VOL. 12 No. 7 Ca,pyriFtjat ~1996 El~wierS~ience Ltd. All tlghts m ~ ' e d O168-9525/96;515.00

2 4 4

(2)

C o M I ~ , I E N ' F

estimated 70% o f all yeast protein gem-~ ( d e d u c e d frow. 65°/., as above, corrected for 6 -7% questionable ORFs (B. Dujon, this issue), n o n e o f which are expected to have clear homology to a database protein). The b a d n e w s is that this information is incomplete, to varying degrees, a n d ,so a sizeable fraction o f these "/00.6 deserve further study, experimentally a n d wifl~ bio- infonnatics tools, as d o die remaining 3&,~ o f genes of completely u n k n o w n function. O n s e c o n d thoughts, the news is actually not so b a d for those planning major efforts in functional analysis (S. Oliver, this issue)!

Interestingly, the n u r u b e r o f sequence o r p h a n s (those without a s e q u e n c e family o f homologs) is already very small - a b o u t 10% o f the total ( n u m b e r from Fig. 1, corrected for questionable ORFs a n d tentative homologies) - a n d will s o o n reach zero, as a result o f a large influx o f s e q u e n c e data in the near future, especially expressed s e q u e n c e tags (ESTs), as well as w o r m a n d h u m a n genomic sequences. This is excellent news for bioinformatics, as similarity searches using multiple s e q u e n c e alignments 3 c a n reach significantly further into the twilight zone o f homology, with a n increased prob- ability o f linking otherwise unavail- able functional information.

Whatever one's point o f view, there is a long w a y to g o before w e have a fairly complete description of the func- tion o f all yeast genes. Bioinformatics is faced with a n u m b e r o f technical challenges in the process: to improve the m e t h o d s for h o m o l o g y modeling in three dimensions [currently, models have b e e n built for 7% of yeast genes with h o m o l o g y to proteins o f k n o w n three-dimensional sUucture (G. Vriend a n d R. Schneider, unpublished)]; to

improve the quality o f honlology transfer b y detecting errors in data- bases, I~y developing quantitative methods for assessing functiotxal diver- gence based o n sequence divergence, a n d b y m o r e accurately extracting functional information from sequence families; to g o b e y o n d the current level o f h o m o l o g y detection, by improved m e t h o d s o f databas,~_ searches (e.g. those lyased o n profites derived from s e q u e n c e families or k n o w n three- dimensional structures5); a n d to de- velop futtber the direct prediction o f function from s e q u e n c e information, or at least the prediction o f functional class ( e . g . A . Goffeau's data a b o u t transmembrane segments in yeast proteins: see poster in this issue).

In meeting these challenges, bioinforntatics will help reduce the experimental effort n e e d e d to dis- cover g e n e function. Each bit of infor- mation gained, e a c h extra g e n e func- lion suggested b y h o m o l o g y will either avoid r e d u n d a n t experimental effort or, ideally, focus the experi- mental strategy o n to a smaller set o f possibilities. Conversely, each n e w experimental characterization o f g e n e function will, t h r o u g h the databases, s p r e a d informatinn to a n increasing n u m b e r of homologs. In this way, w e c a n look forward to considerable synergy between bioinform2tic,~ a n d experimental functional analysis. Far b e y o n d 1 0 0 %

Suppose the proiects for func- tional analysis o f yeast proteins were already completed in July 1996 a n d that w e h a d a n accurate functional assignment for each protein. Yes, w e would marvel at the detailed part list of the complicated machinery o f the eukaryotic cell. But w e would also ask m a ~ y n e w q u e s t i o n s Can w e

have a complete tist of all specific protein-pair interactions a n d higher complexes? What is the complete set o f cycles of interaction in the meta- bolic, regulatory and developmental p a t h w a y g H o w c a n functional con- cepts of molecular genetics a n d cell biology best be r n a p p e d o n to the properties of paglcipating molecules? \'(/hat are the stability properties o f the cell w h e n parts are d a m a g e d o r added? Which molecules facilitate evolutionary adaptation as external conditions change? And m a n y more. In the long ran, bioinformatics will contribute to the emergence of a n e w quantitative a n d predictive mol- ecular biology of the cell. The chal- lenge will b e to combine the knowl- e d g e o f all parts a n d interactions with n e w ideas, a n d to develop m e t h o d s to simulate, o n a computer, the detailed behavior o f cells. O n e w a y o r the other, tim genome sequence w ~ revolutionize cell biology.

Acknowledgements

The GeneQuiz software system for large sc~le sequence analysis was devel- oped in collaboration wifll M.A. Andrade, P. Bork, C. Ouzounis, M. Scharf, J. Tamames and A. Valencia. The order of authoL~ is alphabetical. We are grate- ful for tile use of SRS by T. Etzold, PHD by B. Rost, and WHATIF by G. Vriend. The March 1996 analysis of yeast ORFs was performed at Silicon Graphics, Supercompming Technology Centre, Cortaillod, SwitZerland, with the sup- port of P. Bremer, M. Schlenkrich and colleagues.

R e f e r e n c e s

I Bork, P. et al. (1992) Ps~tein ~ci.

1, 1677-1690

2 C.asari, G. et aL (1995) Nature 376, 647-648

3 Holm, L. and Sander, C. (1995)

TrendL~ Blot:hem. Sci. 20, 345~3.i7

rl~ ytott

T h e Yeast G e n o m e S e q u e n c e

A poster by Bernard Dujon and Andr~ Goffeau is included in

Trends in G e n e t i c s t h i s

month, which highlights some aspects

of the yeast g e n o m e project.

The poster includes information on the centres involved in

sequencing, gene density, the number of open reading

frames predicted to encode transmembrane domains, yeast

genes exhibiting sequence similarity to human disease genes

and more!

I

Références

Documents relatifs

We performed whole exome sequencing in patients with early-onset primary aldosteronism and identifed a de novo heterozygous c.71G>A/p.Gly24Asp mutation in the CLCN2 gene,

A new k-nearest neighbor based entropy estimator, that is efficient in high dimensions and in the presence of large non-uniformity, is proposed. The proposed idea relies on

Une nouvelle technique innovante, la radiofréquence (RF) guidée sous échoendoscopie, représente potentiellement une alternative nettement moins invasive pour le traitement radical

In 2000 and 2001, Perou and Sorlie defined five breast cancer molecular subtypes based on gene-expression profile homologies of an intrinsic gene list that included 427 unique

occurrence of thermal SS↔WS transitions only at low temperatures, resulting in ground WS state for most of the compounds at room temperature, which excludes possibility of SS→WS

Lessons learned from the study of glycans, and particularly the HA example presented here, highlight: (1) the need to employ multiple biophysical, bioanalytical, and

10 investigated silica shocked states using the adptative Erpenbeck procedure 11 where a succession of simulations in the isothermal-isobaric en- semble (NPT) is used to converge