• Aucun résultat trouvé

The Challenges of Using Computers

Dans le document Data Mining in Proteomics (Page 82-92)

Joel B. Hagen

4. The Challenges of Using Computers

in Molecular

Evolution

70 Hagen

early scientific reputation by writing a computer program to reconstruct the phylogeny of various plants, animals, and fungi using amino acid sequences of cytochrome c (42). Although pub-lished after Dayhoff’s similar attempt to reconstruct phylogeny using cytochrome c, the article became a citation classic (43).

Publishing the work in the high-profile journal Science, ensured that Fitch’s computational approach would reach a very broad audience. The study quickly became a textbook example of how to deduce evolutionary history using amino acid sequences, and it propelled Fitch on a career path that relied heavily on compu-tational methods to solve evolutionary problems (43, 44).

Fitch had earned his Ph.D. in comparative biochemistry and was an assistant professor in physiological chemistry at the University of Wisconsin when he began collaborating with Emanuel Margoliash on the cytochrome c project. Margoliash was one of the pioneers of protein sequencing and had elucidated the first cytochrome c sequence from horses. By the time he met Fitch in 1966, Margoliash had an informal collection of twenty cytochrome c variants available for the phylogenetic analysis – including ten unpublished sequences (43). Fitch described this new data set as a “windfall” for his plan to use a computer to generate phylogenetic trees.

Margoliash had very broad biochemical interests, and he was pursuing an ambitious research program on the structure, func-tion, and evolution of cytochrome c. Also collaborating with Margoliash was Richard Dickerson, who had just completed a low resolution X-ray diffraction analysis of cytochrome c.

Margoliash, Fitch, and Dickerson (45) were confident that their combined evolutionary perspectives could unify traditional bio-chemistry and the new information-based approach championed by Zuckerkandl and Pauling: “Molecular evolution is thus likely both to bridge the gulf between the informational and structural areas of knowledge and to provide a fascinating frontier.” Yet, when the three scientists attempted to use sequence data to understand three-dimensional structure and function, they encountered a paradox. Cytochrome c from different species var-ied considerably in primary structure, yet molecules from fungi, plants, and invertebrates reacted identically with mammalian cytochrome oxidase in vitro. All of the various sequences appar-ently folded into precisely the same three-dimensional conforma-tion. Although comparisons of cytochrome sequences from different species provided important hints about the structure and function of the molecule, Dickerson needed models of both the oxidized and reduced forms of cytochrome c to understand how this “molecular machine” worked (46). The collaboration between Dickerson and Fitch was short-lived, and by 1970 Fitch moved decisively toward computational studies of evolutionary mecha-nisms that largely ignored questions of structure and function.

71 The Origin and Early Reception of Sequence Databases

The abortive partnership of Margoliash, Fitch, and Dickerson highlights the difficulties of bridging the differences between a well-established experimental biochemistry and a new molecular evolution based heavily on the idea of molecular information. By the end of the 1960s, sequencing proteins was relatively easy and sequence databases were growing rapidly, but X-ray diffraction studies remained arduous and time consuming. Thus, although scientists like Dickerson were confident that there was direct causal chain linking primary structure with the three dimensional shape and function of a protein, technical limitations prevented this from becoming a central focus of research using sequence databases. At the same time, molecular evolution was coalescing around a core of problems that were readily amenable to study using the growing body of amino acid sequences and digital com-puters. For molecular evolutionists, “molecular information”

increasingly meant information about how proteins evolved, not how they worked within the cell. Thus, establishing a new disci-plinary identity for molecular evolution meant emphasizing the differences between informational and structural approaches.

Given the success of his first attempt to use a computer to reconstruct phylogenies based on amino acid sequences of cyto-chrome c, it is perhaps not surprising that Fitch continued this fruitful line of research. However, moving from biochemistry to molecular evolution and systematics had important consequences for his career. Traditional evolutionary biologists had their own approaches to reconstructing phylogeny. Indeed, the 1960s was a decade of extreme ferment in systematic biology pitting rival schools of evolutionary taxonomists, numerical taxonomists, and cladists (24, 47). Because his research forced him to interact closely with these competing groups, Fitch had to worry about evolutionary concepts and methods that were of little concern to biochemists like Margoliash or Dickerson. For Margoliash (38), the differences between different algorithms for constructing phylogenetic trees were not crucial because they all produced the same general results, but Fitch soon learned that there were important philosophical implications of different computational approaches. For example, although many biochemists used homol-ogy as a synonym for similarity of amino acid sequences, evolution-ary biologists (particularly cladists) defined homology as descent from a common ancestor. Because he began publishing articles in Systematic Zoology and other journals read by evolutionary biolo-gists and systematists, Fitch had to take this distinction seriously.

He significantly modified his tree-building algorithms to recon-struct ancestral sequences at the nodes of the tree (41). Detecting molecular homologies became not only an interesting computa-tional problem for Fitch, but also one with important conceptual and philosophical implications that he could not ignore if his work was to be taken seriously by other systematists (43, 44, 48).

72 Hagen

Russell Doolittle was another biochemist whose reputation became closely linked with the use of computers and sequence databases. However, his experiences during the 1960s were dif-ferent from Fitch’s in a number of important ways. Doolittle became interested in evolution while he was completing his Ph.D.

in biochemistry at Harvard University (22, 49). His dissertation research involved comparisons of the blood-clotting mechanism in various vertebrates (50). Using thrombin from lampreys, Doolittle compared its effect on clotting rates when combined with fibrinogen from cows and lampreys. Thrombin removes a piece of the fibrinogen molecule to produce the active clotting protein, fibrin. Doolittle used paper electrophoresis to separate the small fibrinopeptides that were removed from fibrinogen dur-ing this process. Although he did not determine the amino acid sequences of the fibrinopeptides, Doolittle was able to estimate the amino acid composition of the molecules.

While on a postdoctoral fellowship in the laboratory of Birger Blombäck at the Karolinksa Institute in Stockholm, Doolittle learned how to use the Edman degradation reaction to sequence fibrinopeptides. Doolittle and Blombäck quickly built up an infor-mal database of fibrinopeptide sequences from a variety of mam-mals. Sequences from different species varied in length, but Doolittle and Blombäck used invariant amino acids as “alignment markers” to identify regions of the molecules resulting from inser-tions or deleinser-tions (17). Comparing the aligned peptides, Doolittle and Blombäck proposed a stepwise evolutionary process leading to a branching phylogenetic tree. Constructed intuitively and without the use of a computer, Doolittle and Blombäck followed an informal method of tree building used earlier by Vernon Ingram (18) and Zuckerkandl and Pauling (19) to reconstruct the evolution of the various globin polypeptides. Doolittle and Blombäck used their sequences to hypothesize the evolutionary relationships among several cloven-hoofed mammals (artiodac-tyls) from which they sampled the fibrinopeptides. The results, some of which contradicted well-established phylogenetic rela-tionships, were controversial. For example, the simplest phylo-genetic tree based on fibrinopeptides suggested that goats and sheep were more closely related to reindeer than to cows.

George Gaylord Simpson, a leading evolutionary biologist and the foremost expert on mammalian paleontology and taxon-omy, was highly critical of this evolutionary claim. In their arti-cle, Doolittle and Blombäck (17) acknowledged that their simplest fibrinopeptide tree was contradicted by “a very large body of biological evidence,” and they cited personal commu-nication with Simpson. Simpson’s letter to Doolittle provides a detailed critique of the biochemists’ evolutionary and taxo-nomic claims and of the use of fibrinopeptides for phylogenetic reconstruction, more generally.

73 The Origin and Early Reception of Sequence Databases

The disagreement between Simpson and Doolittle involved an empirical question open to testing and refutation. However, in the context of the 1960s, the hypothesis-testing was embedded in a broader debate among evolutionary biologists about the valid-ity of new molecular techniques (21, 27). Simpson actively engaged molecular evolutionists at meetings and in publications in a critique that he characterized as a “clarifying confrontation.”

This confrontation involved philosophical commitments as well as purely scientific issues. Molecular evolutionists who viewed proteins as “documents of evolutionary history” often argued that protein sequences had a privileged status that set them apart from other biological characteristics (25, 31, 32). Because fibrin-opeptides accumulated mutations quite rapidly, Doolittle and Blombäck were confident that their method could accurately reconstruct the phylogenetic history of a group of very closely related mammals. Conversely, Simpson and other organismal evolutionists argued that molecular data should carry no more weight than paleontological, morphological, and other forms of evidence. Because he believed that the bulk of the evidence con-tradicted some of Doolittle and Blombäck’s hypotheses, Simpson rejected their claims and called into question the usefulness of fibrinopeptides for phylogenetic studies of artiodactyls. Thus, although the phylogenetic relationships among artiodactyls were an empirical question, resolving discrepancies partly depended upon competing philosophies of science. As a biochemist, Doolittle had little knowledge of the rich fossil record of artio-dactyls. Therefore, he was arguing with experts in another disci-pline who not only disagreed with his specific claims, but who were also highly skeptical about the methodology that he employed. This had important practical implications because when Doolittle submitted two grant proposals to the National Science Foundation, Simpson was one of the reviewers. The criti-cal reviews (held in the Simpson archives) reflect Simpson’s deep skepticism toward molecular evolution and the use of fibrinopep-tide sequences as a method for understanding mammalian phylogeny.

After learning of the computational methods that Fitch had developed for reconstructing phylogenies using cytochrome c, Doolittle began using mainframe computers in his research dur-ing the late 1960s (22). Although he continued to use fibrino-peptides for phylogenetic analysis of the artiodactyls, Doolittle did not interact with systematists and organismal biologists to the extent that Fitch did. Doolittle did not publish his later phyloge-netic articles in Systematic Zoology, but in biomedical journals that systematists or mammalogists would not have routinely read.

Important as it was, Doolittle’s phylogenetic research was only a small part of his broader program to understand the molecular basis of blood clotting in mammals. To this end, he continued to

74 Hagen

see himself as a traditional experimental biochemist – albeit one with a strong interest in using computers (22). This disciplinary identification contrasts with Fitch, who moved further away from his biochemical roots as he became increasingly involved with molecular evolution and systematics. The difference is highlighted by the way that the two scientists’ approached a common interest in molecular homology. Both Fitch and Doolittle used computers and sequence databases extensively to study homology, and Doolittle was later lionized for discovering unsuspected evolu-tionary relationships among seemingly unrelated proteins (11, 22, 51). However, because he was not interacting closely with organismal evolutionary biologists, Doolittle was less concerned than Fitch with precisely defining the concept and exploring the philosophical implications of homology.

The experiences of Dayhoff, Fitch, and Doolittle during the 1960s illustrate both the opportunities and the challenges that scientists confronted with the advent of protein sequence data-bases and mainframe computers. None of these scientists was trained in traditional evolutionary biology, but the availability of protein sequences propelled them on career trajectories that were strongly influenced by evolution. The variation in sequences raised compelling evolutionary questions and provided a means for investigating them. Although aligning sequences, searching for homologies, dating evolutionary events, and constructing phylogenetic trees had been done to a limited extent without computers, Dayhoff, Fitch, and Doolittle were at the forefront of efforts to develop a new computational biology. They did this without the benefit of formal training in computer programming, being largely self-taught. Noting the difficulty of getting com-puter scientists interested in his evolutionary research, Doolittle later described his early computer programming efforts as a

“hobby” (22). Without the interactivity provided by the internet and personal computers, Dayhoff, Fitch, and Doolittle used main-frame computers located in centralized computing centers to lay important groundwork for what would become bioinformatics.

Yet, even late in his career, Doolittle denied that he was ever a

“bioinformatician” (22).

Disciplinary identity in an emerging area of research posed significant challenges for early computational biologists, as the careers of Dayhoff, Fitch, and Doolittle illustrate. Using novel techniques to study evolutionary questions at the fringes of well-established fields opened them to criticism for doing second-rate

5. Conclusions

75 The Origin and Early Reception of Sequence Databases

science or for applying inappropriate methods to study evolutionary questions. Eventually, the computational methods that Dayhoff, Fitch, and Doolittle pioneered became mainstream tools in molecular evolution, but in the 1960s molecular evolution was just beginning to take form. Computers only gradually became recognized as credible scientific instruments by biologists. Even in the late 1970s, computational biologists sometimes faced condescension from laboratory scientists who considered them

“failed researchers” playing with computers (52). Today, online databases and powerful computers are such an integral part of modern biology that it is difficult to imagine doing research without the internet. However, this cutting-edge research rests on a foundation that extends back to the very beginning of the computer age.

1. The Atlas was published at irregular intervals between 1965 and 1978. It gave rise to the online database, The Protein Information Resource (PIR), established by the National Biomedical Research Foundation in 1984 at Georgetown University (8, 14, 15).

2. Dayhoff’s research interests spanned a very wide range of evolutionary questions, including the evolution and classifica-tion of proteins, the origins of life, and the thermodynamics and evolution of atmospheres on other planets (14, 15).

3. Dayhoff was acutely aware of the challenges facing women in science. After her death in 1983, the Biophysical Society (of which she was the first female president) established an award in her name to given annually to an outstanding woman at the beginning of her research career (8, 14, 15).

4. Correspondence between Simpson and Doolittle, as well as Simpson’s reviews of grant proposals written by Doolittle, are part of the George Gaylord Simpson Papers at the American Philosophical Society library.

5. Both scientists and historians have emphasized the controver-sies and conflicts between traditional evolutionary biologists and molecular evolutionists. Real as these controversies were, it is equally important to note that many molecular evolu-tionists deeply respected the expertise of Simpson, Ernst Mayr, and other organismal biologists. An extensive corre-spondence with numerous molecular evolutionists can be found in the Simpson papers (21).

6. Notes

76 Hagen

References

1. Wolfe KH, Li WH (2003) Molecular evolu-tion meets the genomic revoluevolu-tion. Nat Genet Suppl 33:255–265

2. Kanehisa M, Bork P (2003) Bioinformatics in the post-sequence era. Nat Genet Suppl 33:305–310

3. Patterson SD, Aebersold RH (2003) Proteomics: the first decade and beyond. Nat Genet Suppl 33:311–323

4. de Chadarevian S (1996) Sequences, confor-mation, information: biochemists and molec-ular biologists in the 1950s. J Hist Biol 29:361–386

5. de Chadarevian S (1999) Protein sequencing and the making of molecular genetics. Trends Biochem Sci 24:203–206

6. Sanger F (1959) The chemistry of insulin.

Science 129:1340–1344

7. Sanger F (1988) Sequences, sequences, sequences. Ann Rev Biochem 57:1–28 8. Strasser BJ (in press) Collecting, comparing,

and computing sequences: the making of Margaret O. Dayhoff’s atlas of protein sequence and structure. J Hist Biol.

9. Strasser BJ (2006) Collecting and experiment-ing: the moral economies of biological research, 1960s–1980s. Preprints Max-Planck Inst Hist Sci 310:105–123

10. Strasser BJ (2008) GenBank – natural history in the 21st century? Science 322:537–538 11. Smith TF (1990) The history of the genetic

sequence databases. Genomics 6:701–707 12. Schachman HK (1979) Summary remarks: a

retrospect on proteins. In: Srinivasan PR, Fruton JS, Edsall JT (eds) The origins of mod-ern biochemistry: a retrospect on proteins, vol 325. Annals of the New York Academy of Sciences, New York, pp 363–373

13. Eck RV, Dayhoff MO (1966) The atlas of protein sequence and structure 1966. National Biomedical Research Foundation, Silver Spring, MA

small protein become so popular?: a succinct account of the development of our under-standing of cytochrome c. In: Scott RA, Mauk AG (eds) Cytochrome c: a multidisci-plinary approach. University Science Books, Sausalito, CA

17. Doolittle RF, Blömback B (1964) Amino acid sequence investigations of fibrinopeptides

from various mammals: evolutionary implica-tions. Nature 202:147–152

18. Ingram VM (1961) Gene evolution and the haemoglobins. Nature 189:704–708 21. Hagen JB (1999) Naturalists, molecular

biol-ogists, and the challenges of molecular evolu-tion. J Hist Biol 32:321–341

22. Doolittle RF (2000) On the trail of protein sequences. Bioinformatics 16:24–33

23. Moody G (2004) Digital code of life: how bioinformatics is revolutionizing science, medicine, and business. Wiley, Hoboken, NJ 24. Hagen JB (2000) The origins of

bioinformat-ics. Nat Rev Genet 1:231–236

25. Crick FHC (1958) On protein synthesis.

Symp Soc Exp Biol 12:138–163

26. Aronson J (2002) Molecules and monkeys:

George Gaylord Simpson and the challenge of molecular evolution. Hist Philos Life Sci 24:441–465

27. Dietrich MR (1998) Paradox and persuasion:

negotiating the place of molecular evolution within evolutionary biology. J Hist Biol 31:85–111

28. Morgan GJ (1998) Emile Zuckerkandl, Linus Pauling and the molecular evolutionary clock, 1959–1965. J Hist Biol 31:155–178

29. Sommer M (2008) History in the gene: nego-tiations between molecular and organismal anthropology. J Hist Biol 41:473–528 30. Hagen JB (in press). Waiting for Sequences:

Morris Goodman, Immunodiffusion Experiments, and the Origins of Molecular Anthropology. J Hist Biol.

31. Zuckerkandl E, Pauling L (1965) Evolutionary divergence and convergence in proteins. In:

Bryson V, Vogel HJ (eds) Evolving genes and proteins. Academic Press, New York, pp 97–166 32. Zuckerkandl E, Pauling L (1965) Molecules

as documents of evolutionary history. J Theor Biol 8:357–366

33. Strasser BJ (1999) Sickle cell anemia, a molec-ular disease. Science 286:1488–1490 34. Dietrich MR (1994) The origins of the

neu-tral theory of molecular evolution. J Hist Biol 27:21–59

35. Kumar S (2005) Molecular clocks: four decades of evolution. Nat Rev Genet 6:

654–662

77 The Origin and Early Reception of Sequence Databases

36. Kimura M (1983) The neutral theory of molecular evolution. Cambridge University Press, Cambridge

37. Suárez E, Barahona A (1996) The experimen-tal roots of the neutral theory of molecular evolution. Hist Philos Life Sci 18:55–81 38. Margoliash E (1972) The molecular variation

of cytochrome c as a function of the evolution of species. Harvey Lect 66:177–247

39. Hagen JB (2001) The introduction of com-puters into systematic research in the united states during the 1960s. Stud His Philos Biol Biomed Sci 32:291–314

40. Hagen JB (2003) The statistical frame of mind in systematic biology from quantitative zoology to biometry. J Hist Biol 36:353–384 41. Felsenstein J (2004) Inferring phylogenies.

Sinauer, Sunderland, MA

42. Fitch WM, Margoliash E (1967) Construction of phylogenetic trees. Science 155:279–284 43. Fitch WM (1988) This week’s citation classic.

Curr Contents 19(27):16

44. Fitch WM (1987) This week’s citation classic.

Curr Contents 18(27):14

45. Margoliash E, Fitch WM, Dickerson RE (1968) Molecular expression of evolutionary phenomena in the primary and tertiary struc-tures of cytochrome c. Structure, function, and evolution in proteins. Brookhaven Symp Biol 21(2):259–305

46. Dickerson RE, Geis I (1969) The structure and action of proteins. Harper & Row, New York 47. Hull DL (1988) Science as a process: an

evo-lutionary account of the social and conceptual development of science. University of Chicago Press, Chicago

48. Fitch WM (2000) Homology: a personal view on some of the problems. Trends Genet 16(5):227–231

49. Doolittle RF (1997) A Delicate Balance.

Boston Rev (February–March).

50. Doolittle RF, Oncley JL, Surgenor DM (1962) Species differences in the interaction of thrombin and fibrinogen. J Biol Chem 237:3123–3127

51. Doolittle RF (1997) Some reflections on the early days of sequence searching. J Mol Med

51. Doolittle RF (1997) Some reflections on the early days of sequence searching. J Mol Med

Dans le document Data Mining in Proteomics (Page 82-92)