The First Molecular - Joel B. Hagen - Data Mining in Proteomics

Joel B. Hagen

2. The First Molecular

Databases

63 The Origin and Early Reception of Sequence Databases

turned his attention to devising methods for sequencing nucleic acids (for which he won a second Nobel Prize). Nonetheless, Sanger’s comparative approach marked the beginning of molecu-lar databases as other protein biochemists began informally col-lecting amino acid sequences as a means for studying molecular structure and function (8). Although sequences were important, Sanger probably would not have considered collecting them to be a serious scientific activity (9). For him, science meant experimen-tation at the bench. Ironically, although his research was the point of departure for larger, more comprehensive sequence databases, Sanger’s philosophy of science minimized the scientific status of collecting sequences and managing databases (8–10).

During the late 1950s and early 1960s, several laboratories worked to determine the amino acid sequences of other proteins, notably glucagon, ribonuclease, cytochrome c, and hemoglobin.

Looking back, Emil Smith (57) described the two decades fol-lowing World War II as the “heroic period” of protein chemistry.

Sanger’s work epitomized this image of a master chemist deci-phering the primary structure of a complex molecule. However, the Edman degradation reaction quickly replaced Sanger’s less elegant methods, and the entire sequencing process became fully automated by 1970. At the same time, research on DNA was eclipsing protein studies – both in prestige and funding (12). As Smith ruefully acknowledged, sequencing proteins, which a decade earlier was worthy of a Nobel Prize, had become a routine task suitable for a competent laboratory technician. By today’s standards, the number of known protein sequences was miniscule in 1970. However, keeping track of slightly more than 1,000 sequences was a serious challenge that posed both technical and vexing social problems for scientists who tried to create and man-age the first comprehensive molecular databases.

Together with her colleagues at the National Biomedical Research Foundation (NBRF), Margaret Dayhoff gathered all of the known protein sequences in a book published in 1965: The Atlas of Protein Sequence and Structure (53). The collection, which can be viewed as the first attempt to create a comprehensive molecular database, included the amino acid sequences of about 70 proteins from various species; mostly variants of hemoglobin, cytochrome c, and fibrinopeptides. Dayhoff’s small team laboriously collected these sequences from the published literature, although Dayhoff also encouraged biochemists to submit unpublished sequences. Finding sequences was not easy. Indeed, one of the justifications for creating the Atlas was the troublesome and time-consuming literature searches that individual scientists had to undertake to find sequences for their research. Although Dayhoff’s team used the newly computerized Medical Literature Analysis and Retrieval System (MEDLARS) established in 1964 at the National Institutes of Health, they complained about the difficulty of locating all of the relevant literature (13). For example, the

64 Hagen

term “sequence” was not widely used as a keyword during the mid-1960s. Managing the collection was also a time-consuming task. Dayhoff’s team used computers to store the molecular data, but they had to manually type and proofread each sequence.

There was no efficient way of sharing data, although later editions of the Atlas were available on magnetic tape. Thus, from the very beginning, database managers faced familiar problems: how to organize, catalog, and distribute overwhelmingly large amounts of data (8, 9). When the second edition of the Atlas doubled in size in 1966, the editors described the influx of new sequences as an “information explosion” (13). Subsequent editions continued to grow rapidly. Still, by the end of the decade the collection con-tained only about 1,000 sequences.¹

The NBRF, which published the Atlas, was on the cutting edge of using computers in biology and medicine. The director, Robert Ledley, was a leading advocate for computational biology and was writing a general survey, Use of Computers in Biology and Medicine (56). Dayhoff’s career spanned the transition from an older tradition of electrical and mechanical calculators to the early mainframe computers. During the late 1940s, before digital com-puters were available, she had used IBM punched card business machines to calculate the resonance energies of polycyclic organic molecules for her Ph.D. dissertation in quantum chemistry at Columbia University (14, 15). After fellowships at Rockefeller University and the University of Maryland, Dayhoff was hired by Ledley to write FORTRAN programs for mainframe computers to aid in determining the amino acid sequences of protein mole-cules. This work led directly to collecting sequences and using them for evolutionary research.² Thus, although the NBRF was a biomedical center and Dayhoff used the potential medical bene-fits of her research as a justification for publishing the Atlas, most of her research with protein sequences dealt with evolutionary questions that had no immediate medical applications.

Despite its small size, the Atlas proved to be more than a col-lection of data. Dayhoff and her colleagues at the NBRF also used the book as a vehicle for reporting their diverse research using computers to study proteins. For Dayhoff, computer program-ming went hand-in-hand with building databases. Her early work with the Atlas marked two important transitions in computational biology. First, the growing number of sequences allowed a more comprehensive comparative analysis of proteins than had been possible for Sanger and other biochemists with the handful of sequences available just a few years earlier. For example, the small electron transport protein cytochrome c turned out to be rela-tively easy to sequence (16) and the second edition of the Atlas listed the sequences from eighteen species, including bacteria, fungi, plants, and a wide variety of animals. This growing collec-tion of data allowed Dayhoff to explore methods of aligning

65 The Origin and Early Reception of Sequence Databases

sequences, developing statistical models for amino acid substitutions, and building phylogenetic trees showing evolutionary relation-ships among the proteins and species that contained them. The second important transition was from hand calculations to computer-assisted analysis. Even before complete sequences were available, biochemists had used similarities and differences in amino acid sequences to align homologous proteins and infer phylogenetic relationships (17–19). These early phylogenetic trees were intui-tively constructed on the basis of hypothetical stepwise substitu-tions of one amino acid for another. Usually, no attempt was made to weight substitutions based on the number of mutations or frequency with which they occurred. Using the data from the Atlas and the calculating power of a mainframe computer, Dayhoff and her colleagues were able to develop probabilistic models of amino acid substitutions. This approach eventually led Dayhoff to develop what she referred to as the Point Accepted Mutation (PAM) matrix, an innovation that formed an early foundation for later approaches for aligning sequences and searching for homol-ogies among proteins. But even in 1966, Dayhoff and her col-league Richard Eck used their earliest substitution models to contruct phylogenetic trees based on cytochrome c (13, 20).

Their computational technique also allowed inferences about ancestral sequences at the nodes of the tree and estimations of the times of divergence.

Although biochemists recognized that Dayhoff’s Atlas of Protein Sequence and Structure was an important innovation, its scientific status remained ambiguous. From the perspective of experimental biochemists, collecting sequences that other scien-tists had discovered was not fundamental research, but was more akin to natural history collecting or editorial work (8–10). The fact that Dayhoff and most of her colleagues were women undoubtedly contributed to the perception that building sequence databases was not science.³ Using computers to generate phylo-genetic trees based on amino acid sequences was a novelty that fell outside mainstream biochemistry and traditional evolutionary biology (21). Although Dayhoff predicted that this new type of research would have practical benefits for biomedical sciences, she could provide few concrete examples of what such an evolution-ary medicine might contribute to society. The novelty and ambi-guity of this new computational research also led some biochemists to question the importance of Dayhoff’s work (8, 9). The prob-lem of establishing a scientific niche for databases also had impor-tant consequences for funding the Atlas. Collecting the sequences was originally supported as part of a grant to Dayhoff from NIH to develop computational methods for sequence analysis. Other agencies also indirectly supported the database through grants for Dayhoff’s studies on molecular evolution. However, as the num-ber of new sequences rapidly increased, the cost of collecting,

66 Hagen

proofreading, and entering the sequences to the database escalated.

NIH became increasingly unwilling to support this part of Dayhoff’s work and repeatedly threatened to terminate the project.

This funding problem was compounded by Dayhoff’s expec-tation that biochemists would submit unpublished sequences to the Atlas. At first, she used a reward system where contributors received free copies of the Atlas in exchange for submitting sequences. However, because of lack of financial support from funding agencies Dayhoff had to sell later editions of the book.

Dayhoff’s attitude toward sequences was reminiscent of the natu-ral history tradition of collecting specimens, but it violated well-entrenched attitudes toward privacy, priority, and intellectual property typical of the experimental sciences (9, 10). For bio-chemists, a sequence was private property until it was published and priority was assigned to the discoverer. Critics charged that Dayhoff was making unfair use of unpublished sequence data for her own research, while at the same time selling the Atlas as a commercial venture (8, 9, 22).

David Lipman, Director of the National Center for Biotechnology Information, has famously described Dayhoff as both “the mother and father of bioinformatics” (23). However, in the 1960s bioin-formatics did not yet exist, and using computational methods to investigate evolutionary questions lay at the fringes of other well-established disciplines. Indeed, the use of computers and data-bases for evolutionary research was not even discussed in general surveys of computational biology, such as Ledley’s Use of Computers in Biology and Medicine – this despite the fact that Ledley was the director of the NBRF where Dayhoff worked.

Combining perspectives from evolutionary biology, protein bio-chemistry, and the biomedical sciences would eventually grow into successful lines of research that we now identify as bioinfor-matics and molecular evolution, but both historians and partici-pants have documented the difficulty of forging these disciplinary identities (22, 24). In order to do so, scientists needed not only large databases and powerful computers, but also a compelling intellectual rationale that would set the new computational biol-ogy apart from well-established disciplines, such as comparative biochemistry and traditional evolutionary biology.

Shortly after publishing the structure of DNA, Francis Crick provided an early justification for studying the evolutionary impli-cations of protein sequences. Reflecting on Sanger’s successful sequencing of insulin, Crick wrote: “Biologists should realize that

3. The Problems

Dans le document Data Mining in Proteomics (Page 75-79)