If you are a biomedical researcher, have you ever used protein databases like UniProt to get information about proteins that you are interested in? Do you know how that database got there? I don’t mean today, I mean decades ago—how did a resource like this come to even exist at all? When researchers search a protein database or align amino acid sequences, frequently they’ll come across a name helped start it all years ago. Margaret Dayhoff was one of the people that pioneered this crucial functionality, a true founder in the field of bioinformatics. But in some histories and timelines of bioinformatics she barely gets a mention. To celebrate Ada Lovelace Day, I’m going to introduce you to Dr. Dayhoff and I hope to raise awareness of her important fundamental contributions to the field of bioinformatics.
Because we can access all the protein information we can stand with a few keystrokes today, it is easy to forget that this data 1) didn’t always exist, and 2) when it did exist, it wasn’t easy to find and work with. In the 1960s, only a handful of protein sequences were known. But it was clear that more of this data would be incredibly useful in a number of ways, and was certainly going to be generated at an increasingly faster pace. And soon it would overwhelm any one person’s ability to analyze and retain. DNA sequences…don’t even go there yet….
But there were some prepared minds ready to begin thinking about these data and the associated opportunities around them. They were also aware that computers might help with these problems. Robert Ledley was one of them. Ledley had trained as a dentist, but obtained a degree in physics and became increasingly interested in the possibilities of applying computational resources to biomedical problems. A report authored by Ledley is one of the earliest studies of biomedical computation, and can be viewed on Google Books today.
Working with Ledley at the National Biomedical Research Foundation was a woman named Margaret Dayhoff. With an undergraduate degree in mathematics and graduate studies in chemistry, Dayhoff had pioneered work with punch cards and data processing machines to evaluate molecular resonance energies of organic molecules. She obtained a Watson Computing Laboratory Fellowship to pursue the work to complete her PhD, which is described by a biographer as:
The process was iterative and required manually carrying cards from one type of machine to another (4 types), as no single machine could do the whole iteration. Convergence was slow and several months could be required for a result.
I imagine she was using machines similar to the antiques we can see in an article contemporary with Dayhoff’s fellowship, wherein Miss Eleanor Krawitz, Tabulating Supervisor, offers a tour of the punch cards and the processes in the Columbia Engineering Quarterly in 1949. (That article also notes that Miss Krawitz was “the first feminine author to contribute to the COLUMBIA ENGINEERING QUARTERLY.”)
So Dayhoff was someone who had understood and actually used “automatic computing methods and equipment” to generate data (Krawitz). Paired with Ledley, she had the opportunity to move the work to protein analysis. In 1962, Dayhoff and Ledley wrote:
In this paper we shall describe a completed computer program for the IBM 7090, which to our knowledge is the first successful attempt at aiding the analysis of the amino acid chain structure of protein.
The IBM 7090 can be viewed on the web in a number of places. It looks like something out of a sci-fi movie. A monstrous collection of metal bins with spinning tape disks. But at least it had transistors instead of the vacuum tubes at this point. And it worked.
The program that Dayhoff and Ledley described was called COMPROTEIN. It was actually a “programming system” which was comprised of six individual programs: MAXLAP, MERGE, PEPT , SEARCH, QLIST, and LOGRED. The paper offers the theoretical framework for assembling protein chain data from peptide digests, and provides typewritten flow diagrams to explain each one of the individual programs. It is almost excruciating to read at this point because it all seems so basic. And to know that it would take so long to actually generate and run them makes my head hurt….
The idea was conceived by us in 1958, but actual programming was not initiated until late 1960.
And the paper was published in 1962. Egads, I could teach myself enough Perl to do this in a weekend now.
But I know, it wasn’t easy, and I don’t mean to suggest that. And it was HUGELY important work. It formed the major foundation for everything I do every day now in bioinformatics. The end of this COMPROTEIN paper says:
Just as the proteins are composed of chains of the same types of molecules, the genetic substances desoxyribonucleic acid (DNA) and ribonucleic acid (RNA) are composed of chains of only 4 different types of molecules called the nucleotide bases. It is possible that the order of the molecules in these substances can also be determined by the aid of this computer program and some computer experiments in this direction have been made. However, application of these techniques to DNA and RNA still awaits further development in the chemical experimental methods.
I know Margaret would have loved next-gen sequencing, the high-throughput, high-volume, huge data generating capacity we have today.
But this was only the beginning of her work in bioinformatics. You may be familiar with her one-letter code for amino acids that required less punch card punching. Dayhoff used computers to develop algorithms and analyze the protein sequences she had available and made huge strides in understanding evolutionary relationships. She created scoring methods and matrices that are still foundational in this field–and if you do sequence alignments you may see her name in the output! She was enormously respected for this work, was supported by the granting agencies for it.
There was a separate aspect of her work, though, that was less well supported by funding groups. She began to collect and publish regularly the Atlas of Protein Sequence and Structure books. The first edition contained 65 sequences. It seems that funding agencies were not keen on funding work that some perceived as “stamp collecting” rather than experimentation. The atlas morphed into a database that Dayhoff made available by subscription in order to support this work. However, this subscription aspect created tension among biomedical researchers who thought that since the protein sequences were freely available, charging for a database was unwarranted.
Bruno Strasser’s study of this period is a fascinating look (pdf) at the history, attitudes, and framework in which this all occurred. At a talk for the Anniversary of GenBank, Strasser explored both the visionary work of Dayhoff, the database she established and other parallel database development projects in molecular biology, and the tension around the value of database curation.
(http://videocast.nih.gov/Summary.asp?File=14412 Strasser’s talk begins at approximately 1:09 and ends around 1:45. You can drag the progress bar to get to the right place and start to watch.)
From the paper and the talk, we hear Dayhoff speak to the importance of the work she was doing:
As she explained to a colleague: “There is a tremendous amount of information regarding evolutionary history and biochemical function implicit in each sequence and the number of known sequences is growing explosively. We feel it is important to collect this significant information, correlate it into a unified whole and interpret it.”
(Dayhoff 1967, from Strasser pg. 111)
I would encourage you to watch the video where Strasser explains this, and read the companion paper—it is a fascinating look at a time that established the world of bioinformatics as we know it today. It presages and informs much of the battle around open source software and data as we know it. After I learned these details about this period, my understanding of the framework and discussions of the open source world in which we find ourselves today became much deeper. I discovered in an obituary that the “stable, adequate, long-term funding” for PIR (the direct descendant of the Atlas) came through finally a few months after she died.
If you use PIR, the Protein Information Resource today, or UniProt , or any of a number of other databases and analysis tools for sequence comparisons, or if you rely on biomedical research for your health and well-being, you should appreciate the life of Margaret Oakley Dayhoff as well.
I’ll let Margaret Dayhoff close with this, and I wish I could tell her how important her link in the chain was to me:
We sift over our fingers the first grains of this great outpouring of information and say to ourselves that the world be helped by it. The Atlas is one small link in the chain from biochemistry and mathematics to sociology and medicine.
(Dayhoff 1968, from Strasser pg. 112)
• Dayhoff, M. O. and G. E. Kimball. Punched Card Calculation of Resonance Energies J. Chem. Phys. 17, 706-717, Ph.D. Thesis, Columbia University, Graduate School of Chemistry, 1949. DOI:10.1063/1.1747374
• Dayhoff, M. O. and R. S. Ledley. Comprotein: A Computer Program to Aid Primary Protein Structure Determination. In Proceedings of the Fall Joint Computer Conference, 1962, 262-274. Santa Monica, CA: American Federation of Information Processing Societies, 1962. http://portal.acm.org/citation.cfm?id=1461546
• Dayhoff, M. O. 1965. Computer aids to protein sequence determination. J. Theor. Biol. 8: 97-112. doi:10.1016/0022-5193(65)90096-2
• Krawitz, E. The Watson Scientific Computing Laboratory: A Center for Scientific Research Using Calculating Machines. Columbia Engineering Quarterly, November 1949. http://www.columbia.edu/acis/history/krawitz/index.html
• Ledley, R.S. Report on the Use of Computers in Biology and Medicine. National Research Council (U.S.). Advisory Committee on Electronic Computers in Biology and Medicine, National Research Council (U.S.). Division of Medical Sciences. Published by National Academy of Sciences – National Research Council, 1960. http://books.google.com/books?id=J5grAAAAYAAJ&output=html
• Strasser, B.J. “Collecting and Experimenting: The moral economies of biological research, 1960s-1980s.”, Preprints of the Max-Planck Institute for the History of Science, 310, 105-23. 2006.
http://www.yale.edu/history/faculty/materials/strasser-mpi-2006.pdf EDIT: new location of this PDF: http://biologie.unige.ch/assets/brunostrasser/Strasser_MPI_2006.pdf
http://www.dayhoff.cc/ Dr. Margaret Oakley Dayhoff — Pioneer in Bioinformatics; has more extensive bibliographies and biographical information. And family photos.
http://www.springerlink.com/content/9w1118639vl11603/ Margart Oakley Dayhoff 1925-1983
http://books.google.com/books?id=J5grAAAAYAAJ&pg=PP1&output=html Use of computers in biology and medicine report.
http://en.wikipedia.org/wiki/Punch_cards Punch card image
http://www.columbia.edu/acis/history/krawitz/index.html punch card machines
http://www.yale.edu/history/faculty/strasser.html Bruno Strasser homepage
http://videocast.nih.gov/Summary.asp?File=14412 GenBank Anniversary talks
http://www.biology.arizona.edu/biochemistry/problem_sets/aa/Dayhoff.html One letter code by Dayhoff
http://www.molecularevolution.org/mbl/resources/models/aamodels.php More on matrices and substitutions
http://www.inf.ethz.ch/personal/gonnet/DarwinManual/node146.html More on matrices and substitutions
To see the Mash Up of Ada Lovelace Day posts by location, topic, or as a list go here: http://ada.pint.org.uk/