Liveblogging the GenBank 25th Anniversary

I’m preparing to liveblog this event, internets permitting:

GenBank: Celebrating 25 years of Service at NCBI: official announcement.

The agenda is here:

Not being a married person, I didn’t know which one this was. I had to look it up. This is Silver. I can’t think of a decent gift, so I’m not bringing one. Maybe they are registered somewhere??

There is a link to a videocast of the event from the Celebration link, supposedly:

View event: You will be able to view the event at when the event is live.
Air date: Monday, April 07, 2008, 9:00:00 AM

Will try to update as often as I can, if I have decent wireless and power.

Welcome remarks

Michael Gottesman: GenBank one of the major accomplishments of the NIH. Major reasons for success: 1. timely, visionary idea. Already a protein seq database (Dayhoff), need for nucleotides as well. 2. International cooperation from the beginning. Support from other US organisations as well. Stable foundation at NIH has been important. 3. Contributions of researchers providing the data has been a third key. 4. Technology improvements in sequening and comparison algorithms. 5. Move from contract basis to NCBI/NLM provided stable support.

Don Lindberg: Personal salute and rememberance: A glorious achievement. Field took a major turn whe Dulbecco letter to Science appeared 1986. Became part of NLM long range plan after Lindberg arrives at NLM. Maxam describes how important the sequence data is to the planning committee–they realize this is a library problem. Reccommendation to create a public mandate as a library problem, needs to include international effort. “Talking One Genetic Language” document was key to getting senate support for creating NCBI Pepper in House, Kennedy in Senate introduce bill. Passes) . NLM goes to Los Alamos 1986-1988. IRX method to search PIR, OMIM, sequences. Report: Unlocking the Mystery of Disease by Pepper was very helpful. Reagan signed the bill. Authorized NCBI, $8M and issued 12 FTEs. Took a while to get data really in and accessible. They used to keyboard enter the sequence letter by letter from papers, nucleotide and amino acid translation….Really the beginning of the public depositing of these sequences for everyone to use. A real turning point in science.

Francis Ouellette: got started reviewing these new electronic submission records that were coming in; was told it would be 20-30min a day….turned out to be much more :)

Rich Roberts: got involved after his work on Rebase (enzyme db). 1979: Rockefeller meeting important to get the right minds focused on mobilizing NIH. Was asked to be on committee to write RFP for what GenBank should be. Still thinks a major role of GenBank is as an archive. Later people can change interpretation–but the archive should not be changed. GenBank should be a true archive of what is originally published. Got on many committees through the years in advisory capacity. Importance of sharing data, sharing information. Repositories key for making new connections. Describes work on Rebase–first list of known restriction enzymes. Got many requests for this list, sent out paper copies. Was classified by KGB at one point :) Was typed, retyped, late 70s memory typewriters help…eventually gets in real db forms. But print was still important–Handbook of Molec Bio. Vol II, 532. 1984: NAR publishes sequence collection that is important–Sequence Supplement. NAR has been incredibly important in getting this data out and shared. 1984-92 Sequence supplements are published. Pushed for requiring accession number from GenBank for publication as a publication norm. Important principle of data sharing. 1993: Annual DB issue begins, Roberts initiative: 1993 = 24 dbs. First artice was a description of GenBank. 1000+ now. 2005: NAR becomes first journal to move to Open Access model from subscription. Fought long and hard, threatened to resign as editor, but Oxford agreed eventually. Journal still profitable. Author costs nearly pay for production at this point. Open Access to literature in important and inevitable. 2001: PNAS article GenBank of the Published Literature. Rounded up laureates letter for Open Access. You can choose where you are going to publish, you need to think about where to submit–encourages open access journals. He works for a small company, can’t pay for access to all the journals he wants to read. Kids doing science projects need this. Library and science budgets disappearing. Small colleges lack access. 3rd world inaccessible. Everybody should have access. Gives background on restriction modification types and genes. Much known, but still requires much biochemistry. Sequence is great–but need to invest in biochemistry still (hint to funding agencies). Clues available based on sequencing data from shotgun sequencing. Needed to go back to look at original data from sequencing projects to look for missing sequences that might be restriction/modification enzymes that weren’t clonable in e coli. There are gaps where there are active restriction enzyme genes, predicted and found. McaT1 methyl transferase nearby, look next to them is a good place to hunt. Find enzyme with behaviors same as BssHII, but new enzyme has no seq similarity to Bss. New use for shotgun sequence data. Final thoughts: open access to data and literature is a great equalizer, lets us know what we know and what we don’t know; open access enables new discoverires. 12 month delay (new NIH standard) too long. We must empower young investigators–it is becoming very hard for adventurous young scientists to get grants today. System must be changed. Biggest grants should go to young people–smaller for old scientists. Nobel-type discoveries often by scientists under 40.

Bruno Strasser: (historian from Yale) “Property, Privacy and Priority: A History of GenBank 1965-1982. Humans love collecting stuff. June 30, 1982 birthday of the contract. Focus on the historical aspect of collecting in late 20th century biomedical research. GenBank as a _way_ of doing science. Early tensions were property, privacy, priority. These prefigure debases on open access issues today. Trying to capture the historical significance of GenBank in this talk. Collections have a long tradition in natural history. Exploration 17th century was creating lots of new knowledge, led to information overload. Needed organized. 19th century rise of experimentalism. Molecular biology follows similar pattern, actually, despite showing contempt for “traditional biology” = natural history. Called them stamp collectors. Today, though, biological collection in databases is at the forefront. Agenda: needed to balance individual and collective interests. 1980 EMBL announces their own db, lights a fire at NIH. At the time, though, there was doubt about the scientific value of seq collections. Was it worth collection vs doing science? Ledley–huge for the framework of Use of Computers in Biology and Medicine (book title), hired Dayhoff to assemble protein sequences. Atlas of protein seq/structure published 1965. Pages with sequence in text form. She surveyed literature manually, collected the protein data. 1965: 100 references. 1972: 1000 references. Unsuccessful gift economy to get people to submit seqs to here before publication (get a free book). Some submit, some don’t. NIH hesitant to funding this because it was collection vs bench research. She copyrighted/sold atlas to fund this effort. Creates some resentment in scientific community. Unease with charging for sequences she got for free. Turn to Walter Goad and Los Alamos research. Some interest in molec sequence, unique computer facilities at the time. Interesting parallels between natural hitory collections vs molecular sequence data issue presented. Goad moves to make authorship, priority and credit part of the sequence submission process. NIH reviewers favor this model. NBRF was private–would it seek profits (Dayhoff’s location). Los Alamose was military and national–suspicion of secrecy and security issues. Data distribution is a major issue– Dayoff =paper, magenetic tapes, Goad = computer networks. Map of Arpanet 1980 shown. Data ownership issues: GenBank represented a resolution of some of the tensions of this data. Dayhoff model more on natural history model, Goad models more on experimental tradition. Note added after talk: this is a really nice session to watch if you want to hear about the historical framework for these developments.

Graham Cameron: Embl, GenBank, DDBJ: The early days. 1980 May 8 Nature editorial, 100kb of data flooding us :) . EMBL 1982: office, telephone, computer terminal, many requests from people for data. Nature vol 296 15 April: Europe Leads in Sequences article. First entry they published was wrong, published errata :). Soon it was clear cooperation was the right strategy on this. Predicted as many as 300 users. Stated right from the beginning that it was freely available. International committee 1987 forms, still exists today. Journals are approached for direct submission issues, key theme reappears. Nature was the most resistant journal at this time, considered it a publication barrier. 28 Sept 1989 v 341. “Eventually even Nature came around.” Mid 80s move to real database technology. Used to throw the data at people, with no help! 1987 began pressing for proposals for what became the EBI. 1993 got agreement and some staff to create it. Mission at first: Research, Service, Industry. Then adds a training component. Shows a lovely training room on the slides.

Takashi Gojobori: DDJB congratulates GenBank NCBI on 25 years. EMBL 1980, GenBank 1982, DDBJ 1986. They are located 1 hour west of Tokyo by bullet train, you are welcome to visit :). Growth of DDJB: 1987 = 66 entries. Interesting graph of sequence submissions: peaked in 2005, slowing lately, but still substantial. New ARSA new algorithm released recently. Major contributions on the rice genome project highlighted. Also on the fantom analysis annotation of mouse. H-invitational cDNA annotation “annotation jamboree”. H-Angel (anatomical gene expression library) contains 19k loci, 60 tissues/cell lines. Interesting discoveries on appearance of tissue specific genes in evolution. Targeted proteins research program. Next generation sequencing challenges today. PNAS paper 105: 1176: 4 minutes human genome sequence, this rate is a major issue/paradigm shift in sequence revolution–other types of data (expression, binding, two-hybrid, epi-genomics) issues. Important biological phenomena that we need to capture. Genome Network Platform project underway. TSS distribution from CAGE tag data shown.

Christian Burks: Communiter Bona Profundere Deum Est. Genome Canada many centers now funded supporting genomics and proteomics projects. Starts chronology w/ Ben Franklin and the Library Company. First lending library in N. America–first lending library (was private). Title is the motto for that, company still exists today. 1965 First t-RNA seq published. 1977 first “complete” genome published (phiX174). 1979 Rockefeller workshop. 1979 Kanehisa @ LANL works with Walter Goad. Los Alamos workshop, begins collecting data, has workshop. 1980: LANL proposal to NIH to build resource. 1980 NIGMS workshop to discuss sequencing db and analysis facilities. 1982: LASL LosAlamos seq library report–sort of the end of the pre-GenBank days. GenBank contract 1, 1982 Christian joins T-10 group at LANL. BBN and IntelliGenetics contracts proposals ensue. BBN wins, works with LANL on this. Significant aspect 1986: Gerry Myers, project for HIV sequence/evolution curation db spins off. Strong branch off the initial effort. 1987 BBN, IntelliGenetics & DNAstar three separate collaborative proposals. This time IntelliGenetics gets contract award. 1991 Leadership turned over to Gilna and Cinkosky. DB moves over to NCBI. Spends some time on the interesting Infrastructure Changes. Server hardware, operating systems, data management. RDBMS 1990 Sybase and C. Community Interface transitions: Data Access = hardcopy, magentic tape, floppies, internet server. Data collection = hardcopy, floppies, tapes, authorin, patentin tools, internet submissions. Fun slide about predictions made for how much data was going to go in (1982, 86, 87). Begin to wonder what is the upper bounds–where are we going? Interesting thoughts on how to calculate this. DNA barcoding: 650 bases of mt genome from each species, 10 specimens per species. Walter Goad’s papers now at American Philosophical Society–a Franklin creation. Nice way to come around.

David Landsman chairs this session. Wearing a tie of DNA sequence. Also has a book that is “what used to be GenBank” in text form. Volume 6 of 7. “Collector’s item”.
Charles DeLisi: talks to NAR 1984 12:417 paper published from lab for theoretical biology. Schematic on overall organization of our db and analysis system Fig 1. Speaks about struggling with the culture of making assumptions about how much seq data capacity they were going to need. Shows some predictions/guesses from the period. Moving from first Santa Fe workshop 1986 to sequencing genome in actual budget was a major issue. “Genbank is the resource that has made the genome projects transformative.” Futures of Computational Biology slide with estimates of what is to come, rate/cost estimates for sequencing. Recognize that all this information is great, but need to get students educated in this domain–1998 created program, over 40 PhD graduated, 70 Ms. to date from the BU/Cambridge area program they have created. Genome project having a transformative effect on the culture of science. Moves on to talk about their current work, including TFBS. Kidney cancer and regulation . Holloway, Kon, DeLisi paper WT1 diagram predicting posible targets of WT1. Moving data with predictive value to bench work in cancer biology. Would be great to have a resource that make these predictions/hypotheses and then can be tested. Kirca et al paper on expression in tumor cells–hunt for genes always overexpressed. Cancer Informatics 2007: 2 1-28. Cluster genes and look for patterns. 158 genes are candidates. Can cluster by processes (mentions good Weinberg paper on the clustering by process). Moved to look for genes hypermethylated, 6 identified. Now to be looked at these predictions in populations like the Framingham study. Example of going from bioinformatics predictions to useful markers in cancer.

Jim Ostell: GenBank at NCBI. 1992 Represented the merger btw GenBank and NCBI, causing some nice harmony and odd situations. Colon cancer gene example of homology to e coli and yeast as telling us about the function of the colon cancer gene because in those species they are DNA repair. 1989 flat file record shows no protein. Corresponding protein shows no DNA. 1990 NCBI model diagram shown in series of slides. How do you say this all in a file format for computers to read vs humans to read? Developed ASN.1 standard for full representational complexity, but lets you create sorts of other tools to use it. Not well received at the time. By 2000 XML comes along, same idea, maybe people would like it? Nah. 2004 international seq consortium new standard. Not well adopted. Keep returning to the flat file. Now let’s match it to Medline and PIR. Had to re-write references to coordinate stuff = Entrez is born 1991. Medline became PubMed. GenBank now becomes connected with other layers of information–protein/literature/etc. Flat files get expanded for source, feature table improvements. Nucleotide IDs, Protein IDs coming along to help. Can version them. Explains why gi numbers don’t exist in EMBL/DDBJ. Links to PubMed now smooth, taxonomy links very cool now. Slide of taxa growth, huge. 1995 First microbial genome, seems small now, a big deal at the time. However, created a problem for avg desktop computers. So they limited GenBank records to 350kb, so had to chunk it. No way to show scaffold/contig reassembled. Created Entrez Genome for this. A simple suggestion: “GenBank select” for best exemplars. Started w/ HIV as good place to start. But there wasn’t a full seq in GenBank despite reseq many times in different labs. Worked with Retrovirus book publishers to get these submitted. Yeast genome comes along. Needed to create a system to work out data ownership/updates model. Created RefSeq–derived from GenBank and other sources, but is not GenBank. More like a review article. And point back to original source. People still confused by what is in GenBank vs RefSeq today. Updates to RefSeq sequences don’t go back to GenBank, but updates to GenBank do propagate to RefSeq. Hilarous graph of growth in bp up to 92 vs 2008. Neat graph of the way sample types have changed over the years—environmental samples just beginning to make a real appearance, but more expected.

David Botstein: Genomics, Computation, and the Nature of Biological Understanding. Congratulates the GenBank team for creating a gov’t business that really serves well! How do we put biological understanding into the seq information. This talk pre-supposes that GenBank exists, and asks where to do we go from there. Origins of Genomics slides–mechanics, stored/transmitted, expressed –> central dogma. Origins of Genomics II: emphasis shifts beginning in 1960s. What do genes do? 1st genome paper he could find: Epstein RH et al, T4D CSH Symp Quant Biol 1963. Origins of Genomics III: model organisms start to bet looked at for this, cooperate on information gathering, and their genomes got sequenced first eventually. Presents great diagram that came out of Charles Darwin’s notebook after he got off the Beagle shows amazing insight. Simple inheritance vs complex inheritance and the diseases associated with various types. Sequence vs function: a single flat sequence not particularly informative–but comparison is informative. At the end of every trail is an experiment–gotta do benchwork to learn. Connecting sequences with what is known–annotations and literature–was the key thing that GenBank did. Next, you have to address that proteins interact with other proteins–they work in gangs. This moves biology to an information science–relationships are important to store, retrieve, display. Combinatorial complexity is huge. We have what he likes to call a NASA problem. We have more information than we understand. Shows Genome Informatics Timeline. 1982 Genbank–>GO 1998. Early key decisions: run by biologists, full time staff required, must serve needs of community. SGD today chart. Gets us to Gene Ontology: Ashburner says “Biologists would rather share a toothbrush than share a gene name”. GO develops to share structure, controlled vocabulary to describe gene products. 3 aspects of GO: biological process, molecular function, cellular component. GO is a huge help for understanding what is known. Intellectual Impact of the Genomic View slide: unification. Despite diversity, fundamental mechanisms related. New frontier opened–at the interaction level, systems level. “How quantitative is human perception?” Brains are good at 6, but not bigger numbers. How do we understand? Presenting and making high-volume data mean something is a major challenge. Showing various clutergrams. When you can see these representations and have some knowledge, you can form hypotheses. These kinds of results can drive you back to genetic testing of patients with different tumor types, for example. Last post-genome anecdote gets to systems biology. Testing results from mass-spec w/ metabolites can get us to similar cluster diagrams that can be informative and can lead to testable hypotheses. Functional relationships from Biopixie, diagram shown that bring us back to bioinformatics. We need to do experiments and finding out what these things do, and we need to be better at “serving it up” in a better way. We need to spend efforts getting us to the next level above sequence data now.

J. Craig Venter: Genomics: From ESTs to Genomes to the Environment and Back. Congrats to NCBI and the team. Will talk about work at 4 institutions–NCBI, TIGR, Celera, JCVI. Refers to Massive parallelism paper in Nature Genetics. ESTs in Genbank 1991 = 337. 50 million today. Prokaryotic genomes completed slide, also slide of genomes by Center (JGI, JCVI, Broad, world). Another slide breaks it down by strategy. 2000 Fly genome done in 4 months, same time period as Haemophilus 5 years earlier, despite major size difference. Presents the paper of his own diploid human genome PLoS Biology 2007: 5:e254. Addresses variation btw people, and variation w chimps. Increasing surprises from the variation data. He’s arguing for many more diploid genomes to understand variation. Of course, seq lacks phenotype information–working with groups to generate that now. You can download diploid genome interactive poster from PLoS Biology–can also see in new browser released from JCVI just last month. Moves on to microbial sequence–including Sargasso and human microbiome project. Metabolomics (Metabolon company) slide looking at compounds in blood plasma. 60% human, 30% diet/xeno, 10% bacterial compounds. Role in physiology looked at the first time. Global ocean sampling slides. Change in composition of samples is clear in different and clusters in different ocean regions. Now have a huge meta data set including GPS coordinates for sequence data. Proteorhodopsins vary by region slide. Another surprise is that species are not cleanly separated. GOS data set nearly doubles the number of known proteins. 4000 new gene families not seen in Genbank in early samples–even more now. Won’t find new genes sequencing more mammalian genomes–will provide variant analysis, but not new genes. Metagenomics catching on slide. Big Questions slide: what is life? can we pare it down to basic components? can we digitize it? can we regenerate life or generate new life out of the digital world? There are technical questions–if we can make a genome chemically, can we boot it up? Did a phage. Try m. genitalium, tells story of design and issues around this. Wanted to avoid problem of YAC issues in human genome sequencing (yeast re-shuffling DNA), checked at every step to see if m. g. assembly was working. D. radiodurans: ultimate DNA assembly machine. Yeast TAR system builds synthetic chromosomes. Got it assembled, published in Science. Now trying to boot up the synthetic chromosome. Describes the role of restriction enzymes in destroying host genome as a strategy. Cholera has e coli like components, appears to have absorbed this at some point. D. radiodurans has 4 genetic elements. Where do we go from here? Synthetic GenomicsTM. Synthetic Organism Designer 1.0 software being developed. Calling new field Combinatorial Genomics. Cassette shuffling–can shuffle for viability, chemical production, etc. Designer fuels. New species for this. Slide shows potentials of Synthetic Genomics. Reverse vaccinology as a strategy–from genome sequence to vaccines. Engineering bacterial to treat cancer (Vogelstein work Sci 314: 1308.) Can use this strategy to design new organisms that lack infectivity. Future uses of synthetic and engineered species slide. Slide on Ethical considerations (Sci 286: 2087.) Showed slide of future phylogenetic diagram w/ synthetic species….Synthetic genome part of the databases?

2 thoughts on “Liveblogging the GenBank 25th Anniversary

  1. kay

    “GenBank: Celebrating 25 years of Service at NCBI”

    How can this be? I vaguely remember NCBI’s inauguration, but this was much less than 25 years ago. Genbank itself may be 25 years old, but this was before my time (in bioinformatics)

Comments are closed.