Liveblogging the GenBank 25th Anniversary II

I’m preparing to liveblog this event again today, internets permitting:

GenBank: Celebrating 25 years of Service at NCBI: http://www.tech-res.com/GenBank25/ official announcement.

The agenda is here: http://www.tech-res.com/GenBank25/agenda.html

There is a link to a videocast of the event from the Celebration link, supposedly:
View event:

You will be able to view the event at http://videocast.nih.gov when the event is live.

Will try to update as often as I can, if I have decent wireless and power.

Session Chair: Steven Salzberg.


Sidney Brenner: Gives a talk with no slides!! “One good phrase is worth a thousand powerpoints”. There is an information crisis in biology now. Publication, power to collect data, and this is a problem that needs to be solved. People confuse data with knowledge. Data is not enough. You get credit for collecting data, you get credit for distributing data, but you don’t get credit for organizing data. We have to not lose this tremendous capital. Information before digitizing publications is often forgotten. How would you organize the information? How do you get hold of this information in a useful way? Maybe the next 25 years of genbank is to turn it from a bank to a way for people to withdraw the information “with interest”. Genbank is a wonderful and free repository….Goes on to talk in more broad terms of hidden assumptions in biological interpretation, genome equilibrium, rates of evolution, dynamics of genomes. We need to communicate to our students better–they need to understand the genome dynamics underlying the software they pick up and run. Systems biology is a foolhardy measure and he predicts it will end up with zero. We are trying to solve the “inverse question” when we are looking at these data points, which is very hard. We have been good at solving the “forward questions” in biology. The idea that we will dissect the complexity by looking at all these measurements. Everyone is hoping for a magic computer program that will solve this, it is a vain hope. I have to tell you: computers are stupid. Must combine human intelligence with computers, rather than the other way around that systems biology is doing. What we have are the white pages of the phone book–we can look at it, we can compare it with other cities. We also have the yellow pages–the annotated genome. We can find that there are 52 plumbers in the city, we can deduce there are pipes in the city. We can’t make many assumptions about the functions of the city based on just the movements around the city–but the locus of interaction in the city is crucial. We are not going to find out what’s going on by collecting the data alone, it needs to be looked at carefully with benchwork. “Don’t think you are only adding signal to the database–you could be adding a lot of noise.” There are 3 values in biology: good, bad, and indifferent. There’s nothing like the human brain–most of biology today is low brain input.

David Relman: Why you are never alone at night: human-microbial symbiosis. It is nice to stop and appreciate resources that we often take for granted. Human indigenous microbiota: benefits slide = vitamin production, food degradation, colonization resistance, terminal differentiation of mucosa…more. We are currently at the point where we are barely understanding the parts list in these interactions. We can’t knock out like you can in yeast. We can look at diversity in human microbiota. Types cultivated diagram Handelsman paper; Eckburg et all Sci 2005. Much of micrbiota in human body has not been seen yet. Dethlefsen Nature 2007 more studies. There has been described a correlation btw body weight loss and gut microbial ecology (Ley, Nature). Spatial patterns of distribution of life in the human body: biogeography. How does this affect human health? Examined individual tooth locations, in addition to a whole mouth sample. Not all tooth locations have the same composition of flora. Each site appears to be distinct, in a non-random way. Are the principles of human mouth biogeography like a tide pool and island biogeography? Sources of variability have really not been looked at yet. Microbial community diversity and disease. Chronic periodontitis looks like it may be an altered community structure. “Community as pathogens” has escaped many of us as clinicians. List of diseases that may be at issue: Crohns, IBD, etc… How can we study and better understand these communities? Antibiotics are a pretty new phenomenon in the biological time scale. Maybe possible to design studies following people after antibiotics. Next gen sequencing can approach this. Challenges for this work: strain diversity: seq quality, computational needs function; specimen size–what is the relevant scale?, clinical metadata–are we capturing important information here?, unevenness–rare community members, uncultivated members. Will we get to “single cell genomics” to understand the features of the overall system. Human Microbiome Project as part of NIH Roadmap.

Elizabeth Nabel. Advances in Genetics and Genomics: an NHLBI Perspective. Comes here as a “fan of NCBI”. Will cover 3 topics: NHLBI in genetics/genomics, GWAS, and NIH trans-GWAS policy. NHLBI has many longitudinal studies adding great value to science. Would have called it “epidemiological data” but would now call in “phenotypes” that can be a huge value to genomics research. Expect broad use of the data that will go into dbGaP for everyone to use. Translating GWAS to Genomic Medicine. Framingham heart study, now 3 generations. Framingham has community pride in their participation over generations now. SHARe SNP Health Association REsource. Genotyping/phenotyping data into dbGaP now. 5.5 billion genotypes in there. MESA and asthma now proceeding: MESA = multi-ethnic study of atherosclerosis. Candidate Gene Association Resource (CARe): contract w/ Broad, to merge genotype and phenotype data to a common bioinformatics platform, hope to see this by end of this year. Warfarin study to show genotype-guided dosing of warfarin therapy. A gene-based clinical trial, just about ready to embark on this. Rare variants in LDL in CAD: PCSK9 gene. Cohen et al. NEJM 354:12. McPherson Science 316: June 07, 9p21 associated w/ CHD. Also reported by DeCode (Helgadottir paper Sci 316). Wellcome Trust Case Control (WTCCC) study around the same time, same finding of a chr9 locus. Showed UCSC genome browser slide w LD plots in this 9p21 region. Same region DeCode says has more effects now on aneurysm. Charkravarti data on long QT interval on NOS1AP. Atrial fibriliation associated w 4q25 coming from DeCode group. ORMDL3 in asthma, Moffatt et al Nature 448 2007. Finished by bringing us up to date on respect to data sharing with respect to GWAS studies on NIH campus. (pronounces them gee-wass, btw). Shows slides of dbGaP, thanks Ostell and team for helping build that. Some individuals in studies restrict the use of their data by for-profit companies. dbGap: 2 sites = public, see what studies are available, who is working on them, pre-computed or published studies. Controlled access site: security measures. login/pw required, terms of use. Has a “period of exclusivity” for primary investigators to publish on the data they obtain before other uses publish on that. Nice diagram of GWAS information flow. Happy Birthday GenBank! Questions ensue about how to use dbGap web site and what you can find.

Francis Collins: Notes from the front lines of the genomics revolution. Nice words about the value of GenBank in the past, and the need for them in the future with the current pace of work. How sequence data has played out in his own personal research. Shows his own publication on beta globin cluster he assembled, required a fancy “laser printer” thing :) . Had entered much of the data by hand, daughter proofread. Daunting experience. 5 years later, new sequence for CFTR gene shown. Great timeline of scientific discoveries in genetics. Critical thing happened for sequence availability of Bermuda rules. Shows hand-written policy generated from that meeting. A “radical” agreement. Where are we now (6 rather arbitrary points because so much is going on.): 1. comparative genomics. Learning a lot about evolution’s lab notebook. Mentions proposal to “wikify” GenBank. Doesn’t seem to be a good idea. Encourage corrections, of course. Now a Genome Reference Consortium is fixing some things 2. DNA sequencing undergoing revolutionary technical advances. 454, Korlach paper PSB, Harris paper last week on faster sequencing. We may get to the $1k genome faster than we think if these work out. Cancer: a disease of the genome. Cancer Genome Atlas pilot effort in the last year brings these genome tools together. http://cancergenome.nih.gov GISTIC analysis for copy number issues in glioblastoma. Showing new and interesting genes coming up. 3. Experimental approaches to determining genome function are moving forward rapidly. ENCODE project bringing great intersection of data types to 30MB of genome in the pilot study. Now scaled up to decorate the genome-wide with functional inforation. modENCODE project also underway. Knockout Mouse consortium for all protein coding genes in mouse. (Komp in US, also others). Can nominate your gene to get locus targeted. MGC full length cDNAs, now also with Xenopus and zfish (XGC and ZGC). Only for the cost of sending out the clone you can get these cDNAs. (resource for all to use). NIH Roadmap small molecule initiative. PubChem has opened up data that used to be behind the firewalls of companies or by subscription. 4. Genetic factors in many common diseases are rapidly being revealed. Glorious 18 months or so for this type of data. HapMap enabled understanding of variation, and can test fewer samples. Also drop in genotyping costs has helped. Manolio, Brooks, Collins coming out in JClin Invest next month review on the discoveries. Amazing map of the discoveries over time. But this is just a fraction of the heritability that we need to understand. Gives great credit to the dbGaP and the access it provides. 1000 Genomes project highlighted for more information on more frequent variations. Info will be in an open database. Next 1.5 years…looks for SNPs and CNV in a systematic way. Gene expression and genotype correlation needs to be easier. “Following up a GWAS with expression analysis slide”. Are there cis-effects on any of these genes? Cookson lab study on asthma was a nice example of how to do this. We need a human tissue db, maybe 1000 samples and genotyping + expression. 5. Clinical applications of genomics are expanding. Diagnostics and prevention, pharmacogenomics, therapeutic developments. Addresses misconceptions of gene discovery and validation of drug targets. Shows example in diabetes that this is valuable. 6. Attention to ethical, legal, and social issues required. Shows personalized genomics companies slide. Great potential for confusion, health choices, and discrimination. Still waiting for GINA (shows slide). Truly a frustrating experience, harmful to studies and our progress. Gives major kudos to GenBank team for helping us get to where we are today.

Session chair: Kent Smith.

Sharon Terry: Genetic Alliance BioBank: Enabling Biomedical Research. Shows photo of her kids who got her into this area–PXE. Frustrated by cost and access to scientific journals for literature that they needed. Includes the story of discovery/patenting gene for the condition. Describes maturing of patient advocacy. Shift in business models from industrial age (old) to the information age (new). Genetic Alliance, now transforming health through genetics. Environments of openness, novel partnerships in advocacy, revolutionize access to information. BioBank for aggregation of clinical samples. Describes features and functions of the BioBank. GeneLogic TRIMS interface. Will be connected to CETT structure as well. SNOMED as key vocabulary now, looking at others. Provides examples of successes of the BioBank, PXE and CFC. PXE as a good example of keeping an open mind in science because the gene wasn’t what most people were expecting. ABCC6 gene. Phenodex tool developed to categorize the phenotype. Displays “systems approach to disease research”. Pushing an envelope: “Ownership: get over it”. Need consent issues resolved, immediate public access, and data sharing. Scientific and community commons to democratize science. Geneticfairness.org. Wants to celebrate with us the “disruptive innovation of GenBank.” Makes some comments on the advent of personal genomics and social networking.

David Lipman: GenBank’s Greatest Hits: A Sampling of Discoveries. Why is GenBank at the library? Examples of greatest hits. Classification and genomic perspective on sequence families. What does the genomics perspective tell us on biological systems. In the early days of sequencing the genes were well characterized already, low sequence-to-papers ratio. Most of the richest information about biological function is in the literature, as many speakers stressed. Slide of History of Sequence –>Literature. 1986 = 7939 papers refereced in GenBank. Growth massive from there, important changes/features highlighted.

Greatest hits: 1982 Walker motif EMBO J Vol. 1 pp945. Reflects how sequence data was done at the time. 1982 Barker and Dayhoff PNAS 79:2836. Evolutionary relationship btw virus and protein of a normal cell. Probably too subtle a relationship for most experimental biologists. 1983: archetype discovery made by computer searching: Waterfield Nature 304, Doolittle Science 221. both in July. Described in NYT as a “serendipitous computer search”. Funny story about the race on this result. BRCA1 Jensen RA Nature 1996. Leads Koonin, Altschul and Bork to look at sequence differently to find motifs BRCT domain (Nature Genetics letter 13:266.) Second set of slides/images from CDD database. cDart database for domain architectures slide.

After 1982 more and more important sequence matches start coming along. These provided some of the justification for the human genome project.

Green et al paper, Science Mar1992 (?). Ancient Conserved Regions in New Gene Sequences and the Protein Databases. Surprised to find few new protein families. ESTs that appear multiple times had a higher chance of having ACR (ancient cons region). Seemed to be relationship conservation and expression. Many of the new genes didn’t match anything.

Earlier estimates of the number of protein families: Zudkerkandl E 1974 (Extremely interesting and forward-looking paper, relevant even today). Dayhoff M 1974.

1997 paper leads to COGs (Tatusov, Koonin, Lipman). Sci 278:631. Clusters of orthologous genes. Slide w update of COGs (eggNog) NAR 2008 36:D250. Slide of COGs in glycolysis.

Larger scale picture of evolution becomes possible: 1997 Koonin et al. Mol Microbiol 25:619. Soon after gene exchange in bacteria/archae Trends Genet: 1998 Aravind et al.

Illustrates work on NO receptor, Iyer et al 2003 BMC Genomics. Brochier et al paper on horizontal transfer Trends Genet 2000. Car analogy for swapping parts. (What is it with men and car analogies??).

Examines some of the real challenges in systems biology because of the noisiness of biology and the inability to swap parts as cleanly as we would hope.

What is coming? Sequencing all sorts of stuff we didn’t expect to sequence before, just like we saw computers becomes used for stuff we wouldn’t have predicted.

One thought on “Liveblogging the GenBank 25th Anniversary II

  1. Pingback: Wikification of Genbank | The OpenHelix Blog

Comments are closed.