Tag Archives: dbGaP

What’s the Answer? databases of disease SNPs

BioStar is a site for asking, answering and discussing bioinformatics questions. We are members of thecommunity and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those questions and answers here in this thread. You can ask questions in this thread, or you can always join in at BioStar.

This week’s question is searching for databases of SNPS ‘causal’ for diseases. As the answers point out, the word ‘causal’ should be used with hesitation when talking about SNPs. That said, the answers gave some good suggestions to this perennially asked question, here’s the first one (others are as useful, so check them out):

Human Gene Mutation Database


Paid subscription for up-to-date information. Otherwise less up-to-date public version of the database is freely available only to registered users from academic institutions/non-profit organisations.

Busting an Embargo

Not me–and not one of the press embargoes.  I’m talking about a data embargo.   While on the way to a workshop this week I was reading my paper issue of Science on the flight.  And I was intrigued by the story of what happened when a data embargo was broken.  The story is: Paper Retracted Following Genome Data Breach, and it is the story of data from dbGaP being published before the authors were permitted to publish on it.

The scientist who helped to develop our dbGaP tutorial had alerted me to this story (hat tip to Cyndy :) ), because she knew how the dbGaP data access system worked.  In fact, let me quote part of our tutorial that explains it very clearly on slide 12 :

Next is the linked study title, followed by the Embargo Release date for each study. Investigators contributing data to dbGaP may retain the exclusive right to publish analyses of their datasets for a defined period of time. Prior to the Embargo Release date, other investigators may be granted access to download and analyze data, but they may not seek publication of their results until after this time.

There’s a great and risky feature of these large-scale data projects.  Investigators are asked by the NIH data sharing rules to submit data to the appropriate repository even before they’ve had a chance to publish on it.  The risk is people will scoop the submitters.  And that’s apparently what happened in this case.

We’ve also spoken to data embargo issues in the context of the ENCODE project.  In fact, one segment of our tutorial on ENCODE covers that issue.  As more and more “big data” projects roll out in this manner, there’s likely to be more of these issues cropping up.  I think PNAS had a good idea–adding an item to their author checklist that specifies whether data is under embargo rules.  (Oh, and they retracted the paper and you can see the stub here.) But I think it’s also up to the projects and databases to explain the data embargoes clearly.  The people associated with the big data projects understand the rules, but I don’t know that it has percolated through the scientific end-user community fully.   We’re trying to help get the word out with ENCODE and dbGaP in our training materials, but I know the process varies by project.  I think this episode offers a nice “teachable moment” for this.  I’ll be referring to it in future workshops, for sure.

So keep an eye out for this as you use “big data” resources.  But use them–don’t let this dissuade you. Just keep an eye on the calendar.

dbGaP: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gap

UCSC Genome Browser (with ENCODE data): http://genome.ucsc.edu/ and http://genome.ucsc.edu/ENCODE/

New and Updated Online Tutorials for dbGaP, GAD, and DGV

Comprehensive tutorials on the publicly available dbGaP, GAD, and DGV databases enable researchers to quickly and effectively use these invaluable resources.

Seattle, WA: July 16, 2009 — OpenHelix today announced the availability of new tutorial suites on dbGaP, Genetic Assocation Database (GAD) and Database of Genomic Variants (DGV). The dbGaP resource is a database of genotypes and phenotypes with extensive variation data and clinical details GAD is an annotated resource connecting human genes and polymorphisms to diseases and traits, and DGV or Database of Genomic Variants, catalogs and displays structural variation in the human genome. These three new tutorials in conjunction with additional OpenHelix tutorials on dbSNP, VISTA, HapMap, GeneSNPs, SeattleSNPs, Genome Variation Server and many others, give the researcher an excellent set of training resources to assist in their genetic association and variation research.

The tutorial suites, available for single purchase or through a low- priced yearly subscription to all OpenHelix tutorials, contain an online, narrated, multi-media tutorial, which runs in just about any browser connected to the web, along with slides with full script, handouts and exercises. These tutorials will teach users:


  • to perform basic and advanced searches and navigate the dbGaP site
  • to understand the displays for the main open access data types: studies, variables, documents, and analyses
  • to use the analysis browser to identify candidate genomic regions for genotype-phenotype associations and to manipulate and customize the browser displays GAD


  • to view GAD tables from different perspectives
  • to read detailed reports for unique genetic associations
  • to perform basic searches for genes, diseases, polymorphisms, environmental factors, and references
  • to perform advanced queries
  • to do a batch query for a large gene list
  • add a new genetic association or edit an existing one


  • to browse and search through DGV’s structural variant data
  • how to find, understand and link to more genomic variation details
  • to navigate and customize your data using the genome browser
  • how to perform a BLAT sequence search

With the tutorials, researchers can quickly learn to effectively and efficiently use these resources. The scripts, handouts and other materials can also be used as a reference or for training others. To find out more about these and over 70 other tutorial suites visit the OpenHelix Catalog and OpenHelix. Or visit the OpenHelix Blog for up-to-date information on genomics and genomics resources.

About OpenHelix:
OpenHelix, LLC, (www.openhelix.com) provides the genomics knowledge you need when you need it. OpenHelix provides online self-run tutorials and on-site training for institutions and companies on the most powerful and popular free, web-based, publicly accessible bioinformatics resources. In addition, OpenHelix is contracted by resource providers to provide comprehensive, long-term training and outreach programs.

List of GWAS studies

They are still working on the recorded version of the NHGRI GWAS seminar that we attended last week, but I wanted to point you to a useful web page they mentioned. It is a collection of GWAS studies with the top 5 SNPs from each listed, as long as they made a certain threshold.

As of 11/24/08, this table includes 202 publications and 435 SNPs.” according to the Catalog of Published Genome-Wide Association Studies.

So if you are interested in GWAS data this is a nice collection of that literature. It also comes as an Excel doc you can download.

The traits they cover are quite a range–from freckles to diabetes to bipolar disorder and many more. I think I would like to take some of these data over to the UCSC Genome Brower’s Genome Graphs feature where you can visualize the data on a handy genome graphic. To get this figure, here’s what I did:

1. Took the GWAS excel file.

2. Pulled out the rs IDs for the SNPs. Some cells had to be fixed because the data within it is a series of comma delimited SNPs. Moved each to a single cell.

3. Cleaned up any non rsIDs. I end up with 480 SNPs. I left the duplicates for now.

4. Created a plain text file of these SNPs. I gave each one a value of 1 just for the purposes of the genome graphs software. I just wanted to see all these SNPs on the genome in one graphic. Genome graphs tool tells me:

Loaded 12351941 elements from snp126 table for mapping.
Mapped 479 of 480 (99.8%) of markers
These data are now available in the drop-down menus on the main page for graphing

Off we go…Here are my SNPs on the genome graph–the SNPs are teeny blue dots. Ok, I don’t know what it means either. I just wanted a sense of what was coming out of all the GWAS studies and where they actually were on the genome. I would like to take another look at the data, this was just a quick pass–I’m intrigued by the SNPs that come up in multiple studies and I’m curious about what those genes do. Hmmm…..