Tag Archives: dbSNP

(one) Video Tip of the Week (to hold them all): Variation and Disease Databases

After again reading Daniel MacArthur’s good rundown about the state of databases of human disease-causing variation from last year (One database to hold them all), I thought it might be nice to do a tip comparing several of them. I couldn’t get it under our self-imposed 5 minute limit for our tips (and technical limit of software I’m using, but that’s about to change). But as I perused our tips and other sites, I found we and others have quite a list of how-to tips to use these databases. So in today’s tip I’ve gathered video tips for 3 of the databases listed in the linked post. Below those tips I’ll link to other how-to videos for additional human variation and disease.

The databases mentioned are OMIM, Human Gene Mutation Database (HGMD), MutaDATABASE and The Human Variome Project . There are video tips for the first three.


Last year OMIM moved to http://www.omim.org and had a entire new interface. Mary was on top of it and did a tip on the new OMIM interface with lots of information on the move and OMIM in the post:

Our full tutorial on the new OMIM is coming soon.

HGMD has a public site and a by-subscription site. The latter includes access to the most current data and some added features. The publicly accessible site is out-of-date by three years. Because of HGMD restrictions, we aren’t able to do a tutorial or a tip on HGMD, but they do have an introduction video to their database:


Additionally, there is a good background page for more information.


Mary did a tip on MutaDatabase last summer:


Another excellent resource is Gen2Phen. The Gen2Phen project “aims to unify human and model organism genetic variation databases towards increasingly holistic views into Genotype-To-Phenotype (G2P) data, and to link this system into other biomedical knowledge sources via genome browser functionality.”  In that vein, they have quite an extensive list of Locus-specific databases and additional resources.

There are several other resources available for human disease variation including CGAP, dbGAP, GAD, PhenomicDB and several others. We have tutorials on all those if you wish to check those out.

Of course there’s dbSNP :D of which we have a tutorial and tip about searching human variation.

You can find an extensive list of other resources at Human Genome Variation Society (HGVS).

And an oft-asked question on Biostar is what kind of resources are there for this kind of data. You can find answers here, here and here.

Video tip of the week: VarSifter for identifying key sequence variations

Recently many of the bioinformatics tweeps I follow were excited about the tool called VarSifter. Here’s the notice that I saw:

RT @yokofakun: http://www.youtube.com/watch?v=I7azpqTWFuM Jamie Teer describes VarSifter, an interactive GUI tool for handing/quering/filtering VCFs #ngs

I just had a chance to watch the video, and now I can see why they were impressed! Over the years in the workshops we do, people have asked questions in various theme groups. For a while it was lists of genes and microarrays. Then it was known SNP variations. Then it became transcription factor binding sites. Lately it’s been: I have a giant set of sequence data that I need to process to find new variants that might impact genes. How do I do that? This video tip-of-the-week will help you to understand how to do that.

In this video that was part of a day of lectures at the NHGRI about how to deal with exome sequencing data: Next-Gen 101: Video Tutorial on Conducting Whole-Exome Sequencing Research . There is a whole series of video and slide material available from NHGRI’s page. And the one I’m highlighting here is number 3 on that list. Be sure to download the slides if you want to take notes, and access the references and URLs that are key to the material.

Jamie Teer gives a terrific talk about dealing with the exome sequence data output that next-gen projects are yielding. It starts with just managing and viewing the reads, and he highlights a couple of different ways to do this. It includes SAMtools, and also showing how they look in both UCSC Genome Browser and in the Broad’s Integrative Genomics Viewer, IGV. It’s nice to see a comparison of these to illustrate what you might expect to see. We could help you to understand how to load this kind of data as custom tracks in the UCSC Genome Browser with our advanced tutorial, and you’ll find some nice guidance on what to expect from IGV from the paper listed below in the references area.

The video also describes annotation software that helps you to identify where the variations and consequences are in the data. Many of these tools we have talked about either in our tutorials or our other tips-of-the-week.

He also describes how people generate pipelines to flow the data through a series of steps to do the analysis. Sometimes these are home-made programs used by a local group. But he also mentioned how Galaxy can help to accomplish this now.  We’ve been fans of Galaxy for a long time, and we know people are using it in exactly this manner.

You still should have a basic understanding of all the tools individually if you want to use them all, or tools that incorporate them all into workflows/processes, though. It will help you to create better workflows/pipelines. And it also matters that you know what you aren’t seeing/using.

Teer closes by introducing the VarSifter software that he’s been involved with creating. This software is freely available for you to download at the VarSifter site. Usually we prefer to highlight web-based interfaces, but there isn’t one for VarSifter. But if you see the utility in it you can also try to get a local copy set up for yourself. VarSifter will help you to view, sort, and filter variants in a lot of ways.

So have a look at this video if you are interested in understanding how these analyses are done, and if you are interested in knowing more about the tools that can be used. It’s worth the 40 minutes–really.

Quick links:

YouTube page: http://www.youtube.com/watch?v=I7azpqTWFuM

VarSifter home page: http://research.nhgri.nih.gov/software/VarSifter/

Exome analysis Talks at NHGRI: http://www.genome.gov/27545880


IGV: Robinson, J., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E., Getz, G., & Mesirov, J. (2011). Integrative genomics viewer Nature Biotechnology, 29 (1), 24-26 DOI: 10.1038/nbt.1754

UCSC new paper: Dreszer, T., Karolchik, D., Zweig, A., Hinrichs, A., Raney, B., Kuhn, R., Meyer, L., Wong, M., Sloan, C., Rosenbloom, K., Roe, G., Rhead, B., Pohl, A., Malladi, V., Li, C., Learned, K., Kirkup, V., Hsu, F., Harte, R., Guruvadoo, L., Goldman, M., Giardine, B., Fujita, P., Diekhans, M., Cline, M., Clawson, H., Barber, G., Haussler, D., & James Kent, W. (2011). The UCSC Genome Browser database: extensions and updates 2011 Nucleic Acids Research DOI: 10.1093/nar/gkr1055

SAMtools: Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., & , . (2009). The Sequence Alignment/Map format and SAMtools Bioinformatics, 25 (16), 2078-2079 DOI: 10.1093/bioinformatics/btp352

World tour of workshops, recent stop: Morocco, Africa

Trainers & organizers

Last year I had the opportunity to give a workshop in Ifrane Morocco (UCSC Genome and Table browsers, Galaxy) at Al Akhawayn University. This year, Mary and I returned for a longer 3-day workshop at University Hassan II in Mohammadia. OpenHelix was a co-sponsor of the workshop (donating our time, materials and expertise). The workshop covered a plethora of topics from a world tour of resources (tutorial-free) and introductory UCSC  Genome Browser (tutorial-free) and ENCODE (tutorial-free) to genome variation analysis in dbSNP (tutorial-subscription) and analysis using Galaxy (tutorial-subscription). You can see the full schedule of the topics Mohammadia Workshop Schedule here (pdf).

As last year, we were impressed with the students (there were 117 total, about 50/50 gender ratio). English is their 3rd or 4th language in most cases, Moroccan Arabic, French or various African languages being their language of choice. Yet, they were attentive and asked very perceptive and fascinating questions. They were also very enthusiastic

The workshop students

learners. It was a delight to teach them.

We’d like to thank Mohammed Bourdi at NIH, who spent large amounts of time and financial resources to organize this (and last year’s) workshop. We hope to repeat and expand these for next year and perhaps years to come. We will be looking for sponsors.

Several questions were asked at the workshop we’d like to reiterate the answers here and seek some answers from our readers:

*One student was looking for wheat genome resources for designing primers. The wheat genome is as yet incomplete, but there are some resources to get started:
Wheat Genome Sequencing Consortium
Gramene’s wheat resources
Wheat Genetic and Genomic Resource Center @ Kansas State
Perhaps also COGE for conserved sequences
edited to add:
CerealsDB and
James’ post on the wheat draft sequence might give some insight into that huge genome.
*Another student asked about dotplot tools:
Galaxy offers a large collection of EMBOSS tools including dotplot analysis, as does EBI Emboss tool

* Another question concerned finding a ‘dynamic programming’ (optimal solution) multiple sequence alignment tool as opposed to a heuristic one. The issue with this is the complexity of the search space of dynamic programming solution, this slide set might help with the understanding, particularly slides 1-5 and 17-22. It is too computationally intensive. That said, the student might want to check out MSAProps and this list at Wikipedia.

Do our readers have any other guidance on this?

Teaching moment

* Another student asked  if we know how to find DC-area internships in biological sciences. Another student (mathematician from Mali) was looking for something in the US in bioinformatics. Any ideas of programs to bring African biology students to the US or Canada?

If our Moroccan students (or anyone else) have any additional questions, please feel free to ask them here!


ANd a side note. Last year I had all of 3 hours to tour Fes. This year I took advantage of my trip. Mary and I spent a few days in Fes and Marrakech. My family joined us in Marrakech and later my family and I toured for 8 days visiting the Atlas mountains, the Sahara and Fes. Needless to say, it was a trip of a lifetime. Morocco is a fascinating and beautiful place. I look forward to visiting again.

Gates and doors of Fes are beautiful

camel excursion to the Sahara





What’s the answer? (duplicate dbSNP IDs)

BioStar is a site for asking, answering and discussing bioinformatics questions. We are members of the community and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those questions and answers here in this thread. You can ask questions in this thread, or you can always join in at BioStar.

This week’s highlighted question is….

Repeated rsIDs in dbSNP?

Is it possible that rsID is repeated in dbSNP? I recently downloaded dbsNP130 from UCSC came across cases such as

chr10 50325 50326 0 + G G G/T genomic single by-cluster 0 0 unknown exact 3

chr18 4739 4740 + G G G/T genomic single by-cluster 0 0 unknown exac

Is this expected? And what’d be the explanation? Or, have I made an error in downloading parsing file?


I highlighted this one because it seems to come along fairly frequently (as evidenced by Jorge Amigo’s answer. And we find it surprises people who have just noticed that the UCSC Genome Browser is now separating out a set of SNPs from dbSNP that they call the Mult. SNPs(132) track you can see on their browser. I think it’s a good awareness to have about these SNPs.

Check out the answers in full here.

What’s the Answer? disease causing SNPs

BioStar is a site for asking, answering and discussing bioinformatics questions. We are members of thecommunity and find it very useful. Often

questions and answers arise at BioStar that are germane to our readers (end users of genomics resources).Every Thursday we will be highlighting one of those questions and answers here in this thread. You can ask questions in this thread, or you can always join in at BioStar.

This week’s highlighted question is….

..which is the best database choice from where i can extract a data set of causative variants and a data set of benign variants (OMIM ,GWAS)…

A perennially favorite question. The accepted answer gives a good rundown of how to go about choosing a database. Another answer points to an earlier discussion with a wealth of databases.

Video Tip of the Week: VnD Resource for Genetic Variation and Drug Information

In today’s tip I am going to feature a resource that I found recently. I’ve been updating our dbSNP tutorial, which Mary & Trey will be presenting at workshops in Morocco, and also our free PDB tutorial, which is sponsored by the RCSB PDB team. I have therefore been thinking about protein structures and small sequence variations a lot lately. As I explored the latest Database issue of NAR looking for resources to do a tip on, I found an article describing the VnD (genetic Variation and Drug) resource, which can also be accessed at the URL www.vandd.org, according to the NAR article. The article is “VnD: a structure-centric database of disease-related SNPs and drugs“, and figure one shows a veritable Who’s Who of protein, variation and disease resources, so I had to investigate.

What I found at VnD made me sure that this was a resource that I wanted to feature in a tip. VnD is from the Korean Bioinformation Center, or KOBIC, who has a list of databases and tools that they provide. I’ll save the rest of the KOBIC resources for another post & concentrate on VnD here. Compiling data from resources such as RefSeq, OMIM, UniProt, PDB, DrugBank, dbSNP, GAD and more might have been cool enough, depending on how it was done, but the VnD also does their own structure modeling analysis on how the variation affects the protein structure and drug/ligand binding.

This tip movie isn’t long enough to really show you the breadth of what is available from the VnD, but I hope it will be enough to encourage you to read the NAR article (listed below), and to check out VnD. One thing to note: don’t expect to find every dbSNP rs# over there – one that I’ve been using in our tutorial isn’t over there. They are specifically interested in variations within genes that might effect drug binding. But hey, you can’t query DrugBank with rs#s, and I’ve never seen the structure modeling done like VnD, so it is a worthy resource that you may want to investigate if you are interested in how genetic variations connect with disease and drug therapies.

Quick links:

VnD: Variations and Drugs resource -  http://vnd.kobic.re.kr:8080/VnD/index.jsp

Korean Bioinformation Center (KOBIC) – http://www.kobic.re.kr/

RCSB PDB – http://www.pdb.org

OpenHelix Tutorial on the RCSB PDB – http://www.openhelix.com/pdb

dbSNP: Short Genetic Variations, from NCBI -  http://www.ncbi.nlm.nih.gov/projects/SNP/

OpenHelix Tutorial on NCBI’s dbSNP – http://www.openhelix.com/cgi/tutorialInfo.cgi?id=39

For links to other resources and OpenHelix tutorials mentioned in this post, please see our catalog of resources – http://www.openhelix.com/cgi/tutorials.cgi

Yang, J., Oh, S., Ko, G., Park, S., Kim, W., Lee, B., & Lee, S. (2010). VnD: a structure-centric database of disease-related SNPs and drugs Nucleic Acids Research, 39 (Database) DOI: 10.1093/nar/gkq957

dbSNP: no longer single….?

I think this is very interesting–dbSNP has a new logo. dbSNP is no longer “single”. Keeping dbSNP as a professional name, but also has a new name for social situations: “Short Genetic Variations”.

I was just checking my twitter feed, and found out something fascinating in the new release.  Here was the item that prompted me to look:

RT @yokofakun: #dbsnp134 has been released: http://www.ncbi.nlm.nih.gov/projects/SNP/docs/build134.txt

Pierre forwarded that notice, and I decided to check out the release notes. Hidden in there is a small piece of information that I think makes a big mental leap for a lot of people….

1) dbSNP logo change (http://www.ncbi.nlm.nih.gov/projects/SNP/)

As there has been confusion about the types of variations dbSNP actually contains, the dbSNP logo text was changed from “Single Nucleotide Polymorphism” to “Short Genetic Variations”. We hope that this change will reflect the wide range of dbSNP’s variation content, and thereby prevent any future misunderstandings.

In spite of its name, dbSNP is not limited to single nucleotide polymorphisms (SNPs), but stores information about multiple small-scale variations that include insertions/deletions, microsatellites, and non-polymorphic variants. dbSNP also stores common and rare variations along with their genotypes and allele frequencies.

Most importantly, dbSNP includes clinically significant variations, and should NOT be assumed to hold only benign polymorphisms.

Some of that stuff will be obvious to a lot of our readers. But you’d be surprised at what we find in the training rooms. Many people are really shocked to see that dbSNP contains a lot more than just single nucleotide polymorphisms. And we make a point of mentioning that the UCSC Genome Browser calls their SNP track “simple nucleotide polymorphisms” to reflect that idea. For many people in our workshops that’s the first time they have processed that knowledge.

In case you are curious, here’s what an old header looked like at dbSNP (I have taken this from our training materials):

I think this is a great move. Subtle, but great. And they must have thought it was important based on that release note piece. dbSNP is no longer single. I feel like I should send a gift….

dbSNP 132 now at UCSC Genome Browser: important changes

While we were traveling for workshops the other day, there was an announcement from the UCSC Genome Browser team that a lot of people have been waiting for: dbSNP132 can be explored on the browser now. It is available on the hg19 assembly–which is the February 2009 one that you can select in the human genome gateway options.

People were eagerly awaiting this for a couple of reasons–first, a new dbSNP release is always offering new SNPs people might want to explore. But this particular release also has SNPs that people wanted to access from the 1000 Genomes project. Here’s the release announcement from dbSNP that describes it:

  1. Build 132:Human include data from 1000 Genomes project pilot 1, 2, and 3 studies.   All 1000 genomes submissions to dbSNP can be searched by batch or using Entrez search filters….

Note: the dbSNP announcement also offers help on filtering for just those SNPs at NCBI if you want them. This led me to try to filter the 1000Genomes submitter name in the UCSC Table Browser as well. It worked–but I haven’t checked all of it yet, so caveat lector on that right now…But some people might want to do that. You could create a 1000Genomes custom track with that sort of query I think.

Table Browser, Filter for dbSNP submitter field:

And this yielded this sort of output–where the submitter field contains 1000Genomes, but it may also include other submitters:

But another really important aspect of the 132 build in the context of the UCSC Genome Browser is that they have changed the way they are offering the SNPs to you. In the past the SNPs have always been in one big bucket. But now they have separated them out into 4 options: Common SNPs, Flagged SNPs, Multiple location mapped SNPs (Mult. SNPs), and All SNPs. So the menus on the browser look like this now:

Key point: the Common SNPs are on by default. If you want All SNPs (or any of the others) you will have to specifically make that choice. Also remember in your table browser queries to make the appropriate selection.

This is a nice option that people have been asking for. But it does represent a change from the way they have been offered before, so be sure you know which SNP subset you want to explore and make the right choice.

PS: I was going to make a custom track of 1000Genomes to load up for anyone as a public service, but I crashed the browser. I may try again later. I think it would be a handy track to have to load up. If someone else gets to it first, let me know and give me your session link and I’ll add it.

PPS: If you don’t know how to navigate around UCSC, change the menu options, or do Table Browser queries, check out the tutorials that we have that are sponsored by UCSC and are freely available: http://openhelix.com/ucsc

UniSNP database

There have been a bunch of tweets lately around the UniSNP database–so I thought I’d do a quick post to raise awareness of that. The mission of UniSNP stated on their homepage at NHGRI is:

UniSNP is a database of uniquely mapped SNPs from dbSNP (build 129) and HapMap (release 27), where differences in SNP positions and names have been resolved, insofar as possible. In addition, SNPs are annotated with various functional characteristics, based on overlap with tracks from the UCSC browser. For details, see [PUB CITATION].

Well, I went looking for a [PUB CITATION] in PubMed for this. I entered the text UniSNP. I got a bunch of results. But that’s because….

Your search for unisnp retrieved no results. However, a search for unison retrieved the following items.

Unison? Um. Ok.

Anyway: the bioinformatics folks seem interested in this resource. So maybe others will be as well. It does offer you the opportunity to look for unique SNPs, using the UCSC assembly hg18/NCBI36. You can search by regions, or by starting with a list of SNPs, It gives you a dozen ways to filter the SNPs for things that might be of interest to you (RefSeq transcript characteristics, HapMap-ness, VISTA enhancer regions, etc).

I would probably accomplish this with a UCSC Table Browser query myself. But if you haven’t had a chance to get familiar with how to use that yet, this form would be a quick way to get similar answers.

Quick links

UniSNP: http://research.nhgri.nih.gov/tools/unisnp/

UCSC Table Browser tutorial: http://openhelix.com//cgi/tutorialInfo.cgi?id=28

The Table Browser tutorial is freely available to everyone as UCSC sponsors that. It’s the same material that we use in our live workshops, with the slides, handouts, and exercises available for anyone to use.

Here’s the tweet that’s going around if you’d like to re-tweet; hat tip to Khader:

@kshameer: UniSNP: uniquely mapped SNPs from dbSNP (build 129) and HapMap (release 27) http://1.usa.gov/gE3Ou0 #genomics #bioinformatics

Tip of the Week: Genome Variation Tour III

Today’s tip is the continuation of researching a single SNP in an individual genome. Trey will use a dbSNP RS ID to find linkage disequilibrium information between a SNP of interest and SNPs in the region easily and quickly. GVS, the Genome Variation Server at the University of Washington to analyze a dbSNP rs ID of your choice. This 3 minute screencast will show you how to use the GVS tool to quickly get this information for a wide range of populations.