After again reading Daniel MacArthur’s good rundown about the state of databases of human disease-causing variation from last year (One database to hold them all), I thought it might be nice to do a tip comparing several of them. I couldn’t get it under our self-imposed 5 minute limit for our tips (and technical limit of software I’m using, but that’s about to change). But as I perused our tips and other sites, I found we and others have quite a list of how-to tips to use these databases. So in today’s tip I’ve gathered video tips for 3 of the databases listed in the linked post. Below those tips I’ll link to other how-to videos for additional human variation and disease.
HGMD has a public site and a by-subscription site. The latter includes access to the most current data and some added features. The publicly accessible site is out-of-date by three years. Because of HGMD restrictions, we aren’t able to do a tutorial or a tip on HGMD, but they do have an introduction video to their database:
Another excellent resource is Gen2Phen. The Gen2Phen project “aims to unify human and model organism genetic variation databases towards increasingly holistic views into Genotype-To-Phenotype (G2P) data, and to link this system into other biomedical knowledge sources via genome browser functionality.” In that vein, they have quite an extensive list of Locus-specific databases and additional resources.
I just had a chance to watch the video, and now I can see why they were impressed! Over the years in the workshops we do, people have asked questions in various theme groups. For a while it was lists of genes and microarrays. Then it was known SNP variations. Then it became transcription factor binding sites. Lately it’s been: I have a giant set of sequence data that I need to process to find new variants that might impact genes. How do I do that? This video tip-of-the-week will help you to understand how to do that.
In this video that was part of a day of lectures at the NHGRI about how to deal with exome sequencing data: Next-Gen 101: Video Tutorial on Conducting Whole-Exome Sequencing Research . There is a whole series of video and slide material available from NHGRI’s page. And the one I’m highlighting here is number 3 on that list. Be sure to download the slides if you want to take notes, and access the references and URLs that are key to the material.
Jamie Teer gives a terrific talk about dealing with the exome sequence data output that next-gen projects are yielding. It starts with just managing and viewing the reads, and he highlights a couple of different ways to do this. It includes SAMtools, and also showing how they look in both UCSC Genome Browser and in the Broad’s Integrative Genomics Viewer, IGV. It’s nice to see a comparison of these to illustrate what you might expect to see. We could help you to understand how to load this kind of data as custom tracks in the UCSC Genome Browser with our advanced tutorial, and you’ll find some nice guidance on what to expect from IGV from the paper listed below in the references area.
The video also describes annotation software that helps you to identify where the variations and consequences are in the data. Many of these tools we have talked about either in our tutorials or our other tips-of-the-week.
He also describes how people generate pipelines to flow the data through a series of steps to do the analysis. Sometimes these are home-made programs used by a local group. But he also mentioned how Galaxy can help to accomplish this now. We’ve been fans of Galaxy for a long time, and we know people are using it in exactly this manner.
You still should have a basic understanding of all the tools individually if you want to use them all, or tools that incorporate them all into workflows/processes, though. It will help you to create better workflows/pipelines. And it also matters that you know what you aren’t seeing/using.
Teer closes by introducing the VarSifter software that he’s been involved with creating. This software is freely available for you to download at the VarSifter site. Usually we prefer to highlight web-based interfaces, but there isn’t one for VarSifter. But if you see the utility in it you can also try to get a local copy set up for yourself. VarSifter will help you to view, sort, and filter variants in a lot of ways.
So have a look at this video if you are interested in understanding how these analyses are done, and if you are interested in knowing more about the tools that can be used. It’s worth the 40 minutes–really.
IGV:Robinson, J., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E., Getz, G., & Mesirov, J. (2011). Integrative genomics viewer Nature Biotechnology, 29 (1), 24-26 DOI: 10.1038/nbt.1754
UCSC new paper: Dreszer, T., Karolchik, D., Zweig, A., Hinrichs, A., Raney, B., Kuhn, R., Meyer, L., Wong, M., Sloan, C., Rosenbloom, K., Roe, G., Rhead, B., Pohl, A., Malladi, V., Li, C., Learned, K., Kirkup, V., Hsu, F., Harte, R., Guruvadoo, L., Goldman, M., Giardine, B., Fujita, P., Diekhans, M., Cline, M., Clawson, H., Barber, G., Haussler, D., & James Kent, W. (2011). The UCSC Genome Browser database: extensions and updates 2011 Nucleic Acids Research DOI: 10.1093/nar/gkr1055
SAMtools:Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., & , . (2009). The Sequence Alignment/Map format and SAMtools Bioinformatics, 25 (16), 2078-2079 DOI: 10.1093/bioinformatics/btp352
As last year, we were impressed with the students (there were 117 total, about 50/50 gender ratio). English is their 3rd or 4th language in most cases, Moroccan Arabic, French or various African languages being their language of choice. Yet, they were attentive and asked very perceptive and fascinating questions. They were also very enthusiastic
The workshop students
learners. It was a delight to teach them.
We’d like to thank Mohammed Bourdi at NIH, who spent large amounts of time and financial resources to organize this (and last year’s) workshop. We hope to repeat and expand these for next year and perhaps years to come. We will be looking for sponsors.
Several questions were asked at the workshop we’d like to reiterate the answers here and seek some answers from our readers:
* Another question concerned finding a ‘dynamic programming’ (optimal solution) multiple sequence alignment tool as opposed to a heuristic one. The issue with this is the complexity of the search space of dynamic programming solution, this slide set might help with the understanding, particularly slides 1-5 and 17-22. It is too computationally intensive. That said, the student might want to check out MSAProps and this list at Wikipedia.
Do our readers have any other guidance on this?
* Another student asked if we know how to find DC-area internships in biological sciences. Another student (mathematician from Mali) was looking for something in the US in bioinformatics. Any ideas of programs to bring African biology students to the US or Canada?
If our Moroccan students (or anyone else) have any additional questions, please feel free to ask them here!
ANd a side note. Last year I had all of 3 hours to tour Fes. This year I took advantage of my trip. Mary and I spent a few days in Fes and Marrakech. My family joined us in Marrakech and later my family and I toured for 8 days visiting the Atlas mountains, the Sahara and Fes. Needless to say, it was a trip of a lifetime. Morocco is a fascinating and beautiful place. I look forward to visiting again.
BioStar is a site for asking, answering and discussing bioinformatics questions. We are members of the community and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those questions and answers here in this thread. You can ask questions in this thread, or you can always join in at BioStar.
I highlighted this one because it seems to come along fairly frequently (as evidenced by Jorge Amigo’s answer. And we find it surprises people who have just noticed that the UCSC Genome Browser is now separating out a set of SNPs from dbSNP that they call the Mult. SNPs(132) track you can see on their browser. I think it’s a good awareness to have about these SNPs.
BioStar is a site for asking, answering and discussing bioinformatics questions. We are members of thecommunity and find it very useful. Often
questions and answers arise at BioStar that are germane to our readers (end users of genomics resources).Every Thursday we will be highlighting one of those questions and answers here in this thread. You can ask questions in this thread, or you can always join in at BioStar.
What I found at VnD made me sure that this was a resource that I wanted to feature in a tip. VnD is from the Korean Bioinformation Center, or KOBIC, who has a list of databases and tools that they provide. I’ll save the rest of the KOBIC resources for another post & concentrate on VnD here. Compiling data from resources such as RefSeq, OMIM, UniProt, PDB, DrugBank, dbSNP, GAD and more might have been cool enough, depending on how it was done, but the VnD also does their own structure modeling analysis on how the variation affects the protein structure and drug/ligand binding.
This tip movie isn’t long enough to really show you the breadth of what is available from the VnD, but I hope it will be enough to encourage you to read the NAR article (listed below), and to check out VnD. One thing to note: don’t expect to find every dbSNP rs# over there – one that I’ve been using in our tutorial isn’t over there. They are specifically interested in variations within genes that might effect drug binding. But hey, you can’t query DrugBank with rs#s, and I’ve never seen the structure modeling done like VnD, so it is a worthy resource that you may want to investigate if you are interested in how genetic variations connect with disease and drug therapies.
Reference: Yang, J., Oh, S., Ko, G., Park, S., Kim, W., Lee, B., & Lee, S. (2010). VnD: a structure-centric database of disease-related SNPs and drugs Nucleic Acids Research, 39 (Database) DOI: 10.1093/nar/gkq957
As there has been confusion about the types of variations dbSNP actually contains, the dbSNP logo text was changed from “Single Nucleotide Polymorphism” to “Short Genetic Variations”. We hope that this change will reflect the wide range of dbSNP’s variation content, and thereby prevent any future misunderstandings.
In spite of its name, dbSNP is not limited to single nucleotide polymorphisms (SNPs), but stores information about multiple small-scale variations that include insertions/deletions, microsatellites, and non-polymorphic variants. dbSNP also stores common and rare variations along with their genotypes and allele frequencies.
Most importantly, dbSNP includes clinically significant variations, and should NOT be assumed to hold only benign polymorphisms.
Some of that stuff will be obvious to a lot of our readers. But you’d be surprised at what we find in the training rooms. Many people are really shocked to see that dbSNP contains a lot more than just single nucleotide polymorphisms. And we make a point of mentioning that the UCSC Genome Browser calls their SNP track “simple nucleotide polymorphisms” to reflect that idea. For many people in our workshops that’s the first time they have processed that knowledge.
In case you are curious, here’s what an old header looked like at dbSNP (I have taken this from our training materials):
I think this is a great move. Subtle, but great. And they must have thought it was important based on that release note piece. dbSNP is no longer single. I feel like I should send a gift….
While we were traveling for workshops the other day, there was an announcement from the UCSC Genome Browser team that a lot of people have been waiting for: dbSNP132 can be explored on the browser now. It is available on the hg19 assembly–which is the February 2009 one that you can select in the human genome gateway options.
People were eagerly awaiting this for a couple of reasons–first, a new dbSNP release is always offering new SNPs people might want to explore. But this particular release also has SNPs that people wanted to access from the 1000 Genomes project. Here’s the release announcement from dbSNP that describes it:
Build 132:Human include data from 1000 Genomes project pilot 1, 2, and 3 studies. All 1000 genomes submissions to dbSNP can be searched by batch or using Entrez search filters….
Note: the dbSNP announcement also offers help on filtering for just those SNPs at NCBI if you want them. This led me to try to filter the 1000Genomes submitter name in the UCSC Table Browser as well. It worked–but I haven’t checked all of it yet, so caveat lector on that right now…But some people might want to do that. You could create a 1000Genomes custom track with that sort of query I think.
Table Browser, Filter for dbSNP submitter field:
And this yielded this sort of output–where the submitter field contains 1000Genomes, but it may also include other submitters:
But another really important aspect of the 132 build in the context of the UCSC Genome Browser is that they have changed the way they are offering the SNPs to you. In the past the SNPs have always been in one big bucket. But now they have separated them out into 4 options: Common SNPs, Flagged SNPs, Multiple location mapped SNPs (Mult. SNPs), and All SNPs. So the menus on the browser look like this now:
Key point: the Common SNPs are on by default. If you want All SNPs (or any of the others) you will have to specifically make that choice. Also remember in your table browser queries to make the appropriate selection.
This is a nice option that people have been asking for. But it does represent a change from the way they have been offered before, so be sure you know which SNP subset you want to explore and make the right choice.
PS: I was going to make a custom track of 1000Genomes to load up for anyone as a public service, but I crashed the browser. I may try again later. I think it would be a handy track to have to load up. If someone else gets to it first, let me know and give me your session link and I’ll add it.
PPS: If you don’t know how to navigate around UCSC, change the menu options, or do Table Browser queries, check out the tutorials that we have that are sponsored by UCSC and are freely available: http://openhelix.com/ucsc
There have been a bunch of tweets lately around the UniSNP database–so I thought I’d do a quick post to raise awareness of that. The mission of UniSNP stated on their homepage at NHGRI is:
UniSNP is a database of uniquely mapped SNPs from dbSNP (build 129) and HapMap (release 27), where differences in SNP positions and names have been resolved, insofar as possible. In addition, SNPs are annotated with various functional characteristics, based on overlap with tracks from the UCSC browser. For details, see [PUB CITATION].
Well, I went looking for a [PUB CITATION] in PubMed for this. I entered the text UniSNP. I got a bunch of results. But that’s because….
Your search for unisnp retrieved no results. However, a search for unison retrieved the following items.
Unison? Um. Ok.
Anyway: the bioinformatics folks seem interested in this resource. So maybe others will be as well. It does offer you the opportunity to look for unique SNPs, using the UCSC assembly hg18/NCBI36. You can search by regions, or by starting with a list of SNPs, It gives you a dozen ways to filter the SNPs for things that might be of interest to you (RefSeq transcript characteristics, HapMap-ness, VISTA enhancer regions, etc).
I would probably accomplish this with a UCSC Table Browser query myself. But if you haven’t had a chance to get familiar with how to use that yet, this form would be a quick way to get similar answers.
The Table Browser tutorial is freely available to everyone as UCSC sponsors that. It’s the same material that we use in our live workshops, with the slides, handouts, and exercises available for anyone to use.
Here’s the tweet that’s going around if you’d like to re-tweet; hat tip to Khader:
Today’s tip is the continuation of researching a single SNP in an individual genome. Trey will use a dbSNP RS ID to find linkage disequilibrium information between a SNP of interest and SNPs in the region easily and quickly. GVS, the Genome Variation Server at the University of Washington to analyze a dbSNP rs ID of your choice. This 3 minute screencast will show you how to use the GVS tool to quickly get this information for a wide range of populations.