This week’s SNPpets include real-time visualization of Ebola spread, precision medicine informatics, big capacity for whole genomes, “genetobollocks” for a new description of media coverage of genomics papers, Neanderal pathogenic variants, and re-examining old problems on a couple of matters.
Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…
This week’s tip of the week is on Gemini which is the acronym for “GENome MINing.” Unlike most of the tips we give every week, this one is a software package. But, it is does use and integrate with many internet databases such as dbSNP, ENCODE, UCSC, ClinVar and KEGG. It’s also a freely available, open source tool and quite a useful software package that gives the researcher the ability to create quite complex queries based on genotypes, inheritance patterns, etc. The above 12 minute clip is a talk given at a conference that gives a introduction of the science behind the tool.
Modern DNA sequencing technologies enable geneticists to rapidly identify genetic variation among many human genomes. However, isolating the minority of variants underlying disease remains an important, yet formidable challenge for medical genetics. We have developed GEMINI (GEnome MINIng), a flexible software package for exploring all forms of human genetic variation. Unlike existing tools, GEMINI integrates genetic variation with a diverse and adaptable set of genome annotations (e.g., dbSNP, ENCODE, UCSC, ClinVar, KEGG) into a unified database to facilitate interpretation and data exploration. Whereas other methods provide an inflexible set of variant filters or prioritization methods, GEMINI allows researchers to compose complex queries based on sample genotypes, inheritance patterns, and both pre-installed and custom genome annotations. GEMINI also provides methods for ad hoc queries and data exploration, a simple programming interface for custom analyses that leverage the underlying database, and both command line and graphical tools for common analyses. We demonstrate GEMINI’s utility for exploring variation in personal genomes and family based genetic studies, and illustrate its ability to scale to studies involving thousands of human samples. GEMINI is designed for reproducibility and flexibility and our goal is to provide researchers with a standard framework for medical genomics.
While I’m at it, and totally unrelated except it’s human genomics, there is this slideshare presentation of the ‘current’ state of personal genomics. Current is in quotes because the slideshare is actually from 3 years ago, but there is a lot of good information in there. Anyone know of a more up-to-date slide set or extensive intro to the current state of personal genomics science similar to this?
In case you aren’t on the UCSC announcement mailing list, and you don’t go to the site via their homepage with the posted news–you should know about this new tool at the UCSC Genome Browser. It will take variations that you are exploring and make a prediction about whether the variant is associated with a function, and potentially if it is damaging to a protein. It’s under active development, so try it out. And if there are features you could use, suggest them. See the VAI page for more.
Here are the details via their email, but sign up for the “announce” mailing list to get this news like this in your inbox if you like too:
In order to assist researchers in annotating and prioritizing thousands
of variant calls from sequencing projects, we have developed the Variant
Annotation Integrator (VAI). Given a set of variants uploaded as a
custom track (in either pgSnp or VCF format), the VAI will return the
predicted functional effect (e.g., synonymous, missense, frameshift,
intronic) for each variant. The VAI can optionally add several other
types of relevant information, including: the dbSNP identifier if the
variant is found in dbSNP, protein damage scores for missense variants
from the Database of Non-synonymous Functional Predictions (dbNSFP), and
conservation scores computed from multi-species alignments. The VAI also
offers filters to help narrow down results to the most interesting variants.
Future releases of the VAI will include more input/upload options,
output formats, and annotation options, and a way to add information
from any track in the Genome Browser, including custom tracks.
There are two ways to navigate to the VAI: (1) From the “Tools” menu,
follow the “Variant Annotation Integrator” link. (2) After uploading a
custom track, hit the “go to variant annotation integrator” button. The
user’s guide is at the bottom of the page, under “Using the Variant
Before I discuss NCBI’s 1000 Genomes Dataset Browser, I’d like to spend a bit of time on the 1000 Genomes project, in order to distinguish what is from NCBI and what is from the project itself. From the 1000 Genomes Pilot paper:
“The aim of the 1000 Genomes Project is to discover, genotype and provide accurate haplotype information on all forms of human DNA polymorphism in multiple human populations. Specifically, the goal is to characterize over 95% of variants that are in genomic regions accessible to current high-throughput sequencing technologies and that have allele frequency of 1% or higher (the classical definition of polymorphism) in each of five major population groups (populations in or with ancestry from Europe, East Asia, South Asia, West Africa and the Americas).”
You can access the full paper from the link below. The project has now moved past the pilot phase and is releasing new data all the time. You can see announcements and project details, or access that data, through the official 1000 Genomes project site, or through the official 1000 Genomes version of the Ensembl Browser. As you might imagine for a “big data” project such as this, data has been added to a variety of NCBI databases, including dbSNP, the Sequence Read Archive (SRA) and BioSample. Although you could search for this data through the universal Entrez search system, previously to view the data you would have to view individual results at each separate database. The 1000 Genomes Browser at NCBI has been created as a powerful interface for comprehensively searching for, and viewing, 1000 Genomes data contained in NCBI resources on a single page.
In the video tip I will familiarize you to the various areas of the page - the browser is created with series of widgets, each with its own function. I will not be able to cover all of the features, or demonstrate how users can upload their own variation data to the browser – I’ll leave you the fun of exploring those on your own. Because the tool is so young, bugs and suggestions/comments are still being actively requested – if you find something, check out the FAQs (which discuss bugs at various stages of being fixed) and then email the team.
There’s a meeting going on today that people might be interested in following if you are interested in analysis of human variation. Here’s how Chris Gunter described it on G+ last night:
For serious genetics geeks: the meeting organized by +Daniel MacArthur and myself (with lots of help from colleagues!), Implicating Sequence Variants in Human Disease, is streaming live here. On til 9 tonight and most of tomorrow during working day.
They have been talking about how they do analysis at their sites, the needs for new databases to support what they are finding, better ways to report variation in the literature, and more.
One of the most frequent questions we hear when we do workshops is: how to I find out if this SNP has an effect on my favorite protein? Well, that’s assuming it is a coding SNP. Of course, promoter SNPs and splicing SNPs and other features would be great to assess as well. Right now, though, the most mature tools are those that look at the effects of variation on the coding of the amino acids in proteins.
We’ve talked before about some tools for this, including PolyPhen2 and SIFT. Each of them will offer different algorithms and options that might help you to explore your SNPs. But another tool is available that you should check out as well: SNPeffect 4.0.
SNPeffect isn’t new–this team has been developing it for a while. But their recent paper that describes new features in the 4.0 version spurred me to have a new look at it. There are some foundational things that are important to know about the data collection in their database. It’s not just a re-hash of dbSNP–it actually relies on another source of variation data. They use the UniProt collection of human proteins as the starting point. If you haven’t used UniProt much, you might not be aware of how much variation they catalog and store that are identified in the proteins (we cover this in our tutorial*). The SNPeffect team takes those variations and evaluates the impact they have on a protein with a variety of algorithms. Some of the variations will correspond to dbSNP entries–but not all of them do. You may find things here that you won’t find in dbSNP. So I would say it’s worth exploring your proteins of interest here as well.
The algorithms they use provide information on a number of features of the protein. TANGO and WALTZ assess protein aggregation and amyloid formation. LIMBO evaluates chaperone binding. Structural stability is predicted by FoldX (if a suitable structure is available). They also use SMART* and Pfam* to see if the variation occurs within domains of the protein. There are some other tools with more protein features examined as well. Check out the paper for more details.
You can also submit proteins of interest to their analysis suite from the “Submit a new SNPeffect job” links.
A new feature highlighted in their paper is the opportunity to do a Meta-analysis on groups of variations. You can explore the features of sets of variants in this way, using the different algorithms they offer.
This short video examines the pipeline, the basic interface, and a couple of sample pages. But you’ll want to go over and try a lot more to learn about your favorite proteins. There’s a lot of information that can come out of this that you might not have known before. Check it out.
*OpenHelix tutorials for these resources available for individual purchase or through a subscription
As you may know, we’ve been doing these video tips-of-the-week for FOUR years now. We have completed around 200 little tidbit introductions to various resources from last year, 2011 (yep, it’s 2012 now). At the end of the year we’ve established a sort of holiday tradition: we are doing a summary post to collect them all. If you have missed any of them it’s a great way to have a quick look at what might be useful to your work.
Trey introduced me to this “decent collection of video tutorials ” from Ensembl, but he and Mary are currently in Morocco teaching a 3-day bioinformatics workshop & then attending the conference (yes, I am envious!). I am therefore creating this week’s tip based on the tutorials that Trey pointed me to. In today’s tip I am going to parallel a tutorial available from Ensembl on SNP information in order to both: 1) show you haw you can access variation information from Ensembl and 2) compare doing these steps using Ensembl 64 (here in this video) and using Ensembl 54 (archived) (in the Ensembl video).
Bioscience resources often are continuously being developed and improved & it can be difficult to keep videos and documentation up-to-date. That’s why here at OpenHelix we work continuously to keeping our materials up-to-date, with weekly tips on new features and updated tutorials as updated sites become stable.
The Ensembl video (SNPs and other Variations – 1 of 2) is quite nice & provides more detail about the actual Ensembl data than I can in my short movie, but it was done a few years ago on an older version of Ensembl. Since then the resource has been updated, and gone through several new versions of the data. I’m going to follow the same steps that are done in part one of the Ensembl SNP tutorial so that you can see examples of what’s changed & what is pretty much the same. I’d suggest you watch both videos back-to-back to get a good idea of what’s changed, and what types of variation information are available from Ensembl. From that basis I’m sure you’ll be able to watch Ensembl’s second SNP video & apply it to using the current version of Ensembl without much trouble. For more details you can refer to the most recent Ensembl paper in the NAR database issue, which describes not just variation information but Ensembl as a whole.
What I found at VnD made me sure that this was a resource that I wanted to feature in a tip. VnD is from the Korean Bioinformation Center, or KOBIC, who has a list of databases and tools that they provide. I’ll save the rest of the KOBIC resources for another post & concentrate on VnD here. Compiling data from resources such as RefSeq, OMIM, UniProt, PDB, DrugBank, dbSNP, GAD and more might have been cool enough, depending on how it was done, but the VnD also does their own structure modeling analysis on how the variation affects the protein structure and drug/ligand binding.
This tip movie isn’t long enough to really show you the breadth of what is available from the VnD, but I hope it will be enough to encourage you to read the NAR article (listed below), and to check out VnD. One thing to note: don’t expect to find every dbSNP rs# over there – one that I’ve been using in our tutorial isn’t over there. They are specifically interested in variations within genes that might effect drug binding. But hey, you can’t query DrugBank with rs#s, and I’ve never seen the structure modeling done like VnD, so it is a worthy resource that you may want to investigate if you are interested in how genetic variations connect with disease and drug therapies.
Reference: Yang, J., Oh, S., Ko, G., Park, S., Kim, W., Lee, B., & Lee, S. (2010). VnD: a structure-centric database of disease-related SNPs and drugs Nucleic Acids Research, 39 (Database) DOI: 10.1093/nar/gkq957
As there has been confusion about the types of variations dbSNP actually contains, the dbSNP logo text was changed from “Single Nucleotide Polymorphism” to “Short Genetic Variations”. We hope that this change will reflect the wide range of dbSNP’s variation content, and thereby prevent any future misunderstandings.
In spite of its name, dbSNP is not limited to single nucleotide polymorphisms (SNPs), but stores information about multiple small-scale variations that include insertions/deletions, microsatellites, and non-polymorphic variants. dbSNP also stores common and rare variations along with their genotypes and allele frequencies.
Most importantly, dbSNP includes clinically significant variations, and should NOT be assumed to hold only benign polymorphisms.
Some of that stuff will be obvious to a lot of our readers. But you’d be surprised at what we find in the training rooms. Many people are really shocked to see that dbSNP contains a lot more than just single nucleotide polymorphisms. And we make a point of mentioning that the UCSC Genome Browser calls their SNP track “simple nucleotide polymorphisms” to reflect that idea. For many people in our workshops that’s the first time they have processed that knowledge.
In case you are curious, here’s what an old header looked like at dbSNP (I have taken this from our training materials):
I think this is a great move. Subtle, but great. And they must have thought it was important based on that release note piece. dbSNP is no longer single. I feel like I should send a gift….