Tag Archives: variation

dbSNP: no longer single….?

I think this is very interesting–dbSNP has a new logo. dbSNP is no longer “single”. Keeping dbSNP as a professional name, but also has a new name for social situations: “Short Genetic Variations”.

I was just checking my twitter feed, and found out something fascinating in the new release.  Here was the item that prompted me to look:

RT @yokofakun: #dbsnp134 has been released: http://www.ncbi.nlm.nih.gov/projects/SNP/docs/build134.txt

Pierre forwarded that notice, and I decided to check out the release notes. Hidden in there is a small piece of information that I think makes a big mental leap for a lot of people….

1) dbSNP logo change (http://www.ncbi.nlm.nih.gov/projects/SNP/)

As there has been confusion about the types of variations dbSNP actually contains, the dbSNP logo text was changed from “Single Nucleotide Polymorphism” to “Short Genetic Variations”. We hope that this change will reflect the wide range of dbSNP’s variation content, and thereby prevent any future misunderstandings.

In spite of its name, dbSNP is not limited to single nucleotide polymorphisms (SNPs), but stores information about multiple small-scale variations that include insertions/deletions, microsatellites, and non-polymorphic variants. dbSNP also stores common and rare variations along with their genotypes and allele frequencies.

Most importantly, dbSNP includes clinically significant variations, and should NOT be assumed to hold only benign polymorphisms.

Some of that stuff will be obvious to a lot of our readers. But you’d be surprised at what we find in the training rooms. Many people are really shocked to see that dbSNP contains a lot more than just single nucleotide polymorphisms. And we make a point of mentioning that the UCSC Genome Browser calls their SNP track “simple nucleotide polymorphisms” to reflect that idea. For many people in our workshops that’s the first time they have processed that knowledge.

In case you are curious, here’s what an old header looked like at dbSNP (I have taken this from our training materials):

I think this is a great move. Subtle, but great. And they must have thought it was important based on that release note piece. dbSNP is no longer single. I feel like I should send a gift….

SNPTips update (1.1)

I did a tip of the week on SNPTips a few months ago (more information there). It’s a great addon to view your genomic data while browsing databases and web sites. They’ve moved to version 1.1. There are two nice new features and some bug fixes. The features are:
*You can now use your deCODEme data, in addition to the 23andme support they started with.
*You can use SNPTips even without raw data to view SNPs on a page.
*and it’s been updated for Firefox 4.x.

You can check our our previous tip here (which still applies :).

SNPTips landing page at 5am Solutions.

Tip of the Week: MutaDATABASE, a centralized and standardized DNA variation database

We all know and love dbSNP, and DGV, and 1000 Genomes, and HapMap, and OMIM, and the couple of other dozen variation databases I can think of off the top of my head. But–even though there’s a lot of stuff out there–you never know what you aren’t seeing. What *isn’t* yet stored in those resources?  One new consortium suggests that there’s a lot you aren’t seeing. And they aim to make it easier to collect variation data, curate it, visualize it, and have it all in one place. The resource they are constructing is called MutaDATABASE.

MutaDATABASE is a new effort to bring together a lot of variation information that is just not getting into existing databases as it should be. The group is described as “a large consortium of diagnostic testing laboratories in Europe, the United States, Australia, and Asia.” In their Nature Biotechnology correspondence they describe many of the barriers facing deposition of new variants in databases. Among them are lack of incentive (or lack of pressure by publishers and other organizations), challenging/difficult software interfaces for submissions, privacy concerns for medical testing situations, and some desire to withhold novel variations as intellectual property. Not all of these issues can be overcome with some software, but they aim to try.

The structural organization of the consortium and contributor community that they wish to develop is described in this slide, which is like Figure 1 in the publication:

So there is a group of MutaAdministrators who oversee the project as a whole (this name makes me giggle a little bit–like a sci-fi government might be called…). There are MutaCurators who assemble and review data on a given gene (is it really just genes? what about non-genic regions and large deletions and such–this isn’t entirely clear to me). Clinicians can give input into the curation, and MutaCircles is a group of labs that do diagnostic testing for a gene that can also discuss, submit, evaluate data. The MutaCurator role is a gatekeeper and accountability on the final appearance.

The gene-specific collections will be freely available online in their database, and link to disease/phenotype information associated with those variations as well. In the tip-of-the-week movie I’ll show you an example of how you might expect a gene record to look when it’s been filled out to some extent.

MutaReviews is a separate aspect that they describe this way on the web site:

MutaREVIEWS is a new “Gene review journal ” published only online, which is freely available to all users. It consists of a compilation of gene review studies that describe the most common human disease genes in a standardised way and lists all observed gene variants. The variants include monogenic variants with high penetrance, rare variants with reduced penetrance and polymorphisms without clinical significance. Each gene review is edited by a specific MutaCURATOR for that gene. These gene reviews are updated every 6 months. There are 12 issues per year.

It’s certainly in the early stages of this project. A lot of the genes I checked just haven’t been curated yet, and I understand that. I hope it works out: I do like the organization and structure, and a one-stop-shop would be handy. But the “build a platform they will come and curate” system has had mixed success elsewhere around biology. And some of the things that need to happen for this to take off are philosophical or possibly legal barriers that are going to vary quite a bit around the research and genetic testing world.

One thing I’d like to see them do is permit and encourage citizen science curation by people who are adopters of personal genomics and looking at data, and by disease community groups who have a specific interest in these genes, but have even more barriers to contribution than the researchers often do.  I’ve found stuff from my genome scan that I don’t really have any place to take, and there’s no way to supplement records at that provider’s site as far as I know. But maybe that’s another variation project somewhere….

Anyway, have a look at MutaDATABASE and see what you think. Or if you participate in this project and I’ve not got some part of this right, drop a note in the comments. I know it’s early in the project and I may not have all the finer points in hand from my looking around and reading.

MutaDATABASE: http://www.mutadatabase.org/ (freely available online database with the variation content)

The sample gene that’s well filled out: http://www.mutareporter.org/mutareporter/Mutadatabase.html?showgene=L1CAM#

MutaReporter: http://www.mutabase.com/index.php?option=com_content&view=article&id=48&Itemid=54 (required license and user subscriptions; but supposedly the MutaDATABASE will have a function to submit that does not require use of this specifically, if I understood that correctly)

MutaBASE: http://www.mutabase.com A company associated with the MutaReporter software. (We have no relationship with that company)


Bale, S., Devisscher, M., Criekinge, W., Rehm, H., Decouttere, F., Nussbaum, R., Dunnen, J., & Willems, P. (2011). MutaDATABASE: a centralized and standardized DNA variation database Nature Biotechnology, 29 (2), 117-118 DOI: 10.1038/nbt.1772

Sage Bioinformatics Advice, But…

Bioinformatics analysis is a powerful technique applicable to a wide variety of fields, and the subject of many a blog post here at OpenHelix. I’ve had two particular bioinformatics articles on my desk for a couple of months now, waiting for me to be able to articulate my thoughts on them. They both offer great information about their particular area of interest – predicting either SNV impacts or protein identities – and sage bioinformatics advice.

The first article “Using bioinformatics to predict the functional impact of SNVs” is a great review of bioinformatics techniques for picking out functionally important single nucleotide variants (SNVs, also sometimes variously referred to as SNPs or Small, Simple or Single Nucleotide Polymorphisms) from the millions of candidate variants being identified everyday. In the introduction the authors do a great job of explaining the many ways in which SNVs can have an impact, as well as how these basic philosophies of impact can be used for bioinformatics analyses. The paper then goes on to describe both classic and bioinformatics techniques for predicting the impact of such variations. It is a phenomenal read for the list of resources alone, with many valuable and important algorithms and resources mentioned.  We’ve got tutorials (ENCODE, OMIM, the UCSC Genome Browser, UniProtKB, Blosum and PAM, HGMDJASPAR, Principal Components Analysis, relative entropy, SIFT score, TRANSFAC, ) and blog posts (the Catalog of Published Genome-Wide Association Studies) describing many of the same resources. In fact this paper inspired at least one of our weekly posted tips (Tip of the Week: SKIPPY predicting variants w/ splicing affects). The paper then goes on to a “BUYER BEWARE” section that offers some sage advice – know the weaknesses, assumptions, and of the resources you use for your predictions.

The second article is an open access article from BioTechniques entitled “Mistaken identities in proteomics“. It offers a romp through the history of mass spectrometry (MS) technology and rising standards for documenting techniques used for protein identification in journals. The article also concludes with sage bioinformatics advice, including this quote:

Proteomic researchers should be able to answer key questions, according to Giddings. “What are you actually getting out of a search engine?” she says. “When can you believe it? When do you need to validate?”

Both papers suggest that researchers who wish to use bioinformatics resources in their research should investigate the theoretical underpinnings and assumptions of each tool before deciding on a tool to use, and then should go at every analysis with a level of disbelief in the tool results. That just sounds like common sense, and makes good theoretical advice.

HOWEVER, the level of investigation that is required to truly know each tool and algorithm is prohibitively huge. As for me, my “practical” suggestion for researchers is a bit of a “filtering shortcut”. Before diving into all the publications on all possible tools, just spend a few minutes with some documentation – the resource’s FAQ, or an intro tutorial – we’ve got a few we can offer you :) – to get an idea of what the tool is about & what you might be able to get from it. Once you’ve got a general idea of how to approach the resource  begin “banging” on it lightly. An initial kick the tires test of an algorithm, database, or other resource can be as easy as keeping a “test set” on hand at all times & running it through any new tool you want to use. Make sure that the set includes a partial list of some very well known proteins/pathways/SNPs/etc. (whatever you work on & will be interested in analyzing) and that it has some of your fields ‘flukes’. Think about what you expect to get back from your set. Then run your tester set through any new tool you are considering using in your research, and look at your results – are they what you know they should be? Can they handle the flukes, or do they break? As an example, when I approach a new protein interaction resource, I’ll use a partial parts list for some aspect of the yeast cell cycle, and include one or two of the hyphenated gene names. If the tool is good, I get a completed list with no bogging on the “weird” names. If it bogs, I know the resource may not be 100% worked out for yeast & may have issues with other species as well. If the full list of interactors comes back with a bunch of space-junk proteins I begin investigating what data is included in the resource and if settings can be tweaked to get better answers. Then, if things still look promising with the tool, I am likely to dig deep into the literature, etc. for the tool – just to be sure – because the authors of these articles are absolutely right, chasing false leads is expensive, frustrating & time consuming. It is amazing how many lemons & jalopies you can weed out with a 5 minute bioinformatics tire kick! :)

I also don’t think the responsibility should solely be on the back of each end user – the resource developer does have some responsibility for making their tool rigorous and for accurately representing its capabilities in publications and documentation. Calls for open source code can help improve some bioinformatics tools, so can education & outreach – but that discussion will have to wait for another day…


UniSNP database

There have been a bunch of tweets lately around the UniSNP database–so I thought I’d do a quick post to raise awareness of that. The mission of UniSNP stated on their homepage at NHGRI is:

UniSNP is a database of uniquely mapped SNPs from dbSNP (build 129) and HapMap (release 27), where differences in SNP positions and names have been resolved, insofar as possible. In addition, SNPs are annotated with various functional characteristics, based on overlap with tracks from the UCSC browser. For details, see [PUB CITATION].

Well, I went looking for a [PUB CITATION] in PubMed for this. I entered the text UniSNP. I got a bunch of results. But that’s because….

Your search for unisnp retrieved no results. However, a search for unison retrieved the following items.

Unison? Um. Ok.

Anyway: the bioinformatics folks seem interested in this resource. So maybe others will be as well. It does offer you the opportunity to look for unique SNPs, using the UCSC assembly hg18/NCBI36. You can search by regions, or by starting with a list of SNPs, It gives you a dozen ways to filter the SNPs for things that might be of interest to you (RefSeq transcript characteristics, HapMap-ness, VISTA enhancer regions, etc).

I would probably accomplish this with a UCSC Table Browser query myself. But if you haven’t had a chance to get familiar with how to use that yet, this form would be a quick way to get similar answers.

Quick links

UniSNP: http://research.nhgri.nih.gov/tools/unisnp/

UCSC Table Browser tutorial: http://openhelix.com//cgi/tutorialInfo.cgi?id=28

The Table Browser tutorial is freely available to everyone as UCSC sponsors that. It’s the same material that we use in our live workshops, with the slides, handouts, and exercises available for anyone to use.

Here’s the tweet that’s going around if you’d like to re-tweet; hat tip to Khader:

@kshameer: UniSNP: uniquely mapped SNPs from dbSNP (build 129) and HapMap (release 27) http://1.usa.gov/gE3Ou0 #genomics #bioinformatics

Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

  • RT @32nm: ISCB Public Policy Statement on Open Access to Scientific and Technical Research Literature — Bioinformatics http://ff.im/x9hQ2 Hat tip to Mitsuteru Nakao [Mary]
  • The beginning of population level structural variation information on the human genome, from the at the  1000 genomes project. Hat tip to GenomeWeb [Jennifer]
  • Registration for the March 2011 GMOD Meeting is now open. See the wiki page at March 2011 Community Meeting. [Mary]
  • Need to assemble and analyze large datasets for multigene phylogenetic analysis? Might want to try out the new iPhy. Paper here. [Trey]
  • Ok, I had to go look at this–not what I expected…. : Top 200 Genomic Females list for Feb. 2011 is NOW ONLINE at http://bit.ly/i0Nj3O via @holstein_world. Yes, the name did clue me in, but I still had to look. [Mary]
  • Scientific Reports: interesting new venture by Nature Publishing Group. [Jennifer]
  • I love when rare diseases can shine lights into the darkness–very neat story on arterial calcification. Fascinating: RT @NatureNews: Solution to medical mystery offers treatment hope http://ff.im/-xiZYE [Mary]

Tip of the Week: Varietas. A plaid database.

For this week’s Tip of the Week I’ll introduce Varietas, a resource that integrates human variation information such as SNP and CNV data, and offers a handy tabular output with links to additional databases that will enable researchers to quickly explore other sources of information about the variations or regions of interest.

I think this is the first resource I’ve used from Finland. And it’s definitely the first resource I have used that is plaid. But it struck me that plaid is a pretty good conceptualization of the variations that we see in the genomes. Some are a single thread, some are larger sections, and the overlaps  between the variations we observed in the genome are important to our understanding of them as well. And the history of computation leads back to textile manufacturing, in fact. So I thought it was a pretty good concept.

But let’s explore the threads of Varietas.  You can read the paper which  is linked below, but here I’ll just summarize some of the main features. First  let me say the focus of this database appears to be human variation. Although you wouldn’t know that from the site very clearly. As far as I could tell there wasn’t any other species data. But if  you want human variation data, you’ll find a variety of threads available to you.  If you check out the About page, you’ll see the source data available includes Ensembl, the NHGRI GWAS catalog, SNPedia, and GAD.  These sources also provide OMIM data, HGNC nomenclature, phenotypes, and MeSH terms. And the threads out include dbSNP, PubMed, SNPedia, and WikiGenes as well. This is also summarized nicely in Figure 1 of their paper.

It’s a very straightforward interface. There is a basic search with a text box for quick searching, and you select the type of data you are starting with: SNPs, genes, keywords, or locations. And the output will be a table with the results that correspond to  your query.

If  you have larger sets of features that you want to interrogate you can use the advanced forms to enter more data.

The tabular output can be viewed on the web with all the handy links. Or you can download the data as a text file to be used in other ways.

I’ll demonstrate the sample search for the movie, but you won’t see the full range of data that’s available there. I wish they had samples for each type of search. But I found one sample that will also show CNV results: choose the Location radio button and enter this location range to see some CNV samples 6:1234-123400

Varietas home page: http://kokki.uku.fi/bioinformatics/varietas/

PubMed record for the paper: http://www.ncbi.nlm.nih.gov/pubmed/20671203


Paananen, J., Ciszek, R., & Wong, G. (2010). Varietas: a functional variation database portal Database, 2010 DOI: 10.1093/database/baq016

Tip of the Week: RGenetics at Galaxy

About 6-7 months ago, Mary mentioned that R-Genetics analysis was coming to Galaxy. Well, it has now and is available at the public Galaxy site. The old Rgenetics site links to the new one and the information about using Galaxy as a wrap around interface for the Rgenetics project tools. Today’s tip just points you to the tool and gives you a quick overview of what is there. You’ll need to do some exploring to learn to use it! Of course, we have our publicly available Galaxy tutorial to get you started.

(oh, and I point you to this tutorial on analyzing Desmond Tutu’s SNPs using Galaxy that I thought was interesting)

Tip of the Week: Genome Variation Tour III

Today’s tip is the continuation of researching a single SNP in an individual genome. Trey will use a dbSNP RS ID to find linkage disequilibrium information between a SNP of interest and SNPs in the region easily and quickly. GVS, the Genome Variation Server at the University of Washington to analyze a dbSNP rs ID of your choice. This 3 minute screencast will show you how to use the GVS tool to quickly get this information for a wide range of populations.