Andrew Evans – Sr. Information Architect and blogger at 5AM Solutions, Inc. This blog post is also published at http://blog.5amsolutions.com/
When I became a customer of the 23andMe personal genomics service back in 2008, I remember the sense of awe I had when I opened up the raw data file for the first time and looked at the actual A’s, G’s C’s and T’s of my genetic makeup. In reality it’s only a tiny portion of it since 23andMe assays 600,000-odd “single nucleotide polymorphisms,” or SNPs (now close to a million with their v3 chip). SNPs are locations in the genome where individuals differ from one another. SNPs (pronounced “snips”) are interesting because they might help explain in part why my personal traits and health have developed the way they have – how I might respond to certain drug treatments, why my eyes are brown, or whether I am a carrier of known Mendelian disorders, for instance. More and more new research arrives every day, attempting to link SNPs to every conceivable health condition and trait.
23andMe does a good job of providing interesting and carefully-researched reports for their customers, available on their website, and tailored to your personal genotype. They sift through the flood of research data and determine what is really worth your time to consider.They also maintain an informative blog that tracks breaking developments that might eventually make it into their official reports.
Those of us who are more adventurous, however, and perhaps better-schooled in the science behind personal genomics, can outgrow 23andMe’s filter. We wander to the scientific literature and other technical online resources where there is a lot more data, much of it preliminary, to explore. Such information can tempt one to jump to erroneous conclusions. But I understand this environment and I’d rather decide for myself.
This is the situation I found myself in not long after becoming a 23andMe customer. I discovered Mike Cariaso’s SNPedia.com, for instance, and ran my data through his Promethease tool – and I realized there was a lot more information on SNPs in the literature than 23andMe was showing us. Also, I became a more technical reader of the 23andMe reports, and now routinely follow up their references and go back to the original journal articles to understand why they draw the conclusions they do.
But there is one problem with all this Web wanderlust – once you leave 23andMe’s website, you must look up your genotype at each SNP on your own if you want to see what impact a particular finding might have on you. This can get quite tedious – especially for journal articles that mention batches of SNPs at a time. To the eye (and my memory), it’s not clear which SNPs are on the 23andMe chip at all, since they assay a subset of the known frequently-occurring SNPs.
This led me to a rather straightforward idea – why not build personal genomics smarts right into the web browser? That way, the browser could enhance web pages that mention SNPs and show you relevant information without you actually having to break your reading flow and go look up genotype information. A browser extension can do this, and so SNPTips was born.
SNPTips is a Firefox browser extension that links to your 23andMe raw data and pre-processes web pages as they are loaded, adding color-coding and tooltips to SNP IDs that are mentioned in web content. If you simply hover your mouse cursor over the SNP RS number, your personal genotype at that SNP is displayed. SNPTips also adds a little icon next to the ID – clicking this icon brings up a balloon with smart links to other web resources, including SNPedia, Google Scholar, and NHGRI’s dbSNP – so you can delve deeper with a single click.
Web page content without SNPTips
Same page, SNPTips-enhanced
(green SNPs are on your chip, gray ones are not)
Looking up each of these SNPs one at a time is a major pain, and a true impediment to getting value out of 23andMe raw data – switching windows, logging in, searching, etc. With SNPTips, a simple mouse gesture is all that’s required, without even losing your place as you read!
SNPTips is currently in public beta. To get SNPTips, simply visit snptips.com with your Firefox 3.6+ browser, and click the Install Now button, and follow the directions on the website to configure. After you’ve tried it out, we’d love to hear from you and hear your questions or suggestions for improvements. Just send e-mail to firstname.lastname@example.org
I’m excited about our future plans for the tool as well (additional browsers, other personal genomics services, for instance) – stay tuned here (and follow @SNPTips on Twitter) for updates.
This next post in our continuing semi-regular Guest Post series is from Andrei Turinsky, one of the developers of iRefWeb. If you are a provider of a free, publicly available genomics tool, database or resource and would like to convey something to users on our guest post feature, please feel free to contact us at wlathe AT openhelix DOT com or the contact form (write ‘guest post’ as subject heading). We welcome introductions to your resource, information on updates, highlights of little known gems or opinion pieces on the state of genomic research and databases.
What is iRefWeb?
Protein-protein interactions (PPI) have become an important tool in biomedical research. Yet the PPI data for a specific organism tend to be distributed over a number of different databases. Comparison and integration of PPI information across databases remains a challenging task.
iRefWeb (Turner et al. (2010) Database, Vol. 2010, Article ID baq023.) is a web interface to a broad integrated landscape of protein-protein interactions (PPIs). For a given gene or protein, you can access all PPI records and protein complexes, consolidated non-redundantly from ten major public databases: BIND, BioGRID, CORUM, DIP, IntAct, HPRD, MINT, MPact, MPPI and OPHID. iRefWeb also presents various supporting evidence, helping you to gauge the reliability of an interaction. Versatile search filters allows you to retrieve the PPIs with a given level of support. Other features facilitate the analysis of possible inconsistencies across PPI data and the examination of PPI statistics. Data consolidation procedure effectively combines redundant records using the iRefIndex process (Razick et al (2008) BMC Bioinformatics 9, 405.).
Figure 1: The iRefIndex process aggregated over 916,059 original PPI records from source databases, 75% of which were redundant. The consolidation merged the redundant PPIs, reducing their number four-fold (orange). Only 232,612 PPIs were non-redundant (blue)
This next post in our continuing semi-regular Guest Post series is from Andrew Johnson, one of the developers and the concept designer of SNAP, SNP Annotation and Proxy Search which is hosted at the Broad Institute. If you are a provider of a free, publicly available genomics tool, database or resource and would like to convey something to users on our guest post feature, please feel free to contact us at wlathe AT openhelix DOT com or the contact form (write ‘guest post’ as subject heading). We welcome introductions to your resource, information on updates, highlights of little known gems or opinion pieces on the state of genomic research and databases.
SNAP (http://www.broadinstitute.org/mpg/snap/, Johnson et al. (2008) Bioinformatics 24(24): 2938), “SNP Annotation and Proxy search”, is a flexible, web-based tool that allows anyone in the world to quickly accomplish a range of SNP-related genetics and bioinformatics tasks. This post highlights some common questions andfeatures of SNAP, some more obscure uses, and recent and planned developments.
How did SNAP come about?
The idea for SNAP was originally sparked by GWAS analysts within a large collaborative group (the Framingham Heart Study SHARe project). This was in the pre-imputation era when GWAS investigators from different groups using different SNP arrays often wanted to find best proxy SNPs based on HapMap for comparison when they didn’t have common genotyped SNPs across groups. We initially implemented local programs to lookup upHapMap LD and also consider the presence of query and proxy SNPs on different commercial genotyping arrays. We quickly realized this was a community-wide problem as we received requests from outside collaborators so we decided it was worth developing a public tool and approached investigators at the Broad Institute. Through collaboration with Paul de Bakker, Bob Handsaker and others at the Broad Institute we were able to add more features like plotting and build a nice, quick and accessible interface. Many people have contributed ideas, testingand improvements to SNAP, and Bob Handsaker and Pei Lin in particular continue to maintain and update SNAP.
What do you use SNAP for the most?
The two major features of SNAP widely used 1) SNP LD queries, and 2) plotting of LD and association data. There are a number of flexible options for these functions. Beyond these, as a SNP bioinformatics specialist, I often use SNAP to rapidly retrieve information about a list of SNPs for other uses (see specialized queries below).
What are some commonly asked questions from users of SNAP?
This next post in our continuing semi-regular Guest Post series is from Pedro Lopez, developer of WAVe at the University of Aveiro Bioinformatic Group in Aveiro Portugal. If you are a provider of a free, publicly available genomics tool, database or resource and would like to convey something to users on our guest post feature, please feel free to contact us at wlathe AT openhelix DOT com or the contact form (write ‘guest post’ as subject heading). We welcome introductions to your resource, information on updates, highlights of little known gems or opinion pieces on the state of genomic research and databases.
I would like to start by thanking Trey Lathe for the opportunity to promote WAVe in this great blog. After his short tip of the week post, I’ll now try to make a more detailed overview of this new application.
What is WAVe?
WAVe stands for Web Analysis of the Variome and is a simple application focused on centralizing the access to distributed and heterogeneous locus-specific databases (LSDB). LSDBs are an emerging type of bioinformatics applications, aiming at providing gene-centric information regarding discovered genomic variants. In WAVe, we offer both LSDBs as well as to its variants. Moreover, we also provide access to a comprehensive list of carefully selected external resources. With this, users have, in a single application, access to gene and variation information enriched with a multitude of gene-related resources in a lightweight and easy to use web application.
What are WAVe’s key features?
At this early stage, WAVe’s publicly available features are related with data access. Users can easily browse through available genes, search for genes, view gene info and access each gene RSS feed. In WAVe’s entry page, users simply need to start typing a gene HGNC-approved symbol and several suggestions will appear: accepting one of them leads directly to the gene view page. Following theview alllink, users can browse all available genes or check, for each gene, how many LSDBs and variants are available.
To access the application data, users just need to navigate in the gene tree. Each tree node represents a distinct data type and the various leaf provide access to external applications: by clicking a leaf, the destination page is loaded in the main content area. Repeating this process, users can navigate in the dozens of listed links for each gene.
WAVe also offers its core data to other developers. To obtain the gene tree and its links, users just need to add the rss tag to the end of gene address. This will output a RSS2.0 feed that can be easily parsed by any application or added to a feed reader.
How was WAVe born?
The european GEN2PHEN project is an initiative to link, as deeply as possible, data from genotype features to its phenotype counterparts. The first step consisted in an attempt to improve various genomic variation resource scenarios. This implied normalizing LSDBs (the “LSDB-in-a-box” approach, LOVD) and defining novel data models and formats for data exchanges from and to LSDBs.
In a long term perspective, applying the GEN2PHEN-approved data models, will enhance the creation of new services and applications to integrate and interact with the exponentially growing dataset of genomic variation data.
With WAVe we tried a different approach based on three questions: why wait for everyone to adopt these new formats? What will happen to legacy LSDBs that won’t adopt the new formats? How can we have an immediate solution? We have created a lightweight integration architecture, based on links to applications and adopted a simple (yet familiar) tree-based navigation interaction to deploy a new application that can be used right now and will easily scale to integrate the foreseen data exchanges formats. Technical details aside, based on a manually curated LSDB list, we can connect and integrated any kind of LSDB application whether it is a modern LOVD application or a simple text-based legacy LSDB.
How is it relevant?
To demo WAVe efficiency let’s just try to perform a simple search in our lab: Are there any LSDBs for COL3A1 gene in the human species? And known variants? And what are the associated proteins and pathways?
In a WAVe-free scenario, to find out COL3A1 LSDBs (if any), researchers need to google it (the main COL3A1 LSDB does not appear in the first result page) or, if you they are used to it, go to HGVS site, go to the “Databases & Tools” section, select “Locus-specific Mutation Databases” and then search for the gene in search box. Now for the variants researchers just need to browse the last page they’ve just entered. How many clicks (and time!) does it take?
For protein information, researchers enter in UniProt and search for COL3A1: that gives about 29 results. Add a filter for the human species and there are 5 results. Good enough to access directly to P02461 (SwissProt reviewed). Though, there is new window/tab open. Now for pathway information, a KEGG quick search for COL3A1 lists 14 results. In the end, there are about 3 windows/tabs and made some 20 mouse clicks to obtain the desired information.
Using WAVe, researchers simply need to access WAVe, start typing the gene HGNC symbol, select COL3A1 from the suggestions and access COL3A1 page. Once in the page, it’s as easy as browsing in the tree… Variations? Check the variation node, they’re even grouped according to the change type. UniProt information? Check the protein node where you have direct access to SwissProt, TrEMBL, PDB, Expasy and InterPro. And I guess you get the picture. In the end, one window/tab and about 6/7 mouse clicks.
Other UA.PT Bioinformatics tools
At the University of Aveiro’s Bioinformatics research group we are mainly young and enthusiast computer science experts, simply trying to make biology easier (at least in terms of computer applications!). Our more relevant web-based tools include MIND (a microarray analysis tool), GeneBrowser (a gene expression tools, useful to process data gathered from systems like MIND) and QuExT (a comprehensive MEDLINE mining application).
This next post in our continuing semi-regular Guest Post series is from Allen Peter Davis, of Comparative Toxicogenomics Database (CTD) at Mount Desert Island Biological Laboratory (MDIBL). If you are a provider of a free, publicly available genomics tool, database or resource and would like to convey something to users on our guest post feature, please feel free to contact us at wlathe AT openhelix DOT com.
The Comparative Toxicogenomics Database (CTD) is a free, public resource that promotes understanding about the effects of environmental chemicals on human health. Since Trey’s original Tip of the Week about CTD, we’ve added many new features we’d like to highlight.
* The redesigned CTD homepage makes navigation easier and more intuitive. Check out the keyword quick search box on every page, and try the “All” setting to see the scope of information available at CTD.
* A new Data Status page uses tag clouds to display the updated content for that month.
* We are particularly pleased to announce new statistical analyses of CTD data. Chemical pages now feature enriched Gene Ontology (GO) terms, garnered from the genes that interact with a chemical. In this release, CTD connects over 5,000 enriched GO terms to more than 4,500 chemicals. As well, now our inferred chemical-disease relationships are also statistically scored and ranked. Both new features will help users explore and generate testable hypotheses about the biological effects of chemicals.
* GeneComps and ChemComps discover genes or chemicals with a similar toxicogenomic profile to your molecule of interest. Learn more about this feature in our recent publication.
* Reactome data are now also included with KEGG, for a more comprehensive view of pathways affected by chemicals.
* VennViewer and MyGeneVenn are new tools that compare datasets for chemicals, diseases, or genes (including your own gene list) using Venn diagrams to discover shared and unique information. These two visualization tools are a nice accompaniment to our original Batch Query tool for meta-analysis.
* The FAQ section under the “Help” menu provides examples of how to maximize your experience with CTD.
* Download our Resource Guide (pdf link) to keep as a handy reference card for CTD.
From the homepage, you can also subscribe to our monthly email newsletter to keep current with CTD’s growing content and features. You can always contact us to request curation of your favorite chemical or paper. And with our new “Author Alert” email program, we’ll even contact you to let you know when we’ve curated data from one of your publications in CTD.
We strive to be the best possible resource of chemical-gene-disease networks for the biological community, so feedback and input from users are of great importance to us.
- Allan Peter Davis
This next post in our continuing semi-regular Guest Post series is from Eric Lyons, of CoGe at the University of California, Berkeley. If you are a provider of a free, publicly available genomics tool, database or resource and would like to convey something to users on our guest post feature, please feel free to contact us at wlathe AT openhelix DOT com.
Thanks both for the prior CoGe post (editors note: a tip of the week on GoGe) and the invitation to write a bit about CoGe. Since most people are probably not familiar with CoGe, let me begin with how it is designed:
CoGe’s architecture and philosophy: Solve a problem once
CoGe is a web-based platform for comparative genomics and consists of many interconnected web-based tools. The entire system is hooked up to a database that can store any version of any genome in any state of assembly from any organism (currently ~9000 genomes from ~8000 organisms). Each of CoGe’s tools is designed to do one task (e.g. search and display information about a genome, compare two genomes and generate syntenic dotplots, search any number of genomes for similar sequence, manage a list of genes, etc.), and are linked to one another. This means that there is no predefined analysis workflow. Instead, people can begin exploring a genome of interest, compare it to what they want, find something interesting, explore that, finding something else, explore that, etc.) People anywhere in the world can perform computationally intense analyses by clicking a few buttons on a web-page, and letting our servers crunch away on whatever genomes we have currently loaded in our system . Since each tool is web-based, links are used to move from tool to tool which creates an easy way to save an analysis for future work or to send to a colleague. This also has the benefit that as we develop new tools to solve a specific problem, we can generalize the solution, and plug it into CoGe’s database and connect it to its pre-existing tool set. Overall, this allows an easy way for us to expand CoGe’s functionality.
This next post in our continuing semi-regular Guest Post series is from Xiaowu Gai, the Bioinformatics Core Director at CHOP . If you are a provider of a free, publicly available genomics tool, database or resource and would like to convey something to users on our guest post feature, please feel free to contact us at wlathe AT openhelix DOT com.
Thanks to Mary for running a Tip of the Week – “CHOP CNV database” a couple of months back. CHOP CNV database is a high-resolution genome-wide survey of copy number variations of a large number (2,026) of apparently healthy individuals. It is publicly accessible and has been widely used by a large number of research groups world-wide. I am now pleased to announce the public release of our software system behind it: CNV Workshop. CNV Workshop is a suite of software tools that we have developed over the last a few years. It provides a comprehensive workflow for analyzing, managing, and visualizing genome copy number variation (CNV) data.
It can be used for almost any CNV research or clinical project by offering the following capabilities for both individual samples and cohort studies:
Implements a modified circular binary segmentation algorithm that reduces false positives
Fully configurable parameters for sensitivity/specificity management
Individual locus-specific annotations such as position, type of variation, call metrics, and overlap with CNVs of other data sets, including the Database of Genomic Variants.
Functional gene annotations such as genes affected and known disease associations
Accepts user-provided annotations
GBrowse-enabled visuals for querying, browsing, interpreting, and reporting CNVs
Export of results into Excel, XML, CSV, and BED files
Direct links to public resources such as the UCSC Genome Browser, NCBI Entrez, Entrez Gene, and FABLE
Project and Account Management
Authentication and permission scheme that is especially useful for clinical diagnostic settings
Analysis result sharing within and between projects
Simple Web-based administrative interface
Remote access and administration enabled
CNV Workshop currently accepts genotyping array data from Illumina’s 550k, 610- and 660-Quad, and Omni arrays, along with Affymetrix’s 5.0 and 6.0 arrays, and can be easily configured to accept data from other platforms. The package comes preloaded with publicly available reference data from more than 2,000 healthy control subjects (the CHOP CNV Database). CNV Workshop also allows the user to upload already processed CNV calls for annotation and presentation.
Our first guest post in our new semi-regular Guest Post series is from Inna Dubchak , principal investigator at the LBNL/JGI group, developers of the VISTA comparative genomics resource (who sponsors a tutorial, free to the users). If you are a provider of a free, publicly available genomics tool, database or resource and would like to convey something to users on our guest post feature, please feel free to contact us at wlathe AT openhelix DOT com.
I would like to give you a heads up on some new VISTA updates and ongoing development!
Updates: As you probably know from this blog, a new, still free VISTA tutorial is available now. We have introduced a lot of updates to these tools - built new programs, improved the existing ones, and entirely changed the design of the site to make it more up-to-date and convenient.
Main addition to the site – VISTA Point – combines capabilities of the three tools currently available at the site – VISTA Gateway, VISTA Browser, and Text Browser usually used step-by-step. VISTA Point makes analyzing multiple and pairwise genome alignments and extracting relevant numerical data much more straightforward, it is easy to update, expand and add new programs.
Soon: We are actively working on visualizing synteny at scales ranging from whole-genome alignment to the conservation of individual genes, with seamless navigation across different levels of resolution. In our upcoming VISTA-Dot tool we used the concept of two-dimensional “dot-plots”, historically employed in the analysis of local alignment, and an interactive Google-map-like interface to visualize whole-genome alignments. You will be able to get a display and analyze large-scale duplication in plants in one click! It can also be useful in genome assembly and finishing. Another addition coming in the near future, VISTA Synteny Viewer, presents a novel interface as three cross-navigable panels representing different scales of the alignment.
Attention: do not forget to use our whole-genome capabilities – Whole-genome VISTA to align sequence of any quality, from draft to finished, up to 10MB long, and Whole Genome rVISTA to evaluate which transcription factor binding sites (TFBS) are over-represented in upstream regions in a group of genes.
Greetings! OpenHelix Blog is instituting a new semi-weekly feature. Every Wednesday we have our “Tip of the Week,” on Thursdays we have our “What’s Your Problem,” and now on an occasional Tuesdays we are going to have our “Provider Guest Post.” These will be posts from providers of genomics tools and database and will be opinions, updates and upcoming features of the resource, whatever the provider of the resource would like to convey to users. We have several lined up for the coming weeks, so keep checking back.
Additionally, if you are a developer or provider of an free, publicly available genomics or biological resource, database or analysis tool and would like to post in our guest feature, be it an introduction to your tool, updates or upcoming features or even an opinion about the current state of genomics research and data, please write us at wlathe AT openhelix DOT com. We would love to put you in the queue for the next guest post.
Our first guest post next Tuesday will be from Inna Dubchak , principal investigator at the LBNL/JGI group, developers of the VISTA comparative genomics resource (who sponsors a tutorial, free to the users). She’ll discuss some new tools at VISTA and give you a quick preview of some new upcoming features.