Sage Bioinformatics Advice, But…

Bioinformatics analysis is a powerful technique applicable to a wide variety of fields, and the subject of many a blog post here at OpenHelix. I’ve had two particular bioinformatics articles on my desk for a couple of months now, waiting for me to be able to articulate my thoughts on them. They both offer great information about their particular area of interest – predicting either SNV impacts or protein identities – and sage bioinformatics advice.

The first article “Using bioinformatics to predict the functional impact of SNVs” is a great review of bioinformatics techniques for picking out functionally important single nucleotide variants (SNVs, also sometimes variously referred to as SNPs or Small, Simple or Single Nucleotide Polymorphisms) from the millions of candidate variants being identified everyday. In the introduction the authors do a great job of explaining the many ways in which SNVs can have an impact, as well as how these basic philosophies of impact can be used for bioinformatics analyses. The paper then goes on to describe both classic and bioinformatics techniques for predicting the impact of such variations. It is a phenomenal read for the list of resources alone, with many valuable and important algorithms and resources mentioned.  We’ve got tutorials (ENCODE, OMIM, the UCSC Genome Browser, UniProtKB, Blosum and PAM, HGMDJASPAR, Principal Components Analysis, relative entropy, SIFT score, TRANSFAC, ) and blog posts (the Catalog of Published Genome-Wide Association Studies) describing many of the same resources. In fact this paper inspired at least one of our weekly posted tips (Tip of the Week: SKIPPY predicting variants w/ splicing affects). The paper then goes on to a “BUYER BEWARE” section that offers some sage advice – know the weaknesses, assumptions, and of the resources you use for your predictions.

The second article is an open access article from BioTechniques entitled “Mistaken identities in proteomics“. It offers a romp through the history of mass spectrometry (MS) technology and rising standards for documenting techniques used for protein identification in journals. The article also concludes with sage bioinformatics advice, including this quote:

Proteomic researchers should be able to answer key questions, according to Giddings. “What are you actually getting out of a search engine?” she says. “When can you believe it? When do you need to validate?”

Both papers suggest that researchers who wish to use bioinformatics resources in their research should investigate the theoretical underpinnings and assumptions of each tool before deciding on a tool to use, and then should go at every analysis with a level of disbelief in the tool results. That just sounds like common sense, and makes good theoretical advice.

HOWEVER, the level of investigation that is required to truly know each tool and algorithm is prohibitively huge. As for me, my “practical” suggestion for researchers is a bit of a “filtering shortcut”. Before diving into all the publications on all possible tools, just spend a few minutes with some documentation – the resource’s FAQ, or an intro tutorial – we’ve got a few we can offer you :) – to get an idea of what the tool is about & what you might be able to get from it. Once you’ve got a general idea of how to approach the resource  begin “banging” on it lightly. An initial kick the tires test of an algorithm, database, or other resource can be as easy as keeping a “test set” on hand at all times & running it through any new tool you want to use. Make sure that the set includes a partial list of some very well known proteins/pathways/SNPs/etc. (whatever you work on & will be interested in analyzing) and that it has some of your fields ‘flukes’. Think about what you expect to get back from your set. Then run your tester set through any new tool you are considering using in your research, and look at your results – are they what you know they should be? Can they handle the flukes, or do they break? As an example, when I approach a new protein interaction resource, I’ll use a partial parts list for some aspect of the yeast cell cycle, and include one or two of the hyphenated gene names. If the tool is good, I get a completed list with no bogging on the “weird” names. If it bogs, I know the resource may not be 100% worked out for yeast & may have issues with other species as well. If the full list of interactors comes back with a bunch of space-junk proteins I begin investigating what data is included in the resource and if settings can be tweaked to get better answers. Then, if things still look promising with the tool, I am likely to dig deep into the literature, etc. for the tool – just to be sure – because the authors of these articles are absolutely right, chasing false leads is expensive, frustrating & time consuming. It is amazing how many lemons & jalopies you can weed out with a 5 minute bioinformatics tire kick! :)

I also don’t think the responsibility should solely be on the back of each end user – the resource developer does have some responsibility for making their tool rigorous and for accurately representing its capabilities in publications and documentation. Calls for open source code can help improve some bioinformatics tools, so can education & outreach – but that discussion will have to wait for another day…