Tag Archives: learning

On a Mission for Protein Information

It’s probably just the human brain’s ability to connect dots  &  find patterns, but it can be interesting how many “unrelated” events and information bits accumulate in my head & eventually get mulled into an idea or theory. Take, for example, a recent biotech mixer, bits from an education leadership series & a past Nature article – each “event” has been meandering in my mind and now they are finding their way out as this blog post.

OK, now the explanation: At a recent local biotech event I heard about a company (KeraNetics) purifying keratin proteins & using them to develop therapeutic and research applications. The company & their research sounded very interesting & because a lot of it is aimed at aiding wounded soldiers, it also sounded directly beneficial. The talk was short, only about 20 minutes, so there wasn’t a lot of time for details or questions. I decided I’d venture forth through many of the bioscience databases and resources that I know and love, in order to learn more about keratin.

My quest was both fun and frustrating because of the nature of the beast – keratin is “well known” (i.e. it comes up in high school academic challenge competitions ‘a lot’, according to someone in the know), but is hard to work with (i.e. tough, insoluble, fibrous structural proteins) that is hard to find much general information on in your average protein database (because it is  made of many different gene products, all referred to as “keratin”). I decided to begin my adventure at two of my favorite protein resources, PDB & SBKB, but I found no solved structures for keratin. Because of the way model organism databases are curated and organized, I often begin a protein search there, just to get some basic background, gene names, sequence information, etc. I (of course) found nothing other than a couple of GO terms in the Saccharomyces Genome Database (SGD), but I found hundreds of results in both Mouse Genome Informatics (MGI) (660 genomic features) and Rat Genome Database (RGD) (162 rat genes, 342 human genes). I also found gene names (Krt*), sequences and many summary annotations with references to diseases with links to OMIM. When I queried for “keratin”, in OMIM I got 180 hits, including 61 “clinical synopsises”, in UniProt returned 505 reviewed entries and 2,435 unreviewed entiries, in Entrez Protein 10,611 results and in PubMed 26,430 articles with 1,707 reviews. I got my curiosity about KeraNetics’ research sated by using a PubMed advanced search for Keratin in the abstract or title & the PI’s name as author (search = “(keratin[Title/Abstract]) AND Van Dyke[Author]“).

I ended up with a lot of information leads that I could have hunted through, but it was a fun process in which I learned a lot about keratin. This is where the education stuff comes in. I’ve been seeing a lot of studies go by talking about reforming education to be more investigation driven, and I can totally see how that can work. “Learning” through memorization & regurgitation is dry for everyone & rough for the “memory challenged”, like me. Having a reason or curiosity to explore, with a new nugget of data or understanding lurking around each corner, the information just seems to get in better & stay longer. (OT, but thought I’d mention a related site that I found today w/ some neat stuff: Mind/Shift-How we will learn.)

And I could have done the advanced PubMed search in the beginning, but what fun would that have been? Plus there is a lot that I learned about keratin from what I didn’t find, like that there wasn’t a plethora of PDB structures for keratin proteins. That brings me to the final dot in my mullings – an article that I came across today as I worked on my reading backlog: “Too many roads not taken“. If you have a subscription to Nature you can read it, but the main point is that researchers are still largely focusing on the same set of proteins that they have been for a long time, because these are the proteins for which there are research tools (antibodies, chemical inhibitors, etc). This same sort of philosophy is fueling the Protein Structure Initiative (PSI) efforts, as described here. Anyway, I found the article interesting & agree with the authors general suggestions. I would however extend it beyond these physical research tools & say that going forward researchers need more data analysis tools, and training on how to use them – but I would, wouldn’t I? :)


  • Sierpinski P, Garrett J, Ma J, Apel P, Klorig D, Smith T, Koman LA, Atala A, & Van Dyke M (2008). The use of keratin biomaterials derived from human hair for the promotion of rapid regeneration of peripheral nerves. Biomaterials, 29 (1), 118-28 PMID: 17919720
  • Edwards, A., Isserlin, R., Bader, G., Frye, S., Willson, T., & Yu, F. (2011). Too many roads not taken Nature, 470 (7333), 163-165 DOI: 10.1038/470163a

Sage Bioinformatics Advice, But…

Bioinformatics analysis is a powerful technique applicable to a wide variety of fields, and the subject of many a blog post here at OpenHelix. I’ve had two particular bioinformatics articles on my desk for a couple of months now, waiting for me to be able to articulate my thoughts on them. They both offer great information about their particular area of interest – predicting either SNV impacts or protein identities – and sage bioinformatics advice.

The first article “Using bioinformatics to predict the functional impact of SNVs” is a great review of bioinformatics techniques for picking out functionally important single nucleotide variants (SNVs, also sometimes variously referred to as SNPs or Small, Simple or Single Nucleotide Polymorphisms) from the millions of candidate variants being identified everyday. In the introduction the authors do a great job of explaining the many ways in which SNVs can have an impact, as well as how these basic philosophies of impact can be used for bioinformatics analyses. The paper then goes on to describe both classic and bioinformatics techniques for predicting the impact of such variations. It is a phenomenal read for the list of resources alone, with many valuable and important algorithms and resources mentioned.  We’ve got tutorials (ENCODE, OMIM, the UCSC Genome Browser, UniProtKB, Blosum and PAM, HGMDJASPAR, Principal Components Analysis, relative entropy, SIFT score, TRANSFAC, ) and blog posts (the Catalog of Published Genome-Wide Association Studies) describing many of the same resources. In fact this paper inspired at least one of our weekly posted tips (Tip of the Week: SKIPPY predicting variants w/ splicing affects). The paper then goes on to a “BUYER BEWARE” section that offers some sage advice – know the weaknesses, assumptions, and of the resources you use for your predictions.

The second article is an open access article from BioTechniques entitled “Mistaken identities in proteomics“. It offers a romp through the history of mass spectrometry (MS) technology and rising standards for documenting techniques used for protein identification in journals. The article also concludes with sage bioinformatics advice, including this quote:

Proteomic researchers should be able to answer key questions, according to Giddings. “What are you actually getting out of a search engine?” she says. “When can you believe it? When do you need to validate?”

Both papers suggest that researchers who wish to use bioinformatics resources in their research should investigate the theoretical underpinnings and assumptions of each tool before deciding on a tool to use, and then should go at every analysis with a level of disbelief in the tool results. That just sounds like common sense, and makes good theoretical advice.

HOWEVER, the level of investigation that is required to truly know each tool and algorithm is prohibitively huge. As for me, my “practical” suggestion for researchers is a bit of a “filtering shortcut”. Before diving into all the publications on all possible tools, just spend a few minutes with some documentation – the resource’s FAQ, or an intro tutorial – we’ve got a few we can offer you :) – to get an idea of what the tool is about & what you might be able to get from it. Once you’ve got a general idea of how to approach the resource  begin “banging” on it lightly. An initial kick the tires test of an algorithm, database, or other resource can be as easy as keeping a “test set” on hand at all times & running it through any new tool you want to use. Make sure that the set includes a partial list of some very well known proteins/pathways/SNPs/etc. (whatever you work on & will be interested in analyzing) and that it has some of your fields ‘flukes’. Think about what you expect to get back from your set. Then run your tester set through any new tool you are considering using in your research, and look at your results – are they what you know they should be? Can they handle the flukes, or do they break? As an example, when I approach a new protein interaction resource, I’ll use a partial parts list for some aspect of the yeast cell cycle, and include one or two of the hyphenated gene names. If the tool is good, I get a completed list with no bogging on the “weird” names. If it bogs, I know the resource may not be 100% worked out for yeast & may have issues with other species as well. If the full list of interactors comes back with a bunch of space-junk proteins I begin investigating what data is included in the resource and if settings can be tweaked to get better answers. Then, if things still look promising with the tool, I am likely to dig deep into the literature, etc. for the tool – just to be sure – because the authors of these articles are absolutely right, chasing false leads is expensive, frustrating & time consuming. It is amazing how many lemons & jalopies you can weed out with a 5 minute bioinformatics tire kick! :)

I also don’t think the responsibility should solely be on the back of each end user – the resource developer does have some responsibility for making their tool rigorous and for accurately representing its capabilities in publications and documentation. Calls for open source code can help improve some bioinformatics tools, so can education & outreach – but that discussion will have to wait for another day…