How well do we know our genes?

Gene Characterization IndexDo you have some favorite genes? Well, of course you do–you are probably a researcher who has in the past worked on some specific genes, or you are interested in groups of genes or genomic regions. Or maybe classes of genes. There is a new resource that provides you with a score of how well a given protein coding gene is annotated, and possibly therefore understood. The GCI, or Gene Characterization Index, can tell you.

I love the idea of this project. The team wanted to look at the gene space and understand how well we knew the human genes. They looked at the growth of our knowledge over time, too–which provides an interesting view of our progress–as shown in this figure from their web site. And they wanted to identify the darkness–where don’t we know enough? Where are some great genes to examine that we can learn some really new things?
That’s the kind of project I wanted to do when I was still in academia. I thought you could build a whole lab and crank out students who get assigned an unknown gene, and it is their job over the next few years to analyze and understand the gene. It would be unbiased by a disease area vision, or by the lab director’s preconceptions of what the gene might do. They could try all sorts of techniques to get there. It is probably also entirely unfundable by grant agencies. Alas.

Anyway–that’s what the GCI holds–buckets of genes that have varying levels of characterization. You can look at genes and find out what is known. You can grab genes that aren’t well characterized and go after them.

I’m not entirely sure I agree with the scores, based on a gene I was intrigued by once….I’m afraid I can’t tell you exactly which one it was because I was working for a pharmaceutical company at the time. It is a gene with a possible druggable structure (cell surface location, pharma loves this–a type of transporter). Its protein identity: human/mouse/rat/cow/pig/chicken = 100/99/98/97/96/90%. Even 89% identity in Xenopus! This is a gene that hasn’t been messed with much. It has to do something important. It gets a score of just under 6 on the 10 point scale. But there really isn’t that much known about this gene at all.

The score is based on 6 main criteria that they selected: GenBank sequences, InterPro domains, KEGG pathways, Medline references, OMIM entries, and Swiss-Prot data. But in the case of the gene I’m interested in, one of the Medline papers led to the really brief and rather uninformative OMIM entry. Two of the papers were giant sequence analysis and cloning efforts that contain over 10,000 genes (PMID: 12477932 and PMID: 15489334). So some of the score on this results from the longstanding issue of transitive information passage among the databases–info in carried from one source to another without much value added. And sometimes this can lead to transitive errors (but not in this case). So I think there is some danger to relying on the specific score, perhaps.

But still–if you wanted to look in the shadows of what we know for some great projects, you may find them here. I would probably take a pretty high score cut-off to consider the less well-knowns. But this appears to be a nice resource to pull together some useful resources for genes of interest. May you use it to illuminate the darkness!

Kemmer, D., Podowski, R.M., Yusuf, D., Brumm, J., Cheung, W., Wahlestedt, C., Lenhard, B., Wasserman, W.W., Valcarcel, J. (2008). Gene Characterization Index: Assessing the Depth of Gene Annotation. PLoS ONE, 3(1), e1440. DOI: 10.1371/journal.pone.0001440