Perusing my copy of Nature Genetics last week, I was flipping through the pages and noticed an unusual graphic. I looked at it a little closer and was convinced it was one of the Spirographs that I used to make as a kid. (Remember those? I always liked that….) I looked a little bit closer and realized it was somewhat more informative than the Spirographs I used to draw. This represented the relationships between genes, based on the literature. Hmmm….how did they do this, exactly?
The paper I was reading was Genetic variants at CD28, PRDM1 and CD2/CD58 are associated with rheumatoid arthritis risk by Raychaudhuri et al, which was interesting enough. I like to read the GWAS papers to see what the current techniques and strategies are, not only for the specific genes themselves. And this paper reported the strategy that they used to prioritize their SNPs, and that they used GRAIL to generate the data for this graphic of gene relationships. Check out Figure 1 for the strategy.
When I saw the name GRAIL I thought–huh….GRAIL is back with a new use? I thought that was…ah…retired…at this point. But this isn’t that GRAIL (http://compbio.ornl.gov/Grail-1.3/, Gene Recognition and Assembly Internet Link). This is a different GRAIL–the new one is Gene Relationships Among Implicated Loci. So I had to go and read that paper, which is Identifying Relationships among Genomic Disease Regions: Predicting Genes at Pathogenic SNP Associations and Rare Deletions by Raychaudhuri et al.
This new GRAIL is all about text mining. It is a tool that relies on statistical text mining of the literature for genes in a region and examines the relationships among those genes in the text. The focus in their case is disease regions, but there’s no reason that you couldn’t use it for a variety of other topics. As the authors state:
Given only a collection of disease regions, GRAIL uses our text-based definition of relatedness (or alternative metrics of relatedness) to identify a subset of genes, more highly related than by chance; it also assigns a select set of keywords that suggest putative biological pathways.
So you pull a set of genes out of the literature based on SNPs or locations of interest, and you can begin to assess what’s interesting in the set. Now, the tool makes a lot of assumptions that you should be aware of if you are going to use it. It assumes each region contains a single pathogenic gene. I’m not sure that’s always going to be the case, but for this tool as long as you know that, that’s a fair assumption. They suggest this helps to keep from multigenic regions from dominating the analysis. Fair enough, but…what if that is the interesting aspect? Still–that’s ok as long as you know.
In the paper they use validated SNPs from 4 different research areas:
- SNPs associated with serum lipid levels: GRAIL finds genes in the cholesterol biosynthesis pathway.
- SNPs associated with height; they identify pathways they consider plausible.
- Crohn’s disease; they confirm associations that have been seen.
- Schizophrenia–and here they used rare deletions as the items of interest; they find related genes, many highly enriched in the CNS. So this suggests using this not only for SNPs but for CNVs this may be a useful strategy.
Their Figure 1 nicely summarizes the strategy:
One curious tweak of the data analysis was that they used the literature prior to December 2006, because right after that there was an onslaught of GWAS papers that would list a whole bunch of genes associated with regions that might be more tenuous still. I understand this in theory, but I imagine it also eliminates more current research on genes of interest from other methods too. I saw in the tool you could choose either pre-Dec 06 or a more up-to-date literature set. It would be useful to try both if you use GRAIL and keep that in mind.
Another point to keep in mind: some genes are just not found in the abstracts, and they mention that is an issue. So the set you can examine are those that were in the abstracts, and were identified properly with nomenclature, spelling, etc. Text mining is cool, but has a lot of limitations around those aspects, and the use of synonyms too in general. It’s not just an issue for GRAIL, but for all text mining tools at this point.
They don’t show any spirographs in their figures in this first GRAIL paper. That one that drew me in was Figure 2 in the arthritis paper. So I went over to the software to try to generate these myself. The outcome at this point is a web page with text and links to UCSC Genome Browser, and Entrez Gene (from the individual genes and from the keyword list–keywords collect multiple Entrez Genes). I was a little surprised that the keyword link wasn’t to PubMed as well. Currently it doesn’t provide the graphic, but maybe that will come along over time. If it does I’ll be sure to mention it on the blog.
One final note on the paper: in the supplemental section they compare GRAIL to other tools in this arena. If you are interested in tools like we are here you may find some of them interesting as well. The tools are listed with URLs in Table S5, and the comparison outcome is in Text S1:
Prioritizer , Gene2Disease (G2D) [3,4,5], Commonality of Functional Annotation (CFA) , and Prospectr . There were five supervised tools: Endeavour , GeneSeeker , SUSPECTS , TOM , and CANDID 
So check out GRAIL and see if you find gene relationships. But don’t forget those caveats about the genes not listed in the abstracts, or the literature coverage dates. The software can be found here: http://www.broad.mit.edu/mpg/grail/
I know it’s a beta. But I think it has a lot of potential to help people sift through the results they are getting from a variety of techniques. Check it out.
NOTE: you may find periods that you can’t run GRAIL because it puts a burden on the servers. You should try again during off hours if you are seeing problems with getting it to run. This happened to me during my testing of it last week.
The list of GWAS data I used to test GRAIL came from the NHGRI catalog, which we discussed here: List of GWAS studies. I tried the straight hair SNP list, and got a pretty interesting set of results that certainly included “epidermis” and “skin” as keywords, among other things.
++++++++++++ Citations ++++++++++++
Raychaudhuri, S., Plenge, R., Rossin, E., Ng, A., International Schizophrenia Consortium, Purcell, S., Sklar, P., Scolnick, E., Xavier, R., Altshuler, D., & Daly, M. (2009). Identifying Relationships among Genomic Disease Regions: Predicting Genes at Pathogenic SNP Associations and Rare Deletions PLoS Genetics, 5 (6) DOI: 10.1371/journal.pgen.1000534
Raychaudhuri, S., Thomson, B., Remmers, E., Eyre, S., Hinks, A., Guiducci, C., Catanese, J., Xie, G., Stahl, E., Chen, R., Alfredsson, L., Amos, C., Ardlie, K., Barton, A., Bowes, J., Burtt, N., Chang, M., Coblyn, J., Costenbader, K., Criswell, L., Crusius, J., Cui, J., De Jager, P., Ding, B., Emery, P., Flynn, E., Harrison, P., Hocking, L., Huizinga, T., Kastner, D., Ke, X., Kurreeman, F., Lee, A., Liu, X., Li, Y., Martin, P., Morgan, A., Padyukov, L., Reid, D., Seielstad, M., Seldin, M., Shadick, N., Steer, S., Tak, P., Thomson, W., van der Helm-van Mil, A., van der Horst-Bruinsma, I., Weinblatt, M., Wilson, A., Wolbink, G., Wordsworth, P., Altshuler, D., Karlson, E., Toes, R., de Vries, N., Begovich, A., Siminovitch, K., Worthington, J., Klareskog, L., Gregersen, P., Daly, M., & Plenge, R. (2009). Genetic variants at CD28, PRDM1 and CD2/CD58 are associated with rheumatoid arthritis risk Nature Genetics, 41 (12), 1313-1318 DOI: 10.1038/ng.479
Medland, S., Nyholt, D., Painter, J., McEvoy, B., McRae, A., Zhu, G., Gordon, S., Ferreira, M., Wright, M., & Henders, A. (2009). Common Variants in the Trichohyalin Gene Are Associated with Straight Hair in Europeans The American Journal of Human Genetics, 85 (5), 750-755 DOI: 10.1016/j.ajhg.2009.10.009