Tag Archives: InterPro

Video Tip of the Week: TargetMine, Data Warehouse for Drug Discovery

Browsing around genomic regions, layering on lots of associated data, and beginning to explore new data types I might come across are things that really fire up my brain. For me, visualization is key to forming new ideas about the relationships between genomic features and patterns of data. But frequently I want to take this to the next step–asking where else these patterns appear, how many other instances of this situation are there in a data set, and maybe adding additional complexity to the problem and refine the quest. This is not always easy to do with primarily visual software tools. This is when I turn to tools like the UCSC Table Browser, BioMart, and InterMine to handle some list of genes, or regions, or features.

We’ve touched on all of these before–sometimes with full tutorial suites (UCSC, BioMart), and sometimes as a Tip of the Week, InterMine and InterMine for complex queries. Learning about the foundations of these tools will let you use various versions or flavors of them at other sites. I love to see tools that are re-used for different topics when that’s possible, rather than building a whole new system. There are ModENCODE, rat, yeast mines, and more. This week’s tip is about one of those others–TargetMine is built on the InterMine foundation, with a specific focus on prioritizing candidate genes for pharmaceutical interventions. From their site overview, I’ll add this description they use: TargetMine

TargetMine is an integrated data warehouse system which has been primarily developed for the purpose of target prioritisation and early stage drug discovery.

For more details about their framework and philosophy, you should see their papers (linked below). The earlier one sets out the rationale, the data types, and the data sources they are incorporating. They also establish their place in the ecosystem of other databases in this arena, which helps you to understand their role.  But you should see the next paper for a really good grasp of how their candidate prioritization work with the “Integrated Pathway Clusters” concept they’ve added. They combined data from KEGG, Reactome, and NCI’s PID collections to enhance the features of their data warehouse system.

This week’s Video Tip of the Week highlights one of the tutorial movies that the TargetMine team provides. There’s no spoken audio with it, but the captions that help you to understand what’s going on are in English. I followed along on a browser with their example–they have a sample list to simply click on, and you can see various enrichments of the sets–pathways, Gene Ontology, Disease Ontology, InterPro, CATH, and compounds. They call these the “biological themes” and I find them really useful. You can create new lists from these theme collections. They also illustrate the “template” option–pre-defined queries with typical features people may wish to search. The example shows how to go from the list of genes you had to pathways–but there are other templates as well.

Another section of the video has an example of a custom query with the Query Builder. They ask for structural information for proteins targeted by acetaminophen. It’s a nice example of how to go from a compound to protein structure–a question I’ve seen come up before in discussion threads.

In their more recent paper (also below), they have some case studies that illustrate the concepts of prioritizing targets for different disease situations with their system.  They also expand on the functions with additional software to explore the pathways: http://targetmine.mizuguchilab.org/pathclust/ .

So have a look at the features of TargetMine for prioritization of candidate genes. I think the numerous “themes” are a really useful way to assess lists of genes (or whatever you are starting with).

Quick Links:

TargetMine: http://targetmine.mizuguchilab.org/ [note: their domain name has changed since the publications, this is the one that will persist.]

InterMine: http://intermine.github.io/intermine.org/


Chen, Y., Tripathi, L., & Mizuguchi, K. (2011). TargetMine, an Integrated Data Warehouse for Candidate Gene Prioritisation and Target Discovery PLoS ONE, 6 (3) DOI: 10.1371/journal.pone.0017844

Chen, Y., Tripathi, L., Dessailly, B., Nyström-Persson, J., Ahmad, S., & Mizuguchi, K. (2014). Integrated Pathway Clusters with Coherent Biological Themes for Target Prioritisation PLoS ONE, 9 (6) DOI: 10.1371/journal.pone.0099030

Kalderimis A.,  R. Lyne, D. Butano, S. Contrino, M. Lyne, J. Heimbach, F. Hu, R. Smith, R. Stěpán, J. Sullivan & G. Micklem & (2014). InterMine: extensive web services for modern biology, Nucleic Acids Research, 42 (W1) W468-W472. DOI: http://dx.doi.org/10.1093/nar/gku301

NAR database issue (always a treasure trove)

The advance access release of most of the  NAR database issue articles is out. As usual, this this database issue includes a wealth of new and updated data repositories and analysis tools. We’ll be writing up additional more extensive blog posts on it and doing some tips of the week over the next couple months, but I thought I’d highlight the issue and some of the reports:

There are a lot of updates to many of the databases we know and love (links to go full text article): UCSC Genome Browser, Ensembl, UniProt, MINT, SMART, WormBase, Gene Ontology,  ENCODE, KEGG, UCSC Archaeal Browser, IMG/M, DBTSS, InterPro and others (we have tutorials on all those listed here).

And, as an indication of the explosion of data available (itself a subject of a database issue article, SRA), there are a lot of new(ish) databases on specific datatypes such as MINAS, a database of metal ions in nucleic acids (nice name :D); doRiNA, a database of RNA interactions in post-transcriptional regulation; BitterDB, a database of bitter compounds and well over 100 more.

And I’ll give a special shout out to my former PI at EMBL because I can, Peer Bork’s group has 4 databases listed in the issue: eggNOG, SMART, STITCH and OGEE. (and he and a couple members are on the InterPro paper also).

This is going to be a wealth of information to wade through!

UCSC Genome Browser: http://genome.ucsc.edu
Ensembl: http://www.ensembl.org/
UniProt: http://www.uniprot.org/
MINT: http://mint.bio.uniroma2.it/mint/
SMART: http://smart.embl.de/
WormBase: http://www.wormbase.org/
Gene Ontology: http://www.geneontology.org/
ENCODE: http://genome.ucsc.edu/ENCODE/
KEGG: http://www.kegg.jp
UCSC Archaeal Brower: http://archaea.ucsc.edu/
IMG: http://img.jgi.doe.gov/cgi-bin/w/main.cgi
DBTSS: http://dbtss.hgc.jp/
InterPro: http://www.ebi.ac.uk/interpro




New and Updated Online Tutorials for PROSITE, InterPro, IntAct and UniProt

Comprehensive tutorials on the publicly available PROSITE, InterPro, IntAct and UniProt databases enable researchers to quickly and effectively use these invaluable resources.

Seattle January 14, 2009 — OpenHelix today announced the availability of new tutorial suites on PROSITE, InterPro and IntAct, in addition to a newly updated tutorial on UniProt. PROSITE is a database that can be used to browse and search for information on protein domains, functional sites and families, InterPro is a database that integrates protein signature data from the major protein databases into a single comprehensive resource and IntAct is a protein interaction database with valuable tools that can be used to search for, analyze and graphically display protein interaction data from a wide variety of species. UniProt is a detailed curated knowledgebase about known proteins, with predictions and computational assignments for both characterized and uncharacterized proteins. These three new tutorials and an updated UniProt tutorial, in conjunction with the additional OpenHelix tutorials on MINT, PDB, Pfam, STRING, SMART, Entrez Protein, MMDB and many others, give the researcher an excellent set of training resources to assist in their protein research.

The tutorial suites, available for single purchase or through a low- priced yearly subscription to all OpenHelix tutorials, contain a narrated, self-run, online tutorial, slides with full script, handouts
and exercises. With the tutorials, researchers can quickly learn to effectively and efficiently use these resources. These tutorials will teach users:


*how to access information on domains, functional sites and protein families in PROSITE
*to perform a quick and an advanced protein sequence scan
*to find patterns in protein sequences using PRATT
*to use MyDomains to create custom domain graphics


  • to use both the basic and advanced search tools to find detailed information on entries in InterPro
  • how to understand and customize the display of your results
  • to use InterProScan to query novel protein sequences for information on domains and families


  • how to perform basic and advanced searches to find protein interaction data
  • to effectively navigate and understand the various data views
  • to graphically display and manipulate a protein interaction network


  • to perform text searches for relevant protein information
  • to search with sequences as a starting point
  • to understand the different types of UniProt records

To find out more about these and other tutorial suites visit the OpenHelix Tutorial Catalog and OpenHelix or visit the OpenHelix Blog for up-to-date information on genomics.

About OpenHelix
OpenHelix, LLC, provides the genomics knowledge you need when you need it. OpenHelix currently provides online self-run tutorials and on-site training for institutions and companies on the most powerful and popular free, web based, publicly accessible bioinformatics resources. In addition, OpenHelix is contracted by resource providers to provide comprehensive, long-term training and outreach programs.

How well do we know our genes?

Gene Characterization IndexDo you have some favorite genes? Well, of course you do–you are probably a researcher who has in the past worked on some specific genes, or you are interested in groups of genes or genomic regions. Or maybe classes of genes. There is a new resource that provides you with a score of how well a given protein coding gene is annotated, and possibly therefore understood. The GCI, or Gene Characterization Index, can tell you. http://cisreg.ca/gci/

I love the idea of this project. The team wanted to look at the gene space and understand how well we knew the human genes. They looked at the growth of our knowledge over time, too–which provides an interesting view of our progress–as shown in this figure from their web site. And they wanted to identify the darkness–where don’t we know enough? Where are some great genes to examine that we can learn some really new things?
That’s the kind of project I wanted to do when I was still in academia. I thought you could build a whole lab and crank out students who get assigned an unknown gene, and it is their job over the next few years to analyze and understand the gene. It would be unbiased by a disease area vision, or by the lab director’s preconceptions of what the gene might do. They could try all sorts of techniques to get there. It is probably also entirely unfundable by grant agencies. Alas.

Continue reading