Tag Archives: IntAct

Tip of the Week: iRefWeb + protein interaction curation

For this week’s tip of the week I’m going to introduce iRefWeb, a resource that provides thousands of data points on protein-protein interactions.  If you follow this blog regularly, you may remember that we had a guest post from the iRefWeb team not too long ago. It was a nice overview of many of the important aspects of this tool, and I won’t go into those again here–you should check that out. Andrei knows those details quite well!

And at the time we also mentioned their webinar was coming up. We were unable to attend that, though, because we were doing workshops at The Stowers Institute. I was delighted to find that their webcast is now available to watch in full. It’s about 40 minutes long and covers much more than my 5-minute appetizer could do.  It details many practical aspects of how to use iRefWeb effectively.

Because they’ve done all the prep work for me, I don’t need to spend much time on the structural and functional features here. What I would like to do is draw your attention to a different aspect of their work. Their project draws together protein interaction data from a variety of source databases–including some of our favorites such as MINT and IntAct (for which we have training suites available for purchase).  They then used the iRefWeb processes and projects to evaluate and consider the issues around curation of protein-protein interaction data, and recently published those results. That’s what I’ll be focusing on in the post.

Every so often a database flame-war erupts in the bioinformatics community. Generally it involves someone writing a review of databases and/or their content. These evaluations are sometimes critical, sometimes not–but often what happens is that the database providers feel that their site is either mis-represented, or unfairly chastised, or at a minimum incompletely detailed on their mission and methods. I remember one  flambé developed not too long ago around a paper by our old friend from our Proteome days–Mike Cusick–and his colleagues (and we talked about that here). As the OpenHelix team has been involved in plenty of software and curation teams, we know how these play out. And we have sympathy for both the authors and the database providers in these situations.

So when the iRefWeb site pointed me to their new paper I thought: oh-oh…shall I wear my asbestos pantsuit for this one???  The title is Literature curation of protein interactions: measuring agreement across major public databases.  Heh–how’s that working out for ya?

Anyway–it turns out not to need protective gear, in my opinion. Because their project brings data from several interaction database sources, they are well-positioned to collect information about the data to compare the data sets. They clearly explain their stringent criteria, and then look at the data from different papers as it is collected across different databases.

A key point is this:

On average, two databases curating the same publication agree on 42% of their interactions. The discrepancies between the sets of proteins annotated from the same publication are typically less pronounced, with the average agreement of 62%, but the overall trend is similar.

So although there is overlap, different database have different data stored. This won’t be a surprise to most of us in bioinformatics. But I think it is something that end users need to understand. The iRefWeb team acknowledges that there are many sources of difference among data curation teams. Some curate only certain species. Some include all data from high-throughput studies, others take only high-confidence subsets of that data. And it’s fine for different teams to slice the data how they want. Users just need to be aware of this.

It seems that in general there’s more agreement between curators on non-vertebrate model organism data sets than there is for vertebrates. Isoform complexity is a major problem among the hairy organisms, it turns out–and this affects how the iRefWeb team scored the data sets. And as always when curation is evaluated–the authors of papers are sometimes found to be at fault for providing some vagueness to their data sets.

The iRefWeb tools offer you a way to assess what’s available from a given paper in a straightforward manner. In their webinar, you can hear them describe that ~30 minutes in. If you use protein-protein interaction data, you should check that out.

Caveat emptor for protein-protein interaction data (well, and all data in databases, really). But iRefWeb provides an indication of what is available and what the sources are–all of it traceable to the original papers.

The paper is a nice awareness of the issues, not specific criticism of any of the sources. They note the importance of the curation standards encouraged by the Proteomics Standards Initiative–Molecular Interaction (PSI-MI) ontologies and efforts. And they use their paper to raise awareness of where there may be dragons. It seems that dragons are quite an issue for multi-protein complex data.

Your mileage may vary. If you are a data provider, you may want to have protective gear for this paper. But as someone not connected directly to any of the projects, I thought it was reasonable. And something to keep in mind as a user of data–especially as more “big data” proteomics projects start rolling out more and more data.

Quick links and References:

iRefWeb http://wodaklab.org/iRefWeb/

Their Webinar: http://www.g-sin.com/home/events/Learn_about_iRefWeb

Turinsky, A., Razick, S., Turner, B., Donaldson, I., & Wodak, S. (2010). Literature curation of protein interactions: measuring agreement across major public databases Database, 2010 DOI: 10.1093/database/baq026

Cusick, M., Yu, H., Smolyar, A., Venkatesan, K., Carvunis, A., Simonis, N., Rual, J., Borick, H., Braun, P., Dreze, M., Vandenhaute, J., Galli, M., Yazaki, J., Hill, D., Ecker, J., Roth, F., & Vidal, M. (2009). Literature-curated protein interaction datasets Nature Methods, 6 (1), 39-46 DOI: 10.1038/nmeth.1284

Tip of the Week: VirusMINT

virusMINTThe MINT or Molecular Interaction database for examination of protein interaction networks has long been a favorite tool of mine.   The regular “flavor” of MINT includes over 100,000 interactions with a focus on experimentally verified protein interaction data.  But recently I became aware of the VirusMINT data that is now available as well.

The VirusMINT paper describes the initial emphasis on medically relevant viruses for their curation efforts, and how the work differs from efforts like this PLoS Pathogens paper and the individual virus sites like NCBI’s HIV Interactions collection and the PIG (Pathogen Interaction Gateway) site.

Manual curation of data is labor-intensive, but I really appreciate the quality of that data.  Some of the data they curated themselves, and some was downloaded from existing curated sites.  Once at the site for VirusMINT, it is really simple to load up a virus network by simply clicking on a virus button, and then the proteins load and generate a network interaction group.  The proteins are clickable and you can find out more about the proteins and their sources, and domain information if that is available.  You can also click on the numbers between the interactions to find out which paper provided the interaction data and link quickly to PubMed from there.  And not only can you interact with the data using the MINT software framework, but you can download the data and use it in other tools as well.

This brief Tip-of-the-Week introduces a few of the basic features of VirusMINT.  We have additional details about how to interact with the software in our full MINT tutorial.

Chatr-aryamontri, A., Ceol, A., Peluso, D., Nardozza, A., Panni, S., Sacco, F., Tinti, M., Smolyar, A., Castagnoli, L., Vidal, M., Cusick, M., & Cesareni, G. (2009). VirusMINT: a viral protein interaction database Nucleic Acids Research, 37 (Database) DOI: 10.1093/nar/gkn739

VirusMINT site directly: http://mint.bio.uniroma2.it/virusmint/Welcome.do

MINT main site directly: http://mint.bio.uniroma2.it/mint/Welcome.do

Paper compares interaction databases

venn_interactions.jpgI wish I had more time to go into this paper in more detail–but I wanted to let you know that the paper is out there now.  It came in my recent Nature Methods in paper version, and if I wasn’t crazy busy on a very cool project that we hope to launch this week I’d go deeper….

The paper is:  Literature-curated protein interaction datasets by Cusick et al. Nature Methods 6, 39 – 46 (2009)  2008 | doi:10.1038/nmeth.1284

I knew from the abstract that it was going to cause some conflama. And I was right.  Soon after an article in Bioinform addressed some of the issues.  Requires a subscription, but here’s the title and the link if you do have one:  Study Finding Erroneous Protein-Protein Interactions in Curated Databases Stirs Debate, by Vivien Marx.

This paper gets at a question that people ask us all the time–how do I know which database to use for X purpose?  So if your question is which database to use for protein interactions, you should read this paper and consider the points they make.   They don’t compare all protein interaction databases, of course–but for those they do examine (IntAct, DIP, MINT) they provide informative comparisons that you should consider for any database.  What does it contain?  What is it missing?  They have some nice Venn diagrams to illustrate the content.  The one I used here is just a representation of that, not attempting to be accurately proportional, go to the paper to see the real ones.

Our position is that you should use all of them, of course  :)  Project goals and funding issues, species specialties, scope…all of this impacts what will be in a database.  (In fact, please go to MINT and support their funding by signing their protest of funding cuts).

One point embedded in the paper caught my attention, though.  One major curation issue was that the species designation of the protein in the interactions was not clear.   I know sometimes this is a problem with the original source paper.  Sometimes it is a curation issue.  But this worries me because of the concern I raised with Wikipedia gene entries.  I made the point that there was no way to distinguish between human genes and mouse genes of the same name (MEF2/Mef2).  This could be true of similar genes in other species too–where the gene might not even be the same gene, just a naming coincidence. I can see it has arisen again.  But if we expect to rely on Wikification projects like Gene Wiki for more and more, I think that would need to be addressed.

New and Updated Online Tutorials for PROSITE, InterPro, IntAct and UniProt

Comprehensive tutorials on the publicly available PROSITE, InterPro, IntAct and UniProt databases enable researchers to quickly and effectively use these invaluable resources.

Seattle January 14, 2009 — OpenHelix today announced the availability of new tutorial suites on PROSITE, InterPro and IntAct, in addition to a newly updated tutorial on UniProt. PROSITE is a database that can be used to browse and search for information on protein domains, functional sites and families, InterPro is a database that integrates protein signature data from the major protein databases into a single comprehensive resource and IntAct is a protein interaction database with valuable tools that can be used to search for, analyze and graphically display protein interaction data from a wide variety of species. UniProt is a detailed curated knowledgebase about known proteins, with predictions and computational assignments for both characterized and uncharacterized proteins. These three new tutorials and an updated UniProt tutorial, in conjunction with the additional OpenHelix tutorials on MINT, PDB, Pfam, STRING, SMART, Entrez Protein, MMDB and many others, give the researcher an excellent set of training resources to assist in their protein research.

The tutorial suites, available for single purchase or through a low- priced yearly subscription to all OpenHelix tutorials, contain a narrated, self-run, online tutorial, slides with full script, handouts
and exercises. With the tutorials, researchers can quickly learn to effectively and efficiently use these resources. These tutorials will teach users:

PROSITE

*how to access information on domains, functional sites and protein families in PROSITE
*to perform a quick and an advanced protein sequence scan
*to find patterns in protein sequences using PRATT
*to use MyDomains to create custom domain graphics

InterPro

  • to use both the basic and advanced search tools to find detailed information on entries in InterPro
  • how to understand and customize the display of your results
  • to use InterProScan to query novel protein sequences for information on domains and families

IntAct

  • how to perform basic and advanced searches to find protein interaction data
  • to effectively navigate and understand the various data views
  • to graphically display and manipulate a protein interaction network

UniProt

  • to perform text searches for relevant protein information
  • to search with sequences as a starting point
  • to understand the different types of UniProt records

To find out more about these and other tutorial suites visit the OpenHelix Tutorial Catalog and OpenHelix or visit the OpenHelix Blog for up-to-date information on genomics.

About OpenHelix
OpenHelix, LLC, provides the genomics knowledge you need when you need it. OpenHelix currently provides online self-run tutorials and on-site training for institutions and companies on the most powerful and popular free, web based, publicly accessible bioinformatics resources. In addition, OpenHelix is contracted by resource providers to provide comprehensive, long-term training and outreach programs.