Tag Archives: MINT

NAR database issue (always a treasure trove)

The advance access release of most of the  NAR database issue articles is out. As usual, this this database issue includes a wealth of new and updated data repositories and analysis tools. We’ll be writing up additional more extensive blog posts on it and doing some tips of the week over the next couple months, but I thought I’d highlight the issue and some of the reports:

There are a lot of updates to many of the databases we know and love (links to go full text article): UCSC Genome Browser, Ensembl, UniProt, MINT, SMART, WormBase, Gene Ontology,  ENCODE, KEGG, UCSC Archaeal Browser, IMG/M, DBTSS, InterPro and others (we have tutorials on all those listed here).

And, as an indication of the explosion of data available (itself a subject of a database issue article, SRA), there are a lot of new(ish) databases on specific datatypes such as MINAS, a database of metal ions in nucleic acids (nice name :D); doRiNA, a database of RNA interactions in post-transcriptional regulation; BitterDB, a database of bitter compounds and well over 100 more.

And I’ll give a special shout out to my former PI at EMBL because I can, Peer Bork’s group has 4 databases listed in the issue: eggNOG, SMART, STITCH and OGEE. (and he and a couple members are on the InterPro paper also).

This is going to be a wealth of information to wade through!

UCSC Genome Browser: http://genome.ucsc.edu
Ensembl: http://www.ensembl.org/
UniProt: http://www.uniprot.org/
MINT: http://mint.bio.uniroma2.it/mint/
SMART: http://smart.embl.de/
WormBase: http://www.wormbase.org/
Gene Ontology: http://www.geneontology.org/
ENCODE: http://genome.ucsc.edu/ENCODE/
KEGG: http://www.kegg.jp
UCSC Archaeal Brower: http://archaea.ucsc.edu/
IMG: http://img.jgi.doe.gov/cgi-bin/w/main.cgi
DBTSS: http://dbtss.hgc.jp/
InterPro: http://www.ebi.ac.uk/interpro

 

sciseekclaimtoken-4ec6d4e6da3c3

sciseekclaimtoken-4ec6cf9447e17

Tip of the Week: iRefWeb + protein interaction curation

For this week’s tip of the week I’m going to introduce iRefWeb, a resource that provides thousands of data points on protein-protein interactions.  If you follow this blog regularly, you may remember that we had a guest post from the iRefWeb team not too long ago. It was a nice overview of many of the important aspects of this tool, and I won’t go into those again here–you should check that out. Andrei knows those details quite well!

And at the time we also mentioned their webinar was coming up. We were unable to attend that, though, because we were doing workshops at The Stowers Institute. I was delighted to find that their webcast is now available to watch in full. It’s about 40 minutes long and covers much more than my 5-minute appetizer could do.  It details many practical aspects of how to use iRefWeb effectively.

Because they’ve done all the prep work for me, I don’t need to spend much time on the structural and functional features here. What I would like to do is draw your attention to a different aspect of their work. Their project draws together protein interaction data from a variety of source databases–including some of our favorites such as MINT and IntAct (for which we have training suites available for purchase).  They then used the iRefWeb processes and projects to evaluate and consider the issues around curation of protein-protein interaction data, and recently published those results. That’s what I’ll be focusing on in the post.

Every so often a database flame-war erupts in the bioinformatics community. Generally it involves someone writing a review of databases and/or their content. These evaluations are sometimes critical, sometimes not–but often what happens is that the database providers feel that their site is either mis-represented, or unfairly chastised, or at a minimum incompletely detailed on their mission and methods. I remember one  flambé developed not too long ago around a paper by our old friend from our Proteome days–Mike Cusick–and his colleagues (and we talked about that here). As the OpenHelix team has been involved in plenty of software and curation teams, we know how these play out. And we have sympathy for both the authors and the database providers in these situations.

So when the iRefWeb site pointed me to their new paper I thought: oh-oh…shall I wear my asbestos pantsuit for this one???  The title is Literature curation of protein interactions: measuring agreement across major public databases.  Heh–how’s that working out for ya?

Anyway–it turns out not to need protective gear, in my opinion. Because their project brings data from several interaction database sources, they are well-positioned to collect information about the data to compare the data sets. They clearly explain their stringent criteria, and then look at the data from different papers as it is collected across different databases.

A key point is this:

On average, two databases curating the same publication agree on 42% of their interactions. The discrepancies between the sets of proteins annotated from the same publication are typically less pronounced, with the average agreement of 62%, but the overall trend is similar.

So although there is overlap, different database have different data stored. This won’t be a surprise to most of us in bioinformatics. But I think it is something that end users need to understand. The iRefWeb team acknowledges that there are many sources of difference among data curation teams. Some curate only certain species. Some include all data from high-throughput studies, others take only high-confidence subsets of that data. And it’s fine for different teams to slice the data how they want. Users just need to be aware of this.

It seems that in general there’s more agreement between curators on non-vertebrate model organism data sets than there is for vertebrates. Isoform complexity is a major problem among the hairy organisms, it turns out–and this affects how the iRefWeb team scored the data sets. And as always when curation is evaluated–the authors of papers are sometimes found to be at fault for providing some vagueness to their data sets.

The iRefWeb tools offer you a way to assess what’s available from a given paper in a straightforward manner. In their webinar, you can hear them describe that ~30 minutes in. If you use protein-protein interaction data, you should check that out.

Caveat emptor for protein-protein interaction data (well, and all data in databases, really). But iRefWeb provides an indication of what is available and what the sources are–all of it traceable to the original papers.

The paper is a nice awareness of the issues, not specific criticism of any of the sources. They note the importance of the curation standards encouraged by the Proteomics Standards Initiative–Molecular Interaction (PSI-MI) ontologies and efforts. And they use their paper to raise awareness of where there may be dragons. It seems that dragons are quite an issue for multi-protein complex data.

Your mileage may vary. If you are a data provider, you may want to have protective gear for this paper. But as someone not connected directly to any of the projects, I thought it was reasonable. And something to keep in mind as a user of data–especially as more “big data” proteomics projects start rolling out more and more data.

Quick links and References:

iRefWeb http://wodaklab.org/iRefWeb/

Their Webinar: http://www.g-sin.com/home/events/Learn_about_iRefWeb

Turinsky, A., Razick, S., Turner, B., Donaldson, I., & Wodak, S. (2010). Literature curation of protein interactions: measuring agreement across major public databases Database, 2010 DOI: 10.1093/database/baq026

Cusick, M., Yu, H., Smolyar, A., Venkatesan, K., Carvunis, A., Simonis, N., Rual, J., Borick, H., Braun, P., Dreze, M., Vandenhaute, J., Galli, M., Yazaki, J., Hill, D., Ecker, J., Roth, F., & Vidal, M. (2009). Literature-curated protein interaction datasets Nature Methods, 6 (1), 39-46 DOI: 10.1038/nmeth.1284

Tip of the Week: A year in tips III (last half of 2010)

As you may know, we’ve been doing tips-of-the-week for three years now. We have completed around 150 little tidbit introductions to various resources. At the end of the year we’ve established a sort of holiday tradition: we are doing a summary post to collect them all. If you have missed any of them it’s a great way to have a quick look at what might be useful to your work.

Here are the tips from the first half of the year, and below you will find the tips from the last half of 2010 (you can see past years’ tips here: 2008 I2008 II2009 I2009 II):

July

July 7: Mint for Protein Interactions, an introduction to MINT to study protein-protein interactions
July 14: Introduction to Changes to NCBI’s Protein Database, as it states :D
July 21: 1000 Genome Project Browser, 1000 Genomes project has pilot data out, this is the browser.
July 28: R Genetics at Galaxy, the Galaxy analysis and workflow tool added R genetics analysis tools.

August

August 4: YeastMine, SGD adds an InterMine capability to their database search.
August 11: Gaggle Genome Browser, a tool to allow for the visualization of genomic data, part of the “gaggle components”
August 18: Brenda, comprehensive enzyme information.
August 25: Mouse Genomic Pathology, unlike other tips, this is not a video but rather a detailed introduction to a new website.

September

September 1: Galaxy Pages, and introduction to the new community documentation and sharing capability at Galaxy.
September 8: Varitas. A Plaid Database. A resource that integrates human variation data such as SNPs and CNVs.
September 15: CircuitsDB for TF/miRNA/gene regulation networks.
September 21: Pathcase for pathway data.
September 29: Comparative Toxicogenomics Database (CTD), VennViewer. A new tool to create Venn diagrams to compare associated datasets for genes, diseases or chemicals.

October

October 6: BioExtract Server, a server that allows researcher to store data, analyze data and create workflows of data.
October 13: NCBI Epigenomics, “Beyond the Genome” NCBI’s site for information and data on epigenetics.
October 20: Comparing Microbial Databases including IMG, UCSC Microbial and Archeal browsers, CMR and others.
October 27: iTOL, interactive tree of life

November

November 3: VISTA Enhancer Browser explore possible regulatory elements with comparative genomics
November 10: Getting canonical gene info from the UCSC Browser. Need one gene version to ‘rule them all’?
November 17: ENCODE Data in the UCSC Genome Browser, an entire 35 minute tutorial on the ENCODE project.
November 24: FLink. A tool that links items in one NCBI database to another in a meaningful and weighted manner.

December

December 1: PhylomeDB. A database of gene phylogenies of many species.
December 8: BioGPS for expression data and more.
December 15: RepTar, a database of miRNA target sites.

Tip of the Week: MINT for protein interactions


We’ve long been fans of the tools developed by the team responsible for MINT: Molecular INTeraction database.  MINT is a curated resource full of experimentally verified protein-protein interactions, with some great visualization options.  In addition to the main MINT interface, there are other aspects to the site that bring other types of visualization as well.  We have done a tip on MINT in the past, but we wanted to re-visit this for our SciVee collection, and also mention a handy tool called Connect. Connect can be used to enter a list of up to ~100 proteins and generate the connection map between them.

HomoMINT: this tool extends the experimentally-verified interaction collection to include inferred interactions for human, based on data from model organisms.  So this is homologous interactions, hence the name….

Domino: a look at the domains that are involved in the protein-protein interactions.

VirusMINT: this aspect of MINT explores the viral proteins that includes how the virus proteins interact with host proteins to disrupt host physiology.

For this week’s tip I’ll focus mainly on the experimentally-verified portion of MINT and that interface, and introduce the others. You’ll see how to do a quick search, explore protein details, and then load up the network in the visualization tool.  We have a full tutorial on MINT available for subscribers for people who want to go deeper into the functionality–we can only barely touch on the features in our screencast movie limit.

Edit: should have put the MINT link more clearly http://mint.bio.uniroma2.it/mint/Welcome.do Go to MINT.

Ceol, A., Chatr Aryamontri, A., Licata, L., Peluso, D., Briganti, L., Perfetto, L., Castagnoli, L., & Cesareni, G. (2009). MINT, the molecular interaction database: 2009 update Nucleic Acids Research, 38 (Database) DOI: 10.1093/nar/gkp983

Tip of the Week: VirusMINT

virusMINTThe MINT or Molecular Interaction database for examination of protein interaction networks has long been a favorite tool of mine.   The regular “flavor” of MINT includes over 100,000 interactions with a focus on experimentally verified protein interaction data.  But recently I became aware of the VirusMINT data that is now available as well.

The VirusMINT paper describes the initial emphasis on medically relevant viruses for their curation efforts, and how the work differs from efforts like this PLoS Pathogens paper and the individual virus sites like NCBI’s HIV Interactions collection and the PIG (Pathogen Interaction Gateway) site.

Manual curation of data is labor-intensive, but I really appreciate the quality of that data.  Some of the data they curated themselves, and some was downloaded from existing curated sites.  Once at the site for VirusMINT, it is really simple to load up a virus network by simply clicking on a virus button, and then the proteins load and generate a network interaction group.  The proteins are clickable and you can find out more about the proteins and their sources, and domain information if that is available.  You can also click on the numbers between the interactions to find out which paper provided the interaction data and link quickly to PubMed from there.  And not only can you interact with the data using the MINT software framework, but you can download the data and use it in other tools as well.

This brief Tip-of-the-Week introduces a few of the basic features of VirusMINT.  We have additional details about how to interact with the software in our full MINT tutorial.

Chatr-aryamontri, A., Ceol, A., Peluso, D., Nardozza, A., Panni, S., Sacco, F., Tinti, M., Smolyar, A., Castagnoli, L., Vidal, M., Cusick, M., & Cesareni, G. (2009). VirusMINT: a viral protein interaction database Nucleic Acids Research, 37 (Database) DOI: 10.1093/nar/gkn739

VirusMINT site directly: http://mint.bio.uniroma2.it/virusmint/Welcome.do

MINT main site directly: http://mint.bio.uniroma2.it/mint/Welcome.do

BioCreative

Looking through one of my zillion mailing lists today I came across a project that was new to me. BioCreative is a competion/challenge for the text-mining community in biology.  It’s been around for a while according to the “about” page–and I recognize a bunch of those names.  But I’m not sure why I wasn’t aware of the current state of this.  I suppose since my Proteome days (now part of BIOBASE) I’ve drifted a bit from the text-mining back into sequences and SNPs and all more…

Anyway–seems like a project I’d like to keep an eye on.  They are now up to BioCreative II.5, and in the midst of a new round of evaluations.  You can read more about the current project here.  That also links to some helpful background on the project and the issues.

Paper compares interaction databases

venn_interactions.jpgI wish I had more time to go into this paper in more detail–but I wanted to let you know that the paper is out there now.  It came in my recent Nature Methods in paper version, and if I wasn’t crazy busy on a very cool project that we hope to launch this week I’d go deeper….

The paper is:  Literature-curated protein interaction datasets by Cusick et al. Nature Methods 6, 39 – 46 (2009)  2008 | doi:10.1038/nmeth.1284

I knew from the abstract that it was going to cause some conflama. And I was right.  Soon after an article in Bioinform addressed some of the issues.  Requires a subscription, but here’s the title and the link if you do have one:  Study Finding Erroneous Protein-Protein Interactions in Curated Databases Stirs Debate, by Vivien Marx.

This paper gets at a question that people ask us all the time–how do I know which database to use for X purpose?  So if your question is which database to use for protein interactions, you should read this paper and consider the points they make.   They don’t compare all protein interaction databases, of course–but for those they do examine (IntAct, DIP, MINT) they provide informative comparisons that you should consider for any database.  What does it contain?  What is it missing?  They have some nice Venn diagrams to illustrate the content.  The one I used here is just a representation of that, not attempting to be accurately proportional, go to the paper to see the real ones.

Our position is that you should use all of them, of course  :)  Project goals and funding issues, species specialties, scope…all of this impacts what will be in a database.  (In fact, please go to MINT and support their funding by signing their protest of funding cuts).

One point embedded in the paper caught my attention, though.  One major curation issue was that the species designation of the protein in the interactions was not clear.   I know sometimes this is a problem with the original source paper.  Sometimes it is a curation issue.  But this worries me because of the concern I raised with Wikipedia gene entries.  I made the point that there was no way to distinguish between human genes and mouse genes of the same name (MEF2/Mef2).  This could be true of similar genes in other species too–where the gene might not even be the same gene, just a naming coincidence. I can see it has arisen again.  But if we expect to rely on Wikification projects like Gene Wiki for more and more, I think that would need to be addressed.

New and updated Online Tutorials fo MINT and Reactome

OpenHelix today announced the availability of a new tutorial suite on MINT, a highly used database of protein-protein interactions, and an update to the Reactome tutorial. MINT is a collection of molecular interaction databases that can be used to search for, analyze and graphically display molecular interaction networks from a wide variety of species. Reactome is a knowledgebase of biological processes that is a high quality, deeply curated assembly of information about biological pathways and their components, including both biological and chemical entities.

The tutorial suites, available for single purchase or through a low-priced yearly subscription to all OpenHelix tutorials, contain a narrated, self-run, online tutorial, slides with full script, handouts and exercises. With the tutorials, researchers can quickly learn to effectively and efficiently use these resources. These tutorials will teach users:

MINT

  • how to search for protein interaction data in MINT
  • how to search for protein interaction data in MINT
  • how to search for inferred human interaction data in   HomoMINT
  • how to search Domino for peptide domain interactions
  • to edit and manipulate interaction data in the MINT viewer

Reactome

  • to navigate through the high-quality biochemical pathway information in Reactome
  • how to find diagrams and details about biological pathways
  • ways to link to information about specific pathways and participating molecules
  • to use the Reactome Mart interface to generate custom queries of the underlying database

To find out more about these and other tutorial suites visit the OpenHelix Tutorial Catalog and OpenHelix or visit the OpenHelix Blog for up-to-date information on genomics.

About OpenHelix
OpenHelix, LLC, (http://www.openhelix.com) provides the genomics knowledge you need when you need it. OpenHelix currently provides online self-run tutorials and on-site training for institutions and companies on the most powerful and popular free, web based, publicly accessible bioinformatics resources. In addition, OpenHelix is contracted by resource providers to provide comprehensive, long-term training and outreach programs.

Open source molecular modeling–finally?

My Bio SmartBrief newsletter today had a reference to a paper in a rather…um…obscure journal. Maybe it is just something I have missed over the years, but the Journal of the Royal Society Interface has really just never come across my desk before. Nevertheless, Wired seems to think this software is finally meeting our needs in biological modeling. Finally?

The open-source software movement has finally met the world of biological modeling.

Both a language and a program, “little b” gives systems biologists an infrastructure for building and sharing models of cellular activity.

Ok–this may be fabulous software. I’ll have a look. But to say that this is the one we have been holding our breath for is rather presumptuous. I’m not paying $49 for the paper, so I can’t assess it from the text. I will go and evaluate it at the developer’s site. For software evaluation I do read the papers (unless they cost $49), but I don’t believe anything until I kick the tires quite a bit anyway.

But from the breathless Wired article I can’t see why this is the solution rather than GenMapp, or BiologicalNetworks, or Cytoscape, or NAViGaTOR, or VisANT or….the half a dozen other that we are looking at for tutorial development. Or the ones I intend to learn about at the ISSB meeting in Sweden next month. The choice of tutorials there had me stumped on which ones I could fit into my schedule.

This Wired line about the image they show cracked me up:

Image: Detail from a gene regulation network, courtesy of PNAS. Wouldn’t it be great not to have to duplicate this in every new model?

Um….I can reproduce most of that now with about 10 different tools. If I wanted to do it quickly with stored information I could go to MINT and check out the curated interaction data and their very cool MINT Viewer (you can watch me do that in a movie here). Well, except it doesn’t show a picture of the Golgi in the background. Is that what’s new–despite that being from some unreferenced PNAS paper that may have nothing to do with this software? I would bet if I asked most of these teams would let me load up a cell graphic in the background, or I could create a network and layer it in with my image editing software. But I don’t think that’s it.

I hope little b is great. But like most software in this field there are other options–and some tools are right for some tasks, others are right for other tasks, even when they are in the same space. As we say in the blogosphere, YMMV {your mileage may vary}.

Tip of the week: Harvester, a "Swiss army knife" of bioinformatics

harvester.jpgThis week I’m going to introduce a tool that searches a whole bunch of resources for you with one single click. Harvester, from the Karlsruhe Institute of Technology, offers a really simple interface for searching. If your species is one of the ones collected in their search, you will find that Harvester will enable you to search a slew of databases with just one query–NCBI, UCSC, MINT, STRING and many others. The results will provide quick links to some databases, and some results pages will be embedded in one big web page that you can scroll down and overview really quickly. The embedded pages aren’t just summary text–they are the actual database pages in situ! You can see them and interact with them just as if you were on that site doing the search.

This 3 minute movie introduces you to Harvester. If you quickly need a summary of what’s in all the databases they collect, it is a very handy tool. It does remind me of a Swiss army knife–not earth shatteringly novel from an algorithmic perspective, but many useful tools pulled together in one place. Try it out!