Tag Archives: iRefWeb

Tip of the Week: DAnCER for disease-annotated epigenetics data

Epigenetics and epigenomics are becoming more exciting areas of investigation, and we are seeing more requests for database resources to support them, and for the sources of data from these types of experiments. If you aren’t aware of these investigations at this point, check out their entries in the Talking Glossary of Genetic Terms:

Epigenetics: Epigenetics is an emerging field of science that studies heritable changes caused by the activation and deactivation of genes without any change in the underlying DNA sequence of the organism. The word epigenetics is of Greek origin and literally means over and above (epi) the genome.

Epigenome: The term epigenome is derived from the Greek word epi which literally means “above” the genome. The epigenome consists of chemical compounds that modify, or mark, the genome in a way that tells it what to do, where to do it, and when to do it. Different cells have different epigenetic marks. These epigenetic marks, which are not part of the DNA itself, can be passed on from cell to cell as cells divide, and from one generation to the next.

And for the talking part–you can hear Dr. Laura Elnitski talk about these in more detail–have a listen at each entry. And just today an article providing an epigenetics primer appeared in my inbox: Epigenetics: A Primer.

These intriguing–and sometimes puzzling–chromatin modification (CM) signals and leads that are being unveiled in many labs and projects now are becoming more widely available in different databases. For this week’s tip of the week I’ll introduce DAnCER: Disease-Annotated Chromatin Epigenetics Resource, one of the tools that is organizing this type of data and enabling additional explorations. You can find DAnCER here: http://wodaklab.org/dancer/

In the associated publication, the DAnCER team describes other useful resources that provide epigenetics data. These include ChromDB, ChromatinDB (for yeast), and the Human Histone Modification Database (HHMD), among others. I’m also aware of other sources. A few months back I introduced the NCBI Epigenomics resource as my tip-of-the-week. (At that time I promised that when the publication became available I’d mention it–that’s now at the bottom in the references section below.) There’s also quite a bit of this data flowing in to the UCSC Genome Browser ENCODE DCC. Including–may I add–some data from the very cool Elnitski bi-directional promoter studies.  You can find similar data types via the modENCODE project as well.

So, there are lots of resources out there. Each provider has different projects, species, goals, displays, etc. But the group that developed DAnCER wanted to fill a niche they didn’t see available already: linking these epigenetic changes to possible disease association data. Here’s how they describe their work:

Our research effort therefore strives to explore CM-related genes in the context of their protein-interaction network, their partnership in multi-protein complexes and cellular pathways, as well as their gene expression profiles….

They are well-suited to linking this kind of information. You may remember our previous explorations and discussions of iRefWeb. The kind of network and interaction data that they assemble in that context can be brought to the chromatin-modification arena. The point is that you can take steps beyond the modifications you know about, to explore their neighborhood of interactions, and potentially unearth important disease relationships from that.

The data includes several species, and because of that evolutionary conservation can also be explored.

So if you find that you are interested in exploring chromatin modifications, and want to take that data further, check out DAnCER, and the other tools and projects that are providing this type of information. If you have used the iRefWeb interface, you’ll see some similarities in structure. Search options with many filters are available. Color-coded and sortable results are provided. Links to gene details within the Wodak lab tools and external links are offered. On the gene pages at DAnCER you’ll have many types of annotations, including Gene Ontology descriptions, evidence type and references, neighbors, and protein domain information as well. And besides the texty-table based stuff, you can choose to load up the interactive network/interaction graphic, just like with the iRefWeb tool.

There’s a lot of opportunity to learn things from this tool. Try it out.

Quick Links and References:

DAnCER http://wodaklab.org/dancer/

Turinsky, A., Turner, B., Borja, R., Gleeson, J., Heath, M., Pu, S., Switzer, T., Dong, D., Gong, Y., On, T., Xiong, X., Emili, A., Greenblatt, J., Parkinson, J., Zhang, Z., & Wodak, S. (2010). DAnCER: Disease-Annotated Chromatin Epigenetics Resource Nucleic Acids Research, 39 (Database) DOI: 10.1093/nar/gkq857

Fingerman, I., McDaniel, L., Zhang, X., Ratzat, W., Hassan, T., Jiang, Z., Cohen, R., & Schuler, G. (2010). NCBI Epigenomics: a new public resource for exploring epigenomic data sets Nucleic Acids Research, 39 (Database) DOI: 10.1093/nar/gkq1146

Tip of the Week: iRefWeb + protein interaction curation

For this week’s tip of the week I’m going to introduce iRefWeb, a resource that provides thousands of data points on protein-protein interactions.  If you follow this blog regularly, you may remember that we had a guest post from the iRefWeb team not too long ago. It was a nice overview of many of the important aspects of this tool, and I won’t go into those again here–you should check that out. Andrei knows those details quite well!

And at the time we also mentioned their webinar was coming up. We were unable to attend that, though, because we were doing workshops at The Stowers Institute. I was delighted to find that their webcast is now available to watch in full. It’s about 40 minutes long and covers much more than my 5-minute appetizer could do.  It details many practical aspects of how to use iRefWeb effectively.

Because they’ve done all the prep work for me, I don’t need to spend much time on the structural and functional features here. What I would like to do is draw your attention to a different aspect of their work. Their project draws together protein interaction data from a variety of source databases–including some of our favorites such as MINT and IntAct (for which we have training suites available for purchase).  They then used the iRefWeb processes and projects to evaluate and consider the issues around curation of protein-protein interaction data, and recently published those results. That’s what I’ll be focusing on in the post.

Every so often a database flame-war erupts in the bioinformatics community. Generally it involves someone writing a review of databases and/or their content. These evaluations are sometimes critical, sometimes not–but often what happens is that the database providers feel that their site is either mis-represented, or unfairly chastised, or at a minimum incompletely detailed on their mission and methods. I remember one  flambé developed not too long ago around a paper by our old friend from our Proteome days–Mike Cusick–and his colleagues (and we talked about that here). As the OpenHelix team has been involved in plenty of software and curation teams, we know how these play out. And we have sympathy for both the authors and the database providers in these situations.

So when the iRefWeb site pointed me to their new paper I thought: oh-oh…shall I wear my asbestos pantsuit for this one???  The title is Literature curation of protein interactions: measuring agreement across major public databases.  Heh–how’s that working out for ya?

Anyway–it turns out not to need protective gear, in my opinion. Because their project brings data from several interaction database sources, they are well-positioned to collect information about the data to compare the data sets. They clearly explain their stringent criteria, and then look at the data from different papers as it is collected across different databases.

A key point is this:

On average, two databases curating the same publication agree on 42% of their interactions. The discrepancies between the sets of proteins annotated from the same publication are typically less pronounced, with the average agreement of 62%, but the overall trend is similar.

So although there is overlap, different database have different data stored. This won’t be a surprise to most of us in bioinformatics. But I think it is something that end users need to understand. The iRefWeb team acknowledges that there are many sources of difference among data curation teams. Some curate only certain species. Some include all data from high-throughput studies, others take only high-confidence subsets of that data. And it’s fine for different teams to slice the data how they want. Users just need to be aware of this.

It seems that in general there’s more agreement between curators on non-vertebrate model organism data sets than there is for vertebrates. Isoform complexity is a major problem among the hairy organisms, it turns out–and this affects how the iRefWeb team scored the data sets. And as always when curation is evaluated–the authors of papers are sometimes found to be at fault for providing some vagueness to their data sets.

The iRefWeb tools offer you a way to assess what’s available from a given paper in a straightforward manner. In their webinar, you can hear them describe that ~30 minutes in. If you use protein-protein interaction data, you should check that out.

Caveat emptor for protein-protein interaction data (well, and all data in databases, really). But iRefWeb provides an indication of what is available and what the sources are–all of it traceable to the original papers.

The paper is a nice awareness of the issues, not specific criticism of any of the sources. They note the importance of the curation standards encouraged by the Proteomics Standards Initiative–Molecular Interaction (PSI-MI) ontologies and efforts. And they use their paper to raise awareness of where there may be dragons. It seems that dragons are quite an issue for multi-protein complex data.

Your mileage may vary. If you are a data provider, you may want to have protective gear for this paper. But as someone not connected directly to any of the projects, I thought it was reasonable. And something to keep in mind as a user of data–especially as more “big data” proteomics projects start rolling out more and more data.

Quick links and References:

iRefWeb http://wodaklab.org/iRefWeb/

Their Webinar: http://www.g-sin.com/home/events/Learn_about_iRefWeb

Turinsky, A., Razick, S., Turner, B., Donaldson, I., & Wodak, S. (2010). Literature curation of protein interactions: measuring agreement across major public databases Database, 2010 DOI: 10.1093/database/baq026

Cusick, M., Yu, H., Smolyar, A., Venkatesan, K., Carvunis, A., Simonis, N., Rual, J., Borick, H., Braun, P., Dreze, M., Vandenhaute, J., Galli, M., Yazaki, J., Hill, D., Ecker, J., Roth, F., & Vidal, M. (2009). Literature-curated protein interaction datasets Nature Methods, 6 (1), 39-46 DOI: 10.1038/nmeth.1284

Guest Post: iRefWeb — Andrei Turinsky

This next post in our continuing semi-regular Guest Post series is from Andrei Turinsky, one of the developers of iRefWeb. If you are a provider of a free, publicly available genomics tool, database or resource and would like to convey something to users on our guest post feature, please feel free to contact us at wlathe AT openhelix DOT com or the contact form (write ‘guest post’ as subject heading). We welcome introductions to your resource, information on updates, highlights of little known gems or opinion pieces on the state of genomic research and databases.

What is iRefWeb?

Protein-protein interactions (PPI) have become an important tool in biomedical research. Yet the PPI data for a specific organism tend to be distributed over a number of different databases. Comparison and integration of PPI information across databases remains a challenging task.

iRefWeb (Turner et al. (2010) Database, Vol. 2010, Article ID baq023.) is a web interface to a broad integrated landscape of protein-protein interactions (PPIs). For a given gene or protein, you can access all PPI records and protein complexes, consolidated non-redundantly from ten major public databases: BIND, BioGRID, CORUM, DIP, IntAct, HPRD, MINT, MPact, MPPI and OPHID. iRefWeb also presents various supporting evidence, helping you to gauge the reliability of an interaction. Versatile search filters allows you to retrieve the PPIs with a given level of support. Other features facilitate the analysis of possible inconsistencies across PPI data and the examination of PPI statistics. Data consolidation procedure effectively combines redundant records using the iRefIndex process (Razick et al (2008) BMC Bioinformatics 9, 405.).

Figure 1: The iRefIndex process aggregated over 916,059 original PPI records from source databases, 75% of which were redundant. The consolidation merged the redundant PPIs, reducing their number four-fold (orange). Only 232,612 PPIs were non-redundant (blue)

Continue reading