Tip of the week: ORegAnno for regulatory annotation

Lately we’re getting a lot of questions about ways to analyze the promoters and other regulatory aspects of genes. And for a while we were mostly pointing to the prediction data that was available in the UCSC Genome Browser’s TFBS Conserved track. TFBS Conserved is a track of computationally predicted transcription factor binding sites (TFBS) which are conserved across human/mouse/rat and based on Transfac v7.0 by BioBase.  As they say in the track description, it’s important to know this:

The data are purely computational, and as such not all binding sites listed here are biologically functional binding sites.

Though this is useful, people have been wanting more evidence based on real binding and/or activity data. Today’s tip will talk about 2 ways to get other data–beyond computational predictions. First we’ll explore ORegAnno so you’ll understand the data sources, and then we’ll also look at that data in the context of the UCSC Genome Browser and some useful data from the ENCODE project.

ORegAnno is the Open Regulatory Annotation Database, a community literature curation project for regulatory information. Anyone can participate in the curation–they provide helpful curation tools and automated cross-linking and checking features that make it easier. You would register, curate, and the data becomes available to anyone. And with the curator tools that are available the data becomes loaded into projects that coordinate with ORegAnno–including the track at the UCSC Genome Browser of ORegAnno data.

In the paper published in NAR 2008, they stated this:

The current release comprises 30 145 records curated from 922 publications and describing regulatory sequences for over 3853 genes and 465 transcription factors from 19 species.

So that’s a nice set with traceable data that’s not just computational predictions. In the tip I’ll show one example of Stat1 binding, in human, near the Il10 gene. If you look at that record, you’ll see several pieces of evidence that support this data and a link to the publication that offers it.

Now, if you look at ORegAnno data over in the UCSC Genome Browser, you could compare it to the computational predictions, or TFBS data from other projects such as the ENCODE data sets with the Chip-Seq data (Yale TFBS and HAIB, for example; note: you may have to go back an assembly because the ENCODE data is not all on the current assembly at this time). This is what I show in the movie: I take an ORegAnno annotated item, visualize that with the TFBS Conserved predictions and with some ENCODE project data.  So you get all 3 types of data with a few clicks.

So there are several ways to look for TFBS data–some of it computational predictions, some literature curation, and some big data stuff from the ENCODE teams. All of them have strengths and caveats. Computational predictions may be genome wide and independent of a given cell or tissue type, but are subject to the constraints of the algorithms. Community literature curation can offer quality evidence, but may be selected by interested groups and not as broadly representative of the genome-wide situation. Big data projects can be genome-wide and have evidence in some cell types, but may be in progress and subject to checking as they are pre-publication data.  But effectively using them all could help you to understand regulation of genes that you might be interested in.

Quick Links:

ORegAnno: http://www.oreganno.org/

Biobase and Transfac: http://www.gene-regulation.com/pub/databases.html

UCSC Genome Browser: http://genome.ucsc.edu/

ENCODE data at UCSC: http://genome.ucsc.edu/ENCODE/

Griffith, O., Montgomery, S., Bernier, B., Chu, B., Kasaian, K., Aerts, S., Mahony, S., Sleumer, M., Bilenky, M., Haeussler, M., Griffith, M., Gallo, S., Giardine, B., Hooghe, B., Van Loo, P., Blanco, E., Ticoll, A., Lithwick, S., Portales-Casamar, E., Donaldson, I., Robertson, G., Wadelius, C., De Bleser, P., Vlieghe, D., Halfon, M., Wasserman, W., Hardison, R., Bergman, C., Jones, S., & The Open Regulatory Annotation Consortium. (2007). ORegAnno: an open-access community-driven resource for regulatory annotation Nucleic Acids Research, 36 (Database) DOI: 10.1093/nar/gkm967

3 thoughts on “Tip of the week: ORegAnno for regulatory annotation

  1. Shaun Mahony

    Hi Mary —

    If you’re plugging the expensive proprietary Transfac, it’s also worth mentioning the free & open-source (but less populated) Jaspar:

    I’ve also found UniProbe to be an excellent source of information on the binding preferences of mammalian transcription factors.
    It’s a repository for Martha Bulyk’s protein binding microarray analysis of the in vitro binding preference of many TFs. Some families are quite deeply profiled (e.g. Homeodomains & ETS).

  2. Mary Post author

    Hi Shaun–

    I’m not plugging Transfac–it’s just the track that is available at UCSC that I was talking about. And there’s no cost to use it on the UCSC Genome Browser as shown here.

    But thanks for the other tools as well. Like most issues in bioinformatics, there are a lot of tools with different perspectives and project goals. And different amounts of data. So it’s nice to have options.

  3. Trey

    Thanks Shaun for the tool suggestions.

    This did remind me of something I’d like to reiterate.

    I just want to point out to our readers, if they haven’t already figured it out, that we focus on publicly available, open-access databases and tools (hence our name). Nearly all the resources and tools we train on, or mention here on the blog, are thus.

    We are not against proprietary tools, but if we do mention them we’ll obviously point the fact that they are proprietary out. And if we are paid for advertising them, we’ll definitely disclose that.

    And as Mary mentioned, the post was about UCSC (publicly available) track of Transfac (openly accessed on UCSC) and other publicly available tools like ORegAnno.

    Just thought this was a good place to reiterate that :D

Comments are closed.