Lately we’re getting a lot of questions about ways to analyze the promoters and other regulatory aspects of genes. And for a while we were mostly pointing to the prediction data that was available in the UCSC Genome Browser’s TFBS Conserved track. TFBS Conserved is a track of computationally predicted transcription factor binding sites (TFBS) which are conserved across human/mouse/rat and based on Transfac v7.0 by BioBase. As they say in the track description, it’s important to know this:
The data are purely computational, and as such not all binding sites listed here are biologically functional binding sites.
Though this is useful, people have been wanting more evidence based on real binding and/or activity data. Today’s tip will talk about 2 ways to get other data–beyond computational predictions. First we’ll explore ORegAnno so you’ll understand the data sources, and then we’ll also look at that data in the context of the UCSC Genome Browser and some useful data from the ENCODE project.
ORegAnno is the Open Regulatory Annotation Database, a community literature curation project for regulatory information. Anyone can participate in the curation–they provide helpful curation tools and automated cross-linking and checking features that make it easier. You would register, curate, and the data becomes available to anyone. And with the curator tools that are available the data becomes loaded into projects that coordinate with ORegAnno–including the track at the UCSC Genome Browser of ORegAnno data.
In the paper published in NAR 2008, they stated this:
The current release comprises 30 145 records curated from 922 publications and describing regulatory sequences for over 3853 genes and 465 transcription factors from 19 species.
So that’s a nice set with traceable data that’s not just computational predictions. In the tip I’ll show one example of Stat1 binding, in human, near the Il10 gene. If you look at that record, you’ll see several pieces of evidence that support this data and a link to the publication that offers it.
Now, if you look at ORegAnno data over in the UCSC Genome Browser, you could compare it to the computational predictions, or TFBS data from other projects such as the ENCODE data sets with the Chip-Seq data (Yale TFBS and HAIB, for example; note: you may have to go back an assembly because the ENCODE data is not all on the current assembly at this time). This is what I show in the movie: I take an ORegAnno annotated item, visualize that with the TFBS Conserved predictions and with some ENCODE project data. So you get all 3 types of data with a few clicks.
So there are several ways to look for TFBS data–some of it computational predictions, some literature curation, and some big data stuff from the ENCODE teams. All of them have strengths and caveats. Computational predictions may be genome wide and independent of a given cell or tissue type, but are subject to the constraints of the algorithms. Community literature curation can offer quality evidence, but may be selected by interested groups and not as broadly representative of the genome-wide situation. Big data projects can be genome-wide and have evidence in some cell types, but may be in progress and subject to checking as they are pre-publication data. But effectively using them all could help you to understand regulation of genes that you might be interested in.
Biobase and Transfac: http://www.gene-regulation.com/pub/databases.html
UCSC Genome Browser: http://genome.ucsc.edu/
ENCODE data at UCSC: http://genome.ucsc.edu/ENCODE/
Griffith, O., Montgomery, S., Bernier, B., Chu, B., Kasaian, K., Aerts, S., Mahony, S., Sleumer, M., Bilenky, M., Haeussler, M., Griffith, M., Gallo, S., Giardine, B., Hooghe, B., Van Loo, P., Blanco, E., Ticoll, A., Lithwick, S., Portales-Casamar, E., Donaldson, I., Robertson, G., Wadelius, C., De Bleser, P., Vlieghe, D., Halfon, M., Wasserman, W., Hardison, R., Bergman, C., Jones, S., & The Open Regulatory Annotation Consortium. (2007). ORegAnno: an open-access community-driven resource for regulatory annotation Nucleic Acids Research, 36 (Database) DOI: 10.1093/nar/gkm967