The ENCODE project is one of the “big data” projects that is generating genome-wide data on a variety of different aspects of genome biology. It’s been around for a while, and some people have heard about it but really haven’t begun to dive into the data yet. And they really should.
We’ve had our hands on the ENCODE data since the earliest days of the new scale-up or production phase. We’ve been doing outreach for the UCSC Genome Browser’s DCC portion of the ENCODE work, meaning workshops, online materials, etc. So as a “user” of the data, I can tell you that there is some amazing stuff coming out of this project now. And a lot of it is brand new information to you–I assure you.
I talked about some great stuff recently that was published about chromatin state: ENCODE Chromatin state data offers nice insights. Take this and run with it. That paper is a closer look at one of the data types that’s coming in, and provides some nice guidance on how to explore your own regions of interest for the signals they detect. But the ENCODE team as a whole had just published a paper that looks across the whole project, gives you more background on how the data is generated and what the individual project pieces are supposed to be generating, and some tips on how to use the data. They’ve published a “User’s Guide”.
Figure 1 provides a nice summary of many of the types of data that are coming out of the project, and the techniques used to analyze the features: many types of long-range regulatory elements such as enhancers, silencers, etc; short-range regulatory elements such as promoters and transcription factor binding sites; many types of RNAs being made in the cells, and so on. I also love the copy number variation data the cell lines and other structural info coming out. The other part of the figure talks to the structure of the project data and flow, which is helpful to understand as well.
But most importantly, here’s where you come in–when you start to use this and make discoveries, you can feed back into the project with your insights:
Examples applying ENCODE data at individual loci to specific biological or medical issues are a good starting point for exploration and use of the data. Thus, we also provide a collection of examples at the “session gallery” at the ENCODE portal. Users are encouraged to submit additional examples; we anticipate that this community-based sharing of insights will accelerate the use and impact of the ENCODE data.
I know people have made discoveries in the ENCODE data. When we did a workshop at the NIH there was a woman sitting in the front row giggling about some TFBS data we showed her how to obtain. I have seen chromatin signals that identify a tissue-specific splice site that I know about, but which is not annotated in human. My 23andMe SNPs sometimes have been in really sparse regions–and the only data I have seen in that region is some intriguing ENCODE regulation data: I think I’ve found an un-annotated gene. I’ve seen CNV data that led me to curious correlations in the International Cancer Genome Consortium (IGCG) data. It’s just sitting there waiting for you.
There’s good and novel stuff in there. You need to look at your regions of interest and add this context to it.
The paper also does talk about the limitations of the ENCODE data as well. One is that cells are not synchronized, so it has to provide a population look at the cells, which means their cell cycle states are mixed. A couple of the cell lines are known to have some genome instability (and yeah, that’s what I saw in the CNV data). And cell lines are not human tissues. There are also some limitations of the sequence reads that are generated. But still–the data that is coming along should keep you busy nonetheless.
You can read the user’s guide for more details, and it will be really helpful as a reference as you get into the data. You can also explore the tutorial that we developed with the UCSC team for an overview.
Special note for software junkies: be sure to see the Supplemental Data. Table S1 has a nice summary of the software tools that are being used to generate the data (Warning–it’s a word doc). Some of it isn’t published yet, but is worth keeping an eye out for. One point we keep making in the workshops is that even if you don’t care much about these specific data types, the ENCODE project is offering nice strategies and tools for you to use for analyzing and displaying your own NGS data with genome context.
The ENCODE Project Consortium. (2011). A User’s Guide to the Encyclopedia of DNA Elements (ENCODE) PLoS Biology, 9 (4) DOI: 10.1371/journal.pbio.1001046
Tutorial and Portal:
ENCODE tutorial at OpenHelix, freely available because it is sponsored by UCSC: http://openhelix.com/ENCODE
ENCODE portal: http://encodeproject.org/
Other reading about this:
RT @sangerinstitute: Genomes are deep stores of nuanced messages: ENCODE project hunts meaning http://bit.ly/dFoaqH
NHGRI Press Release: New user’s guide and tutorial helps disease researchers interpret human genome
Nature (2011). Genomics: A guided tour of the genome Nature, 473 (7345), 8-9 DOI: 10.1038/473008d