One year of ENCODE data
We’ve talked about the ENCODE data before, and you can see a number of entries about the project with the ENCODE tag. But last week I came across the ENCODE paper in the Nucleic Acids Research Advanced Access collection, so it seemed like a good time to review some of the information about this project.
ENCODE stands for ENCyclopedia Of DNA Elements. It is one of the big data projects wrangled by NHGRI. There was a pilot phase project to explore the utility and methods of assessing in extensive detail 1% of the genome–looking beyond the known and predicted genes at many more aspects of the genome. After the results of the pilot phase were in, the project was examined again, certain choices were made on how to proceed, and the scale-up or production phase ensued.
The paper from the UCSC team describes the framework for the scale-up phase, starting with a focus on the choices that were made for cell types and data types that are used for the ongoing work. Table 1 is a nice summary overview of that to give you a sense of the scope.
They go on to describe some of the issues around housing and displaying the data from these projects. UCSC is the DCC, or Data Coordination Center, for the data. It often required new strategies to display the different cell types and data sets. One point they mention is that the methodology for several aspects of the project changed after the pilot is that there was much more next-gen sequencing short-read type of data coming out of the scale-up. What this might mean for you even if you don’t care about human data or this project specifically: if you are trying to figure out nice ways to display your next-gen data you may find nice examples of strategies in this collection. As we’ve done training on the UCSC Genome Browser and ENCODE we found people were certainly interested in the data from that perspective.
The 3 main ways to interact with the data are provided next: the regular browser, the Table Browser, and downloading every bit of it, if you like. A major difference in the regular browser from the pilot phase is that since now the data is genome wide, the ENCODE tracks can be integrated fully with all the other data as any other track. Since it isn’t set off as a special project with limited coverage, you now will find ENCODE tracks in the track sections where they would be expected to be found–such as regulation, or expression, depending on the data type. The pilot ones were in separate ENCODE track group areas. Now you just have to look for the ENCODE icon next to the tracks to know they are part of this project.
They also stress the Data Use Policy, which includes free access to the data but under the Fort Lauderdale sort of embargo strategy. If you are going to use the data (and they want you to make discoveries, so please do) just keep an eye on the time stamp of the embargo and properly cite those sources. There’s more detail on that on the Data Policy page.
The paper also references the OpenHelix tutorials on the UCSC Genome Browser and ENCODE data. UCSC sponsors us to provide the training freely, and you can access three tutorials on our site:
- Introduction, for an overview of how the main browser works, with display features and definitions for menus and such.
- Table Browser and Custom Tracks, for more complex custom query and display options.
- Additional Tools, this has tools associated with the UCSC Browser and this is where you’ll find the ENCODE section. Or you can view the ENCODE section separately here in a previous post about it (and I added it below again too). It covers much of the same material that the paper does and should supplement your reading nicely.
You can download the slides and use them in your own talks, use the exercises for students or workshops, or just point folks to the materials if you like.
One other note: there is a separate DCC for the modENCODE project with Drosophila and C. elegans, and we touch on that in a post here.
Stand-alone ENCODE tutorial section: http://www.openhelix.com/downloads/jing/encode/encode_movie.html
Rosenbloom, K., Dreszer, T., Pheasant, M., Barber, G., Meyer, L., Pohl, A., Raney, B., Wang, T., Hinrichs, A., Zweig, A., Fujita, P., Learned, K., Rhead, B., Smith, K., Kuhn, R., Karolchik, D., Haussler, D., & Kent, W. (2009). ENCODE whole-genome data in the UCSC Genome Browser Nucleic Acids Research DOI: 10.1093/nar/gkp961