modENCODE: the data bonanza ensues

Another of the “big data” projects that is underway is the ENCODE project, or Encyclopedia of DNA Elements, to provide comprehensive annotation of genomic elements.  Some people are aware of this and are using the data already. If you aren’t, you should check out the online tutorial, freely available because it is sponsored by the UCSC ENCODE Data Coordination Center (DCC) team, for an overview of the organization and availability of the ENCODE mammal data that you can find in the UCSC Genome Browser. That data is flowing in, and you can start looking at it now.

There’s another branch of ENCODE, though, which is not housed at UCSC, that you should be aware of. There’s also modENCODE. The modENCODE project–as you might guess from the name–is aimed at model organisms. The principles are similar: to explore and analyze all the functional elements of what comprises the genome. But the focus is on model organism species: Drosophila and C. elegans. The data coordination center for modENCODE is handled separately from the mammalian branch, but the groups coordinate and interact in other project arenas.

There’s a marker paper from 2009 that establishes the foundation and the framework for the modENCODE project. But just before Christmas there were 2 papers that came out that provide terrific overviews of the status of the modENCODE projects. There’s one for each organism.

One of the parts that really struck me about the modENCODE features is that they have the opportunity to explore developmental life stages that aren’t possible with the human ENCODE data. As someone who studied developmental biology in the lab, that’s a particularly keen aspect of this for me. So much of what we know about human is adult or cell line data, and there’s so much to learn when you can explore over time in this way. Very neat.

Both papers provide the fairly standard sort of “big data” paper framework: why we did this, what we did, summary statistics for things they analyzed, and some compelling examples of a few sample tidbits. But like all of the big data papers, the real data you might need really isn’t in there. There’s going to be a lot more in the supplement. But mostly you’ll have to go to the DCC databases to browse around and query for items and regions of interest for your work. You should go over to the modENCODE site and start your mining with the modMINE tools.

I just noticed in my twitter feed today though that there’s more you should know about if this project is relevant for your work: there is a special issue of Genome Research that collects the more detailed data papers from the modENCODE projects. (hat tip to @bachinsky for that. PS: this is why I use twitter for work).

I haven’t had time to read the Genome Research papers yet, but I can see they cover the data, methods, and the reagents/resources that are associated with the project. There’s going to be a wealth of stuff over there. Check it all out.

