In this 5 minute tip I want to offer a constructive way to start to engage with ENCODE data.
When the ENCODE consortium publications were released last week, a media blitzkrieg ensued. Soon after, there was a backlash by scientists based on some of the claims that they were seeing made. Some of the issues were due to flawed representations in the press that were legitimate targets of the scientists. Some of the attacks on the science writers were unfair. Some folks had issues with the publication process. Some pushback on the “big science” structure and funding arose. Another thread of discussion was about some of the global claims by the ENCODE team—largely about the parsing of the term “functional”. But this parsing discussion was actually quite informative and useful—the good kind of “inside baseball” that goes on among scientists. Although to people outside the field it may be misunderstood, that’s the way we challenge each other and it’s not personal—it’s about the data. It was like watching a huge world-wide lab meeting take place over a few days via twitter and blogs, and it was really pretty cool. (My favorite take on that drama so far was Sean Eddy’s piece: ENCODE says what?)
But what I hope isn’t lost in the conflama is that there is now an enormous opportunity for folks who do smaller science. It would be very unfortunate if the drama dissuaded bench biomedical researchers—and even citizen scientists–from wading into the ENCODE data. ENCODE is a resource project: it provides the foundation for you to go further. It was not an end point, all done, wrapped up and tidy, with all the answers. This is something that the media is not well positioned to convey, which explains part of the drama.
Yes, there was a bolus of 30 papers dropped on you last week. You don’t need to read them all (probably), and certainly not all at once. Do at least read the overall summary paper to get a sense of what’s going on. I am going to offer you a way to consider approaching the rest of them. But even more important than the papers is the fact that the data isn’t really in those papers. Only a fraction of what researchers generated can be discussed or shown in a paper, or even the supplements. The data is in the data repositories—the databases—and that’s where I think you could start. You can start with just a web browser. You don’t need to run the virtual machine. You don’t need to write code.
In a separate commentary in Nature, Ewan Birney who led the consortium makes this point:
The ENCODE project has delivered an incredible amount of information because of its sheer scale: more than 1,600 experiments on 147 cell types, including 235 antibodies or other assay protocols.
Essentially what this means is that you have just been handed a box of puzzle pieces. A huge box of pieces: 1640 x 147 x 235 = 56,653,800 pieces (back of the envelope estimate–pedants
may will correct this in the comments). And you get preliminary evaluation of the data across the minds of ~450 authors on 30 papers. But there are not 56 million pieces in the papers—they generally only give you some representative compelling examples. You have to turn to the databases and repositories for access to the puzzle box.
In this 56 million piece puzzle, there’s a catch though: some of them in that box have wonky edges, and won’t fit in your puzzle. Some of them are going to fit nicely. Some will need to be finessed a bit. Some might be missing (maybe they didn’t do your main transcription factor of interest—the main paper says they’ve done 119 out of 1800 known TFs). But you are capable of figuring it out how some of it fits: at least in your section of the puzzle that you’ve been thinking about for years already anyway. The “big data” guys don’t have your depth on your topic/region of interest. But you can make the connections–you have a prepared mind. You just need to be pointed to the tools to do it.
There’s a ton of information there that doesn’t get into the papers. But you can look at your genes/regions/pathways of interest, turn on ENCODE tracks, and look around. Certainly most of the time we are looking under the flashlight for what we are able to assess. ENCODE is a license to look deeper for validity by at least starting from where there *might* be something interesting. It can help you aim and design your studies.
What ENCODE gives you is a more powerful flashlight than you had before. It doesn’t excuse you from confirming your hunches and discoveries.
Don’t be daunted. In fact, part of this is *your* job. You need to go in and look at these putative “functional” sites. You look upstream and downstream and inside of your genes/regions of interest. Will you draw an instant and immovable conclusion from what you see in the UCSC Genome Browser? Of course not. You will do what you always do with a hunch or a lead–look harder. Try things. Re-try things. Tweak it and try something else. Try to talk yourself out of it. Then try it again. Run it up the flagpole at lab meeting. Let your labmates challenge you (in that healthy inside-baseball kind of way). And go back and try other stuff.
One other tidbit that I haven’t seen noted anywhere yet: the ENCODE data includes human embryonic stem cell assays (H1-hESC in Tier 1). There are not that many people who’ve had access to these cells over the years, and this is an excellent use of them. Compare these cells to more differentiated cells and see if there are differences. If I was doing developmental biology I’d definitely have a hard look at that data compared to cell lines and tissues with other characteristics.
There are other things you can learn about individual cell lines you might use or might find useful. I found one rather large deletion in one of the cell lines that might matter if I was researching cancer, for example. Or maybe that’s a feature that would cause you want to avoid that cell line
In a response to some of the criticisms of the ENCODE blitz, Ewan posted another blog piece. Response on ENCODE reaction. In that piece he says this:
The real measure of a foundational resource such as ENCODE is not the press reaction, nor the papers, but the use of its data by many scientists in the future.
And I couldn’t agree more. This really is the point of ENCODE. This is what needs to happen now. I’m going to help you get started on deciding whether there’s value in the ENCODE data for your work. This data is accessible to you. And it’s up to you to unearth the great stuff.
Here’s my recommendation on how to proceed to evaluate ENCODE data (in short form):
1. Watch the full OpenHelix tutorial I created that lasts about an hour, and try the step-by-step exercises on how to recognize and investigate ENCODE data tracks in the UCSC Genome Browser. We’ve done this material in workshops and people seemed to connect with it. Find your region of interest or one that you know well (if you don’t know how to do that see the UCSC basics tutorial first).
2. Turn on some ENCODE data tracks you might be interested in–or even already know some details about–using the track features. Transcription factor binding? Chromatin state? Variation? Consider the control rows. Do these data make sense with what you know? Is there something else curious going one?
3. Now go to the Nature Portal thread for that topic and find the specific papers relevant for that data. Read those. See how the researches made the calls they did. Look again at your region. Read more from the literature that existed before. Look at more locations—where you expect something, and places you don’t, and places you don’t have any idea about.
4. See if there are cell lines relevant for your focus with this awesome Matrix (note: not all cell lines or antibodies are turned on when you hit “show” on a track–you may have to go in and click them. This is shown in the tutorial.). Same outcome? Different? What about in the stem cell line?
5. Think about confirming what you see. Get the cell lines and grow them with ENCODE conditions. Can you see it again? Can you tweak it?
6. Check out your favorite cell line/tissue in your lab. Do you see the same thing? Or is it different? Different in non-reproducible way or different in an interesting way?
7. Publish. Or blog about it. Let’s talk about it. If you do see—or if you don’t see—what the ENCODE data suggested, we want to know. It will help us all to understand which puzzle pieces are good, and which pieces might need more work or more thought, or re-analysis.
I hope that this tip is a constructive way to proceed. Please have a look at the data. It really could help your work—to understand current features, and to design future directions to pursue. That’s what ENCODE is there for.
ENCODE tutorial: http://www.openhelix.com/ENCODE2
UCSC Genome Browser Introduction tutorial: http://www.openhelix.com/ucsc
ENCODE portal at UCSC Genome Browser: http://encodeproject.org/
Main UCSC Genome Browser starting point: http://genome.ucsc.edu/
The YouTube tip page directly: http://youtu.be/QS1c7BpIvmI
Order Quick Reference Cards to keep near your computer for handy guides: http://bit.ly/QRCencode
Birney, E. (2012). The making of ENCODE: Lessons for big-data projects, Nature, 489 (7414) 51. DOI: 10.1038/489049a
The ENCODE Project Consortium (2012). An integrated encyclopedia of DNA elements in the human genome Nature, 489, 57-74 DOI: 10.1038/nature11247
Disclaimer: Over the past few years we had a contract from the UCSC DCC to provide training and outreach on ENCODE data in the UCSC Genome Browser, and the training materials I offer were developed under that contract. That contract has completed and we are not officially associated with ENCODE in any manner at this time.