Tag Archives: modENCODE

dna_cutting_with_scissors_hr-150x150

Friday SNPpets

This week’s SNPpets include finding hidden treasures in a “big data” repository, genomic epidemiology and malaria, cannabis strain phylogeny, hackathons and lessons learned, ClinGen for clinical genomics, and more….


Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…


Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

Heh:

 

Spanking #ENCODE

While I was on the road last week–ironically to do workshops including one on ENCODE data in the UCSC Genome Browser, a conflama erupted over a new paper that was published essentially spanking the ENCODE team for some of the claims they made. Some of the first notes I saw:

Luckily I happened to be in a library when this drama broke, so I ran to an unoccupied workshop computer while Trey talked about the Table Browser and read the paper quickly. I will re-read it when I have more time, but I wanted to offer my initial thoughts while the chatter is ongoing.

Subsequently I saw a range of reactions to this: PZ Myers says ENCODE gets a public reaming; Mick Watson’s Dear ENCODE…. ; Homolog.us’ critique of Mick Watson’s response Biomickwatson’s Ridiculous Criticism of ENCODE Critics ; Biostars’ forum ENCODE commentary from Dan Graur …and more. I’m sure there will be further fallout.

My first thoughts were that the paper was the snarkiest scientific paper I have ever read, and I thought it was hilarious. I also think some of the criticisms were completely valid. Some less so.

First I should establish some cred on this topic, and explain my role. I was not part of the official ENCODE analysis team, and was not an author on any of the papers. As OpenHelix we were engaged in some outreach for the UCSC Genome Browser Data Coordination Center–but we worked for and reported to them and not to the main ENCODE group. As such, we delivered training materials and workshops for several years, and although these touched on various data sets presented by many ENCODE teams, we did not have contact with other teams. The materials were aimed at how to locate and use the ENCODE data in the UCSC framework. (ENCODE Foundations and ENCODE Data at UCSC were the recorded materials). However, we are now no longer receiving any ENCODE-related funds in any manner.

So I was exploring ENCODE data a lot earlier than most people. I was making discoveries and finding out interesting new things years ago. And I was also with new users of ENCODE data in workshops around the country. This is the framework that you should use to assess my comments.

On the immortality of television sets….

In the Graur et al paper, there are a number of aspects of the ENCODE project that come under fire. The largest portion of this was aimed at the claim of 80% functionality of the genome. This statement caused problems from day 1, and I agree that it was not a well-crafted statement. It was bound to catch media attention and it irked pretty much everyone. Nearly immediately Ewan Birney tried to explain the position, but most people still found this 80% thing unsatisfying and unhelpful. And I think the Graur et al paper presents why it was so problematic pretty clearly.

Another criticism of the work is that the ENCODE project was focused on cell lines.

“We note that ENCODE used almost exclusively pluripotent stem cells and cancer cells, which are known as transcriptionally permissive environments.”

I understand this concern and even raised it myself in the past in the workshops. But there are 2 important things to note about that: in order to get everyone sufficient sample material to enable comparisons across techniques, technologies, and replications, it would not be possible to use human tissue samples. It just would be physically impossible. Further, a lot of non-ENCODE experimental work is carried out in these cell lines and understanding the difference among cell lines may be incredibly useful in the long run. Making better choices about which ones mirror human conditions, or not using the wrong cell line to test some compound if it’s missing some key receptor could be great information to have. I wish there had been one of the papers that characterized the cell lines, actually.

But another thing everyone missed: STEM CELLS. We now have the largest set of genome-wide data on human embryonic stem cells. This has been information that was particularly hard to obtain in the US, but now everyone can look around at that. I was really sorry to see that aspect of this project got no love whatsoever.

But besides that, the mouse ENCODE project did deliver tissue data. But we can share mouse strains and treatment protocols to get sufficient materials. Additionally the modENCODE project got some really fascinating information on developmental stages that we couldn’t get on humans. I think all of these features are missing in the snark-fest.

Another criticism in the paper is the sensitivity vs specificity choice for reporting on the data.

 At this point, we must ask ourselves, what is the aim of ENCODE: Is it to identify every possible functional element at the expense of increasing the number of elements that are falsely identified as functional? Or is it to create a list of functional elements that is as free of false positives as possible. If the former, then sensitivity should be favored over selectivity; if the latter then selectivity should be favored over sensitivity. ENCODE chose to bias its results by excessively favoring sensitivity over specificity. In fact, they could have saved millions of dollars and many thousands of research hours by ignoring selectivity altogether, and proclaiming a priori that 100% of the genome is functional. Not one functional element would have been missed by using this procedure.

Maybe the Graur et al team thinks that it should have been the other way. That’s fine–they can take all of this data and re-examine it, reprocess it, and deliver it with their thresholds. But I think at this time over-prediction is not the worst sin. Some of this technology is still being worked on. Some of techniques will undoubtedly be refined as we go forward. But some of that will shake out once we look at regions and understand why some calls should or shouldn’t be made. Certainly there are going to be artifacts. But there may also be subtle and useful things that researchers on a specific topic and with interests in a specific region will be able to suss out because they had some leads. Maybe some won’t pan out. But certainly that’s not impossible with under-prediction or false negatives either.

I don’t know how many of you have stood in front of rooms of researchers and opened up new data sets to them. I’ve done this quite a bit. I have heard the giggles of a researcher at NIH who was delighted to discover in our workshop that GATA1 binding evidence was present in front of a region she was interested in–and this evidence looked very solid to me. This data came from ENCODE years ago, and she could go back to her lab that afternoon and start to ask new questions and design new experiments long before the controversial statements. Just the other day there was a researcher who found new RNA-seq signals in an area he cares about. Will these turn out to be something? I don’t know. But he was eager to go back to the lab and look harder with the new knowledge.

Big science vs. small science

Another segment of the Graur paper is called “Big Science,” “small science,” and ENCODE”. I tell researchers in workshops that they need to take the leads they get from this and look at it again, confirm it, and poke around with other tools and other cell lines or tissues. But I have seen that the ENCODE data has offered new paths and new ideas to researchers. As I wrote a while ago, ENCODE enables smaller science and people who had no contact with the initial project are making new discoveries with this data. And I think this statement is unfair:

Unfortunately, the ENCODE data are neither easily accessible nor very useful—without ENCODE, researchers would have had to examine 3.5 billion nucleotides in search of function, with ENCODE, they would have to sift through 2.7 billion nucleotides.

Most researchers don’t need 3.5 billion or 2.7 billion nucleotides. But they are very interested in some specific regions, and many of those regions now have new and actionable information that these researchers didn’t have before. And it’s not hard to access this–although we would love to have been funded to do more workshops to show people how they can get to it*.

Alas

So in short, I thought the spanking was funny and partially deserved. Some of it was unwarranted. I was a bit surprised to see this level of snarkiness in a scientific paper rather than a blog post or some other format, and I think if that became a publishing trend it might not serve us well. But we are also coming to a point where the literature is less important than the data–because the data isn’t in the papers anymore. What will matter is what we see downstream as people use the ENCODE data. And I hope they do, because I think there’s gold in there. I’ve seen some. But you’ll have to verify it. I think the saddest thing would be is if the drama on the claims made at the end cause people to walk away from the good that came from this. That would be a huge waste.

*If anyone asked me (not that anyone has), I think that outreach on big data projects should be improved in a number of ways. There should be a branch of the project whose only role is outreach–not attached to a specific project team–that has access to all of the teams, but can still maintain some distance. It would help to understand what new users face when they see the project. Often we find that teams on software or data projects are a bit too close to the materials and need to understand what it’s like to be an outsider looking in. And we find that people tell us things that they might not be willing to say to the development team directly, which can be very useful feedback. This is not specific to ENCODE but I have seen this numerous times in other projects as well.

 

Related posts:

Mining the “big data” is…fascinating. And necessary.

Video Tip of the Week: ENCODE enables smaller science

 

References:

Graur, D., Zheng, Y., Price, N., Azevedo, R., Zufall, R., & Elhaik, E. (2013). On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE Genome Biology and Evolution DOI: 10.1093/gbe/evt028

Rosenbloom, K., Sloan, C., Malladi, V., Dreszer, T., Learned, K., Kirkup, V., Wong, M., Maddren, M., Fang, R., Heitner, S., Lee, B., Barber, G., Harte, R., Diekhans, M., Long, J., Wilder, S., Zweig, A., Karolchik, D., Kuhn, R., Haussler, D., & Kent, W. (2012). ENCODE Data in the UCSC Genome Browser: year 5 update Nucleic Acids Research, 41 (D1) DOI: 10.1093/nar/gks1172

++++++++

Update to add more blowback:

The Guardian: Scientists attacked over claim that ‘junk DNA’ is vital to life

Josh Witten at The Finch and Pea: So I take it you aren’t happy with ENCODE…

John Farrell at Forbes: ENCODE Papers Get A Fisking

RT @leonidkruglyak: And of course, there’s now a @FakeEncode parody account…

Jalees Rehman at SciLogs: The ENCODE Controversy And Professionalism In Science  (this also has a Storify of some of the chatter that’s gone on via twitter, where much of this goes on these days)

This was an early one but I missed it in my travel days, by Ashutosh Jogalekar at SciAm: ENCODE, Apple Maps and function: Why definitions matter

Anshul Kundaje takes issue with some of the conclusions drawn based on the data used: https://twitter.com/anshul

Derek Lowe at In the Pipeline: ENCODE: The Nastiest Dissent I’ve Seen in Quite Some Time

Mike’s Fourth Try, Mike Lin–an author in the consortium: My thoughts on the immortality of television sets

Rebecca Boyle at PopSci: The Drama Over Project Encode, And Why Big Science And Small Science Are Different

Gencode Genes: On the annotation of functionality in GENCODE (or: our continuing efforts to understand how a television set works).

Nicolas Le Novère: Coming out of the closet: I like ENCODE results, and I think the “switches” may be functional

W. Ford Doolittle Is junk DNA bunk? A critique of ENCODE

Larry Moran: Ford Doolittle’s Critique of ENCODE

Nature Editorial: Form and function

MendelsPod has a couple of podcasts: Debating ENCODE: Dan Graur, Michael Eisen and Debating ENCODE Part II: Ross Hardison, Penn St.

Peter, a kiwi, on The ENCODE War Continues

Richard Gayle ENCODE and the Truth

Sean Eddy (subscription req) #openaccess : The ENCODE project: Missteps overshadowing a success

Tip of the Week: InterMine for mining “big data”

Integrating large data sets for queries within–and across–various collections is one of the arenas that has lately been pretty active in bioinformatics. As more and more “big data” projects yield huge numbers of data points and data types, this is only becoming more necessary.  I love to browse data, but there are times when a large-scale customized query is what you’ll want to make some broader discoveries.

Right now there are a number of resources and interfaces that I turn to for structured and customized queries of data collections. The UCSC Table Browser, BioMart, Galaxy–these are the ones I have my hands on almost continuously. But there is another warehouse and interface system that we’re seeing more and more: InterMine.

My first real encounter with InterMine was for the modENCODE data. There’s some really terrific data flowing out of that project now (I talked a bit about that recently here), and the interface and storage system they are using is InterMine.

FlyMine was the initial impetus for the “Mine” system. Some years back FlyMine was created as a warehouse and query system for the increasing amounts of fly data that was coming from various projects. The goal was to have a system powerful enough for bioinformatics + super users, but also a friendly yet powerful interface for bench biologists to use.

The initial paper described the basic components: a user interface with 3 primary components: a Quick Search that’s great for browsing; a Template library that lets users access some pre-defined standard or likely query types that they can tweak for their needs; and a fully customizable Query Builder for the most advanced access. Since this paper development has continued, and there are other new and cool features present as well.

Another big goal of the FlyMine effort was to be able to deal with lists. One of the most common questions we still get in workshops is: “I have a list of _____.  What’s the best way to deal with that?” FlyMine–and the InterMines in general–help people to query and manage their explorations with lists of stuff.

The MyMine feature of the InterMines is also a nice component. You can create a login and store things you want to have repeated access to: queries, lists, etc.

There are other people using InterMine for their systems too–a recent paper on TargetMine, for “Gene Prioritization and Target Discovery” is available, and might appear as an upcoming tip! Jennifer did a tip on YeastMine from SGD once as well.

But what triggered me to do this tip is that a letter came from the RGD mailing list last week that said this:

Effective Friday, May 20th, 2011 the MCW BioMart tool will be retired by RGD and the MCW Proteomics Center.  For mining rat data, we have found that the RatMIne tool is easier to use, more flexible and incorporates more types of data than BioMart.  In addition, RatMine includes analysis tools not found in BioMart, giving RatMine users a single, intuitive interface for both obtaining and analyzing data.

So they are moving fully to InterMine and retiring the Rat BioMart, exclusively using RatMine at their installation. So this tip of the week will explore InterMine, RatMine, and some other Mines. That’s a lot of ground to cover–but it’s probably worth your time to know about InterMine as it becomes more broadly available.  It’s also important to understand how to query with the Mines if you want to bring the data to Galaxy for further analysis. If you visit Galaxy you’ll see that their “Get Data” section lets you access Mine tools–but you still need to know how to do the basic queries at the host site first.

Although this tip will touch on RatMine, the focus is the more general InterMine suite. RGD also said this in their notice:

For an overview of RatMine and how to use it, go to the RGD tutorial video, “An Introduction to the RatMine Database”, at http://rgd.mcw.edu/wg/home/rgd_rat_community_videos/an-introduction-to-the-ratmine-database2.  Alternatively, follow the “self-guided tour” of RatMine by clicking the “Take a tour” link at the top of any RatMine page.

To try out RatMine for yourself, go to http://ratmine.mcw.edu/ and get started with simplified data mining and analysis.

So if you want to have more specific information about using RatMine, be sure to check out their introduction.

Quick Links:

InterMine: http://intermine.org/

RatMine: http://ratmine.mcw.edu/

modENCODE: http://www.modencode.org/

Galaxy: http://usegalaxy.org/

Reference:
Lyne, R., Smith, R., Rutherford, K., Wakeling, M., Varley, A., Guillier, F., Janssens, H., Ji, W., Mclaren, P., North, P., Rana, D., Riley, T., Sullivan, J., Watkins, X., Woodbridge, M., Lilley, K., Russell, S., Ashburner, M., Mizuguchi, K., & Micklem, G. (2007). FlyMine: an integrated database for Drosophila and Anopheles genomics Genome Biology, 8 (7) DOI: 10.1186/gb-2007-8-7-r129

ENCODE usability survey is up: please share

Hello folks: the team at UCSC involved with the ENCODE project is really interested in hearing from ENCODE data users about their interactions with the data. They’ve created a usability survey, and it would really help them out if you could offer your thoughts on this. Go to a gateway page, and you’ll see the yellow highlight that offers the link to take the survey:

When we do the ENCODE workshops, we often get feedback from the people who are there. We always deliver that to the UCSC team. But it’s not always so easy to contact users with different levels of knowledge, different locations, different project goals.

I’m not going to cloud your answers by telling you what we are hearing–but if you have had some experiences using the ENCODE data please share your thoughts.

If you haven’t been using ENCODE much yet, now is a good time to get started! The tutorial that is sponsored by the UCSC ENCODE team covers the same stuff that we do in our workshops. There’s also a paper that was just published called a “users guide” to ENCODE. I’ve got a blog post planned on that soon, but haven’t had a chance to work it up yet.

Introductory Tutorial: http://openhelix.com/ENCODE

New paper from the ENCODE team in PLoS Biology: A User’s Guide to the Encyclopedia of DNA Elements (ENCODE)

EDIT: As soon as I posted this, I saw a tweet about the modENCODE user survey too–if you use that data there’s a place for your feedback too!

RT @wormbase: Please take a moment to complete the modENCODE User Survey: http://bit.ly/kBNWt7

modENCODE: the data bonanza ensues

Another of the “big data” projects that is underway is the ENCODE project, or Encyclopedia of DNA Elements, to provide comprehensive annotation of genomic elements.  Some people are aware of this and are using the data already. If you aren’t, you should check out the online tutorial, freely available because it is sponsored by the UCSC ENCODE Data Coordination Center (DCC) team, for an overview of the organization and availability of the ENCODE mammal data that you can find in the UCSC Genome Browser. That data is flowing in, and you can start looking at it now.

There’s another branch of ENCODE, though, which is not housed at UCSC, that you should be aware of. There’s also modENCODE. The modENCODE project–as you might guess from the name–is aimed at model organisms. The principles are similar: to explore and analyze all the functional elements of what comprises the genome. But the focus is on model organism species: Drosophila and C. elegans. The data coordination center for modENCODE is handled separately from the mammalian branch, but the groups coordinate and interact in other project arenas.

There’s a marker paper from 2009 that establishes the foundation and the framework for the modENCODE project. But just before Christmas there were 2 papers that came out that provide terrific overviews of the status of the modENCODE projects. There’s one for each organism.

One of the parts that really struck me about the modENCODE features is that they have the opportunity to explore developmental life stages that aren’t possible with the human ENCODE data. As someone who studied developmental biology in the lab, that’s a particularly keen aspect of this for me. So much of what we know about human is adult or cell line data, and there’s so much to learn when you can explore over time in this way. Very neat.

Both papers provide the fairly standard sort of “big data” paper framework: why we did this, what we did, summary statistics for things they analyzed, and some compelling examples of a few sample tidbits. But like all of the big data papers, the real data you might need really isn’t in there. There’s going to be a lot more in the supplement. But mostly you’ll have to go to the DCC databases to browse around and query for items and regions of interest for your work. You should go over to the modENCODE site and start your mining with the modMINE tools.

I just noticed in my twitter feed today though that there’s more you should know about if this project is relevant for your work: there is a special issue of Genome Research that collects the more detailed data papers from the modENCODE projects. (hat tip to @bachinsky for that. PS: this is why I use twitter for work).

I haven’t had time to read the Genome Research papers yet, but I can see they cover the data, methods, and the reagents/resources that are associated with the project. There’s going to be a wealth of stuff over there. Check it all out.

References for modENCODE:

Marker paper 2009:

Celniker, S., Dillon, L., Gerstein, M., Gunsalus, K., Henikoff, S., Karpen, G., Kellis, M., Lai, E., Lieb, J., MacAlpine, D., Micklem, G., Piano, F., Snyder, M., Stein, L., White, K., & Waterston, R. (2009). Unlocking the secrets of the genome Nature, 459 (7249), 927-930 DOI: 10.1038/459927a

New papers 2010:
Gerstein, M., Lu, Z., Van Nostrand, E., Cheng, C., Arshinoff, B., Liu, T., Yip, K., Robilotto, R., Rechtsteiner, A., Ikegami, K., Alves, P., Chateigner, A., Perry, M., Morris, M., Auerbach, R., Feng, X., Leng, J., Vielle, A., Niu, W., Rhrissorrakrai, K., Agarwal, A., Alexander, R., Barber, G., Brdlik, C., Brennan, J., Brouillet, J., Carr, A., Cheung, M., Clawson, H., Contrino, S., Dannenberg, L., Dernburg, A., Desai, A., Dick, L., Dose, A., Du, J., Egelhofer, T., Ercan, S., Euskirchen, G., Ewing, B., Feingold, E., Gassmann, R., Good, P., Green, P., Gullier, F., Gutwein, M., Guyer, M., Habegger, L., Han, T., Henikoff, J., Henz, S., Hinrichs, A., Holster, H., Hyman, T., Iniguez, A., Janette, J., Jensen, M., Kato, M., Kent, W., Kephart, E., Khivansara, V., Khurana, E., Kim, J., Kolasinska-Zwierz, P., Lai, E., Latorre, I., Leahey, A., Lewis, S., Lloyd, P., Lochovsky, L., Lowdon, R., Lubling, Y., Lyne, R., MacCoss, M., Mackowiak, S., Mangone, M., McKay, S., Mecenas, D., Merrihew, G., Miller, D., Muroyama, A., Murray, J., Ooi, S., Pham, H., Phippen, T., Preston, E., Rajewsky, N., Ratsch, G., Rosenbaum, H., Rozowsky, J., Rutherford, K., Ruzanov, P., Sarov, M., Sasidharan, R., Sboner, A., Scheid, P., Segal, E., Shin, H., Shou, C., Slack, F., Slightam, C., Smith, R., Spencer, W., Stinson, E., Taing, S., Takasaki, T., Vafeados, D., Voronina, K., Wang, G., Washington, N., Whittle, C., Wu, B., Yan, K., Zeller, G., Zha, Z., Zhong, M., Zhou, X., , ., Ahringer, J., Strome, S., Gunsalus, K., Micklem, G., Liu, X., Reinke, V., Kim, S., Hillier, L., Henikoff, S., Piano, F., Snyder, M., Stein, L., Lieb, J., & Waterston, R. (2010). Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project Science, 330 (6012), 1775-1787 DOI: 10.1126/science.1196914

The modENCODE Consortium., Roy, S., Ernst, J., Kharchenko, P., Kheradpour, P., Negre, N., Eaton, M., Landolin, J., Bristow, C., Ma, L., Lin, M., Washietl, S., Arshinoff, B., Ay, F., Meyer, P., Robine, N., Washington, N., Di Stefano, L., Berezikov, E., Brown, C., Candeias, R., Carlson, J., Carr, A., Jungreis, I., Marbach, D., Sealfon, R., Tolstorukov, M., Will, S., Alekseyenko, A., Artieri, C., Booth, B., Brooks, A., Dai, Q., Davis, C., Duff, M., Feng, X., Gorchakov, A., Gu, T., Henikoff, J., Kapranov, P., Li, R., MacAlpine, H., Malone, J., Minoda, A., Nordman, J., Okamura, K., Perry, M., Powell, S., Riddle, N., Sakai, A., Samsonova, A., Sandler, J., Schwartz, Y., Sher, N., Spokony, R., Sturgill, D., van Baren, M., Wan, K., Yang, L., Yu, C., Feingold, E., Good, P., Guyer, M., Lowdon, R., Ahmad, K., Andrews, J., Berger, B., Brenner, S., Brent, M., Cherbas, L., Elgin, S., Gingeras, T., Grossman, R., Hoskins, R., Kaufman, T., Kent, W., Kuroda, M., Orr-Weaver, T., Perrimon, N., Pirrotta, V., Posakony, J., Ren, B., Russell, S., Cherbas, P., Graveley, B., Lewis, S., Micklem, G., Oliver, B., Park, P., Celniker, S., Henikoff, S., Karpen, G., Lai, E., MacAlpine, D., Stein, L., White, K., Kellis, M., Acevedo, D., Auburn, R., Barber, G., Bellen, H., Bishop, E., Bryson, T., Chateigner, A., Chen, J., Clawson, H., Comstock, C., Contrino, S., DeNapoli, L., Ding, Q., Dobin, A., Domanus, M., Drenkow, J., Dudoit, S., Dumais, J., Eng, T., Fagegaltier, D., Gadel, S., Ghosh, S., Guillier, F., Hanley, D., Hannon, G., Hansen, K., Heinz, E., Hinrichs, A., Hirst, M., Jha, S., Jiang, L., Jung, Y., Kashevsky, H., Kennedy, C., Kephart, E., Langton, L., Lee, O., Li, S., Li, Z., Lin, W., Linder-Basso, D., Lloyd, P., Lyne, R., Marchetti, S., Marra, M., Mattiuzzo, N., McKay, S., Meyer, F., Miller, D., Miller, S., Moore, R., Morrison, C., Prinz, J., Rooks, M., Moore, R., Rutherford, K., Ruzanov, P., Scheftner, D., Senderowicz, L., Shah, P., Shanower, G., Smith, R., Stinson, E., Suchy, S., Tenney, A., Tian, F., Venken, K., Wang, H., White, R., Wilkening, J., Willingham, A., Zaleski, C., Zha, Z., Zhang, D., Zhao, Y., & Zieba, J. (2010). Identification of Functional Elements and Regulatory Circuits by Drosophila modENCODE Science, 330 (6012), 1787-1797 DOI: 10.1126/science.1198374

Special issue of Genome Research: http://genome.cshlp.org/content/21/2.toc

Friday SNPpets

Welcome to our Friday feature link dump: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

One year of ENCODE data

encode_logo

We’ve talked about the ENCODE data before, and you can see a number of entries about the project with the ENCODE tag.  But last week I came across the ENCODE paper in the Nucleic Acids Research Advanced Access collection, so it seemed like a good time to review some of the information about this project.

ENCODE stands for ENCyclopedia Of DNA Elements.  It is one of the big data projects wrangled by NHGRI.  There was a pilot phase project to explore the utility and methods of assessing in extensive detail 1% of the genome–looking beyond the known and predicted genes at many more aspects of the genome.  After the results of the pilot phase were in, the project was examined again, certain choices were made on how to proceed, and the scale-up or production phase ensued.

The paper from the UCSC team describes the framework for the scale-up phase, starting with a focus on the choices that were made for cell types and data types that are used for the ongoing work. Table 1 is a nice summary overview of that to give you a sense of the scope.

They go on to describe some of the issues around housing and displaying the data from these projects.  UCSC is the DCC, or Data Coordination Center, for the data.  It often required new strategies to display the different cell types and data sets. One point they mention is that the methodology for several aspects of the project changed after the pilot is that there was much more next-gen sequencing short-read type of data coming out of the scale-up.  What this might mean for you even if you don’t care about human data or this project specifically: if you are trying to figure out nice ways to display your next-gen data you may find nice examples of strategies in this collection.  As we’ve done training on the UCSC Genome Browser and ENCODE we found people were certainly interested in the data from that perspective.

The 3 main ways to interact with the data are provided next: the regular browser, the Table Browser, and downloading every bit of it, if you like.  A major difference in the regular browser from the pilot phase is that since now the data is genome wide, the ENCODE tracks can be integrated fully with all the other data as any other track.  Since it isn’t set off as a special project with limited coverage, you now will find ENCODE tracks in the track sections where they would be expected to be found–such as regulation, or expression, depending on the data type.  The pilot ones were in separate ENCODE track group areas.  Now you just have to look for the ENCODE icon next to the tracks to know they are part of this project.

They also stress the Data Use Policy, which includes free access to the data but under the Fort Lauderdale sort of embargo strategy.  If you are going to use the data (and they want you to make discoveries, so please do) just keep an eye on the time stamp of the embargo and properly cite those sources.  There’s more detail on that on the Data Policy page.

The paper also references the OpenHelix tutorials on the UCSC Genome Browser and ENCODE data.  UCSC sponsors us to provide the training freely, and you can access three tutorials on our site:

  • Introduction, for an overview of how the main browser works, with display features and definitions for menus and such.
  • Additional Tools, this has tools associated with the UCSC Browser and this is where you’ll find the ENCODE section.   Or you can view the ENCODE section separately here in a previous post about it (and I added it below again too).  It covers much of the same material that the paper does and should supplement your reading nicely.

You can download the slides and use them in your own talks, use the exercises for students or workshops, or just point folks to the materials if you like.

One other note: there is a separate DCC for the modENCODE project with Drosophila and C. elegans, and we touch on that in a post here.

Stand-alone ENCODE tutorial section: http://www.openhelix.com/downloads/jing/encode/encode_movie.html

encode_movie

Rosenbloom, K., Dreszer, T., Pheasant, M., Barber, G., Meyer, L., Pohl, A., Raney, B., Wang, T., Hinrichs, A., Zweig, A., Fujita, P., Learned, K., Rhead, B., Smith, K., Kuhn, R., Karolchik, D., Haussler, D., & Kent, W. (2009). ENCODE whole-genome data in the UCSC Genome Browser Nucleic Acids Research DOI: 10.1093/nar/gkp961

ENCODE wants your input on the data release policy

enc_data_release.jpgThis week’s Tip of the Week is a bit different than some of the others that I have done in the past. I’m going to take you through parts of a document–the newly released draft of the Data Release Policy for ENCODE (go over to this page at NHGRI and get a copy of the document). I know–you expect software from us. But I will also show you a bit of software at the end, if you can stick with me for that. OK?

We’ve been talking about the ENCODE projects about once a month lately. We are hoping to raise awareness and understanding about the framework, foundations, and goals for ENCODE. That’s because a TON of genome-wide data is going to be collected and offered to researchers worldwide as this project progresses. And as we proceed I’ll be showing you how to access that data in the UCSC Genome Browser, since UCSC is the DCC (or data coordination center) to wrangle the human data around ENCODE.

encode_logo.gifHowever, if you are going to use ENCODE data, you need to know about the guidelines for using that data. That’s what I’ll cover today. And I’ll also give you a peek at some of the first data to come through the process at UCSC on the test server*. It is a sample of ChIP-Seq data from HudsonAlpha that I’ll use as an example.

In short, this data policy tries to balance the needs of the users of this publicly-funded data with those of the scientists who are generating this data. They are proposing a 9-month non-scoop window: the providers will release the data and have 9 months to submit their manuscripts on it. In the meantime, you can look at the data and start to use it. But in general, they ask that you don’t submit a paper without the consent of the ENCODE team in that window. The appendix offers a couple of nice scenarios about the appropriate use of the data so it helps to clarify this.

I hope you’ll have a look at the ENCODE draft data release policy and think about using the ENCODE data. And please give NHGRI and the ENCODE team feedback on this.

*Note on the test server: this is a sandbox for developers at UCSC, the data might not have all be QCed yet, and data here should not be considered final form. But you can have a look.

There’s been some coverage of the request for comment elsewhere, too, if you want to read more about this: http://www.genomeweb.com/issues/news/149419-1.html

UCSC Genome Browser “News” item has a link to the document as well.

Video Tip of the Week: modENCODE

modencode_logo_small.pngWe have talked about the ENCODE project before–both the successful pilot project and the current new phase of the ENCODE project that is going genome-wide, beyond the 1% coverage of the pilot project. One thing you may have noticed about the ENCODE data we talked about at the UCSC Genome Browser, though, is that it is very human-centric. But fear not–model organisms are in da house! There is actually a separate aspect of the ENCODE effort that I wanted to introduce today: modENCODE.

NHGRI has funded modENCODE researchers to take the ENCODE-style strategies and tools to some of our favorite model organisms: fly and worm–as you probably guessed from the logo. Already we are seeing data from the project, which you can access at the modENCODE project web site: http://www.modencode.org/

In this 4-minute tip of the week movie we’ll take a quick look at the resources available to examine the modENCODE data.

For the main modENCODE web site: http://www.modencode.org/

For the InterMine query tool for modENCODE data: http://intermine.modencode.org/release-3/begin.do