Spanking #ENCODE

While I was on the road last week–ironically to do workshops including one on ENCODE data in the UCSC Genome Browser, a conflama erupted over a new paper that was published essentially spanking the ENCODE team for some of the claims they made. Some of the first notes I saw:

Luckily I happened to be in a library when this drama broke, so I ran to an unoccupied workshop computer while Trey talked about the Table Browser and read the paper quickly. I will re-read it when I have more time, but I wanted to offer my initial thoughts while the chatter is ongoing.

Subsequently I saw a range of reactions to this: PZ Myers says ENCODE gets a public reaming; Mick Watson’s Dear ENCODE…. ; Homolog.us’ critique of Mick Watson’s response Biomickwatson’s Ridiculous Criticism of ENCODE Critics ; Biostars’ forum ENCODE commentary from Dan Graur …and more. I’m sure there will be further fallout.

My first thoughts were that the paper was the snarkiest scientific paper I have ever read, and I thought it was hilarious. I also think some of the criticisms were completely valid. Some less so.

First I should establish some cred on this topic, and explain my role. I was not part of the official ENCODE analysis team, and was not an author on any of the papers. As OpenHelix we were engaged in some outreach for the UCSC Genome Browser Data Coordination Center–but we worked for and reported to them and not to the main ENCODE group. As such, we delivered training materials and workshops for several years, and although these touched on various data sets presented by many ENCODE teams, we did not have contact with other teams. The materials were aimed at how to locate and use the ENCODE data in the UCSC framework. (ENCODE Foundations and ENCODE Data at UCSC were the recorded materials). However, we are now no longer receiving any ENCODE-related funds in any manner.

So I was exploring ENCODE data a lot earlier than most people. I was making discoveries and finding out interesting new things years ago. And I was also with new users of ENCODE data in workshops around the country. This is the framework that you should use to assess my comments.

On the immortality of television sets….

In the Graur et al paper, there are a number of aspects of the ENCODE project that come under fire. The largest portion of this was aimed at the claim of 80% functionality of the genome. This statement caused problems from day 1, and I agree that it was not a well-crafted statement. It was bound to catch media attention and it irked pretty much everyone. Nearly immediately Ewan Birney tried to explain the position, but most people still found this 80% thing unsatisfying and unhelpful. And I think the Graur et al paper presents why it was so problematic pretty clearly.

Another criticism of the work is that the ENCODE project was focused on cell lines.

“We note that ENCODE used almost exclusively pluripotent stem cells and cancer cells, which are known as transcriptionally permissive environments.”

I understand this concern and even raised it myself in the past in the workshops. But there are 2 important things to note about that: in order to get everyone sufficient sample material to enable comparisons across techniques, technologies, and replications, it would not be possible to use human tissue samples. It just would be physically impossible. Further, a lot of non-ENCODE experimental work is carried out in these cell lines and understanding the difference among cell lines may be incredibly useful in the long run. Making better choices about which ones mirror human conditions, or not using the wrong cell line to test some compound if it’s missing some key receptor could be great information to have. I wish there had been one of the papers that characterized the cell lines, actually.

But another thing everyone missed: STEM CELLS. We now have the largest set of genome-wide data on human embryonic stem cells. This has been information that was particularly hard to obtain in the US, but now everyone can look around at that. I was really sorry to see that aspect of this project got no love whatsoever.

But besides that, the mouse ENCODE project did deliver tissue data. But we can share mouse strains and treatment protocols to get sufficient materials. Additionally the modENCODE project got some really fascinating information on developmental stages that we couldn’t get on humans. I think all of these features are missing in the snark-fest.

Another criticism in the paper is the sensitivity vs specificity choice for reporting on the data.

 At this point, we must ask ourselves, what is the aim of ENCODE: Is it to identify every possible functional element at the expense of increasing the number of elements that are falsely identified as functional? Or is it to create a list of functional elements that is as free of false positives as possible. If the former, then sensitivity should be favored over selectivity; if the latter then selectivity should be favored over sensitivity. ENCODE chose to bias its results by excessively favoring sensitivity over specificity. In fact, they could have saved millions of dollars and many thousands of research hours by ignoring selectivity altogether, and proclaiming a priori that 100% of the genome is functional. Not one functional element would have been missed by using this procedure.

Maybe the Graur et al team thinks that it should have been the other way. That’s fine–they can take all of this data and re-examine it, reprocess it, and deliver it with their thresholds. But I think at this time over-prediction is not the worst sin. Some of this technology is still being worked on. Some of techniques will undoubtedly be refined as we go forward. But some of that will shake out once we look at regions and understand why some calls should or shouldn’t be made. Certainly there are going to be artifacts. But there may also be subtle and useful things that researchers on a specific topic and with interests in a specific region will be able to suss out because they had some leads. Maybe some won’t pan out. But certainly that’s not impossible with under-prediction or false negatives either.

I don’t know how many of you have stood in front of rooms of researchers and opened up new data sets to them. I’ve done this quite a bit. I have heard the giggles of a researcher at NIH who was delighted to discover in our workshop that GATA1 binding evidence was present in front of a region she was interested in–and this evidence looked very solid to me. This data came from ENCODE years ago, and she could go back to her lab that afternoon and start to ask new questions and design new experiments long before the controversial statements. Just the other day there was a researcher who found new RNA-seq signals in an area he cares about. Will these turn out to be something? I don’t know. But he was eager to go back to the lab and look harder with the new knowledge.

Big science vs. small science

Another segment of the Graur paper is called “Big Science,” “small science,” and ENCODE”. I tell researchers in workshops that they need to take the leads they get from this and look at it again, confirm it, and poke around with other tools and other cell lines or tissues. But I have seen that the ENCODE data has offered new paths and new ideas to researchers. As I wrote a while ago, ENCODE enables smaller science and people who had no contact with the initial project are making new discoveries with this data. And I think this statement is unfair:

Unfortunately, the ENCODE data are neither easily accessible nor very useful—without ENCODE, researchers would have had to examine 3.5 billion nucleotides in search of function, with ENCODE, they would have to sift through 2.7 billion nucleotides.

Most researchers don’t need 3.5 billion or 2.7 billion nucleotides. But they are very interested in some specific regions, and many of those regions now have new and actionable information that these researchers didn’t have before. And it’s not hard to access this–although we would love to have been funded to do more workshops to show people how they can get to it*.

Alas

So in short, I thought the spanking was funny and partially deserved. Some of it was unwarranted. I was a bit surprised to see this level of snarkiness in a scientific paper rather than a blog post or some other format, and I think if that became a publishing trend it might not serve us well. But we are also coming to a point where the literature is less important than the data–because the data isn’t in the papers anymore. What will matter is what we see downstream as people use the ENCODE data. And I hope they do, because I think there’s gold in there. I’ve seen some. But you’ll have to verify it. I think the saddest thing would be is if the drama on the claims made at the end cause people to walk away from the good that came from this. That would be a huge waste.

*If anyone asked me (not that anyone has), I think that outreach on big data projects should be improved in a number of ways. There should be a branch of the project whose only role is outreach–not attached to a specific project team–that has access to all of the teams, but can still maintain some distance. It would help to understand what new users face when they see the project. Often we find that teams on software or data projects are a bit too close to the materials and need to understand what it’s like to be an outsider looking in. And we find that people tell us things that they might not be willing to say to the development team directly, which can be very useful feedback. This is not specific to ENCODE but I have seen this numerous times in other projects as well.

 

Related posts:

Mining the “big data” is…fascinating. And necessary.

Video Tip of the Week: ENCODE enables smaller science

 

References:

Graur, D., Zheng, Y., Price, N., Azevedo, R., Zufall, R., & Elhaik, E. (2013). On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE Genome Biology and Evolution DOI: 10.1093/gbe/evt028

Rosenbloom, K., Sloan, C., Malladi, V., Dreszer, T., Learned, K., Kirkup, V., Wong, M., Maddren, M., Fang, R., Heitner, S., Lee, B., Barber, G., Harte, R., Diekhans, M., Long, J., Wilder, S., Zweig, A., Karolchik, D., Kuhn, R., Haussler, D., & Kent, W. (2012). ENCODE Data in the UCSC Genome Browser: year 5 update Nucleic Acids Research, 41 (D1) DOI: 10.1093/nar/gks1172

++++++++

Update to add more blowback:

The Guardian: Scientists attacked over claim that ‘junk DNA’ is vital to life

Josh Witten at The Finch and Pea: So I take it you aren’t happy with ENCODE…

John Farrell at Forbes: ENCODE Papers Get A Fisking

RT @leonidkruglyak: And of course, there’s now a @FakeEncode parody account…

Jalees Rehman at SciLogs: The ENCODE Controversy And Professionalism In Science  (this also has a Storify of some of the chatter that’s gone on via twitter, where much of this goes on these days)

This was an early one but I missed it in my travel days, by Ashutosh Jogalekar at SciAm: ENCODE, Apple Maps and function: Why definitions matter

Anshul Kundaje takes issue with some of the conclusions drawn based on the data used: https://twitter.com/anshul

Derek Lowe at In the Pipeline: ENCODE: The Nastiest Dissent I’ve Seen in Quite Some Time

Mike’s Fourth Try, Mike Lin–an author in the consortium: My thoughts on the immortality of television sets

Rebecca Boyle at PopSci: The Drama Over Project Encode, And Why Big Science And Small Science Are Different

Gencode Genes: On the annotation of functionality in GENCODE (or: our continuing efforts to understand how a television set works).

Nicolas Le Novère: Coming out of the closet: I like ENCODE results, and I think the “switches” may be functional

W. Ford Doolittle Is junk DNA bunk? A critique of ENCODE

Larry Moran: Ford Doolittle’s Critique of ENCODE

Nature Editorial: Form and function

MendelsPod has a couple of podcasts: Debating ENCODE: Dan Graur, Michael Eisen and Debating ENCODE Part II: Ross Hardison, Penn St.

Peter, a kiwi, on The ENCODE War Continues

Richard Gayle ENCODE and the Truth

Sean Eddy (subscription req) #openaccess : The ENCODE project: Missteps overshadowing a success

10 thoughts on “Spanking #ENCODE

  1. Neuroskeptic

    I don’t know much about junk DNA but (seriously) I think it’s good that scientific journals are publishing this kind of criticism. Some might say it’s unprofessional and dangerous to disagree with other scientists in this way and perhaps it is – but it happens. You hear much worse things said in private about research all the time. Much better that it be out in the open, than hidden.

  2. Mary Post author

    I think the junk vs functional discussion is very much worthwhile. And it won’t end here.

    I just hope the dispute doesn’t cloud the rest of the value in the data. If it drives people away from exploring ENCODE for other reasons it would be too bad.

    And personally I must have the high-tolerance version of the snark gene, so I don’t mind the way it was done. But I can see why some people don’t like the strategy.

  3. Pingback: The ENCODE Controversy And Professionalism In Science | The Next Regeneration

  4. Pingback: Critiche su ENCODE, forse si erano espressi male | Prometeus - ANBI Magazine

  5. Bert Overduin

    Thanks for this sane view of the ENCODE controversy / cat fight. All the points you raise are very valid. I especially like the “ENCODE enables smaller science” part!

    Greetings from a fellow outreacher (who has to give an ENCODE workshop soon)

  6. Mary Post author

    Thanks Bert!

    The ENCODE workshops go really well–people are totally intrigued about this whole slew of new stuff that’s available to them. And they aren’t stupid, they know you still have to verify things. But they really like how easy it is to look across multiple cell lines and technologies and use the weight of the evidence to assess the possibilities.

    Good luck with yours.

  7. Pingback: Role of ‘Professionalism’ in Science « Homologus

  8. Pingback: Interactions: February High Five | Altmetric.com

  9. THEMAYAN

    Just reading the abstract alone sounded more like a hit piece than professional scientific journalism. The mean spirited tone reeked of anger and bias.
    As I read further, I was surprised to find the authors paraphrasing Frank Zappa. Don’t get me wrong, I loved Zappa, but I think even he would have said that it would be very silly to use any of his utterances in a science journal, and especially one which seems to be more personalized than unbiased.

    “Data is not information, information is not knowledge, knowledge is not wisdom, wisdom is not truth,” —Robert Royar (1994) paraphrasing Frank Zappa’s (1979) anadiplosis

    I also found it interesting that they quoted T. R. Gregory who is critical of ENCODE, but for completely different reasons. According to Gregory, we supposedly knew about function decades ago and that this should be so no big surprise. Of course as I had to remind him that maybe one of the problems laid in the fact that many scientist ignored this data (as they should have just stuck to science and not get involved in the culture war) as it is well document that many instead, held this useless junked DNA paradigm as a poster child for bad design with all this supposed empirical evidence to back it up. Like many others, Gregory is of the sort that follows the logic, that if the data is incongruent to the theory, then the data must be wrong as he speaks of his “onion test” concerning C Value paradox below.

    “The onion test is a simple reality check for anyone who thinks they can assign a function to every nucleotide in the human genome. Whatever your proposed functions are, ask yourself this question: Why does an onion need a genome that is about five times larger than ours?” —T. Ryan Gregory”
.

    Dan Graur
    
”playing fast and loose with the term “function,” by divorcing genomic analysis from its evolutionary context and ignoring a century of population genetics theory”….

    Dan maybe its time to update these 80 year old constructs. As this paper below which is one of many indicates……
    The new biology: beyond the Modern Synthesis Michael R Rose1* and Todd H Oakley2 . The last third of the 20th Century featured an accumulation of research findings that severely challenged the assumptions of the “Modern Synthesis” which provided the foundations for most biological research during that century. The foundations of that “Modernist” biology had thus largely crumbled by the start of the 21st Century. This in turn raises the question of foundations for biology in the 21st Century. .
.
.


    Dan Graur
    
”There are two almost identical sequences in the genome. The first, TATAAA, has been maintained by natural selection to bind a transcription factor, hence, its selected effect function is to bind this transcription factor. A second sequence has arisen by mutation and, purely by chance, it resembles the first sequence; therefore, it also binds the transcription factor. However, transcription factor binding to the second sequence does not result in transcription, i.e., it has no adaptive or maladaptive consequence. Thus, the second sequence has no selected effect function, but its causal role function is to bind a transcription factor”

    Here is what ENCODE’s lead analysis coordinator E. Birney says about this….
”Rather than being inert, the portions of DNA that do not code for genes contain about 4 million so-called gene switches, transcription factors that control when our genes turn on and off and how much protein they make, not only affecting all the cells and organs in our body, but doing so at different points in our lifetime. Somewhere amidst that 80% of DNA, for example, lie the instructions that coax an uncommitted cell in a growing embryo to form a brain neuron, or direct a cell in the pancreas to churn out insulin after a meal, or guide a skin cell to bud off and replace a predecessor that has sloughed off”
.

    Dan Graur 

    “The human genome is rife with dead copies of protein-coding and RNA-specifying genes that have been rendered inactive by mutation. These elements are called pseudogenes (Karro et al. 2007). Pseudogenes come in many flavors (e.g., processed, duplicated, unitary) and, by definition, they are nonfunctional”

    Not according to paper below…..
    PSEUDOGENES: Are They “Junk” or Functional DNA? Annual Review of Genetics
    Vol. 37: 123-151 (Volume publication date December 2003)
    First published online as a Review in Advance on June 25, 2003
    DOI: 10.1146/annurev.genet.37.040103.103949″Pseudogenes have been defined as nonfunctional sequences of genomic DNA originally derived from functional genes. It is therefore assumed that all pseudogene mutations are selectively neutral and have equal probability to become fixed in the population. Rather, pseudogenes that have been suitably investigated often exhibit functional roles, such as gene expression, gene regulation, generation of genetic (antibody, antigenic, and other) diversity. Pseudogenes are involved in gene conversion or recombination with functional genes. Pseudogenes exhibit evolutionary conservation of gene sequence, reduced nucleotide variability, excess synonymous over nonsynonymous nucleotide polymorphism, and other features that are expected in genes or DNA sequences that have functional roles”…..

    It seems the biggest criticism in this paper is in how the the word function is used, as its definition of function is used broadly, but it also seems kind of silly to not expect such a broad definition when the findings themselves are so broad. And again just because the findings seem incongruent to how we view selection based on the modern synthesis (and or what Stewart Newman refers to as these old entrenched dogmas) it does not mean the theory should trump scientific revelation & the discovery of new and empirical data. Maybe it’s the theory that needs changing. One very well known scientist once told me. Scientist don’t change their minds, they just die.

  10. Pingback: Al Gore on the ENCODE project | The OpenHelix Blog

Comments are closed.