Tag Archives: ENCODE

Video Tips of the Week, Annual Review 2013 (part 1)

As you may know, we’ve been doing these video tips-of-the-week for SiX years now. We have completed or collected around 300 little tidbit introductions to various resources through this past year, 2013. At first we had to do all of our own video intros, but as the movie technology became more accessible and more teams made their own, we were able to find a lot more that were done by the resource providers themselves. So we began to collect those as well. At the end of the year we’ve established a sort of holiday tradition: we are doing a summary post to collect them all. If you have missed any of them it’s a great way to have a quick look at what might be useful to your work.

You can see past years’ tips here: 2008 I, 2008 II, 2009 I, 2009 II, 2010 I, 2010 II, 2011 I, 2011 II, 2012 I, 2012 II, 2013 II (next week).

Annual Review VI:

January 2013:
January 2: Annual Review V part deux
January 9: The New and Improved OMIM®
January 16: InSilico DB
January 23: ZooBank and species nomenclature
January 30: ScienceGameCenter #edtech

February 2013:
February 6: MotifLab workbench for TFBS analysis
February 13: UCSC Genome Browser restriction enzyme display
February 20: ENCODE Data at UCSC (reminder)
February 27: NetGestalt

March 2013:
March 6: NCBI Genomics Workbench
March 13: FlyBase
March 20: figshare + GenoCAD = outreach
March 27: Enzyme Portal and User-Centered Design

April 2013:
April 3: Phytozome and the Peach Genome
April 10: Introductory Cheminformatics
April 17: Sharing H7N9 data at GISAID.org with EpiFlu™
April 24: Cancer Atlas roadmap

May 2013:
May 1: My Cancer Genome
May 8: Transfac (and HGMD, Proteome, etc)
May 15: Influenza Research Database (IRD)
May 22: Canary Database for sentinels of human health
May 29: QIIME for Quantitative Insights Into Microbial Ecology

June 2013:
June 5: Prezi and other nonlinear presentation methods
June 12: TrioVis for family genome data sets
June 19: ENCODE ChIP-Seq Significance Tool
June 26: InnateDB, Systems Biology of the Innate Immune Response

VideoTip of the Week: ENCODE @ Ensembl

We have a lot of tutorials (2 in fact, ENCODE Foundations & ENCODE @ UCSC), tips and information about ENCODE. We also have a lot of tutorials (again 2, Ensembl and Ensembl Legacy- on the older versions ), tips and information about Ensembl, the database and browser at EBI.

Now here is a tip of the week on both Ensembl AND ENCODE. This is one of the more recent additions to Ensembl’s video tutorials. This video looks at how to identify sequences that may be involved in gene regulation. Most of this data at Ensembl is based on ENCODE data. This is using the “Matrix,” a way to select the regulation data you need based on cell types and TF’s. At the end of the 8 minute video they discuss a bit more about how to get all ENCODE data.

So, now you have a wealth of information here at OpenHelix through our tutorials and our blog about ENCODE and Ensembl.

Quick links:

ENCODE: http://encodeproject.org/ENCODE/
ENCODE @ UCSC: http://genome.ucsc.edu/ENCODE/
Ensembl: http://www.ensembl.org
ENCODE Tutorials: http://openhelix.com/encode
Ensembl Tutorials: http://openhelix.com/cgi/tutorialInfo.cgi?id=95

Video Tip of the Week: ENCODE ChIP-Seq Significance Tool

We’ve been doing training and workshops on the UCSC Genome Browser for 10 years now. It’s a tremendous tool that has to be a foundational item in your toolkit in genomics. But–there may be times when you want to examine some of the data that you can find there in another way, with a different focus or emphasis. It might be possible to craft some clever Table Browser queries that get you what you want. Sometimes, though, someone else has created a way for you to query the underlying data for a topic that could be useful too. And today’s tip of the week is exactly this kind of tool. A web interface to query the ENCODE data that resides in the UCSC Genome Browser, with a focus on finding transcription factors with enriched binding in a region that you might be interested in exploring. Today’s video tip is for the ENCODE ChIP-Seq Significance Tool.

There’s a ton of great data that flowed into the UCSC Genome Browser as part of the ENCODE project. It’s going to provide years of mining for biologists. What would be great is for biomedical researchers who have interest in specific genes–or sets of genes–to take a look at the ENCODE data to see if they can unearth some useful insights about the regulation of these genes or lists of genes. You can use the ChIP-Seq Significance tool to sift through the data.

The video that the Butte lab team did is very nice. Very specific guidance on how to use their tool–what to choose for the menu options, what the choices are, and what to expect from the results. Here’s their video:

Of course you should read their paper about this tool for the background you need (linked below), and the references that will also help you to understand what this tool offers. You should also read up on the associated ENCODE data. The supplement with the paper is also nicely written in clear language to help you to understand the features.

One of the things I was curious about was whether this might be extended to the mouse data too. One thing that people grouse to me about is that ENCODE is cell line data, and tissue data would really be great. But I saw discussion at Stephen Turner’s blog (read the comments) about the focus on human for now. There was also discussion of the CScan tool, though, which does cover the mouse data. So if this is a tool you are interested in, you might want to explore CScan too.

Hat tip to Stephen Turner for the awareness:

Quick links:

ENCODE ChIP-Seq Significance Tool: http://encodeqt.stanford.edu/

CScan: http://www.beaconlab.it/cscan

Reference:

Auerbach, R., Chen, B., & Butte, A. (2013). Relating Genes to Function: Identifying Enriched Transcription Factors using the ENCODE ChIP-Seq Significance Tool Bioinformatics DOI: 10.1093/bioinformatics/btt316

Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

Yes.

Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

 

What’s the Answer? (making multiwig)

BioStar is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the community and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at BioStar.

This week’s highlighted question is a bit different. It has to do more with creating the overlay track type that became available at UCSC that lets you show multiple results–like the colored peak and valley wiggle tracks you might recognize from the default ENCODE regulation data. People have been asking at our workshops how those are done, and we point them to the multiwig info, but this is a nice guide to doing that as well.

Tutorial: Overlay Multiple Tracks in UCSC Browser [Quick Minimal Tutorial]

Hi, just wrote a quick tutorial for overlaying multiple tracks in ucsc.

Follow the link http://biofeed.tumblr.com/post/45676161703/overlay-multiple-tracks-in-ucsc-browser.

For full documentation and detailed syntax refer to

http://genome.ucsc.edu/goldenPath/help/hgTrackHubHelp.html

http://genome.ucsc.edu/goldenPath/help/trackDb/trackDbHub.html

Sukhdeep Singh

Have a look, and try it out for your data. And check out that new visualization tool in the comments below as well.

Protip: check the genome of your cell line. HeLa cells are “strikingly aberrant”

This is a paper I’ve been waiting for: the analysis of the HeLa genome. I was aware of a lot of issues with the cell lines and missing or duplicated regions from the ENCODE data that was coming along some time ago: Mining the “big data” is…fascinating. And necessary.

People may be familiar with HeLa cells even if they aren’t in biomedical research because of the great book by Rebecca Skloot: The Immortal Life of Henrietta Lacks which explored the history of these cells and the woman whose terrible cancer led to their existence.

But there were many discussions over the years about how different these cells are from actual tissues, and concerns over how representative they are for actual human research issues. Here are some:

So a new paper has been published that explores this–and it’s at the top of my reading list for later today.

Here’s the paper itself: http://www.g3journal.org/content/early/2013/03/11/g3.113.005777.abstract 

Hat tip Ward Plunet via twitter:
RT @WardPlunet: Havoc in biology’s most-used human cell line: Genome of HeLa cells sequenced for the first time http://t.co/VVpZmiwIiX .

Update: A piece from one of the paper’s authors:

Reference:

Landry JJM, Pyl1 PT, Rausch T, Zichner T, Tekkedil MM, Stütz AM, Jauch A, Aiyar RS, Pau G, Delhomme N, Gagneur J, Korbel JO, Huber W, & Steinmetz LM (2013). The Genomic and Transcriptomic Landscape of a HeLa Cell Line G3 : 10.1534/g3.113.005777

Spanking #ENCODE

While I was on the road last week–ironically to do workshops including one on ENCODE data in the UCSC Genome Browser, a conflama erupted over a new paper that was published essentially spanking the ENCODE team for some of the claims they made. Some of the first notes I saw:

Luckily I happened to be in a library when this drama broke, so I ran to an unoccupied workshop computer while Trey talked about the Table Browser and read the paper quickly. I will re-read it when I have more time, but I wanted to offer my initial thoughts while the chatter is ongoing.

Subsequently I saw a range of reactions to this: PZ Myers says ENCODE gets a public reaming; Mick Watson’s Dear ENCODE…. ; Homolog.us’ critique of Mick Watson’s response Biomickwatson’s Ridiculous Criticism of ENCODE Critics ; Biostars’ forum ENCODE commentary from Dan Graur …and more. I’m sure there will be further fallout.

My first thoughts were that the paper was the snarkiest scientific paper I have ever read, and I thought it was hilarious. I also think some of the criticisms were completely valid. Some less so.

First I should establish some cred on this topic, and explain my role. I was not part of the official ENCODE analysis team, and was not an author on any of the papers. As OpenHelix we were engaged in some outreach for the UCSC Genome Browser Data Coordination Center–but we worked for and reported to them and not to the main ENCODE group. As such, we delivered training materials and workshops for several years, and although these touched on various data sets presented by many ENCODE teams, we did not have contact with other teams. The materials were aimed at how to locate and use the ENCODE data in the UCSC framework. (ENCODE Foundations and ENCODE Data at UCSC were the recorded materials). However, we are now no longer receiving any ENCODE-related funds in any manner.

So I was exploring ENCODE data a lot earlier than most people. I was making discoveries and finding out interesting new things years ago. And I was also with new users of ENCODE data in workshops around the country. This is the framework that you should use to assess my comments.

On the immortality of television sets….

In the Graur et al paper, there are a number of aspects of the ENCODE project that come under fire. The largest portion of this was aimed at the claim of 80% functionality of the genome. This statement caused problems from day 1, and I agree that it was not a well-crafted statement. It was bound to catch media attention and it irked pretty much everyone. Nearly immediately Ewan Birney tried to explain the position, but most people still found this 80% thing unsatisfying and unhelpful. And I think the Graur et al paper presents why it was so problematic pretty clearly.

Another criticism of the work is that the ENCODE project was focused on cell lines.

“We note that ENCODE used almost exclusively pluripotent stem cells and cancer cells, which are known as transcriptionally permissive environments.”

I understand this concern and even raised it myself in the past in the workshops. But there are 2 important things to note about that: in order to get everyone sufficient sample material to enable comparisons across techniques, technologies, and replications, it would not be possible to use human tissue samples. It just would be physically impossible. Further, a lot of non-ENCODE experimental work is carried out in these cell lines and understanding the difference among cell lines may be incredibly useful in the long run. Making better choices about which ones mirror human conditions, or not using the wrong cell line to test some compound if it’s missing some key receptor could be great information to have. I wish there had been one of the papers that characterized the cell lines, actually.

But another thing everyone missed: STEM CELLS. We now have the largest set of genome-wide data on human embryonic stem cells. This has been information that was particularly hard to obtain in the US, but now everyone can look around at that. I was really sorry to see that aspect of this project got no love whatsoever.

But besides that, the mouse ENCODE project did deliver tissue data. But we can share mouse strains and treatment protocols to get sufficient materials. Additionally the modENCODE project got some really fascinating information on developmental stages that we couldn’t get on humans. I think all of these features are missing in the snark-fest.

Another criticism in the paper is the sensitivity vs specificity choice for reporting on the data.

 At this point, we must ask ourselves, what is the aim of ENCODE: Is it to identify every possible functional element at the expense of increasing the number of elements that are falsely identified as functional? Or is it to create a list of functional elements that is as free of false positives as possible. If the former, then sensitivity should be favored over selectivity; if the latter then selectivity should be favored over sensitivity. ENCODE chose to bias its results by excessively favoring sensitivity over specificity. In fact, they could have saved millions of dollars and many thousands of research hours by ignoring selectivity altogether, and proclaiming a priori that 100% of the genome is functional. Not one functional element would have been missed by using this procedure.

Maybe the Graur et al team thinks that it should have been the other way. That’s fine–they can take all of this data and re-examine it, reprocess it, and deliver it with their thresholds. But I think at this time over-prediction is not the worst sin. Some of this technology is still being worked on. Some of techniques will undoubtedly be refined as we go forward. But some of that will shake out once we look at regions and understand why some calls should or shouldn’t be made. Certainly there are going to be artifacts. But there may also be subtle and useful things that researchers on a specific topic and with interests in a specific region will be able to suss out because they had some leads. Maybe some won’t pan out. But certainly that’s not impossible with under-prediction or false negatives either.

I don’t know how many of you have stood in front of rooms of researchers and opened up new data sets to them. I’ve done this quite a bit. I have heard the giggles of a researcher at NIH who was delighted to discover in our workshop that GATA1 binding evidence was present in front of a region she was interested in–and this evidence looked very solid to me. This data came from ENCODE years ago, and she could go back to her lab that afternoon and start to ask new questions and design new experiments long before the controversial statements. Just the other day there was a researcher who found new RNA-seq signals in an area he cares about. Will these turn out to be something? I don’t know. But he was eager to go back to the lab and look harder with the new knowledge.

Big science vs. small science

Another segment of the Graur paper is called “Big Science,” “small science,” and ENCODE”. I tell researchers in workshops that they need to take the leads they get from this and look at it again, confirm it, and poke around with other tools and other cell lines or tissues. But I have seen that the ENCODE data has offered new paths and new ideas to researchers. As I wrote a while ago, ENCODE enables smaller science and people who had no contact with the initial project are making new discoveries with this data. And I think this statement is unfair:

Unfortunately, the ENCODE data are neither easily accessible nor very useful—without ENCODE, researchers would have had to examine 3.5 billion nucleotides in search of function, with ENCODE, they would have to sift through 2.7 billion nucleotides.

Most researchers don’t need 3.5 billion or 2.7 billion nucleotides. But they are very interested in some specific regions, and many of those regions now have new and actionable information that these researchers didn’t have before. And it’s not hard to access this–although we would love to have been funded to do more workshops to show people how they can get to it*.

Alas

So in short, I thought the spanking was funny and partially deserved. Some of it was unwarranted. I was a bit surprised to see this level of snarkiness in a scientific paper rather than a blog post or some other format, and I think if that became a publishing trend it might not serve us well. But we are also coming to a point where the literature is less important than the data–because the data isn’t in the papers anymore. What will matter is what we see downstream as people use the ENCODE data. And I hope they do, because I think there’s gold in there. I’ve seen some. But you’ll have to verify it. I think the saddest thing would be is if the drama on the claims made at the end cause people to walk away from the good that came from this. That would be a huge waste.

*If anyone asked me (not that anyone has), I think that outreach on big data projects should be improved in a number of ways. There should be a branch of the project whose only role is outreach–not attached to a specific project team–that has access to all of the teams, but can still maintain some distance. It would help to understand what new users face when they see the project. Often we find that teams on software or data projects are a bit too close to the materials and need to understand what it’s like to be an outsider looking in. And we find that people tell us things that they might not be willing to say to the development team directly, which can be very useful feedback. This is not specific to ENCODE but I have seen this numerous times in other projects as well.

 

Related posts:

Mining the “big data” is…fascinating. And necessary.

Video Tip of the Week: ENCODE enables smaller science

 

References:

Graur, D., Zheng, Y., Price, N., Azevedo, R., Zufall, R., & Elhaik, E. (2013). On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE Genome Biology and Evolution DOI: 10.1093/gbe/evt028

Rosenbloom, K., Sloan, C., Malladi, V., Dreszer, T., Learned, K., Kirkup, V., Wong, M., Maddren, M., Fang, R., Heitner, S., Lee, B., Barber, G., Harte, R., Diekhans, M., Long, J., Wilder, S., Zweig, A., Karolchik, D., Kuhn, R., Haussler, D., & Kent, W. (2012). ENCODE Data in the UCSC Genome Browser: year 5 update Nucleic Acids Research, 41 (D1) DOI: 10.1093/nar/gks1172

++++++++

Update to add more blowback:

The Guardian: Scientists attacked over claim that ‘junk DNA’ is vital to life

Josh Witten at The Finch and Pea: So I take it you aren’t happy with ENCODE…

John Farrell at Forbes: ENCODE Papers Get A Fisking

RT @leonidkruglyak: And of course, there’s now a @FakeEncode parody account…

Jalees Rehman at SciLogs: The ENCODE Controversy And Professionalism In Science  (this also has a Storify of some of the chatter that’s gone on via twitter, where much of this goes on these days)

This was an early one but I missed it in my travel days, by Ashutosh Jogalekar at SciAm: ENCODE, Apple Maps and function: Why definitions matter

Anshul Kundaje takes issue with some of the conclusions drawn based on the data used: https://twitter.com/anshul

Derek Lowe at In the Pipeline: ENCODE: The Nastiest Dissent I’ve Seen in Quite Some Time

Mike’s Fourth Try, Mike Lin–an author in the consortium: My thoughts on the immortality of television sets

Rebecca Boyle at PopSci: The Drama Over Project Encode, And Why Big Science And Small Science Are Different

Gencode Genes: On the annotation of functionality in GENCODE (or: our continuing efforts to understand how a television set works).

Nicolas Le Novère: Coming out of the closet: I like ENCODE results, and I think the “switches” may be functional

W. Ford Doolittle Is junk DNA bunk? A critique of ENCODE

Larry Moran: Ford Doolittle’s Critique of ENCODE

Nature Editorial: Form and function

MendelsPod has a couple of podcasts: Debating ENCODE: Dan Graur, Michael Eisen and Debating ENCODE Part II: Ross Hardison, Penn St.

Peter, a kiwi, on The ENCODE War Continues

Richard Gayle ENCODE and the Truth

Sean Eddy (subscription req) #openaccess : The ENCODE project: Missteps overshadowing a success

Video Tip of the Week: ENCODE Data at UCSC (reminder)

This week’s tip of the week is a reminder: go and watch the ENCODE tutorial that is sponsored by the UCSC Genome Browser’s ENCODE team. There now just one week of free access left.

One week left for free access to the movie, slides, and exercises

For a number of years a team at the UCSC Genome Browser group was designated as the DCC, or Data Coordination Center, charged with wrangling and display of ENCODE data so that everyone could access it as soon as possible. We covered the early stages of this in a tutorial about the ENCODE Foundations.

As the project matured and more and more data flowed in, the DCC folks created new tracks and displays and new management tools to work with the data. We created a second tutorial to explain and explore that: ENCODE Data at UCSC.

Thousands of people have come and watched the set of “free” suites that we have. However, sometimes the grants come to an end and we no longer have the support to keep them available. So soon we’ll be taking them from the “free” page and putting them in the subscription side. The materials will still be available by individual purchase or by institutional subscription.

A recent paper (cited below) provides details on a number of the features that we explore in the tutorial. The data types and the display strategies are discussed. Access tools and distribution details are provided. If you have a look at that paper and watch the tutorial you’ll have a great grasp of how to interact with the ENCODE data in the UCSC Genome Browser.

But you still have a week! Go watch the video from the “launch tutorial” button. Download the slides and exercises.

We’ll be keeping an eye on the next steps for the ENCODE project and the transition to the new DCC. And we’ll update the materials when it’s needed.

Quick links:

ENCODE Foundations tutorial suite: www.openhelix.com/ENCODE

ENCODE Data at UCSC tutorial suite: www.openhelix.com/ENCODE2

ENCODE Project landing page: www.encodeproject.org

References:

Rosenbloom, K., Sloan, C., Malladi, V., Dreszer, T., Learned, K., Kirkup, V., Wong, M., Maddren, M., Fang, R., Heitner, S., Lee, B., Barber, G., Harte, R., Diekhans, M., Long, J., Wilder, S., Zweig, A., Karolchik, D., Kuhn, R., Haussler, D., & Kent, W. (2012). ENCODE Data in the UCSC Genome Browser: year 5 update Nucleic Acids Research, 41 (D1) DOI: 10.1093/nar/gks1172

Video Tip of the Week: MotifLab workbench for TFBS analysis

When we do workshops, I sing the praises of the ENCODE data that’s genome-wide, and how it is offering amazing new opportunities to explore and discover new features of you genomic regions of interest. But I know that’s all I can do to introduce folks to the data in a short session–and they need to take it to the next level themselves. In the future I’ll be pointing them to MotifLab as a way they might want to proceed after their UCSC and ENCODE training.

MotifLab is a software tool that lets you take segments of interest and process them with a number of other useful data types and tools, integrated into one place. You can then apply various motif finding and analysis tools to assess the region. And you can layer on other data types to help you to further understand what’s going on in that spot.

In the past I might have taken regions of interest from the UCSC Genome Browser and gone to other sites to accomplish many of the things that are integrated into MotifLab. And some of those tools–while nice–don’t offer me the visual track-based additional data I want to consider once I’ve analyzed my stuff. I’d end up taking it back to UCSC as a custom track, uploading that, and then exploring some more. But over many regions, that can be hard to visualize simultaneously.

In fact, it reminds me of a question that a lot of trainees ask in our UCSC Genome Browser workshops: can I have several regions open at the same time? And I think that MotifLab will essentially let you do that, in one place.

They have a series of tutorials to work through to help you to understand what they offer and how to accomplish it. I will point you there, because I haven’t had the time to fully examine all the details myself. But I have plans to take some data I’m interested in, and run it through the paces there. It might not be the right tool for everyone, but it’s got the right combination of tools and graphics to work the way I like to think about the data.

The tutorials they have aren’t embeddable, so I link you to them with the screenshot above. Or you can go directly to the list. They take a while to load, and there’s no audio–you will click through the segments. But they give a good grasp of the kinds of things you can do.

I found out about MotifLab via Chris Upton. Hat tip to his handy Scoop-It collection.

Quick links:

MotifLab: http://www.motiflab.org

MotifLab tutorials list: http://tare.medisin.ntnu.no/motiflab/index.php?page=tutorial

Reference:

Klepper, K., & Drabløs, F. (2013). MotifLab: a tools and data integration workbench for motif discovery and regulatory sequence analysis BMC Bioinformatics, 14 (1) DOI: 10.1186/1471-2105-14-9