Getting leads from Big Data

The tensions that I see between “big data” projects and biologists who could benefit from the data–but don’t seem to be mining it enough–is frustrating to me. In almost every ENCODE workshop we do we hear from at least one person who has made some new connection between their project and some of the data we’ve just shown them how to explore. And the “big data” teams and funding agencies want you to succeed with this data–it will make them look really good and like money well spent :) . But to get the most out of these projects, we need to have the biologists with specialties in an area to take their knowledge to the databases and mine out some insights relevant to their work.

Sometimes I wonder about how to fix this–how to more make people realize there are features waiting to be unearthed in the data that could help their research. We can’t seem to get everywhere to do workshops, and the online training reaches a lot of people but not everyone has found that yet. And it may not be as obvious as a new SNP in the middle of the gene they study (although that would be great), or that some transcription-factor binding site is active in genes they study (although this was also great and we had a woman in a workshop giggling out loud when we showed her this in the ENCODE data workshop once). People aren’t aware of the range of data that’s being handed out.

So I thought maybe I could show examples of stuff I find interesting that might be useful to someone. I was trained as a cell biologist, so I come to big data from that perspective. I can see both sides of the divide here. What I have decided to illustrate–in an open science way–is some information about a cell line that might help researchers decide it whether it is–or possibly is not–a good model for their work.

I loved working with cell cultures. But I was always aware that there were issues with them that keep you from drawing conclusions about their relevance to in vivo biology. And one of the things that I became specifically interested in as the ENCODE data was flowing in was: how much do these cell lines look like the reference genome? For the first time you could look across the whole genome of some cell lines and compare that.

So one day I turned on the Copy Number Variation data for cell lines from the ENCODE project (note: this is on the March 2006 assembly). I wanted to see if the cell lines showed evidence of missing pieces of their genomes. I browsed them chromosome by chromosome, looking for big red deletions. And you know what–they did have big chunks missing in some cases.

Here’s an example of one region I became specifically interested in and took a bit further. It is the 9p21 segment, with the CNV data for the ENCODE cell lines GM12878, HepG2, and K562 turn on in “full” visibility. You can see them down at the bottom in the image. I have created a “session” you can load to see it yourself on the browser by clicking here. I have loaded a static screen capture of this region to FigShare. Abbreviated screen capture of this region below:

At the bottom of the image, you can see red bars that indicate homozygous deletions compared to the reference genome in this region for the K562 cells. If you look at the UCSC Genes track in the region you can see it includes quite a lot of genes. It might be important for you to know these are absent in this cell line.

From this data, I became curious about this region. I learned that there was reason to suspect there would be missing items in this region. A paper from Gursky et al in 2001 showed evidence of a deletion here, but only a small region could be shown based on the technology and the information they had and the genes of their focus that were specifically examined. They could only look with the flashlight they had. Now, with this data, we can see the real extent of the loss in that area. We’ve got a floodlight on the region.

As that paper indicated years ago, this 9p21 was already known to be a suspicious section for a number of cancers and in vivo samples. Work continues to find deletions and other variations there as you can find in a PubMed search today (9p21 AND cancer). I also found numerous examples in the ICGC database of deletions of genes in this region in tumors from patients for various cancers. Here’s a sample of a search for the MTAP gene that’s in that deleted region. You’ll see multiple cancer types and many samples show that loss as well. This is a query of their BioMart interface. [Edit: here’s a link directly to an MTAP summary page, not sure if the link will work.]

So maybe this isn’t earth-shattering information. But it confirms previous work illustrating deletions in this region–and adds to it. Now we know the extent of the deletion in K562 cells. If you wanted to use those cells as a test of cancer treatments it may help you to understand what’s happening. You could even transform with individual genes to see if affects these cells and treatment strategies. Or knowing this you might not want to use K562 cells for your particular investigations.

You may not be interested in cancer research. You may not be interested in this particular cell line. You might not even be interested in humans–but this region is also interesting for cancer studies in Bernese Mountain dogs it turns out. But the ENCODE project has explored many cell lines–including stem cells–that might be offering you insights to the system you use in your lab. Or maybe it will offer you leads on cell lines you might want to use–or to avoid.  I just wanted to illustrate that there might be nuggets in there–stuff that does not become the “compelling example” in the papers that can only briefly touch on some aspects of the Big Data work. Here I hope I showed you something you didn’t know from traditional publications. And maybe you didn’t know this was available from ENCODE project teams.

And my other point–that I keep harping on, I know–you have to use the public databases to get this information. It may be in the repositories of the big data projects well before publication (but may be embargoed, do keep that in mind). It may never appear in a publication in the form you might expect it. But the data is there for you to mine. Today.

Free tutorials on how to explore ENCODE data:

ENCODE Foundations:

ENCODE in the UCSC Genome Browser part II:


Gursky, S., Olopade, O.I. & Rowley, J.D. (2001). Identification of a 1.2 Kb cDNA fragment from a region on 9p21 commonly deleted in multiple tumor types, Cancer Genetics and Cytogenetics, 129 (2) 101. DOI: 10.1016/S0165-4608(01)00444-7

Shearin, A.L., Hedan, B., Cadieu, E., Erich, S.A., Schmidt, E.V., Faden, D.L., Cullen, J., Abadie, J., Kwon, E.M., Grone, A. & (2012). The MTAP-CDKN2A Locus Confers Susceptibility to a Naturally Occurring Canine Cancer, Cancer Epidemiology Biomarkers & Prevention, 21 (7) 1027. DOI: 10.1158/1055-9965.EPI-12-0190-T

The ENCODE Project Consortium (2011). A User’s Guide to the Encyclopedia of DNA Elements (ENCODE) PLoS Biol, 9 (4) DOI: 10.1371/journal.pbio.1001046

Hudson (Chairperson), T.J., Anderson, W., Aretz, A., Barker, A.D., Bell, C., Bernabé, R.R., Bhan, M.K., Calvo, F., Eerola, I., Gerhard, D.S. & (2010). International network of cancer genome projects, Nature, 464 (7291) 998. DOI: 10.1038/nature08987