Mining the “big data” is…fascinating. And necessary.

When we have workshops coming up, I spend some time tooling around in the big data to see if there have been changes since the last time I talked about it, update the slides if necessary, and sometimes forming a hypothesis and testing it. (PS: we’re at Baylor next, if anyone is looking for a workshop there.) On Friday I totally lost myself in a query that began at UCSC in the ENCODE data, and ended up in the ICGC BioMart. And wow. Do I wish I had a lab somedays….

One of the comments at our last workshop was that the ENCODE data on cell lines is not the same as looking at tissues. And I totally agree with that–but the mouse ENCODE data is going to help get that sort of data. But as someone who spent a lot of time culturing cells in the past, I am interested to know how different cell lines are from “reference” genome complement. And there’s one specific part of the human ENCODE project that’s looking at this: Common Cell CNV track.

Here’s what I did: a Table Browser query to look for the types of structural variations that were coming up in the 3 cell lines that have been examined: GM12878, HepG2, and K562. I wondered to myself: how many of these CNVs overlap with known genes? And what types of variations are there? Here’s a sample of how I structured that query for one of the cell lines:

This query yields normal sections, amplifications, deletions–and some deletions are homozygous and some are heterozygous. One of the points I make in the ENCODE workshop is that if I was using a cell line I’d be curious to know these sorts of things about it–I wish someone would do HeLa and the other big cell lines out there too. (Probably someone is, but I don’t know about the data. If someone has it, give me a holler.)

So I’m working around these variations, and I got curious about one particular region in one of the cell lines. It took out a region with some rather important-looking genes. I went to the literature to find that this region is known to be a problem in some cancers.

I went to look at the ICGC data to see if anything interesting was turning up with these genes. And wow–whadda ya know: there’s not a ton of data in that data set yet, but I found a significant correspondence between some of the data already in there from real tumors and what I found in the cell line. It’s too early for conclusions about that. It’s hard to know in these big data projects what you *aren’t* seeing, how much is already in there, how much isn’t, etc. But I checked a bunch of other genes and none showed this sort of pattern I was seeing.

Because of the ICGC usage policy, I don’t think I can speak specifically about what I saw. But it was very curious. If I had a lab I would have put a student on it this morning ;)

And my point is this: the data is not in the papers anymore. It’s in the databases. And you need to be mining it–these big data projects are handing you the pick-axes and pointing you to the mines.


What you need to do what I did:

1. A grasp of the UCSC functions and the ENCODE data. Check out our tutorials on those that are freely available as they are sponsored by UCSC and the ENCODE team at UCSC.

2. BioMart: we have a tutorial on this, but it is in our subscription package.

What you don’t need: current literature. It’s not in the papers, and may never be. The “big data” stuff is in the databases, and only small amounts can really be published in the traditional way.

One thought on “Mining the “big data” is…fascinating. And necessary.

  1. Pingback: ENCODE Chromatin state data offers nice insights. Take this and run with it. | The OpenHelix Blog

Comments are closed.