Tag Archives: biomart

Video Tip of the Week: TargetMine, Data Warehouse for Drug Discovery

Browsing around genomic regions, layering on lots of associated data, and beginning to explore new data types I might come across are things that really fire up my brain. For me, visualization is key to forming new ideas about the relationships between genomic features and patterns of data. But frequently I want to take this to the next step–asking where else these patterns appear, how many other instances of this situation are there in a data set, and maybe adding additional complexity to the problem and refine the quest. This is not always easy to do with primarily visual software tools. This is when I turn to tools like the UCSC Table Browser, BioMart, and InterMine to handle some list of genes, or regions, or features.

We’ve touched on all of these before–sometimes with full tutorial suites (UCSC, BioMart), and sometimes as a Tip of the Week, InterMine and InterMine for complex queries. Learning about the foundations of these tools will let you use various versions or flavors of them at other sites. I love to see tools that are re-used for different topics when that’s possible, rather than building a whole new system. There are ModENCODE, rat, yeast mines, and more. This week’s tip is about one of those others–TargetMine is built on the InterMine foundation, with a specific focus on prioritizing candidate genes for pharmaceutical interventions. From their site overview, I’ll add this description they use: TargetMine

TargetMine is an integrated data warehouse system which has been primarily developed for the purpose of target prioritisation and early stage drug discovery.

For more details about their framework and philosophy, you should see their papers (linked below). The earlier one sets out the rationale, the data types, and the data sources they are incorporating. They also establish their place in the ecosystem of other databases in this arena, which helps you to understand their role.  But you should see the next paper for a really good grasp of how their candidate prioritization work with the “Integrated Pathway Clusters” concept they’ve added. They combined data from KEGG, Reactome, and NCI’s PID collections to enhance the features of their data warehouse system.

This week’s Video Tip of the Week highlights one of the tutorial movies that the TargetMine team provides. There’s no spoken audio with it, but the captions that help you to understand what’s going on are in English. I followed along on a browser with their example–they have a sample list to simply click on, and you can see various enrichments of the sets–pathways, Gene Ontology, Disease Ontology, InterPro, CATH, and compounds. They call these the “biological themes” and I find them really useful. You can create new lists from these theme collections. They also illustrate the “template” option–pre-defined queries with typical features people may wish to search. The example shows how to go from the list of genes you had to pathways–but there are other templates as well.

Another section of the video has an example of a custom query with the Query Builder. They ask for structural information for proteins targeted by acetaminophen. It’s a nice example of how to go from a compound to protein structure–a question I’ve seen come up before in discussion threads.

In their more recent paper (also below), they have some case studies that illustrate the concepts of prioritizing targets for different disease situations with their system.  They also expand on the functions with additional software to explore the pathways: http://targetmine.mizuguchilab.org/pathclust/ .

So have a look at the features of TargetMine for prioritization of candidate genes. I think the numerous “themes” are a really useful way to assess lists of genes (or whatever you are starting with).

Quick Links:

TargetMine: http://targetmine.mizuguchilab.org/ [note: their domain name has changed since the publications, this is the one that will persist.]

InterMine: http://intermine.github.io/intermine.org/


Chen, Y., Tripathi, L., & Mizuguchi, K. (2011). TargetMine, an Integrated Data Warehouse for Candidate Gene Prioritisation and Target Discovery PLoS ONE, 6 (3) DOI: 10.1371/journal.pone.0017844

Chen, Y., Tripathi, L., Dessailly, B., Nyström-Persson, J., Ahmad, S., & Mizuguchi, K. (2014). Integrated Pathway Clusters with Coherent Biological Themes for Target Prioritisation PLoS ONE, 9 (6) DOI: 10.1371/journal.pone.0099030

Kalderimis A.,  R. Lyne, D. Butano, S. Contrino, M. Lyne, J. Heimbach, F. Hu, R. Smith, R. Stěpán, J. Sullivan & G. Micklem & (2014). InterMine: extensive web services for modern biology, Nucleic Acids Research, 42 (W1) W468-W472. DOI: http://dx.doi.org/10.1093/nar/gku301

Video Tip of the Week: InterMine for complex queries

We’ve been fans of InterMine for a long time. We did a tip-of-the-week in a while ago that highlighted ways that this software can be used to mine from big data projects of many types. The generic framework of InterMine can be customized for use at different projects–today I’ll include videos from the FlyMine installation and the YeastMine flavor–but you may find versions of this handy tool in many other places as well.

The first video is a broader overview of different types of things you can do–and although this is FlyMine, you’ll find similar behavior at the other Mines too.

This next video is more specific about a task that people need to accomplish–working with a list of genes. This example was recently produced by the YeastMine folks, but again this should work in a similar way across other Mines. You should also read the SGD blog post on it–Create, Analyze, Save: the Power of Gene Lists in YeastMine.

The other thing that I noticed about this framework is the effort of several of these model organism Mines to coordinate into this InterMOD structure. Although I am often wary of “one search to rule them all” sorts of efforts, there can be value in this as a central organizing principle as we keep adding more species genomes that may not have as well-developed communities and infrastructure to support them.

I certainly use a lot of query tools that are similar to these–like the UCSC Table Browser, and BioMartUniProt offers ways to build queries that’s different but conceptually similar. Using these interfaces you can construct some clever and complex ways to extract information out of data repositories.

Quick links:

InterMine: http://intermine.github.io/intermine.org/

FlyMine: http://www.flymine.org/

YeastMine: http://yeastmine.yeastgenome.org/

InterMOD: http://intermod.intermine.org


Smith R.N., Aleksic J., Butano D., Carr A., Contrino S., Hu F., Lyne M., Lyne R., Kalderimis A. & Rutherford K. & (2012). InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data., Bioinformatics (Oxford, England), DOI:

Lyne R., Smith R., Rutherford K., Wakeling M., Varley A., Guillier F., Janssens H., Ji W., Mclaren P. & North P. & (2012). FlyMine: an integrated database for Drosophila and Anopheles genomics., Genome biology, PMID:

Balakrishnan R., Park J., Karra K., Hitz B.C., Binkley G., Hong E.L., Sullivan J., Micklem G. & Cherry J.M. (2012). YeastMine–an integrated data warehouse for Saccharomyces cerevisiae data as a multipurpose tool-kit., Database : the journal of biological databases and curation, PMID:

Sullivan J., Karra K., Moxon S.A.T., Vallejos A., Motenko H., Wong J.D., Aleksic J., Balakrishnan R., Binkley G. & Harris T. & (2013). InterMOD: integrated data and tools for the unification of model organism research., Scientific reports, 3 (1802) PMID:

Video Tip of the Week: ICGC portal for cancer genomics

A question at Biostar about cancer “gene sets” recently got me looking at one of my favorite data sources again–the ICGC, International Cancer Genome Consortium, and their data portal. Previous posts we’ve done were based on their legacy portal (which is still available on their site). They changed things up a bit with a release last fall, and I hadn’t covered those changes yet.

Conveniently, they have done a short video explaining how to access the data that they offer. They’ve continued to add new data, and to refine the software. You should check it out.

ICGC Data Portal Tutorial from ICGC on Vimeo.

In the past I found some really useful info to compare with a lung cancer cell line I had been examining. I saw the same mutation in actual tumor samples as had been found in this cell line years back. But there have also been publications recently that talk in more detail about the project and some interesting outcomes from data that’s been found there (linked below).

You really need to be mining these projects for data if they cover your research area. There’s a lot to learn that hasn’t been published yet–just be sure to read up on their usage policies before you deliver your great discoveries to the journals!

Quick link:

Data portal: http://dcc.icgc.org/

Project homepage: http://icgc.org/


Hudson (Chairperson) T.J., Anderson W., Aretz A., Barker A.D., Bell C., Bernabé R.R., Bhan M.K., Calvo F., Eerola I. & Gerhard D.S. & many others in a large consortium… (2010). International network of cancer genome projects, Nature, 464 (7291) 993-998. DOI:

Alexandrov L.B., Nik-Zainal S., Wedge D.C., Aparicio S.A.J.R., Behjati S., Biankin A.V., Bignell G.R., Bolli N., Borg A. & Børresen-Dale A.L. & many others in a large consortium…; (2013). Signatures of mutational processes in human cancer, Nature, 500 (7463) 415-421. DOI:

Gonzalez-Perez A., Mustonen V., Reva B., Ritchie G.R.S., Creixell P., Karchin R., Vazquez M., Fink J.L., Kassahn K.S. & Pearson J.V. & many others in a large consortium… (2013). Computational approaches to identify functional genetic variants in cancer genomes, Nature Methods, 10 (8) 723-729. DOI:

What’s the Answer? (Gene ID conversion)

BioStar is a site for asking, answering and discussing bioinformatics questions. We are members of thecommunity and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those questions and answers here in this thread. You can ask questions in this thread, or you can always join in at BioStar.

This week’s highlighted question:

What is a good “gene ID conversion tool

This is an older question, from 2 years ago, but still relevant and the answers still quite helpful and full of resources such as DAVID, BioDBnet, BioMart and others.

Check it out. Also, might want to check out the third exercise of our UCSC Advanced Tutorial .  The exercise:

“From a list of UCSC genes, add gene symbols and GO IDs for additional information about the gene set. Bonus step: add GO terms.”

Walks through how you might be able to do this with the UCSC Table Browser with some simple modifications.

Video Tips of the Week: Annual Review IV, 2nd half

As you may know, we’ve been doing these video tips-of-the-week for FOUR years now. We have completed around 200 little tidbit introductions to various resources from last year, 2011 (yep, it’s 2012 now). At the end of the year we’ve established a sort of holiday tradition: we are doing a summary post to collect them all. If you have missed any of them it’s a great way to have a quick look at what might be useful to your work.

You can see past years’ tips here: 2008 I2008 II2009 I2009 II2010 I2010 II. The summary of the first half of 2011 is available from last week.

July 2011

July 6: Prioritizing genes using the Gene Prioritization Portal

July 13: PolySearch, searching many databases at once

July 20: Human Epigenomics Visualization Hub

July 27: The new SIB Bioinformatics Resource Portal


August 2011

August 3: SNPexp, correlation between SNPs and gene expression 

August 10: CompaGB for comparing genome browser software

August 17: CoGe, comparing genomes revisited

August 24: Domain Draw for quick motif diagrams

August 31: From UniProt to the PSI SBKB and back again


September 2011

September 7: Plant comparative genomics using Plaza

September 14: phiGENOME for bacteriophage genome exploration

September 21: Getting flanking sequences of genomic locations

September 28: Introduction to R statistical software 


October 2011

October 5: VnD resource for genetic variation and drug information

October 12: Track Hubs in UCSC Genome Browser

October 19: Mitochondrial Transcriptome GBrowser 

October 26: Variation data from Ensembl


November 2011

November 2: MizBee Synteny Browser

November 9: The new database of genomic variants: DGV2

November 16: MapMi, automated mapping of microRNA loci

November 23: BioMart’s new central portal

November 30: Phosphida, a post-translational modification database

December 2011

December 7: VarSifter, for identifying key sequence variations

December 14: Big changes to NCBI’s genome resources

December 21: eggNOG for the Holidays (or to explore orthologous genes)

December 28: Video Tips of the Week: Annual Review IV (first half of 2011)

Video Tip of the Week: BioMart’s new central portal

BioMart is widely-used data management open-source software, with an interface that enables end-users to generate complex and customized queries across many types and sources of biological data. It’s part of the GMOD tool kit, and many project teams that have big data have chosen the BioMart software to organize and make their data available to you.

We’ve been fans of BioMart for years. It was one of the earliest software tools we described, as it was integrated into many of the sites that we covered–such as Ensembl. Eventually we broke it out into its own tutorial suite, though, as there are now dozens of groups that have built Marts of their own. Although the skin may change and the data sets that are available will vary at different sites, the underlying software features are the same. Learning to use the main BioMart portal will help you to use all of them. Until recently the list of data providers that used BioMart was on the homepage, but here’s a taste of that list from my slides:

In this video tip I’ll introduce the newly re-designed BioMart main site, and touch on some of the other version of BioMart that you should get to know. We’ll be updating our tutorial suite with the new look soon, but most of the software functionality is the same as we’ve covered otherwise (available by subscription).

There are two main versions of BioMart circulating right now. The v 0.7 is the one that will probably be most familiar to people who have encountered BioMart at any of the genomics sites that have installations right now. But there’s a new and re-designed v 0.8 that is under development. It’s the one that’s used at the International Cancer Genome Consortium (ICGC.org) and there’s also a 0.8 central BioMart portal available to try out. Eventually this may replace many of the 0.7 setups, but this depends on the site. Some may persist with 0.7 for a while rather than updating. So it’s probably wise to have an idea of how to use both of them at this time.

One of the features of the new BioMart interface that’s already got bioinformatics folks talking is the ID converter. This is a common problem in the field, and Steven Turner thought this was a nice aspect of the facelift: BioMart Gene ID converter.

I also wanted to note that BioMart is one of the tools that you can use at Galaxy to access large swaths of data for further analysis. At Galaxy, open the “Get Data” menu to see that BioMart is one of your options.

There was also a lot of buzz about BioMart last week when a “Virtual Issue”of the journal Database was released that had not only an overview article about BioMart as a whole, but also several of the resources that use BioMart for their management and query interfaces as well. So you can see how widely useful this software is, among many different types of data providers. You can use the local installations of BioMart at a provider’s site, or you can use the main site to query from any of these sources as well–and more powerfully you can cross-database query too.

Quick links:

BioMart main site: http://www.biomart.org/

BioMart new style Bio Central portal: http://central.biomart.org/

BioMart pages at GMOD: http://gmod.org/wiki/BioMart

Virtual Issue of Database on BioMart: http://www.oxfordjournals.org/our_journals/databa/biomart_virtual_issue.html


Kasprzyk, A. (2011). BioMart: driving a paradigm change in biological data management Database, 2011 DOI: 10.1093/database/bar049

Zhang, J., Haider, S., Baran, J., Cros, A., Guberman, J., Hsu, J., Liang, Y., Yao, L., & Kasprzyk, A. (2011). BioMart: a data federation framework for large collaborative projects Database, 2011 DOI: 10.1093/database/bar038

Guberman, J., Ai, J., Arnaiz, O., Baran, J., Blake, A., Baldock, R., Chelala, C., Croft, D., Cros, A., Cutts, R., Di Genova, A., Forbes, S., Fujisawa, T., Gadaleta, E., Goodstein, D., Gundem, G., Haggarty, B., Haider, S., Hall, M., Harris, T., Haw, R., Hu, S., Hubbard, S., Hsu, J., Iyer, V., Jones, P., Katayama, T., Kinsella, R., Kong, L., Lawson, D., Liang, Y., Lopez-Bigas, N., Luo, J., Lush, M., Mason, J., Moreews, F., Ndegwa, N., Oakley, D., Perez-Llamas, C., Primig, M., Rivkin, E., Rosanoff, S., Shepherd, R., Simon, R., Skarnes, B., Smedley, D., Sperling, L., Spooner, W., Stevenson, P., Stone, K., Teague, J., Wang, J., Wang, J., Whitty, B., Wong, D., Wong-Erasmus, M., Yao, L., Youens-Clark, K., Yung, C., Zhang, J., & Kasprzyk, A. (2011). BioMart Central Portal: an open database network for the biological community Database, 2011 DOI: 10.1093/database/bar041

Haider, S., Ballester, B., Smedley, D., Zhang, J., Rice, P., & Kasprzyk, A. (2009). BioMart Central Portal–unified access to biological data Nucleic Acids Research, 37 (Web Server) DOI: 10.1093/nar/gkp265

World tour of workshops, recent stop: Morocco, Africa

Trainers & organizers

Last year I had the opportunity to give a workshop in Ifrane Morocco (UCSC Genome and Table browsers, Galaxy) at Al Akhawayn University. This year, Mary and I returned for a longer 3-day workshop at University Hassan II in Mohammadia. OpenHelix was a co-sponsor of the workshop (donating our time, materials and expertise). The workshop covered a plethora of topics from a world tour of resources (tutorial-free) and introductory UCSC  Genome Browser (tutorial-free) and ENCODE (tutorial-free) to genome variation analysis in dbSNP (tutorial-subscription) and analysis using Galaxy (tutorial-subscription). You can see the full schedule of the topics Mohammadia Workshop Schedule here (pdf).

As last year, we were impressed with the students (there were 117 total, about 50/50 gender ratio). English is their 3rd or 4th language in most cases, Moroccan Arabic, French or various African languages being their language of choice. Yet, they were attentive and asked very perceptive and fascinating questions. They were also very enthusiastic

The workshop students

learners. It was a delight to teach them.

We’d like to thank Mohammed Bourdi at NIH, who spent large amounts of time and financial resources to organize this (and last year’s) workshop. We hope to repeat and expand these for next year and perhaps years to come. We will be looking for sponsors.

Several questions were asked at the workshop we’d like to reiterate the answers here and seek some answers from our readers:

*One student was looking for wheat genome resources for designing primers. The wheat genome is as yet incomplete, but there are some resources to get started:
Wheat Genome Sequencing Consortium
Gramene’s wheat resources
Wheat Genetic and Genomic Resource Center @ Kansas State
Perhaps also COGE for conserved sequences
edited to add:
CerealsDB and
James’ post on the wheat draft sequence might give some insight into that huge genome.
*Another student asked about dotplot tools:
Galaxy offers a large collection of EMBOSS tools including dotplot analysis, as does EBI Emboss tool

* Another question concerned finding a ‘dynamic programming’ (optimal solution) multiple sequence alignment tool as opposed to a heuristic one. The issue with this is the complexity of the search space of dynamic programming solution, this slide set might help with the understanding, particularly slides 1-5 and 17-22. It is too computationally intensive. That said, the student might want to check out MSAProps and this list at Wikipedia.

Do our readers have any other guidance on this?

Teaching moment

* Another student asked  if we know how to find DC-area internships in biological sciences. Another student (mathematician from Mali) was looking for something in the US in bioinformatics. Any ideas of programs to bring African biology students to the US or Canada?

If our Moroccan students (or anyone else) have any additional questions, please feel free to ask them here!


ANd a side note. Last year I had all of 3 hours to tour Fes. This year I took advantage of my trip. Mary and I spent a few days in Fes and Marrakech. My family joined us in Marrakech and later my family and I toured for 8 days visiting the Atlas mountains, the Sahara and Fes. Needless to say, it was a trip of a lifetime. Morocco is a fascinating and beautiful place. I look forward to visiting again.

Gates and doors of Fes are beautiful

camel excursion to the Sahara





International Cancer Genome Consortium; interview with Tom Hudson

We’ve talked about the International Cancer Genome Consortium (ICGC) before a number of times, and we had a Tip of the Week on the project and database last year. It may be time for a new tip because their site and software has changed. One of the very cool aspects of the data access is that they are using the BioMart query tool for the interface–but it is the v0.8 cutting-edge style of BioMart that has some nice new features.

Anyway, I saw a tweet this morning about an interview with one of the principals of the ICGC, Tom Hudson. It’s a nice interview that talks about the project, the progress, and more. If you haven’t been following the ICGC’s work you might use this interview as a nice entry point to that. And then check out the data–and the BioMart interface that’s available at the site.

Interview (and hat tip to the tweeter that pointed me there):

RT @ResearchMedia: Dr Thomas Hudson of the ICGC Secretariat outlines the benefit of working as a consortium in the fight against #cancer http://t.co/CqM1UQm

Visit the ICGC: http://www.icgc.org/ and click on the Data Portal to start looking at the data that’s flowing in now.


Tip of the Week: InterMine for mining “big data”

Integrating large data sets for queries within–and across–various collections is one of the arenas that has lately been pretty active in bioinformatics. As more and more “big data” projects yield huge numbers of data points and data types, this is only becoming more necessary.  I love to browse data, but there are times when a large-scale customized query is what you’ll want to make some broader discoveries.

Right now there are a number of resources and interfaces that I turn to for structured and customized queries of data collections. The UCSC Table Browser, BioMart, Galaxy–these are the ones I have my hands on almost continuously. But there is another warehouse and interface system that we’re seeing more and more: InterMine.

My first real encounter with InterMine was for the modENCODE data. There’s some really terrific data flowing out of that project now (I talked a bit about that recently here), and the interface and storage system they are using is InterMine.

FlyMine was the initial impetus for the “Mine” system. Some years back FlyMine was created as a warehouse and query system for the increasing amounts of fly data that was coming from various projects. The goal was to have a system powerful enough for bioinformatics + super users, but also a friendly yet powerful interface for bench biologists to use.

The initial paper described the basic components: a user interface with 3 primary components: a Quick Search that’s great for browsing; a Template library that lets users access some pre-defined standard or likely query types that they can tweak for their needs; and a fully customizable Query Builder for the most advanced access. Since this paper development has continued, and there are other new and cool features present as well.

Another big goal of the FlyMine effort was to be able to deal with lists. One of the most common questions we still get in workshops is: “I have a list of _____.  What’s the best way to deal with that?” FlyMine–and the InterMines in general–help people to query and manage their explorations with lists of stuff.

The MyMine feature of the InterMines is also a nice component. You can create a login and store things you want to have repeated access to: queries, lists, etc.

There are other people using InterMine for their systems too–a recent paper on TargetMine, for “Gene Prioritization and Target Discovery” is available, and might appear as an upcoming tip! Jennifer did a tip on YeastMine from SGD once as well.

But what triggered me to do this tip is that a letter came from the RGD mailing list last week that said this:

Effective Friday, May 20th, 2011 the MCW BioMart tool will be retired by RGD and the MCW Proteomics Center.  For mining rat data, we have found that the RatMIne tool is easier to use, more flexible and incorporates more types of data than BioMart.  In addition, RatMine includes analysis tools not found in BioMart, giving RatMine users a single, intuitive interface for both obtaining and analyzing data.

So they are moving fully to InterMine and retiring the Rat BioMart, exclusively using RatMine at their installation. So this tip of the week will explore InterMine, RatMine, and some other Mines. That’s a lot of ground to cover–but it’s probably worth your time to know about InterMine as it becomes more broadly available.  It’s also important to understand how to query with the Mines if you want to bring the data to Galaxy for further analysis. If you visit Galaxy you’ll see that their “Get Data” section lets you access Mine tools–but you still need to know how to do the basic queries at the host site first.

Although this tip will touch on RatMine, the focus is the more general InterMine suite. RGD also said this in their notice:

For an overview of RatMine and how to use it, go to the RGD tutorial video, “An Introduction to the RatMine Database”, at http://rgd.mcw.edu/wg/home/rgd_rat_community_videos/an-introduction-to-the-ratmine-database2.  Alternatively, follow the “self-guided tour” of RatMine by clicking the “Take a tour” link at the top of any RatMine page.

To try out RatMine for yourself, go to http://ratmine.mcw.edu/ and get started with simplified data mining and analysis.

So if you want to have more specific information about using RatMine, be sure to check out their introduction.

Quick Links:

InterMine: http://intermine.org/

RatMine: http://ratmine.mcw.edu/

modENCODE: http://www.modencode.org/

Galaxy: http://usegalaxy.org/

Lyne, R., Smith, R., Rutherford, K., Wakeling, M., Varley, A., Guillier, F., Janssens, H., Ji, W., Mclaren, P., North, P., Rana, D., Riley, T., Sullivan, J., Watkins, X., Woodbridge, M., Lilley, K., Russell, S., Ashburner, M., Mizuguchi, K., & Micklem, G. (2007). FlyMine: an integrated database for Drosophila and Anopheles genomics Genome Biology, 8 (7) DOI: 10.1186/gb-2007-8-7-r129

Mining the “big data” is…fascinating. And necessary.

When we have workshops coming up, I spend some time tooling around in the big data to see if there have been changes since the last time I talked about it, update the slides if necessary, and sometimes forming a hypothesis and testing it. (PS: we’re at Baylor next, if anyone is looking for a workshop there.) On Friday I totally lost myself in a query that began at UCSC in the ENCODE data, and ended up in the ICGC BioMart. And wow. Do I wish I had a lab somedays….

One of the comments at our last workshop was that the ENCODE data on cell lines is not the same as looking at tissues. And I totally agree with that–but the mouse ENCODE data is going to help get that sort of data. But as someone who spent a lot of time culturing cells in the past, I am interested to know how different cell lines are from “reference” genome complement. And there’s one specific part of the human ENCODE project that’s looking at this: Common Cell CNV track.

Here’s what I did: a Table Browser query to look for the types of structural variations that were coming up in the 3 cell lines that have been examined: GM12878, HepG2, and K562. I wondered to myself: how many of these CNVs overlap with known genes? And what types of variations are there? Here’s a sample of how I structured that query for one of the cell lines:

This query yields normal sections, amplifications, deletions–and some deletions are homozygous and some are heterozygous. One of the points I make in the ENCODE workshop is that if I was using a cell line I’d be curious to know these sorts of things about it–I wish someone would do HeLa and the other big cell lines out there too. (Probably someone is, but I don’t know about the data. If someone has it, give me a holler.)

So I’m working around these variations, and I got curious about one particular region in one of the cell lines. It took out a region with some rather important-looking genes. I went to the literature to find that this region is known to be a problem in some cancers.

I went to look at the ICGC data to see if anything interesting was turning up with these genes. And wow–whadda ya know: there’s not a ton of data in that data set yet, but I found a significant correspondence between some of the data already in there from real tumors and what I found in the cell line. It’s too early for conclusions about that. It’s hard to know in these big data projects what you *aren’t* seeing, how much is already in there, how much isn’t, etc. But I checked a bunch of other genes and none showed this sort of pattern I was seeing.

Because of the ICGC usage policy, I don’t think I can speak specifically about what I saw. But it was very curious. If I had a lab I would have put a student on it this morning ;)

And my point is this: the data is not in the papers anymore. It’s in the databases. And you need to be mining it–these big data projects are handing you the pick-axes and pointing you to the mines.


What you need to do what I did:

1. A grasp of the UCSC functions and the ENCODE data. Check out our tutorials on those that are freely available as they are sponsored by UCSC and the ENCODE team at UCSC.

2. BioMart: we have a tutorial on this, but it is in our subscription package.

What you don’t need: current literature. It’s not in the papers, and may never be. The “big data” stuff is in the databases, and only small amounts can really be published in the traditional way.