Tag Archives: ICGC

Video Tip of the Week: ICGC portal for cancer genomics

A question at Biostar about cancer “gene sets” recently got me looking at one of my favorite data sources again–the ICGC, International Cancer Genome Consortium, and their data portal. Previous posts we’ve done were based on their legacy portal (which is still available on their site). They changed things up a bit with a release last fall, and I hadn’t covered those changes yet.

Conveniently, they have done a short video explaining how to access the data that they offer. They’ve continued to add new data, and to refine the software. You should check it out.

ICGC Data Portal Tutorial from ICGC on Vimeo.

In the past I found some really useful info to compare with a lung cancer cell line I had been examining. I saw the same mutation in actual tumor samples as had been found in this cell line years back. But there have also been publications recently that talk in more detail about the project and some interesting outcomes from data that’s been found there (linked below).

You really need to be mining these projects for data if they cover your research area. There’s a lot to learn that hasn’t been published yet–just be sure to read up on their usage policies before you deliver your great discoveries to the journals!

Quick link:

Data portal: http://dcc.icgc.org/

Project homepage: http://icgc.org/


Hudson (Chairperson) T.J., Anderson W., Aretz A., Barker A.D., Bell C., Bernabé R.R., Bhan M.K., Calvo F., Eerola I. & Gerhard D.S. & many others in a large consortium… (2010). International network of cancer genome projects, Nature, 464 (7291) 993-998. DOI:

Alexandrov L.B., Nik-Zainal S., Wedge D.C., Aparicio S.A.J.R., Behjati S., Biankin A.V., Bignell G.R., Bolli N., Borg A. & Børresen-Dale A.L. & many others in a large consortium…; (2013). Signatures of mutational processes in human cancer, Nature, 500 (7463) 415-421. DOI:

Gonzalez-Perez A., Mustonen V., Reva B., Ritchie G.R.S., Creixell P., Karchin R., Vazquez M., Fink J.L., Kassahn K.S. & Pearson J.V. & many others in a large consortium… (2013). Computational approaches to identify functional genetic variants in cancer genomes, Nature Methods, 10 (8) 723-729. DOI:

Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

International Cancer Genome Consortium; interview with Tom Hudson

We’ve talked about the International Cancer Genome Consortium (ICGC) before a number of times, and we had a Tip of the Week on the project and database last year. It may be time for a new tip because their site and software has changed. One of the very cool aspects of the data access is that they are using the BioMart query tool for the interface–but it is the v0.8 cutting-edge style of BioMart that has some nice new features.

Anyway, I saw a tweet this morning about an interview with one of the principals of the ICGC, Tom Hudson. It’s a nice interview that talks about the project, the progress, and more. If you haven’t been following the ICGC’s work you might use this interview as a nice entry point to that. And then check out the data–and the BioMart interface that’s available at the site.

Interview (and hat tip to the tweeter that pointed me there):

RT @ResearchMedia: Dr Thomas Hudson of the ICGC Secretariat outlines the benefit of working as a consortium in the fight against #cancer http://t.co/CqM1UQm

Visit the ICGC: http://www.icgc.org/ and click on the Data Portal to start looking at the data that’s flowing in now.


Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

  • Interesting story, but NOT what our legislation did, unfortunately: GINA does NOT cover life + long term care insurance. RT @gmopundit: RT @Genomengin: Genome power is about to sweep world: Nobel laureate http://t.co/CS41Rby via @theage [Mary]
  • RT @bffo: ICGC Data Coordination Center released version 5 of the ICGC data portal http://dcc.icgc.org/ #cancer #genomics #bioinformatics [Mary]
  • RT @westr: ISCB Honors Michael Ashburner and Olga Troyanskaya with Top Bioinformatics/Computational Biology Awards for 2011 http://bit.ly/ivsA2q [Mary]
  • Nature mentoring awards, this year in France. Nominations are due by June 27th, 2011. [Jennifer]
  • Sounds good–investigating now: RT @EinsteinMed: Einstein offers easy-to-use, open-source GenPlay #genome analyzer to scientific community http://ein.st/mkqCtg [Mary]
  • RT @davidweisss: RT @GenePattern: InSilicoDB, a sophisticated query engine for selecting GEO datasets and analyzing in @GenePattern: http://insilico.ulb.ac.be #bioinformatics [Mary]
  • Articles about how synonymous mutations possibly cause disease by altering miRNA regulation: The Sound of Silence and the original Crohn’s research article (subscription required for both) [Jennifer]
  • ‘nother day, ‘nother genome: RT @GenomeBrowser: Today we released the newest genome assembly of the green anole lizard, Anolis carolinensis (produced by @broadinstitute): anoCar2. [Mary]
  • KEGG 3D Mapping tool–here’s a sample image: http://bit.ly/iWTI3Q and you can see the recent changes indicated currently on the KEGG Mapper page, including the new Color Pathway features (with more planned). Hat tip to @d_kihara for retweeting @takujida ‘s item that I would have missed otherwise. [Mary]

Mining the “big data” is…fascinating. And necessary.

When we have workshops coming up, I spend some time tooling around in the big data to see if there have been changes since the last time I talked about it, update the slides if necessary, and sometimes forming a hypothesis and testing it. (PS: we’re at Baylor next, if anyone is looking for a workshop there.) On Friday I totally lost myself in a query that began at UCSC in the ENCODE data, and ended up in the ICGC BioMart. And wow. Do I wish I had a lab somedays….

One of the comments at our last workshop was that the ENCODE data on cell lines is not the same as looking at tissues. And I totally agree with that–but the mouse ENCODE data is going to help get that sort of data. But as someone who spent a lot of time culturing cells in the past, I am interested to know how different cell lines are from “reference” genome complement. And there’s one specific part of the human ENCODE project that’s looking at this: Common Cell CNV track.

Here’s what I did: a Table Browser query to look for the types of structural variations that were coming up in the 3 cell lines that have been examined: GM12878, HepG2, and K562. I wondered to myself: how many of these CNVs overlap with known genes? And what types of variations are there? Here’s a sample of how I structured that query for one of the cell lines:

This query yields normal sections, amplifications, deletions–and some deletions are homozygous and some are heterozygous. One of the points I make in the ENCODE workshop is that if I was using a cell line I’d be curious to know these sorts of things about it–I wish someone would do HeLa and the other big cell lines out there too. (Probably someone is, but I don’t know about the data. If someone has it, give me a holler.)

So I’m working around these variations, and I got curious about one particular region in one of the cell lines. It took out a region with some rather important-looking genes. I went to the literature to find that this region is known to be a problem in some cancers.

I went to look at the ICGC data to see if anything interesting was turning up with these genes. And wow–whadda ya know: there’s not a ton of data in that data set yet, but I found a significant correspondence between some of the data already in there from real tumors and what I found in the cell line. It’s too early for conclusions about that. It’s hard to know in these big data projects what you *aren’t* seeing, how much is already in there, how much isn’t, etc. But I checked a bunch of other genes and none showed this sort of pattern I was seeing.

Because of the ICGC usage policy, I don’t think I can speak specifically about what I saw. But it was very curious. If I had a lab I would have put a student on it this morning ;)

And my point is this: the data is not in the papers anymore. It’s in the databases. And you need to be mining it–these big data projects are handing you the pick-axes and pointing you to the mines.


What you need to do what I did:

1. A grasp of the UCSC functions and the ENCODE data. Check out our tutorials on those that are freely available as they are sponsored by UCSC and the ENCODE team at UCSC.

2. BioMart: we have a tutorial on this, but it is in our subscription package.

What you don’t need: current literature. It’s not in the papers, and may never be. The “big data” stuff is in the databases, and only small amounts can really be published in the traditional way.

Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

  • Chromothripsis – new model for some cancers? From GenomeWeb Daily News. I’m interested in seeing follow up studies on this. [Jennifer]
  • A new data source added to the BioMart Central portal: “EMAGE, a database of in situ gene expression data in the mouse embryo, has been added to BioMart Central Portal. The EMAGE website can be found at http://www.emouseatlas.org/emage/ and the EMAGE BioMart server can be found at http://biomart.emouseatlas.org/” (via the Mart-dev mailing list) [Mary]
  • Another potential outlet for scientists wanting to get involved: the Global Knowledge Initiative who’s goal is [Jennifer]

    We build global knowledge partnerships between individuals and institutions of higher education and research. We help partners access the global knowledge, technology, and human resources needed to sustain growth and achieve prosperity for all.

  • From GenomeWeb – an announcement about MoDEL the ‘World’s Largest Protein Video Database’ – it is free for academic, not-for-profit use. I haven’t tried it at all, but it sounds like it might be cool. Let us know if you check it out! [Jennifer]
  • Announcement from the International Cancer Genome Consortium (where you can access the data using the cutting edge BioMart build…Hat tip to @bffo: Update on ICGC website with a simplified application process for controlled access data  #bioinformatics #cancer #genomics  http://icgc.org/ [Mary]
  • Another resource for protein-protein and drug-protein interactions: PROMISCUOUS [Jennifer]
  • There’s a new Announcement mailing list for BioMart, as it gets migrated from the former EBI location.  Announce and Users lists are available–if you were on them you probably got automatically migrated. If you want to sign up, see this note:  [mart-announce] New BioMart announce and users mailing lists.  Hmm, that’s not entirely helpful as it hides the addresses you need. They are: mart-dev@ebi.ac.uk becomes users@biomart.org and mart-announce@ebi.ac.uk becomes announce@biomart.org [Mary]
  • REViGO – a resource for reducing and visualizing Gene Ontology trees, described in this paper: Supek F et al. PLoS Genet 6(6): e1001004. [Jennifer]

The data isn’t in the papers anymore, you know.

This week I was working on finishing up some training materials on the ENCODE data. We’ve talked about this before, and we’ve had some materials out already to support the ENCODE project, since we have a contract with the folks at UCSC to do some training on it. (The new materials should be out later this month.) But we were out doing a workshop on this data/software recently and we had a really great thing happen.

In the workshop we got to the exercise where I showed the attendees how to add the data for the GATA1 transcription factor binding sites to the visualization. This data is part of the Yale Transcription Factor Binding Site track.

In the front row of the training room, a researcher actually started to giggle.  Sometimes you can have fun in software training, but this was different. This woman was so happy to have discovered something she didn’t know before about GATA1 binding near her gene of interest that she was beside herself with delight.

Maybe this happens when she reads papers, too. But it struck me that what she had just done was come across something that isn’t in the papers. And that specific item may not be in the papers for a long time. But because she knew how to use the UCSC Genome Browser, and because she is now aware of the ENCODE data in the browser, she discovered something important for her research.

And that’s not in the literature. It’s in the databases.

I was also recently using the International Cancer Genome Consortium site’s new BioMart interface at their Data Coordination Center.  With their recent update they added some new features, I was using the new view of “Affected Genes” on that page. I picked a cancer type, I loaded up the Protein Coding genes, and there I was looking at the genes that had been repeatedly found to be affected in patient after patient. Some of the genes were not a surprise, certainly. But I sat there looking at data that a lot of people don’t know about–because it’s not in the papers yet. And it may not be for a long time.

Now, both of these “big data” projects have caveats: this data is pre-publication. Although there are some levels of QC, it should be considered as preliminary and you need to do due diligence before running off with conclusions about it. And both projects have data usage policies about how far you can take it before the embargo or moratorium is considered lifted. But still: you could make discoveries that no one else has made yet if you 1) are aware that this data is there, and 2) know how to use the software to get at it. There’s really no other way to know it.

That said, I know there are issues with the information in databases. A paper spoke to some issues of mis-annotation of data (Schnoes et al below):

Due to the rapid release of new data from genome sequencing projects, the majority of protein sequences in public databases have not been experimentally characterized; rather, sequences are annotated using computational analysis. The level of misannotation and the types of misannotation in large public databases are currently unknown and have not been analyzed in depth…..

So you need to be aware of that. And you need to confirm what you are seeing. But again–you need: 1) awareness of the tools used to do this, and 2) training on how to use tools to be sure you are getting appropriate information.

That’s also not in the papers anymore. It’s up to you.

There are so many projects of this nature out there now. We know of many species, data types, and topics that are just tossing great stuff into the ethers….and so many people don’t realize it. I just wish I had time to mine it all myself. There’s some real gems of discovery out there.  But you need a map, and you need some tools. And I want to hear more giggling, people. Get on it, please.

Schnoes, A., Brown, S., Dodevski, I., & Babbitt, P. (2009). Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies PLoS Computational Biology, 5 (12) DOI: 10.1371/journal.pcbi.1000605

Rosenbloom, K., Dreszer, T., Pheasant, M., Barber, G., Meyer, L., Pohl, A., Raney, B., Wang, T., Hinrichs, A., Zweig, A., Fujita, P., Learned, K., Rhead, B., Smith, K., Kuhn, R., Karolchik, D., Haussler, D., & Kent, W. (2009). ENCODE whole-genome data in the UCSC Genome Browser Nucleic Acids Research, 38 (Database) DOI: 10.1093/nar/gkp961

Hudson (Chairperson), T., Anderson, W., Aretz, A., Barker, A., Bell, C., Bernabé, R., Bhan, M., Calvo, F., Eerola, I., Gerhard, D., Guttmacher, A., Guyer, M., Hemsley, F., Jennings, J., Kerr, D., Klatt, P., Kolar, P., Kusuda, J., Lane, D., Laplace, F., Lu, Y., Nettekoven, G., Ozenberger, B., Peterson, J., Rao, T., Remacle, J., Schafer, A., Shibata, T., Stratton, M., Vockley, J., Watanabe, K., Yang, H., Yuen, M., Knoppers (Leader), B., Bobrow, M., Cambon-Thomsen, A., Dressler, L., Dyke, S., Joly, Y., Kato, K., Kennedy, K., Nicolás, P., Parker, M., Rial-Sebbag, E., Romeo-Casabona, C., Shaw, K., Wallace, S., Wiesner, G., Zeps, N., Lichter (Leader), P., Biankin, A., Chabannon, C., Chin, L., Clément, B., de Alava, E., Degos, F., Ferguson, M., Geary, P., Hayes, D., Hudson, T., Johns, A., Kasprzyk, A., Nakagawa, H., Penny, R., Piris, M., Sarin, R., Scarpa, A., Shibata, T., van de Vijver, M., Futreal (Leader), P., Aburatani, H., Bayés, M., Bowtell, D., Campbell, P., Estivill, X., Gerhard, D., Grimmond, S., Gut, I., Hirst, M., López-Otín, C., Majumder, P., Marra, M., McPherson, J., Nakagawa, H., Ning, Z., Puente, X., Ruan, Y., Shibata, T., Stratton, M., Stunnenberg, H., Swerdlow, H., Velculescu, V., Wilson, R., Xue, H., Yang, L., Spellman (Leader), P., Bader, G., Boutros, P., Campbell, P., Flicek, P., Getz, G., Guigó, R., Guo, G., Haussler, D., Heath, S., Hubbard, T., Jiang, T., Jones, S., Li, Q., López-Bigas, N., Luo, R., Muthuswamy, L., Francis Ouellette, B., Pearson, J., Puente, X., Quesada, V., Raphael, B., Sander, C., Shibata, T., Speed, T., Stein, L., Stuart, J., Teague, J., Totoki, Y., Tsunoda, T., Valencia, A., Wheeler, D., Wu, H., Zhao, S., Zhou, G., Stein (Leader), L., Guigó, R., Hubbard, T., Joly, Y., Jones, S., Kasprzyk, A., Lathrop, M., López-Bigas, N., Francis Ouellette, B., Spellman, P., Teague, J., Thomas, G., Valencia, A., Yoshida, T., Kennedy (Leader), K., Axton, M., Dyke, S., Futreal, P., Gerhard, D., Gunter, C., Guyer, M., Hudson, T., McPherson, J., Miller, L., Ozenberger, B., Shaw, K., Kasprzyk (Leader), A., Stein (Leader), L., Zhang, J., Haider, S., Wang, J., Yung, C., Cross, A., Liang, Y., Gnaneshan, S., Guberman, J., Hsu, J., Bobrow (Leader), M., Chalmers, D., Hasel, K., Joly, Y., Kaan, T., Kennedy, K., Knoppers, B., Lowrance, W., Masui, T., Nicolás, P., Rial-Sebbag, E., Lyman Rodriguez, L., Vergely, C., Yoshida, T., Grimmond (Leader), S., Biankin, A., Bowtell, D., Cloonan, N., deFazio, A., Eshleman, J., Etemadmoghadam, D., Gardiner, B., Kench, J., Scarpa, A., Sutherland, R., Tempero, M., Waddell, N., Wilson, P., McPherson (Leader), J., Gallinger, S., Tsao, M., Shaw, P., Petersen, G., Mukhopadhyay, D., Chin, L., DePinho, R., Thayer, S., Muthuswamy, L., Shazand, K., Beck, T., Sam, M., Timms, L., Ballin, V., Lu (Leader), Y., Ji, J., Zhang, X., Chen, F., Hu, X., Zhou, G., Yang, Q., Tian, G., Zhang, L., Xing, X., Li, X., Zhu, Z., Yu, Y., Yu, J., Yang, H., Lathrop (Leader), M., Tost, J., Brennan, P., Holcatova, I., Zaridze, D., Brazma, A., Egevad, L., Prokhortchouk, E., Elizabeth Banks, R., Uhlén, M., Cambon-Thomsen, A., Viksna, J., Ponten, F., Skryabin, K., Stratton (Leader), M., Futreal, P., Birney, E., Borg, A., Børresen-Dale, A., Caldas, C., Foekens, J., Martin, S., Reis-Filho, J., Richardson, A., Sotiriou, C., Stunnenberg, H., Thomas, G., van de Vijver, M., van’t Veer, L., Calvo (Leader), F., Birnbaum, D., Blanche, H., Boucher, P., Boyault, S., Chabannon, C., Gut, I., Masson-Jacquemier, J., Lathrop, M., Pauporté, I., Pivot, X., Vincent-Salomon, A., Tabone, E., Theillet, C., Thomas, G., Tost, J., Treilleux, I., Calvo (Leader), F., Bioulac-Sage, P., Clément, B., Decaens, T., Degos, F., Franco, D., Gut, I., Gut, M., Heath, S., Lathrop, M., Samuel, D., Thomas, G., Zucman-Rossi, J., Lichter (Leader), P., Eils (Leader), R., Brors, B., Korbel, J., Korshunov, A., Landgraf, P., Lehrach, H., Pfister, S., Radlwimmer, B., Reifenberger, G., Taylor, M., von Kalle, C., Majumder (Leader), P., Sarin, R., Rao, T., Bhan, M., Scarpa (Leader), A., Pederzoli, P., Lawlor, R., Delledonne, M., Bardelli, A., Biankin, A., Grimmond, S., Gress, T., Klimstra, D., Zamboni, G., Shibata (Leader), T., Nakamura, Y., Nakagawa, H., Kusuda, J., Tsunoda, T., Miyano, S., Aburatani, H., Kato, K., Fujimoto, A., Yoshida, T., Campo (Leader), E., López-Otín, C., Estivill, X., Guigó, R., de Sanjosé, S., Piris, M., Montserrat, E., González-Díaz, M., Puente, X., Jares, P., Valencia, A., Himmelbaue, H., Quesada, V., Bea, S., Stratton (Leader), M., Futreal, P., Campbell, P., Vincent-Salomon, A., Richardson, A., Reis-Filho, J., van de Vijver, M., Thomas, G., Masson-Jacquemier, J., Aparicio, S., Borg, A., Børresen-Dale, A., Caldas, C., Foekens, J., Stunnenberg, H., van’t Veer, L., Easton, D., Spellman, P., Martin, S., Barker, A., Chin, L., Collins, F., Compton, C., Ferguson, M., Gerhard, D., Getz, G., Gunter, C., Guttmacher, A., Guyer, M., Hayes, D., Lander, E., Ozenberger, B., Penny, R., Peterson, J., Sander, C., Shaw, K., Speed, T., Spellman, P., Vockley, J., Wheeler, D., Wilson, R., Hudson (Chairperson), T., Chin, L., Knoppers, B., Lander, E., Lichter, P., Stein, L., Stratton, M., Anderson, W., Barker, A., Bell, C., Bobrow, M., Burke, W., Collins, F., Compton, C., DePinho, R., Easton, D., Futreal, P., Gerhard, D., Green, A., Guyer, M., Hamilton, S., Hubbard, T., Kallioniemi, O., Kennedy, K., Ley, T., Liu, E., Lu, Y., Majumder, P., Marra, M., Ozenberger, B., Peterson, J., Schafer, A., Spellman, P., Stunnenberg, H., Wainwright, B., Wilson, R., & Yang, H. (2010). International network of cancer genome projects Nature, 464 (7291), 993-998 DOI: 10.1038/nature08987