A question at Biostar about cancer “gene sets” recently got me looking at one of my favorite data sources again–the ICGC, International Cancer Genome Consortium, and their data portal. Previousposts we’ve done were based on their legacy portal (which is still available on their site). They changed things up a bit with a release last fall, and I hadn’t covered those changes yet.
Conveniently, they have done a short video explaining how to access the data that they offer. They’ve continued to add new data, and to refine the software. You should check it out.
In the past I found some really useful info to compare with a lung cancer cell line I had been examining. I saw the same mutation in actual tumor samples as had been found in this cell line years back. But there have also been publications recently that talk in more detail about the project and some interesting outcomes from data that’s been found there (linked below).
You really need to be mining these projects for data if they cover your research area. There’s a lot to learn that hasn’t been published yet–just be sure to read up on their usage policies before you deliver your great discoveries to the journals!
Hudson (Chairperson) T.J., Anderson W., Aretz A., Barker A.D., Bell C., Bernabé R.R., Bhan M.K., Calvo F., Eerola I. & Gerhard D.S. & many others in a large consortium… (2010). International network of cancer genome projects, Nature, 464 (7291) 993-998. DOI: 10.1038/nature08987
Alexandrov L.B., Nik-Zainal S., Wedge D.C., Aparicio S.A.J.R., Behjati S., Biankin A.V., Bignell G.R., Bolli N., Borg A. & Børresen-Dale A.L. & many others in a large consortium…; (2013). Signatures of mutational processes in human cancer, Nature, 500 (7463) 415-421. DOI: 10.1038/nature12477
Gonzalez-Perez A., Mustonen V., Reva B., Ritchie G.R.S., Creixell P., Karchin R., Vazquez M., Fink J.L., Kassahn K.S. & Pearson J.V. & many others in a large consortium… (2013). Computational approaches to identify functional genetic variants in cancer genomes, Nature Methods, 10 (8) 723-729. DOI: 10.1038/nmeth.2562
BioStar is a site for asking, answering and discussing bioinformatics questions. We are members of thecommunity and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those questions and answers here in this thread. You can ask questions in this thread, or you can always join in at BioStar.
As you may know, we’ve been doing these video tips-of-the-week for FOUR years now. We have completed around 200 little tidbit introductions to various resources from last year, 2011 (yep, it’s 2012 now). At the end of the year we’ve established a sort of holiday tradition: we are doing a summary post to collect them all. If you have missed any of them it’s a great way to have a quick look at what might be useful to your work.
BioMart is widely-used data management open-source software, with an interface that enables end-users to generate complex and customized queries across many types and sources of biological data. It’s part of the GMOD tool kit, and many project teams that have big data have chosen the BioMart software to organize and make their data available to you.
We’ve been fans of BioMart for years. It was one of the earliest software tools we described, as it was integrated into many of the sites that we covered–such as Ensembl. Eventually we broke it out into its own tutorial suite, though, as there are now dozens of groups that have built Marts of their own. Although the skin may change and the data sets that are available will vary at different sites, the underlying software features are the same. Learning to use the main BioMart portal will help you to use all of them. Until recently the list of data providers that used BioMart was on the homepage, but here’s a taste of that list from my slides:
In this video tip I’ll introduce the newly re-designed BioMart main site, and touch on some of the other version of BioMart that you should get to know. We’ll be updating our tutorial suite with the new look soon, but most of the software functionality is the same as we’ve covered otherwise (available by subscription).
There are two main versions of BioMart circulating right now. The v 0.7 is the one that will probably be most familiar to people who have encountered BioMart at any of the genomics sites that have installations right now. But there’s a new and re-designed v 0.8 that is under development. It’s the one that’s used at the International Cancer Genome Consortium (ICGC.org) and there’s also a 0.8 central BioMart portal available to try out. Eventually this may replace many of the 0.7 setups, but this depends on the site. Some may persist with 0.7 for a while rather than updating. So it’s probably wise to have an idea of how to use both of them at this time.
One of the features of the new BioMart interface that’s already got bioinformatics folks talking is the ID converter. This is a common problem in the field, and Steven Turner thought this was a nice aspect of the facelift: BioMart Gene ID converter.
I also wanted to note that BioMart is one of the tools that you can use at Galaxy to access large swaths of data for further analysis. At Galaxy, open the “Get Data” menu to see that BioMart is one of your options.
There was also a lot of buzz about BioMart last week when a “Virtual Issue”of the journal Database was released that had not only an overview article about BioMart as a whole, but also several of the resources that use BioMart for their management and query interfaces as well. So you can see how widely useful this software is, among many different types of data providers. You can use the local installations of BioMart at a provider’s site, or you can use the main site to query from any of these sources as well–and more powerfully you can cross-database query too.
Kasprzyk, A. (2011). BioMart: driving a paradigm change in biological data management Database, 2011 DOI: 10.1093/database/bar049
Zhang, J., Haider, S., Baran, J., Cros, A., Guberman, J., Hsu, J., Liang, Y., Yao, L., & Kasprzyk, A. (2011). BioMart: a data federation framework for large collaborative projects Database, 2011 DOI: 10.1093/database/bar038
Guberman, J., Ai, J., Arnaiz, O., Baran, J., Blake, A., Baldock, R., Chelala, C., Croft, D., Cros, A., Cutts, R., Di Genova, A., Forbes, S., Fujisawa, T., Gadaleta, E., Goodstein, D., Gundem, G., Haggarty, B., Haider, S., Hall, M., Harris, T., Haw, R., Hu, S., Hubbard, S., Hsu, J., Iyer, V., Jones, P., Katayama, T., Kinsella, R., Kong, L., Lawson, D., Liang, Y., Lopez-Bigas, N., Luo, J., Lush, M., Mason, J., Moreews, F., Ndegwa, N., Oakley, D., Perez-Llamas, C., Primig, M., Rivkin, E., Rosanoff, S., Shepherd, R., Simon, R., Skarnes, B., Smedley, D., Sperling, L., Spooner, W., Stevenson, P., Stone, K., Teague, J., Wang, J., Wang, J., Whitty, B., Wong, D., Wong-Erasmus, M., Yao, L., Youens-Clark, K., Yung, C., Zhang, J., & Kasprzyk, A. (2011). BioMart Central Portal: an open database network for the biological community Database, 2011 DOI: 10.1093/database/bar041
Haider, S., Ballester, B., Smedley, D., Zhang, J., Rice, P., & Kasprzyk, A. (2009). BioMart Central Portal–unified access to biological data Nucleic Acids Research, 37 (Web Server) DOI: 10.1093/nar/gkp265
As last year, we were impressed with the students (there were 117 total, about 50/50 gender ratio). English is their 3rd or 4th language in most cases, Moroccan Arabic, French or various African languages being their language of choice. Yet, they were attentive and asked very perceptive and fascinating questions. They were also very enthusiastic
The workshop students
learners. It was a delight to teach them.
We’d like to thank Mohammed Bourdi at NIH, who spent large amounts of time and financial resources to organize this (and last year’s) workshop. We hope to repeat and expand these for next year and perhaps years to come. We will be looking for sponsors.
Several questions were asked at the workshop we’d like to reiterate the answers here and seek some answers from our readers:
* Another question concerned finding a ‘dynamic programming’ (optimal solution) multiple sequence alignment tool as opposed to a heuristic one. The issue with this is the complexity of the search space of dynamic programming solution, this slide set might help with the understanding, particularly slides 1-5 and 17-22. It is too computationally intensive. That said, the student might want to check out MSAProps and this list at Wikipedia.
Do our readers have any other guidance on this?
* Another student asked if we know how to find DC-area internships in biological sciences. Another student (mathematician from Mali) was looking for something in the US in bioinformatics. Any ideas of programs to bring African biology students to the US or Canada?
If our Moroccan students (or anyone else) have any additional questions, please feel free to ask them here!
ANd a side note. Last year I had all of 3 hours to tour Fes. This year I took advantage of my trip. Mary and I spent a few days in Fes and Marrakech. My family joined us in Marrakech and later my family and I toured for 8 days visiting the Atlas mountains, the Sahara and Fes. Needless to say, it was a trip of a lifetime. Morocco is a fascinating and beautiful place. I look forward to visiting again.
We’ve talked about the International Cancer Genome Consortium (ICGC) before a number of times, and we had a Tip of the Week on the project and database last year. It may be time for a new tip because their site and software has changed. One of the very cool aspects of the data access is that they are using the BioMart query tool for the interface–but it is the v0.8 cutting-edge style of BioMart that has some nice new features.
Anyway, I saw a tweet this morning about an interview with one of the principals of the ICGC, Tom Hudson. It’s a nice interview that talks about the project, the progress, and more. If you haven’t been following the ICGC’s work you might use this interview as a nice entry point to that. And then check out the data–and the BioMart interface that’s available at the site.
Interview (and hat tip to the tweeter that pointed me there):
RT @ResearchMedia: Dr Thomas Hudson of the ICGC Secretariat outlines the benefit of working as a consortium in the fight against #cancer http://t.co/CqM1UQm
Visit the ICGC: http://www.icgc.org/ and click on the Data Portal to start looking at the data that’s flowing in now.
Integrating large data sets for queries within–and across–various collections is one of the arenas that has lately been pretty active in bioinformatics. As more and more “big data” projects yield huge numbers of data points and data types, this is only becoming more necessary. I love to browse data, but there are times when a large-scale customized query is what you’ll want to make some broader discoveries.
Right now there are a number of resources and interfaces that I turn to for structured and customized queries of data collections. The UCSC Table Browser, BioMart, Galaxy–these are the ones I have my hands on almost continuously. But there is another warehouse and interface system that we’re seeing more and more: InterMine.
My first real encounter with InterMine was for the modENCODE data. There’s some really terrific data flowing out of that project now (I talked a bit about that recently here), and the interface and storage system they are using is InterMine.
FlyMine was the initial impetus for the “Mine” system. Some years back FlyMine was created as a warehouse and query system for the increasing amounts of fly data that was coming from various projects. The goal was to have a system powerful enough for bioinformatics + super users, but also a friendly yet powerful interface for bench biologists to use.
The initial paper described the basic components: a user interface with 3 primary components: a Quick Search that’s great for browsing; a Template library that lets users access some pre-defined standard or likely query types that they can tweak for their needs; and a fully customizable Query Builder for the most advanced access. Since this paper development has continued, and there are other new and cool features present as well.
Another big goal of the FlyMine effort was to be able to deal with lists. One of the most common questions we still get in workshops is: “I have a list of _____. What’s the best way to deal with that?” FlyMine–and the InterMines in general–help people to query and manage their explorations with lists of stuff.
The MyMine feature of the InterMines is also a nice component. You can create a login and store things you want to have repeated access to: queries, lists, etc.
There are other people using InterMine for their systems too–a recent paper on TargetMine, for “Gene Prioritization and Target Discovery” is available, and might appear as an upcoming tip! Jennifer did a tip on YeastMine from SGD once as well.
But what triggered me to do this tip is that a letter came from the RGD mailing list last week that said this:
Effective Friday, May 20th, 2011 the MCW BioMart tool will be retired by RGD and the MCW Proteomics Center. For mining rat data, we have found that the RatMIne tool is easier to use, more flexible and incorporates more types of data than BioMart. In addition, RatMine includes analysis tools not found in BioMart, giving RatMine users a single, intuitive interface for both obtaining and analyzing data.
So they are moving fully to InterMine and retiring the Rat BioMart, exclusively using RatMine at their installation. So this tip of the week will explore InterMine, RatMine, and some other Mines. That’s a lot of ground to cover–but it’s probably worth your time to know about InterMine as it becomes more broadly available. It’s also important to understand how to query with the Mines if you want to bring the data to Galaxy for further analysis. If you visit Galaxy you’ll see that their “Get Data” section lets you access Mine tools–but you still need to know how to do the basic queries at the host site first.
Although this tip will touch on RatMine, the focus is the more general InterMine suite. RGD also said this in their notice:
Reference: Lyne, R., Smith, R., Rutherford, K., Wakeling, M., Varley, A., Guillier, F., Janssens, H., Ji, W., Mclaren, P., North, P., Rana, D., Riley, T., Sullivan, J., Watkins, X., Woodbridge, M., Lilley, K., Russell, S., Ashburner, M., Mizuguchi, K., & Micklem, G. (2007). FlyMine: an integrated database for Drosophila and Anopheles genomics Genome Biology, 8 (7) DOI: 10.1186/gb-2007-8-7-r129
When we have workshops coming up, I spend some time tooling around in the big data to see if there have been changes since the last time I talked about it, update the slides if necessary, and sometimes forming a hypothesis and testing it. (PS: we’re at Baylor next, if anyone is looking for a workshop there.) On Friday I totally lost myself in a query that began at UCSC in the ENCODE data, and ended up in the ICGCBioMart. And wow. Do I wish I had a lab somedays….
One of the comments at our last workshop was that the ENCODE data on cell lines is not the same as looking at tissues. And I totally agree with that–but the mouse ENCODE data is going to help get that sort of data. But as someone who spent a lot of time culturing cells in the past, I am interested to know how different cell lines are from “reference” genome complement. And there’s one specific part of the human ENCODE project that’s looking at this: Common Cell CNV track.
Here’s what I did: a Table Browser query to look for the types of structural variations that were coming up in the 3 cell lines that have been examined: GM12878, HepG2, and K562. I wondered to myself: how many of these CNVs overlap with known genes? And what types of variations are there? Here’s a sample of how I structured that query for one of the cell lines:
This query yields normal sections, amplifications, deletions–and some deletions are homozygous and some are heterozygous. One of the points I make in the ENCODE workshop is that if I was using a cell line I’d be curious to know these sorts of things about it–I wish someone would do HeLa and the other big cell lines out there too. (Probably someone is, but I don’t know about the data. If someone has it, give me a holler.)
So I’m working around these variations, and I got curious about one particular region in one of the cell lines. It took out a region with some rather important-looking genes. I went to the literature to find that this region is known to be a problem in some cancers.
I went to look at the ICGC data to see if anything interesting was turning up with these genes. And wow–whadda ya know: there’s not a ton of data in that data set yet, but I found a significant correspondence between some of the data already in there from real tumors and what I found in the cell line. It’s too early for conclusions about that. It’s hard to know in these big data projects what you *aren’t* seeing, how much is already in there, how much isn’t, etc. But I checked a bunch of other genes and none showed this sort of pattern I was seeing.
Because of the ICGC usage policy, I don’t think I can speak specifically about what I saw. But it was very curious. If I had a lab I would have put a student on it this morning
And my point is this: the data is not in the papers anymore. It’s in the databases. And you need to be mining it–these big data projects are handing you the pick-axes and pointing you to the mines.
What you need to do what I did:
1. A grasp of the UCSC functions and the ENCODE data. Check out our tutorials on those that are freely available as they are sponsored by UCSC and the ENCODE team at UCSC.
Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…
We build global knowledge partnerships between individuals and institutions of higher education and research. We help partners access the global knowledge, technology, and human resources needed to sustain growth and achieve prosperity for all.
From GenomeWeb – an announcement about MoDEL the ‘World’s Largest Protein Video Database’ – it is free for academic, not-for-profit use. I haven’t tried it at all, but it sounds like it might be cool. Let us know if you check it out! [Jennifer]
Announcement from the International Cancer Genome Consortium (where you can access the data using the cutting edge BioMart build…Hat tip to @bffo: Update on ICGC website with a simplified application process for controlled access data #bioinformatics #cancer #genomics http://icgc.org/ [Mary]
Another resource for protein-protein and drug-protein interactions: PROMISCUOUS [Jennifer]
As you may know, we’ve been doing tips-of-the-week for three years now. We have completed around 150 little tidbit introductions to various resources*. At the end of the year we’ve established a sort of holiday tradition: we are doing a summary post to collect them all. If you have missed any of them it’s a great way to have a quick look at what might be useful to your work.