Tag Archives: biomart

Video Tip of the Week: InterMine for complex queries

We’ve been fans of InterMine for a long time. We did a tip-of-the-week in a while ago that highlighted ways that this software can be used to mine from big data projects of many types. The generic framework of InterMine can be customized for use at different projects–today I’ll include videos from the FlyMine installation and the YeastMine flavor–but you may find versions of this handy tool in many other places as well.

The first video is a broader overview of different types of things you can do–and although this is FlyMine, you’ll find similar behavior at the other Mines too.

This next video is more specific about a task that people need to accomplish–working with a list of genes. This example was recently produced by the YeastMine folks, but again this should work in a similar way across other Mines. You should also read the SGD blog post on it–Create, Analyze, Save: the Power of Gene Lists in YeastMine.

The other thing that I noticed about this framework is the effort of several of these model organism Mines to coordinate into this InterMOD structure. Although I am often wary of “one search to rule them all” sorts of efforts, there can be value in this as a central organizing principle as we keep adding more species genomes that may not have as well-developed communities and infrastructure to support them.

I certainly use a lot of query tools that are similar to these–like the UCSC Table Browser, and BioMartUniProt offers ways to build queries that’s different but conceptually similar. Using these interfaces you can construct some clever and complex ways to extract information out of data repositories.

Quick links:

InterMine: http://intermine.github.io/intermine.org/

FlyMine: http://www.flymine.org/

YeastMine: http://yeastmine.yeastgenome.org/

InterMOD: http://intermod.intermine.org

References:

Smith R.N., Aleksic J., Butano D., Carr A., Contrino S., Hu F., Lyne M., Lyne R., Kalderimis A. & Rutherford K. & (2012). InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data., Bioinformatics (Oxford, England), DOI:

Lyne R., Smith R., Rutherford K., Wakeling M., Varley A., Guillier F., Janssens H., Ji W., Mclaren P. & North P. & (2012). FlyMine: an integrated database for Drosophila and Anopheles genomics., Genome biology, PMID:

Balakrishnan R., Park J., Karra K., Hitz B.C., Binkley G., Hong E.L., Sullivan J., Micklem G. & Cherry J.M. (2012). YeastMine–an integrated data warehouse for Saccharomyces cerevisiae data as a multipurpose tool-kit., Database : the journal of biological databases and curation, PMID:

Sullivan J., Karra K., Moxon S.A.T., Vallejos A., Motenko H., Wong J.D., Aleksic J., Balakrishnan R., Binkley G. & Harris T. & (2013). InterMOD: integrated data and tools for the unification of model organism research., Scientific reports, 3 (1802) PMID:

Video Tip of the Week: ICGC portal for cancer genomics

A question at Biostar about cancer “gene sets” recently got me looking at one of my favorite data sources again–the ICGC, International Cancer Genome Consortium, and their data portal. Previous posts we’ve done were based on their legacy portal (which is still available on their site). They changed things up a bit with a release last fall, and I hadn’t covered those changes yet.

Conveniently, they have done a short video explaining how to access the data that they offer. They’ve continued to add new data, and to refine the software. You should check it out.

ICGC Data Portal Tutorial from ICGC on Vimeo.

In the past I found some really useful info to compare with a lung cancer cell line I had been examining. I saw the same mutation in actual tumor samples as had been found in this cell line years back. But there have also been publications recently that talk in more detail about the project and some interesting outcomes from data that’s been found there (linked below).

You really need to be mining these projects for data if they cover your research area. There’s a lot to learn that hasn’t been published yet–just be sure to read up on their usage policies before you deliver your great discoveries to the journals!

Quick link:

Data portal: http://dcc.icgc.org/

Project homepage: http://icgc.org/

References:

Hudson (Chairperson) T.J., Anderson W., Aretz A., Barker A.D., Bell C., Bernabé R.R., Bhan M.K., Calvo F., Eerola I. & Gerhard D.S. & many others in a large consortium… (2010). International network of cancer genome projects, Nature, 464 (7291) 993-998. DOI:

Alexandrov L.B., Nik-Zainal S., Wedge D.C., Aparicio S.A.J.R., Behjati S., Biankin A.V., Bignell G.R., Bolli N., Borg A. & Børresen-Dale A.L. & many others in a large consortium…; (2013). Signatures of mutational processes in human cancer, Nature, 500 (7463) 415-421. DOI:

Gonzalez-Perez A., Mustonen V., Reva B., Ritchie G.R.S., Creixell P., Karchin R., Vazquez M., Fink J.L., Kassahn K.S. & Pearson J.V. & many others in a large consortium… (2013). Computational approaches to identify functional genetic variants in cancer genomes, Nature Methods, 10 (8) 723-729. DOI:

What’s the Answer? (Gene ID conversion)

BioStar is a site for asking, answering and discussing bioinformatics questions. We are members of thecommunity and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those questions and answers here in this thread. You can ask questions in this thread, or you can always join in at BioStar.

This week’s highlighted question:

What is a good “gene ID conversion tool

This is an older question, from 2 years ago, but still relevant and the answers still quite helpful and full of resources such as DAVID, BioDBnet, BioMart and others.

Check it out. Also, might want to check out the third exercise of our UCSC Advanced Tutorial .  The exercise:

“From a list of UCSC genes, add gene symbols and GO IDs for additional information about the gene set. Bonus step: add GO terms.”

Walks through how you might be able to do this with the UCSC Table Browser with some simple modifications.

Video Tips of the Week: Annual Review IV, 2nd half

As you may know, we’ve been doing these video tips-of-the-week for FOUR years now. We have completed around 200 little tidbit introductions to various resources from last year, 2011 (yep, it’s 2012 now). At the end of the year we’ve established a sort of holiday tradition: we are doing a summary post to collect them all. If you have missed any of them it’s a great way to have a quick look at what might be useful to your work.

You can see past years’ tips here: 2008 I2008 II2009 I2009 II2010 I2010 II. The summary of the first half of 2011 is available from last week.

July 2011

July 6: Prioritizing genes using the Gene Prioritization Portal

July 13: PolySearch, searching many databases at once

July 20: Human Epigenomics Visualization Hub

July 27: The new SIB Bioinformatics Resource Portal

 

August 2011

August 3: SNPexp, correlation between SNPs and gene expression 

August 10: CompaGB for comparing genome browser software

August 17: CoGe, comparing genomes revisited

August 24: Domain Draw for quick motif diagrams

August 31: From UniProt to the PSI SBKB and back again

 

September 2011

September 7: Plant comparative genomics using Plaza

September 14: phiGENOME for bacteriophage genome exploration

September 21: Getting flanking sequences of genomic locations

September 28: Introduction to R statistical software 

 

October 2011

October 5: VnD resource for genetic variation and drug information

October 12: Track Hubs in UCSC Genome Browser

October 19: Mitochondrial Transcriptome GBrowser 

October 26: Variation data from Ensembl

 

November 2011

November 2: MizBee Synteny Browser

November 9: The new database of genomic variants: DGV2

November 16: MapMi, automated mapping of microRNA loci

November 23: BioMart’s new central portal

November 30: Phosphida, a post-translational modification database

December 2011

December 7: VarSifter, for identifying key sequence variations

December 14: Big changes to NCBI’s genome resources

December 21: eggNOG for the Holidays (or to explore orthologous genes)

December 28: Video Tips of the Week: Annual Review IV (first half of 2011)

Video Tip of the Week: BioMart’s new central portal

BioMart is widely-used data management open-source software, with an interface that enables end-users to generate complex and customized queries across many types and sources of biological data. It’s part of the GMOD tool kit, and many project teams that have big data have chosen the BioMart software to organize and make their data available to you.

We’ve been fans of BioMart for years. It was one of the earliest software tools we described, as it was integrated into many of the sites that we covered–such as Ensembl. Eventually we broke it out into its own tutorial suite, though, as there are now dozens of groups that have built Marts of their own. Although the skin may change and the data sets that are available will vary at different sites, the underlying software features are the same. Learning to use the main BioMart portal will help you to use all of them. Until recently the list of data providers that used BioMart was on the homepage, but here’s a taste of that list from my slides:

In this video tip I’ll introduce the newly re-designed BioMart main site, and touch on some of the other version of BioMart that you should get to know. We’ll be updating our tutorial suite with the new look soon, but most of the software functionality is the same as we’ve covered otherwise (available by subscription).

There are two main versions of BioMart circulating right now. The v 0.7 is the one that will probably be most familiar to people who have encountered BioMart at any of the genomics sites that have installations right now. But there’s a new and re-designed v 0.8 that is under development. It’s the one that’s used at the International Cancer Genome Consortium (ICGC.org) and there’s also a 0.8 central BioMart portal available to try out. Eventually this may replace many of the 0.7 setups, but this depends on the site. Some may persist with 0.7 for a while rather than updating. So it’s probably wise to have an idea of how to use both of them at this time.

One of the features of the new BioMart interface that’s already got bioinformatics folks talking is the ID converter. This is a common problem in the field, and Steven Turner thought this was a nice aspect of the facelift: BioMart Gene ID converter.

I also wanted to note that BioMart is one of the tools that you can use at Galaxy to access large swaths of data for further analysis. At Galaxy, open the “Get Data” menu to see that BioMart is one of your options.

There was also a lot of buzz about BioMart last week when a “Virtual Issue”of the journal Database was released that had not only an overview article about BioMart as a whole, but also several of the resources that use BioMart for their management and query interfaces as well. So you can see how widely useful this software is, among many different types of data providers. You can use the local installations of BioMart at a provider’s site, or you can use the main site to query from any of these sources as well–and more powerfully you can cross-database query too.

Quick links:

BioMart main site: http://www.biomart.org/

BioMart new style Bio Central portal: http://central.biomart.org/

BioMart pages at GMOD: http://gmod.org/wiki/BioMart

Virtual Issue of Database on BioMart: http://www.oxfordjournals.org/our_journals/databa/biomart_virtual_issue.html

References:

Kasprzyk, A. (2011). BioMart: driving a paradigm change in biological data management Database, 2011 DOI: 10.1093/database/bar049

Zhang, J., Haider, S., Baran, J., Cros, A., Guberman, J., Hsu, J., Liang, Y., Yao, L., & Kasprzyk, A. (2011). BioMart: a data federation framework for large collaborative projects Database, 2011 DOI: 10.1093/database/bar038

Guberman, J., Ai, J., Arnaiz, O., Baran, J., Blake, A., Baldock, R., Chelala, C., Croft, D., Cros, A., Cutts, R., Di Genova, A., Forbes, S., Fujisawa, T., Gadaleta, E., Goodstein, D., Gundem, G., Haggarty, B., Haider, S., Hall, M., Harris, T., Haw, R., Hu, S., Hubbard, S., Hsu, J., Iyer, V., Jones, P., Katayama, T., Kinsella, R., Kong, L., Lawson, D., Liang, Y., Lopez-Bigas, N., Luo, J., Lush, M., Mason, J., Moreews, F., Ndegwa, N., Oakley, D., Perez-Llamas, C., Primig, M., Rivkin, E., Rosanoff, S., Shepherd, R., Simon, R., Skarnes, B., Smedley, D., Sperling, L., Spooner, W., Stevenson, P., Stone, K., Teague, J., Wang, J., Wang, J., Whitty, B., Wong, D., Wong-Erasmus, M., Yao, L., Youens-Clark, K., Yung, C., Zhang, J., & Kasprzyk, A. (2011). BioMart Central Portal: an open database network for the biological community Database, 2011 DOI: 10.1093/database/bar041

Haider, S., Ballester, B., Smedley, D., Zhang, J., Rice, P., & Kasprzyk, A. (2009). BioMart Central Portal–unified access to biological data Nucleic Acids Research, 37 (Web Server) DOI: 10.1093/nar/gkp265

World tour of workshops, recent stop: Morocco, Africa

Trainers & organizers

Last year I had the opportunity to give a workshop in Ifrane Morocco (UCSC Genome and Table browsers, Galaxy) at Al Akhawayn University. This year, Mary and I returned for a longer 3-day workshop at University Hassan II in Mohammadia. OpenHelix was a co-sponsor of the workshop (donating our time, materials and expertise). The workshop covered a plethora of topics from a world tour of resources (tutorial-free) and introductory UCSC  Genome Browser (tutorial-free) and ENCODE (tutorial-free) to genome variation analysis in dbSNP (tutorial-subscription) and analysis using Galaxy (tutorial-subscription). You can see the full schedule of the topics Mohammadia Workshop Schedule here (pdf).

As last year, we were impressed with the students (there were 117 total, about 50/50 gender ratio). English is their 3rd or 4th language in most cases, Moroccan Arabic, French or various African languages being their language of choice. Yet, they were attentive and asked very perceptive and fascinating questions. They were also very enthusiastic

The workshop students

learners. It was a delight to teach them.

We’d like to thank Mohammed Bourdi at NIH, who spent large amounts of time and financial resources to organize this (and last year’s) workshop. We hope to repeat and expand these for next year and perhaps years to come. We will be looking for sponsors.

Several questions were asked at the workshop we’d like to reiterate the answers here and seek some answers from our readers:

*One student was looking for wheat genome resources for designing primers. The wheat genome is as yet incomplete, but there are some resources to get started:
Wheat Genome Sequencing Consortium
Gramene’s wheat resources
Wheat Genetic and Genomic Resource Center @ Kansas State
Perhaps also COGE for conserved sequences
edited to add:
CerealsDB and
James’ post on the wheat draft sequence might give some insight into that huge genome.
*Another student asked about dotplot tools:
Galaxy offers a large collection of EMBOSS tools including dotplot analysis, as does EBI Emboss tool

* Another question concerned finding a ‘dynamic programming’ (optimal solution) multiple sequence alignment tool as opposed to a heuristic one. The issue with this is the complexity of the search space of dynamic programming solution, this slide set might help with the understanding, particularly slides 1-5 and 17-22. It is too computationally intensive. That said, the student might want to check out MSAProps and this list at Wikipedia.

Do our readers have any other guidance on this?

Teaching moment

* Another student asked  if we know how to find DC-area internships in biological sciences. Another student (mathematician from Mali) was looking for something in the US in bioinformatics. Any ideas of programs to bring African biology students to the US or Canada?

If our Moroccan students (or anyone else) have any additional questions, please feel free to ask them here!

 

ANd a side note. Last year I had all of 3 hours to tour Fes. This year I took advantage of my trip. Mary and I spent a few days in Fes and Marrakech. My family joined us in Marrakech and later my family and I toured for 8 days visiting the Atlas mountains, the Sahara and Fes. Needless to say, it was a trip of a lifetime. Morocco is a fascinating and beautiful place. I look forward to visiting again.

Gates and doors of Fes are beautiful

camel excursion to the Sahara

 

 

 

 

International Cancer Genome Consortium; interview with Tom Hudson

We’ve talked about the International Cancer Genome Consortium (ICGC) before a number of times, and we had a Tip of the Week on the project and database last year. It may be time for a new tip because their site and software has changed. One of the very cool aspects of the data access is that they are using the BioMart query tool for the interface–but it is the v0.8 cutting-edge style of BioMart that has some nice new features.

Anyway, I saw a tweet this morning about an interview with one of the principals of the ICGC, Tom Hudson. It’s a nice interview that talks about the project, the progress, and more. If you haven’t been following the ICGC’s work you might use this interview as a nice entry point to that. And then check out the data–and the BioMart interface that’s available at the site.

Interview (and hat tip to the tweeter that pointed me there):

RT @ResearchMedia: Dr Thomas Hudson of the ICGC Secretariat outlines the benefit of working as a consortium in the fight against #cancer http://t.co/CqM1UQm

Visit the ICGC: http://www.icgc.org/ and click on the Data Portal to start looking at the data that’s flowing in now.

 

Tip of the Week: InterMine for mining “big data”

Integrating large data sets for queries within–and across–various collections is one of the arenas that has lately been pretty active in bioinformatics. As more and more “big data” projects yield huge numbers of data points and data types, this is only becoming more necessary.  I love to browse data, but there are times when a large-scale customized query is what you’ll want to make some broader discoveries.

Right now there are a number of resources and interfaces that I turn to for structured and customized queries of data collections. The UCSC Table Browser, BioMart, Galaxy–these are the ones I have my hands on almost continuously. But there is another warehouse and interface system that we’re seeing more and more: InterMine.

My first real encounter with InterMine was for the modENCODE data. There’s some really terrific data flowing out of that project now (I talked a bit about that recently here), and the interface and storage system they are using is InterMine.

FlyMine was the initial impetus for the “Mine” system. Some years back FlyMine was created as a warehouse and query system for the increasing amounts of fly data that was coming from various projects. The goal was to have a system powerful enough for bioinformatics + super users, but also a friendly yet powerful interface for bench biologists to use.

The initial paper described the basic components: a user interface with 3 primary components: a Quick Search that’s great for browsing; a Template library that lets users access some pre-defined standard or likely query types that they can tweak for their needs; and a fully customizable Query Builder for the most advanced access. Since this paper development has continued, and there are other new and cool features present as well.

Another big goal of the FlyMine effort was to be able to deal with lists. One of the most common questions we still get in workshops is: “I have a list of _____.  What’s the best way to deal with that?” FlyMine–and the InterMines in general–help people to query and manage their explorations with lists of stuff.

The MyMine feature of the InterMines is also a nice component. You can create a login and store things you want to have repeated access to: queries, lists, etc.

There are other people using InterMine for their systems too–a recent paper on TargetMine, for “Gene Prioritization and Target Discovery” is available, and might appear as an upcoming tip! Jennifer did a tip on YeastMine from SGD once as well.

But what triggered me to do this tip is that a letter came from the RGD mailing list last week that said this:

Effective Friday, May 20th, 2011 the MCW BioMart tool will be retired by RGD and the MCW Proteomics Center.  For mining rat data, we have found that the RatMIne tool is easier to use, more flexible and incorporates more types of data than BioMart.  In addition, RatMine includes analysis tools not found in BioMart, giving RatMine users a single, intuitive interface for both obtaining and analyzing data.

So they are moving fully to InterMine and retiring the Rat BioMart, exclusively using RatMine at their installation. So this tip of the week will explore InterMine, RatMine, and some other Mines. That’s a lot of ground to cover–but it’s probably worth your time to know about InterMine as it becomes more broadly available.  It’s also important to understand how to query with the Mines if you want to bring the data to Galaxy for further analysis. If you visit Galaxy you’ll see that their “Get Data” section lets you access Mine tools–but you still need to know how to do the basic queries at the host site first.

Although this tip will touch on RatMine, the focus is the more general InterMine suite. RGD also said this in their notice:

For an overview of RatMine and how to use it, go to the RGD tutorial video, “An Introduction to the RatMine Database”, at http://rgd.mcw.edu/wg/home/rgd_rat_community_videos/an-introduction-to-the-ratmine-database2.  Alternatively, follow the “self-guided tour” of RatMine by clicking the “Take a tour” link at the top of any RatMine page.

To try out RatMine for yourself, go to http://ratmine.mcw.edu/ and get started with simplified data mining and analysis.

So if you want to have more specific information about using RatMine, be sure to check out their introduction.

Quick Links:

InterMine: http://intermine.org/

RatMine: http://ratmine.mcw.edu/

modENCODE: http://www.modencode.org/

Galaxy: http://usegalaxy.org/

Reference:
Lyne, R., Smith, R., Rutherford, K., Wakeling, M., Varley, A., Guillier, F., Janssens, H., Ji, W., Mclaren, P., North, P., Rana, D., Riley, T., Sullivan, J., Watkins, X., Woodbridge, M., Lilley, K., Russell, S., Ashburner, M., Mizuguchi, K., & Micklem, G. (2007). FlyMine: an integrated database for Drosophila and Anopheles genomics Genome Biology, 8 (7) DOI: 10.1186/gb-2007-8-7-r129

Mining the “big data” is…fascinating. And necessary.

When we have workshops coming up, I spend some time tooling around in the big data to see if there have been changes since the last time I talked about it, update the slides if necessary, and sometimes forming a hypothesis and testing it. (PS: we’re at Baylor next, if anyone is looking for a workshop there.) On Friday I totally lost myself in a query that began at UCSC in the ENCODE data, and ended up in the ICGC BioMart. And wow. Do I wish I had a lab somedays….

One of the comments at our last workshop was that the ENCODE data on cell lines is not the same as looking at tissues. And I totally agree with that–but the mouse ENCODE data is going to help get that sort of data. But as someone who spent a lot of time culturing cells in the past, I am interested to know how different cell lines are from “reference” genome complement. And there’s one specific part of the human ENCODE project that’s looking at this: Common Cell CNV track.

Here’s what I did: a Table Browser query to look for the types of structural variations that were coming up in the 3 cell lines that have been examined: GM12878, HepG2, and K562. I wondered to myself: how many of these CNVs overlap with known genes? And what types of variations are there? Here’s a sample of how I structured that query for one of the cell lines:

This query yields normal sections, amplifications, deletions–and some deletions are homozygous and some are heterozygous. One of the points I make in the ENCODE workshop is that if I was using a cell line I’d be curious to know these sorts of things about it–I wish someone would do HeLa and the other big cell lines out there too. (Probably someone is, but I don’t know about the data. If someone has it, give me a holler.)

So I’m working around these variations, and I got curious about one particular region in one of the cell lines. It took out a region with some rather important-looking genes. I went to the literature to find that this region is known to be a problem in some cancers.

I went to look at the ICGC data to see if anything interesting was turning up with these genes. And wow–whadda ya know: there’s not a ton of data in that data set yet, but I found a significant correspondence between some of the data already in there from real tumors and what I found in the cell line. It’s too early for conclusions about that. It’s hard to know in these big data projects what you *aren’t* seeing, how much is already in there, how much isn’t, etc. But I checked a bunch of other genes and none showed this sort of pattern I was seeing.

Because of the ICGC usage policy, I don’t think I can speak specifically about what I saw. But it was very curious. If I had a lab I would have put a student on it this morning ;)

And my point is this: the data is not in the papers anymore. It’s in the databases. And you need to be mining it–these big data projects are handing you the pick-axes and pointing you to the mines.

++++++++++++

What you need to do what I did:

1. A grasp of the UCSC functions and the ENCODE data. Check out our tutorials on those that are freely available as they are sponsored by UCSC and the ENCODE team at UCSC.

2. BioMart: we have a tutorial on this, but it is in our subscription package.

What you don’t need: current literature. It’s not in the papers, and may never be. The “big data” stuff is in the databases, and only small amounts can really be published in the traditional way.

Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

  • Chromothripsis – new model for some cancers? From GenomeWeb Daily News. I’m interested in seeing follow up studies on this. [Jennifer]
  • A new data source added to the BioMart Central portal: “EMAGE, a database of in situ gene expression data in the mouse embryo, has been added to BioMart Central Portal. The EMAGE website can be found at http://www.emouseatlas.org/emage/ and the EMAGE BioMart server can be found at http://biomart.emouseatlas.org/” (via the Mart-dev mailing list) [Mary]
  • Another potential outlet for scientists wanting to get involved: the Global Knowledge Initiative who’s goal is [Jennifer]

    We build global knowledge partnerships between individuals and institutions of higher education and research. We help partners access the global knowledge, technology, and human resources needed to sustain growth and achieve prosperity for all.

  • From GenomeWeb – an announcement about MoDEL the ‘World’s Largest Protein Video Database’ – it is free for academic, not-for-profit use. I haven’t tried it at all, but it sounds like it might be cool. Let us know if you check it out! [Jennifer]
  • Announcement from the International Cancer Genome Consortium (where you can access the data using the cutting edge BioMart build…Hat tip to @bffo: Update on ICGC website with a simplified application process for controlled access data  #bioinformatics #cancer #genomics  http://icgc.org/ [Mary]
  • Another resource for protein-protein and drug-protein interactions: PROMISCUOUS [Jennifer]
  • There’s a new Announcement mailing list for BioMart, as it gets migrated from the former EBI location.  Announce and Users lists are available–if you were on them you probably got automatically migrated. If you want to sign up, see this note:  [mart-announce] New BioMart announce and users mailing lists.  Hmm, that’s not entirely helpful as it hides the addresses you need. They are: mart-dev@ebi.ac.uk becomes users@biomart.org and mart-announce@ebi.ac.uk becomes announce@biomart.org [Mary]
  • REViGO – a resource for reducing and visualizing Gene Ontology trees, described in this paper: Supek F et al. PLoS Genet 6(6): e1001004. [Jennifer]