Tag Archives: ensembl

Tip of the week: CompaGB for comparing genome browser software

Here at OpenHelix we think a lot about the differences between nominally similar software that will accomplish some given task.  For example, in our workshops we are often asked about the differences between genome browsers.  Although UCSC sponsors our workshops and training materials on their browser, we know they aren’t the only genome browser out there and we can talk about them all–in fact, that’s one of the coolest things about being separate from UCSC or a specific software tool provider/grant–we can talk about everyone! And our answer is usually something like this:

The basic foundation of the “official reference sequence” is usually the same in all the main browsers. However, the way they choose to organize the display, the tools for showing/hiding annotation data, and the custom query and display options vary. But they generally all have some mechanism for this. For me, usually the choice comes down to what data I need to look at–and how a given software tools shows me that and lets me interact with it.

I know that’s largely an end-user perspective, but that’s who is attending our workshops. I can remember talking to one guy at our conference booth who only wanted to use a genome browser with the reference sequence display organized vertically. I gave him Map Viewer. Some people need a specific species–and no matter how good the software is, if your research species isn’t in there, it just doesn’t matter…. I’ve seen super-users on twitter complain about the look of the background at one browser or another. That doesn’t have much bearing on my choice–but I do have to say I really hate “hidden” menus and features you have to hover and dig to find, in general. What you don’t see is just impossible to know as an end-user.

But quite frankly when I’m looking for some details in a given region for a research use, I often explore all the browsers I know because of their differences in display and available data to show–to make sure I’m not missing anything. It doesn’t take that long to use them all (if you know your way around, and I think I do…).

But one group has tried to quantify the differences between software tools in a standardized way with with specific metrics. A group from INRA has collected and assessed various characteristics of genome browsers, and has developed a database where you can look at what they have curated. It’s called CompaGB.

You can assess the features as one of these profiles: biologist, computational biologist, or computer scientist at this time.  In this tip of the week I explore the CompaGB interface, from an end-user biologist perspective. I’ll choose a couple of browsers to compare, and we can look at the type of things that the CompaGB team scores to give you a sense of what you can find. For developers you’ll see there are different metrics and you should go back and explore those as well.

In their paper they describe their inspiration for this project–which is QSOS. The Qualification and Selection of Open Sources software project provides a model and framework to standardize descriptions of available software features. The QSOS framework is illustrated in this graphic on their Welcome page:

In short, they have 4 steps: defining frames of reference appropriate for the software tool; assessing the features; qualify the features with a weighting mechanism, and selecting the appropriate tool.

You can easily see how the CompaGB team integrated these ideas in their database of genome browser comparisons. They let you choose criteria you are interested in, and offer a radar plot display as well as a tabular representation of the scores so you can consider the overall view or the details.

There are scores for “full, limited/medium, and poor” but not a lot of detail on that. They assessed the tools in 4 sections: (1) technical features, (2) data content, (3) GUI, and (4) annotation editing and creation.  There is apparently no swimsuit competition. Alas.

The paper says that 4 different evaluators examined the tools (at this time 7 different genome browser: MuGeN, GBrowse, UCSC, Ensembl, Artemis, JBrowse, Dalliance). They have version numbers–for example you can compare the 2 widely used GBrowse versions right now. How often these will get re-evaluated I don’t know. And how to compare different installations of GBrowse at different sites is not really clear to me–they can vary a lot by what the project team wants and needs to implement.

One evaluator did each tool in most cases. And reportedly the results were sent to the database providers for checking. I have no idea what was sent to UCSC on the training issue… [*cough* I have issues with the UCSC training score details, for example...Yeah, we do workshops and so does UCSC. Lots of workshops around the world, we have slides and exercises...I'll show you in the tip where I saw it.] They do encourage users to comment or suggest on their web site if you have supplementary information–I may want to add some details later ;)  And it appears possible to create new items and curate, but I haven’t tried this. They also say they are re-vamping the evaluation process going forward to simplify it.

But…this statement in their paper surprised me:

The UCSC browser natively displays a broad range of human annotations, including cross-species comparisons. UCSC browser’s underlying strategy focuses upon centralizing data on UCSC servers and, as far as we know, no external lab has installed it locally for the purpose of storing and browsing their own data.

Ummm–no. We talk to people all over the place who maintain local installations of UCSC. Quite often in hospital situations where patient data privacy is a major issue there are local installs; certain companies have them. But there are others as well–among them a bunch of mirrors around the world. There’s a whole separate mailing list where people discuss their issues with their own installations. But we’ve also seen the UCSC infrastructure used for other species that UCSC doesn’t support such as HIV, malaria, phage browsers, and more.  Maybe this unusual setup of the UCSC software at the Epigenetics Visualization Hub would be interesting to be aware of. And we know the UCSC team consults with groups and helps them to do it.  And by the way–we mention in our tutorials and workshops that we’ve done around the world that other installations are possible and available.  And we know that the materials we provide are used in many countries to do local trainings as well.

So it was an interesting attempt to measure software features, and I understand why they attempted it, but it seems challenging to scale and maintain. And the curation strategy will have to be considered when evaluating the data. These are fixable if the project proceeds beyond this early set of browsers and branches out to other types of open source software. It really is hard to know what’s worth spending your time on, I admit. And that’s why we hope end-users have a look at our training materials to get introduced to a specific site and see if it suits their needs, and they can kick the tires with the exercises.

+++++++++++++++++++++++

Quick links

CompaGB: http://genome.jouy.inra.fr/CompaGB/

QSOS: http://www.qsos.org/

+++++++++++++++++++++++

Reference
Lacroix, T., Loux, V., Gendrault, A., Gibrat, J., & Chiapello, H. (2011). CompaGB: An open framework for genome browsers comparison BMC Research Notes, 4 (1) DOI: 10.1186/1756-0500-4-133

Naked Mole Rat, another day, another genome

The latest genome to be completed is the naked mole rat (Heterocephalus glaber). Now, could there be a cooler (if ugly) mammal on the planet? It’s one of only two truly eusocial mammals in the world, it lives up to 28 long years (my daughter’s rat, no relation, lived only 3 years) and is surprisingly resistant to a lot of diseases.

So, no wonder the genome was sequenced. Maybe we can learn some things about social behavior and longevity.

Of course there is a resource for it at http://www.naked-mole-rat.org/ though it’s basically just a blast server and some downloads. I’m counting down to the day it’s available at UCSC or Ensembl :D. I have some genes I’m interested in comparing.

Alternate sequences in UCSC and Ensembl

If you go to the UCSC Genome Browser and type “vars” in the “gene” text box (human genome, 2009 assembly), you’ll notice something different. The chromosome region listed is “chr6_apd_hap1:3,060,047-3,078,462″  Those are the haplotype sequence coordinates. With the addition of the hg19 assembly, now provided by GRC, additional alternative sequences were included: haplotypes, alternative loci and patches.

Now type in “vars” in the “position or search term” box, or “chr6:30,000,000-31,000,000″ and submit.

Now to go the Mapping and Sequencing tracks and change the “GRC Patch Release” Menu to “pack.” Click the title link and you’ll see you can turn on either the haplotypes or the patch releases.

Once you do that, click refresh.

Here you will be able to see all those alternative sequences provided by the GRC.

It will look something like the the screenshot below.

 

 

Click on any of the sequence icons and get more information about that sequence. You can read more about the GRC haplotype and patch release at the site.

The Ensembl browser too of course has these alternative sequences included. Instead of going through how to access them here, I’ll point you to their very informative blog post on that very subject.

 

 

 

Workshop: World Tour of Genome Browser and Galaxy of Analysis Tools

Would like to just announce that Mary and I will be giving an all-day hands-on workshop on Tuesday, November 2nd, 2010 in Washington DC (my home town), right before the ASHG conference (where we will also be). The title of the workshop is A World Tour of Genome Browsers and a Galaxy of Analysis Tools. We’ll be covering UCSC Genome and Table Browsers, an overview of other genome browsers, BioMart, Galaxy and a tour of genome resources and how to find them. For more information on location, cost, topics you can continue reading here. There are workshops on UCSC and Galaxy at ASHG  for attendess (which we will be at, but Bob, Anton and others will be doing), but those have sold out and filled up. We are offering this workshop for those who would like to learn these topics and more, both DC residents and ASHG attendees.

To purchase a seat and register, go to our upcoming workshops page.

Continue reading

Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

Tip of the Week: 1000 Genomes Project Browser


You may have been hearing about the 1000 Genomes project–it’s one of the ongoing “big data” projects that is going to yield a great deal of variation information about the human genome. The goal is to sequence well over1000 genomes to identify “most genetic variants that have frequencies of at least 1% in the populations studied”.  They are doing this by sequencing large numbers of samples  with 4x coverage. You can read more about their strategy in their About page on their web site. It also lists the anticipated sample populations.

In this week’s Tip of the Week I’m going to take a quick spin through their browser. (You can also download all the data, but I’ll be focusing on the browser.) They have begun to release data now, and there are 6 individual sequences available at this time.  These are part of their “pilot” studies.  You can get some details on the pilot from their about page, which links to this PDF about the samples.

They are using the Ensembl framework to display their data. So if you are familiar with using Ensembl you’ll have some facility moving around this browser.  One thing that isn’t apparent right away from the site is that you can click the Resembl link on the display to turn on a track that puts the read/coverage data on the viewer. I also liked the alignment display  of all 6 genomes–but I’m sure that’s going to get challenging to view later with more and more genomes.

In an exchange with their very helpful help desk yesterday, I got this quick summary of the samples you’ll see:

For the high coverage populations NA12891, NA12892 and NA12878 are the CEU trio, NA19238, NA19239 and NA19240 are the YRI trio both father, mother, child respectively and both children were daughters.

If you have questions about their data, be sure to go ask them for help–they were very speedy with answers for me :) .

Some of the project data has also been picked up by UCSC and you can access the same sequences in the UCSC Genome Browser in the Genome Variants track on the March 2006 human assembly. (You’ll also see Venter, Watson, and some other individual genomes there).

Quick links:

The Project: http://www.1000genomes.org/

The Browser: http://browser.1000genomes.org/

An article in Science with some background:  A Plan to Capture Human Diversity in 1000 Genomes

Friday SNPpets

Welcome to our Friday feature link dump: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

Tip of the Week: WAVe, Web Analysis of the Variome

Today’s Tip of the Week is a short introduction to WAVe, or the Web Analysis of the Variome. The tool was recently introduced to us, and I’ve found it a welcome introduction to the tools available to the researcher to analyze human variation. This is apropos considering the recent paper we’ve been discussing on the clinical assessment of a personal genome (here, here and here) and that papers implications for personalized medicine and the use of online variation resources. WAVe also has introduced me to some additional tools I’ve either not been aware of, or haven’t used, which might be of use such as: LOVD (Leiden Open Variation Database), QuExT (Query Expansion Tool, also from the same developers as WAVe), and others. Of course there are also database information pulled in from Ensembl, Reactome, KEGG, InterPro, PDB, UniProt, NCBI and many others. Take some time to check it out.

Choosing a genome browser for your organism…

There are a number of genome browsers out there–we’ve covered that a number of times.  And there are always new ones coming along.  With the onslaught of sequence data we’re about to get from high-throughput sequencing, more and more research groups, communities, and individuals are going to need to choose a genome browser to use to display their data.

One time I stumbled across the survey results for a group that was choosing a new platform to display their community’s data: MaizeGDB.  I wrote about it then because I thought it was interesting, and because I know people are facing this pretty regularly now.  We get asked.  But since that time they have progressed, implemented, and they wrote up their experience.  It’s now been published in Database.

It’s a pretty straighforward paper.  They describe their needs and their assessment of the resources their community had and used.  They surveyed likely users to see what they wanted, and how they felt about the pieces that already existed.  One piece they specifically noted–when asked, many users did not say they used Ensembl, but the Ensembl software was the foundation of one of the items they did say they used.  MaizeGDB writes:

This result shows that users may not be aware of the underlying browser software that the various web sites use.

Ah, yeah.  Here’s another thing this shows: database end users are definitely not thinking about browser software the same way database developers are.  And I do not mean end users are stupid.  They just do not think about this stuff the way software providers think they do.  We keep trying to tell providers this.  It’s not always well received.

So anyway, they move on to assess the candidates for their new implementation.  The focus on Ensembl, GBrowse, Map Viewer, UCSC Genome Browser, and xGDB.  They describe the framework, possibilities, and limitations of each for their purposes.  I think this is a nice look at the various options that lots of people considering the issue should find useful.  They also address that there are other browser that have since, or may still, come along in the future that could be considered, but at the time these were the focus.

They go on to describe their implementation experience.  They seem pleased with it.  And they highlight a one of their favorite pieces, a Locus Lookup tool, that they have added as well.  It sounds like it’s serving their community really nicely.

This is a highly useful paper for the people in the market for genome browsers.  It’s not for everyone, for sure.  Well, at least not yet.  But your day is coming. You’ll need a browser eventually….

You can check out their GBrowse implementation at: http://gbrowse.maizegdb.org/

And if you are interested you can see our free GBrowse training suite here: http://www.openhelix.com/gbrowse

References:
Sen, T., Harper, L., Schaeffer, M., Andorf, C., Seigfried, T., Campbell, D., & Lawrence, C. (2010). Choosing a genome browser for a Model Organism Database: surveying the Maize community Database, 2010 DOI: 10.1093/database/baq007

Andorf, C., Lawrence, C., Harper, L., Schaeffer, M., Campbell, D., & Sen, T. (2010). The Locus Lookup tool at MaizeGDB: identification of genomic regions in maize by integrating sequence information with physical and genetic maps Bioinformatics, 26 (3), 434-436 DOI: 10.1093/bioinformatics/btp556

EDIT: added links to a couple of older blog posts, should have had them in before….

New and Updated Online Tutorials for Ensembl Legacy and Overview of Genome Browsers

Comprehensive tutorials on the publicly available Ensembl and an overview of genome browsers enable researchers to quickly and effectively use these invaluable resources.

Seattle, WA (PRWEB) April 26, 2010 — OpenHelix today announced the availability of a new tutorial on Ensembl, and an updated tutorial suite on the Overview of Genome Browsers.

Ensembl is a genome browser to visualize and analyze human and many other species genomes. Though Ensembl recently updated the browser software, many species genome browsers still use the older versions of the browser. OpenHelix has a tutorial on the latest version, and has now created a new tutorial, Ensembl Legacy, to acquaint researchers with the older versions they might encounter. Overview of Genome Browsers is an updated tutorials which introduces researchers to some of the more popular genome browsers including Ensembl, Map Viewer, UCSC Genome Browser, the Integrated Microbial Genomes (IMG) browser and the GBrowse software. These two tutorials, in conjunction with larger, in-depth OpenHelix tutorials on UCSC Genome and Table Browsers, GBrowse. IMG, IMG/M, Ensembl and MapViewer and others will give you a set of training resources to help be efficient and effective at accessing and analyzing genome data.

The tutorial suites, available through an annual OpenHelix subscription, contain an online, narrated, multimedia tutorial, which runs in just about any browser connected to the web, along with slides with full script, handouts and exercises. With the tutorials, researchers can quickly learn to effectively and efficiently use these resources. The scripts, handouts and other materials can also be used as a reference or for training others.

These tutorials will teach users:

Ensembl Legacy

*about the Ensembl software and its developers
*how to access older versions of the browser from the Ensembl archive
*the differences and similarities between versions
*about some example installations of Ensembl at other databases

Overview of Genome Browsers

*where to find these 5 useful tools
*an overview of the organization and display features
*some guidance on how or why to choose a given browser for your research needs
To find out more about these and over 85 other tutorial suites visit the OpenHelix Catalog and OpenHelix. Or visit the OpenHelix Blog for up-to-date information on genomics and genomics resources.

About OpenHelix
OpenHelix, LLC, (www.openhelix.com) provides a bioinformatics and genomics search and training portal, giving researchers one place to find and learn how to use resources and databases on the web. The OpenHelix Search portal searches hundreds of resources, tutorial suites and other material to direct researchers to the most relevant resources and OpenHelix training materials for their needs. Researchers and institutions can save time, budget and staff resources by leveraging a subscription to nearly 100 online tutorial suites available through the portal. More efficient use of the most relevant resources means quicker and more effective research.