This week’s SNPpets include something unusual: bioinformatics software becoming a mainstream discussion. A recent NYT piece about Zika genomics included a Bandage software-based illustrations, and a subsequent explainer piece in SciAm covered it. Zika was big this week. Of course, we covered Bandage months ago…. A reprise and riff of Tardigate was good reading. Also this week: GBrowse for peanut, FireBrowse for the Broad, updates to GeneMania, Galaxy record hit, and the opposite of update: UCSC Genome Browser in ASCII. Impersonal genomics made me laugh.
Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…
For this week’s video tip of the week we’ll explore the Yak Genome Database. Honestly I wouldn’t have predicted a week where I talk about the Sasquatch genome, the abominable snowman (really, it was a Nature paper), and yaks. But genomics is pretty wild these days.
Some folks are getting jaded about the new genome every day we seem to be getting–Carl Zimmer called it YAGS: “yet another genome syndrome” a couple of years ago already. But I’m delighted every time I see a new genome. Certainly the press releases are overselling the results in many cases. However, as Carl also points out:
What remains truly exciting is the kind of research starts after the genomes are sequenced: discovering what genes do, mapping out the networks in which genes cooperate, and reconstructing the deep history of life.
And I completely agree with this. However, I think the research teams deserve a bit of horn-tooting when they roll out their sequencing paper. The foundation for the future work needs to be done, and it needs to become available with some initial analysis. Then it becomes available for others to take that work further and for that team to continue to learn more.
The other great thing about the price reduction in sequencing and the access of new research groups is the range of species we now see coming along. Mushrooms. Birch trees. Puerto Rican parrots. Watermelon. Bananas (with the best Venn diagram in genome papers so far). Some of these are species that only small research groups have focused on before. But the sequence data leads to so many potential novelties in our understanding of their biological niche. Such is the case with yak.
Yaks are probably not on the radar of a lot of American researchers. But this is an important agricultural species for Tibet. It also has climate adaptations that are useful to understand. If we continue to face potentially rapid climate alterations, there are a lot of things we are going to want to know about how species adapt to different scenarios. We may need to help protect them from emerging pathogens. We may need to help coax some to different breeding cycles. And the more data we have about those species, the better.
However, the “big” genome data centers are not always able to absorb new and less supported species quickly. They have funding focus issues and limited resources too. So these species groups are often the ones who have to deliver genome access themselves. Most often I’m seeing these groups are setting up an installation of GBrowse. So understanding how to interact with that software can be really helpful as you look for new insights from new genomes.
So I offer you the Yak Genome Database:
We used the Generic Genome Browser (GBrowse) developed as part of the Generic Model Organism Database project (GMOD; http://gmod.org/wiki/GMOD) to visualize the genome of the yak . In addition, predicted genes, single nucleotide variants (SNVs), multiple types of RNA sets and repeats contained within the YGD can be visualized using Gbrowse.
Have a browse around the yak genome. In the browser paper they highlight the ARG2 gene page in Figure 1, and the region of the GPR125 G-protein coupled receptor in Figure 2. I’ll show that in the video as well.
A couple of months back when the Heliconius (Postman) Butterfly genome paper was released, we got to see another example of how the new sequencing technologies are giving us access to more and more genome data–in species that are not the main model organisms. Monarch butterfly genome data had been released prior to that as well. And you may not know that there’s a huge effort to get thousands of insect genomes–the i5k project. I think that’s my favorite thing about where we are today: we can examine more species in more detail than we ever have before. Not only do we get interesting details from the genome sequence framework, but interesting info about species evolutionary relationships, and intriguing and novel biology features can be explored as well. I mean–the human genome and its variations are great–but Monarch butterflies have a sun compass! How cool is that??
And like most genome papers today, only a fraction of the data that was obtained is in the main body of the paper. The “compelling examples” might be there. But of the “12,699 predicted protein-coding genes” of the Heliconius genome, only a handful are really addressed in the text. A few more handfuls in some figures. The earlier Monarch butterfly paper delivered “a set of 16,866 protein-coding genes” (and 10 supplements beyond the paper!). But to access the data yourself and compare to your genes and species of interest you need to turn to the browsers that accompany the papers.
In this case you have two choices for browser styles: the Heliconius Genome Consortium (authors of the paper) maintain a GBrowse installation at their Butterflygenome.org site. The Monarch group has a GBrowse at MonarchBase. In addition, the data for both is also now included in Ensembl as of the July 2012 release 15. [note: see administrative details in the comments --mm]
For this week’s tip we fly around from the species-specific GBrowsers to the collected sets at Ensembl. It’s great to have the species-specific sites for depth of information about the projects and resources, but it’s also nice to have the additional tools and displays of the larger genome browsers. Community browsers can offer very current and new data that might not yet be included in the super-browsers, and the super-browsers may offer additional tools and infrastructure that is not available from the community browsers. Your best bet is to be aware of both, and to get comfortable with the main software features and their strengths and weaknesses.
The bugs are coming–and thousands of them. Be ready. And beware: look for the right superhero…
Note: I have been unable to locate the Mothra genome that’s been all atwitter for the last couple of days.
Useful Links: MaizeGDB MaizeGDB tutorials GBrowse OpenHelix GBrowse Tutorial Harper LC, Schaeffer ML, Thistle J, Gardiner JM, Andorf CM, Campbell DA, Cannon EK, Braun BL, Birkett SM, Lawrence CJ, & Sen TZ (2011). The MaizeGDB Genome Browser tutorial: one example of database outreach to biologists via video. Database : the journal of biological databases and curation, 2011 PMID: 21565781
As many of you know, OpenHelix specializes in helping people access and utilize the gold mine of public bioscience data in order to further research. One of the ways that we do this is by creating materials to train people – researchers, clinicians, librarians, and anyone interested in science - on where to find data they are interested in, and how to access data at particular public databases and data repositories. We’ve got over 100 such tutorials on everything from PubMed to the Functional Glycomics Gateway (more on that later).
In addition creating these tutorials, we also spend a lot of time to keep them accurate and up-to-date. This can be a challenge, especially when lots of databases or resources all have major releases around the same time. Our team continually assesses and updates our materials and in this post I am happy to announce recently released updates to three of our tutorials: UniProt, World Tour, and Overview of Genome Browsers.
Our Introductory UniProt tutorialshows users how to: perform text searches at UniProt for relevant protein information, search with sequences as a starting point, understand the different types of UniProt records, and create multi-sequence alignments from protein records using Clustal.
In the latest update of Ensembl, the developers added the ability to save configurations. This allows you to set your track views and analysis to a specific configuration and load that configuration at a later time. The blog post linked previously (or here) explains the steps to creating your own configurations you can save and return to. In the future they will be adding the ability to share your configurations with colleagues and other researchers.
As the data deluge continues, and those next-gen sequencing setups and labs continue to crank out more and more data, the details cannot be captured in the papers anymore. They just can’t. Authors can summarize the key findings, and show compelling examples and representative pieces. But they simply can’t show the volume of data that comprises the complete oeuvre from a given project anymore.
This is a point we keep hammering on. Knowing how to effectively use the software that stores and displays this data is now just as important as learning how to read publications in the first place. In the stone age when I was in grad school, most of what you needed to grasp from a paper was within the text and figures in the main body. Those days are gone in genomics, and they are never coming back. However, the software has limitations too. I’ll get to that later…
I was alerted to this interesting paper on Google+ by Robert West (but the specific item was unlinkable, sorry). The research involves analysis of the human mitochondrial transcriptome. Which even as a 1-off sort of assessment would have been interesting. But this group evaluated the transcriptome in over a dozen tissues and cell lines. That’s a lot of data.
And the paper summarizes key highlights–like the fact that the transcriptome does vary by tissue. Heart and muscle have different energy requirements and it appears to be reflected in their mitochondria at the level of transcript abundance. And there is a terrific Circos diagram (Figure 1) to summarize a lot of what they examined and mapped.
But: there’s no way for you to convey in a traditional publication all of those results. No. Way. And yes, I realize there are 6 large supplements attached to this paper. But that’s still not good enough.
So in this week’s tip of the week will show you how to look at the data from this paper in the custom GBrowse that was built for this paper. We’ll have a look at how to display the tracks you want to explore.
As great as this special browser is, though, this paper made me aware of a limitation of this representation as well. The team of researchers was also interested in nuclear-encoded genes for mitochondrial proteins. Also intriguing to think about–because you can also imagine tissue-specific issues around nuclear gene expression impacting the functions of the mitochondria. But what you can’t do in this browser is layer that on. I mean, I can imagine a way to kludge that together, in fact. You could add those genes to one end of the linear representation, with some spacers, and sort of fake it out. Like this I mean pretend this is the reference sequence:
And then you could compare them all together. But it’s certainly a work-around rather than a real complex visualization. We need better visualization tools. (I have a thought here that a custom Caleydo would work, but I’d be interested in other ideas too).
So that’s what I think, to summarize: the data’s not in the papers; you need to be as adept at software are you are at reading; and we need more and better visualization tools. But this was one cool example of all of that, plus a very cool and informative set of results. I’ve been thinking about this for a while since I read it. And those are my favorite papers–the ones that make me think about a whole bunch of different things.
Reference: Mercer, T., Neph, S., Dinger, M., Crawford, J., Smith, M., Shearwood, A., Haugen, E., Bracken, C., Rackham, O., Stamatoyannopoulos, J., Filipovska, A., & Mattick, J. (2011). The Human Mitochondrial Transcriptome Cell, 146 (4), 645-658 DOI: 10.1016/j.cell.2011.06.051
Here at OpenHelix we think a lot about the differences between nominally similar software that will accomplish some given task. For example, in our workshops we are often asked about the differences between genome browsers. Although UCSC sponsors our workshops and training materials on their browser, we know they aren’t the only genome browser out there and we can talk about them all–in fact, that’s one of the coolest things about being separate from UCSC or a specific software tool provider/grant–we can talk about everyone! And our answer is usually something like this:
The basic foundation of the “official reference sequence” is usually the same in all the main browsers. However, the way they choose to organize the display, the tools for showing/hiding annotation data, and the custom query and display options vary. But they generally all have some mechanism for this. For me, usually the choice comes down to what data I need to look at–and how a given software tools shows me that and lets me interact with it.
I know that’s largely an end-user perspective, but that’s who is attending our workshops. I can remember talking to one guy at our conference booth who only wanted to use a genome browser with the reference sequence display organized vertically. I gave him Map Viewer. Some people need a specific species–and no matter how good the software is, if your research species isn’t in there, it just doesn’t matter…. I’ve seen super-users on twitter complain about the look of the background at one browser or another. That doesn’t have much bearing on my choice–but I do have to say I really hate “hidden” menus and features you have to hover and dig to find, in general. What you don’t see is just impossible to know as an end-user.
But quite frankly when I’m looking for some details in a given region for a research use, I often explore all the browsers I know because of their differences in display and available data to show–to make sure I’m not missing anything. It doesn’t take that long to use them all (if you know your way around, and I think I do…).
But one group has tried to quantify the differences between software tools in a standardized way with with specific metrics. A group from INRA has collected and assessed various characteristics of genome browsers, and has developed a database where you can look at what they have curated. It’s called CompaGB.
You can assess the features as one of these profiles: biologist, computational biologist, or computer scientist at this time. In this tip of the week I explore the CompaGB interface, from an end-user biologist perspective. I’ll choose a couple of browsers to compare, and we can look at the type of things that the CompaGB team scores to give you a sense of what you can find. For developers you’ll see there are different metrics and you should go back and explore those as well.
In their paper they describe their inspiration for this project–which is QSOS. The Qualification and Selection of Open Sources software project provides a model and framework to standardize descriptions of available software features. The QSOS framework is illustrated in this graphic on their Welcome page:
In short, they have 4 steps: defining frames of reference appropriate for the software tool; assessing the features; qualify the features with a weighting mechanism, and selecting the appropriate tool.
You can easily see how the CompaGB team integrated these ideas in their database of genome browser comparisons. They let you choose criteria you are interested in, and offer a radar plot display as well as a tabular representation of the scores so you can consider the overall view or the details.
There are scores for “full, limited/medium, and poor” but not a lot of detail on that. They assessed the tools in 4 sections: (1) technical features, (2) data content, (3) GUI, and (4) annotation editing and creation. There is apparently no swimsuit competition. Alas.
The paper says that 4 different evaluators examined the tools (at this time 7 different genome browser: MuGeN, GBrowse, UCSC, Ensembl, Artemis, JBrowse, Dalliance). They have version numbers–for example you can compare the 2 widely used GBrowse versions right now. How often these will get re-evaluated I don’t know. And how to compare different installations of GBrowse at different sites is not really clear to me–they can vary a lot by what the project team wants and needs to implement.
One evaluator did each tool in most cases. And reportedly the results were sent to the database providers for checking. I have no idea what was sent to UCSC on the training issue… [*cough* I have issues with the UCSC training score details, for example...Yeah, we do workshops and so does UCSC. Lots of workshops around the world, we have slides and exercises...I'll show you in the tip where I saw it.] They do encourage users to comment or suggest on their web site if you have supplementary information–I may want to add some details later ;) And it appears possible to create new items and curate, but I haven’t tried this. They also say they are re-vamping the evaluation process going forward to simplify it.
But…this statement in their paper surprised me:
The UCSC browser natively displays a broad range of human annotations, including cross-species comparisons. UCSC browser’s underlying strategy focuses upon centralizing data on UCSC servers and, as far as we know, no external lab has installed it locally for the purpose of storing and browsing their own data.
Ummm–no. We talk to people all over the place who maintain local installations of UCSC. Quite often in hospital situations where patient data privacy is a major issue there are local installs; certain companies have them. But there are others as well–among them a bunch of mirrors around the world. There’s a whole separate mailing list where people discuss their issues with their own installations. But we’ve also seen the UCSC infrastructure used for other species that UCSC doesn’t support such as HIV, malaria, phage browsers, and more. Maybe this unusual setup of the UCSC software at the Epigenetics Visualization Hub would be interesting to be aware of. And we know the UCSC team consults with groups and helps them to do it. And by the way–we mention in our tutorials and workshops that we’ve done around the world that other installations are possible and available. And we know that the materials we provide are used in many countries to do local trainings as well.
So it was an interesting attempt to measure software features, and I understand why they attempted it, but it seems challenging to scale and maintain. And the curation strategy will have to be considered when evaluating the data. These are fixable if the project proceeds beyond this early set of browsers and branches out to other types of open source software. It really is hard to know what’s worth spending your time on, I admit. And that’s why we hope end-users have a look at our training materials to get introduced to a specific site and see if it suits their needs, and they can kick the tires with the exercises.
I am pleased to formally announce pre-registration for the upcoming GMOD community meeting which will take place October 12-13 in Toronto, Ontario, Canada, hosted by the Ontario Institute for Cancer Research.
and while it is still under construction, the registration page should be available by the end of next week, along with information about the keynote speaker(s) and logistics like hotels. In the mean time, I urge you to go to the meeting page and add suggested topics and talks to the appropriate section.
Finally, the meeting itself will be limited in size, so when registration is open, I urge you to register as soon as possible, since I may need to close registration when we are full.
Thanks and I look forward to seeing you in Toronto, Scott
From the ethers comes word of the GMOD spring trainings. I hear how valuable these are for folks who are working with the Generic Model Organism tools like GBrowse, Apollo, Chado, and more from the installation/configuration perspective. (Our trainings focus on the end users.) And this time they are also listing Galaxy as one of the tools they’ll be training on! Here’s the full text, with the links to find out more:
Applications are now being accepted for the 2011 GMOD Spring Training
course, a five-day hands-on school aimed at teaching new GMOD
administrators how to install, configure and integrate popular GMOD
components. The course will be held March 8-12 at the US National
Evolutionary Synthesis Center (NESCent) in Durham, North Carolina, as
part of GMOD Americas 2011.
These components will be covered:
* Apollo – genome annotation editor
* Chado – biological database schema
* Galaxy – workflow system
* GBrowse – genome viewer
* GBrowse_syn – synteny viewer
* GFF3 – genome annotation file format and tools
* InterMine – biological data mining system
* JBrowse – next generation genome browser
* MAKER – genome annotation pipeline
* Tripal – web front end to Chado databases
The deadline for applying is the end of Friday, January 7, 2011.
Admission is competitive and is based on the strength of the
application, especially the statement of interest. The 2010 school had
over 60 applicants for the 25 slots. Any application received after
deadline will be automatically placed on the waiting list.
The course requires some knowledge of Linux as a prerequisite. The
registration fee will be $265 (only $53 per day!). There will be a
limited number of scholarships available.
This may be the only GMOD School offered in 2011. If you are
interested, you are strongly encouraged to apply by January 7.