Last fall I noticed an announcement at Biostar about an upcoming webinar that would illustrate some new features in the IGB browser. And at the time I highlighted some of their materials as our video tip that week. So for more details you can check out that overview.
But recently I was told that their longer-form introduction is available again. If you are interested in the different functionalities of various browsers this is a better overview perhaps. So now that it’s viewable again I thought I’d offer that as the tip this week. It’s in 2 parts, the first one is here:
To get an idea of how it’s used in the field, have a look at data about blueberries. There’s a SlideShare of Ann Loraine’s recent presentation that shows you where to find their blueberry RNA-Seq data. On slide 13 there’s a neat example of the different transcripts present in ripe and unripe fruit. With those details (focusing on Cuff.187.1 region, and loading up the RNA-Seq Berry Development tracks, load the “coverage” files in the Graphs folder) I was able to see exactly what they show as the difference among the data sets. And I loved how they were color-coded to match the berry stages–I thought that was very effective. The slides go on to show further steps of annotation and exploration with Blast2GO and PlantCyc. And they show some sample data of pathways that are altered over developmental time points.
Nicol J.W., Helt G.A., Blanchard S.G., Raja A. & Loraine A.E. (2009). The Integrated Genome Browser: free software for distribution and exploration of genome-scale datasets, Bioinformatics, 25 (20) 2730-2731. DOI: 10.1093/bioinformatics/btp472
One of the topics I keep an eye on is visualization of various types of genomics data, and I’m always interested in new tools for graphical representations. In the past some of our most popular posts have been tools that aren’t heavy-lifting analysis types of tools–but better ways to visualize and explore data, or different ways to present it.
This week’s tip of the week is a tool of this type–Ambiscript Mosaic offers a new way to look at nucleotides in stretches of sequence data. Now, I know–you think: new ways to look at A, T, G, and C? Really? Do we need this? And I’ll admit I wasn’t convinced at first. When I read the first paper on it the sequences just looked like Elvish–which I thought was cute, but I wasn’t convinced it was useful. But the more I thought about this interesting abstraction and read more about it, the more I liked the concept.
The basic idea is that the Roman letters for ATGC are certainly important and useful. But they can be represented with graphical elements that convey more detail visually. And the 5′ to 3′ representation of letter-based sequence info offers one way to think about the sequence, but the reverse complement of that requires a translation step. However, if represented graphically, the same data is just a physical flip away with no additional changes.
This strategy isn’t one that you’d want to replace every view of sequence data, of course. But for some purposes this might offer a new view of the information that will be better suited to seeing some types of motifs or patterns.
In this week’s tip I’ll illustrate an example of how this type of visualization could offer a complementary way to evaluate a particular DNA motif. As a bonus, I’ll also provide the video of the presentation by the Rozak team that helped me to understand why this offers something different from the letter system. You can see it on their site, but I wanted to have a video version of it as well for cross-platform access.
For the demonstration video, I chose to compare the sequence logo style representation generated by the MEME suite tools with this graphical notation. MEME is a tool that I would use to identify motifs–to do the heavy lifting of the analysis part–and then visualize the results. They offer several ways to visually examine the results, and one of them is a sequence logo. The MEME documentation offers a sample motif, which I used to then display the Mosaic style. Here is MEME above, and Mosaic below it:
In the demonstration video I don’t have the time to cover a number of the useful aspects of the graphical strategies employed by the Ambiscript tool–this just covers the basics. Be sure to read their papers and see that other background video to understand more about the actual graphical representational choices and details of colors and shading, for example. There’s a lot more thought behind this than I had time to cover. I didn’t show gaps here either, but it can account for gaps.
This bonus video offers some of the background and foundations of the graphical representations they’ve selected. It is based on the prior work, so it doesn’t have some of the additional features that the Mosaic paper describes. But it helps to explain the conceptual basis for the styles. It helped me to connect to the ideas about the choices for graphics. There’s no audio with it, it’s just a conversion of the slide walk-through.
This tool is unusual, I know—I’m sure not everyone will want to let go of ATGCs as letters. And it won’t be suited for every sequence visualization purpose. It took me a while to wrap my head around the idea of not having the letters there. But as a different way to consider sequence data, I think it could be useful for exploring some features. You’ll still want to use the algorithms like the MEME suite has to discover features like possible transcription factor binding motifs. But you can think about seeing them differently with Ambiscript Mosaic.
Credits or quick links to things you saw in the demo video:
Thanks to David Rozak for permission to convert the slide presentation to video.
Rozak D. & Rozak A. (2008). Simplicity, function, and legibility in an enhanced ambigraphic nucleic acid notation, BioTechniques, 44 (6) 811-813. DOI: 10.2144/000112727
Rozak D.A. & Rozak A.J. (2014). Using a color-coded ambigraphic nucleic acid notation to visualize conserved palindromic motifs within and across genomes, BMC Genomics, 15 (1) 52. DOI: 10.1186/1471-2164-15-52
Bailey T.L., Boden M., Buske F.A., Frith M., Grant C.E., Clementi L., Ren J., Li W.W. & Noble W.S. (2009). MEME SUITE: tools for motif discovery and searching, Nucleic Acids Research, 37 (Web Server) W202-W208. DOI: 10.1093/nar/gkp335
This week’s Tip of the Week is a bit different. The database resource that’s the focus of this piece doesn’t exist yet. Parts of it do, but there’s a ways to go before we actually have a Centralized Model Organism Database (CMOD).
The ideas that Andrew Su offers for CMOD in this talk are ones that we have to start moving towards. There has got to be a way to capture more of the annotation information that scientists have–and others need–from and for all of these sequencing projects that are flowing in daily at this point.
Using the current infrastructure of GMOD (Generic Model Organism Database) and the large community of users of the resources like the numerous GBrowses that are already out there, we’ve got access to a lot of organism-specific community-based information (even yak, butterflies, and trees among them; will these all continue to be supported individually?). But some of these species coming along lack the community size or resourcing that the big ones have. And the way we are doing things now just doesn’t scale.
I know there have been multiple attempts to capture the Wikipediean-type model of community curation, with varying success. Personally I still want a group of professional curators involved–but if we can supplement their work with additional information from the wider community that would be great. And if we can have the professionals help to seed and maintain information with new tools and strategies it would help encourage volunteer curation too. So in this talk you’ll hear more about these issues and how Su’s collaborators have approached this so far, with the reasons for the directions they chose and their experiences with Gene Wiki curation.
The video misses a bit of the intro and the questions at the end, but there’s plenty to chew on still. And you can follow along with the slide deck too (I put that in below, but you can also go directly there).
For paper on the framework of the Gene Wiki project: Good B.M., Clarke E.L., de Alfaro L. & Su A.I. (2011). The Gene Wiki in 2011: community intelligence applied to human gene annotation, Nucleic Acids Research, 40 (D1) D1255-D1261. DOI: 10.1093/nar/gkr925
Issues around biocuration in general: Howe D., Costanzo M., Fey P., Gojobori T., Hannick L., Hide W., Hill D.P., Kania R., Schaeffer M. & St Pierre S. & (2008). Big data: The future of biocuration, Nature, 455 (7209) 47-50. DOI: 10.1038/455047a
Metagenomics analysis can be a bit daunting at times, but there are a good number of tools out there to assist a researcher in analysis. Integrated Microbial Genomes at JGI has some excellent tools such as IMG/M and IMG HMP M. (OpenHelixtutorial) There are other excellent tools that I suggest you check out. QIIME is an excellent tool also.
But the above is not per se a metagenomics tutorial, rather it’s some short screencast of how to use the Galaxy interface for loading data and datatypes. Why? Because another excellent set of tools to use for metagenomic analysis is MetaPhlAn from the Huttenhower lab at Harvard.
This tip isn’t bioinformatics per se–but it’s a tool that I recently found very quick and handy to prioritize a giant pile of literature that I had in my lap. I’ve been participating in a curation project in which all the papers have to get in to a database–but because the data extraction process is uneven I wanted to prioritize some groups in a meaningful (but quick) way. I needed rapid and bespoke text-mining.
“Overview” will do that for you. You can take a giant pile of documents–in my case PDFs–and ask it to quickly sort them into subsets based on words of interest to you. It’s pretty flexible–you can ask it for new sorting or tagging words on the fly. But then you can also tag the subsets with handy reminders, or other categorizations that you need.
Certainly there may be more text-mining you want to do with your literature after–but for a quick sort, and potential way to do discovery on some word combinations–this is a really handy way to explore. And of course it’s not limited to PDFs. You could do a batch of tweets from a conference. You could sort emails. You could sort NSA- or WikiLeaks-style document dumps–should you be so inclined.
Overview is a new free tool designed for investigative journalists and researchers interested in finding relevant information within large collections of text documents, from reports to social media tweets.
Overview greatly simplifies the task of analyzing, indexing and visualising large document collections in ways that can allow a journalist to identify relevant patterns and threads across thousands of different documents.
I’ll let their video describe how it works–I found it was really simple and effective on a huge folder of papers I had. I could sort them by species, and then by other useful terms, and more, really quickly once everything was loaded.
I like the intuitive folder flow. I like the color coding. I found the tagging really handy. There’s another video I found helpful to get started with my documents: Learn Overview in 90 seconds. I had to look up a couple of other things, but I found everything I needed to get working with the data set very quickly at their site.
Their site: Overviewproject.org and you can use it online. Or you can download the code from Github and set up your own.
A couple of weeks back we did a workshop on the UCSC Genome Browser, and I was asked a question we see pretty frequently: Is there a way to export the browser view that you selected with specific tracks, filters, regions, etc? People may want to have a record of their customized view in a lab notebook, or use it for teaching, or in a seminar perhaps–or of course to publish your awesome observations in journals.
Most of the time I just take screen shots of what I need with a screen capture tool (my personal favorite is Snag-It from TechSmith). But there may be times you want something a bit heavier-duty. If you are going to do a poster, or submit it for publication, for example, you might want a nice PostScript version you can work with and edit further. At UCSC, the way to do that is with the “View” menu option here for PDF/PS:
Export the browser image to a file for further editing or use.
When you get a file, you can take it down and use Adobe graphics tools if you have them, or free open-source one like InkScape. You can change the colors, delete stuff, add more annotations, etc.
So when I saw that there was a similar function with the NCBI‘s Sequence Viewer tool, I thought I should mention that as well. They have a nice and clear video that illustrates how to accomplish getting the image out of the Viewer and into a file.
Click the “Graphics” link on the page to open the Sequence Viewer.
After you get to the sequence viewer, follow the instructions just as it plays out in the YouTube video. It’s pretty straight-forward–just watch out to click the right menu for PDFs.
If you haven’t used the NCBI Sequence Viewer much, you should definitely check it out. There are some other helpful videos for more features as well. And another neat feature is that you can embed sequence viewer in your own web pages.
All of the genome browsers have different features and functions, and it’s nice to know that there are various strategies to accomplish tasks you might need to get done.
Karolchik D., Barber G.P., Casper J., Clawson H., Cline M.S., Diekhans M., Dreszer T.R., Fujita P.A., Guruvadoo L. & Haeussler M. & (2013). The UCSC Genome Browser database: 2014 update, Nucleic Acids Research, 42 (D1) D764-D770. DOI: 10.1093/nar/gkt1168
Acland A., Agarwala R., Barrett T., Beck J., Benson D.A., Bollin C., Bolton E., Bryant S.H., Canese K. & Church D.M. & (2013). Database resources of the National Center for Biotechnology Information, Nucleic Acids Research, 42 (D1) D7-D17. DOI: 10.1093/nar/gkt1146
Last week I talked about some of the terrific visualization tools from the Caleydo team, the ones that are focused on looking at pathway data. There’s another tool that I learned about in their newsletter that offers another type of visualization, which you can also supplement with pathway data. StratomeX offers a look at comparisons within stratified data. For example, you might want to look at some cancer data with certain subsets of patients to evaluate subtypes.
A review paper by Schroeder et al last year examined a number of cancer-related analysis tools and websites. As with any group of tools, some have specific features that are suited for specific tasks. And there may be times you are going to take the data you obtain from one tool and explore it with another. So it’s useful to have a number of these in your back pocket. They described StratomeX this way:
Caleydo StratomeX is especially well suited to exploring relationships between groups of samples (Figure 2). These relationships are visualized as ribbons of varying width drawn between neighboring columns. Wide ribbons encode a high co-occurrence of samples in different groupings, whereas their absence indicates mutual exclusion. This coding provides a straightforward and scalable overview of the consistency of group memberships of tumor samples across different data types.
Of course, the StratomeX team also has publications and presentations that describe the work in more detail. which you can obtain from the introduction page they have provided. That page offers a nice overview level, though, which can help you to begin to assess whether this tool would suit your research goals.
But also have a look at their video, which covers their Data View Integrator, and StratomeX. It’s a really compelling example of glioblastoma subtypes. The gene expression data can be supplemented with other data types–in this case “days to death” curves, and then pathway data can be brought in to provide further insights.
This tool is another winner from the Caleydo team. Usually we highlight web-based tools, but these tools need a little bit more overhead. You have to download and install it. But I think they are really worth it. You should try the Caleydo team’s tools.
References: Lex A., Streit M., Schulz H.J., Partl C., Schmalstieg D., Park P.J. & Gehlenborg N. (2012). StratomeX: Visual Analysis of Large-Scale Heterogeneous Genomics Data for Cancer Subtype Characterization, Computer Graphics Forum, 31 (3pt3) 1175-1184. DOI: 10.1111/j.1467-8659.2012.03110.x
Schroeder M.P., Gonzalez-Perez A. & Lopez-Bigas N. (2013). Visualizing multidimensional cancer genomics data, Genome Medicine, 5 (1) 9. DOI: 10.1186/gm413
Have you dreamed of looking at genomic pathway data, with experimental information aligned with known pathway details, and wandering easily from one pathway node to another as you consider the implication of increased/decreased gene expression, or potential copy number variations? Easily hopping to related pathways to keep looking? Yeah–me too, for years . If this is something that you think might help your work–have a look at the tools from the Caleydo team.
Last month I got a newsletter from the the team. But due to the holidays and some deadlines I hadn’t had time to look into it. When I went back last week to have a look at some of their new features, I found myself just as impressed as when I looked at their software years ago.
Back then, I loved the idea of having the different views combined for the pathways and the expression levels. And now there are even more ways to do this, and the tools are even cooler.
Here I’ll focus on their “Entourage” and “enRoute” visualization strategies for the pathway data. They are different but related. With these tools you can wander through the KEGG pathway maps. With Entourage–you can have one of the maps in the main area loaded up as the “focus” pathway–but off to the side you can chose to explore other related pathways (context), or jump to ones with specific genes. You can take “portals” to other pathways in such a slick way it’s just amazing.
To get a sense of these features, first take a look at their video for Entourage.
Next, here is their YouTube of the enRoute components. You can again load up a KEGG pathway, or something from Wikipathways, Nearby you can visualize expression levels or copy number status.
I liked the visualizations from Caleydo before, and these additional pieces on that Caleydo framework are even better now. Have a look and kick the tires. There’s more than I covered here, or appears in the videos. So read the papers, check out their site, and download the software. I think there’s a lot of utility here.
Lex A., Partl C., Kalkofen D., Streit M., Gratzl S., Wassermann A.M., Schmalstieg D. & Pfister H. (2013). IEEE Transactions on Visualization and Computer Graphics, 19 (12) 2536-2545. DOI: 10.1109/TVCG.2013.154
Partl C., Lex A., Streit M., Kalkofen D., Kashofer K. & Schmalstieg D. (2013). enRoute: dynamic path extraction from biological pathway maps for exploring heterogeneous experimental datasets, BMC Bioinformatics, 14 (Suppl 19) S3. DOI: 10.1186/1471-2105-14-S19-S3
BioStar is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the community and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at BioStar.
There’s a range of topics that come up at Biostar. But there are some that I find particularly suited to this kind of community discussion that can be hard to find from other places. When you are talking to hardware vendors, they may have good information–but they may prefer to guide you towards pieces that aren’t so right for your anticipated needs. This week someone needs some guidance on hardware setup for crop plant genomics. If you have any insight on this, please help out.
We are creating a specification for a new server for our research scientists in applied genetics and breeding (I am a computer scientist).
Any advice regarding the number of processors/cores and RAM would be most helpful. Here is a background to the work being carried out, kindly written by a colleague of mine:
[list of stuff they do, go see it over there]
Unfortunately, the budget is small for the server itself – approximately £10,000 (~ 16,000 USD), though it could be increased if it is not enough to work with. Some of the servers I have looked at offer 256GB of RAM. Would this be enough to run the tools cited above?Do they typically page to disk if memory is not available or just fall over?
On the data storage side, I think we would be ok to start with as there will be a dedicated SAN of 24TB with spare bays.
Finally, we would like to use VMWare and use the server as a host (but just for one virtual server). Would this present any issues?
What’s awesome about that is how eager people are to deliver their information to other people, and go beyond PowerPoint. (That said, you can go overboard on these animations and styles, and I find they have some drawbacks around re-using the materials or providing them in other formats, like printout/handouts…). But sometimes they are really effective for the right materials and situation.
So when I saw this tweet the other day, I thought I ought to check out this tool called Slidify:
From the Slidify homepage you have access to more details. They also provide one video to introduce you to the functions. But there are other videos that offer further examples as well–including the recently uploaded Slidify Playground (and you can dance to it). For this week’s video tip of the week I’ll highlight the intro one:
Certainly Slidify (and RStudio) aren’t for everyone–but if you are using R tools it might be of interest to you. If you don’t know anything about using R for analysis there was a nice intro to it a while back that we highlighted: Introduction to R Statistical Software (with video).