This week’s tips offer some software, including a pre-print from one of my favorite groups–the folks who do great visualizations of sets. I have talked about UpSet before, but now there’s an R package for it. Speaking of great visuals, check out the sponge-microbe symbiosis. And 10 legume genomes. Cannabis as a gateway to plant genomics. And a story of entrepreneurship, and how it ends.
And for kicks, there’s a video that I helped to script-edit that’s been popular: Are GMOs Good or Bad?
Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…
This week’s SNPpets offer a rather eclectic collection. Visualization with PanViz, LepBase for lepidopterans, a simulated data generator, and a new collection of community-curated phylogenetic estimates. But the big noise was the cancer “moonshot” data commons and the clinical trial for NCI-MATCH and precision medicine. Also newborn genome sequencing. Funniest thing: passive-aggresive bioinformatics. Coolest thing: Paul Simon and CRISPR (scroll to the bottom of the list).
Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…
This week’s highlighted question is from the Bioinformatics subreddit. A simple question about advice for some poster design turned into quite the conversation about what people want from software posters.
I’m a first year PhD student in Bioinformatics. I’m creating a poster for the symposium of the department to present a software I developed.
There is no algorithm to discuss, no discovery, no findings as it’s mainly a data visualisation tool. The public will be biologists and bioinformaticians
I have problems to organise the abstract and the poster, do you have some advices / examples of posters “only” presenting a software. What are the important things to talk about, the things I don’t have to forget ?
I guess no one is really interested to read something about the implementation, software architecture and other developper’s stuff.
So the discussion ensued about the role of the poster. And the role of the presenter. Whether to have code or not to have code. I was actually kind of amused at how deep people’s feelings were on some of these matters. Anyway, I thought it was a fun read. Have a look.
At the recent Discovery On Target conference, a workshop on data and analytics for drug discovery contained several informative talks. This week’s Video Tip of the Week was inspired by the first speaker in that session, Georges Grinstein. Not only was the software he talked about something I wanted to examine right away (Weave)–his philosophy on visualization of data was so in line with my informal thoughts on the topic that I just connected with it immediately. But also–stay for the “living figures” down below.
Grinstein on dataviz at VIZBI.
Grinstein has been working on dataviz for a long time. And he’s been working with big data since long before big data was trendy. For some of his background and philosophy, check out this talk at a VIZBI conference. Because so many of the problems are the same across big data types, the software that he’s been working on could really be useful for the new issues facing big data in biology. But I don’t know that I’ve heard about it among the genoscenti just yet. (In this talk he also covers RadViz, a radial visualization tool that some folks might find useful. It was also mentioned in the workshop.)
One of the key things that he wanted us to take away from the workshop was that we need to offer people multiple, interactive visualizations for them to get the most of out the data. This is something I’ve been looking for quite a bit. I fell in love with an early version of the Caleydo stuff for exactly this reason. But I understand that it can be tricky.
Weave, or the Web-based Analysis and Visualization Environment, gets closer to this with super responsiveness than I’ve seen elsewhere. This week’s Video Tip is a short intro to this platform, but I’ll link you below to a longer form that you should watch if you want to dive into this tool. Here you’ll see that just by dragging a CSV file in, you can then set up a scatter plot, bar chart, parallel coordinates, a color histogram, and a table. In seconds. Really.
This brief intro doesn’t do full justice to this tool, of course. I joined the Weave-users discussion group and found a recent webinar recording that you should watch. But you’ll have to grab it from the group, it doesn’t appear to be stored on a video platform site (search for the thread called “IVPR Update on Weave Monday 3/23“). It goes into more detail on the features, of course. And sharing data, and reproducibility of the visualizations with the session history options.
I downloaded the Weave Desktop and ran it on my little system. I grabbed some transcription factor score data from the ENCODE project with the UCSC Table Browser, got it in csv format, pulled it in, and within seconds was looking over all the data on the X chromosome for this TFBS I was interested in. Clicking an item in my table highlighted it in my histogram. And that was just to kick the tires. According to the video, you could have had a tile of Cytoscape (because you can integrate with Cytoscape–I didn’t get that far yet though) and checked out interaction data as well. Although I mention Cytoscape because readers here probably know it, that’s just one of the linkable tools. R is embedded, and other stats tools, and you can modify your scripts right from Weave. Some of these additional features may be part of the Analyst Workstation sub-project. I couldn’t always tell which tool had which features in my early explorations.
But if there’s one thing I’d like you to do after reading this post (if you read this far) is look at this paper that is just out. As I was noodling on Weave, I thought to myself that it was PERFECT to create the kind of “living figures” that I want to see in more papers. Now go see Dynamic Data Visualization with Weave and Brain Choropleths. I don’t care if you aren’t interested in brain choropleths–go look at the figures. In each one, there’s a link to a Weave demo, like this:
Click on those demos to load them. You can be interacting with the data on the brain maps, with pre-set Weave tiles of different features of the data set for you. Open the gears icons to change the settings. Now imagine this with gene expression maps in C. elegans bodies. Or with transcription factors and scores in mouse embryos. Or Venns with big piles of GO terms (but what I really want there is UpSet anyway). Or any of a dozen other types of data we get in big data papers now that are really impossible to explore in traditional publication format. I want this for genomics papers in the future, okay?
This software has a lot of potential for analysis, visualization, and sharing of data. I can’t cover it all in a brief blog post. The Weave team has thought carefully about sharing with colleagues, reusable templates, and provenance of data, and all this is built right into to this tool. If you are analyzing data for others, you can set up dashboards for them to see specific views. See their help and info docs for more details, and check out the longer videos in the forum. I think it would connect with a lot of people–and could benefit the genomics community greatly. Have a look. I think you’ll like it.
This week’s tip is not our usual short video. We’ll connect you to our newest tutorial suite, our World Tour of Genomics Resources, part II. Our previous tour was really popular–because as much as bench researchers know about the tools they currently use–everyone realizes there are more tools out there. And many of them don’t realize that there could be some very handy ones for tasks that they have.
This time the tour discusses not only tools for which we have full tutorial suites (video, slides, handouts, exercises), but also a lot of the handy problem-solving tools that we cover in our weekly tips. Things like UpSet for exploring data relationships among sets–which scales way better than Venn diagrams for genomics data sets. Or like Slidify to make slides from RStudio directly. We won’t have full training suites on these, but people will find them really useful in their daily work.
Sometimes we will also add tips about tools for which we have suites, but that have new features. For example, although thousands of people watch our UCSC Genome Browser full trainings, we also have tips that highlight new features or tools that aren’t part of the basic intro–such as new wiggle track features, or the Genome Browser in a Box. So we help people keep current in the field this way, even with existing tools they use.
But still we adhere to our philosophy that we explained in our paper (below). Raising awareness of tools that are out there, and help with how to find and use them effectively. This World Tour illustrates that.
This week’s highlighted item from Biostars gets back to the visualization challenges that I love to think about. The question posted asked for help for an 11-set Venn diagram. What was funny about the response was that the overwhelming consensus was: please, no! And alternatives were offered.
Biostars is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the community and find it very useful. Often questions and answers arise at Biostars that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at Biostars.
The question was a frequent problem in various data sets. You want to find the members of groups that overlap in different conditions, treatment situations, genes present or absent in different species, whatever. In the most famous case of Venn illustration, the banana genome team created a much-discussed masterpiece of sets of genes in shared gene families among other plant species. It was so astonishing to look at that it even got Cory Doctorow’s attention: Just look at that banana genome Venn diagram. But genomics Venn diagrams get around. Here’s one that became fashion:
However, part of the problem with the Venn is that it was so difficult to interpret. As a developer of visualization tools told me later, Venns do not scale well for genomics types of data. He was UpSet about genome folks trying to force the data in, and created the very neat UpSet tool to help that: Video Tip of the Week: UpSet about genomics Venn Diagrams?
As I’ve mentioned before, once I start looking over some new tools I’m often led to others in the same arena that offer related but different features. That’s what happened when I looked at the Proband iPad app for human pedigrees. I noted that they are using important community standards, and I decided to follow those threads a bit. That led me to last week’s tip, the Human Phenotype Ontology (HPO).
HPO has been around for a while and I’ve been aware of it, but this recent re-investigation made me realize how mature it has become, and I was impressed with the amount of adoption there’s been in the genomics community in the big projects. But it also led me to some new tools that I hadn’t encountered before. This week’s tip highlights PhenogramViz–combining my appreciation for controlled vocabularies, standards, and data visualization.
The PhenogramViz team illustrates how they analyze and visualize gene-phenotype relationships
Here’s now the PhenogramViz team describes their tool:
A tool that automatically analyses and visualizes gene-to-phenotype relations for a set of genes affected by CNV of a patient and a set of HPO-terms representing the symptoms of said patient. The tool makes full use of the cross-species phenotype ontology “uberpheno” (see here).
So if you have a patient with copy-number variation issues in their genome, you may be able to use this tool to lead to the genes in that CNV segment that convey certain phenotypes. So the goal–as stated in their paper linked below–is to assist with the clinical interpretation of the genome alterations.
The additional layer of this effort that I find useful is that they use another ontology to take this even further for supporting information. They employ the “Uberpheno” cross-species phenotype ontology to find further details in model organisms.
I’ll let you get a sense of how this works with one of their tutorial videos from their YouTube channel. They have others too–which will help you with different aspects on everything from installation to analyses. I’ll embed the one that shows how you start with a list of patient symptoms or phenotypes, then loading the CNVs or genes, then from the results list you can simply click for graphical representations of the gene-phenotype relationships. Then with the Cytoscape tools you can interact with the “phenograms” in more detail. There’s no sound, you can read the guidance in the callouts.
The videos include some abbreviations–like HPO. That’s why I talked last week about the Human Phenotype Ontology. I was prepping you for this one. And in another video (Prioritization of pathogenic CNVs) they reference the scoring strategies, which you will find need further explanation in their paper linked below (Journal of Medical Genetics one). I would spend some time looking over how the scoring and ranking happens to understand what’s shown.
Although the focus of this is using the data for human diagnosis, I think it could also help researchers to choose more appropriate animal model for further testing. There are lots of complaints about the unsuitability of animal models for a range of subjects–but refining those choices would also be a huge benefit. Saving resources by helping to choose the right animal model would be another worthwhile use of this tool.
Köhler, S., Doelken, S., Mungall, C., Bauer, S., Firth, H., Bailleul-Forestier, I., Black, G., Brown, D., Brudno, M., Campbell, J., FitzPatrick, D., Eppig, J., Jackson, A., Freson, K., Girdea, M., Helbig, I., Hurst, J., Jahn, J., Jackson, L., Kelly, A., Ledbetter, D., Mansour, S., Martin, C., Moss, C., Mumford, A., Ouwehand, W., Park, S., Riggs, E., Scott, R., Sisodiya, S., Vooren, S., Wapner, R., Wilkie, A., Wright, C., Vulto-van Silfhout, A., Leeuw, N., de Vries, B., Washingthon, N., Smith, C., Westerfield, M., Schofield, P., Ruef, B., Gkoutos, G., Haendel, M., Smedley, D., Lewis, S., & Robinson, P. (2013). The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data Nucleic Acids Research, 42 (D1) DOI: 10.1093/nar/gkt1026
Köhler S., Doelken S.C., Ruef B.J., Bauer S., Washington N., Westerfield M., Gkoutos G., Schofield P., Smedley D. & Lewis S.E. & (2013). Construction and accessibility of a cross-species phenotype ontology along with gene annotations for biomedical research., F1000Research, PMID: http://www.ncbi.nlm.nih.gov/pubmed/24358873
Köhler, S., Schoeneberg, U., Czeschik, J., Doelken, S., Hehir-Kwa, J., Ibn-Salem, J., Mungall, C., Smedley, D., Haendel, M., & Robinson, P. (2014). Clinical interpretation of CNVs with cross-species phenotype data Journal of Medical Genetics, 51 (11), 766-772 DOI: 10.1136/jmedgenet-2014-102633
Shannon P., Markiel A., Ozier O., Baliga N.S., Wang J.T., Ramage D., Amin N., Schwikowski B. & Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks., Genome research, PMID: http://www.ncbi.nlm.nih.gov/pubmed/14597658
The multiple sequence alignment editing question recently on our What’s the Answer? feature was popular. We have covered MSA editors in the past, and we include a bit on Jalview in our Clustal tutorial, but I hadn’t revisited them lately. In preparation for that post I specifically looked over at the Jalview site, and I realized that they have recently provided a number of training videos to help people use their tools. So this week’s tip of the week will highlight them.
At the Jalview site, they give this brief description of the features:
Jalview is a free program for multiple sequence alignment editing, visualisation and analysis. Use it to view and edit sequence alignments, analyse them with phylogenetic trees and principal components analysis (PCA) plots and explore molecular structures and annotation.
On the Jalview online training Youtube channel, they have a number of videos. Some are general overview, some are specific tasks. For a general overview of what it does, this intro video will help you to decide if it’s a tool that would help you:
If you are ready to try it out, there are some handy tips in this video with more details about actually using the features of the software. It covers basic navigation, understanding the interface layout, working on editing, and good tips for accomplishing things efficiently.
For more of the philosophy and foundations of Jalview, check out their paper (linked below). And check out their other videos to go further.
Waterhouse, A.M., Procter, J.B., Martin, D.M.A, Clamp, M. and Barton, G. J. (2009)
“Jalview Version 2 – a multiple sequence alignment editor and analysis workbench”
Bioinformatics25 (9) 1189-1191 doi: 10.1093/bioinformatics/btp033
Caleydo, from the Institute of Computer Graphics and Vision, a suite of genomics and biomolecular visualization tools. As the project developers state, it’s strength is “the visualization of interdependencies between multiple datasets.” The tip of the week this week is a video introducing one of their newest tools: LineUp.
LineUp is an open source scalable visualization technique for ranking systems that use several disparate ranks. Lineup was developed to
address [the] need to understand the ranking of genes by mutation frequency and other clinical parameters in a group of patients,…It is an ideal tool to create and visualize complex combined scores of bioinformatics algorithms.
Yet, it can be used for many different ranking systems whether that is to view rankings of universities or restaurants, or ranked datasets from from various sources. In the video above, the users explain how to use Lineup to look at and visual the ranking of universities based on several different rankings such as student reputation, student-to-faculty ratio and many others. The tool allows users to assign weights to different parameters to create a custom ranking.
You really need to watch the video to understand the power of the visualization tool and the broad applicability. I immediately saw several uses in research, but even down to choosing schools for my children. In San Francisco schools are by “lottery,” and you rank the schools by preference. There are so many datasets that affect that for parents, distance, academic ranking, teacher to student ratio, diversity ranking and several more. I could see this tool as a great way to determine the ranking of our choices. The uses are endless.
Gratzl S, Lex A, Gehlenborg N, Pfister H, & Streit M (2013). LineUp: visual analysis of multi-attribute rankings. IEEE transactions on visualization and computer graphics, 19 (12), 2277-86 PMID: 24051794
Before I discuss NCBI’s 1000 Genomes Dataset Browser, I’d like to spend a bit of time on the 1000 Genomes project, in order to distinguish what is from NCBI and what is from the project itself. From the 1000 Genomes Pilot paper:
“The aim of the 1000 Genomes Project is to discover, genotype and provide accurate haplotype information on all forms of human DNA polymorphism in multiple human populations. Specifically, the goal is to characterize over 95% of variants that are in genomic regions accessible to current high-throughput sequencing technologies and that have allele frequency of 1% or higher (the classical definition of polymorphism) in each of five major population groups (populations in or with ancestry from Europe, East Asia, South Asia, West Africa and the Americas).”
You can access the full paper from the link below. The project has now moved past the pilot phase and is releasing new data all the time. You can see announcements and project details, or access that data, through the official 1000 Genomes project site, or through the official 1000 Genomes version of the Ensembl Browser. As you might imagine for a “big data” project such as this, data has been added to a variety of NCBI databases, including dbSNP, the Sequence Read Archive (SRA) and BioSample. Although you could search for this data through the universal Entrez search system, previously to view the data you would have to view individual results at each separate database. The 1000 Genomes Browser at NCBI has been created as a powerful interface for comprehensively searching for, and viewing, 1000 Genomes data contained in NCBI resources on a single page.
In the video tip I will familiarize you to the various areas of the page - the browser is created with series of widgets, each with its own function. I will not be able to cover all of the features, or demonstrate how users can upload their own variation data to the browser – I’ll leave you the fun of exploring those on your own. Because the tool is so young, bugs and suggestions/comments are still being actively requested – if you find something, check out the FAQs (which discuss bugs at various stages of being fixed) and then email the team.