What’s the Answer? (publication quality alignment images)

reddit_iconThis week’s highlighted question is from the Bioinformatics subreddit. And I’m using it because it made me laugh. I dare someone to name their alignment tool “Fancy Pants”. That said, it did provide links to a number of different tools that people might find useful, depending on the kind of thing you might want to do do make your stuff look niiiiiiiiice.

Looking for publication quality alignment (‘fancy pants alignment’ are the words my colleague specifically used)

submitted  by CatsVansBags

Hey all- my colleague and I are looking for an alignment tool (software, website, etc) that makes beautiful looking alignments for publications. ClustalW is what everyone keeps suggesting and its nice but its not niiiiiiiiice. If anyone has any favorites, please fill me in, Thanks!

Some of the tools are things we’ve talked about before. Jalview and AliView, for example. But there are some others too. Have a look at the chatter and check ‘em out yourself if you need some alignments. Or, if you have other tools, suggest them.

What’s the Answer? (pan-genome graphs)

This weeks highlighted discussion is the problem of pan-genome graphs, which are ways to represent the variation we find in genomes instead of a linear reference sequence view. I was really struggling with these concepts until I heard a talk at the #TRICON meeting recently. David Haussler had some really helpful visuals. I don’t have an audio link to the talk I heard, but I found a similar one. I think it’s a concept people need to consider, because these are going to be coming to us in the near future.

Biostars is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the Biostars_logo community and find it very useful. Often questions and answers arise at Biostars that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at Biostars.

Forum: Pan-genome graphs make the popular science press

There’s much talk on my twittersphere about this piece in MIT Technology Review: Rebooting the Human Genome. It talks about how the current reference genome concept misses so much of the human variation that we need to capture as we sequence more and more people’s personal genome data.

But there’s also some confusion. I don’t think the concepts of the graphs was really well described in there. In an earlier thread here we talked about it a little, but I wasn’t able to find the talk I’d heard about this which was helpful to me. But I found a similar one, and maybe this will help people to get the idea of the graphs instead of just the current linear view we have of the reference genome.

You can watch the whole thing, of course. But the part about the graph ideas come in to this talk around 52 minutes.

So the idea is that we have to be able to account for the “bubbles” that don’t match a linear reference string. Some bubbles will be alterations, some insertions, some deletions, some inversions–but we can capture this with graph representations that go beyond our current tools. But they are all valid, and we need to know and see this variation better.

Anyway, I’m posting because I think it’s important to be aware of. And I think that even researchers in the field aren’t that familiar with the ideas yet.

This paper was also helpful to me to understand the concepts, but unfortunately is not open access: Building a pan-genome reference for a population. doi: 10.1089/cmb.2014.0146 http://www.ncbi.nlm.nih.gov/pubmed/25565268

If anyone else has good introductions to the representations of these variant graph concepts I’d like to see them.

Edit to add: this paper has some of Haussler’s graphs too: http://arxiv.org/abs/1404.5010


Hat tip to Mike Schatz for leading me to the Tech Review article originally.


Benedict Paten, Adam Novak, & David Haussler (2014). Mapping to a Reference Genome Structure arXiv.org arXiv: 1404.5010v1

Nguyen, N., Hickey, G., Zerbino, D., Raney, B., Earl, D., Armstrong, J., Kent, W., Haussler, D., & Paten, B. (2015). Building a Pan-Genome Reference for a Population Journal of Computational Biology, 22 (5), 387-401 DOI: 10.1089/cmb.2014.0146

Video Tip of The Week: Jalview for multiple sequence alignment editing and visualization

The multiple sequence alignment editing question recently on our What’s the Answer? feature was popular. We have covered MSA editors in the past, and we include a bit on Jalview in our Clustal tutorial, but I hadn’t revisited them lately. In preparation for that post I specifically looked over at the Jalview site, and I realized that they have recently provided a number of training videos to help people use their tools. So this week’s tip of the week will highlight them.

At the Jalview site, they give this brief description of the features:

Jalview is a free program for multiple sequence alignment editing, visualisation and analysis. Use it to view and edit sequence alignments, analyse them with phylogenetic trees and principal components analysis (PCA) plots and explore molecular structures and annotation.

There are 2 flavors of Jalview. There is a JalviewLite applet you can demo by simply clicking on some examples at their site. Or you can run the Jalview desktop for more features (you can do this from the web or by downloading a local copy). The description on their About page will tell you more about the distinctions. You may also encounter Jalview that’s being incorporated in other tools. Here’s a handy list of those on their Community resources page.

On the Jalview online training Youtube channel, they have a number of videos. Some are general overview, some are specific tasks. For a general overview of what it does, this intro video will help you to decide if it’s a tool that would help you:

If you are ready to try it out, there are some handy tips in this video with more details about actually using the features of the software. It covers basic navigation, understanding the interface layout, working on editing, and good tips for accomplishing things efficiently.

For more of the philosophy and foundations of Jalview, check out their paper (linked below). And check out their other videos to go further.

Quick link:

Jalview: http://www.jalview.org/


Waterhouse, A.M., Procter, J.B., Martin, D.M.A, Clamp, M. and Barton, G. J. (2009)
“Jalview Version 2 – a multiple sequence alignment editor and analysis workbench”
Bioinformatics25 (9) 1189-1191 doi: 10.1093/bioinformatics/btp033

World tour of workshops, recent stop: Morocco, Africa

Trainers & organizers

Last year I had the opportunity to give a workshop in Ifrane Morocco (UCSC Genome and Table browsers, Galaxy) at Al Akhawayn University. This year, Mary and I returned for a longer 3-day workshop at University Hassan II in Mohammadia. OpenHelix was a co-sponsor of the workshop (donating our time, materials and expertise). The workshop covered a plethora of topics from a world tour of resources (tutorial-free) and introductory UCSC  Genome Browser (tutorial-free) and ENCODE (tutorial-free) to genome variation analysis in dbSNP (tutorial-subscription) and analysis using Galaxy (tutorial-subscription). You can see the full schedule of the topics Mohammadia Workshop Schedule here (pdf).

As last year, we were impressed with the students (there were 117 total, about 50/50 gender ratio). English is their 3rd or 4th language in most cases, Moroccan Arabic, French or various African languages being their language of choice. Yet, they were attentive and asked very perceptive and fascinating questions. They were also very enthusiastic

The workshop students

learners. It was a delight to teach them.

We’d like to thank Mohammed Bourdi at NIH, who spent large amounts of time and financial resources to organize this (and last year’s) workshop. We hope to repeat and expand these for next year and perhaps years to come. We will be looking for sponsors.

Several questions were asked at the workshop we’d like to reiterate the answers here and seek some answers from our readers:

*One student was looking for wheat genome resources for designing primers. The wheat genome is as yet incomplete, but there are some resources to get started:
Wheat Genome Sequencing Consortium
Gramene’s wheat resources
Wheat Genetic and Genomic Resource Center @ Kansas State
Perhaps also COGE for conserved sequences
edited to add:
CerealsDB and
James’ post on the wheat draft sequence might give some insight into that huge genome.
*Another student asked about dotplot tools:
Galaxy offers a large collection of EMBOSS tools including dotplot analysis, as does EBI Emboss tool

* Another question concerned finding a ‘dynamic programming’ (optimal solution) multiple sequence alignment tool as opposed to a heuristic one. The issue with this is the complexity of the search space of dynamic programming solution, this slide set might help with the understanding, particularly slides 1-5 and 17-22. It is too computationally intensive. That said, the student might want to check out MSAProps and this list at Wikipedia.

Do our readers have any other guidance on this?

Teaching moment

* Another student asked  if we know how to find DC-area internships in biological sciences. Another student (mathematician from Mali) was looking for something in the US in bioinformatics. Any ideas of programs to bring African biology students to the US or Canada?

If our Moroccan students (or anyone else) have any additional questions, please feel free to ask them here!


ANd a side note. Last year I had all of 3 hours to tour Fes. This year I took advantage of my trip. Mary and I spent a few days in Fes and Marrakech. My family joined us in Marrakech and later my family and I toured for 8 days visiting the Atlas mountains, the Sahara and Fes. Needless to say, it was a trip of a lifetime. Morocco is a fascinating and beautiful place. I look forward to visiting again.

Gates and doors of Fes are beautiful

camel excursion to the Sahara





What’s the answer? Open thread

BioStar Question of the Week:

Multiple sequence alignment of thousands of proteins

I want to track the evolution of several domains, and for doing so, I need to align and cluster 1000′s of sequences. is it possible? and what is the best software to use for that? Eventually I want to understand which is the most “basal” sequence that might lead me to the most ancient protein containing this sequence.

–by Dror

The selected answer:

“mafft –auto” is stable for up to hundreds of thousands of proteins and produces reasonable alignments: http://mafft.cbrc.jp/alignment/software/

–by avilella

But there are a couple of other options as well, as with most bioinformatics solutions!  This includes a hot-off-the-press lead on the new Clustal version (Clustal Omega). Check out the others over there.


I have a vague memory of reading about COBALT a while back, but at the time it was an executable file to download and I think I put it away as “to do.”  Well, a couple days ago I was over at the NCBI BLAST site for something (tip of the week?), and noticed there was a “new” flash for COBALT. So, COBALT is now integrated as a web-tool on the NCBI site. The short description of what COBALT is, from the site:

COBALT is a multiple sequence alignment tool that finds a collection of pairwise constraints derived from conserved domain database, protein motif database, and sequence similarity, using RPS-BLAST, BLASTP, and PHI-BLAST.
Pairwise constraints are then incorporated into a progressive multiple alignment.

I haven’t tried it out yet, compared it to other multiple sequence alignment tools, but thought I’d point it out to those who haven’t yet noticed it.