This weeks highlighted discussion is the problem of pan-genome graphs, which are ways to represent the variation we find in genomes instead of a linear reference sequence view. I was really struggling with these concepts until I heard a talk at the #TRICON meeting recently. David Haussler had some really helpful visuals. I don’t have an audio link to the talk I heard, but I found a similar one. I think it’s a concept people need to consider, because these are going to be coming to us in the near future.
Biostars is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the community and find it very useful. Often questions and answers arise at Biostars that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at Biostars.
There’s much talk on my twittersphere about this piece in MIT Technology Review: Rebooting the Human Genome. It talks about how the current reference genome concept misses so much of the human variation that we need to capture as we sequence more and more people’s personal genome data.
But there’s also some confusion. I don’t think the concepts of the graphs was really well described in there. In an earlier thread here we talked about it a little, but I wasn’t able to find the talk I’d heard about this which was helpful to me. But I found a similar one, and maybe this will help people to get the idea of the graphs instead of just the current linear view we have of the reference genome.
You can watch the whole thing, of course. But the part about the graph ideas come in to this talk around 52 minutes.
So the idea is that we have to be able to account for the “bubbles” that don’t match a linear reference string. Some bubbles will be alterations, some insertions, some deletions, some inversions–but we can capture this with graph representations that go beyond our current tools. But they are all valid, and we need to know and see this variation better.
Anyway, I’m posting because I think it’s important to be aware of. And I think that even researchers in the field aren’t that familiar with the ideas yet.
This paper was also helpful to me to understand the concepts, but unfortunately is not open access: Building a pan-genome reference for a population. doi: 10.1089/cmb.2014.0146 http://www.ncbi.nlm.nih.gov/pubmed/25565268
If anyone else has good introductions to the representations of these variant graph concepts I’d like to see them.
Edit to add: this paper has some of Haussler’s graphs too: http://arxiv.org/abs/1404.5010
Hat tip to Mike Schatz for leading me to the Tech Review article originally.
— Michael Schatz (@mike_schatz) June 3, 2015
Benedict Paten, Adam Novak, & David Haussler (2014). Mapping to a Reference Genome Structure arXiv.org arXiv: 1404.5010v1
Nguyen, N., Hickey, G., Zerbino, D., Raney, B., Earl, D., Armstrong, J., Kent, W., Haussler, D., & Paten, B. (2015). Building a Pan-Genome Reference for a Population Journal of Computational Biology, 22 (5), 387-401 DOI: 10.1089/cmb.2014.0146