Category Archives: Genomics Research


Video Tip of the Week: Beacon, to locate genome variants of potential clinical significance

This week’s Video Tip of the Week follows on last week’s chatter about the Internet of DNA. As I mentioned then, the Beacon tool we touched on was going to get more coverage. So this week’s video is provided by the Beacon team, part of the larger Global Alliance for Genomics and Health project (GA4GH).

I’ve touched on some of the GA4GH work in the past. I heard more about a very interesting piece of it from David Haussler at the recent TRICON meeting.

D. Haussler, slide from TRICON talk.

D. Haussler, slide from TRICON talk.

The talk was called “Stable Reference Structures for Human Genome Analysis” and it was important for me to see this. I’ve been wrestling with some of the literature (linked below) that describes ways to represent genome variations among massive numbers of humans. It really helped me to hear it described and shown as cartoons on slides that were less like equations. And how this will play out in graphs and visualizations with software tools is of particular interest to me.

So one branch of the Data Working Group of the GA4GH is tasked with how to represent the variations as multiple paths as graphs, instead of the one linear reference genome we think of today. It has to accommodate many types of variations–inversions, deletions, duplications, as well as just SNPs. So, as the kids say today, it’s complicated. But we have to figure it out. Stay tuned, I’m sure we’ll be talking more about this in the years to come.


Beacon is like SETI for genome variations.

Another branch of this project is tasked with trying to figure out how to share genomic data among all the international producers of this data. If we can’t share the data, we won’t be able to look at the variations among humans and learn from them, nevermind display them. This has additional layers of social and legal complexity we are just beginning to face. As a first pass at sharing this data, a “Beacon” system has been implemented to help researchers locate variations of interest to them.

You should read up on the whole Beacon philosophy and see its current implementation at their site. From what I gather, it is a minimal way to share genome information, without incurring privacy and consent barriers that might be hit if you were pulling down a whole genome. You can query any site that implements a Beacon to ask: do you have a variation at this position? And the Beacons can respond with “yes” or “no”. If there are useful variations, you can then pursue them from there, and if you need access to more you can go through the channels then. But at least you’ve possibly found some needles in some haystacks that you might not have known about otherwise.

The Beacon team has done a short video explaining this. It has no audio, just explanatory text with the graphics. Marc Fiume gave me permission to embed it here.

The “Beacon of Beacons” aggregates the query to send it out to all the known Beacons. You can use it today to search for this kind of data. The video also notes that you can cloak the name of the institution to protect patient privacy.

I have been more acutely concerned about genomic privacy issues than some of my cohorts in this arena. And I fully accept that there will not be privacy–what I want is protection from misuse of the information, which I find lacking in the US legal framework right now. That said, I think that Beacon is a nice work-around for that. If I had a variant of concern, I could ping these other sites to see if others had it. Or vice-versa. But the framework under which the donor of that material provided the data would not be pierced. This makes total sense to me, and I can accept this strategy.

Sharing the genomic data from sequenced individuals is going to be tricky and complex. But I’m keen to see the GA4GH group tackle it. I like several of the directions that I’ve seen so far. But right now–check out Beacon. Implement one if you have this kind of data, and let’s see if it works.

Quick links:

Global Alliance for Genomics and Health:

Beacon (project details page):

Beacon of Beacons (where you would do a search):


Nguyen N., Glenn Hickey, Daniel R. Zerbino, Brian Raney, Dent Earl, Joel Armstrong, W. James Kent, David Haussler & Benedict Paten (2015). Building a Pan-Genome Reference for a Population, Journal of Computational Biology, 150107093755006. DOI:

During David Haussler’s talk, he also referenced these papers:

TIL: There’s a chief data scientist for the US. DJ Patil.

I know there’s lots of hype and drama over “big data”, some of which is over-the-top. But there are real needs and real opportunity in all sorts of data we are generating as well. So we now have a chief data scientist in the US. I found the news on the NIH Data Science blog, where they have more links and include this video where DJ Patil explains more about this role and the reasons.

Highlights of the video in case you can’t listen right now:

~6min he calls out “bioinformatics” as an area of emphasis

~10min he specifically talks of working with Phil Bourne and NIH about bringing data science and bioinformatics together.

The White House release about Patil references the Precision Medicine efforts. 1.29.15_precision_medicine

Precision medicine. Medical and genomic data provides an incredible opportunity to transition from a “one-size-fits-all” approach to health care towards a truly personalized system, one that takes into account individual differences in people’s genes, environments, and lifestyles in order to optimally prevent and treat disease. We will work through collaborative public and private efforts carried out under the President’s new Precision Medicine Initiative to catalyze a new era of responsible and secure data-based health care.

He asks for your help. They are building out teams. He wants everyone to check out the site and see if they can contribute.

US Data Service:

Follow @dpatil on twitter:

Hat tip to Beth Russell at the NIH Data Science blog called Input | Output:

Statistics for Biologists

In a curious coincidence (not statistically relevant), this week I planned to highlight some useful statistical software as my Video Tip of the Week and the Answer post. In order to lure you back for the other pieces this week, I bring you a handy collection from Nature that was just announced:

Direct link over there in case the tweet breaks later:  Statistics for biologists – A free Nature Collection is the announcement post.

The collection is here:

Molecular Medicine Tri-Con 2015, early registration ends soon (#TRICON)

Quick note about this upcoming conference in San Francisco, February 15-20: OpenHelix will have a booth there, and I’ll post details about that later, but wanted to draw your attention right now to some of the content.

There’s a lot of stuff going on at this conference, but some particular talks of note to readers of this blog include some folks in the Informatics Channel you might want to hear from:

Genome and Transcriptome Analysis:

Integrating Transcriptome and Genome Sequencing to Understand Functional Variation in Human Genomes

Tuuli Lappalainen, Ph.D., Principal Investigator & Core Member, New York Genome Center; Assistant Professor, Systems Biology, Columbia University

Detailed characterization of cellular effects of genetic variants is essential for understanding biological processes that underlie genetic associations to disease. Integration of genome and transcriptome data has allowed us to characterize regulatory and loss-of-function genetic variants as well as imprinting both at the population and individual level, as well as their tissue-specificity and role in disease associations.


Stable Reference Structures for Human Genome Analysis

David Haussler, Ph.D., Distinguished Professor and Scientific Director, UC Santa Cruz Genomics Institute, University of California Santa Cruz

Currently there are many different ways to map individual patient DNA and call genetic variants relative to the human reference genome GRCh38, and on top of this, when an expanded version GRCh39 arrives, quite a bit of remapping and recalling turmoil will be created. I describe a new scheme being developed with assistance from the Global Alliance for Genomics and Health in which mapping to the reference genome and calling variants would become a precisely defined and relatively stable process, with a well-defined incremental update when the reference genome expands to a more comprehensive version. This will enable a better standardized and more accurate discourse about human genetic variation for science and medicine.


Accessible and Reproducible Large-Scale Analysis with Galaxy

James Taylor, Ph.D., Ralph S. O’Connor Associate Professor, Biology; Associate Professor, Computer Science, Johns Hopkins University

I will discuss the Galaxy framework for accessible genomic data analysis. I will particularly highlight new features of Galaxy which are enabling analysis at increasingly larger scales, including UI and backend improvements, as well as other recent improvements to Galaxy.


There’s a lot more going on as well, but this track seemed particularly well suited to our readers. Have a look.

Note: OpenHelix is a part of Cambridge Healthtech Institute.

Margaret Oakley Dayhoff, going on #ThatOtherShirt.

I’ve been a fan of Margaret Oakley Dayhoff for a long time. One of the most popular posts on this blog is the one linked in this tweet below. I can tell when students have been assigned a project to read up on her, because suddenly I see an influx of hits to the page.

And one night over twitter I had to help identify her, so I know there’s a need for wider recognition:

Not much has changed since I wrote that earlier Dayhoff post, but a few links aren’t working so well anymore. I don’t want the important history of this field going into the memory hole. The other day I came across a paper by Bruno Strasser that’s worth pointing folks to, for additional details on the time frame of Dayhoff’s work, and her role in the sphere that became “bioinformatics”.

Anyway, I really wanted to see such a pioneer on the shirt with all those other women. And who doesn’t need a Hawaiian shirt with Rosalind Franklin and Barbara McClintock too? Looking forward to wearing it to training events.
If you want a shirt, you only have a couple of days left on the Kickstarter:


Strasser B. (2011). The Experimenter’s Museum: GenBank, Natural History, and the Moral Economies of Biomedicine, Isis, 102 (1) 60-96. DOI:

Survey: microbial ecology visualizations. Help out a student.

I know it’s a holiday week, and you might not be up for heavy-lifting posts. But if you have some time on your hands, have a look at this survey on visualizations of microbial communities. I think visualizations are becoming increasingly challenging as we get so much new data on so many genomics problems. And Meg Pirrung is tackling this problem for her thesis. Help a student out and fill out the survey.

Read her request on G+, also pasted below:

I am conducting a survey on microbial ecology visualizations for my PhD thesis. The survey has 3 parts and then an optional demographic survey, and should take somewhere around 15 to 20 minutes to complete. Please complete the survey on a computer and not on a mobile device.

You don’t need any prior knowledge of the subject matter to complete the survey, I want responses from all different parts of the population.

If you wouldn’t mind completing it for me, I would be extremely grateful. Thanks!

Direct link to survey:



Oxford plots from the gibbon genome paper

A while back I talked about the software in the gibbon genome paper. I went through to try to pull out as much of the software as I could as sort of a catalog of a representative genome project. Of course, there was a lot in there. Some of it, though, consisted of unpublished code.

fig2_dotplotsOne of the figures I liked very much because it contained a lot of information quickly was this Figure 2 from the main paper, with the Oxford plots for comparison, and then the view of the phylogenetic tree. I mused about whether this was available somewhere, and I contacted the team to find out. Javier Herrero has been really terrific about answering my questions and getting back to me with more details. The plot code was an internal script, and the tree layout wasn’t a special tool, but just a graphical arrangement done by hand later.

So knowing my interest in this software, Javier let me know the other day that he’s put that code for the plots on Github. You can access it yourself there. Note–it requires eHive and Kent libraries. And this makes the dot plots, but you still would have to lay out the tree by hand.

But now you can plot these types of comparisons if you want to try it out.

Quick link:

Oxford plots:


Carbone L., R. Alan Harris, Sante Gnerre, Krishna R. Veeramah, Belen Lorente-Galdos, John Huddleston, Thomas J. Meyer, Javier Herrero, Christian Roos, Bronwen Aken & Fabio Anaclerio & al. (2014). Gibbon genome and the fast karyotype evolution of small apes, Nature, 513 (7517) 195-201. DOI:

Oy. I worry about this with cell line studies a lot. Mis-IDed + contaminated.

cellsVia NCBI Announce mailing list:

NCBI BioSample includes curated list of over 400 known misidentified and contaminated cell lines

The NCBI BioSample database now includes a curated list of over 400 known misidentified and contaminated cell lines. Scientists should check this list before they start working with a new cell line to see if that cell line is known to be misidentified.

Continuous cell lines are used widely in research as model systems for normal cellular processes and disease states. However, as noted by many (e.g. PubMed 23235867, 20143388, 19003294, 18072586, and 17522957), cell line cross-contamination or misidentification represents a serious and widespread problem, and researchers should take great care to check that their cell line is what they think it is. Cell lines can be easily mislabeled or become overgrown by cells derived from a different individual, tissue or species.

This problem is so common it is thought that thousands of misleading and potentially erroneous papers have been published using cell lines that are incorrectly identified (PubMed 20448633). The first step in combating this problem is to make sure your cell line is not on the list of known misidentified and cross-contaminated cell lines. Detailed information about how to test your cell lines is provided by the International Cell Line Authentication Committee.

NCBI BioSample curated list of misidentified and contaminated cell lines:[Attribute]

Articles on cell line cross-contamination and misidentification in PubMed mentioned above:

The International Cell Line Authentication Committee:

I also worry about SNV and all sorts of other issues within the cell lines. When the first data was coming out on CNVs in the ENCODE cell lines, I found duplications, and homozygous and heterozygous deletions, that would have concerned me if I was working on certain pathways. If I was still doing cell biology, I’d sequence my cell line of choice before I did another experiment with them.  Below I’ve linked to the PubMed reference they provided in the body.


American Type Culture Collection Standards Development Organization Workgroup ASN-0002. (2010). Cell line misidentification: the beginning of the end, Nature Reviews Cancer, 10 (6) 441-448. DOI:

Genome Editing with CRISPR-Cas9, nifty animation

I saw this come across my twitter feed the other day, and as a nice Friday afternoon diversion I posted it to Google+. I was surprised how popular it was. So I thought–hey, I have a blog too. Let’s put it there…. So grab some coffee and watch, a nice gentle way to get your Monday underway.

This animation depicts the CRISPR-Cas9 method for genome editing – a powerful new technology with many applications in biomedical research, including the potential to treat human genetic disease. Feng Zhang, a leader in the development of this technology, is a faculty member at MIT, an investigator at the McGovern Institute for Brain Research, and a core member of the Broad Institute. Further information can be found on Prof. Zhang’s website at .

Images and footage courtesy of Sputnik Animation, the Broad Institute of MIT and Harvard, Justin Knight and pond5.

The publications page at the Zhang lab has some nice examples of CRISPR, including that knockin mouse one with cancer modeling applications. I’ve been meaning to get that but don’t have a subscription to Cell, so that was handy.

Platt R., Sidi Chen, Yang Zhou, Michael J. Yim, Lukasz Swiech, Hannah R. Kempton, James E. Dahlman, Oren Parnas, Thomas M. Eisenhaure, Marko Jovanovic & Daniel B. Graham & (2014). CRISPR-Cas9 Knockin Mice for Genome Editing and Cancer Modeling, Cell, 159 (2) 440-455. DOI: