Category Archives: Genomics Research

TIL: There’s a chief data scientist for the US. DJ Patil.

I know there’s lots of hype and drama over “big data”, some of which is over-the-top. But there are real needs and real opportunity in all sorts of data we are generating as well. So we now have a chief data scientist in the US. I found the news on the NIH Data Science blog, where they have more links and include this video where DJ Patil explains more about this role and the reasons.

Highlights of the video in case you can’t listen right now:

~6min he calls out “bioinformatics” as an area of emphasis

~10min he specifically talks of working with Phil Bourne and NIH about bringing data science and bioinformatics together.

The White House release about Patil references the Precision Medicine efforts. 1.29.15_precision_medicine

Precision medicine. Medical and genomic data provides an incredible opportunity to transition from a “one-size-fits-all” approach to health care towards a truly personalized system, one that takes into account individual differences in people’s genes, environments, and lifestyles in order to optimally prevent and treat disease. We will work through collaborative public and private efforts carried out under the President’s new Precision Medicine Initiative to catalyze a new era of responsible and secure data-based health care.

He asks for your help. They are building out teams. He wants everyone to check out the site and see if they can contribute.

US Data Service:

Follow @dpatil on twitter:

Hat tip to Beth Russell at the NIH Data Science blog called Input | Output:

Statistics for Biologists

In a curious coincidence (not statistically relevant), this week I planned to highlight some useful statistical software as my Video Tip of the Week and the Answer post. In order to lure you back for the other pieces this week, I bring you a handy collection from Nature that was just announced:

Direct link over there in case the tweet breaks later:  Statistics for biologists – A free Nature Collection is the announcement post.

The collection is here:

Molecular Medicine Tri-Con 2015, early registration ends soon (#TRICON)

Quick note about this upcoming conference in San Francisco, February 15-20: OpenHelix will have a booth there, and I’ll post details about that later, but wanted to draw your attention right now to some of the content.

There’s a lot of stuff going on at this conference, but some particular talks of note to readers of this blog include some folks in the Informatics Channel you might want to hear from:

Genome and Transcriptome Analysis:

Integrating Transcriptome and Genome Sequencing to Understand Functional Variation in Human Genomes

Tuuli Lappalainen, Ph.D., Principal Investigator & Core Member, New York Genome Center; Assistant Professor, Systems Biology, Columbia University

Detailed characterization of cellular effects of genetic variants is essential for understanding biological processes that underlie genetic associations to disease. Integration of genome and transcriptome data has allowed us to characterize regulatory and loss-of-function genetic variants as well as imprinting both at the population and individual level, as well as their tissue-specificity and role in disease associations.


Stable Reference Structures for Human Genome Analysis

David Haussler, Ph.D., Distinguished Professor and Scientific Director, UC Santa Cruz Genomics Institute, University of California Santa Cruz

Currently there are many different ways to map individual patient DNA and call genetic variants relative to the human reference genome GRCh38, and on top of this, when an expanded version GRCh39 arrives, quite a bit of remapping and recalling turmoil will be created. I describe a new scheme being developed with assistance from the Global Alliance for Genomics and Health in which mapping to the reference genome and calling variants would become a precisely defined and relatively stable process, with a well-defined incremental update when the reference genome expands to a more comprehensive version. This will enable a better standardized and more accurate discourse about human genetic variation for science and medicine.


Accessible and Reproducible Large-Scale Analysis with Galaxy

James Taylor, Ph.D., Ralph S. O’Connor Associate Professor, Biology; Associate Professor, Computer Science, Johns Hopkins University

I will discuss the Galaxy framework for accessible genomic data analysis. I will particularly highlight new features of Galaxy which are enabling analysis at increasingly larger scales, including UI and backend improvements, as well as other recent improvements to Galaxy.


There’s a lot more going on as well, but this track seemed particularly well suited to our readers. Have a look.

Note: OpenHelix is a part of Cambridge Healthtech Institute.

Margaret Oakley Dayhoff, going on #ThatOtherShirt.

I’ve been a fan of Margaret Oakley Dayhoff for a long time. One of the most popular posts on this blog is the one linked in this tweet below. I can tell when students have been assigned a project to read up on her, because suddenly I see an influx of hits to the page.

And one night over twitter I had to help identify her, so I know there’s a need for wider recognition:

Not much has changed since I wrote that earlier Dayhoff post, but a few links aren’t working so well anymore. I don’t want the important history of this field going into the memory hole. The other day I came across a paper by Bruno Strasser that’s worth pointing folks to, for additional details on the time frame of Dayhoff’s work, and her role in the sphere that became “bioinformatics”.

Anyway, I really wanted to see such a pioneer on the shirt with all those other women. And who doesn’t need a Hawaiian shirt with Rosalind Franklin and Barbara McClintock too? Looking forward to wearing it to training events.
If you want a shirt, you only have a couple of days left on the Kickstarter:


Strasser B. (2011). The Experimenter’s Museum: GenBank, Natural History, and the Moral Economies of Biomedicine, Isis, 102 (1) 60-96. DOI:

Survey: microbial ecology visualizations. Help out a student.

I know it’s a holiday week, and you might not be up for heavy-lifting posts. But if you have some time on your hands, have a look at this survey on visualizations of microbial communities. I think visualizations are becoming increasingly challenging as we get so much new data on so many genomics problems. And Meg Pirrung is tackling this problem for her thesis. Help a student out and fill out the survey.

Read her request on G+, also pasted below:

I am conducting a survey on microbial ecology visualizations for my PhD thesis. The survey has 3 parts and then an optional demographic survey, and should take somewhere around 15 to 20 minutes to complete. Please complete the survey on a computer and not on a mobile device.

You don’t need any prior knowledge of the subject matter to complete the survey, I want responses from all different parts of the population.

If you wouldn’t mind completing it for me, I would be extremely grateful. Thanks!

Direct link to survey:



Oxford plots from the gibbon genome paper

A while back I talked about the software in the gibbon genome paper. I went through to try to pull out as much of the software as I could as sort of a catalog of a representative genome project. Of course, there was a lot in there. Some of it, though, consisted of unpublished code.

fig2_dotplotsOne of the figures I liked very much because it contained a lot of information quickly was this Figure 2 from the main paper, with the Oxford plots for comparison, and then the view of the phylogenetic tree. I mused about whether this was available somewhere, and I contacted the team to find out. Javier Herrero has been really terrific about answering my questions and getting back to me with more details. The plot code was an internal script, and the tree layout wasn’t a special tool, but just a graphical arrangement done by hand later.

So knowing my interest in this software, Javier let me know the other day that he’s put that code for the plots on Github. You can access it yourself there. Note–it requires eHive and Kent libraries. And this makes the dot plots, but you still would have to lay out the tree by hand.

But now you can plot these types of comparisons if you want to try it out.

Quick link:

Oxford plots:


Carbone L., R. Alan Harris, Sante Gnerre, Krishna R. Veeramah, Belen Lorente-Galdos, John Huddleston, Thomas J. Meyer, Javier Herrero, Christian Roos, Bronwen Aken & Fabio Anaclerio & al. (2014). Gibbon genome and the fast karyotype evolution of small apes, Nature, 513 (7517) 195-201. DOI:

Oy. I worry about this with cell line studies a lot. Mis-IDed + contaminated.

cellsVia NCBI Announce mailing list:

NCBI BioSample includes curated list of over 400 known misidentified and contaminated cell lines

The NCBI BioSample database now includes a curated list of over 400 known misidentified and contaminated cell lines. Scientists should check this list before they start working with a new cell line to see if that cell line is known to be misidentified.

Continuous cell lines are used widely in research as model systems for normal cellular processes and disease states. However, as noted by many (e.g. PubMed 23235867, 20143388, 19003294, 18072586, and 17522957), cell line cross-contamination or misidentification represents a serious and widespread problem, and researchers should take great care to check that their cell line is what they think it is. Cell lines can be easily mislabeled or become overgrown by cells derived from a different individual, tissue or species.

This problem is so common it is thought that thousands of misleading and potentially erroneous papers have been published using cell lines that are incorrectly identified (PubMed 20448633). The first step in combating this problem is to make sure your cell line is not on the list of known misidentified and cross-contaminated cell lines. Detailed information about how to test your cell lines is provided by the International Cell Line Authentication Committee.

NCBI BioSample curated list of misidentified and contaminated cell lines:[Attribute]

Articles on cell line cross-contamination and misidentification in PubMed mentioned above:

The International Cell Line Authentication Committee:

I also worry about SNV and all sorts of other issues within the cell lines. When the first data was coming out on CNVs in the ENCODE cell lines, I found duplications, and homozygous and heterozygous deletions, that would have concerned me if I was working on certain pathways. If I was still doing cell biology, I’d sequence my cell line of choice before I did another experiment with them.  Below I’ve linked to the PubMed reference they provided in the body.


American Type Culture Collection Standards Development Organization Workgroup ASN-0002. (2010). Cell line misidentification: the beginning of the end, Nature Reviews Cancer, 10 (6) 441-448. DOI:

Genome Editing with CRISPR-Cas9, nifty animation

I saw this come across my twitter feed the other day, and as a nice Friday afternoon diversion I posted it to Google+. I was surprised how popular it was. So I thought–hey, I have a blog too. Let’s put it there…. So grab some coffee and watch, a nice gentle way to get your Monday underway.

This animation depicts the CRISPR-Cas9 method for genome editing – a powerful new technology with many applications in biomedical research, including the potential to treat human genetic disease. Feng Zhang, a leader in the development of this technology, is a faculty member at MIT, an investigator at the McGovern Institute for Brain Research, and a core member of the Broad Institute. Further information can be found on Prof. Zhang’s website at .

Images and footage courtesy of Sputnik Animation, the Broad Institute of MIT and Harvard, Justin Knight and pond5.

The publications page at the Zhang lab has some nice examples of CRISPR, including that knockin mouse one with cancer modeling applications. I’ve been meaning to get that but don’t have a subscription to Cell, so that was handy.

Platt R., Sidi Chen, Yang Zhou, Michael J. Yim, Lukasz Swiech, Hannah R. Kempton, James E. Dahlman, Oren Parnas, Thomas M. Eisenhaure, Marko Jovanovic & Daniel B. Graham & (2014). CRISPR-Cas9 Knockin Mice for Genome Editing and Cancer Modeling, Cell, 159 (2) 440-455. DOI:

Bioinformatics tools extracted from a typical mammalian genome project [supplement]

This is Table 1 that accompanies the full blog post: Bioinformatics tools extracted from a typical mammalian genome project. See the main post for the details and explanation. The table is too long to keep in the post, but I wanted it to be web-searchable. A copy also resides at FigShare:

Continue reading