Category Archives: Tip of the Week

Video Tip of the Week: VectorBase, for invertebrate vectors of human pathogens

I wish I had been clever enough to coordinate this week’s Video Tip of the Week with “Mosquito Week” a couple of months back. There was a bunch of chatter at that time about this infographic that was released by Bill Gates, which illustrated the contribution of various human-killing species. The mosquito was deemed: The Deadliest Animal in the World. Jonathan Eisen took issue with the numbers, however, noting that if you are consistent about the way you count disease vectors, humans come out on top (or, bottom, I guess, in this category). Still, Eisen noted, mosquitoes are important and demand attention. But there are lots of other vectors to keep in mind as well.

Luckily, the team at VectorBase is on it. VectorBase has been providing information on invertebrate vectors of human pathogens for a long time. They collect a variety of species data, including mosquitoes, but also a lot more–ticks, lice, flies, etc. Check out their list of organisms here: . They have information not only on basic biology, but also information about the very key problems of resistance to insecticides as well.

We’ve been fans of VectorBase for years, and have highlighted them in the past, after a site redesign a couple of years ago, and a few other times with various other news tidbits. But I was delighted to discover recently that they have a new overview video which is my favorite kind to highlight in these tips. If you are new to a resource, a brief overview is the most helpful way to understand the kinds of data and tools you’ll see at their site. They have a lot of other slide/PDF tutorials as well, which focus on specific tools and features that will supplement an overview. But in our experience, a video overview is a bit more tempting when you are first becoming acquainted with a resource.

So here I’ve embedded the VectorBase overview, which you can also find here: The slides to accompany it are also available there.

So have a look at VectorBase’s important collection of species data and tools. You can also read more about their foundations and directions in their publications, including the one below. I keep up with news about their new features from their newsletter, but you can also see other types of community outreach strategies over at their site.

Quick link:



Megy K., D. Lawson, D. Campbell, E. Dialynas, D. S. T. Hughes, G. Koscielny, C. Louis, R. M. MacCallum, S. N. Redmond & A. Sheehan & (2012). VectorBase: improvements to a bioinformatics resource for invertebrate vector genomics, Nucleic Acids Research, 40 (D1) D729-D734. DOI:

Bonus video: The Gates blog hosted this highly-produced video about mosquito bites and their impact.

Video Tip of the Week: Google Genomics, API and GAbrowse

This week’s video tip comes to us from Google–it’s about their participation in the “Global Alliance for Genomics and Health” coalition. Global Alliance is aimed at developing genomic data standards for interoperability, and they’ve been working on creating the framework (some background links below in the references will provide further details). It has over 170 members, and one of these members is Google. Although Google talked about this earlier this year when they joined this group, more recently pieces have begun to emerge about the directions and specific tools. Google’s efforts made the mainstream news recently in their announcement about working on a project to examine genomic data associated with autism.

Although this video doesn’t talk about a single specific tool like we usually cover, it provides more detail about this framework for building tools which is important to understand. And in this video I learned about a new browser developed under this project that I did have a quick look at, and I’ll add below.

They browser that they reference is called GAbrowse–I assume that means Global Alliance browse–but there’s not a lot of detail. Their “about” dialog box says this:

GABrowse is a sample application designed to demonstrate the capabilities of the GA4GH API v0.1.

Currently, you can view data from Google, NCBI and EBI.

  • Use the button on the left to select a Readset or Callset.
  • Once loaded, choose a chromosome and zoom or drag the main graph to explore Read data.
  • Individual bases will appear once you zoom in far enough.

The code for this application is in GitHub and is a work in progress. Patches welcome!

I kicked the tires a bit, but it’s clearly not fully fleshed out at this point. When I tried to zoom up from the nucleotide level it went up a bit, but eventually you hit a point that says “This zoom level is coming soon!” So certainly there’s more to come, and a lot more functionality that would be necessary. But it’s early. And it’s just a demo. I have no idea if it’s intended to become a stand-alone public browser.

So if you are interested in issue of cross-compatibility of human genomic data (and as far as I can tell this is all human-centric, I’d like to see a wider conversation on this), it’s probably worth knowing what Google is offering here. You should also be aware of what the Global Alliance is working on. Below I’ve added some of the publications and media I’ve seen about their efforts.

Hat tip to Can Holyavkin on Google+ for the link to the video.

Quick links:

Global Alliance for Genomics and Health:

Google genomics:


(2013). Global Alliance to Create Standards For Sharing Genomic Data, American Journal of Medical Genetics Part A, 161 (9) xi-xi. DOI:

Callaway E. (2014). Global genomic data-sharing effort kicks off, Nature, DOI:

White paper 2013:

Framework for Responsible Sharing of Genomic and Health-Related Data – DRAFT # 7

Terry S.F. (2014). The Global Alliance for Genomics , Genetic Testing and Molecular Biomarkers, 18 (6) 375-376. DOI: [available here from GA:]

Video Tip of the Week: NCBI Variation Viewer

The folks at NCBI recently hosted a webinar that covered a number of resources: GTR, ClinVar, and MedGen. It was a nice introduction to these resources using a case study of exploring information about a 9-year-old child who needed to get clearance for participation in sports. So they follow the course of some details about this kid across the different resources at NCBI to show what you could learn at the different sites.

I was hoping that recording would become available so that could be a triple-tip of the week, but I haven’t seen any announcements of it; I’ll keep an eye out and highlight it in the future if it does. Below I have also referenced a paper that covers some of the same ground as that webinar. But in the meantime they also recently added a new short video about the Variation Viewer that I found handy as well. So that will be this week’s video tip.

I particularly liked the way you can easily select an exon to focus on, with the little bubbles near the top. That wasn’t obvious to me at first.  People are often asking me for handy ways to focus in on the specifics of a single exon.

In addition to this video, I will also offer a screen-cap of one of the slides from the longer webinar that linked to related resources around NCBI. If you haven’t checked out these associated tools you will want to look at them as well. There are a lot of terrific tools available and they are always adding new useful features. Follow them on Twitter for announcements about their tools and trainings–that’s how I stay on top of the new items.

NCBI webinar sitesQuick links:

Variation Viewer:





Landrum M.J., G. R. Riley, W. Jang, W. S. Rubinstein, D. M. Church & D. R. Maglott (2014). ClinVar: public archive of relationships among sequence variation and human phenotype, Nucleic Acids Research, 42 (D1) D980-D985. DOI:

Video Tip of the Week: Leukemia outcome predictions challenge

Although I had other tips in the pipeline, I’m bumping this one up because it is time sensitive. It’s about a competition (or challenge, as they describe it) to use data from cases of leukemia to model make predictions about the outcomes, which could help drive treatment decisions someday. It is called the Acute Myeloid Leukemia Outcome Prediction Challenge.

I found out about it on Google+ via Amina Qutub. In case you can’t see G+, here’s the detail from that post:

Join the crowd to make an impact on cancer research!

The crowd-sourced DREAM 9 Challenge Wiki site opened to all interested scientists, mathematicians, computer scientists, engineers and clinicians around the world. This DREAM Challenge’s goal is to use the wisdom of the crowds to develop new algorithms to understand and treat leukemia, using data provided by M.D. Anderson Cancer Center.

To join, learn about the Challenge, & interact with the data using new visualization tools, visit the DREAM Wiki:!Synapse:syn2455683/wiki/

You can sign up to access the data to begin to work with it. But even before that you can check out the visualization options they provide. This video illustrates a tool they have, which lets you examine specific proteins and specific clinical features, as well as the survival data. (As they note about this tool, though: “The DiBS Data Visualization modules are proprietary, patent-pending tools.”)

From their Data Visualization page, you can click below the video to start looking at that heat map and survival curve data–without having to sign up to access the underlying data. Under the video, click the “BioWheel Interactive Visualization” button to kick the tires a bit.

I started to click around with the visualization tools, I can’t quite figure out what the YWHAE and STK11 heat map patterns mean, they looked very different to me in the visualization.  I have signed up to look at the data itself but I haven’t had a chance to dig any more yet. But it’s available to anyone who agrees to the terms of use, and maybe you can suss out some of the signals that would meet the challenge’s goals:

Subchallenge 1: Determine the best model to predict which AML patients will have Complete Remission or will be Primary Resistant.

Subchallenge 2: For patients who have Complete Remission, predict remission duration.

Subchallenge 3: Predict the overall survival time for each patient

Researchers around he world are collecting lots of data on many disease scenarios. It needs to get closer to patients. Projects like this–with many eyes on it, are a nice way to help us get there–here’s a recent piece about other similar efforts: New platforms aim to obliterate silos of participatory science. There are other challenges from the Sage Bionetworks folks as well. They describe their mission this way:

As a 503c nonprofit organization, Sage Bionetworks’ mission is to catalyze a cultural transition from the traditional single-PI, single-lab, and single-company research paradigm to a model founded on broad precompetitive collaboration. This structure would benefit patients by accelerating development of disease treatments, and society as a whole by reducing the cost of health care and biological research. Sage Bionetworks is actively engaged with academic, industrial, governmental, and philanthropic collaborators in developing this distributed research model.

And there will be more challenges in the future–a reference below explains more of the foundation for these types of efforts. Keep an eye out for them, and hack away.


Boutros P.C., Kyle Ellrott, Thea C Norman, Kristen K Dang, Yin Hu, Michael R Kellen, Christine Suver, J Christopher Bare, Lincoln D Stein & Paul T Spellman & (2014). Global optimization of somatic variant identification in cancer genomes with a global community challenge, Nature Genetics, 46 (4) 318-319. DOI:

Dolgin E. (2014). New platforms aim to obliterate silos of participatory science, Nature Medicine, 20 (6) 565-566. DOI:

Video Tip of the Week: e-PathGen, Using Genomics to Support Public Health

Recently I saw the Director of Public Health Genomics for the CDC tweet about a resource that was new to me, ePathGenPathogen Genomics for Epidemiology. This is an area that I’m glad to see getting attention. My undergrad degree was in microbiology, and certainly the most memorable class I had in college was about pathogenic bacteria and viruses and their consequences throughout history and to the present. One thing that was stressed to us, though, was that we could only study the things we could culture. Some things were really challenging to grow or couldn’t be grown at all with current methods. I was struck by this again in a seminar I heard where a physician described the assessment of the the organisms in a brain abscess sample, they were able to culture 22 organisms. With PCR, the same sample showed 72. Eek.

But our new abilities to look at unculturable organisms by sequencing them rapidly, and then to more quickly and appropriately target infections, is also even getting NYT press coverage at this point: In a First, Test of DNA Finds Root of Illness. And that’s just one kid’s illness–this can also be used to more quickly put the brakes on community-wide issues too. So here’s the tweet that caught my eye:

And I went to see what e-PathGen was about. What they provide are a couple of video tutorials–but I can’t embed them here, they are part of a learning module that also has additional details and two case studies to work through.  The tutorials offer some guidance for folks who might be new to genomics and the sequencing technology from a public health perspective. Then the two case studies show how this type of information might be used on a specific outbreak of illness.

So here’s a look at their landing page, and you can click to go over there and watch their videos:

ePathGen Tutorials and Case Studies -- click to visit them.

ePathGen Tutorials and Case Studies — click to visit them.

Or go to the site with this link:

And usually we like to highlight specific database resources or other bioinformatics tools in our tips. And in the first case study I came across a database collection that was new to me–the BIGSdb system, Bacterial Isolate Genome Sequence Database. This is a framework that offers researchers and clinicians a place to store details of specific isolates of patient or environmental samples. It doesn’t require whole genome data–but it is flexible enough to support that as well as we will increasingly see more of that kind of data coming along.

This framework has now been used by a number of different groups to create databases with their organisms of interest. Check out this list of organisms that you can find individual samples from: I hope to take a look at this in a future tip–I’ve already gone longer than I like to in our weekly introduction to a new resource. So check back with us for more on this later.

Quick links:

ePathGen videos and case studies:

Bacterial Isolate Genome Sequence Database (BIGSdb):



Jefferies J. & McCulloch J. (2014). ePathGen – a new e-learning package in pathogen genomics., Euro surveillance : bulletin Européen sur les maladies transmissibles = European communicable disease bulletin, PMID:

Jolley K.A. & Maiden M.C.J. (2010). BIGSdb: Scalable analysis of bacterial genome variation at the population level., BMC bioinformatics, DOI:

Video Tip of the Week: InterMine for complex queries

We’ve been fans of InterMine for a long time. We did a tip-of-the-week in a while ago that highlighted ways that this software can be used to mine from big data projects of many types. The generic framework of InterMine can be customized for use at different projects–today I’ll include videos from the FlyMine installation and the YeastMine flavor–but you may find versions of this handy tool in many other places as well.

The first video is a broader overview of different types of things you can do–and although this is FlyMine, you’ll find similar behavior at the other Mines too.

This next video is more specific about a task that people need to accomplish–working with a list of genes. This example was recently produced by the YeastMine folks, but again this should work in a similar way across other Mines. You should also read the SGD blog post on it–Create, Analyze, Save: the Power of Gene Lists in YeastMine.

The other thing that I noticed about this framework is the effort of several of these model organism Mines to coordinate into this InterMOD structure. Although I am often wary of “one search to rule them all” sorts of efforts, there can be value in this as a central organizing principle as we keep adding more species genomes that may not have as well-developed communities and infrastructure to support them.

I certainly use a lot of query tools that are similar to these–like the UCSC Table Browser, and BioMartUniProt offers ways to build queries that’s different but conceptually similar. Using these interfaces you can construct some clever and complex ways to extract information out of data repositories.

Quick links:






Smith R.N., Aleksic J., Butano D., Carr A., Contrino S., Hu F., Lyne M., Lyne R., Kalderimis A. & Rutherford K. & (2012). InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data., Bioinformatics (Oxford, England), DOI:

Lyne R., Smith R., Rutherford K., Wakeling M., Varley A., Guillier F., Janssens H., Ji W., Mclaren P. & North P. & (2012). FlyMine: an integrated database for Drosophila and Anopheles genomics., Genome biology, PMID:

Balakrishnan R., Park J., Karra K., Hitz B.C., Binkley G., Hong E.L., Sullivan J., Micklem G. & Cherry J.M. (2012). YeastMine–an integrated data warehouse for Saccharomyces cerevisiae data as a multipurpose tool-kit., Database : the journal of biological databases and curation, PMID:

Sullivan J., Karra K., Moxon S.A.T., Vallejos A., Motenko H., Wong J.D., Aleksic J., Balakrishnan R., Binkley G. & Harris T. & (2013). InterMOD: integrated data and tools for the unification of model organism research., Scientific reports, 3 (1802) PMID:

Video Tip of the Week: LineUp, data ranking visualization tool


Caleydo, from the Institute of Computer Graphics and Vision, a suite of genomics and biomolecular visualization tools. As the project developers state, it’s strength is “the visualization of interdependencies between multiple datasets.” The tip of the week this week is a video introducing one of their newest tools: LineUp.

LineUp is an open source scalable visualization technique for ranking systems that use several disparate ranks. Lineup was developed to

address [the] need to understand the ranking of genes by mutation frequency and other clinical parameters in a group of patients,…It is an ideal tool to create and visualize complex combined scores of bioinformatics algorithms.

Yet, it can be used for many different ranking systems whether that is to view rankings of universities or restaurants, or ranked datasets from from various sources. In the video above, the users explain how to use Lineup to look at and visual the ranking of universities based on several different rankings such as student reputation, student-to-faculty ratio and many others. The tool  allows users to assign weights to different parameters to create a custom ranking.

You really need to watch the video to understand the power of the visualization tool and the broad applicability. I immediately saw several uses in research, but even down to choosing schools for my children. In San Francisco schools are by “lottery,” and you rank the schools by preference. There are so many datasets that affect that for parents, distance, academic ranking, teacher to student ratio, diversity ranking and several more. I could see this tool as a great way to determine the ranking of our choices. The uses are endless.

Quick Links:




Gratzl S, Lex A, Gehlenborg N, Pfister H, & Streit M (2013). LineUp: visual analysis of multi-attribute rankings. IEEE transactions on visualization and computer graphics, 19 (12), 2277-86 PMID: 24051794

Video Tip of the Week: New UCSC “stacked” wiggle track view

This week’s video tip shows you a new way to look at the multiWig track data at the UCSC Genome Browser. A new option has recently been released (see 06 May 2014), a “stacked” view, and it’s a handy way to look at the data with a new strategy. But I’ll admit it took me a little while of working with it to understand the details. So in this tip I hope you’ll see what the new visualization offers.

I won’t go into the background on the many types of annotation tracks available–if you need to be introduced to the idea of the basic track views, start out with our introduction tutorial that touches on the different types of graphical representations. Custom tracks are touched on in the advanced tutorial. For guidance specifically how to create the different track types, see the UCSC documentation. The type of track I’m illustrating in the video today, a MultiWig track, has its own section over there too. Basically, if you are completely new to this, the “wiggle” style is a way to show a histogram display across a region. MultiWig lets you overlay several of these histograms in one space. In the example I’ll show here, the results of looking at 7 different cell lines are shown for some histone mark signals (Layered H3K27Ac track).

Annotation track cell lines

Annotation track cell lines

When I saw the announcement, I thought this was a good way to show all of the data simultaneously. When we do basic workshops, we don’t always have time to go into the details of this view, although we do explore it in the ENCODE material, because the track I’m using is one of the ENCODE data sets. I’ll use the same track in the same region as the announcement, which is shown here:

stack announcementBut when I first looked at this, I wasn’t sure if the peak–focus on the pink peak that represents the NHLF cell line–was meant to cover the whole area underneath or not. What I was trying to figure out is essentially this (a graphical representation of my thought process follows):


By trying out the various styles I was pretty sure I had the idea of what was really being shown, but I confirmed that with one of the track developers. The value is only the pink band segment, not the whole area below it. And Matthew also noted to me that they are sorting the tracks in reverse alphabetical order (so NHLF is the highest in the stack). That was an aspect I hadn’t realized yet. They are not sorting based on the values at that spot. This makes sense, of course, but it wasn’t obvious to me at first.

I like this option very much–but I figured if I had to do some noodling on what it actually meant others might have the same questions.

In the video I’ll show you how this segment looks with the different “Overlay method” settings on that track page. I’ll be looking at the SOD1 area, like the announcement example.  I tweaked a couple of the other settings from the defaults so it would be easier to see on the video (see arrowheads for my changes). But I hope this conveys the options you have now to look at this type of track data effectively.

Track settings for videoSo here is the video with the SOD1 5′ region in the center, using the 4 different choices of overlay method, illustrating the histone mark data in the 7 cell lines. I’m not going into the details of the data here, but I’ll point you to a reference associated with this work for more on how it’s done–see the Bernstein lab paper below.  I wanted to just demonstrate this new type of viewing options that will be available on wiggle tracks. Some tracks will have too much data for one type or another, or will be clearer with one or another style. But now you have an additional way to consider it.

Quick links:

UCSC Genome Browser:

UCSC Intro tutorial:

UCSC Advanced tutorial:

These tutorials are freely available because UCSC sponsors us to do training and outreach on the UCSC Genome Browser.


Kent W.J., Zweig A.S., Barber G., Hinrichs A.S. & Karolchik D. (2010). BigWig and BigBed: enabling browsing of large distributed datasets., Bioinformatics (Oxford, England), PMID:

Karolchik D., Barber G.P., Casper J., Clawson H., Cline M.S., Diekhans M., Dreszer T.R., Fujita P.A., Guruvadoo L. & Haeussler M. & (2013). The UCSC Genome Browser database: 2014 update., Nucleic acids research, PMID:

Ram O., Goren A., Amit I., Shoresh N., Yosef N., Ernst J., Kellis M., Gymrek M., Issner R. & Coyne M. & al. Combinatorial patterning of chromatin regulators uncovered by genome-wide location analysis in human cells., Cell, PMID:

The ENCODE Project Consortium, Bernstein B.E., Birney E., Dunham I., Green E.D., Gunter C. & Snyder M. et al. (2012). An integrated encyclopedia of DNA elements in the human genome., Nature, 489 PMID:

Also see the Nature special issue on ENCODE data, especially the chromatin accessibility and histone modification subset (section 02):

Video Tip of the Week: PhenX, standardizing phenotype measurements

This week’s tip is actually sort of a mega-tip. It’s not just one video–it’s a series of videos that the GeneticAlliance has provided (and there are more to come) with the theme: “Managing the Mass of Measures: Real People’s Real Data Made Useful”. It is part of their Standards and Tools webinar series that is offering outreach and knowledge on tools that aims to bridge the research–>patient gap.

Although I’ll select the PhenX webinar to highlight, PhenX is just one of the tools in a sort of ecosystem that is building more support from research knowledge to patient phenotypes, which can potentially link up with electronic health records (EHR) data, and hopefully lead to new insights and treatments. PhenX is a big team, with various moving parts, and their toolkit is the PhenX Toolkit, a way to standardize and collect important measurements about human biology and factors that influence health.

On their home page they describe this aspect of the work:

The Toolkit provides standard measures related to complex diseases, phenotypic traits and environmental exposures. Use of PhenX measures facilitates combining data from a variety of studies, and makes it easy for investigators to expand a study design beyond the primary research focus.

That should give you an idea of the types of things they intend to capture, but I would encourage you to have a look at their webinar to learn more about their project and how it fits into the scheme of translational medicine. But rather than me talking on about it, watch their presentation from the webinar series page, or embedded here:

Also in this webinar there were references to other projects and tools you might be interested in which relate to this: (aka eMERGE), PheKB (the Phenotype KnowledgeBase), and the PROMIS system (Patient Reported Outcomes Measurement Information System). These things are certainly downstream of most of the bioinformatics I’ve been involved with, but an important direction to get the research work tied more to the clinical side, to ultimately have impacts on human health.

Quick links:
PhenX team site:
PhenX Toolkit:


Hamilton C.M., Strader L.C., Pratt J.G., Maiese D., Hendershot T., Kwok R.K., Hammond J.A., Huggins W., Jackman D. & Pan H. & (2011). The PhenX Toolkit: Get the Most From Your Measures, American Journal of Epidemiology, 174 (3) 253-260. DOI:

Video Tip of the Week: PheGenI, Phenotype-Genotype Integrator

The hunt for variations in genes and genomes has been both fruitful and frustrating. We can see genome variations in a variety of ways, but we can’t always connect them with a phenotype easily. And vice versa, of course. Another problem is that the kinds of data that we want to mine for further analysis is stored in different silos. PheGenI (Phenotype-Genotype Integrator) is an attempt to wrangle some silos together.

As they describe on their landing page:

The Phenotype-Genotype Integrator (PheGenI), merges NHGRI genome-wide association study (GWAS) catalog data with several databases housed at the National Center for Biotechnology Information (NCBI), including Gene, dbGaP, OMIM, GTEx and dbSNP.

The GWAS catalog is something I’ve turned to a number of times looking for samples of studies on different topics. It’s possible to search it from their site, or just browse around the enormous table. But as of right now, it’s just getting bigger and bigger: “As of 05/03/14, the catalog includes 1912 publications and 13270 SNPs“. Kind of a lot to browse at this point.

But of course we use Gene, dbGaP, OMIM, and dbSNP too (and we have training on these). GTEx stands for Genotype-Tissue Expression eQTL (expression quantitative trait loci) browser (I have got to write up something on GTEx).

At the recent Biology of Genomes meeting (#BoG14), this problem was illustrated thus:

So PhenGenI offers a way to navigate among these different types of resources more easily. You can learn more about the resource in this video, and from the paper linked below.

The place they recommend in the video for an overview of the goals of PheGenI: New Web Portal Expands View of Genetic Association Data for Researchers. And, of course, check out their paper below.

Quick link:

PheGenI homepage:


Ramos E.M., Hoffman D., Junkins H.A., Maglott D., Phan L., Sherry S.T., Feolo M. & Hindorff L.A. (2013). Phenotype–Genotype Integrator (PheGenI): synthesizing genome-wide association study (GWAS) data with existing genomic resources, European Journal of Human Genetics, 22 (1) 144-147. DOI: