Category Archives: Tip of the Week

Video Tip of the Week: PaleoBioDB, for your paleobiology searches

Yeah, I know, it’s not genomics–but it’s the history of life on this planet–right?  The Paleobiology Database has been keeping records of this ancient biology for a while now, and they have some really nice tools to explore the fossil records and resources that have become available. It’s also interesting to me to see the informatics needs of this type of project. It has a lot of overlap with databases of more recent biology, like the GOLD one–they need taxonomy for the organisms, they need literature links–but they have other needs to capture both geographical regions and the layers of time as well.

There are a couple of ways to access the data. When you arrive at the main landing page, you have the choice to “Launch PBDB”, or “Launch Navigator”. PBDB is a “classic” interface, with typical search boxes and query results. Since this is the internet, I used that “quick search” and looked for paleo cats, and found a lot of Felis in there. But that’s not the only way to look around. They have a newer graphical access mechanism that’s called the Navigator. You can use the navigator to search the world, filter for specific items or time periods–but my favorite thing is you can reset the planet to be what it looked like eons ago. This is covered in their intro video that is this week’s Tip of the Week:

They have other videos as well, you can see that they have both this Navigator interface and help with the classic style. Their “apps” offer other types of searches too. You can even search for insect size. Another way to access information is via R. I began to look around at this because David Bapst on Google+ pointed to their new publication announcement (linked below), offering their R package for accessing their underlying data.

According to their publications page, this resource supports a wide range (and copious amount) of research in this field. It was really neat to have a look at a rather different scale of bioinformatics across the time horizon. Check out the Paleobiology Database resources for your fossil needs.

Quick link:

Paleobiology DB:

Varela S., González-Hernández J., Sgarbi L., Marshall C., Uhen M., Peters S. & McClennen M. (2014). paleobioDB: an R package for downloading, visualizing and processing data from the Paleobiology Database, Ecography, DOI: 10.1111/ecog.01154

Video Tip of the Week: SeqMonk

Always on the lookout for effective visualization tools, I recently came across a series of videos about the SeqMonk software. It’s not software that I had used before, so I wanted to look at the videos, and then try it out. It downloaded quickly, offered me an extensive list of genomes to load up, and then right away I was kicking the tires. And I was impressed. It was easy to locate and explore different regions and the different tracks that were available. And it appears to be very straightforward to load up your own data as well. The video I’ll highlight here is called “Creating Custom Genomes with SeqMonk” which gives a nice intro to their setup.

But they have a whole BabrahamBioinf channel with helpful videos, including a nice short one on how to export graphical representations to use for presentations and publications and such. This is a request I hear a lot from people, and this is a nice guide.

Then I went to look for references for the software to learn more. The group that has developed it–Babraham Bioinformatics–hasn’t published papers specifically on their tools, apparently. They are a services and support group for an institution and not a research group. But they make many of their tools available to the public.

As I’ve noted, though, I really like to get a sense of how people are using the tools, and who is using tools, by looking deeply at the literature. When something has no official citation, it’s harder to assess. And as I’ve pointed out, many papers don’t even cite the tools in the main paper, sometimes it’s in figure legends, or supplements.

A lot of folks have found SeqMonk useful. But it took me 3 different site searches to figure out how useful. I searched at PubMed, PubMedCentral, and Google Scholar. The results were pretty interesting, actually. Just a basic search for SeqMonk yields these differences:

Literature search site number of results
PubMed 1
PubMedCentral 53
Google Scholar 110

The paper in PubMed wasn’t in PubMedCentral, but it was among the 100+ in Google Scholar. Of the 53 in PMC, 2 were absent from Scholar–one had SeqMonk in a figure legend, one had SeqMonk in supplemental procedures. Google Scholar obviously had the biggest range–it also included meeting abstracts, theses, and patent documents, and also a few false positives (from 1840?, 1929, and a couple of other things I couldn’t figure out). Oddly, sometimes the titles differed between PMC and Scholar, but they appeared to be the same paper.  As I’ve noted before, it’s challenging to find out where software is being used, since the way people reference it can be so variable. This was another interesting example of this variability.

But that aside, I was certainly impressed by the various types of data and species that SeqMonk has supported. The variety of species included archaea, chloroplast genome studies, bacteria, ancient maize, yeast, medicinal mushroom mitochondria, zebrafish, and a lot of mammalian research. It has supported a wide range of explorations and topics–lots of epigenetics, PCR techniques, telomere erosion, methylomes of tumors, and even comparison of sequence alignment software. Figure 1 of that aligners paper gives you a nice look at SeqMonk in the wild.

So have a look at the features of SeqMonk for visualization, analysis, and display of existing genomes or your own data. It’s a flexible and effective tool for many purposes.

Quick links:


Their video channel:

Their training materials:

Follow them on twitter:


Chatterjee A., P. A. Stockwell, E. J. Rodger & I. M. Morison (2012). Comparison of alignment software for genome-wide bisulphite sequence data, Nucleic Acids Research, 40 (10) e79-e79. DOI:

Video Tip of the Week: MedGen, GTR, and ClinVar

The terrific folks at NCBI have been increasing their outreach with a series of webinars recently. I talked about one of them not too long ago, and I mentioned that when I found the whole webinar I’d highlight that. This recording is now available, and if you are interested in using these medical genetics resources, you should check this out.

I was reminded of this webinar by a detailed post over at the NCBI Insights blog: NCBI’s 3 Newest Medical Genetics Resources: GTR, MedGen & ClinVar. There’s no reason for me to repeat all of that–I’ll conserve the electrons and direct you over there for more details about the features of these various tools. There is a lot of information in these resources, and the webinar touches on these features and also describes the relationships and differences among them.

I’ve been catching the notice of their webinars by following their Twitter announcements. The next one is coming up on October 15th, announced here, on E-Utilities. Follow them to keep up with the new offerings: @NCBI.

Quick links:


GTR, Genetic Testing Registry:



Acland A., R. Agarwala, T. Barrett, J. Beck, D. A. Benson, C. Bollin, E. Bolton, S. H. Bryant, K. Canese, D. M. Church & K. Clark & (2013). Database resources of the National Center for Biotechnology Information, Nucleic Acids Research, 42 (D1) D7-D17. DOI:

Video Tip of the Week: UCSC #Ebola Genome Portal

Although I had other tips in my queue already, over the last week I’ve seen a lot of talk about the new Ebola virus portal from the UCSC Genome Browser team. And it struck me that researchers who have worked primarily on viral sequences may not be as familiar with the functions of the UCSC tools. So I wanted to do a tip with an overview for new folks who may not have used the browser much before.

There is great urgency in understanding the Ebola virus, examining different isolates, and developing possible interventions to help tackle this killer. Jim Kent was made aware of the CDC’s concerns from his sister–who edits the CDC’s “Morbidity and Mortality Weekly Report”, according to this story:

“It wasn’t until talking to Charlotte that I realized this one was special,” Jim Kent said. “It had broken out of the containments that had worked previously, and really, if a good response wasn’t made, the entire developing world was at risk.”

Jim Kent redirected his team of 30 genome analysts to devote all resources toward developing the Ebola genome. They worked through the night for a week to develop a map for other scientists to determine where on the virus to target treatment.

So the folks at UCSC have created a portal where you can explore the sequence information and variations among different isolated strains, annotations about the features of the genes and proteins, and they even added a track for the Immune Epitope Database (IEDB, which happened to be a video tip not long ago)–where antibodies have been shown to bind Ebola protein sequences. The portal also provides links to publications and further research related to these efforts.

The reference sequence that forms the framework for the browser is a sample from Sierra Leone: It was isolated from a patient  this past May, and I don’t see a publication attached to it–the submission is from the Broad’s Viral Hemorrhagic Fever Consortium. There are more details and thanks to the Pardis Sabeti lab for the sequence, you can read in the announcement email. So, as we keep seeing, we need to have access to the data long before publications become available. The work happens in the databases now, we can’t wait for traditional publishing.

In a side note, I also just learned that the NLM (National Library of Medicine) has a disaster response function, and they have a special Ebola section now because of the needs: Ebola Outbreak 2014: Information Resources. And for more of Jim Kent’s thoughts on Ebola, check out the blog that the UCSC folks have just started: 2014 Ebola Epidemic.

The goal of this tip was to provide an overview of the layout and features for folks who might be new to the UCSC software ecosystem. If you already know how to use it, it won’t be new to you. But if you are interested in getting the most out of the UCSC tools, you can also explore our longer training videos. UCSC has sponsored us to provide free online training materials on the existing tools, and this portal is based on the same underlying software. So you can go further, including using the Table Browser for queries beyond just browsing, if you learn the basics that we cover in the longer suites.

Quick links:

Ebola virus portal at UCSC:

UCSC browser intro training:

UCSC advanced training:


Karolchik D., G. P. Barber, J. Casper, H. Clawson, M. S. Cline, M. Diekhans, T. R. Dreszer, P. A. Fujita, L. Guruvadoo, M. Haeussler & R. A. Harte & (2013). The UCSC Genome Browser database: 2014 update, Nucleic Acids Research, 42 (D1) D764-D770. DOI:

Gire S.K., A. Goba, K. G. Andersen, R. S. G. Sealfon, D. J. Park, L. Kanneh, S. Jalloh, M. Momoh, M. Fullah, G. Dudas & S. Wohl & (2014). Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak, Science, 345 (6202) 1369-1372. DOI:

Video Tip of the Week: MEGA, Molecular Evolutionary Genetics Analysis

This week’s tip of the week highlights the MEGA tools–MEGA is a collection of tools that perform Molecular Evolutionary Genetics Analysis. MEGA tools are not new–they’ve been developed and supported over many years. In fact, on their landing page you can see the first reference to MEGA was in 1994. How much computing were you doing in 1994, and what kind of computer was that on?

As they describe their tools on their homepage–here’s a summary:

MEGA is an integrated tool for conducting sequence alignment, inferring phylogenetic trees, estimating divergence times, mining online databases, estimating rates of molecular evolution, inferring ancestral sequences, and testing evolutionary hypotheses.

But you can see they’ve progressed regularly and deeply since 1994, continuing to add new features and tools, and the current version is MEGA6. Although we usually focus on web-based interfaces, there are some tools that run on a desktop installation instead. So you will have to download and install MEGA to try it out, but the number of things you can do with it make it worth your time.

I decided to take a fresh look at MEGA because it was referenced in the gibbon genome paper that I’ve been perusing to find software tools in use by the genomics community. And I happened to find some training videos developed by the NIAID Bioinformatics team about using MEGA. Their focus in these videos is MEGA5, but much of the foundational information will be the same.

The first video illustrates a file conversion and preparation–getting your data into the right format for MEGA. I won’t embed that here, but when you are ready to kick the tires yourself you should have a look. I’ll jump right to the second video, that includes a bit more action about the things you can do with MEGA. This covers generating a neighbor-joining tree, and several subsequent options for modifying and saving it.

But this is just one aspect of what you can do with the MEGA tools. Be sure to explore the range of things you can do. Their documentation contains a section aimed at the “first time user” that can help you to understand various options you have.  They also have sample data for you to try out the tools.

Tamura K., N. Peterson, G. Stecher, M. Nei & S. Kumar (2011). MEGA5: Molecular Evolutionary Genetics Analysis Using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods, Molecular Biology and Evolution, 28 (10) 2731-2739. DOI:

Tamura K., G.Stecher, D. Peterson, A. Filipski & S. Kumar (2013). MEGA6: Molecular Evolutionary Genetics Analysis Version 6.0, Molecular Biology and Evolution, 30 (12) 2725-2729. DOI:

Video Tip of the Week: StratomeX for genomic stratification of diseases

The Calyedo team and the tools they develop have been on my short list of favorites for a long time. I’ve been talking about their clever visualizations for years now. My first post on their work was in 2010, with the tip I did on their Calyedo tool that combined gene expression and pathway visualization. They’ve continued to refine their visualizations, and enable new data types to be brought into the analysis, and earlier this year we featured Entourage, enRoute, LineUp, and also StratomeX. They have lots of options for wrangling “big data”. But recently they published a paper on StratomeX and a nice video overview, so I wanted to bring it to your attention again now that the paper is out.

The emphasis in this paper is cancer subtype analysis, using some data from The Cancer Genome Atlas (TCGA). But it’s certainly not limited to cancer analysis–any research area that’s currently flooded with multiple types of data and outcomes could be run through this stratification and visualization software. I find the weighting of the lines and connections among the subsets to be really effective for me when thinking about relationships among the data types. That schizophrenia work that recently did that sort of stratification and clustering thing to suss out the relationships among different sub-types, was the kind of thing that’s going to be really useful (but I don’t know what software they used, because paywall…). And I expect that strategy to become increasingly important for a lot of conditions.

So have a look at this new paper (below), and their well-crafted video with examples.

If you are going to start working with StratomeX, be sure to also see their documentation pages. There are some features and options there that aren’t covered in the intro video and that you’ll want to know about.

The team is a cross-institutional and international bunch: this is a joint project between a lab at Harvard, led by Hanspeter Pfister, Peter Park’s lab at the Center for Biomedical Informatics at Harvard Medical School, and collaborators at Johannes Kepler University in Linz and the Graz University of Technology (both in Austria). And look for upcoming tools from them as well–there’s new stuff over at their site. They keep developing useful items, and I expect to be highlighting those in future Tips of the Week.

Quick links:

StratomeX project page:

Caleydo tools homepage:


Marc Streit, Alexander Lex, Samuel Gratzl, Christian Partl, Dieter Schmalstieg, Hanspeter Pfister, Peter J Park & Nils Gehlenborg (2014). Guided visual exploration of genomic stratifications in cancer, Nature Methods, 11 (9) 884-885. DOI:

Video Tip of the Week: GOLD, Genomes OnLine Database

Yes, I know some people suffer from YAGS-malaise (Yet Another Genome Syndrome), but I don’t. I continue to be psyched for every genome I hear about. I even liked the salmon lice one. And Yaks. The crowd-funded Puerto Rican parrot project was so very neat. These genomes may not matter much for your everyday life, and may not exactly be celebrities among species. But we’ll learn something new and interesting from every one of them. It’s also very cool that it’s bringing new researchers, trainees, and citizens into the field.

The good news is there is opportunity still for many, many more species. And decreasing costs will make it possible for more research teams to do locally-important species. But–it would be a shame if we wasted resources by doing 30 versions of something cute, rather than tackling new problems. A central registry for sequencing projects may help to manage this. Genomes OnLine Database has been cataloging projects for years, and it would be great if folks would register their research there.

I was reminded of this by a tweet I saw come through my #bioinformatics column. This is what I saw flying by:

As much as I enjoy Twitter and think that science nerds are pretty good at it, it’s hard to know if the right people will see a tweet. Anyway, I suggested that this researcher check out GOLD and BioProject to see if anyone had registered anything.

I realized that although we have talked about GOLD in the past, it hadn’t been highlighted in our Tips of the Week before. So here I will include a video from a talk about GOLD. Ioanna Pagani gives an overview of GOLD, the foundations and the purpose. And then she goes on to demonstrate how to enter project metadata into their registry (~12min). Watching this will help you to understand the usefulness of GOLD, and what you can expect to find there. She describes both single-species project entry, and another option for entering metagenome data projects (~25min).

In the News at GOLD, they mention that their update this summer resulted in some changes to the interface–so the specifics might be a bit different from the video. But the basic structural features are still going to be useful to understand the goals and strategies. It may also help to convey the importance of appropriate metadata for genome projects. If you are involved with these projects, checking out the team’s paper on the structure and use of metadata is certainly worthwhile.

In times of all this sequencing capacity, people are going to start looking for new organisms to cover. Of course, some people will want to look at another strain, isolate, geographical sample for good reasons–but keeping a lot of unnecessary duplication from happening would be nice too. And it would be great if submitters also conformed to the standards for genome metadata–the ‘Minimum Information about a Genome Sequence’ (MIGS, now in the broader collection of standard checklists in the MIxS project) standards being developed by the Genomic Standards Consortium. (You can see how GOLD conformed to this in their other paper below.) Let’s spread the resources around to get new knowledge when we can. I would like to see a more formal mechanism that connects people who have some genome of interest with researchers who might have the bandwidth to do it, as well. Social sequencing?

Quick links:


Genomics Standards Consortium:

Pagani I., J. Jansson, I.-M. A. Chen, T. Smirnova, B. Nosrat, V. M. Markowitz & N. C. Kyrpides (2011). The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata, Nucleic Acids Research, 40 (D1) D571-D579. DOI:

Liolios K., Lynette Hirschman, Ioanna Pagani, Bahador Nosrat, Peter Sterk, Owen White, Philippe Rocca-Serra, Susanna-Assunta Sansone, Chris Taylor & Nikos C. Kyrpides & (2012). The Metadata Coverage Index (MCI): A standardized metric for quantifying database metadata richness, Standards in Genomic Sciences, 6 (3) 444-453. DOI:

Field D., Tanya Gray, Norman Morrison, Jeremy Selengut, Peter Sterk, Tatiana Tatusova, Nicholas Thomson, Michael J Allen, Samuel V Angiuoli & Michael Ashburner & (2008). The minimum information about a genome sequence (MIGS) specification, Nature Biotechnology, 26 (5) 541-547. DOI:

Video Tip of the Week: #Docker, shipping containers for software and data

Breaking into the zeitgeist recently, Docker popped into my sphere from several disparate sources. Seems to me that this is a potential problem-solver for some of the reproducibility and sharing dramas that we have been wrestling with in genomics. Sharing of data sets and versions of analysis software is being tackled in a number of ways. FigShare, Github, and some publishers have been making strides among the genoscenti. We’ve seen virtual machines offered as a way to get access to some data and tool collections*. But Docker offers a lighter-weight way to package and deliver these types of things in a quicker and straightforward manner.

One of the discussions I saw about Docker came from Melissa Gymrek, with this post about the potential to use it for managing these things: Using docker for reproducible computational publications. Other chatter led me to this piece as well: Continuous, reproducible genome assembler benchmarking. And at the same time as all this was bubbling up, a discussion on Reddit covered other details: Question: Does using docker hit performance?

Of course, balancing the hype and reality is important, and this discussion thrashed that about a bit (click the timestamp on the Nextflow tweet to see the chatter):

To get a better handle on the utility of Docker, I went looking for some videos, and these are now the video tip of the week. This is different from our usual topics, but because users might find themselves on the receiving end of these containers at some point, it seemed relevant for our readers.

The first one I’ll mention gave me a good overview of the concept. The CTO of Docker, Solomon Hykes, talks at Twitter University about the basis and benefits of their software (Introduction to Docker). He describes Docker of being like the innovation of shipping containers–which don’t really sound particularly remarkable to most of us, but in fact the case has been made that they changed the global economy completely. I read that book that Bill Gates recommended last year, The Box, and it was quite astonishing to see how metal boxes changed everything. This brought standardization and efficiencies that were previously unavailable. And those are two things we really need in genomics data and software.

Hykes explains that the problem of shipping stuff–coffee beans, or whatever, had to be solved, at each place the goods might end up. This is a good analogy–like explained in the shipping container book. How to handle an item, appropriate infrastructure, local expertise, etc, was a real barrier to sharing goods. And this happens with bioinformatics tools and data right now. But with containerization, everyone could agree on the size of the piece, the locks, the label position and contents, and everything standardized on that system. This brought efficiency, automation, and really changed the world economy. As Hykes concisely describes [~8min in]:

“So the goal really is to try and do the same thing for software, right? Because I think it’s embarrassing, personally, that on average, it’ll take more time and energy to get…a collection of software to move from one data center to the next, than it is to ship physical goods from one side of the planet to the other. I think we can do better than that….”

This high-level overview of the concept in less than 10min is really effective. He then takes a question about Docker vs a VM (virtual machine). I think this is the essential take-away: containerizing the necessary items  [~18min]:

“…Which means we can now define a new unit of software delivery, that’s more lightweight than a VM [virtual machine], but can ship more than just the application-specific piece…”

After this point there’s a live demo of Docker to cover some of the features. But if you really do want to get started with Docker, I’d recommend a second video from the Docker team. They have a Docker 101 explanation that covers things starting from installation, to poking around, destroying stuff in the container to show how that works, demoing some of the other nuts and bolts, and the ease of sharing a container.

So this is making waves among the genomics folks. This also drifted through my feed:

Check it out–there seem to be some really nice features of Docker that can impact this field. It doesn’t solve everything–and it shouldn’t be used as an escape mechanism to not put your data into standard formats. And Melissa addresses a number of unmet challenges too. But it does seem that it can be a contributor to reproducibility and access to data issues that are currently hurdles (or, plagues) in this field. Docker is also under active development and they appear to want to make it better. But sharing our stuff: it’s not trivial–there are real consequences to public health from inaccessible data and tools (1). But there are broader applications beyond bioinformatics, of course. And wide appeal and adoption seems to be a good thing for ongoing development and support. More chatter on the larger picture of Docker:

And this discussion was helpful: IDF 2014: Bare Metal, Docker Containers, and Virtualization.

And, er…

I laughed. And wrote this anyway.

Quick links:

Docker main site:

Docker Github:

(1) Baggerly K. (2010). Disclose all data in publications, Nature, 467 (7314) 401-401. DOI:

*Ironically, this ENCODE VM is gone, illustrating the problem:


Video Tip of the Week: NIH 3D Print Exchange

The other day I was joking about how I was 3D-printing a baby sweater–the old way, with yarn and knitting needles. And I also mentioned that I assumed my niece-in-law was 3D-printing the baby separately. I’ve been musing (and reading) about 3D printing a lot lately–sometimes the plastic model part, sometimes the bioprinting of tissues part. So when I came across this new NIH 3D Print Exchange information, it seemed worthy of highlighting.

Although I haven’t had access to a 3D printer setup yet (although I’m planning to take a course soon at the local Artisan Asylum), I’ve been seeing quite a bit of chatter about it. Some folks are designing gel combs (rather than paying ridiculous catalog prices). Some folks print skulls and other bones. There is so much opportunity for a wide range of helpful scientific applications across many fields that it seems an introduction to this topic would be wise for a lot of folks.

So when someone pointed me to the 3D printing initiative at NIH, I was hooked. The public announcement and site launch was in mid-June, according to their blog and press release. I was catching up by reading other items on their site, including some press coverage that provides context for this and other government initiatives on 3D printing. Make Magazine’s piece “The Scramble To Build is On!” notes that the Smithsonian and NASA also have projects underway. But for me, molecules in 3D are what I’m most interested in, so I’ll focus on this NIH version below.

An intro video provides an overview of the kinds of things that will be available on their site. But there’s also a YouTube channel with more.

At the site now you will find a number of ways to get started. At the “Share” navigation area you will find already there is a section for custom lab gear, anatomical stuff, and biological structures and even some organisms. So if you have models to share, you can load ‘em up. With the “Create” space you can quickly generate some items with a handy quick start feature. Because I’m fascinated with the beautiful structures of hemolysins (have you seen these things?) I picked one out, entered a PDB ID, and within a half hour I was notified that the printable model was available to me–and you can see it here. But you can build your own from scratch as well, of course. There are other tutorials that will help you get some foundations in place.

Hemolysin 3D printable modelOr you can look around–from the “Discover” page you can browse or search for examples of models people have done. At this time, there are 347 (including the one I just did yesterday). But there will be more. I want to get mine printed up, and then see some other proteins too.

Ok, so it’s not like I made a kidney or something (although we know that day is coming). Being able to think about the 3D printing process, file types, and various options are probably worth noodling on. Getting your feet wet with a little protein structure or organelle might be a good way to get started. Check it out, and start thinking in other dimensions.

Quick links:

NIH 3D Print Exchange:

Hemolysin for image:

Model Generated for hemolysin from PDB record:

Murphy S.V. (2014). 3D bioprinting of tissues and organs, Nature Biotechnology, 32 (8) 773-785. DOI:

Video Tip of the Week: Phenoscape, captures phenotype data across taxa

Development of the skeleton is a good example of a process that is highly regulated, requires a lot of precision, is conserved and important relationships across species, and is fairly easy to detect when it’s gone awry. I mean–it’s hard to know at a glance if all the neurons in an organism got to the right place at the right time or if all the liver cells are in the right place still. But skeletal morphology–length, shape, location, abnormalities can be apparent and are amenable to straightforward observations and measurements. Some of these have been collected for decades by fish researchers. This makes them a good model for creating a searchable, stored, phenotype collection.

The team at Phenoscape is trying to wrangle this sort of phenotype information. I completely agree with this statement of the need:

Although the emphasis has been on genomic data (Pennisi, 2011), there is growing recognition that a corresponding sea of phenomic data must also be organized and made computable in relation to genomic data.

They have over half a million phenotype observations cataloged. These include observations in thousands of fish taxa. They created and used an annotation suite of tools called Phenex to facilitate this. They describe Phenex as:

Annotation of phenotypic data using ontologies and globally unique taxonomic identifiers will allow biologists to integrate phenotypic data from different organisms and studies, leveraging decades of work in systematics and comparative morphology.

That’s great data to capture to provide important context for all the sequencing data we are now able to obtain. I think this is a nice example of combining important physical observations, mutant studies, and more, with genomics to begin to get at questions about evolutionary relationships among genes and regulatory regions that aren’t obvious only from the sequence data. You may not be personally interested in fish skeletons–but as an informative way to think about structuring these data types across species to make them useful for hypothesis generation–this is a useful example.

Here’s a intro video provided by the Phenoscape team that walks you through a search starting with a gene of interest, and taking you through the kinds of things you can find.

So have a look around Phenoscape to see a way to go from the physical observations of phenotype to gene details, or vice versa.

Quick links:



Mabee B.P., Balhoff J.P., Dahdul W.M., Lapp H., Midford P.E., Vision T.J. & Westerfield M. (2012). 500,000 fish phenotypes: The new informatics landscape for evolutionary and developmental biology of the vertebrate skeleton., Zeitschrift fur angewandte Ichthyologie = Journal of applied ichthyology, PMID:

Balhoff J.P., Cartik R. Kothari, Hilmar Lapp, John G. Lundberg, Paula Mabee, Peter E. Midford, Monte Westerfield & Todd J. Vision (2010). Phenex: Ontological Annotation of Phenotypic Diversity, PLoS ONE, 5 (5) e10500. DOI: