Category Archives: New Resource

Have some NGS SAM/BAM files? get a GUI interface

A recent paper on a GUI interface introduces SAMMate. As the paper states:

With just a few mouse clicks, SAMMate will provide biomedical researchers easy access to important alignment information stored in SAM/BAM files.

You might want to check it out if you have Next Generation Sequencing data in the form of BAM/SAM files. A nice feature I haven’t been able to check is that it will export a ‘wiggle’ file for alignment visualization in the UCSC Genome Browser.

NAR Database issue…get it while it’s hot!

Ok, it’s hot now–but it’s something we refer back to all year long, actually. For people who don’t know about the NAR Database Issue, since the mid-90s Nucleic Acids Research has been collecting bioinformatics databases and tools that are of use to a huge range of researchers. We’ve watched it grow over the years and we’ve even graphed it. We’ll have to update that graph with the new data point for this year.  But here’s the graph as we published it last year:

(You can get this figure from our paper here, it is Figure 1)

You can see steady growth in the resources collected in the NAR set. But that’s certainly not all of them–others can be found in their server issue in the summer, and some just aren’t listed in a lot of places. We think there are in the range of 3000 tools and resources of some sort around.

A nice overview of the state of play is always provided in the introduction paper for that issue. As they state, this year we are up to 1330 data sources in their list. And they also highlight a couple of editorials that address important issues in this arena. One is about the need for data sources to talk to each other. This is an important point:

these databases risk functioning increasingly as isolated islands in a sea of disparate biological data

And there’s another editorial that speaks to the understanding of the data we have in our hands–and the need to understand it better. It describes COMBREX–a very cool effort:

This project is designed to serve as a clearinghouse, collecting functional predictions from specialists in bioinformatics and functional genomics and then sending these predictions for testing by experimentalists.

This is the kind of thing that makes me wish I still had a lab. There’s so much opportunity here…alas. The road not taken. But a hot opportunity for smart youngsters who might like to carve out a niche with a lab that mines the computational materials and pairs it with great projects for students to do the bench characterizations. And it offers grants to do this work….

Anyway–check out the NAR database issue. It’s worth your time. Really.

EDIT: there’s a fun and interesting crowd-sourced analysis of the NAR databases in the list for features of utility to bioinformatics geeks going on at BioStar.

Williams, J., Mangan, M., Perreault-Micale, C., Lathe, S., Sirohi, N., & Lathe, W. (2010). OpenHelix: bioinformatics education outside of a different box Briefings in Bioinformatics, 11 (6), 598-609 DOI: 10.1093/bib/bbq026

Galperin, M., & Cochrane, G. (2010). The 2011 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection Nucleic Acids Research, 39 (Database) DOI: 10.1093/nar/gkq1243

Gaudet, P., Bairoch, A., Field, D., Sansone, S., Taylor, C., Attwood, T., Bateman, A., Blake, J., Bult, C., Cherry, J., Chisholm, R., Cochrane, G., Cook, C., Eppig, J., Galperin, M., Gentleman, R., Goble, C., Gojobori, T., Hancock, J., Howe, D., Imanishi, T., Kelso, J., Landsman, D., Lewis, S., Mizrachi, I., Orchard, S., Ouellette, B., Ranganathan, S., Richardson, L., Rocca-Serra, P., Schofield, P., Smedley, D., Southan, C., Tan, T., Tatusova, T., Whetzel, P., White, O., Yamasaki, C., & , . (2010). Towards BioDBcore: a community-defined information specification for biological databases Nucleic Acids Research, 39 (Database) DOI: 10.1093/nar/gkq1173

Roberts, R., Chang, Y., Hu, Z., Rachlin, J., Anton, B., Pokrzywa, R., Choi, H., Faller, L., Guleria, J., Housman, G., Klitgord, N., Mazumdar, V., McGettrick, M., Osmani, L., Swaminathan, R., Tao, K., Letovsky, S., Vitkup, D., Segre, D., Salzberg, S., Delisi, C., Steffen, M., & Kasif, S. (2010). COMBREX: a project to accelerate the functional annotation of prokaryotic genomes Nucleic Acids Research, 39 (Database) DOI: 10.1093/nar/gkq1168

The Phenoscape Project, upcoming webinar

I just got this announcement about an upcoming webinar from the NCBO–the National Center for Biomedical Ontology. It’s a project that’s new to me, so I can’t give you any insights on its utility for our readers. It sounds like text mining + evolution, and ways to extract information that’s not been extensively used previously. But I love to find out about new projects and tools in this arena and I think I’ll listen in.

Here’s the notice I got via the Biocurator mailing list:

The next NCBO Webinar will be presented by Dr. Hilmar Lapp, Assistant Director for Informatics at the National Evolutionary Synthesis Center (NESCent) on “Bringing reason to phenotype diversity, character change, and common descent” at 10:00am PST, Wednesday, November 17. Below is information on how to join the online meeting via WebEx and accompanying teleconference. For the full schedule of the NCBO Webinar presentations see:

For more than a century, systematic biologists have meticulously documented the stunning biodiversity of phenotypes across the tree of life in the comparative systematics literature. This vast store of often complex character and character state descriptions informs our understanding of the evolutionary transitions that gave rise to the present diversity of life on earth. Yet, as free text in natural language, these descriptive data are not amenable to even simple computational processing, such as comparison of organisms by phenotype similarity, much less large-scale data integration and knowledge mining. I will present the approach that we have adopted within the Phenoscape project ( to expose these data to machine reasoning.  Phenoscape uses the Entity-Quality (EQ) model to transform characters and character states into formal phenotype assertions.  Data transformed in this way from the systematics literature are integrated with mutant phenotype data from model organisms in a large knowledge base (, in order to generate hypotheses about the genetic causes of evolutionary character transitions. I will discuss both successes and challenges in blending formal knowledge representation methods with descriptive biology and hypotheses of descent.  An important remaining challenge is a logic framework for reasoning over homology, i.e. descent from a common ancestor, which is required for many forms of evolutionary inference.


Hilmar Lapp is the Assistant Director for Informatics at the National Evolutionary Synthesis Center (NESCent). His research interests are in reusable and interoperable software and data, large-scale data integration, and building sustainable cyberinfrastructure. A biologist by training, he has also been programming for more than two decades, ranging from commercial applications to real-time data acquisition to bioinformatics data integration and standards. In his role at NESCent, he is involved in many of the Center’s cyberinfrastructure initiatives, and serves as senior personnel in the NSF-funded Phenoscape project (, as well as the Dryad digital repository for data supporting scientific publications ( Before joining NESCent in 2006, he worked for almost 10 years in functional genome informatics in the biopharmaceutical sector. At the Genomics Institute of the Novartis Research Foundation (GNF) in San Diego, CA, he built SymAtlas, one of the first decidedly gene-centric database integrating genome annotation databases with gene function data.

And it also came with webinar details, but I’ll send you over to NCBO for that rather than posting them here. Go over to their webinar page: and click on this talk. The details about how to join and call in are over there.

New NCBI Image Database

Mary brought up a paper just recently about what we are missing when data mining papers: Figures and figure legends.

Enter the NCBI Image database. This very new database includes over 3 million images that are found in the full-text resources (i.e. PubMed Central) at NCBI. So, I did a search for “drosophila phylogeny” and found some great images and figures. The results will not only pull out the figure, but also the figure legend. I got over 200 results. The links in the search result figure titles take you directly to the figure. Below the legend you can see links to the full text. It’s a great start to searching figures and figure legends.

Along with this, PubMed search results now are enhanced with images from this database (if, remember, the article is in the full-text resources.. but over time a lot of research published with

NIH funding will go there won’t they?). For example, go to this abstract for the paper “Text mining and manual curation of the chemical-gene-disease networks for the comparative toxicogenomics database.” Scroll down just a bit, you’ll see the figures from this paper, which have been deposited in the NCBI image database. You can go directly to the link to all the figures or to the papers.

Of course, as stated, not all articles will have images in the database, only those deposited in PubMed Central. You’ll find a lot of your searches won’t have this image strip because the journal isn’t deposited there . But with 3 million images and more journal articles going to PMC every day, this database and feature of PubMed could prove to be quite useful.

Hattip: APD at CTD :)

New databases and resources from NAR database issue

If you haven’t noticed, articles in the Database issue of Nucleic Acids Research have been going to Advance Access in the last week . There is a wealth of new resources and databases, as always, in this issue. I’ll be going through these in the coming month or so and will post more in depth reviews of them then, but I’d thought I’d list some that were released Friday, go to the link above for even more from earlier in the week:

AmoebaDB (article)
MicrosporidaDB (article)
BriX (article)
WebGeSTer DB: a transcription terminator database. (article)
SCLD: Stem cell lineage database (article)
OrthoDB: Hierarchal  Catalog of Eukaryotic Orthologs (article)
CaSNP: a database for interrogating copy number alterations of cancer genome from SNP array data (article)

Like shooting catfish genomes in a barrel

You know, when the catfish genome is complete, that will be a cool addition to our “Yet another Genome” posts (which I should make a regular series or some update somewhere).  Till the genome is complete, you can view and analyze catfish genomic data at cBARBEL, reported in this weeks NAR advance access: Catfish Breeder And Researcher Bioinformatics Entry Location. Among other tools and schema, they use GBrowse (we do have a free tutorial ;) to compare the data to the Zebrafish genome.

Mark this database (as with many others) as one whose acronyms were created to fit the name. Barbels are the whiskers on a catfish and cBARBEL stands for “Catfish Breeder and Researcher Bioinformatic Entry Location.” See, I was thinking more along the lines of “Catfish Breeder And Researcher Research Entry Location” or Catfish Barrel, but that is too culturally obscure and specific isn’t it? cBARBEL is good :D.

All kidding aside, it’s a great start to another agriculturally important model organism database.

Tip of the Week: NCBI Epigenomics “Beyond the Genome”

We spend a lot of time talking about sequence data: where to find it, how to analyze it, etc. But increasingly we are seeing more and more data that comes from epigenomics projects. Recently a tweet from NCBI got me to look at their Epigenetics site again.

Their definition of epigenetics is:

What is Epigenetics?

Interest in epigenetics has exploded in recent years, but the central question it aims to answer has been with us for decades: how do the many cell types of the body maintain drastically different gene expression patterns while sharing exactly the same DNA?

Epigenetics refers to a gene activity state that may be stable over long periods of time, persist through many cell divisions, or even be inherited through several generations, all without any change to the primary DNA sequence (Roloff and Nuber 2005, Ng and Gurdon 2008, Probst, et al. 2009).

This is a nice site that offers a lot of helpful background, project information about the NIH Roadmap for Epigenomics, and then of course access to the data itself.  They have separate guidance on the types of data that you will find in here: About DNA Methylation, About Histone Modification, and About Chromatin Structure. So if you are ready to go “Beyond the Genome” as their tag line indicates, you can learn about the data types and find the data too.

This tip of the week will take a look at access to the data. I’ll be taking a look at what happens when you use the Sample Browser as a starting point to see some of the data via browsing. You can do more complex and custom queries with the Advanced Query form, which looks like other query building tools at NCBI. I won’t have time to cover that, but I wanted you to know it was available.

For my example I just chose the top sample that was in the list at the time I did this tip. And it was fortuitous for a couple of reasons.  First it was exactly the kind of paper that I was talking about in my recent post (The data isn’t in the papers anymore, you know.) This paper (referenced below) has a huge volume of data. It looks at 39 types of histone modifications, and looks at them genome wide.  There’s no way to publish all that as figures in this paper.  There are summary figures, but not individual ones for that data collection. You’d have to visualize this yourself elsewhere.  The second reason it was cool was because the data perfectly validates some of the data I’ve been using to develop the ENCODE project tutorial we’ve just created with the UCSC ENCODE team.

Anyway–check out the NCBI Epigenomics resource for a great way to visualize data on this topic. Data that you will not find in the papers.

Quick links:

NCBI Epigenomics (the tip site):


Epigenome Browser:


By the way: I also asked the hive mind at BioStar what tools they are using for epigenomics or epigenetics, and you can go and see that question over there. People told me about the Epigenome Atlas and EPIGRAPH. And as I was researching this tip I came across a Roadmap Epigenomics site, that offers a link to a browser. It’s a UCSC Genome Browser framework focused on this kind of data: Epigenome Browser–but that’s a different installation than the main UCSC Genome Browser that I illustrate from this tip.

Reference for data used and shown in the tip:

Wang, Z., Zang, C., Rosenfeld, J., Schones, D., Barski, A., Cuddapah, S., Cui, K., Roh, T., Peng, W., Zhang, M., & Zhao, K. (2008). Combinatorial patterns of histone acetylations and methylations in the human genome Nature Genetics, 40 (7), 897-903 DOI: 10.1038/ng.154

Currently there’s isn’t a reference for NCBI Epigenomics. I contacted the Help Desk to be sure, and they told me it’s been submitted but isn’t out yet. I’ll update this when that reference becomes available.

Next-Generation Analysis Tools

MassGenomics points to a new structured programming framework for analyzing NGS data: A Foundation for Next-Generation Analysis Tools, the GATK (Genome Analysis Toolkit) and points to a few tools that use it.

While I’m at it, let me remind you of a NGS discussion group, GBrowse help to visualize NGS data and Galaxy has a NGS toolbox.

Tip of the Week: Gaggle Genome Browser

For this week’s tip of the week we’ll be looking at the Gaggle Genome Browser. As we are seeing more and more species or individuals data coming along from high-throughput sequencing projects, metagenomics data sets, and additional annotation track types coming from various projects–we’re gonna need more visualization options. Gaggle Browser provides the foundation for a new kind of visualization and interaction with the data.

The Gaggle Browser is one piece of the Gaggle components, actually. Gaggle is actually a framework that enables different enabled tools to interact with each other.  If a database or program can interact with this system, it can be called a “Goose”. Some of the current Geese right now include the Browser, and Cytoscape, and other pieces. Check out the components page for additional interactors. And check out that figure that illustrates the Firegoose toolbar interacting with important web tools such as DAVID, STRING, KEGG, and so on.  That will give you a sense of the goals of this project, and the possibility of extending and integrating the data types that might be useful for your projects.

For this tip, though, we’ll focus on the browser. A paper has just come out that provides a lot more background on the project–including extensive descriptions of the underlying software and connections.  For end users, though, the key piece is what this can do for you–and the team’s goal was stated as this:

“Our choice to build the GGB as a desktop application was largely motivated by the need to support large user-generated datasets.”

This is crucial. We love the big browsers and the range that are out there now, and we rely on them every day. And newer ones like JBrowse are also contributing to our choices. But at every workshop we do these days someone says to us: I have this giant data set I need to look at, how can I do it? And they want to be able to more customize exactly what they want to show–for data visualization and for making figures, etc. For some people Gaggle Browser may be an answer.

In my interactions with the browser so far, I can easily see how smooth and fast the navigation around large regions can be. For my purposes I think it would be great to load up genomic regions of interest, and then add a type of annotation track I’m interested it, and then scan around.  For example, I was loading up the human genome data in GGB, and then I pulled down the Transcription Factor Binding Sites (TFBS conserved) from chromosome 1 using the UCSC Table Browser. I sent that to Galaxy to convert to the GFF file format I needed, and then loaded up the TFBS on the browser to look around. I had a little trouble with the upload at first, but after we worked that out (with the help of the handy discussion group list)  I was able to do what I wanted. I’d still probably go back to UCSC to do queries and other visualizations of this data–but for a quick look around at the landscape–Gaggle was a really nice option.

In this tip I’ll load up a sample data set and move around some, showing you some of the display aspects.

For my example I’m going to load  up the Bacillus anthracis demo and interact with the interface a bit to help you get started.  I’ll mention some of the features, but in this short movie I won’t have time to illustrate how easy it was to load up a track of my own on the human genome.  But I can certainly see the advantages of the quick custom browser I could build with data of interest.

It’s a young browser, and features are still being explored and added–but I think it could be very nice for people to interact with their large data sets of interest.  I’ve already offered 2 feature requests: I would like to have labels for my tracks, and I would like to be able to choose different genome assemblies. And the Gaggle Browser team was very responsive and friendly to those inquiries.

So load up a Gaggle Browser and try it out. And start to imagine ways to load your favorite data on this framework.

Quick links:

Gaggle Genome Browser (GGB):

Gaggle components homepage:

Bare, J., Koide, T., Reiss, D., Tenenbaum, D., & Baliga, N. (2010). Integration and visualization of systems biology data in context of the genome BMC Bioinformatics, 11 (1) DOI: 10.1186/1471-2105-11-382

Tip of the Week: 1000 Genomes Project Browser

You may have been hearing about the 1000 Genomes project–it’s one of the ongoing “big data” projects that is going to yield a great deal of variation information about the human genome. The goal is to sequence well over1000 genomes to identify “most genetic variants that have frequencies of at least 1% in the populations studied”.  They are doing this by sequencing large numbers of samples  with 4x coverage. You can read more about their strategy in their About page on their web site. It also lists the anticipated sample populations.

In this week’s Tip of the Week I’m going to take a quick spin through their browser. (You can also download all the data, but I’ll be focusing on the browser.) They have begun to release data now, and there are 6 individual sequences available at this time.  These are part of their “pilot” studies.  You can get some details on the pilot from their about page, which links to this PDF about the samples.

They are using the Ensembl framework to display their data. So if you are familiar with using Ensembl you’ll have some facility moving around this browser.  One thing that isn’t apparent right away from the site is that you can click the Resembl link on the display to turn on a track that puts the read/coverage data on the viewer. I also liked the alignment display  of all 6 genomes–but I’m sure that’s going to get challenging to view later with more and more genomes.

In an exchange with their very helpful help desk yesterday, I got this quick summary of the samples you’ll see:

For the high coverage populations NA12891, NA12892 and NA12878 are the CEU trio, NA19238, NA19239 and NA19240 are the YRI trio both father, mother, child respectively and both children were daughters.

If you have questions about their data, be sure to go ask them for help–they were very speedy with answers for me :) .

Some of the project data has also been picked up by UCSC and you can access the same sequences in the UCSC Genome Browser in the Genome Variants track on the March 2006 human assembly. (You’ll also see Venter, Watson, and some other individual genomes there).

Quick links:

The Project:

The Browser:

An article in Science with some background:  A Plan to Capture Human Diversity in 1000 Genomes