Tag Archives: Gene Ontology

Video Tip of the Week: Human Phenotype Ontology, HPO

Typically, our Tips-of-the-Week cover a specific software tool or feature that we think readers would maybe like to try out. But this week’s tip is a bit different. It’s got a conceptual piece that is important, as well as referencing several software tools that work with this crucial concept to enable interoperability of many tools, helping us link different data types in a common framework.

Conceptually, the Human Phenotype Ontology (HPO) is much like other controlled vocabulary systems you may have used in genomics tools–like Gene Ontology, Sequence Ontology, or others that you might find at the National Center for Biomedical Ontology. We’ve covered the idea of broad parent terms, increasingly precise child terms, and standard definitions in tutorial suites. It’s important to standardize and share the same language to describe the same things among different projects, software providers, and as we move more genomics to the clinic, sharing descriptors for human phenotypes and conditions will be crucial.

The concepts and strategies are becoming mature at this point. and we now have lots of folks who agree and want to use these shared descriptors. A really nice overview of the state of phenotype descriptions and how to use them for discovery and for integration across many data resources was published earlier this year: Finding Our Way through Phenotypes.  It also offers recommendations for researchers, publishers, and developers to support and use a common vocabulary.

For this week’s video, I’m highlighting a lecture by one of the authors of that paper, Peter Robinson. It’s a seminar-length video, but it covers both the key conceptual features of the HPO, provides some examples of how it can be useful in translational research settings, and also describes the range of tools and databases that are using the HPO now. I think it’s worth the time to hear the whole thing. The audio is a bit uneven in parts, but you can get the crucial stuff.

The early part is about the concepts of specific terms, synonyms, and shared terms that can mean completely different things (think American football and European football). He describes the phenotype ontology. There are examples of research that leads to phenotypes that are then used as discovery and diagnostic tools. He talks about tools that utilize the HPO right now, including Phenomizer for obtaining or exploring appropriate terms, PhenIX, Phenotypic Interpretation of eXomes for prioritization of candidate genes in exome sequencing data sets. There is also PhenoTips, that can help you to collect and analyze patient data (and also edit pedigrees).

Many large scale projects and key genomics tools employ the human phenotype ontology.

Many large scale projects and key genomics tools employ the human phenotype ontology.

He also notes how tools like DECIPHER, NCBI Genetic Testing Registry, GWAS Central, and many more include the human phenotype vocabulary. This is a great sign for a project like this, that’s it is being adopted by so many groups and tools world-wide. They’ve also worked with key large-scale projects in this arena to ensure that the vocabulary is suited and workable, and update them when needed. They credit OMIM and Orphanet as being crucial to their efforts as well. As part of the Monarch Initiative, there seems to be solid support going forward as well.

There are more tools to discuss, but I’m going to save those for another post. This one is already loaded with things you should check out, so be sure to come back for further exploration of the HPO-related tools and projects that are worth exploring.

Quick links:

Human Phenotype Ontology: http://www.human-phenotype-ontology.org/

Phenomizer: http://compbio.charite.de/phenomizer/

PhenIX: http://compbio.charite.de/PhenIX/

PhenExplorer: http://compbio.charite.de/phenexplorer/

PhenoTips: https://phenotips.org/

Monarch Initiative: http://monarchinitiative.org/

Deans A.R., Suzanna E. Lewis, Eva Huala, Salvatore S. Anzaldo, Michael Ashburner, James P. Balhoff, David C. Blackburn, Judith A. Blake, J. Gordon Burleigh, Bruno Chanet & Laurel D. Cooper & (2015). Finding Our Way through Phenotypes, PLoS Biology, 13 (1) e1002033. DOI: http://dx.doi.org/10.1371/journal.pbio.1002033

Kohler, S., Doelken, S., Mungall, C., Bauer, S., Firth, H., Bailleul-Forestier, I., Black, G., Brown, D., Brudno, M., Campbell, J., FitzPatrick, D., Eppig, J., Jackson, A., Freson, K., Girdea, M., Helbig, I., Hurst, J., Jahn, J., Jackson, L., Kelly, A., Ledbetter, D., Mansour, S., Martin, C., Moss, C., Mumford, A., Ouwehand, W., Park, S., Riggs, E., Scott, R., Sisodiya, S., Vooren, S., Wapner, R., Wilkie, A., Wright, C., Vulto-van Silfhout, A., Leeuw, N., de Vries, B., Washingthon, N., Smith, C., Westerfield, M., Schofield, P., Ruef, B., Gkoutos, G., Haendel, M., Smedley, D., Lewis, S., & Robinson, P. (2013). The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data Nucleic Acids Research, 42 (D1) DOI: 10.1093/nar/gkt1026

Köhler, S., Schulz, M., Krawitz, P., Bauer, S., Dölken, S., Ott, C., Mundlos, C., Horn, D., Mundlos, S., & Robinson, P. (2009). Clinical Diagnostics in Human Genetics with Semantic Similarity Searches in Ontologies The American Journal of Human Genetics, 85 (4), 457-464 DOI: 10.1016/j.ajhg.2009.09.003

Zemojtel, T., Kohler, S., Mackenroth, L., Jager, M., Hecht, J., Krawitz, P., Graul-Neumann, L., Doelken, S., Ehmke, N., Spielmann, M., Oien, N., Schweiger, M., Kruger, U., Frommer, G., Fischer, B., Kornak, U., Flottmann, R., Ardeshirdavani, A., Moreau, Y., Lewis, S., Haendel, M., Smedley, D., Horn, D., Mundlos, S., & Robinson, P. (2014). Effective diagnosis of genetic disease by computational phenotype analysis of the disease-associated genome Science Translational Medicine, 6 (252), 252-252 DOI: 10.1126/scitranslmed.3009262

Girdea, M., Dumitriu, S., Fiume, M., Bowdin, S., Boycott, K., Chénier, S., Chitayat, D., Faghfoury, H., Meyn, M., Ray, P., So, J., Stavropoulos, D., & Brudno, M. (2013). PhenoTips: Patient Phenotyping Software for Clinical and Research Use Human Mutation, 34 (8), 1057-1065 DOI: 10.1002/humu.22347

Video Tip of the Week: TargetMine, Data Warehouse for Drug Discovery

Browsing around genomic regions, layering on lots of associated data, and beginning to explore new data types I might come across are things that really fire up my brain. For me, visualization is key to forming new ideas about the relationships between genomic features and patterns of data. But frequently I want to take this to the next step–asking where else these patterns appear, how many other instances of this situation are there in a data set, and maybe adding additional complexity to the problem and refine the quest. This is not always easy to do with primarily visual software tools. This is when I turn to tools like the UCSC Table Browser, BioMart, and InterMine to handle some list of genes, or regions, or features.

We’ve touched on all of these before–sometimes with full tutorial suites (UCSC, BioMart), and sometimes as a Tip of the Week, InterMine and InterMine for complex queries. Learning about the foundations of these tools will let you use various versions or flavors of them at other sites. I love to see tools that are re-used for different topics when that’s possible, rather than building a whole new system. There are ModENCODE, rat, yeast mines, and more. This week’s tip is about one of those others–TargetMine is built on the InterMine foundation, with a specific focus on prioritizing candidate genes for pharmaceutical interventions. From their site overview, I’ll add this description they use: TargetMine

TargetMine is an integrated data warehouse system which has been primarily developed for the purpose of target prioritisation and early stage drug discovery.

For more details about their framework and philosophy, you should see their papers (linked below). The earlier one sets out the rationale, the data types, and the data sources they are incorporating. They also establish their place in the ecosystem of other databases in this arena, which helps you to understand their role.  But you should see the next paper for a really good grasp of how their candidate prioritization work with the “Integrated Pathway Clusters” concept they’ve added. They combined data from KEGG, Reactome, and NCI’s PID collections to enhance the features of their data warehouse system.

This week’s Video Tip of the Week highlights one of the tutorial movies that the TargetMine team provides. There’s no spoken audio with it, but the captions that help you to understand what’s going on are in English. I followed along on a browser with their example–they have a sample list to simply click on, and you can see various enrichments of the sets–pathways, Gene Ontology, Disease Ontology, InterPro, CATH, and compounds. They call these the “biological themes” and I find them really useful. You can create new lists from these theme collections. They also illustrate the “template” option–pre-defined queries with typical features people may wish to search. The example shows how to go from the list of genes you had to pathways–but there are other templates as well.

Another section of the video has an example of a custom query with the Query Builder. They ask for structural information for proteins targeted by acetaminophen. It’s a nice example of how to go from a compound to protein structure–a question I’ve seen come up before in discussion threads.

In their more recent paper (also below), they have some case studies that illustrate the concepts of prioritizing targets for different disease situations with their system.  They also expand on the functions with additional software to explore the pathways: http://targetmine.mizuguchilab.org/pathclust/ .

So have a look at the features of TargetMine for prioritization of candidate genes. I think the numerous “themes” are a really useful way to assess lists of genes (or whatever you are starting with).

Quick Links:

TargetMine: http://targetmine.mizuguchilab.org/ [note: their domain name has changed since the publications, this is the one that will persist.]

InterMine: http://intermine.github.io/intermine.org/


Chen, Y., Tripathi, L., & Mizuguchi, K. (2011). TargetMine, an Integrated Data Warehouse for Candidate Gene Prioritisation and Target Discovery PLoS ONE, 6 (3) DOI: 10.1371/journal.pone.0017844

Chen, Y., Tripathi, L., Dessailly, B., Nyström-Persson, J., Ahmad, S., & Mizuguchi, K. (2014). Integrated Pathway Clusters with Coherent Biological Themes for Target Prioritisation PLoS ONE, 9 (6) DOI: 10.1371/journal.pone.0099030

Kalderimis A.,  R. Lyne, D. Butano, S. Contrino, M. Lyne, J. Heimbach, F. Hu, R. Smith, R. Stěpán, J. Sullivan & G. Micklem & (2014). InterMine: extensive web services for modern biology, Nucleic Acids Research, 42 (W1) W468-W472. DOI: http://dx.doi.org/10.1093/nar/gku301

Bioinformatics tools extracted from a typical mammalian genome project

In this extended blog post, I describe my efforts to extract the information about bioinformatics-related items from a recent genome sequencing paper, and the larger issues this raises in the field. It’s long, and it’s something of a hybrid between a blog post and a paper format, just to give it some structure for my own organization. A copy of this will also be posted at FigShare with the full data set. Huge thanks to the gibbon genome project team for a terrific paper and extensively-documented collection of their processes and resources. The issues I wanted to highlight are about the access to bioinformatics tools in general and are not specific to this project at all, but are about the field.


In the field of bioinformatics, there is a lot of discussion about data and code availability, and reproducibility or replication of research using the resources described in previous work. To explore the scope of the problem, I used the recent publication of the well-documented gibbon genome sequence project as a launching point to assess the tools, repositories, data sources, and other bioinformatics-related items that had been in use in a current project. Details of the named bioinformatics items were extracted from the publication, and location and information about the tools was then explored.

Only a small fraction of the bioinformatics items from the project were denoted in the main body of the paper (~16%). Most of them were found in the supplementary materials. As we’ve noted in the past, neither the data nor the necessary tools are published in the traditional paper structure any more. Among the over 100 bioinformatics items described in the work, availability and usability varies greatly. Some reside on faculty or student web sites, some on project sites, some in code repositories. Some are published in the traditional literature, some are student thesis publications, some are not ever published and only a web site or software documentation manual serves to provide required details. This means that information about how to use the tools is very uneven, and support is often non-existent. Access to different software versions poses an additional challenge, either for open source tools or commercial products.

New publication and storage strategies, new technological tools, and broad community awareness and support are beginning to change these things for the better, and will certainly help going forward. Strategies for consistently referencing tools, versions, and information about them would be extremely beneficial. The bioinformatics community may also want to consider the need to manage some of the historical, foundational pieces that are important for this field, some of which may need to be rescued from their current status in order to remain available to the community in the future.


From the Nature website, I obtained a copy of the recently published paper: Gibbon genome and the fast karyotype evolution of small apes (Carbone et al, 2014). From the text of the paper and the supplements, I manually extracted all the references to named database tools, data source sites, file types, programs, utilities, or other computational moving parts that I could identify. There maybe be some missed by this process, for example, names that I didn’t recognize or didn’t connect with some existing tool (or some image generated from a tool, perhaps). Some references were to “in house Perl scripts” or other “custom” scenarios were not generally included unless they had been made available. Pieces deemed as being done “in a manner similar to that already described” in some other reference were present, and I did not go upstream to prior papers to extract those details. Software associated with laboratory equipment, such as sequencers (located at various institutions) or PCR machines were not included. So this likely represents an under-count of the software items in use. I also contacted the research team for a couple of additional things, and quickly received help and guidance. Using typical internet search engines or internal searches at publisher or resource sites, I tried to match the items to sources of software or citations for the items.

What I put in the bucket included specific names of items or objects that would be likely to be necessary and/or unfamiliar to students or researchers outside of the bioinformatics community. Some are related, but different. For example, you need to understand what “Gene Ontology” is as a whole, but you also need to know what “GOslim” is, a conceptual difference and a separate object in my designation system here. Some are sub-components of other tools, but important aspects to understand (GOTERM_BP_FAT at DAVID or randomBed from BEDTools) and are individual named items in the report, as these might be obscure to non-practitioners. Other bioinformatics professionals might disagree with their assignment to this collection. We may discuss removal or inclusion of these in discussions about them in future iterations of the list.


After creating a master list of references to bioinformatics objects or items, the list was checked and culled for duplicates or untraceable aspects. References to “in house Perl scripts” or other “custom” scripts were usually eliminated, unless special reference to a code repository was provided. This resulted in 133 items remaining.

How are they referenced? Where in the work?
Both the main publication (14 PDF pages) and the first Supplementary Information file (133 PDF pages) provided the names of bioinformatics objects in use for this project. All of the items referenced in the main paper were also referenced in the supplement. The number of named objects in the main paper was 21 of the 133 listed components (~16%). This is consistent with other similar types of consortium or “big data” papers that I’ve explored before: the bulk of the necessary information about software tools, data sources, methods, parameters, and features have been in the extensive supplemental materials.

The items are referenced in various ways. Sometimes they are named in the body of the main text, or the methods. Sometimes they are included as notes. Sometimes tools are mentioned only in figure legends, or only in references. In this case, some details were found in the “Author information” section.


As noted above, most were found in the supplemental information. And in this example, this could be in the text or in tables. This is quite typical of these large project papers, in our experience. Anyone attempting to text-mine publications for this type of information should be aware of this variety of locations for this information.

Which bioinformatics objects are involved in this paper?
Describing bioinformatics tools, resources, databases, files, etc, has always been challenging. These are analogous to the “reagents” that I would have put in my benchwork biology papers years ago. They may matter to the outcome, such as enzyme vendors, mouse strain versions, or antibody species details. They constitute things you would need to reproduce or extend the work, or to appropriately understand the context. But in the case of bioinformatics, this can mean file formats such as the FASTQ or axt format from UCSC Genome Browser. They can mean repository resources like the SRA. They can be various different versioned downloaded data sets from ENSEMBL (version 67, 69, 70, or 73 here, but which were counted only once as ENSEMBL). It might be references to Reactome in a table.

With this broad definition in mind, Table 1 provides the list of named bioinformatics objects extracted from this project. The name or nickname or designation, the site at which it can be found (if available), and a publication or some citation is included when possible. Finally, a column designates whether it was found in the main paper as well.

What is not indicated is that some are references multiple times in different contexts and usages, with might cause people to not realize how frequently these are used. For example, ironically, RepeatMasker was referenced so many times I began to stop marking it up at one point.

Table 1. Software tools, objects, formats, files, and resources extracted from a typical mammalian genome sequencing project. See the web version supplement to this blog post: http://blog.openhelix.eu/?p=20002, or access at FigShare: http://dx.doi.org/10.6084/m9.figshare.1194867

Bioinformatics tools extracted from a typical mammalian genome project [supplement] – See more at: http://blog.openhelix.eu/?p=20002&preview=true#sthash.pcNdYhOZ.dpuf
Bioinformatics tools extracted from a typical mammalian genome project [supplement] – See more at: http://blog.openhelix.eu/?p=20002&preview=true#sthash.pcNdYhOZ.dpuf


What can we learn about the source or use of these items?
Searches for the information about the source code, data sets, file types, repositories, and associated descriptive information about the items yields a variety of access. Some objects are associated with traditional scientific publications and have valid and current links to software or data (but are also sometimes incorrectly cited). These may be paywalled in certain publications, or are described in unavailable meeting papers. Some do not have associated publications at all, or are described as submitted or in preparation. Some tools remain unpublished in the literature, long after they’ve gone into wide use, and their documentation or manual is cited instead. Some reside on faculty research pages, some are student dissertations. Some tools are found on project-specific pages. Some exist on code repositories—sometimes deprecated ones that may disappear. A number of them have moved from their initial publications, without forwarding addresses. Some are allusions to procedures other publications. Some of them are like time travel right back to the 1990s, with pages that appear to be original for the time. Some may be at risk of disappearing completely the next time an update at a university web site changes site access.

Other tools include commercial packages that may have unknown details, versions, or questionable sustainability and future access.

When details of data processing or software implementations are provided, the amount can vary. Sometimes parameters are included, others not.

Missing tool I wanted to have
One of my favorite data representations in the project results was Figure 2 in the main paper, Oxford grids of the species comparisons organized in a phylogenetic tree structure. This conveyed an enormous amount of information in a small area very effectively. I had hoped that this was an existing tool somewhere, but upon writing to the team I found it’s an R script by one of the authors, with a subsequent tree arrangement in the graphics program “Illustrator” by another collaborator. I really liked this, though, and hope it becomes available more broadly.

Easter eggs
The most fun citation I came across was the page for PHYLIP, and the FAQ and credits were remarkable. Despite the fact that there is no traditional publication available to me, a lengthy “credits” page offers some interesting insights about the project. The “No thanks to” portion was actually a fascinating look at the tribulations of getting funding to support software development and maintenance. The part about “outreach” was particularly amusing to us:

“Does all this “outreach” stuff mean I have to devote time to giving workshops to mystified culinary arts students? These grants are for development of advanced methods, and briefing “the public or non-university educators” about those methods would seem to be a waste of time — though I do spend some effort on fighting creationists and Intelligent Design advocates, but I don’t bring up these methods in doing so.”

Even the idea of “outreach” and support for use of the tools is certainly unclear to the tool providers, apparently. Training? Yeah, not in any formal way.


The gibbon genome sequencing project provided an important and well-documented example of a typical project in this arena. In my experience, this was a more detailed collection and description than many other projects I’ve explored, and some tools that were new and interesting to me were provided. Clearly an enormous number and range of bioinformatics items, tools, repositories, and concepts are required for the scope of a genome sequencing project. Tracing the provenance of them, though, is uneven and challenging, and this is not unique to this project—it’s a problem among the field. Current access to bioinformatics objects is also uneven, and future access may be even more of a hurdle as aging project pages may disappear or become unusable. This project has provided an interesting snapshot of the state of play, and good overview of the scope of awareness, skills, resources, and knowledge that researchers, support staff, or students would need to accomplish projects of similar scope.

little_macIt used to be simpler. We used to use the small number of tools on the VAX, uphill, in the snow, both ways, of course. When I was a grad student, one day in the back of the lab in the early 1990s, my colleague Trey and I were poking around at something we’d just heard about—the World Wide Web. We had one of those little funny Macs with the teeny screens, and we found people were making texty web pages with banal fonts and odd colors, and talking about their research.

Although we had both been using a variety of installed programs or command lines for sequence reading and alignment, manipulation, plasmid maps, literature searching and storage, image processing, phylogenies, and so on—we knew that this web thing was going to break the topic wide open.

Not long after, I was spending more and more time in the back room of the lab, pulling out sequences from this NCBI place (see a mid-1990s interface here), and looking for novel splice variants. I found them. Just by typing—no radioactivity and gels required by me! How cool was that? We relied on Pedro’s List to locate more useful tools (archive of Pedro’s Molecular Biology Search and Analysis Tools.).

Both of us then went off into postdocs and jobs that were heavily into biological software and/or database development. We’ve had a front seat to the changes over this period, and it’s been really amazing to watch. And it’s been great for us—we developed our interests into a company that helps people use these tools more effectively, and it has been really rewarding.

At OpenHelix, we are always trying to keep an eye on what tools people are using. We regularly trawl through the long, long, long supplementary materials from the “big data” sorts of projects, using a gill net to extract the software tools that are in use in the community. What databases and sites are people relying on? What are the foundational things everyone needs? What are the cutting-edge things to keep a lookout for? What file formats or terms would people need to connect with a resource?

But as I began to do it, I thought: maybe I should use this as a launching point to discuss some of the issues of software tools and data in genomics. If you were new to the field and had to figure out how a project like this goes, or what knowledge, skills, and tools you’d need, can you establish some idea of where to aim? So I used this paper to sort of analyze the state of play: what bioinformatics sites/tools/formats/objects/items are included in a work of this scope? Can you locate them? Where are the barriers or hazards? Could you learn to use them and replicate the work, or drive forward from here?

It was illuminating to me to actually assemble it all in one place. It took quite a bit of time to track the tools down and locate information about them. But it seemed to be a snapshot worth taking. And I hope it highlights some of the needs in the field, before some of the key pieces become lost to the vagaries of time and technology. And also I hope the awareness encourages good behavior in the future. Things seem to be getting better—community pressure to publish data sets and code in supported repositories has increased. We could use some standardized citation strategies for the tools, sources, and parameters. The US NIH getting serious about managing “big data” and ensuring that it can be used properly has been met with great enthusiasm. But there are still some hills left to climb before we’re on top of this.


Carbone L., R. Alan Harris, Sante Gnerre, Krishna R. Veeramah, Belen Lorente-Galdos, John Huddleston, Thomas J. Meyer, Javier Herrero, Christian Roos, Bronwen Aken & Fabio Anaclerio & al. (2014). Gibbon genome and the fast karyotype evolution of small apes, Nature, 513 (7517) 195-201. DOI: http://dx.doi.org/10.1038/nature13679

FigShare version of this post: http://dx.doi.org/10.6084/m9.figshare.1194879

Video Tip of the Week: eGIFT, extracting gene information from text

eGIFT, as the tag line says, is a tool to extract gene information from text. It’s a tool that allows you to search for and explore terms  and documents related to a gene or set of genes. There are many ways to search and explore eGIFT, find genes given a specific term, find terms related to a set of genes and more. How does the tool do this? You can check out the user guide to find out more, but here is a brief summary from the site:

We look at PubMed references (titles and abstracts), gather those references which focus on the given gene, and automatically identify terms which are statistically more likely to be relevant to this gene than to genes in general. In order to understand the relationship between a specific iTerm and the given gene, we allow the users to see all sentences mentioning the iTerm, as well as the abstracts from which these sentences were extracted.

To learn more about how this tool was put together and the calculations involved, you can check out the BMC Bioinformatics publication about it from 2010, eGIFT: Mining Gene Information from the Literature.

But, for today, take a tour of the site and some of the things you can do in today’s Tip of the Week.

Relevant Links:
PubMed (tutorial)
XplorMed (tutorial)
Literature & Text Mining Resource Tutorials

Tudor, C., Schmidt, C., & Vijay-Shanker, K. (2010). eGIFT: Mining Gene Information from the Literature BMC Bioinformatics, 11 (1) DOI: 10.1186/1471-2105-11-418

What’s the Answer: genes implicated in…

BioStar is a site for asking, answering and discussing bioinformatics questions. We are members of the community and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those questions and answers here in this thread. You can ask questions in this thread, or you can always join in at BioStar.

Question of the week: This weeks’ question was specific, 

I’m looking for a database for genes implicated in lymphoid and myeloid development

Short answer is there isn’t any (that I or anyone can find), but two answers as of this posting do give good methods to find what the question is looking for.

We’ve developed a text-mining approach, called GETM (Gene Expression Text Miner) to associate genes with anatomical locations based on tagging of gene names and species-specific anatomy ontologies that might help with your problem.

Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

  • Chromothripsis – new model for some cancers? From GenomeWeb Daily News. I’m interested in seeing follow up studies on this. [Jennifer]
  • A new data source added to the BioMart Central portal: “EMAGE, a database of in situ gene expression data in the mouse embryo, has been added to BioMart Central Portal. The EMAGE website can be found at http://www.emouseatlas.org/emage/ and the EMAGE BioMart server can be found at http://biomart.emouseatlas.org/” (via the Mart-dev mailing list) [Mary]
  • Another potential outlet for scientists wanting to get involved: the Global Knowledge Initiative who’s goal is [Jennifer]

    We build global knowledge partnerships between individuals and institutions of higher education and research. We help partners access the global knowledge, technology, and human resources needed to sustain growth and achieve prosperity for all.

  • From GenomeWeb – an announcement about MoDEL the ‘World’s Largest Protein Video Database’ – it is free for academic, not-for-profit use. I haven’t tried it at all, but it sounds like it might be cool. Let us know if you check it out! [Jennifer]
  • Announcement from the International Cancer Genome Consortium (where you can access the data using the cutting edge BioMart build…Hat tip to @bffo: Update on ICGC website with a simplified application process for controlled access data  #bioinformatics #cancer #genomics  http://icgc.org/ [Mary]
  • Another resource for protein-protein and drug-protein interactions: PROMISCUOUS [Jennifer]
  • There’s a new Announcement mailing list for BioMart, as it gets migrated from the former EBI location.  Announce and Users lists are available–if you were on them you probably got automatically migrated. If you want to sign up, see this note:  [mart-announce] New BioMart announce and users mailing lists.  Hmm, that’s not entirely helpful as it hides the addresses you need. They are: mart-dev@ebi.ac.uk becomes users@biomart.org and mart-announce@ebi.ac.uk becomes announce@biomart.org [Mary]
  • REViGO – a resource for reducing and visualizing Gene Ontology trees, described in this paper: Supek F et al. PLoS Genet 6(6): e1001004. [Jennifer]

Tip of the Week: PathCase for pathway data

We spend a lot of time exploring genomic data, variations, and annotations. But of course a linear perspective on the genes and sequences is not the only way to examine the data. Understanding the pathways in which genes and molecular entities interact is crucial to understanding systems biology.

There are a number of tools which can help you to visualize and explore this kind of data. KEGG is one of the most venerable tools in bioinformatics, BioCyc is well known and used, Reactome is one of our favorites. Recently NCBI BioSystems has come along, and the BioModels tool at EBI provides more data of this type as well. Pathway Interaction Database is another place to try. What you’ll find is that each one has different emphasis, species focus or data sets available, and different tools to use to graphically display the databases. The ways to customize or interact with the data will vary as well. So you may need to try several to find the one you want for your purposes.

But for today’s tip of the week I will highlight PathCase, a Pathways Database System from Case Western Reserve University. This is a  tool I’ve  had my eye on for a number of years, and they continue to add new features and data sets to their visualization and search interface which are very nicely done.

PathCase offers you several ways to browse and search for pathways, processes, organisms, and also molecular entities (such as ATP, ions, etc) as well as genes and proteins. It’s all integrated into the system, so when you find an item of interest you can move to the other related pieces.  For example, from the Pathways you can find genes and learn more about the genes. From genes you can load the pathways in which they participate.

When you have the pathway graphics loaded, you can interact with that pathway by clicking, dragging, re-organizing and more. Right-clicking offers more details about the items and ways to visualize the data. One option I didn’t have time to show in the movie is that you can use the H20/CO2 box to load up pathways that are linked to the one you are looking at and load those up, going even further along any route that you might be interested in. Here’s just a quick sample of that: from the NARS2 gene page I loaded the alanine pathway, and then added the fatty acid metabolism pathway. Now I can explore both of them with all the standard PathCase tools and understand many of their relationships. Once you start exploring these pathways you be amazed at how complex visualizations are possible.

So if you are interested in biological pathways, exploring them and representing them, check out PathCase.

PathCase site: http://nashua.cwru.edu/pathwaysweb/

Elliott, B., Kirac, M., Cakmak, A., Yavas, G., Mayes, S., Cheng, E., Wang, Y., Gupta, C., Ozsoyoglu, G., & Meral Ozsoyoglu, Z. (2008). PathCase: pathways database system Bioinformatics, 24 (21), 2526-2533 DOI: 10.1093/bioinformatics/btn459

Ok, say you get a genome. What next?

There may be a lot of opportunity to get one’s genome of interest sequenced in the near future.  And I don’t only mean your own personal genome–I mean your species of interest.  There should be academic centers and service providers who offer genome sequencing at increasingly reasonable pricing soon.

So let’s say you get your favorite sample done: what next?  We have talked about how cool it is to be able to use the GBrowse or WebGBrowse tools to display your data.  But that is missing a step, actually.  To get appropriate data to display you need to annotate your genome.  You need to curate your genome.  The GMOD suite offers some tools to do this–Apollo can be a part of your strategy. JCVI has an annotation service for prokaryotic genomes that uses Manatee. They frequently offer a prokaryotic annotation course around that process as well.

But the other day I heard about another option that I thought I would mention: the BLAST2GO team is announcing a course on Automated Functional Annotation and Genome Mining.  Here is their announcement via the GOFriends mailing listpersonally I would choose the one in Valencia :)


•Are you working in sequencing projects?
(EST projects, Next Generation Sequencing, Microarray desing, etc… )
•Do no have thousands of novel sequences that need functional annotation?
•Do you need a user-friendly tool to functionally analyze your data?
The Blast2GO Team is very pleased to announce:


In this course you will learn tools and tips for functional
annotation, visualization and analysis of novel sequence data making use of Blast2GO.

The course will be offered to 35 participants at 2 locations:
•Valencia, Spain: 28 to 30, September 2009
•Florida, USA: 14 to 16, October, 2009
For more information and registration until the 1 of September please visit:

But Florida in October wouldn’t be a bad option either!

For more information about just the BLAST2GO part, check out this site: http://blast2go.org/

Bioinformatics resource tweets

Yes, I know, but I used to be resistant to blogging too….

I’m starting to get a number of announcements about Twitter feeds from bioinformatics resource groups.  I think it’s time to start a collection of those.  This post will have a couple to get started, but I’m going to use this as a collector for others that will inevitably come across over time.

WormBase: (announcement), their feed is http://twitter.com/wormbase

GO: (announcement), their feed is http://twitter.com/news4go

If you have others, let me know in the comments. Or for my co-bloggers feel free to edit this post to grow the list.

New Gene Ontology connections

No doubt you’ve seen all those Gene Ontology (GO) terms in the databases.  There’s a nice story behind the development, structure and use of GO terms which we describe in our full tutorial on the topic.  But another important feature of GO is that is continues to evolve and improve, and a new feature is in the process of rolling out.

I learned about this from the a GO mailing list.  If you find yourself relying on GO terms for tools you support or queries you may do to annotate lists of genes, you may want to keep up with changes like this.

On February 17th the GO team implemented major new features, one of which is expanding the use of “regulates” to the Molecular Function hierarchy.  Earlier in GO development the relationships didn’t indicate a “regulates” relationship at all.  There were only “is a” or “part of” relationships between terms.  But they added regulates to the Biological Process group some time ago.  Here’s a sample of how that looks:


The really big new aspect of this, though, is not just this functionality within Molecular Function terms.  It is that now they are creating a “regulates” function between the Molecular Function and Biological Process terms–these are called intra-ontology links.

This is a nifty new way to be able to annotate genes and gene products.  It carries more information than just a single definition would. But, it is also rather a challenge for software developers because it is a largee conceptual leap than the straight hierarchies are.  So it may take a while to fully roll out to your favorite database that uses GO.  But watch for it–it is a big deal.

For more details if you are interested (or you have software that may break because of this) you can see more details and an example in the GO wiki:


One example is this:

Specifically, we have made the implicit regulatory relationships between ‘regulation of molecular function’ BP terms and the corresponding MF terms explicit. For example:

   * regulation of kinase activity (BP) regulates kinase activity (MF)

It makes sense in English, I know. But it’s not so simple for a computer. They have to be told this. And that’s what this change can do.

I’ll concede that this may not be among the most exciting things in your life lately.  But it is big for what we do.  So I just thought I’d mention it….