Category Archives: Genomics Research

Genome Editing with CRISPR-Cas9, nifty animation

I saw this come across my twitter feed the other day, and as a nice Friday afternoon diversion I posted it to Google+. I was surprised how popular it was. So I thought–hey, I have a blog too. Let’s put it there…. So grab some coffee and watch, a nice gentle way to get your Monday underway.

This animation depicts the CRISPR-Cas9 method for genome editing – a powerful new technology with many applications in biomedical research, including the potential to treat human genetic disease. Feng Zhang, a leader in the development of this technology, is a faculty member at MIT, an investigator at the McGovern Institute for Brain Research, and a core member of the Broad Institute. Further information can be found on Prof. Zhang’s website at .

Images and footage courtesy of Sputnik Animation, the Broad Institute of MIT and Harvard, Justin Knight and pond5.

The publications page at the Zhang lab has some nice examples of CRISPR, including that knockin mouse one with cancer modeling applications. I’ve been meaning to get that but don’t have a subscription to Cell, so that was handy.

Platt R., Sidi Chen, Yang Zhou, Michael J. Yim, Lukasz Swiech, Hannah R. Kempton, James E. Dahlman, Oren Parnas, Thomas M. Eisenhaure, Marko Jovanovic & Daniel B. Graham & (2014). CRISPR-Cas9 Knockin Mice for Genome Editing and Cancer Modeling, Cell, 159 (2) 440-455. DOI:

Bioinformatics tools extracted from a typical mammalian genome project [supplement]

This is Table 1 that accompanies the full blog post: Bioinformatics tools extracted from a typical mammalian genome project. See the main post for the details and explanation. The table is too long to keep in the post, but I wanted it to be web-searchable. A copy also resides at FigShare:

Continue reading

Bioinformatics tools extracted from a typical mammalian genome project

In this extended blog post, I describe my efforts to extract the information about bioinformatics-related items from a recent genome sequencing paper, and the larger issues this raises in the field. It’s long, and it’s something of a hybrid between a blog post and a paper format, just to give it some structure for my own organization. A copy of this will also be posted at FigShare with the full data set. Huge thanks to the gibbon genome project team for a terrific paper and extensively-documented collection of their processes and resources. The issues I wanted to highlight are about the access to bioinformatics tools in general and are not specific to this project at all, but are about the field.


In the field of bioinformatics, there is a lot of discussion about data and code availability, and reproducibility or replication of research using the resources described in previous work. To explore the scope of the problem, I used the recent publication of the well-documented gibbon genome sequence project as a launching point to assess the tools, repositories, data sources, and other bioinformatics-related items that had been in use in a current project. Details of the named bioinformatics items were extracted from the publication, and location and information about the tools was then explored.

Only a small fraction of the bioinformatics items from the project were denoted in the main body of the paper (~16%). Most of them were found in the supplementary materials. As we’ve noted in the past, neither the data nor the necessary tools are published in the traditional paper structure any more. Among the over 100 bioinformatics items described in the work, availability and usability varies greatly. Some reside on faculty or student web sites, some on project sites, some in code repositories. Some are published in the traditional literature, some are student thesis publications, some are not ever published and only a web site or software documentation manual serves to provide required details. This means that information about how to use the tools is very uneven, and support is often non-existent. Access to different software versions poses an additional challenge, either for open source tools or commercial products.

New publication and storage strategies, new technological tools, and broad community awareness and support are beginning to change these things for the better, and will certainly help going forward. Strategies for consistently referencing tools, versions, and information about them would be extremely beneficial. The bioinformatics community may also want to consider the need to manage some of the historical, foundational pieces that are important for this field, some of which may need to be rescued from their current status in order to remain available to the community in the future.


From the Nature website, I obtained a copy of the recently published paper: Gibbon genome and the fast karyotype evolution of small apes (Carbone et al, 2014). From the text of the paper and the supplements, I manually extracted all the references to named database tools, data source sites, file types, programs, utilities, or other computational moving parts that I could identify. There maybe be some missed by this process, for example, names that I didn’t recognize or didn’t connect with some existing tool (or some image generated from a tool, perhaps). Some references were to “in house Perl scripts” or other “custom” scenarios were not generally included unless they had been made available. Pieces deemed as being done “in a manner similar to that already described” in some other reference were present, and I did not go upstream to prior papers to extract those details. Software associated with laboratory equipment, such as sequencers (located at various institutions) or PCR machines were not included. So this likely represents an under-count of the software items in use. I also contacted the research team for a couple of additional things, and quickly received help and guidance. Using typical internet search engines or internal searches at publisher or resource sites, I tried to match the items to sources of software or citations for the items.

What I put in the bucket included specific names of items or objects that would be likely to be necessary and/or unfamiliar to students or researchers outside of the bioinformatics community. Some are related, but different. For example, you need to understand what “Gene Ontology” is as a whole, but you also need to know what “GOslim” is, a conceptual difference and a separate object in my designation system here. Some are sub-components of other tools, but important aspects to understand (GOTERM_BP_FAT at DAVID or randomBed from BEDTools) and are individual named items in the report, as these might be obscure to non-practitioners. Other bioinformatics professionals might disagree with their assignment to this collection. We may discuss removal or inclusion of these in discussions about them in future iterations of the list.


After creating a master list of references to bioinformatics objects or items, the list was checked and culled for duplicates or untraceable aspects. References to “in house Perl scripts” or other “custom” scripts were usually eliminated, unless special reference to a code repository was provided. This resulted in 133 items remaining.

How are they referenced? Where in the work?
Both the main publication (14 PDF pages) and the first Supplementary Information file (133 PDF pages) provided the names of bioinformatics objects in use for this project. All of the items referenced in the main paper were also referenced in the supplement. The number of named objects in the main paper was 21 of the 133 listed components (~16%). This is consistent with other similar types of consortium or “big data” papers that I’ve explored before: the bulk of the necessary information about software tools, data sources, methods, parameters, and features have been in the extensive supplemental materials.

The items are referenced in various ways. Sometimes they are named in the body of the main text, or the methods. Sometimes they are included as notes. Sometimes tools are mentioned only in figure legends, or only in references. In this case, some details were found in the “Author information” section.


As noted above, most were found in the supplemental information. And in this example, this could be in the text or in tables. This is quite typical of these large project papers, in our experience. Anyone attempting to text-mine publications for this type of information should be aware of this variety of locations for this information.

Which bioinformatics objects are involved in this paper?
Describing bioinformatics tools, resources, databases, files, etc, has always been challenging. These are analogous to the “reagents” that I would have put in my benchwork biology papers years ago. They may matter to the outcome, such as enzyme vendors, mouse strain versions, or antibody species details. They constitute things you would need to reproduce or extend the work, or to appropriately understand the context. But in the case of bioinformatics, this can mean file formats such as the FASTQ or axt format from UCSC Genome Browser. They can mean repository resources like the SRA. They can be various different versioned downloaded data sets from ENSEMBL (version 67, 69, 70, or 73 here, but which were counted only once as ENSEMBL). It might be references to Reactome in a table.

With this broad definition in mind, Table 1 provides the list of named bioinformatics objects extracted from this project. The name or nickname or designation, the site at which it can be found (if available), and a publication or some citation is included when possible. Finally, a column designates whether it was found in the main paper as well.

What is not indicated is that some are references multiple times in different contexts and usages, with might cause people to not realize how frequently these are used. For example, ironically, RepeatMasker was referenced so many times I began to stop marking it up at one point.

Table 1. Software tools, objects, formats, files, and resources extracted from a typical mammalian genome sequencing project. See the web version supplement to this blog post:, or access at FigShare:

Bioinformatics tools extracted from a typical mammalian genome project [supplement] – See more at:
Bioinformatics tools extracted from a typical mammalian genome project [supplement] – See more at:


What can we learn about the source or use of these items?
Searches for the information about the source code, data sets, file types, repositories, and associated descriptive information about the items yields a variety of access. Some objects are associated with traditional scientific publications and have valid and current links to software or data (but are also sometimes incorrectly cited). These may be paywalled in certain publications, or are described in unavailable meeting papers. Some do not have associated publications at all, or are described as submitted or in preparation. Some tools remain unpublished in the literature, long after they’ve gone into wide use, and their documentation or manual is cited instead. Some reside on faculty research pages, some are student dissertations. Some tools are found on project-specific pages. Some exist on code repositories—sometimes deprecated ones that may disappear. A number of them have moved from their initial publications, without forwarding addresses. Some are allusions to procedures other publications. Some of them are like time travel right back to the 1990s, with pages that appear to be original for the time. Some may be at risk of disappearing completely the next time an update at a university web site changes site access.

Other tools include commercial packages that may have unknown details, versions, or questionable sustainability and future access.

When details of data processing or software implementations are provided, the amount can vary. Sometimes parameters are included, others not.

Missing tool I wanted to have
One of my favorite data representations in the project results was Figure 2 in the main paper, Oxford grids of the species comparisons organized in a phylogenetic tree structure. This conveyed an enormous amount of information in a small area very effectively. I had hoped that this was an existing tool somewhere, but upon writing to the team I found it’s an R script by one of the authors, with a subsequent tree arrangement in the graphics program “Illustrator” by another collaborator. I really liked this, though, and hope it becomes available more broadly.

Easter eggs
The most fun citation I came across was the page for PHYLIP, and the FAQ and credits were remarkable. Despite the fact that there is no traditional publication available to me, a lengthy “credits” page offers some interesting insights about the project. The “No thanks to” portion was actually a fascinating look at the tribulations of getting funding to support software development and maintenance. The part about “outreach” was particularly amusing to us:

“Does all this “outreach” stuff mean I have to devote time to giving workshops to mystified culinary arts students? These grants are for development of advanced methods, and briefing “the public or non-university educators” about those methods would seem to be a waste of time — though I do spend some effort on fighting creationists and Intelligent Design advocates, but I don’t bring up these methods in doing so.”

Even the idea of “outreach” and support for use of the tools is certainly unclear to the tool providers, apparently. Training? Yeah, not in any formal way.


The gibbon genome sequencing project provided an important and well-documented example of a typical project in this arena. In my experience, this was a more detailed collection and description than many other projects I’ve explored, and some tools that were new and interesting to me were provided. Clearly an enormous number and range of bioinformatics items, tools, repositories, and concepts are required for the scope of a genome sequencing project. Tracing the provenance of them, though, is uneven and challenging, and this is not unique to this project—it’s a problem among the field. Current access to bioinformatics objects is also uneven, and future access may be even more of a hurdle as aging project pages may disappear or become unusable. This project has provided an interesting snapshot of the state of play, and good overview of the scope of awareness, skills, resources, and knowledge that researchers, support staff, or students would need to accomplish projects of similar scope.

little_macIt used to be simpler. We used to use the small number of tools on the VAX, uphill, in the snow, both ways, of course. When I was a grad student, one day in the back of the lab in the early 1990s, my colleague Trey and I were poking around at something we’d just heard about—the World Wide Web. We had one of those little funny Macs with the teeny screens, and we found people were making texty web pages with banal fonts and odd colors, and talking about their research.

Although we had both been using a variety of installed programs or command lines for sequence reading and alignment, manipulation, plasmid maps, literature searching and storage, image processing, phylogenies, and so on—we knew that this web thing was going to break the topic wide open.

Not long after, I was spending more and more time in the back room of the lab, pulling out sequences from this NCBI place (see a mid-1990s interface here), and looking for novel splice variants. I found them. Just by typing—no radioactivity and gels required by me! How cool was that? We relied on Pedro’s List to locate more useful tools (archive of Pedro’s Molecular Biology Search and Analysis Tools.).

Both of us then went off into postdocs and jobs that were heavily into biological software and/or database development. We’ve had a front seat to the changes over this period, and it’s been really amazing to watch. And it’s been great for us—we developed our interests into a company that helps people use these tools more effectively, and it has been really rewarding.

At OpenHelix, we are always trying to keep an eye on what tools people are using. We regularly trawl through the long, long, long supplementary materials from the “big data” sorts of projects, using a gill net to extract the software tools that are in use in the community. What databases and sites are people relying on? What are the foundational things everyone needs? What are the cutting-edge things to keep a lookout for? What file formats or terms would people need to connect with a resource?

But as I began to do it, I thought: maybe I should use this as a launching point to discuss some of the issues of software tools and data in genomics. If you were new to the field and had to figure out how a project like this goes, or what knowledge, skills, and tools you’d need, can you establish some idea of where to aim? So I used this paper to sort of analyze the state of play: what bioinformatics sites/tools/formats/objects/items are included in a work of this scope? Can you locate them? Where are the barriers or hazards? Could you learn to use them and replicate the work, or drive forward from here?

It was illuminating to me to actually assemble it all in one place. It took quite a bit of time to track the tools down and locate information about them. But it seemed to be a snapshot worth taking. And I hope it highlights some of the needs in the field, before some of the key pieces become lost to the vagaries of time and technology. And also I hope the awareness encourages good behavior in the future. Things seem to be getting better—community pressure to publish data sets and code in supported repositories has increased. We could use some standardized citation strategies for the tools, sources, and parameters. The US NIH getting serious about managing “big data” and ensuring that it can be used properly has been met with great enthusiasm. But there are still some hills left to climb before we’re on top of this.


Carbone L., R. Alan Harris, Sante Gnerre, Krishna R. Veeramah, Belen Lorente-Galdos, John Huddleston, Thomas J. Meyer, Javier Herrero, Christian Roos, Bronwen Aken & Fabio Anaclerio & al. (2014). Gibbon genome and the fast karyotype evolution of small apes, Nature, 513 (7517) 195-201. DOI:

FigShare version of this post:

A History of Bioinformatics (told from the Year 2039)

A week or so back I was watching the chatter around the #ISMB / #BOSC2014 meeting, and saw a number of amusing and intriguing comments about Titus Brown’s keynote talk.

You can see a lot of chatter about it in the Storify. I was delighted to soon see this follow up tweet:

I didn’t have time to watch it right away, but when I did, I really enjoyed it. It’s worth your time if you have some interest about the directions of this field. It’s not easy to pull off a talk like you are 25 years into the future. It’s also rife with danger–as later people might use pieces of it against you. Lincoln Stein wrote an amusing follow-up to to a prediction talk he gave in 2003, entitled: Bioinformatics: Gone in 2012 (follow up piece linked below).  Or it could just end up so embarrassingly off-target that you’ll look like some of the folks that Titus highlights in the talk, whose predictions about future technologies were pretty…um…well, you’ll see. But it’s a clever way to think about the future that we want, and how the path could look to get us there.

SPOILERS: Here are some of my favorite tidbits, mostly for my own notes:

  • Bioinformatics sweatshops [I fear this too]
  • California has disappeared [egads, but...]
  • MicrosoftElsevier [snicker]
  • Universities have collapsed [hmm, not convinced on this]
  • Pioneering appointment of Phil Bourne: “NIH finally realized that training was important” [~20min; oh, please let this come true]
  • the problems of “Glam Data” [contrast to "glam journals" today]
  • in the future, because of better education, 80% of the US will accept evolution [from your lips to...wait...]
  • ~33min, interesting look at the actual outcomes of techno-progress and how they diverged from predictions; via Heinlein’s “Where To?” with 4 curves of predicted human progress (linked below). [Heh, I'm in this argument a lot, this could be handy--piece + chart linked below]
  • “I have no idea what I’m doing, but I’m trying new things.” [~38min, about forging unchartered directions in a young field]
  • At the end, ~56min: “Let the crazy people do the crazy things. See what happens.” [Testify.]

Boy, the pressure is on Phil Bourne to solve everything. This is a recurring theme at every genomics and bioinformatics event I see lately…I wish him luck sorting this out. Good news from this talk is that he seems to have done it.

And the slides are here, with Talk notes for the Bioinformatics Open Source Conference (2014) at Titus’ blog.


Stein L.D. (2008). Bioinformatics: alive and kicking, Genome Biology, 9 (12) 114. DOI:

Heinlein R. (1952). Where to?, Galaxy Magazine, February 13-22. ["Your personal telephone will be small enough to carry in your handbag." Well, he nailed that one.]

{sorry,  had to republish to get it in to the ResearchBlogging queue. RB was down yesterday.}

Video Tip of the Week: New UCSC “stacked” wiggle track view

This week’s video tip shows you a new way to look at the multiWig track data at the UCSC Genome Browser. A new option has recently been released (see 06 May 2014), a “stacked” view, and it’s a handy way to look at the data with a new strategy. But I’ll admit it took me a little while of working with it to understand the details. So in this tip I hope you’ll see what the new visualization offers.

I won’t go into the background on the many types of annotation tracks available–if you need to be introduced to the idea of the basic track views, start out with our introduction tutorial that touches on the different types of graphical representations. Custom tracks are touched on in the advanced tutorial. For guidance specifically how to create the different track types, see the UCSC documentation. The type of track I’m illustrating in the video today, a MultiWig track, has its own section over there too. Basically, if you are completely new to this, the “wiggle” style is a way to show a histogram display across a region. MultiWig lets you overlay several of these histograms in one space. In the example I’ll show here, the results of looking at 7 different cell lines are shown for some histone mark signals (Layered H3K27Ac track).

Annotation track cell lines

Annotation track cell lines

When I saw the announcement, I thought this was a good way to show all of the data simultaneously. When we do basic workshops, we don’t always have time to go into the details of this view, although we do explore it in the ENCODE material, because the track I’m using is one of the ENCODE data sets. I’ll use the same track in the same region as the announcement, which is shown here:

stack announcementBut when I first looked at this, I wasn’t sure if the peak–focus on the pink peak that represents the NHLF cell line–was meant to cover the whole area underneath or not. What I was trying to figure out is essentially this (a graphical representation of my thought process follows):


By trying out the various styles I was pretty sure I had the idea of what was really being shown, but I confirmed that with one of the track developers. The value is only the pink band segment, not the whole area below it. And Matthew also noted to me that they are sorting the tracks in reverse alphabetical order (so NHLF is the highest in the stack). That was an aspect I hadn’t realized yet. They are not sorting based on the values at that spot. This makes sense, of course, but it wasn’t obvious to me at first.

I like this option very much–but I figured if I had to do some noodling on what it actually meant others might have the same questions.

In the video I’ll show you how this segment looks with the different “Overlay method” settings on that track page. I’ll be looking at the SOD1 area, like the announcement example.  I tweaked a couple of the other settings from the defaults so it would be easier to see on the video (see arrowheads for my changes). But I hope this conveys the options you have now to look at this type of track data effectively.

Track settings for videoSo here is the video with the SOD1 5′ region in the center, using the 4 different choices of overlay method, illustrating the histone mark data in the 7 cell lines. I’m not going into the details of the data here, but I’ll point you to a reference associated with this work for more on how it’s done–see the Bernstein lab paper below.  I wanted to just demonstrate this new type of viewing options that will be available on wiggle tracks. Some tracks will have too much data for one type or another, or will be clearer with one or another style. But now you have an additional way to consider it.

Quick links:

UCSC Genome Browser:

UCSC Intro tutorial:

UCSC Advanced tutorial:

These tutorials are freely available because UCSC sponsors us to do training and outreach on the UCSC Genome Browser.


Kent W.J., Zweig A.S., Barber G., Hinrichs A.S. & Karolchik D. (2010). BigWig and BigBed: enabling browsing of large distributed datasets., Bioinformatics (Oxford, England), PMID:

Karolchik D., Barber G.P., Casper J., Clawson H., Cline M.S., Diekhans M., Dreszer T.R., Fujita P.A., Guruvadoo L. & Haeussler M. & (2013). The UCSC Genome Browser database: 2014 update., Nucleic acids research, PMID:

Ram O., Goren A., Amit I., Shoresh N., Yosef N., Ernst J., Kellis M., Gymrek M., Issner R. & Coyne M. & al. Combinatorial patterning of chromatin regulators uncovered by genome-wide location analysis in human cells., Cell, PMID:

The ENCODE Project Consortium, Bernstein B.E., Birney E., Dunham I., Green E.D., Gunter C. & Snyder M. et al. (2012). An integrated encyclopedia of DNA elements in the human genome., Nature, 489 PMID:

Also see the Nature special issue on ENCODE data, especially the chromatin accessibility and histone modification subset (section 02):

Participate in an NSF “IDEAS LAB” (generate research agendas and proposals)

The short link: IUSE IDEAS LAB:

NSF’s education directorate has a funding opportunity called “Improving Undergraduate STEM Education” (IUSE).

The IUSE program description [PD 14-7513] outlines a broad funding opportunity to support projects that address immediate challenges and opportunities facing undergraduate science, technology, engineering, and math (STEM) education,
To generate research agendas and proposals for this, NSF is holding an… 

Ideas Lab:
Ideas labs are meetings that bring together researchers, educators and others in an “intensive, interactive and free-thinking environment, where participants immerse themselves in a collaborative dialog in order to construct bold and innovative approaches and develop research projects.” MOre often than not, these “Ideas Labs” produce new collaborations and research projects proposals that often go on to be funded. The Ideas Lab is patterned after the Ideas Factory process.
“to make new connections, which are frequently cross disciplinary, and also generate novel research projects coupled with real-time peer review.”
This NSF Ideas lab has several purposes, but the one most pertinent to this community is finding new ways, and develop research proposals, to infuse computational thinking, literacy and competency into the core curriculum for undergraduate education.
Individuals apply to the Ideas lab, it’s a 2 page proposal and is DUE FEBRUARY 4 (Next Tuesday). Funding is provided for the trip. These ideas labs are excellent ways to meet and discuss genomics, biology and education, build new collaborations and to develop new research proposals.
The letter and more information (read the link):
A Dear Colleague Letter on the topic of ³Preparing Applications to
Participate in Phase I Ideas Labs on Undergraduate STEM Education² [NSF
14-033] has been posted on the NSF web site.
If you have any questions, you can ask here or by email (wlathe AT ). I am _not_ a project officer at NSF and don’t have all the answers, but I can direct you to the places you might find answers.
PLEASE feel free to disseminate!

The Thanksgiving Genomes

Happy Thanksgiving to those who celebrate. For those of you who don’t,  have a nice Thursday.

Light posting this week due to the holiday, but this might be fun for you to keep in your back pocket for dinner discussion–genome information of the traditional foods.

The Genome of Your Thanksgiving Supper

The genetic sequences of the turkey, apple, potato, and other traditional Thanksgiving ingredients are providing bountiful lessons for scientists.

Public service announcement: CAFA2 for protein functional annotations

Just got this email on the Biocurators mailing list, wanted to spread the word:

Announcing CAFA 2: The Second Critical Assessment of protein Function Annotations

Friends and Colleagues,

We are pleased to announce the Second Critical Assessment of protein Function Annotation (CAFA) challenge. The goal of the challenge is to predict functional annotations of genes/proteins. In CAFA, the organizers provide a set of about 100,000 protein sequences, of which most are completely unannotated and some are partially annotated with respect to their function. The participants are asked to predict functional annotation of these proteins before January 15, 2014. At that time, all predictions will be stored and we will wait for 6-12 months until new annotations are available in the biomedical literature and/or major databases. The initial evaluation will be provided in July 2014, during the ISMB conference (Boston, MA). Anyone in the world is welcome to participate.

In brief:

Web site:

Prediction submission deadline: January 15, 2014

Initial evaluation: July 12, 2014 in Boston

All targets can be downloaded from The web site also contains training data; however, the participants are *not* required to use it and even if they do, they can use any additional data of their choice, including the literature. The CAFA challenge is different from many other similar challenges because not even the organizers know which of the original target sequences will be functionally annotated after the submission deadline.

The CAFA 1 experiment is described in the following paper:

P. Radivojac et al. A large-scale evaluation of computational protein function prediction. Nature Methods (2013) 10(3): 221-227.

A brief introduction to the problem for computer scientists is provided at:

The mission of the Automated Function Prediction Special Interest Group (AFP-SIG) is to bring together computational biologists who are dealing with the important problem of gene and gene product function prediction, to share ideas and create collaborations. We also aim to facilitate interactions with experimental biologists and biocurators.

We hope that AFP-SIG serves an important role in stimulating research in annotation of biological macromolecules, but also related fields.

New in CAFA 2:

In CAFA 2, we would like to evaluate the performance of protein function prediction tools/methods and also expand the CAFA challenge to include prediction of human phenotypes associated with genes and gene products. As the last time, CAFA will be a part of the Automated Function Prediction Special Interest Group (AFP-SIG) meeting that will be held alongside the ISMB conference. AFP-SIG will be held as a two-day meeting in July 2014 in Boston.

About the CAFA experiment:

The problem: There are far too many proteins for which the sequence is known, but the function is not. The gap between what we know and what we do not know is growing. A major challenge in the field of bioinformatics is to predict the function of a protein from its sequence (and all other data one can find). At the same time, how can we judge how well these function prediction algorithms are performing and whether we are making progress over time?

The solution: The Critical Assessment of protein Function Annotation algorithms (CAFA) is an experiment designed to provide a large-scale assessment of computational methods dedicated to predicting protein function. We will evaluate methods in predicting the Gene Ontology (GO) terms in the categories of Molecular Function, Biological Process, and Cellular Component. In addition, predictors may use the Human Phenotype Ontology (HPO) for the human dataset. A set of protein sequences is provided by the organizers, and participants are expected to submit their predictions by the submission deadline, January 15, 2014. The predictions will be evaluated during the Automated Function Prediction (AFP) meeting, which has been approved as a Special Interest Group (SIG) meeting, at the ISMB 2014 conference (Boston, USA).

History: The first CAFA experiment was conducted in 2010-2011. Twenty-three groups submitted fifty-four algorithms for assessment. The results and most methods were published in Nature Methods and in a special supplement in BMC Bioinformatics. CAFA 1 has brought together a large group of computational predictors and, for the first time, provided us with a clear picture of the state of this important field. As with other critical assessment experiments, the aim of CAFA is improve protein function prediction by continuously challenging groups to develop more accurate methods.

How to participate in CAFA 2?

1. Go to

2. Download target proteins, already available

3. Submit predictions on or before January 15, 2014

4. Join us at the AFP-SIG, July 11-12, 2014 in Boston for the eighth protein function prediction meeting, to hear the CAFA 2 results, to present your work, and to learn about the latest research in computational protein function prediction

More details at:

Confirmed keynote speakers:

Fiona Brinkman, Simon-Fraser University, Canada

Mark Gerstein, Yale University, USA

We look forward to hearing from you!

The CAFA organizing Team: Predrag Radivojac, Michal Linial, Sean Mooney and Iddo Friedberg