Tag Archives: curation


Video Tip of the Week: Introduction to Biocuration and the career path


The ISB is a professional organization for biocurators

At OpenHelix, we’ve long sung the praises of curators. Some of us have been curators and worked with curation and database development teams. All of us have relied on quality information in the databases for research and teaching. But I think there are a lot of people who don’t understand the value of quality curation, how it’s done, and who curators are. They are widely taken for granted.

A recent talk by Claire O’Donovan of EBI-EMBL helps to explain the roles and the importance of biocurators. So although this talk isn’t a typical software talk, I think understanding this is crucial to everyone’s appreciation of how information you rely on gets into the databases you use. And if you find yourself in situations where you are guiding students, knowing about this career is also worthwhile.

Claire O’Donovan has had a front row seat to the development of this field, and has great enthusiasm for the future. And going forward, in your doctor’s office as precision medicine and treatments become a thing–how much do you want correct information in the databases? Mining data, standardizing language for descriptions of features, and sharing this information is crucial for all of us.

Here’s what’s covered in this video, from the agenda slide:

  • Introduction to the concept of biocuration.
  • The different kinds of biocurators, and the skill set needed.
  • Our community: Biocuration Society and conference.
  • The future of biocuration and career paths.

Specific examples of what curators do are illustrated (~6:30min). A sample UniProt entry illustrates what kind of information is captured and where it appears. She also touches on their work with Gene Ontology. And a bit about the ecosystem of curation, how teams at different resources help each other but don’t wish to duplicate work, using HGNC nomenclature as an example.

About 8min, the skill sets for biocuration are covered: data basics, curation skills, programming and database concepts, ontologies, and usability of the data collected. This also includes data access and management, as well as dissemination and outreach. This includes user training (yay!) and the concepts of data analysis for users.

There’s no formal degree path for curation practitioners at this point, and different groups will have different needs. But the community is begining to think about this, and about professional qualifications. She also mentioned a recent report from the National Academy of Sciences press on the topic of the future workforce skills and needs (linked below). This is an alternative career route for people with science training, and it’s important to understand not only the science but computational pieces. And it should be taken seriously as a discipline. There is now a journal that reflects this (also linked below).

Claire also takes a look at the future of biocuration, using the Center for Target Validation (CTTV) as an example. And she talks about the importance of quality information in medical records as we increasingly have genomic details in diagnosis and treatment situations. If we want precision medicine to work, we have to have the precise and correct information in the databases. So respect and value the curators. They are worth it. And if you know anyone that deserves special recognition–nominate!

Quick links:

International Society for Biocuration:  http://biocuration.org/

Preparing the Workforce for Digital Curation: http://www.nap.edu/catalog/18590/preparing-the-workforce-for-digital-curation 



Holliday, G., Bairoch, A., Bagos, P., Chatonnet, A., Craik, D., Finn, R., Henrissat, B., Landsman, D., Manning, G., Nagano, N., O’Donovan, C., Pruitt, K., Rawlings, N., Saier, M., Sowdhamini, R., Spedding, M., Srinivasan, N., Vriend, G., Babbitt, P., & Bateman, A. (2015). Key challenges for the creation and maintenance of specialist protein resources Proteins: Structure, Function, and Bioinformatics, 83 (6), 1005-1013 DOI: 10.1002/prot.24803

Gaudet, P., Munoz-Torres, M., Robinson-Rechavi, M., Attwood, T., Bateman, A., Cherry, J., Kania, R., O’Donovan, C., & Yamasaki, C. (2013). DATABASE, The Journal of Biological Databases and Curation, is now the official journal of the International Society for Biocuration Database, 2013 DOI: 10.1093/database/bat077

Tip of the week: ORegAnno for regulatory annotation

Lately we’re getting a lot of questions about ways to analyze the promoters and other regulatory aspects of genes. And for a while we were mostly pointing to the prediction data that was available in the UCSC Genome Browser’s TFBS Conserved track. TFBS Conserved is a track of computationally predicted transcription factor binding sites (TFBS) which are conserved across human/mouse/rat and based on Transfac v7.0 by BioBase.  As they say in the track description, it’s important to know this:

The data are purely computational, and as such not all binding sites listed here are biologically functional binding sites.

Though this is useful, people have been wanting more evidence based on real binding and/or activity data. Today’s tip will talk about 2 ways to get other data–beyond computational predictions. First we’ll explore ORegAnno so you’ll understand the data sources, and then we’ll also look at that data in the context of the UCSC Genome Browser and some useful data from the ENCODE project.

ORegAnno is the Open Regulatory Annotation Database, a community literature curation project for regulatory information. Anyone can participate in the curation–they provide helpful curation tools and automated cross-linking and checking features that make it easier. You would register, curate, and the data becomes available to anyone. And with the curator tools that are available the data becomes loaded into projects that coordinate with ORegAnno–including the track at the UCSC Genome Browser of ORegAnno data.

In the paper published in NAR 2008, they stated this:

The current release comprises 30 145 records curated from 922 publications and describing regulatory sequences for over 3853 genes and 465 transcription factors from 19 species.

So that’s a nice set with traceable data that’s not just computational predictions. In the tip I’ll show one example of Stat1 binding, in human, near the Il10 gene. If you look at that record, you’ll see several pieces of evidence that support this data and a link to the publication that offers it.

Now, if you look at ORegAnno data over in the UCSC Genome Browser, you could compare it to the computational predictions, or TFBS data from other projects such as the ENCODE data sets with the Chip-Seq data (Yale TFBS and HAIB, for example; note: you may have to go back an assembly because the ENCODE data is not all on the current assembly at this time). This is what I show in the movie: I take an ORegAnno annotated item, visualize that with the TFBS Conserved predictions and with some ENCODE project data.  So you get all 3 types of data with a few clicks.

So there are several ways to look for TFBS data–some of it computational predictions, some literature curation, and some big data stuff from the ENCODE teams. All of them have strengths and caveats. Computational predictions may be genome wide and independent of a given cell or tissue type, but are subject to the constraints of the algorithms. Community literature curation can offer quality evidence, but may be selected by interested groups and not as broadly representative of the genome-wide situation. Big data projects can be genome-wide and have evidence in some cell types, but may be in progress and subject to checking as they are pre-publication data.  But effectively using them all could help you to understand regulation of genes that you might be interested in.

Quick Links:

ORegAnno: http://www.oreganno.org/

Biobase and Transfac: http://www.gene-regulation.com/pub/databases.html

UCSC Genome Browser: http://genome.ucsc.edu/

ENCODE data at UCSC: http://genome.ucsc.edu/ENCODE/

Griffith, O., Montgomery, S., Bernier, B., Chu, B., Kasaian, K., Aerts, S., Mahony, S., Sleumer, M., Bilenky, M., Haeussler, M., Griffith, M., Gallo, S., Giardine, B., Hooghe, B., Van Loo, P., Blanco, E., Ticoll, A., Lithwick, S., Portales-Casamar, E., Donaldson, I., Robertson, G., Wadelius, C., De Bleser, P., Vlieghe, D., Halfon, M., Wasserman, W., Hardison, R., Bergman, C., Jones, S., & The Open Regulatory Annotation Consortium. (2007). ORegAnno: an open-access community-driven resource for regulatory annotation Nucleic Acids Research, 36 (Database) DOI: 10.1093/nar/gkm967

Tip of the Week: Word Add-In for Ontology Recognition

In today’s tip I want to make you aware of a tool that I think will help researchers to present their own data and publications in an accurate and universally searchable way. I learned of the resource (UCSDBioLit) through an article in one of my recent BioMed Central article alert emails. This resource allows authors to mark-up their own publications with XML tags AS THEY WRITE their papers. This will allow faster and more accurate semantic searching of their research.

A huge problem in science today is the ability to quickly search the vast literature base and to accurately and efficiently find the data that you are interested in. Here at OpenHelix we focus on ways of effectively and efficiently get information out of public databases and resources, but at the other end of the process is the ability for scientific knowledge to be curated into those resources. We have featured biocurators and the phenomenal work that they do several times in the past, but it is work that never ends and can be very labor intensive. It often involves an initial triaging of a field’s literature, some level of automatic information gathering, and then careful manual effort on the part of scientist at the resource to gather and present the information through their site. I know from personal experience that the process of reading a paper, clarifying research details with an author, and then presenting that information to the author’s satisfaction can be a very long & labor intensive process, for both the curator AND the original author.

For years there has been discussion of ‘expert curation’ in which experts in the field author review or summary pages in a resource, or community curation jamborees, etc. And there have been fruits from many of these efforts, but in general participation is low. But who is more of an expert on the research being published other than the author himself? If authors could/would mark up their own papers during the publication process, not only could they be assured that it would be accurate but they would help make their research universally searchable without the lag required for searchability through a specific resource. Thus far document mark-up is has not been an easy process and has largely been deemed ‘not worth the effort’ for the level of attribution/recognition affiliated with it.

The BioMed Central article does a nice job of outlining and discussing many of these issues. It cites many other efforts and resources, explains their motivation and the implementation of their software. A nice feature of the tool is that there are interoperability features, and a real commitment to conforming with existing standards of practice. The article also presents an appendix of resource addresses of other groups involved in semantic searching and literature publication. I especially like this quote from the paper:

The Word add-in presented here will assist authors in this effort using community standards and by making it possible for the author of the document, the absolute expert on the content, to do so during the authoring process and to provide this information in the original source document.

You can also find brief tutorials on using the tool at SciVee: Word Add-in for Ontology Recognition Tutorial (1 of 4): Install Process

As a note, literature mark-up and enabling are currently an active area – Mary found another literature handling resource and paper as well: Check out the tip, the articles & the tools. Tell me what you find/think. Thanks! (OH, and Happy St. Patty’s to ya!)

UCSDBioLit Reference:
Fink, J., Fernicola, P., Chandran, R., Parastatidis, S., Wade, A., Naim, O., Quinn, G., & Bourne, P. (2010). Word add-in for ontology recognition: semantic enrichment of scientific literature BMC Bioinformatics, 11 (1) DOI: 10.1186/1471-2105-11-103


Are you curious about what biocurators do?

It may not surprise you that we at OpenHelix are pretty heavy-duty users of curated information from databases. It might surprise you to know that some of us have been involved in actually curating them as well. In both public and commercial situations, we’ve been on the curation side. And lately we’ve been heavily on the end-user side.

So we’ve been huge supporters of curators for a long time. We know that they are the ones responsible for the most trustworthy data in the databases. We know the intelligence, the focus, the attention to detail, and the training it takes to do this well.

Biocurators rock. If you do biomedical research and use the data from the databases, you can thank a biocurator.

But maybe you don’t know that much about exactly what biocurators do if you are mostly an end user of the databases. I’d like you to meet some of them. The new International Society for Biocuration has been established to foster development and respect for biocuration as a career choice and career path.

They are also currently holding an election for their board.  Have a look at the slate of candidates, and read some of those statements.  Check out the varied backgrounds on these folks–you’ll be seriously impressed with their skills and dedication to good data.

And if you are a biocurator, and a member of the society, I would encourage you to have a special look at Jennifer Williams of OpenHelix.  Jennifer is an incredible member of our team, and we totally support her candidacy for the ISB board.  She would bring useful skills to the job from the project management perspective, I assure you.  She knows both sides of curation equally well: getting data in and getting data out.  She’s also very much a bridge-builder with a very gentle and effective way of bringing people together in the right place.

If you have a membership and intend to vote, please consider voting for Jennifer Williams for one of the 6 board members.  She is a real gem, and she’s certain to serve capably and effectively.

Tip of the Week: UCSC wiki annotations


In the continuing effort to get scientists and researchers to annotate and curate data and to capture the huge amount of knowledge available, UCSC Genome Browser has added a wiki annotation track to the browser. It’s not the first effort of course, GeneWiki is an effort, with mixed results so far, to annotate gene function information as a community exercise using Wikipedia. Some journals are requiring wiki entries, and several databases have opened wikis for curation. Wikis could be a solution for capturing the exponentially increasing amount of data,

or they could be just another place for adding confusion… or both. I suspect out of the plethora the wikis coming available for annotation and curation of genomic data, something will stick and find that Goldilocks balance of a dedicated community, ease of use, usability, and other aspects that will be needed for this to work.

Perhaps UCSC Genome Browser has that balance. It will remain to be seen, but let’s get started. Today’s tip is introducing the new wiki track in the UCSC Genome Browser.

Required Wiki updates?

In the push to ‘communitize’ annotation and curation, one journal, RNA Biology, is requiring submitters to add or update their RNA sequences on wikipedia. This article suggests that it’s working so far (update, link to the article added),

The first examples of this program in action are already online. The journal is hosting an open access paper that describes a family of RNA molecules found in nematode worms; a corresponding Wikipedia page is already in place. In good Wikipedia form, the phylogenetic analysis of these RNAs is dinged for not providing citations, while the article as a whole is flagged as having excess jargon. (The talk page hosts an interesting discussion of how much jargon can possibly be eliminated from a highly technical description like this.)

So far, everyone is happy with the results. A few scientists have started updating the scientific content of the RNA entries, while the usual Wikipedia denizens have helped out in terms of catching typos and improving the formatting. The people backing the project expect that it will be immune to some of the issues that plague other Wikipedia entries; Nature quotes one of the biologists as saying, “”We don’t think vandalism will ever be as much of a problem for a Wikipedia page on transfer RNAs as it is for a page on George Bush.”

And looking at that one entry, it does seem to. But I have a question, if researchers are soon required not only to submit and/or annotate in a database and to wikis and curate and annotate if they wish to publish, doesn’t this start to place an undue burden on researchers who already have grant writing, teaching, and more in addition to actual research? There does need to be a solution to the growing need for curation and annotation of data, it will be interesting to see if this is one solution that will hold.

Tip of the Week: Adopt-a-Species

eoltipEver wanted to adopt a pet? Perhaps you’ve thought of donating to a zoo by “adopting” a zoo animal, well, you can do even more. The Encyclopedia of Life (which I’ve written about before) needs someone to adopt a species as an ‘authenticator/curator’. The EOL has a lot of potential, but it’s going to require some volunteer work. This week’s tip introduces the EOL, what kind of data is there in the ‘model’ pages and shows you how to sign up!

Eh, enter your own damn data….

tair_submission.jpgI was looking over the Eurekalert announcements and came across one that I have been percolating about now for some time. It is an effort I fully support and encourage. But I worry about a few aspects of it. The alert is entitled: Controlling a sea of information. The Arabidopsis Information Resource (TAIR) has partnered with the journal Plant Physiology to ensure data from Plant Physiology papers will get into the TAIR database. The longer story is available from the alert and from the associated Editorial. The short story is: there aren’t enough curators to keep up with all the data coming out. This prevents a lot of information from getting into the databases. The TAIR and PlantPhysiol folks have teamed up to create a way for the authors themselves to get this information into TAIR with a simple form.

Continue reading