Recently I was looking for something on YouTube, and I came across a lecture at a recent conference called “The Cost of Curation”. I’ve been interested in annotation and curation for a long time, as my entry point for bioinformatics was at The Jackson Lab and the Gene Expression Database project in its early days. Later I was at Proteome, which did curation in a company setting. So I suppose a disclaimer on this would be: some of my best friends have been curators, and I have huge respect for their skills and their knowledge.
From those perspectives I saw the flood of publications come in that needed curation. I saw the software development that made it somewhat more manageable. I saw the curation tool development that aided the curatorial tasks. And I saw a lot of high quality information extracted and organized by well-educated professionals who had great pride in the work.
Despite the fact that everyone wanted quality information in the databases they rely on, it wasn’t clear to me that people were aware of this behind-the-scenes sort of work. It was also increasingly clear as some long-standing resources struggled for funding that the work wasn’t valued to the extent I thought it should be.
A few things came along over time that changed the playing field to some extent. Wikification became hot. The idea was that if you build-it-they-will-come and curate stuff. Although this can succeed, it’s clear this isn’t always a solution for high-volume, standardized, and broad information in many cases. People might curate stuff they are particularly interested in, but trying to get the time and mind-share on less sexy topic…meh. Maintenance and updates over time? Er…not so much.
There were some hybrid cases, where journals would require some level of curation to help databases locate key publications. That has had some success. I’ve heard reports of getting students to curate has worked, as a class exercise and educational opportunity (see CACAO).
But we reach a new point in the flood of information as the data collection technology continued to change. As I keep noting, the data is not in the papers anymore. A “marker paper” with some “compelling examples” might be published on a major project. But most of the data languishes waiting to be examined–and connected to other information–in databases and repositories around the intertubz.
So the challenge to manage the data goes on, and new tools and strategies are being developed. Text-mining and other automation methods are being explored. One project that’s been going on for some time is called BioCreAtIve: Critical Assessment of Information Extraction in Biology. You can see more about these ongoing efforts at the BioCreative site.
Active as an organizer and researcher in the BioCreative arena, Lynette Hirschman from MITRE has been tackling the information extraction for a long time. Here you can see a recent talk, from the iDASH conference (Integrating Data for Analysis, Anonymization, and SHaring). She takes a look at the strategies used to obtain curated biomedical information, and the costs and quality of those efforts.
The summary slide comes at about 33 minutes if you want to jump to that. Essentially there are computational strategies, professional human curator strategies, and low budget human curation strategies involving Amazon Mechanical Turk, or some combinations of these. I see more references to mTurk around the ‘tubz lately. It’s becoming pretty popular in academia for getting tasks done. What you can do is set up some tasks in little bites that need a human to look over–maybe it’s extracting some features, tagging some item, locating related components of some paragraph, writing a summary, answering survey questions, etc–and put them out as HITs, human intelligence tasks. You establish how long a task should take, and assign a small value to that. You can also screen the people who you want to do the tasks for some qualifications, and/or pre-screen with some testing, and select a subset of the Turking community to do them.
Here’s the slide with the summary details, but you should watch the video for the whole context:
Curation is a commodity. Its cost can be determined.
But what is the value of curation? Is it worth it to have PhDs use their skills to extract and connect relevant information? When I was reading up on Margaret Dayhoff’s struggles to get support for early PDB curation, I saw that it suffered from a stigma that people considered “stamp collecting”. I think that persists.
Is “mediocre” low-budget curation worth it in the long run? I looked into “Turking” a bit to understand it from the curation side. It certainly is low budget. It’s a system that requires people to have computers, internet service, language skills, and pays very little (in pre-tax dollars) for tasks. It offers no benefits. I think if it wasn’t the internet it would violate every labor standard in the world, even those in banana republics. But a large community of “turkers” will do these jobs. It looks to me like the WalMartization of curation. Others may think differently about it, as C. Titus Brown observes in his series “w4s – the awesomeness we’re experiencing“:
“this” = Experimental Turk.
I wish skilled, professional curation was valued more. But I have seen the future, and I suspect it’s Turkers. With my opinion, and $3 (which you maybe can get for an hour or two of turking), you might be able to buy a cup of coffee at Starbucks. Professors in the future can Turk at Starbucks between sessions of their MOOCs to pay for their coffee. I guess it all works out.
Here’s the full video. Have a look, and think about the directions of the field and the type of curation we want.
Hirschman, L., Yeh, A., Blaschke, C., & Valencia, A. (2005). Overview of BioCreAtIvE: critical assessment of information extraction for biology BMC Bioinformatics, 6 (Suppl 1) DOI: 10.1186/1471-2105-6-S1-S1
Lu, Z., & Hirschman, L. (2012). Biocuration workflows and text mining: overview of the BioCreative 2012 Workshop Track II Database, 2012 DOI: 10.1093/database/bas043