For quite a while I’ve been watching the development of ContentMine. There have been a number of different ways to text-mine the scientific literature over the years. Most of the efforts I’m familiar with aim at a specific subset of the literature. This could be species-specific mining, topic-specific (such as interaction data, or a field like cancer or virology), to extract gene-related tidbits, and so on. Sometimes the tools have been limited to abstracts which are publicly available, which would miss much of the knowledge that’s embedded in the actual papers and lately the extraordinary “supplemental” sections–which are making me crazy because much of the key information I need on software tools is buried deep within those. But the philosophy of ContentMine is to go big across the entire realm of scientific publication–as they describe in their “about” page:
To make this a reality we are building software and training resources so that together we can liberate 100,000,000 facts from the scientific literature.
And they want to make all of this available to you, so you can pull out the subset that’s useful to your research. You can learn about their philosophy and strategies from this video, as well as some of the specific tasks that they have been working on to get to the point where people could use their resources and tools to extract information.
One of the things that always worried me about mining was how much of the information in images and tables and supplements wasn’t available. But they are also tackling this, as the video explains.
The reason this floated to the top of my “blog drafts” list, though, was because of this great and current example of using their resources for an emerging public health issue. They’ve got a sample video of accessing information related to the Zika virus that they’ve just released. I think it’s a nice concrete demonstration of how ContentMine can be quickly deployed on a topic to pull out relevant research details.
So have a look at their project. There are details about specific tools that have also been written about–linked below. And there are more videos from their YouTube and Vimeo collections that can help you to learn more. Some are longer, and some are more specific for a task. Thre’s a lot more information at their site as well. They are eager to help people get the most out of the literature. You should have a look and see how it can help you–and maybe how you can help them.
ContentMine site: http://contentmine.org/
YouTube channel: https://www.youtube.com/channel/UCM1gxtWZOJeDK7KL7MAZWGA
Vimeo videos: https://vimeo.com/petermr
Follow them on twitter: https://twitter.com/TheContentMine
Smith-Unna, R., & Murray-Rust, P. (2014). The ContentMine Scraping Stack: Literature-scale Content Mining with Community-maintained Collections of Declarative Scrapers D-Lib Magazine, 20 (11/12) DOI: 10.1045/november14-smith-unna
Murray-Rust, P., Smith-Unna, R., & Mounce, R. (2014). AMI-diagram: Mining Facts from Images D-Lib Magazine, 20 (11/12) DOI: 10.1045/november14-murray-rust