This tip isn’t bioinformatics per se–but it’s a tool that I recently found very quick and handy to prioritize a giant pile of literature that I had in my lap. I’ve been participating in a curation project in which all the papers have to get in to a database–but because the data extraction process is uneven I wanted to prioritize some groups in a meaningful (but quick) way. I needed rapid and bespoke text-mining.
“Overview” will do that for you. You can take a giant pile of documents–in my case PDFs–and ask it to quickly sort them into subsets based on words of interest to you. It’s pretty flexible–you can ask it for new sorting or tagging words on the fly. But then you can also tag the subsets with handy reminders, or other categorizations that you need.
Certainly there may be more text-mining you want to do with your literature after–but for a quick sort, and potential way to do discovery on some word combinations–this is a really handy way to explore. And of course it’s not limited to PDFs. You could do a batch of tweets from a conference. You could sort emails. You could sort NSA- or WikiLeaks-style document dumps–should you be so inclined.
Hat tip to Donna Murdoch on Google+ for the lead. It was described at the link she found to Robin Good’s Content Curation World as a terrific tool for journalists–it’s definitely a broad tool. (The project lead, Jonathan Stray, teaches “computational journalism”. I didn’t know that was a thing, but I like it.)
Overview is a new free tool designed for investigative journalists and researchers interested in finding relevant information within large collections of text documents, from reports to social media tweets.
Overview greatly simplifies the task of analyzing, indexing and visualising large document collections in ways that can allow a journalist to identify relevant patterns and threads across thousands of different documents.
I’ll let their video describe how it works–I found it was really simple and effective on a huge folder of papers I had. I could sort them by species, and then by other useful terms, and more, really quickly once everything was loaded.
I like the intuitive folder flow. I like the color coding. I found the tagging really handy. There’s another video I found helpful to get started with my documents: Learn Overview in 90 seconds. I had to look up a couple of other things, but I found everything I needed to get working with the data set very quickly at their site.
Their site: Overviewproject.org and you can use it online. Or you can download the code from Github and set up your own.