We have a slide we like to present at some trainings showing the rise in the amount of raw sequence data and number of complete genomes over the last 18 years. There is another slide we show that indicates the rise of the number of databases and analysis tools over the years as listed in the annual database issue of NAR. The number has been doubling every 4 years.
Well, there is another slide we can show too, and this shows the growth of the number of abstract entries into PubMed over the last 20 years (from Hunter and Cohen, 2006). Like data and databases, the number of research articles published and indexed just keeps getting larger. This increase in number is both a bane and a boon to researchers. Well, of course not only the number of papers indexed is growing, the amount of text is growing (open access, etc) and is about to grow even more with the signing of the new open access act. Searching, mining and making sense of all this literature is going to be a challenge, it is a challenge now.
There is a recent paper in PLOS Computational Biology entitled “Getting Started in Text Mining (see ref 1 below)” (btw, hat tip to Coturnix for his great series: New and Exciting in PLoS Community Journals) The paper looks to the future. In the abstract they mention some present attempts to search the literature (beyond PubMed) including Chilibot to search protein, gene and keyword relationships, Textpresso to search species-specific literature retrieval (ref2 below) and PreBind. There are others I’ll mention in a moment.
Then they go into the current state of affairs for text mining (and searching), that many current tools are created by computational biologists, but few by text-mining specialists. The reason for this, they speculate:
In the introduction, we pointed out that all or most of the demonstrably useful biomedical text mining systems have been built not by text mining specialists, but by computational biologists. Why might this be? Although this has not been systematically investigated, we speculate that it is related to cultural differences between the two groups. Text mining specialists are more likely to build systems that are likely to get them published in computational linguistics conferences. Such systems are not domain-dependent, are usable for a wide variety of tasks, and, if fashionable, rely more on statistical approaches than on knowledge sources.
They suggest that a combination of computational biologists and text mining specialists will be optimum, as they state:
Text mining specialists continue to excel at building system components and designing datasets for evaluation; computational biologists currently appear to be much better at producing useful task definitions. Perhaps the most fruitful approaches are characterized by combined efforts that leverage the abilities of each type of scientist.
And then the authors present a pretty decent outline on how to go about creating text mining tools. So, hopefully there are some computational biologists and text mining specialists that will create some of those future literature mining resources.
Got me to thinking though, what resources are out there for the researcher to search and filter the literature now? This question different than machine text mining of a huge database of literature to computationally “curate” the database. It’s more of a more intelligent manual search of the literature I’m thinking of.
Of course there is PubMed. Yet, though it has some great search tools, there are better searching methods. You could try Faculty 1000 (less a search method than a reviewed and recommended list of articles by experts in various fields, kind of like the Coturnix series times well… a thousand :), and I like it for what it does.
But searching and finding literature? There are several options. Some, like Textpresso above, use ontologies to narrow and refine searching. Others, like XploreMed, use co-occurrence of keywords to refine your search to relevant literature. Still others are ways to be notified of new citations.
It’s beyond the scope of this blog post to review these, but I thought I’d list some I’ve come across and used:
XplorMed : from the about section ” The XplorMed server allows you to explore a set of abstracts words in groups of abstracts. Then, you can select a subset of your abstracts based on selected groups of related words and iterate your analisis on them. derived from a MEDLINE search. The system gives you the main associations between the ”
GoPubMed: an ontology-based and semantic search of PubMed.
PubGene: a search tool to find gene/protein pairs that are ‘co-cited’ in the literature.
PubMatrix: “a simple way to rapidly and systematically compare any list of terms against any other list of terms in PubMed. It reports back the frequency of co-occurrence between all pairwise comparisons between the two lists as a matrix table.”
IHop: “A network of concurring genes and proteins extends through the scientific literature touching on phenotypes, pathologies and gene function… By using genes and proteins as hyperlinks between sentences and abstracts, the information in PubMed can be converted into one navigable resource, bringing all advantages of the internet to scientific literature research.”
HubMed: Less a different search methodology than it is a different, simpler, interface to PubMed.
PubCrawler: A literature notification system.
eTBlast: A way to find experts and journals based on your keyword search of Pubmed (or NASA, or CRISP or …) and a history of citations on your search term.
There are others, but hopefully that should get you started in taming the literature explosion. And I’m sure, with publications like cited above, there are more to come
1. Cohen, K.B., Hunter, L. (2008). Getting Started in Text Mining. PLoS Computational Biology, 4(1), e20. DOI: 10.1371/journal.pcbi.0040020
2. Muller, H., Kenny, E.E., Sternberg, P.W. (2004). Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature. PLoS Biology, 2(11), e309. DOI: 10.1371/journal.pbio.0020309