Navigating the literature

progress slideWe have a slide we like to present at some trainings showing the rise in the amount of raw sequence data and number of complete genomes over the last 18 years. There is another slide we show that indicates the rise of the number of databasesdatabase growth and analysis tools over the years as listed in the annual database issue of NAR. The number has been doubling every 4 years.

Well, there is another slide we can show too, and this shows the growth of the literature risenumber of abstract entries into PubMed over the last 20 years (from Hunter and Cohen, 2006). Like data and databases, the number of research articles published and indexed just keeps getting larger. This increase in number is both a bane and a boon to researchers. Well, of course not only the number of papers indexed is growing, the amount of text is growing (open access, etc) and is about to grow even more with the signing of the new open access act. Searching, mining and making sense of all this literature is going to be a challenge, it is a challenge now.

Blogging on Peer-Reviewed ResearchThere is a recent paper in PLOS Computational Biology entitled “Getting Started in Text Mining (see ref 1 below)” (btw, hat tip to Coturnix for his great series: New and Exciting in PLoS Community Journals) The paper looks to the future. In the abstract they mention some present attempts to search the literature (beyond PubMed) including Chilibot to search protein, gene and keyword relationships, Textpresso to search species-specific literature retrieval (ref2 below) and PreBind. There are others I’ll mention in a moment.

Then they go into the current state of affairs for text mining (and searching), that many current tools are created by computational biologists, but few by text-mining specialists. The reason for this, they speculate:

In the introduction, we pointed out that all or most of the demonstrably useful biomedical text mining systems have been built not by text mining specialists, but by computational biologists. Why might this be? Although this has not been systematically investigated, we speculate that it is related to cultural differences between the two groups. Text mining specialists are more likely to build systems that are likely to get them published in computational linguistics conferences. Such systems are not domain-dependent, are usable for a wide variety of tasks, and, if fashionable, rely more on statistical approaches than on knowledge sources.

They suggest that a combination of computational biologists and text mining specialists will be optimum, as they state:

Text mining specialists continue to excel at building system components and designing datasets for evaluation; computational biologists currently appear to be much better at producing useful task definitions. Perhaps the most fruitful approaches are characterized by combined efforts that leverage the abilities of each type of scientist.

And then the authors present a pretty decent outline on how to go about creating text mining tools. So, hopefully there are some computational biologists and text mining specialists that will create some of those future literature mining resources.

Got me to thinking though, what resources are out there for the researcher to search and filter the literature now? This question different than machine text mining of a huge database of literature to computationally “curate” the database. It’s more of a more intelligent manual search of the literature I’m thinking of.

Of course there is PubMed. Yet, though it has some great search tools, there are better searching methods. You could try Faculty 1000 (less a search method than a reviewed and recommended list of articles by experts in various fields, kind of like the Coturnix series times well… a thousand :), and I like it for what it does.

But searching and finding literature? There are several options. Some, like Textpresso above, use ontologies to narrow and refine searching. Others, like XploreMed, use co-occurrence of keywords to refine your search to relevant literature. Still others are ways to be notified of new citations.

It’s beyond the scope of this blog post to review these, but I thought I’d list some I’ve come across and used:

XplorMed : from the about section ” The XplorMed server allows you to explore a set of abstracts words in groups of abstracts. Then, you can select a subset of your abstracts based on selected groups of related words and iterate your analisis on them. derived from a MEDLINE search. The system gives you the main associations between the

GoPubMed: an ontology-based and semantic search of PubMed.

PubGene: a search tool to find gene/protein pairs that are ‘co-cited’ in the literature.

PubMatrix: “a simple way to rapidly and systematically compare any list of terms against any other list of terms in PubMed. It reports back the frequency of co-occurrence between all pairwise comparisons between the two lists as a matrix table.”

IHop: “A network of concurring genes and proteins extends through the scientific literature touching on phenotypes, pathologies and gene function… By using genes and proteins as hyperlinks between sentences and abstracts, the information in PubMed can be converted into one navigable resource, bringing all advantages of the internet to scientific literature research.”

My NCBI: A literature notification system, and much, much more.

HubMed: Less a different search methodology than it is a different, simpler, interface to PubMed.

PubCrawler: A literature notification system.

eTBlast: A way to find experts and journals based on your keyword search of Pubmed (or NASA, or CRISP or …) and a history of citations on your search term.

There are others, but hopefully that should get you started in taming the literature explosion. And I’m sure, with publications like cited above, there are more to come

1. Cohen, K.B., Hunter, L. (2008). Getting Started in Text Mining. PLoS Computational Biology, 4(1), e20. DOI: 10.1371/journal.pcbi.0040020

2. Muller, H., Kenny, E.E., Sternberg, P.W. (2004). Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature. PLoS Biology, 2(11), e309. DOI: 10.1371/journal.pbio.0020309

2 thoughts on “Navigating the literature

  1. Trey

    Thanks bt27uk,

    It definitely is an interesting article. They actually used eTBlast, listed above and have a database of the duplicates found at Dejavu.

    And though it does seem to point to the used of text-mining as an useful tool, I think it also points out one of the caveats you’d have to consider. Lots of the same-author duplicates have nearly identical abstracts and titles. I would wonder why an author would find that advantageous to do? Surely anyone looking at the CV would notice identical titles (and abstracts) making republication for padding purposes kind of useless. The first duplicate I checked was like this. THe first publication was in a peer-reviewed original research journal, the second was a chapter in an annual book which allowed republications. I saw several like this.

    It does seem there are a lot of republished works, but it’s going to take algorithm refinement and/or manual checking to get rid of instances like this. Anyway, it’s very interesting use of text-mining software.

Comments are closed.