Every so often something comes up in your weekly literature search that makes you go: huh. That happened to me today with a paper on text mining. Now, I have used a variety of text-mining tools (Textpresso, iHOP, PubMatrix, XplorMed, etc are among the ones we have subscription tutorials on) and they have all sorts of strengths and weaknesses. And I’m convinced of the utility of them for making new connections, finding related literature, examining over-represented terms, etc. Because of gene nomenclature issues they haven’t always been quite as effective as I’ve always wanted for different sorts of interaction data that I’d love to be able to extract from the literature. That’s still best done by professional curators, IMHO.
When I saw this paper, though, I thought–yeah, figures and figure legends. There could be some real utility there. And I wondered if the mining tools I’ve been using take the figure legends into account? And then it also led me to wonder about the supplemental materials that are becoming so crucial (and overwhelming) from these “big data” projects? It was one of those realizations that you don’t know what you aren’t looking at….
So this specific paper took thousands of figures from a variety of publications, and mined them:
According to our pathway definition described in the previous
section, we manually checked the 75,350 figures and identified 375
pathway figures to be positive data. Another 11,251 figures other
than pathway figures were randomly selected as negative data.
There were a lot of pieces of the regular text mining strategies (stemming*, decisions trees, weighting, etc). The details of this are provided. And their method is supposedly novel by combining figure text and the paper body–which gives them improved results for for figure information. But for me the issue was just the awarenesses of 1) the potential value of figures and legends, and 2) the fact that in other text mining tools I’m using I don’t know if those data are in there.
Like all paper components, the quality and depth of figure legends vary, of course. But it did strike me that especially for pathway data people might assume the figure conveys a lot of useful information that might not be explicitly stated in the body of the paper.
As far as I can tell there’s no web interface around this. One link in the paper that was supposed to have some more info is currently 403, so I’ve written to the team. Their introduction also led me to a different tool called FigSearch that sounded like a web interface for a similar type of analysis, but that doesn’t seem to be available any more. Such is the world of software….sigh.
But still: I like it when a paper gives me a realization that I need to think about what I’m not seeing when I’m using software. It’s an easy thing to forget.
Ishii, N., Koike, A., Yamamoto, Y., & Takagi, T. (2010). Figure classification in biomedical literature to elucidate disease mechanisms, based on pathways Artificial Intelligence in Medicine, 49 (3), 135-143 DOI: 10.1016/j.artmed.2010.04.005
*The stemming example cracked me up. It appeared to be partially LOLcat: “This algorithm removes suffixes from words and leaves the stem (e.g., pathway or pathways becomes pathwai).”