Mining figure legends. Huh.

Every so often something comes up in your weekly literature search that makes you go: huh. That happened to me today with a paper on text mining. Now, I have used a variety of text-mining tools (Textpresso, iHOP, PubMatrixXplorMed, etc are among the ones we have subscription tutorials on) and they have all sorts of strengths and weaknesses. And I’m convinced of the utility of them for making new connections, finding related literature, examining over-represented terms, etc. Because of gene nomenclature issues they haven’t always been quite as effective as I’ve always wanted for different sorts of interaction data that I’d love to be able to extract from the literature. That’s still best done by professional curators, IMHO.

When I saw this paper, though, I thought–yeah, figures and figure legends. There could be some real utility there. And I wondered if the mining tools I’ve been using take the figure legends into account? And then it also led me to wonder about the supplemental materials that are becoming so crucial (and overwhelming) from these “big data” projects?  It was one of those realizations that you don’t know what you aren’t looking at….

So this specific paper took thousands of figures from a variety of publications, and mined them:

According to our pathway definition described in the previous
section, we manually checked the 75,350 figures and identified 375
pathway figures to be positive data. Another 11,251 figures other
than pathway figures were randomly selected as negative data.

There were a lot of pieces of the regular text mining strategies (stemming*, decisions trees, weighting, etc). The details of this are provided. And their method is supposedly novel by combining figure text  and the paper body–which gives them improved results for for figure information. But for me the issue was just the awarenesses of 1) the potential value of figures and legends, and 2) the fact that in other text mining tools I’m using I don’t know if those data are in there.

Like all paper components, the quality and depth of figure legends vary, of course.  But it did strike me that especially for pathway data people might assume the figure conveys a lot of useful information that might not be explicitly stated in the body of the paper.

As far as I can tell there’s no web interface around this. One link in the paper that was supposed to have some more info is currently 403, so I’ve written to the team. Their introduction also led me to a different tool called FigSearch that sounded like a web interface for a similar type of analysis, but that doesn’t seem to be available any more. Such is the world of software….sigh.

But still: I like it when a paper gives me a realization that I need to think about what I’m not seeing when I’m using software.  It’s an easy thing to forget.


Ishii, N., Koike, A., Yamamoto, Y., & Takagi, T. (2010). Figure classification in biomedical literature to elucidate disease mechanisms, based on pathways Artificial Intelligence in Medicine, 49 (3), 135-143 DOI: 10.1016/j.artmed.2010.04.005


*The stemming example cracked me up. It appeared to be partially LOLcat: “This algorithm removes suffixes from words and leaves the stem (e.g., pathway or pathways becomes pathwai).”

8 thoughts on “Mining figure legends. Huh.

  1. Mary Post author

    Very cool, thanks Casey!

    I can’t get Figureome to work. And Image Finder isn’t loading for me right now either. But I’ll definitely check out the Figure Search.

    I still want to know if the other non-figure-centric tools cover the figures. And if supplemental figures get into these.

  2. Mary Post author

    Just tested Figure Search. I tried to access the “about” and “help” but those are unavailable, so I will have to dig a bit more to understand what to expect from the collection. These are mostly notes to myself so I can remember what I tried. I’d like to try it in other ones too.

    I took an article that was not recent so it should have had time to get in: I knew that would have significant supplemental data (and there’s 100+ page supplement, in fact).

    I searched the images with some items from Figure 7 of the regular paper: (H3K4me2 H3K27me2). I also separately tried Gencode but didn’t pull up this paper with either (figure 4).

    The supplement has the word QQplot in the first supplemental figure (text = Supplementary Figure 1: QQplot demonstrating the approximate gaussianity of the overlap statistics from simulation)

    I couldn’t get this paper back with QQplot as a search term either.

    But I’m not sure which part isn’t working for me. Note to self: ENCODE is a bad search term….

    Hmmm….I’m going to try these in the other tools later and see what I can determine.

  3. Pingback: New NCBI Image Database | The OpenHelix Blog

  4. Mary Post author

    Hi Casey–

    Well, that’s interesting. Selling image. Well, many are apparently free (289,100 of 2 million).

    Thanks for catching that.

Comments are closed.