Tag Archives: text mining


Video Tip of the Week: ContentMine, with a #Zika example

content-mine-502x502For quite a while I’ve been watching the development of ContentMine. There have been a number of different ways to text-mine the scientific literature over the years. Most of the efforts I’m familiar with aim at a specific subset of the literature. This could be species-specific mining, topic-specific (such as interaction data, or a field like cancer or virology), to extract gene-related tidbits, and so on. Sometimes the tools have been limited to abstracts which are publicly available, which would miss much of the knowledge that’s embedded in the actual papers and lately the extraordinary “supplemental” sections–which are making me crazy because much of the key information I need on software tools is buried deep within those. But the philosophy of ContentMine is to go big across the entire realm of scientific publication–as they describe in their “about” page:

To make this a reality we are building software and training resources so that together we can liberate 100,000,000 facts from the scientific literature.

And they want to make all of this available to you, so you can pull out the subset that’s useful to your research. You can learn about their philosophy and strategies from this video, as well as some of the specific tasks that they have been working on to get to the point where people could use their resources and tools to extract information.

TheContentMine II from Peter Murray-Rust on Vimeo.

One of the things that always worried me about mining was how much of the information in images and tables and supplements wasn’t available. But they are also tackling this, as the video explains.

The reason this floated to the top of my “blog drafts” list, though, was because of this great and current example of using their resources for an emerging public health issue. They’ve got a sample video of accessing information related to the Zika virus that they’ve just released. I think it’s a nice concrete demonstration of how ContentMine can be quickly deployed on a topic to pull out relevant research details.

So have a look at their project. There are details about specific tools that have also been written about–linked below. And there are more videos from their YouTube and Vimeo collections that can help you to learn more.  Some are longer, and some are more specific for a task. Thre’s a lot more information at their site as well. They are eager to help people get the most out of the literature. You should have a look and see how it can help you–and maybe how you can help them.

Quick links:

ContentMine site: http://contentmine.org/

YouTube channel: https://www.youtube.com/channel/UCM1gxtWZOJeDK7KL7MAZWGA

Vimeo videos: https://vimeo.com/petermr

Follow them on twitter: https://twitter.com/TheContentMine


Smith-Unna, R., & Murray-Rust, P. (2014). The ContentMine Scraping Stack: Literature-scale Content Mining with Community-maintained Collections of Declarative Scrapers D-Lib Magazine, 20 (11/12) DOI: 10.1045/november14-smith-unna

Murray-Rust, P., Smith-Unna, R., & Mounce, R. (2014). AMI-diagram: Mining Facts from Images D-Lib Magazine, 20 (11/12) DOI: 10.1045/november14-murray-rust

Video Tip of the Week: Nowomics, set up alert feeds for new data

Yeah, I know you know. There’s a lot of genomics and proteomics data coming out every day–some of it in the traditional publication route, but some of it isn’t–and it’s only getting harder and harder to wrangle the useful information to access the signal from the noise.  I can remember when merely looking through the (er, paper-based) table of contents of Cell and Nature would get me up to speed for a week. But increasingly, the data I need isn’t even coming through the papers.

Like everyone else, I have a variety of strategies to keep notified of different things I need to see. I use the MyNCBI stored searches to keep me posted on things that come from via the NCBI system. I signed up for the OMIM new “MIM-Match” service as well. But there’s still a lot of room for new ways to collect and filter new data and information. Today’s tip focuses on a service to do that: Nowomics. This is a freely available tool to help you keep track of important new data. Here’s a quick video overview of how to see what’s going on with Nowomics.

The goal of Nowomics is to offer you an actively updated feed of relevant information on genes or topics of interest, using text mining and ontology term harvesting from a range of sources. What makes them different from MyNCBI or OMIM is the range and types of data sources they use. The user sets up some genes or Gene Ontology terms to “follow”, and the software regularly checks for changes in the source sites. You can go in an look at your feed, you can filter it for different types of data, and you can see what’s new (“latest”) or what’s being hotly chattered about (“popular”) using Altmetric strategies. For example, here’s a paper that people seemed to find worth talking about, based on the tweets and the Mendeley occurrences.

example_paper This tool is in early stages of development–if there are features you’d like to see or other sources you’d think are useful, the Nowomics team is eager for feedback. You can find a link to contact them over at their site, or locate them on Facebook and Twitter. You can also learn more from their blog. You can also learn more about the philosophy and foundations of Nowomics from their slide presentation below.


Quick links:

Nowomics: http://nowomics.com/

Example gene feed: http://nowomics.com/gene/human/BRCA2


Acland A., T. Barrett, J. Beck, D. A. Benson, C. Bollin, E. Bolton, S. H. Bryant, K. Canese, D. M. Church & K. Clark & (2014). Database resources of the National Center for Biotechnology Information, Nucleic Acids Research, 42 (D1) D7-D17. DOI: http://dx.doi.org/10.1093/nar/gkt1146

Online Mendelian Inheritance in Man, OMIM®. McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD), July 22 2014. World Wide Web URL: http://omim.org/

Video Tip of the Week: “Overview” for document sorting and mining

This tip isn’t bioinformatics per se–but it’s a tool that I recently found very quick and handy to prioritize a giant pile of literature that I had in my lap. I’ve been participating in a curation project in which all the papers have to get in to a database–but because the data extraction process is uneven I wanted to prioritize some groups in a meaningful (but quick) way. I needed rapid and bespoke text-mining.

Overview” will do that for you. You can take a giant pile of documents–in my case PDFs–and ask it to quickly sort them into subsets based on words of interest to you. It’s pretty flexible–you can ask it for new sorting or tagging words on the fly. But then you can also tag the subsets with handy reminders, or other categorizations that you need.

Certainly there may be more text-mining you want to do with your literature after–but for a quick sort, and potential way to do discovery on some word combinations–this is a really handy way to explore. And of course it’s not limited to PDFs. You could do a batch of tweets from a conference. You could sort emails. You could sort NSA- or WikiLeaks-style document dumps–should you be so inclined.

Hat tip to Donna Murdoch on Google+ for the lead.  It was described at the link she found to Robin Good’s Content Curation World as a terrific tool for journalists–it’s definitely a broad tool. (The project lead, Jonathan Stray, teaches “computational journalism”. I didn’t know that was a thing, but I like it.)

Overview is a new free tool designed for investigative journalists and researchers interested in finding relevant information within large collections of text documents, from reports to social media tweets.

Overview greatly simplifies the task of analyzing, indexing and visualising large document collections in ways that can allow a journalist to identify relevant patterns and threads across thousands of different documents.

I’ll let their video describe how it works–I found it was really simple and effective on a huge folder of papers I had. I could sort them by species, and then by other useful terms, and more, really quickly once everything was loaded.

Reading through thousands of documents quickly with Overview from Jonathan Stray on Vimeo.

I like the intuitive folder flow. I like the color coding. I found the tagging really handy. There’s another video I found helpful to get started with my documents: Learn Overview in 90 seconds. I had to look up a couple of other things, but I found everything I needed to get working with the data set very quickly at their site.

Their site: Overviewproject.org and you can use it online. Or you can download the code from Github and set up your own.

Video Tip of the Week: eGIFT, extracting gene information from text

eGIFT, as the tag line says, is a tool to extract gene information from text. It’s a tool that allows you to search for and explore terms  and documents related to a gene or set of genes. There are many ways to search and explore eGIFT, find genes given a specific term, find terms related to a set of genes and more. How does the tool do this? You can check out the user guide to find out more, but here is a brief summary from the site:

We look at PubMed references (titles and abstracts), gather those references which focus on the given gene, and automatically identify terms which are statistically more likely to be relevant to this gene than to genes in general. In order to understand the relationship between a specific iTerm and the given gene, we allow the users to see all sentences mentioning the iTerm, as well as the abstracts from which these sentences were extracted.

To learn more about how this tool was put together and the calculations involved, you can check out the BMC Bioinformatics publication about it from 2010, eGIFT: Mining Gene Information from the Literature.

But, for today, take a tour of the site and some of the things you can do in today’s Tip of the Week.

Relevant Links:
PubMed (tutorial)
XplorMed (tutorial)
Literature & Text Mining Resource Tutorials

Tudor, C., Schmidt, C., & Vijay-Shanker, K. (2010). eGIFT: Mining Gene Information from the Literature BMC Bioinformatics, 11 (1) DOI: 10.1186/1471-2105-11-418

Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

  • Wow: the little skate genome. 49 chromosomes, 59x coverage. Still tough to assemble: RT @Biomedical101: Assembling the Little Skate Genome: This past week I was visiting the University of Delaware to attend the 3rd S… http://bit.ly/kxC8BK [Mary]
  • RT @biogrid: BioGRID version 3.1.77 released with 2,890 new physical and genetic interactions. http://bit.ly/lsTeFY #bioinformatics #biology #biogrid [Mary]
  • Initiative to get scientist into schools to help teachers and improve science education, explained here: Those who can [Jennifer]
  • Does look cool: RT @hurstej: Awesome! I love LigerCat – this is a cool resource for genomic literature visualized from Medline http://bit.ly/flI6nF #bmispring2011 [Mary]
  • Another effort to connect scientist to the public: The Science & Entertainment Exchange is “a program of the National Academy of Sciences that connects entertainment industry professionals with top scientists and engineers” HT book review by Kevin Hand [Jennifer]
  • Iz handy: RT @larry_parnell: Just what do all those histone modifications mean? See http://bit.ly/jP7UpK for handy chart listing function and modifying enzyme [Mary]
  • RT @paulblaser: What is Public Health Genomics? A Day in the Invisible Life of Public Health #Genomics – http://goo.gl/izoX6 [Mary]

Mining figure legends. Huh.

Every so often something comes up in your weekly literature search that makes you go: huh. That happened to me today with a paper on text mining. Now, I have used a variety of text-mining tools (Textpresso, iHOP, PubMatrixXplorMed, etc are among the ones we have subscription tutorials on) and they have all sorts of strengths and weaknesses. And I’m convinced of the utility of them for making new connections, finding related literature, examining over-represented terms, etc. Because of gene nomenclature issues they haven’t always been quite as effective as I’ve always wanted for different sorts of interaction data that I’d love to be able to extract from the literature. That’s still best done by professional curators, IMHO.

When I saw this paper, though, I thought–yeah, figures and figure legends. There could be some real utility there. And I wondered if the mining tools I’ve been using take the figure legends into account? And then it also led me to wonder about the supplemental materials that are becoming so crucial (and overwhelming) from these “big data” projects?  It was one of those realizations that you don’t know what you aren’t looking at….

So this specific paper took thousands of figures from a variety of publications, and mined them:

According to our pathway definition described in the previous
section, we manually checked the 75,350 figures and identified 375
pathway figures to be positive data. Another 11,251 figures other
than pathway figures were randomly selected as negative data.

There were a lot of pieces of the regular text mining strategies (stemming*, decisions trees, weighting, etc). The details of this are provided. And their method is supposedly novel by combining figure text  and the paper body–which gives them improved results for for figure information. But for me the issue was just the awarenesses of 1) the potential value of figures and legends, and 2) the fact that in other text mining tools I’m using I don’t know if those data are in there.

Like all paper components, the quality and depth of figure legends vary, of course.  But it did strike me that especially for pathway data people might assume the figure conveys a lot of useful information that might not be explicitly stated in the body of the paper.

As far as I can tell there’s no web interface around this. One link in the paper that was supposed to have some more info is currently 403, so I’ve written to the team. Their introduction also led me to a different tool called FigSearch that sounded like a web interface for a similar type of analysis, but that doesn’t seem to be available any more. Such is the world of software….sigh.

But still: I like it when a paper gives me a realization that I need to think about what I’m not seeing when I’m using software.  It’s an easy thing to forget.


Ishii, N., Koike, A., Yamamoto, Y., & Takagi, T. (2010). Figure classification in biomedical literature to elucidate disease mechanisms, based on pathways Artificial Intelligence in Medicine, 49 (3), 135-143 DOI: 10.1016/j.artmed.2010.04.005


*The stemming example cracked me up. It appeared to be partially LOLcat: “This algorithm removes suffixes from words and leaves the stem (e.g., pathway or pathways becomes pathwai).”

Tip of the Week: GRAIL for prioritizing SNPs

grail_snps_tipPerusing my copy of Nature Genetics last week, I was flipping through the pages and noticed an unusual graphic.  I looked at it a little closer and was convinced it was one of the Spirographs that I used to make as a kid.  (Remember those? I always liked that….)  I looked a little bit closer and realized it was somewhat more informative than the Spirographs I used to draw.  This represented the relationships between genes, based on the literature.  Hmmm….how did they do this, exactly?

The paper I was reading was Genetic variants at CD28, PRDM1 and CD2/CD58 are associated with rheumatoid arthritis risk by Raychaudhuri et al, which was interesting enough.  I like to read the GWAS papers to see what the current techniques and strategies are, not only for the specific genes themselves.  And this paper reported the strategy that they used to prioritize their SNPs, and that they used GRAIL to generate the data for this graphic of gene relationships.  Check out Figure 1 for the strategy.

When I saw the name GRAIL I thought–huh….GRAIL is back with a new use?  I thought that was…ah…retired…at this point.  But this isn’t that GRAIL (http://compbio.ornl.gov/Grail-1.3/, Gene Recognition and Assembly Internet Link).  This is a different GRAIL–the new one is Gene Relationships Among Implicated Loci. So I had to go and read that paper, which is  Identifying Relationships among Genomic Disease Regions: Predicting Genes at Pathogenic SNP Associations and Rare Deletions by Raychaudhuri et al.

This new GRAIL is all about text mining.  It is a tool that relies on statistical text mining of the literature for genes in a region and examines the relationships among those genes in the text.  The focus in their case is disease regions, but there’s no reason that you couldn’t use it for a variety of other topics.   As the authors state:

Given only a collection of disease regions, GRAIL uses our text-based definition of relatedness (or alternative metrics of relatedness) to identify a subset of genes, more highly related than by chance; it also assigns a select set of keywords that suggest putative biological pathways.

So you pull a set of genes out of the literature based on SNPs or locations of interest, and you can begin to assess what’s interesting in the set.   Now, the tool makes a lot of assumptions that you should be aware of if you are going to use it.  It assumes each region contains a single pathogenic gene.  I’m not sure that’s always going to be the case, but for this tool as long as you know that, that’s a fair assumption.  They suggest this helps to keep from multigenic regions from dominating the analysis.  Fair enough, but…what if that is the interesting aspect?  Still–that’s ok as long as you know.

In the paper they use validated SNPs from 4 different research areas:

  • SNPs associated with serum lipid levels: GRAIL finds genes in the cholesterol biosynthesis pathway.
  • SNPs associated with height; they identify pathways they consider plausible.
  • Crohn’s disease; they confirm associations that have been seen.
  • Schizophrenia–and here they used rare deletions as the items of interest; they find related genes, many highly enriched in the CNS. So this suggests using this not only for SNPs but for CNVs this may be a useful strategy.

Their Figure 1 nicely summarizes the strategy:


One curious tweak of the data analysis was that they used the literature prior to December 2006, because right after that there was an onslaught of GWAS papers that would list a whole bunch of genes associated with regions that might be more tenuous still.  I understand this in theory, but I imagine it also eliminates more current research on genes of interest from other methods too.  I saw in the tool you could choose either pre-Dec 06 or a more up-to-date literature set.  It would be useful to try both if you use GRAIL and keep that in mind.

Another point to keep in mind: some genes are just not found in the abstracts, and they mention that is an issue.   So the set you can examine are those that were in the abstracts, and were identified properly with nomenclature, spelling, etc.  Text mining is cool, but has a lot of limitations around those aspects, and the use of synonyms too in general. It’s not just an issue for GRAIL, but for all text mining tools at this point.

They also devise a way to use Gene Ontology (GO) and some expression data in GRAIL as other “relatedness” metrics.  You’ll find those available from the GRAIL tool as well. spirograph

They don’t show any spirographs in their figures in this first GRAIL paper.  That one that drew me in was Figure 2 in the arthritis paper.  So I went over to the software to try to generate these myself.  The outcome at this point is a web page with text and links to UCSC Genome Browser, and Entrez Gene (from the individual genes and from the keyword list–keywords collect multiple Entrez Genes).  I was a little surprised that the keyword link wasn’t to PubMed as well.  Currently it doesn’t provide the graphic, but maybe that will come along over time.  If it does I’ll be sure to mention it on the blog.

One final note on the paper: in the supplemental section they compare GRAIL to other tools in this arena.  If you are interested in tools like we are here you may find some of them interesting as well.   The tools are listed with URLs in Table S5, and the comparison outcome is in Text S1:

Prioritizer [2], Gene2Disease (G2D) [3,4,5], Commonality of Functional Annotation (CFA) [6], and Prospectr [7]. There were five supervised tools: Endeavour [8], GeneSeeker [9], SUSPECTS [10], TOM [11], and CANDID [12]

So check out GRAIL and see if you find gene relationships.  But don’t forget those caveats about the genes not listed in the abstracts, or the literature coverage dates.  The software can be found here:  http://www.broad.mit.edu/mpg/grail/

I know it’s a beta.  But I think it has a lot of potential to help people sift through the results they are getting from a variety of techniques.  Check it out.

NOTE: you may find periods that you can’t run GRAIL because it puts a burden on the servers.  You should try again during off hours if you are seeing problems with getting it to run. This happened to me during my testing of it last week.

The list of GWAS data I used to test GRAIL came from the NHGRI catalog, which we discussed here:  List of GWAS studies.  I tried the straight hair SNP list, and got a pretty interesting set of results that certainly included “epidermis” and “skin” as keywords, among other things.

++++++++++++ Citations ++++++++++++
Raychaudhuri, S., Plenge, R., Rossin, E., Ng, A., International Schizophrenia Consortium, Purcell, S., Sklar, P., Scolnick, E., Xavier, R., Altshuler, D., & Daly, M. (2009). Identifying Relationships among Genomic Disease Regions: Predicting Genes at Pathogenic SNP Associations and Rare Deletions PLoS Genetics, 5 (6) DOI: 10.1371/journal.pgen.1000534

Raychaudhuri, S., Thomson, B., Remmers, E., Eyre, S., Hinks, A., Guiducci, C., Catanese, J., Xie, G., Stahl, E., Chen, R., Alfredsson, L., Amos, C., Ardlie, K., Barton, A., Bowes, J., Burtt, N., Chang, M., Coblyn, J., Costenbader, K., Criswell, L., Crusius, J., Cui, J., De Jager, P., Ding, B., Emery, P., Flynn, E., Harrison, P., Hocking, L., Huizinga, T., Kastner, D., Ke, X., Kurreeman, F., Lee, A., Liu, X., Li, Y., Martin, P., Morgan, A., Padyukov, L., Reid, D., Seielstad, M., Seldin, M., Shadick, N., Steer, S., Tak, P., Thomson, W., van der Helm-van Mil, A., van der Horst-Bruinsma, I., Weinblatt, M., Wilson, A., Wolbink, G., Wordsworth, P., Altshuler, D., Karlson, E., Toes, R., de Vries, N., Begovich, A., Siminovitch, K., Worthington, J., Klareskog, L., Gregersen, P., Daly, M., & Plenge, R. (2009). Genetic variants at CD28, PRDM1 and CD2/CD58 are associated with rheumatoid arthritis risk Nature Genetics, 41 (12), 1313-1318 DOI: 10.1038/ng.479

Medland, S., Nyholt, D., Painter, J., McEvoy, B., McRae, A., Zhu, G., Gordon, S., Ferreira, M., Wright, M., & Henders, A. (2009). Common Variants in the Trichohyalin Gene Are Associated with Straight Hair in Europeans The American Journal of Human Genetics, 85 (5), 750-755 DOI: 10.1016/j.ajhg.2009.10.009

Tip of the Week: Fable, text mining for literature on human genes

fable_thumb A couple of weeks ago we brought you a tip of the week about the CHOP CNV Database. The same people who bring you that database also do FABLE (Fast Automated Biomedical Literature Extraction), a literature mining tool. The tool uses an advanced algorithm to find Human genes that are directly related to the keywords search on and then find literature on those genes. The tool has some great features and is a great way to quickly find  the literature of a gene of interest. Today’s tip will give you a quick intro to the tool.

Tip of the Week: PLAN2L for Arabidopsis literature

plan2L_jingFor this tip of the week we look at a text-mining tool for the Arabidopsis literature, Plan2L, or PLant ANnotation to Literature.  It has a very straightforward interface that permits searching of the paper space, and you can do that with a variety of focal points: the bibliome as a whole, or with emphasis on interactions, regulation, cell cycle, and more.  The results offer links to the PubMed abstracts, and tabular results of the statistics of the term occurance in that area of focus.  Green results indicate positive scores and likely relevance, red are likely to be non-relevant, a graphical guide to quickly finding the data of interest. Links to other resources including the BioCreative server, WikiGenes, iHOP and TAIR are provided as well.

The current emphasis for this resource is Arabidopsis, but it would be quite useful for other species too.  If you are interested in text mining Arabidopisis I would also encourage you to compare the results with the Textpresso installation at TAIR to see what you discover in a different text miner interface as well.

Plan2L site: http://zope.bioinfo.cnio.es/plan2l/plan2l.html

For their recent paper on Plan2L see: http://www.ncbi.nlm.nih.gov/pubmed/19520768 or the full article freely available in PubMedCentral:  http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=19520768


Looking through one of my zillion mailing lists today I came across a project that was new to me. BioCreative is a competion/challenge for the text-mining community in biology.  It’s been around for a while according to the “about” page–and I recognize a bunch of those names.  But I’m not sure why I wasn’t aware of the current state of this.  I suppose since my Proteome days (now part of BIOBASE) I’ve drifted a bit from the text-mining back into sequences and SNPs and all more…

Anyway–seems like a project I’d like to keep an eye on.  They are now up to BioCreative II.5, and in the midst of a new round of evaluations.  You can read more about the current project here.  That also links to some helpful background on the project and the issues.