Tag Archives: literature

Video Tip of the Week: Publications track in UCSC Genome Browser

There’s great stuff in genome databases. And there’s great stuff in the literature. Sometimes there are clear links between the two–awesome curators work hard to source quality information, and sometimes automated processes can help. Of course, there’s also tons of stuff flowing into databases that’s not made it into the literature yet and may not ever in full, and that’s a whole ‘nother issue. But what if there was a way to put them together…. [Americans of a certain age will crack up at this]

But a lot of people are working on mining the literature for useful information and making it more accessible in other ways, as well as adding context to a genomic region. The project I’m highlighting today does exactly that.

Last year a program from Elsevier started to enable database and software providers and others to have access to their text corpus, and build “apps” that add value to the literature. We talked about it here in the context of NCBI’s app. Using this mechanism, you can also add value to genome databases by linking to the literature directly. That’s what the new Publications track in the UCSC Genome Browser does.

You can learn more about the track details from the link to a Publications landing page. But there was also an announcement from the track’s lead developer Max Haeussler, that came over the Biocurator mailing list with some features, I’ll quote that here:

Look for it in the group “Mapping and Sequencing”, name “publications” on the UCSC genome browser for human and major model organisms (mouse, fly, zebrafish, etc). It currently contains data mined from around 3 million research articles, with sequences found in around 200k papers.

So millions of Elsevier papers, PubMed Central articles, (and more sources to come are likely) have been mined to find sequences. These sequences were blatted against genome sequences. Matches have been indicated on the UCSC Genome Browser. You can get more details about the strategy from the paper I’ve linked to below, and on the Publications track page.

What’s so cool about this is that you can look at your genomic regions of interest, and now see if others have come across this region. Some of them will be papers you know, of course–but there might be other papers you didn’t know that could bring new insights about that region.

Also in the Biocurator letter Max offered a sample region to look at–here’s the link for that. Click it to load up the example and look around:

Here is a link to the genome browser with the track activated and zoomed the EGF gene:

In my video tip I’ll show how to get to that and the track details as well. If you need to learn more about the basic functions of the UCSC Genome Browser you can see the freely available sponsored training materials on that here: http://openhelix.com/ucsc

The other thing you can do is add the “app” at Elsevier to your personal set of apps if you have access to SciVerse. And when you find yourself in an article with sequence data, you’ll be able to click from that to go to UCSC. Jennifer described how to do that for our app, but the process would be similar if you wanted to add this UCSC app as well. You can find and add the UCSC app here, and while you’re at it you can add the OpenHelix app here ;) . That will mine the text for the databases and software that authors mention, and will link you to training so you can learn how to use resources like the UCSC Genome Browser.

Special note: Max and his team are eager for feedback on this new track–if there’s something not quite right, or if there are other aspects you might want to see. He’s great on bug reports (I know, I sent some): takes them seriously and roots them out! And if you have any constructive thoughts I’m sure he’ll come by for a look. He has contact info on the Publications Track page as well. And he can correct anything I’ve got not fully right here in return for my bug hunting :)

Editing note October 25, 2013: At the time of this video, the Publications track was located elsewhere, but it can now be found in its own “Literature” track group. Look for it between “Genes and Gene Predictions” and “mRNA and ESTs”.


Quick links:

Publications Track details page for human on UCSC Genome Browser: http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=278356441&c=chr4&g=pubs

UCSC Genome Matches app at SciVerse –add it to supplement your literature browsing: http://bit.ly/Netpka

UCSC Genome Browser Intro training: http://openhelix.com/ucsc

OpenHelix SciVerse App Description: http://bit.ly/xtGcco


Haeussler, M., Gerner, M., & Bergman, C. (2011). Annotating genes and genomes with DNA sequences extracted from biomedical articles Bioinformatics, 27 (7), 980-986 DOI: 10.1093/bioinformatics/btr043

Video Tip of the Week: eGIFT, extracting gene information from text

eGIFT, as the tag line says, is a tool to extract gene information from text. It’s a tool that allows you to search for and explore terms  and documents related to a gene or set of genes. There are many ways to search and explore eGIFT, find genes given a specific term, find terms related to a set of genes and more. How does the tool do this? You can check out the user guide to find out more, but here is a brief summary from the site:

We look at PubMed references (titles and abstracts), gather those references which focus on the given gene, and automatically identify terms which are statistically more likely to be relevant to this gene than to genes in general. In order to understand the relationship between a specific iTerm and the given gene, we allow the users to see all sentences mentioning the iTerm, as well as the abstracts from which these sentences were extracted.

To learn more about how this tool was put together and the calculations involved, you can check out the BMC Bioinformatics publication about it from 2010, eGIFT: Mining Gene Information from the Literature.

But, for today, take a tour of the site and some of the things you can do in today’s Tip of the Week.

Relevant Links:
PubMed (tutorial)
XplorMed (tutorial)
Literature & Text Mining Resource Tutorials

Tudor, C., Schmidt, C., & Vijay-Shanker, K. (2010). eGIFT: Mining Gene Information from the Literature BMC Bioinformatics, 11 (1) DOI: 10.1186/1471-2105-11-418

Video tip of the week: OpenHelix App on SciVerse to Extend Research

We’ve all seen the discussions – on twitter, in journals, lots of places – on how to collect, store, find and use all the data that is and will be generated. Here at OpenHelix we believe that there is a gold mine of bioscience data that is being vastly underutilized, and our goal is to help make that data more accessible to researchers, clinicians, librarians, students and anyone else who is interested in science.

We go at our goal in a variety of ways, including: this blog with its weekly tips, answers and other posts; with our online tutorial materials on over 100 different biological databases and resources; and with our live trainings, many of which are sponsored by resource providers such as the UCSC Genome Browser group.

In today’s tip I will introduce you to another one of our efforts to “extend research” by showing you a glimpse of an OpenHelix app that we designed for the SciVerse platform, which Elsevier has described as an “ecosystem providing workflow solutions to improve scientist productivity and help them in their research process”. This app scans a ScienceDirect journal article for any database names or URLs that we train on, and then displays a list of such resources in the window of the app. A researcher can use this list to go from a research article to our training on how to use the resource, and to the resource itself. We believe this type of integration will help extend research by making it easier to find, access and use data associated with a paper. If you have access to articles through ScienceDirect, and you try out our app, please comment here & let us know what you think, or suggest future enhancements. Also you could consider reviewing it for the app gallery. Thanks!

Quick links:

SciVerse Hub http://www.hub.sciverse.com

SciVerse Application Gallery http://www.applications.sciverse.com

OpenHelix SciVerse App Description http://bit.ly/xtGcco

Reference shown in Tip (subscription required): Mortensen, H., & Euling, S. (2011). Integrating mechanistic and polymorphism data to characterize human genetic susceptibility for environmental chemical risk assessment in the 21st century Toxicology and Applied Pharmacology DOI: 10.1016/j.taap.2011.01.015

OpenHelix Reference (free from PMC here): Williams, J., Mangan, M., Perreault-Micale, C., Lathe, S., Sirohi, N., & Lathe, W. (2010). OpenHelix: bioinformatics education outside of a different box Briefings in Bioinformatics, 11 (6), 598-609 DOI: 10.1093/bib/bbq026

SciVerse Reference (subscription required): Bengtson, J. (2011). ScienceDirect Through SciVerse: A New Way To Approach Elsevier Medical Reference Services Quarterly, 30 (1), 42-49 DOI: 10.1080/02763869.2011.541346

New feature, research articles, on OpenHelix site

If you go over to the OpenHelix home page, you’ll start to notice some differences. Our tutorial landing pages have an added new feature. We now have a section on each landing page that shows the most recent research in the BioMed Central journals. For example, the screenshot here is for our GeneMANIA tutorial suite. At the bottom left corner you’ll see the 5 most recent research articles that have used GeneMANIA from the BioMed Central catalog of research journals. This will be useful to users to get an idea of what kind of data  a resource has and research that is done using the resource. This will also be helpful to find other resources that might be of use (often a related-resource will be mentioned in conjunction with the resource of interest).

This is the beginning of a series of new features we hope to add to the landing pages that will enhance the ability of researchers to learn all they can about the data and tools available to them. We will be adding a “recent video tips” section and more.

And, if you are a publisher of research literature and would like to get a ‘recent research list’ like this on our landing pages for your journals (will need access to full text of articles), please contact us and we’d love to work with you to do so for the benefit of your journals, the resources, the researchers and yes, OpenHelix :).

In our continuing effort to maintain and expand our search database and engine, our tutorials, our blog and more, you will also notice we have added advertising to our site. The ads are a top banner ad, a side skyscraper ad and a small ad on tutorial landing pages. Please consider viewing these sponsors if they interest you. Ads will not be on sponsored tutorials such as UCSC Genome Browser and PDB, and others, nor will they be visible for any subscribers to the catalog of tutorials.

If you would like to sponsor a tutorial so that it is publicly available, whether you are the developer of the tool or a company, please contact us about the opportunity to get training and outreach to a large number of researchers.

Also, though we have a dozen sponsored tutorials that are publicly available, we do have a large catalog of over 90 additional tutorials available for subscription. You can subscribe to access these as an individual, department or institution.


Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

Tip of the Week: Word Add-In for Ontology Recognition

In today’s tip I want to make you aware of a tool that I think will help researchers to present their own data and publications in an accurate and universally searchable way. I learned of the resource (UCSDBioLit) through an article in one of my recent BioMed Central article alert emails. This resource allows authors to mark-up their own publications with XML tags AS THEY WRITE their papers. This will allow faster and more accurate semantic searching of their research.

A huge problem in science today is the ability to quickly search the vast literature base and to accurately and efficiently find the data that you are interested in. Here at OpenHelix we focus on ways of effectively and efficiently get information out of public databases and resources, but at the other end of the process is the ability for scientific knowledge to be curated into those resources. We have featured biocurators and the phenomenal work that they do several times in the past, but it is work that never ends and can be very labor intensive. It often involves an initial triaging of a field’s literature, some level of automatic information gathering, and then careful manual effort on the part of scientist at the resource to gather and present the information through their site. I know from personal experience that the process of reading a paper, clarifying research details with an author, and then presenting that information to the author’s satisfaction can be a very long & labor intensive process, for both the curator AND the original author.

For years there has been discussion of ‘expert curation’ in which experts in the field author review or summary pages in a resource, or community curation jamborees, etc. And there have been fruits from many of these efforts, but in general participation is low. But who is more of an expert on the research being published other than the author himself? If authors could/would mark up their own papers during the publication process, not only could they be assured that it would be accurate but they would help make their research universally searchable without the lag required for searchability through a specific resource. Thus far document mark-up is has not been an easy process and has largely been deemed ‘not worth the effort’ for the level of attribution/recognition affiliated with it.

The BioMed Central article does a nice job of outlining and discussing many of these issues. It cites many other efforts and resources, explains their motivation and the implementation of their software. A nice feature of the tool is that there are interoperability features, and a real commitment to conforming with existing standards of practice. The article also presents an appendix of resource addresses of other groups involved in semantic searching and literature publication. I especially like this quote from the paper:

The Word add-in presented here will assist authors in this effort using community standards and by making it possible for the author of the document, the absolute expert on the content, to do so during the authoring process and to provide this information in the original source document.

You can also find brief tutorials on using the tool at SciVee: Word Add-in for Ontology Recognition Tutorial (1 of 4): Install Process

As a note, literature mark-up and enabling are currently an active area – Mary found another literature handling resource and paper as well: Check out the tip, the articles & the tools. Tell me what you find/think. Thanks! (OH, and Happy St. Patty’s to ya!)

UCSDBioLit Reference:
Fink, J., Fernicola, P., Chandran, R., Parastatidis, S., Wade, A., Naim, O., Quinn, G., & Bourne, P. (2010). Word add-in for ontology recognition: semantic enrichment of scientific literature BMC Bioinformatics, 11 (1) DOI: 10.1186/1471-2105-11-103

Tip of the Week: Managing & sharing references with Mendeley

This week’s tip is a bit off-topic (as in genomics databases), but it is science/biology-related and something we all need. There are a lot of reference management software possibilities out there like EndNote, some great web 2.0 social networking sites like Connotea (Nature Publishing) and CiteULike (Springer) and some PDF management tools. Mendeley wants to be all three. I like the idea a lot. Instead of having 2 or 3 separate applications, desktop and/or web, etc., you have one to rule them all. Of course EndNote has EndNote web, but it’s not free (Mendeley is free, and the features they offer now will stay free. They will offer new features for professional users later with a fee). You can export your references in Connotea and CiteULIke, but it’s an extra annoying step. My first experiences with Mendeley have been quite positive, so I thought I’d introduce them to you here.

Impact Factor

I remember considering the “Impact Factor” of journals when submitting research papers, and wondering what the impact factor of a specific paper I published might be out of curiosity. Not particularly seriously, my field was narrow enough in my Ph.D. research that there were just a few journals to even consider, so it was usually pretty simple choosing. And for individual articles, I am pretty sure I knew the 4 people in the world outside my lab that were interested in my research (I jest, a little). During my postdoc, my PI was pretty good and choosing journals based on the article, the journal’s audience… and impact factor.

But impact factor measuring has it’s issues (Article-Level Metrics and the Evolution of Scientific Impact, Neylon and Wu. PLoS Biol 7: e1000242), and there is always a search to measure the impact of journals and individual articles better, or at least differently. Well, one of my favorite science sites and one of my favorite journal publishers ResearchBlogging.org and PLoS, have worked together to measure the impact of journal articles. PLoS has a lot of metrics to see what the ‘impact’ of an article might be, and now they’ve added a metric to see how many times it’s been written about on blogs using blog aggregators like Postgenomic, Blog Lines and Nature Blogs, and now ResearchBlogging.

I like the partnership with ResearchBlogging specifically because whereas the other blog aggregators are not necessarily picking up articles that discuss the science of the article (Postgenomic) or aggregate only a subset of science blogs out there (Nature Blogs), ResearchBlogging is specifically blogs posts discussing the research of  peer-reviewed articles.

Of course I don’t find this particularly useful to compare one article against another (the best articles aren’t always written about, and those that are might not be in the blog aggregators), but I do think this will be great way to carry on the conversation and dig deeper into the research topic.

You can view that metric at PLoS of any article, for example the one I link to above, click on the “metric” tab, scroll down a bit until you see the heading “Blog Coverage.” For that article, you’ll see two ResearchBlogging posts (as of this writing), a metric for this paper about metrics :).

Tip of the Week: Fable, text mining for literature on human genes

fable_thumb A couple of weeks ago we brought you a tip of the week about the CHOP CNV Database. The same people who bring you that database also do FABLE (Fast Automated Biomedical Literature Extraction), a literature mining tool. The tool uses an advanced algorithm to find Human genes that are directly related to the keywords search on and then find literature on those genes. The tool has some great features and is a great way to quickly find  the literature of a gene of interest. Today’s tip will give you a quick intro to the tool.

(re)Funding Databases II

ResearchBlogging.orgSo, I wrote about defunding resources and briefly mentioned a paper in Database about funding (or ‘re’funding) databases and resources. I’d like to discuss this a bit further. The paper, by Chandras et. al, discusses how databases and, to use their term, Biological Resource Centers (BRCs) are to maintain financial viability.

Let me state first, I completely agree with their premise, that databases and resources have become imperative. The earlier model of “publication of experimental results and sharing of the reated research materials” needs to be extended. As they state:

It is however no longer adequate to share data through traditional modes of publication, and, particularly with high throughput (‘-omics) technologies, sharing of datasets requires submission to public databases as has long been the case with nucleic acid and protein sequence data.

The authors state, factually, that the financial model for most biological databases (we are talking the thousands that exist), has often been a 3-5 year development funding, that once runs out, the infrastructure needs to be supported by another source. In fact, this has lead to the defunding of databases such as TAIR and VBRC (and many others), excellent resources with irreplaceable data and tools, that then must struggle to find funding to maintain the considerable costs of funding infrastructure and continued development.

The demands of scientific research, open, shared data, require a funding model that maintains the publicly available nature of these databases. And thus the problem as they state:

If, for financial reasons, BRCs are unable to perform their tasks under conditions that meet the requirements of sceintfic research and the deamnds of industry, scientists will either see valuable information lost or being transferred into strictly commercial environment with at east two consequences: (i) blockade of access to this information and/or high costs and (ii) loss of data and potentioal for technology transfer for the foreseeable future. In either case the effect on both the scientific and broader community will be detrimental.

Again, I agree.

They discuss several possible solutions to maintaining the viability of publicly available databases including a private-public dual tier system where for-profits paid an annual fee and academic researchers have free access. They mention Uniprot, which underwent a funding crisis over a decade ago, as an example. Uniprot (then Swissprot) went back to complete public funding in 2002. There are still several other databases that are attempting to fund themselves by such a model. BioBase is one where several databases have been folded. TransFac is one. There is a free, reduced functionality, version that is available to academics through gene-regulation.com and the fuller version for a subscription at BioBase. This former version allows some data to be shared, as one could see at VISTA or UCSC. I am not privy to the financials of BioBase and other similar models, and I assume that will work for some, but I agree with the authors that many useful databases and resources would be hard-pressed to be maintained this way.

Other possibilities include fully  including databases under a single public institution funding mechanism. The many databases of NCBI and EBI fit this model. In fact, there is even a recent case of a resource being folded into this model at NCBI. Again, this works for some, but not all useful resources.

Most will have to find variable methods for funding their databases. Considering the importance of doing so, it is imperative that viable models are found. The authors reject, out of hand, advertising. As they mention, most advertisers will not be drawn to website advertising without a visibility of at least 10,000 visitors per month. There might be some truth to this (and I need to read the reference they cite that use to back that up).

But the next model they suggest seems to me to have the same drawback. In this model, the database or resource would have a ‘partnership of core competencies.’ An example they cite is MMdb (not to be confused with MMDB). This virtual mutant mouse repository provides direct trial links to Invitrogen from it’s gene information to the product page. They mention that though 6 companies were approached, only one responded. It would seem that this model has the same issues as directly selling advertising.

They also mention that, at least for their research community of mouse functional genomics, “Institutional Funding” seems the best solution for long-term viability and open access. Unfortunately, until institutions like NIH and EMBL are willing or able to fund these databases, I’m not sure that’s thats a solution.

As they mention in the paper, the rate of growth of the amounts and types of data that is being generated is exponential. I am not sure that government or institutional funding can financially keep up with housing the infrastructure needed to maintain and further develop these databases so that all the data generated can remain publicly and freely accessible.

Information is should be free, but unfortunately it is not without cost. It will be interesting to see how funding of databases and resources evolves in this fast growing genomics world (and imperative we figure out solutions).

PS: On a personal note, the authors use their resource, EMMA (European Mouse Mutant Archive), as an example in the paper. I like the name since it’s the name of my daughter, but it just goes to prove that names come in waves. We named our daughter thinking few would name their daughter the same. When even databases name the same name, you know that’s not the case.

Chandras, C., Weaver, T., Zouberakis, M., Smedley, D., Schughart, K., Rosenthal, N., Hancock, J., Kollias, G., Schofield, P., & Aidinis, V. (2009). Models for financial sustainability of biological databases and resources Database, 2009 DOI: 10.1093/database/bap017