Tag Archives: databases

What’s the answer? Database anomalies

BioStar is a site for asking, answering and discussing bioinformatics questions. We are members of the community and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those questions and answers here in this thread. You can ask questions in this thread, or you can always join in at BioStar.

The question for the week:

Incorrect/unusual entries in main databases (GenBank, UniProt, PDb)? Pierre Poulain asks ” I… advise my students to be cautious with the data they can find in these databases. To illustrate this, I found quite unusual entries in GenBank:..” and he then lists some good ones.

There were several interesting, and funny, answers including one from our own Mary,

My favorite bizarre database item was a PubMed one. This was long before that NCBI ROLF blog was created. I was searching for genes identified in the transition to gray hair. This was not useful….


This is the TITLE (note, not the abstract):

I am a 64-year-old man, and I’ve always been proud of my perfect health record. I’ve also been proud of my full head of hair, even after the gray started creeping in. Four months ago I caught pneumonia and spent eight days in the hospital (three in intensive care). It took a while, but I’m finally back to normal – except that my hair is falling out. It comes out in clumps when I shampoo or even comb it, and it’s gotten noticeably thin all over. I remember reading about Propecia in your newsletter but I don’t have the old issue. Should I try the medication?

Check out the other answers for good examples as to why the researcher should always double-check the data.

There’s a database for everything, even uber-operons

I was playing around with Google Scholar’s new citation feature that allowed me to collect my papers in one place easily (worked pretty well, btw, save a few glitches, see below) when I noticed it missed a paper of mine from 2000: “Gene context conservation of a higher order than operons.” The abstract:

Operons, co-transcribed and co-regulated contiguous sets of genes, are poorly conserved over short periods of evolutionary time. The gene order, gene content and regulatory mechanisms of operons can be very different, even in closely related species. Here, we present several lines of evidence which suggest that, although an operon and its individual genes and regulatory structures are rearranged when comparing the genomes of different species, this rearrangement is a conservative process. Genomic rearrangements invariably maintain individual genes in very specific functional and regulatory contexts. We call this conserved context an uber-operon.

The uber-operon. It was my PI’s suggested term. Living and working in Germany at the time, I thought it was kind of funny. Anyway, I never really expanded more than another paper or so on that research and kind of lost track whether that paper resulted in much. I typed in ‘uber-operon’ in google today and found that it’s been cited a few times (88) and, I found this interesting: there have been a few databases built of “uber-operons.”

A Chinese research group created the Uber-Operon Database. The paper looks interesting, but unfortunately the server is down (whether this is temporary or permanent, I do not know), the ODB (Operon Database) uses uber-operons (which they call reference operons) to predict operons in the database , Nebulon is another, HUGO is another. Read the chapter on computational methods for predicting uber-operons :)

Just goes to show you, there’s a database for everything.

Oh, and back to Google Scholar citation. It did find nearly every paper I’ve published, though it missed two (including the one above) and had two false positives. Additionally, many citations are missing (like the 88 for this paper, and many others from other papers). That’s not to say it’s not useful, I find it a nice tool but it’s not perfect. You can find out more about google scholar citation here, and about Microsoft’s similar feature here.

Oh, and does this post put me in the HumbleBrag Hall of Fame? If that’s reserved for twitter, than maybe I should twitter this so I can get there :). (though I’m not sure pointing out relatively small databases based a relatively minor paper constitutes bragging, humbly or not LOL).

“What’s the Answer”

BioStar is a site for asking, answering and discussing bioinformatics question

s. We are members of the community and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those questions and answers here in this thread. You can ask questions in this thread, or you can always join in at BioStar.

Today’s question and answer is:

Recommend easy to use microarray clustering software

The most highly voted answer (was the author who posted the recommendation thread):

One of my favorites is the MEV micro-array data analysis tool. It is simple to use and it has a very large number of features.

Works well for any type of data. You can also load into it data from a file that is in a simple text format:

GENE1, value1, value2
GENE2, value1, value2

Feel free to post your favorite clustering tool.

Several other excellent tools were suggested, you can check them out here.

Real bioinformaticians write code, real scientists…

Just over a week ago, Neil Saunders wrote a post I agreed with: Real bioinformaticians write code. The post was in response to a tweet conversation that started:

Many #biostar questions begin “I am looking for a resource..”. The answer is often that you need to code a solution using the data you have.

He’s right, and that’s very true for bioinformaticists to whom he’s talking. My concern is for the rest of biological researchers. He states in the post:

In other words: know the data sources, know the right tools and you can always sculpt a solution for your own situation.

This is very true and I whole heartedly agree. So many solutions exist already in thousands of databases and analysis tools. It’s what we do here at OpenHelix, help experimental biologists, genomics researchers and bioinformaticists find the right data sources and tools and then go and “sculpt a solution for their situation.”

In the last part of my comment,

BioMart, UCSC Genome Browser, Galaxy, etc, etc are excellent tools and data sources and could probably answer about 80% of most posed questions :). But my caveat would be that knowing the data sources and right tools can be a bit of a daunting task.

And it is, despite the somewhat dismissive response :). We’ve all seen the graphs, exponentially rising amounts of data over time. It’s an issue as the Chronicle of Higher Education article title states:

Dumped on by Data: Scientists Say a Deluge is Drowning Research

The journal Science also had an entire 10 article section on the issue. It’s not a problem that will go away.

Along with that deluge of data, has come a deluge of databases and data analysis tools (created for the most part by bioinformaticists!), many of which _alone_ are quite daunting to find the right data and tool within. There are thousands such databases and tools. I’ve lost count.

Neil Saunders is correct. The solution is out there, find the right tools and data, sculpt a solution. He responds to my comment with “Learning what you need to know in bioinformatics can certainly be daunting. But then, science isn’t for for the easily daunted :-).” In other words, “if you are daunted, you aren’t a scientist?”

We give workshops to researchers around the world from Singapore to the US to Morocco and at institutions as varied as Harvard, Stanford, University of Missouri, Mt. Sinai, Stowers and Hudson-Alpha. The researchers we’ve given workshops and answered questions from were also varied, developmental biologists, evolutionary, medical researchers, bioinformaticists, researchers quite well versed in genomics and those not.

The overriding theme is finding and knowing the data and the tools is not only daunting, but sometimes not possible. Not because they don’t exist, but because finding and knowing them is a drain of personal and lab resources considering the shear growing field of things to find and know. I refer you to the Chronicle article… drowning in data..

They are real scientists not easily daunted, but daunted just the same, by what’s in front of them. And yes, many of those specific questions to specific research needs can be answered by existing tools. We come across many questions on Biostar that a well-crafted database search or analysis step will answer beautifully, without the need for reinventing the wheel with more code (and the answers are often code).

I suspect that most of those scientists out there who call themselves ‘bioinformaticists” should have a grasp of the tools and databases available to them (but I can tell you, even the brightest of them don’t sometimes). So, the advice and final words of the linked blog post above…

In other words: know the data sources, know the right tools and you can always sculpt a solution for your own situation…. real bioinformaticists write code

Yes, real bioinformaticists write code, but this advice is insufficient to the other 90% of real scientists who don’t. Perhaps Biostar is not the solution (I suspect a lot of those questions being asked he points out are those by non-bioinformaticists who only have a basic, if any, knowledge of coding nor access to those who do). Perhaps it, or something like it, can be.

Tip of the Week: PhylomeDB

Gene phylogenies (as opposed to species phylogenies) can be very useful in determined gene function, history, orthology and paralogy predictions. PhylomeDB (link added!) is a database of gene phylogenies (or as they call them, Phylomes.. no end to the ‘omes is there? :). Currently there are over a dozen such phylomes from species like humans and yeast. The database allows you to obtain phylogenies of genes based on gene ID or BLAST, you can also get orthology predictions and alignments and more. Today’s tip is introducing you to the database.

New NCBI Image Database

Mary brought up a paper just recently about what we are missing when data mining papers: Figures and figure legends.

Enter the NCBI Image database. This very new database includes over 3 million images that are found in the full-text resources (i.e. PubMed Central) at NCBI. So, I did a search for “drosophila phylogeny” and found some great images and figures. The results will not only pull out the figure, but also the figure legend. I got over 200 results. The links in the search result figure titles take you directly to the figure. Below the legend you can see links to the full text. It’s a great start to searching figures and figure legends.

Along with this, PubMed search results now are enhanced with images from this database (if, remember, the article is in the full-text resources.. but over time a lot of research published with

NIH funding will go there won’t they?). For example, go to this abstract for the paper “Text mining and manual curation of the chemical-gene-disease networks for the comparative toxicogenomics database.” Scroll down just a bit, you’ll see the figures from this paper, which have been deposited in the NCBI image database. You can go directly to the link to all the figures or to the papers.

Of course, as stated, not all articles will have images in the database, only those deposited in PubMed Central. You’ll find a lot of your searches won’t have this image strip because the journal isn’t deposited there . But with 3 million images and more journal articles going to PMC every day, this database and feature of PubMed could prove to be quite useful.

Hattip: APD at CTD :)

We’ve got widgets

I’ve mentioned others’ widgets before. They can be very handy tools on websites and blogs to add content and useful interactive searches, etc.

Well, we now have our own. As many of our readers know, we have a genomics and bioinformatics search engine that helps the researcher find the database or analysis tool that best fits their need. Type in a term and you get a list of genomics resources that are queued in rank of relevancy. In addition, you are shown where in context (the resource web site, or in our tutorials or blog if there) where the term was found. Additionally, you’ll find tutorials we’ve created on nearly 100 of them, about a dozen free to the user like PDB, SGKB, UCSC Genome Browser, and another 80 or so by subscription.

Anyway, you can now put the search (which of course is publicly available) on your blog or web site using one the widgets we’ve just had created (by the same people who helped create our database search). We have three sizes and you can find them and the code for them at this page.

You’ll also see I’ve put the smaller widget on the right column here on the blog. You can put a term in there and test it out. It will open another page with the results of our search. Try it out!

Tip of the Week: WAVe, Web Analysis of the Variome

Today’s Tip of the Week is a short introduction to WAVe, or the Web Analysis of the Variome. The tool was recently introduced to us, and I’ve found it a welcome introduction to the tools available to the researcher to analyze human variation. This is apropos considering the recent paper we’ve been discussing on the clinical assessment of a personal genome (here, here and here) and that papers implications for personalized medicine and the use of online variation resources. WAVe also has introduced me to some additional tools I’ve either not been aware of, or haven’t used, which might be of use such as: LOVD (Leiden Open Variation Database), QuExT (Query Expansion Tool, also from the same developers as WAVe), and others. Of course there are also database information pulled in from Ensembl, Reactome, KEGG, InterPro, PDB, UniProt, NCBI and many others. Take some time to check it out.

Guest Post: CHOP’s new tool, CNV Workshop – Xiaowu Gai

This next post in our continuing semi-regular Guest Post series is from Xiaowu Gai, the Bioinformatics Core Director at CHOP . If you are a provider of a free, publicly available genomics tool, database or resource and would like to convey something to users on our guest post feature, please feel free to contact us at wlathe AT openhelix DOT com.

Thanks to Mary for running a Tip of the Week – “CHOP CNV database” a couple of months back. CHOP CNV database is a high-resolution genome-wide survey of copy number variations of a large number (2,026) of apparently healthy individuals. It is publicly accessible and has been widely used by a large number of research groups world-wide. I am now pleased to announce the public release of our software system behind it: CNV Workshop. CNV Workshop is a suite of software tools that we have developed over the last a few years. It provides a comprehensive workflow for analyzing, managing, and visualizing genome copy number variation (CNV) data.

It can be used for almost any CNV research or clinical project by offering the following capabilities for both individual samples and cohort studies:

CNV identification
Implements a modified circular binary segmentation algorithm that reduces false positives
Fully configurable parameters for sensitivity/specificity management
Individual locus-specific annotations such as position, type of variation, call metrics, and overlap with CNVs of other data sets, including the Database of Genomic Variants.
Functional gene annotations such as genes affected and known disease associations
Accepts user-provided annotations
GBrowse-enabled visuals for querying, browsing, interpreting, and reporting CNVs
Export of results into Excel, XML, CSV, and BED files
Direct links to public resources such as the UCSC Genome Browser, NCBI Entrez, Entrez Gene, and FABLE
Project and Account Management
Authentication and permission scheme that is especially useful for clinical diagnostic settings
Analysis result sharing within and between projects
Simple Web-based administrative interface
Remote access and administration enabled

CNV Workshop currently accepts genotyping array data from Illumina’s 550k, 610- and 660-Quad, and Omni arrays, along with Affymetrix’s 5.0 and 6.0 arrays, and can be easily configured to accept data from other platforms. The package comes preloaded with publicly available reference data from more than 2,000 healthy control subjects (the CHOP CNV Database). CNV Workshop also allows the user to upload already processed CNV calls for annotation and presentation.

The software package is freely available at http://sourceforge.net/projects/cnv/. It is also described in more detailed in our recent paper on BMC Bioinformatics.

-Xiaowu Gai

Coming up, Guest Posts

Greetings! OpenHelix Blog is instituting a new semi-weekly feature. Every Wednesday we have our “Tip of the Week,” on Thursdays we have our “What’s Your Problem,” and now on an occasional Tuesdays we are going to have our “Provider Guest Post.” These will be posts from providers of genomics tools and database and will be opinions, updates and upcoming features of the resource, whatever the provider of the resource would like to convey to users. We have several lined up for the coming weeks, so keep checking back.

Additionally, if you are a developer or provider of an free, publicly available genomics or biological resource, database or analysis tool and would like to post in our guest feature, be it an introduction to your tool, updates or upcoming features or even an opinion about the current state of genomics research and data, please write us at wlathe AT openhelix DOT com. We would love to put you in the queue for the next guest post.

Our first guest post next Tuesday will be from Inna Dubchak , principal investigator at the LBNL/JGI group, developers of the VISTA comparative genomics resource (who sponsors a tutorial, free to the users). She’ll discuss some new tools at VISTA and give you a quick preview of some new upcoming features.