Tag Archives: databases

(re)Funding Databases II

ResearchBlogging.orgSo, I wrote about defunding resources and briefly mentioned a paper in Database about funding (or ‘re’funding) databases and resources. I’d like to discuss this a bit further. The paper, by Chandras et. al, discusses how databases and, to use their term, Biological Resource Centers (BRCs) are to maintain financial viability.

Let me state first, I completely agree with their premise, that databases and resources have become imperative. The earlier model of “publication of experimental results and sharing of the reated research materials” needs to be extended. As they state:

It is however no longer adequate to share data through traditional modes of publication, and, particularly with high throughput (‘-omics) technologies, sharing of datasets requires submission to public databases as has long been the case with nucleic acid and protein sequence data.

The authors state, factually, that the financial model for most biological databases (we are talking the thousands that exist), has often been a 3-5 year development funding, that once runs out, the infrastructure needs to be supported by another source. In fact, this has lead to the defunding of databases such as TAIR and VBRC (and many others), excellent resources with irreplaceable data and tools, that then must struggle to find funding to maintain the considerable costs of funding infrastructure and continued development.

The demands of scientific research, open, shared data, require a funding model that maintains the publicly available nature of these databases. And thus the problem as they state:

If, for financial reasons, BRCs are unable to perform their tasks under conditions that meet the requirements of sceintfic research and the deamnds of industry, scientists will either see valuable information lost or being transferred into strictly commercial environment with at east two consequences: (i) blockade of access to this information and/or high costs and (ii) loss of data and potentioal for technology transfer for the foreseeable future. In either case the effect on both the scientific and broader community will be detrimental.

Again, I agree.

They discuss several possible solutions to maintaining the viability of publicly available databases including a private-public dual tier system where for-profits paid an annual fee and academic researchers have free access. They mention Uniprot, which underwent a funding crisis over a decade ago, as an example. Uniprot (then Swissprot) went back to complete public funding in 2002. There are still several other databases that are attempting to fund themselves by such a model. BioBase is one where several databases have been folded. TransFac is one. There is a free, reduced functionality, version that is available to academics through gene-regulation.com and the fuller version for a subscription at BioBase. This former version allows some data to be shared, as one could see at VISTA or UCSC. I am not privy to the financials of BioBase and other similar models, and I assume that will work for some, but I agree with the authors that many useful databases and resources would be hard-pressed to be maintained this way.

Other possibilities include fully  including databases under a single public institution funding mechanism. The many databases of NCBI and EBI fit this model. In fact, there is even a recent case of a resource being folded into this model at NCBI. Again, this works for some, but not all useful resources.

Most will have to find variable methods for funding their databases. Considering the importance of doing so, it is imperative that viable models are found. The authors reject, out of hand, advertising. As they mention, most advertisers will not be drawn to website advertising without a visibility of at least 10,000 visitors per month. There might be some truth to this (and I need to read the reference they cite that use to back that up).

But the next model they suggest seems to me to have the same drawback. In this model, the database or resource would have a ‘partnership of core competencies.’ An example they cite is MMdb (not to be confused with MMDB). This virtual mutant mouse repository provides direct trial links to Invitrogen from it’s gene information to the product page. They mention that though 6 companies were approached, only one responded. It would seem that this model has the same issues as directly selling advertising.

They also mention that, at least for their research community of mouse functional genomics, “Institutional Funding” seems the best solution for long-term viability and open access. Unfortunately, until institutions like NIH and EMBL are willing or able to fund these databases, I’m not sure that’s thats a solution.

As they mention in the paper, the rate of growth of the amounts and types of data that is being generated is exponential. I am not sure that government or institutional funding can financially keep up with housing the infrastructure needed to maintain and further develop these databases so that all the data generated can remain publicly and freely accessible.

Information is should be free, but unfortunately it is not without cost. It will be interesting to see how funding of databases and resources evolves in this fast growing genomics world (and imperative we figure out solutions).

PS: On a personal note, the authors use their resource, EMMA (European Mouse Mutant Archive), as an example in the paper. I like the name since it’s the name of my daughter, but it just goes to prove that names come in waves. We named our daughter thinking few would name their daughter the same. When even databases name the same name, you know that’s not the case.

Chandras, C., Weaver, T., Zouberakis, M., Smedley, D., Schughart, K., Rosenthal, N., Hancock, J., Kollias, G., Schofield, P., & Aidinis, V. (2009). Models for financial sustainability of biological databases and resources Database, 2009 DOI: 10.1093/database/bap017

Whole genome association studies

Genetic Future reports: First ever association study using whole genome sequences.

New-technology DNA sequencing provider Complete Genomics will provide near-complete genome sequences of 100 individuals to the Institute for Systems Biology, driving the first ever association study for a complex trait using whole-genome sequencing. Here’s the press release, and GenomeWeb has some additional information

This study was done by Complete Genomics, and as Daniel mentions, does indicate some changes and advances to come. Read the entire post, he mentions some things learned at ASHG about how these studies will look in the future, and particularly, this sentence…

Now the real challenge - coming up with ways of handling the massive volumes of data generated by these technologies

goes to the heart of something I see as a very important question. Not only the right tools but funding them.

An embarrassment of riches.

(re)Funding Databases I

I blogged recently about (de)funding databases and lo and behold, a new paper was just published in Database (which is a new journal I just blogged about earlier this year) on that very subject:

Models for financial sustainability of biological databases and resources — Chandras et al. 2009 (0): bap017 — Database.

I will be writing up a longer review and some thoughts about the article. I’m having a bad blog week, and kind of lost some stuff I was writing. But I would like to point out the article for now, post my thoughts soon (later this evening?)

Tip of the Week: Finding the right genomics resource

ohsearch_thumbOpenHelix just opened our new web site. We will still be offering the tutorials we’ve always offered (80 and growing!), but now we have a new search engine and database of many more resources. And it’s publicly available and free.

There are now thousands of databases and analysis tools for the researcher to use when doing research in biology and genomics. The first problem the researcher has is just finding those resources, finding the right one. In another step toward helping that happen, we’ve put together a highly relevant, curated database of genomics and biological resources available to the researcher and a search engine to find them based the context of the keyword found at the resource site, tutorials and even our blogs posts that mention the database. You will find that your searches will lead you to resources relevant to your needs. Today’s tip introduces this new search.

We strongly believe this is the best method available now for the researcher not only to find relevant resources, but to find training on how to use those resources. And check back, our database of resources will be growing, as will our features.

Blog and Site are moving

We’ll be having a bigger announcement later, but we are moving!

The URL is the same: http://www.openhelix.com. Though we also now have http://www.openhelix.eu for our European users. They’ll both go to the same place of course! (though in the next couple days as our domain moves to the new servers, the first will be our old site, the latter will be our new one).

We have a new site and functionality! OpenHelix will now have a great new search engine (free and publicly available) to search through hundreds of genomics resources to find exactly what you are looking for. We believe this will be a great boon to finding your data. A large number of those (approaching 100 now) will have links to our tutorials for you to view and learn how to use those resources, some are sponsored and free to the user, others have reasonable subscription cost.

Our blog will have a new look and a new URLhttp://blog.openhelix.com. Of course, if you use http://www.openhelix.com/blog, you’ll be easily redirected.

We have lots of exciting plans for the search, for the site, for the blog and for our tutorials over the next few months and into the future, so keep an eye out! And please, if you find a feature you want or hit a snag, let us know!

Tip of the Week: TARGeT

target_thumbToday’s tip is on a TARGeT. TARGeT is, as the the paper’s title in the this year’s NAR’s issue states, “a web-based pipeline for retrieving and characterizing gene and transposable element families from genomic sequences.” There are several things you can do at TARGeT. Using BLAST, PHI BLAST, MUSCLE and TreeBest ,the main function of TARGeT is  to quickly obtain gene and transposon families from a query sequence. The tip today is a quick intro to the tool and a search on an R1 non-LTR transposon.

Tip of the Week: Acytelome, String and a new database

phosida_thumbI recently read an article in Science entitled “Lysine Acetylation Targets Protein Complexes and Co-Regulates Major Cellular Functionswritten by Choudhary et al. The research uses “high-resolution mass spectrometry to identify 3600 lysine acetylation sites on 1750 proteins” and “demonstrate[s] that the regulatory scope of lysine acetylation is broad and comparable with that of other major posttranslational modifications.”

ResearchBlogging.orgI’m going to admit, I know little of acetylation as a regulatory mechanism, though after reading through the paper, I found this quite and interesting find and it suggests to me that genomics has a lot to offer in the advance in our understanding of regulation and evolution.

Three things jumped out at me though.

The first is minor. The authors use the term Acytelome. You can now add that to the huge list of -omics terms to keep straight :D.

acetalnetworkThe second is that they use STRING to complete an analysis of networked interactions of the proteins discovered in their study and the processes where they are found, as you can see in their figure.

I did my postdoc and some later research in the lab (Peer Bork, EMBL) that developed STRING, and I’ve created a tutorial on it, so any time it’s used, I’m interested :D. So, I went to Methods and Materials to see how the analysis was done. Though there was a decent explanation of the process, it was not enough for me to recreate the analysis. This is not a criticism of the paper or the authors, but of how papers are being published. More and more, papers include genomics analysis, but rarely are these reported in the research paper in the detail needed to easily reproduce the analysis. Projects like Galaxy (publicly available tutorial) and Taverna are filling that void, so I’d like to see more Methods and Materials sections include analysis histories and workflows. It definitely would help in the advancement of science.

And now to the tip of the week. The paper also refers to a new database (at least new to me, it’s at least two years old and was reported in “Phosida: management, structural and evolutionary investigation and prediction of phosphosites.“) called Phosida. The database “allows retrieval of phosphorylation and acetylation data of any protein of interest.” The Tip-of-the-Week today is a quick introduction to that database.

Choudhary, C., Kumar, C., Gnad, F., Nielsen, M., Rehman, M., Walther, T., Olsen, J., & Mann, M. (2009). Lysine Acetylation Targets Protein Complexes and Co-Regulates Major Cellular Functions Science, 325 (5942), 834-840 DOI: 10.1126/science.1175371

New and Updated Online Tutorials for PROSITE, InterPro, IntAct and UniProt

Comprehensive tutorials on the publicly available PROSITE, InterPro, IntAct and UniProt databases enable researchers to quickly and effectively use these invaluable resources.

Seattle January 14, 2009 — OpenHelix today announced the availability of new tutorial suites on PROSITE, InterPro and IntAct, in addition to a newly updated tutorial on UniProt. PROSITE is a database that can be used to browse and search for information on protein domains, functional sites and families, InterPro is a database that integrates protein signature data from the major protein databases into a single comprehensive resource and IntAct is a protein interaction database with valuable tools that can be used to search for, analyze and graphically display protein interaction data from a wide variety of species. UniProt is a detailed curated knowledgebase about known proteins, with predictions and computational assignments for both characterized and uncharacterized proteins. These three new tutorials and an updated UniProt tutorial, in conjunction with the additional OpenHelix tutorials on MINT, PDB, Pfam, STRING, SMART, Entrez Protein, MMDB and many others, give the researcher an excellent set of training resources to assist in their protein research.

The tutorial suites, available for single purchase or through a low- priced yearly subscription to all OpenHelix tutorials, contain a narrated, self-run, online tutorial, slides with full script, handouts
and exercises. With the tutorials, researchers can quickly learn to effectively and efficiently use these resources. These tutorials will teach users:


*how to access information on domains, functional sites and protein families in PROSITE
*to perform a quick and an advanced protein sequence scan
*to find patterns in protein sequences using PRATT
*to use MyDomains to create custom domain graphics


  • to use both the basic and advanced search tools to find detailed information on entries in InterPro
  • how to understand and customize the display of your results
  • to use InterProScan to query novel protein sequences for information on domains and families


  • how to perform basic and advanced searches to find protein interaction data
  • to effectively navigate and understand the various data views
  • to graphically display and manipulate a protein interaction network


  • to perform text searches for relevant protein information
  • to search with sequences as a starting point
  • to understand the different types of UniProt records

To find out more about these and other tutorial suites visit the OpenHelix Tutorial Catalog and OpenHelix or visit the OpenHelix Blog for up-to-date information on genomics.

About OpenHelix
OpenHelix, LLC, provides the genomics knowledge you need when you need it. OpenHelix currently provides online self-run tutorials and on-site training for institutions and companies on the most powerful and popular free, web based, publicly accessible bioinformatics resources. In addition, OpenHelix is contracted by resource providers to provide comprehensive, long-term training and outreach programs.

TrEMBLing in the face of so many protein databases

I’m not a protein person (DNA, arthropods, SNPs, RNA, that’s me), so as I was doing some research using the protein databases, I came across this tidbit of information. UniProt is a central repository of protein sequences from Swiss-Prot, TrEMBL, and PIR. Check, I knew that. What I just learned was (yes, slow on the uptake, I know) the IPI (International Protein Index) is somewhat different.

From the FAQ:

IPI protein sets are made for a limited number of higher eukaryotic species whose genomic sequence has been completely determined but where there are a large number of predicted protein sequences that are not yet in UniProt. IPI takes data from UniProt and also from sources of such predictions, and combines them non-redundantly into a comprehensive proteome set for each species.

Just saying.

Updated Online Tutorials for DBTSS, Pfam and PDB

Seattle, WA (PRWEB) October 29, 2008 –  OpenHelix today announced the availability of newly updated tutorial suites on the DataBase of Transcriptional Start Sites (DBTSS), Pfam and the Protein Databank (PDB). DBTSS is a public resource for the analysis of promoter regions. Pfam is a comprehensive database of protein families manually created from multiple sequence alignments and hidden Markov models and PDB is a repository for a tremendous collection of structural information about proteins and other macromolecular structures. These three updated tutorials, in conjunction with the additional OpenHelix tutorials on ASTD, Entrez Protein and MMDB, give the researcher an excellent set of resources to carry their research from transcript to 3D protein structure.

The tutorial suites, available for single purchase or through a low- priced yearly subscription to all OpenHelix tutorials, contain a  narrated, self-run, online tutorial, slides with full script, handouts and exercises. With the tutorials, researchers can quickly learn to effectively and efficiently use these resources. These tutorials will teach users:


* to examine human promoter regions, and those in selected other species as well
* to locate transcription start sites, promoters, transcription factor binding sites and SNPs
* to use multiple query strategies to identify data of interest to your projects


* a variety of ways to search Pfam, including by keyword or by protein sequence
* how to use the information in Pfam to predict functions for uncharacterized proteins
* where you can access domain interaction data in Pfam
* about Pfam Clans, which are groups of domains from a single evolutionary origin


* how to search for structures and related information using a variety of strategies
* to understand the results pages
* how to access various tools to visualize and examine structural details

To find out more about these and other tutorial suites visit the OpenHelix Tutorial Catalog and OpenHelix or  visit the OpenHelix Blog for up-to- date information on genomics.

About OpenHelix
OpenHelix, LLC, provides the genomics knowledge you need when you need  it. OpenHelix currently provides online self-run tutorials and on-site  training for institutions and companies on the most powerful and  popular free, web based, publicly accessible bioinformatics resources.  In addition, OpenHelix is contracted by resource providers to provide comprehensive, long-term training and outreach programs.