Tag Archives: TAIR

(re)Funding Databases II

ResearchBlogging.orgSo, I wrote about defunding resources and briefly mentioned a paper in Database about funding (or ‘re’funding) databases and resources. I’d like to discuss this a bit further. The paper, by Chandras et. al, discusses how databases and, to use their term, Biological Resource Centers (BRCs) are to maintain financial viability.

Let me state first, I completely agree with their premise, that databases and resources have become imperative. The earlier model of “publication of experimental results and sharing of the reated research materials” needs to be extended. As they state:

It is however no longer adequate to share data through traditional modes of publication, and, particularly with high throughput (‘-omics) technologies, sharing of datasets requires submission to public databases as has long been the case with nucleic acid and protein sequence data.

The authors state, factually, that the financial model for most biological databases (we are talking the thousands that exist), has often been a 3-5 year development funding, that once runs out, the infrastructure needs to be supported by another source. In fact, this has lead to the defunding of databases such as TAIR and VBRC (and many others), excellent resources with irreplaceable data and tools, that then must struggle to find funding to maintain the considerable costs of funding infrastructure and continued development.

The demands of scientific research, open, shared data, require a funding model that maintains the publicly available nature of these databases. And thus the problem as they state:

If, for financial reasons, BRCs are unable to perform their tasks under conditions that meet the requirements of sceintfic research and the deamnds of industry, scientists will either see valuable information lost or being transferred into strictly commercial environment with at east two consequences: (i) blockade of access to this information and/or high costs and (ii) loss of data and potentioal for technology transfer for the foreseeable future. In either case the effect on both the scientific and broader community will be detrimental.

Again, I agree.

They discuss several possible solutions to maintaining the viability of publicly available databases including a private-public dual tier system where for-profits paid an annual fee and academic researchers have free access. They mention Uniprot, which underwent a funding crisis over a decade ago, as an example. Uniprot (then Swissprot) went back to complete public funding in 2002. There are still several other databases that are attempting to fund themselves by such a model. BioBase is one where several databases have been folded. TransFac is one. There is a free, reduced functionality, version that is available to academics through gene-regulation.com and the fuller version for a subscription at BioBase. This former version allows some data to be shared, as one could see at VISTA or UCSC. I am not privy to the financials of BioBase and other similar models, and I assume that will work for some, but I agree with the authors that many useful databases and resources would be hard-pressed to be maintained this way.

Other possibilities include fully  including databases under a single public institution funding mechanism. The many databases of NCBI and EBI fit this model. In fact, there is even a recent case of a resource being folded into this model at NCBI. Again, this works for some, but not all useful resources.

Most will have to find variable methods for funding their databases. Considering the importance of doing so, it is imperative that viable models are found. The authors reject, out of hand, advertising. As they mention, most advertisers will not be drawn to website advertising without a visibility of at least 10,000 visitors per month. There might be some truth to this (and I need to read the reference they cite that use to back that up).

But the next model they suggest seems to me to have the same drawback. In this model, the database or resource would have a ‘partnership of core competencies.’ An example they cite is MMdb (not to be confused with MMDB). This virtual mutant mouse repository provides direct trial links to Invitrogen from it’s gene information to the product page. They mention that though 6 companies were approached, only one responded. It would seem that this model has the same issues as directly selling advertising.

They also mention that, at least for their research community of mouse functional genomics, “Institutional Funding” seems the best solution for long-term viability and open access. Unfortunately, until institutions like NIH and EMBL are willing or able to fund these databases, I’m not sure that’s thats a solution.

As they mention in the paper, the rate of growth of the amounts and types of data that is being generated is exponential. I am not sure that government or institutional funding can financially keep up with housing the infrastructure needed to maintain and further develop these databases so that all the data generated can remain publicly and freely accessible.

Information is should be free, but unfortunately it is not without cost. It will be interesting to see how funding of databases and resources evolves in this fast growing genomics world (and imperative we figure out solutions).

PS: On a personal note, the authors use their resource, EMMA (European Mouse Mutant Archive), as an example in the paper. I like the name since it’s the name of my daughter, but it just goes to prove that names come in waves. We named our daughter thinking few would name their daughter the same. When even databases name the same name, you know that’s not the case.

Chandras, C., Weaver, T., Zouberakis, M., Smedley, D., Schughart, K., Rosenthal, N., Hancock, J., Kollias, G., Schofield, P., & Aidinis, V. (2009). Models for financial sustainability of biological databases and resources Database, 2009 DOI: 10.1093/database/bap017

(de)Funding Databases

From Deepak Singh:

Scientists spend years collecting and generating increasing amounts data. The data ranges from raw instrument data, “finished” data (e.g. a

crisis_newbanner_correctsize1_flattenedgenome sequence which is constructed after aligning all the short reads from a next-gen sequencer), and annotated data, which has been marked up to add additional information. We have repositories where a lot of this data goes, RCSB, NCBI, etc. In many cases there is clarity in these

destinations and for the better part, resources like RCSB and NCBI are well funded and long lived (although I am always nervous about RCSB). However, many data repositories are dependent on funding, with no guarantees that the funding will be renewed. Given the size of some of these data resources, shouldn’t we be thinking of a more sustainable model for funding? This is a general problem for infrastructure resources, given the cost and the fact that you shouldn’t be looking at these from a 3-5 year perspective. This especially baffles me when libraries come into play. Shouldn’t the timescale there be in the 10’s of years?mndoci.com, The disconnect in funding data resources, Oct 2009

You should read the whole article.

A recent example of this is the arabidopsis resource, TAIR.

Continue reading

1001 Genomes: plant researchers raise by 1

arabidopsisThere is plenty of buzz out there for the big data biology projects–but usually the focus is the human data (with a few token model organisms thrown in).  But this week plant researchers renewed the call for big plant data.  I’m totally on board with that.

The 1000 Genomes project to obtain more human variation information is well underway, funded, and has companies supporting it.  And that’s great–I’m all for this too!  But as someone who survives largely on the kindness of plants I want more plant research going on.  I want to see this funded and supported.  And as we face increasing stresses on resources from limitations like oil and water supplies to wacky climate conditions and environmental consequences I think we could well afford to spend less time gazing at our human genomic navels and devote more attention to the plants.

There is already some work on this Arabidopsis project.  The first paper with data on this effort came out last fall.  But the researchers are still having to go out and lobby for this project.  A new opinion piece in Genome Biology calls out for awareness and support for this effort.

They have already done a first generation green HapMap.  The paper last fall illustrated the feasibility of the project by looking at the reference Col-O (Columbia) and Bur-O and Tsu-1 strains.  The paper presents the process, compares their pipeline software with another package (SHORE that they developed and MAQ), They have a GBrowse installation that presents the data  (and you can get free training on GBrowse here to effectively use the site).  They also provide data to TAIR.

I think this is important and I hope it gets the same level of support and respect that 1000 humans will get.

1001 Genomes main site: http://1001genomes.org/

1001 Genomes GBrowse: http://gbrowse.weigelworld.org/cgi-bin/gbrowse/ath_reseq_1001/
Clark, R., Schweikert, G., Toomajian, C., Ossowski, S., Zeller, G., Shinn, P., Warthmann, N., Hu, T., Fu, G., Hinds, D., Chen, H., Frazer, K., Huson, D., Scholkopf, B., Nordborg, M., Ratsch, G., Ecker, J., & Weigel, D. (2007). Common Sequence Polymorphisms Shaping Genetic Diversity in Arabidopsis thaliana Science, 317 (5836), 338-342 DOI: 10.1126/science.1138632

Ossowski, S., Schneeberger, K., Clark, R., Lanz, C., Warthmann, N., & Weigel, D. (2008). Sequencing of natural strains of Arabidopsis thaliana with short reads Genome Research, 18 (12), 2024-2033 DOI: 10.1101/gr.080200.108

Weigel, D., & Mott, R. (2009). The 1001 Genomes Project for Arabidopsis thaliana Genome Biology, 10 (5) DOI: 10.1186/gb-2009-10-5-107

Wikification of Genbank

Speaking of Genbank’s 25th, a few weeks ago Science had a news piece “Proposal to ‘Wikify’ Genbank Meets Stiff Resistance.” Apparently, those in the Mycology research community have found many inaccuracies in the Genbank records and wish to see a change that would allow annotations to be made by the community:

a scheme like those used in herbaria and museums, where specimens often have multiple annotations: listing original and new entries side by side. It would be a community operation, like Wikipedia, in which the users themselves update and add information, but not anonymously.

But the idea is meeting resistance from Genbank’s Managers:

Continue reading

Eh, enter your own damn data….

tair_submission.jpgI was looking over the Eurekalert announcements and came across one that I have been percolating about now for some time. It is an effort I fully support and encourage. But I worry about a few aspects of it. The alert is entitled: Controlling a sea of information. The Arabidopsis Information Resource (TAIR) has partnered with the journal Plant Physiology to ensure data from Plant Physiology papers will get into the TAIR database. The longer story is available from the alert and from the associated Editorial. The short story is: there aren’t enough curators to keep up with all the data coming out. This prevents a lot of information from getting into the databases. The TAIR and PlantPhysiol folks have teamed up to create a way for the authors themselves to get this information into TAIR with a simple form.

Continue reading

Finding Flies

I finally got around to reading last month’s Nature paper on the genomic sequence of 12 Drosophila species. In addition to being genomics research (which is my field now :), it is also looking at 12 of the couple dozen species I studied for my Ph.D. (though I was only looking at the evolution of R1 & R2 retroposons in arthropods).Interesting paper, and I might go into it more in depth later (what genomics means and doesn’t mean for evolutionary studies).

But I did get to thinking, where would I go to browse and search the genomic sequence data for these 12 species ( hey, I might want to recreate my work, though the Eickbush lab already has.. and extended). Of course there are the two browsers mentioned in the paper ;-), Flybase and UCSC Genome Browser, though UCSC doesn’t include D. willistoni as I write this. I checked the other two major general genome browsers, as opposed to species or taxa specific: Ensembl and NCBI’s MapViewer. Continue reading