Database "openness"

We train on publicly available databases and resources. For our purposes on deciding when to develop training, the definition is relatively straightforward: Can the academic researcher access the data without cost or license restriction? If the answer is yes, our next step is to determine if we can develop training materials based on the resource without cost or license restriction and to ask the providers specifically for permission to do so. We ask permission for several reasons: let the developer know what we are doing, verify the restrictions or lack there of, build good relationships, etc.

That first decision, “is it publicly available?”, would seem a relatively clearcut criteria, but we have found that it isn’t always. There are several problems. Often, the ‘terms of use’ or copyright documentation is difficult to find on the web site or non-existent. Even when it available, the terms, language and restrictions can vary quite a bit across databases, countries and even within a resource at times. Determining what “publicly available” is and which resource fits that definition can be less than simple, to say the least.

There is an attempt to offer a definition of “open” using the Creative Commons license.

Science Commons has a FAQ on how a database can apply a Creative Commons license, giving databases and biological resources a single, standardized, easier understood definition of publicly available. Even that can be a bit complicated as Ethan Zuckerman of “My Heart’s in Accra” states:

…a wonderfully complex FAQ on applying Creative Commons licenses to databases – the first question read “Can a Creative Commons license be applied to a database?” After a six paragraph answer to that question, the third question read, “So, a Creative Commons license can be applied to a database?”

As he mentions though, Science Commons is now offering a more straightforward protocol (the ever-required FAQ):

…the complexities of asking scientists to release their data under Creative Commons licenses was so severe that Science Commons has ended up advocating for data to be released public domain, under the auspices of their protocol, instead.

Science Commons has found that opening the data is not quite that simple and the criteria across databases and resources can be quite different. Their goal is to make it simpler and thus more open.

Melanie Dulong de Rosnay has been doing research in this very area: “how open is the data”. Her work to date can be seen in this Nature Precedings article, outlining which databases fit the criteria of technically and legally open access by determining the following :

The website provides a file transfer protocol or a link to download the whole dataset without registration. The ability to download the whole dataset without registration constitute the double requirement to be considered as technically accessible.
2. TECHNICAL RESTRICTION: the database can be accessed only through registration, batch or query-based system. Technical accessibility is not achieved.
3. PUBLIC DOMAIN POLICY: the website provides simple and clear terms of use informing users that the data are in the public domain. Data are thus free to integrate. Legal accessibility is achieved.
4. NO POLICY: the website does not provide terms of use. Legal accessibility is not achieved.
5. LEGAL RESTRICTIONS: the terms of use impose contractual restrictions, such as heavy contractual requirements for attribution, limitation to non-commercial usages, prohibition to modify data, or other constraints on their redistribution or modification. Legal accessibility is not achieved. The data are not free to integrate.

Public interface to this analysis so far can be found at Shirley Fung’s site here (60 databases have been analyzed for this data to-date, though only 34 are so far on the site as far as I can tell) . There are only a few that meet all the criteria, and only 7 that meet the criteria of the Science Commons protocol, which is basically that it is in the public domain with a published terms of use policy and downloadable in whole without restriction.

Ensembl doesn’t make the Science Commons protocol list on the site, though from my reading of it, it should. It is downloadable in whole without restriction, has a terms of use policy published and is in the public domain. I’ve read the protocol and checked Ensembl, seems like it should meet the criteria. In fact, GeneID and GOA seem to fit the criteria also, but aren’t on the list. But as we have found, even with this simplification of the protocol, perhaps I am missing something. Why would these three databases not be considered fulfilling the protocol?

The list does have some other issues*, but for the most part it is a great start.

Even if Ensembl, GeneID and GOA are incorrectly eliminated as databases that fulfill the protocol (and the jury is out on this, I could very well be reading this incorrectly), this is a great start. I am hopeful that it will lead to more standardization for database openness and get database developers thinking along those lines.

The researchers have their work cut out for them in building and maintaining this list. There are  over 2,000 publicly available databases. These are changing constantly. The databases themselves change, new ones are born and often fade away. Even if they only do a small percentage of those databases available, it is starting the discussion. Already I’ve been looking at several databases like UCSC Genome Browser to see if they fit the Science Commons Protocol (yes, in my estimation :).

*There is a “minor” one for EcoCyc, the link in the heading of the EcoCyc record goes to the wrong database (though other links to download and terms of use policy are correct)

**HUGE hat tip to Bora and Blog around the Clock for pointing this all out.

One thought on “Database "openness"

  1. Mary

    I spend a lot of my time looking for the terms and licensing details. Sometimes I end up just writing to the developers (who may or may not have valid emails). And plenty of time I just end up on the phone with tech transfer folks trying to explain our needs. This can go on for weeks. It is very challenging.

Comments are closed.