Busting an Embargo

Not me–and not one of the press embargoes.  I’m talking about a data embargo.   While on the way to a workshop this week I was reading my paper issue of Science on the flight.  And I was intrigued by the story of what happened when a data embargo was broken.  The story is: Paper Retracted Following Genome Data Breach, and it is the story of data from dbGaP being published before the authors were permitted to publish on it.

The scientist who helped to develop our dbGaP tutorial had alerted me to this story (hat tip to Cyndy :) ), because she knew how the dbGaP data access system worked.  In fact, let me quote part of our tutorial that explains it very clearly on slide 12 :

Next is the linked study title, followed by the Embargo Release date for each study. Investigators contributing data to dbGaP may retain the exclusive right to publish analyses of their datasets for a defined period of time. Prior to the Embargo Release date, other investigators may be granted access to download and analyze data, but they may not seek publication of their results until after this time.

There’s a great and risky feature of these large-scale data projects.  Investigators are asked by the NIH data sharing rules to submit data to the appropriate repository even before they’ve had a chance to publish on it.  The risk is people will scoop the submitters.  And that’s apparently what happened in this case.

We’ve also spoken to data embargo issues in the context of the ENCODE project.  In fact, one segment of our tutorial on ENCODE covers that issue.  As more and more “big data” projects roll out in this manner, there’s likely to be more of these issues cropping up.  I think PNAS had a good idea–adding an item to their author checklist that specifies whether data is under embargo rules.  (Oh, and they retracted the paper and you can see the stub here.) But I think it’s also up to the projects and databases to explain the data embargoes clearly.  The people associated with the big data projects understand the rules, but I don’t know that it has percolated through the scientific end-user community fully.   We’re trying to help get the word out with ENCODE and dbGaP in our training materials, but I know the process varies by project.  I think this episode offers a nice “teachable moment” for this.  I’ll be referring to it in future workshops, for sure.

So keep an eye out for this as you use “big data” resources.  But use them–don’t let this dissuade you. Just keep an eye on the calendar.

dbGaP: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gap

UCSC Genome Browser (with ENCODE data): http://genome.ucsc.edu/ and http://genome.ucsc.edu/ENCODE/