What’s the answer? (SNP data restrictions)

BioStar is a site for asking, answering and discussing bioinformatics questions. We are members of the community and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those questions and answers here in this thread. You can ask questions in this thread, or you can always join in at BioStar.

This week’s highlighted question is something that sometimes surprises people who are digging into data that they expect is publicly and freely accessible–sometimes there are constraints.

Why is SNP data generally restricted?

Whenever I’ve looked at public data websites like Gene Expression Omnibus or The Cancer Genome Atlas, it seems like SNP datasets are restricted access. I vaguely understand that this is related to privacy concerns, since a SNP profile could theoretically uniquely identify a person. However, this seems ridiculous because to uniquely identify the person from SNP data, you’d need the person’s genome or SNP profile. These are not things that can be obtained easily and covertly, or legally without consent. Furthermore, such a policy of burying SNP data in a layer of red tape and requiring a separate request to be filed for every specific use discourages exploratory research and data mining.

Why is there so much concern about what seems to be such a theoretical issue? Is there anywhere were large amounts of de-identified human SNP data are available for data mining purposes without layers of red tape?

EDIT: I’m mostly interested in case-control SNP data, which seems particularly hard to find.


There are many interesting answers, including the fact that there are huge numbers of public data sets that you can access right now without restrictions. But it is also important to know that some projects have embargo windows for publication. But more commonly for patient-based samples it really is about privacy concerns. Sometimes the patients have signed agreements that don’t allow just anyone to access the samples. And it has been shown that it could be possible to re-identify project participants from SNP data. This caused NHGRI to withdraw access to some samples that had previously been public.

Anyway–go have a look at the replies. They are helpful to understand the issues with this data.