Re-identification and its Discontents

futurepeopleLast fall, a paper from Homer et al. in PLoS Genetics made waves by demonstrating that it was possible, in principle, to identify an individual’s genomic data within a large dataset of pooled genomic data. Pooled or aggregated genomic data had previously been considered to provide individual research participants with a strong measure of privacy. The paper from Homer et al. produced an immediate reaction from the genomic research community, prompting the National Human Genome Research Institute (NHGRI) to immediately restrict pooled genomic data (pdf) that had previously been accessible (pdf) to the public. Other institutions including the Wellcome Trust and the Broad Institute quickly followed suit.

Twelve months later, the issue of genomic privacy is still a hot topic, at least in the pages of scientific journals. Last week, in particular, saw a flurry of activity, with Nature Genetics publishing “A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies,” which followed close on the heels of last month’s “Genomic privacy and limits of individual detection in a pool.” Over at PLoS Genetics, the current issue offers up a pair of similarly focused papers: “Needles in the Haystack: Identifying Individuals Present in Pooled Genomic Data” and “The Limits of Individual Identification from Sample Allele Frequencies: Theory and Statistical Analysis.”

I. The Limits of Genomic Privacy

All of these articles, in some fashion, build upon or critique the landmark work of Nils Homer and his team. The statistical techniques underlying these papers are complex and, at the margins, there is plenty of room for debate over the ability to infer an individual’s presence from a particular genomic dataset. The number of individuals pooling their data, the number of genetic markers disclosed for each individual, the frequency of those genetic markers and the frequency of genotyping error represent only some of the factors that make the identification of a single individual in genomic dataset more or less difficult.

Furthermore, as a practical matter, the likelihood of genomic re-identification of the type discussed by Homer and others appears to be low. In addition to a publicly available dataset of pooled genomic data, identified genomic data from the individual seeking to be identified (or a close relative) would be required. Also required to complete the re-identification would be non-trivial amounts of technical expertise and resources along with a motive (likely beyond mere curiosity) that supported tackling the re-identification effort in the first place. That each of those elements would fall into line at the same time and for the same individual seems unlikely indeed.

Statistically speaking, it may be much more likely that an individual is associated with his or her genetic information through intentional or accidental mishandling of data, malicious hacking or through the DNA of a first-degree relative. But focusing on any particular pathway for re-identification obscures the underlying point: the issue is not how, specifically, re-identification has happened or might happen, but rather that it can happen, and that the pathways for re-identification appear to be expanding and widening.

Although the boundaries of what is possible when it comes to protecting genomic privacy will continue to be tested—both by those seeking to provide greater security and those seeking to develop increasingly sophisticated and powerful identification techniques—what is clear is that it is no longer technologically possible, or ethically responsible, to offer individuals unequivocal promises of privacy when soliciting their genomic data for use in the large genomic datasets that represent the future of genomics research and commerce.

The implications of this conclusion have been teased out most clearly in the genomics research space. In an article last year in Nature Reviews Genetics, I argued with my co-authors that the fundamental inability to promise research participants that genomic privacy would be maintained required, ethically and legally, that researchers employ an “open consent” model in which risks were carefully identified and openly disclosed. With respect to data privacy, we argued that participants should not be promised that their genomic data would remain anonymous without exception. The open consent approach forms the foundation of the informed consent model employed by the Personal Genome Project, and several commentators in a series of commentaries, also in this week’s PLoS Genetics, strike similar themes (see Church and Heeney et al.)

II. Genomic Privacy in the DTC Context

But the last year has seen significant changes in the genomic research landscape; specifically, the emergence of the DTC genomic research movement. As we’ve discussed numerous times at the Genomics Law Report (see here, here, here and here), DTC and other commercial genomics companies have become increasingly ambitious in their plans to develop and capitalize upon large datasets consisting of their customers’ genotypic and phenotypic data.

Although these proprietary datasets are unlikely to be made public in the way that, for instance, the NIH provided dbGaP data to the public prior to Homer et al.‘s research, their structure and composition are highly similar to traditional genomic datasets. And while these companies may not be subject in every instance to the same human subjects research regulations that govern traditional research projects—in most instances, the so-called “Common Rule” governs—the possibility of re-identification for DTC customers is quite real.

As an example of this tension, industry leader 23andMe includes the following statement in its Consent and Legal Agreement:

23andMe may grant researchers associated with partner organizations access to aggregated data from our database of genetic and other contributed personal information for specific research queries. 23andMe will only provide individual level data to external researchers upon individual consent from each customer.

23andMe makes a distinction here between “aggregated data,” which it may share without consent, and “individual level data,” which it will not share without obtaining consent. But as the flood of recent scholarship demonstrates, this distinction is not as strong as 23andMe’s policy suggests. At least in certain instances, researchers (and others) may be able to extract “individual level data” from the “aggregated data.”

Jacobs et al. conclude their recent article in Nature Genetics with the following recommendation:

In light of these developments, the policies and practices guiding genomic data sharing should continue to evolve in order to promote quality science, minimize duplicative research and merit the ongoing trust of the research subjects who consent to participate in scientific studies.

That final point—securing the ongoing trust of individuals—is crucial to the continued development and success of both commercial and research genomics. For that reason, and in light of the increasing prevalence of genomic data sharing conducted by commercial genomics companies, we must add “customers who purchase genomic services” to the population of individuals whose ongoing trust must be preserved.

The failure to acknowledge the limitations of genomic privacy and to preserve the trust of participants and consumers is one that could have dire consequences for both genomics research and commerce. In many respects, the loss of public trust in genomics research and commerce might produce far more and long-lasting damage than any particular harm that is likely to result from the re-identification of an individual’s genomic data.

But preserving trust does not require that genomics researchers or companies protect individual’s data at all costs, including refraining from sharing aggregate (or even individual-level) data to the detriment of their research or their businesses. It simply requires that researchers and businesses lay plain the risks of such data sharing to their participants and customers, respectively, and provide them with the information they need to make a truly informed decision.

Read More: While the PLoS Genetics articles are freely accessible the articles in Nature Genetics are not. For those without a subscription, GenomeWeb (free registration) and Medscape Today have excellent summaries of the paper from Jacobs et al.