Genomic Privacy and Re-Identification Redux

New research published this week in the Proceedings of the National Academy of Sciences from Loukides et al. offers up a new method for preserving individual privacy while linking genomic and healthcare data. (“Anonymization of electronic medical records for validating genome-wide association studies.”) Daniel Cressey of Nature News and Katharine Gammon of Technology Review have concise (and free) summaries.

As we’ve written earlier (“Back to the Future: NIH to Revisit Genomic Data-Sharing Policy”), the ability to link – and to share – genotype and phenotype data (including medical records, particularly treatment and outcome data) will be essential to the development of the next generation of genomic research. One of the most common ways to link genotype and phenotype data is to combine genomic data with electronic medical records (EMRs). A particular patient’s EMR may contain everything from basic biographical information to family medical history to current diagnoses, including ICD codes. When it comes to associating genes with medical conditions, researchers rely on International Classification of Disease (ICD) codes to categorize individual patients by disease type and search for shared genetic variations that might play a causal role.

Cracking the Codes. Obviously identifying information (e.g., biographical information) is generally required to be removed pursuant to HIPAA regulations. ICD codes, however, are sometimes retained for purposes of genetic association research and, in some circumstances, a set of otherwise anonymous ICD codes pulled from an EMR can be traced backwards to identify the specific individual supplying the codes.

The new research from Loukides et al., a team which includes data privacy pioneer Bradley Malin, recognizes the potential for genomic privacy risks created by linked genotype-phenotype datasets. Loukides and his colleagues propose a mechanism for modifying such datasets to eliminate one route to individual re-identification while retaining enough information to make the data useful. From the abstract:

This work proposes an approach that provably prevents this type of data linkage and furnishes a result that helps support GWAS. Our approach automatically extracts potentially linkable clinical features and modifies them in a way that they can no longer be used to link a genomic sequence to a small number of patients, while preserving the associations between genomic sequences and specific sets of clinical features corresponding to GWAS-related diseases. Extensive experiments with real patient data derived from the Vanderbilt’s University Medical Center verify that our approach generates data that eliminate the threat of individual reidentification, while supporting GWAS validation and clinical case analysis tasks.

The approach from Loukides et al. involves (i) designating individual-level medical data that are potentially identifiable (the ICD codes) and then (ii) modifying the data in such a way that they no longer pose a risk of re-identification. The team’s approach combines a privacy policy (determined by reference to the size of subsets that can be created using the ICD codes) with a utility policy (a set of diseases that can be categorized by combining various ICD codes without overly distorting the phenotypic information those codes represent) to construct a dataset that “provides provable protection from individual reidentification based on clinical features” while enabling important GWAS research.

A Balancing Act. The primary reason why genomic privacy even presents as an issue, of course, is that most individuals are uncomfortable publicly sharing their genomic and medical data. Although some “information altruists” agree to waive their privacy rights and participate in research projects – most notably the Personal Genome Project, which employs a fully public data release and consent model – most genomic research, and particularly research that combines genomic and other medical data, is premised upon some level of privacy for the participants.

The fundamental tension is how to balance individual desires for privacy with a collective interest in employing linked genotypic and phenotypic data to advance scientific understanding and, ultimately, provide improved medical care to individuals. Pure privacy – or sharing no data that could possibly be re-identified – is an untenable solution, because it is impossible. On the other hand, requiring participants to waive all privacy rights is equally untenable because it would, in all likelihood, dramatically restrict the available pool of research participants. (And, as The Wall Street Journal reported today, patients may lie to their doctors if they believe their EMRs will ultimately be shared without appropriate privacy protections, behavior that would hamper both research and medical care.)

Viewed in light of this ever-present tension, the model proposed by Loukides et al. should be applauded for its contribution to the continuing project of striving to balance the conflicting desires of robust individual data privacy and broad access to linked medical and genomic datasets. As Malin puts it: “Generating data is expensive, and it’s both good science and good etiquette to reuse data. The challenge is to do it while protecting people.”

By seeking to block a significant path to re-identification (even if it is impossible to eliminate all possible re-identification scenarios) while preserving the utility of the published data, the approach put forth by Loukides et al. can provide needed comfort to researchers, institutions and participants considering the publication of linked genotype-phenotype datasets. After all, simply because data might be identified does not mean that it need be easily identifiable, and in many research settings robust privacy protection mechanisms will continue to serve a critical function.

Teri Manolio, director of the Office of Population Genomics at the NHGRI, agrees that the team’s approach shows promise. “It does a good job of trying to maximize the information shared while minimizing the risk for re-identification, recognizing that these goals are in dynamic tension and both cannot be fully met at the same time.” Encouraging words from an agency that has struggled to strike the proper balance between privacy and access when it comes to genomic data.

One Kind of Re-Identification. Whether the Loukides method will be adopted remains to be seen, and a technical analysis of the algorithm is beyond the scope of this article. Either way, while the approach described by Loukides and his team – if validated – appears promising, it is important to emphasize that this particular privacy protection mechanism addresses only one pathway of genomic data re-identification. Even if the Loukides et al. method “eliminates the threat of individual reidentification” using statistical measures applied to certain linked genotype-phenotype datasets, researchers have recognized that re-identification can occur in a variety of ways.

As George Church pointed out, one of the most prevalent forms of re-identification occurs through accidental or intentional releases of data that were never intended to be public, such as the data breaches tracked by the Privacy Rights Clearinghouse. Such unintended data releases could, at least in theory, compromise otherwise secure datasets. Re-identification is thus unlikely to be a risk that is ever susceptible to complete elimination. (For a more complete discussion of this issue, see our previous post, “Re-Identification and its Discontents.”)

The Genomic Privacy Two-Step. Loukides and his colleagues recognize that they are providing only a partial solution, and note that genomic privacy tools such as theirs are only effective when applied in an appropriate fashion. As the authors point out, “as is true of all data anonymization methods, our approach leaves the decision of selecting a suitable privacy protection level…to data owners or policy officials.”

Furthermore, striking a sensible balance between privacy and access is only the first step in developing a responsible approach to privacy in genomic research. Researchers and institutions must also be sure to communicate the relevant trade-offs to those individuals whose data will be used in the research, to ensure that they understand – and agree with – whatever risk of identification has been deemed appropriate to the proposed research.

Tackling both prongs of genomic privacy – the risk of re-identification and accurate communication of that risk – is necessary to ensure that the next generation of genomic research is conducted in a way that is technically robust, as well as ethically, legally and socially responsible.