Back to the Future: NIH to Revisit Genomic Data-Sharing Policy

Back to the Future Clock TowerAs first reported by GenomeWeb, last week the NIH issued a “Notice on Development of Data Sharing Policy for Sequence and Related Genomic Data.” Although the title doesn’t exactly trip off of the tongue, the NIH’s announcement provides an opportunity to review where we are and where we have already been when it comes to genomic data-sharing.

At the heart of the NIH’s announcement is a desire to increase the availability of genomic datasets. From last week’s notice:


Consistent with the NIH mission to improve public health through research and the longstanding NIH policy to make data publicly available from the research activities that it funds, the NIH has concluded that the full value of sequence-based genomic data can best be realized by making the sequence, as well as other genomic and phenotype datasets derived from large-scale studies, available as broadly as possible to a wide range of scientific investigators.

For NIH-funded genomic researchers, this language should have a familiar ring. In 2007, the NIH published a policy covering data-sharing for genome-wide association studies (GWAS) that required all NIH-funded GWAS research be deposited in a central data repository. Here’s the mission statement from the 2007 policy:

Consistent with the NIH mission to improve public health through research, the NIH believes that the full value of GWAS to the public can be realized only if the genotype and phenotype datasets are made available as rapidly as possible to a wide range of scientific investigators.

Although the 2007 and 2009 statements are similar, a comparison of the two shows some important differences as well (the red text indicates deletions from the 2007 statement while the blue indicates insertions in the 2009 version):

Consistent with the NIH mission to improve public health through research, the NIH believes and the longstanding NIH policy to make data publicly available from the research activities that it funds, the NIH has concluded that the full value of GWAS to the public can be realized only if the genotype sequence-based genomic data can best be realized by making the sequence, as well as other genomic and phenotype datasets are made derived from large-scale studies, available as rapidly broadly as possible to a wide range of scientific investigators.

It may be only a paragraph, but in looking at the language that has been changed, as well as conserved, there are plenty of opportunities to read between the lines and ponder what has changed in the nearly three years since the NIH last tackled the issue of genomic data-sharing.

From GWAS to WGS.

One of the most obvious changes is the shift from the language of GWAS and genotyping to that of “sequence-based genomic data.” The transition from GWAS to whole-genome sequencing (WGS) is “being made possible by maturing, more-effective methods and technologies for generating very large sequence data sets,” which has resulted in the production of research datasets of increasing size and scope, including the likely incorporation of “human clinical/phenotypic information.” Whether you believe that genome-wide association studies have been a failure or success, it is clear that the NIH is preparing for a future of sequencing-driven genomic research.

The Public’s Role in Genomic Data-Sharing.

As expressed in 2007, the goal of genomic data-sharing policy was to deliver “the full value of GWAS to the public” by making data available to “a wide range of scientific investigators.” Although the 2009 notice still targets the “wide range of scientific investigators,” the emphasis has shifted ever so slightly from realizing the value of genomic research on behalf of the public to supporting the NIH’s “longstanding” policy of making “data publicly available.”

The difference is subtle, but significant, and reflects the distinction between crowd-sourcing and open-sourcing in genomic research. In its GWAS policy, the NIH focused primarily on collecting all of the disparate GWA studies in a central repository, thus creating a more powerful resource for scientific investigators, an example of crowd-sourcing genomic research. It is possible that the emphasis in the NIH’s next attempt at a genomic data-sharing policy will shift, or at least expand, to focus on making that data more broadly available, to the public as well as to scientific investigators. That would be an example of open-sourcing genomic research.

This is not the first time the NIH has attempted to encourage open-source genomic research by making genomic data publicly accessible. Late in 2006 the NIH unveiled a central GWAS data repository: the Database of Genotypes and Phenotypes (dbGaP). As first implemented, dbGaP was designed to support two different levels of data access: aggregate or pooled study data was made available to the general public as “open-access data,” while individualized study data was de-identified and made available only as “controlled-access data” to authorized investigators. The NIH’s 2007 policy formalized this two-tiered approach to data-sharing.

From December 2006 until the fall of 2008, the open-access portion of the data included aggregate information from a wide range of genome-wide association studies; data that was accessed hundreds of times. Then, in September 2008, a paper from Homer et al. in PLoS Genetics demonstrated that it was possible, in principle, to identify an individual within a large dataset of pooled genomic data. The NIH and others had assumed, mistakenly it appeared, that aggregated or pooled GWAS data could not be manipulated in order to identify individuals participating in a particular genomic research study. Caught off guard, the NIH quickly restricted access to pooled genomic data that had previously been included in the open-access portion of dbGaP.1

This abrupt about-face, induced by Homer et al.’s research and supported by subsequent research, was an unexpected blow for the NIH, dbGaP and attempts at broader sharing of genomic research data. But to its credit, the NIH has continued to assert the importance of making genomic data more broadly accessible. It is reasonable to assume that the NIH’s emphasis here on the importance of “publicly available” data is a continuation of that theme.

From Genetic Privacy to Open Consent?

The decision to pull previously publicly available data from the open-access section of dbGaP was necessitated by the expectation, shared by the NIH, researchers and individual participants alike, that genetic information would be shared only in a confidential or de-identified fashion.1 This expectation was generated by the structure and informed consent protocols employed by individual research projects and reinforced by the NIH’s own data-sharing policy, which concluded that “protecting the privacy of the research participants and the confidentiality of their data” was “critically important.”

Promises of confidentially had been called into question prior to September 2008. But for the NIH, the paper from Homer et al. was clearly a wake-up call (and one that has been echoed by a torrent of subsequent research). It was apparent that existing informed consent protocols were likely inadequate to permit researchers or the NIH to continue without modification public genomic data-sharing projects such as dbGaP.

The biggest obstacle was that the informed consent protocols supporting GWAS research typically failed to notify participants of the potential risks of data-sharing, including that it might not be possible to keep their genetic information private in all scenarios. But how to address this issue? The NIH spokeswoman who announced the decision to remove certain information from the open-access section of dbGaP framed the problem succinctly: “How much do you tell people without scaring them? How do you communicate the level of risk? What level of risks are people willing to tolerate?”

Just over a month after the NIH restricted access to certain dbGaP data, the Personal Genome Project (PGP) announced the release of publicly available and identifiable data from its first ten participants. Using an “open consent” (pdf) model that expressly eliminates promises of privacy, the PGP provides interested participants with a comprehensive list of possible “risks and discomforts” associated with the project, some (identification with genetic data and associated loss of privacy) more likely than others (production of synthetic DNA to be planted at a crime scene). Particularly in October 2008, the PGP’s approach to data-sharing stood as a stark contrast to the NIH’s policy which, while it clearly identified the importance of adequately informed consent, was intent on reinforcing traditional expectations of genetic data privacy.

The scientific, ethical, legal and social developments of the past several years have clarified that the next generation of large-scale and integrated genomic/phenomic datasets, such as Kaiser’s recently announced 100,000+ person genetic database, must be accompanied by new attitudes toward genetic privacy and new methods for ensuring truly informed participant consent. And the NIH clearly intends to do just that, listing the following as one of its three core purposes in issuing last week’s notice:

Encourage investigators and IRBs to consider the potential for broad sharing of sequence and related genomic data in developing informed consent processes and documents for such studies involving human sequence data.

There is little likelihood that the NIH’s new data-sharing policy will require genomic researchers to adopt the PGP model in all respects, including foregoing attempts to safeguard participants’ genetic privacy in order to conduct genomic research in full view of the public. Although supported by the NIH (the PGP is funded in part by the National Heart, Lung and Blood Institute) and embraced by both domestic and international research groups, for all its virtues the PGP model is not the only or necessarily the best approach for all forms of genomic research.

The PGP is an ambitious project that promotes responsible public genomics research through rigorous informed consent protocols and fully public data-sharing. But the PGP model is not without its costs. Setting a higher bar for informed consent, through more robust risk disclosures and more stringent requirements of genetic literacy, limits and biases participant populations. And although the PGP’s approach of directly linking genomic and phenomic data and placing it into the public domain without any promise of privacy allows it to avoid the risk of an unanticipated breach of privacy, the fact is that extremely robust, even if imperfect, mechanisms do exist to protect participants’ genetic information from disclosure and identification.

In order to safeguard the ongoing trust of both the public and individual research participants in genomic research it is critically important that researchers and the NIH alike acknowledge that the privacy of such participants and the confidentiality of their genomic data cannot be unequivocally guaranteed. In addressing this issue, through restructured research projects and revised informed consent protocols, the PGP represents one option along a spectrum of possible approaches. But it is an option with associated costs. Balancing the competing desires of genetic privacy and public availability will continue to be a delicate and iterative task, but the fact that the NIH is tackling this issue yet again is a welcome development for the future of public genomic research.


1 Angrist M. Eyes wide open: The Personal Genome Project, citizen science and veracity in informed consent. Personalized Medicine. In press (Volume 6, Issue 6; November 2009).  (Angrist’s excellent article provides tremendous insight into the rationale behind the NIH’s 2007 GWAS policy and the climate of de-identification and confidentiality in which the PGP first arose.)