Sequencing genomes to diagnose puzzling symptoms presents a conundrum: how to interpret whether a person’s genotype causes the syndrome without comparison to many other human genome sequences? Put another way, a gene variant (mutation) that people with the same symptoms share must also be absent in people without the syndrome for it to be labeled “causal,” rather than the disturbingly vague “variant of uncertain significance.”
The challenge is in de-identifying the hordes of healthy genomes needed to add diagnostic context to those with disease-causing mutations. A team of biologists, computer scientists, and cryptographers at Stanford University described in Science magazine a new computational tool “to make certain that genomic discrimination doesn’t happen,” according to co-author Gill Bejerano, PhD, associate professor of developmental biology, pediatrics, and computer science.
A genome sequence can reveal much more than the needle-in-the-haystack mutations that might underlie a diagnosis: parentage, ancestry, susceptibilities and risk factors, even whether a certain drug will work, binge-drinking make you violently ill, or smoking likely to cause lung cancer. How can genome sequencing provide useful information without sacrificing privacy?
THE CLASSIC GYMREK STUDY
One of my favorite papers is also from Science, “Identifying Personal Genomes by Surname Inference,” if you can call a report from 2013 a classic. First author was then-grad student Melissa Gymrek, who now heads a lab at UCSD.
Gymrek and her co-workers tackled the 1000 Genomes Project, which ran from 2008-2015 and spawned a supposedly anonymous database. The informed consent form read, “. . . it will be hard for anyone to find out anything about you personally from any of this research.”
Right. Online searches easily shattered that premature promise of privacy.
Gymrek, then a student of Yaniv Erlich, a researcher at the Whitehead Institute who had worked with databases at financial banks, tried to identify people who’d anonymously donated DNA to the 1000 Genomes Project—just to see if they could.
They looked at sets of short tandem repeats, the bits of sequence of 2-13 DNA bases used in forensics and genetic genealogy to distinguish individuals. Consulting public genealogy databases they found surnames corresponding to specific Y haplotypes (STRs linked on the male chromosome).
Basic public information such as state of residence and birth year was easy to find. DNA data posted on family websites confirmed some identifications. The researchers found women by cross-referencing DNA sequences in the Coriell Cell Repositories in New Jersey to other data. Searching mutation databases for disease, hometown, and date of birth identified children.
When Gymrek had identified 50 people fairly easily, Dr. Erlich, alarmed, notified the NIH, catalyzing efforts to begin to hide some of the DNA data, although of course they couldn’t control people who’d post anything on social media. Their report in Science became a rallying cry of sorts for the ease of assigning names to DNA sequences – something that’s much easier today, with more than a million of our genomes sequenced and with the ability to carry such information on our smartphones.
In the new paper, the researchers used a cryptographic approach called Yao’s protocol with cloud computing to enable a genome peruser to zero in on the DNA sequences of clinical interest, while ignoring all else. It’s a genomic cloaking device, for those familiar with the Romulan invention from Star Trek that makes a spaceship seem to vanish. It irked Captain Kirk.
A terrific news release by Krista Conger at Stanford explained it all:
“Using the technique, the researchers were able to identify the responsible gene mutations in groups of patients with four rare diseases; pinpoint the likely culprit of a genetic disease in a baby by comparing his DNA with that of his parents; and determine which out of hundreds of patients at two individual medical centers with similar symptoms also shared gene mutations. They did this all while keeping 97 percent or more of the participants’ unique genetic information completely hidden from anyone other than the individuals themselves,” the release said.
Many “news aggregators” just publish news releases verbatim, but I dug a little deeper:
• For the four already-known diseases, the technique identified 211-374 “rare functional gene variants” in 210- 356 genes (meaning more than one mutation in some genes) among the patients, then selected the most likely candidates. The computation correctly identified the mutation in all four – across all 20,663 genes, and in 5 to 10 seconds. Anyone who reads this blog regularly knows that a diagnostic odyssey for a rare genetic disease can take years, using conventional medicine.
• The baby was XY (chromosomally male) with female genitalia. The child and the parents each had 164-185 rare functional variants found with exome sequencing, and the computation revealed only two unique to the child. A review of the genetics literature found that one, ACTB, made sense – and it had been found in the 1000 Genomes Project! Only the two meaningful variants were reported to the parents and their provider, leaving what the researchers call a “protection quotient” of 99.6%. (Definition: “the fraction of private information that is exposed neither to the other participants nor to the entity running the computation.”) This more complex test took just under an hour.
• The researchers compared 928 patients from one medical center to 282 patients at another, generating a list of 5,000+ rare functional variants seen in at least one patient, then whittled it down to 159 variants seen among patients in both hospitals. The info diagnosed patients with specific heart problems, and also revealed previously unrecognized gene-disease connections – so the computation is a discovery tool too.
The beauty of the technique, and the secret to the privacy promise, is that the patient enters the data, into smartphone, tablet, or computer. That shouldn’t sound scary, for we send our info into the ether all the time, from ordering concert tickets to making plane reservations. “In this way, no person or computer, other than the individuals themselves, has access to the complete set of genetic information,” said Dr. Bejerano.
The computation encrypts a genome sequence into a linear series of values that rates each gene variant according to several criteria well-established among genome researchers:
• Could the gene’s function explain a patient’s symptoms?
• Is the variant rare? This is where the need for a backup million or so sequenced genomes comes in. If a variant is common, it can’t be making people too sick to reproduce.
• Is the variant functional? What does it do?
The direct involvement of the patient and the return of only relevant data from the cloud can avoid the genetic red flags that might underlie denial of a loan or life insurance, neither protected under the Genetic Information Nondiscrimination Act (GINA), should it survive the Trump administration. And the data from the healthy genomes is aggregated without identifiers.
Genome cloaking at some point requires interpretation of and communication by health care providers who are familiar and comfortable with DNA information. That might still be a rare breed. Here’s a quick test that I just invented for a provider discussing genetic testing: define SNP, CNV, VUS, and exome. If she or he can’t, find a genetic counselor pronto. The media’s common depiction of physicians as scientists – the Dana Scully effect, from the X-Files doctor constantly calling herself a scientist — can set up unrealistic expectations of expertise.
Another advance that could come from genome cloaking would be, finally, the ability to track sets of genes. This is important because gene actions can oppose. What’s the use of finding out you have a gene variant that increases the risk of Alzheimer’s, like APOE e4, yet not knowing that you also inherited a gene variant that lowers the risk (APOE e2)?
With the ability to nail disease-causing gene variants, while offering the privacy that Melissa Gymreck showed years ago to be easily compromised, genome cloaking may be able to catapult DNA science into the research lab and clinic, by providing reassurance to both families with genetic disease and to the healthy population whose genome sequences are vital to providing context.
(Thanks to NHGRI for images.)