Gorilla Genome 2.0: Lessons for the Clinic?
The unveiling of a new and improved gorilla genome sequence this week in Science isn’t a “first,” but the differences between it and gorGor3, from 2012, echo clinical situations that can arise when genetic information is incomplete.
First, the gorilla news.
SUSIE
The new gorilla genome sequence comes from Susie, who lives at the Lincoln Park Zoo, from bioinformatics specialist David Gordon at the University of Washington, part of Evan Eichler’s group. The findings in a nutshell:
• A human genome differs from a gorilla’s by 117,512 indels (insertions and deletions) and 697 insertions, most of which are newly identified.
• Susie’s genome has a more extensive major histocompatibility complex, the gene cluster that controls antigen presentation and hence immune function. Gorilla genomes have 3 large insertions there.
• We are slightly more closely related to gorillas than the 2012 sequence indicated. Repeats might have garbled earlier estimates.
• Susie’s genome shows slightly more evidence of inbreeding, reflecting a shrinking population. Severity of the most recent population bottleneck, over the past 100,000 years, might have been underestimated by 50%.
Thousands of human genomes have been sequenced compared to the handful of gorilla genomes, but that means that we have many more differences to evaluate.
VARIANTS OF UNCERTAIN SIGNIFICANCE
With phrases such as “settled science” and “scientific proof” part of the lexicon, it’s little wonder that health care consumers expect medical test results to be yes or no, not maybe. Few clinical experiences are as unsettling as receiving a “VUS” – variant of uncertain (or unknown) significance – as a genetic test result. “Yes, your gene has an unusual DNA sequence, but we don’t know what it means.”
VUS arise from the informational nature of a gene. The hundreds of DNA building blocks can vary in ways that do not affect the structure or function of the encoded protein, a little like a typo in this sentence changing “blocks” to “blokks”. For example, the three “Ashkenazi” mutations in BRCA1 remove 2 bases or add or delete one – changes that disrupt the 3-base language of DNA, greatly altering protein structure. But some mutations don’t alter protein structure at all, or do so in a way that doesn’t appreciably impact the protein’s function.
The VUS issue will resolve with time. It’s a bioinformatics challenge that requires identifying all gene variants that nearly always track with a suspected clinical condition, and almost never with people who don’t have the condition. The exceptions are due to gene-gene interactions (more on that soon).
While researchers are “curating” genomes to identify gene functions, the American College of Medical Genetics and Genomics and the Association for Molecular Pathology last year issued a joint consensus recommendation, “Standards and Guidelines for the Interpretation of Sequence Variants,” establishing these categories:
• pathogenic
• likely pathogenic
• variant of uncertain significance (“unknown” seems to have taken over)
• likely benign
• benign
The categories between “pathogenic” and “benign” are unlikely to come as good news to a patient struggling to understand a condition that she or he has likely never heard of. And it’s not really as simple as 5 categories. Different groups define “pathogenic” differently, for example. Some may require wet lab validation, others just statistical correlation.
Harvard’s Heidi Rehm, final author on the guidelines paper, talked about the challenge of evaluating variants at last year’s annual meeting of the American Society of Human Genetics. “We routinely throw out genes from panels as we do curation. Half the genes on clinical panels today, we’re not really sure what they do.” She discussed current projects to bring meaning to all the variants in human genomes, including the Matchmaker Exchange, the Global Alliance for Genomics and Health, ClinVar, and the Human Genome Mutation Database.
INTERACTING OR REDUNDANT GENES
Standing back from genomes to consider three examples at the single-gene level shows how important it is to identify all variants of a gene, and all genes that cause a particular condition.
1. Inheriting two copies of apolipoprotein E4 (apoE4) hikes risk of late-onset Alzheimer disease 15-fold, and one copy, 3-fold. Yet the apoE2 variant lowers risk. Presumably a test for the gene would reveal either variant. But what about the amyloid-β precursor protein (APP) gene? One variant, A673T, protects against Alzheimer’s. Do apoE4 and A673T cancel each other out?
2. Spinal muscular atrophy (SMA) blocks innervation of muscles and is usually fatal in early childhood. The abnormal protein shortens axons. But some siblings of patients who also inherited the SMA genotype never develop symptoms. They can thank a variant of another gene, plastin 3, which increases production of the protein (actin) that extends axons.
3. When only one gene was implicated in osteogenesis imperfecta, some parents of children with recurrent fractures were falsely accused of child abuse when genetic testing didn’t find the only known mutation. Eight distinct genes that cause the disease when mutant are now recognized.
Keeping the Needles in the Genomic Haystacks
When I taught at a large university, I wouldn’t have dreamed of using an essay exam to evaluate the 600 students in my class. A short-answer exam to neatly group them by letter grade was fast and effective, although perhaps missing some talented students with test anxiety, and rewarding a few good guessers.
Similarly, analyzing genes and their variants with across-the-board cut-off values for pathogenicity isn’t perfect. Tools that predict effects of single base variants on protein function include SIFT (sorting intolerant from tolerant), PolyPhen-2 (polymorphism phenotyping version 2), and CADD (combined annotation-dependent depletion). The higher the score, the more damaging a mutation is predicted to be, based on such factors as the frequency with which specific variants appear in people with a particular condition, and predicting protein function (or malfunction) based on the site of a DNA base change.
Across-the-board cut-offs bothered Yuval Itan, a research associate in the St. Giles Laboratory of Human Genetics of Infectious Diseases at Rockefeller University. He introduced a more specific way to do this, in a recent Nature Methods article that discusses the “mutation significance cutoff,” or MSC. It’s necessary because “patient information may include tens of thousands of variants, a classic needle in a haystack situation,” he said.
“I estimate the effect of the mutation in the context of the gene in which the mutation is harbored, while most previous methods completely disregarded the gene entity. While previously, when the frequency of the mutation in the general population was considered, there was at least a 35% chance of removing true disease-causing mutations from the analysis. With my method the risk is only 2%,” he told me recently. DNA Science previously covered Dr. Itan’s work on establishing a human connectome (“A GPS View of the Human Genome).
Itan explains the limitation of the existing tools. “All methods use a specific cut off to differentiate benign and damaging. In the CADD method, zero is the most benign and 99 is the most damaging. Most people use a cut-off of 15. Any variant with a value higher than 15 is considered to be potentially damaging. If it is benign, it is removed from the data, although CADD developers rightfully recommend against such usage.” The problem arises when using the cut-off for all gene variants retains the benign or ditches the pathogenic – a research designation that could have profound clinical consequences. With the new tool “there is a very low risk of removing the needle from the haystack,” Itan said.
The Big Picture
I think that we are in a period of discovery that might not have been imagined 20 years ago when people were first starting talking about sequencing “the” human genome. Our genetic diversity is simply astounding, yet also perfectly predictable from the very nature of the genetic material. But in time, especially with the accelerating pace of bioinformatics improvements, we will sort through it all enough to fully embrace genetic information in the clinic.
Thanks, Susie, for inspiring this week’s post. I have a special affinity for gorillas because I wore a gorilla suit for many Halloweens in a haunted house, and my children loved the edition of my textbook that had my name over a cover image of a gorilla.
(Featured image credit: Lincoln Park Zoo)
“Variants near CHRNA3/5 and APOE have age- and sex-related effects on human lifespan”
http://dx.doi.org/10.1038/ncomms11174
Excerpt:
The signal we observe is driven by rs429358, a non-synonymous Cys112Arg variant, which defines the ε4 allele and which has not previously been shown to be the causal variant influencing lifespan.
My question and comments:
Is there any reason to not report all causal variants, such as the Cys112Arg variant, in the context of nutrient-dependent RNA-mediated amino acid substitutions. I think some people may be confused by terms that link the mouse to human model of cell type differences in expression of the EDAR variant, rs3827760, which also is known as 1540T/C, 370A, EDARV370A or Val370Ala, which is a single nucleotide polymorphism (SNP) in the ectodysplasin A receptor (EDAR) gene on chromosome 2?
See: Modeling Recent Human Evolution in Mice by Expression of a Selected EDAR Variant
http://linkinghub.elsevier.com/retrieve/pii/S0092867413000676
The differences in the way causal variants are reported may have stalled the “Precision Medicine Initiative.” But, others are now linking metabolic networks to genetic networks and to differences in human behavior during life history transitions by fixation of single amino acid substitutions and supercoiled DNA that appears to protect all organized genomes from virus-driven entropy.
[…] sequencing techniques for their genome than the one published in 2012. In addition to finding new, previously unknown coding segments, the team noted that the older method of sequencing had thrown away some of the duplicates of […]
[…] sequencing techniques for their genome than the one published in 2012. In addition to finding new, previously unknown coding segments, the team noted that the older method of sequencing had thrown away some of the duplicates of […]
[…] sequencing techniques for their genome than the one published in 2012. In addition to finding new, previously unknown coding segments, the team noted that the older method of sequencing had thrown away some of the duplicates of […]
[…] techniques for their genome compared to the one published in 2012. In addition to finding new, previously unknown coding segments, the group noted that the older way of sequencing had thrown away several of the duplicates of […]
[…] See also: Gorilla Genome 2.0: Lessons for the Clinic? […]