Skip to content

When you choose to publish with PLOS, your research makes an impact. Make your work accessible to all, without restrictions, and accelerate scientific discovery with options like preprints and published peer review that make your work more Open.


Revealing the Subtext in DNA Sequences

Whenever the first copy of a book I’ve written arrives on my doorstep, I’m afraid to look at it. I still haven’t leafed through the 12th edition of my human genetics textbook, delivered more than a month ago.

Why? I’m afraid there will be errors.

Not misspellings or perish-the-thought incorrect grammar, but the sorts of mistakes that would have flown under the radar of the copyeditors, proofreaders, spellchecks, and grammarchecks.

The missed errors are of two types:

1. Those that repeat a word or part of one – codon codon codon, or hippopotapotapotamus.

2. Phrases that mysteriously moved from where they should be to where they shouldn’t, a sentence from one chapter appearing in another, out of context yet likely undetectable by a bored student.

Unusual repeats and transpositions also happen in genomes, as well as flipped DNA sequences, which thankfully I’ve not seen in a book. Conventional DNA sequencing can’t see these glitches because the sequences haven’t changed – they’ve just been relocated. Clinically, the hiding-in-plain-sight of such repeats and rearrangements can delay diagnosis as false negatives accrue.

Two just-published papers address the genetic subtext of alterations that aren’t easy to spot because they don’t affect the DNA sequence.

The FBI uses 13 STRs in forensic investigations.

TREDPARSE Tracks Repeats

Trios of DNA bases, a type of “short tandem repeat” or STR, lie behind the FBI’s forensic testing (CODIS) and older genetic geneaology tests. Expanded triplet repeats in certain genes cause devastating diseases. Huntington’s disease (HD) is the classic, caused by 40 or more repeats of the DNA sequence CAG in the first protein-encoding part of the gene, gunking up the encoded protein. DNA Science covered a family here in which a little girl, Karli, had 99 repeats.

Other triplet repeat diseases (“TREDs”) are myotonic dystrophy type 1 (CTG repeats), Friedreich’s ataxia (GAA repeats), and fragile X syndrome (CGG repeats). There are about a dozen more, as well as a few with larger motifs, such as a form of ALS in which GGGGCC echoes.

Repeats tend to grow when the two strands of the double helix misalign, a little like mismatching the sides of a zipper and zipping anyway, leaving a longer half hanging. Repeats can be in the protein-encoding parts (HD) or in control regions (fragile X syndrome).

Wherever a repeat happens to be, and whatever its size, many copies register as one to a DNA sequencing machine. CAG looks much the same as CAGCAGCAG or even Karli’s CAG x 99.

A new software program called TREDPARSE, developed by Haibao Tang, PhD, and colleagues at Human Longevity Inc., can measure all but the very longest repeats. And that’s important because in HD, size matters; severity tracks with repeat length. Their article, “Profiling of Short-Tandem-Repeat Disease Alleles in 12,632 Human Whole Genomes,” appears in this month’s American Journal of Human Genetics.

The team deployed TREDPARSE on 12,632 sequenced genomes, and found 138 people with extra repeats in the genes implicated in 15 triplet repeat diseases. The repeat lengths were the same within families that had more than one affected person, such as in a parent with HD and children who had inherited the mutation and were therefore at-risk.

Although the method hasn’t yet been compared to existing tests to measure repeats, the team has done many validations, and “for individuals with HD repeats over 40, I strongly believe we can identify most risk alleles. Our method is computational, and one of the best in its class, based on whole genome sequencing data. One caveat is that for long repeats, we cannot assay the exact number of repeats. However, we would still be able to report a range that facilitates a qualitative assessment of whether an individual is ‘at-risk’ or not,” said Dr. Tang.

The research repercussions of the new repeat test may be as or more important than the clinical one, because finding telltale expanded DNA repeats in families who do not have the disease can reveal genes other than the one behind HD that protect against it. Population studies reveal 6.5 to 15 HD cases per 100,000 in the US, but the new study finds the triplet repeat expansion to be about three times as common as this.

A balanced translocation (top) doesn’t alter the amount of information, just rearranges it (credit: CNIO).

Rearrangements That Underlie Duchenne Muscular Dystrophy

An even more subtle genetic change than a triplet repeat is a rearrangement that shuffles things around – inversions and translocations (such as two different chromosomes swapping parts). They cause harm if they disrupt a vital gene’s DNA sequence or jettison DNA bases, but can silently pass through generations, only alerting a genetic counselor when a family has several instances of infertility, pregnancy loss, or birth defects, due to sperm or eggs with extra or missing chromosome parts. (Here’s my take on how translocations could fashion a new human species.)

In the second of the new papers, Eric Vilain, MD, PhD, at the Children’s National Health System, and colleagues used Bionano genome mapping to spot a flipped chunk of the X chromosome in a boy with Duchenne muscular dystrophy (DMD). Sequencing of the dystrophin gene and exome sequencing had missed the mutation, nor had chromosomal microarray analysis or MLPA, which both detect deletions and small-scale repeats, picked it up. The boy had been diagnosed from the symptoms and findings from a painful muscle biopsy. The results are published in Genome Medicine (“Next-generation mapping: a novel approach for detection of pathogenic structural variants with a potential utility in clinical diagnosis.”)

Bionano Genomics developed the technology (next-generation mapping or NGM) behind the test (Saphyr). It detects “order and orientation,” spotting deletions, inversions, and translocations rather than just the DNA sequence. It works by fluorescently marking landmarks in big hunks of DNA, creating glowing tags that “are read efficiently by molecular combing within nanochannels,” which sounds very poetic and has something to do with seeing single molecules.

The researchers tackled the dystrophin gene because it’s the largest and DMD most often arises from missing parts of it. The technology identified the mutations in 8 affected boys and 3 of their mothers, including the boy who had 5.1 million DNA bases flipped.

This isn’t the reverse of the global warming hockey stick but a depiction of plummeting genome sequencing costs.

Practically speaking, NGM provides results within 2 weeks, and the cost is the same as that for whole genome sequencing, which has plummeted. The DMD case illustrates that the new test could, in some cases, replace four different types of testing, promising to shorten the diagnostic odysseys that many families with rare diseases endure. “The scientific community will be able to solve a larger fraction of undiagnosed genetic diseases,” the researchers conclude.

A Bit of Perspective

The various incarnations of “the” human genome sequence – the draft in 2001, the official version in 2003 – were only the beginning. Even the clinical exome sequencing that came along a few years later only probes the 2% or so of the genome that encodes protein, missing the controls and many mysteries. Some of those mysteries lie not in the DNA base sequences, but in how the information is laid out.

The two new technologies should open up the field of reading the subtext in the sequences, enabling more families to place a name to their loved ones’ symptoms.


Leave a Reply

Your email address will not be published. Required fields are marked *

Add your ORCID here. (e.g. 0000-0002-7299-680X)

Back to top