How DNA Ancestry Testing is Like the Wheel of Fortune: Filling in the Blanks
“Ricki, we have another one.”
AncestryDNA has spit out yet another half-sibling.
I’m a curious hybrid, a geneticist and an “NPE” – “not parent expected” – individual. My few posts about it, such as here and here, were from January, when I thought I had only one half-sibling. The discovery of others, some of whom know they were donor-conceived (DC), more or less confirms half of my origin.
So I wondered, on what percentage of a human genome’s 3.2 billion DNA bases do the consumer DNA ancestry companies base these deductions that can shatter lives?
Do some tube-spitters and cheek-swabbers assume entire genomes are compared? No, not for $100, just yet. Others might not know what SNPs are, the single-DNA-base markers that align in haplotypes on chromosomes and are used to match people, the algorithms trained on known relationships. With millions of people taking ancestry tests, the associations used in the matches are quite robust.
But I thought I’d do a small calculation to clarify what’s probed when you plop your genetic essence into the mail, and how it represents an individual.
A Literary Analogy
A “single nucleotide polymorphism” is a site in the genome where the DNA base – A, T, C, or G – varies in a population. Let’s say an ancestry company compares about 770,000 of them (it’s a little less). Charting 770,000 SNPs is a way to distinguish individuals, but it’s out of the total 3.2 billion bases.
Dividing 770,000 by 7, which I take to be the average number of letters in a word, gives 110,000 – about the number of words for a 300-or-so page book. In fact, linked SNPs do tend to pass from generation to generation in groups along their chromosome, like words or sentences.
My book about the first FDA-approved gene therapy is 101,305 words – close to the standard 100,000 stipulated in my contract. So I’ll use that figure for the analogy.
The number of markers expressed as 7-SNP words in ancestry testing compared to the number of DNA words in a human genome (3.2 billion DNA bases divided by 7) is the equivalent of 24 out of the 101,305 words of my book. But it’s still a powerful stand-in.
Not All SNPs are Equally Meaningful
The algorithms that chew up all that DNA ancestry data seek rare SNP combos to match people. Using another literary analogy, consider these words as representing common SNP sequences:
bumblebee
xenophobia
tuba
spleen
cupcake
Introduce rare spelling errors (mutations, shown in CAPS):
bumblebeT
Zenophobia
TubE
sAleen
CupcUke
If these typos were in a term paper, they’d shout plagiarism, which is perhaps a twisted metaphor for people discovering surprise relatives. Shared ancestry is a more likely explanation for two people sharing swaths of SNPs than the coincidence of undergoing several identical mutations, just like five identical term papers with the same errors are more likely to have come from the same online source than five students making the same exact spelling errors. (I get weekly Google Alerts when students buy papers about my book.) The more rare SNPs two people share, the more closely related they are.
Size matters too. The bigger the DNA words (haplotypes) and the fewer of them, the more recent the shared ancestry. This is because it takes time, measured in generations, for chromosome hunks to exchange places on the chromosomes and break up the genetic linkage.
When I printed out my chromosomes aligned with those of my first-discovered half-sister, the swatches of yellow belied our common ancestry, beyond the backdrop of sameness that all of us of Ashkenazi Jewish descent share. I didn’t need numbers or DNA base sequences to see that.
Deducing relatedness from chromosomal territory defined by the same linked SNPs is like filling in the blanks on Wheel of Fortune to reveal enough of a phrase to recognize it. My first half-sister and I share enough chromosomal real estate for the message to have jumped out at me.
A Planet of the Apes Ending?
In the final scene in the original Planet of the Apes film, a few words uttered in anguish makes all the difference, like 24 words in a 100,000 word book, or 770,000 SNPs representing a human genome.
“You maniacs! You blew it up! Damn you! Goddamn you all to hell! You finally really did it!” Charlton Heston’s stranded astronaut Colonel George Taylor bellows as he plunges his fists into the sand upon seeing the upper torso and head of the Statue of Liberty emerging from the surf. His words bring the meaning crashing down.
Finding at last count 7 half-siblings impacted me like Colonel Taylor’s view of the statue: shock, disbelief, and then shuffling alternative explanations until acknowledging a totally unexpected reality. The relatives that keep popping up at AncestryDNA are gradually filling in the most recent twigs on a complicated family tree. Will a paternal identity eventually emerge, like a hidden Wheel of Fortune message? I’m not sure I want to know.
But I think some geneticists may have foreseen the current explosion of genetic revelations roiling established families and jumpstarting new ones.
TMGI: Too Much Genetic Information
In the years following the sequencing of the first human genomes, publishers snapped up books written by prominent journalists and scientists, all with the idea to have their genomes sequenced and report on what was and is and predict what would be. Meanwhile, it was a time of bizarre firsts. The first Asian genome! The first ancient African genome! As if modern people are as different as Romulans and Klingons. At the same time, many researchers were onto a more meaningful quest: identifying how we differ, at each spot in the genome.
It was called the HapMap project, announced just a few months after the first human genome sequences were published in February 2001. The two giant sequencing papers followed the staged announcement in the White House Rose Garden the previous June, the date chosen because it was the only open one in the calendar.
The HapMap Project identified 10 million single-base landmarks, SNPs, splattered across the chromosomes and then winnowed them down to roughly 500,000 “tag SNPs,” like chapter titles in a book that indicate content.
SNPs, organized into haplotype blocks, are the bits of information ancestry DNA testing companies compare to spring those surprise relatives on us. Haplotypes gathered into much longer haplogroups are used to trace geographic origins, generating the pretty pie charts that festoon the TV ads. The Ashkenazi ones are boringly the same hue.
And so the HapMap data begat consumer DNA ancestry testing. Surely some if not many geneticists had an inkling, perhaps a vision, of the day when ordinary people could easily compare key parts of their genomes. I wrote in 2007 about the initial intensely negative reaction of attendees at the American Society of Human Genetics annual meeting to presentations by three of the pioneer direct-to-consumer genetic testing companies describing what they were about to launch.
In retrospect, being an NPE, I’d say unleash.
Perhaps when the consumer DNA tests hit the market in 2008 and became the butt of jokes – the New Yorker’s “spit parties” – the scientists at the companies were only thinking of the haplogroups that would reveal, confirm, or counter where people thought they came from, at the continental or country level. Not so much the haplotypes that would be fleshed out as more genomes were sequenced.
But were they thinking that finding that one’s BCF (birth certificate father) is not in fact the BF (birth father) could expose affairs and rapes, break up families, and out sperm donors once assured anonymity? It works the other way too, egg donors who never meant to be identified found out.
Were the scientists at the companies initially thinking of the consequences of having to tell a nonagenarian that her secrets of a lifetime ago were now revealed, without her consent?
Bioethics seems to be lagging behind the science, dumping a mess onto the mental health community. To drag out the trite comparisons – a can of worms, Pandora’s box – to me, as a DC-NPE as well as a geneticist, doesn’t quite cut it. I don’t have the words to adequately describe the churning emotions that 770,000 SNPs can evoke.