Anyone who lives with more than one member of Felis catus knows that our beloved felines love to smell each other’s anal…
When it comes to estimating risk of a disease that is either genetic or has a genetic component, ancestry of an individual plays an important role. That’s because increased risk of a particular health condition may be associated with a gene variant (aka mutation) in one population, but not another. Someone from a group not represented in the data on which a clinical test is based could receive an incorrect risk assessment, or even prescribed a drug unlikely to work.
A team from the Johns Hopkins Bloomberg School of Public Health and the National Cancer Institute has developed a new algorithm for genetic risk-scoring for major diseases across diverse ancestral populations. Their findings are published in Nature Genetics.
Although the algorithm is a start, and takes a logical approach to address health care disparities, it doesn’t go far enough. Considering large groups – like Latinos or Africans – doesn’t parse humanity sufficiently to hold much predictive power for genetic diseases, or conditions with large genetic components.
Tools to Track Disease: Biobanks to AI
The new investigation looks at points in a human genome that can be any of the four types of DNA building blocks – A, C, T, or G. The patterns of these single nucleotide polymorphisms – SNPs – differ among population groups. Because considering thousands or millions of SNPs in genome-wide association studies (GWAS) is cumbersome, genome researchers invented an abbreviated form of the data: a polygenic risk score (PRS).
Assembling panels of SNPs and associating them with health conditions began in 2002 with the International HapMap Project, shortly after the first human genomes were sequenced. A few hundred thousand or even a million SNPs is a shortcut, less to analyze than the 3.2 billion base pairs of a genome. A PRS is simpler still.
That’s a lot of abbreviations: SNP, GWAS, PRS. They all mark places where genomes vary, and thus provide information on specific traits, like disease risks.
People with a particular illness – say, gout – may have a distinctly different genetic profile (SNP pattern) than people who do not have the painful condition. These sorts of data are associations, risk-raisers – not evidence quite definitive enough to back up a clinical diagnosis based on exams and other types of tests.
One reason why polygenic risk scores are fuzzy is that for many years, the populations from which they were derived and initial GWAS based were predominantly Europeans (aka white). DNA samples used in genetic research came mostly from deCODE Genetics (begun in 1996, on the Icelandic population) and the UK Biobank (begun in 2006).
The new study considers 19 million SNPs across human genomes sampled from four biobanks: the UK Biobank, All of Us from the US, 23andMe, and the Global Lipids Genetics Consortium, involving 5.1 million individuals of diverse ancestry. It also includes 1.18 million individuals from four non-European populations.
The investigation considered 13 “complex traits” – those that have genetic as well as environmental influences. These are common conditions; the news release for the new paper mentions “cancers, coronary artery disease, and depression” analyzed among five ancestry categories: European, African, Latino, East Asian, and South Asian.
The researchers also conducted simulation studies using an AI tool called CT-SLEB, extrapolating from biobank data.
“Our method can help close the risk-scoring performance gap for non-European-ancestry populations. At the same time, we can’t fully close the gap with new methods alone—we also need larger datasets on these populations,” said senior author Nilanjan Chatterjee.
A New Approach: Incorporate Pangenomics
But the challenge is more than just increasing the numbers in existing biobanks –it is in capturing the genetic fine structure of populations.
Consider Steel syndrome, a collagen (COL27A1) disorder. Symptoms include joint pain, hip dislocation, fused finger and toe bones, scoliosis, and a pinched neck. The person is short with a large head and a characteristic long, oval face, with a prominent forehead, broad nose, and small low ears turned slightly backward.
Eimear Kenny, of the Institute for Genomic Health at the Icahn School of Medicine at Mount Sinai and her colleagues have studied Steel syndrome in East Harlem for years. It is rare in the general population, and even among Latinos living in East Harlem, yet clustered among the 8,000 or so people of Puerto Rican ancestry in that community. Knowing the ancestry of a patient can refine and target the diagnosis. Otherwise, if misdiagnosed as a different orthopedic condition, a physician might perform hip or other surgery that can actually worsen symptoms.
The Mount Sinai team identified a genetic signature of Puerto Rican descent so distinctive that they could even tell people who knew only that they were Hispanic/Latino that their ancestors came from the island. To identify the genetic clues, the researchers analyzed 600,000 SNPs among 11,000 DNA samples from a biobank of New York City residents.
But another way of viewing genomes may soon overtake SNPs and the polygenic risk scores that represent them.
Enter the Pangenome
It was with a feeling of déjà vu that I wrote The Age of the Pangenome Dawns here at DNA Science a few months ago, as I recalled the hoopla over sequencing “the” human genome back in the late 1980s.
The human pangenome is on a different scale. It is theoretically the idea that human genomes vary in many ways. Practically, the broader view is being used to generate representations, called genome graphs, that depict maximal human genome diversity – that is, all of the ways that the sequence of 3,054,832 billion DNA base pairs can vary.
The sponsoring Human Pangenome Reference Consortium is creating a “genome reference representation that can capture all human genome variation and support research on the full diversity of populations.” But it’s yet to make many headlines. Perhaps a science-weary public is tired of endless “breakthroughs,” and has become a bit genomed-out.
Although the first public release of a genome graph represented a mere 350 individuals, the challenge is clearly one meant for artificial intelligence. An AI approach to all possible human genome sequences could also embrace copy number variants (short repeated DNA sequences, the number of repeats serving as the informational content), as well as gene-gene interactions (epistasis).
Overlaying population-specific data onto the genome sequences of patients for diagnostic purposes could counter the health care inequity of assuming all genetic backgrounds echo those of the UK Biobank or another homogeneous resource. But I fear that information technology may render the use of SNP arrays and polygenic risk scores to do so quickly obsolete. Would it be more economically feasible to wait for AI-guided genomic analysis to catch up?