A “deep-learning” algorithm shines a light on mutations in once-obscure areas of the genome
The so-called “streetlight effect” has often fettered scientists who study complex hereditary diseases. The term refers to an old joke about a drunk searching for his lost keys under a streetlight. A cop asks, “Are you sure this is where you lost them?” The drunk says, “No, I lost them in the park, but the light is better here.”
For researchers who study the genetic roots of human diseases, most of the light has shone down on the two per cent of the human genome that includes protein-coding DNA sequences. “That’s fine. Lots of diseases are caused by mutations there, but those mutations are low-hanging fruit,” says Brendan Frey, a U of T computer engineer who studies genetic networks. “They’re easy to find because the mutation changes one amino acid to another one, and that very much changes the protein.”
The trouble is, many disease-related mutations also happen in non-coding regions of the genome – the parts that do not directly make proteins but that still regulate how genes behave. Scientists have long been aware of how valuable it would be to analyze the other 98 per cent but there has not been a practical way to do it.
Now Frey has developed a “deep-learning” machine algorithm that effectively shines a light on the entire genome, identifying patterns of mutation across coding and non-coding DNA alike. The algorithm can also predict how likely each variant is to contribute to a given disease. “Our system can predict whether or not a mutation will cause a change in RNA splicing that could lead to a disease phenotype,” he says. RNA splicing is one of the major steps in turning genetic blueprints into living organisms. Splicing determines which bits of DNA code get included in the messenger-RNA strings that build proteins. Different configurations yield different proteins. Misregulated splicing contributes to an estimated 15 to 60 per cent of human genetic diseases.
Frey, who holds the Canada Research Chair in Information Processing and Machine Learning, trained his algorithm using millions of data points. The algorithm was then able to extrapolate how likely it was that any of tens of thousands of mutations could cause a splicing error associated with a particular disease.
The research team tested the method by showing it could detect genes related to spinal muscular atrophy as well as nonpolyposis colorectal cancer. Frey says the team’s “most ambitious case” was its study of autism spectrum disorder; about 100 genes are known to be associated with it. In fact, many researchers think it is likely that autism comprises many disorders, each resulting from unique mutations but all resulting in common symptoms.
Working with U of T autism researcher Stephen Scherer, Frey compared mutations in autism patients’ genomes with those of controls. Nothing unusual popped up. But when Frey and Scherer tested the genomes against the mutations flagged by Frey’s algorithm, they “saw patterns emerge.” According to Frey, “Kids with autism are more likely to have these ‘high-scoring’ mutations that change the meaning of the genome, and that are thought to be involved with brain functions and developmental functions.”
Not only did the algorithm’s analysis fit with existing knowledge about autism genetics, it also identified 17 new disease-causing gene candidates. With each of the three diseases addressed in the study, the algorithm both made predictions that were consistent with existing data and also pointed toward additional regions of the genome where researchers might search next.
This is a shorter version of an article that appeared on scientificamerican.com in December 2014.
Brendan Frey explains how he locates disease-causing genes:
Video courtesy TEDx