Scientists are using A.I. to create artificial human genetic code

Digital Trends Graphic / Digital Trends

Since at least 1950, when Alan Turing’s famous “Computing Machinery and Intelligence” paper was first published in the journal Mind, computer scientists interested in artificial intelligence have been fascinated by the notion of coding the mind. The mind, so the theory goes, is substrate independent, meaning that its processing ability does not, by necessity, have to be attached to the wetware of the brain. We could upload minds to computers or, conceivably, build entirely new ones wholly in the world of software.

This is all familiar stuff. While we have yet to build or re-create a mind in software, outside of the lowest-resolution abstractions that are modern neural networks, there are no shortage of computer scientists working on this effort right this moment.

Artificial genetic data

“Creating artificial genetic data that are realistic enough, without directly copying the sequences, is a very hard problem,” Flora Jay, a researcher specializing in machine learning and population genetics at the University of Paris-Saclay University, told Digital Trends. “Genetic data is of high dimension, and you cannot just eyeball what’s important or not. We thus turned to cutting-edge techniques [being] applied to the computer vision, text, music, or protein world. These generative networks — GANs and [restricted Boltzmann machines] — are designed so that they can progressively and automatically learn how to create artificial genetic sequences.”

A GAN, a class of machine-learning framework coined by researcher (and current Apple employee) Ian Goodfellow, uses a combative, tug-of-war approach to improve its generative outcomes. It consists of two neural networks: A “generator” and a “discriminator” which pass outputs between one another.

The generator’s job is to create something, be it an A.I. painting or a chunk of code representing an artificial genome in the form of ones and zeroes. The discriminator, like a bot version of J.K. Simmons’ perfectionist music instructor in the movie Whiplash, then critiques its efforts and sends this back to the generator. The generator learns from this feedback, while the discriminator similarly gets ever better at guessing what’s been created by the generator and what is the genuine article. Eventually, the generator is so good at creating fake versions of whatever it is attempting that the discriminator can be fooled. It’s no longer able to differentiate real from fake.

“One of the main problems here is assessing the quality of artificial genomes,” Burak Yelmen, a Ph.D. student at the University of Tartu’s Institute of Genomics, told Digital Trends. “You can look at an image and decide if it looks real, but this is not possible for genomes. [The] majority of the analyses we performed in our study was to see whether the artificial genome chunks we generated really looked like the real ones.”

Don’t worry, though. Despite a growing mass of articles about highly dubious gene tampering designed to rewrite the human code, this work is not about trying to “write” new parentless humans who could be created with the aid of supercomputers.

A chromosome emerges from random digital noise — Burak Yelmen

“To be clear, the objective of our work is to better understand and encode the existing genetic diversity of thousands or millions of people around the world, not to create artificial cells,” Jay said. “The neural networks are trained on this existing diversity, so the generated genomic regions do not carry additional novel mutations that could easily disrupt the functionality of a sequence — and they include, untouched, the segments that are conserved across human populations.”

Jay noted that, at the whole genome scale, it is “difficult to say” whether a specific combination of millions of generated nucleotides could indeed be “functional.” In other words, don’t expect to compile and run this code, expecting a fully formed person (or their blueprints) to emerge at the other end. Instead, the purpose is something altogether less sinister and, potentially, more useful.

All about data privacy

“There is an immense amount of data in biobanks and it keeps increasing every day,” said Yelmen. “However, genomic data is sensitive data and accessing these biobanks can be difficult for researchers due to ethical concerns. The main goal of our work is to create high-quality surrogates of existing genome banks and provide a solution to this accessibility barrier within a safe ethical framework. It is important to note that our study was a first step: There is still work to do.”

Added Jay: “The idea behind our study is to start investigating whether releasing artificial genomes instead of the real ones could preserve the privacy of genome donors, while providing useful information to the population genetics community. [Possible] applications of artificial genomes could range from better understanding of our evolutionary past to providing insights in medical genetics, including a wider range of diversity.”

In some ways, the work is reminiscent of the trend, seen a couple of years ago, in which GANs were used to create images of imaginary people, animals, and more as epitomized by the generative website ThisPersonDoesNotExist.com. Only this time, of course, it involves actual genetic code, rather than simple pictures.

A paper describing the project, titled “Creating artificial human genomes using generative neural networks,” was recently published in the journal PLOS Genetics.