Can A.I. solve one of the oldest mysteries of linguistics?

Francesco Riccardo Iacomino/Getty Images

There are many things that distinguish humans from other species, but one of the most important is language. The ability to string together various elements in essentially infinite combinations is a trait that “has often in the past been considered to be the core defining feature of modern humans, the source of human creativity, cultural enrichment, and complex social structure,” as linguist Noam Chomsky once said.

But as important as language has been in the evolution of humans, there is still much we don’t know about how language has evolved. While dead languages like Latin have a wealth of written records and descendants through which we can better understand it, some languages are lost to history.

Researchers have been able to reconstruct some lost languages, but the process of deciphering them can be a long one. For example, the ancient script Linear B was “solved” over half a century after its discovery, and some of those who worked on it didn’t live to see the work completed. An older script called Linear A, the writing system of the Minoan civilization, remains undeciphered.

Modern linguists have a powerful tool at their disposal, however: Artificial intelligence. By training A.I. to locate the patterns in undeciphered languages, researchers can reconstruct them, unlocking the secrets of the ancient world. A recent, novel neural approach by researchers at the Massachusetts Institute of Technology (MIT) has already shown success at deciphering Linear B, and could one day lead to to solving other lost languages.

Resurrecting the dead (languages)

Much like skinning a cat, there is more than one way to decode a lost language. In some cases, the language has no written records, so linguists try to reconstruct it by tracing the evolution of sounds through its descendants. Such is the case with Proto-Indo-European, the hypothetical ancestor of numerous languages through Europe and Asia.

In other cases, archaeologists unearth written records, which was the case with Linear B. After archaeologists discovered tablets on the island of Crete, researchers spent decades puzzling over the writings, eventually deciphering it. Unfortunately, this isn’t currently possible with Linear A, as researchers don’t have nearly as much source material to study. But that might not be necessary.

But English and French are living languages with centuries of cultural overlap. Deciphering a lost language is far trickier.

A project by researchers at MIT illustrates the difficulties of decipherment, as well as the potential of A.I. to revolutionize the field. The researchers developed a neural approach to deciphering lost languages “informed by patterns in language change documented in historical linguistics.” As detailed in a 2019 paper, while previous A.I. for deciphering languages had to be tailored to a specific language, this one does not.

“If you look at any commercially available translator or translation product,” says Jiaming Luo, the lead author on the paper, “all of these technologies have access to a large number of what we call parallel data. You can think of them as Rosetta Stones, but in a very large quantity.”

A parallel corpus is a collection of texts in two different languages. Imagine, for example, a series of sentences in both English and French. Even if you don’t know French, by comparing the two sets and observing patterns, you can map words in one language onto the equivalent words in the other.

“If you train a human to do this, if you see 40-plus-million parallel sentences,” Luo explains, “I’m confident that you will be able to figure out a translation.”

But English and French are living languages with centuries of cultural overlap. Deciphering a lost language is far trickier.

“We don’t have that luxury of parallel data,” Luo explains. “So we have to rely on some specific linguistic knowledge about how language evolves, how words evolve into their descendants.”

Neural Decipherment/MIT

In order to create a model that could be used regardless of the languages involved, the team set constraints based on trends that can be observed through the evolution of languages.

“We have to rely on two levels of insights on linguistics,” Luo says. “One is on the character level, which is all we know that when words evolve, they usually evolve from left to right. You can think about this evolution as sort of like a string. So maybe a string in Latin is ABCDE that most likely you were going to change that to ABD or ABC, you still preserve the original order in a way. That’s what we call monotonic.”

At the level of vocabulary (the words that make up a language), the team used a technique called “one-to-one mapping.”

“That means that if you pull out the entire vocabulary of Latin and pull out the entire vocabulary of Italian, you will see some kind of one-to-one matching,” Luo offers as an example. “The Latin word for ‘dog’ will probably evolve into the Italian word for ‘dog’ and the Latin word for ‘cat’ will probably evolve to the Italian word for ‘cat.’”

To test the model, the team used a few datasets. They translated the ancient language Ugaritic to Hebrew, Linear B to Greek, and to confirm the efficacy of the model, performed cognate (words with common ancestry) detection within the Romance languages Spanish, Italian, and Portuguese.

It was the first known attempt to automatically decipher Linear B, and the model successfully translated 67.3% of the cognates. The system also improved on previous models for translating Ugaritic. Given that the languages come from different families, it demonstrates that the model is flexible, as well as more accurate than previous systems.

The future

Linear A remains one of language’s great mysteries, and cracking that ancient nut would be a remarkable feat for A.I. For now, Luo says, something like that is entirely theoretical, for a couple reasons.

First, Linear A offers a smaller quantity of data than even Linear B does. There’s also the matter of figuring out just what kind of script Linear A even is.

“I would say the unique challenge for Linear A is that you have a lot of pictorial or logographic characters or symbols,” Luo says. “And usually when you have a lot of these symbols, it’s going to be much harder.”

Brand X Pictures/Getty Images

As an example, Luo compares English and Chinese.

“English has 26 letters if you don’t count capitalization, and Russian has 33. These are called alphabetic systems. So you just have to figure out a map for these 26 or 30-something characters,” he says.

“But for Chinese, you have to deal with thousands of them,” he continues. “I think an estimation of the minimal amount of characters to master just to read a newspaper would be about 3,000 or 5,000. Linear A is not Chinese, but because of its pictorial or logographic symbols and stuff like that, it’s definitely harder than Linear B.”

Although Linear A is still undeciphered, the success of MIT’s novel neural decipherment approach in automatically deciphering Linear B, moving beyond the need for a parallel corpus, is a promising sign.

Editors' Recommendations