Skip to main content

Researchers teach computer to understand dialects by reading Twitter

Computers don’t harbor the more problematic prejudices that are unfortunately still found in parts of society, but that isn’t to say they’re without their faults. One task machines frequently prove less adept at is understanding other dialects, such as an English language dialect considered to originate in some African-American communities. (Researchers term the dialect “African-American English,” which we realize may be regarded as inaccurate by African Americans who don’t share it.) Now, researchers are training AI to recognize and use this dialect.

When it comes to why computers are less good at understanding some dialects than others, there is a logical reason: computer scientists who have spent the past 30 years teaching machines to read have frequently used readily available data, such as back issues of the Wall Street Journal, to carry out the training. Such formal written language has rendered many natural language processing (NLP) systems less adept at understanding language which doesn’t conform to a very specific type.

“If you think about traditional media that have existed for a long time — things like books or, more recently, newspapers — you’re seeing a very standardized dialect of language, associated with elite education and the like,” Brendan O’Connor, a natural language processing expert at the University of Massachusetts Amherst, told Digital Trends. “That’s not specific to English: you see it in every language in the world.”

As O’Connor noted, this no longer has to be the case. The internet — and particularly social media — has opened up a rich data-stream of different dialects which can be used to train the next wave of NLP systems. In a new paper, O’Connor and other researchers created the largest dataset for studying African-American English from online communication, composed of 59 million tweets from 2.8 million users.

“The African-American English dialect has … millions of speakers and is different from standard English in several interesting ways,” O’Connor said. “It’s different enough that our artificial intelligence tools — which are designed for standardized English — perform worse with them; they’re less intelligent at understanding that dialect. African-American English is often incorrectly characterized as ‘not English’ by current classifiers.”

For their paper, O’Connor and his colleagues showed that properly fine-tuned NLP systems are capable of understanding African-American English. The authors plan to release their new model in the next year to better identify English written in this dialect.

“The future next step is to make systems that can do deeper analysis of sentences that are written in different types of English dialects,” he said. “Embracing linguistic diversity is certainly something that needs to be focused on. We highlight the importance of engineering systems that are better at handling different forms of dialect.”

Because, ultimately, making AI systems that can understand everyone equally will be the best possible outcome for all.

Editors' Recommendations