Speech-recognition systems may not yet be perfect, but as the likes of Amazon Echo show, they’re getting both better and more ubiquitous all the time.
A new piece of research by investigators at The Massachusetts Institute of Technology’s Computer Science and Artificial Intelligence Laboratory (CSAIL) suggests a new technique for training these systems — by getting them to learn by looking at images.
“This is an attempt to get machines to require less supervised training to learn about spoken language,” Jim Glass, a senior research scientist at CSAIL, told Digital Trends. “The conventional way to train speech recognition systems is by using recordings of people talking and, for each utterance, transcribing exactly what words have been said. Ideally, you have hundreds or thousands of hours of speech in order for the system to work properly. Some of the biggest companies doing this — like Baidu and Google — are using tens of thousands of hours for training. The more annotated data that they have, the better these systems perform.”
So what’s wrong with that? After all, as noted, speech-recognition tech is continuously getting better. Whatever computer scientists are doing is obviously working.
That may be true, but this new approach is interesting for a couple of reasons. Firstly, opening up the ability of a machine to train itself to understand by looking at combined images and audio (eventually, you could imagine it training by watching YouTube) is much closer to the way that we learn as human beings.
Secondly — and arguably more importantly — is the fact that it could help bring speech recognition to parts of the world that might greatly benefit from this kind of technology.
“Annotated data is expensive to produce,” Glass continued. “Speech recognition has been going on for decades and the majority of it has been for languages in countries which can afford to invest in these kind of resources. When it comes to language, it tends to be those which companies think will help them make a profit. English has received by far the most attention, followed by western European languages, and other languages like Japanese and Mandarin. The problem is that there are around 7,000 languages spoken in the world and around 300 that are spoken by more than 1 million people. A lot of these just haven’t received much attention — if any.”
In parts of the world where literacy levels are low, it’s easy to see how speech recognition could be a game changer in terms of providing people with access to information. Hopefully, this technology can help toward that goal.
As exciting as the research is, however, Glass notes it is still in its very early stages. At present, CSAIL researchers have been feeding their system with a database of 1,000 images, each with a free-form verbal description that relates to it in some way. They then test the system by giving it a recording and asking it to retrieve 10 images which best match what it is hearing.
Over time, the hope is that such approaches to speech recognition will improve in their effectiveness to the point where laborious labeling of speech training data is no longer considered a necessity.
If all goes according to plan, that should be better for everyone — whether you’re an English speaker in the U.S. or a speaker of Xhosa in South Africa.