Microsoft's new speech recognition system achieves human parity in audible words

Computers can do some amazing things lately, with things like parallel processing, machine intelligence, and more powerful hardware allowing extraordinary advancements on what seems like a daily basis. Microsoft is in the thick of things when it comes to the artificial intelligence, and machine learning is at the center of it all. On Tuesday, the company announced another significant breakthrough.

The most natural way for humans to interact with computers is by speaking with them, and Microsoft has created technology that can understand spoken language as well as humans, according to the Microsoft blog. Reaching human parity in speech recognition is a historic achievement and Microsoft achieved this milestone more quickly than it expected. “Even five years ago, I wouldn’t have thought we could have achieved this. I just wouldn’t have thought it would be possible,” said Harry Shum, executive vice president in charge of Microsoft’s Intelligence and Research Group.

Recommended Videos

According to a paper published on Monday, Microsoft’s research team has created a speec- recognition system that achieves a word error rate (WER) of only 5.9 percent, a reduction from the 6.3 percent reported just a month ago. Human beings who transcribe the same conversation used in the test also achieve around a 5.9 percent WER, meaning that for the first time, a computer performs just as well in the industry standard Switchboard task as do humans.

Speech-recognition research began in the early 1970s at the Defense Advanced Research Projects Agency (DARPA), and the computer industry took up the challenge and has been working ever since to accomplish the goal of a human-like ability to understand what is being said. Now that this milestone has been reached, we can expect digital assistants and other tools to dramatically improve their ability to interact with us in more natural fashion. “This will make Cortana more powerful, making a truly intelligent assistant possible,” Shum said.

Microsoft’s new speech-recognition system does not achieve perfection in recognizing spoken conversation, but then again, neither do we. To overcome the usual mistakes in recognizing language, the system uses neural network technology to leverage neural language models that can make the same inferences that humans make when correcting for misheard words.

The team used a few existing tools to achieve the speech-recognition milestone. For example, the Computational Network Toolkit, an open source Microsoft system for applying deep learning to computing tasks, was utilized, allowing the specialized graphics processing units (GPUs) running in parallel to enable faster processing of deep-learning algorithms. Technologies used for other tasks, such as image processing, were also leveraged.

The researchers are not resting on their laurels, however. Work remains to make the speech-recognition technology work in more real-world settings where background noise and context can make recognizing conversational speaking a much more difficult task. As Geoffrey Zweig, manager of Microsoft’s Speech & Dialog research group, put it, “The next frontier is to move from recognition to understanding.”