With speech recognition getting better every day, it’s remarkable how well Siri, Alexa, and Cortana can parse human speech. But what about cheering crowds or crashing waves? Can our AI personal assistants tell the difference between those? Well, probably not. Sound recognition is actually very difficult for computers, particularly natural sounds.
From our smartphones to our most advanced supercomputers, recognizing images and speech is something they’re able to do fairly well across the board. While natural sounds have been an exception, that may be about to change. Scientists at the Massachusetts Institute of Technology might have found a solution.
According to Phys.Org, a group of researchers at MIT’s Computer Science and Artificial Intelligence Laboratory, or CSAIL, have pioneered a new way to teach computers to recognize sound – by cutting out the middle men.
Normally, vast databases of sounds need to be annotated by hand, by humans, to teach computers how to recognize and identify particular sounds. This new method, however, circumvents the human element by using video.
“Computer vision has gotten so good that we can transfer it to other domains. We’re capitalizing on the natural synchronization between vision and sound. We scale up with tons of unlabeled video to understand sound,” Carl Vondrick, an MIT graduate student in electrical engineering, told Phys.Org.
The new system essentially leverages a computer’s ability to recognize visual information and tie that recognition to its understanding of the sounds the videos produce. Think of it this way, the computers recognize objects in the video, and look for correlations between the appearance of those objects and the sound information they’re processing.
It’s a quicker, easier, and more accurate way to train computers to recognize sounds. According to a research paper, it’s between 13 and 15 percent more accurate than the previous method of hand-annotating massive libraries of sounds and feeding that information into a computer.
The CSAIL research team’s full conclusions will be presented at the Neural Information Processing Systems conference in early December.