Skip to main content

Google is learning to differentiate between your voice and your friend’s

Looking to Listen: Stand-up

We may be able to pick out our best friend’s or our mother’s voice from a crowd, but can the same be said for our smart speakers? For the time being, the answer may be “no.” Smart assistants aren’t always right about who’s speaking, but Google is looking to change that with a pretty elegant solution.

Thanks to new research detailed in a paper titled, “Looking to Listen at the Cocktail Party,” Google researchers explain how a new deep learning system is able to identify voices simply by looking at people’s faces as they speak.

“People are remarkably good at focusing their attention on a particular person in a noisy environment, mentally “muting” all other voices and sounds,” Inbar Mosseri and Oran Lang, software engineers at Google Research noted in a blog post. And while this ability is innate to human beings, “automatic speech separation — separating an audio signal into its individual speech sources — while a well-studied problem, remains a significant challenge for computers.”

Mosseri and Lang, however, have created a deep learning audio-visual model capable of isolating speech signals from a variety of other auditory inputs, like additional voices and background noise. “We believe this capability can have a wide range of applications, from speech enhancement and recognition in videos, through video conferencing, to improved hearing aids, especially in situations where there are multiple people speaking,” the duo said.

So how did they do it? The first step was training the system to identify individual voices (paired with their faces) speaking uninterrupted in an aurally clean environment. The researchers presented the system with about 2,000 hours of video, all of which featured a single person in the camera frame with no background interference. Once this was complete, they began to add virtual noise — like other voices — in order to teach its A.I. system to differentiate among audio tracks, and thereby allowing the system to identify which track is which.

Ultimately, the researchers were able to train the system to “split the synthetic cocktail mixture into separate audio streams for each speaker in the video.” As you can see in the video, the A.I. can identify the voices of two comedians even as they speak over one another, simply by looking at their faces.

“Our method works on ordinary videos with a single audio track, and all that is required from the user is to select the face of the person in the video they want to hear, or to have such a person be selected algorithmically based on context,” Mosseri and Lang wrote.

We’ll just have to see how this new methodology is ultimately implemented in Google products.

Editors' Recommendations

Lulu Chang
Former Digital Trends Contributor
Fascinated by the effects of technology on human interaction, Lulu believes that if her parents can use your new app…
Meta just created a Snoop Dogg AI for your text RPGs
Meta AI's Dungeon Master looks like Snoop Dogg.

Meta Connect started with the Quest 3 announcement but that’s not the only big news. The metaverse company is also a leader in AI and has released several valuable models to the open-source community. Today, Meta announced its generative AI is coming soon to its social media apps, and it looks both fun and useful.
Meta AI for text
When CEO Mark Zuckerberg announced Meta AI for social media, it seemed interesting. When one of the custom AIs looked like Snoop Dogg wearing Dungeons and Dragons gear, there was a gasp from the live audience, followed by whoops of joy and applause.

Meta AI's Dungeon Master looks like Snoop Dogg. Meta

Read more
Spotify using AI to clone and translate podcasters’ voices
spotify app available in windows 10 store

Spotify has unveiled a remarkable new feature powered by artificial intelligence (AI) that translates a podcast into multiple languages using the same voices of those in the show.

It’s been made possible partly by OpenAI’s just-released voice generation technology that needs only a few seconds of listening to replicate a voice.

Read more
Meta is building a space-age ‘universal language translator’
A silhouetted person holds a smartphone displaying the Facebook logo. They are standing in front of a sign showing the Meta logo.

When you think of tools infused with artificial intelligence (AI) these days, it’s natural for ChatGPT and Bing Chat to spring to mind. But Facebook owner Meta wants to change that with SeamlessM4T, an AI-powered “universal language translator” that could instantly convert any language in the world into whatever output you want.

Meta describes SeamlessM4T as “the first all-in-one multilingual multimodal AI translation and transcription model.” That’s quite a mouthful, but in simple terms, it means it can convert languages in a range of different ways, such as taking speech audio and switching it into text in a different tongue.

Read more