Skip to main content

Google is learning to differentiate between your voice and your friend’s

Looking to Listen: Stand-up

We may be able to pick out our best friend’s or our mother’s voice from a crowd, but can the same be said for our smart speakers? For the time being, the answer may be “no.” Smart assistants aren’t always right about who’s speaking, but Google is looking to change that with a pretty elegant solution.

Recommended Videos

Thanks to new research detailed in a paper titled, “Looking to Listen at the Cocktail Party,” Google researchers explain how a new deep learning system is able to identify voices simply by looking at people’s faces as they speak.

“People are remarkably good at focusing their attention on a particular person in a noisy environment, mentally “muting” all other voices and sounds,” Inbar Mosseri and Oran Lang, software engineers at Google Research noted in a blog post. And while this ability is innate to human beings, “automatic speech separation — separating an audio signal into its individual speech sources — while a well-studied problem, remains a significant challenge for computers.”

Mosseri and Lang, however, have created a deep learning audio-visual model capable of isolating speech signals from a variety of other auditory inputs, like additional voices and background noise. “We believe this capability can have a wide range of applications, from speech enhancement and recognition in videos, through video conferencing, to improved hearing aids, especially in situations where there are multiple people speaking,” the duo said.

So how did they do it? The first step was training the system to identify individual voices (paired with their faces) speaking uninterrupted in an aurally clean environment. The researchers presented the system with about 2,000 hours of video, all of which featured a single person in the camera frame with no background interference. Once this was complete, they began to add virtual noise — like other voices — in order to teach its A.I. system to differentiate among audio tracks, and thereby allowing the system to identify which track is which.

Ultimately, the researchers were able to train the system to “split the synthetic cocktail mixture into separate audio streams for each speaker in the video.” As you can see in the video, the A.I. can identify the voices of two comedians even as they speak over one another, simply by looking at their faces.

“Our method works on ordinary videos with a single audio track, and all that is required from the user is to select the face of the person in the video they want to hear, or to have such a person be selected algorithmically based on context,” Mosseri and Lang wrote.

We’ll just have to see how this new methodology is ultimately implemented in Google products.

Lulu Chang
Former Digital Trends Contributor
Fascinated by the effects of technology on human interaction, Lulu believes that if her parents can use your new app…
Microsoft’s Bing adds a Copolit Search mode to rival Google AI Search
Copilot Search for Bing Search engine.

Barely a few weeks ago, Google introduced a new AI Search mode. The idea is to provide answers as a wall of text, just the way an AI chatbot answers your queries, instead of the usual Search Results with blue links to different sources.

Microsoft is now in the race, too. The company has quietly rolled out a new Copilot Search option for its Bing search engine. The feature was first spotted by Windows Latest, but Digital Trends can confirm that it is now accessible across all platforms. 

Read more
Google might be making Gemini more child-friendly
Android figurine holding a balloon in the shape of the Gemini logo at MWC 2025

Code snippets within the Google app have revealed that the company is working on a version of its Gemini AI model for kids. It looks like this model would help kids with their homework and generate stories.

Android Authority found this code during an APK teardown, and the following strings make it pretty clear what's going on:

Read more
Amazon’s AI agent will make it even easier for you to part with your money
Amazon Nova Act performing task in a web browser.

The next big thing in the field of artificial intelligence is Agentic AI, which is essentially an AI tool that can automate certain multi-step processes for users. For example, interacting with a web browser for tasks like booking tickets or ordering groceries. 

Amazon certainly sees a future in there. After giving a massive overhaul to Alexa and introducing a new Alexa+ assistant, the company has today announced a new AI agent called Nova Act. Amazon says Nova Act is designed to “complete tasks in a web browser.” Amazon won’t be the first to reach this milestone, as few other AI companies have already attempted this vision. 

Read more