Skip to main content

Google’s new AI generates audio soundtracks from pixels

An AI generated wolf howling
Google Deep Mind

Deep Mind showed off the latest results from its generative AI video-to-audio research on Tuesday. It’s a novel system that combines what it sees on-screen with the user’s written prompt to create synced audio soundscapes for a given video clip.

Recommended Videos

The V2A AI can be paired with vide -generation models like Veo, Deep Mind’s generative audio team wrote in a blog post, and can create soundtracks, sound effects, and even dialogue for the on-screen action. What’s more, Deep Mind claims that its new system can generate “an unlimited number of soundtracks for any video input” by tuning the model with positive and negative prompts that encourage or discourage the use of a particular sound, respectively.

V2A Cars

The system works by first encoding and compressing the video input, which the diffusion model then leverages to iteratively refine the desired audio effects from background noise based on the user’s optional text prompt and from the visual input. This audio output is finally decoded and exported as a waveform that can then be recombined with the video input.

The best part is that the user doesn’t have to go in and manually (read: tediously) sync the audio and video tracks, as the V2A system does it automatically. “By training on video, audio and the additional annotations, our technology learns to associate specific audio events with various visual scenes, while responding to the information provided in the annotations or transcripts,” the Deep Mind team wrote.

V2A Wolf

The system is not yet perfected, however. For one, the output audio quality is dependent on the fidelity of the video input and the system gets tripped up when video artifacts or other distortions are present in the input. According to the Deep Mind team, syncing dialogue to the audio track remains an ongoing challenge.

V2A Claymation family

“V2A attempts to generate speech from the input transcripts and synchronize it with characters’ lip movements,” the team explained. “But the paired vide- generation model may not be conditioned on transcripts. This creates a mismatch, often resulting in uncanny lip-syncing, as the video model doesn’t generate mouth movements that match the transcript.”

The system still needs to undergo “rigorous safety assessments and testing” before the team will consider releasing it to the public. Every video and soundtrack generated by this system will be affixed with Deep Mind’s SynthID watermarks. This system is far from the only audio-generating AI currently on the market. Stability AI dropped a similar product just last week while ElevenLabs released their sound effects tool last month.

Andrew Tarantola
Former Computing Writer
Andrew Tarantola is a journalist with more than a decade reporting on emerging technologies ranging from robotics and machine…
Can Google’s new AI experiment help me learn a language?
AI translation on Android phone.

I've lived in Germany for around a decade now, and an enduring source of shame for me is my rather underwhelming German language skills. Like many language learners, I've reached the point where I can understand what's said to me, make casual conversation, and handle everyday situations, but I've never really achieved the level of comfort with the language and the breadth of vocabulary that's required for true fluency.

I'd love to improve my language skills, but I've struggled to commit the time and resources for intensive classes. But perhaps there's a solution hiding in plain sight on my phone.

Read more
You can now interact with Google’s AI Mode in search results
Google AI Mode

Google has been working on adding more AI features to its Search feature, and now an integrated AI Mode is being rolled out to the public. Different from the AI Overview function that Google has included as a default part of Search since last year, the AI Mode is a chatbot which users can interact with as part of their search results.

AI Mode has previously only been available as a Google Labs experiment, but now Google says that it will be coming to search for "a small percentage of people" in the U.S. over the coming weeks. Those who are part of the test will see an "AI Mode" tab in Search, and clicking on it will bring up information related to your search from the chatbot.

Read more
Meta’s new AI app lets you share your favorite prompts with friends
Meta AI WhatsApp widget.

Meta has been playing the AI game for a while now, but unlike ChatGPT, its models are usually integrated into existing platforms rather than standalone apps. That trend ends today -- the company has launched the Meta AI app and it appears to do everything ChatGPT does and more.

Powered by the latest Llama 4 model, the app is designed to "get to know you" using the conversations you have and information from your public Meta profiles. It's designed to work primarily with voice, and Meta says it has improved responses to feel more personal and conversational. There's experimental voice tech included too, which you can toggle on and off to test -- the difference is that apparently, full-duplex speech technology generates audio directly, rather than reading written responses.

Read more