Skip to main content

Google’s AI just got ears

The Google Gemini AI logo.
Google

AI chatbots are already capable of “seeing” the world through images and video. But now, Google has announced audio-to-speech functionalities as part of its latest update to Gemini Pro. In Gemini 1.5 Pro, the chatbot can now “hear” audio files uploaded into its system and then extract the text information.

The company has made this LLM version available as a public preview on its Vertex AI development platform. This will allow more enterprise-focused users to experiment with the feature and expand its base after a more private rollout in February when the model was first announced. This was originally offered only to a limited group of developers and enterprise customers.

1. Breaking down + understanding a long video

I uploaded the entire NBA dunk contest from last night and asked which dunk had the highest score.

Gemini 1.5 was incredibly able to find the specific perfect 50 dunk and details from just its long context video understanding! pic.twitter.com/01iUfqfiAO

— Rowan Cheung (@rowancheung) February 18, 2024

Google shared the details about the update at its Cloud Next conference, which is currently taking place in Las Vegas. After calling the Gemini Ultra LLM that powers its Gemini Advanced chatbot the most powerful model of its Gemini family, Google is now calling Gemini 1.5 Pro its most capable generative model. The company added that this version is better at learning without additional tweaking of the model.

Gemini 1.5 Pro is multimodal in that it can interpret different types of audio into text, including TV shows, movies, radio broadcasts, and conference call recordings. It’s even multilingual in that it can process audio in several different languages. The LLM may also be able to create transcripts from videos; however, its quality may be unreliable, as mentioned by TechCrunch.

When first announced, Google explained that Gemini 1.5 Pro used a token system to process raw data. A million tokens equate to approximately 700,000 words or 30,000 lines of code. In media form, it equals an hour of video or around 11 hours of audio.

There have been some private preview demos of Gemini 1.5 Pro that demonstrate how the LLM is able to find specific moments in a video transcript. For example, AI enthusiast Rowan Cheung got early access and detailed how his demo found an exact action shot in a sports contest and summarized the event, as seen in the tweet embedded above.

However, Google noted that other early adopters, including United Wholesale Mortgage, TBS, and Replit, are opting for more enterprise-focused use cases, such as mortgage underwriting, automating metadata tagging, and generating, explaining, and updating code.

Editors' Recommendations

Fionna Agomuoh
Fionna Agomuoh is a technology journalist with over a decade of experience writing about various consumer electronics topics…
Meta Smart Glasses just got the AI upgrade I’ve been waiting for
Phil Nickinson wearing the Apple AirPods Pro and Ray-Ban Meta smart glasses.

Meta loves to upgrade its hardware with extra features, and the Ray-Ban smart glasses that I found to be very impressive are getting a big AI update starting today. Lives=treaming capabilities are also expanding.

New features are great, but if you couldn’t find a style you liked when the Ray-Ban Meta Smart Glasses launched last October, there's more good news — new styles are on the way too.
Multimodal AI
In our comprehensive list of the best smart glasses to buy in 2024, I mentioned that Meta was testing multimodal input for its Ray-Ban smart glasses. That feature is now rolling out to everyone in the U.S. and Canada.

Read more
AI is about to change video production forever
An object being highlighted in a scene from a video.

Generative AI is just starting to break into the world of video, but some new features in Premiere Pro are taking things to the next level. Adobe has announced a host of generative AI video tools that it will be introducing to the Premiere Pro video editing suite throughout the year, and they look pretty incredible.

The brand is working to expand on its own AI model, Adobe Firefly, which started as a text-to-image generator last March and has quickly expanded its AI brand across several editing mediums.

Read more
Google quietly launches a new text-to-video AI app
A photo of Google Vids running with a sample timeline

Google quietly announced an AI-powered video creation app today. Called Google Vids, the new app is designed for Google Workspace users and uses the power of Google Gemini to help you create informational videos for the workspace.

Currently in testing with select Google Workspace Labs users (a public beta ispromised for later), the new online tool builds on some of the AI-powered features we've already seen in Google's other apps like Docs, Sheets, and Slides. The difference is that with Google Vids, you can manually create a video storyboard using your media or use AI to create one using basic words and simple prompts. This allows you to edit and put together much more informative videos in a short time.

Read more