Skip to main content

Google’s AI just got ears

The Google Gemini AI logo.
Google

AI chatbots are already capable of “seeing” the world through images and video. But now, Google has announced audio-to-speech functionalities as part of its latest update to Gemini Pro. In Gemini 1.5 Pro, the chatbot can now “hear” audio files uploaded into its system and then extract the text information.

The company has made this LLM version available as a public preview on its Vertex AI development platform. This will allow more enterprise-focused users to experiment with the feature and expand its base after a more private rollout in February when the model was first announced. This was originally offered only to a limited group of developers and enterprise customers.

Recommended Videos

1. Breaking down + understanding a long video

I uploaded the entire NBA dunk contest from last night and asked which dunk had the highest score.

Gemini 1.5 was incredibly able to find the specific perfect 50 dunk and details from just its long context video understanding! pic.twitter.com/01iUfqfiAO

— Rowan Cheung (@rowancheung) February 18, 2024

Google shared the details about the update at its Cloud Next conference, which is currently taking place in Las Vegas. After calling the Gemini Ultra LLM that powers its Gemini Advanced chatbot the most powerful model of its Gemini family, Google is now calling Gemini 1.5 Pro its most capable generative model. The company added that this version is better at learning without additional tweaking of the model.

Gemini 1.5 Pro is multimodal in that it can interpret different types of audio into text, including TV shows, movies, radio broadcasts, and conference call recordings. It’s even multilingual in that it can process audio in several different languages. The LLM may also be able to create transcripts from videos; however, its quality may be unreliable, as mentioned by TechCrunch.

When first announced, Google explained that Gemini 1.5 Pro used a token system to process raw data. A million tokens equate to approximately 700,000 words or 30,000 lines of code. In media form, it equals an hour of video or around 11 hours of audio.

There have been some private preview demos of Gemini 1.5 Pro that demonstrate how the LLM is able to find specific moments in a video transcript. For example, AI enthusiast Rowan Cheung got early access and detailed how his demo found an exact action shot in a sports contest and summarized the event, as seen in the tweet embedded above.

However, Google noted that other early adopters, including United Wholesale Mortgage, TBS, and Replit, are opting for more enterprise-focused use cases, such as mortgage underwriting, automating metadata tagging, and generating, explaining, and updating code.

Fionna Agomuoh
Fionna Agomuoh is a Computing Writer at Digital Trends. She covers a range of topics in the computing space, including…
Google adds Spanish and French to NotebookLM in huge language update
Google video explaining Audio Overview languages.

NotebookLM is one of Google's lesser-used AI products but it introduced a feature that's becoming increasingly popular -- Audio Overviews. The company already brought it over to Gemini and plans to add the feature to Google Docs in the next few months too. Until now, Audio Overviews has been an English-only tool but as of this week, it's available in over 50 languages.

The NotebookLM platform is all about putting together notebooks of information and different sources and using LLMs to interact with them. Audio Overviews is basically a fancy summary tool -- it lets you generate audio summaries of your selected sources that are presented in the style of a podcast with two AI hosts.

Read more
YouTube’s AI Overviews want to make search results smarter
YouTube App

YouTube is experimenting with a new AI feature that could change how people find videos. Here's the kicker: not everyone is going to love it.

The platform has started rolling out AI-generated video summaries directly in search results, but only for a limited group of YouTube Premium subscribers in the U.S. For now, the AI Overviews are focused on things like product recommendations and travel ideas. They're meant to give quick highlights from multiple videos without making users look at each item they're interested in.

Read more
Microsoft has revealed one of its recent ads uses gen AI — can you tell?
Shot from a Microsoft advert.

In January, Microsoft released a minute-long advert for its Surface Pro and Surface Laptop. It currently has 42,000 views on YouTube with 302 comments discussing the hardware -- what the comments don't mention, however, is the AI-generated shots used in the ad. Why? Because no one even realized AI was involved until Microsoft smugly revealed it this week.

You can tell the company is proud of this little stunt it's pulled off because the blog about it begins with a dramatic summary of the history of film and how it has evolved -- implying generative AI tools are the next step in this grand evolution.

Read more