Skip to main content

I tested the future of AI image generation. It’s astoundingly fast.

Imagery generated by HART.
MIT / HART

One of the core problems with AI is the notoriously high power and computing demand, especially for tasks such as media generation. On mobile phones, when it comes to running natively, only a handful of pricey devices with powerful silicon can run the feature suite. Even when implemented at scale on cloud, it’s a pricey affair.

Nvidia may have quietly addressed that challenge in partnership with the folks over at the Massachusetts Institute of Technology and Tsinghua University. The team created a hybrid AI image generation tool called HART (hybrid autoregressive transformer) that essentially combines two of the most widely used AI image creation techniques. The result is a blazing fast tool with dramatically lower compute requirement.

Recommended Videos

Just to give you an idea of just how fast it is, I asked it to create an image of a parrot playing a bass guitar. It returned with the following picture in just about a second. I could barely even follow the progress bar. When I pushed the same prompt before Google’s Imagen 3 model in Gemini, it took roughly 9-10 seconds on a 200 Mbps internet connection.

Image of a parrot generated by HART.
MIT / HART

A massive breakthrough

When AI images first started making waves, the diffusion technique was behind it all, powering products such as OpenAI’s Dall-E image generator, Google’s Imagen, and Stable Diffusion. This method can produce images with an extremely high level of detail. However, it is a multi-step approach to creating AI images, and as a result, it is slow and computationally expensive.

The second approach that has recently gained popularity is auto-regressive models, which essentially work in the same fashion as chatbots and generate images using a pixel prediction technique. It is faster, but also a more error-prone method of creating images using AI.

On-device demo for HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

The team at MIT fused both methods into a single package called HART. It relies on an autoregression model to predict compressed image assets as a discrete token, while a small diffusion model handles the rest to compensate for the quality loss. The overall approach reduces the number of steps involved from over two dozen to eight steps.

The experts behind HART claim that it can “generate images that match or exceed the quality of state-of-the-art diffusion models, but do so about nine times faster.” HART combines an autoregressive model with a 700 million parameter range and a small diffusion model that can handle 37 million parameters.

Evolution of image training for HART.
MIT / HART

Solving the cost-computing crisis

Interestingly, this hybrid tool was able to create images that matched the quality of top-shelf models with a 2 billion parameter capacity. Most importantly, HART was able to achieve that milestone at a nine times faster image generation rate, while requiring 31% less computation resources.

As per the team, the low-compute approach allows HART to run locally on phones and laptops, which is a huge win. So far, the most popular mass-market products such as ChatGPT and Gemini require an internet connection for image generation as the computing happens in the cloud servers.

In the test video, the team showcased it running natively on an MSI laptop with Intel’s Core series processor and an Nvidia GeForce RTX graphics card. That’s a combination you can find on a majority of gaming laptops out there, without spending a fortune, while at it.

Comparative analysis of AI images.
MIT / HART

HART is capable of producing 1:1 aspect ratio images at a respectable 1024 x 1024 pixels resolution. The level of detail in these images is impressive, and so is the stylistic variation and scenery accuracy. During their tests, the team noted that the hybrid AI tool was anywhere between three to six times faster and offered over seven times higher throughput.

The future potential is exciting, especially when integrating HART’s image capabilities with language models. “In the future, one could interact with a unified vision-language generative model, perhaps by asking it to show the intermediate steps required to assemble a piece of furniture,” says the team at MIT.

They are already exploring that idea, and even plan to test the HART approach at audio and video generation. You can try it out on MIT’s web dashboard.

Some rough edges

Before we dive into the quality debate, do keep in mind that HART is very much a research project that is still in its early stages. On the technical side, there are a few hassles highlighted by the team, such as overheads during the inference and training process.

Failures of HART.
HART / Nadeem Sarwar

The challenges can be fixed or overlooked, because they are minor in the bigger scheme of things here. Moreover, considering the sheer benefits HART delivers in terms of computing efficiency, speed, and latency, they might just persist without leading to any major performance issues.

In my brief time prompt-testing HART, I was astonished by the pace of image generation. I barely ran into a scenario where the free web tool took more than two seconds to create an image. Even with prompts that span three paragraphs (roughly over 200 words in length), HART was able to create images that adhere tightly to the description.

AI images sample generated with HART.
HART / Nadeem Sarwar

Aside from descriptive accuracy, there was plenty of detail in the images. However, HART suffers from the typical failings of an AI image generator tool. It struggles with digits, basic depictions like eating food items, character consistency, and failing at perspective capture.

Photorealism in human context is one area where I noticed glaring failures. On a few occasions, it simply got the concept of basic objects wrong, like confusing a ring with a necklace. But overall, those errors were far, few, and fundamentally expected. A healthy bunch of AI tools still can’t get that right, despite being out there for a while now.

Overall, I am particularly excited by the immense potential of HART. It would be interesting to see whether MIT and Nvidia create a product out of it, or simply adopt the hybrid AI image generation approach in an existing product. Either way, it’s a glimpse into a very promising future.

Nadeem Sarwar
Nadeem is a tech and science journalist who started reading about cool smartphone tech out of curiosity and soon started…
Opera One puts an AI in control of browser tabs, and it’s pretty smart
AI tab manager in Opera One browser.

Opera One browser has lately won a lot of plaudits for its slick implementation of useful AI features, a clean design, and a healthy bunch of chat integrations. Now, it is putting AI in command of your browser tabs, and in a good way.
The new feature is called AI Tab Commands, and it essentially allows users to handle their tabs using natural language commands. All you need to do is summon the onboard Aria AI assistant, and it will handle the rest like an obedient AI butler.
The overarching idea is to let the AI handle multiple tabs, and not just one. For example, you can ask it to “group all Wikipedia tabs together,” “close all the Smithsonian tabs,” “or shut down the inactive tabs.”

A meaningful AI for web browsing
Handling tabs is a chore in any web browser, and if internet research is part of your daily job, you know the drill. Having to manually move around tabs using a mix of cursor and keyboard shorcuts, naming them, and checking through the entire list of tabs is a tedious task.
Meet Opera Tab Commands: manage your tabs with simple prompts
Deploying an AI do it locally — and using only natural language commands — is a lovely convenience and one of the nicest implementations of AI I’ve seen lately. Interestingly, Opera is also working on a futuristic AI agent that will get browser-based work done using only text prompts.
Coming back to the AI-driven tab management, the entire process unfolds locally, and no data is sent to servers, which is a neat assurance. “When using Tab Commands and asking Aria to e.g. organize their tabs, the AI only sends to the server the prompt a user provides (e.g., “close all my YouTube tabs”) – nothing else,” says the company.
To summon the AI Tab manager, users can hit the Ctrl + slash(/) shortcut, or the Command + Slash combo for macOS. It can also be invoked with a right-click on the tabs, as long as there are five or more currently running in a window.
https://x.com/opera/status/1904822529254183166?s=61
Aside from closing or grouping tabs, the AI Tab Commands can also be used to pin tabs. It can also accept exception commands, such as “close all tabs except the YouTube tabs.” Notably, this feature is also making its way to Opera Air and the gaming-focused Opera GX browser, as well.
Talking about grouping together related tabs, Opera has a neat system called tab islands, instead of color-coded tab groups at the top, as is the case with Chrome or Safari. Opera’s implementation looks better and works really well.
Notably, the AI Tab Commands window also comes with an undo shortcut, for scenarios where you want to revert the actions, like reviving a bunch of closed tabs. Opera One is now available to download on Windows and macOS devices. Opera also offers Air, a browser than puts some zen into your daily workflow.

Read more
OpenAI halts free GPT-4o image generation after Studio Ghibli viral trend
OpenAI and ChatGPT logos are marked do not enter with a red circle and line symbol.

After only one day, OpenAI has put a halt on the free version of its in-app image generator, powered by the GPT-4o reasoning model. The update is intended to improve realism in images and text in AI-generated context; however, users have already created a runaway trend that has caused the AI company to rethink its rollout strategy. 

Not long after the update became available on ChatGPT, users began sharing images they had fashioned to social media platforms in the style of Studio Ghibli, the popular Japanese animation studio. Creations ranged from Studio Ghibli-based personal family photos to iconic scenes from the 2024 Paris Olympics, scenes from movies including “The Godfather” and “Star Wars”, and internet memes including distracted boyfriend and disaster girl.

Read more
Apple’s hardware can dominate in AI — so why is Siri struggling so much?
Apple's Craig Federighi presents the Image Playground app running on macOS Sequoia at the company's Worldwide Developers Conference (WWDC) in June 2024.

Over the past year or so, a strange contradiction has emerged in the world of Apple: the company makes some of the best computers in the world, whether you need a simple consumer laptop or a high-powered workstation. Yet Apple’s artificial intelligence (AI) efforts are struggling so much that it’s almost laughable.

Take Siri, for example. Many readers will have heard that Apple has taken the highly unusual (and highly embarrassing) step of publicly admitting the new, AI-backed Siri needs more time in the oven. The new Siri infused with Apple Intelligence just isn’t living up to Apple’s promises.

Read more