One of the core problems with AI is the notoriously high power and computing demand, especially for tasks such as media generation. On mobile phones, when it comes to running natively, only a handful of pricey devices with powerful silicon can run the feature suite. Even when implemented at scale on cloud, it’s a pricey affair.

Nvidia may have quietly addressed that challenge in partnership with the folks over at the Massachusetts Institute of Technology and Tsinghua University. The team created a hybrid AI image generation tool called HART (hybrid autoregressive transformer) that essentially combines two of the most widely used AI image creation techniques. The result is a blazing fast tool with dramatically lower compute requirement.

A massive breakthrough

When AI images first started making waves, the diffusion technique was behind it all, powering products such as OpenAI’s Dall-E image generator, Google’s Imagen, and Stable Diffusion. This method can produce images with an extremely high level of detail. However, it is a multi-step approach to creating AI images, and as a result, it is slow and computationally expensive.

The second approach that has recently gained popularity is auto-regressive models, which essentially work in the same fashion as chatbots and generate images using a pixel prediction technique. It is faster, but also a more error-prone method of creating images using AI.

On-device demo for HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

The team at MIT fused both methods into a single package called HART. It relies on an autoregression model to predict compressed image assets as a discrete token, while a small diffusion model handles the rest to compensate for the quality loss. The overall approach reduces the number of steps involved from over two dozen to eight steps.

The experts behind HART claim that it can “generate images that match or exceed the quality of state-of-the-art diffusion models, but do so about nine times faster.” HART combines an autoregressive model with a 700 million parameter range and a small diffusion model that can handle 37 million parameters.

Evolution of image training for HART. — MIT / HART

Solving the cost-computing crisis

Interestingly, this hybrid tool was able to create images that matched the quality of top-shelf models with a 2 billion parameter capacity. Most importantly, HART was able to achieve that milestone at a nine times faster image generation rate, while requiring 31% less computation resources.

As per the team, the low-compute approach allows HART to run locally on phones and laptops, which is a huge win. So far, the most popular mass-market products such as ChatGPT and Gemini require an internet connection for image generation as the computing happens in the cloud servers.

In the test video, the team showcased it running natively on an MSI laptop with Intel’s Core series processor and an Nvidia GeForce RTX graphics card. That’s a combination you can find on a majority of gaming laptops out there, without spending a fortune, while at it.

Comparative analysis of AI images. — MIT / HART

HART is capable of producing 1:1 aspect ratio images at a respectable 1024 x 1024 pixels resolution. The level of detail in these images is impressive, and so is the stylistic variation and scenery accuracy. During their tests, the team noted that the hybrid AI tool was anywhere between three to six times faster and offered over seven times higher throughput.

The future potential is exciting, especially when integrating HART’s image capabilities with language models. “In the future, one could interact with a unified vision-language generative model, perhaps by asking it to show the intermediate steps required to assemble a piece of furniture,” says the team at MIT.

They are already exploring that idea, and even plan to test the HART approach at audio and video generation. You can try it out on MIT’s web dashboard.

Some rough edges

Before we dive into the quality debate, do keep in mind that HART is very much a research project that is still in its early stages. On the technical side, there are a few hassles highlighted by the team, such as overheads during the inference and training process.

Failures of HART. — HART / Nadeem Sarwar

The challenges can be fixed or overlooked, because they are minor in the bigger scheme of things here. Moreover, considering the sheer benefits HART delivers in terms of computing efficiency, speed, and latency, they might just persist without leading to any major performance issues.

In my brief time prompt-testing HART, I was astonished by the pace of image generation. I barely ran into a scenario where the free web tool took more than two seconds to create an image. Even with prompts that span three paragraphs (roughly over 200 words in length), HART was able to create images that adhere tightly to the description.

AI images sample generated with HART. — HART / Nadeem Sarwar

Aside from descriptive accuracy, there was plenty of detail in the images. However, HART suffers from the typical failings of an AI image generator tool. It struggles with digits, basic depictions like eating food items, character consistency, and failing at perspective capture.

Photorealism in human context is one area where I noticed glaring failures. On a few occasions, it simply got the concept of basic objects wrong, like confusing a ring with a necklace. But overall, those errors were far, few, and fundamentally expected. A healthy bunch of AI tools still can’t get that right, despite being out there for a while now.

Overall, I am particularly excited by the immense potential of HART. It would be interesting to see whether MIT and Nvidia create a product out of it, or simply adopt the hybrid AI image generation approach in an existing product. Either way, it’s a glimpse into a very promising future.