MIT’s latest A.I. is freakishly good at determining what’s going on in videos

By Luke Dormehl October 2, 2018

How a Temporal Relation Network understands what's going on there

Just a few frames of information telling a story are all we need to understand what is going on. This is, after all, the basis for comic books — which provide just enough of the important story beats for us to follow what has happened. Sadly, robots equipped with computer vision technology struggle to do this. Until now, at least.

Recently, the Massachusetts Institute of Technology (MIT) demonstrated a new type of artificial intelligence system which uses a neural network to fill in the blanks in video frames to work out what activity is taking place. The results make it astonishingly good at determining what is taking place in a video.

“The newly developed temporal relation modules enable the A.I. system to analyze a few key frames and estimate the temporal relation among them, in order to understand what’s going on in the video — such as a stack of objects [being] knocked down,” Bolei Zhou, a former Ph.D. student in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), who is now an assistant professor of computer science at the Chinese University of Hong Kong, told Digital Trends. “Because the model works with key frames sparsely sampled from the incoming video, the processing efficiency is greatly improved, enabling real-time activity recognition.”

Another exciting property of the A.I. model is that it can anticipate and forecast what will happen early on by viewing frames of video. For instance, if it sees a person holding a bottle, the algorithm anticipates that they might take a drink or possibly squeeze it. Such anticipation abilities will be essential for artificial intelligence used in domains like autonomous driving, where it could proactively prevent accidents by guessing what will happen from moment to moment.

“It [could also] be used to monitor human behaviors, such as a home robot assistant which could anticipate your intention by delivering things beforehand,” Zhou continued. “It [could additionally be employed] to analyze the massive [number of] videos online, to do better video understanding and video retrieval.”

The next step of the project will involve increasing the A.I.’s ability to recognize a broader number of objects and activities. The team is also working with robotics researchers to deploy this activity recognition into robot systems. These could see enhanced perception and visual reasoning skills as a result.

Editors' Recommendations