The program starts by organizing all of the footage, which is often from multiple takes and camera angles. Those clips are matched to the script, so it’s easy to find several video options for each line of dialogue.
The program then works to recognize exactly what is inside those clips. Using facial recognition alongside emotion recognition and other computational imaging effects, the program determines what is in each frame. For example, the program flags whether the shot is a wide-angle or a close-up and which characters the shot includes.
With everything organized, the video editor then instructs the program in just how the videos should be edited using different styles and techniques the researchers call idioms. For example, a common style is to show the face of the character during their lines. If the editor wants that to happen, he or she just drags that idiom over. The idioms can also be negative. For example, the idiom “avoid jump cuts,” can be added to actually avoid them, or negatively to intentionally add jump cuts whenever possible.
The editor can drag over multiple idioms to instruct the program on an editing style. In a video demonstrating the technology, the researchers created a cinematic edit by using idioms that tell the software to keep the speaker visible while talking, to start with a wide-angle shot, to mix with close-ups and to avoid jump cuts. To edit the video in a completely different, fast-paced style, the researchers instead dragged over idioms for including jump cuts, using fast performance, and keeping the zoom consistent.
Editing styles can be saved to recall later, and with the idioms in place, a stylized video edit is generated with a click. Alternative clips are arranged next to the computer’s edit so editors can quickly adjust if something’s not quite right.
The program speeds up video editing using artificial intelligence, but also allows actual humans to set the creative parameters in order to achieve a certain style. The researchers did acknowledge a few shortcomings of the program. The system is designed for dialogue-based videos, and further work would need to be done for the program to work with other types of shots, such as action shots. The program also couldn’t prevent continuity errors, where the actor’s hands or a prop is in a different location in the next clip.
The study, conducted by Stanford University and Adobe Research, is included in the July issue of the ACM Transactions on Graphics Journal.