In November of 2016, we spoke with Adobe about Project StyLit, a research project that used artificial intelligence to apply an artistic style to a three-dimensional model. It was an intriguing technology, but it had many limitations and an apparently limited usefulness. Building off of what the team learned with Project StyLit, Adobe Research has now taken a big step forward with a new project that, for now, is known simply as “Stylized Facial Animation.” It can apply an input style (or “style exemplar”) to video with very natural results, doing a much better job of imitating hand-drawn animation than other techniques. While it is currently limited to faces, it shows promise for how animators will one day be able to automate the rotoscoping process, or at least aspects of it.
The technology will be publicly demonstrated on August 3 at SIGGRAPH, but Digital Trends had the chance to speak with Adobe’s David Simons about the project ahead of the presentation. Simons holds the role of Senior Principal Scientist within Adobe’s Advanced Product Development Group and is a 25-year veteran of After Effects, Adobe’s special effects program for video. After Effects seems a likely candidate for where the technology demonstrated in Stylized Facial Animation could one day end up. At this time, Adobe has not announced any plans for a commercial release of the tech, but it is the mission of Simons and the Advanced Product Development Group to turn research projects into marketable products, so it may not be that far off from finding its way into a future Creative Cloud update (it’s highly likely that it will be demonstrated at Adobe’s upcoming MAX conference).
Simons opened by explaining some key differences between Project StyLit and Stylized Facial Animation. StyLit relied on the information channels available in 3D models in order to appropriately map the style to the model, and styles were based off of simple “sphere study” sketches that clearly laid out tonality and color. “The idea was to make similar stylized synthesis, driven by your choice of style, but be able to apply it to a photograph or video,” Simons told Digital Trends.
In this case, rather than relying on sketches of spheres, the style exemplars are portraits of actual faces, and the qualities of the style — textures, lines, brush strokes — are appropriately redrawn in the shape of the subject’s face in the target video. In the resulting output animation, the human subject is clearly recognizable, yet appears to be hand drawn in the technique used in the style exemplar, even when the exemplar face is a different shape, age, or gender from the person in the video.
Neural networks to the rescue
As detailed in a research paper, Stylized Facial Animation is based on previous research on neural network computing, conducted by both Adobe and others. One of the earliest and most popular consumer applications of neural network technology was the Prisma app, which applied an artistic style to an image and, later, to video. An important distinction between the likes of Prisma and simple image filters is that apps based on neural networks actually redraw the image from scratch, rather than overlay the style on top of the photograph. But while Prisma could do this very quickly, it did a relatively poor job of maintaining details present in the source style.
While it still struggles in some specific areas, like forehead wrinkles, Adobe’s approach significantly increases the amount of artistic detail that makes it through to the final output. It accomplishes this by analyzing both the style exemplar and the target video, creating “guiding channels” that ensure the style is appropriately applied to the target. One channel is used to align the position of the exemplar and the target, another maintains the shading present in the target subject, and a third matches semantically similar regions (eyes to eyes, hair to hair, etc.).
The underlying technology is actually based on the Content-Aware Fill feature in Adobe Photoshop. “It looks at nearby areas and takes patches that fit together. It’s like trying to fill in a jigsaw puzzle with pieces from the rest of the area that match up,” Simons said.
The result is, essentially, just how you would expect such a technology to work, but it is considerably different from what we’ve seen in apps like Prisma as it maintains so much more detail from the source style. But when it comes to applying the effect to multiple frames for an animation, things get even trickier. With Project StyLit, animations could only be created by applying a style one frame at a time, then manually connecting the frames in Adobe Premiere Pro or another video-editing application.
“A lot of it is about avoiding the uncanny valley. You want a level of abstraction.”
“We want it to look redrawn on every frame,” Simons explained, “But not so flickery that you can’t see what’s going on. A lot of it is about avoiding the uncanny valley. You want a level of abstraction.”
The solution came in allowing control over “temporal noise” by using a separate guiding channel to determine coherence. As each frame is synthesized from the preceding one, controlling the amount of blur in a frame’s temporal guiding channel will raise or lower coherence in the next frame, making it possible to find a balance between an animation that is too choppy and too smooth.
Applying the technology
While Simons was quick to reaffirm that the project is definitely “still research at this time,” there are two possible applications for it in the future. One would be for animating live-action footage, as demonstrated in the research paper, and the other would be for creating puppets to use in Adobe’s Character Animator, which automatically animates a character based on video of an actor.
Any portrait — a sketch, painting, or even a photograph — can be used to style the video.
Going forward, one of the main goals will be getting to the point where the program can run in real time, or at least close to it. Currently, it takes about 30 seconds just to compute the guiding channels and three minutes to perform full synthesis of a single, 1-megapixel frame using a quad-core processor running at 3 gigahertz, according to the research paper. However, the team has already demonstrated vast performance improvements possible by using a graphics processing unit (GPU), which reduces the time required for single-frame synthesis down to just 5 seconds. Given that most video runs at 24 or 30 frames per second, this is still a ways off from achieving real time synthesis, but the project is also still in very early development.
Whatever the eventual outcome of Stylized Facial Animation, it is clear that Adobe is very interested in continuing to develop the technology. The paper’s lead author, Jakub Fiser, started out as an intern at Adobe. This project was his Ph.D. thesis, and was enough for Adobe to bring him on full time. Fiser joins an experienced group of individuals, of whom Simons spoke highly, even while keeping modest with regard to his own role. “Everyone else is a Ph.D.,” he said. “They’re the real researchers.”
- When a movie isn’t enough, the best shows on Netflix will keep you busy for days
- Amazon shuts down Anime Strike channel, rolls it over into Prime Video
- Lapland reindeer go high-tech with tracking sensors to protect them from wolves
- ‘Dragon Ball FighterZ’ review
- CEO wants ‘PlayerUnknown’s Battlegrounds’ to be more than just a video game