Explosive "Video ControlNet" open source! Relying on prompt words to change the painting style accurately, produced by an all-Chinese team
"ControlNet for Video" is here!
Let the God of War in blue become a Disney public event in seconds:
Before and after video processing, except for the style of painting, nothing else is changed.
Girls speak with the same mouth shape.
Jiang Wen, who is stabbing his sword, can also "next second" become the Rise of the Planet of the Apes.
This is the latest video processing algorithm CoDeF created by an all-Chinese team . It was released only a few days ago and quickly exploded on the Internet.
Netizens watched and called out:
Day by day, it is becoming more and more difficult to distinguish between reality and reality!
You only need to shoot something yourself, and then cover it, and you can turn it into a variety of animations.
Some people say that it only needs to be given a year to be used in film production.
This immediately attracted the affirmation of others: the development of technology is really crazy and fast.
Currently, the team has open-sourced this method on GitHub.
The posture remains the same, and the painting style "leather case" can be changed at will
The main reason why it is called "ControlNet for video" is that CoDeF can precisely control the original video.
(ControlNet realizes precise control of image element changes based on prompt words, such as character movements, image structure, etc.)
According to the prompt words given, it only changes the style of the video, and it is for the complete video.
For example, if you enter "Chinese ink painting", the landscape documentary can instantly become a masterpiece of Chinese ink painting.
Including water flow can also be tracked very well, and the entire fluid movement has not been changed.
Even a large piece of fringe, how it swayed in the original video, the frequency and amplitude are exactly the same after changing the painting style.
In terms of changing the painting style, CoDeF has also done a lot of detail processing to make the effect more realistic and reasonable.
After "from spring to winter", the originally rippling river stopped, and the clouds in the sky were replaced by the sun, which is more in line with the winter scene.
After Taylor Swift became a magical girl, the earrings were replaced with glowing gems, and the apple in her hand was replaced with a magic ball.
In this way, it is much easier to make movie characters age with one click.
Wrinkles can "quietly" appear on the face, and everything else remains unchanged.
So, how is CoDeF implemented?
Water and smoke can be tracked with greater consistency across frames
CoDeF is the abbreviation of "the content deformation field" in English, that is, the author here proposes a new method called content deformation field for video style transfer tasks .
Compared with static image style transfer, the complexity of this task lies in the consistency and fluency of time series.
For example, when dealing with elements such as water and smoke, the consistency between the two frames is very important.
Here, the author "has an idea" and proposes to use image algorithms to directly solve video tasks.
They only deploy the algorithm on one image, and then upgrade the image-image conversion to video-video conversion, and promote key point detection to key point tracking, without any training .
This enables better cross-frame consistency and even tracking of non-rigid objects compared to traditional methods.
Specifically, CoDeF decomposes the input video into a 2D canonical content field and a 3D temporal deformation field:
The former is used to aggregate static content throughout the video; the latter is responsible for recording the transformation process of each individual frame of the image along the time axis.
Using MLP (Multilayer Perceptron), each field is represented by a multi-resolution 2D or 3D hash table.
Here, the author deliberately introduces regularization to ensure that the content norm field can inherit the semantic information (such as the shape of the object) in the original video.
As shown in the figure above, this series of designs enables CoDeF to automatically support various image algorithms directly applied to video processing——
That is, it only needs to use the corresponding algorithm to extract a canonical image, and then propagate the result along the time axis through the time deformation field.
For example, "putting" ControlNet on CoDeF, which is originally used for image processing, can complete the "translation" of video style (that is, the bunch of effects we saw at the beginning and the first paragraph):
"Put on" the segmentation algorithm SAM, we can easily track the object of the video and complete the dynamic segmentation task:
"Put on" Real-ESRGAN, it is also easy to super-score the video...
The whole process is very painless and does not require any adjustments or processing of the video to be operated.
Not only processing, but guaranteed effects, i.e. good time consistency and compositing quality.
As shown in the figure below, compared to the Layered neural atlas algorithm born last year, CoDeF can present details that are very faithful to the original video, without deformation or destruction.
In the comparison of the tasks of modifying the video style according to the text prompts, CoDeF performed outstandingly, not only matching the given requirements best, but also having a higher degree of completion.
The cross-frame consistency is shown in the following figure:
A fresh graduate
This research was jointly brought by the Hong Kong University of Science and Technology, the Ant Team, and the CAD&CG Laboratory of Zhejiang University.
There are three co-authors, namely Ouyang Hao, Yujun Shen and Yuxi Xiao.
Among them, Ouyang Hao is a doctor of Hong Kong University of Science and Technology, under the tutelage of Chen Qifeng (one of the corresponding authors of this article); Jia Jiaya is his undergraduate supervisor. He has interned at MSRA, SenseTime, and Tencent Youtu Lab, and is currently interning at Google.
The other is Qiuyu Wang. Yujun Shen is one of the corresponding authors.
He is a Senior Research Scientist at Ant Research Institute, in charge of the Interactive Intelligence Lab. His research direction is computer vision and deep learning, and he is especially interested in generative models and 3D visual effects.
The third one is Yuxi Xiao, who has just graduated from Wuhan University, and started to study for a Ph.D. in the CAD&CG laboratory of Zhejiang University in September this year.
His paper Level-S2fM: Structure from Motion on Neural Level Set of Implicit Surfaces published as a first author was accepted by CVPR2023.
Paper address: https://arxiv.org/abs/2308.07926
Project address: https://qiuyu96.github.io/CoDeF/
Reference link: https://twitter.com/LinusEkenstam/status/1692492872392626284