HomeAI News
Picture + audio can be converted into video in seconds! Xi'an Jiaotong University open-sources SadTalker: Head and lip movements are supernatural, bilingual in Chinese and English, and can sing

Picture + audio can be converted into video in seconds! Xi'an Jiaotong University open-sources SadTalker: Head and lip movements are supernatural, bilingual in Chinese and English, and can sing

Hayo News
Hayo News
April 27th, 2023
View OriginalTranslated by Google
Let the picture match your audio, the matching sd-webui plug-in has been released!

With the popularity of the concept of digital human and the continuous development of generation technology, it is no longer a problem to make the characters in the photos move following the audio input.

However, there are still many problems in "generating a talking head portrait video through a face image and a piece of voice audio", such as unnatural head movement, distorted facial expressions, and excessive differences between the faces of the characters in the video and pictures.

Recently, researchers from Xi'an Jiaotong University and others proposed the SadTalker model, which learns to generate 3DMM's 3D motion coefficients (head pose, expression) from audio in a 3D sports field, and uses a new 3D facial renderer to generate the head sports.

Paper link: https://arxiv.org/pdf/2211.12194.pdf

Project homepage: https://sadtalker.github.io/

The audio can be in English, Chinese, or songs, and the characters in the video can also control the blinking frequency!

To learn realistic motion coefficients, the researchers explicitly model the connection between audio and different types of motion coefficients separately: learning accurate facial expressions from audio by distilling coefficients and 3D-rendered faces; VAE designed PoseVAE to synthesize different styles of head motion.

Finally, the generated 3D motion coefficients are mapped to the unsupervised 3D keypoint space of face rendering, and the final video is synthesized.

Finally, it is demonstrated in experiments that the method achieves state-of-the-art performance in terms of motion synchronization and video quality.

At present, the stable-diffusion-webui plugin has also been released!

photo + audio = video

Many fields such as digital human creation and video conferencing need the technology of "using voice and audio to animate still photos", but it is still a very challenging task at present.

Previous work has mainly focused on generating "lip motion" because the relationship between lip motion and speech is strongest. Other work is also trying to generate face videos of other related motions (such as head posture), but the generation of video Quality is still very unnatural and limited by preferred poses, blurring, identity modification and facial distortion.

Another popular approach is latent-based facial animation, which focuses on specific categories of motion in conversational facial animation, which is also difficult to synthesize high-quality videos because although 3D facial models contain highly decoupled representations, It can be used to learn motion trajectories for different positions of the face alone, but still generates inaccurate expressions and unnatural motion sequences.

Based on the above observations, the researchers proposed SadTalker (Stylized Audio-Driven Talking-head), a stylized audio-driven video generation system through implicit three-dimensional coefficient modulation.

To achieve this goal, the researchers considered the motion coefficients of 3DMMs as intermediate representations and divided the task into two main parts (expression and pose), aiming to generate more realistic motion coefficients from audio (e.g. head pose, lip movement and eye blinking), and learn each movement individually to reduce uncertainty.

The source image is finally driven by a 3D-aware face rendering inspired by face-vid2vid.

3D face

Because real-world videos are shot in a 3D environment, 3D information is crucial to improve the realism of generated videos; however, previous work seldom considered 3D space because it is difficult to obtain the original 3D is sparse, and high-quality face renderers are also difficult to design.

Inspired by recent single-image deep 3D reconstruction methods, the researchers use the space of predicted 3D deformable models (3DMMs) as an intermediate representation.

In 3DMM, the 3D face shape S can be decoupled as:

Among them, S is the average shape of the three-dimensional face, Uid and Uexp are the regularization of the identity and expression of the LSFM morphable model, and the coefficients α (80 dimensions) and β (64 dimensions) describe the identity and expression of the character respectively; in order to maintain the difference in posture, The coefficients r and t denote head rotation and translation, respectively; to achieve identity-independent coefficient generation, only the parameters of motion are modeled as {β, r, t}.

Namely, head pose ρ = [r, t] and expression coefficient β are learned separately from the driving audio, and then face rendering is implicitly modulated using these motion coefficients for final video synthesis.

Generating Motion Sparse from Audio

The 3D motion coefficients include head pose and expression, where the head pose is a global motion, while the expression is relatively local, so fully learning all the coefficients will bring huge uncertainty to the network because of the relationship between head pose and audio relatively weak, while lip movement is highly audio-dependent.

So SadTalker uses the following PoseVAE and ExpNet to generate the movement of head pose and expression respectively.


Learning a general model that can "generate accurate expression coefficients from audio" is very difficult for two reasons:

1) Audio-to-expression is not a one-to-one mapping task for different characters;

2) There are some audio-related actions in the expression coefficient, which will affect the accuracy of prediction.

The design goal of ExpNet is to reduce these uncertainties; as for the character identity problem, the researchers use the expression coefficient of the first frame to associate the expression movement with a specific character.

In order to reduce the motion weight of other facial components in natural dialogue, the pre-trained network of Wav2Lip and deep 3D reconstruction uses lip motion only as the coefficient target.

As for other subtle facial movements (such as eye blinking), etc., they can be introduced in the additional landmark loss on the rendered image.


The researchers designed a VAE-based model to learn realistic, identity-aware stylized head movements in talking videos.

In training, the pose VAE is trained on fixed n frames using an encoder-decoder-based structure, where the encoder and decoder are both two-layer MLPs, and the input contains a continuous t-frame head pose, which is embedded in to a Gaussian distribution; in the decoder, the network learns to generate t-frame poses from the sampling distribution.

It should be noted that PoseVAE does not directly generate poses, but learns the residuals of the conditional poses of the first frame, which also allows the method to generate longer, more stable, and more continuous poses under the conditions of the first frame in the test. head movement.

According to CVAE, corresponding audio features and style identification are also added in PoseVAE as conditions for rhythm awareness and identity style.

The model uses KL divergence to measure the distribution of generated motion; uses mean square loss and adversarial loss to guarantee the quality of generation.

3D-aware facial rendering

After generating realistic 3D motion coefficients, the researchers rendered the final video through an elaborate 3D image animator.

The recently proposed image animation method face-vid2vid can implicitly learn 3D information from a single image, but this method requires a real video as the motion driving signal; while the face rendering proposed in this paper can be driven by 3DMM coefficients .

The researchers propose mappingNet to learn the relationship between explicit 3DMM motion coefficients (head pose and expression) and implicit unsupervised 3D keypoints.

MappingNet is built through several one-dimensional convolutional layers, similar to PIRenderer, using the time coefficient of the time window for smoothing; the difference is that the researchers found that the face alignment motion coefficient in PIRenderer will greatly affect the motion generated by audio-driven video Naturalness, so mappingNet only uses coefficients for expression and head pose.

The training phase consists of two steps: first follow the original paper to train face-vid2vid in a self-supervised manner; then freeze all the parameters of the appearance encoder, canonical key point estimator and image generator, and reconstruct them on the ground truth video The mappingNet is trained on the 3DMM coefficients for fine-tuning.

Supervised training using the L1 loss in the domain of unsupervised keypoints and following its original implementation gives the final generated videos.

Experimental results

In order to prove the superiority of this method, the researchers selected Frechet Inception Distance (FID) and Cumulative Probability Blur Detection (CPBD) indicators to evaluate the quality of the image, where FID mainly evaluates the authenticity of the generated frame, and CPBD evaluates the clarity of the generated frame .

To evaluate the degree of identity preservation, ArcFace is used to extract the identity embedding of the image, and then the cosine similarity (CSIM) of the identity embedding between the source image and the generated frame is calculated.

To assess lip synchronization and lip sync, the researchers assessed the perceived difference of lip sync from Wav2Lip, including distance scores (LSE-D) and confidence scores (LSE-C).

In the evaluation of head motion, the standard deviation of the head motion feature embedding extracted by Hopenet from the generated frame is used to calculate the diversity of generated head motion; the Beat Align Score is calculated to evaluate the consistency of audio and generated head motion .

In the comparison method, several state-of-the-art talking avatar generation methods, including MakeItTalk, Audio2Head and audio-to-expression generation methods (Wav2Lip, PC-AVS), are selected and evaluated using public checkpoint weights.

From the experimental results, it can be seen that the proposed method can exhibit better overall video quality and diversity of head poses, while also showing comparable performance to other fully speaking head generation methods in terms of lip sync metrics. performance.

The researchers believe that these lip-sync metrics are so sensitive to audio that unnatural lip movements might get better scores, but the method proposed in the paper achieved similar scores to real videos, which also shows that the method The advantages.

As can be seen in the visual results generated by the different methods, the visual quality of the method is very similar to the original target video, but also very similar to the expected different head poses.

Compared to other methods, Wav2Lip generates blurred half-faces; PC-AVS and Audio2Head struggle to preserve the identity of the source image; Audio2Head can only generate faces talking from the front; MakeItTalk and Audio2Head generate distorted faces due to 2D warping video.



Reprinted from 新智元 LRSView Original


no dataCoffee time! Feel free to comment