Learn2Talk: 3D Talking Face Learns from 2D Talking Face

Yixiang Zhuang1, Baoping Cheng2, Yao Cheng2, Yuntao Jin1, Renshuai Liu1, Chengyang Li1, Juncong Lin1, Jing Liao3, Xuan Cheng1*
1School of Informatics, Xiamen University, 2China Mobile (Hangzhou) Information Technology Co., Ltd.
3Department of Computer Science, City University of Hong Kong

Fig. 1: Given a speech audio, our method can synthesize facial motion, head motion and drive a 3D Gaussian Splatting based head avatar.

Abstract

Speech-driven facial animation methods usually contain two main classes, 3D and 2D talking face, both of which attract considerable research attention in recent years. However,to the best of our knowledge, the research on 3D talking face does not go deeper as 2D talking face, in the aspect of lip synchronization (lip-sync) and speech perception.

To mind the gap between the two sub-fields, we propose a learning framework named Learn2Talk, which can construct a better 3D talking face network by exploiting two expertise points from the field of 2D talking face. Firstly, inspired by the audio-video sync network, a 3D sync-lip expert model is devised for the pursuit of lip-sync between audio and 3D facial motion. Secondly, a teacher model selected from 2D talking face methods is used to guide the training of the audio-to-3D motions regression network to yield more 3D vertex accuracy.

Extensive experiments show the advantages of the proposed framework in terms of lip-sync, vertex accuracy and speech perception, compared with state-of-the-arts. Finally, we show two applications of the proposed framework: audio-visual speech recognition and speech-driven 3D Gaussian Splatting based avatar animation.

Video

Due to video compression, the online video contains some artifacts in lip-sync. For better audio-visual experience, please download the high definition video.

Qualitative Evaluation

We visually compare our method with the competitors. We illustrate six typical frames of synthesized facial animations that speak at specific syllables. Compared with the competitors, the lip movements produced by Learn2Talk are more accurately articulated with the speech signals and also more consistent with the reference.

Fig. 2: Visual comparisons of sampled facial motions animated by different methods on VOCA-Test (left) and BIWI-Test-A (right).

Fig. 3 shows the 3D reconstruction error heatmaps of the sampled frames, produced by the three methods. The per-vertex errors are color-coded on the reconstructed mesh for visualization. It could be observed that, our method obtains more accurate vertices both in mouth region and upper-face region.

Fig. 3: 3D reconstruction error heatmaps of sampled facial motions predicted by different methods on VOCA-Test (left) and BIWI-Test-A (right).

Fig. 4 shows the visual comparisons on other languages, including Chinese, Japanese and Korean. Although trained on English datasets, our method can generalize well to any other language. This is because that our method uses the signal characteristics of the speech audio, instead of words or linguistics. From Fig. 4, it could be observed that our method generates more expressive animation, with obvious movements in opening mouth, pouting and closing mouth. The comparisons on facial animations with all frames are available in the supplementary material.

Fig. 4: Visual comparisons of sampled facial motions predicted by different methods across languages, including Chinese, Japanese and Korean.

Applications

Audio-visual speech recognition. The first application is to synthesize the audio-visual data for the task of the audio-visual speech recognition. We use Learn2Talk to synthesize the video clips from the audios on the subset of LRS3, and then construct the labeled audio-visual corpora which can be used to train and test the audio-visual speech recognition network. Such synthesized audio-visual data is valuable in labeled data augmentation and personal privacy protection.

Speech-driven 3DGS-based avatar animation. The recent 3DGS method achieves high rendering quality for novel-view synthesis with real-time performance. Instead of training the neutral network in 3D space such as NeRF, it optimizes the discrete geometric primitives, namely 3D Gaussians. 3DGS has already been used to reconstruct the head avatar, and the human body avatar from multiple view images. The second application of our method is that, it enables the speech-driven 3DGS-based head avatar. Our speech-driven 3DGS animation is well in-sync with the driving audio, and supports multiple view synthesis in each frame.

Fig. 5: Visualization of the sampled motions of the speech-driven 3DGS-based head avatar. In each frame, we show the FLAME model generated by our method (1st row-left), and the front view (1st row-right), two side views (2nd row) of the driven avatar.

BibTeX


    @misc{zhuang2024learn2talk,
          title={Learn2Talk: 3D Talking Face Learns from 2D Talking Face}, 
          author={Yixiang Zhuang and Baoping Cheng and Yao Cheng and Yuntao Jin and Renshuai Liu and Chengyang Li and Xuan Cheng and Jing Liao and Juncong Lin},
          year={2024},
          eprint={2404.12888},
          archivePrefix={arXiv},
          primaryClass={cs.CV}
    }