Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization

tl;dr

Diff2Lip: arbitrary speech + face videos → high quality lip-sync.
Applications: movies, education, virtual avatars, (eventually) video conferencing
^*
Diff2Lip is not real time yet but we are hopeful that future versions will be.
.

    (a) Video Source

    (b) Wav2Lip [1]

(c) PC-AVS [2]

(d) Diff2Lip (ours)

Abstract

The task of lip synchronization (lip-sync) seeks to match the lips of human faces with different audio. It has various applications in the film industry as well as for creating virtual avatars and for video conferencing. This is a challenging problem as one needs to simultaneously introduce detailed, realistic lip movements while preserving the identity, pose, emotions, and image quality. Many of the previous methods trying to solve this problem suffer from image quality degradation due to a lack of complete contextual information. In this paper, we present Diff2Lip, an audio-conditioned diffusion-based model which is able to do lip synchronization in-the-wild while preserving these qualities. We train our model on Voxceleb2, a video dataset containing in-the-wild talking face videos. Extensive studies show that our method outperforms popular methods like Wav2Lip and PC-AVS in Fréchet inception distance (FID) metric. We show results on both reconstruction (same audio-video inputs) as well as cross (different audio-video inputs) settings on Voxceleb2 and LRW datasets.

Overview of our approach :
Top: Diff2Lip uses an audio-conditioned diffusion model to generate lip-synchronized videos.
Bottom: On zooming in to the mouth region it can be seen that our method generates high-quality video frames without suffering from identity loss.

In-The-Wild Examples

Diffusion Process Visualization

Intermediate States

Use the slider here to iteratively denoise left frame to the right frame.

Masked Frame

Loading...

Generated Frame

Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization

tl;dr

Diff2Lip: arbitrary speech + face videos → high quality lip-sync.
Applications: movies, education, virtual avatars, (eventually) video conferencing
^*
Diff2Lip is not real time yet but we are hopeful that future versions will be.
.

(a) Video Source

    (b) Wav2Lip [1]

(c) PC-AVS [2]

(d) Diff2Lip (ours)

Abstract

In-The-Wild Examples

More Video Comparisons

(a) Video Source

     (b) Wav2Lip [1]

(c) PC-AVS [2]

(d) Diff2Lip (ours)

Diffusion Process Visualization

Intermediate States

References

Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization

tl;dr

Diff2Lip: arbitrary speech + face videos → high quality lip-sync. Applications: movies, education, virtual avatars, (eventually) video conferencing*Diff2Lip is not real time yet but we are hopeful that future versions will be..

(a) Video Source (b) Wav2Lip [1] (c) PC-AVS [2] (d) Diff2Lip (ours)

Abstract

In-The-Wild Examples

More Video Comparisons

(a) Video Source (b) Wav2Lip [1] (c) PC-AVS [2] (d) Diff2Lip (ours)

Diffusion Process Visualization

Intermediate States

References

Diff2Lip: arbitrary speech + face videos → high quality lip-sync.
Applications: movies, education, virtual avatars, (eventually) video conferencing
^*
Diff2Lip is not real time yet but we are hopeful that future versions will be.
.

(a) Video Source

(b) Wav2Lip [1]

(c) PC-AVS [2]

(d) Diff2Lip (ours)

(a) Video Source

(b) Wav2Lip [1]

(c) PC-AVS [2]

(d) Diff2Lip (ours)