Microsoft Research Asia has released an AI model that can generate realistic, talking deepfake videos from a single still image and an audio track.
The model has been trained on footage of approximately 6,000 talking faces from the VoxCeleb2 dataset, and can animate still images that lip-sync to a supplied voice track, creating realistic vocal expressions and natural head movements.
The technology, called VASA-1, can reportedly generate synced videos at 512x512 pixels at 40 frames per second without latency.
Photo credit: Microsoft Research Asia