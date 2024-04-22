According to Microsoft’s official website, VASA stands for Virtual Avatar Speech Animation, a revolutionary framework designed to generate lifelike talking faces from single static images and audio clips.

VASA-1, its flagship model, boasts the capability to synchronise lip movements with audio seamlessly while capturing a spectrum of facial nuances and natural head motions, lending authenticity and liveliness to virtual characters.

The core innovations include a holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos.

"Through extensive experiments including evaluation on a set of new metrics, we show that our method significantly outperforms previous methods along various dimensions comprehensively," Microsoft said.

"Our method not only delivers high video quality with realistic facial and head dynamics but also supports the online generation of 512x512 videos at up to 40 FPS with negligible starting latency. It paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors," it added.