Alibaba launches ChatAnyone! A real-time AI character video generation framework

Alibaba launches revolutionary AI video generation technology, leading the new trend of future video interaction.
Core content:
1. ChatAnyone: Alibaba's real-time AI character video generation framework
2. Generate high-fidelity portrait videos based on audio input
3. Support real-time interaction, suitable for a variety of application scenarios
Overview
ChatAnyone is a real-time stylized portrait video generation framework launched by Alibaba Tongyi Lab. It generates portrait videos with rich expressions and upper body movements through audio input.
The use of efficient hierarchical motion diffusion model and hybrid control fusion generation model can achieve high-fidelity and natural video generation, support real-time interaction, and is suitable for many scenarios such as virtual anchors, video conferencing, content creation, education, customer service, marketing, social entertainment, medical health, etc. ChatAnyone supports stylized control, and can adjust the expression style according to needs to achieve personalized animation generation.
abstract
Live interactive video chat portraits are increasingly considered the trend of the future, especially due to the remarkable progress made in text and voice chat technologies. However, existing methods mainly focus on generating head movements in real time, but have difficulty in generating synchronized body movements that match these head movements.
In addition, fine-grained control over the nuances of speaking styles and facial expressions remains a challenge. To address these limitations, we introduce a new framework for stylized real-time portrait video generation, enabling expressive and flexible video chats from talking heads to upper-body interactions. Our approach consists of the following two stages.
The first stage involves an efficient hierarchical motion diffusion model that considers explicit and implicit motion representations based on audio input, which can generate various facial expressions through style control and synchronization between head and body movements.
The second stage aims to generate portrait videos featuring upper body movements, including hand gestures. We inject explicit hand control signals into the generator to produce more detailed hand movements and further perform facial refinement to enhance the overall realism and expressiveness of the portrait videos.
In addition, our method supports efficient and continuous generation of upper-body portrait videos with a resolution of up to 512 × 768 at up to 30fps on a 4090 GPU, and supports real-time interactive video chat. Experimental results show that our method is able to produce portrait videos with rich expressiveness and natural upper-body movements.
method
An efficient hierarchical motion diffusion model is proposed for audio-to-motion representation, which hierarchically generates facial and body control signals based on the input audio, while considering both explicit and implicit motion signals to achieve accurate facial expressions. In addition, fine-grained expression control is introduced to achieve different variations in expression intensity, as well as stylized expression transfer from reference videos, aiming to produce controllable and personalized expressions.
Designed for upper body image generation, the hybrid control fusion generative model exploits explicit keypoints for direct and editable facial expression generation while introducing implicit offsets based on explicit signals to capture facial variations across different avatar styles. We also inject explicit hand control for more accurate and realistic hand textures and movements. Furthermore, a facial refinement module is employed to enhance facial fidelity, ensuring highly expressive and realistic portrait videos.
We built a scalable real-time generation framework for interactive video chat applications, which can adapt to various scenarios through flexible sub-module combinations, supporting a variety of tasks from head-driven animation to upper body generation with gestures. In addition, we built an efficient streaming inference pipeline to achieve 30fps at a maximum resolution of 512 × 768 on a 4090 GPU, ensuring a smooth and immersive experience in real-time video chat.
Audio driven upper body animation
We can generate highly expressive audio-driven upper-body digital human videos in different scenarios, with or without hands.
Audio-driven Talking Head animation
We can achieve highly accurate lip-sync results and generate expressive facial expressions and natural head poses.
Audio-driven stylized animation
We can generate audio-driven results for stylized characters, while also enabling the creation of highly expressive singing videos.
Dual Host AI Podcast Demo
We can also generate dual-host podcasts to enable AI-driven conversations.
Interactive Demo
Our method achieves real-time generation at 30fps on a 4090 GPU, supporting practical applications in interactive video chat.
Virtual anchors and video conferencing : virtual images used for news broadcasting, live streaming and video conferencing.
Content creation and entertainment : Generate stylized animated characters, virtual concerts, AI podcasts, etc.
Education and Training : Generate virtual teacher images and virtual characters in training simulations.
Customer Service : Generate virtual customer service images to provide vivid answers and interactions.
Marketing and advertising : Generate virtual spokesperson images and highly interactive advertising content.