Alibaba OmniTalker is released! 0.8B parameters achieve 25FPS real-time audio and video generation, precise synchronization of cross-language emotional expression

Written by

Audrey Miles

Updated on:July-03rd-2025

Introduction:

Digital human technology has made a major breakthrough! OmniTalker , the latest product from Alibaba Tongyi Lab , is the world's first end-to-end text-driven speaker video generation system. With only a single reference video, it can achieve zero-sample style reproduction in Chinese and English, support six emotional expressions such as anger and happiness, and redefine the human-computer interaction experience with a real-time generation speed of 25 frames per second. This article will deeply analyze its dual-branch Diffusion Transformer architecture and show how to generate a speech video with just one sentence!

text:

1. Technological breakthroughs

• Audio and video sync engine :

# Audio-visual fusion module pseudo code
class AudioVisualFusion (nn.Module): 
    def forward ( self, audio_feat, visual_feat ): 
        cross_attn = AudioVisualAttention(audio_feat, visual_feat)   # Cross-modal attention
        return  audio_feat + cross_attn, visual_feat + cross_attn

• Lip sync accuracy of 98.2% (traditional solutions only 85%)
• Real-time interactive capability with latency <40ms

• Zero-shot style transfer :

Reference Video Properties	Reproducible elements	Example Effect
Lei Jun's speech	Hubei accent + iconic gestures	Generate English content to maintain the original speech style
News Anchor	Standard voice tone + professional expression management	Automatically adapt to emotions such as anger/sadness

2. Superb performance

index	OmniTalker	Wav2Lip	EMO
Spawn Speed (FPS)	25	12	18
Parameter scale	0.8B	0.3B	1.5B
Maximum generation time	10 minutes	30 seconds	5 minutes
Cross-language style preservation	✓	✗	✗

3. Five-minute quick experience

1. Environmental preparation :

# Install basic dependencies
pip install omnitalker-torch==2.5.0

2. Single sentence generation :

from  omnitalker  import  Generator
gen = Generator(ref_video= "lei_jun.mp4" )
output = gen.generate(
    text= "Xiaomi 14 sales exceeded 1 million units" , 
    emotion= "happy" , 
    language= "en" # Supports conversion between Chinese and English  
)
output.save( "result.mp4" )

3. Long video generation :

# Segment processing to avoid memory overflow
for  paragraph  in  long_text.split( "\n" ):
    gen.stream(paragraph, buffer_size= 60 )   # 60 second buffer

4. Enterprise-level application scenarios

• Cross-border e-commerce live streaming :

• The same host generates multi-language explanations in Chinese, English and Japanese in real time
• Automatically adjust expressions based on the sentiment of the review (smile for positive reviews/concern for negative reviews)

• Online Education :

• Historical figures revived in multiple languages (Confucius's Analects in English)
• Emotional courseware generation (chemical experiment danger warning expressions)

• Psychotherapy :

• Multi-emotional AI psychologist
• Emotional mirror therapy for patients with depression

5. In-depth Customization Guide

• Style intensive training :

# config/train.yaml
style_enhance:
  audio: 
    prosody_weight: 0.9 # Enhance intonation features   
  visual:
    micro_expression:  [ blink_rate=0.3 ,  smile_asymmetry=0.2 ]   # Personalized micro-expression

• Legal compliance settings :

gen.set_watermark(
    text= "AI generated content" , 
    position= "bottom_right" ,
    opacity = 0.5
)

Ethical Warning:

⚠️Usage restrictions :

• Prohibit cloning of politicians’ voices (built-in blacklist of 100+ celebrity voiceprints)
• Financial advice content must include risk warnings
• Emotion generation module disables extreme emotional expressions

Architecture decryption:

How does dual-branch DiT work ?

1. Audio branch : text → Wav2Vec2 features → Mel spectrum generation
2. Visual branch : text → FLAME model parameters → facial action units
3. Fusion module : Synchronous audio and video citation through cross-modal attention :

@article{omnitalker2025,
  title={OmniTalker: Real-Time Text-Driven Talking Head Generation with Audio-Visual Style Replication},
  author={Alibaba Tongyi Lab},
  journal={arXiv preprint arXiv:xxxx.xxxxx},
  year={2025}
}

Summarize:

The launch of OmniTalker marks the entry of digital human generation into the era of "real-time interaction". Its innovative unified framework design achieves film-level content output while maintaining lightweight (0.8B parameters).