Alibaba OmniTalker is released! 0.8B parameters achieve 25FPS real-time audio and video generation, precise synchronization of cross-language emotional expression

Written by
Audrey Miles
Updated on:July-03rd-2025
Recommendation

Alibaba OmniTalker's technological innovation, 0.8B parameters to achieve 25FPS real-time audio and video generation, cross-language emotional precise synchronization.

Core content:
1. OmniTalker's technical features and disruptive breakthroughs
2. Performance comparison and real-time interactive capabilities
3. Speedy experience tutorials and enterprise-level application scenarios

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

 

Introduction:

Digital human technology has made a major breakthrough! OmniTalker , the latest product from Alibaba Tongyi Lab , is the world's first end-to-end text-driven speaker video generation system. With only a single reference video, it can achieve zero-sample style reproduction in Chinese and English, support six emotional expressions such as anger and happiness, and redefine the human-computer interaction experience with a real-time generation speed of 25 frames per second. This article will deeply analyze its dual-branch Diffusion Transformer architecture and show how to generate a speech video with just one sentence!


text:

1. Technological breakthroughs

  • •  Audio and video sync engine :
    # Audio-visual fusion module pseudo code
    class AudioVisualFusion (nn.Module): 
        def forward ( self, audio_feat, visual_feat ): 
            cross_attn = AudioVisualAttention(audio_feat, visual_feat)   # Cross-modal attention
            return  audio_feat + cross_attn, visual_feat + cross_attn
    • • Lip sync accuracy of 98.2% (traditional solutions only 85%)
    • • Real-time interactive capability with latency <40ms
  • •  Zero-shot style transfer :
    Reference Video Properties
    Reproducible elements
    Example Effect
    Lei Jun's speech
    Hubei accent + iconic gestures
    Generate English content to maintain the original speech style
    News Anchor
    Standard voice tone + professional expression management
    Automatically adapt to emotions such as anger/sadness

2. Superb performance

index
OmniTalker
Wav2Lip
EMO
Spawn Speed ​​(FPS)
25
12
18
Parameter scale
0.8B
0.3B
1.5B
Maximum generation time
10 minutes
30 seconds
5 minutes
Cross-language style preservation

3. Five-minute quick experience

  1. 1.  Environmental preparation :
    # Install basic dependencies
    pip install omnitalker-torch==2.5.0
  2. 2.  Single sentence generation :
    from  omnitalker  import  Generator
    gen = Generator(ref_video= "lei_jun.mp4" )
    output = gen.generate(
        text= "Xiaomi 14 sales exceeded 1 million units"
        emotion= "happy"
        language= "en" # Supports conversion between Chinese and English  
    )
    output.save( "result.mp4" )
  3. 3.  Long video generation :
    # Segment processing to avoid memory overflow
    for  paragraph  in  long_text.split( "\n" ):
        gen.stream(paragraph, buffer_size= 60 )   # 60 second buffer

4. Enterprise-level application scenarios

  • •  Cross-border e-commerce live streaming :
    • • The same host generates multi-language explanations in Chinese, English and Japanese in real time
    • • Automatically adjust expressions based on the sentiment of the review (smile for positive reviews/concern for negative reviews)
  • •  Online Education :
    • • Historical figures revived in multiple languages ​​(Confucius's Analects in English)
    • • Emotional courseware generation (chemical experiment danger warning expressions)
  • •  Psychotherapy :
    • • Multi-emotional AI psychologist
    • • Emotional mirror therapy for patients with depression

5. In-depth Customization Guide

  • •  Style intensive training :
    # config/train.yaml
    style_enhance:
      audio: 
        prosody_weight: 0.9 # Enhance intonation features   
      visual:
        micro_expression:  [ blink_rate=0.3smile_asymmetry=0.2 ]   # Personalized micro-expression
  • •  Legal compliance settings :
    gen.set_watermark(
        text= "AI generated content"
        position= "bottom_right" ,
        opacity = 0.5
    )

Ethical Warning:

⚠️Usage  restrictions :

  • • Prohibit cloning of politicians’ voices (built-in blacklist of 100+ celebrity voiceprints)
  • • Financial advice content must include risk warnings
  • • Emotion generation module disables extreme emotional expressions

Architecture decryption:

How does dual-branch DiT work ?

  1. 1.  Audio branch : text → Wav2Vec2 features → Mel spectrum generation
  2. 2.  Visual branch : text → FLAME model parameters → facial action units
  3. 3.  Fusion module : Synchronous audio and video citation through cross-modal attention :
@article{omnitalker2025,
  title={OmniTalker: Real-Time Text-Driven Talking Head Generation with Audio-Visual Style Replication},
  author={Alibaba Tongyi Lab},
  journal={arXiv preprint arXiv:xxxx.xxxxx},
  year={2025}
}

Summarize:

The launch of OmniTalker marks the entry of digital human generation into the era of "real-time interaction". Its innovative unified framework design achieves film-level content output while maintaining lightweight (0.8B parameters).