Ali Qwen version of advanced voice mode and real-time video chat mode is here: 10 trials per day

Written by
Silas Grey
Updated on:July-09th-2025
Recommendation

Ali Qwen version of voice and video chat mode, all-round AI model Qwen2.5-Omni-7B is coming, you can try it 10 times a day.

Core content:
1. Ali Qwen Chat adds real-time voice and video chat functions
2. Qwen2.5-Omni-7B model: all-round AI, supports text, audio, image, video and other inputs
3. Qwen2.5-Omni performance evaluation: cross-modal capabilities reach SOTA level

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


 

Alibaba's Qwen Chat now allows real-time voice and video chats, up to 10 times a day



All-rounder Qwen2.5-Omni is now available and open source

The new function is supported by the newly released  Qwen2.5-Omni-7B  model, which is an  Omni (all-round) model . Simply put, a model can understand  multiple inputs such as text, audio, image, and video at the same time  , and can output  text and audio.

Alibaba continues to promote open source and directly  open sourced the Qwen2.5-Omni-7B model based on the Apache 2.0 license  ! At the same time, a detailed technical report has also been made public, full of useful information

Here are all the portals for you to study in depth and get started:

  • •  Experience Qwen Chat new features: 
  • https://chat.qwenlm.ai
  • •  Technical Report(Paper): 
  • https://github.com/QwenLM/Qwen2.5-Omni/blob/main/assets/Qwen2.5_Omni.pdf
  • •  Official Blog: 
  • https://qwenlm.github.io/blog/qwen2.5-omni
  • •  GitHub repository: 
  • https://github.com/QwenLM/Qwen2.5-Omni
  • •  Hugging Face Model: 
  • https://huggingface.co/Qwen/Qwen2.5-Omni-7B
  • •  ModelScope Model: 
  • https://modelscope.cn/models/Qwen/Qwen2.5-Omni-7B

Core Structure: “Thinker-Talker”

The key to Qwen2.5-Omni's omnipotence lies in its  "Thinker-Talker"  architecture. This design is very clever, allowing the model  to think and speak at the same time :

  1. 1.  Thinker:  plays the role of the brain. It is responsible for processing input from multiple modalities such as text, audio, and video. It extracts information through a dedicated audio and video encoder, and then uses a Transformer decoder to understand and process it, and finally generates high-level semantic representations and corresponding text content.
  2. 2.  Talker:  Serves as the mouth. It receives high-level representations and text generated by Thinker in a streaming manner, and uses a dual-track autoregressive Transformer decoder architecture to smoothly synthesize and output discrete speech units (tokens).

The key point is that Talker does not work independently. It can directly obtain the high-dimensional representation generated by Thinker and  share all the historical context information of Thinker . This makes Thinker and Talker form a tightly coordinated  single overall model that can be trained and reasoned end-to-end. This design is the core of achieving low-latency and high-fluency voice interaction.

How is the performance? Comprehensive and powerful

The research team conducted a comprehensive evaluation of Qwen2.5-Omni and the results were quite impressive:

Cross-modal capability SOTA:  Qwen2.5-Omni has reached the current state-of-the-art in tasks that require the integration of multiple modal information (such as the OmniBench benchmark)

Good unimodal capabilities:  Compared with unimodal models of the same size (such as Qwen2.5-VL-7B, Qwen2-Audio) and some powerful closed-source models (such as Gemini-1.5-pro), Qwen2.5-Omni also shows strong competitiveness in various unimodal tasks. Specifically including: * Speech recognition: Common Voice * Speech translation: CoVoST2 * Audio understanding: MMAU * Image reasoning: MMMU, MMStar * Video understanding: MVBench * Speech generation:  Seed-tts-eval and subjective naturalness evaluation





It can be said that Qwen2.5-Omni has not sacrificed its capabilities in various vertical fields while maintaining its omnipotence.

Summarize: