Ali Qwen version of advanced voice mode and real-time video chat mode is here: 10 trials per day

Written by

Silas Grey

Updated on:July-09th-2025

Alibaba's Qwen Chat now allows real-time voice and video chats, up to 10 times a day

All-rounder Qwen2.5-Omni is now available and open source

The new function is supported by the newly released Qwen2.5-Omni-7B model, which is an Omni (all-round) model . Simply put, a model can understand multiple inputs such as text, audio, image, and video at the same time , and can output text and audio.

Alibaba continues to promote open source and directly open sourced the Qwen2.5-Omni-7B model based on the Apache 2.0 license ! At the same time, a detailed technical report has also been made public, full of useful information

Here are all the portals for you to study in depth and get started:

• Experience Qwen Chat new features:
https://chat.qwenlm.ai
• Technical Report(Paper):
https://github.com/QwenLM/Qwen2.5-Omni/blob/main/assets/Qwen2.5_Omni.pdf
• Official Blog:
https://qwenlm.github.io/blog/qwen2.5-omni
• GitHub repository:
https://github.com/QwenLM/Qwen2.5-Omni
• Hugging Face Model:
https://huggingface.co/Qwen/Qwen2.5-Omni-7B
• ModelScope Model:
https://modelscope.cn/models/Qwen/Qwen2.5-Omni-7B

Core Structure: “Thinker-Talker”

The key to Qwen2.5-Omni's omnipotence lies in its "Thinker-Talker" architecture. This design is very clever, allowing the model to think and speak at the same time :

1. Thinker: plays the role of the brain. It is responsible for processing input from multiple modalities such as text, audio, and video. It extracts information through a dedicated audio and video encoder, and then uses a Transformer decoder to understand and process it, and finally generates high-level semantic representations and corresponding text content.
2. Talker: Serves as the mouth. It receives high-level representations and text generated by Thinker in a streaming manner, and uses a dual-track autoregressive Transformer decoder architecture to smoothly synthesize and output discrete speech units (tokens).

The key point is that Talker does not work independently. It can directly obtain the high-dimensional representation generated by Thinker and share all the historical context information of Thinker . This makes Thinker and Talker form a tightly coordinated single overall model that can be trained and reasoned end-to-end. This design is the core of achieving low-latency and high-fluency voice interaction.

How is the performance? Comprehensive and powerful

The research team conducted a comprehensive evaluation of Qwen2.5-Omni and the results were quite impressive:

Cross-modal capability SOTA: Qwen2.5-Omni has reached the current state-of-the-art in tasks that require the integration of multiple modal information (such as the OmniBench benchmark)

Good unimodal capabilities: Compared with unimodal models of the same size (such as Qwen2.5-VL-7B, Qwen2-Audio) and some powerful closed-source models (such as Gemini-1.5-pro), Qwen2.5-Omni also shows strong competitiveness in various unimodal tasks. Specifically including: * Speech recognition: Common Voice * Speech translation: CoVoST2 * Audio understanding: MMAU * Image reasoning: MMMU, MMStar * Video understanding: MVBench * Speech generation: Seed-tts-eval and subjective naturalness evaluation

It can be said that Qwen2.5-Omni has not sacrificed its capabilities in various vertical fields while maintaining its omnipotence.

Summarize: