Alibaba opens source Qwen2.5-Omni late at night, with 7B parameters to complete viewing, listening, speaking and writing

Written by

Audrey Miles

Updated on:July-09th-2025

In the early morning of March 27, the Alitong Yi Qianwen team released Qwen2.5-Omni.

This is the new flagship multimodal large model in the Qwen series, designed for comprehensive multimodal perception. It can seamlessly process various inputs including text, images, audio and video, while supporting streaming text generation and natural speech synthesis output.

From now on, you can chat with Qwen just like making a phone call or video call! It can be said that "voice chat + video chat" are both realized.

Experience address: https://chat.qwen.ai/

More importantly, the team has open-sourced the model that supports all of this, Qwen2.5-Omni-7B, under the Apache 2.0 license, and published a technical report sharing all the details!

Now, developers and enterprises can download the commercial Qwen2.5-Omni for free, and mobile phones and other terminal smart hardware can also be easily deployed and run.

Some netizens said that this is the real Open AI.

You can experience the real performance of Qwen2.5-Omni through the official demo.

Qwen2.5-Omni model architecture

Qwen2.5-Omni has the following features:

Omni and innovative architectures: The team proposed the Thinker-Talker architecture, an end-to-end multimodal model designed to perceive multiple modalities including text, images, audio, and video, while generating text and natural speech responses in a streaming manner. In addition, the team proposed a new positional embedding called TMRoPE (Time-aligned Multimodal RoPE) to synchronize the timestamps of video input and audio;
Real-time voice and video chat: The architecture is designed for fully real-time interaction, supporting chunked input and immediate output;
Natural and robust speech generation: In terms of speech generation, Qwen2.5-Omni surpasses many existing streaming and non-streaming alternatives, demonstrating superior robustness and naturalness;
Strong multi-modal performance: When benchmarked against similarly sized single-modal models, the Qwen2.5-Omni demonstrated superior performance in all modalities. The Qwen2.5-Omni surpassed the similarly sized Qwen2-Audio in audio capabilities and achieved comparable performance to the Qwen2.5-VL-7B;
Excellent end-to-end voice command following capability: Qwen2.5-Omni's end-to-end voice command following performance is comparable to the effectiveness of text input, as demonstrated in benchmarks such as MMLU and GSM8K.

As we mentioned above, Qwen2.5-Omni adopts the Thinker-Talker architecture.

Thinker is like a brain, responsible for processing and understanding inputs from text, audio, and video modalities, generating high-level representations and corresponding text.

Talker is like a human mouth, receiving high-level representations and text produced by Thinker in a streaming manner and smoothly outputting discrete speech tokens.

Thinker is a Transformer decoder equipped with encoders for audio and image to facilitate information extraction. In contrast, Talker is designed as a dual-track autoregressive Transformer decoder architecture.

During training and inference, Talker directly receives high-dimensional representations from Thinker and shares all historical context information of Thinker. Therefore, the entire architecture runs as a unified single model, achieving end-to-end training and inference.

Qwen2.5-Omni model architecture

Model performance

The team conducted a comprehensive evaluation of Qwen2.5-Omni and found that it outperformed single-modal models of similar size and closed-source models such as Qwen2.5-VL-7B, Qwen2-Audio, and Gemini-1.5-pro in all modes.

In tasks that require integration of multiple modalities, such as OmniBench, Qwen2.5-Omni achieves state-of-the-art results.

In addition, in unimodal tasks, Qwen2.5-Omni excels in multiple fields, including speech recognition (Common Voice), translation (CoVoST2), audio understanding (MMAU), image reasoning (MMMU, MMStar), video understanding (MVBench), and speech generation (Seed-tts-eval and subjective naturalness).