Alibaba opens source multimodal large model Qwen2.5-Omni

Written by
Clara Bennett
Updated on:July-08th-2025
Recommendation

Alibaba leads a new era of multimodal AI interaction, and the Qwen2.5-Omni large model is open source.

Core content:
1. Qwen2.5-Omni's innovative Thinker-Talker architecture design
2. Cross-modal performance advantages and outstanding performance in single-modal tasks
3. Application of time-aligned multimodal rotation position embedding technology

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Multimodal models have become a hot area of ​​research and application. Among them, Qwen2.5-Omni developed by the Alibaba team stands out. With its innovative architecture design, excellent performance and rich application scenarios, it has brought a new solution to multimodal interaction and led artificial intelligence towards a more intelligent and natural interaction era.

1. Qwen2.5-Omni Architecture Innovation

Qwen2.5-Omni adopts a unique Thinker-Talker architecture, the core of which is to achieve end-to-end multimodal perception and interaction. It can simultaneously process input information in multiple modes such as text, images, audio and video, and generate text and natural voice responses in a streaming manner, breaking the limitations of traditional models in processing multimodal information and greatly improving the real-time and smoothness of interaction.

In this architecture, the time-aligned multimodal rotational position embedding (TMRoPE) technology is particularly critical. When processing video and audio inputs, information from different modalities often has differences in timestamps, which will affect the model's accurate understanding and processing of the information. TMRoPE uses an innovative position embedding method to accurately synchronize the timestamps of video input and audio, ensuring that the temporal relationship between each modality is accurately expressed when the model processes multimodal information, thereby more accurately integrating data from different modalities and improving the model's understanding and processing capabilities for complex scenes. This innovative architectural design has laid a solid foundation for Qwen2.5-Omni's outstanding performance in the field of multimodal interaction.

2. Excellent performance

1. Comprehensive advantages of cross-modality

Qwen2.5-Omni has demonstrated strong cross-modal performance. Compared with unimodal models of the same scale, it performs well in tasks in all modalities. In terms of audio capabilities, it surpasses the Qwen2-Audio of a similar scale, showing higher accuracy and better understanding capabilities in tasks such as speech recognition and audio understanding; in terms of image and video processing, it can achieve comparable performance compared to Qwen2.5-VL-7B, and can accurately analyze and interpret relevant information, whether it is image reasoning or video understanding.

In the OmniBench benchmark test of comprehensive multimodal tasks, Qwen2.5-Omni achieved a leading result. Behind this achievement is the model's powerful multimodal fusion capability, which can effectively integrate information from different modalities and conduct comprehensive and in-depth analysis, thereby demonstrating excellent performance in complex multimodal tasks and providing users with more accurate and useful answers.

2. Excellent performance on unimodal tasks

Qwen2.5-Omni also achieved remarkable results in single-modal tasks. In speech recognition tasks (such as the Common Voice dataset test), it can accurately convert speech to text, with high recognition accuracy and strong adaptability to different accents and language environments; in translation tasks (such as the CoVoST2 dataset test), whether it is translation from speech to text or translation between texts, it can provide high-quality translation results with natural and fluent language expression.


In the audio understanding task (MMAU), Qwen2.5-Omni can deeply understand the semantic information in the audio, not only can it recognize the speech content, but also can analyze the deep information such as emotion and intention in the speech; in the image reasoning task (MMMU, MMStar), it can accurately reason about the objects, scenes, relationships, etc. in the image, understand the meaning of the image and make reasonable predictions; in the video understanding task (MVBench), it can process dynamic video information and understand the actions and event development in the video. In the speech generation task (Seed-tts-eval and subjective naturalness evaluation), the generated speech is natural and fluent, highly similar to human speech, and surpasses many existing models in terms of robustness and naturalness.

3. End-to-end voice command execution capability

Qwen2.5-Omni performs well in end-to-end voice command following. Through benchmark tests such as MMLU and GSM8K, it can be found that its performance in understanding and executing voice commands is equivalent to the effect of processing text input commands. This means that users can interact with the model through natural voice commands. Whether it is complex problem solving, task execution or information query, Qwen2.5-Omni can accurately understand the user's intention and give an appropriate response, greatly improving the convenience and efficiency of user-model interaction.

3. Convenient usage methods and tools

1. Installation and environment configuration

To use Qwen2.5-Omni, users need to perform a series of installation and environment configuration. Since its code on Hugging Face Transformers is in the pull request stage and has not yet been merged into the main branch, users may need to build and install from source code. First, usepip uninstall transformersCommand to uninstall the installed transformers library, and then passpip install git+https://github.com/huggingface/transformers@3a1ead0aabed473eafe527915eea8c197d424356The command installs a specific version of the transformers library. You also need to installaccelerateLibraries to optimize model running performance.

In order to more conveniently process various audio and visual inputs, Qwen2.5-Omni providesqwen-omni-utilsToolkit. When installing the toolkit, if the system hasffmpeg, and want to load videos faster, you can usepip install qwen-omni-utils[decord]Command to install,decordThe library can accelerate video processing. If your system is not Linux, it may not be installed from PyPIdecord, you can now usepip install qwen-omni-utilscommand, which will fall back to usingtorchvisionFor video processing. Of course, users can also install from source codedecordto use it when loading the video.

(II) Usage examples

When using Qwen2.5-Omni for multimodal interaction, users can refer to the following code examples:

import soundfile as sffrom transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessorfrom qwen_omni_utils import process_mm_info# Load model model = Qwen2_5OmniModel.from_pretrained("Qwen/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto")processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")# Build conversation conversation = [ { "role": "system", "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech." }, { "role": "user", "content": [{"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"}] }]# Data preprocessing text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True)inputs = inputs.to(model.device).to(model.dtype)# Inference text_ids, audio = model.generate(**inputs, use_audio_in_video=True)text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)print(text)sf.write("output.wav", audio.reshape(-1).detach().cpu().numpy(), samplerate=24000)


In this code, the Qwen2.5-Omni model and processor are first loaded, and then a dialog containing video input is constructed.process_mm_infoThe function processes multimodal information and converts the conversation content into an input format that the model can process. In the inference phase, the model generates text and audio output based on the input, and finally prints the generated text and saves the audio asoutput.wavdocument.

4. Usage tips and precautions

1. Audio output settings

If the user needs audio output, the system prompt must be set to specific content: "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech." Otherwise, the audio output may not work properly. This is because when the model processes the audio output, it needs to activate the corresponding speech generation function according to specific system prompts to ensure that the generated speech conforms to the expected role setting and interaction logic.

(II) Use of video and audio

In the process of multimodal interaction, videos are usually accompanied by audio information, which is crucial for the model to understand the video content and provide a better interactive experience. Therefore, Qwen2.5-Omni provides relevant parameters to control whether to use the audio in the video.audios, images, videos = process_mm_info(conversations, use_audio_in_video=True)Set to use the audio in the video; in the model inference phase, throughtext_ids, audio = model.generate(**inputs, use_audio_in_video=True)Ensure that the model considers video and audio information when generating output. It should be noted that in multi-round dialogues, theuse_audio_in_videoThe parameters must be set the same, otherwise unexpected results may occur, affecting the model's accurate processing of multimodal information.

3. Whether to use audio output

Qwen2.5-Omni supports both text and audio output. If the user does not need audio output,from_pretrainedSet in functionenable_audio_output=False, which can save about 2GB of GPU memory. But it should be noted thatgenerateIn the functionreturn_audioOption can only be set toFalseIn order to obtain a more flexible user experience, it is recommended that users initialize the modelenable_audio_outputSet toTrue, and then callgeneratefunction to decide whether to return audio based on actual needs.return_audioSet toFalse, the model will return only text output, allowing you to get text responses faster.

(IV) Changing the voice type of output audio

Qwen2.5-Omni supports changing the voice type of the output audio. The "Qwen/Qwen2.5-Omni-7B" checkpoint supports two voice types: Chelsie (female voice, sweet and soft, with gentle warmth and bright clarity) and Ethan (male voice, bright, optimistic, full of appeal, giving people a warm and friendly feeling). Users cangenerateFunctionspkThe parameter specifies the voice type. By default, if it is not specifiedspkparameter, the Chelsie voice type will be used. For example:

text_ids, audio = model.generate(**inputs, spk="Chelsie")text_ids, audio = model.generate(**inputs, spk="Ethan")

5. Speed ​​up with Flash-Attention 2

In order to further improve the model generation speed, Qwen2.5-Omni supports the use of Flash-Attention 2 technology. First, users need to ensure that the latest version of Flash-Attention 2 is installed.pip install -U flash-attn --no-build-isolationAt the same time, the hardware needs to be compatible with Flash-Attention 2. For more information, please refer to the Flash-Attention official documentation. Flash-Attention 2 can only be used on models withtorch.float16ortorch.bfloat16Used when loading. When loading a model, addattn_implementation="flash_attention_2"The technique can be enabled by adding a parameter, for example:

from transformers import Qwen2_5OmniModelmodel = Qwen2_5OmniModel.from_pretrained( "Qwen/Qwen2.5-Omni-7B", device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2")

As an innovative multimodal model, Qwen2.5-Omni has brought new breakthroughs and development directions to the field of multimodal interaction with its unique architecture design, excellent performance and rich usage functions. It not only shows great potential in current artificial intelligence research and applications, but also lays a solid foundation for the future development of intelligent interaction.