Ali Qwen version of advanced voice mode and real-time video chat mode is here: 10 trials per day

Ali Qwen version of voice and video chat mode, all-round AI model Qwen2.5-Omni-7B is coming, you can try it 10 times a day.
Core content:
1. Ali Qwen Chat adds real-time voice and video chat functions
2. Qwen2.5-Omni-7B model: all-round AI, supports text, audio, image, video and other inputs
3. Qwen2.5-Omni performance evaluation: cross-modal capabilities reach SOTA level
Alibaba's Qwen Chat now allows real-time voice and video chats, up to 10 times a day
All-rounder Qwen2.5-Omni is now available and open source
The new function is supported by the newly released Qwen2.5-Omni-7B model, which is an Omni (all-round) model . Simply put, a model can understand multiple inputs such as text, audio, image, and video at the same time , and can output text and audio.
Alibaba continues to promote open source and directly open sourced the Qwen2.5-Omni-7B model based on the Apache 2.0 license ! At the same time, a detailed technical report has also been made public, full of useful information
Here are all the portals for you to study in depth and get started:
• Experience Qwen Chat new features: https://chat.qwenlm.ai • Technical Report(Paper): https://github.com/QwenLM/Qwen2.5-Omni/blob/main/assets/Qwen2.5_Omni.pdf • Official Blog: https://qwenlm.github.io/blog/qwen2.5-omni • GitHub repository: https://github.com/QwenLM/Qwen2.5-Omni • Hugging Face Model: https://huggingface.co/Qwen/Qwen2.5-Omni-7B • ModelScope Model: https://modelscope.cn/models/Qwen/Qwen2.5-Omni-7B
Core Structure: “Thinker-Talker”
The key to Qwen2.5-Omni's omnipotence lies in its "Thinker-Talker" architecture. This design is very clever, allowing the model to think and speak at the same time :
1. Thinker: plays the role of the brain. It is responsible for processing input from multiple modalities such as text, audio, and video. It extracts information through a dedicated audio and video encoder, and then uses a Transformer decoder to understand and process it, and finally generates high-level semantic representations and corresponding text content. 2. Talker: Serves as the mouth. It receives high-level representations and text generated by Thinker in a streaming manner, and uses a dual-track autoregressive Transformer decoder architecture to smoothly synthesize and output discrete speech units (tokens).
The key point is that Talker does not work independently. It can directly obtain the high-dimensional representation generated by Thinker and share all the historical context information of Thinker . This makes Thinker and Talker form a tightly coordinated single overall model that can be trained and reasoned end-to-end. This design is the core of achieving low-latency and high-fluency voice interaction.
How is the performance? Comprehensive and powerful
The research team conducted a comprehensive evaluation of Qwen2.5-Omni and the results were quite impressive:
Cross-modal capability SOTA: Qwen2.5-Omni has reached the current state-of-the-art in tasks that require the integration of multiple modal information (such as the OmniBench benchmark)
Good unimodal capabilities: Compared with unimodal models of the same size (such as Qwen2.5-VL-7B, Qwen2-Audio) and some powerful closed-source models (such as Gemini-1.5-pro), Qwen2.5-Omni also shows strong competitiveness in various unimodal tasks. Specifically including: * Speech recognition: Common Voice * Speech translation: CoVoST2 * Audio understanding: MMAU * Image reasoning: MMMU, MMStar * Video understanding: MVBench * Speech generation: Seed-tts-eval and subjective naturalness evaluation
It can be said that Qwen2.5-Omni has not sacrificed its capabilities in various vertical fields while maintaining its omnipotence.
Summarize: