PAI-Model Gallery supports one-click deployment of Qwen3 full-size models on the cloud

Written by

Audrey Miles

Updated on:June-26th-2025

Model Introduction

Qwen3 is the latest generation of large language models in the Qwen series, providing a series of dense and mixture of experts (MOE) models. Based on extensive training, Qwen3 has made breakthrough progress in reasoning, command following, agent capabilities and multi-language support, with the following key features:

Unique support for seamless switching between thinking mode (for complex logical reasoning, mathematics, and coding) and non-thinking mode (for efficient general conversations), ensuring optimal performance in a variety of scenarios.
Significantly enhanced reasoning capabilities , surpassing the previous QwQ (in thinking mode) and Qwen2.5 instruction model (in non-thinking mode) in mathematics, code generation, and common sense logic reasoning.
Superior alignment with human preferences , excels in creative writing, role-playing, multi-turn conversations, and command following, providing a more natural, engaging, and immersive conversational experience.
Expert in agent capabilities , can accurately integrate external tools in thinking and non-thinking modes, and leads in open source models in complex agent-based tasks.
It supports more than 100 languages and dialects , and has powerful multi-language understanding, reasoning, command following and generation capabilities.

PAI-Model Gallery Introduction

Model Gallery is a product component of Alibaba Cloud's artificial intelligence platform PAI. It integrates high-quality pre-trained models from domestic and foreign AI open source communities, covering various fields such as LLM, AIGC, CV, and NLP. Through PAI's adaptation of these models, users can implement the entire process from training to deployment to reasoning in a zero-code manner, simplifying the model development process and bringing faster, more efficient, and more convenient AI development and application experience to developers and enterprise users.

PAI-Model Gallery access address: https://pai.console.aliyun.com/#/quick-start/models

Alibaba Cloud PAI-Model Gallery has simultaneously connected all the open source models of Qwen3, providing enterprise-level deployment solutions.

✅Zero code one-click deployment

✅Automatically adapt to cloud resources

✅Out -of-the-box API

✅Full process operation and maintenance hosting

✅ Enterprise-level security data does not leave the domain

One-click deployment of Qwen3 solution

⬇️Experience it now⬇️

The following takes the Qwen3-8B model deployment as an example (the inference cost is low and can be used for fast verification).

1. Find the Qwen3-8B model in the Model Gallery, or go directly to the model via the link: https://x.sm.cn/W5Qpfy

2. Click "Deploy" in the upper right corner of the model details page. SGLang and vLLM high-performance deployment frameworks are supported. After selecting computing resources, you can complete the cloud deployment of the model with one click.

3. After successful deployment, you can click "View call information" on the service page to obtain the call endpoint and token. If you want to know how to call the service, you can click the pre-trained model link and return to the model introduction page to view the call method description.

4. Use the reasoning service. You can debug the deployed Qwen3 model service online on the PAI-EAS reasoning service platform. From the figure, we can see that the model response has a good thinking chain ability.

The following table provides the minimum configuration required for deployment, as well as the maximum number of tokens supported on different inference frameworks when deployed on different machine models.

More model support

In addition to the full range of Qwen3 models, PAI-Model Gallery continues to provide rapid deployment, training, and evaluation practices for popular models in the open source community.

DeepSeek-R1 inference performance optimized version

The inference performance is improved. Under the same latency constraint, the throughput can be increased by 492% . Under the same throughput condition, the first token latency is reduced by 86%, and the latency between tokens is reduced by 69%.