Woter AI detection.Hurry - ends Jul 28th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

Good news for old graphics cards! Meituan launches the first open-source INT8 lossless full-blood version of DeepSeek R1

Written by

Jasper Cole

Updated on:July-08th-2025

1. Background
2. INT8 Quantized Inference Practice

2.1 Basic principles of quantization
2.2 Introduction to Quantification of DeepSeek R1
2.3 Quantitative Method Design
2.4 Quantitative Model Evaluation
2.5 Quantization Model Deployment

3. Summary and Outlook

1. Background

After DeepSeek R1 was launched, it attracted many companies and individual users to try to deploy its full version. However, the model weights of the native version are in FP8 data format, which has strict restrictions on GPU chip types and can only be supported by NVIDIA's new GPUs ( such as Ada and Hopper architecture chips ). Other models of GPUs ( such as A100 ) cannot be directly deployed. Although we can dequantize FP8 weights to BF16 weights and perform inference on GPUs such as A100, this doubles the requirements for video memory and reduces inference throughput.

In order to solve these problems, Meituan's search and recommendation platform department tried INT8 precision quantization on the DeepSeek R1 model and found that the model accuracy was basically intact after using INT8 quantization. Based on INT8 quantization, the DeepSeek R1 model unlocked chip limitations and can be deployed to other GPU models such as A100; and compared with BF16, it achieved a 50% throughput improvement, further reducing the inference cost. The quantization code has been released on the open source LLM inference framework SGLang, and the quantization model has been released to the Hugging Face community for user convenience.

This article will share the practical experience of INT8 quantization inference of DeepSeek R1 model based on SGLang framework. Welcome everyone to communicate, learn from each other and grow together.

2. INT8 Quantized Inference Practice

| 2.1 Basic principles of quantization

Model quantization is the process of converting the model's weights and activation values from high precision ( such as BF16 ) to low precision ( such as INT8 ), and ensuring that the model effects before and after the conversion are consistent as much as possible. Taking the common INT8 symmetric quantization as an example, the quantization process is as follows:

1. Calculate the scaling factor

2. Quantify and Dequantize at the appropriate location

Quantization formula:

Dequantization formula:

3. FP16 calculations can be converted to INT8 calculations

| 2.2 Introduction to Quantification of DeepSeek R1

According to the latest technical report released by DeepSeek , the breakthrough training cost control of V3/R1 mainly relies on the FP8 precision training solution. FP8 is a typical model quantization technology. Compared with the BF16 precision commonly used in the industry, FP8 precision significantly reduces the single calculation cost by halving the data bit width, but it also brings a certain loss of precision. In practice, DeepSeek R1 uses a mixed precision training mechanism to effectively alleviate the problem of precision loss.

Since DeepSeek R1 is trained with FP8 precision, the open-source native weights are FP8 precision. During inference, in order to minimize the loss of model precision while maintaining inference throughput similar to FP8, we naturally thought of using INT8 precision with the same bit width as FP8 precision as a substitute. At the same time, INT8 precision is natively supported by a wide range of hardware, and based on INT8 precision, the hardware deployment range of the DeepSeek model can be greatly expanded. Therefore, we began to explore the feasibility of INT8 quantization on DeepSeek R1.

| 2.3 Quantitative Method Design

After comprehensively considering the accuracy and inference performance of the quantized model, we chose two solutions: block-wise quantization and channel -wise quantization .

Block quantization : Block quantization is one of the key technologies for DeepSeek V3/R1 to reduce quantization loss. Block quantization controls the range of quantization operations within the matrix of [128, 128] by fine-grained segmentation of the weight matrix, reducing the probability of distribution dispersion, thereby well controlling the loss in each quantization process. In order to minimize the accuracy loss of the quantized model, we continued the quantization strategy of DeepSeek training and also performed block quantization operations on the weights within the matrix of [128, 128] to ensure the consistency of training and inference. However, it should be noted that due to the different quantization scaling factors between different blocks of block quantization, multiple dequantization operations are required during the INT8 matrix multiplication process, and the dequantization operation is performed on the CUDA Core with lower computational throughput, which will reduce the efficiency of matrix multiplication to a certain extent. In practice, since DeepSeek does not officially provide half-precision floating point ( BF16 ) weights, it is first necessary to dequantize the native FP8 model weights into BF16 and then block quantize them into INT8 precision. In order to match the block quantization of weights, the activation value adopts the online token-group quantization method, that is, the embedding vector of each token is divided into multiple groups and quantized group by group. The multiplication process of block quantization activation value and weight is shown in the left figure below.

Channel quantization : In addition to the above-mentioned block quantization, we have also explored more efficient channel quantization, that is, each column of weights is quantized as a group. After performing the INT8 matrix multiplication, channel quantization only needs to perform one dequantization calculation, and the computational overhead is lower than that of block quantization. However, since channel quantization is more likely to encounter outliers when quantizing a column of elements, there will be more precision loss than block quantization. In specific practice, the model weights of the native FP8 are first dequantized into BF16, and then quantized into INT8 type channel by channel. At the same time, online token-by-token quantization is used for the activation value to minimize the quantization loss of activation. The multiplication process of the activation value and weight of channel quantization is shown in the figure on the right below:

Currently, both INT8 quantization weights have been open sourced to Hugging Face .

| 2.4 Quantitative Model Evaluation

2.4.1 Accuracy

We applied the above two quantization methods to perform INT8 quantization on the open source DeepSeek R1 model, and evaluated the accuracy of the quantized model on the GSM8K and MMLU datasets. The evaluation results are shown in the following table. Compared with the baseline BF16 and FP8 models, the accuracy of the two INT8 quantization models is basically intact.

Note: The accuracy results in the table are the average of multiple tests .

2.4.2 Inference Throughput

We have provided inference support for the above two INT8 quantization methods ( block quantization and channel quantization ) on the well-known open source inference framework SGLang . SGLang is the current SOTA open source LLM inference framework, which has the best inference performance on the DeepSeek series of models and is widely used in the industry.

Using the BF16 model as the baseline, we evaluated the inference throughput of two INT8 models on the A100-80G GPU. Thanks to lower video memory requirements, the INT8 quantized model only requires 16 A100 GPUs for inference, but the BF16 model requires 32 A100 GPUs. To ensure fairness in comparison, we uniformly perform throughput tests on 32 A100 GPUs. The results are shown in the table below. Compared with BF16, block-quantized INT8 inference can improve throughput by 33%; channel-quantized INT8 inference can further achieve a 50% throughput improvement thanks to lower dequantization overhead.

| 2.5 Quantization Model Deployment

Taking dual nodes with 8 A100 GPUs each as an example, developers need to install the latest version of SGLang on both deployment nodes, and then execute the following commands respectively:

# Block quantization INT8 reasoning 
# Master node
 python3 -m sglang.launch_server \ 
 --model meituan/DeepSeek-R1-Block-INT8 --tp 16 --dist-init-addr \ 
 HEAD_IP:5000 --nnodes 2 --node-rank 0 --trust-remote -- enable -torch-compile --torch-compile-max-bs 8 
# Slave node
 python3 -m sglang.launch_server \ 
 --model meituan/DeepSeek-R1-Block-INT8 --tp 16 --dist-init-addr \ 
 HEAD_IP:5000 --nnodes 2 --node-rank 1 --trust-remote -- enable -torch-compile --torch-compile-max-bs 8 
  
# Channel quantization INT8 reasoning 
# Master node
 python3 -m sglang.launch_server \ 
 --model meituan/DeepSeek-R1-Channel-INT8 --tp 16 --dist-init-addr \ 
 HEAD_IP:5000 --nnodes 2 --node-rank 0 --trust-remote -- enable -torch-compile --torch-compile-max-bs 8 \ 
  --quantization w8a8_int8 
# Secondary node
 python3 -m sglang.launch_server \ 
 --model meituan/DeepSeek-R1-Channel-INT8 --tp 16 --dist-init-addr \ 
 HEAD_IP:5000 --nnodes 2 --node-rank 1 --trust-remote -- enable -torch-compile --torch-compile-max-bs 8 \ 
  --quantization w8a8_int8

Here are some interesting cases:

Demonstrating deep thinking skills

Here we take curl as an example to send a request:

curl -X POST  'http://HEAD_IP:5000/v1/chat/completions'  \ 
 --header  'Content-Type: application/json'  \ 
  -d  '{      
     "model": "deepseek-r1",     
    "messages": [{"role": "user", 
    "content": "Among the following options, find the one that is different: 1. Aluminum 2. Tin 3. Steel 4. Iron 5. Copper" } ] }  
      '

In INT8 R1's answer, we can see his unique reflection style:

However, I need to confirm whether this is correct. For example, is it possible that there are alloys in other options? For example, tin is sometimes mixed with other metals, but tin itself is a pure metal. Similarly, aluminum, iron, and copper are all pure metals. So steel, as an alloy, should be a key difference. In addition, other properties can be considered, such as their uses, physical properties, or chemical properties. For example, conductivity: silver is the best conductor, but there is no silver here, and copper is also very conductive, aluminum is also good, and iron and steel are relatively poor in conductivity. But if classified in this way, it may be difficult to determine which one is different. ...But going back to the original classification, steel is the only alloy, and the others are pure metals, which may be more basic. Therefore, the answer should be 3. Steel. However, it is necessary to verify whether this conclusion is correct.

The final answer is also correct:

Answer: 3. Steel

Simulate the movement of a large number of small balls

Using the prompts of other LLM tests as input, we compared the running effects of the INT8 (left figure below) and FP8 (right figure below) output codes. INT8 is not inferior to FP8 at all.

write a script for 100 bouncing balls within a sphere, make sure to handle collision detection properly. make the sphere slowly rotate. make sure balls stays within the sphere. implement it in p5.js

3. Summary and Outlook

In summary, we have explored INT8 quantization technology on DeepSeek R1 and supported reasoning capabilities based on the SGLang framework. Under the premise of ensuring the accuracy of the quantized model, DeepSeek R1 can be deployed on older GPUs such as A100 and improve reasoning throughput. We hope that the open source code and weights can benefit more users and business parties, and we also welcome everyone to actively exchange related technologies and jointly build and give back to the open source community.

---------- END ---------