NVIDIA is chasing Qwen3: First release of Nemotron efficient inference model technical report

Written by
Clara Bennett
Updated on:June-25th-2025
Recommendation

New breakthroughs in AI technology, how NVIDIA improves AI reasoning efficiency through the Nemotron model.

Core content:
1. NVIDIA's Llama Nemotron series models and their dynamic switching function
2. Detailed explanation of neural architecture search (NAS) and block-level local distillation technology
3. Application of mixed integer programming (MIP) and FFN fusion technology in model optimization

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

NVIDIA released the LLama Nemotron series of models, which can dynamically switch between reasoning mode and normal chat mode to adapt to different task requirements.

Interpretation of Llama-Nemotron's key technologies

Neural Architecture Search (NAS)

  • Block-level local distillation :

    • Starting from the Llama 3 instruction model, the Puzzle framework trains each alternative sub-block independently and in parallel, so that it can improve the computing performance while maintaining the functionality of the parent block, such as reducing latency, memory usage or increasing throughput.

    • For example, for the LN-Ultra model, starting from the Llama 3.1-405B-Instruct model, each replacement child block is trained to improve the computational performance while maintaining the functionality of the parent block.

    • During the training process, each replacement block is trained to approximate the functionality of the parent block while improving its computational performance.

    • For example, some blocks may reduce computation and KV cache memory consumption by removing the attention mechanism, while others may achieve different degrees of compression by adjusting the intermediate size of the feed-forward network (FFN).

  • Mixed Integer Programming (MIP) :

    • After building a library of alternative blocks, the Puzzle framework utilizes a mixed integer programming solver to select the optimal block for each layer according to the given constraints.

    • For example, for the LN-Super model, the constraints include achieving at least 5x throughput improvement on a single NVIDIA H100 GPU and supporting about 300K cache tags at FP8 precision.

    • The complete model is constructed by selecting the optimal block for each layer from the block library through a MIP solver according to given constraints (such as hardware compatibility, maximum allowed latency, total memory budget, or expected inference throughput).

    • The MIP solver optimizes the objective function and selects the optimal block combination from the block library to build a complete model while satisfying the constraints.

    • For example, for the LN-Ultra model, the final model achieves at least 1.5x latency reduction on 8 H100 GPUs and supports up to 3M cache tags at FP8 precision.

  • FFN Fusion :

    • For the LN-Ultra model, the FFN fusion technology is introduced. After Puzzle removes some attention layers, continuous FFN blocks often appear in the model.

      For example, if there are two consecutive FFN blocks in the model, the FFN fusion technique replaces them with a wider FFN layer and can execute them in parallel, thus reducing the sequence steps and improving the computational utilization.

    • Through FFN fusion, the LN-Ultra model has achieved significant improvements in inference latency, ultimately achieving a 1.71x latency improvement.

Knowledge distillation and continuous pre-training

  • Knowledge Distillation :

    • The LN-Super model is trained on the Distillation Mix dataset for knowledge distillation with 40B tags.

    • For example, by comparing the output of the LN-Super model with the output of the teacher model, the parameters of the LN-Super model are adjusted so that it can better approximate the behavior of the teacher model.

    • The LN-Ultra model is first trained for knowledge distillation on the Distillation Mix dataset with 65B tokens, and then continues to be pre-trained on the Nemotron-H stage 4 pre-training dataset with 88B tokens.

    • For example, in the knowledge distillation stage, the LN-Ultra model gradually improves its performance by learning the output of the teacher model;

    • During the continued pre-training phase, the model further expands its knowledge and ultimately surpasses the reference model Llama 3.1-405B-Instruct on key benchmarks.

  • Continuous pre-training :

    • LN-Ultra continues to be pre-trained on the Nemotron-H stage 4 pre-training dataset after knowledge distillation to further improve the performance.

    • For example, the LN-Ultra model performs better in reasoning tasks by expanding its vocabulary and language patterns by learning from a large amount of unlabeled data during the continuous pre-training phase.

Supervised Fine-tuning (SFT)

  • Data preparation :

    • Construct a mixed dataset containing both inference and non-inference data.

    • For example, in the reasoning data, each prompt contains the instruction “detailed thinking on”, and the model needs to output the detailed reasoning process;

    • In non-inference data, each prompt contains the instruction of “detailed thinking off” and the model needs to output a concise response.

    • For inference data, it is further broken down into math, coding, science, and general domains.

    • For example, in the field of mathematics, we collect math problems from the Art of Problem Solving (AoPS) community forum and use models such as DeepSeek-R1 and Qwen2.5-Math-7BInstruct to generate reasoning and non-reasoning solutions, and then go through filtering and verification steps to ensure the quality and correctness of the data.

  • Training process :

    • The initial stage focuses on training inference data

    • Introducing non-inference data in the intermediate stage

    • The final phase focuses on a mix of chat, command following, and tool invocation data;

    • All models are trained using a per-label cross entropy loss based on instruction-based data.

    • For example, during training, the model’s output is compared with the target output, and the model’s parameters are adjusted by calculating the cross entropy loss.

    • Depending on the model size and requirements, different learning rates, sequence lengths, and training cycles are used for training.

    • For example, the LN-Nano model uses a three-stage SFT process:

    • The LN-Super model is trained in a single cycle on the full dataset;

    • The LN-Ultra model uses a more complex training strategy, including linear warm-up and cosine decay learning rate adjustment, to ensure training stability and convergence.

Large-Scale Reinforcement Learning

  • Training Algorithm :

    • For LN-Ultra, the Group Relative Policy Optimization (GRPO) algorithm is used for reinforcement learning to improve scientific reasoning capabilities.
  • Data processing :

    • By independently generating responses and calculating pass rates, prompts with low pass rates are screened out to increase the difficulty of the training data.

    • At the same time, a course training strategy is adopted to dynamically adjust the difficulty distribution of each batch based on the pre-calculated pass rate as the difficulty indicator, so that the model can gradually learn tasks from simple to complex.

  • Reward mechanism :

    • Use accuracy rewards and format rewards to guide model learning.

    • Accuracy rewards, which reward the model for accurately answering questions by judging whether the responses it generates match the true answers;

    • The format reward guides the model to follow the correct output format by checking whether the model correctly outputs the thinking process in reasoning mode and avoids outputting thinking labels in non-reasoning mode.

  • Inference mode switching

    • Through the lightweight system prompt "detailed thinking on/off", dynamic switching between reasoning mode and normal chat mode is achieved.