Feynman explains fine-tuning parameters of large models - even beginners can understand

Written by

Silas Grey

Updated on:June-27th-2025

The story goes like this

Character group (1): teacher, student. Character group (2): father, mother, me, sister. Task 1: My mother supervises my study. I study science. I mainly look for tricks and improve my problem-solving skills and methods. I want to get high scores in the exams so that my mother can give me more pocket money. Task 2: My father supervises my sister's study. My sister studies liberal arts. My sister mainly memorizes everything she has learned from the books. She learns by memorizing the content.

Task 2: Dad supervises my sister's liberal arts learning. Dad starts supervising my sister's liberal arts learning. Recite book knowledge and memorize standard answers to complete liberal arts learning. Task 1: Mom supervises my science learning. Mom starts supervising me to learn science. Find tips for solving problems, improve skills and methods, get high scores in exams, and get more pocket money.

The relationship between the whole brain and the big language model

Everyone has used ChatGPT and DeepSeek. We only need to enter a question into the big model webpage and we will get the answer immediately. So, the input is text, and the big model also answers in text. So how does the big model brain think and then give the answer? Let's look at the following picture, mainly bluestep.

• The first step, starting from the left, is to say a sentence to the big model: Let’s have a meeting today.
• In the second step, each Chinese character was encoded into a telegraph code, and the encoding method was to make each Chinese character's own position 1 and other positions 0. These telegraph codes were sent one by one to the core computing unit of the large model - the neuron.
• In the third step, the neurons (computing units) of the large model calculate new telegraph codes one by one according to the input telegraph codes.
• In the fourth step, these telegraph codes were deciphered into Chinese characters.
• Step 5. The deciphered Chinese characters are output one by one from the big model: Everyone is very happy, this is the answer of the big model to the questions we asked it.

Neuron calculation details example: today = [1,0,0...] depth calculation example: [0,1,0...] = day (1) input sentence (2) Chinese character encoding (3) neuron calculation (4) telegraph code decoding (5) output result

Because the number of neurons in a large language model is very large, a large language model uses a huge number of neurons to simulate the tens of billions of neurons in the human brain. In this way, a large language model is similar to the human brain. In fact, each neuron contains a linear or nonlinear calculation formula, which contains formula coefficients, such as. Billions of neurons are billions of similar formulas, which are combined into a larger formula through addition, subtraction, multiplication, division or exponential method, and then there are hundreds of billions or even hundreds of billions of coefficients. These coefficients are the parameters of the large model. This is not a rigorous statement, but we just need to know that the large language model is composed of formulas and formula coefficients. The formula determines the law of thinking of the large language model, that is, the large language model thinks about problems according to mathematical calculation formulas, and the coefficients determine the configuration of the large model thinking about problems. Thinking mode + thinking configuration ultimately determines the output content of the large language model.

How do students learn knowledge from teachers - small model parameter learning/distillation paradigm

We know that large language models can think about problems, but how do large language models learn (train)? In fact, the learning method of large language models is similar to that of humans, except that the learning method of large language models is more clumsy and not as flexible as that of humans. Humans learn when they want to learn, and learn whatever they encounter. There is never a fixed learning method, but machines are different, because they are machines. The figure below shows how a student (a small language model with little knowledge) learns knowledge from a teacher (a large language model with a lot of knowledge). ① Right figure: Model parameter learning/distillation paradigm. View from bottom to top. If a student who does not have much knowledge wants to learn more knowledge, he can either learn from the teacher or learn by himself. The final result of learning from the teacher is that the knowledge in the teacher's mind is transferred to the student's mind (small language model). Learning by yourself actually requires reading books, and books are also the knowledge created by former teachers. In any case, you can follow the method in the figure. If students are ready to learn mathematical knowledge, you can find a lot of fine test papers and let students answer them and teachers answer them separately. Compare the results of the answered test papers, and the results of the comparison will be fed back to the students in different ways.

②Left picture: The different feedback methods here directly determine the specific actions and ideas of students’ learning, looking from top to bottom.

• If the younger sister follows her father to study. The younger sister recites every day, and the father is responsible for checking. If the father's feedback method each time is: compare the difference between the younger sister's answer and the standard answer, if the difference is large, punish the younger sister, and if the difference is small, punish less, then this is using supervised learning. The best way for the younger sister to learn is to memorize the process and the answer, memorize it by rote, and don't consider other things. This learning process is essentially the memory reinforcement of the younger sister (small model). The more you memorize, the better you learn. It is very suitable for learning liberal arts content, especially common sense problems. The early ChatGPT, DeepSeek-V3, llama, etc. that are common in reality all belong to this type of model. If you have seen it, you know it, but if you haven't seen it, it's very difficult. To give another example, first remember the ten numbers 1-10. If you want to calculate 2+3=?, you must have seen the question 2+3=5. If you don't remember it, then even if you know 2+2=4, you are unlikely to know 2+3=5; therefore, supervised learning has the advantage of memory, and the reasoning ability is very weak or the early model does not have it (after all, it was improved later).
• If I study with my mother. I study every day, and my mother is responsible for checking. If my mother's feedback every time is: tell me the difference between my test scores and the teacher's test scores, if the difference becomes larger, I can't get more pocket money, if the difference becomes smaller, then I can get more pocket money, then it is a reinforcement learning method. Give me the change in the score difference, I will know whether I have improved, and in this learning method, my goal is not like supervised learning, I must memorize something before I can learn well, but as long as my score improves greatly, I have improved, and I can get more pocket money. So at this time, my learning skills? In fact, it is my freedom. In this case, I can choose to recite, but I can also choose logical reasoning. If I study science, then I will definitely prefer to learn reasoning, because there is no end to reciting questions, and there are endless questions in mathematics and physics. I only need to find better reasoning skills, master the principles of mathematics and physics, and respond to changes with constancy. On the contrary, my scores can be improved faster; but if I study liberal arts, I must recite and memorize, which is very time-consuming. Similarly, calculate 2+3=? For example, first remember the ten numbers 1-10. If you want to calculate 2+3=? Even if I have never seen the problem 2+3=5, as long as I can reason 1+1=2, 1+1+1=3, then 2+3 can be decomposed into 1+1+3→1+1+1+1+1+1, so =5. Therefore, the method of reinforcement learning, strong reasoning ability, general memory ability. Typical: R1 O1~O3.

Science learning mode Liberal arts learning mode Differences are large Differences are small Differences increase Differences decrease Typical model: ChatGPT/LLaMA Typical architecture: DeepSeek-R1/O3 Start reciting Input standard answers Mechanical memory Check differences Strengthen punishment Reduce punishment Strengthen rote memorization Start learning Explore problem-solving strategies Autonomous reasoning Comparison of results Reduce rewards Increase rewards Adjustment method Early large model New generation model

In fact, supervised learning corresponds to improving the amount of students' memory content, while reinforcement learning corresponds to improving students' logical reasoning ability. Which method to use for learning depends on whether the problem we are learning is one that requires memory to learn well, or one that requires reasoning to learn well? Arts is something that requires memory, and science is something that requires reasoning. In retrospect, the two different learning methods correspond to two different thinking patterns of students. They also correspond to two different algorithms for updating the parameters of the small language model.

Using my sister and I's method to train students throughout the process

We need to train students throughout the entire process, and we need to consider everything from their learning methods to their future employment. This is a responsible attitude towards education and a responsible attitude towards the future. We need to train students with a purpose, and my sister and I’s learning methods are only one aspect. We also need to choose the right learning materials and tools to provide graduates with the right jobs in the future.

• ① Determine the goal. Are our students mainly engaged in text work or calligraphy art? If they are engaged in text work, then we should choose an open source text generation model; if they are engaged in calligraphy art, then we should choose an open source multimodal model.
• ② Selection of learning materials. Both basic courses (open source professional data sets) and professional courses are required. Basic courses do not require so many textbooks, but professional courses must choose high-quality reading materials that are closely related to students' future career development. These reading materials must be carefully selected to avoid students making mistakes or wasting energy during the learning process. Only when the learning materials and books are very good can we ensure that students do not learn in a biased way.
• ③Choice of learning tools. Learning tools should be selected to suit the student’s learning style, learning conditions, school, and the appropriate position and work location after the student has mastered the major.
• ④Determination of learning methods. The two learning methods are introduced above, namely supervised learning and reinforcement learning.

④Determine the learning method Declarative procedural knowledge type Supervised learning Reinforcement learning ③Configure learning tools Teaching platform Jupyter Training framework PyTorch Deployment environment Docker ②Select learning materials Open source data sets Selected textbooks General knowledge foundation Wikipedia Professional skills Industry documents ①Determine the training objectives Text creation Art design Talent demand analysis Core direction Text large model Multimodal model ①Determine the training objectives ②Select learning materials ③Configure learning tools ④Determine the learning method Employment output

The learning process of the large language model corresponds to the following four steps.

1. Pre-training model selection Choose the basic model architecture based on business scenario requirements: ① Large text model (suitable for NLP tasks) ② Large multimodal model (suitable for image and text generation tasks)

2. Professional data processing process to build high-quality training data sets:

Data source: open source data set + internal real business scenario data (multiple types of typical scenario data)
Preprocessing: Data deduplication → cleaning (outlier/noise processing) → formatting and vectorization
Data optimization: Diversity screening through clustering algorithms, combined with the THINK process to generate annotation labels

3. Fine-tune the toolchain configuration to build a complete training framework:

Core tools: LLaMA-Factory framework/byte verl framework/unsloth acceleration library
Deployment plan: Integrate inference optimization tools such as vllm and sglang
Hardware adaptation: support for distributed training solutions such as PyTorch Lightning and DeepSpeed

4. Fine-tuning algorithm implementation Two mainstream training methods are implemented: 1. Supervised fine-tuning (SFT) process: custom dataset → cross entropy loss function → distributed training loop → evaluation indicator verification (Accuracy/BLEU) → iterative optimization

2. Four-stage implementation of reinforcement learning fine-tuning (RL):

Data preparation: Collecting preference comparison data annotated by humans/AI
Reward Modeling: Training an independent reward evaluation model
Strategy generation: Based on supervised model initialization, generate responses through beam search
Strategy optimization: Use PPO/TRL and other algorithms to optimize the strategy model, and KL divergence constraints to prevent deviation

Refer to the picture below.

Reinforcement learning algorithm Supervised learning algorithm Model fine-tuning four-step method ① Select pre-trained model based on business needs Text large model Multimodal large model ② Professional data set Open source Professional data set Internal data set Real scene data (1) Real scene data (2) Real scene data (3) Data deduplication Data cleaning Real scene data (4) Formatting, vectorization Clustering algorithm Diversity screening Annotation Generation THINK process ③ Based on the data set fine-tuning tool framework LLaMA-Factoryverl (Byte open source) Deployment, reasoning tools (vllm, sglang) unsloth ④ Main fine-tuning algorithm process Supervised fine-tuning SFT Reinforcement learning fine-tuning RLHF, GRPO/PPO/DAPO Preprocessing (self-collection and organization of data sets) Define loss function (Cross-Entropy) Training loop (PyTorch Lightning DeepSpeed/Megatron) Validation set evaluation (Accuracy/BLEU) Model iteration Preference Data collection (Human/AI annotation comparison) Training reward model (Reward Modeling) Initialization strategy model (based on supervised learning model) Generate response (Beam Search) Reward model scoring (RM Inference) Strategy Optimization (PPO/TRLRL4LMs) KL Divergence Constrained Model Iteration