Karpathy's 3-hour course has helped me understand AI better than 99% of people. Here are the complete notes!

A required course for in-depth understanding of AI language models. Karpathy will take you to the secrets of AI.
Core content:
1. Analysis of the pre-training stage of large language models
2. How to make the model more in line with human needs in the post-training stage
3. Key techniques and practices to improve model accuracy
With the popularity of Deepseek, there are many topics about short videos. I have watched them for several months and listened to the explanations of many experts, but I still have not formed a comprehensive understanding. Until I found the course "Large Language Model" by Andrej Karpathy, one of the founders of Open AI, I am grateful to him. I read it repeatedly and recorded the complete notes as follows:
How did large language models such as ChatGPT come about?
1. Pre-training stage
The model can learn the basic rules of language, including vocabulary, grammar, semantic associations, etc., through massive text data, to form general language generation and comprehension capabilities.
1. Collect data from the entire network to form a data set
Need to filter out advertisements, bad information, racial discrimination, adult websites, etc.
2. Tokenization
The original data format is HTML. After being converted into text, the text is divided into tokens that the model can process. The size of the vocabulary is reduced by merging high-frequency character combinations. Each token is the basic unit for neural network training.
3. Training the Neural Network
We built neurons through code and mathematical expressions (fixed mathematical expressions from trillions of inputs to outputs). Information flows through neurons, allowing us to predict the next word based on existing words (0-n) until we get a prediction. Word prediction, or reasoning, has a random probability of prediction at the beginning. By continuously training and optimizing the neural network and adjusting the parameters, the probability of the next suitable word is increased. Finally, the specific parameter weights are fixed, and chatgpt has a specific set of weights, which works very well.
demo: GTP2 recurrence process
The prediction is updated for each word, and each line of code is an update to the neural network. Each line improves the prediction of all words in the training set. We can observe the training process by paying attention to the loss function. For 1 million words, the training is expected to take 2 days. The better the quality of the training data set, the better the hardware facilities, and the better optimized the software running the model, the lower the training cost.
4. Basic Model
The essence is an Internet document simulator, source code (standard) + parameters (non-standard, where the real value lies), eg: GPT2, LLMA3
Recommended website for interacting with the base model: Hyperbolic
The basic model is not an assistant. If you ask a question, it will only automatically complete the answer based on the statistical data of the training set. The system is random, and the same word will always get different answers. The basic model performs lossy compression on the content of the entire data set, which is similar to storing a vague concept of the world as a whole. This information is not explicitly stored in any parameter. It is specifically vague, probabilistic and statistical, and frequently appearing content is easy to be remembered by the model.
2. Post-training stage - supervision and coordination
Transform LLM into an assistant to make the output meet human needs
1. Training
By creating a conversation dataset through implicit programming, a unique word unit that has never been trained is created. LLM introduces this word unit to make the model learn that this is the beginning of a conversation. Finally, through some kind of encoding, the conversation is turned into a one-dimensional word unit sequence to continue to complete the reasoning.
2. Model Illusion
LLM completely fabricated information
The model is imitating the training set. For example, the training set has confident and accurate answers to the question "Who is x?". The model is trying to imitate the answers in this style, and may fabricate them. We can ask the model to use online search to reduce hallucinations.
3. Better prompt words
1) The memory of the big model is similar to what we have learned ourselves, but the context window information is similar to what we were experiencing and perceiving a few minutes ago. For example, it is better to directly post the specific content of a book to him and then give him a prompt word instead of asking him to summarize the content of a book.
2) Teach the model to reason better and expand computation between tokens. The model works in a one-dimensional sequence from left to right. The amount of computation per token is limited. We should distribute reasoning and computation across multiple tokens.
Eg: In this case, the answer of the first training set is obviously worse because it puts all the computation into the word 3 first. The second one is from left to right, creating intermediate steps, allowing the model to gradually reach conclusions.
3) Let the model calculate directly equal to our mental arithmetic, and learn to let it use tools, such as code
3. Reinforcement Learning (RL)
Big models need education just like we go to school.
1. Reinforcement Learning
We try different types of solutions for different problems. Each attempt is a different path. We will encourage the correct methods (not relying on humans, because the model knows the final correct answer and will find out which form is effective by itself. It will find other good features in the correct answer and even the appropriate solution and train accordingly. Once the parameters are updated, the model will be more inclined to choose this path. This process is not labeled by humans, but comes from the model itself.
2. Is it possible for RL to surpass humans in reasoning or thinking?
The process of learning fine-tuning is generally done within LLM companies, but deepseek has publicly discussed RL and its importance in large prediction models. The research found that the model is backtracking and thinking about using more words to try to solve the problem. It is doing a lot of what we humans do in the process of solving mathematical problems. It is rediscovering what happens in the human mind, rather than the examples you give it in the solution case.
The model has learned chains of thought! Incredible! The model is discovering ways to think, how to approach a problem, how to look at it from different angles, how to introduce analogies or do something different, and how to try many different things over time. The only thing we need to give is the correct answer, and the model's attempt to solve them has produced the above incredible changes. Does this mean that RL has the possibility of surpassing humans in reasoning or thinking?
3. Special case: For learning in verifiable fields, all solutions have a unique correct answer. However, for unverifiable fields such as poetry and writing, it becomes difficult to score different solutions. At this time, we need to introduce model rewards and human intervention sorting, and continuously update the reward model based on artificial data.
4. Thinking
RLHF ( Reinforcement Learning from Human Feedback):
Advantages:
1) run RL➡️results in better model
2) allows people to contribute their supervision even without having to do extremely difficult tasks
- Disadvantages
1) Running RL not based on humans and actual human judgment, but only on a corrupted simulation of humans, or may produce mislesding
2) RL is very good at deceiving models, and long-term operation may produce adversarial examples. Currently, we can only hard-code such adversarial examples.
So, the best solution at the moment is to run RLHF, and the model gets better, and stop updating at this stage, and don’t run too many times against the reward model, because the optimization starts to take advantage of it.
IV. Summary and Future Thoughts
1. Multimodal Multimodal will be quickly applied within a single model. We can create word streams of audio and images and use them interleaved
2. The model is not yet able to correct errors in a coherent way, especially for long periods of time. We will see more long running agents
3. Humans will become supervisors of more agents (ratio of human to machine agents)
4. Be pervasive and invisible because it will be integrated into tools and be everywhere
5. This field needs more new ideas (the human brain can learn continuously, but the parameters of model training are fixed)
6. The latest AI consulting sources:
reference https://lmarena.ai/ subscribe to https://buttondown.com/ainews X/Twitter