Classic large model prompt word engineering technology route overview

Explore the core technology and application prospects of large model prompting engineering.
Core content:
1. Definition of prompting engineering and its application in large models
2. Implementation and effect analysis of CoT Prompting mode
3. Discussion on the advantages and limitations of prompting engineering
This article outlines several classic prompt word engineering methods, summarizes key information, analyzes their advantages and limitations, and shares the author's thoughts.
Prompt Engineering is a technique that can expand the capabilities of large models without modifying their parameters. It activates relevant knowledge through task-specific instructions or contextual prompts, achieving seamless integration of models with downstream tasks. This field has been successful in applications such as question answering and common sense reasoning. This article outlines several classic prompt engineering methods, summarizes key information, analyzes their advantages and limitations, and shares the author's thoughts.
Technical route 1: CoT Prompting mode
Paper: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
1. Concept : When solving complex problems, humans usually break down the problem step by step and reason until they get the answer. This method imitates the human thinking process and is expressed in the form of natural language, namely the "chain of thought".
2. Implementation : When given a question (such as a math word problem), not only the question itself is provided, but also one or more examples of intermediate reasoning steps (i.e., chain of thought) and the final answer are provided. These examples are used as part of the prompt to guide the model on how to reason.
3. Examples : In the paper, the authors provide several examples to illustrate this approach. For example, for a math problem: "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?" The standard prompt may directly give the answer "11", while the chain-of-thought prompting will provide the following steps:
First calculate the number of apples remaining: 23 - 20 = 3.
Then add the number of newly purchased apples: 3 + 6 = 9.
The final answer is: they now have 9 apples.
4. Training and fine-tuning : This approach does not require additional training or fine-tuning of the model . Instead, it relies on the language understanding ability that the model has already acquired through pre-training on a large amount of data. By including examples of chain of thought in the prompts, the model is able to learn how to imitate this reasoning process.
5. Effect : Experimental results show that chain-of-thought prompting can significantly improve the performance of the model on a series of reasoning tasks, especially in arithmetic, common sense and symbolic reasoning tasks . For example, in the GSM8K math word problem benchmark, the PaLM 540B model using this method achieved state-of-the-art accuracy.
6. Advantages : The advantage of this approach is that it allows the model to decompose complex problems into smaller, more manageable steps, provides transparency into the model reasoning process, and helps understand and debug the model's behavior. In addition, it is applicable to many types of reasoning tasks, not just mathematical problems.
7. Limitations : Although chain-of-thought prompting works well in large models, it may not be able to effectively stimulate reasoning capabilities for small models or when there are not enough examples. In addition, this method depends on the size of the model, and significant performance improvements can only be observed when the model reaches a certain parameter size.
Technical route 2: Self-Consistency + CoT mode
Paper: Self-Consistency Improves Chain of Thought Reasoning in Language Models
In the chain-of-thought prompting method, a greedy decoding strategy is used because in the traditional decoding process, the model selects the word with the highest probability to generate text at each step. This method is simple and direct, but it does not consider other possible word combinations, so it may not fully explore the generation space of the language model, especially in tasks that require complex reasoning.
In the paper "SELF-CONSISTENCY IMPROVES CHAIN OF THOUGHT REASONING IN LANGUAGE MODELS", the authors mentioned the limitations of greedy decoding and proposed a self-consistency method as an improvement . The following is an example from the paper that illustrates the use of greedy decoding in chain-of-thought prompting:
Suppose there is a math problem: "If there were originally 3 cars in the parking lot and 2 more cars arrived, how many cars are there in the parking lot now?" A greedy decoding strategy using chain-of-thought prompting might generate the following answer:
There are 3 cars in the parking lot already. 2 more arrive. Now there are 3 + 2 = 5 cars. The answer is 5.
In this example, the model directly gives the correct answer "5", but in the actual reasoning process, the model may generate multiple different reasoning paths, each of which may have different intermediate steps and final answers. Greedy decoding only selects the most direct path without considering other possible reasoning methods.
The Self-Consistency method is different. It generates multiple reasoning paths, such as:
In this example, although the second path gives the incorrect answer "26", the self-consistent method can identify and select the correct answer "18" by comparing multiple paths.
This example shows that greedy decoding may ignore other reasonable reasoning paths. To overcome these limitations, the paper proposes a self-consistency method. The core idea of the self-consistency method is that a complex reasoning problem can usually be solved through multiple different reasoning paths, but the correct answer should be unique . By sampling multiple reasoning paths and finding the most consistent answer, the reasoning accuracy of the model can be improved.
Examples and improvements of the self-consistent method :
1. The self-consistency method consists of three steps: (1) using chained thinking (CoT) prompts to guide the language model; (2) replacing the "greedy decoding" in the CoT prompts by sampling from the language model's decoder to generate diverse reasoning paths; (3) marginalizing the reasoning paths and aggregating them by selecting the most consistent answers in the final answer set.
2. Improvements :
Diversity : The self-consistency method generates multiple reasoning paths by sampling, which increases the diversity of the decoding process, allowing the model to explore more reasoning possibilities.
Consistency : By marginalizing the final answers from multiple paths, self-consistency methods are able to identify the most consistent answer, which is usually more likely to be the correct answer.
Robustness : Even if some paths produce incorrect reasoning or answers, the self-consistent method can still correct these errors through a majority voting mechanism, thereby improving the overall accuracy.
Technical route 3: Least-to-Most Prompting mode
Paper: Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
1. Background: Traditional chain-of-thought prompting performs poorly when solving complex problems that are more difficult than the examples. To overcome this challenge, the paper proposes a minimum-to-maximum prompting strategy that solves complex problems by breaking them down into a series of simpler sub-problems.
2. Least-to-Most Prompting Strategy : This strategy is divided into two stages: problem decomposition and sub-problem solving. In the problem decomposition stage, constant examples are used to show how to decompose the problem; in the sub-problem solving stage, the prompt contains three parts: examples of solved sub-problems, a list of answered sub-problems and their generated solutions, and the next question to be answered.
3. Example: Elsa has 5 apples, and Anna has 2 more apples than Elsa. How many apples do they have together?
Least-to-Most Prompting solution process:
Determine how many apples Anna has.
Determine how many apples Elsa and Anna have together.
Output: Together they have 12 apples.
Comparison between L2M and CoT
Least-to-Most Prompting (L2M) and Chain-of-Thought Prompting (CoT) are two different language model prompting strategies, both of which aim to guide the model to complete a specific task by providing examples. Here is a comparison of the two modes, including their similarities and differences:
Similarities :
1. Prompts based on few-shot learning : Both L2M and CoT use few-shot prompts to guide language models to reason or answer questions. These examples are part of the input and help the model understand the requirements of the task.
2. Flexibility : Both L2M and CoT provide a flexible way to leverage pre-trained language models to solve a variety of problems without having to train or fine-tune the models for specific tasks.
Differences :
1. Complexity of the example :
L2M : Least-to-Most Prompting starts with the simplest example and gradually increases the complexity of the example. This method assumes that simple examples can help the model establish basic understanding, and then gradually guide the model to more complex reasoning through more complex examples.
CoT : Chain-of-Thought Prompting focuses on providing a series of intermediate reasoning steps that simulate the thinking process of humans when solving problems. Each example shows how to break down the problem and arrive at the answer step by step.
2. Transparency of reasoning process :
L2M : As the complexity of examples increases, the model's reasoning process may become less obvious, because changes in examples may not always clearly indicate the reasoning steps.
CoT : By explicitly showing the intermediate reasoning steps, CoT provides transparency into the model’s reasoning process, making the process of generating the final answer clearer.
3. Task suitability :
L2M : It may be more suitable for tasks that the model already has a certain understanding ability, challenging and improving the model's ability by gradually increasing the difficulty.
CoT : Particularly suitable for tasks that require multi-step logical reasoning, such as math problem solving, logic puzzles, etc., because it directly demonstrates the specific steps of reasoning through examples.
4. Consistency of results :
L2M : Since the complexity of examples gradually increases, the output of the model may change as the examples change, which may lead to less consistent results than CoT.
CoT : By generating multiple reasoning paths and selecting the most consistent answer, the CoT method aims to improve the consistency and accuracy of the answers.
The above three are all works of Google.
Technical route 4 XoT Various magical changes based on CoT ideas
Boosting of Thoughts (BoT, 2024.02)
Boosting of Thoughts (BoT): Trial-and-Error Problem Solving with Large Language Models
University of Toronto、University of Alberta
Core content overview :
1. Research Background : LLMs rely on chain-of-thought prompts when solving complex problems, but existing methods usually require manual annotation and are difficult to generalize to new tasks .
2.BoT method : BoT automatically explores and evaluates a large number of thinking trees through an iterative process to gain reasoning experience. These experiences are used to revise prompts and enhance the generation of reasoning steps until the final answer is obtained. This method does not rely on manual annotations for specific tasks, but gradually improves the reasoning steps by learning from the errors generated by the model.
3. Experiments : Use GPT-4 and Llama2 models to conduct experiments on multiple complex mathematical problem datasets, including MMLU, SVAMP, GSM8K, AQuA, and MATH.
4. Results : BoT achieves comparable or higher problem solving rates than human annotations on most datasets, especially in the absence of human annotations, where its performance is significantly improved.
5. Contributions : BoT proposes a new framework that does not require task-specific manual annotations, has good scalability, and converges quickly to a solution through an iterative process.
6. Conclusion : BoT demonstrates the ability to guide LLMs to perform effective reasoning through enhanced hints, maintaining high performance in a variety of tasks even without human annotations.
Steps of BoT method :
1. Initial Prompt : Start with a simple initial prompt without any human-annotated examples.
2. Generate thinking structures : Generate a large number of thinking structures (e.g., tree structures) in parallel, which represent possible reasoning paths.
3. Aggregate thinking chains : Extract the thinking chains that are most likely to succeed from the generated thinking structure.
4. Self-assessment : Use LLM to evaluate the aggregated thinking chain and generate feedback, including error analysis and improvement suggestions.
5. Iterative improvement : Incorporate feedback into prompts and use it as experience to guide the generation of thinking structures for the next iteration.
6. Convergence and solution : Through multiple iterations, experience is accumulated and an accurate solution is finally obtained.
Example 24 point problem : The given four numbers are: 2, 4, 5, 5
The BoT solution process is as follows:
1. Initial prompt : BoT starts with a simple prompt, telling the model that it needs to use the given four numbers and basic arithmetic operations (addition, subtraction, multiplication, and division) to get 24.
2. First iteration : The model generates some preliminary thinking steps, but these steps may not be close to the solution. For example, the model may first try to add two 5s to get 10, and then try to multiply 10 by 4 to get 40, which is obviously not the right direction.
3. Error analysis : The model evaluates the generated thought chain and identifies errors or invalid steps. At this stage, the model may recognize that 40 is not a reasonable intermediate result because it is too far from the target 24.
4. Second iteration : Based on the error analysis from the first iteration, the model adjusts the prompts and tries different operations and number combinations. This time, the model might try to use a combination of addition and multiplication, such as adding 2 and 4 to get 6, then adding two 5s to get 10, and finally multiplying 6 and 10 to get 60.
5. Further iterations : The model continues the iterative process, each time improving the thinking steps based on previous experience and error analysis. In subsequent iterations, the model may try more combinations until it finds a valid solution.
6. Final solution : After many iterations, the model eventually finds the right chain of thoughts. For example, it might find the following steps: multiply 4 by 5 to get 20, then subtract 2 from 20 to get 18, and finally add 18 to 5 to get the final 24.
The initial prompt is a simple text description that guides the Large Language Model (LLM) to start solving the problem. For the 24-point game problem, the initial prompt might be:
"In the game of 24, you are given four numbers, and the goal is to use basic arithmetic operations (+, -, *, /) to combine these numbers to obtain a result of 24. You can only use each number once, and parentheses can be used to change the order of operations."
This prompt provides the model with the basic rules of the game, which is to use the four basic arithmetic operations of addition, subtraction, multiplication, and division to combine the four given numbers to get the result 24. This initial prompt does not contain specific problem-solving steps or examples, but provides the contextual information needed to solve the problem. Subsequently, the Boosting of Thoughts framework will gradually guide the model to explore possible problem-solving paths through an iterative process, and optimize the prompts based on the model's self-assessment after each iteration to better solve specific 24-point problems.
Tree of Thoughts (ToT, 2023.05)
Paper: Tree of Thoughts- Deliberate Problem Solving with Large Language Models
Princeton University + Google DeepMind
Tree of Thoughts (ToT) is a structure used in the article to enhance the reasoning ability of large language models (LLMs) in complex problem solving. ToT converts the traditional serialized reasoning steps into a tree structure, which allows the model to explore multiple possible thinking paths during the reasoning process. Each node in the tree represents a reasoning step, and the branches represent different reasoning directions or choices.
Tree of Thoughts Features :
1. Tree structure : ToT uses a tree structure to represent the reasoning process, where the root node represents the initial state of the problem and the leaf nodes represent possible solutions or the final steps of reasoning.
2. Branch exploration : At each level of the tree, multiple branches are generated, each representing a different reasoning path. These branches can be generated based on the model’s understanding of the problem and possible solutions.
3. Backtracking mechanism : ToT allows the model to backtrack in the tree, that is, if a certain reasoning path does not seem to lead to the correct solution, the model can return to the previous node and try another path.
4. Dynamic expansion : The tree structure expands dynamically during the reasoning process, and the model can continuously add new nodes and branches based on new information and intermediate results.
5. Complex Problem Solving : ToT is particularly suitable for solving complex problems that require multi-step logical reasoning and multiple possible solutions.
Application in Boosting of Thoughts :
In the Boosting of Thoughts framework, the ToT structure is used to generate and evaluate a large number of reasoning paths. BoT builds these tree structures through an iterative process, and each iteration adjusts and improves the prompts based on the feedback generated by the LLM. In this way, even starting from a simple initial prompt, BoT is able to gradually guide the model to generate more effective reasoning chains through accumulated experience, and ultimately solve complex problems.
For example, when solving a math problem, BoT might generate multiple ToT structures containing different math operations and intermediate results. By evaluating these structures, BoT can identify which sequence of operations is more likely to get the correct answer, and use this information to guide the next round of reasoning. In this way, BoT is able to simulate the thinking process of humans when solving problems, gradually approaching the final solution through trial and error and self-correction.
Comments 1
There are three types of technical routes:
greedy search greedy search
exhausted search
beam search
Features/Methods | Greedy Search | Exhausted Search | Beam Search |
definition | At each step, the locally optimal option is selected without considering the global optimal solution. | Systematically search through all possible options until the optimal solution is found. | A heuristic search algorithm that expands by a fixed number of optimal options (called the "beam width") at a time and gradually builds a solution. |
Target | Quickly find local optimal solutions. | Find the global optimal solution. | Find near-optimal solutions with limited computing resources. |
efficiency | High, because decisions are made quickly and not all possible paths are explored. | Low, because all possibilities need to be explored, which can be very time-consuming. | Medium, balances efficiency and effectiveness by limiting the search width. |
Completeness | If it is incomplete, the optimal solution may be missed. | Complete, guaranteed to find the optimal solution (if one exists). | It may not be complete, but it usually finds a good solution. |
applicability | Applicable to situations where the solution space is small or the problem structure is simple. | It is suitable for problems with a large solution space and the need to find the optimal solution. | It is suitable for problems with a large solution space and a good solution needs to be found in a limited time. |
example | Hill climbing algorithm, shortest path algorithm (such as Dijkstra algorithm). | Depth-first search (DFS), breadth-first search (BFS). | A* search algorithm, Monte Carlo Tree Search (MCTS). |
Technical route 5: Multiple self-iteration mode (AutoGPT)
1. Amazon’s work: Auto-GPT for Online Decision Making: Benchmarks and Additional Opinions
Auto-GPT explores the effectiveness and flexibility of autonomous agents in online decision-making tasks. Auto-GPT is an autonomous agent based on Large Language Models (LLMs) that is able to connect to the Internet and attempt to complete a variety of tasks. Although Auto-GPT has attracted widespread attention, its performance in real-world tasks remains uncertain.
2. Auto-GPT Features
- Receive high-level goals and instructions without step-by-step guidance from humans.
- Conduct self-monologue by generating “Thinking”, “Reasoning”, “Planning” and “Criticism”.
- Ability to integrate various tools with simple tool instructions and several examples.
- Contains long-term self-memory and memory retrieval mechanisms.
- Adaptation for specific tasks should require minimal effort, such as providing a goal definition and a tool description.
3. Experimental Setup: WebShop and ALFWorld are two simulated environments for evaluating the performance of language models in online decision-making tasks that involve responding to unknown external environments .
WebShop : WebShop is an environment that simulates an online shopping experience. It creates a realistic action space by crawling 1,181,436 product information from Amazon.com and hosting these products on an isolated server. The environment provides agents with the option to perform product searches, click on items, return to previous pages, and make purchases. Equipped with an integrated search engine, the environment provides shopping agents with real-time observations similar to those of a web browser. The evaluation process involves judging whether the agent can successfully purchase the intended product based on the product description, which requires a full match of the product itself, attributes, options, and price.
ALFWorld : ALFWorld is a groundbreaking research environment that combines the complex, task-oriented language understanding of the ALFRED dataset with the immersive, interactive fiction of TextWorld. The ALFRED (Action Learning From Realistic Environments and Directives) benchmark provides a powerful testing ground for models to learn to parse and execute tasks from language instructions in detailed, interactive 3D environments. At the same time, TextWorld serves as a dynamic learning playground for training and evaluating reinforcement learning agents in text-based games. ALFWorld interweaves these two platforms, combining the language understanding and decision-making challenges of text-based games with physical interactions in 3D environments, and is a key step in fusing natural language instructions with real-world physical interactions. The environment contains more than 25,000 unique tasks generated in realistic environments in various areas such as kitchens, living rooms, and bedrooms. These tasks require complex problem-solving skills and a thorough understanding of language and environment, providing a higher benchmark for AI performance.
4. Model comparison: Comparison of the performance of popular LLMs such as GPT-4, GPT-3.5, Claude, and Vicuna on Auto-GPT-style decision-making tasks.
5. Additional Opinions algorithm : The Additional Opinions algorithm is introduced as an effective way to incorporate supervised/imitation-based learners into the Auto-GPT scheme. This approach enables lightweight supervised learning without the need to fine-tune the base LLMs.
6. Experimental Results: Through careful baseline comparisons and ablation studies, we demonstrate that the extra opinion algorithm significantly improves the performance of online decision-making benchmark tasks, including WebShop and ALFWorld.
7. Conclusion: Auto-GPT not only demonstrates its potential in practical use, but also, driven by GPT-4, outperforms supervised IL (imitation learning) models designed specifically for these tasks. In addition, by introducing additional opinions from external expert models, the decision-making ability of the Auto-GPT style agent is further improved, which is especially beneficial to GPT-4.
This paper demonstrates the potential of Auto-GPT in handling complex online decision-making tasks and proposes a new method to utilize external models as providers of additional opinions, which opens up new possibilities for the use of AI models in practical applications.
Comments 2
Now, the routine of AutoGPT 2023.03 is even a bit outdated. Specifically:
1. Fixed prompt structure : AutoGPT relies on a predefined prompt structure, which limits the flexibility and adaptability of the model. When faced with new tasks , generalization is bound to be problematic, which will significantly increase deployment complexity.
2. Lack of contextual understanding : AutoGPT is a "one-step" logic. Although AutoGPT is able to generate coherent text, it lacks in understanding and maintaining long-term context. As a result, in long conversations or complex tasks, the model may fail to maintain topic consistency or ignore important contextual information.
3. Limitations in reasoning ability : AutoGPT mainly relies on pattern matching and associative learning, not real reasoning ability, and is still on the old path of beam search . When dealing with problems that require logical reasoning and deep understanding, the model may not be able to provide accurate answers.
4. Data dependence : The performance of AutoGPT depends heavily on the quality and diversity of expert hints . If the expert hints are biased or insufficient, it may be a disaster.
This article uses a large number of big models to sort out and summarize. Although the big model cannot form a complete summary document with opinions, examples, and logic, and sometimes even produces various low-information information (yes, nonsense), the big model has indeed demonstrated amazing capabilities in extracting long text summary information. In the past, it might take an experienced student a whole day to organize such a document, but this article only took 2 hours to organize.