Google's official Prompt Engineering white paper complete translation and PDF download

Written by
Iris Vance
Updated on:July-02nd-2025
Recommendation

Explore Google's official white paper on prompt engineering and master the rigorous development and review process of high-quality technical documents in the field of artificial intelligence.

Core content:
1. The process of creating high-quality technical documents through collaboration among experts in multiple fields
2. The teaching structure of prompt engineering, the progressive relationship from basic to advanced skills
3. The complexity and influencing factors of prompt writing, and the importance of the iterative process

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


The completion of this white paper benefited from the collaboration and contributions of many experts. The participation of different roles such as reviewers, contributors, planners, editors, technical writers and designers reflects the multi-faceted efforts and rigorous development review process involved in creating high-quality technical documents in the field of artificial intelligence. This shows that although the concept of prompt engineering is relatively easy to understand, its effective practice and knowledge dissemination still require a structured approach and clear presentation, reflecting the growing importance and complexity of this  field1 .


The table of contents clearly shows the structure of the white paper: starting with basic concepts (Introduction, Basics), diving into specific techniques (from zero-shot to ReAct), covering key applications (Code), mentioning future directions (Multimodality), and ending with practical advice (Best Practices). This didactic structure helps readers understand the flow of information and how the content is organized1  . The progression from basic configuration and zero-shot prompts to complex reasoning techniques such as CoT, ToT, and ReAct shows that to achieve advanced LLM applications, more and more complex skills are required. The inclusion of “Code Hints” as a major section highlights the importance of LLM in software development. The considerable space devoted to “ Best Practices” emphasizes the empirical and iterative nature of this field1  .

Introduction

When talking about inputs and outputs of large language models (LLMs), a text prompt (sometimes accompanied by other modalities such as images) is the input modality that the model uses to predict a specific output1  . Writing prompts is not the preserve of data scientists or machine learning engineers—anyone can do it. However, building the most effective prompts can be quite complex. The effectiveness of prompts is influenced by many factors: the model used, the data the model is trained on, the model configuration, wording choice, tone of voice, structure, and context are all important1  . Therefore , prompt engineering is an iterative process. Inappropriate prompts can lead to vague, inaccurate responses and hinder the model's ability to provide meaningful output1  .

You don't need to be a data scientist or a machine learning engineer – everyone can write a prompt

This statement lowers the bar for beginners, but the list of influencing factors that follows (model selection, training data, configuration, wording, style, tone, structure, context) reveals its inherent complexity. This shows that while basic interactions are easy, obtaining reliable, high-quality results requires deliberate effort and knowledge accumulation - the realm of "engineering"  1 .

When users interact with the Gemini chatbot¹, they are essentially writing prompts. However, this whitepaper focuses on writing prompts for the Gemini model in Vertex AI or through the API, because direct prompting of the model gives access to configuration parameters such as temperature1  . This explicit focus on direct model interaction (via Vertex AI/API rather than a chat interface) suggests that fine-grained control over configurations such as temperature is considered fundamental to advanced prompt engineering, as distinct from casual chatbot use. Mastering prompt engineering involves not only the prompt text itself, but also manipulating the model’s generation parameters, which is critical for tasks that require a particular level of creativity or  certainty1 .

This white paper explores prompt engineering in detail. We’ll look at various prompting techniques, help you get started, and share tips and best practices for becoming a prompting expert. We’ll also discuss some of the challenges you may encounter when building  prompts1 .

Prompt engineering

It is important to understand how LLM works: it is a prediction engine. The model receives sequential text as input and then predicts the next token that should appear based on its training data. LLM is designed to perform this process repeatedly, adding the previously predicted tokens to the end of the sequence text to predict the next token. The prediction of the next token is based on the relationship between the content in the previous tokens and what the LLM has seen during  training1 .

When writing a prompt, you are actually trying to guide the LLM to predict the correct sequence of tokens. Prompt engineering is the process of designing high-quality prompts to guide the LLM to produce accurate output. This process involves iterative debugging to find the best prompt, optimizing the prompt length, and evaluating the writing style and structure of the prompt in relation to the task1  . In the context of natural language processing and LLMs, prompts are the inputs provided to the model to generate a response or  prediction1 .

The term “engineering” is appropriate here because it describes a systematic process involving “design,” “optimization,” “evaluation,” and “debugging.” This is more than just writing; it is a process of systematic improvement in response to requirements, similar to traditional engineering disciplines. It elevates prompt creation from a simple act of asking questions to a purposeful, goal-oriented  design process.1

These hints can be used to achieve a variety of understanding and generation tasks, such as text summarization, information extraction, question answering, text classification, language or code translation, code generation, and code documentation or  reasoning1 .

Refer to Google’s prompting guide²,³ for examples of simple and effective  prompts1 .

When engineering prompts, you first need to choose a model. Whether using the Gemini language model from Vertex AI, GPT, Claude, or an open source model like Gemma or LLaMA, prompts may need to be optimized for a specific model1  . Explicitly stating that prompts may need to be optimized for a specific model (e.g., Gemini, GPT, Claude, Gemma, LLaMA) emphasizes that prompt engineering is not a one-size-fits-all skill. Techniques may be general, but the optimal wording and structure may be model-dependent due to differences in model architecture, training data, and fine-tuning. Effective prompt engineering requires understanding the characteristics of the target  model1 .

In addition to the prompts themselves, various configurations of the LLM need to be  debugged1 .

LLM output configuration

Once a model has been selected, the model configuration needs to be determined. Most LLMs come with a variety of configuration options that control their output. Effective hint engineering requires optimizing these configurations for a specific  task1 .

Output length

An important configuration setting is the number of tokens to generate in a response. Generating more tokens requires more computation by the LLM, resulting in higher energy consumption, potentially slower response times, and higher  costs1 .

Reducing the output length of an LLM does not make the LLM stylistically or textually more concise in the output it creates, it simply causes the LLM to stop predicting more tokens when the limit is reached. If your requirements call for a shorter output length, you may also need to design your prompts accordingly to accommodate1  . To be clear: limiting the number of tokens (max_tokens) via configuration is a forced truncation, which by itself does not force the model to produce concise content. Achieving conciseness usually requires specific instructions in the prompt itself (e.g., "summarize in one sentence")  1 .

Output length limits are particularly important for some LLM prompting techniques, such as ReAct, because the LLM may continue to emit useless tokens after obtaining the required response1  .

Sampling controls

LLM does not formally predict a single token. Instead, LLM predicts the probability of what the next token might be, and each token in the LLM vocabulary is given a probability. These token probabilities are then sampled to determine the next token that will be  generated1 .

Temperature, Top-K, and Top-P are the most common configuration settings that determine how the predicted token probabilities are processed to select a single output  token1 .

Temperature

The temperature controls the degree of randomness in token selection. Lower temperatures are good for prompts that expect a more deterministic response, while higher temperatures may lead to more diverse or unexpected results. A temperature of 0 (greedy decoding) is deterministic: the token with the highest probability is always chosen (but note that if two tokens have the same highest predicted probability, a temperature of 0 may not always give the same output, depending on how ties are handled) 1 .

Temperatures close to the maximum tend to produce more random outputs. As the temperature gets higher, all tokens become equally likely to be the next predicted token1  . The temperature parameter provides a fundamental trade-off between predictability/factual accuracy (low temperatures) and creativity/diversity (high temperatures). Choosing the right temperature for different tasks is critical - factual question answering requires low temperatures, while story generation may benefit from high  temperatures1 .

Gemini’s temperature control can be understood similarly to the softmax function used in machine learning. Low temperature settings are similar to low softmax temperatures (T), emphasizing a single preferred temperature with high certainty. Higher Gemini temperature settings are similar to high softmax temperatures, making a wider range of temperatures around the selected setting more acceptable. This increased uncertainty accommodates scenarios where a strictly precise temperature is not required, such as when experimenting with creative  output1 .

Top-K and top-P

Top-K and Top-P (also known as kernel sampling)⁴ are two sampling settings used in LLM to restrict the predicted next token to those with the highest predicted probability. Similar to temperature, these sampling settings control the randomness and diversity of the generated  text1 .

  • Top-K  sampling selects the K tokens with the highest probability from the distribution predicted by the model. The higher the Top-K value, the more creative and diverse the model's output is; the lower the Top-K value, the more restricted and fact-based the model's output is. A Top-K of 1 is equivalent to greedy decoding  1 .

  • Top-P  sampling selects the highest probability tokens whose cumulative probability does not exceed some value (P). The value of P ranges from 0 (greedy decoding) to 1 (all tokens in the LLM vocabulary) 1 .

Top-K and Top-P provide different ways to complement temperature control to shape the probability distribution before sampling. Top-K limits the number of choices , while Top-P limits based on cumulative probability mass . Top-K sets a hard limit on the number of tokens considered (e.g., only look at the first 40). Top-P sets a limit based on the sum of probabilities (e.g., consider tokens until their probabilities add up to 0.95). These are different mechanisms. Understanding this difference allows for finer control over the output diversity than using temperature  alone1 .

The best way to choose between Top-K and Top-P is to try both methods simultaneously (or together) to see which produces the results you are looking  for1 .

Putting it all together

The choice between Top-K, Top-P, temperature, and the number of tokens to generate depends on the specific application and the desired results, and these settings affect each other1  . It is also important to understand how the selected model combines different sampling settings.

If Temperature, Top-K, and Top-P are all available (as in Vertex Studio), tokens that satisfy both Top-K and Top-P criteria become candidates for the next predicted token, and then Temperature is applied to sample from tokens that pass both Top-K and Top-P criteria. If only Top-K or Top-P is available, the behavior is the same, but only one Top-K or P setting of  1 is used .

If the temperature is not available, a token is randomly selected from those satisfying the Top-K and/or Top-P criteria to produce a single next prediction token  1 .

At extreme settings of a sampling configuration value, the sampling setting either cancels out other configuration settings or becomes  insignificant1 :

  • If you set the temperature to 0, Top-K and Top-P become irrelevant - the token with the highest probability becomes the next predicted token. If you set the temperature extremely high (higher than 1 - usually on the order of 10), the temperature becomes irrelevant and any token that passes the Top-K and/or Top-P criteria will then be randomly sampled to select the next predicted  token1 .

  • If you set Top-K to 1, Temperature and Top-P become irrelevant. Only one token passes the Top-K criterion, and that token is the next predicted token. If you set Top-K extremely high, such as to the size of the LLM vocabulary, any token with non-zero probability of being the next token will satisfy the Top-K criterion, and no tokens will be filtered  out .

  • If Top-P is set to 0 (or a very small value), most LLM sampling implementations will only consider the tokens with the highest probability to satisfy the Top-P criterion, making Temperature and Top-K irrelevant. If Top-P is set to 1, any token with non-zero probability of being the next token will satisfy the Top-P criterion and no tokens will be filtered  out .

These configuration parameters do not operate independently. Their interactions are complex, and some settings can dominate or counteract others. Understanding these interactions is critical for predictable control. For example, setting Temperature to 0 or Top-K to 1 will invalidate other sampled parameters. Effective configuration requires a global  perspective1 .

As a general starting point, a Temperature of 0.2, a Top-P of 0.95, and a Top-K of 30 will give relatively coherent results that can be creative but not overly so. If you want particularly creative results, try starting with a Temperature of 0.9, a Top-P of 0.99, and a Top-K of 40. If you want less creative results, try starting with a Temperature of 0.1, a Top-P of 0.9, and a Top-K of 20. Finally, if the task always has a single correct answer (e.g., answering math problems), start with a Temperature of 01.  Providing specific starting values ​​for different task types (coherent, creative, factual) provides valuable guidance for practice. This acknowledges that finding the optimal setting requires experimentation, but can be guided by heuristics that reduce the initial search  space1 .

Note: The higher the degrees of freedom (higher Temperature, Top-K, Top-P, and Output Tokens), the  less relevant the LLM may produce1  .

Prompting techniques

LLMs are tuned to follow instructions and trained on large amounts of data, so they are able to understand prompts and generate answers. But LLMs are not perfect; the clearer the prompt text, the better the LLM predicts the next likely text. In addition, specific techniques that exploit how LLMs are trained and work will help get relevant results from  LLMs1 .

Now that we understand what hint engineering is and its elements, let’s dive into some of the most important hinting techniques  examples1 .

General prompting / zero shot

A zero-shot⁵ prompt is the simplest type of prompt. It provides only a task description and some text for the LLM to start with. This input can be anything: a question, the beginning of a story, or instructions. The name “zero-shot” stands for “no example” 1 .

Let’s use Vertex AI Studio (for Language)⁶ from Vertex AI, which provides a platform for testing prompts. In Table 1, you’ll see an example of a zero-shot prompt for classifying movie  reviews1 .

The table format used below is a great way to record hints. Your hints may go through many iterations before they end up in the code base, so it's important to track hint engineering work in a disciplined, structured way. For more information about this table format, the importance of tracking hint engineering work, and the hint development process, see the "Best Practices" section later in this chapter ("Recording Various Hint Attempts") 1 .

The model temperature should be set to a low number since no creativity is needed, and we use the default Top-K and Top-P values ​​from gemini-pro, which effectively disables both settings (see "LLM Output Configuration" above). Note the generated output. The words "disturbing" and "masterpiece" should complicate the predictions a bit since they are used in the same  sentence .

Table 1: An example of zero-shot prompting 

Fields

value

Name

1_1_movie_classification

Goal

Classify movie reviews as positive, neutral or negative.

Model

gemini-pro

Temperature (Temp)

0.1

Token Limit(Limit)

5

Top-K

N/A

Top-P

1

Prompt

Classify movie reviews as POSITIVE, NEUTRAL, or NEGATIVE. Review: "Her" is a disturbing study in where humanity is headed if artificial intelligence is allowed to continue to evolve unchecked. I wish there were more movies like this masterpiece. Sentiment:



Output

POSITIVE

  • This table provides a concrete, minimal example illustrating the concept of zero-shot. It shows the basic structure of a prompt for a common task (classification) and includes metadata (target, model, configuration) that is critical for reproducibility and documentation, as advocated in the text1  .

When zero-shot does not work, a demonstration or example can be provided in the prompt, which leads to the “one-shot” and “few-shot”  prompts1 .

One-shot & few-shot

When creating prompts for an AI model, it’s helpful to provide examples. These examples help the model understand what you want. Examples are especially useful when you want to guide the model to follow a specific output structure or  pattern1 .

One-shot prompts provide a single example, and thus a single sample. The idea is that the model has a single example that it can imitate to best accomplish the  task1 .

Few-shot prompts⁷ Provide multiple examples to the model. This approach shows the model a pattern it needs to follow. The idea is similar to single-shot prompts, but multiple examples of the desired pattern increase the chances that the model will follow that  pattern1 .

The number of examples required for few-shot prompts depends on several factors, including the complexity of the task, the quality of the examples, and the capabilities of the generative AI (gen AI) model you are using. As a general rule of thumb, few-shot prompts should use at least three to five examples. However, for more complex tasks, you may need to use more examples, or due to input length constraints on your model, you may need to use fewer  examples1 .

Table 2 shows an example of a few - shot prompt, using the same gemini-pro model configuration settings as before, except that the token limit is increased to accommodate longer responses1  .

Table 2: An example of few-shot prompting 

Fields

value

Goal

Parse pizza orders to JSON

Model

gemini-pro

Temperature (Temp)

0.1

Token Limit(Limit)

250

Top-K

N/A

Top-P

1

Prompt

Parse the customer's pizza order into valid JSON:

Example:

I'd like a small pizza with cheese, tomato sauce and pepperoni.

JSON response:

json

{  "size": "small",  "type": "normal",  "ingredients": [["cheese", "tomato sauce", "peperoni"]] } Example: Can I have a large pizza with tomato sauce, basil, and mozzarella? JSON response: Json {  "size": "large",  "type": "normal",  "ingredients": [["tomato sauce", "bazel", "mozzarella"]] } Now, I want a large pizza with half cheese and mozzarella. The other half is tomato sauce, ham, and pineapple. JSON response:

















Output

Json {  "size": "large",  "type": "half-half",  "ingredients": [["cheese", "mozzarella"], ["tomato sauce", "ham", "pineapple"]] }




  • This table demonstrates the power of few-shot hints on a structured output task (JSON generation). It shows how providing examples of varying complexity (simple pizza vs. double-pizza) teaches the model the expected format and logic, allowing it to accurately handle new, similar requests1  . Few-shot hints leverage the model’s contextual learning capabilities. By observing examples, the model can infer the underlying task and expected output format without explicit instruction tuning. Therefore, the quality and diversity of examples are  critical1 .

When choosing examples for your prompts, use examples that are relevant to the task you want to perform. Examples should be diverse, high-quality, and well-written. A small mistake can confuse the model and lead to undesired  outputs1 .

If you are trying to generate output that is robust to a wide range of inputs, it is important to include edge cases in your examples. Edge cases are those inputs that are unusual or unexpected, but that the model should still be able to  handle1 .

System, contextual and role prompting

System, context, and role cues are all techniques used to guide LLM in how to generate text, but they focus on different  aspects1 :

  • System prompting  sets the overall context and purpose of the language model. It defines the “big picture” of what the model should do, such as translating languages, classifying reviews,  etc.1 .

  • Contextual prompting  provides specific details or background information related to the current conversation or task. It helps the model understand the nuances of the questions asked and adjust the response  accordingly1 .

  • Role prompting  assigns a language model a specific role or identity to adopt. This helps the model generate responses that are consistent with the assigned role and its associated knowledge and  behavior1 .

There can be considerable overlap between system, context, and role cues. For example, a cue that assigns a role to a system may also include  context1 .

However , each type of prompt has a slightly different primary purpose1  :

  • System prompt:  Define the basic capabilities and overall goals of the  model1 .

  • Contextual cues:  Provide immediate, task-specific information to guide responses. They are highly specific to the current task or input and are  dynamic1 .

  • Character cues:  build the model’s output style and tone. It adds a layer of specificity and  personality1 .

These three prompt types represent different levels or dimensions of guiding the LLM. System prompts set the stage, context prompts provide immediate scene details, and role prompts define the image of the "actors." They can be used alone or in combination to achieve fine-grained control. For example, a system prompt might define the task as "translate text," a context prompt might provide the source text and target language, and a role prompt might specify "play the role of a professional translator specializing in legal documents." This layered approach allows for highly specific control over the  output1 .

Distinguishing between system, context, and role cues provides a framework for designing cues that are clear in their intent, allow for flexible composition, and make it easier to analyze how each cue type affects the output of a language  model1 .

Let’s take a closer look at these three different types of prompts.

System prompting

Table 3 contains a system prompt that specifies additional information about how the output is returned. The temperature was increased for higher creativity levels and a higher token limit was specified. However, due to the explicit instructions about how the output is returned, the model did not return additional  text1 .

Table 3: An example of system prompting 1 

Fields

value

Goal

Classify movie reviews as positive, neutral or negative. Only return the label in uppercase.

Model

gemini-pro

Temperature (Temp)

1

Token Limit(Limit)

5

Top-K

40

Top-P

0.8

Prompt

Classify movie reviews as POSITIVE, NEUTRAL, or NEGATIVE. Return labels in uppercase only. Review: "Her" is a disturbing study in where humanity is headed if artificial intelligence is allowed to continue to evolve unchecked. It was so disturbing that I couldn't watch it. Sentiment :



Output

NEGATIVE

  • This table example clearly demonstrates how the system hints enforce specific output formatting constraints (uppercase labels) even at high temperature settings, which would normally encourage more verbose  output1 .

System hints are very useful for generating output that meets specific requirements. The name "system hint" actually stands for "providing additional tasks to the system" 1. For example, you can use system hints to generate code snippets that are compatible with a specific programming language, or you can use system hints to return a certain structure. Consider Table 4, where the output is returned in JSON format  1 .

Table 4: An example of system prompting with JSON format 

Fields

value

Goal

Classify movie reviews as positive, neutral or negative, return JSON.

Model

gemini-pro

Temperature (Temp)

1

Token Limit(Limit)

1024

Top-K

40

Top-P

0.8

Prompt

Classify a movie review as positive (POSITIVE), neutral (NEUTRAL), or negative (NEGATIVE). Returns valid JSON: review: "Her" is a disturbing study in where humanity is headed if artificial intelligence is allowed to continue to evolve unchecked. It is so disturbing that I could not watch it. Schema: ```json MOVIE: {  "sentiment": String "POSITIVE"





Output

Json {  "Movie_reviews": }




  • This table illustrates a more complex use case for system hints: enforcing structured data output (JSON) by providing a schema. This is valuable for integrating LLM output into downstream applications. It also highlights the benefit of enforcing structure to potentially limit  illusions1 .

There are some benefits to returning a JSON object from the prompt that extracts the data. In practice, this JSON format does not need to be created manually, the data can be returned in sorted order (very convenient when dealing with datetime objects), but most importantly, by requiring the JSON format through the prompt, it forces the model to create structure and limits illusions1  . System prompts demonstrate that direct instructions, especially regarding output formats or constraints such as security, can effectively override default model behavior or tendencies that are affected by other settings such as high temperature1  .

System prompts are also very useful for security and toxicity control. To control the output, just add a line to the prompt, such as: "Your answers should be respectful." 1 .

Role prompting

Role prompting is a technique in prompt engineering that involves assigning specific roles to generative AI models. This can help the model generate more relevant and informative outputs because the model can construct responses based on the specific role assigned to  it1 .

For example, a generative AI model can be role-prompted to play the role of a book editor, kindergarten teacher, or motivational speaker. Once the model is assigned a role, it can be given prompts specific to that role. For example, a teacher can be prompted to create a lesson plan, which they can then review. See Table 5, where the model plays the role of a travel guide  1 .

Table 5: An example of role prompting 

Fields

value

Goal

Act as travel guide and provide 3 travel suggestions

Model

gemini-pro

Temperature (Temp)

1

Token Limit(Limit)

1024

Top-K

40

Top-P

0.8

Prompt

I want you to play as a travel guide. I will tell you where I am and you will suggest 3 places worth visiting nearby. In some cases, I will also tell you the type of place I will visit. My suggestion: "I am in Amsterdam and just want to visit museums." Travel suggestions:

Output

1.  Rijksmuseum:  A world-renowned museum known for Dutch masterpieces like Rembrandt's The Night Watch. Don't miss the incredible collection of paintings and artifacts showcasing Dutch art and history. 2.  Van Gogh Museum:  Immerse yourself in the fascinating work of Vincent van Gogh. The museum houses the largest collection of his paintings and drawings, including The Starry Night and Sunflowers. 3.  Stedelijk Museum Amsterdam:  Explore modern and contemporary art from around the world. The museum's collection includes works by Picasso, Kandinsky, and Rietveld, and is housed in a striking modern building.

  • This table demonstrates how assigning a role (“Travel Guide”) directs the model to provide relevant recommendations (museums in Amsterdam) in the appropriate format (a list with descriptions) 1 .

Defining a persona perspective for an AI model provides it with a blueprint for the desired tone, style, and focused expertise to improve the quality, relevance, and effectiveness of outputs1  . Persona prompts allow an LLM to emulate a persona, adopting its expected knowledge, vocabulary, tone, and style. This is very effective for tailoring output to specific audiences or communication goals1  .

Here are some effective styles to choose from:

Confrontational, Descriptive, Direct, Formal, Humorous, Influential, Informal, Inspirational, Persuasive

Let's change the prompts in Table 6 to include humorous and inspirational  styles1 .

Table 6: An example of role prompting with a humorous tone and style 

Fields

value

Goal

Act as travel guide and provide 3 travel suggestions

Model

gemini-pro

Temperature (Temp)

1

Token Limit(Limit)

1024

Top-K

40

Top-P

0.8

Prompt

I want you to play the role of a travel guide. I will tell you where I am and you will suggest 3 nearby places worth visiting in a humorous way. My suggestion: "I'm in Manhattan." Travel suggestions:

Output

1.  Behold the Empire State of Mind:  Ascend the dizzying heights of the Empire State Building and bask in the glory of the Manhattan skyline. Prepare to feel like King Kong atop the Big Apple, only without the giant ape-sized bananas. 2.  Get Artsy-Fartsy at MoMA:  Unleash your inner art lover at the Museum of Modern Art (MoMA). Gaze upon masterpieces that will blow your mind and make you question whether your stick figures have any artistic value at all. 3.  Shop 'Til You Drop on Fifth Avenue:  Indulge in some retail therapy on iconic Fifth Avenue. Be prepared for price shock as you browse (or actually shop) at designer boutiques that will make your wallet weep. But hey, you're in Manhattan, so might as well embrace the awesomeness!

  • This table shows that persona cues can not only define expertise but also specify tone and style (“humor”), significantly changing the characteristics of the output while satisfying the core request1  .

Contextual prompting

By providing contextual cues, you can help ensure that AI interactions are as seamless and efficient as possible. The model will be able to understand your request faster and be able to generate more accurate and relevant responses, as shown in the examples in Table 71.  Contextual cues place the LLM’s response in specific context or background information, preventing generic or irrelevant outputs and ensuring that the response is tailored to the situation at  hand1 .

Table 7: An example of contextual prompting 

Fields

value

Goal

Suggest articles for a blog about retro games

Model

gemini-pro

Temperature (Temp)

1

Token Limit(Limit)

1024

Top-K

40

Top-P

0.8

Prompt

Background:  You are writing for a blog about retro arcade video games from the 80s. Suggest 3 topics you could write an article about, along with a few lines of description of what the article should contain.

Output

1.  The Evolution of Arcade Cabinet Design:  This article will explore the evolution of arcade cabinet design, from the early wooden and metal cabinets of the 1970s to the sleek, neon designs of the 1980s and beyond. 2.  Blast From The Past: Iconic Arcade Games of The 80's:  - This article will briefly explore the history, influence, and legacy of some of the most iconic and influential arcade games of the 1980s, focusing on their innovations, popular mechanics, and enduring appeal. 3.  The Rise and Retro Revival of Pixel Art:  - This article will delve into the evolution of pixel art as the defining visual style of early arcade games, tracing its roots, exploring its techniques, and discussing the resurgence of pixel art in modern games and digital art.









  • This table clearly illustrates how providing explicit context (“blogs about retro 80s arcade video games”) focuses the model’s output on highly relevant recommendations, rather than a generic request for “blog topics”  1 .

Step-back prompting

Step-back⁸ prompts are a technique to improve performance by prompting the LLM to first consider a more general question related to the specific task at hand, and then feeding the answer to that general question into a subsequent task-specific prompt. This “step-back” allows the LLM to activate relevant background knowledge and reasoning processes before attempting to solve a specific  problem1 .

By considering broader and underlying principles, the LLM can generate more accurate and insightful responses. Backward prompting encourages the LLM to think critically and apply its knowledge in new and creative ways. It changes the final prompt for the task by leveraging more knowledge in the LLM's parameters than would come into play if the LLM were prompted directly1  . By forcing the model to first think about abstract principles or general concepts related to the task, the backward prompt helps it subsequently generate more grounded, creative, and relevant specific output. It's a form of guided brainstorming1  .

It can help mitigate bias in LLM responses by focusing on general principles rather than specific  details1 .

Let’s look at some examples to see how fallback hints can improve results. We first review a traditional hint (Table 8) and then compare it to the fallback hints (Tables 9 and 10)  1 :

Table 8: A traditional prompt before we compare it with a step back prompt 

Fields

value

Goal

Write a storyline for a level of a first-person shooter video game.

Model

gemini-pro

Temperature (Temp)

1

Token Limit(Limit)

1024

Top-K

40

Top-P

0.8

Prompt

Write a challenging and engaging storyline for a new first-person shooter video game level (one paragraph).

Output

At the start of the level, the player's squad is ambushed by a ruthless enemy faction in a dense urban area. The player must fight their way through a maze of back alleys and dilapidated buildings, using stealth and precision shooting to eliminate enemy forces and avoid detection. The player must use the environment to their advantage, set traps, use cover, and take calculated actions to survive. Along the way, they discover intelligence that the enemy is planning a devastating attack, and the player must race against time to collect evidence and transmit it back to headquarters before the enemy's plan can be carried out.

  • This table serves as a baseline and shows the common output that might be produced when requesting creative content directly, especially at high temperatures1  .

When you set the temperature to 1, you might get a wide variety of storyline creative writing, but it's also pretty random and generic. So let's take a step back and think about  1 :

Table 9: First step of step-back prompting - generating general concepts 

Fields

value

Goal

Generates 5 fictional key settings that help build challenging and engaging first-person shooter level storylines.

Model

gemini-pro

Temperature (Temp)

1

Token Limit(Limit)

1024

Top-K

40

Top-P

0.8

Prompt

Based on popular first-person shooter action games, what are 5 key fictional settings that help build challenging and engaging level storylines in first-person shooter video games?

Output

1.  Abandoned Military Base:  A sprawling, post-apocalyptic military complex crawling with mutant soldiers and rogue robots, perfect for challenging gunfights. 2.  Cyberpunk City:  A neon-lit, futuristic urban environment with towering skyscrapers and dense alleyways, featuring cybernetically enhanced enemies and hacking mechanics. 3.  Alien Spaceship:  A massive alien spaceship stranded on Earth, featuring eerie corridors, zero-g areas, and alien creatures to encounter. 4.  Zombie-Infested Town:  A desolate town overrun by hordes of aggressive zombies, featuring intense close-quarters combat and puzzle-solving to find a safe passage. 5.  Underwater Research Facility:  A flooded deep-sea laboratory filled with mutated aquatic life, requiring stealth and underwater exploration skills to survive.







  • This table illustrates the “backward” step—asking a more abstract, principle-based question to elicit underlying ideas or themes related to a specific  task1 .

These themes look like they would be a good fit for a first-person video game. Let's go back to our original prompt, but this time we'll include the answer to the fallback step as context and see what it  returns .

Table 10: Final prompt using the step-back context 

Fields

value

Goal

Write a challenging and engaging level storyline for a first-person shooter game based on one of the 5 themes provided.

Model

gemini-pro

Temperature (Temp)

1

Token Limit(Limit)

1024

Top-K

40

Top-P

0.8

Prompt

Background:  5 Compelling First Person Shooter Video Game Themes: 1.  Abandoned Military Base: ... 2.  Cyberpunk City: ... 3.  Alien Spaceship: ... 4.  Zombie Infested Town: ... 5.  Underwater Research Facility: ... Choose one of these themes and write a challenging and engaging storyline (one paragraph) for a new First Person Shooter Video Game level.













Output

At the heart of the Dark Abyss lies a dilapidated underwater research facility that stands as a testament to humanity’s ambitions and their disastrous consequences. Shrouded in darkness, pulsating with the hum of malfunctioning machinery, and haunted by the echoes of drowned souls, the facility is now a breeding ground for the nightmares of the deep. The player, an elite marine equipped with advanced diving gear and experimental weaponry, is sent on an adventure into this terrifying aquatic realm. With tenacity and resourcefulness, they must navigate treacherous corridors, solve mysterious puzzles, and confront terrifying sea monsters lurking in the depths. From coping with the immense pressure of the deep to outwitting cunning aquatic predators, every moment in this unknown underworld tests the limits of human endurance and courage.

  • This table demonstrates the effectiveness of the back-off approach. By first generating general concepts (Table 9) and then using them as context (Table 10), the final output is more specific, more topical, and arguably more engaging than the direct approach (Table 8)  1 .

Looks like a fun video game! By using the back-off hint technique, the accuracy of the hints can be  improved1 .

Chain of Thought (CoT)

Chain of Thought (CoT)⁹ prompts are a technique to improve the reasoning ability of LLMs by generating intermediate reasoning steps. This helps LLMs generate more accurate answers. This can be combined with few-shot prompts to achieve better results on more complex tasks that require reasoning to respond, as zero-shot Chain of Thought is challenging in this  regard1 .

CoT has many advantages. First, it is labor-efficient and very effective, and works with off-the-shelf LLMs (so no fine-tuning is required). Interpretability is also gained with CoT prompts, as one can learn from the LLM’s responses and see the reasoning steps followed. If there is a failure, one will be able to identify it. Chaining of thoughts seems to improve robustness when transferring between different LLM versions. This means that the performance of prompts should drift less between different LLMs than with prompts that do not use chaining of reasoning. There are of course disadvantages, but they are somewhat  intuitive1 .

LLM responses involve chained reasoning, which means more output tokens, which in turn means predictions are more expensive and take  longer1 .

To explain the following examples in Table 11, let us first try to create a prompt without using the CoT prompt to demonstrate the shortcomings of large language  models1 .

Table 11: An example of a prompt which is trying to solve a mathematical problem 

Fields

value

Prompt

When I was 3, my partner was 3 times my age. Now, I am 20. How old is my partner?

Output

63 years old

  • This table illustrates a common LLM failure mode—even simple arithmetic/logical reasoning questions can be answered incorrectly when asked directly, laying the foundation for the introduction of  CoT1 .

Oops. That’s clearly the wrong answer. In fact, LLMs often struggle with math tasks and may provide incorrect answers — even for tasks as simple as multiplying two numbers. This is because they are trained on large amounts of text, and math may require a different approach. So let’s see if an intermediate reasoning step will improve the  output1 .

Table 12: An example of Chain of Thought prompting 

Fields

value

Prompt

When I was 3, my partner was 3 times my age. Now, I am 20. How old is my partner? Let's think about it step by step.

Output

1. Assume that my current age is "x" years old. (Note: The original text is wrong here and should be based on the question information) 2. When I was 3 years old, my age was 3 years old. 3. My partner's age at that time was 3 times my age, which means that my partner's age is 3 * 3 = 9 years old. 4. Now, I am 20 years old, which means that since I was 3 years old, my age has increased by 20 - 3 = 17 years. 5. Since my partner's age was 9 when I was 3 years old, and my age has increased by 17 years, then my partner's age must have increased by 17 years. 6. Therefore, my partner's current age is 9 + 17 = 26 years old. Therefore, my partner is 26 years old.













  • This table clearly shows the effect of zero-shot CoT. Simply adding the trigger phrase “Let’s think about it step by step” elicited a reasoning process that led to the correct answer1  . CoT works by forcing the model to externalize its reasoning process into a sequence of tokens. This sequential generation of steps appears to help the model stay on track and perform calculations or logical deductions more reliably than trying to compute the answer in a single “thought”  1 .

OK, now the final answer is correct. This is because we explicitly instructed the LLM to explain each step instead of just returning an answer. Interestingly, the model added the additional 17 years. In my opinion, I would calculate the age difference between me and my partner and add that. (20+(9-3)). Let's help the model think more like  me1 .

Table 12 is an example of a “zero-shot” chain of thoughts. When combined with one-shot or few-shot, chaining cues can be very powerful, as you can see in Table  131 :

Table 13: An example of chain of thought prompting with a single-shot 

Fields

value

Prompt

Q:  When my brother was 2 years old, I was twice his age. Now I am 40. How old is my brother? Let's think about it step by step. A:  When my brother was 2 years old, I was 2 * 2 = 4 years old. The age difference is 2 years, and I am older. Now I am 40, so my brother is 40 - 2 = 38. The answer is 38. Q:  When I was 3 years old, my partner was 3 times my age. Now I am 20. How old is my partner? Let's think about it step by step. A:







Output

When I was 3, my partner was 3 * 3 = 9 years old. The age difference is 6 years, and my partner is older. Now I am 20, so my partner is 20 + 6 = 26 years old. The answer is 26.

  • This table demonstrates few-shot CoT. Providing an example of an expected reasoning process can guide the model to follow similar structure and logic for new problems, potentially producing more reliable or structured reasoning than using zero-shot CoT alone1  .

Thought Chaining can be used for a variety of use cases. Think of code generation, for breaking down a request into steps and mapping those steps to specific lines of code. Or for creating synthetic data, when you have some kind of seed, like “the product is called XYZ, write a description that guides the model through the assumptions you would make given the product title.” In general, any task that can be solved by “stepping through it” is a good candidate for Thought Chaining. If you can explain the steps to solve a problem, try Thought Chaining  1 .

Please refer to the notebook¹⁰ hosted in the GoogleCloudPlatform Github repository, which explains CoT Tip  1 in more detail .

In the Best Practices section of this chapter, we’ll learn some best practices specific to the Thought Chaining prompt1  .

Self-consistency

While large language models have achieved impressive success in various NLP tasks, their reasoning capabilities are often seen as a limitation that cannot be overcome by simply increasing the model size. As we learned in the previous thought chain prompting section, models can be prompted to generate reasoning steps like humans solve problems. However, CoT uses a simple "greedy decoding" strategy, which limits its effectiveness. Self-consistency¹¹ combines sampling and majority voting to generate diverse reasoning paths and select the most consistent answer. It improves the accuracy and coherence of LLM-generated responses1  . Self-consistency exploits the idea that while a single reasoning path may be flawed, the correct answer is likely to be reached via multiple valid paths. By sampling diverse paths and looking for consensus, it filters out anomalous reasoning and increases confidence in the final answer. It trades computational cost for  robustness1 .

Self-consistency gives a pseudo-probability that the answer is correct, but obviously comes at a high  cost1 .

It follows these  steps1 :

  1. Generate diverse reasoning paths:  Provide the same prompt to the LLM multiple times. High temperature settings encourage the model to generate different reasoning paths and different perspectives on the problem.

  2. Extract the answer from each generated response.

  3. Select the most common answer.

Let’s look at an example of an email classification system that classifies emails as IMPORTANT or NOT IMPORTANT. A zero-shot thought chain prompt will be sent to the LLM multiple times to see if the response is different after each submission. Note the friendly tone, word choice, and sarcasm used in the email. All of these can fool the LLM  1 .

Table 14: An example of prompting for self consistency 

Fields

value

Prompt

Email: Hi, I see your site uses WordPress. A great open source content management system. I have used it in the past as well. It comes with many great user plugins. And it is fairly easy to set up. I did find a bug in the contact form that occurs when you select the name field. See the attached screenshot of me entering text into the name field. Notice the JavaScript alert box I invoked. But other than that, it is a great site. I enjoy reading it. Feel free to leave bugs on the site as it gives me more interesting things to read. Best wishes, Hacker Harry. Categorize the above email as IMPORTANT or NOT IMPORTANT. Let's think through it step by step and explain why.

















Output - Attempt 1

Step 1: Identify the purpose of the message. The purpose of the message is to inform the recipient of a bug in their website's contact form. The sender also mentioned that they enjoyed reading the website and that they were leaving the bug in place for now. Step 2: Assess the potential impact of the bug. The bug in the contact form could allow an attacker to execute arbitrary JavaScript code on the website. This could result in an attacker being able to take control of the website, steal sensitive information, or deface the website. Step 3: Consider the sender's credibility. The sender of the message claims to be a hacker named Harry. However, the sender's identity cannot be verified. The sender could be a legitimate hacker trying to help the recipient, but could also be a malicious actor trying to trick the recipient into clicking a malicious link or downloading a malicious file. Conclusion: IMPORTANT Based on the potential impact of the bug and the sender's credibility, this message should be classified as  IMPORTANT . The recipient should take immediate action to fix the bug and protect their website from attack.













Output - Attempt 2

Explanation: 1.  Lack of Urgency:  The email does not convey any sense of urgency or time sensitivity. It starts with a casual greeting and presents the information in a factual tone. 2.  Non-critical Bug Report:  The bug reported in the contact form is described as interesting but non-critical. The email sender even suggests that the website owner can keep the bug. 3.  Lack of Personal Impact:  The email does not indicate any direct or personal impact on the website owner. It simply mentions that the sender enjoys reading the website. 4.  Lack of Request for Action:  The email does not explicitly request the website owner to take any action. It does not request to fix the bug or provide any specific suggestions for improvement. 5.  Sender's Intent:  The sender's intention seems to be more focused on sharing observations and expressing interest in the website rather than raising serious concerns or requesting help. Conclusion: NOT IMPORTANT











Output - Attempt 3

Step 1: Identify the purpose of the message. The purpose of the message is to inform the recipient of a bug in the contact form of their website. The sender also mentioned that they enjoyed reading the website and that they would leave the bug in place for now. Step 2: Assess the potential impact of the bug. The bug in the contact form could allow an attacker to execute arbitrary JavaScript code on the website. This could result in an attacker being able to take control of the website, steal sensitive information, or deface the website. Step 3: Consider the credibility of the sender. The sender of the message claims to be a hacker named Harry. However, the identity of the sender cannot be verified. The sender could be a legitimate hacker trying to help the recipient, but could also be a malicious actor trying to exploit the bug for personal gain. Conclusion: Based on the above factors, this message should be classified as  IMPORTANT . The bug in the contact form is a serious security risk, and the credibility of the sender is unknown. The recipient should take immediate action to fix the bug and investigate the sender's claims.













  • This table perfectly demonstrates the self-consistency mechanism. It shows that even with CoT, a single run (especially at high temperatures that encourage diversity) can lead to incorrect or less robust conclusions (like in attempt 2). Generating multiple reasoning paths and taking a majority vote (2 out of 3 considered significant) provides a more robust final  answer1 .

You can use the tips above and try to see if it returns a consistent classification. Depending on the model and temperature configuration you use, it may return "important" or "not important" 1 .

By generating many chains of thought and taking the most common answer (“important”), we can get more consistently correct answers from the  LLM1 .

This example shows how to use self-consistency cues to improve the accuracy of LLM responses by considering multiple perspectives and selecting the most consistent  answer1 .

Tree of Thoughts (ToT)

Now that we are familiar with thought chains and self-consistency cues, let’s review the Tree of Thoughts (ToT)¹². It generalizes the concept of the CoT cue in that it allows the LLM  to explore multiple different reasoning paths simultaneously , rather than just following a single linear chain of thoughts. This is depicted in Figure 11. ToT  moves away from linear or independent reasoning paths to a more structured exploration strategy . It allows the model to consider alternatives at each step, potentially backtrack, and evaluate different branches, mimicking a more thoughtful human approach to problem solving1  .

Figure 1: A visualization of chain of thought prompting on the left versus. Tree of Thoughts prompting on the right

This approach makes ToT particularly suitable for complex tasks that require exploration. It works by maintaining a thought tree where each thought represents a coherent sequence of language as an intermediate step in solving a problem. The model can then explore different reasoning paths by branching off from different nodes in the  tree1 .

There is a great notebook that shows the Tree-of-Thought (ToT) in more detail based on the paper Large Language Model Guided Tree-of-Thought⁹ 1 .

ReAct (reason & act)

Reason and act (ReAct)¹³ Prompt is a paradigm that enables LLMs to solve complex tasks by combining natural language reasoning with external tools (search, code interpreters, etc.), allowing LLMs to perform certain actions, such as interacting with external APIs to retrieve information, which is the first step towards modeling intelligent agents1  . ReAct fundamentally expands the capabilities of LLMs, allowing them to interact with the external world (or specific tools) to gather information or perform actions that cannot be done with text generation alone. This puts the reasoning process on the basis of real-time, external  data1 .

ReAct mimics how humans operate in the real world, as we reason verbally and can take actions to acquire information. ReAct performs well relative to other cue engineering approaches in a variety of  domains1 .

ReAct prompts work by combining reasoning and action in a thought-action cycle. The LLM first reasons about the problem and generates an action plan. It then performs the actions in the plan and observes the results. Next, the LLM uses the observations to update its reasoning and generate a new action plan. This process continues until the LLM finds a solution to the  problem1 .

To see this in action, you need to write some code. In Code Snippet 1, Python's langchain framework is used, along with the VertexAI (google-cloud-aiplatform) and google-search-results pip packages  1 .

To run this example, you must   create a (free) SerpAPI key  from https://serpapi.com/manage-api-key and set the environment variable SERPAPI_API_KEY 1 .

Next, let's write some Python code to ask the LLM to find out how many children the members of the band Metallica  have1 .


Python



# Code Snippet 1: Create a ReAct Agent using LangChain and VertexAI from  langchain.agents  import  load_tools from  langchain.agents  import  initialize_agent from  langchain.agents  import  AgentType from  langchain.llms  import  VertexAI # Hint: How many children do the band members of Metallica have? prompt =  "How many kids do the band members of Metallica have?" # Initialize LLM (VertexAI) and set a lower temperature to get a more factual responsellm = VertexAI(temperature= 0.1 ) # Load tools (here we use SerpAPI for web search) tools = load_tools([ "serpapi" ], llm=llm) # Initialize ReAct Agent # AgentType.ZERO_SHOT_REACT_DESCRIPTION means using a zero-shot approach and deciding which tool to use based on the tool descriptionagent = initialize_agent(tools, llm,                        agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose= True ) # Run Agent agent.run(prompt)























Code Snippet 1: Creating a ReAct Agent with LangChain and VertexAI 

Code snippet 2 shows the results. Notice that ReAct performs a series of five searches. In effect, LLM is scraping Google search results to find the names of the band members. It then lists the results as observations and links the thought to the next search  1 .

Code snippet 2: Output of executing ReAct Agent in the terminal

Entering the new AgentExecutor chain...

Think: Metallica has 4 members.

Action: Search

Action Input: How many children does James Hetfield have?

Observation: Three Children

Thoughts: 1/4 of Metallica band members have 3 children

Action: Search

Action Input: How many children does Lars Ulrich have?

Observations: 3

Thoughts: 2/4 of Metallica band members have 6 children

Action: Search

Action Input: How many children does Kirk Hammett have?

Observation: Hammett has been married to his second wife, Lani, since 1998. They have two sons, Angel (born September 29, 2006) and Vincenzo (born June 28, 2008).

Thoughts: 3/4 of Metallica band members have 8 children

Action: Search

Action Input: How many children does Robert Trujillo have?

Observations: 2

Thoughts: 4/4 of the Metallica band members have 10 children

Final answer: 10

Exiting AgentExecutor chain...

Code Snippet 2: Executing ReAct Agent in the terminal 1

Code snippet 2 finds out that the band Metallica has four members. It then searches each band member to request the total number of children and adds the totals. Finally, it returns the total number of children as the final answer  1 .

Applying ReAct prompts in practice requires understanding that you must continually resend previous prompts/responses (and prune additionally generated ones), as well as setting up the model with appropriate examples/instructions. Please refer to the notebook¹⁴ hosted in the GoogleCloudPlatform Github repository, which shows actual LLM inputs and outputs in more detail with more detailed  examples1 .

Automatic Prompt Engineering (APE)

At this point you may be realizing that writing prompts can be complex. Wouldn’t it be great if we could automate this process (write a prompt to write prompts)? Well, there is a way: Automatic Prompt Engineering (APE). This approach¹⁵ not only alleviates the need for human input, but also improves the performance of models on a variety of tasks1  . APE represents a meta-level application of LLMs - using them to optimize inputs (prompts) to themselves or other systems. This suggests that in the future AI may provide important assistance in the process of interacting with and controlling  AI1 .

You will prompt a model to generate more prompts. Evaluate them, possibly modify the good ones. Then repeat  1 .

For example, we can use autosuggest engineering to help train a chatbot for an online store selling T-shirts. We want to find all the different ways a customer might use to order a band T-  shirt .

  1. Write the hints that will generate the output variants.  In this example, 10 instructions are generated using gemini-pro. See Table 15  1 :

Table 15: Use an LLM to generate prompts 1 

Fields

value

Prompt

We have an online store for band merchandise T-shirts, and to train the chatbot, we need multiple ways to order: "One Metallica T-shirt, size S." Generate 10 variations with the same semantics but keeping the original meaning.

Output

1. I would like to purchase a Metallica T-shirt, size small. 2. Can I order a Metallica T-shirt, size small? 3. I would like to purchase a Metallica T-shirt, size small. 4. One Metallica shirt, size small, thank you. 5. I would like to purchase a Metallica T-shirt, size small. 6. I would like to place an order for a Metallica T-shirt, size small. 7. One Metallica T-shirt, size small, thank you. 8. I am looking for a Metallica T-shirt, size small. 9. Can you get me a Metallica T-shirt, size small? 10. One Metallica T-shirt, size small, thank you.

















  • This table demonstrates the core idea of ​​APE—using the LLM itself to generate diverse wordings or instructions for a given task, automating part of the creative process of prompt design or data  augmentation1 .

  1. Evaluate all instruction candidates.  Evaluate by scoring the candidates according to a chosen metric. For example, BLEU (Bilingual Evaluation Underscore) or ROUGE (Recall-Oriented Key Points Evaluation Underscore) 1 can be used .

  2. Select the instruction candidate with the highest evaluation score.  This candidate will be the final prompt that you can use in your software application or chatbot. You can also adjust the selected prompt and evaluate it  again1 .

Code prompting

Gemini focuses primarily on text-based prompts, which also includes writing prompts that return code. Let’s go into Vertex AI Studio and test these prompts with some coding  examples1 .

Prompts for writing code

Gemini can also play the role of a developer, helping you write code in any programming language of your choice. As a developer, this can help you speed up the process of writing  code1 .

Imagine you have a folder on your machine with hundreds of files that need to be renamed. Renaming each file one by one would take a lot of time. If you know a little Bash, you could write a script to automate this process, but that might also take some time. So let's write a prompt. You can write prompts in Gemini, the public consumer chatbot, or if you're more concerned about confidentiality, you can write these prompts in your Google Cloud account and open Vertex AI Studio. The advantage of Vertex AI Studio is that you can configure parameters such as  temperature1 .

Table 16: Using Gemini to program code 1 

Fields

value

Name

1_bash_rename_files_in_folder

Goal

Write a prompt to write code in Bash to rename files in a folder.

Model

gemini-pro

Temperature (Temp)

0.1

Token Limit(Limit)

1024

Top-K

N/A

Top-P

1

Prompt

Write a Bash code snippet that asks for a folder name. It then gets the contents of that folder and renames all the files in it by prefixing the name draft.

Output

bash #!/bin/bash # Ask for a folder name echo "Enter folder name: " read folder_name # Check if the folder exists if [! -d "$folder_name" ]; then  echo "Folder does not exist."   exit 1 fi # Get a list of files in the folder files=( "$folder_name"/* ) # Rename each file by appending "draft_" to the beginning of the file name for file in "${files[@]}"; do  # Extract the base file name from the full path  base_name=$(basename "$file")  # Construct the new file path  new_file_path="$folder_name/draft_$base_name"  # Rename the file  mv "$file" "$new_file_path" done echo "File renamed successfully." 






















































  • This table provides a clear example of how to use LLM for code generation based on natural language descriptions, demonstrating the potential for automating simple scripting  tasks1 .

This looks like nice code—it even has documentation! However, since LLM cannot reason and will repeat the training data, it is important to read and test your code  first1 .

The moment we all look forward to, does it really work? 1

Let's try this out first with a test folder that has just a few files that need to be renamed from filename.txt to draft_filename.txt  1 .

  1. Copy the output of Table 16 (excluding the bash text wrapper) and paste it into a new file called "rename_files.sh".

  2. Open a terminal window and type: . rename_files.sh. It will ask you to enter a folder name, such as test, and press enter.

  3. The script appears to have run fine. You will see the message: File renamed successfully.

When you look in the test folder, you'll see that all the files have been perfectly renamed to draft_filename.txt  1 .

It worked! 1

Prompts for explaining code

As a developer, when working in a team, you have to read other people’s code. Gemini can help with this as well. Let’s take the code output from Table 16, remove the comments, and ask a large language model to explain what is happening, see Table 17  1 :

Table 17: A prompt for explaining code 1 

Fields

value

Name

1_bash_explain_the_code

Goal

Write a prompt to explain Bash code.

Model

gemini-pro

Temperature (Temp)

0.1

Token Limit(Limit)

1024

Top-K

N/A

Top-P

1

Prompt

Please explain the following Bash code to me: bash #!/bin/bash echo "Enter folder name: " read folder_name if [! -d "$folder_name" ]; then  echo "Folder does not exist."   exit 1 fi files=( "$folder_name"/* ) for file in "${files[@]}"; do  base_name=$(basename "$file")  new_file_path="$folder_name/draft_$base_name"  mv "$file" "$new_file_path" done echo "File renamed successfully." 
































Output

This Bash script performs the following tasks: 1.  User Input:  - It prompts the user to enter a folder name using echo and read commands. The entered folder name is stored in the folder_name variable. 2.  Folder Existence Check:  - It checks if a folder with the specified name exists using the [! -d "$folder_name" ] condition. If the folder does not exist, it displays an error message ("Folder does not exist.") and exits the script with exit code 1. 3.  File Listing:  - It lists all the files inside the specified folder using the files=( "$folder_name"/* ) command. This creates an array named files containing the paths of all the files in the folder. 4.  File Renaming:  - It iterates through each file in the files array using a for loop.  - For each file, it extracts the base file name using base_name=$(basename "$file") and then generates a new file path using new_file_path="$folder_name/draft_$base_name". This new file name is created by prepending the string "draft_" to the original base file name.  - It then renames the file using the mv "$file" "$new_file_path" command, which moves the file to the new file path. 5.  Success Message:  - After renaming all the files, it displays a success message ("File rename successful.") to indicate that the renaming operation was successful.























  • This table demonstrates the usefulness of LLM for code comprehension. The model accurately breaks down the script and explains the purpose of each command block in natural  language1 .

Prompts for translating code

The bash code in Listing 16 seems to work fine. However, it would be truly reusable if this script could prompt me for a file name and ideally work as a standalone application with a UI. As a starting point, Python is a better language than Bash for (web) applications. LLM can help translate code from one language to another. See Example  1 in Listing 18 :

Table 18: A prompt for translating code from Bash to Python 1 

Fields

value

Name

1_python_rename_files

Goal

Write a prompt to translate Bash code to Python.

Model

gemini-pro

Temperature (Temp)

0.1

Token Limit(Limit)

1024

Top-K

N/A

Top-P

1

Prompt

Translate the following Bash code into a Python snippet. bash #!/bin/bash echo "Enter folder name: " read folder_name if [! -d "$folder_name" ]; then  echo "Folder does not exist."   exit 1 fi files=( "$folder_name"/* ) for file in "${files[@]}"; do  base_name=$(basename "$file")  new_file_path="$folder_name/draft_$base_name"  mv "$file" "$new_file_path" done echo "File renamed successfully." 
































Output

python import os import shutil # Get a folder name from the user folder_name = input("Enter folder name: ") # Check if the folder exists if not os.path.isdir(folder_name):  print("Folder does not exist.")  exit(1) # Get a list of all files in the folder files = os.listdir(folder_name) # Iterate over the files for file in files:  # Create a new file name with the prefix "draft_"  # NOTE: Only the file name is used here, the original extension is not retained  new_file_name = f"draft_{file}"  # Construct the full path of the original and new file  old_path = os.path.join(folder_name, file)  new_path = os.path.join(folder_name, new_file_name)  # Move the file to its new name  shutil.move(old_path, new_path) # Print a success message print("File renamed successfully.")
























































  • This table demonstrates LLM's code translation capabilities, mapping functionality from idioms in one language (Bash commands) to another (eg, Python's os, shutil libraries) 1 .

Read and review the code. Copy the output from the prompt and paste it into a new file: file_renamer.py. Test the code by opening a terminal window and executing the following command: python file_renamer.py  1 .

Note:  When prompted for (Python) code in Vertex AI's Language Studio, you must click the "Markdown" button. Otherwise, you will receive plain text that lacks proper line indentation, which is important for running Python  code1 .

Prompts for debugging and reviewing code

Let's manually edit the code in Listing 18. It should prompt the user for a file name prefix and write it in uppercase characters. See the sample code in Code Snippet 3, but oops. It now returns a Python error !


Python



# Code Snippet 3: A Broken Python Scriptimport  os import  shutil folder_name =  input ( "Enter folder name: " ) prefix =  input ( "Enter prefix string to be prepended to file name: " ) # Error: use of undefined function toUpperCase text = toUpperCase(prefix) if not  os.path.isdir(folder_name):     print( "Folder does not exist." )     exit( 1 ) files = os.listdir(folder_name) for  file  in  files: # Error: inconsistent variable names new_filename vs new_file_name     new_filename =  f" {text} _ {file} "     old_path = os.path.join(folder_name, file) # use of new_file_name which is not defined in the loop     new_path = os.path.join(folder_name, new_file_name)     shutil.move(old_path, new_path) print( "File renamed successfully." )














    


    





Code Snippet 3: A broken Python script 

Oops! Looks like a bug  1 :

Code snippet 4: Traceback of a Python code error

The following Python code gives an error:

Traceback (most recent call last):

File "/Users/leeboonstra/Documents/test_folder/rename_files.py", line 7, in <module>

text = toUpperCase(prefix)

NameError: name 'toUpperCase' is not defined

Code Snippet 4: I broke the Python code

Let’s see if we can ask a large language model to debug and review code. See Table 19  1 :

Table 19: A prompt for debugging and reviewing Python code 1 

Fields

value

Name

1_python_debug_code

Goal

Write a prompt to debug and review Python code.

Model

gemini-pro

Temperature (Temp)

0.1

Token Limit(Limit)

1024

Top-K

N/A

Top-P

1

Prompt

The following Python code gives an error: Traceback (most recent call last):  File "/Users/leeboonstra/Documents/test_folder/rename_files.py", line 7, in <module>  text = toUpperCase(prefix) NameError: name 'toUpperCase' is not defined Debug where the error is and explain how to improve the code. python import os import shutil folder_name = input("Enter folder name: ") prefix = input("Enter the prefix string to be added to the file name: ") text = toUpperCase(prefix) if not os.path.isdir(folder_name):  print("Folder does not exist.")  exit(1) files = os.listdir(folder_name) for file in files:  new_filename = f"{text}_{file}"  old_path = os.path.join(folder_name, file)  new_path = os.path.join(folder_name, new_file_name)  shutil.move(old_path, new_path) print("File renamed successfully.")
























































Output

The code calls the toUpperCase function to convert prefix to upper case, but the function is undefined. To fix this problem, you can use the upper() method of the string class, which converts the given string to upper case. The modified code is as follows: python import os import shutil folder_name = input("Enter folder name: ") prefix = input("Enter the prefix string to be prepended to the file name: ") text = prefix.upper() # Using .upper() method if not os.path.isdir(folder_name):  print("Folder does not exist.")  exit(1) files = os.listdir(folder_name) for file in files:  # NOTE: fixed variable name new_filename  new_filename = f"{text}_{file}" old_path  = os.path.join(folder_name, file) new_path = os.path.join(folder_name,  new_filename)  shutil.move(old_path, new_path) print("File renamed successfully.") In addition, there are some other areas where the code can be improved: 1. The file extension of the new file name is not preserved. It is better to extract the extension from the original file name and append it to the new file name. 2. The code assumes that there are no spaces in the folder name. However, it is good practice to handle spaces gracefully by replacing them with underscores or hyphens. 3. Using f-string syntax for string interpolation is more readable and concise than the traditional + operator. 4. The code does not handle errors that may occur during the renaming process. It is better to wrap the shutil.move call in a try...except block to catch any potential errors. Here is the improved code with these suggestions: python import os import shutil folder_name = input("Enter folder name: ") prefix = input("Enter prefix string to be prepended to file name: ").upper() # Convert to uppercase directly if not os.path.isdir(folder_name):  print("Folder does not exist.")  exit(1) files = os.listdir(folder_name) print(f"Renaming files in {folder_name}...") for file in files:  # Process only files, skipping subdirectories if necessary  old_path = os.path.join(folder_name, file)  if os.path.isfile(old_path):  # Separate filename and extension  name, ext = os.path.splitext(file)  # Create a new filename (keeping extension)  new_filename = f"{prefix}_{name}{ext}"  new_path = os.path.join(folder_name, new_filename)  # Try to move the file  try:  shutil.move(old_path, new_path)  print(f" '{file}' -> '{new_filename}'")  except Exception as e:  print(f"Error moving file {file}: {e}") print("File renaming process completed."") These improvements make the code more robust and flexible, while also making it easier to read and understand... (The response has been truncated because it reached the token limit. If you need a longer response, try increasing the token limit.)





























































































































  • This table is a strong demonstration that LLM can go beyond fixing specific bugs and can also do proactive code reviews and make best practice recommendations. Providing error tracebacks along with the code is a critical context for debugging1  . Providing error messages/tracebacks along with the code greatly helps LLM locate the problem, making debugging hints much more effective than just providing code and saying “fix it”  1 .

That was great. Not only did it tell me how to fix the problem, it also found out more things were wrong with my code and how to fix them. The last section of the tip gives overall suggestions for improving the  code1 .

What about multimodal prompting?

Writing prompts for code still uses the same conventional large language models. Multimodal prompts are a separate problem that refers to techniques that use multiple input formats (rather than relying solely on text) to guide large language models. This can include text, images, audio, code, or even a combination of other formats, depending on the capabilities of the model and the task at hand1  . This section largely acknowledges the existence of multimodal prompts, but explicitly excludes them from detailed discussion, keeping the white paper focused on engineering prompts based on text (including code)  1 .

Best Practices

Finding the right prompts can take a lot of trial and error. The Language Studio in Vertex AI is the perfect place to try out your prompts and test them against a variety of  models1 .

Become an expert in prompt engineering using the following best practices1  .

Provide examples

The most important best practice is to provide (one-shot/few-shot) examples in the prompt. This is very effective because it acts as a powerful teaching tool. These examples demonstrate the desired output or similar responses, allowing the model to learn from them and adjust its own generation accordingly. This is like giving the model a reference point or goal to improve the accuracy, style, and tone of its responses so that they better match your expectations1  . This once again emphasizes the effectiveness of few-shot learning seen earlier. Explicitly calling it the “most important” practice highlights its huge impact across a variety of  tasks1 .

Design with simplicity

Prompts should be concise, clear, and easy to understand (both for you and the model). As a rule of thumb, if it’s confusing to you, it’s likely to be just as confusing to the model. Try not to use complex language or provide unnecessary information1  . Sometimes less is more. While context is good, unnecessary jargon or overly complex sentence structure can confuse the model. Direct, clear language is  preferred1 .

Example: 1 

Before modification (BEFORE):

I am currently visiting New York and I would like to learn more about good places to stay. I am with my two 3 year olds. Where should we go on our vacation?

AFTER REWRITE:

Play the role of a tourist's travel guide. Describe good places to visit with a 3-year-old in Manhattan, New York.

Try using verbs that describe actions. Here are some examples1:

Act, Analyze, Categorize, Classify, Contrast, Compare, Create, Describe, Define, Evaluate, Extract, Find, Generate, Identify, List, Measure, Organize, Parse, Pick, Predict, Provide, Rank, Recommend, Return, Retrieve, Rewrite, Select, Show, Sort, Summarize, Translate, Write.

Be specific about the output

Be specific about the desired output. Concise instructions may not be enough to guide the LLM, or may be too general. Providing specific details in prompts (either through system or contextual hints) can help the model focus on relevant content and improve overall accuracy1  . Don’t assume the model knows what you want. Explicitly state constraints such as length (“3 paragraphs”), content focus (“top 5 video game consoles”), and style (“conversational”) 1 .

Example: 1 

Recommended (DO):

Generate a 3-paragraph blog post about the top 5 video game consoles.

The blog post should be informative and engaging and should be written in a conversational style.

DO NOT:

Generate a blog post about video game consoles.

Use Instructions over Constraints

Instructions and constraints are used in prompts to guide the output of the  LLM1 .

  • An instruction  provides explicit instructions about the expected format, style, or content of a response. It tells the model what to do or  produce1 .

  • A constraint  is a set of limits or bounds on a response. It restricts what the model should not do or  avoid1 .

A growing body of research suggests that focusing on positive instructions in prompts may be more effective than relying heavily on constraints. This approach aligns with the human preference for positive instructions over lists of “don’ts.”  1 Requests framed in a positive way (what to include) are often clearer and less likely to be misinterpreted or conflicting than negative constraints (what to exclude)  . 1

Instructions directly communicate the desired outcome, whereas constraints may leave the model guessing about what is allowed. It provides flexibility and encourages creativity within defined boundaries, whereas constraints may limit the model’s potential. Furthermore, a bunch of constraints may conflict with each  other1 .

Constraints can still be valuable in certain situations, for example to prevent a model from generating harmful or biased content, or when a strict output format or style is  required1 .

When possible, use positive instructions: instead of telling the model what not to do, tell it what to do. This avoids confusion and improves the accuracy of the  output1 .

Example: 1 

Recommended (DO):

Generate a 1-paragraph blog post about the top 5 video game consoles.

Only the consoles themselves, the manufacturer company, year, and total sales are discussed.

DO NOT:

Generate a 1-paragraph blog post about the top 5 video game consoles.

Do not list video game titles.

As a best practice, prioritize instructions first, clearly stating what you want the model to do, and use constraints only when necessary for safety, clarity, or specific requirements. Experiment and iterate, testing different combinations of instructions and constraints to find what works best for your specific task, and document these  attempts1 .

Control the max token length

To control the length of the generated LLM response, you can set a maximum token limit in the configuration, or explicitly request a specific length in the prompt. Example 1:

"Quantum physics explained in a tweet-length message."

Length can be managed in both technical (configuration) and semantic (prompt) ways. Prompt-based controls allow for more granular length specifications (e.g., “use 3 sentences”, “tweet length”) 1 .

Use variables in prompts

To reuse prompts and make them more dynamic, use variables in prompts that can change for different inputs. For example, consider a prompt that provides facts about a city. Instead of hard-coding the city name in the prompt, use a variable. Variables can save time and effort by allowing you to avoid repeating yourself. If you need to use the same information in multiple prompts, you can store it in a variable and then reference the variable in each prompt. This makes sense when you integrate the prompt into your own  application1 .

Table 20: Using variables in prompts 1 

Fields

value

VARIABLES

{city} = "Amsterdam"

PROMPT

You are a travel guide. Tell me one fact about the city {city}.

Output

Amsterdam is a beautiful city filled with canals, bridges and narrow streets. It is a great tourist destination for its rich history, culture and nightlife.

  • This table clearly demonstrates a simple templating mechanism for making prompts reusable and adaptable to different inputs without having to rewrite the entire prompt structure. This is essential for programmatic  use1 .

Experiment with input formats and writing styles

Different models, model configurations, prompt formats, wording choices, and even submissions can produce different results. Therefore, it is important to experiment with prompt properties such as style, wording choices, and prompt type (zero-shot, few-shot, system prompt)  1 . LLMs can be surprisingly sensitive to the precise wording and format of the input. Small changes can lead to different outputs, which again emphasizes the need for experimentation1  .

For example, a goal is to generate textual prompts about the revolutionary video game console Sega Dreamcast, which can be formulated as questions, statements, or instructions, resulting in different  outputs1 :

  • Question:  What is the Sega Dreamcast and why is it such a revolutionary gaming console?

  • Statement:  The Sega Dreamcast is a sixth-generation video game console released by Sega in 1999. It...

  • Instructions:  Write a paragraph describing the Sega Dreamcast gaming console and explaining why it is revolutionary.

For few-shot prompting with classification tasks, mix up the classes

In general, the order of few-shot examples should not matter much. However, when doing classification tasks, make sure to mix the possible response categories in your few-shot examples. This is done because you might otherwise overfit to the specific order of the examples. By mixing the possible response categories, you ensure that the model is learning key features to identify each category, rather than just memorizing the order of the examples. This will lead to more robust and better generalization on unseen data1  . In few-shot classification, the order of examples can inadvertently become a signal. Mixing the categories forces the model to rely on the actual content of the examples to distinguish between the categories, rather than just their position in the  sequence1 .

A good rule of thumb is to start with 6 few-shot examples and test accuracy from  there1 .

Adapt to model updates

It’s important to stay current with changes to model architectures, new data, and capabilities. Try newer model versions, and adjust your prompts to take better advantage of new model features. Tools like Vertex AI Studio are great for storing, testing, and documenting the various versions of your prompts1  . Prompt engineering is not a one-time task. A prompt that works well today may need to be adjusted as the underlying model evolves. Continuous testing and adaptation is  necessary1 .

Experiment with output formats

In addition to hinting at input formats, consider experimenting with output formats. For non-creative tasks such as extracting, selecting, parsing, sorting, ranking, or classifying data, try having your output returned in a structured format such as JSON or XML1  . Explicitly requiring structured output such as JSON can act as a strong constraint, improving reliability and reducing meaningless or unstructured responses, especially for data-oriented  tasks1 .

Returning a JSON object from the prompt that extracts the data has some benefits. In practice, I don't need to manually create this JSON format, I can already return the data in sorted order (very handy when dealing with datetime objects), but most importantly, by requiring the JSON format through the prompt, it forces the model to create structure and limits the illusion of  1 .

Table 4 in the Few-Shot Tips section shows an example of how to return structured output  1 .

Experiment together with other prompt engineers

If you are in a situation where you have to try to come up with a good prompt, you might want to get more than one person to try. When everyone follows best practices (such as those listed in this chapter), you will see performance differences between all the different prompt attempts1  . Different people approach problems in different ways. Collaborating on prompt design can bring in diverse perspectives and wordings, potentially leading to a better solution faster than if a single person worked independently1  .

CoT Best practices

For CoT prompts, placing the answer after the inference is necessary because the generation of the inference changes the tokens1 obtained when the model predicts the final  answer .

For CoT and self-consistency, you need to be able to extract the final answer from the prompt and separate it from the  reasoning1 .

For CoT prompts, set the temperature to 0  1 . The chain of thoughts prompt is based on greedy decoding, predicting the next word in the sequence based on the highest probability assigned by the language model. In general, when using inference to arrive at a final answer, there is likely to be only one correct answer. Therefore, the temperature should always be set to 0  1 . Effective use of CoT requires attention to specific implementation details such as answer placement, extractability, and appropriate configuration (low/zero temperature) 1 .

Document the various prompt attempts

This last piece of advice was mentioned earlier in this chapter, but we cannot emphasize its importance enough: keep detailed records of your prompting attempts so that you can learn over time what works well and what doesn’t  . 1 Effective prompting engineering requires treating it as a scientific or engineering process, which requires rigorous documentation of experiments (cues, settings, results) in order to learn, compare, and replicate  . 1

Prompt outputs can vary across models, sampling settings, and even different versions of the same model. Furthermore, even for the same prompt from the same model, there can be small differences in output sentence format and wording choices. (For example, as mentioned earlier, if two tokens have the same predicted probability, the tie may be broken randomly. This then affects subsequent predicted tokens.) 1 .

We recommend creating a Google Sheet and using Table 21 as a template. The advantage of this approach is that when you inevitably need to revisit your prompting work—whether to pick it up again in the future (you’ll be surprised how much you forget after even a short break), to test prompting performance on different versions of your model, and to help debug future bugs—you have a complete  record1 .

In addition to the fields in this table, it is helpful to keep track of the version (iteration) of the prompt, a field to record whether the result was OK/NOT OK/SOMETIMES OK, and a field to record feedback. If you are lucky enough to use Vertex AI Studio, save your prompt (using the same name and version as listed in the documentation) and keep track of a hyperlink to the saved prompt in the table. This way, you can rerun your prompt with a single  click1 .

When working with a retrieval enhancement generation (RAG) system, you should also capture specific aspects of the RAG system that influence what is inserted into the prompt, including the query, chunk settings, chunk output, and other  information1 .

Table 21: A template for documenting prompts 1 

Fields

describe

Name

[name and version of your prompt]

Goal

[One sentence explanation of the goal of this attempt]

Model

[name and version of the used model]

Temperature

[value between 0 - 1]

Token Limit

[number]

Top-K

[number]

Top-P

[number]

Prompt

[Write all the full prompt]

Output

[Write out the output or multiple outputs]

  • This table provides a concrete, structured template for key documentation practices. Following this template ensures that key information (goals, inputs, configurations, outputs) for each prompt experiment is systematically captured to facilitate subsequent analysis, comparison, and  replication1 .

Once you feel that your prompt is close to perfect, incorporate it into your project codebase. In your codebase, keep the prompt in a separate file from the code for easier maintenance. Finally, ideally, your prompt is part of an operationalized system, and as a prompt engineer, you should rely on automated testing and evaluation procedures to understand how well your prompt generalizes on the  task1 .

Cue engineering is an iterative process. Make and test different cues, analyze and record the results. Refine your cues based on the performance of your model. Continue experimenting until you achieve the desired output. When you change your model or model configuration, go back and continue experimenting with previously used  cues1 .

Summary

This white paper discusses hint engineering. We learned about various hinting techniques, such as  1 :

  • Zero prompting

  • Few shot prompting

  • System prompting

  • Role prompting

  • Contextual prompting

  • Step-back prompting

  • Chain of thought

  • Self consistency

  • Tree of thoughts

  • ReAct

  • Automatic Prompt Engineering (APE)

We even looked at how to automate your  prompts1 .

The whitepaper then discusses the challenges of generative AI, such as what can happen when your hints are insufficient. We end with best practices on how to become a better hint engineer  .