Structured output guide: three essential prompt tips

Written by
Silas Grey
Updated on:June-20th-2025
Recommendation

Master the skills of structured output and improve application efficiency.

Core content:
1. The concept of structured output and its importance in application
2. Two methods of obtaining structured output: native model function and prompt engineering
3. Detailed explanation of the specific application and skills of prompt engineering

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

When we try to integrate LLM into actual applications or workflows, we often encounter a tricky problem - the information format output by the model is not always what we expect. Their answers are often paragraphs of free-form text, which may be sufficient for human understanding, but difficult for code and automated systems to handle. At this time, the importance of structured output is highlighted. This article will introduce the concept and importance of structured output in detail for beginners, and explain in depth three essential tips and tricks to help everyone better obtain usable structured data from large language models.

1. The Importance of Structured Output

In simple terms, structured output means that a large language model returns information in a clear, predictable, and organized format, such as a neat list, a labeled data field (in the form of key-value pairs: key: value), or even a simple table, rather than a long conversational text. This concept is of great significance in practical applications.

Take building an application that uses AI to analyze customer reviews. When we ask the model to “summarize this review and tell me the overall sentiment of the product and customer mentioned,” the model might respond something like this: “Well, it looks like customer Jane D bought the ‘MegaWidget 3000’ and was very unhappy with it. She mentioned that the product broke after only two days and that she was very frustrated with the whole experience.” This response is easy for a human to understand, but it’s challenging for the application’s code to understand. The code needs to wade through the text to find the product name (“MegaWidget 3000”), determine the sentiment (is “unsatisfied” or “frustrated” the main sentiment?), and maybe even extract the customer name if that’s what’s needed. And if the model uses a different wording next time, like “disappointed” instead of “unsatisfied,” the code used to parse the sentiment might get it wrong.

In contrast, structured output avoids these problems. For example:

review_analysis:
  product_name: MegaWidget 3000
  sentiment: Negative
  summary: Customer reported that the product was damaged after two days, causing frustration.

This structured format transforms messy, conversational text into clean, organized data that your code can process reliably.

2. Methods for obtaining structured output

Generally, there are two main approaches to obtaining structured outputs from large language models: native model capabilities and prompt engineering.

As AI models continue to develop, some models have added built-in methods for requesting structured data. For example, Google's Gemini model can usually use function descriptions or specific patterns directly, and sometimes integrates with tools such as Pydantic; OpenAI's models (such as GPT-4) provide functions such as "JSON mode" or "function call" to force the output into a specific JSON structure. However, although these native functions are powerful, they often rely on specific models. Different models may request structured data in different ways, which may lock developers into a specific vendor and may need to learn its specific API or library, increasing the complexity of use.

Unlike relying on native model functions, prompt engineering is a more general approach. Its core idea is very simple, that is, in the instructions (i.e., prompts) issued to the model, directly and explicitly tell the model the output format you expect. This method does not depend on a specific model and is applicable to almost most large language models. It makes full use of the core ability of the model to understand and follow instructions, allowing developers to flexibly define the required structure, such as requesting YAML or a specific JSON format. In addition, obtaining structured output through prompt engineering does not require pre-learning of complex model-specific APIs, which lowers the threshold for use and is a more friendly choice for beginners. Next, we will focus on three essential tips based on prompt engineering.


Tip 1: Use YAML instead of relying solely on JSON

When requesting structured output from a large language model, choosing the right format is crucial. JSON (JavaScript Object Notation) is widely used in web development and APIs, but it can sometimes cause some problems for large language models, especially when dealing with quotes or multi-line text.

JSON has strict rules that strings must be enclosed in double quotes ("). If the text itself contains double quotes, they must be escaped with a backslash (\), such as \". Similarly, newlines in strings need to be represented as \n. Large language models often have difficulty handling these rules, mainly due to the way they process text - tokenization. Large language models break text into smaller parts (tokens), which may be complete words, parts of words, or single characters/symbols. During the tokenization process, escape characters like \ or formatting tags like \n are sometimes split inappropriately, and the model may also have difficulty learning when and how to correctly apply these complex contextual rules in its huge training data. Due to this underlying tokenization mechanism, large language models do not perform well in consistently handling escape characters.

For example, when the model is required to extract a conversation "Alice said: "Hello Bob. How are you?"" and output it in JSON format, the correct format should be {"dialogue": "Alice said: \"Hello Bob.\nHow are you?\""}. However, due to the challenges of word segmentation, it is not easy for the model to generate such a format accurately every time, and errors are prone to occur.

In contrast, YAML (YAML Ain't Markup Language) has a clear advantage in dealing with such situations. YAML is designed to be easier for humans to read and has more flexible rules for strings, especially multi-line strings, which makes it less sensitive to escapes and format errors. Taking the above conversation as an example, it is more concise to express it in YAML format:

speaker: Alice
dialogue: |
  Alice said: "Hello Bob.
  How are you?"

In this example, there is no need to escape quotes and line breaks appear naturally. YAML block scalar style (|) is used here.

YAML provides a variety of powerful ways to process multi-line strings. Among them, the literal style (|) will accurately retain the line breaks in the block, and each new line in the source YAML will become a line break (\n) in the result string. For example:

literal_style: |   
  Line 1   
  Line 2    
  Line 4

The resulting string is: "Line 1\nLine 2\n\nLine 4\n" (note the double line feed and the final line feed).

The folding style (>) collapses most line breaks within a block to spaces, treating the text as one long line with line breaks only for readability. It preserves blank lines (which become \n in the resulting string). For example:

folded_style: >
  This is actually
  just one long sentence,
  folded for readability.
  This starts a new paragraph.

The resulting string is: "This is actually just one long sentence, folded for readability.\nThis starts a new paragraph.\n" (note the space and the single \n).

In addition, you can further control how the newline at the end of the block scalar is handled by adding a "bite indicator" (+, -). By default (no indicator, i.e. | or >), if there is a trailing newline, it is preserved, but any additional trailing newlines are removed; using |+ or >+ means that all trailing newlines are preserved; using |- or >- means that all trailing newlines are removed, including the last one (if any).

In practice, when prompting a large language model for structured output that may contain complex strings, we should explicitly request the use of YAML format and include the output inyaml ...block. If multiple lines of text are likely and formatting is important, you should also specify a multiline style (| or >) in the prompt. In general, you do not need to specify a tail-biting indicator unless there is a specific need. At the same time, even though YAML is flexible, be sure to parse and validate the output in your code, using assertions or other pattern checking methods to ensure the accuracy and usability of the data. By using YAML, especially its multi-line processing capabilities, you can significantly reduce formatting errors caused by the strict rules of JSON and the word segmentation challenges of large language models.

Tip 2: Request an index instead of just words

In many practical tasks, we need large language models to identify specific items from a provided list, such as filtering or selecting items based on certain criteria. A common practice is to ask the model to return the text content of the item it selected, but this approach is often unreliable when dealing with real-world text, which may contain various formatting issues and noise.

Suppose we have a batch of product reviews and we want to use a large language model to mark out the reviews that look like spam (e.g., reviews that contain suspicious links or nonsense content). The input list of reviews might look like this:

review_list = [
  "Great product, really loved it! Highly recommend.", # Index 0
  "DONT BUY!! Its a scam! Visit my site -> www.getrichfast-totallylegit.biz", # Index 1
  " Item arrived broken. Very disappointed :( ", # Index 2 (extra space, emoticons)
  "????? ?????? ?????? click here for prize >>> http://phish.ing/xxx", # Index 3 (garbled code, link)
  "Works as expected. Good value for the price.", # Index 4
  "¡¡¡ AMAZING DEAL just for YOU -> check my profile link !!!" # Index 5 (weird punctuation, instructions)
]

The text in this list includes all kinds of variations in punctuation, capitalization, spacing, symbols, and may even contain spelling errors (which are not explicitly added here, but may occur in practice).


If we give a big language model a prompt like this: “Review the list below. Identify any reviews that appear to be spam or contain suspicious links. Return the full text of the reviews that should be removed.”, the model might respond differently. It might perfectly replicate the content at index 1, or it might correctly return the content at index 3. But for index 5, it might normalize “¡¡¡ AMAZING DEAL just for YOU -> check my profile link !!!” to “!!! AMAZING DEAL just for YOU -> check my profile link !!!”, or subtly change the spacing or punctuation in other reviews. If our code tries to remove items based on the exact text returned by the model (e.g., if llm_output_text in review_list:), then even if the model correctly identifies a spam review, any slight variation could result in that review not being found in the original list. This is because big language models are not designed to perfectly replicate a potentially noisy input string; they process the meaning of the text and generate output, sometimes introducing small variations.

To fix this, we can change the prompt to ask the model to output the index (position number) of the review that should be removed, rather than the potentially complex and varied content of the review. We can rewrite the prompt like this: "Analyze the list of product reviews provided below, each marked with an index number (0 to 5). Identify any reviews that seem like spam or contain suspicious links/instructions. Output ONLY a list of the integer indexes corresponding to the reviews that should be removed." and include a numbered list in the prompt:

# Product Reviews (Output indexes of spam/suspicious ones):
# 0: Great product, really loved it! Highly recommend.
# 1: DONT BUY!! Its a scam! Visit my site -> www.getrichfast-totallylegit.biz
# 2: Item arrived broken. Very disappointed :(
# 3: ?????? ?????? ?????? click here for prize >>> http://phish.ing/xxx
# 4: Works as expected. Good value for the price.
#5: ¡ ¡ ¡ AMAZING DEAL just for YOU -> check my profile link !!!

At this point, the expected output of the large language model for this example should be a list of numbers, and it is likely formatted according to the requested YAML structure (combined with tip 1):

reviews_to_remove_indexes:
  - 1
  - 3
  - 5

This output has many advantages: it is simple, just a list of integers; the integers are stable and have no spelling errors, spacing problems, or punctuation variations; it is also easy to verify, just check that the output is a list of valid integers in the expected range (0 - 5); and it can be used directly in the code, by iterating over these indices, it is possible to reliably access or delete the corresponding comments in the original list of comments, no matter how messy the original comments are.

Therefore, when asking a large language model to select or recognize items from a list of strings that may be complex or messy, we should provide clear indices (or unique, simple identifiers) for the list in the prompt and instruct the model to only output a list of indices/identifiers corresponding to the selected items. At the same time, verify that the output is a list of valid indices/identifiers. This approach greatly improves the reliability of handling tasks such as selecting from noisy real-world text inputs, avoids fragile string matching, and uses stable indices to ensure accuracy on the task.

Tip 3: Embedding Reasoning via Annotations

The third trick, which may seem counterintuitive at first glance, is to deliberately ask the big language model to add "extra" natural language to its structured output, which we do through YAML comments (#). This is done not only to improve the readability of the output, but more importantly to improve the accuracy of the structured data itself.

When we ask a large language model to perform a complex task (like analyzing multiple reviews and outputting a list of review indexes to remove) and produce structured data right away, the model can sometimes "rush" to complete the task. Without a clear step to integrate its findings or reason about its choices before settling on a structured format, it can easily make mistakes. It can miss an item, include the wrong item, or make mistakes in complex classifications. Jumping directly from analysis to a final structure can be unreliable.

To alleviate this problem, we can instruct the big language model to generate a natural language annotation to explain its reasoning process before outputting key structured data. This is not only to make it easier for us to understand the output later (although this is an added benefit), but more importantly, it forces the big language model to perform a "chain of thought" step when it is most critical.

When processing an input, such as a list of reviews, the Big Language Model first performs an analysis. Before outputting a list of indices, it must first generate annotations summarizing why those specific indices were chosen. This brings it back into natural language reasoning mode to integrate its findings. Having clearly articulated its reasoning, the Big Language Model is now better prepared to output a correct list of indices or an accurate structured value. Generating annotations is like a "cognitive speed bump" that interrupts the process of jumping directly to structured output, prompting the model to think for a moment, which often leads to more accurate results, especially for tasks that require synthesis or judgment (such as selecting multiple items from a list or making detailed classifications).

Let’s revisit the spam review filtering task (technique 2). We modify the prompt to read: “Analyze the list of product reviews… Output ONLY a YAML block containing the key reviews_to_remove_indexes with a list of integers. Crucially, add a YAML comment line starting with # immediately before the reviews_to_remove_indexes list, briefly summarizing which reviews were identified as spam/suspicious and why.” At this point, the large language model might generate the following output:

# Identified reviews 1, 3, 5 as spam/suspicious due to external links, gibberish, or spammy language.
reviews_to_remove_indexes:
  - 1
  - 3
  - 5

By forcing the model to generate annotations like “# Identified reviews…” first, we increase the likelihood that the subsequent list [1, 3, 5] will be accurate, because the model must explicitly explain its choice in natural language before outputting a number.

In practice, to leverage embedding reasoning to improve accuracy, we first identify those key structured outputs (e.g., lists of selected items, classifications, summary fields) that the big language model needs to make judgments or synthesize. Then, instruct the model to add YAML comments (# reasoning...) before these specific fields, structuring them into the form of summarizing their findings or reasons before the data points. This approach works best when the big language model does more than just extract simple facts, but makes selections or summarizes the results of the analysis into a structured format. Think of it as asking the big language model to "show its preliminary work" in the comments before completing the structured answer. This embedding reasoning step is a powerful technique to improve the reliability and accuracy of structured outputs.