Practical post! How to generate perfect JSON output in LLM

Written by

Iris Vance

Updated on:July-09th-2025

If you often use LLM to complete tasks such as writing code, generating structured data, or calling APIs, you may have heard of something likejson_modeOr functions like "function call". These functions actually solve a very core problem: let the content generated by LLM be output in the format we want.

Except for DeepSeek, all other models of Silicon Mobility API support JSON format return. You can try it.

The Nature of LLM-Generated Content

We all know that the current LLM content generation process is based on a probabilistic model, generating each token step by step. The selection of each token is actually determined by a probability distribution, that is, the next token is "guessed" based on the context .

This process is indeed very flexible, as it allows LLM to generate a wide variety of text. However, this flexibility also brings up a problem: what if we want the output to be in a specific format, such as JSON?

Adjusting the probability distribution

In addition to allowing LLM to generate freely, we can also use some technical means to constrain its behavior. Today we will focus on a very practical technology - forcing LLM to generate content that conforms to a specific grammar by manually intervening in the probability distribution .

That is, we can set the probability of tokens that do not conform to the target format to 0 , thereby ensuring that LLM does not deviate from the rules we set when generating content. This method has been widely used in practice, such asjson_mode, structured output, and function calling, etc.

Open source projects on GitHubllama.cpp, which provides agrammars/json.gbnfThe file (https://github.com/ggml-org/llama.cpp/blob/master/grammars/json.gbnf) is specifically used to define the rules of JSON syntax.

root ::= object
value ::= object | array | string | number | ( "true"  |  "false"  |  "null" ) ws

object ::=
"{"  ws (
            string  ":"  ws value
    ( ","  ws string  ":"  ws value)*
  )?  "}"  ws

array ::=
"["  ws (
            value
    ( ","  ws value)*
  )?  "]"  ws

string ::=
"\""  (
    [^ "\\\x7F\x00-\x1F] |
    " \\" ([ "\\bfnrt] | " u " [0-9a-fA-F]{4}) # escapes
  )* " \" " ws

number ::= (" - "? ([0-9] | [1-9] [0-9]{0,15})) (" . " [0-9]+)? ([eE] [-+]? [0-9] [1-9]{0,15})? ws

# Optional space: by convention, applied in this grammar after literal chars when allowed
ws ::= | " " | " \n " [ \t]{0,20}

In this way, LLM can strictly adhere to these grammatical rules when generating JSON formatted responses without any deviation.

Frameworks like LangChain that provide GBNF support are also implemented through llama.cpp

Or if you want to use your own custom JSON structure, you can refer to the following writing method:

from  llama_cpp  import  Llama, LlamaGrammar

# Define your GBNF grammar
grammar_string =  r"""
root ::= "{" pair ("," pair)* "}"
pair ::= string ":" list
list ::= "[" value ("," value)* "]"
value ::= string | number
string ::= "\"" [a-zA-Z0-9_]+ "\""
number ::= [0-9]+
"""

# Create a LlamaGrammar object
my_grammar = LlamaGrammar.from_string(grammar_string)

# Initialize Llama model
llm = Llama(model_path= "C:/Users/sride/PycharmProjects/gbnf_implemen/llama-2-7b.Q4_K_S.gguf" )

# Generate constrained output
prompt =  "Give me list of fruits"
output = llm(prompt, max_tokens= 100 , temperature= 0.7 , grammar=my_grammar)

print(output[ 'choices' ][ 0 ][ 'text' ])