How to make your Gemini no longer short and generate a long article of 10,000 words in one go?

Written by
Audrey Miles
Updated on:June-13th-2025
Recommendation

Improve Gemini's long text output capability and realize the creation of 10,000-word novels.

Core content:
1. Analysis of the reasons why Gemini output becomes shorter
2. Practical techniques for optimizing Gemini output
3. Gemini 2.5 Pro's long text output capability

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

 

Today, when I was doing a novel writing tutorial, a group friend asked me:

"Why has my Gemini output become so short? It seems that compared with 0325, the number of words spit out at a time is much shorter, often 2,000 to 3,000 tokens, and it seemed that it could reach tens of thousands of tokens before."

This problem is very common. I don't usually output very long texts. I am more accustomed to short outputs, which makes it convenient for me to guide the AI ​​to make corrections at any time.

But we must admit that when writing a novel, sometimes we just need to write a complete long text of ten thousand words at a time, so that there will be more room for expansion and trimming in the key descriptive paragraphs.

In fact, the output length of Gemini often fluctuates in different environments. This does not mean that the official has cut the output capacity of the model.

In fact, the basic capability parameters of the model usually do not change easily, but the current computing power of the server and the internal structure of the model change after iteration, and the overall "personality" changes, while our configuration method remains unchanged. Then, after combining the two, we will definitely get very different results.

In this case, users can only continue to fine-tune the input prompts to adapt to the new personality of the model as much as possible.

So, how should we set it up so that Gemini can rise again?

Flow saving points:

  • • Enable the output token option and set max_output_tokens to 65000, which is close to the upper limit
  • • Check client specific parameters (if applicable) and make sure stopSequences is an empty array
  • • Specify a clear, specific, and long outline, and specify a specific number of tokens for each part that are as long as possible.
  • • Instruct the AI ​​to continue outputting until the output limit is reached without any conclusion or summary
  • • Open streaming output
  • • Long-term thinking chain budget
  • • Eliminate the possibility of AI outputting special formats such as Markdown; reduce the probability of AI outputting sensitive words to prevent triggering sensitive word truncation
  • • High temperature and high creativity (not directly related, to be conservative you can go for a medium value such as 0.7, if you want to increase the number of words and creativity as much as possible you can go for a high value such as 1)
  • • Recursive continuation method may allow exceeding the 65535 tokens limit (not fully tested yet)

Let's take Gemini 2.5 Pro's recent version 0506 or 0605 as an example. The hard indicators of Gemini 2.5 Pro API are very "arrogant" - the official model page of Vertex AI and Google AI Studio clearly states: a single input can reach 1,048,576 tokens, and a single output limit of 65,535 tokens. Converted: more than 65,000 tokens are at least equivalent to about 250,000-260,000 Chinese characters, which can write a medium-length online novel chapter in one breath.

Why is the actual output of your common large models much shorter than this?

  1. 1. Front-end/SDK protection valve - In order to avoid browser deadlock, the Gemini web version and many third-party clients cut the response to a few thousand words; there have been many questions in the community to confirm this.
  2. 2. Security and Budget Strategies - Gemini 2.5 Pro adds "configurable thinking budgets". Google will truncate the generation internally based on the load and security strategies. Even if you set maxOutputTokens to the maximum, it may end early when the security or budget threshold is triggered.
  3. 3. finish_reason=MAX_TOKENS——When the model reaches the top or is intercepted by stop_sequences, the finish_reason in the return value will be noted. It will not report an error directly like the early model, but will give a legal but truncated candidate text.

Direct variable that determines "single output length"

variable
effect
maxOutputTokens
Hard cap, officially allowed is 1 – 65,535; when not explicitly set, the default cap is usually 8,192.
stopSequences
A maximum of 5 strings, the model will stop writing once it generates any of them. The wrong stop character is the most common cause of "accidental abortion".
Safety Filtering
Hitting a sensitive word will trigger a SAFETY stop, even if there is a token balance.
Think about your budget
When you turn on "budget" in AI Studio/Vertex AI or the SDK defaults to medium budget, the number of internal inference layers is limited and the build will end earlier.
Network timeouts and client caching
If the client does not consume the long streaming output for 30 seconds, the server may close the connection; offline generateContent requests are subject to HTTP timeout limits.

Tips:temperature,topP,topK It only affects the randomness of word selection and has little to do with the length of the article.


Three steps to make Gemini keep spitting out long texts

Step 1: Open the streaming output interface

Use instead streamGenerateContent(or in the SDK stream=True). When writing long text while receiving it, you can avoid HTTP timeouts and will not take up a lot of memory.

Streaming output options in Cherry studio:

As shown in the figure

Just max it out in generation_config:

{ "maxOutputTokens": 65000, "stopSequences": [], "temperature": 0.9 }

If your project has "thinking budget" turned on, set the budget to large to avoid premature internal closure.


As shown in the figure

Step 2: Structured instruction output, prohibition of convergence, continuous writing

The official upper limit of Gemini 2.5 Pro is 65,535 tokens; stopSequences Setting it to an empty array will completely disable soft truncation. The rest is up to the prompt: Use the system role to give the model a "death order" - immerse yourself in writing until you hit the hard limit.

{
  "contents": [
    {
      "role": "system",
      "parts": [
        {
          "text": "You are writing the first volume of a long fantasy novel. Maintain the same narrative perspective and style, focus on the description of scenes and details, do not summarize, and do not end prematurely. Never end unless the output reaches the maximum limit of the model."
        }
      ]
    },
    {
      "role": "user",
      "parts": [
        {
          "text": "The outline of the first volume is as follows:\n——During the Spring Festival in the capital, the young guard Leon accidentally picked up the Sealed Sword;\n——A melee broke out during the festival, the princess was kidnapped, and Leon was forced to embark on a pursuit journey;\n——Along the way, he encountered the Thieves Alliance, the Hermit Alchemist, and ancient ruins;\n——In the final chapter, he fought a decisive battle with the Silver Dragon Knight who kidnapped the princess at the top of the Obsidian Tower.\nPlease use an immersive lens to write from the hustle and bustle of the market in the early morning of the Spring Festival, and fully unfold the entire plot of the first volume."
        }
      ]
    }
  ],
  "generationConfig": {
    "maxOutputTokens": 65000,
    "stopSequences": [],
    "temperature": 0.7
  }
}

Key points for designing prompt words:

  1. 1. Say only one thing—don’t stop writing. Avoid summarizing or generalizing; avoid ending at the end of a chapter.
  2. 2. User instructions give a clear, linear outline. The more specific the outline, the less the model will go in circles, and tokens will be more likely to be filled with plot rather than repeating themselves.
  3. 3. Scene-based door opening. Starting from specific scenes such as "the hustle and bustle of the market in the early morning of the Spring Festival", the model will naturally advance along the timeline and continue to consume tokens.
  4. 4. Parameter coordination.stopSequences Leave blank,temperature Any, can be medium or high;streamGenerateContent Streaming output.

The above is a complete prompt generated by AI, which can also be formatted in Markdown. Taking the above prompt as an example, you can enrich the specific chapter outline of each chapter as much as possible, and optimize it by making it clear and concise, and the ending of each paragraph has a clear scene and plot for AI to unfold. Ideally, if you call it in this format, Gemini will write from the smoke in the market to the dragon wings on the top of the Obsidian Tower without any "continue?" questions, until 65,000 tokens are reached and the sentence is automatically broken with finish_reason = "MAX_TOKENS".

Step 3 (Explore the possibility of further improvement): Detailed outline control + "relay prompt words" to ensure uninterrupted

Next, here is a new method that combines OpenAI and Gemini Deepresearch. It has not been tested, but you can give it a listen and it may be useful.

First, make sure you have a detailed outline that gives the AI ​​the ability to write in chunks that are as long as possible.

Use the following prompt word template to drive the model to write chunked text. When each chunk reaches the maximum of 60K+ tokens, finish_reason="MAX_TOKENS" will appear automatically and the writing will continue immediately:

<system>
You are generating the text for a long online novel. Keep the style consistent and output at the end of each paragraph.
"[PART_END]". Stop at the end of the sentence when you reach the output limit, do not write a summary.
</system>

<user>
The novel synopsis is as follows…
Now start with Chapter 1 and expand on it without leaving out any details.
</user>

Code side logic example (pseudo code):

while True:
    response = model.stream_generate_content(messages, max_output_tokens=65000)
    save(response.text)
    if response.candidates[0].finish_reason != "MAX_TOKENS":
        break
    messages.append({"role": "user", "content": "Continue."})

This  "check finish_reason → append to continue" dialogue recursion  can seamlessly splice multiple 65K token outputs, which is theoretically only subject to the account's daily tokens quota and bill constraints.


Lessons Learned

  • • Preliminary settings, maximum output token parametersmax_output_tokens, high thinking chain budget, streaming output, is the key to adjustment
  • • The more specific the content, the clearer the instructions, the less sensitive the content , and the less likely it is to trigger Gemini security cutoffs.
  • • Putting a long outline, character list, and world view with a clear structure and content into the promotion words in advance can reduce the unnecessary token consumption caused by the model "going in circles".
  • • If you are worried that 65K in a single round is not enough, let the model output a complete chapter directory first, and then call it chapter by chapter to avoid being unable to find the continuation position after being interrupted by security policies in the middle.
  • • If you encounter sudden blank spaces or HTML blobs, it is likely that the Markdown code block or table is filtered. Try disabling Markdown in the prompt word and use plain text instead.

After making these settings, Gemini 2.5 Pro can stably output long texts of 20,000 or even 100,000 tokens in practice, and the stopping point only depends on when you say "that's enough".

Finally, let me talk about my own experience.

Since many mainstream large models such as Gemini are large language models, the essence of the output of large language models is probability calculation and simulation of human language forms. In other words, when AI "thinks about it this way", it will output content according to the vague image it "thinks about". It is not like a computer program that executes whatever it receives.


For a large model, it is meaningless to ask it to output 8,000 or 9,000 tokens. At best, it can learn from training that this is a long text. The amount we ask it to output is actually a general idea.


Therefore, if we want to output as long as possible, we should not only specify the output tokens as high as possible, but also take into account rationality. Only when the prompt words are accurate, clear, specific, and reasonable, combined with the instructions of long tokens, can we more easily ensure long output.


The output of a large model is often just a technique to make it feel like "the atmosphere is right".

For example, we ask AIBased on the plot of Wang Xiaoming getting up in the morning and making himself fried eggs, the novel outputs 60,000 tokens at a time.Such a request is obviously unreasonable.

It is also difficult for AI to output such a long text.

And if we insert Wang Xiaoming's life experience from childhood to adulthood, his relationships with the characters, his ambitions, and his decision to go to the battlefield to save the world after eating fried eggs... such a rich plot is inserted into the plot of making fried eggs in the morning in the form of montage. When we tell AI that Wang Xiaoming's frying eggs this morning is actually just the most symbolic fragment of the most critical day in his life... then we have actually expanded the outline to a point that is enough to support a 100,000-word novella.

In this way, the atmosphere is right and it is easy for AI to generate long texts.