Save up to 75% of token costs, Gemini 2.5 model launches implicit cache

Written by
Iris Vance
Updated on:June-28th-2025
Recommendation

The implicit cache function of the Gemini 2.5 model can help you save up to 75% of the token cost in repeated scenarios, making AI development more efficient and affordable.

Core content:
1. The new implicit cache function of the Gemini 2.5 model greatly reduces the token cost in repeated scenarios
2. How developers can use implicit cache to optimize applications such as AI question-answering robots and enjoy cache dividends
3. The Gemini team continues to push the Pareto frontier to achieve the optimal balance between AI performance and cost

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

In May 2024, the Gemini API launched the context caching feature.

Can save 75% of token costs in repeated scenarios.

In fact, DeepSeek in China already has a similar caching mode.

However, the cache had to be set manually before, and the process was a bit cumbersome.

Yesterday, the Gemini 2.5 model brought a smarter “implicit caching” feature to make saving money even easier.

What is implicit caching?

Simply put, you no longer need to build a cache yourself. The Gemini API system will automatically help you determine which content can save money.

As long as your request has the same beginning as the previous request, this part of the content can be "hit in the cache" and enjoy a 75% token discount.

The original text is as follows:

Based on this, we don't need to write another line of cache code.

Now, Gemini 2.5's implicit caching is equivalent to embedding "saving money" directly into the API.

Developers only need to put the unchanged content at the beginning of the request and the changing content at the end to maximize the benefits of caching.

for example:

When making an AI question-and-answer robot, put general instructions and background in the front, and user questions at the end. This way, each new question can trigger the cache, greatly reducing costs.

Of course, the cache is also limited.

The 2.5 Flash model requires 1024 tokens to trigger caching, while the 2.5 Pro model requires 2048 tokens.

In fact, most scenarios can benefit from implicit caching.

Currently, Gemini 2.5 still retains the configuration items of the explicit cache API, and the cache can still be managed manually.

The Gemini team said it well. They want to continue to push the "Pareto frontier" (more on this concept later in the article) to make AI more efficient and affordable both in use and development.

If you haven’t used Gemini 2.5’s implicit cache yet, you can enjoy the benefits of implicit cache through AI Studio or Vertex, so give it a try!

Vertex offers a 90-day free trial for $300 to new users. For details and usage on Google Cloud Platform, please refer to the following tutorial: Using Vertex AI to Call Gemini 2.5 Pro in Google Cloud

It has to be said that Google is worried that developers will not pay it more.

Extension: Pareto Front

The Pareto front is to achieve the optimal balance under limited resources.

For example, you have two goals: one is to improve the performance of AI, and the other is to reduce costs.

You can’t achieve the best of both worlds, there will always be trade-offs. The Pareto front is the set of all the optimal points where “going one step further means sacrificing one side”.

The same is true for AI products.

Every technological advancement actually pushes the "Pareto frontier" forward a little, allowing you to get a little more benefit in areas where you could not have had it before.

Google says it wants to "push the Pareto frontier," which means pushing the two seemingly contradictory goals of "high performance" and "low cost" together in a better direction.