10 basic capabilities that AI gateways need to have

Written by

Iris Vance

Updated on:July-15th-2025

It has become an industry consensus that the main battlefield of large models has shifted from training to inference. More and more companies have begun to design large model applications that meet internal corporate needs and external business directions, and deploy them in production environments. In this process, we encountered a series of new requirements that are different from the initial large model applications in the "proof of concept stage". These new requirements are more derived from the need for scale and safe use. Among them, the AI gateway has become one of the key components of the AI infrastructure that has been discussed more.

We believe that the AI gateway is not a new form independent of the API gateway. In essence, it is also an API gateway. The difference is that it has been specially extended for the new needs of AI scenarios. It is both an inheritance of the API gateway and an evolution of the API gateway. Therefore, we classified the capabilities of the AI gateway from the API perspective to facilitate the formation of a consensus on the concept.

API Gateway inheritance

Cloud Native

Since there are many gateway capabilities provided around API and many roles are involved, we classify all capabilities based on the users, including three scenarios: R&D, supply and consumption, which correspond to the R&D team of API interface, the R&D and operation and maintenance team of API platform, and the external caller of API platform.

API R&D scenarios

API First means defining the API specification first, then coding. Different from not defining the API and coding directly, API First emphasizes designing and developing the API interface first before building the application, regarding the API as the core architecture component of the system, and achieving modularization through well-defined interface specifications. For example, public cloud products all provide API calling methods, and WeChat mini-programs and DingTalk open platforms also provide API interfaces for developers. Similar to the modular system of Lego blocks, it realizes flexible combination of services through standard interfaces, improves the scalability and maintainability of the system, and thus improves ecological efficiency.

API provisioning scenarios

The API provisioning scenario refers to the process by which an API provider (such as an enterprise, platform, or service) exposes data or functions through a standardized interface. Its core is to create, manage, and maintain APIs to ensure their availability, security, and efficiency. Core capabilities include:

API security: protects APIs from various security threats, ensures that only authorized users and applications can access APIs, and guarantees the confidentiality, integrity, and availability of data during transmission and storage. For example, identity authentication, authorization management, data encryption and decryption, and anti-attack mechanisms.
Grayscale: It is a strategy for gradually introducing new API versions or features in a production environment, allowing a portion of users or request traffic to be directed to the new version of the API, while the rest remain on the old version, so that the new API can be tested and verified without affecting the overall system stability and user experience.
Caching: refers to temporarily storing the response results of the API in the cache server. When the same request comes again, the response results are directly obtained from the cache without accessing the backend server again, thereby improving the response speed of the API and the performance of the system.

API consumption scenarios

API consumption scenarios refer to the process in which the caller (such as applications, developers) quickly implements functions or obtains data by integrating external APIs. The core is to use the capabilities or data provided by the platform to achieve business needs.

Call auditing: The process of comprehensively recording, monitoring, and analyzing API call activities. It records detailed information about each API call, including call time, caller identity, called API interface, request parameters, response results, response time, etc.
Caller quota rate limit: refers to the mechanism by which the API gateway limits the number of API calls, traffic size, or resource usage for each caller (such as user, application, IP address, etc.) within a certain period of time according to pre-set rules.
Backend protection-type current limiting: Manage and control the access traffic of the API to ensure that the API can run stably and efficiently, and avoid system crashes and performance degradation caused by excessive or abnormal traffic, including load balancing, current limiting, degradation, and circuit breaking capabilities.

The Evolution of API Gateway

Cloud Native

In the big model scenario, the big model provides services to the outside world through APIs, so there are more diverse demands in R&D scenarios, supply scenarios, and consumption scenarios.

Large Model API R&D Scenario

API First or API is a first-class citizen is no longer a slogan, but has gradually become a real application development specification. The development and operation of the agent requires calling the API. The agent provides external services through an open platform and also provides an API. The API gateway can cover all stages of the API life cycle, including design, development, testing, release, sales, operation and maintenance monitoring, security management, and offline. Enterprises will have stronger demands. Based on the API gateway, multiple plug-in capabilities can also be provided to improve the efficiency of agent development, such as AI prompt word templates [1], API AI Agent [2], Json formatting [3], which is used to structure the AI response according to the default or user-configured Json Schema, etc.

Large Model API Supply Scenarios

Flexible switching of multiple models and fallback retry: It has become a standard feature of large-model application backends to connect to multiple large models. First, it allows users to choose which backend model to use, and second, it provides a fallback mechanism when the application fails or has capacity constraints. [4]
Content security and compliance: Filter out harmful or inappropriate content, detect and block requests containing sensitive data, and perform quality and compliance audits on AI-generated content with content security plugins.[5]
Semantic caching: The pricing of the Big Model API service is divided into X yuan (cache hit) / Y yuan (cache miss) per million input tokens. X is much lower than Y. Taking the Tongyi series as an example, X is only 40% of Y. By caching LLM responses in the memory database and in the form of gateway plug-ins, the latency and cost of reasoning can be improved. The historical conversations of the corresponding user are automatically cached at the gateway layer, and the context is automatically filled in in subsequent conversations, so that the Big Model can understand the context semantics. [6]
Multiple API Key balancing: API Key is a key used to identify and verify the identity of the caller and control its access rights to the API. Multiple API Key balancing means that when there are multiple API Keys, the API Gateway distributes API requests evenly or according to specific rules to these API Keys for processing through certain strategies.

Large Model API Consumption Scenarios

Token quota management and flow control: "Token" is a common unit of measurement for large model applications. It accurately quantifies the amount of data processed by large model applications. Just like traditional gateways manage service access, AI gateways also need to have the ability to manage tokens, including observing usage, providing flow control functions, and configuring precise call quota limits for calling tenants. [7][8]
Traffic grayscale: Both the base model and large model applications are continuously improving the quality of content generation, keeping the change frequency of large model applications at a high level, and will rely heavily on A/B testing and service grayscale capabilities to iterate models. As a traffic entry, the AI gateway needs to play a key role in traffic grayscale and observation, including grayscale labeling and monitoring of indicators such as entry traffic delay and success rate.
Call cost audit: The computing resources consumed by large model calls are much higher than those consumed by Web application requests, so the demand for controlling call costs is more rigid. The calls here include both direct economic costs, such as the fees to be paid when using third-party API services, or the costs incurred by API calls consuming internal enterprise computing resources (such as servers, storage, bandwidth, etc.); and indirect costs, such as resource costs caused by API call errors.

Why on the gateway?

Rather than implementing these capabilities at the large model service layer

Cloud Native

Architecture design and decoupling

Functional separation: The gateway and the large model service layer have different core functions. The large model service layer focuses on performing complex computing tasks, such as natural language processing and image recognition, to provide users with intelligent responses. The main function of the API gateway is to manage API access, including security authentication, flow control, protocol conversion, etc. Implementing the capabilities of the API gateway on the gateway can achieve a clear separation of functions, make the responsibilities of each component clearer, and facilitate system development, maintenance, and expansion.
Decoupled system: If the API gateway function is implemented at the big model service layer, the big model service will be tightly coupled with the API management function. When the API management strategy needs to be adjusted (such as changing the security authentication method, adjusting the traffic restriction rules), the stability and performance of the big model service may be affected. Implementing the API gateway capability on the gateway can decouple the big model service from API management, allowing the two to develop and upgrade independently, reducing the complexity and maintenance costs of the system.

Performance Optimization

Reduce the load on large models: Large models usually require a lot of computing resources and memory to run, and processing complex reasoning tasks has already consumed a lot of system resources. If the functions of the API gateway, such as authentication, current limiting, caching, etc., are implemented in the large model service layer, it will further increase the load on the large model and affect its processing speed and response time. Implementing these functions on the gateway can pre-process and filter requests before they reach the large model service layer, reducing unnecessary requests entering the large model service layer, thereby improving the performance and efficiency of the large model.
Improve concurrent processing capabilities: The gateway can evenly distribute a large number of API requests to multiple large model service instances through load balancing and other technologies to improve the concurrent processing capabilities of the system. If the API gateway function is implemented at the large model service layer, each large model service instance needs to handle API management tasks independently, which will limit the concurrent processing capabilities of the system. The gateway can handle these tasks centrally and better cope with high-concurrency scenarios.

Safety and security

Unified security protection: As the entrance to the system, the gateway can conduct comprehensive security checks on all API requests entering the system, forming a unified security line of defense. Implementing security functions such as identity authentication, authorization, and anti-attack on the gateway can effectively prevent malicious requests from entering the large model service layer and protect the security of large models and related data. If security functions are implemented in the large model service layer, there may be loopholes in security protection due to the decentralization of large model services.
Data protection: The gateway can encrypt and desensitize the data of API requests and responses to ensure the security of data during transmission and storage. Processing these data protection tasks at the large model service layer may increase the complexity and computational burden of the large model. Unified processing on the gateway can better protect the user's sensitive information while avoiding the security risks brought by the large model directly contacting sensitive data.

Scalability and flexibility

Convenient integration of new functions: As the business develops, it may be necessary to add new functions to API management, such as supporting new security authentication protocols, introducing new flow control algorithms, etc. Implementing API gateway capabilities on the gateway makes it easier to integrate these new functions without making large-scale modifications to the large model service layer. This allows for quick response to changes in business needs and improves the scalability of the system.
Support multi-model access: In actual applications, multiple different large model services may be used at the same time. The gateway can serve as a unified access point to provide the same API management service for different large model services, making it easier to manage and schedule multiple large models. If the API gateway function is implemented separately at each large model service layer, it will increase the complexity of the system and the difficulty of management.

Observability and Monitoring

Centralized monitoring and analysis: The gateway can centrally monitor and analyze all API requests, and collect various indicator data, such as request response time, call frequency, error rate, etc. By analyzing these data, problems in the system, such as performance bottlenecks, security vulnerabilities, etc., can be discovered in a timely manner, and corresponding measures can be taken to optimize and repair them. If the monitoring function is implemented at the large model service layer, it will be difficult to fully understand and analyze the API call situation of the entire system.
Troubleshooting and location: When an API call failure occurs, it is easier to troubleshoot and locate the failure on the gateway. The gateway can record detailed information about each API request, including the source of the request, request parameters, response results, etc. By analyzing this information, the cause and location of the failure can be quickly determined, reducing the time and cost of fault repair.

The evolution direction of AI gateways

Cloud Native

Thanks to the dynamic expansion capability of the Wasm plug-in, Higress has evolved rapidly in the AI era and developed AI gateway capabilities. The underlying capabilities of large model API management mentioned in this article have been launched in the open source Higress and Alibaba Cloud native API gateway:

Higress Open Source Console

Alibaba Cloud Native API Gateway Console

At the same time, we provide AI API management capabilities on Alibaba Cloud's cloud-native API gateway, which can manage APIs in the AI era more conveniently and efficiently: