Woter AI detection.Hurry - ends Jul 21st

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

The main battlefield of large model reasoning: What is the standard communication protocol?

Written by

Jasper Cole

Updated on:July-15th-2025

DeepSeek has accelerated the equalization of models, which has led to a surge in demand for large model reasoning. The main battlefield for improving large model performance has shifted from training to reasoning. The increase in reasoning concurrency will give rise to new engineering requirements in computing, storage, networking, middleware, databases and other fields.

This article will share SSE and WebSocket, two standard network communication protocols for large model applications, and let us get to know these two old friends in the new era again.

Table of contents:

What are SSE and WebSocket?
What was the mainstream network communication protocol before the emergence of large model applications?
Why don't large-scale model applications use the mainstream communication protocols of Web applications?
Why are SSE and WebSocket more suitable for supporting large model applications?
Technical Challenges and Solutions of Real-time Communication Protocols
What's Next?

What are SSE and WebSocket?

SSE (Server-Sent Events) is a network communication protocol based on HTTP that allows the server to push real-time data to the client in one direction. The generated content needs to be frequently transmitted between the client and the server. The server can generate results while sending them to the client in blocks, so that the user can see the generated content step by step, instead of waiting for the server to process all the content before seeing it. Main features

Efficient one-way communication: Designed for one-way communication from server to client, it perfectly matches large model scenarios (the client sends a request once, and the server continuously returns streaming results).
Low latency: Each time a logical paragraph or token is generated, it can be pushed immediately, avoiding the long wait of the traditional HTTP request-response model.
Lightweight protocol: Based on HTTP/HTTPS, no additional protocol handshake is required (such as WebSocket's two-way negotiation), reducing connection overhead.

WebSocket is a network communication protocol that allows full-duplex, persistent connections between clients and servers to achieve real-time, two-way data transmission. Unlike SSE, once a WebSocket connection is established, both parties can send data at any time, which is more effective. That is, the client can initiate a request without waiting for the server to return content. It is suitable for real-time synchronization of multiplayer online game operations, chat rooms of social software, and simultaneous editing of online documents by multiple people. The main features are:

Full-duplex communication : The client and server can send and receive data at the same time.
Persistent connection : After the connection is established, it remains open until it is actively closed.
Low latency : Data can be transmitted instantly, suitable for real-time applications.

What was the mainstream network communication protocol before the emergence of large model applications?

Before the emergence of large-scale model applications, online Internet applications were mainly web applications, running on browsers, and usually communicating with servers through HTTP/HTTPS protocols, such as e-commerce applications, new retail/new finance/travel and other transaction applications, education, media, medical and other industry applications, as well as public websites, CRM and other internal enterprise applications, with a very wide range of applications. Among them, HTTPS is a secure version of HTTP, which encrypts and protects the transmitted data through the SSL/TLS protocol, and some performance loss will be caused during the encryption and decryption process.

Different types of network communication protocols from the perspective of API management

HTTPS is a stateless, application-layer protocol used to transmit hypertext (such as HTML files, images, videos, etc.) between clients (such as browsers) and servers. It has the following characteristics:

Based on request-response model.
Stateless: Each request is independent and the server does not save the client's state.
Data encryption prevents data from being eavesdropped or tampered with; identity authentication ensures that the client communicates with the correct server; data integrity prevents data from being modified during transmission. (HTTP is plain text transmission)

The advantages are:

Simple and easy to use : The HTTP protocol is simple in design and easy to implement and use.
Widely supported : Almost all browsers, servers, and development frameworks support HTTP.
Flexibility : Supports multiple data formats (such as JSON, XML, HTML) and content types.
Stateless : Simplifies server design and is suitable for distributed systems.

Security and Compliance : Protect data transmission through encryption technology to prevent eavesdropping and tampering; comply with modern cybersecurity standards (such as GDPR, PCI DSS).

Here we take TLS 1.3 as an example to understand the handshake process between the client and the server of HTTPS through a diagram. (TLS 1.3 simplifies the previous handshake process, with less performance loss and faster response)

Why don't large-scale model applications use the mainstream communication protocols of Web applications?

Different types of large model applications have different requirements for network communication, but almost all of them are inseparable from the following requirements:

Real-time dialogue : Users interact with the model continuously, and the model needs to respond immediately. For example, Tongyi Qianwen and the Q&A robot on the HIgress official website need to respond immediately based on customer questions.
Streaming output : When a large model generates content, it returns the results word by word or sentence by sentence, rather than all at once. However, in applications such as DingTalk and WeChat, when two people talk to each other, streaming output is not used, and text and other content are returned all at once.
Long-term task processing : Large models may take a long time to process complex tasks and need to provide progress feedback to the client, especially when processing long texts, as well as multimodal content such as images and videos. This is because responses that rely on large model calculations consume much more resources than responses that rely on manually written business logic. This is why large model calculations rely on GPUs rather than CPUs. CPUs are far inferior to GPUs in parallel computing and large-scale matrix computing.
Multiple rounds of interaction : Users and models need to interact multiple times to maintain context. This is a necessary capability for large model applications to ensure user experience.

These scenarios have high requirements for real-time and two-way communication. If HTTPS, the mainstream communication protocol for Web applications, is used, the following problems will occur:

It only supports one-way communication, that is, the request-response model. The server can only respond when the client initiates the request. It cannot perform two-way communication, resulting in the inability to support streaming output and handle long-term tasks.
Each time the client makes a request, it needs to re-establish a connection, which increases latency and makes it impossible to support real-time conversations.
HTTPS is a stateless communication protocol. Each request is independent and the server does not save the client's status. Even if the client can repeatedly send context information with each request, it will bring additional network overhead and make it impossible to efficiently support multi-round interaction scenarios.

Although HTTPS has evolved into HTTPS/2 and HTTPS/3, and has improved performance, it is still not native enough for scenarios with high real-time requirements such as large model applications, and has not become the mainstream communication protocol in such scenarios.

Why are SSE and WebSocket more suitable for supporting large model applications?

The workflow of SSE is as follows:

1. The client initiates an SSE connection

The client uses JavaScript EventSource The API initiates an HTTP request to the server.
The request header contains Accept: text/event-stream, indicating that the client supports the SSE protocol.
Sample code:

const eventSource = new EventSource('https://example.com/sse-endpoint');

2. The server returns an HTTP response

The server response header must contain the following fields:

Content-Type: text/event-stream: Indicates that the response content is SSE data stream.
Cache-Control: no-cache: Disable cache to ensure data is updated in real time.
Connection: keep-alive: Maintain a long connection.

Example response header:

HTTP/1.1 200 OKContent-Type: text/event-streamCache-Control: no-cacheConnection: keep-alive

3. Server push data stream

The server continuously pushes data through HTTP persistent connections. Each message begins with data: Starts with two newline characters \n\n Finish.
Example data flow:

data: { "message" : "Hello" }
data: { "message" : "World" }

4. Client processes data

Client via EventSource of onmessage Event monitoring for data pushed by the server.
Sample code:

eventSource.onmessage = (event) => { console.log('Received data:', event.data);};

5. Connection closing or error handling

If the connection is lost (such as due to a network problem), the client automatically attempts to reconnect.
The server can send retry: field specifies the reconnection time in milliseconds.
Example reconnect settings:

retry: 5000

We can verify the network communication protocol used by the large model APP in the following way:

Call the API and check the response headers :
Use stream=True Parameter request streaming response, through the developer tools or curl View returned Content-Type, if text/event-stream, then it is clearly SSE.
Example command:

curl -X POST "https://api.deepseek.com/v1/chat/completions" \ -H "Authorization: Bearer YOUR_KEY" \ -H "Content-Type: application/json" \ -d '{"model":"deepseek-chat", "messages":[{"role":"user","content":"Hello"}], "stream":true}' \ -v # View detailed response header

Expected Output:

< HTTP/1.1 200 OK< Content-Type: text/event-stream< Transfer-Encoding: chunked

Data format verification :
The data body format of the streaming response is data: {...}\n\n, in accordance with the SSE specification [1].
Example response snippet:

data: {"id":"123","choices":[{"delta":{"content":"Hi"}}]}data: [DONE]

The workflow of WebSocket is as follows:

1. The client initiates a WebSocket handshake request

The client initiates a WebSocket handshake via an HTTP request, and the request header contains the following fields:

Upgrade: websocket: Indicates that the client wishes to upgrade to the WebSocket protocol.
Connection: Upgrade: Indicates that the client wishes to upgrade the connection.
Sec-WebSocket-Key: A randomly generated Base64 encoded string used for handshake verification.

Example request header:

GET /ws-endpoint HTTP/1.1Host: example.comUpgrade: websocketConnection: UpgradeSec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==Sec-WebSocket-Version: 13

2. The server returns a WebSocket handshake response

The server verifies the client's handshake request and returns an HTTP 101 status code (Switching Protocols), indicating that the protocol upgrade was successful.
The response header contains the following fields:

Upgrade: websocket: Confirm the protocol upgrade.
Connection: Upgrade: Confirm the connection upgrade.
Sec-WebSocket-Accept: Client-based Sec-WebSocket-Key The calculated value used to verify the handshake.

Example response header:

HTTP/1.1 101 Switching ProtocolsUpgrade: websocketConnection: UpgradeSec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=

3. WebSocket connection establishment

After a successful handshake, the HTTP connection is upgraded to a WebSocket connection, and the client and server can communicate bidirectionally via the WebSocket protocol.
The connection is based on TCP and supports full-duplex communication.

4. Data Transfer

The client and server send and receive data frames through the WebSocket protocol. Data frames can be in text or binary format.
Text frame : used to transmit text data such as JSON and strings.

{"message": "Hello"}

Binary frame : used to transmit binary data such as pictures, audio, and video.

[0x01, 0x02, 0x03]

5. Connection closed

The client or server can actively send a Close Frame to terminate the connection.
The Close frame contains a close status code and an optional reason.
Example close frame:

Close Frame: - Code: 1000 (Normal Closure) - Reason: "Connection closed by client"

For example, OpenAI's Realtime API (suitable for scenarios with higher real-time requirements, where the client can initiate a request without waiting for the server to send the content) recommends using the WebSocket protocol to support low-latency multimodal interactions, including text and audio input and output. It has the following features: [2]

Event-based communication: Realtime API uses WebSocket for stateful event-driven interaction, and the client and server communicate by sending and receiving events in JSON format1 3 5 .
Persistent connections: The WebSocket protocol enables APIs to maintain a persistent two-way connection, allowing for immediate data flow, which is critical for real-time conversations and interactions.
Multimodal support: The API not only supports text input, but can also process audio data to provide a richer and more natural user experience.

It can be seen that both SSE and WebSocket can better support the real-time requirements of large model applications. The former is more lightweight, and the latter has more advantages in scenarios with higher real-time requirements because of its two-way communication. Here we compare the characteristics of each protocol through a table.

characteristic	HTTP/1.1	HTTP/2	SSE	WebSocket
Protocol Basics	Based on TCP, text protocol	Based on TCP, binary protocol	Based on HTTP/1.1 or HTTP/2	Based on TCP, independent full-duplex protocol
Communication Mode	Request-Response	Request-Response	One-way (server → client)	Bidirectional (server ↔ client)
Connection multiplexing	Not supported (default short link)	Support (multiplexing)	Supported (based on HTTP/1.1 or HTTP/2)	Support (long connection)
Header Compression	Not supported	Support (HPACK compression)	Rely on the underlying HTTP protocol	Not supported
Data Format	text	Binary	text	Text or binary
Delay	Higher (head-of-line blocking problem)	Lower (multiplexing solves head-of-line blocking)	Low (Server push)	Very low (full-duplex real-time communication)
Reconnect after disconnection	Need to be implemented manually	Need to be implemented manually	Support automatic reconnection	Need to be implemented manually
Applicable scenarios	Traditional web pages, API requests	High performance web pages and API requests	Real-time notifications, log streaming, progress updates	Real-time chat, online games, collaborative editing
Security	HTTPS encryption required	HTTPS encryption required	HTTPS encryption required	Requires WSS (WebSocket Secure) encryption
Protocol Upgrade	No upgrade required	No upgrade required	No upgrade required	Requires protocol upgrade (HTTP → WebSocket)
Typical Applications	Static resource loading, form submission	Streaming media, parallel loading of multiple resources	A large model application of question and answer	Online games, multi-player collaboration, and large-model application scenarios with higher real-time requirements

In summary, HTTP/1.1 is suitable for traditional web pages and simple API requests, but its performance is relatively low. HTTP/2 is suitable for modern high-performance web pages and solves the head-of-line blocking problem of HTTP/1.1. SSE is suitable for scenarios where the server actively pushes real-time data, such as large model applications with one-question-one-answer. WebSocket is suitable for scenarios that require two-way real-time communication, such as online games, multi-person collaboration, and large model application scenarios with higher real-time requirements (scenarios where the generation process can be interrupted at any time to ask questions, such as large model real-time debate platforms).

In addition, WebRTC is also widely used in large model application scenarios. For example, when calling the Realtime API, OpenAI officially recommends using WebSocket or WebRTC[3].

Technical Challenges and Solutions of Real-time Communication Protocols

Although the characteristics of SSE and WebSocket are naturally suitable for processing games, social networking, large-scale model applications and other scenarios that require real-time communication, as the user base expands, some engineering technical challenges will still be encountered.

If data is compared to cargo, HTTPS is a small ferry suitable for short-distance, small-volume cargo transportation, while SSE and WebSocket are large cargo ships suitable for long-distance, large-volume cargo transportation. At this time, the gateway is responsible for connecting the transit hall between land and water, controlling the order and direction of ships (routing, load balancing), conducting safety checks on cargo ships (identity authentication), and setting up emergency and backup channels (traffic control, high availability assurance), etc. Since large cargo ships enter the transit hall uninterruptedly (long connection), and the amount of data accessed at a time is large and there are many access users, higher requirements are placed on the gateway that manages the establishment and maintenance of SSE and WebSocket connections.

The following lists specific challenges and solutions at the gateway layer. The solutions are for reference only. There is no single answer to engineering problems. You should choose a solution that suits you based on the actual situation of the business and technical teams.

Stability risks caused by software changes and service expansion and contraction

Technical Challenges:

The faster the application develops, the more emerging the application is, the higher the frequency of software changes, and gateway upgrades are an important part of software changes. However, gateway upgrades usually involve service restarts, configuration changes, or network switching, which will directly affect the stability of SSE and WebSocket connections.
During the service expansion process (adding instances), the existing SSE and WebSocket may not be able to connect to the new instance. During the service reduction process (reducing instances), the existing SSE and WebSocket may be forcibly closed due to the service offline. These applications with high real-time requirements, such as games and large-model real-time chat, will lead to a decline in user experience.

Solution:

Lossless on- and off-line capability: This capability is widely used when microservices are changed, and can effectively reduce the stability risk of version release and scaling. It is commonly found in commercial capabilities of cloud products. For example, Alibaba Cloud's cloud-native API gateway provides microservice governance capabilities for HTTPS/WebSocket.
Client reconnection mechanism: An automatic reconnection mechanism is designed on the client to reduce the impact of interruptions. Just like lossless online and offline, a heartbeat packet is used to detect the connection status. Once interrupted, the connection is automatically reconnected. In addition, the sent data can be recorded on the server to achieve breakpoint resumption.
Protocol switching mechanism: When SSE and WebSocket are unavailable, fall back to long polling, but this depends on whether the gateway itself supports these long connections.

Large bandwidth leads to rapid increase in memory stability risk and bandwidth cost

Technical Challenges:

Large models often need to process long texts, as well as multimodal content such as images and videos. The bandwidth consumption far exceeds that of Web applications, resulting in a rapid increase in memory usage and higher bandwidth costs.

Solution:

Choose a gateway that supports streaming (such as Higress) to transmit the generated content in blocks to reduce the amount of data transmitted at a time. At the same time, use compression algorithms (such as Gzip) to reduce data transmission volume and control bandwidth costs. Alibaba Cloud's cloud-native API gateway will soon launch a software and hardware integrated content compression solution, which can reduce bandwidth transmission costs by 20%+.

High latency increases the resource cost of preventing malicious attacks

Technical challenges: Compared with Web applications, large model applications consume more computing resources during inference. For example, when a DDoS attack occurs, Web applications will consume 1:1 computing resources to respond to the attack, while large model applications will consume 1:x (x is much greater than 1) backend resources, making large models more vulnerable to malicious attacks.

Countermeasures: Deploy three-dimensional protection measures at the gateway layer, including authentication, security protection, traffic control, etc.

Authentication: Check the compliance of requests from clients. Select a third-party authentication protocol based on specific business needs. Based on our experience with customers, most choose OAuth2 and JWT.
Security protection: Design security protection measures through IP restrictions or based on features such as URLs and request headers.
Traffic control: Token-level traffic control based on URL parameters, HTTP request headers, client IP addresses, consumer names, or keys in cookies.

Higress and the commercial version of the cloud-native API gateway provide a wealth of out-of-the-box plug-ins, reducing the user's development, adaptation and maintenance costs. The following is the product's plug-in market.

These protective measures can not only improve the robustness of the system in the face of malicious attacks, but also improve the stability of the system when facing external irregular traffic.

What's Next?

In addition to bringing about an increasing frequency of use of SSE and WebSocket, large model applications are also promoting the concept of API First. In the past, online applications exposed their capabilities through services, but large model applications will provide service capabilities through APIs . In addition to base model vendors who have already provided APIs to serve a large number of developers, large model application vendors have also begun to provide API services.

For example, Perplexity will recently launch its AI search API service for enterprise customers and developers - the basic version of Sonar and the advanced version of Sonar Pro, allowing enterprises and developers to build Perplexity's generative AI search tools into their own applications.

The benefits of this are: First, Perplexity can make its AI search ubiquitous, not just limited to its applications and websites. One example is its client Zoom: Sonar allows Zoom's AI chatbot to provide real-time answers based on web searches with citations, without requiring Zoom's video users to leave the chat window.