Replicate Xiaozhi AI Step 2, 2 core flow charts to learn its WebSocket protocol

Written by
Clara Bennett
Updated on:June-30th-2025
Recommendation

Replicate Xiaozhi AI, master the core process of WebSocket protocol, and realize real-time communication.

Core content:
1. Background introduction of replicating Xiaozhi AI based on Arduino framework
2. The role and data type of WebSocket protocol in Xiaozhi AI communication
3. Detailed steps and configuration parameters for establishing WebSocket connection

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Preface

Continue to try to replicate Xiaozhi AI, but based on the Arduino framework.

Last week, I finished building the development environment of ESP32-S3 + ESP-SR + ESP-TTS with VSCode + PlatformIO + Arduino (see the article "Replicating Xiaozhi AI, ESP32-S3 to build Arduino + ESP-SR + ESP-TTS development environment step record", the main voice wake-up, command recognition, and text-to-speech functions are all running, and then I can start connecting to the WebSocket protocol of Xiaozhi AI server.

However, the original author 78/xiaozhi-esp32 The project is a little complicated, not very convenient to read, and I don't want to use the IDF compilation environment, so I want to find out if there is an implementation for other platforms, and then I actually found one. huangjunsen0406/py-xiaozhi  The project is a desktop client with an interface written in Python + PyTk. It supports manual dialogue and automatic dialogue mode switching. You can also learn about lightweight speech recognition on the PC.


Communication Process

Xiaozhi AI client and server can use WebSocket or MQTT protocol. For convenience, we will directly use WebSocket protocol to learn here.

Protocol Overview

In the communication process of Xiaozhi AI, WebSocket is used to achieve real-time, two-way communication between the client and the server. It mainly transmits the following types of data:

  • Control instructions : such as start/stop monitoring, interrupt TTS, etc.

  • Text information : such as LLM's response, emotional instructions, configuration information, etc.

  • Audio data :

    • Client -> Server: Recorded Opus encoded audio stream.

    • Server -> Client: Opus encoded audio stream generated by TTS.

  • Status synchronization : such as TTS playback start/end.

There are two main formats used for communication:

  • JSON : Used to transfer text, control commands, and status information.

  • Binary : Used to transmit Opus-encoded audio data.

Establishing a connection

  1. The client initiates the connection : The client WEBSOCKET_URL Initiate a WebSocket connection request to the server.

  2. Send header information : When establishing a WebSocket connection, the client needs to send the necessary HTTP header information, including:

  • AuthorizationBearer <access_token> (Configuring WEBSOCKET_ACCESS_TOKEN)

  • Protocol-Version1 (Protocol version number)

  • Client-Id: Client ID

  • Device-Id: Device identification (usually the device MAC address)

  • Currently, except for Device-Id, which needs to be generated by the client, the other fields are fixed values ​​and can be set as follows:

    "WEBSOCKET_URL": "wss://api.tenclass.net/xiaozhi/v1/","WEBSOCKET_ACCESS_TOKEN": "test-token","CLIENT_ID": "1dd91545-082a-454e-a131-1c8251375c9c",
  • Server Response : Server accepts the connection.

  • Client sends hello: After the connection is successfully established, the client needs to send a hello The message (JSON format).

    hello_message = { "type": "hello", "version": 1, "transport": "websocket", "audio_params": { "format": AudioConfig.FORMAT, "sample_rate": AudioConfig.SAMPLE_RATE, "channels": AudioConfig.CHANNELS, "frame_duration": AudioConfig.FRAME_DURATION, }}

    The audio encoding parameters will be preset here, but it’s not a big problem, as the server will push the settings it can accept later.

  • Server responds with hello : providing a session ID and possible initial configuration.

    { "type": "hello", "version": 1, "transport": "websocket", "audio_params": { "format": "opus", "sample_rate": 24000, "channels": 1, "frame_duration": 20 }, "session_id": "a1f81xs89"}

    Note: The client must store session_id Used for all subsequent messages that require a session identifier.

    Note 2: You need to use audio_params Update local Opus encoding settings.

  • Server Authentication

    When you connect to the Xiaozhi AI official backend for the first time, you need to add a device in the console.

    The method of adding devices is also very convenient. When the client connects to the server and sends the first voice message, the server will return a voice message with a 6-digit verification code, and the device can be added in the background.

    At this point, the WebSocket connection with the Xiaozhi AI server has been established, and the subsequent dialogue process can begin.


    Client Message

    To communicate with Xiaozhi AI, the client generally needs to actively initiate the conversation process and send the first audio data, or the wake-up word.

    listen (JSON)

    Controls the state of audio monitoring (recording).

    • Start monitoring :

      { "session_id": "session-id", "type": "listen", "state": "start", "mode": "manual" | "auto" | "realtime" // Listening mode}
    • Stop monitoring :

      { "session_id": "session-id", "type": "listen", "state": "stop"}

    wake_word (JSON)

    If you start a conversation with a wake word, use another type of listen Message, notifying the server that the wake-up word has been detected, so that the server will immediately return a voice message.

    • Format :

      { "session_id": "session-id", "type": "listen", "state": "detect", "text": "Hello Xiaozhi" // Modify according to the actual wake-up word}

    abort (JSON)

    Request the server to interrupt the current operation (mainly TTS voice playback).

    • Format :

      { "session_id": "session-id", "type": "abort", "reason": "wake_word_detected" // (optional) abort reason}

    This is mainly used when the Xiaozhi AI server outputs a long voice but wants to start a new conversation.

    audio (Binary)

    Send recorded audio data.

    • Format : Binary Frame.

    • Content : According to session_info middle audio_config The audio data block is encoded in the agreed format (Opus by default).

    IoT Messages

    I won’t play with this for now, I’ll study the specific format later.


    Server Message

    The message types returned by Xiaozhi AI server are also divided into JSON and Binary. JSON type messages depend on type fields to distinguish the actual content.

    Example JSON message format:

{ "type": "tts", "state": "start", "sample_rate": 24000, "session_id": "session-id"}

in type The field is used to identify the message type. llm,tts,stt wait.

type=tts (JSON)

This message is the main message type returned by Xiaozhi AI server, including emotions, voice playback, and voice-to-text, all of which are returned in this type of message.

It can be said that in the entire interaction process of Xiaozhi AI, the main workload is completed by the server, and the client implementation can be relatively lightweight.

exist type=tts Type of message, according to state Different fields also need to be processed specifically.

state=start

After receiving the voice data from the client, Xiaozhi AI server generates the corresponding LLM chat dialogue content and starts to return the voice data. Here, an audio data is also given. sample_rate Parameters can be used to update the playback configuration synchronously.

{ "type": "tts", "state": "start", "sample_rate": 24000, "session_id": "session-id"}

state=sentence_start

The beginning of a sentence in the conversation returned by Xiaozhi AI,text The field contains the text of the spoken speech.

{ "type": "tts", "state": "sentence_start", "text": "You seem to be in a bad mood, what happened?", "session_id": "session-id"}

state=sentence_end

The end of a sentence in the dialogue returned by Xiaozhi AI.

{ "type": "tts", "state": "sentence_end", "text": "You seem to be in a bad mood, what happened?", "session_id": "session-id"}

state=stop

Xiaozhi AI has completed the response content generated for the previously received voice, and the client can continue the recording operation.

{ "type": "tts", "state": "stop", "session_id": "session-id"}

type=llm (JSON)

This message returns the sentiment that the big model needs to express when replying.text It is an Emoji expression.emotion Words corresponding to emotions can be displayed by mapping words to pictures on devices that cannot display Emoji.

{ "type": "llm", "text": "?", "emotion": "thinking", "session_id": "session-id"}

emotion The possible values ​​are as follows:

static const std::vector<Emotion> emotions = { {"?", "neutral"}, {"?", "happy"}, {"?", "laughing"}, {"?", "funny"}, {"?", "sad"}, {"?", "angry"}, {"?", "crying"}, {"?", "loving"}, {"?", "embarrassed"}, {"?", "surprised"}, {"?", "shocked"}, {"?", "thinking"}, {"?", "winking"}, {"?", "cool"}, {"?", "relaxed"}, {"?", "delicious"}, {"?", "kissy"}, {"?", "confident"}, {"?", "sleepy"}, {"?", "silly"}, {"?", "confused"}};

type=stt (JSON)

This is the text recognized by the voice sent by the client to the Xiaozhi AI server, which can be displayed on the screen to show the complete conversation content of both parties.

{ "type": "stt", "text": "What's the weather like today", "session_id": "session-id"}

type=iot (JSON)

 Just like the client message, this has not been studied yet, so we will look into it later.

audio (Binary)

TTS audio data sent by Xiaozhi AI server.

  • Format : Binary Frame.

  • Content : According to hello In the message audio_params The TTS audio data block is encoded in the agreed format (Opus by default). The client should decode and play it immediately after receiving it.


Core interaction flow chart

Manual dialogue interaction process

Automatic dialogue interaction flow chart


Exception handling

The server actively disconnects

When you say "goodbye" to Xiaozhi AI, the server will actively disconnect. Therefore, at this time, if you restart a manual conversation or use a wake-up word to trigger a conversation, you need to reconnect to the server.

Network anomaly

When the network is abnormal, just reconnect the WebSocket according to the normal initialization process.


Summarize

In general, the communication protocol of Xiaozhi AI is relatively simple. After a rough review, you can use Cursor + AI to quickly create a Python version of the client, and then try to connect it to ESP32.

In addition, the processes and messages here are summarized with reference to the official warehouse and the actual interaction process. There may be inaccuracies. If there are any errors, please correct them.


References

  • https://github.com/78/xiaozhi-esp32

  • https://github.com/huangjunsen0406/py-xiaozhi


Other DIY Projects

Open source, smart UV glue UV curing lamp replica tutorial

A coin-sized game console that can play Bee and Pac-Man ~ Fully open source ~

The cost is 60 yuan. Use ESP32-S3 to make an open source game console that can play FC/NES, GameBoy, and has a dedicated color PCB

Don't miss out on idle fast charging heads, DIY a USB-PD power decoy with a screen to display power