Replicate Xiaozhi AI Step 2, 2 core flow charts to learn its WebSocket protocol

Replicate Xiaozhi AI, master the core process of WebSocket protocol, and realize real-time communication.
Core content:
1. Background introduction of replicating Xiaozhi AI based on Arduino framework
2. The role and data type of WebSocket protocol in Xiaozhi AI communication
3. Detailed steps and configuration parameters for establishing WebSocket connection
Preface
Continue to try to replicate Xiaozhi AI, but based on the Arduino framework.
Last week, I finished building the development environment of ESP32-S3 + ESP-SR + ESP-TTS with VSCode + PlatformIO + Arduino (see the article "Replicating Xiaozhi AI, ESP32-S3 to build Arduino + ESP-SR + ESP-TTS development environment step record", the main voice wake-up, command recognition, and text-to-speech functions are all running, and then I can start connecting to the WebSocket protocol of Xiaozhi AI server.
However, the original author 78/xiaozhi-esp32
The project is a little complicated, not very convenient to read, and I don't want to use the IDF compilation environment, so I want to find out if there is an implementation for other platforms, and then I actually found one. huangjunsen0406/py-xiaozhi
The project is a desktop client with an interface written in Python + PyTk. It supports manual dialogue and automatic dialogue mode switching. You can also learn about lightweight speech recognition on the PC.
Communication Process
Xiaozhi AI client and server can use WebSocket or MQTT protocol. For convenience, we will directly use WebSocket protocol to learn here.
Protocol Overview
In the communication process of Xiaozhi AI, WebSocket is used to achieve real-time, two-way communication between the client and the server. It mainly transmits the following types of data:
Control instructions : such as start/stop monitoring, interrupt TTS, etc.
Text information : such as LLM's response, emotional instructions, configuration information, etc.
Audio data :
Client -> Server: Recorded Opus encoded audio stream.
Server -> Client: Opus encoded audio stream generated by TTS.
Status synchronization : such as TTS playback start/end.
There are two main formats used for communication:
JSON : Used to transfer text, control commands, and status information.
Binary : Used to transmit Opus-encoded audio data.
Establishing a connection
The client initiates the connection : The client
WEBSOCKET_URL
Initiate a WebSocket connection request to the server.Send header information : When establishing a WebSocket connection, the client needs to send the necessary HTTP header information, including:
Authorization
:Bearer <access_token>
(ConfiguringWEBSOCKET_ACCESS_TOKEN
)Protocol-Version
:1
(Protocol version number)Client-Id
: Client IDDevice-Id
: Device identification (usually the device MAC address)Currently, except for Device-Id, which needs to be generated by the client, the other fields are fixed values and can be set as follows:
"WEBSOCKET_URL": "wss://api.tenclass.net/xiaozhi/v1/","WEBSOCKET_ACCESS_TOKEN": "test-token","CLIENT_ID": "1dd91545-082a-454e-a131-1c8251375c9c",
Server Response : Server accepts the connection.
Client sends
hello
: After the connection is successfully established, the client needs to send ahello
The message (JSON format).hello_message = { "type": "hello", "version": 1, "transport": "websocket", "audio_params": { "format": AudioConfig.FORMAT, "sample_rate": AudioConfig.SAMPLE_RATE, "channels": AudioConfig.CHANNELS, "frame_duration": AudioConfig.FRAME_DURATION, }}
The audio encoding parameters will be preset here, but it’s not a big problem, as the server will push the settings it can accept later.
Server responds with hello : providing a session ID and possible initial configuration.
{ "type": "hello", "version": 1, "transport": "websocket", "audio_params": { "format": "opus", "sample_rate": 24000, "channels": 1, "frame_duration": 20 }, "session_id": "a1f81xs89"}
Note: The client must store
session_id
Used for all subsequent messages that require a session identifier.Note 2: You need to use
audio_params
Update local Opus encoding settings.Start monitoring :
{ "session_id": "session-id", "type": "listen", "state": "start", "mode": "manual" | "auto" | "realtime" // Listening mode}
Stop monitoring :
{ "session_id": "session-id", "type": "listen", "state": "stop"}
Format :
{ "session_id": "session-id", "type": "listen", "state": "detect", "text": "Hello Xiaozhi" // Modify according to the actual wake-up word}
Format :
{ "session_id": "session-id", "type": "abort", "reason": "wake_word_detected" // (optional) abort reason}
Format : Binary Frame.
Content : According to
session_info
middleaudio_config
The audio data block is encoded in the agreed format (Opus by default).
Server Authentication
When you connect to the Xiaozhi AI official backend for the first time, you need to add a device in the console.
The method of adding devices is also very convenient. When the client connects to the server and sends the first voice message, the server will return a voice message with a 6-digit verification code, and the device can be added in the background.
At this point, the WebSocket connection with the Xiaozhi AI server has been established, and the subsequent dialogue process can begin.
Client Message
To communicate with Xiaozhi AI, the client generally needs to actively initiate the conversation process and send the first audio data, or the wake-up word.
listen
(JSON)
Controls the state of audio monitoring (recording).
wake_word
(JSON)
If you start a conversation with a wake word, use another type of listen
Message, notifying the server that the wake-up word has been detected, so that the server will immediately return a voice message.
abort
(JSON)
Request the server to interrupt the current operation (mainly TTS voice playback).
This is mainly used when the Xiaozhi AI server outputs a long voice but wants to start a new conversation.
audio
(Binary)
Send recorded audio data.
IoT Messages
I won’t play with this for now, I’ll study the specific format later.
Server Message
The message types returned by Xiaozhi AI server are also divided into JSON and Binary. JSON type messages depend on type
fields to distinguish the actual content.
Example JSON message format:
{ "type": "tts", "state": "start", "sample_rate": 24000, "session_id": "session-id"}
in type
The field is used to identify the message type. llm
,tts
,stt
wait.
type=tts
(JSON)
This message is the main message type returned by Xiaozhi AI server, including emotions, voice playback, and voice-to-text, all of which are returned in this type of message.
It can be said that in the entire interaction process of Xiaozhi AI, the main workload is completed by the server, and the client implementation can be relatively lightweight.
exist type=tts
Type of message, according to state
Different fields also need to be processed specifically.
state=start
After receiving the voice data from the client, Xiaozhi AI server generates the corresponding LLM chat dialogue content and starts to return the voice data. Here, an audio data is also given. sample_rate
Parameters can be used to update the playback configuration synchronously.
{ "type": "tts", "state": "start", "sample_rate": 24000, "session_id": "session-id"}
state=sentence_start
The beginning of a sentence in the conversation returned by Xiaozhi AI,text
The field contains the text of the spoken speech.
{ "type": "tts", "state": "sentence_start", "text": "You seem to be in a bad mood, what happened?", "session_id": "session-id"}
state=sentence_end
The end of a sentence in the dialogue returned by Xiaozhi AI.
{ "type": "tts", "state": "sentence_end", "text": "You seem to be in a bad mood, what happened?", "session_id": "session-id"}
state=stop
Xiaozhi AI has completed the response content generated for the previously received voice, and the client can continue the recording operation.
{ "type": "tts", "state": "stop", "session_id": "session-id"}
type=llm
(JSON)
This message returns the sentiment that the big model needs to express when replying.text
It is an Emoji expression.emotion
Words corresponding to emotions can be displayed by mapping words to pictures on devices that cannot display Emoji.
{ "type": "llm", "text": "?", "emotion": "thinking", "session_id": "session-id"}
emotion
The possible values are as follows:
static const std::vector<Emotion> emotions = { {"?", "neutral"}, {"?", "happy"}, {"?", "laughing"}, {"?", "funny"}, {"?", "sad"}, {"?", "angry"}, {"?", "crying"}, {"?", "loving"}, {"?", "embarrassed"}, {"?", "surprised"}, {"?", "shocked"}, {"?", "thinking"}, {"?", "winking"}, {"?", "cool"}, {"?", "relaxed"}, {"?", "delicious"}, {"?", "kissy"}, {"?", "confident"}, {"?", "sleepy"}, {"?", "silly"}, {"?", "confused"}};
type=stt
(JSON)
This is the text recognized by the voice sent by the client to the Xiaozhi AI server, which can be displayed on the screen to show the complete conversation content of both parties.
{ "type": "stt", "text": "What's the weather like today", "session_id": "session-id"}
type=iot
(JSON)
Just like the client message, this has not been studied yet, so we will look into it later.
audio
(Binary)
TTS audio data sent by Xiaozhi AI server.
Format : Binary Frame.
Content : According to
hello
In the messageaudio_params
The TTS audio data block is encoded in the agreed format (Opus by default). The client should decode and play it immediately after receiving it.
Core interaction flow chart
Manual dialogue interaction process
Automatic dialogue interaction flow chart
Exception handling
The server actively disconnects
When you say "goodbye" to Xiaozhi AI, the server will actively disconnect. Therefore, at this time, if you restart a manual conversation or use a wake-up word to trigger a conversation, you need to reconnect to the server.
Network anomaly
When the network is abnormal, just reconnect the WebSocket according to the normal initialization process.
Summarize
In general, the communication protocol of Xiaozhi AI is relatively simple. After a rough review, you can use Cursor + AI to quickly create a Python version of the client, and then try to connect it to ESP32.
In addition, the processes and messages here are summarized with reference to the official warehouse and the actual interaction process. There may be inaccuracies. If there are any errors, please correct them.
References
https://github.com/78/xiaozhi-esp32
https://github.com/huangjunsen0406/py-xiaozhi
Other DIY Projects
Open source, smart UV glue UV curing lamp replica tutorial
A coin-sized game console that can play Bee and Pac-Man ~ Fully open source ~
The cost is 60 yuan. Use ESP32-S3 to make an open source game console that can play FC/NES, GameBoy, and has a dedicated color PCB
Don't miss out on idle fast charging heads, DIY a USB-PD power decoy with a screen to display power