Low-latency Xiaozhi AI server construction - ASR (continued): CPU can run

Monkey Brother shared the construction of low-latency AI server ASR and detailed explanation of CPU inference solution.
Core content:
1. Introduction to sherpa-onnx open source project, cross-platform speech processing toolkit
2. ASR model selection, streaming inference paraformer model
3. CPU inference solution, model loading and implementation
GPU Inference
The cost is enough to deter most players.if you:
Aversion to high costs; Individual players don't care about user experience.
You can try the nextCPU Reasoning
plan.
This article will first introduce sherpa-onnx
, a high-performance speech processing open source project.
Then, select one of the models and implementXiaozhi AI Server ASR
ofReal-time CPU inference
.
1. Introduction to sherpa-onnx
https://github.com/k2-fsa/sherpa-onnx
sherpa-onnx
It is a cross-platform, multi-language speech processing toolkit implemented using onnxruntime. It supports a wide range of functions, including:
Voice activity detection Speech to Text Text-to-speech Speaker Segmentation
As usual, let me briefly introduce the highlights:
Cross-platform compatibility : including operating systems such as Windows, macOS, Linux, Android and iOS. It also supports various embedded systems.
Multi-language API : Provides interfaces in 11 mainstream programming languages.
High performance : Based on ONNX runtime, it is suitable for deployment on devices with various computing capabilities.
2. Model selection
Model list: https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index.html
According to the actual test in the previous article, the VAD model has fewer parameters and the delay can be ignored.
Therefore, the VAD model selection remains consistent with the previous article.
2.1 ASR Model
If we want to ensure that the ASR recognition effect is not poor, the number of model parameters is essential.
At the same time, latency must be as low as possible.
In summary,ASR
model, we can consider choosing one that supports streaming reasoning paraformer
, which is also the streaming inference model on Alibaba Cloud.
2.2 Voiceprint Vector Model
Considering the positive correlation between the number of model parameters and inference delay, in order to balance performance and delay, you can choose 3dspeaker_speech_campplus_sv_zh_en_16k-common_advanced
3. CPU Inference Solution
3.1 Model loading
The VAD model loading is consistent with the previous article. The ASR and voiceprint recognition models are used sherpa-onnx
way to load.
class ModelManager:
def __init__(self):
self.vad_model = None
self.asr_model = None
self.sv_model = None
def load_models(self):
self.vad_model = AutoModel(model= "ckpts/speech_fsmn_vad_zh-cn-16k-common-pytorch" )
model_dir = "ckpts/sherpa-onnx-streaming-paraformer-bilingual-zh-en"
self.asr_model = sherpa_onnx.OnlineRecognizer.from_paraformer(
encoder=f '{model_dir}/encoder.int8.onnx' ,
decoder=f '{model_dir}/decoder.int8.onnx' ,
tokens=f '{model_dir}/tokens.txt' )
model_dir = "ckpts/3dspeaker_speech_campplus_sv_zh_en_16k-common_advanced.onnx"
config = sherpa_onnx.SpeakerEmbeddingExtractorConfig(model=model_dir)
self.sv_model = sherpa_onnx.SpeakerEmbeddingExtractor(config)
3.2 Model Reasoning
The core logic of the service is:
When VAD detects active audio, asr_model creates an audio streamself.stream
:
if self.stream is None:
self.stream = self.model_manager.asr_model.create_stream()
self.input_stream(self.audio_buffer)
Then, each time an audio clip comes in, it is pushed into the audio stream:
def on_audio_frame(self, frame):
frame_fp32 = np.frombuffer(frame, dtype=np.int16).astype(np.float32) / 32768
if self.stream:
self.input_stream(frame_fp32)
It was found thatstreaming-paraformer
The so-called streaming inference is: decoding is triggered every 10 60ms audios.
A single decoding takes less than 0.3s.
So rtf can indeed do < 1.
To this end, the audio data is pushed into a consumption queue for asynchronous processing, which theoretically allows real-time reasoning.
The transformation process is as follows :
First, the audio data is no longer decoded directly, but pushed into a queue:
# self.input_stream(frame_fp32)
self.asr_queue.put(frame_fp32)
Then, create a worker to asynchronously consume the audio in the queue:
def _decode_worker(self):
while True:
try:
# Get audio data using blocking method
chunk = self.asr_queue.get(timeout=0.01)
if self.stream: # Make sure stream exists
self.running = True
self.input_stream(chunk)
except Empty:
self.running = False # Clear decoding events
continue # Continue waiting when the queue is empty
except Exception as e:
logging.error(f "Error in decode worker: {e}" , exc_info=True)
Finally, the voiceprint vector model obtains audio features:
def generate_embedding(self):
if self.audio_buffer.shape[0] == 0:
return []
# last 3 seconds for speaker embedding
stream = self.model_manager.sv_model.create_stream()
stream.accept_waveform(SAMPLE_RATE, self.audio_buffer[-SAMPLE_RATE*3:])
stream.input_finished()
embedding = self.model_manager.sv_model.compute(stream)
return embedding
3.3 Delay Test
streaming-paraformer
Streaming inference: Decoding is triggered every 10 60ms audios:
Voiceprint vector
Model, audio duration up to 3s, single inference:
3.4 Server Configuration and Cost
The above three models are loaded and occupy more than 1.0 G of memory.
Therefore, if high concurrency is not considered, the virtual machine configuration must ensure at least 2c2g.
If we consider mainstream cloud vendors, the annual and monthly prices are as follows:
Alibaba Cloud: https://www.aliyun.com/minisite/goods?userCode=ggqtukm3
Tencent Cloud: https://curl.qcloud.com/BLm2fgkN
Of course, if you just want to play around, you don’t want to use it for a year.
Recommended to try sealos
Products are paid on demand based on the resources used.
Registration experience: https://cloud.sealos.run/?uid=QDGJoX2_Qp
New members will receive a 10 yuan balance when they register, which is enough to play for a while.
Let's take 2c2g as an example and mount a 20G disk. The daily cost is about 1.5 yuan . Just turn it off when not in use.
Last words
This article sharesXiaozhi AI Server ASR
ofCPU Reasoning
The solution was proposed and the configuration and cost were estimated.