Low-latency Xiaozhi AI server construction - ASR (continued): CPU can run

Written by
Clara Bennett
Updated on:June-29th-2025
Recommendation

Monkey Brother shared the construction of low-latency AI server ASR and detailed explanation of CPU inference solution.

Core content:
1. Introduction to sherpa-onnx open source project, cross-platform speech processing toolkit
2. ASR model selection, streaming inference paraformer model
3. CPU inference solution, model loading and implementation

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
It is undeniable thatGPU InferenceThe cost is enough to deter most players.

if you:

  • Aversion to high costs;
  • Individual players don't care about user experience.

You can try the nextCPU Reasoningplan.

This article will first introduce sherpa-onnx, a high-performance speech processing open source project.

Then, select one of the models and implementXiaozhi AI Server ASRofReal-time CPU inference.

1. Introduction to sherpa-onnx

https://github.com/k2-fsa/sherpa-onnx

sherpa-onnx It is a cross-platform, multi-language speech processing toolkit implemented using onnxruntime. It supports a wide range of functions, including:

  • Voice activity detection
  • Speech to Text
  • Text-to-speech
  • Speaker Segmentation

As usual, let me briefly introduce the highlights:

  • Cross-platform compatibility : including operating systems such as Windows, macOS, Linux, Android and iOS. It also supports various embedded systems.

  • Multi-language API : Provides interfaces in 11 mainstream programming languages.

  • High performance : Based on ONNX runtime, it is suitable for deployment on devices with various computing capabilities.

2. Model selection

Model list: https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index.html

According to the actual test in the previous article, the VAD model has fewer parameters and the delay can be ignored.

Therefore, the VAD model selection remains consistent with the previous article.

2.1 ASR Model

If we want to ensure that the ASR recognition effect is not poor, the number of model parameters is essential.

At the same time, latency must be as low as possible.

In summary,ASRmodel, we can consider choosing one that supports streaming reasoning paraformer, which is also the streaming inference model on Alibaba Cloud.

2.2 Voiceprint Vector Model

Considering the positive correlation between the number of model parameters and inference delay, in order to balance performance and delay, you can choose 3dspeaker_speech_campplus_sv_zh_en_16k-common_advanced

3. CPU Inference Solution

3.1 Model loading

The VAD model loading is consistent with the previous article. The ASR and voiceprint recognition models are used sherpa-onnx way to load.

class ModelManager:
    def __init__(self):
        self.vad_model = None
        self.asr_model = None
        self.sv_model = None

    def load_models(self):
        self.vad_model = AutoModel(model= "ckpts/speech_fsmn_vad_zh-cn-16k-common-pytorch" )
       
        model_dir =  "ckpts/sherpa-onnx-streaming-paraformer-bilingual-zh-en"
        self.asr_model = sherpa_onnx.OnlineRecognizer.from_paraformer(
            encoder=f '{model_dir}/encoder.int8.onnx' ,
            decoder=f '{model_dir}/decoder.int8.onnx' ,
            tokens=f '{model_dir}/tokens.txt' )

        model_dir =  "ckpts/3dspeaker_speech_campplus_sv_zh_en_16k-common_advanced.onnx"
        config = sherpa_onnx.SpeakerEmbeddingExtractorConfig(model=model_dir)
        self.sv_model = sherpa_onnx.SpeakerEmbeddingExtractor(config)

3.2 Model Reasoning

The core logic of the service is:

When VAD detects active audio, asr_model creates an audio streamself.stream:

if  self.stream is None:
    self.stream = self.model_manager.asr_model.create_stream()
    self.input_stream(self.audio_buffer)

Then, each time an audio clip comes in, it is pushed into the audio stream:

def on_audio_frame(self, frame):
    frame_fp32 = np.frombuffer(frame, dtype=np.int16).astype(np.float32) / 32768
    if  self.stream:
        self.input_stream(frame_fp32)

It was found thatstreaming-paraformerThe so-called streaming inference is: decoding is triggered every 10 60ms audios.

A single decoding takes less than 0.3s.

So rtf can indeed do < 1.

To this end, the audio data is pushed into a consumption queue for asynchronous processing, which theoretically allows real-time reasoning.

The transformation process is as follows :

First, the audio data is no longer decoded directly, but pushed into a queue:

# self.input_stream(frame_fp32)
self.asr_queue.put(frame_fp32)

Then, create a worker to asynchronously consume the audio in the queue:

def _decode_worker(self):
    while  True:
        try:
            # Get audio data using blocking method
            chunk = self.asr_queue.get(timeout=0.01)
            if  self.stream:   # Make sure stream exists
                self.running = True
                self.input_stream(chunk)
        except Empty:
            self.running = False   # Clear decoding events
            continue # Continue waiting when the queue is empty  
        except Exception as e:
            logging.error(f "Error in decode worker: {e}" , exc_info=True)

Finally, the voiceprint vector model obtains audio features:

def generate_embedding(self):
    if  self.audio_buffer.shape[0] == 0:
        return  []
    # last 3 seconds for speaker embedding
    stream = self.model_manager.sv_model.create_stream()
    stream.accept_waveform(SAMPLE_RATE, self.audio_buffer[-SAMPLE_RATE*3:])
    stream.input_finished()
    embedding = self.model_manager.sv_model.compute(stream)
    return  embedding 

3.3 Delay Test

streaming-paraformerStreaming inference: Decoding is triggered every 10 60ms audios:

Model
CPU Inference (seconds)
paraformer
0.09-0.25
paraformer-int8
0.08-0.09

Voiceprint vectorModel, audio duration up to 3s, single inference:

Model
CPU Inference (seconds)
3dspeaker
0.09

3.4 Server Configuration and Cost

The above three models are loaded and occupy more than 1.0 G of memory.

Therefore, if high concurrency is not considered, the virtual machine configuration must ensure at least 2c2g.

If we consider mainstream cloud vendors, the annual and monthly prices are as follows:

Alibaba Cloud: https://www.aliyun.com/minisite/goods?userCode=ggqtukm3

Tencent Cloud: https://curl.qcloud.com/BLm2fgkN

Of course, if you just want to play around, you don’t want to use it for a year.

Recommended to try sealos Products are paid on demand based on the resources used.

Registration experience: https://cloud.sealos.run/?uid=QDGJoX2_Qp

New members will receive a 10 yuan balance when they register, which is enough to play for a while.

Let's take  2c2g  as an example and mount a  20G  disk. The daily cost is about 1.5 yuan . Just turn it off when not in use.

Last words

This article sharesXiaozhi AI Server ASRofCPU ReasoningThe solution was proposed and the configuration and cost were estimated.