A newly open-sourced TTS voice model! 25ms ultra-low latency supports real-time conversations, and 4 specifications are suitable for all scenarios!

Written by

Iris Vance

Updated on:July-10th-2025

In the field of TTS, the naturalness and real-time performance of emotional expression have always been two major challenges. Traditional models often have difficulty balancing latency and voice quality.

In the past two years, TTS models have been developing better and better, and more and more comprehensive TTS models have been launched.

Orpheus TTS is a newly released open source TTS model. It has quickly become the focus of the open source community for its natural emotional expression close to that of humans, ultra-low latency real-time output, and powerful zero-sample voice cloning capabilities.

Not only does it generate smooth, natural, and emotional sound, it also compresses latency to an amazing 25-50 milliseconds, perfectly suited for real-time conversation scenarios.

It also provides four models with parameters ranging from 150M to 3B to meet the needs of different scenarios. It supports zero-sample voice cloning and flexible emotional control, allowing everyone to easily customize their own voice.

Key highlights

• Ultra-low latency : supports real-time streaming inference, with latency as low as about 200 milliseconds, and can be reduced to 25-50 milliseconds through compression
• Natural emotional expression : Supports rich emotions and tone control, including happiness, sadness, anger, sleepiness, etc.
• Zero-sample voice cloning : No pre-training required, just provide reference audio to clone the target voice
• Provides 4 model sizes : Medium (3B), Small (1B), Tiny (400M), Nano (150M)
• End-to-end speech generation : Not yet available, but once available, it will improve speech naturalness, controllability, and generation speed

Quick Use

Orpheus TTS has a simple installation and usage process and supports local deployment.

If you want to experience the TTS tool directly, there is also an online demo available on the HF platform (magic is required).

Online Demo:

https://huggingface.co/spaces/MohamedRashad/Orpheus-TTS

Local deployment steps:

① Clone project

git  clone  https://github.com/canopyai/Orpheus-TTS.git
cd  Orpheus-TTS

② Installation dependencies

pip install orpheus-speech

③ Python call example

from orpheus_tts import OrpheusModel
import wave
import  time

model = OrpheusModel(model_name = "canopylabs/orpheus-tts-0.1-finetune-prod" )
prompt =  '' 'Man, the way social media has, um, completely changed how we interact is just wild, right? Like, we' re all connected 24/7 but somehow people feel more alone than ever. And don 't even get me started on how it's messing with kids ' self-esteem and mental health and whatnot.' ' '

start_time = time.monotonic()
syn_tokens = model.generate_speech(
   prompt=prompt,
   voice= "tara" ,
   )

with wave.open( "output.wav" ,  "wb" ) as wf:
   wf.setnchannels(1)
   wf.setsampwidth(2)
   wf.setframerate(24000)

   total_frames = 0
   chunk_counter = 0
   for  audio_chunk  in  syn_tokens:  # output streaming
      chunk_counter += 1
      frame_count = len(audio_chunk) // (wf.getsampwidth() * wf.getnchannels())
      total_frames += frame_count
      wf.writeframes(audio_chunk)
   duration = total_frames / wf.getframerate()

   end_time = time.monotonic()
   print (f "It took {end_time - start_time} seconds to generate {duration:.2f} seconds of audio" )

Last words

Traditional text-to-speech (TTS) systems face three core challenges: stiff emotional expression, high inference delay (generally >500ms), and large amounts of data required to clone voice.

Orpheus TTS uses a hybrid expert architecture (MoE) and KV cache optimization to achieve the following in a parameter range of 150M to 3B: MOS score of 4.6, end-to-end latency reduced to 25ms, zero-sample voice cloning, and super emotion control.

It is suitable for a variety of applications such as AI voice assistants, game dubbing, audiobooks, virtual customer service, intelligent voice interaction, etc., taking into account high-quality speech synthesis & real-time interactive experience, and is one of the most promising open source TTS solutions at present!