A newly open-sourced TTS voice model! 25ms ultra-low latency supports real-time conversations, and 4 specifications are suitable for all scenarios!

Explore the latest open source TTS technology and experience the perfect combination of real-time conversation and emotional expression.
Core content:
1. The unique advantages and technical highlights of the Orpheus TTS model
2. Support for full-scenario applications of zero-sample voice cloning and emotional control
3. Quick deployment and online experience guide
In the field of TTS, the naturalness and real-time performance of emotional expression have always been two major challenges. Traditional models often have difficulty balancing latency and voice quality.
In the past two years, TTS models have been developing better and better, and more and more comprehensive TTS models have been launched.
Orpheus TTS is a newly released open source TTS model. It has quickly become the focus of the open source community for its natural emotional expression close to that of humans, ultra-low latency real-time output, and powerful zero-sample voice cloning capabilities.
Not only does it generate smooth, natural, and emotional sound, it also compresses latency to an amazing 25-50 milliseconds, perfectly suited for real-time conversation scenarios.
It also provides four models with parameters ranging from 150M to 3B to meet the needs of different scenarios. It supports zero-sample voice cloning and flexible emotional control, allowing everyone to easily customize their own voice.
Key highlights
• Ultra-low latency : supports real-time streaming inference, with latency as low as about 200 milliseconds, and can be reduced to 25-50 milliseconds through compression • Natural emotional expression : Supports rich emotions and tone control, including happiness, sadness, anger, sleepiness, etc. • Zero-sample voice cloning : No pre-training required, just provide reference audio to clone the target voice • Provides 4 model sizes : Medium (3B), Small (1B), Tiny (400M), Nano (150M) • End-to-end speech generation : Not yet available, but once available, it will improve speech naturalness, controllability, and generation speed
Quick Use
Orpheus TTS has a simple installation and usage process and supports local deployment.
If you want to experience the TTS tool directly, there is also an online demo available on the HF platform (magic is required).
Online Demo:
https://huggingface.co/spaces/MohamedRashad/Orpheus-TTS
Local deployment steps:
① Clone project
git clone https://github.com/canopyai/Orpheus-TTS.git
cd Orpheus-TTS
② Installation dependencies
pip install orpheus-speech
③ Python call example
from orpheus_tts import OrpheusModel
import wave
import time
model = OrpheusModel(model_name = "canopylabs/orpheus-tts-0.1-finetune-prod" )
prompt = '' 'Man, the way social media has, um, completely changed how we interact is just wild, right? Like, we' re all connected 24/7 but somehow people feel more alone than ever. And don 't even get me started on how it's messing with kids ' self-esteem and mental health and whatnot.' ' '
start_time = time.monotonic()
syn_tokens = model.generate_speech(
prompt=prompt,
voice= "tara" ,
)
with wave.open( "output.wav" , "wb" ) as wf:
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(24000)
total_frames = 0
chunk_counter = 0
for audio_chunk in syn_tokens: # output streaming
chunk_counter += 1
frame_count = len(audio_chunk) // (wf.getsampwidth() * wf.getnchannels())
total_frames += frame_count
wf.writeframes(audio_chunk)
duration = total_frames / wf.getframerate()
end_time = time.monotonic()
print (f "It took {end_time - start_time} seconds to generate {duration:.2f} seconds of audio" )
Last words
Traditional text-to-speech (TTS) systems face three core challenges: stiff emotional expression, high inference delay (generally >500ms), and large amounts of data required to clone voice.
Orpheus TTS uses a hybrid expert architecture (MoE) and KV cache optimization to achieve the following in a parameter range of 150M to 3B: MOS score of 4.6, end-to-end latency reduced to 25ms, zero-sample voice cloning, and super emotion control.
It is suitable for a variety of applications such as AI voice assistants, game dubbing, audiobooks, virtual customer service, intelligent voice interaction, etc., taking into account high-quality speech synthesis & real-time interactive experience, and is one of the most promising open source TTS solutions at present!