35k stars, a revolutionary text-to-speech tool, now open source!

Written by

Iris Vance

Updated on:July-08th-2025

In recent years, with the explosive development of generative AI technology, the field of text-to-speech (TTS) has welcomed a disruptive player - ChatTTS. The project has 35.2k stars on GitHub and is praised by the industry as "the open source TTS model that is closest to real human voice features."

Highlights

Conversational TTS: ChatTTS is optimized for conversational tasks and enables natural and expressive synthesized speech. It supports multiple speakers to facilitate interactive dialogues.
Fine-grained control: The model can predict and control fine-grained prosodic features, including laughter, pauses, and interjections.
Better prosody: ChatTTS surpasses most open source TTS models in prosody. We provide pre-trained models to support further research and development.

Tutorial

Clone the repository

git  clone  https://github.com/2noise/ChatTTS
cd  ChatTTS

Install Dependencies

1. Direct installation

pip install --upgrade -r requirements.txt

2. Install using conda

conda create -n chattts
conda activate chattts
pip install -r requirements.txt

Optional: If using NVIDIA GPU(Linux only), installable TransformerEngine.

Quick Start

Make sure you are in the project root directory when executing the following commands.

1. WebUI visual interface

python examples/web/webui.py

2. Command line interaction

The generated audio will be saved to ./output_audio_n.mp3

python examples/cmd/run.py  "Your text 1." "Your text 2."

Advantages and Disadvantages Analysis

advantage:

High generation quality: ChatTTS uses advanced Transformer architecture and large-scale pre-training technology to generate highly natural speech that is close to real human voice.
Strong flexibility: Due to the use of a unified text-to-text framework, ChatTTS can handle a variety of language tasks, not only limited to speech synthesis, but also translation, summarization and other tasks.
Open source community support: ChatTTS is an open source project that has received extensive community support and contributions, and provides rich resources and tools for developers to use.

shortcoming:

High computing resource requirements: High-quality speech generation requires a lot of computing resources, especially in the training and fine-tuning stages, which places high demands on hardware performance.
Strong data dependence: The generation effect is heavily dependent on the quality and diversity of the training data. In some specific application scenarios, a large amount of specific data may be required for fine-tuning.
Lack of real-time performance: Due to the complexity of the generation process, there may be delays in some real-time applications, especially when processing complex texts and generating long segments of speech.

Application Scenario

Smart Assistant: Add humanized voice interaction capabilities to LLMs such as ChatGPT.
Audio content creation: Automatically generate audiobooks and podcast narrations, and support reading by different roles.
Education: Creating language learning materials with emotional feedback.
Accessibility service: Provide a more natural voice reading experience for visually impaired users.