ByteDance MegaTTS 3! 0.45B ultra-lightweight voice cloning model, Chinese and English mixed output + accent control black technology

Written by

Silas Grey

Updated on:July-08th-2025

Introduction:

Speech synthesis technology has made a major breakthrough! MegaTTS 3 , the latest open source from ByteDance and Zhejiang University, has only 0.45B parameters but can achieve voice cloning effects comparable to real people! It exclusively supports mixed output of Chinese and English, and can adjust the intensity of accents freely. Fine-grained pronunciation control will be launched soon. Whether it is multilingual podcast production or personalized voice assistant development, this is a cutting-edge tool that cannot be missed! This article will take you to experience it in 3 minutes and reveal its core technical principles.

text:

1. Three major technological breakthroughs

• Extremely lightweight :

• 80% smaller than traditional TTS models (VITS is usually 1.5B+)

• Cross-language cloning :

# Example of mixed Chinese and English output
text =  "Welcome to Douyin, today we will introduce the technical details of MegaTTS3"

• Precise accent control :

• p_wParameter to adjust the standardization (1.0 = keep the original accent, 3.0 = standard pronunciation)
• t_wParameter controlling sentiment similarity (recommended 0-3 points higher than p_w)

2. Performance comparison

index	MegaTTS 3	VITS	YourTTS
Phonetic Similarity	4.8/5.0	4.2	4.5
English MOS	4.6	4.3	4.4
Inference speed	0.7s/sentence	1.2s	1.5s
Video memory usage	2.3GB	5GB	6GB

3. Five-minute quick experience

1. Environmental configuration :

conda create -n megatts3 python=3.9
conda activate megatts3
pip install -r requirements.txt

2. Download the pre-trained model :

mkdir  checkpoints &&  cd  checkpoints
wget [model download link]

• Google Drive: https://drive.google.com/drive/folders/1CidiSqtHgJTBDAHQ746_on_YR0boHDYB?usp=sharing
• Hugging Face: https://huggingface.co/ByteDance/MegaTTS3

3. Start voice cloning :

# Chinese synthesis (with emotion preservation)
python tts/infer_cli.py \
  --input_wav  "sample.wav"  \
  --input_text  "Today's weather is great, suitable for outdoor sports"  \
  --t_w 3.5 --output_dir ./output

# English accent adjustment (p_w=1.5 tends to standard pronunciation)
python tts/infer_cli.py \
  --input_wav  "english.wav"  \
  --input_text  "This is an example of accent control"  \
  --p_w 1.5 --t_w 3.0

4. Enterprise-level application scenarios

• Cross-border e-commerce :

• Generate Chinese and English mixed voice for the same product description
• Adjust accent strength according to target market (American/British)

• Educational Technology :

• Clone the teacher's voice to generate multi-language courseware
• Pronunciation correction model in foreign language learning (p_w=2.5)

• Smart hardware :

• Deployment on low-resource devices (Raspberry Pi runs smoothly)
• Personalized voice assistant customization

5. Advanced Development Skills

• WebUI quick deployment :

CUDA_VISIBLE_DEVICES=0 python tts/gradio_api.py

• Fine-grained control (coming soon) :

# Future API Examples
control_params = {
    "phoneme_duration" : { "of" :  0.3 s,  "yes" :  0.2 s},
    "pitch_curve" : { "today" : [+ 5 %,  0 , - 3 %]}
}

Safety Tips:

? Please read before use:

• Voice samples must pass security review https://security.bytedance.com
• Prohibit illegal use of fake other people's voices

Technical Digging:

How does WaveVAE encoder achieve ultra-high compression of 25Hz?

1. 24kHz audio → time-frequency decomposition
2. Residual Quantization Coding
3. 98.7% reconstruction fidelity (ABX test)
4. Citation

@article{jiang2025sparse,
  title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
  author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and Jionghao, Bai and Yang, Xiaoda and Zuo, Jialong and Zhang, Yu and others},
  journal={arXiv preprint arXiv:2502.18924},
  year={2025}
}

@article{ji2024wavtokenizer,
  title={Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling},
  author={Ji, Shengpeng and Jiang, Ziyue and Wang, Wen and Chen, Yifu and Fang, Minghui and Zuo, Jialong and Yang, Qian and Cheng, Xize and Wang, Zehan and Li, Ruiqi and others},
  journal={arXiv preprint arXiv:2408.16532},
  year={2024}
}

Summarize:

MegaTTS 3 uses a lightweight architecture to achieve commercial-grade voice cloning effects, and its English mixing and accent control capabilities break through industry bottlenecks. Visit the GitHub repository https://github.com/MegaTTS3 to experience it now and start a new era of intelligent voice development!