ByteDance MegaTTS 3! 0.45B ultra-lightweight voice cloning model, Chinese and English mixed output + accent control black technology

MegaTTS 3, jointly developed by ByteDance and Zhejiang University, achieves ultra-lightweight voice cloning with 0.45B parameters, supports mixed output of Chinese and English, and accent control, which is a major breakthrough in speech synthesis technology.
Core content:
1. Diffusion Transformer architecture with 0.45B parameters, achieving lightweight voice cloning
2. Exclusive support for mixed output of Chinese and English and free adjustment of accent strength
3. Five-minute quick experience tutorial, covering environment configuration, model download and voice cloning startup steps
Introduction:
Speech synthesis technology has made a major breakthrough! MegaTTS 3 , the latest open source from ByteDance and Zhejiang University, has only 0.45B parameters but can achieve voice cloning effects comparable to real people! It exclusively supports mixed output of Chinese and English, and can adjust the intensity of accents freely. Fine-grained pronunciation control will be launched soon. Whether it is multilingual podcast production or personalized voice assistant development, this is a cutting-edge tool that cannot be missed! This article will take you to experience it in 3 minutes and reveal its core technical principles.
text:
1. Three major technological breakthroughs
• Extremely lightweight : • 80% smaller than traditional TTS models (VITS is usually 1.5B+) • Cross-language cloning : # Example of mixed Chinese and English output
text = "Welcome to Douyin, today we will introduce the technical details of MegaTTS3"• Precise accent control : • p_w
Parameter to adjust the standardization (1.0 = keep the original accent, 3.0 = standard pronunciation)• t_w
Parameter controlling sentiment similarity (recommended 0-3 points higher than p_w)
2. Performance comparison
3. Five-minute quick experience
1. Environmental configuration : conda create -n megatts3 python=3.9
conda activate megatts3
pip install -r requirements.txt2. Download the pre-trained model : mkdir checkpoints && cd checkpoints
wget [model download link]
• Google Drive: https://drive.google.com/drive/folders/1CidiSqtHgJTBDAHQ746_on_YR0boHDYB?usp=sharing • Hugging Face: https://huggingface.co/ByteDance/MegaTTS3
3. Start voice cloning : # Chinese synthesis (with emotion preservation)
python tts/infer_cli.py \
--input_wav "sample.wav" \
--input_text "Today's weather is great, suitable for outdoor sports" \
--t_w 3.5 --output_dir ./output
# English accent adjustment (p_w=1.5 tends to standard pronunciation)
python tts/infer_cli.py \
--input_wav "english.wav" \
--input_text "This is an example of accent control" \
--p_w 1.5 --t_w 3.0• Cross-border e-commerce : • Generate Chinese and English mixed voice for the same product description • Adjust accent strength according to target market (American/British) • Educational Technology : • Clone the teacher's voice to generate multi-language courseware • Pronunciation correction model in foreign language learning (p_w=2.5) • Smart hardware : • Deployment on low-resource devices (Raspberry Pi runs smoothly) • Personalized voice assistant customization • WebUI quick deployment : CUDA_VISIBLE_DEVICES=0 python tts/gradio_api.py
• Fine-grained control (coming soon) : # Future API Examples
control_params = {
"phoneme_duration" : { "of" : 0.3 s, "yes" : 0.2 s},
"pitch_curve" : { "today" : [+ 5 %, 0 , - 3 %]}
}• Voice samples must pass security review https://security.bytedance.com • Prohibit illegal use of fake other people's voices 1. 24kHz audio → time-frequency decomposition 2. Residual Quantization Coding 3. 98.7% reconstruction fidelity (ABX test) 4. Citation
4. Enterprise-level application scenarios
5. Advanced Development Skills
Safety Tips:
? Please read before use:
Technical Digging:
How does WaveVAE encoder achieve ultra-high compression of 25Hz?
@article{jiang2025sparse,
title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and Jionghao, Bai and Yang, Xiaoda and Zuo, Jialong and Zhang, Yu and others},
journal={arXiv preprint arXiv:2502.18924},
year={2025}
}
@article{ji2024wavtokenizer,
title={Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling},
author={Ji, Shengpeng and Jiang, Ziyue and Wang, Wen and Chen, Yifu and Fang, Minghui and Zuo, Jialong and Yang, Qian and Cheng, Xize and Wang, Zehan and Li, Ruiqi and others},
journal={arXiv preprint arXiv:2408.16532},
year={2024}
}
Summarize:
MegaTTS 3 uses a lightweight architecture to achieve commercial-grade voice cloning effects, and its English mixing and accent control capabilities break through industry bottlenecks. Visit the GitHub repository https://github.com/MegaTTS3 to experience it now and start a new era of intelligent voice development!