Woter AI detection.Hurry - ends Jul 9th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

You'll be amazed when you hear it! Xiaozhi AI's voice emotion value is full, all thanks to CosyVoice 2.0! It's amazing, it can handle multiple languages, dialects, and voice cloning!

Written by

Iris Vance

Updated on:July-01st-2025

I have been in the technology circle for some years, and I like to tinker with all kinds of new things, especially those technologies that can change our lives and improve efficiency. I always feel that it is not enough to just know how to use it, but I have to figure out how it is implemented and what practical problems it can solve, so that it will be exciting.

Recently, voice technology has become more and more popular. From smart speakers to voice assistants in various apps and even virtual live broadcasts, they all rely on a core technology - speech synthesis, or TTS (Text-to-Speech). Making machines speak naturally like humans sounds simple, but it is not easy to do.

There are many TTS tools on the market, but they are either not very effective or closed source and charge a fee. It is quite troublesome to customize them or conduct in-depth research. However, just recently, I found that the FunAudioLLM team has open-sourced a project called CosyVoice . In particular, the CosyVoice 2.0 ^[1] version they just released was so eye-catching that I couldn’t help but talk to you about it.

This is not just any TTS. It is not only amazingly effective, but also incredibly powerful. The key is that it is open source ! The Apache-2.0 license means you can use, modify and distribute it freely. For us developers, this is great news!

CosyVoice 2.0: What’s so great about it?

Let’s get straight to the point and see what amazing features CosyVoice 2.0 has:

1. A master of languages who can even handle dialects!

• This is not just a "basic version" that only speaks Mandarin and English. CosyVoice 2.0 supports Chinese, English, Japanese, and Korean . That's not all, it also supports a bunch of Chinese dialects ! Cantonese, Sichuanese, Shanghainese, Tianjin dialect, Wuhan dialect, etc. can all be simulated for you. Imagine that your app can communicate with users in their native dialect. Wouldn't it feel instantly familiar?

2. Voice cloning + cross-language? Done in 3 seconds!

• This is one of the coolest features I think! It's called Zero- Shot Voice Cloning. What does it mean? You only need to provide a short recording of the target voice (the official says it only takes a few seconds, such as 3 seconds), and CosyVoice 2.0 will imitate the timbre and rhythm of this voice, and then use this voice to read any text you want it to read!
• Even more amazing is Crosslingual & Mixlingual. Not only can it use the cloned voice to speak the language of the recording itself, but it can also make the voice speak other supported languages ! For example, if you recorded a Chinese passage, it can imitate your voice to speak English, Japanese, or even a mixture of Chinese and English. This is like adding a "simultaneous interpretation" + "voice changer" buff to the voice!

3. Fast! Ridiculously fast response speed!

• For applications that require real-time interaction, such as intelligent customer service and voice assistants, the response speed of TTS is crucial. CosyVoice 2.0 has also made great efforts in this regard, supporting bidirectional streaming . This means that it can start synthesizing speech while receiving text, rather than waiting for the entire sentence to be finished.
• How good is the effect? The official data is that the delay of the first packet of speech synthesis is as low as 150 milliseconds ! What does this mean? As soon as you enter the text, you can hear the corresponding sound clip almost at the same time. The experience is very smooth and there is almost no delay.

4. Accurate pronunciation and stable timbre, say goodbye to "machine language"!

• The biggest fear of TTS is inaccurate pronunciation or unstable timbre, which sounds fake. Compared with version 1.0, the pronunciation error rate of CosyVoice 2.0 has been reduced by 30% to 50% . In addition, the consistency and stability of timbre have been greatly improved in zero-sample cloning and cross-language synthesis . There will be no embarrassing situation where the voice changes after a few words, or the timbre difference is too large when switching between Chinese and English.

5. Be as natural as possible and control your emotions and accent!

• It is not enough to speak correctly, but also to speak nicely and naturally. CosyVoice 2.0 has been optimized in terms of rhythm and sound quality, making it sound more like a real person. The official MOS evaluation score (an indicator for measuring the naturalness of speech) has been improved from 5.4 out of 1.0 to 5.53 , which is a very high score in the TTS field.
• More interestingly, it now supports more fine-grained emotion control and accent adjustment . You can make the synthesized voice sound happier, more serious, or have a specific accent style, which greatly enhances the playability.

It sounds great, but what is the principle behind it? (Just a quick chat)

Although the official documentation does not go into detail about the underlying architecture, it can be seen from some technical terms (such as Flow matching, LLM, KV cache) and project dependencies (FunASR, Matcha-TTS, etc.) that CosyVoice should also integrate the more cutting-edge technologies in the current AI field.

• Large models are the foundation: Models like CosyVoice2-0.5B have a large number of parameters (0.5B means 500 million), which ensures that it can learn sufficiently rich speech knowledge.
• Flow Matching: This is a relatively new generative model training technique that may be used to generate more natural, high-fidelity vocoder parts.
• Zero-Shot Learning: By pre-training on a large amount of data from different speakers, the model learns how to quickly capture the timbre characteristics of a new speech and apply it to new text synthesis. This is the key to zero-shot cloning.
• Streaming processing: Through caching (KV cache) and optimized attention mechanism (SDPA), the model can gradually generate audio streams while processing input text, achieving low latency.

You give it text and a sample of your voice, and it can "read the text and speak" in your voice.

Want to try it out? Follow me!

The CosyVoice team has thoughtfully provided detailed tutorials and pre-trained models, so it is not difficult to get started.

Step 1: Pull the code down

# Clone the main repository, remember to use --recursive to download submodules as well
git  clone  --recursive https://github.com/FunAudioLLM/CosyVoice.git

# If the submodule download fails (network reasons, you know), cd to the CosyVoice directory and try a few more times
cd  CosyVoice
git submodule update --init --recursive

Step 2: Create a Python environment and install dependencies

The official recommendation is to use Conda, which is more convenient and can solve some cross-platform dependency issues.

# Create a new environment called cosyvoice, using Python 3.10
conda create -n cozyvoice -y python=3.10

# Activate the environment
conda activate cosyvoice

# Install pynini (required by WeTextProcessing), it is more stable to install with conda
conda install -y -c conda-forge pynini==2.1.5

# Install other Python dependencies. Using Alibaba Cloud's image will be much faster.
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

# If you encounter sox-related errors during subsequent runs, install sox according to the system
# Ubuntu/Debian:
# sudo apt-get update && sudo apt-get install sox libsox-dev
# CentOS:
# sudo yum install sox sox-devel

Step 3: Download the pre-trained model

The model file is relatively large, and the official website provides two download methods: ModelScope SDK and Git LFS. It is strongly recommended to use ModelScope SDK , which has fast download speed and convenient management in China.

# Run this code in the Python environment to download the model
from  modelscope  import  snapshot_download

# Recommended to download the best CosyVoice 2.0 model (0.5B parameters)
snapshot_download( 'iic/CosyVoice2-0.5B' , local_dir= 'pretrained_models/CosyVoice2-0.5B' )

# Other models are downloaded on demand (such as version 1.0, versions fine-tuned for specific tasks, etc.)
# snapshot_download('iic/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-300M')
# snapshot_download('iic/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT')
# snapshot_download('iic/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct')

# You also need to download a text front-end processing resource (optional, if you don't install it, you will use WeTextProcessing)
snapshot_download( 'iic/CosyVoice-ttsfrd' , local_dir= 'pretrained_models/CosyVoice-ttsfrd' )

(Optional) If you want to use a better ttsfrd To do text regularization, you can enter pretrained_models/CosyVoice-ttsfrd/ Directory, unzip resource.zip And install the corresponding .whl However, it can be used without installing it, and it will be used automatically. WeTextProcessing.

Step 4: Run a demo to get a feel for it

The official provides a very concise Python calling method. Let's try the coolest Zero-Shot function:

import  sys
# Make sure you can find the code in the submodule
sys.path.append( 'third_party/Matcha-TTS' )
from  cosyvoice.cli.cosyvoice  import  CosyVoice2  # Note that it is CosyVoice2
from  cozyvoice.utils.file_utils  import  load_wav
import  torchaudio

# --- CosyVoice 2.0 Usage Examples ---

# Load the model. Here we use the recommended CosyVoice2-0.5B
# Parameters can be adjusted: load_jit/load_trt controls whether to load the optimized model, fp16 half-precision acceleration, use_flow_cache is also an acceleration option
cosyvoice = CosyVoice2( 'pretrained_models/CosyVoice2-0.5B' , load_jit= False , load_trt= False , fp16= False , use_flow_cache= False )

# Prepare a prompt speech, which is the sound sample you want to clone
# This needs to be a WAV file with a 16kHz sampling rate. You can replace './asset/zero_shot_prompt.wav' with your own recording.
prompt_speech_16k = load_wav( './asset/zero_shot_prompt.wav' ,  16000 )

# The text to be synthesized
text_to_speak =  'I received a birthday gift from a friend from afar. The unexpected surprise and deep blessings filled my heart with sweet joy and my smile blossomed like a flower.'
# Another reference text (affects the rhythm style)
prompt_text =  'I hope you can do better than me in the future.'

print ( f"Start using prompt sound' { './asset/zero_shot_prompt.wav' } ' Synthesized text..." )

# Call zero_shot inference
# stream=False means non-streaming, generating the entire speech at once
# What is returned is a generator, which may have multiple results (for example, it is automatically segmented according to punctuation)
for  i, output  in enumerate (cosyvoice.inference_zero_shot(text_to_speak, prompt_text, prompt_speech_16k, stream= False )):
    # Get the synthesized speech data (PyTorch Tensor)
    tts_speech = output[ 'tts_speech' ]
    # Save as WAV file
    output_filename =  f'zero_shot_output_ {i} .wav'
    torchaudio.save(output_filename, tts_speech, cozyvoice.sample_rate)
    print ( f "Successfully synthesized the  {i+ 1 }  th speech segment, saved as  {output_filename} " )

print ( "Zero-shot synthesis completed!" )

# --- If you want to try out dialects or special effects (Instruct mode) ---
# text_instruct = 'I received a birthday gift from a friend from afar. The unexpected surprise and deep blessings filled my heart with sweet joy and my smile blossomed like a flower.'
# instruction = 'Say this sentence in Sichuan dialect' # or 'Say this sentence in Cantonese', 'Say it in a happy tone' etc.
# for i, output in enumerate(cosyvoice.inference_instruct2(text_instruct, instruction, prompt_speech_16k, stream=False)):
# torchaudio.save(f'instruct_output_{i}.wav', output['tts_speech'], cosyvoice.sample_rate)
# print(f"Successfully synthesized Instruct mode speech, saved as instruct_output_{i}.wav")

Run this code, if everything goes well, you can find it in the current directory zero_shot_output_0.wav Open the file and listen to it. Is the sound the same as the one you provided? zero_shot_prompt.wav Very similar?

There is also a Web UI to play with!

If you don't want to write code and just want to experience it quickly, the official website also provides a web interface. Run the following command to start (the CosyVoice-300M model is used by default, you can change it --model_dir Parameters using 2.0 model):

python3 webui.py --port 50000 --model_dir pretrained_models/CosyVoice2-0.5B

Then open it in your browser http://localhost:50000 You can play.

CosyVoice 2.0 vs other solutions?

There are many TTS solutions on the market, here is a simple comparison:

characteristic	CosyVoice 2.0 (Open Source)	Commercial cloud TTS (such as Azure, Google)	Other open source TTS (such as Piper, VITS)
Multi-language support	Strong (Chinese, English, Japanese, Korean + dialect)	Strong (covers many languages)	Medium (depending on the model)
Dialect Support	Strong (various Chinese dialects)	Weak (partial support for Cantonese, etc.)	weak
Zero-Shot Clone	Strong (good effect, cross-language)	Partially available (effects/restrictions vary)	Some models support (effects vary)
Low latency streaming	Support (as low as 150ms)	Support (latency varies)	Partial support (implementation/effects vary)
Open Source and Licensing	Open Source (Apache-2.0)	Closed source, paid API calls	Open source (various licenses)
cost	Free (you should bear your own computing resources)	Pay-As-You-Go	Free (you should bear your own computing resources)
Customization capabilities	High (self-trainable/fine-tunable)	Low (usually only preset tones can be selected)	High (self-trainable/fine-tunable)
Ease of use	Medium (requires technical background to deploy and use)	High (simple API calls)	Medium (deployment and use thresholds vary)
Natural effect	High (MOS 5.53)	High (top service with good results)	Medium to high (determined by model quality)

In simple terms:

• Commercial cloud TTS : Saves time and effort, is ready to use out of the box, and generally produces good results, but is charged on a volume-based basis and is difficult to customize.
• Other open source TTS : high degree of freedom, free, but the effects, functions, and community activity vary, and it is not easy to find one that is comprehensive and easy to use.
• CosyVoice 2.0 : It combines the advantages of both, with top-notch effects, comprehensive functions (especially multi-language, dialects, and zero-sample cloning), completely open source and free, and great customization potential. The disadvantage is that it requires certain hands-on skills to deploy and use, and also has certain requirements for computing resources.

For developers and teams who want to delve deeper into voice technology, require a high degree of customization, or want to integrate high-quality TTS into their own products without being constrained by commercial APIs, CosyVoice 2.0 is definitely an option worth focusing on and trying.

The future is promising and the community is active

After looking at the project's roadmap, the team is still iterating and plans to launch a model with a higher compression rate (25Hz), a sound conversion model, optimize inference stability, etc. The project is also quite active on GitHub (12.8k Stars!), and there is an official DingTalk group for communication and discussion.

If you are interested in speech synthesis, or are looking for a powerful, flexible, and effective TTS solution for your project, I strongly recommend going to GitHub (search FunAudioLLM/CosyVoice) Check out CosyVoice 2.0. Run the demo yourself and listen to the results. Maybe you will be amazed by its capabilities like I was!