Voice cloning revolution: OpenAudio S1 is online! With one sentence, AI voice actors can perform all the emotions you want!

Written by

Jasper Cole

Updated on:June-13th-2025

In the field of voice cloning, if there is a need for emotional changes, one or two sentences can be generated at will, but what if you want to generate a paragraph of text?

For example, if a user has already written a paragraph of text, he or she will already have a general idea of the tone, speed, and emotional expression of the paragraph.

The voice generated by the model is "definitely in the right direction", but it is definitely not as real as the real person's narration when it comes to details, such as requiring several paragraphs of delicate emotions. At this time, users can only try again and again, and "draw cards" until a version that you are more satisfied with appears.

Now, the voice cloning tool that understands you best is here. It can control emotions at will and insert the emotions you need into the text, as if a "real person" is communicating with you.

Fish Audio's newly updated OpenAudio S1 speech generation model achieves the expressiveness and naturalness of professional voice actors with highly natural sound, rich tone control, and powerful command-following capabilities.

The Fish Audio team said : If we want AI to reach or even surpass human levels, then it must execute human instructions, rather than just generate based on text. So we have done a lot of research and training on open-domain instruction in the past year. The S1 model, which we will release in early June, will be the first to fully implement this capability - users can directly instruct the model to generate specific tones, roles, emotions, rhythms and backgrounds through natural language, truly realizing the freedom of voice control.

It adopts dual autoregressive architecture and RLHF training technology, and ranks first in TTS-Arena. It supports zero-sample and few-sample voice cloning, and provides two versions, S1 and S1-mini, to meet the needs of different users. In the future, it will launch real-time voice interaction function.

If you want to pursue the naturalness and expressiveness comparable to professional voice actors? Fish Audio's OpenAudio S1 voice model is exactly what you need now!

Fish Audio official website:

https://fish.audio

Key highlights of OpenAudio S1:

Fine emotion and style control: Supports a variety of emotion markers (such as anger, sadness, excitement, sarcasm, etc.), intonation markers (such as haste, shouting, whispering, etc.) and special markers (such as laughter, sobbing, sighing, etc.), and can accurately control the emotion and style of the voice, comparable to professional voice actors;

Multi-language support: Supports 13 languages including English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish and Portuguese, with strong global applicability.

Excellent cost-effectiveness: As the most cost-effective high-quality TTS model on the market, OpenAudio S1 is priced at only US$15 per megabyte (approximately US$0.8 per hour), with a significant price advantage.

After Fish Audio launched the emotion-controllable voice cloning model, it can be used for video dubbing, audiobooks, and even advertising content.

Fish Audio is an efficient assistant for content creators. Say goodbye to expensive recording studios and cumbersome dubbing processes, and get professional-grade audio instantly with one-click input.

For voice actors, Fish Audio is an AI tool that can reduce the burden of the profession. It solves the vocal cord strain and stress problems caused by long-term dubbing.

Fish Audio also announced that it will soon launch a copyrighted sound registration and profit-sharing mechanism, which will be able to preserve the sound of its peak period and continue to obtain passive income.