Voice Agent open source framework TEN allows your AI Agent to listen and speak!

Written by
Jasper Cole
Updated on:July-10th-2025
Recommendation

The revolutionary framework developed by Voice Agent, TEN Framework, enables AI to listen and speak, and realizes low-latency, interruptible audio and video interaction.

Core content:
1. TEN Framework solves the multimodal data transmission and latency problems in Voice Agent construction
2. Supports multimodal transmission, low latency, interruptible interactive experience, as well as rich plug-ins and flexible arrangement
3. Supports multiple languages ​​and cross-platforms, and quickly realizes audio and video interaction scenarios such as AI outbound call centers

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Building a Voice Agent is like putting an elephant into a refrigerator. It seems simple with only three steps:

1) Select LLM/STT/TTS large model

2) Connect to WebRTC or WebSockets for real-time transmission

3) Adjust parameter encapsulation

However, in actual use, there are many difficulties:

"?The echo is too loud, there is too much noise", "The voices are too mixed to be heard clearly?"

"Is artificial intelligence like a mentally retarded person who can't even be interrupted when someone is talking?"

"The delay is too high and the response is slow?", "There is a new model and I have to reconnect it?"

"The three-stage project looks simple, but is it too difficult to implement?"

"Real-time transmission of multimodal data is too troublesome and difficult to handle?"

“ Why is the CPU  consumption so high?! ?

Thus, the conversational Voice Agent open source framework - TEN Framework  came into being!

TEN solves the problems of complex multimodal data transmission and high latency in the process of building Voice Agent, and modularizes and freely calls models such as LLM, STT, and TTS, which reduces engineering problems for developers during implementation, allows them to focus more on scenarios and business content, quickly complete product implementation and verification, and can be truly used in actual production.

                               So, what is TEN?

TEN is a real-time conversational Voice Agent engine that can help developers quickly build AI Agents that can interact with audio and video.

Currently, it supports major global STT, LLM, and TTS manufacturers including  Deepseek, OpenAI, Gemini, etc.

At the same time, TEN can support access to  dify  and  Coze . You only need to configure the bot ID/API to make your bot speak.

                                                What are the advantages of TEN? 

1. Support multi-modal transmission: can meet the input and output of voice, text and images

    • Supports voice, text, image and other data transmission, giving full play to the multi-modal advantages

    • Supports both cascade mode (STT-LLM-TTS) and end-to-end mode (End to End) to create audio and video interaction


    2. Low latency and interruptibility: Built-in optimized real-time communication capabilities provide low latency and interruptibility for interactive experiences

      • Built-in RTC to solve the delay problem during voice interaction. The Agent built based on TEN Framework optimizes the delay to only 650ms under the best conditions.

      • With built-in VAD, you can interrupt and restore the real conversation at any time during the communication with AI voice

      3. Rich plug-ins and flexible arrangement: support access to global mainstream STT, LLM and TTS for quick use

        • Already supports the world's mainstream STT, LLM, TTS and other plug-ins, just configure the key

        • Keep up with the latest technology and complete access to OpenAI Realtime API and Gemini 2.0 within 24 hours

        4. Multi-language, cross-platform: Supports mainstream languages, and Agent can be seamlessly connected across platforms

          • Supports various programming languages ​​such as C++/Go/Python/Node.JS (JavaScript will be supported soon)

          • Support cross-platform use of Agent on Windows/Mac/Linux/mobile terminals, etc.

          What can you do with TEN?

          1. TEN + SIP: AI Outbound Call Center

          AI outbound call center, such as: corporate customer service/outbound call center/professional consulting...

          Let customers call your customized AI Agent experts!


          The demo shows a psychological counseling expert. You can see that the Agent's tone lowered when he heard "I" say I was in a bad mood. Voice is more suitable than text in this scenario.

          2. TEN + Hardware: Smart Toys

          Story machine/smart speaker/AI toys/smart home......

          ESP 32 is now supported. You can have a low-latency, interruptible conversation directly with ESP 32 and let it tell you a story.

          3. TEN + Digital Human: Virtual Companionship

          TEN currently supports Trulience avatars, which can be your AI shopping guide/virtual pet/AI game companion...

          You can let the puppy switch dialects and communicate with you by voice;


          You can also play chess with AI, controlling it with your mouth, freeing your hands.


          4. TEN + Computer Use: Voice control of computer

          Natural language interface (LUI) will become more and more integrated into our lives.

          Use voice to open browsers, computer apps, memos... You can also use TEN to create your own "Jarvis".

          5. TEN + Games: AI game companion

          Voice script of Murder on the Orient Express.

          Chat with NPCs about what they were doing when the case happened. It is an immersive experience and you can play the script-killing game alone.

          6. TEN + Gemini 2.0: Visible Personal Assistant

          When using the Gemini 2.0 model, TEN can not only hear, but also see!

          When sharing pictures with TEN via webcam/screen sharing, he can not only accurately identify the kitten's color, but also the specific breed! ?

          7. TEN + Storytelling Machine that can speak and draw

          TEN provides Storyteller as a usecase, with a built-in text-image model plug-in that can guide users to complete a story together and generate wonderful supporting images!


          How to use TEN?

          If you are a beginner and want to learn how to use TEN Agent step by step, please refer to the tutorial by YouTube blogger Developer Digest?


          The following video is from Xiaohongshu blogger @T8.star ?

          If you already have a basic understanding of TEN, you are also welcome to try the latest virtual person TEN + Trulience?


          Finally, if you are interested in TEN, you are welcome to star the project, support it and follow the latest developments!