A must-read for AI companion hardware companies: How Agora helps Robopoet build the next generation of AI companion hardware | Case study

Written by

Clara Bennett

Updated on:June-27th-2025

With the rise of AIGC (generative artificial intelligence) technology, the AI hardware market has ushered in unprecedented development opportunities. AI hardware has opened up a new human-computer interaction experience through real-time companionship, immersive story experience and dynamic plot. However, the voice interaction experience of many current AI hardware is not satisfactory. Traditional voice interaction solutions based on technologies such as WebSocket have been unable to keep up with the rapid development of AIGC , resulting in poor AI voice interaction experience. To achieve a smooth and natural AI voice interaction experience, there are many technical challenges, such as noise interference in complex environments, communication problems in weak signal environments such as underground garages, and the implementation of intelligent interruption functions.

For AI hardware companies, solving these problems on their own is not only costly but also time-consuming, and it is difficult to quickly respond to market demand in the ever-changing AI market. Therefore, companies should focus more on polishing their own business logic and core technologies, and leave the underlying technical problems to professional suppliers.

In this context, Shanghai Robopoet Intelligent Technology Co., Ltd. (hereinafter referred to as " Robopoet ") successfully overcame these technical difficulties through cooperation with SoundNet, and brought to the market an AI hardware Fuzzoo with excellent interactive experience . This case vividly demonstrates how AI hardware companies can achieve rapid product iteration and market launch through cooperation with technology suppliers, providing valuable experience for the industry.

01‍

In order to create an excellent interactive experience, Robopoet faces multiple technical challenges in voice interaction and urgently needs cooperation to accelerate product launch.

Robopoet was founded in January 2024 and focuses on developing AI companion robots. The company's first product is the AI companion pet Fuzzoo , which is mainly aimed at women and aims to innovate the emotional companionship experience through innovative technology. Fuzzoo is equipped with Robopoet's original multimodal emotion model ( MEM ), which can listen, perceive and soothe users' emotions, and has the attributes of nurturing, providing users with real-time personalized companionship services.

However, in the field of AI toys, traditional hardware products generally use non-real-time technical solutions for voice interaction. This causes users to often feel obvious delays when having voice conversations with AI toys , greatly reducing the smoothness of the interaction. In addition, when there is background noise interference in the conversation environment, the AI toy's recognition accuracy of commands will also drop significantly, making users feel a "mechanical" response experience during the interaction process.

In order to create an excellent interactive experience, Robopoet put forward the following key requirements for Fuzzoo :

1. Immediacy of interactive feedback : Fuzzoo needs to be able to respond quickly to user instructions and questions, provide a smooth and seamless interactive experience , and avoid making users feel waited due to delays.

2. Voice recognition in noisy environments : Even in noisy environments, such as subway stations, shopping malls, or party scenes, Fuzzoo must be able to clearly recognize the user's voice commands, avoid misjudging background noise as valid input, and ensure the accuracy of the interaction.

3. Communication capability in low-bandwidth environments : In outdoor environments or in scenarios with weak network signals, such as underground parking lots, Fuzzoo needs to be able to efficiently and accurately transmit the user's voice information to the backend large model under limited bandwidth conditions to ensure that the large model can clearly interpret the user's intentions.

4. Accuracy of speech recognition : When a user is speaking, there may be other people talking around him. Fuzzoo needs to have accurate speech recognition capabilities to accurately distinguish the speaker's voice and avoid misjudging other people's voices as the speaker's instructions.

5. Support interruption function : During the interaction process, users may need to interrupt Fuzzoo 's response at any time. Fuzzoo needs to support this flexible interaction method, rather than just communicating .

Given that solving these problems on their own would involve high investment and a long cycle, and Robopot hopes that Fuzzoo can be launched as soon as possible, they decided to work with professional technology suppliers to jointly overcome these technical difficulties in order to achieve rapid product iteration and market launch.

With its technical advantages such as low latency, noise reduction, network stability, accurate identification and intelligent interruption, as well as its adaptability to mainstream large models, Agora has become an ideal partner for Robopoet.

Robopoet 's founding team is young and efficient. After communicating with Agora, the two sides quickly reached a consensus on cooperation. On the one hand, Agora and Robopoet have similar judgments on market trends and are optimistic about the huge potential of the AI emotional companionship market; on the other hand, Agora 's technical capabilities in the field of conversational AI are highly consistent with Robopoet 's needs.

In terms of voice interaction, low latency is the key to achieving a smooth experience. When the delay reaches 3 seconds, users will noticeably feel the lag and slowness, while the median response delay of the sound network is only 650 milliseconds. This data has been verified by actual tests in major cities such as China, the United States, Europe, and Southeast Asia. Such a fast response speed can be similar to the natural human conversation experience, effectively eliminating the user's anxiety about waiting.

In terms of noise reduction capabilities, SoundNet has deep experience in 3A algorithms (acoustic echo cancellation, automatic gain control, and automatic noise suppression) and AI noise reduction technology. Traditional 3A algorithms can effectively solve steady-state noise problems, such as continuous buzzing or applause; while AI noise reduction technology focuses on processing transient noise, such as sudden noise such as drilling when users pass by a construction site. This noise reduction capability can effectively purify voice signals and improve interaction quality.

In a complex network environment, the software-defined real-time network ( SD-RTN ) of the sound network has demonstrated strong stability. The sound network has built more than 200 data centers around the world, and through intelligent routing and anti-weak network algorithms, it ensures smooth voice interaction in scenarios with poor network signals such as subways and underground garages. Even in the face of an 80% packet loss rate, the communication between users and AI can remain stable, and even if the network is disconnected for 3-5 seconds, the conversation can still be seamless.

The "Selective Attention Locking" technology of the Sound Network can block 95% of the ambient human voice and noise interference and accurately identify the human voice in the conversation. In the scenario where multiple people share a microphone, this technology can accurately distinguish the voices of different speakers, extract specific sounds according to user needs, and treat other sounds as noise for noise reduction, thereby providing a better voice interaction experience.

In addition, the "intelligent interruption" technology developed by Agora can simulate the rhythm of real-life conversations, allowing users to interrupt conversations with AI at any time . The interruption response time of this technology is as low as 340 milliseconds, truly achieving a natural and smooth conversation experience. Compared with traditional AI dialogue systems, Agora's technology can intelligently identify the user's intentions. For example, when the user makes a sound such as "hmm", the system will not misjudge it as an interruption command, thereby more accurately simulating natural communication between people.

In addition to its deep accumulation in intelligent voice technology, Agora has also completed adaptation with almost all mainstream large model manufacturers in the world (such as DeepSeek , ChatGPT , etc.). This means that Robopot can freely switch between different large models according to its own needs in the future, without being restricted by a single supplier, so as to better cope with the market environment of rapid iteration of large models.

Agora provides technical support to Robopoet through end-to-end software and hardware solutions , enabling it to focus on optimizing core business logic and sentiment models and jointly promote the development of Fuzzoo.

Agora provides Robopoet with an end-to-end solution, covering all aspects of software and hardware support.

At the software level, Agora provides a conversational AI development kit. Among them, the advanced voice activity detection ( VAD ) technology can accurately identify voice signals and effectively reduce the interference of background noise, thereby ensuring high accuracy of voice recognition. The real-time speech synthesis function achieves rapid response, making the interaction smoother and more natural. The intelligent interruption processing technology gives the device flexible conversation capabilities, which can be adjusted in real time according to the user's expression, greatly improving the adaptability and fluency of the interaction, saying goodbye to "mechanical" responses, and helping Fuzzoo achieve a smoother and faster interactive experience.

On the hardware level, Agora also provides comprehensive support for Robopoet , covering key aspects such as chip selection, power consumption design, vibration motors, etc., to ensure a high match between hardware performance and software functions.

Under this cooperation model, Robopoet can focus on the core areas of its business. For example, how Fuzzoo 's business logic works and how dolls interact socially are the key points that Robopoet needs to focus on. Fuzzoo 's core competitiveness lies in Robopoet's self-developed multimodal emotion model ( MEM ). How to polish and optimize this model is also a problem that Robopoet needs to focus on. Agora, through its technological advantages, provides Robopoet with solid underlying support to ensure Fuzzoo's excellent performance in interactive experience.

Fuzzoo was successfully launched at MWC and received high attention and recognition from the market

Robopoet conducted a Reuters of Fuzzoo at the Mobile World Congress ( MWC ) in 2025. Fuzzoo can always accompany users and listen to their needs. It can perceive emotional changes through users' language, expressions and behaviors, and respond with corresponding language comfort, vibration or expression changes . Currently, it has more than 200 expression changes built in. As the interaction time and frequency increase, Fuzzoo will become , not only establishing a deeper emotional connection, but also cultivating a unique personality. In addition, Fuzzoo will record daily interactions with users from a unique perspective and generate a "diary" to enhance intimacy. Fuzzoo has also specially added the NFC function, so pets can become good friends with just a light touch, fully demonstrating its social attributes. Robopoet plans to officially release Fuzzoo in June 2025 and start online pre-sales at the same time.

"SoundNet's conversational AI technology empowers the next generation of AI hardware and robots to sense, think, react, and communicate in real time," said Pan Yunan, co-founder and CTO of Robopoet . "With ultra-low latency response, intelligent interruption, and advanced voice processing capabilities, SoundNet makes human-machine interaction more natural and smooth, and always ensures the stability and reliability of the interactive experience."