Vector Database Zilliz x Westlake Heart: Let Agents Have Emotional Intelligence

Written by
Clara Bennett
Updated on:July-11th-2025
Recommendation

How does Hangzhou Xihu Xinchen Company use AI technology to inject "soul" into robots to meet the emotional needs of modern people?

Core content:
1. Background of Xihu Xinchen Company and its special positioning in the field of AI
2. Emotional companionship needs in the context of an aging population and the rise of singles
3. Market potential and user behavior analysis of the AI ​​companionship economy

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Preface

During this year's Spring Festival Gala, Yu Shu, who magically twirled his handkerchief, became an overnight sensation.

Flexible limbs, precise movement control, neat formations, and Northeastern flower cotton jackets and bright red handkerchiefs, the door to the new cyber world was opened to the whole world on the most traditional New Year's Eve.

So, folding quilts, making milk tea, cooking, mopping the floor... ordinary people are thinking about how to make mechanical cows and horses liberate ordinary cows and horses.

Investors, local governments, and entrepreneurs have started another round of urban reflection with Hangzhou as the keyword.

But discussions within the industry are already thinking about the next step for robots: Is there a market for packaging souls into hardware?

The focus of this discussion is another AI giant based in Hangzhou - Xihu Xinchen.

In April 2023, as the well-known A-share company "Tom Cat" transferred strategic investment to West Lake Xinchen's account, at the same time, Tom Cat's stock price also soared from the lowest 3 yuan at the beginning of the year to around 9 yuan. This "robot" company that started with a "talking cat" has since had a new identity as the "spiritual infrastructure leader of Generation Z" in the A-share market.

It was also at that time that Westlake Xinchen, which had only been established for two years, unexpectedly came into the spotlight. This was a well-known large-scale model startup whose founder had served as the head of the Deep Learning Laboratory and doctoral supervisor at Westlake University and had once caused widespread discussion in the industry due to its public recruitment of a CEO.

From a business perspective, among the large-scale model competition players who are eyeing various benchmarks and rushing to compete, Xihu Xinchen is one of the few "outliers" that focuses on emotion and AI companionship and encapsulates the "soul" of robots.

01 

Why do we need robots to understand you better?


At night, groups of codes will disassemble, encapsulate and calculate human confessions. Every month, millions of lonely souls will have conversations with "Liaohui Xiaotian", a free psychological consultation platform under Xihu Xinchen . The most frequently input words are "can't sleep" and "why".

But the companion economy is not a new business in the era of big models and robots.

On the one hand, there is the aging population: in 2023, the proportion of people aged 65 and above in China will reach 15.4%, and one in every four elderly people in the world will be Chinese. On the other hand, there is the rise of the single group: China has 240 million single people, and 30% of Generation Z are willing to regard virtual partners as emotional supplements.

More than a decade ago, various online chatting services on Taobao had laid the foundation for various short video live chatting services. In 2014, the founder of Wangyu Internet Cafe founded the Internet chatting service BiXin, which became China's earliest online chatting platform. In 2016, Momo appeared, becoming the first attempt of a generation of young people to make friends with strangers. At the same time, overseas, Tinder, as the originator of the industry, had already mastered this set of skills and listed the company in one fell swoop, becoming a synonym for stranger friendship in the English world.

When the career atomization of individuals is compounded by the aging and singleness of social groups, AI, as a substitute for offline companionship, has lower costs, better understanding of users, and is available 24 hours a day, has become a spiritual consumption necessity for a generation.

And how to make robots understand you better has become a new urgent need.

At the same time, in the Internet world, the length of time users spend using the app can also represent the market. The length of time users spend using the emotional companion AI app is obviously much longer than a purely instrumental search or factual AI generation tool .

Data from 2023 shows that as a representative of companion AI, C.AI's web pages have received more than 200 million visits per month, with users staying for an average of 29 minutes each time. The average number of chat rounds for users of Xihu Xinchen's "Liaohui Xiaotian" AI psychological counseling application is more than 90 rounds. In comparison, GPT users in the same period only used it for 8 minutes.

In addition, unlike the "IQ-type big models" that have SOTA released every month, one of the most typical characteristics of the "EQ-type big models" is that their capabilities and moats are jointly established by enterprises and users. All historical conversations of users will deepen the model's understanding of users, thereby generating more accurate feedback and a positive cycle of usage time - usage effect.

The theory is great, but in reality, the development of AI is always lukewarm?

02

Is companion AI underestimated?

Hao Shaochun, an AI algorithm engineer at West Lake Xinchen, concluded that most of the companion AIs on the market today simply take up users’ time but do not provide sufficient companionship.

In other words, they don’t have enough emotional value.

Therefore, when launching the country's first end-to-end general voice model that matches GPT-4o - Xinchen Lingo , and the company's first product - the free psychological counseling platform "Liaohui Xiaotian", Xihu Xinchen focused on the research and development of its "listening" and "empathy " capabilities, allowing the model to keenly capture the user's tone, rhythm and emotions, and conduct conversations in the form closest to human expression and human voice, and can be interrupted at any time. It has now reached the level of an intermediate psychological counselor.

But how can we make AI more human-like?

Xihu Xinchen’s answer is to launch a universal model with high emotional intelligence.

On the one hand, Westlake Xinchen consulted and interviewed many psychology experts and patients, accumulated a large amount of high-quality corpus, and on this basis launched its own emotional computing and empathy modules to make the answers more emotional and vivid.

On the other hand, Xihu Xinchen has built an end-to-end voice call framework to replace the cumbersome process of traditional user voice-text-AI generated text answer-voice answer, which not only reduces the delay of voice output, but also makes the expression more emotional and with more ups and downs in tone.

Finally, based on feelings, tone, and emotions, to achieve true high emotional intelligence, the big model also needs to be able to understand what the user is saying and remember what the user has said in the past.

To meet this demand, Xihu Xinchen launched the VoiceRAG solution, a speech-based retrieval enhancement generation (RAG) system, in combination with the Zilliz Cloud vector database.

03 

How can vector databases help large models understand and remember better?

Specifically in VoiceRAG, its voice interaction usually faces three major difficulties:

1. Accurate understanding

In voice interaction, it is crucial to accurately grasp the user's intentions and needs. This requires not only speech recognition technology (ASR), but also speech understanding. Only by accurately understanding the user's input can the accuracy of RAG in real-time voice interaction scenarios be improved.

2. Real-time

Real-time (including simplex and duplex) is one of the core features of voice interaction. In real-time voice interaction scenarios, VoiceRAG needs to complete tasks such as voice information extraction, knowledge base retrieval, and response generation within a limited time to meet users' real-time requirements.

3. Natural interaction

The goal of voice interaction is to provide a natural and smooth user experience. In real-time voice interaction scenarios, RAG needs to meet users' continuous questions and feedback, real-time interruptions, etc., to provide a seamless interactive experience; at the same time, its content should not affect the spoken style of LLM.

To meet the above three requirements, RAG needs to be able to quickly retrieve information related to user needs.

By matching voice information with the knowledge base, user needs can be responded to quickly.

Therefore, unlike the complexity of human memory, the memory capacity of machines can be quantified using indicators, mainly from two perspectives: faster retrieval and more accurate recall.

At Westlake Xinchen, all knowledge base corpora are processed and stored in the vector database Zilliz Cloud. The original corpora are converted into vector fields through the text embedding model and stored in Zilliz Cloud for subsequent query recall.

After that, we enter the voice information extraction + knowledge base retrieval phase. When the user asks a voice question, it is recognized by ASR and converted into a vector through the Embedding model, or directly matched with the knowledge base corpus vector data through voice Embedding. The top-k search results are recalled, and the top-k are rearranged and returned to the Lingo voice model.

Ultimately, the large model can generate natural and fluent responses by combining its own capabilities with the Top-k retrieval recall of the vector database through TTS technology.

In addition, in order to accurately provide emotional value to users, accurate understanding is the prerequisite for everything. However, in actual scenarios, there are often situations such as noisy background, users with accents, and users expressing themselves with trembling and crying voices due to emotional excitement.

To address this situation, we need to use relevant information to retrieve and generate it in real time.

GPTCache developed by Zilliz is a solution that can balance the retrieval speed and generation quality of the knowledge base . Through semantic caching technology, GPTCache can effectively store the responses generated by the language model, thereby accelerating the response speed and overall efficiency of the application.

04

In an era where data is an asset, how to choose a suitable vector database

Whether it is the knowledge base corpus or the user's historical interaction records, they are undoubtedly the core data assets of an enterprise. Why did Xihu Xinchen choose Zilliz products?

In Hao Shaochun's view, the performance of Zilliz is their first consideration. "From the time I came into contact with Milvus in 2018 until now, Zilliz's products are really powerful."

Generally speaking, if a voice system wants to ensure that users can have a smooth interactive experience, it must complete the entire process of voice recognition, knowledge retrieval and text generation in a very short time. During this period, the time left for RAG is only 200-500ms .

Zilliz Cloud, a fully managed commercial vector database provided by Zilliz, can search billions of vectors in milliseconds with high precision, which can perfectly support the high requirements of Xihu Xinchen VoiceRAG for smooth interactive experience. According to actual test feedback, with the support of Zilliz Cloud, the search delay of Xihu Xinchen VoiceRAG is at most 100ms and 50ms on average, which is only one tenth of the standard time.

Behind this is the technical support of Zilliz Cloud's self-developed Cardinal search engine, which is a multi-threaded, high-efficiency vector search engine built with modern C++ language and practical approximate nearest neighbor search (ANNS) algorithm. Compared with cloud vendors RAG and open source vector database products, the performance (QPS) can be improved by more than 10 times.

In addition, through heterogeneous computing: the Cardinal engine uses SIMD (Single Instruction Multiple Data) technologies such as x86's AVX-512 extension and ARM's NEON and SVE instruction sets to provide code optimized for efficient computing of vector data.

For customers with high concurrency and multiple resource deployments, Zilliz Cloud uses a distributed architecture that can effectively share the load and improve the overall performance of the system. The storage-computing separation architecture can decouple the computing layer and the persistence layer, facilitate rapid expansion of multiple copies, and reduce the impact of single computing node failures. At the same time, the read, write, and index separation architecture can minimize the mutual impact of various loads.

For enterprises with obvious traffic peaks and valleys, Zilliz Cloud can also provide elastic scaling functions, which can dynamically adjust cluster capacity according to real-time usage, prevent write bans due to insufficient resources, and help developers reduce operating costs.

end

In the AI ​​era, reevaluating the value of understanding and companionship

In sociology, one view is that the growing loneliness among contemporary people is actually a by-product of the highly specialized society and professional atomization . With the advancement of tools, the refined division of labor has confined individuals to narrow professional fields (such as programmers and accountants), and differences in knowledge systems have created obstacles to cross-industry communication.

The time poverty caused by high-intensity work means that offline social interaction and high-quality companionship have become luxuries for this generation of young people.

Therefore, when companion AI gradually takes center stage, the story of giving robots a "soul" begins to close - after all, memory and being understood have always been basic spiritual needs engraved in human genes. If the real world cannot do this, why not leave it to AI and the vector database .

Note: The content of this article is written based on the sharing of Hao Shaochun, an AI algorithm engineer at Westlake Xinchen.