Stanford Multimodal Interaction Agent Review: Agent AI Integration and Its Technical Challenges

Written by
Jasper Cole
Updated on:July-17th-2025
Recommendation

In-depth understanding of the latest progress of Agent AI and its technical challenges in multimodal interaction.

Core content:
1. The concept of Agent AI and its connection with general artificial intelligence
2. Challenges and solutions in the integration process of Agent AI
3. Agent AI learning strategy and cross-modal interaction capabilities
4. Application scenarios of Agent AI in different fields
5. Continuous self-improvement and future development of Agent AI

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


This paper explores in depth the multimodal artificial intelligence system, especially the interactivity of agents in physical and virtual environments. It not only provides a research roadmap for researchers and the AI ​​field, but also shows deep insights into the future development of AI. The core content of the paper is divided into the following parts:

1. The Concept of Agent AI

This paper introduces the background, motivation and future goals of Agent AI, and discusses how it can become an important way to achieve general artificial intelligence (AGI).

2. Challenges faced by Agent AI

The challenges encountered in integrating Agent AI with existing large-scale base models (such as LLMs and VLMs), such as hallucinations, bias, and data privacy, are discussed, and corresponding solutions are explored.

3. Learning strategy of Agent AI

Different strategies and mechanisms for training Agent AI are explored, including reinforcement learning, imitation learning, and contextual learning.

IV. Classification and Application of Agent AI

The different types of Agent AI are classified and their practical application scenarios in fields such as gaming, robotics, healthcare, etc. are explored.

5. Cross-modal, cross-domain, and cross-reality Agent AI

The paper discusses how Agent AI interacts and understands different modalities, domains, and realities, and how to achieve the transition from simulation to reality. This is a very forward-looking research and discussion in the paper.

6. Continuous Self-Improvement of Agent AI

It explores how Agent AI can continuously learn and improve itself through interaction with the external environment and users, while pointing out the current challenges and difficulties.

This article explores the challenges facing Agent AI .

Specialize

question

Head

record

01

Infinite Agents

02

Intelligent AI based on large basic models

03

Agent AI for emergent capabilities

Multimodal AI systems are likely to become ubiquitous in our daily lives.

One promising approach is to embody these systems as intelligent agents in physical and virtual environments, making them more interactive.

Currently, the system leverages existing base models as basic building blocks for creating embodied agents. Embedding agents into such environments helps the models process and interpret visual and contextual data, which is critical for creating more complex and context-aware AI systems.

For example, a system that can perceive user actions, human behavior, environmental objects, audio expressions, and the collective emotion of a scene can be used to inform and guide the responses of an agent in a given environment.

To accelerate research in agent-based multimodal intelligence, we define "agent AI" as a class of interactive systems that perceive visual stimuli, language input, and other environmental ground truth data and are able to generate meaningful embodied actions. In particular, we explore systems that aim to improve agents based on predictions of the next embodied action by integrating external knowledge, multisensory input, and human feedback.

We argue that by developing embodied AI systems in embodied environments, the hallucination problem of large base models and their tendency to generate outputs that do not correspond to the environment can also be mitigated. The emerging field of Agent AI covers a wider range of aspects of embodiment and embodiment in multimodal interactions.

In addition to agents acting and interacting in the physical world, we envision a future where people can easily create any virtual reality or simulation scenario and interact with embodied agents in the virtual environment.

An overview of intelligent AI systems that can perceive and act in different fields and applications.

Agent AI is emerging as a promising path toward artificial general intelligence (AGI). Agent AI training has demonstrated the ability to understand the physical world in multiple modal ways, and by combining generative AI with multiple independent data sources, a training framework that is decoupled from physical reality has been constructed. Large-scale basic models of agents and behaviors trained on cross-reality data can be applied to both the physical and virtual worlds. We propose a general framework for agent AI systems that can achieve cross-domain and multi-scenario perception and action, which is expected to become a feasible path to AGI through the agent paradigm.

2. Agent Artificial Intelligence Integration

Although base models based on large language models (LLMs) and visual language models (VLMs) have been applied in the field of embodied intelligence, their performance is still limited, especially in understanding, generating, editing, and interacting with unseen environments or scenes. This makes it difficult for AI agents to achieve optimal output.

Current agent-centric AI modeling approaches focus on directly accessible and well-defined data, such as text or world states in the form of strings. Such approaches typically use domain- and environment-independent patterns learned through large-scale pre-training to predict action outputs for each specific environment. In our previous research, we explored knowledge-guided collaborative and interactive scene generation tasks by integrating large foundation models and achieved encouraging results. This shows that knowledge-based LLM agents can significantly improve the performance of 2D and 3D scene understanding, generation, and editing, while also showing advantages in interactions with other human-agents.

By integrating the Agent AI framework, large-scale basic models can understand user input more accurately and deeply, thereby building a complex and highly adaptable human-computer interaction (HCI) system. In the field of generative AI, LLM and VLM, as invisible underlying architectures, have gradually emerged in their emergent capabilities and are widely used in embodied AI, knowledge enhancement of multimodal learning, mixed reality generation, text-to-visual editing, and human-computer interaction in games or robotic tasks involving 2D/3D simulations.

Recent advances in Agent AI in foundational models provide a key catalyst for unlocking general intelligence in embodied agents. Large action models, or Agent-Vision-Language models, open up new possibilities for the realization of general embodied systems, enabling efficient planning, problem solving, and continuous learning in complex environments. Agent AI not only takes a further step in testing in the metaverse, but also points the way to an early version of general artificial intelligence (AGI).

2.1. Infinite Agents

Artificial intelligence (AI) agents have the ability to interpret, predict, and respond based on their training data and input data. While these capabilities are advanced and continue to improve, it is critical to recognize their limitations and the impact of the underlying data on which they are trained.

AI agent systems typically have the following capabilities:

  • Predictive modeling: AI agents can predict likely outcomes or suggest next actions based on historical data and trends. For example, they can predict the continuation of a text, the answer to a question, the bot’s next move, or the solution to a scenario.

  • Decision making: In some applications, AI agents can make decisions based on their reasoning. Typically, agents make decisions based on the reasoning that is most likely to achieve a specific goal. For AI applications such as recommender systems, agents can decide which products or content to recommend based on reasoning about user preferences.

  • Handling ambiguity: AI agents are usually able to handle ambiguous inputs and choose the most likely interpretation by reasoning based on context and training data. However, their ability to handle ambiguity is limited by the scope of their training data and algorithms. 

  • Continuous Optimization: While some AI agents are able to learn from new data and interactions, many large language models do not continuously update their knowledge base or internal representations after training. Their reasoning is often based only on the data available as of the last training update.

An augmented interactive agent with multimodal and cross-reality unbiased integration with an emergent mechanism is shown below. An AI agent needs to collect a large amount of training data for each new task, which may be expensive or impossible in many fields. In this study, we developed an infinite agent that can learn to transfer the memory information of a general base model (such as GPT-X, DALL-E) to new domains or scenes to achieve understanding, generation, and interactive editing of scenes in the physical or virtual world.


Multi-model agent AI for 2D/3D embodied generation and editing interactions across real-world environments.

In the field of robotics, one application of this infinite agent is RoboGen (Wang et al., 2023d). In this study, the authors proposed a pipeline that autonomously runs the task proposal, environment generation, and skill learning loop. RoboGen aims to transfer the knowledge embedded in large models to the field of robotics.

2.2. Intelligent AI based on large basic models

Recent studies have shown that large base models play a key role in generating data used as benchmarks to determine the behavior of an agent under the constraints imposed by the environment.

For example, using base models for robotic manipulation and navigation. For example, Black et al. used an image editing model as a high-level planner to generate images of future subgoals to guide low-level policies. For robotic navigation, Shah et al. proposed a system that uses LLM to identify landmarks from text and uses a visual language model (VLM) to associate these landmarks with visual input to enhance navigation through natural language instructions.

In addition, there is growing interest in generating human movements conditioned by language and environmental factors. Several AI systems have been proposed to generate actions tailored to specific language instructions and adapt to various 3D scenes. This line of research highlights the growing power of generative models in enhancing the adaptability and responsiveness of AI agents in diverse scenarios.

2.2.1 Hallucination

Agents that generate text are often prone to hallucinations, where the generated text lacks meaning or does not match the provided source content. Hallucinations can be divided into two categories, intrinsic hallucinations and extrinsic hallucinations .

Intrinsic hallucinations are hallucinations that contradict the source material, while extrinsic hallucinations are when the generated text contains additional information that was not originally included in the source material.

Some promising approaches to reduce the hallucination rate in language generation include using retrieval-augmented generation or other methods that generate natural language output through external knowledge retrieval. Typically, these methods aim to enhance language generation by retrieving additional source material and providing mechanisms to check whether the generated response is inconsistent with the source material.

In multimodal agent systems, visual language models (VLMs) have also been found to produce hallucinations. A common cause of hallucinations in visually generated language is an over-reliance on the co-occurrence of objects with visual cues in the training data. AI agents that rely solely on pre-trained models (such as LLMs) or VLMs, and perform only limited environment-specific fine-tuning, are particularly prone to hallucinations. This is because they rely on the internal knowledge base of the pre-trained model to generate actions, which may not accurately understand the dynamics of the world state in the deployment environment.

2.2.2 Prejudice and Inclusion

AI agents based on LLMs or LMMs (Large Multimodal Models) are biased due to inherent factors in their design and training process. When designing these AI agents, we must consider inclusivity and pay attention to the needs of all end users and stakeholders.

In the context of AI agents, inclusivity refers to the measures and principles adopted to ensure that the agent’s responses and interactions are inclusive, respectful, and sensitive to a wide range of users from all backgrounds.

We list the key aspects of proxy bias and inclusion below.

  • Training Data:

The underlying model is trained on a large amount of text data collected from the Internet, including books, articles, websites, and other text sources. This data often reflects biases that exist in human society, and the model may inadvertently learn and replicate these biases. This includes stereotypes, prejudices, and biased views related to race, gender, ethnicity, religion, and other personal attributes.

In particular, by training on internet data, typically using only English text, models implicitly learn the cultural norms of Western, educated, industrialized, wealthy, and democratic societies, which have a disproportionately large presence on the internet.

However, it is important to recognize that datasets created by humans cannot be completely free of bias, as they often reflect societal biases and the tendencies of the individuals who originally generated and/or compiled the data.

  • Historical and Cultural Biases:

AI models are trained on large datasets from a wide variety of content. As a result, training data often contains historical texts or material from a variety of cultural contexts. In particular, training data from historical sources may contain offensive or derogatory language that represents specific sociocultural norms, attitudes, and biases. This can lead to models perpetuating outdated stereotypes or failing to fully understand contemporary cultural shifts and nuances.

  • Language and context restrictions:

Language models may have difficulty understanding and accurately expressing nuances in language, such as sarcasm, humor, or cultural references. This can lead to misinterpretations or biased responses in certain contexts.

Additionally, there are many aspects of spoken language that cannot be captured through text-only data, which can lead to a potential disconnect between humans’ understanding of language and models’ understanding of language.

  • Policies and Guidelines:

Artificial intelligence (AI) agents operate under strict policies and guidelines to ensure fairness and inclusivity.

For example, when generating images, there are rules that require the depiction of diverse people and avoid stereotypes related to race, gender, and other attributes.

  • Overgeneralization:

These models tend to generate responses based on patterns observed in the training data. This can lead to overgeneralization, where the model may generate responses that appear to stereotype certain groups or make broad assumptions.

  • Continuous monitoring and updating:

AI systems are continually monitored and updated to address any emerging bias or inclusion issues. User feedback and ongoing research in the field of AI ethics play a key role in this process.

  • Amplification of mainstream views:

Because training data often contains more content from the dominant culture or group, the model may be more biased towards those views and may underestimate or misinterpret minority views.

  • Ethical and Inclusive Design:

AI tools should be designed with ethical considerations and inclusivity as core principles. This includes respecting cultural differences, promoting diversity, and ensuring that AI does not perpetuate harmful stereotypes.

  • User Guide:

Users are also guided on how to interact with AI in a way that promotes inclusivity and respect. This includes avoiding requests that could result in biased or inappropriate outputs. Additionally, this helps reduce the risk of models learning harmful content from user interactions.


Despite these measures, AI agents  continue to exhibit biases. Ongoing efforts in agent AI research and development focus on further reducing these biases, as well as improving the inclusiveness and fairness of agent AI systems.

Here are some efforts to mitigate bias:

  • Diverse and inclusive training data: Strive to include a more diverse and inclusive range of sources in training data.

  • Bias Detection and Correction : Ongoing research is devoted to detecting and correcting bias in model responses.

  • Ethical codes and policies : Models are often subject to ethical codes and policies designed to reduce bias and ensure respectful and inclusive interactions

  • Diverse representation : Ensure that generated content or responses provided by AI agents reflect a broad range of human experiences, cultures, ethnicities, and identities. This is particularly important in scenarios such as image generation or narrative construction.

  • Bias mitigation : Actively take steps to reduce bias in AI responses. This includes bias related to race, gender, age, disability, sexual orientation, and other personal characteristics. The goal is to provide fair, balanced responses that avoid perpetuating stereotypes or biases.

  • Cultural sensitivity : The AI ​​must first be culturally sensitive, recognizing and respecting the diversity of cultural norms, practices, and values. This includes understanding and responding appropriately to cultural references and nuances.

  • Accessibility : Ensure AI agents are accessible to users of different abilities, including people with disabilities. This may involve incorporating features that make interaction easier for people with visual, auditory, motor, or cognitive impairments.

  • Language-based inclusivity : Support multiple languages ​​and dialects to meet the needs of global users and be sensitive to linguistic nuances and variations.

  • Ethical and Respectful Interactions : The agent is programmed to interact with all users in an ethical and respectful manner, avoiding responses that could be viewed as offensive, harmful, or disrespectful.

  • User feedback and adaptation : Continuously improve the inclusivity and effectiveness of AI agents by incorporating user feedback. This includes learning from interactions to better understand and serve a diverse user base.

  • Adhere to inclusion guidelines : Follow inclusion guidelines and standards established for AI agents, which are often set by industry associations, ethics committees, or regulators.

Despite these efforts, it is important to be aware of the potential for bias in responses and to interpret them with critical thinking. Continuous improvements in AI agent technology and ethical practices aim to reduce these biases over time. One of the overall goals of achieving inclusivity in agent AI is to create an agent that is respectful and accessible to all users, regardless of background or identity.

2.2.3. Data privacy and use

A key ethical consideration for AI agents is understanding how these systems process, store, and potentially retrieve user data. We discuss key aspects below:

Data collection, use, and purpose . When leveraging user data to improve model performance, model developers will have access to data collected by AI agents when they interact with users in production. Some systems allow users to view their data through their user accounts or by making requests to service providers. It is important to recognize what data is collected by AI agents during these interactions. This may include text input, user usage patterns, personal preferences, and sometimes more sensitive personal information. Users should also understand how data collected from their interactions is used. If for some reason the AI ​​holds incorrect information about a specific individual or group, there should be mechanisms for users to help correct this once it has been identified. This is important for both accuracy and respect for all users and groups. Common uses for retrieving and analyzing user data include improving user interactions, personalized responses, and system optimization. It is critical for developers to ensure that data is not used for purposes that users have not consented to, such as unsolicited marketing.

Storage and security . Developers should understand where user interaction data is stored and what security measures are in place to prevent unauthorized access or data leakage. This includes encryption, secure servers, and data protection protocols. It is critical to determine whether proxy data is shared with third parties and under what conditions. This should be transparent and usually requires user consent.

Data deletion and retention . It is also important for users to understand how long their data is stored and how to request deletion of their data. Many data protection regulations give users the "right to be forgotten," which means that users have the right to request deletion of their data. AI agents must comply with data protection regulations such as the EU's GDPR or California's CCPA. These regulations regulate data processing practices and users' rights over their personal data.

Data portability and privacy policy . In addition, developers must create a privacy policy for the AI ​​agent to explain to users how their data is processed. This should detail data collection, use, storage, and user rights. Developers should ensure that users consent to data collection is obtained, especially when sensitive information is involved. Users can usually opt out or limit the data provided. In some jurisdictions, users even have the right to request a copy of their data in a format that can be transferred to another service provider.

Anonymization . Data used for broader analysis or AI training should ideally be anonymized to protect personally identifiable information. Developers must understand how their AI agents retrieve and use historical user data during interactions.

2.2.4. Interpretability and Explanability

  • Imitation learning → decoupling .

In reinforcement learning (RL) or imitation learning (IL), agents are typically trained through a continuous feedback loop, starting from a randomly initialized policy. However, this approach faces leaderboard challenges when obtaining initial rewards in unfamiliar environments, especially when rewards are sparse or only available at the end of long-step interactions. Therefore, a better solution is to use infinite memory agents trained through IL, which can learn policies from business data, as shown in the figure below, improving the exploration and exploitation of unseen environment space through emerging infrastructure. With the help of expert features, agents can better explore and exploit unseen environment space. AI agents can learn policies and new paradigm flows directly from business data.

Traditional IL actively generates policies by having the agent imitate the behavior of an expert demonstrator. However, directly learning an expert policy may not always be the best approach, as the agent may not generalize well to unseen situations. To address this issue, we propose to learn an agent with contextual cues or an implicit reward function that captures key aspects of the expert’s behavior, as shown in Figure 3. This equips the infinite-memory agent with physical-world behavior data for task execution, which is learned from expert demonstrations. It helps overcome the shortcomings of existing imitation learning, such as the need for large amounts of expert data and possible errors in complex tasks.

The key ideas behind Agent AI are divided into two parts:

1) Unlimited agents collect physical world expert demonstrations as state-action pairs;

2) Virtual environment simulation agent generator.

The actions generated by the imitating agent imitate the expert behavior, while the agent learns a policy mapping from states to actions by minimizing a loss function that is the difference between the expert actions and the actions generated by the learned policy.

  • Decoupling → Generalization.

Instead of relying on task-specific reward functions, agents acquire knowledge by learning from expert demonstrations, which provide a set of state-action pairs covering various aspects of the task. The agent then learns a policy that maps states to actions by imitating the expert's behavior. In imitation learning, decoupling refers to separating the learning process from the task-specific reward function, allowing the policy to generalize across different tasks without explicitly relying on a task-specific reward function. Through decoupling, agents are able to learn from expert demonstrations and master policies that adapt to a variety of situations. Decoupling also supports transfer learning, where policies learned in one domain can be adapted to other domains with a small amount of fine-tuning. By learning general policies that do not rely on a specific reward function, agents are able to transfer the knowledge they acquire in one task to other related tasks, performing well.

Since the agent is no longer dependent on a specific reward function, it can adapt to changes in the reward function or the environment without requiring large-scale retraining. This makes the learned policy more robust and generalizable across environments. In this context, decoupling refers to the separation of two tasks in the learning process: learning the reward function and learning the optimal policy.

Example of a burst interaction mechanism,

Using an agent to identify text associated with an image from candidates. This task involves using multimodal AI agents from the web and knowledge interaction samples annotated by humans to integrate external world information.

  • Generalization → emergent behavior.

Emergence theory explains how new properties or behaviors emerge from simple components or rules in complex systems. The core idea is to identify the basic elements or rules of system behavior, such as a single neuron or a basic algorithm.

By observing the interactions between these simple components or rules, we find that these interactions often lead to the emergence of complex behaviors that cannot be predicted by analyzing the individual components alone. The system's ability to generalize across different levels of complexity allows it to learn universal principles that apply across these levels, giving rise to emergent properties.

This enables the system to adapt to new situations and exhibit more complex behaviors arising from simple rules. In addition, the ability to generalize across different levels of complexity facilitates the transfer of knowledge from one domain to another, which helps the system to generate complex behaviors in new contexts when adapting to new environments.

2.2.5 Reasoning Enhancement

The reasoning power of an AI agent lies in its ability to interpret, predict, and respond based on its training and input data . While these capabilities are advanced and continually improving, it is important to recognize their limitations and the impact of the underlying data on which their training relies.

Specifically in the context of large language models, this refers to their ability to draw conclusions, make predictions, and generate responses based on the training data and inputs they receive.

Reasoning enhancement in AI agents refers to the process of improving the natural reasoning capabilities of AI through additional tools, techniques, and data to increase its performance, accuracy, and usefulness. This is particularly important in complex decision-making scenarios or when dealing with nuanced or specialized content.

We list below particularly important sources of inference enhancement:

Data augmentation. Incorporating additional, often external, data sources to provide more background or context can help AI agents make more informed inferences, especially in domains where their training data may be limited. For example, AI agents can infer meaning based on the context of a conversation or text. They analyze the information they are given and use it to understand the intent and relevant details of a user's query. These models excel at identifying patterns in data. They use this ability to make inferences about language, user behavior, or other related phenomena based on the patterns learned during training.

Algorithmic enhancement. Improving reasoning capabilities by improving the algorithms that underlie AI. This may include adopting more advanced machine learning models, integrating different types of AI (such as combining natural language processing with image recognition), or updating algorithms to better handle complex tasks. Reasoning in language models involves understanding and generating human language. This includes capturing tone, intent, and the nuances of different language structures.

Human-in-the-loop (HITL). In critical areas that require human judgment, such as ethical considerations, creative tasks, or ambiguous situations, it is particularly important to introduce human input to enhance AI’s reasoning capabilities. Humans can provide guidance, correct errors, or share insights that AI agents cannot infer on their own.

Real-time feedback integration. Leveraging real-time feedback from users or the environment to enhance reasoning is another promising approach to improve performance during reasoning. For example, an AI can adjust its recommendations based on real-time user feedback or changing conditions in a dynamic system. Or, if an action taken by an agent in a simulated environment violates certain rules, feedback can be given to the agent dynamically to help it self-correct.

Cross-domain knowledge transfer. Leveraging knowledge or models from one domain to improve reasoning in another domain is particularly useful when generating discipline-specific outputs. For example, techniques used for language translation might be applied to code generation, or insights from medical diagnostics might help improve predictive maintenance of machines.

Customization for specific use cases. Tailoring AI reasoning capabilities for specific applications or industries may involve training AI on specialized datasets or fine-tuning models to better suit specific tasks, such as legal analysis, medical diagnosis, or financial forecasting. Because the specific language or information within a domain may differ greatly from the language in other domains, it is beneficial to fine-tune the agent on domain-specific information.

Ethical and bias considerations. It is critical to ensure that the enhancement process does not introduce new biases or ethical issues. This requires careful consideration of the sources of additional data, or the impact of new reasoning enhancement algorithms on fairness and transparency. AI agents sometimes need to deal with ethical considerations when performing reasoning, especially on sensitive topics. This includes avoiding harmful stereotypes, respecting privacy, and ensuring fairness.

Continuously learn and adapt. Regularly update and optimize AI capabilities to keep pace with new developments, the changing data landscape, and evolving user needs.

In summary, reasoning enhancement in AI agents involves methods to enhance their natural reasoning capabilities through additional data, improved algorithms, human input, and other techniques. Depending on the specific application scenario, this enhancement is often critical to handling complex tasks and ensuring the accuracy of the agent's output.

2.2.6. Regulations

The field of Agent AI has made remarkable progress, and its integration with embodied systems has opened up new possibilities for interacting with intelligent agents, resulting in more immersive, dynamic, and engaging experiences.

To accelerate the development process and reduce the heavy work in Agent AI development, we propose to develop the next generation AI-enhanced agent interaction pipeline . Build a human-machine collaborative system that enables humans and machines to communicate and interact meaningfully. The system can communicate with human players and identify their needs by leveraging the conversational capabilities of LLM or visual language model (VLM) and a rich action library. The system will then perform appropriate actions based on the request to assist the human player.

Developed robot teaching system. 


(Left) System workflow. The process consists of three steps: task planning , in which ChatGPT plans the robot's task based on instructions and environmental information; demonstration , in which the user visually demonstrates the action sequence. All steps require user review, and if any step fails or is insufficient, the user can return to the previous step to make adjustments as needed. 

(Right) A web application that supports uploading demo data and implements interaction between users and ChatGPT.


When using LLM/VLM for human-robot collaborative systems, it is important to note that these models operate as black boxes and generate unpredictable outputs. This uncertainty may become critical in a physical environment, such as operating an actual robot.

One way to address this challenge is to constrain the focus of the LLM/VLM through prompt engineering. For example, in robotic task planning from instructions, it has been shown that providing environmental information in prompts can produce more stable outputs than relying solely on text.

Another approach is to design hints so that LLMs/VLMs contain explanatory text so that users understand what the model is paying attention to or identifying. In addition, implementing a higher-level module under human guidance, allowing for validation and modification before execution, can facilitate the operation of systems running under that guidance.

2.3. Agent AI for emergent capabilities

Despite the growing adoption of interactive agent AI systems, most proposed solutions still face challenges in generalization performance when faced with unseen environments or scenarios.

Current modeling practices require developers to prepare large-scale datasets for each domain to fine-tune or pre-train models; however, this process is costly and even becomes infeasible when the domain is new.

To address this problem, we built interactive agents that leverage knowledge memories from common base models (such as ChatGPT, Dall-E, GPT-4, etc.) to cope with novel scenarios, specifically, to generate collaborative spaces between humans and agents. We discovered an emerging mechanism—which we named " Mixed Reality for Knowledge Reasoning Interaction "—that facilitates collaboration with humans to solve challenging tasks in complex real-world environments and enables exploration of unseen environments, thereby enabling adaptation to virtual reality. 

A new agent paradigm for multimodal general agents is proposed.

There are five main modules: 1) Environment and Perception, including task planning and skill observation; 2) Agent Learning; 3) Memory; 4) Agent Action; 5) Cognition.

For this mechanism, the agent learns

i) Cross-modal micro-responses: collect relevant individual knowledge for each interactive task (e.g., understanding unseen scenes) from explicit web resources and implicitly infer from the output of pre-trained models;

ii) Reality-agnostic macro-behavior: enhancing interaction dimensions and modes in linguistic and multimodal domains, and adapting them to role characteristics, specific target variables, and the diversity of collaborative information affected in mixed reality and LLMs.

We study the synergy of knowledge-guided interactions in the task of collaborative scene generation combining various OpenAI models, and show how a system of interacting agents can further improve on promising results of large base models.