Woter AI detection.Hurry - ends Jun 29th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

The engineering of AI agents is underestimated

Written by

Caleb Hayes

Updated on:June-13th-2025

Table of contents

0 1Product Engineering

02Technical Engineering

03 Conclusion

Two recent popular articles [1] [2] both mentioned that the role of engineering in AI applications has been underestimated since its development.

“For example, better virtual machines, longer contexts, a large number of MCPs, and even smart contracts…a series of engineering problems are huge demands.”
“There are many engineering tools for AI, such as LangGraph and LangChain. These are Lego blocks for building. The richer the blocks, the stronger the ability to assemble into complex structures.”

However, the term engineering is a very general technical term that covers a wide range of content. In a broad sense, non-algorithmic technical implementation and product design can be classified as engineering. This article temporarily classifies engineering into product engineering and technical engineering, and attempts to use this perspective to simply disassemble the engineering system for building AI Agents.

Engineering = Product Engineering + Technical Engineering

These two parts work together to determine whether an AI Agent is "usable, easy to use, and scalable."

Product Engineering

Goal: Make AI usable and easy to use, so that users can understand it, feel comfortable using it, and continue to use it. From the perspective of growth, it is not only about download volume, but also retention rate and activity.

Product engineering is concerned with comprehensive thinking on product philosophy, product business, interactive design, and user experience, so that AI is no longer just a "black box", but can be perceived, guided, feedback, and have a self-correction mechanism. We first disassemble product engineering, and then select some key modules to elaborate on their role in achieving a successful AI Agent.

Modules	definition
Requirements Modeling	Clarify who AI applications serve and what problems they can solve, and avoid using AI just for the sake of using it.
UI/UX Design	Transform AI's complex behaviors into interfaces and processes that users can understand and operate
Human-computer interaction process	Let AI "ask questions" and "confirm decisions" to complete tasks rhythmically like an assistant
Prompt Project	Use the “magic wand” of prompt words to improve the quality and consistency of AI output
Feedback loop	Allow users to provide feedback so that the system can learn to improve or indicate failure
Permissions and Compliance	Control who can use what data to prevent AI abuse or leakage

1️⃣Demand modeling

In traditional software development, we first ask “what is the user’s core pain point?” The same is true for AI Agents. If you don’t understand the objects and scenarios of use, it’s easy to end up in an awkward situation where “it looks like it can answer everything, but it can’t be used at all.”

The first step in building an AI Agent is not to choose a model, but to answer a question like a product manager: "Who will this AI help, how will it solve the problem, what problem will it solve, to what extent can the problem be solved, and are users willing to pay for this function?" This is the point of fit between the product and the market.

Take Manus as an example [3] , the world’s first general AI agent. Its core concept is “using both hands and brain”, emphasizing the transformation of AI from a passive tool to an active collaborator.

1. Clarify the role of AI

In the requirements modeling phase, Manus positions AI as an “active collaborator” that can think, plan, and execute complex tasks independently, rather than just providing suggestions or answers.

Example : The user enters "7-day Thailand island tour budget of 20,000 yuan", and Manus automatically completes exchange rate conversion, hotel price comparison, itinerary planning and exports a PDF manual.

This role setting requires that the AI's responsibilities and behavioral boundaries be clearly stated in the system prompts to ensure its autonomy and reliability when performing tasks.

2. Design of task closed-loop capability

Manus emphasizes the closed-loop execution capability of tasks, that is, the automation of the entire process from receiving instructions to delivering complete results.

Example : A user requests an analysis of xx stock, and Manus automatically obtains relevant data, performs financial modeling, generates an interactive dashboard, and deploys it as an accessible website.

In demand modeling, user requirements need to be broken down into multiple subtasks, and corresponding execution processes and tool calling strategies need to be designed to ensure that the tasks can be closed smoothly.

3. Multi-agent collaborative architecture

Manus adopts a multi-agent architecture, which is divided into a planning layer, an execution layer, and a verification layer. Each layer works together to complete complex tasks.

Planning layer: responsible for task decomposition and process planning.
Execution layer: Calls various tools and APIs to perform specific tasks.
Verification layer: Verify the task results to ensure the accuracy and compliance of the output.

During the requirements modeling phase, it is necessary to define the responsibilities and interaction methods of each layer to ensure the collaborative work of the entire system.

4. Innovation in human-machine collaboration model

Manus supports asynchronous processing and mid-process intervention. Users can add instructions or modify task parameters during task execution to simulate the collaboration mode in the real workplace.

Example : When executing a task in Manus, the user can shut down the device or add instructions at any time, and Manus will adjust the task execution process according to the new instructions.

This design requires that the flexibility and controllability of human-computer interaction be considered in demand modeling to ensure that users have sufficient control during the collaboration process.

This method of demand modeling is similar to segmenting the user's task process and finding out the links that AI is best at and most worthy of intervention. It not only avoids being "big and general", but also allows users to feel the efficiency improvement at the first time.

Just like designing the service boundaries of microservices, which responsibilities are the responsibility of the order unit and which responsibilities are the responsibility of the user inventory unit, in AI applications, it is also necessary to clarify which functions are the responsibility of AI and which are backed by business logic. These will directly determine the final experience of the application.

2️⃣ UI/UX design

AI Agent is different from traditional software. Its output is often uncertain, delayed, and unpredictable. This also means that UI/UX design is not only about "beautiful interface", but also about grasping the user's psychology and behavioral rhythm.

For example, DeepSeek visualizes the "thinking process" of a large model for the first time. Before generating a response, it displays the model's thinking chain, letting users know that AI is not just guessing, but thinking logically. Users no longer passively accept the results, but participate in the thinking process and establish a partnership to solve problems together.

This design effectively improves user trust and acceptance, especially in scenarios with multi-step tasks, complex document summaries, and cross-information references.

At present, interactive strategies such as "progressive information presentation", "visualization of ideas", and "structuring of results" have become standard features of AI Agents. Users can view call links, track reference sources, etc. Qwen even provides an option to delete reference sources, further reducing the illusion caused by unreliable Internet sources.

3️⃣System prompt words

System prompts are a set of instructions or constraints preset by AI agent developers to define the model’s behavior framework, role settings, interaction style, and output rules in specific application scenarios. Unlike user prompts, system prompts are more basic control mechanisms that have a decisive impact on the quality, style, consistency, and security of the model’s output content. Its core components usually include: role definition, behavior constraints, task orientation, and context management. [4]

For example, the core concept of NotebookLM is that users upload their own information, and the AI assistant answers questions and provides suggestions based on this information, acting as a "trusted knowledge consultant" to enable users to conduct more efficient and intelligent learning and research activities.

This product positioning determines that the system prompts behind NotebookLM must not only provide basic language guidance, but also implement more complex task instructions and safety constraints. We can analyze the design ideas of its system prompts from several key dimensions:

1. Role definition: You are my research assistant

NotebookLM's system prompts will set the model as a "document knowledge consultant", emphasizing that it can only provide answers based on the notes, PDFs, web pages, etc. uploaded by the user. This is very critical, as it ensures that the model's answers will not be "played" arbitrarily or introduce hallucination information, but will always revolve around the content provided by the user.

prompt
You are a research assistant. Your job is to help users understand the content of the documents they upload. When answering questions, only cite these materials and do not make subjective assumptions.

This role setting not only controls the output range of the model, but also makes it easier for users to psychologically trust the source of the answer.

2. Behavioral constraints: citing and indicating materials

NotebookLM emphasizes that answers must "cite the source of the material", which requires the system prompt to explicitly require the model to attach the corresponding document and paragraph link when answering. Developers usually add such constraints in the prompt:

prompt
When answering each question, quote the relevant paragraph in the document and list its title and paragraph index in markdown format.

This allows users to trace back and verify the source at any time when reading the model's answers, forming a good traceability experience. Trust in the quality of AI output often comes from this "sourced" design.

3. Task-oriented: Question-driven content generation

NotebookLM not only answers questions, but also supports structured content generation, such as automatically summarizing documents, generating research outlines, etc. Each task mode uses different system prompts. For example, in the summary mode, the prompt may guide the model like this:

prompt
Based on the documents provided by the user, please extract the 5 most important points and list them one by one. Keep the language style objective and neutral, and do not provide extended explanations.

Task-oriented prompt arrangement has gone beyond the scope of traditional "conversational question and answer" and is like a "task script", laying the foundation for multimodal AI applications.

4️⃣ Feedback loop

No AI agent can output accurate content from the beginning. The reality is that it performs poorly under certain inputs and has semantic deviations under certain data. At this time, it is necessary to collect "failure cases" to form a closed loop from input to evaluation.

For example, Monica 's memory function, when users browse recommended memory items, if they click "Accept as fact", these contents will be written into the memory database for subsequent conversations; and those that users do not adopt will not be memorized. This "feedback → selective absorption → next round of tuning" mechanism is an enhancement of the traditional Prompt + Chat mode.

Monica will improve the accuracy of understanding needs and responding through continuous learning of user characteristics. The essence is to rebuild the context-aware mechanism in conversational interaction. Just like communication between people, the longer you spend together, the more familiar you are with each other, and the more you can understand what the other person is expressing.

Technical Engineering

Goal: Make the system behind AI start quickly, run steadily, expand, and see clearly

Technical engineering is used to verify product engineering. Similar to the Internet era where the fast fish eat the slow fish, the AI era also values the efficiency of technical engineering, running through product engineering as soon as possible, conducting market verification, and iterating quickly. Technical engineering is the logistics system that supports AI applications, covering architecture and modularization, tool calling mechanism, model and service integration, traffic and access control, data management and structured output, security and isolation mechanism, DevOps and observability, etc.

Modules	definition
Architecture and Modularity	Break AI applications into small modules, with clear responsibilities for each component, making them easier to combine and maintain
Tool calling mechanism	Let AI adjust databases, check weather, place orders, etc., and truly "do things"
Model and service integration	Access multiple models (DeepSeek, Qwen, local large models, etc.) for unified calling and management
Traffic and access control	Control the usage frequency and access rights of different users and models to prevent abuse or crash
Data management and structured output	Convert AI's free text into structured data so that the system can continue to use it or store it in a database
Security and Isolation Mechanisms	Prevent data from being used in pairs or unauthorized operations, especially in multi-tenant/enterprise applications.
DevOps and Observability	Supports grayscale release, new function rollback, performance alarm, records what happens during each call, locates problems, optimizes indicators, and ensures continuous and stable operation

1️⃣Application architecture and modularity:

In the process of AI applications moving from prototypes to production systems, a key challenge is how to organize multiple heterogeneous model capabilities, such as prompts, augmented models (LLM), advisors, retrieval, chat memory, tools, evaluation, and MCP, through a unified architecture and achieve high maintainability, observability, and scalability. This is where modular architecture and process orchestration tools come into play.

In addition to Python-based ecosystems such as LangChain and LangGraph, Spring AI Alibaba provides native, enterprise-level AI application orchestration capabilities for the Java community, becoming one of the important tools for building modular AI applications. In addition to the eight basic capabilities mentioned above, its core functions also provide:

Multi-agent framework: Graph is one of the core implementations of the Spring AI Alibaba community. It is also the design concept of the entire framework that differs from Spring AI, which only provides underlying atomic abstractions. Spring AI Alibaba hopes to help developers build intelligent applications more easily.
Through AI ecosystem integration, we can solve the pain points of enterprise intelligence implementation. Spring AI Alibaba supports deep integration with the Bailian platform, providing model access and RAG knowledge base solutions; supports seamless access to observable products such as ARMS and Langfuse; supports enterprise-level MCP integration, including Nacos MCP Registry distributed registration and discovery, automatic Router routing, etc.
Explore general intelligent agent products and platforms with autonomous planning capabilities. The community has released the JManus intelligent agent based on the Spring AI Alibaba framework. In addition to benchmarking the general intelligent agent capabilities of Manus, our goal is to explore the application of autonomous planning in intelligent agent development based on JManus, and provide developers with more flexible options for building intelligent agents from low code, high code to zero code.

2️⃣ The less controllable the background logic is, the more important traffic and user access control is.

The large model is very powerful, but compared with the classic application that is completely based on the business code architecture, it introduces uncertainty, which requires strengthening traffic and user access control to hedge, so as to ensure availability, cost, and prevent abuse. The classic application is upgraded to AI Agent, and two new information processing links, LLM and MCP Server, are added. The AI gateway plays the role of traffic and user access control between AI Agent, LLM and MCP Server. (As shown below)

Flow Control:

key-rate-limit: Set an upper limit on the access frequency for each access key (API Key). For example, a maximum of 10 requests per second. For example, free users can only call the AI interface 100 times a day, while paid users can call it 10,000 times.
http-real-ip: Get the user's real IP address in a multi-layer proxy environment. For example, identify the real address of malicious visitors, and some people frequently register accounts using VPN. Determine which country or region the user is from, so as to facilitate subsequent content recommendation or risk control strategies.
HTTP Strict Transport Security: Forces clients to communicate via HTTPS to prevent man-in-the-middle attacks. First, to protect sensitive data, AI chat records, training data uploads, inference requests, etc. must be transmitted encrypted. Second, when AI is accessed by governments or financial institutions, a high security level is required to ensure that data will not be monitored or tampered with during transmission.
canary-header: Decide whether to send the request to the "canary version" service based on the specific tag in the request header. For example, when releasing an AI model in grayscale and launching a new version of the GPT interface, let 5% of users use it first, and then fully promote it if there are no problems. A/B testing: Test the difference in the effects of two prompt word templates or two fine-tuned models. Only requests with specific headers are routed to the test version.
Traffic-tag: Specific tags are added to requests, and policy control is performed based on these tags. For example, in model multi-version scheduling, the same question is sent to different model versions (such as Qwen3 32B and Qwen3 235B) based on the "paid user" and "new user" tags. Personalized policy routing: Developers tag requests based on user behavior, such as "image preference" and "many code requests", and route them to customized processing links.
cluster-key-rate-limit: Set a unified access frequency control across multiple node gateway instances to limit traffic in a cluster manner. For example, for a high-availability AI service platform, deploy AI gateways in multiple regions, and still set a total access limit for a user to prevent bypassing single-point traffic limits. Protect backend model resources: Unify traffic limits during holidays or hot topic outbreaks to prevent large model resources from running out.

User Access Control:

key-auth : Authentication is achieved through pre-distributed API Keys. The client must carry a specific key in the request to call the service. For example, the open platform requires developers to register an account and get their own key before calling the interface. As well as Webhook signature verification, the intranet service or BFF layer uses the key for simple identity authentication when calling the gateway service.
basic-auth: Username and password authentication based on HTTP Basic Auth. The client transmits the credentials in Base64 encoding in the request header. For example, a low-complexity internal authentication mechanism can quickly protect the interface during internal enterprise testing or early prototype stages. And AI internal tool access control : some visual analysis panels, prompt word management platforms, etc., first use Basic Auth to quickly add gates.
hmac-auth: A signature authentication mechanism based on HMAC (Hash-based Message Authentication Code). Each request carries a signature value, and the gateway verifies whether the signature is valid based on the key. For example, in Webhook callback verification , third-party platforms (such as Stripe and OpenAI) send events with signatures to verify whether the source is credible. Anti-tampering interface protection , some sensitive interfaces (such as sending inference tasks) use signatures to protect whether the request has been modified. Just like stamping a unique seal on the seal before sending a package, the recipient can only confirm that the package has not been tampered with after seeing the correct seal. Signatures are often generated based on request body + timestamp + key to prevent replay and forged requests.
jwt-auth: Implement user authentication and authorization through JWT (JSON Web Token). The request carries the issued JWT, and the gateway determines whether it is valid based on the signature and payload. For example, for user identity authentication , the user obtains a token after logging in, which is used to access services such as AI assistants and conversation systems, as well as role permission control . Role information is embedded in JWT, such as role=admin, the gateway releases or restricts access accordingly. It is like getting a digital pass that says "I am a VIP customer". The system will scan the certificate to confirm the identity and permissions. It is often used in enterprise systems with front-end and back-end separation architecture and support SSO (single sign-on).
jwt-logout: Implement the JWT expiration mechanism. Although JWT has its own expiration time, once the user actively logs out, it needs to be invalidated at the system level. For example, to strengthen security scenarios , after the user logs out, you don’t want the old token to continue to call the model interface. Session control in the AI SaaS platform : Once the user changes the device and logs out, you need to actively clean up their identity credentials. Just like canceling the access card someone sent you to avoid continuing to use it even if it is still valid.
oauth: Based on the OAuth 2.0 protocol, it implements three-party authorized login and access control. Users can log in through authorized platforms such as Google, GitHub, and WeChat. For example, with one-click login, Agent allows users to log in with Google, obtain identity tokens after authorization, and integrate enterprise identity systems.

In addition, AI gateways represented by Higress/Alibaba Cloud API Gateway have expanded more capabilities at the gateway layer through a flexible and scalable plug-in mechanism. Users can also develop custom plug-ins to enrich plug-in capabilities.

AI search-enhanced generation: Simplify the development of RAG applications by docking with the vector search service (DashVector) and optimize the generated content of large models.
Security compliance: Input and output interception at the gateway: The content security protection strategy has the ability to scan the gateway request and response content in real time to identify potential risk information. Input and output interception at the large model layer: By checking the input and output of the large model in real time, it ensures that potential risk information will not be accidentally exposed or spread, thereby reducing the risk of data leakage. Intercept at the networked search engine layer and detect the search engine response content to identify potential risk information (such as illegal information, malicious links, sensitive information, etc.).
Content caching: In AI request scenarios with high repetitiveness, the response results generated by the large language model are cached in the Redis database to avoid repeated calls to the large language model and improve response speed.

3️⃣ Logging and monitoring:

The difference between large model applications and traditional applications is not as simple as "adding model call logs". The following are several key comparison dimensions and core differences.

Observable Object Differences: From Code Execution → To Model Behavior

category	Traditional applications	Large model application
Observable objects	Backend logic, database queries, API calls	Prompt input/output, model reasoning process, context change, thought chain
Focus	Performance bottlenecks, service status, exception stack	Response rationality, consistency, deviation, illusion, and whether it exceeds authority
Observability Granularity	Function level, call chain level	Token level, semantic level, behavior path level

For example, classic applications focus on "whether this interface has timed out", while large model applications also need to focus on "why this model said something it shouldn't have said" and "whether it misunderstood the user's intention."

The change of observability goals: from availability to semantic correctness

For traditional systems, the observable goal is whether the system is running normally , such as: whether the response is timed out, whether the CPU and memory are alarming, and whether the error rate is soaring. For large model applications: the system may run normally, but the result is wrong ! The observable goal should not only focus on the availability indicators, but also on

Is the response content correct and reasonable (semantic correctness)
Whether it deviates from the set behavior (such as system role)
Are there hallucinations, overstepping of authority, or toxic language?
Is the model behavior related to version, prompt, or context?

For example, an AI medical question-answering system does not have any crash logs, but the model outputs the wrong suggestion of "cold medicine is not recommended". This error requires observability at the semantic layer.

To address the above issues, we first need to understand which components a call passes through, and then connect all these components in series through the call chain. To achieve full-link diagnosis, we must first open up these links. When a request has a problem, we can quickly identify which link has a problem, whether it is the AI application or the internal reasoning of the model. [5]

Secondly, we need to build a full-stack observable data platform that can well associate all of this data, including not only links but also indicators, such as some GPU utilization within the model. Through correlation analysis, we can know whether the problem is in the application layer or in the underlying model.

Finally, we also need to use some model logs to understand the input and output of each call, and use this data to do some evaluation and analysis to verify the quality of the AI application.

Starting from these three methods, in the field of monitoring, we have provided you with some observation methods and core concerns at different levels.

Summarize

Although this article attempts to sort out the core modules and key paths of AI Agent engineering, we must admit that the challenges in reality are far more complex than what is described in this article. Whether it is the organic collaboration between human logic and large models, the processing paradigm of complex tasks, or the evaluation criteria for generated content, each link hides a large number of unresolved product and technical engineering details. Engineering is never the icing on the cake, it is the only way to turn model capabilities into real productivity.

More importantly, the advancement of AI Agent engineering is not just a topic for Agent Builder, but also related to the evolution of the entire industry. Only by continuously investing in multiple dimensions such as development platform, traffic and access control, tool chain, observability, and security, and building a stable, reliable, and reusable application infrastructure, can we truly drive the large-scale implementation of upper-level Agent applications and promote the formation of a new generation of "application supply chain" ecology for large models.