Recommendation
A practical guide to building intelligent agents, exploring the low-threshold future of AI applications.
Core content:
1. Misunderstandings and actual barriers to building intelligent agents
2. Logical thinking in the design of intelligent agent workflows
3. Application of "split personality" capabilities in building intelligent agents
Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)
Recently, I often see live broadcasts on short video platforms that teach how to build intelligent agents.Some are based on GPTs, and some are based on Kouzi (Doubao's intelligent agent building platform).Some of them speak in a sensational way, under the banner of monetization; others are more pragmatic and teach step by step.But without exception, intelligent agents are described as tools with low barriers to entry and accessible to everyone.Screenshot of the button official websiteOn the second day of the Force conference last month, I attended the button special session.The organizers invited many guests to share their experiences: some were individuals who realized monetization through buttons, some were business leaders who applied buttons to business processes, and some were founders of startups who used buttons to build applications.A few months ago, I tried to build an agent out of buttons to help me decide what to eat for dinner. It wasn’t a success, but I did learn something.Yesterday, I launched a button-based AI customer service on my official account, which can answer questions based on my official account articles. At the same time, I am also developing an agent to help record my son's daily performance.Although I cannot call myself an expert on intelligent agents, I do have some personal insights.Intelligent agents (such as buttons) do lower the threshold for application development, especially for applications with LLM capabilities. But I don’t think it is completely barrier-free and everyone can easily get started.You need at least two abilities - logic and " split personality ". Let me explain.You can also create an agent in the Doubao App, but the functions are limited. You can only define the functions of the agent through the character description of the agent, and at best it is a conversational robot with characteristics.The agents created in buttons are more powerful.It can have its own knowledge base, acquire skills through plug-ins (such as searching the Internet), and record structured information from conversations in a database.You can even create apps with UIs through buttons, rather than being limited to conversational methods.In the button agent, the most critical component is the workflow.As you can tell from the name and interface, it is essentially a flowchart, with each step called a node.The workflow has a beginning and an end, and the output of the previous step is the input of the next step. All nodes are executed in order according to specific rules and logic, and can fork or merge.Different from traditional programs where all functions are implemented by code, the nodes in the button workflow can be LLM. Its input and output are both natural language, and the intermediate processing is handed over to LLM.A node can also be an image generation model.Using an enterprise project flowchart analogy, you can imagine each LLM node as a person who completes a specific task in the project.Anyone who has experience in flowchart design understands that this requires high logical thinking skills.Programmers are perhaps one of the groups with the strongest logical abilities.The few bloggers I saw on the short video platform who taught intelligent agents all had a programmer background. They may not realize that although using buttons does not require programming language knowledge, it does require logical thinking skills cultivated through programming.I admit that the term "split personality" is a bit of an exaggeration, and it refers to a more fine-grained level of logical ability.In multi-person collaborative tasks, you need to clarify the logical relationship between different people's tasks: whose input is one person's output, who completes which step under what circumstances, etc.As mentioned earlier, we can imagine each LLM node in a button as a person, but this is not accurate enough.An LLM node does not perform a complex series of tasks well , but rather completes a specific step that requires human capabilities.This requires you to be able to logically split a person's tasks, just like splitting a person into multiple avatars, each of which only performs one step. Sometimes you even need to let "yourself" evaluate and review "your own" output.Take my official account customer service as an example.In reality, one person can handle customer service work, which seems to be a task that can be completed by an LLM. However, in order to limit the agent's answers to the scope of my public account articles, I set up 4 LLM nodes:- The first determines whether the reader’s message is AI-related, a general greeting (such as hello, thank you), or “other”
- The second handles general greetings and directs readers to AI-related questions
- The third is to rewrite the user's question according to the context. For example, if a reader first asks "What is AI" and then asks "What can it do", LLM needs to rewrite the second question into "What can AI do" in order to better retrieve public account articles.
- The fourth is to combine the user's question with the retrieved article information and organize the language to answer
I categorize a lot of situations (like “Who is the most beautiful woman in the world”) as “other” and address them with a standard response (rather than, “She just asked me a question” or “It’s not you”).Even so, this workflow already has 11 nodes.Of course, you can also set all logics through language description as an LLM persona, similar to SOP.But first, I'm not sure how well this would work in a button; second, two-dimensional graphics have advantages over linear languages when it comes to expressing complex logic.I saw a video a few days ago: Andrew Ng’s DeepLearning invited an OpenAI employee to demonstrate how to use GPT-4o1.He introduced a technique called Meta Prompting - letting the latest version of GPT-4o1 write a super detailed SOP, and then letting GPT4o mini refer to this SOP to play the role of airline customer service, respond to customer questions and choose the corresponding operation procedures (such as refunding tickets).This SOP even uses 5-level numbering (such as 3a2b2)!From this, it is not difficult to understand why the company leader built hundreds of intelligent entities with tens of thousands of nodes when applying buttons to business processes.03The ability of LLM is a double-edged sword.
It can indeed complete many tasks without programming, but its freedom and flexibility force us to control its behavior through workflow to obtain the expected results.
It's like having an exceptional employee in a company: how to not limit her creativity when you need her to be creative, but not let her be too free when you need her to follow procedures.