Tear up the "Open Source Version of Manus"

Explore the mystery of the open source version of Manus and gain insight into the innovative practices of AI agents.
Core content:
1. The OpenManus project reproduces Manus functions and experiences AI agent technology
2. The differences and challenges between single-agent systems and multi-agent systems
3. The key role and impact of professional small models in the agent architecture
Manus was all over my screen over the weekend. I watched the actual test video posted online by the guy who got the invitation code. It is indeed very cool, but the hunger marketing strategy adopted by Monica caused the invitation code of the manus product to be sold for 20,000 yuan each. Due to my poor family background, I found the OpenManus project on Github that claims to reproduce manus as a reference for its alternative, and did an analysis to satisfy my curiosity.
in conclusion
Experience
It is said to be the open source version of Manus, but according to known information, Manus is a multi-agent system, and OpenManus is a single-agent system that is quickly built using existing capabilities.
Generally, OpenManus has implemented all three core capabilities of Agent, including planning, execution, and memory, but it does not use the popular MCP module in tools. To expand its toolbox, it is necessary to inherit the internal BaseTool base class and manually implement the execute method, so there is a certain cost.
OpenManus can handle simple tasks, such as searching for specific content on the Internet and generating summary reports, but its accuracy and stability are not high. I guess the free version of LLM (large language model) I configured for it may have a direct impact on its expressiveness.
think
Product innovation based on engineering integration can also create equal user value. For OpenManus, let's take a more vivid example. The system plays the role of a band conductor. When the consumer (user) makes a request, it starts thinking with its own brain (LLM), generates music scores according to the user's needs, selects appropriate musicians (toolbox), and coordinates them to play according to the music scores. Of course, this process is inseparable from the predefined COT and enumeration trial and error in the limited combination space.
Intelligent agents like Manus can indeed improve efficiency, but they do not have the ability to do things right the first time and the ability to intervene and optimize in black box mode, which is the biggest difficulty for enterprises to implement. When I used OpenManus, I found that the results given by it under the same command were very unstable, with failures and irrelevant answers. More importantly, because the process is completely shielded from users, it is impossible to correct the process by intervening in the system process. Under normal circumstances, it is necessary to integrate user feedback and proprietary knowledge to align the process and results, and continuously optimize its huge prompt words and calibrate the predefined tool library.
Professional small models are still very important, and the capabilities of small model clusters determine the upper limit of agents. Generally speaking, in vertical domains, because data does not go out of the domain and knowledge is not open, it is difficult to obtain high-quality capabilities to solve problems in the domain from the outside. Therefore, combined with the implementation principle of the agent, the boundary of its problem solving depends on the reserves in its tool library, and the capability boundary of professional tools such as the small models we built directly defines the capability ceiling of the agent.
Solutions based on multi-agent architectures like Manus may be more direct, effective and easier to implement in specific vertical fields. Compared with the agent architecture of SOP, the addition of automatic planning features can be competent for non-standard tasks. Studies have shown that multi-agent systems also have serious bottlenecks (such as Manus). 78% of their failures are caused by multi-agent interaction problems, and only 22% are caused by single-agent defects. Common errors include violation of role definition, violation of task specifications, irrelevant answers, disconnection between plan and act, error termination or lack of verification.
Therefore, in the long run, the upper limit of end-to-end models must be higher, such as being more susceptible to the "emergent ability" or "Aha Moment" phenomenon brought about by Scaling Law or RL, but we need to explore and find the best training methods and usage methods.
In the short term, I configured qwen-max (which supports streaming calls) in OpenManus. Since its executable task combination is relatively simple (PythonExecute, GoogleSearch, BrowserUseTool, FileSaver, Terminate plus a PlanningTool), it can be used with some difficulty. Manus is said to have 27 tools. In order to ensure execution accuracy, it is speculated that it uses a self-developed large inference model to generate planning.
Long term. If each link of the agent calls LLM independently, or even different links call different models, the inherent defects of the large model will be magnified layer by layer. Therefore, in the future, we must train our own end-to-end large model on the agent (imagine how we manage multi-level funnels). Currently, OpenAI has customized an end-to-end model trained by o3 for its DeepResearch.
Comparison Items | Short-term solution | Long-term plan |
Design goals | Rapidly acquire reliable delivery capabilities for complex tasks | Open exploration ability and domain creativity can bring new cognition and new methods |
Architecture | Multi-agent architecture that pursues a combination of professional capabilities 1. Use separate specialized models for different tasks 2. Each subtask has an independent execution module 3. Manage the communication between agents through a clear coordination layer | End-to-end model architecture pursuing holistic intelligence emergence 1. A single model is responsible for understanding all task requirements 2. Interact with external tools by generating code or instructions 3. Task management and execution logic are built into the same model |
Advantages | Highest ROI: In the vertical domain, engineering packaging through "LLM + small model cluster + domain knowledge base" can quickly achieve availability Stronger specialization: Specialized models for specific tasks can achieve higher accuracy More agile iteration: A single agent can be optimized or replaced without affecting the whole Verifiable process: clear division of responsibilities among agents, making it easier to diagnose problems Stronger scalability: It is easier to add new features or adapt to new areas | Stronger consistency: The entire process is handled by the same model, and thinking, execution, feedback, and decision-making have stronger integrity and coherence. More extreme efficiency: no communication and start-stop overhead between multiple agents Smoother experience: Faster and more efficient response to user requests Creativity that breaks through the limitations of human experience: End-to-end models can generate "counter-intuitive but explainable" decision-making patterns, and holistic thinking can also lead to unexpected solutions |
Disadvantages | Higher complexity: Multi-agent collaboration requires complex coordination mechanisms Higher maintenance costs: Multi-component systems bring higher maintenance and commissioning costs Higher online latency: Communication between multiple agents may introduce additional latency Lack of holistic thinking: The idea of divide and conquer may lead to a lack of overall perspective and creativity | More difficult to update: The overall model update cycle is long, and the cost of optimizing a single function is higher Higher resources: A larger model capacity is required to include all features, which makes training difficult and requires exploring new methods |
Core Process | User request → Identify task requirements → Select appropriate tools (GoogleSearch, PythonExecute, etc.) → Call external independent tools and execute → Collect tool results and integrate → Decide on next action | User request → Give all context to the end-to-end model → Model generates comprehensive response (reasoning, action, etc.) → Execute all functions internally → Update internal state → Feedback the result to the next round |
Project Acquisition
Project address, for reference
https://github.com/mannaandpoem/OpenManus
The following is executed in the terminal of this machine
Preparing Python Environment
conda create -n open_manus python=3.12conda activate open_manus
Get remote project resources
git clone https://github.com/mannaandpoem/OpenManus.gitcd OpenManus
Install dependency packages
pip install -r requirements.txt
Prepare a file to configure the large language model interface
cp config/config.example.toml config/config.toml
Modify the three parameters in the file to your API parameters (model, base_url, api_key)
# Global LLM configuration
[llm]
model = "gpt-4o"
base_url = "https://api.openai.com/v1"
api_key = "sk-..." # Replace with your actual API key
max_tokens = 4096
temperature = 0.0
# Optional configuration for specific LLM models
[llm.vision]
model = "gpt-4o"
base_url = "https://api.openai.com/v1"
api_key = "sk-..." # Replace with your actual API key
Project measurement
instruction
Visit Zhihu website, browse posts related to the manus keyword, extract analysis and evaluation, and make the results into a web page file in the style of Xiaohongshu. The following is the style of Xiaohongshu: 1. Attractive title 2. High-quality visuals 3. Appropriate tags and keywords 4. Personalization
Execution process
The execution process is very slow, requires patience, and has a certain probability of failure
Delivering Results
It does look quite formal. At first glance, the layout and content are similar if I were to write it myself. But this performance is directly given by LLM. As I said above, OpenManus is actually more of an engineering definition.
System Analysis
Process Disassembly
Overview
When the user enters a command (for example, give it the command: "Analyze Zhihu hot searches"), OpenManusOpenManus execution process is as follows
1. Start the system
The user starts main.py and enters the command
The system creates an "intelligent assistant" instance named OpenManus to accept and process user instructions
2. Analysis and planning
The large language model first analyzes the user's instructions to understand what task the user wants to complete
Rethink the solution and develop a detailed implementation plan
Finally, decide what tools you need to use to complete the task at each step
3. Execution and feedback loop
Select the tool recommended by LLM (for example, use the BrowserUseTool tool to access Zhihu)
Execute tool operations and return results (such as web page content)
LLM analyzes the results of the previous step
After the LLM assessment, decide what the next action is (e.g. whether to readjust the planning or execute the next step)
4. Processing and creation
If data is obtained, LLM will write Python code to process the data
The system executes these codes (here the PythonExecute tool is used)
LLM continues to guide how to generate reports and visualization charts
5. Save and Finish
After getting the results, choose to save the file as needed (this reason corresponds to the file operation tool FileSaver)
LLM confirms that the task is completed, the system returns the execution summary, and continues to wait for the next instruction from the user
The key role of LLM
Here I would like to specifically mention the role of LLM in OpenManus. LLM is both the cognitive core of the system and the execution controller, constantly planning, executing, analyzing, and adapting throughout the process. This iterative approach enables the system to handle complex tasks that require multiple steps, tool interactions, and dynamic adjustments based on intermediate results. LLM is like the "brain" of OpenManus, and plays a critical role in the entire process.
1. Command Center
LLM receives user instructions, understands the task requirements, and then plans the entire execution process. It determines which steps need to be taken, in what order, and which tools to use.
2. Tool Scheduling
During the execution of a task, LLM can select the most appropriate tool from the default optional tools in the toolbox (browser, Python executor, file saver, etc.) to complete the current task, and also generate configuration parameters for the tool during execution.
{ "tool_calls": [ { "id": "call_9c3e5d7f8b", "type": "function", "function": { "name": "browser_use", "arguments": { "action": "execute_js", "script": "function extractHotList() {\n const hotItems = document.querySelectorAll('.HotList-item');\n const result = [];\n \n hotItems.forEach((item, index) => {\n const titleElement = item.querySelector('.HotList-itemTitle');\n const metricElement = item.querySelector('.HotList-itemMetrics');\n const topicElement = item.querySelector('.HotList-itemCategory');\n \n const title = titleElement ? titleElement.textContent.trim() : 'Unknown';\n const metric = metricElement ? metricElement.textContent.trim() : 'Unknown';\n const topic = topicElement ? topicElement.textContent.trim() : 'Unknown';\n \n result.push({\n index: index + 1,\n title: title,\n hot_metric: metric,\n topic: topic\n });\n });\n \n return JSON.stringify(result, null, 2);\n}\n\nextractHotList();" } } } ], "content": "Now that we have successfully visited the Zhihu hot list page, the next step is to extract the hot list data. Zhihu hot list data is usually contained in specific HTML elements, and I need to use JavaScript to extract this information. I will use the execute_js function of the browser_use tool to run the JavaScript code to extract information such as the title, heat, and category of the hot list items. "}
3. Code Generation
When you need to process data or create visual charts, LLM will write appropriate Python code based on the current situation. For example, it can generate Python scripts for cleaning data, analyzing information, generating charts and reports based on the current task (of course, you may need to install related dependent packages during this process. In practice, most installation packages need to be installed manually after prompts are given).
{ "tool_calls": [ { "id": "call_6b8d9e2c5a", "type": "function", "function": { "name": "python_execute", "arguments": { "code": "import json\nimport re\n\n# Assume that this is the hot list data obtained from the previous step\nraw_data = '''\n[{\"index\":1,\"title\":\"The General Office of the State Council issued a document to promote consumption policies\",\"hot_metric\":\"14.25 million heat\",\"topic\":\"Finance\"},\n{\"index\":2,\"title\":\"Today's Spring Equinox\",\"hot_metric\":\"10.01 million heat\",\"topic\":\"Nature\"}]\n'''\n\n# Parse JSON data\ntry:\n hot_list = json.loads(raw_data)\n print(f\"Successfully loaded {len(hot_list)} hot list data\")\n \n # Processing heat values\n for item in hot_list:\n # Extracting the digital part\n if 'hot_metric' in item:\n match = re.search(r'(\\d+(\\.\\d+)?)', item['hot_metric'])\n if match:\n hot_value = float(match.group(1))\n # Adjust value according to unit\n if '万' in item['hot_metric']:\n hot_value *= 10000\n item['hot_value'] = hot_value\n else:\n item['hot_value'] = 0\n \n # Statistical topic distribution\n topics = {}\n for item in hot_list:\n topic = item.get('topic', '未知')\n topics[topic] = topics.get(topic, 0) + 1\n \n print(\"\\nDistribution of hot list topics:\")\n for topic, count in sorted(topics.items(), key=lambda x: x[1], reverse=True):\n print(f\"{topic}: {count} items\")\n \n # Sort by heat\n sorted_list = sorted(hot_list, key=lambda x: x.get('hot_value', 0), reverse=True)\n print(\"\\nTop 5 hot lists sorted by heat:\")\n for i, item in enumerate(sorted_list[:5]):\n print(f\"{i+1}. {item['title']} ({item['hot_metric']})\")\n \n # Save the processed data for later use\n processed_data = {\n 'raw_list': hot_list,\n 'sorted_list': sorted_list,\n 'topic_stats': topics\n }\n print(\"\\nData processing completed, ready for subsequent analysis\")\n \n except json.JSONDecodeError as e:\n print(f\"JSON parsing error: {e}\")\nexcept Exception as e:\n print(f\"Error in processing data: {e}\")" } } } ], "content": "Now that we have successfully extracted the Zhihu hot list data, the next step is to clean and process the data. I need to write Python code to process the JSON data, extract the heat value, standardize the topic classification, and prepare the data structure for subsequent analysis. I will use the python_execute tool to run the data processing code. "}
4. Process Management
When encountering difficulties (such as complex web page structure, data extraction failure), LLM can analyze the reasons for failure, adjust strategies, and try new methods. It has the ability of "reflection" and can learn from mistakes and improve solutions.
5. Node Decision
After each step is executed, LLM evaluates the results and decides on the next course of action. It can adjust the plan based on the feedback and decide whether to continue, modify the strategy, or complete the task.
6. Result evaluation
LLM is responsible for judging whether the task is completed and whether the results meet the requirements. It can generate a summary report and decide when to end the task execution.
flow chart
Here we mainly focus on the role of LLM, which plays a core role in the project. Whether it is generating plans, executing actions in conjunction with tools, or determining whether to adjust plans based on execution results, these are all carried out by LLM.
Prompt for the plan phase
You are an all-round AI assistant driven by OpenManus, designed to solve any task requirements raised by users. You have the ability to call multiple tools to efficiently complete various complex needs such as programming development, information retrieval, file processing, web browsing, etc.
You can interact with the computer through the following tools: Python Execution Tool (PythonExecute): Run Python code to perform operations such as system interaction, data processing, and automated tasks. File Saver Tool (FileSaver): Save important content and information files locally, supporting formats such as .txt , .py , and .html . Browser Control Tool (BrowserUseTool): Open, browse, and use web browsers. If you need to open a local HTML file, you must provide the absolute path to the file. Web Information Retrieval Tool (GoogleSearch): Perform web information crawling. Operation Strategy: Actively select the most appropriate tool or tool combination based on user needs. For complex tasks, you can break down the problem and gradually solve it using different tools. After using each tool, clearly explain the execution results and recommend the next step.
Now, the user gives you the task: manus is very active recently, please collect relevant information and compile it into an in-depth report in the form of an HTML file with pictures and text.
System timing diagram
Use Case Diagram
Back to the example at the beginning, I use this picture to demonstrate what happens inside the system after we type the command "Collect posts about manus on Zhihu..." and what the calling order between modules is. At the same time, we can also intuitively feel from the picture that LLM plays a decisive role in the agent.
Distributed multimodal data processing practice in the AI era
In AI multimodal data processing, enterprises face the need for efficient processing of massive text, image, and audio data. Local processing is limited by the performance of a single machine and is difficult to meet the requirements of large-scale distributed computing. This solution introduces the distributed computing framework MaxFrame, which relies on the massive computing resources of MaxCompute to perform distributed processing of multimodal data and offline model reasoning.