Use multimodal models to write a new generation of crawlers

Explore the next generation of automated crawler technology, how Midscene.js uses multimodal models to simplify front-end automated testing.
Core content:
1. Introduction to the Midscene.js plugin and its application in the crawler field
2. Plugin API: Action, Query, Assert function details
3. Plugin installation, configuration and code integration practical guide
ByteDance has a very practical but not very popular project called Midscene.js, which has only 10,000 installations on the Chrome store. It is a front-end automated testing plug-in driven by a multimodal model. I rarely use automated testing, but I find it particularly suitable for writing crawlers...
Midscene.js has three major APIs: Action, Query, and Assert
Action Interaction
Describe the steps and perform the interaction. For example, to interact on GitHub: find the Twikoo project on GitHub, click on the details page, and click Star.
Query Extraction
"Understand" and extract data from the UI, and the return value is in JSON format. It can give you any data structure you want. For example, extract: string[] on the Interview Question Collection website, all interview questions -
Assert assertion
Determine whether the specified conditions are met. For example, on the smart home page, assert that the computer is turned off.
Large model support
Initially, the project only supported the GPT-4o model, and the cost of running one line of use cases was around ¥0.1, which was quite expensive. Later, it supported Qwen-2.5-VL and UI-TARS, and the cost was greatly reduced. The following uses the Qianwen model as an example to guide everyone to get started with this magical plug-in.
Install
Can be installed directly from the Chrome Store: https://chromewebstore.google.com/detail/midscene/gbldofcpkknbggpkmbdaefngejllnief
Configuration
Open the sidebar of Midscene.js from the plugin menu in the upper right corner of the browser. It will prompt No config. Click the button to pop up the Env Config setting box. Configure the following variables in it
OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
MIDSCENE_MODEL_NAME="qwen-vl-max-latest"
MIDSCENE_USE_QWEN_VL=1
Among them OPENAI_API_KEY
You need to apply for it yourself, the application address is: https://bailian.console.aliyun.com/?apiKey=1#/api-key
The above link does not include promotion. If you are opening Alibaba Cloud Bailian for the first time, new users have free quotas. Please pay attention to the validity period of the quota to avoid waste.
test
Next, write a command in natural language, click the Run button, and watch AI start to take over your browser...
Code Integration
Next, we will try to write a crawler, combining these three APIs to complete complex automation tasks.
Create a new Node.js project and install the required dependencies -
pnpm install @midscene/web tsx --save-dev
Writing a Script main.ts
, perform the operation you want to perform, for example, open Bing, enter iMaeGoo and click Search, and output the search results -
import { AgentOverChromeBridge } from "@midscene/web/bridge-mode" ;
function sleep ( ms : number ) {
return new Promise ( ( r ) => setTimeout (r, ms));
}
async function main () {
process. env . OPENAI_BASE_URL =
"https://dashscope.aliyuncs.com/compatible-mode/v1" ;
process. env . OPENAI_API_KEY = "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" ;
process. env . MIDSCENE_MODEL_NAME = "qwen-vl-max-latest" ;
process.env.MIDSCENE_USE_QWEN_VL = 1 ;
const agent = new AgentOverChromeBridge ();
// This method will connect to your desktop Chrome's new tab page
// Remember to enable your Chrome extension and click the 'allow connection' button. Otherwise you will get a timeout error
await agent.connectNewTabWithUrl ( "https://www.bing.com" ) ;
// These methods are the same as the normal Midscene agent
await agent.ai ( "Enter iMaeGoo and click to search" ) ;
const result = await agent. aiQuery (
"{title: string, url: string}[], search results"
);
console . log ( "search result" , result);
await sleep ( 3000 );
await agent.destroy () ;
}
main ();
Launch your Chrome extension, click Bridge Mode, and then click the 'Allow connection' button -