Use multimodal models to write a new generation of crawlers

Written by
Caleb Hayes
Updated on:July-09th-2025
Recommendation

Explore the next generation of automated crawler technology, how Midscene.js uses multimodal models to simplify front-end automated testing.

Core content:
1. Introduction to the Midscene.js plugin and its application in the crawler field
2. Plugin API: Action, Query, Assert function details
3. Plugin installation, configuration and code integration practical guide

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

 

ByteDance has a very practical but not very popular project called Midscene.js, which has only 10,000 installations on the Chrome store. It is a front-end automated testing plug-in driven by a multimodal model. I rarely use automated testing, but I find it particularly suitable for writing crawlers...

Midscene.js has three major APIs: Action, Query, and Assert

Action Interaction

Describe the steps and perform the interaction. For example, to interact on GitHub: find the Twikoo project on GitHub, click on the details page, and click Star.

Query Extraction

"Understand" and extract data from the UI, and the return value is in JSON format. It can give you any data structure you want. For example, extract: string[] on the Interview Question Collection website, all interview questions -

Assert assertion

Determine whether the specified conditions are met. For example, on the smart home page, assert that the computer is turned off.

Large model support

Initially, the project only supported the GPT-4o model, and the cost of running one line of use cases was around ¥0.1, which was quite expensive. Later, it supported Qwen-2.5-VL and UI-TARS, and the cost was greatly reduced. The following uses the Qianwen model as an example to guide everyone to get started with this magical plug-in.

Install

Can be installed directly from the Chrome Store: https://chromewebstore.google.com/detail/midscene/gbldofcpkknbggpkmbdaefngejllnief

Configuration

Open the sidebar of Midscene.js from the plugin menu in the upper right corner of the browser. It will prompt No config. Click the button to pop up the Env Config setting box. Configure the following variables in it

OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
MIDSCENE_MODEL_NAME="qwen-vl-max-latest"
MIDSCENE_USE_QWEN_VL=1

Among them OPENAI_API_KEY You need to apply for it yourself, the application address is: https://bailian.console.aliyun.com/?apiKey=1#/api-key

The above link does not include promotion. If you are opening Alibaba Cloud Bailian for the first time, new users have free quotas. Please pay attention to the validity period of the quota to avoid waste.

test

Next, write a command in natural language, click the Run button, and watch AI start to take over your browser...

Code Integration

Next, we will try to write a crawler, combining these three APIs to complete complex automation tasks.

Create a new Node.js project and install the required dependencies -

pnpm install @midscene/web tsx --save-dev

Writing a Script main.ts, perform the operation you want to perform, for example, open Bing, enter iMaeGoo and click Search, and output the search results -

import  {  AgentOverChromeBridge  }  from "@midscene/web/bridge-mode" ;

function sleep ( msnumber ) {
return new Promise ( ( r ) => setTimeout (r, ms));
}

async function main () {
  process. env . OPENAI_BASE_URL  =
    "https://dashscope.aliyuncs.com/compatible-mode/v1" ;
  process. env . OPENAI_API_KEY  =  "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" ;
  process. env . MIDSCENE_MODEL_NAME  =  "qwen-vl-max-latest" ;
  process.env.MIDSCENE_USE_QWEN_VL  =  1 ;
const  agent =  new AgentOverChromeBridge ();
// This method will connect to your desktop Chrome's new tab page
// Remember to enable your Chrome extension and click the 'allow connection' button. Otherwise you will get a timeout error
await  agent.connectNewTabWithUrl ( "https://www.bing.com" ) ;
// These methods are the same as the normal Midscene agent
await  agent.ai ( "Enter iMaeGoo and click to search" ) ;
const  result =  await  agent. aiQuery (
    "{title: string, url: string}[], search results"
  );
console . log ( "search result" , result);
await sleep ( 3000 );
await  agent.destroy () ;
}

main ();

Launch your Chrome extension, click Bridge Mode, and then click the 'Allow connection' button -