Crawlee: Unlocking the superpowers of data crawling and automation

Explore Crawlee, a new generation of crawler tools, and open a new era of data crawling and automation.
Core content:
1. The outstanding features of Crawlee as a next-generation crawler tool
2. The comparative advantages of Crawlee over existing crawler frameworks
3. Application scenarios of Crawlee in data mining, automated testing and web monitoring
Thank you for taking the time
.
.
Read this article
1. Crawlee: The rising star of the crawler world
In the data-driven era, web crawlers and browser automation tools have become a powerful assistant in obtaining information. From data collection in academic research, to market intelligence analysis in the business field, to data indexing in search engines, these tools play an indispensable role.Crawlee
As a next-generation web crawler and browser automation tool, it is gradually emerging, bringing a new experience to developers with its powerful functions and excellent performance. It can not only easily cope with complex network environments and anti-crawler mechanisms, but also provides a simple and efficient programming interface, making data crawling and browser automation tasks easier and more efficient. Next, let's explore it in depth.Crawlee
The wonderful world of.
2. Crawlee First Meeting
What is Crawlee
Crawlee
yesApify
The team has carefully created an open source web crawler and browser automation library. It is like a master key, opening the door to efficient data crawling and browser automation for developers. Its original design was to simplify and speed up the development process of web crawlers and browser automation tasks, allowing developers to focus more on the implementation of business logic without having to worry too much about the underlying technical details.API
,Crawlee
It can help developers quickly crawl web content and automate browser operations. It can be used in data mining, automated testing, web monitoring and other fields.
Crawlee
Supports multiple programming languages, including the widely usedJavaScript
,Python
andTypeScript
This allows developers with different technical backgrounds to easily get started and choose the appropriate language for development according to their preferences and project requirements. In today's digital age, the value of data is self-evident.Crawlee
It plays an important role in the field of data mining, helping enterprises to extract valuable information from massive network data and provide strong support for decision-making; in terms of automated testing,Crawlee
It can simulate various operations of users in the browser, realize automated testing of web applications, and improve testing efficiency and accuracy; in the field of web monitoring, it can monitor changes in web pages in real time, discover and notify users in time, and ensure the normal operation of the website.
Why choose Crawlee
In the field of crawlers, there are already many excellent frameworks, such asScrapy
,BeautifulSoup
,Selenium
Wait, then why?Crawlee
Which one can stand out and attract the attention of many developers? Let's find out through comparison.
Scrapy
It is a powerfulPython
Crawler framework, which performs well in handling large-scale data crawling tasks, has efficient asynchronous processing capabilities and rich middleware extensions. However, when faced with the need toJavaScript
Dynamic interactive web pages,Scrapy
It seems a bit inadequate. Because it mainly focuses on traditionalHTTP
Requests and static web page crawling, for those who rely onJavaScript
It is difficult to directly obtain complete data for rendered dynamic content.
BeautifulSoup
It is a simple and easy to useHTML/XML
Parsing libraries, usually withRequests
The advantage is that it is easy to use and can be easilyHTML
orXML
However, its functions are relatively simple, mainly focusing on parsing the structure of web pages. It cannot provide comprehensive support for complex crawler tasks such as processing dynamic content, automatic expansion, proxy rotation, etc.
Selenium
As a browser automation tool, it supports multiple programming languages, can simulate user operations in the browser, and handle dynamically loaded content and complex user interactions.Selenium
There are certain limitations in performance. It runs relatively slowly and consumes a lot of resources when processing large-scale tasks. In addition,Selenium
The lack of persistent queues and automatic expansion capabilities limits its application in some scenarios to a certain extent.
In contrast,Crawlee
The advantages are obvious. First,Crawlee
It has unique advantages in handling dynamic content.Playwright
andPuppeteer
Build, can simulate the real user's operation in the browser, such as click, scroll, input, etc., can easily deal with complex web page interaction scenarios, such as verification code recognition, sliding verification, etc. This makes it easy to capture those dependentJavaScript
When dynamically loading web pages, complete and accurate data can be obtained.
Secondly,Crawlee
It has an automatic expansion function. It can automatically manage concurrency according to the available system resources, thereby improving the efficiency of the crawler. When processing large-scale crawling tasks, it can dynamically adjust resource allocation according to the needs of the task to ensure that the task can run efficiently and stably. This automatic expansion capability makesCrawlee
When faced with large-scale data crawling tasks, it can give full play to its performance advantages, greatly saving development time and resource costs.
Furthermore,Crawlee
Powerful proxy rotation function is built in. Using proxy servers is a good way to bypass geographical restrictions and avoidIP
One of the important means of being blocked.Crawlee
The proxy rotation feature intelligently manages proxies, discarding proxies that frequently time out, return network errors, or are bad.HTTP
Code (such as401
or403
) to maintain the health of the proxy, thereby improving the stability and reliability of crawling. This feature makesCrawlee
When faced with various anti-crawler mechanisms, it can respond more flexibly to ensure the smooth progress of crawler tasks.
also,Crawlee
It also provides flexible queue management functions and supports multiple queue types, such as priority queues, timed queues, etc. Developers can adjust the crawling strategy according to their needs and arrange the execution order of tasks reasonably.Crawlee
Supports storing captured data in a variety of databases and storage systems, such asMySQL
,MongoDB
,Elasticsearch
etc., which facilitates subsequent data processing and analysis. A rich plug-in ecosystem is alsoCrawlee
The active community provides developers with a large number of practical plug-ins, such as proxy plug-ins, data analysis plug-ins, etc., which further enhancesCrawlee
The function helps developers to complete crawler projects efficiently.
3. Crawlee in action: from installation to data crawling
Install Crawlee
Before starting to useCrawlee
Before we do that, we need to install it into our development environment.Crawlee
Supported Usepip
Install it.pip
yesPython
A package management tool that helps us easily install, upgrade and managePython
Package. InstallCrawlee
We also need to installPlaywright
,becauseCrawlee
is based onPlaywright
Constructed,Playwright
Provides support for browser automation operations. Execute the following command in the command line:
# Crawlee is the crawlee package on PyPI. This package contains the core functionality, while other features are provided as optional extras to minimize dependencies and package size.
# You can install Crawlee with all features or select only the features you need. To install it using pip package manager
python -m pip install 'crawlee[all]'
# Verify that the installation is successful
python -c 'import crawlee; print(crawlee.__version__)'
# Install Playwright
playwright install
During the installation process,pip
Will automatically downloadCrawlee
and its dependencies, and install them intoPython
Please make sure yourPython
The environment has been correctly configured and the network connection is normal to avoid problems during the installation. If you encounter permission issues during the installation, you can try to run the command line with administrator privileges, or use a virtual environment to isolate project dependencies to avoid affecting the system environment. After the installation is complete, we canPython
Introduced into the projectCrawlee
, let’s start our data crawling journey.
Basic crawler structure
In useCrawlee
When crawling data, we need to understand its basic crawler structure first. Let's take crawling a website article list as an example to explain how to useCrawlee
Build a simple crawler.
First, we need to importCrawlee
Libraries and related modules.Python
We can import it using the following code:
from crawlee import Request, Spider, run_spider
Here,Request
Used to defineHTTP
ask,Spider
It is the base class of the crawler. We need to inherit it to create our own crawler.run_spider
It is used to start the crawler.
Next, we create a crawler class that inherits fromSpider
In this class, we need to define astart_requests
Method, which is used to generate the initial request. For example:
class ArticleSpider (Spider) :
async def start_requests (self) :
yield Request(url= 'https://testerroad.com/articles' , callback=self.parse_articles)
In this example, we generate a request and specify a callback functionparse_articles
, used to process the response returned by the request.
Then, we defineparse_articles
Method, this method is used to parse the response and extract the data we need. For example:
async def parse_articles (self, response) :
articles = response.css( 'div.article' )
for article in articles:
title = article.css( 'h2.title::text' ).get()
link = article.css( 'a::attr(href)' ).get()
yield {
'title' : title,
'link' : link
}
In this method, we useresponse.css
To select elements on the page,div.article
Indicates selection of allclass
forarticle
ofdiv
elements. Then, we passCSS
The selector extracts the article title and link respectively and returns them as a dictionary.yield
Keywords are because we may extract data from multiple articles.yield
This data can be returned one by one instead of all at once, which can save memory.
Data crawling practice
Below we use a complete code example to show how to useCrawlee
Crawl web page content and save the crawled data. The following is a method usingcrawlee
andPlaywright
ofPython
An example that demonstrates how to crawl dynamic web pages (taking the product list of an e-commerce website as an example) and store data.
from crawlee import (
PlaywrightCrawler,
Request,
RequestQueue,
Dataset,
ProxyConfiguration
)
from playwright.async_api import Page
import asyncio
'''
@Author : TesterRoad
@Time : 2025/3
@Desc: Official account: Test engineer growth path
@Software: PyCharm
'''
# Target website example (please replace with an actual crawlable website)
TARGET_URL = 'https://www.testerroad.com.cn/products'
async def handle_page (page: Page, request: Request) :
try :
# Wait for the product list to load
await page.wait_for_selector( '.product-item' , timeout= 5000 )
# Get all product items
product_items = await page.query_selector_all( '.product-item' )
results = []
for item in product_items:
# Extract product data
title = await item.text_content( '.product-title' )
price = await item.text_content( '.price' )
description = await item.text_content( '.description' )
results.append({
'title' : title.strip() if title else None ,
'price' : float(price.replace( '$' , '' )) if price else None ,
'description' : description.strip() if description else None ,
'url' : page.url
})
# Save the data to the dataset
await Dataset.push_data(results)
# Handling paging (Example: Finding the next page button)
next_button = await page.query_selector( '.pagination .next' )
if next_button:
next_url = await next_button.get_attribute( 'href' )
if next_url:
await RequestQueue.add_request(
Request(url=page.url.split( '?' )[ 0 ] + next_url)
)
except Exception as e:
print( f'Error processing {request.url} : {str(e)} ' )
# Configure the crawler
async def main () :
# Initialize the request queue
request_queue = await RequestQueue.open()
await request_queue.add_request(Request(url=TARGET_URL))
# Proxy configuration (optional)
proxy_config = ProxyConfiguration({
'proxyGroups' : [ 'RESIDENTIAL' ]
})
# Create a crawler instance
crawler = PlaywrightCrawler(
request_queue=request_queue,
handle_page_function=handle_page,
proxy_configuration=proxy_config,
browser_launch_options={
'headless' : True , # headless mode
'timeout' : 15000
},
navigation_options={
'wait_until' : 'domcontentloaded' ,
'timeout' : 30000
}
)
# Run the crawler
await crawler.run()
# Export data to JSON file
await Dataset.export_to_json( 'products.json' )
if __name__ == '__main__' :
asyncio.run(main())
Code Analysis
1. Dependency import :
• PlaywrightCrawler
: Playwright-based reptiles• Request
/RequestQueue
: Manage request queues• Dataset
: Data storage module• ProxyConfiguration
: Proxy Configuration
2. Page processing function ( handle_page
) :• Waiting for the product list to load • use CSS
Selector to extract data• Automatically handle paging logic • Error handling mechanism 3. Core configuration : • Proxy settings (valid proxy account required) • Browser startup parameters • Navigation timeout setting 4. Data Storage : • Use built-in Dataset
Modules• The final export is JSON
document
4. Crawlee Advanced Applications and Techniques
Handling complex web page structures
In actual web crawling tasks, we often encounter various complex web page structures, such as nested pages, paging, etc., which increase the difficulty of data crawling.Crawlee
It provides powerful functions and flexible methods to help us easily meet these challenges.
For crawling nested pages,Crawlee
Allows us to generate new requests in callback functions. For example, when crawling an e-commerce website, we may need to go from the product list page to the product details page to get more detailed product information. In the callback function that parses the product list page, we can generate a new request for each product link and specify a new callback function to process the product details page. The code example is as follows:
async def parse_product_list (self, response) :
products = response.css( 'div.product-item' )
for product in products:
product_link = product.css( 'a::attr(href)' ).get()
yield Request(url=product_link, callback=self.parse_product_detail)
async def parse_product_detail (self, response) :
title = response.css( 'h1.product-title::text' ).get()
price = response.css( 'span.product-price::text' ).get()
description = response.css( 'div.product-description::text' ).get()
yield {
'title' : title,
'price' : price,
'description' : description
}
In the above code,parse_product_list
The method is responsible for parsing the product list page, extracting the link of each product, and generating a new request for each link.parse_product_detail
Method processing.parse_product_detail
The method is responsible for parsing the product details page and extracting information such as the product title, price, and description.
When faced with a paginated web page,Crawlee
This can also be easily handled. We can generate requests for different page numbers in the callback function by analyzing the paging rules of the web page. For example, for a paginated news list page with 10 news items per page, we can generate requests for different page numbers in a loop to crawl all the news items. The code example is as follows:
async def start_requests (self) :
base_url = 'https://testerroad.com/news?page={}'
for page in range( 1 , 11 ): # Assume there are 10 pages in total
url = base_url.format(page)
yield Request(url=url, callback=self.parse_news_list)
async def parse_news_list (self, response) :
news = response.css( 'div.news-item' )
for item in news:
title = item.css( 'h2.news-title::text' ).get()
link = item.css( 'a::attr(href)' ).get()
yield {
'title' : title,
'link' : link
}
In this example,start_requests
The method generates requests for 1 to 10 pages in a loop.URL
Format according to page number.parse_news_list
The method is responsible for parsing the news list on each page and extracting the title and link of the news.
Dealing with anti-crawler mechanisms
In the world of web crawlers, anti-crawler mechanisms are a challenge we often face. Fortunately,Crawlee
It has a series of powerful anti-shielding functions built in, which can help us effectively deal with anti-crawler and ensure the smooth progress of crawler tasks.
Crawlee
Provides proxy rotation function. By using a proxy server, we can hide the realIP
address, reducing the risk of being blocked by the target website.Crawlee
Ability to intelligently manage proxies, automatically discarding those that frequently time out, return network errors, or are badHTTP
Code (such as401
or403
) to keep the proxy healthy. When configuring the proxy, we can implement proxy rotation by setting the proxy list. The code example is as follows:
from crawlee import Request, Spider, run_spider
'''
@Author : TesterRoad
@Time : 2025/3
@Desc: Official account: Test engineer growth path
@Software: PyCharm
'''
class MySpider (Spider) :
async def start_requests (self) :
proxy_list = [
'http://proxy1.testerroad.com:8080' ,
'http://proxy2.testerroad.com:8080' ,
'http://proxy3.testerroad.com:8080'
]
for proxy in proxy_list:
yield Request(url= 'https://testerroad.com' , proxy=proxy, callback=self.parse_page)
async def parse_page (self, response) :
# Process the response data
pass
if __name__ == "__main__" :
run_spider(MySpider)
In the above code, we define a proxy listproxy_list
, thenstart_requests
In this method, a different proxy is set for each request to achieve proxy rotation.
also,Crawlee
It also has a human-like fingerprint generation function, which can simulate the browser fingerprint of real users, including browser type, version, operating system and other information, making the crawler's behavior closer to that of real users, thereby reducing the probability of being detected by anti-crawler mechanisms.Crawlee
These features are enabled by default and require no additional configuration, providing us with great convenience.
Distributed crawler deployment
As the amount of data continues to grow and the crawler tasks become increasingly complex, the performance of a single-machine crawler may be limited. In order to improve crawling efficiency,Crawlee
Supports distributed crawler deployment, allowing us to distribute crawler tasks to multiple nodes for parallel execution. Through distributed deployment, we can make full use of the computing resources of multiple machines, greatly speed up data crawling, and also improve the stability and reliability of the crawler system. Although the configuration of distributed crawler deployment is relatively complex,Crawlee
Provides relevant documents and tools to help developers quickly build a distributed crawler environment. For readers who have large-scale data crawling needs, distributed crawler deployment is a direction worthy of in-depth research and practice, which can bring higher efficiency and greater flexibility to data collection work.
V. Conclusion
Crawlee
As a powerful, flexible and easy-to-use next-generation web crawler and browser automation tool, it has shown great advantages in the field of data crawling. It not only has powerful crawling capabilities and can cope with various complex web page structures and anti-crawler mechanisms, but also provides a simple and efficient programming interface, allowing developers to easily implement data crawling and browser automation operations.Crawlee
, we can efficiently obtain network data and provide strong data support for tasks such as data analysis and machine learning.
For developers,Crawlee
It is undoubtedly a tool worth trying. Whether you are a novice in the field of crawlers or an experienced crawler developer,Crawlee
It can bring you a brand new experience and convenience. Its rich functions and active community will help you keep moving forward on the road of crawler development and realize more creativity and ideas.