Crawlee: Unlocking the superpowers of data crawling and automation

Written by
Clara Bennett
Updated on:June-27th-2025
Recommendation

Explore Crawlee, a new generation of crawler tools, and open a new era of data crawling and automation.

Core content:
1. The outstanding features of Crawlee as a next-generation crawler tool
2. The comparative advantages of Crawlee over existing crawler frameworks
3. Application scenarios of Crawlee in data mining, automated testing and web monitoring

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)









Thank you for taking the time

.

.

Read this article

 

1. Crawlee: The rising star of the crawler world

In the data-driven era, web crawlers and browser automation tools have become a powerful assistant in obtaining information. From data collection in academic research, to market intelligence analysis in the business field, to data indexing in search engines, these tools play an indispensable role.CrawleeAs a next-generation web crawler and browser automation tool, it is gradually emerging, bringing a new experience to developers with its powerful functions and excellent performance. It can not only easily cope with complex network environments and anti-crawler mechanisms, but also provides a simple and efficient programming interface, making data crawling and browser automation tasks easier and more efficient. Next, let's explore it in depth.CrawleeThe wonderful world of.

2. Crawlee First Meeting

What is Crawlee

CrawleeyesApifyThe team has carefully created an open source web crawler and browser automation library. It is like a master key, opening the door to efficient data crawling and browser automation for developers. Its original design was to simplify and speed up the development process of web crawlers and browser automation tasks, allowing developers to focus more on the implementation of business logic without having to worry too much about the underlying technical details.API,CrawleeIt can help developers quickly crawl web content and automate browser operations. It can be used in data mining, automated testing, web monitoring and other fields.

CrawleeSupports multiple programming languages, including the widely usedJavaScript,PythonandTypeScript This allows developers with different technical backgrounds to easily get started and choose the appropriate language for development according to their preferences and project requirements. In today's digital age, the value of data is self-evident.CrawleeIt plays an important role in the field of data mining, helping enterprises to extract valuable information from massive network data and provide strong support for decision-making; in terms of automated testing,CrawleeIt can simulate various operations of users in the browser, realize automated testing of web applications, and improve testing efficiency and accuracy; in the field of web monitoring, it can monitor changes in web pages in real time, discover and notify users in time, and ensure the normal operation of the website.

Why choose Crawlee

In the field of crawlers, there are already many excellent frameworks, such asScrapy,BeautifulSoup,SeleniumWait, then why?CrawleeWhich one can stand out and attract the attention of many developers? Let's find out through comparison.

ScrapyIt is a powerfulPythonCrawler framework, which performs well in handling large-scale data crawling tasks, has efficient asynchronous processing capabilities and rich middleware extensions. However, when faced with the need toJavaScriptDynamic interactive web pages,ScrapyIt seems a bit inadequate. Because it mainly focuses on traditionalHTTPRequests and static web page crawling, for those who rely onJavaScriptIt is difficult to directly obtain complete data for rendered dynamic content.

BeautifulSoupIt is a simple and easy to useHTML/XMLParsing libraries, usually withRequestsThe advantage is that it is easy to use and can be easilyHTMLorXMLHowever, its functions are relatively simple, mainly focusing on parsing the structure of web pages. It cannot provide comprehensive support for complex crawler tasks such as processing dynamic content, automatic expansion, proxy rotation, etc.

SeleniumAs a browser automation tool, it supports multiple programming languages, can simulate user operations in the browser, and handle dynamically loaded content and complex user interactions.SeleniumThere are certain limitations in performance. It runs relatively slowly and consumes a lot of resources when processing large-scale tasks. In addition,SeleniumThe lack of persistent queues and automatic expansion capabilities limits its application in some scenarios to a certain extent.

In contrast,CrawleeThe advantages are obvious. First,CrawleeIt has unique advantages in handling dynamic content.PlaywrightandPuppeteerBuild, can simulate the real user's operation in the browser, such as click, scroll, input, etc., can easily deal with complex web page interaction scenarios, such as verification code recognition, sliding verification, etc. This makes it easy to capture those dependentJavaScriptWhen dynamically loading web pages, complete and accurate data can be obtained.

Secondly,CrawleeIt has an automatic expansion function. It can automatically manage concurrency according to the available system resources, thereby improving the efficiency of the crawler. When processing large-scale crawling tasks, it can dynamically adjust resource allocation according to the needs of the task to ensure that the task can run efficiently and stably. This automatic expansion capability makesCrawleeWhen faced with large-scale data crawling tasks, it can give full play to its performance advantages, greatly saving development time and resource costs.

Furthermore,CrawleePowerful proxy rotation function is built in. Using proxy servers is a good way to bypass geographical restrictions and avoidIPOne of the important means of being blocked.CrawleeThe proxy rotation feature intelligently manages proxies, discarding proxies that frequently time out, return network errors, or are bad.HTTPCode (such as401or403) to maintain the health of the proxy, thereby improving the stability and reliability of crawling. This feature makesCrawleeWhen faced with various anti-crawler mechanisms, it can respond more flexibly to ensure the smooth progress of crawler tasks.

also,CrawleeIt also provides flexible queue management functions and supports multiple queue types, such as priority queues, timed queues, etc. Developers can adjust the crawling strategy according to their needs and arrange the execution order of tasks reasonably.CrawleeSupports storing captured data in a variety of databases and storage systems, such asMySQL,MongoDB,Elasticsearchetc., which facilitates subsequent data processing and analysis. A rich plug-in ecosystem is alsoCrawleeThe active community provides developers with a large number of practical plug-ins, such as proxy plug-ins, data analysis plug-ins, etc., which further enhancesCrawleeThe function helps developers to complete crawler projects efficiently.

3. Crawlee in action: from installation to data crawling

Install Crawlee

Before starting to useCrawleeBefore we do that, we need to install it into our development environment.CrawleeSupported UsepipInstall it.pipyesPythonA package management tool that helps us easily install, upgrade and managePythonPackage. InstallCrawleeWe also need to installPlaywright,becauseCrawleeis based onPlaywrightConstructed,PlaywrightProvides support for browser automation operations. Execute the following command in the command line:

# Crawlee is the crawlee package on PyPI. This package contains the core functionality, while other features are provided as optional extras to minimize dependencies and package size.
# You can install Crawlee with all features or select only the features you need. To install it using pip package manager
python -m pip install  'crawlee[all]'
# Verify that the installation is successful
python -c  'import crawlee; print(crawlee.__version__)'
# Install Playwright
playwright install

During the installation process,pipWill automatically downloadCrawleeand its dependencies, and install them intoPythonPlease make sure yourPythonThe environment has been correctly configured and the network connection is normal to avoid problems during the installation. If you encounter permission issues during the installation, you can try to run the command line with administrator privileges, or use a virtual environment to isolate project dependencies to avoid affecting the system environment. After the installation is complete, we canPythonIntroduced into the projectCrawlee, let’s start our data crawling journey.

Basic crawler structure

In useCrawleeWhen crawling data, we need to understand its basic crawler structure first. Let's take crawling a website article list as an example to explain how to useCrawleeBuild a simple crawler.

First, we need to importCrawleeLibraries and related modules.PythonWe can import it using the following code:

from crawlee import Request, Spider, run_spider

Here,RequestUsed to defineHTTPask,SpiderIt is the base class of the crawler. We need to inherit it to create our own crawler.run_spiderIt is used to start the crawler.

Next, we create a crawler class that inherits fromSpiderIn this class, we need to define astart_requestsMethod, which is used to generate the initial request. For example:

class ArticleSpider (Spider) : 
    async def start_requests (self) :  
        yield  Request(url= 'https://testerroad.com/articles' , callback=self.parse_articles)

In this example, we generate a request and specify a callback functionparse_articles, used to process the response returned by the request.

Then, we defineparse_articlesMethod, this method is used to parse the response and extract the data we need. For example:

async def parse_articles (self, response) :  
    articles = response.css( 'div.article' )
    for  article  in  articles:
        title = article.css( 'h2.title::text' ).get()
        link = article.css( 'a::attr(href)' ).get()
        yield  {
            'title' : title,
            'link' : link
        }

In this method, we useresponse.cssTo select elements on the page,div.articleIndicates selection of allclassforarticleofdivelements. Then, we passCSSThe selector extracts the article title and link respectively and returns them as a dictionary.yieldKeywords are because we may extract data from multiple articles.yieldThis data can be returned one by one instead of all at once, which can save memory.

Data crawling practice

Below we use a complete code example to show how to useCrawleeCrawl web page content and save the crawled data. The following is a method usingcrawleeandPlaywrightofPythonAn example that demonstrates how to crawl dynamic web pages (taking the product list of an e-commerce website as an example) and store data.

from  crawlee  import  (
PlaywrightCrawler,
Request,
RequestQueue,
Dataset,
ProxyConfiguration
)
from  playwright.async_api  import  Page
import  asyncio

'''
 @Author : TesterRoad
 @Time : 2025/3
 @Desc: Official account: Test engineer growth path
 @Software: PyCharm
'''


# Target website example (please replace with an actual crawlable website)
TARGET_URL =  'https://www.testerroad.com.cn/products'

async  def handle_page (page: Page, request: Request) : 
    try :
        # Wait for the product list to load
        await  page.wait_for_selector( '.product-item' , timeout= 5000 )

        # Get all product items
        product_items =  await  page.query_selector_all( '.product-item' )

        results = []
        for  item  in  product_items:
            # Extract product data
            title =  await  item.text_content( '.product-title' )
            price =  await  item.text_content( '.price' )
            description =  await  item.text_content( '.description' )

            results.append({
                'title' : title.strip()  if  title  else None ,
                'price' : float(price.replace( '$''' ))  if  price  else None ,
                'description' : description.strip()  if  description  else None ,
                'url' : page.url
            })

        # Save the data to the dataset
        await  Dataset.push_data(results)

        # Handling paging (Example: Finding the next page button)
        next_button =  await  page.query_selector( '.pagination .next' )
        if  next_button:
            next_url =  await  next_button.get_attribute( 'href' )
            if  next_url:
                await  RequestQueue.add_request(
                    Request(url=page.url.split( '?' )[ 0 ] + next_url)
                )

    except  Exception  as  e:
        print( f'Error processing  {request.url}{str(e)} ' )

# Configure the crawler
async  def main () : 
    # Initialize the request queue
    request_queue =  await  RequestQueue.open()
    await  request_queue.add_request(Request(url=TARGET_URL))

    # Proxy configuration (optional)
    proxy_config = ProxyConfiguration({
        'proxyGroups' : [ 'RESIDENTIAL' ]
    })

    # Create a crawler instance
    crawler = PlaywrightCrawler(
        request_queue=request_queue,
        handle_page_function=handle_page,
        proxy_configuration=proxy_config,
        browser_launch_options={
            'headless'True ,   # headless mode
            'timeout'15000
        },
        navigation_options={
            'wait_until''domcontentloaded' ,
            'timeout'30000
        }
    )

    # Run the crawler
    await  crawler.run()

    # Export data to JSON file
    await  Dataset.export_to_json( 'products.json' )

if  __name__ ==  '__main__' :
    asyncio.run(main())

Code Analysis

  1. 1.  Dependency import :
  • • PlaywrightCrawler: Playwright-based reptiles
  • • Request/RequestQueue: Manage request queues
  • • Dataset: Data storage module
  • • ProxyConfiguration: Proxy Configuration
  • 2.  Page processing function (handle_page) :
    • • Waiting for the product list to load
    • • useCSSSelector to extract data
    • • Automatically handle paging logic
    • • Error handling mechanism
  • 3.  Core configuration :
    • • Proxy settings (valid proxy account required)
    • • Browser startup parameters
    • • Navigation timeout setting
  • 4.  Data Storage :
    • • Use built-inDatasetModules
    • • The final export isJSONdocument

    4. Crawlee Advanced Applications and Techniques

    Handling complex web page structures

    In actual web crawling tasks, we often encounter various complex web page structures, such as nested pages, paging, etc., which increase the difficulty of data crawling.CrawleeIt provides powerful functions and flexible methods to help us easily meet these challenges.

    For crawling nested pages,CrawleeAllows us to generate new requests in callback functions. For example, when crawling an e-commerce website, we may need to go from the product list page to the product details page to get more detailed product information. In the callback function that parses the product list page, we can generate a new request for each product link and specify a new callback function to process the product details page. The code example is as follows:

    async def parse_product_list (self, response) :  
        products = response.css( 'div.product-item' )
        for  product  in  products:
            product_link = product.css( 'a::attr(href)' ).get()
            yield  Request(url=product_link, callback=self.parse_product_detail)

    async  def parse_product_detail (self, response) : 
        title = response.css( 'h1.product-title::text' ).get()
        price = response.css( 'span.product-price::text' ).get()
        description = response.css( 'div.product-description::text' ).get()
        yield  {
            'title' : title,
            'price' : price,
            'description' : description
        }

    In the above code,parse_product_listThe method is responsible for parsing the product list page, extracting the link of each product, and generating a new request for each link.parse_product_detailMethod processing.parse_product_detailThe method is responsible for parsing the product details page and extracting information such as the product title, price, and description.

    When faced with a paginated web page,CrawleeThis can also be easily handled. We can generate requests for different page numbers in the callback function by analyzing the paging rules of the web page. For example, for a paginated news list page with 10 news items per page, we can generate requests for different page numbers in a loop to crawl all the news items. The code example is as follows:

    async def start_requests (self) :  
        base_url =  'https://testerroad.com/news?page={}'
        for  page  in  range( 111 ):   # Assume there are 10 pages in total
            url = base_url.format(page)
            yield  Request(url=url, callback=self.parse_news_list)

    async  def parse_news_list (self, response) : 
        news = response.css( 'div.news-item' )
        for  item  in  news:
            title = item.css( 'h2.news-title::text' ).get()
            link = item.css( 'a::attr(href)' ).get()
            yield  {
                'title' : title,
                'link' : link
            }

    In this example,start_requestsThe method generates requests for 1 to 10 pages in a loop.URLFormat according to page number.parse_news_listThe method is responsible for parsing the news list on each page and extracting the title and link of the news.

    Dealing with anti-crawler mechanisms

    In the world of web crawlers, anti-crawler mechanisms are a challenge we often face. Fortunately,CrawleeIt has a series of powerful anti-shielding functions built in, which can help us effectively deal with anti-crawler and ensure the smooth progress of crawler tasks.

    CrawleeProvides proxy rotation function. By using a proxy server, we can hide the realIPaddress, reducing the risk of being blocked by the target website.CrawleeAbility to intelligently manage proxies, automatically discarding those that frequently time out, return network errors, or are badHTTPCode (such as401or403) to keep the proxy healthy. When configuring the proxy, we can implement proxy rotation by setting the proxy list. The code example is as follows:

    from  crawlee  import  Request, Spider, run_spider

    '''
     @Author : TesterRoad
     @Time : 2025/3
     @Desc: Official account: Test engineer growth path
     @Software: PyCharm
    '''


    class MySpider (Spider) : 
        async  def start_requests (self) : 
            proxy_list = [
                'http://proxy1.testerroad.com:8080' ,
                'http://proxy2.testerroad.com:8080' ,
                'http://proxy3.testerroad.com:8080'
            ]
            for  proxy  in  proxy_list:
                yield  Request(url= 'https://testerroad.com' , proxy=proxy, callback=self.parse_page)
        async  def parse_page (self, response) : 
            # Process the response data
            pass
            
    if  __name__ ==  "__main__" :
        run_spider(MySpider)

    In the above code, we define a proxy listproxy_list, thenstart_requestsIn this method, a different proxy is set for each request to achieve proxy rotation.

    also,CrawleeIt also has a human-like fingerprint generation function, which can simulate the browser fingerprint of real users, including browser type, version, operating system and other information, making the crawler's behavior closer to that of real users, thereby reducing the probability of being detected by anti-crawler mechanisms.CrawleeThese features are enabled by default and require no additional configuration, providing us with great convenience.

    Distributed crawler deployment

    As the amount of data continues to grow and the crawler tasks become increasingly complex, the performance of a single-machine crawler may be limited. In order to improve crawling efficiency,CrawleeSupports distributed crawler deployment, allowing us to distribute crawler tasks to multiple nodes for parallel execution. Through distributed deployment, we can make full use of the computing resources of multiple machines, greatly speed up data crawling, and also improve the stability and reliability of the crawler system. Although the configuration of distributed crawler deployment is relatively complex,CrawleeProvides relevant documents and tools to help developers quickly build a distributed crawler environment. For readers who have large-scale data crawling needs, distributed crawler deployment is a direction worthy of in-depth research and practice, which can bring higher efficiency and greater flexibility to data collection work.

    V. Conclusion

    CrawleeAs a powerful, flexible and easy-to-use next-generation web crawler and browser automation tool, it has shown great advantages in the field of data crawling. It not only has powerful crawling capabilities and can cope with various complex web page structures and anti-crawler mechanisms, but also provides a simple and efficient programming interface, allowing developers to easily implement data crawling and browser automation operations.Crawlee, we can efficiently obtain network data and provide strong data support for tasks such as data analysis and machine learning.

    For developers,CrawleeIt is undoubtedly a tool worth trying. Whether you are a novice in the field of crawlers or an experienced crawler developer,CrawleeIt can bring you a brand new experience and convenience. Its rich functions and active community will help you keep moving forward on the road of crawler development and realize more creativity and ideas.