Firecrawl in-depth basic analysis

Written by
Jasper Cole
Updated on:July-02nd-2025
Recommendation

Master Firecrawl and efficiently obtain website data.

Core content:
1. Introduction to Firecrawl cloud crawling service and its open source version
2. Single URL crawling function and complex content processing capabilities
3. Methods for batch crawling multiple URLs and SDK call examples

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

1. What is Firecrawl

Firecrawl is a cloud crawling service that can crawl the entire website content, a single web page, and a site map without a site map; it uses built-in LLM to extract crawled content using natural language. At the same time, Firecrawl also has a corresponding open source version, and you can download the open source code and run it yourself.

2. Firecrawl main functions (explained with Cloud service)

2.1 Scrape crawls a single URL

Convert any URL into clean data

Firecrawl converts web pages to markdown, which is perfect for LLM applications.

It manages complexity: proxies, caching, rate limiting, js-blocked content Handles dynamic content: dynamic websites, js-rendered websites, PDFs, images Outputs clean markdown, structured data, screenshots or html. The endpoints to call are as follows:

curl --request POST \
  --url https://host:port/v1/scrape \
  --header  'Content-Type: application/json'  \
  --data  '{
  "url": "<string>",
  "formats": [
    "markdown"
  ],
  "onlyMainContent": true,
  "includeTags": [
    "<string>"
  ],
  "excludeTags": [
    "<string>"
  ],
  "headers": {},
  "waitFor": 0,
  "mobile": false,
  "skipTlsVerification": false,
  "timeout": 30000,
  "jsonOptions": {
    "schema": {},
    "systemPrompt": "<string>",
    "prompt": "<string>"
  },
  "actions": [
    {
      "type": "wait",
      "milliseconds": 2,
      "selector": "#my-element"
    }
  ],
  "location": {
    "country": "US",
    "languages": [
      "en-US"
    ]
  },
  "removeBase64Images": true,
  "blockAds": true,
  "proxy": "basic"
}'

2.2 Scrape batch crawl multiple URLs

Can batch crawl multiple URLs at the same time. It takes a starting URL and optional parameters as arguments. The params parameter allows you to specify additional options for the batch crawl job, such as the output format.

It works very similar to the /crawl endpoint. It submits a batch crawl job and returns a job ID to check the status of the batch crawl.

The SDK provides two methods: synchronous and asynchronous. The synchronous method returns the results of the batch crawl job, while the asynchronous method returns a job ID that you can use to check the status of the batch crawl.

curl --request POST \
  --url https://host:port/v1/batch/scrape \
  --header  'Content-Type: application/json'  \
  --data  '{
  "urls": [
    "<string>"
  ],
  "webhook": {
    "url": "<string>",
    "headers": {},
    "metadata": {},
    "events": [
      "completed"
    ]
  },
  "ignoreInvalidURLs": false,
  "formats": [
    "markdown"
  ],
  "onlyMainContent": true,
  "includeTags": [
    "<string>"
  ],
  "excludeTags": [
    "<string>"
  ],
  "headers": {},
  "waitFor": 0,
  "mobile": false,
  "skipTlsVerification": false,
  "timeout": 30000,
  "jsonOptions": {
    "schema": {},
    "systemPrompt": "<string>",
    "prompt": "<string>"
  },
  "actions": [
    {
      "type": "wait",
      "milliseconds": 2,
      "selector": "#my-element"
    }
  ],
  "location": {
    "country": "US",
    "languages": [
      "en-US"
    ]
  },
  "removeBase64Images": true,
  "blockAds": true,
  "proxy": "basic"
}'

2.3 Scrape with Extract Function (LLM)

Use Firecrawl to crawl and extract structured data Firecrawl uses large language models (LLM) to efficiently extract structured data from web pages. The specific method is as follows:

This approach simplifies data extraction, reduces manual processing and improves efficiency.

2.3.1 Pattern Extraction

curl -X POST https://host:port/v1/scrape \
    -H  'Content-Type: application/json'  \
    -d  '{
      "url": "https://docs.firecrawl.dev/",
      "formats": ["json"],
      "jsonOptions": {
        "schema": {
          "type": "object",
          "properties": {
            "company_mission": {
                      "type": "string"
            },
            "supports_sso": {
                      "type": "boolean"
            },
            "is_open_source": {
                      "type": "boolean"
            },
            "is_in_yc": {
                      "type": "boolean"
            }
          },
          "required": [
            "company_mission",
            "supports_sso",
            "is_open_source",
            "is_in_yc"
          ]
        }
      }
    }'

2.3.2 Schemaless Extraction

Simply pass to the endpoint to extract the data without using a modal prompt.

curl -X POST https://host:port/v1/scrape \
    -H  'Content-Type: application/json'  \
    -d  '{
      "url": "https://docs.firecrawl.dev/",
      "formats": ["json"],
      "jsonOptions": {
        "prompt": "Extract the company mission from the page."
      }
    }'

2.4 Crawl the entire site

Firecrawl can recursively search URL subdomains and collect content

Firecrawl crawls the site thoroughly, ensuring comprehensive data extraction while bypassing any network blocking mechanisms. Here's how it works:

This method guarantees exhaustive crawling and data collection from any starting URL.

curl -X POST https://host:port/v1/crawl \
    -H  'Content-Type: application/json'  \
    -d  '{
      "url": "https://docs.firecrawl.dev",
      "limit": 100,
      "scrapeOptions": {
        "formats": ["markdown", "html"]
      }
    }'

2.5 Map to grab the entire site map

Enter a website and get all the URLs on that website - blazingly fast

This is the easiest way to convert a single URL into an entire sitemap. This feature is useful in the following situations:

When you need to prompt the end user to select the links to crawl Need to quickly understand the links on the website Need to crawl the website pages related to a specific topic (using search parameters) Only crawl specific pages of the website

curl -X POST https://host:port/v1/map \
    -H  'Content-Type: application/json'  \
    -d  '{
      "url": "https://firecrawl.dev"
    }'

2.6 Extract multiple pages automatically

The /extract endpoint simplifies the process of collecting structured data from any number of URLs or entire domains. Simply provide a list of URLs (optionally using wildcards, such as example.com/*), and a prompt or schema that describes the information you want. Firecrawl takes care of the details of crawling, parsing, and organizing large or small datasets.

Using /extract you can extract structured data from one or more URLs, including wildcards:

Single-page example: https://firecrawl.dev/some-page Multiple-page/full-domain example: https://firecrawl.dev/* When you use /*, Firecrawl automatically crawls and parses all URLs it finds under that domain, and then extracts the requested data.

curl -X POST https://host:port/v1/extract \
    -H  'Content-Type: application/json'  \
    -d  '{
      "urls": [
        "https://firecrawl.dev/", 
        "https://docs.firecrawl.dev/", 
        "https://www.ycombinator.com/companies"
      ],
      "prompt": "Extract the company mission, whether it supports SSO, whether it is open source, and whether it is in Y Combinator from the page.",
      "schema": {
        "type": "object",
        "properties": {
          "company_mission": {
            "type": "string"
          },
          "supports_sso": {
            "type": "boolean"
          },
          "is_open_source": {
            "type": "boolean"
          },
          "is_in_yc": {
            "type": "boolean"
          }
        },
        "required": [
          "company_mission",
          "supports_sso",
          "is_open_source",
          "is_in_yc"
        ]
      }
    }'

3. Access methods supported by the cloud version

  • API: Documentation
  • SDK: Python, Node, Go, Rust
  • LLM framework: Langchain (python), Langchain (js), Llama Index, Crew.ai, Composio, PraisonAI, Superinterface, Vectorize
  • Low-code frameworks: dify, Langflow, Flowise AI, Cargo, Pipedream
  • Others: Zapier, Pabbly Connect

4. Cloud version and open source version

Firecrawl is open source and available under the AGPL-3.0 license. Firecrawl Cloud is available at firecrawl.dev, and cloud hosting offers a range of features not available in the open source version.

The difference between the open source version and the cloud version is shown in the following figure:

Same function

  1. Support scraping of single web page

  2. Support crawling of the main site

  3. Support Extraction via LLM

  4. Support crawling sitemap sitemap

5.Support output LLM-friendly formats, markwon, json, etc.

  1. SDKs supporting Python, Node, Go, and Rust

Cloud version additional features

7. Bypassing bot protection

  1. Agent rotation

  2. Crawling console

10. Content Interaction and Click-to-Action

11. Headless Browser

12.Some other features of the Enterprise Edition

5. Reference

Quickstart | Firecrawl

Blog

Firecrawl