Firecrawl in-depth basic analysis

Master Firecrawl and efficiently obtain website data.
Core content:
1. Introduction to Firecrawl cloud crawling service and its open source version
2. Single URL crawling function and complex content processing capabilities
3. Methods for batch crawling multiple URLs and SDK call examples
1. What is Firecrawl
Firecrawl is a cloud crawling service that can crawl the entire website content, a single web page, and a site map without a site map; it uses built-in LLM to extract crawled content using natural language. At the same time, Firecrawl also has a corresponding open source version, and you can download the open source code and run it yourself.
2. Firecrawl main functions (explained with Cloud service)
2.1 Scrape crawls a single URL
Convert any URL into clean data
Firecrawl converts web pages to markdown, which is perfect for LLM applications.
It manages complexity: proxies, caching, rate limiting, js-blocked content Handles dynamic content: dynamic websites, js-rendered websites, PDFs, images Outputs clean markdown, structured data, screenshots or html. The endpoints to call are as follows:
curl --request POST \
--url https://host:port/v1/scrape \
--header 'Content-Type: application/json' \
--data '{
"url": "<string>",
"formats": [
"markdown"
],
"onlyMainContent": true,
"includeTags": [
"<string>"
],
"excludeTags": [
"<string>"
],
"headers": {},
"waitFor": 0,
"mobile": false,
"skipTlsVerification": false,
"timeout": 30000,
"jsonOptions": {
"schema": {},
"systemPrompt": "<string>",
"prompt": "<string>"
},
"actions": [
{
"type": "wait",
"milliseconds": 2,
"selector": "#my-element"
}
],
"location": {
"country": "US",
"languages": [
"en-US"
]
},
"removeBase64Images": true,
"blockAds": true,
"proxy": "basic"
}'
2.2 Scrape batch crawl multiple URLs
Can batch crawl multiple URLs at the same time. It takes a starting URL and optional parameters as arguments. The params parameter allows you to specify additional options for the batch crawl job, such as the output format.
It works very similar to the /crawl endpoint. It submits a batch crawl job and returns a job ID to check the status of the batch crawl.
The SDK provides two methods: synchronous and asynchronous. The synchronous method returns the results of the batch crawl job, while the asynchronous method returns a job ID that you can use to check the status of the batch crawl.
curl --request POST \
--url https://host:port/v1/batch/scrape \
--header 'Content-Type: application/json' \
--data '{
"urls": [
"<string>"
],
"webhook": {
"url": "<string>",
"headers": {},
"metadata": {},
"events": [
"completed"
]
},
"ignoreInvalidURLs": false,
"formats": [
"markdown"
],
"onlyMainContent": true,
"includeTags": [
"<string>"
],
"excludeTags": [
"<string>"
],
"headers": {},
"waitFor": 0,
"mobile": false,
"skipTlsVerification": false,
"timeout": 30000,
"jsonOptions": {
"schema": {},
"systemPrompt": "<string>",
"prompt": "<string>"
},
"actions": [
{
"type": "wait",
"milliseconds": 2,
"selector": "#my-element"
}
],
"location": {
"country": "US",
"languages": [
"en-US"
]
},
"removeBase64Images": true,
"blockAds": true,
"proxy": "basic"
}'
2.3 Scrape with Extract Function (LLM)
Use Firecrawl to crawl and extract structured data Firecrawl uses large language models (LLM) to efficiently extract structured data from web pages. The specific method is as follows:
This approach simplifies data extraction, reduces manual processing and improves efficiency.
2.3.1 Pattern Extraction
curl -X POST https://host:port/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
"url": "https://docs.firecrawl.dev/",
"formats": ["json"],
"jsonOptions": {
"schema": {
"type": "object",
"properties": {
"company_mission": {
"type": "string"
},
"supports_sso": {
"type": "boolean"
},
"is_open_source": {
"type": "boolean"
},
"is_in_yc": {
"type": "boolean"
}
},
"required": [
"company_mission",
"supports_sso",
"is_open_source",
"is_in_yc"
]
}
}
}'
2.3.2 Schemaless Extraction
Simply pass to the endpoint to extract the data without using a modal prompt.
curl -X POST https://host:port/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
"url": "https://docs.firecrawl.dev/",
"formats": ["json"],
"jsonOptions": {
"prompt": "Extract the company mission from the page."
}
}'
2.4 Crawl the entire site
Firecrawl can recursively search URL subdomains and collect content
Firecrawl crawls the site thoroughly, ensuring comprehensive data extraction while bypassing any network blocking mechanisms. Here's how it works:
This method guarantees exhaustive crawling and data collection from any starting URL.
curl -X POST https://host:port/v1/crawl \
-H 'Content-Type: application/json' \
-d '{
"url": "https://docs.firecrawl.dev",
"limit": 100,
"scrapeOptions": {
"formats": ["markdown", "html"]
}
}'
2.5 Map to grab the entire site map
Enter a website and get all the URLs on that website - blazingly fast
This is the easiest way to convert a single URL into an entire sitemap. This feature is useful in the following situations:
When you need to prompt the end user to select the links to crawl Need to quickly understand the links on the website Need to crawl the website pages related to a specific topic (using search parameters) Only crawl specific pages of the website
curl -X POST https://host:port/v1/map \
-H 'Content-Type: application/json' \
-d '{
"url": "https://firecrawl.dev"
}'
2.6 Extract multiple pages automatically
The /extract endpoint simplifies the process of collecting structured data from any number of URLs or entire domains. Simply provide a list of URLs (optionally using wildcards, such as example.com/*), and a prompt or schema that describes the information you want. Firecrawl takes care of the details of crawling, parsing, and organizing large or small datasets.
Using /extract you can extract structured data from one or more URLs, including wildcards:
Single-page example: https://firecrawl.dev/some-page Multiple-page/full-domain example: https://firecrawl.dev/* When you use /*, Firecrawl automatically crawls and parses all URLs it finds under that domain, and then extracts the requested data.
curl -X POST https://host:port/v1/extract \
-H 'Content-Type: application/json' \
-d '{
"urls": [
"https://firecrawl.dev/",
"https://docs.firecrawl.dev/",
"https://www.ycombinator.com/companies"
],
"prompt": "Extract the company mission, whether it supports SSO, whether it is open source, and whether it is in Y Combinator from the page.",
"schema": {
"type": "object",
"properties": {
"company_mission": {
"type": "string"
},
"supports_sso": {
"type": "boolean"
},
"is_open_source": {
"type": "boolean"
},
"is_in_yc": {
"type": "boolean"
}
},
"required": [
"company_mission",
"supports_sso",
"is_open_source",
"is_in_yc"
]
}
}'
3. Access methods supported by the cloud version
API: Documentation SDK: Python, Node, Go, Rust LLM framework: Langchain (python), Langchain (js), Llama Index, Crew.ai, Composio, PraisonAI, Superinterface, Vectorize Low-code frameworks: dify, Langflow, Flowise AI, Cargo, Pipedream Others: Zapier, Pabbly Connect
4. Cloud version and open source version
Firecrawl is open source and available under the AGPL-3.0 license. Firecrawl Cloud is available at firecrawl.dev, and cloud hosting offers a range of features not available in the open source version.
The difference between the open source version and the cloud version is shown in the following figure:
Same function
Support scraping of single web page
Support crawling of the main site
Support Extraction via LLM
Support crawling sitemap sitemap
5.Support output LLM-friendly formats, markwon, json, etc.
SDKs supporting Python, Node, Go, and Rust
Cloud version additional features
7. Bypassing bot protection
Agent rotation
Crawling console
10. Content Interaction and Click-to-Action
11. Headless Browser
12.Some other features of the Enterprise Edition
5. Reference
Quickstart | Firecrawl
Blog
Firecrawl