Skip to content

Use Firecrawl Extract 2.0 for agent-driven web scraping

reference

Scraping complex websites requiring login, navigation, and multi-step interactions fails with traditional crawlers

agentsfirecrawlweb-scrapingextraction
27 views

Problem

Traditional web scrapers break on modern websites that require authentication, JavaScript rendering, pagination, and multi-step navigation. Sites with dynamic content, infinite scroll, or gated data behind login walls are especially problematic. Crawlers like BeautifulSoup or Scrapy cannot handle these interactions without extensive custom code for each target site.

Common failures with traditional scraping:

  • Login walls and session management
  • JavaScript-rendered content that does not exist in raw HTML
  • Paginated results requiring click-through navigation
  • Anti-bot protections and rate limiting
  • Dynamic content loaded via API calls after page render

Solution

Firecrawl Extract 2.0 uses AI agents that can perform multi-step browser interactions to scrape data from complex websites.

Basic extraction with a schema:

import Firecrawl from "@mendable/firecrawl-js";

const firecrawl = new Firecrawl({ apiKey: process.env.FIRECRAWL_API_KEY });

const result = await firecrawl.extract({
  urls: ["https://example.com/products/*"],
  prompt: "Extract all product names, prices, and availability status",
  schema: {
    type: "object",
    properties: {
      products: {
        type: "array",
        items: {
          type: "object",
          properties: {
            name: { type: "string" },
            price: { type: "number" },
            inStock: { type: "boolean" },
          },
        },
      },
    },
  },
});

Agent-driven extraction with authentication and navigation:

const result = await firecrawl.extract({
  urls: ["https://dashboard.example.com/analytics"],
  prompt: "Log in, navigate to the analytics tab, and extract monthly revenue figures",
  actions: [
    { type: "click", selector: "#login-button" },
    { type: "fill", selector: "#email", value: "user@example.com" },
    { type: "fill", selector: "#password", value: process.env.SITE_PASSWORD },
    { type: "click", selector: "#submit" },
    { type: "wait", milliseconds: 2000 },
    { type: "click", selector: "[data-tab='analytics']" },
  ],
  schema: {
    type: "object",
    properties: {
      monthlyRevenue: { type: "array", items: { type: "number" } },
      period: { type: "string" },
    },
  },
});

Why It Works

Firecrawl Extract 2.0 combines a headless browser with an AI agent that interprets page structure and executes multi-step workflows. The agent handles JavaScript rendering, waits for dynamic content, and can navigate authenticated sessions. By defining a schema, the extracted data is returned in a structured format rather than raw HTML, eliminating the need for custom parsing logic. The agent approach means you describe what you want rather than writing brittle CSS selectors that break when the site changes.

Context

  • Firecrawl offers both cloud-hosted and self-hosted options
  • The Extract 2.0 agent feature launched in April 2025 and supports login flows, tab navigation, and pagination
  • For bulk scraping (tens of thousands of pages), contact Firecrawl directly for increased rate limits
  • Alternative tools include Apify for LinkedIn-specific scraping and Scrapin.io for structured LinkedIn data
  • Firecrawl also provides an MCP server for integration with Claude and other AI assistants
About this share
Contributormblode
Repositorymblode/shares
CreatedFeb 10, 2026
View on GitHub