How to scrape Indeed in 2026? One of the best protected websites in the world

11 February 2026

Indeed deploys a full arsenal against scraping: Cloudflare, TLS fingerprinting, dynamic tokens, behavioral detection. Learn how to scrape it with a lightweight script, no headless browser needed.

View the full source code on GitHub

Disclaimer: This article is published for educational and research purposes only. Please respect the Terms of Service (ToS) and the rules of each website before performing any data extraction.

Indeed is one of the best protected websites against scraping.

Cloudflare
TLS Fingerprinting
Dynamic pagination tokens
Behavioral detection

Indeed.fr deploys a complete arsenal. I'll teach you how to scrape this site with a lightweight script, without using expensive and unreliable headless browsers. Want to find out how it works? Let's go!

The full source code is available on GitHub.

The overall architecture

The scraper is a single script (scrape_indeed.py) that relies on two dependencies:

pip install curl_cffi lxml

curl_cffi -- a Python binding for curl that enables TLS impersonation of real browsers
lxml -- HTML parsing to extract structured data from pages

The flow is simple: load the results page, extract job listings as JSON, then visit each listing individually via an "embedded" URL to enrich the data.

curl_cffi Session (Chrome TLS)
    |
    v
GET listing (SERP) --> parse Mosaic/Legacy --> JSON job listings
    |
    v
For each job:
    GET viewjob?viewtype=embedded --> parse JSON/JSON-LD --> enrich listing
    |
    v
Full JSON output + optional export

curl_cffi: the main weapon

Why not `requests`?

The fundamental problem with scraping in 2026 is TLS fingerprinting. When a browser establishes an HTTPS connection, the TLS ClientHello contains a unique signature:

The order of cipher suites
Supported TLS extensions
Elliptic curves
ALPN support (HTTP/2)
The order of everything

This signature is called a JA3 fingerprint (or JA4 in its newer version). Python requests uses the default OpenSSL TLS stack, which has a signature that's instantly recognizable as "not a real browser".

How curl_cffi solves the problem

curl_cffi (4900+ stars on GitHub) is a binding for curl-impersonate, a curl fork that faithfully reproduces the TLS handshake of real browsers. In one line:

from curl_cffi import requests

session = requests.Session(impersonate="chrome")

In the script, that's exactly what we do:

IMPERSONATE = "chrome"

def create_session(proxy_url: str | None = None) -> requests.Session:
    session = requests.Session(impersonate=IMPERSONATE)

    if proxy_url:
        session.proxies = {"http": proxy_url, "https": proxy_url}

    return session

The "chrome" profile without a version number automatically uses the latest Chrome version available in curl_cffi (currently Chrome 142). You can also pin a specific version with "chrome124" or "chrome131" for example. Each request produces a TLS handshake identical to Chrome's: same cipher suites, same order, same extensions, same ALPN.

Available profiles

curl_cffi supports dozens of profiles:

Browser	Available versions
Chrome	99, 100, 101, 104, 107, 110, 116, 119, 120, 123, 124, 131, 133a, 136, 142
Firefox	133, 135, 144
Safari	15.3, 15.5, 17.0, 18.0, 18.4, 26.0
Edge	99, 101
Tor	145

Each profile faithfully reproduces the TLS specifics of the target browser down to the version.

HTTP headers: consistency with impersonation

TLS impersonation isn't enough. HTTP headers must be consistent with the simulated browser. Indeed specifically checks Sec-Fetch headers.

Headers for the listing page (SERP)

LISTING_HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:148.0) "
                  "Gecko/20100101 Firefox/148.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "fr,fr-FR;q=0.9,en-US;q=0.8,en;q=0.7",
    "Alt-Used": "fr.indeed.com",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "same-origin",
    "Sec-Fetch-User": "?1",
    "Priority": "u=0, i",
    "Pragma": "no-cache",
    "Cache-Control": "no-cache",
}

Why these headers matter

Sec-Fetch-* : These are headers that real browsers send automatically. They indicate the request context:

Sec-Fetch-Site: same-origin = intra-site navigation (not an external API call)
Sec-Fetch-Mode: navigate = user navigation (not a JS fetch)
Sec-Fetch-Dest: document = requesting an HTML document
Sec-Fetch-User: ?1 = the request is user-initiated (click)

A script without these headers is immediately identifiable.

Accept-Language: fr,fr-FR;q=0.9,en-US;q=0.8,en;q=0.7 is consistent with a French user on fr.indeed.com. A standalone en-US would be suspicious.

Referer: The referer is dynamically computed from the search URL:

headers = dict(LISTING_HEADERS)
parsed = urlparse(url)
headers["Referer"] = f"{parsed.scheme}://{parsed.netloc}/jobs"

Headers for detail pages (job listings)

DETAIL_HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:148.0) "
                  "Gecko/20100101 Firefox/148.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "fr,fr-FR;q=0.9,en-US;q=0.8,en;q=0.7",
}

The headers are lighter for detail pages -- matching what the browser actually does. The referer points to the job URL in the listing:

headers = dict(DETAIL_HEADERS)
headers["Referer"] = job.get("url", "https://fr.indeed.com/jobs")

The proxy

The script accepts a proxy via the PROXY_URL environment variable:

PROXY_URL="http://user:pass@host:port" python scrape_indeed.py

This is configured at session creation:

if proxy_url:
    session.proxies = {"http": proxy_url, "https": proxy_url}

Without a proxy, the script works but Indeed will likely block after a few requests. Residential proxy providers (Decodo/Smartproxy, Bright Data, etc.) are recommended -- datacenter proxies are detected more easily.

Bot detection

Indeed uses Cloudflare for protection. Without pagination (first page only), blocking typically manifests as a 403 with a Cloudflare captcha. In this case, two options:

Switch proxy -- a new residential IP is often enough to get through
Use a captcha solving service like CapSolver with their AntiCloudflareTask, which solves the challenge and returns cf_clearance cookies to inject into the session

We won't cover this part here.

Pagination: Complete shutdown since February 2026

Indeed has now completely blocked pagination for non-logged-in users. Pagination tokens (pp) are still present in the response (in pageLinks), but they no longer work without an authenticated session: any attempt to use them or manually increment the start parameter redirects to the login page.

This is a major change from 2024 when these same pp tokens worked in anonymous mode. Today, Indeed requires login as a prerequisite for any navigation beyond the first page of results.

The standalone script therefore scrapes the first page of results (typically 15 listings), then enriches each one via its detail page.

At crawlergrid.ai, we handle all types of data extraction and automation for you, with no pagination limits or captcha blocking.

Parsing

Listing page: the Mosaic format

Indeed uses a "Mosaic" architecture (2026) that exposes data via JavaScript assignments:

window.mosaic.providerData["mosaic-provider-jobcards"] = {...};

The scraper locates these blocks by regex, then extracts JSON by counting braces (blocks can be hundreds of KB on a single line, a naive json.loads would fail). Job listings are in the mosaic-provider-jobcards provider, the total count in MosaicProviderRichSearchDaemon.

A legacy fallback is also supported for older pages (<script id="comp-initialData">).

Detail page: the embedded format

Each listing is loaded via viewtype=embedded, which returns pure JSON instead of HTML:

detail_url = (
    f"https://fr.indeed.com/viewjob?viewtype=embedded"
    f"&jk={jk}&from=shareddesktop_copy&adid=0&spa=1&hidecmpheader=1"
)

The JSON is deeply nested -- each field (company, salary, contract, description) is extracted with multiple fallback paths. If the embedded JSON fails, the script falls back to the JobPosting JSON-LD (schema.org) present in the HTML.

Listing + detail merge

Listing data (title, company, location) is enriched with detail data (description, contract, full salary).

Request timing

Timing is crucial to avoid detection and behave like a regular user:

# Between detail pages (job listings)
delay = random.uniform(2, 4)
time.sleep(delay)

Usage and output

Commands

# Without proxy (risk of blocking)
python scrape_indeed.py

# With proxy (recommended)
PROXY_URL="http://user:pass@host:port" python scrape_indeed.py

# Custom search, limit to 5 listings
python scrape_indeed.py --url "https://fr.indeed.com/jobs?q=python&l=Paris" --max 5

# Listing only (no detail page loading)
python scrape_indeed.py --no-detail

# Export results as JSON
python scrape_indeed.py --json-output

What the script displays

The script displays each pipeline step with structured, colored logs. Each phase is timestamped:

======================================================================
  Indeed Scraper -- curl_cffi + Chrome TLS Impersonation
======================================================================

[14:32:01] Creating curl_cffi session
  > TLS impersonation: chrome
  !! No proxy configured -- high risk of blocking

[14:32:01] Loading results page (SERP)
  > URL: https://fr.indeed.com/jobs?q=alternance&l=France&sort=date...
  > HTTP 200 -- 847,231 bytes
  OK Page received, parsing in progress...
  OK Format detected: Mosaic (2026+)
  > Providers found: mosaic-provider-jobcards, MosaicProviderRichSearchDaemon, ...
  OK 15 listings extracted from this page (Indeed total: 2847)

Right after parsing the listing, the extracted JSON for each job is displayed:

[14:32:02] JSON extracted from listing (15 jobs)

----------------------------------------------------------------------
  --- Job #1 -- Web Developer Apprenticeship ---
{
  "job_key": "abc123def456",
  "title": "Web Developer Apprenticeship",
  "company": "TechCorp",
  "location": "Paris (75)",
  "salary": "1,200 EUR per month",
  "published_at": "2026-02-10T14:00:00+00:00",
  "url": "https://fr.indeed.com/viewjob?jk=abc123def456"
}

Then for each detailed listing, the script shows progress and enriched fields:

[14:32:03] Loading details (5 jobs)
  > [1/5] Detail for: Web Developer Apprenticeship (jk=abc123def456)
  OK Enriched fields: company, location, description, contract_type, published_at
  > Pause 2.7s...
  > [2/5] Detail for: Sales Assistant (jk=xyz789...)
  OK Enriched fields: company, location, salary, description, published_at

At the end, a compact summary followed by the full enriched JSON (listing + detail merged):

[14:32:18] Final results (5 jobs)

  #1 Web Developer Apprenticeship @ TechCorp -- Paris (75) | 1,200 EUR per month
  #2 Sales Assistant @ SARL Dupont -- Lyon 69001
  ...

[14:32:18] Full enriched JSON

----------------------------------------------------------------------
  --- Job #1 -- Web Developer Apprenticeship ---
{
  "job_key": "abc123def456",
  "title": "Web Developer Apprenticeship",
  "company": "TechCorp",
  "location": "Paris 75001",
  "salary": "1,200 EUR per month",
  "contract_type": "Apprenticeship",
  "published_at": "2026-02-10T14:00:00+00:00",
  "url": "https://fr.indeed.com/viewjob?jk=abc123def456",
  "description": "We are looking for a web developer apprentice... (1847 chars)"
}

Long descriptions are truncated in display (200 characters) but kept in full in the --json-output export.

Technical summary

Layer	Implementation	Role
TLS	curl_cffi + `impersonate="chrome"`	TLS fingerprint identical to Chrome
HTTP	Sec-Fetch-* headers, Accept-Language, dynamic Referer	Application consistency
Network	Single proxy via `PROXY_URL`	IP rotation (manual)
Anti-bot	Redirect detection `secure.indeed.com/auth`	Blocking diagnosis
Pagination	First page only (login required since 2026)	Known limitation
Listing parsing	Mosaic providers + legacy fallback `comp-initialData`	Multi-format support
Detail parsing	Embedded JSON + JSON-LD `JobPosting` fallback	Multi-format support
Extraction	`dict_get` multi-path + per-field fallbacks	Resilience to variations
Timing	Random 2-4s delays between details	Human-like behavior
Output	Colored logs + JSON pretty-print + `--json-output` export	Traceability and debug

The key is consistency across all these layers. The TLS fingerprint alone isn't enough if the HTTP headers are inconsistent. Good headers aren't enough if the IP is blacklisted. And since 2026, even with all that, pagination remains a challenge -- crawlergrid.ai handles all these constraints in production, reliably and sustainably.

Need help with your scraping project?

Get a quote