Blog

Web Scraping for AI Pipelines: What Actually Works in 2026

Aditya Pathak . Senior Software Engineer

Kamalpreet Saluja . Tech Lead

June 30, 2026

The Real Ceiling Is Not the Model

When you build an AI-powered application that needs to generate current, structured output – whether that’s a presentation, a research brief, a job-match report, or a product analysis – the conversation almost always gravitates to the same question first: which LL+M? GPT-4o? Gemini? Claude? Which prompt strategy?

That is the wrong question to lead with.

At 47Billion, we build 7Seers – an AI-powered education-to-employment platform – and we learned this the hard way. When a teacher asks for a slide deck on “Union Budget 2026 Highlights,” no amount of prompt engineering fixes the fact that most language models have a training cutoff twelve to eighteen months prior. The model simply does not have that content, and it will hallucinate confidently in its absence. The problem is not the model. The problem is what you feed it.

The actual ceiling of any AI-generated output is not the model. It is the content pipeline behind it.

If you want accurate, current output on live topics, you need to fetch content from the live web – rendered completely, extracted cleanly, and structured so your AI agent can reason over it. Building that pipeline turned out to be significantly harder than we expected, and this post is an honest account of what we built, what failed, and where the real unsolved problems still are.

A Note on Responsible Scraping

One thing we are deliberate about: we respect websites that do not want their data scraped. Before building a scraping path for any source, we check its robots.txt directives and Terms of Service. Where a site signals it does not want automated access, we do not override that signal – we look for an official API, a licensed data feed, or a direct data partnership instead. The technical capability to bypass a block is not the same as the right to do so. We think that distinction matters, and we build accordingly.

Why Scraping for LLMs Is a Different Problem?

Traditional web scraping had a simple success criterion: did the data arrive? You wrote a CSS selector, ran a cron job, stored rows in a database. The bar was binary.

Scraping for LLM applications is a different discipline entirely. The question is not “did the data arrive?” but “can a language model reason coherently over this?” That distinction changes every design decision in the pipeline.

Raw HTML is effectively toxic for LLMs. Nested containers, navigation menus, cookie banners, JavaScript loaders – these all consume tokens and confuse the model. A typical web page is 60–70% noise relative to its actual content. When you feed that noise to a model, it treats a sidebar link to a 2019 listicle as content equal to the article it is embedded in. The model does not hallucinate because it is a bad model – it produces incoherent output because the input was incoherent. Fix the pipeline, and the model’s output quality follows without any prompt changes.

Freshness matters at the granularity of days, not months. A RAG system grounded in week-old government policy data gives wrong answers. A job-matching engine pulling stale listings misrepresents the actual market. And the model’s training data is not a substitute for either – even if the model technically encountered your target article during training, that version is frozen. If the policy changed last Tuesday, the model does not know.

This is why teams building AI applications on live web data cannot avoid building a proper content pipeline. And this is precisely the problem we were solving for 7Seers.

Attempt 1 – Naive HTTP Scraping

We started where everyone starts: Python’s requests library to fetch pages, BeautifulSoup to parse the HTML, and a simple routine to pull the text out.

It worked cleanly for about two days – on Wikipedia, a few static government pages, some open documentation sites. Then we hit the real web.

The problem is structural. The vast majority of modern content sites render their actual content via JavaScript. News portals, research platforms, edtech sites, job boards – the initial HTTP response is essentially a shell document. What arrives from a raw HTTP request is a sparse HTML file with containers and JavaScript loader references. The actual article, the policy details, the job description — all of that loads after a browser executes JavaScript on the client side.

So our scraper would fetch the page, parse the HTML, and return a beautifully structured empty document. No errors raised. No exceptions. No stack traces. Just structurally valid, completely empty content that looked fine to the pipeline – until the LLM tried to generate slides from it and produced something that technically had the right format but said absolutely nothing.

This is worse than a clean failure. A clean failure is caught immediately. A silent failure ships to users.

Attempt 2 – Browser Automation with Playwright

We moved to Playwright – Microsoft’s browser automation framework that controls a real headless Chromium instance, executes JavaScript, waits for network requests to settle, and returns the fully rendered DOM. Worth knowing: Playwright was not designed for scraping. It was built for end-to-end testing. But because it ships with full JavaScript execution, network interception, and device emulation, it has become the de facto engine underneath almost every modern scraping framework. Crawl4AI uses it. Firecrawl runs it under the hood. Even managed scraping APIs launch Playwright instances on their backend.

With Playwright, dynamic content finally rendered. Single-page applications loaded properly. We could retrieve what a real user would see.

Until the content sources started seeing us too.

Rate limiting kicked in. CAPTCHAs appeared. Our IP addresses got flagged and blocked – often within two or three requests from the same IP. Understanding why requires understanding how modern bot detection actually works, because it operates across multiple independent layers simultaneously.

The first layer is at the protocol level. Playwright communicates with the browser via the Chrome DevTools Protocol (CDP), and in its default configuration it issues a specific command – Runtime.enable — in the main browser context. Anti-bot systems like Cloudflare and DataDome specifically watch for this command as an automation signal. The moment it fires, the session is flagged. This is not a JavaScript-level leak that you can patch with an injection – it happens at the protocol layer, before any page script runs.

The second layer is browser fingerprinting. Headless Chromium sets a flag called navigator.webdriver to true – the single most widely-checked bot indicator. Beyond that, it produces automation-consistent signatures in canvas fingerprinting, WebGL renderer strings, and audio context fingerprinting. These are stable, identifiable patterns that anti-bot systems catalogue.

The third layer is IP reputation. Most proxy providers serve datacenter IP addresses – addresses that come from server farms hosted by AWS, Google Cloud, or similar providers. Anti-bot systems maintain continuously updated blocklists of these IP address ranges. The moment a request arrives from a known datacenter ASN (Autonomous System Number), it is suspect before anything else is evaluated.

Vanilla Playwright triggers all three layers at once. Browser automation is necessary – but it is nowhere near sufficient.

Where We Landed – A Layered Architecture

After two partial failures, we stopped looking for a single tool that solved everything and built a layered system instead. The core insight is that each failure mode requires a different solution at a different level of the stack. Solving one layer does not help with the others.

URL Input ──► Decision Router
                  │
      ┌──────────┴──────────┐
      │                     │
[Static HTML]       [Dynamic / JS-rendered]
      │                     │
requests +         Playwright (Headless Chromium)
BeautifulSoup               │
      │              ┌───────┴───────────────┐
      │              │    STEALTH LAYER       │
      │              │ playwright-stealth    │
      │              │ Residential Proxies   │
      │              │ Fingerprint Spoofing │
      │              └───────┬───────────────┘
      │                      │
      │              ┌───────┴───────────────┐
      │              │   BEHAVIOR LAYER       │
      │              │ Homepage Entry First │
      │              │ Randomized Timing     │
      │              │ Natural Navigation    │
      │              └───────┬───────────────┘
      │                      │
      └──────────┬───────────┘
                 │
      ┌──────────▼──────────────────────┐
      │        EXTRACTION LAYER          │
      │ Crawl4AI (HTML → Markdown/JSON) │
      │ LLM Extractor (messy pages)     │
      │  Firecrawl (prototyping)         │
      └──────────┬──────────────────────┘
                 │
      ┌──────────▼───────────────────┐
      │ Clean Markdown / JSON        │
      │ ──► LLM Agent                │
      └───────────────────────────────┘

Layer 1 – Stealth: Residential Proxies and CDP Patching

On proxies: datacenter IPs fail because an ASN lookup resolves them immediately to a known cloud hosting provider. Cloudflare maintains live blocklists of these ranges and rejects them before any other detection logic runs. Residential proxies use IP addresses assigned by real ISPs to actual home subscribers — from a detection system’s perspective, the traffic is indistinguishable from a real person browsing from their home connection. The success rate difference is significant: residential proxies achieve 85–99% pass rates on Cloudflare-protected sites compared to 40–60% for datacenter IPs. We use WebShare.io for rotating residential proxies. It is a mandatory infrastructure cost, not optional.

On CDP patching: proxies solve the IP layer and nothing else. The protocol leak – Playwright issuing Runtime.enable in the main browser context – is still present, and residential IPs do not hide it.

On top of residential proxies, we layer playwright-stealth patching – hiding the navigator.webdriver flag, spoofing canvas fingerprinting, fixing timezone leakage, and addressing a dozen other signals that bot detectors actively scan for. The idea is straightforward: make the browser look as close to a real human session as possible at every signal point. There are more aggressive CDP-level patching approaches in the community – tools that patch Playwright’s protocol communication itself rather than just the JavaScript layer – but we have not gone down that path yet. For our current sources, the combination of residential proxies and stealth patching gets us where we need to be.

Layer 2 – Behavior: Making the Session Look Human

Technical stealth handles fingerprints. Behavioral analysis handles how the session acts over time. These are evaluated independently by modern anti-bot systems – a session can pass every fingerprint check and still get blocked because it navigates like a machine.

The clearest bot signature we hit: going directly to a deep content URL with no referrer and no prior navigation history. A real user opens a browser, lands on a homepage, maybe searches or browses a category, then reaches the article. Our scrapers were teleporting directly to the target URL with an empty session history. On several major content sources, that single pattern was sufficient to trigger a block.

What we changed: every session now starts at the domain root before navigating to the target. We route through internal search where possible, establishing a realistic referrer chain. Delays between actions are randomized around human-observed timing distributions rather than being fixed intervals — fixed-interval pauses are one of the clearest automation signatures because real humans do not wait exactly 1,000 milliseconds between every action. On content-heavy pages, we simulate gradual scroll at reading speed with occasional pauses. None of this is a guarantee against sophisticated behavioral analysis, but it eliminates the low-hanging signatures that account for most of the blocks we were actually encountering.

Layer 3 – Intelligent Extraction: Crawl4AI and the Fallback Stack

Getting the rendered HTML is only the first half. Turning that HTML into something a language model can reason over is the second – and often the harder half.

Every website structures its content differently. Wikipedia uses clean semantic HTML. Government portals use structured tables and section tags. Most news sites wrap content in deeply nested containers with class names that change every few weeks. Some pages carry critical information inside embedded PDFs rendered as iframes. You cannot write CSS selectors to reliably handle all of these – any selector you write against a modern website has a half-life measured in weeks, and when it breaks, it fails silently.

Crawl4AI is our primary extraction tool. It is an open-source Python crawler built specifically for LLM data pipelines – not a general-purpose crawler adapted for AI, but one designed from the start around the question “can a model reason over this output?” It uses Playwright for JavaScript rendering and packages it with a content preparation step that converts the rendered DOM into clean Markdown or structured JSON.

One output distinction that matters in practice: Crawl4AI produces both raw_markdown (the full page content) and fit_markdown (boilerplate stripped, main content only). For LLM consumption – especially when you are paying per token – fit_markdown is almost always the right choice. It strips navigation menus, footers, sidebars, and other boilerplate, leaving the semantic content that actually matters. The tradeoff is that you need to trust its boilerplate detection, which works well on article-structured content and less reliably on non-standard layouts.

For sources with consistent, predictable HTML structure – Wikipedia, government data portals, standardized academic databases – Crawl4AI uses CSS or XPath selectors. Fast, cheap, deterministic. For messy, freeform pages where selectors break, Crawl4AI supports LLM-based extraction: you define a JSON schema describing what you want, provide a natural-language instruction, and the framework passes the page content through any LLM provider you configure (including local models via Ollama) and returns structured JSON regardless of how chaotic the source HTML is. The tradeoff with LLM-based extraction is speed and cost – it is meaningfully slower and more expensive per page than selector-based extraction. Our pipeline chooses between the two based on whether the source has a known-good selector configuration.

Firecrawl is a managed API service we evaluated and did not adopt for production. It handles proxy rotation and anti-bot measures internally and returns clean Markdown – the fastest path to working output in a prototype. It integrates directly with LangChain and LlamaIndex. The reason we did not adopt it: per-request pricing at any meaningful volume makes it economically unviable for a high-frequency pipeline. For early-stage prototyping where speed of iteration matters more than cost, it is worth a look.

One tool we are still evaluating is Scrapling – an adaptive Python scraping framework that takes a different approach to the broken-selector problem. Rather than requiring you to maintain CSS or XPath selectors, Scrapling learns from website changes and automatically relocates elements when page structure updates. Its StealthyFetcher also comes with built-in fingerprint spoofing and claims to bypass Cloudflare Turnstile out of the box. On paper, it addresses two of the three biggest pain points in our pipeline simultaneously. We have not put it through a proper production evaluation yet – it is on the list.

Another idea on our list is leveraging browser reading mode – most modern browsers expose a reader view that strips navigation, ads, and boilerplate and surfaces only the main article content. Hacking into that pipeline programmatically as a lightweight extraction step is something we want to explore. The appeal is obvious: essentially free boilerplate removal that the browser itself does, without any LLM call.

For open, static pages – plain HTML with no authentication, no JavaScript rendering, no anti-bot protection – we skip all of the above entirely and use a lightweight HTTP + BeautifulSoup path. Launching a headless Chromium process uses 200–400 MB of RAM. Doing that for a Wikipedia article is unnecessary. Our decision router makes this call with a quick preflight fetch that checks whether meaningful content is present in the raw response.

The Unsolved Problem – CAPTCHA

Let us be direct about where we remain stuck.

There is no reliable, freely available solution for solving CAPTCHAs at scale. Residential proxies and stealth browser patching prevent the majority of CAPTCHA encounters – when your traffic genuinely resembles a human on a home ISP connection, you generally do not trigger challenges in the first place. But “generally” has a real tail distribution. Edge cases hit Cloudflare Turnstile or reCAPTCHA v3, and when they do, those requests currently fail.

Paid services – 2Captcha, CapSolver, Anti-CAPTCHA – use a combination of human workers and AI models to solve challenges. They claim 90%+ solve rates. The cost is non-trivial, and even the well-regarded ones acknowledge the gap: a 10% unsolved rate on a high-volume pipeline is an operational problem, not an acceptable edge case. We are actively monitoring this space. It remains the one layer of the pipeline without a satisfying solution.

What This Work Taught Us?

A few things we learned that most tutorials skip over.

The broken-selector problem is more dangerous than bot detection. Everyone writes about getting blocked. Fewer people write about the fact that any CSS selector you write against a modern website has a half-life measured in weeks. A site redesign, a class name change, a frontend framework migration -and your pipeline fails silently with no error, which is the worst kind of failure. LLM-based extraction is slower and costs more per call, but it is structurally resilient to these changes in a way selector-based approaches are not. For high-value sources, that resilience is worth the cost.

“Garbage in, garbage out” is the primary AI failure mode, not hallucination. We consistently observed this: the model does not produce bad output because it is a bad model. It produces bad output because we fed it empty, malformed, or boilerplate-heavy content. Improving the pipeline improved output quality immediately, without any prompt changes.

Compute cost is a real architecture constraint, not a secondary concern. A headless Chromium process runs at 200–400 MB of RAM. At any meaningful scale, you cannot spin up a new instance per request. Session pooling, instance reuse, and intelligent routing between the lightweight and heavyweight paths are production requirements.

The paywall trend is real and accelerating. As LLMs become more capable at ingesting publicly available content, more publishers are moving high-quality material behind authentication. This is not a distant concern – it is happening now, and it makes the scraping problem structurally harder even as the tooling improves. We do not have a solution to this, and we are honest about that.

LLM training data and live targeted scraping are not the same thing. The question came up internally: if the model already saw this article during training, why are we scraping it again? Precision and freshness. A model trained a year ago does not have last week’s policy update. And even for content the model technically knows, our pipeline builds a targeted RAG context anchored to the specific source the user provided – not a generalized representation of that domain. Grounded, source-attributed answers are qualitatively different from what the model’s prior produces.

Closing Thought

If you are building any AI application that depends on current, external information – presentation generators, RAG systems, job-matching engines, research tools, autonomous agents – the content pipeline is not a commodity layer you can skip past. It is a real engineering problem with multiple distinct failure modes, each requiring a different solution at a different level of the stack. Every deployment requires calibration: which sources need the full stealth stack, which can use the lightweight path, which need LLM-based extraction versus CSS selectors, and where the compute budget is best spent. Getting these decisions wrong does not just affect pipeline performance – it directly degrades the quality of every AI output downstream.

The model is not the bottleneck. The content pipeline is. Fix the pipeline first – and build it responsibly.

47Billion builds AI-powered enterprise platforms across education, employment, and talent intelligence. If you are working on similar content pipeline or AI infrastructure challenges, reach out – we are building in this space every day.

Frequently Asked Questions

1. What is web scraping for AI pipelines?

Web scraping for AI pipelines is the process of collecting, rendering, extracting, and structuring live web content so large language models (LLMs) can generate accurate, up-to-date responses. Unlike traditional scraping, AI pipelines require clean, context-rich data in formats such as Markdown or JSON to support Retrieval-Augmented Generation (RAG), AI agents, and enterprise AI applications.

2. Why isn't prompt engineering enough for AI applications that use live data?

Prompt engineering cannot compensate for missing or outdated information. Most LLMs have knowledge cutoffs and cannot access recent events, policy changes, or newly published content without external data retrieval. A robust content pipeline that fetches, cleans, and structures live web data enables AI models to produce more reliable and factually grounded outputs.

3. Why do AI web scraping pipelines use Playwright instead of traditional web scraping libraries?

Modern websites rely heavily on JavaScript to render content dynamically. Traditional libraries like Requests and BeautifulSoup can only retrieve the initial HTML, often missing the actual page content. Playwright automates a real browser, executes JavaScript, and captures the fully rendered page, making it the preferred choice for scraping dynamic websites used in AI and RAG pipelines.

4. How can enterprises reduce bot detection while scraping websites responsibly?

Organizations can improve scraping reliability by using browser automation with stealth techniques, rotating residential proxies, realistic browsing behavior, and intelligent request routing. At the same time, responsible scraping requires respecting robots.txt directives, complying with website Terms of Service, and using official APIs or licensed data sources whenever available.

5. What makes a production-ready AI content pipeline?

A production-ready AI content pipeline combines multiple components, including intelligent URL routing, browser rendering for dynamic pages, content extraction, boilerplate removal, structured output generation, monitoring, caching, and seamless integration with RAG systems or AI agents. The goal is to provide clean, relevant, and current information that enables LLMs to deliver accurate, source-grounded responses at scale.

You might also like: