The Problem Nobody Talks About in EdTech
Every EdTech platform eventually hits the same wall: content at scale is hard.
You can build an elegant platform, nail the UX, get universities onboarded – and then stall because the question bank is thin, repetitive, or factually unreliable. Coursera outsources to faculty networks. Udemy crowdsources from instructors. Most platforms either spend years curating content manually or quietly rely on the same shortcut: ask an LLM to generate questions and hope for the best.
We tried that. It does not hold up at production scale.
This post walks through what we actually built at 47Billion – an AI-assisted content generation pipeline for an education-to-employment platform – covering the problems we hit, the architectural decisions we made, and where we landed.

Why Naive LLM Prompting Fails at Scale?
The instinct is understandable: give an LLM a topic, ask for 20 questions, write them to a database. It works for a demo. It falls apart fast when you run it across thousands of topics with multiple publishers.
Hallucination and outdated facts are the most visible failure. LLMs generate questions confidently even when the underlying facts are stale or wrong. Interviews with professors confirms that they are very hesitant and consider hallucination as primary challenge with LLM adoption in education – static internal knowledge and hallucination make models unreliable for assessment content without grounding in current sources.
No grounding in source material is the deeper issue. A free-running LLM ignores a publisher’s specific syllabus and generates from parametric memory – knowledge baked in at training time. For professional certification or interview prep content, the gap between what the model learned in 2023 and what a candidate will be tested on in 2025 is exactly where it fails.
Duplication compounds this. Ask an LLM for questions on “REST APIs” across ten sessions and you get variations of the same five concepts. At scale, the question bank becomes a shallow pool that students recognize after their second practice session.
We needed a pipeline that was grounded, structured, and deduplication-aware from the start.
Pipeline Architecture: The High-Level View
The pipeline has five stages, each targeting a specific failure mode:
Topic / Syllabus Input
↓
Web Search + Content Scraping
↓
Noise Reduction + Chunking + Summarization
↓
Semantic Clustering via LLM Agent
↓
Typed Question Generation + Deduplication

Stage 1: Grounding in Source Content
Instead of generating from memory, the pipeline starts with retrieval. When a publisher enters a topic -say, “JWT Authentication” – the system searches the web, pulls the top relevant links, and scrapes content from each page. The LLM is never asked to generate anything until real, current source material is in hand.
This is the core principle behind Retrieval-Augmented Generation (RAG): augment generation with retrieved external knowledge rather than relying solely on what the model learned during training. For private use cases – universities with proprietary archives, enterprises with documentation behind firewalls – publishers can upload documents directly. The same downstream pipeline applies to both paths.

Stage 2: Noise Reduction and Structured Chunking
Raw scraped HTML is not usable. A typical web page is 60–70% noise: navigation menus, cookie banners, sidebar ads, footer links. Fed directly to an LLM, it treats a footer link to “10 Best JavaScript Frameworks of 2019” as content equal to the actual article.
We added a readability extraction layer – implementing Mozilla’s Readability algorithm, the same logic behind Firefox’s reader view – which strips noise while preserving the article’s actual structure: headings, paragraphs, code blocks, in the order the author wrote them. Structure matters as much as noise removal, because it tells the LLM what is primary content versus incidental detail.
Content from multiple URLs on a complex topic can easily run to 50,000+ words. We chunk it into segments of roughly 500–800 tokens, then generate a one-to-two sentence summary per chunk. The chunk preserves specific facts; the summary provides a navigable index. All chunks across all sources for a topic are merged into a single structured corpus – a topic-specific book assembled from multiple authoritative references.
Stage 3: LLM-Assisted Semantic Clustering
This is the most architecturally interesting stage – and where purely algorithmic approaches failed us.
The problem: chunks that are semantically identical do not always register as similar to a cosine similarity function. “JWT tokens expire after 1 hour by default” and “The default JWT expiration is set to 60 minutes” mean the same thing, but their embedding vectors diverge because the surface-level word choice is different. Cosine similarity measures the angular distance between high-dimensional text vectors – when two sentences say the same thing with different vocabulary, the angle between their vectors is larger than intuition suggests.
We tried cosine similarity thresholding on chunk embeddings first. It helped but consistently failed on paraphrase-level equivalences exactly like the JWT example. Chunks that should have been grouped were treated as distinct, generating near-duplicate questions downstream.
The solution was an LLM agent handling the clustering step. The agent receives chunk summaries as input and decides which summaries belong in the same semantic cluster. When uncertain, it can request the full chunk content via chunk ID to make a more informed decision. This correctly handles paraphrase equivalences, groups complementary chunks together, and makes the deduplication step downstream significantly cleaner.
Stage 4: Typed Question Generation and Deduplication
With clean, semantically clustered chunks, generation is constrained and structured. Per cluster, the pipeline generates exactly four questions across four distinct types: conceptual (what it is and why), practical (apply the concept), scenario-based (reason through a real situation), and case-based (multi-step analysis). Fixing the count prevents any well-covered topic from dominating the question bank while less-represented concepts get few questions. The type constraint forces diversity – you cannot generate four conceptual questions from a single cluster.
Questions are tagged at generation time as easy, medium, or hard. A more reliable direction we are moving toward is sourcing from existing exam and interview archives where difficulty is implicitly encoded in context – a question from a senior engineering interview carries different signal than one from a first-year university paper. This also lays the groundwork for adaptive assessment: start a student at medium difficulty, adjust based on their response pattern, and shift the band dynamically — the same approach CAT and GRE have used for decades.

Semantic Deduplication: Cosine Similarity + BM25 + PGVector
Even with clustered generation, overlap across clusters is possible. All generated questions are embedded and a pairwise cosine similarity matrix is computed across the full topic set. Any pair above ~0.88 threshold retains only one question.
For publisher-uploaded content – question sets the platform didn’t generate – cosine similarity alone is insufficient. A question phrased in dense technical language and one phrased simply may score below the similarity threshold even if they test the same thing. We are extending deduplication with a hybrid approach directly within PostgreSQL: pgvector for embedding storage and cosine similarity queries, combined with BM25-based keyword matching via full-text search on the same table, fused using Reciprocal Rank Fusion (RRF).
BM25 and cosine similarity are complementary by design. BM25 – which scores documents by term frequency, inverse document frequency, and length normalization – excels at exact-match detection: same specific technical terms, same values. Cosine similarity catches paraphrase-level duplicates where vocabulary differs but meaning aligns. Running both within a single PostgreSQL instance eliminates the need for a separate vector database while achieving recall that either method alone cannot match.
What This Enables Beyond Assessments?
The corpus built by this pipeline – structured, chunked, summarized, semantically indexed – is not single-use. The same topic corpus powers assessment question banks, mock interview sets, personalized study guides, and faculty presentation content. The pipeline runs once per topic; every downstream use case queries the corpus without regenerating from scratch. The corpus is the asset; questions are one output of it.
Where We Are and What Is Next?
The pipeline is live and generating question banks for production use. Subject matter expert review remains in the loop – deliberately – but the reviewer’s job has shifted from authoring questions to validating AI-generated ones. That is a meaningful productivity gain.
Immediate next iterations: adaptive difficulty sequencing based on student response patterns; cross-question answer verification using the knowledge base as ground truth for open-ended response evaluation; and curriculum alignment mapping to NCRF levels and Bloom’s taxonomy tiers so publishers can verify coverage before deployment.
Closing Thought
Retrieval quality determines generation quality. You cannot prompt your way to a reliable, grounded question bank. The information architecture has to come first – clean content, structured chunking, semantic clustering – and generation is the final, constrained step of a well-designed system.
The LLM is not the pipeline. It is one component in one.
Build This With 47Billion
At 47Billion, we design and deploy production-grade AI pipelines for EdTech platforms, universities, and enterprises – RAG-grounded retrieval systems, semantic deduplication, LLM agent orchestration, and PostgreSQL-native vector search, without the overhead of separate vector database infrastructure.
If you are building a knowledge-intensive AI product and hitting the walls this post describes, we would like to talk.