RAG-Powered Form Authoring Pipeline

End-to-end test of whether an LLM agent can generate a complete, regulation-compliant government benefits application from a policy corpus alone — no source PDF, no hand-crafted template.

Status

Working pipeline against a Wisconsin SNAP policy corpus. RAG’s contribution is measured via an ablation (see rag-ablation-snap). The corpus is load-bearing: without it, the pipeline falls back to a single skeletal page. With it, the model produces a 14-page form that tracks the ground truth’s topical structure across eligibility, income, assets, work requirements, verification, expedited service, and Wisconsin-specific administration.

Absolute scores against ground truth are low (10 % recall) — a scorer artifact, not a pipeline failure. See the ablation writeup for why.

How the pipeline works

Four stages, each independently configurable via the variant system (Settings → Authoring):

Criteria analysis — LLM reads the corpus and produces evaluation criteria with regulatory citations. Each criterion is an atomic check a reviewer can approve or reject before generation starts.
Structure generation — LLM derives pages and groups from the approved criteria + policy corpus. The prompt explicitly tells the model to call no tools when input is insufficient, so empty input produces an empty form (validated by the RAG ablation).
Section generation — per page, the LLM proposes fields with types, labels, required-ness, and sensitivity tags, citing the regulation that requires each.
Auto-evaluation — LLM-as-judge scores the generated form against the approved criteria. (Deferred for v1; the scorer exists, the wiring into the pipeline is follow-up work.)

Corpus

catalog/references/snap-wisconsin.md — 21 sections covering eligibility (household, income, resources, citizenship, students), deductions, work requirements, expedited service, application/interview/verification, categorical eligibility (TANF/SSI, BBCE), ABAWD time limits, certification periods, reporting changes, fair hearings, and Wisconsin-specific administration (Wis. Stat. 49.79, DHS 1507, ACCESS portal, QUEST card, IM consortia, FSET). Expanded from the original 13-chunk corpus on 2026-04-20 to exercise retrieval more seriously.

Key findings

RAG more than doubles recall on SNAP authoring. With the corpus: 10.6 % recall, 14 pages, 140 fields. Without: 4.7 %, 1 page, 7 fields. See rag-ablation-snap.
Prompt engineering must not impersonate retrieval. The first ablation run reported a 7.1 % / 7.1 % tie — the structure prompt had the 8 SNAP page titles hardcoded, so the model reproduced them even with an empty corpus. Rewriting the prompt to derive structure from criteria + corpus exposed the real RAG contribution. A generalisable lesson: before concluding retrieval adds no value, audit the prompts for hidden domain knowledge.
Absolute numbers are gated by the scorer. Both variants cap around 10 % recall under deterministic field-name matching because the model emits earned_income_source where ground truth has wages_salary_source. An LLM-as-judge scorer would plausibly lift both variants to 40-60 %. This is a measurement investment, not a pipeline investment.
Haiku is a viable generation model. Sonnet for criteria + structure (one-time, benefits from richer reasoning); Haiku for per-section generation (repeated, speed matters). Latency drops from ~3 min to ~80 s with no type-accuracy regression.

Variants

Variant id	Criteria	Structure	Generation	Use case
`all-sonnet`	Sonnet	Sonnet	Sonnet	Highest-quality run for new corpora
`no-rag-sonnet`	Sonnet	Sonnet	Sonnet	Ablation control — passes empty corpus to every stage
`haiku-generation`	Sonnet	Sonnet	Haiku	Fast interactive use
`all-haiku`	Haiku	Haiku	Haiku	Cheapest; trades reasoning depth for speed

Running the pipeline

# Run a single variant end-to-end against the SNAP fixture
bun run cli evaluate authoring all-sonnet

# Ablation: isolate RAG's contribution
bun run cli evaluate authoring no-rag-sonnet

# Compare model-choice variants
bun run cli evaluate authoring compare

Artifacts land in a temporary directory by default; pass --out-dir <path> to pin them.

Limitations and next steps

Scorer. Move to an LLM-as-judge scorer so semantic equivalents stop being scored as misses.
Per-stage retrieval scoping. Section generation currently receives the full corpus. A retriever call that pulls only the chunks relevant to “Household Composition” for that section would let fine-grained regulatory detail show through. The retriever exists at src/services/rag/retrieval.ts; this is a wiring exercise.
Auto-evaluation retry loop. Once the judge is wired in, failed criteria can feed back into section generation as targeted edits.
Harder corpora. SNAP is well-represented in Sonnet’s pre-training. A state waiver form or a recent amendment the model has not seen would separate parametric knowledge from retrieved context more cleanly.

RAG ablation for SNAP authoring — controlled test isolating RAG’s contribution
PDF field extraction — sonnet-with-rag — the separate RAG win on sensitivity labeling during PDF extraction
Corpus file — the 21 policy chunks the pipeline retrieves from