Forms Lab — Presentation

The Problem

Good forms require engineering, domain expertise, and UX focus
That combination is expensive and scarce
Sources of truth: policy (regulations, rules) and existing PDFs (layout, not structure)

The Thesis

Separate "what to collect" from "how to present it"

DataCollectionSpec

→

FormSpec

→

Submission

"Let LLMs handle the hard parts — behind well-defined interfaces"

Architecture

Service Layer

One-way dependency flow
Services own their types
Strategy pattern for variants

Deployment

Hono on Bun
NixOS on EC2
Every branch gets its own app

Development Process

"Claude Code as collaborator, not autocomplete"

Catalog as shared context: personas, stories, architecture, decisions
Session lifecycle: start-story → finish-story
Time invested in structure made LLM integrations plug in cleanly

LLM Integration Points

Extraction

PDF → structured fields, 11 variants

Shaping

NL → form edit commands, tool-use

Filling

Conversational completion, agent tools

Evaluation

LLM-as-judge, deterministic + semantic

Experimentation Framework

Strategy registry: plug in a variant, get evaluation for free
Runtime variant picker: toggle implementations without restart
Evaluation harness: recall, precision, type accuracy, sensitivity
Ground truth from Opus; judge handles semantic equivalence

Key Findings

A hybrid prompt wins PDF extraction. Hybrid prompt = one short instruction + one inline exemplar + temperature=0. It Pareto-dominates verbose-guideline and 3-exemplar few-shot variants: 73% recall, 99% precision, +24pp sensitivity over baseline Sonnet at the same cost.
RAG is load-bearing for policy-grounded form authoring. Given a 21-chunk SNAP regulatory corpus, the pipeline produces a 14-page form tracking the ground truth's topical structure. With an empty corpus it produces a single skeletal page. Recall doubles (4.7% → 10.6%); field coverage grows 20×.
Capability boundaries are real. Non-Anthropic small multimodal models (Nova Pro, Nova Lite, Llama 3.2 Vision) fail at field-level PDF extraction. Model selection beats prompt engineering once you pass a model's capability line.

Extraction Evaluation

Five Sonnet-based variants against three government-form fixtures. Hybrid v1 = one short instruction + one inline exemplar + temperature=0.

Variant	Recall	Precision	Sensitivity
Sonnet baseline	62%	79%	27%
Few-shot (3 exemplars)	55%	87%	21%
Tool-use	35%	96%	79%
Hybrid v1	73%	99%	51%
RAG-grounded	56%	93%	53%

RAG-Guided Form Authoring

Policy corpus in, evaluated form out

Policy Corpus

→

Criteria

→

Structure

→

Fields

→

Evaluation

Same command language as manual shaping
LLM-as-judge scores sections against regulatory criteria
User approves criteria before generation begins

What's Next

Conversational filling maturity
Cost optimization: flagship → smaller → local models
The framework makes each step an experiment, not a rewrite

Demo →