Forms Lab

LLM-Assisted Forms Platform

Daniel Naab

The Problem

The Thesis

Separate "what to collect" from "how to present it"

DataCollectionSpec
FormSpec
Submission

"Let LLMs handle the hard parts — behind well-defined interfaces"

Architecture

Service Layer

  • One-way dependency flow
  • Services own their types
  • Strategy pattern for variants

Deployment

  • Hono on Bun
  • NixOS on EC2
  • Every branch gets its own app

Development Process

"Claude Code as collaborator, not autocomplete"

LLM Integration Points

Extraction

PDF → structured fields, 11 variants

Shaping

NL → form edit commands, tool-use

Filling

Conversational completion, agent tools

Evaluation

LLM-as-judge, deterministic + semantic

Experimentation Framework

Key Findings

  1. A hybrid prompt wins PDF extraction. Hybrid prompt = one short instruction + one inline exemplar + temperature=0. It Pareto-dominates verbose-guideline and 3-exemplar few-shot variants: 73% recall, 99% precision, +24pp sensitivity over baseline Sonnet at the same cost.
  2. RAG is load-bearing for policy-grounded form authoring. Given a 21-chunk SNAP regulatory corpus, the pipeline produces a 14-page form tracking the ground truth's topical structure. With an empty corpus it produces a single skeletal page. Recall doubles (4.7% → 10.6%); field coverage grows 20×.
  3. Capability boundaries are real. Non-Anthropic small multimodal models (Nova Pro, Nova Lite, Llama 3.2 Vision) fail at field-level PDF extraction. Model selection beats prompt engineering once you pass a model's capability line.

Extraction Evaluation

Five Sonnet-based variants against three government-form fixtures. Hybrid v1 = one short instruction + one inline exemplar + temperature=0.

Variant Recall Precision Sensitivity
Sonnet baseline 62% 79% 27%
Few-shot (3 exemplars) 55% 87% 21%
Tool-use 35% 96% 79%
Hybrid v1 73% 99% 51%
RAG-grounded 56% 93% 53%

RAG-Guided Form Authoring

Policy corpus in, evaluated form out

Policy Corpus
Criteria
Structure
Fields
Evaluation

What's Next

1 / 11