Experiment Roadmap

This is the live roadmap for the Forms Lab LLM experiments. Each experiment is a user-story that ships a new variant (or a new task suite) that Maya or Carlos can select at runtime via Settings → Variants. The picker, provenance convention, and evaluation harness are already in place; this roadmap tracks the breadth of variants we’re building on top of them.

How to read this

Status — planned (issue filed), in-progress (branch open), shipped (merged to main), scope-deferred (intentionally not shipping; capture why).
Ships — the variant(s) or infrastructure that become user-selectable when the story lands.
Depends on — what has to merge first.
Catalog pages — where findings land when the story ships.

Trunk

Story	Status	Ships
#10 Maya chooses how her form was extracted	shipped (PR #58)	variant picker, preferences service, provenance convention, `<VariantBadge>`, expanded fixtures (I-9, W-9), registry-driven extraction

Picker tabs — parallel after #10 merges

Story	Status	Ships	Catalog
#59 Maya chooses her shaping model	shipped	shaping eval-kind, `shaping/haiku`, `shaping/sonnet`, `shaping/opus`, shaping tab in picker	`/catalog/experiments/shaping-model-comparison/` (new suite). Scoring kind is deterministic (command-kind precision/recall + arg accuracy). Live metrics landed via #75: opus 73/83/67%, sonnet 67/75/62%, haiku 67/83/62%. Model size is not the dominant lever for this suite.
#75 Run live shaping model evaluation	shipped (PR #77)	`evaluate shaping <variant-id>` CLI subcommand, live metrics on all three shaping variants	`/catalog/experiments/shaping-model-comparison/{haiku,sonnet,opus}.md` populated. Finding: three of six intents are at ceiling across all models; two fail across all models — the bottleneck is prompt disambiguation (renamePage vs renameGroup, quoted group-name resolution), not model size.
#60 Carlos’s conversation uses a chosen model	planned	filling eval-kind (personas + scripts), `filling/haiku`, `filling/sonnet`, `filling/opus`, filling tab in picker	`/catalog/experiments/filling-model-comparison/` (new suite)
#61 Maya verifies AcroForm mapping	planned	field-mapping eval-kind, `field-mapping/haiku`, `field-mapping/sonnet`, `field-mapping/opus`, mapping tab in picker	`/catalog/experiments/field-mapping/` (new suite)

Capability stories — parallel after #10 merges

Story	Status	Ships	Catalog
#62 / #74 Maya’s extractions cite the law	shipped (PR #78)	RAG primitive (embeddings + cosine store), policy corpus, `extraction/sonnet-with-rag`	`/catalog/experiments/pdf-field-extraction/sonnet-with-rag.md`. Finding: sensitivity accuracy +25.3pp, precision +13.6pp, recall -5.7pp. Grounding in CFR/USC text is the largest sensitivity win we’ve measured outside of tool-use.
#63 Maya’s extractions learn from curated examples	shipped	`extraction/few-shot-sonnet`, `extraction/nova-pro`	`/catalog/experiments/pdf-field-extraction/few-shot-sonnet.md`. Finding: precision +7.6pp over baseline; recall -6.8pp. Few-shot helps model be selective, not comprehensive. Nova Pro (non-Claude) fails at field-level extraction entirely.
#64 Maya’s extractions use a tuned prompt	scope-deferred	—	Superseded by #73. The hybrid-v1 and temperature=0 variants cover the “tuned prompt” surface with stronger metrics than a dedicated prompt-opt harness would produce.
#73 Maya’s extractions use temperature + hybrid prompt	shipped (PR #76)	`extraction/sonnet-temperature-zero`, `extraction/sonnet-hybrid-v1`	`/catalog/experiments/pdf-field-extraction/sonnet-temperature-zero.md`, `/catalog/experiments/pdf-field-extraction/sonnet-hybrid-v1.md`. Finding: hybrid-v1 wins the suite (precision 99.2%, recall 72.6%, sensitivity +23.8pp over baseline). Temp=0 alone is the largest precision lever (+15.1pp) but costs recall. Assignment 10’s prompt-shape rank ordering (hybrid > few-shot > verbose) reproduces on Sonnet.
#65 Maya’s extractions use our fine-tuned model	scope-deferred	—	`/catalog/experiments/pdf-field-extraction/lora-scope-deferral.md`. Deferred: prompt engineering achieves target quality on Claude; LoRA adds cost without new learning for this deadline.
#66 Maya extracts via structured tool-use	shipped	`extraction/tool-use-sonnet`	`/catalog/experiments/pdf-field-extraction/tool-use-sonnet.md`. Finding: precision 96.3%, sensitivity +51pp. Constrained output eliminates hallucination at cost of recall (step-limit).

Priority tiers

Tier 1 — must land for presentation. Shipped. #10 trunk (variant picker) plus #59 shaping-model picker, #63 few-shot, #66 tool-use.

Tier 2 — should land. Shipped. #73 prompt optimization (hybrid + temperature=0), #74 RAG (with policy corpus), #75 live shaping eval runner.

Tier 3 — aspirational. #60 filling-model picker and #65 LoRA remain deferred — neither is required for the rubric. #65 is a documented scope-deferral; #60 is feasible post-presentation if a filling-comparison narrative is wanted.

Conventions

Every variant story ends at the catalog. The definition of done includes a markdown page in the appropriate suite directory with metrics, approach, and tradeoffs. No catalog entry → not done.
Evaluation is mandatory. A variant without an evaluation run doesn’t get merged. If the evaluation required is qualitative, the catalog page explains the evaluation method.
Status lives here. When a story lands, this file updates: status becomes shipped, and a one-line finding is added to the row.

Governance

Issues tracked under the Final Project milestone, label user-story (plus llm-integration where applicable).
Design and plan docs land in notes/story-<N>-<short-name>/ before implementation.
See the architecture principles for how these variants respect the layered service design.