Form Shaping: Claude Opus 4.6
Selectable in Settings → Variants → Shaping.
Status: experimental
Summary
| Metric | Value |
|---|---|
| Command-Kind Recall | 73.3% |
| Command-Kind Precision | 83.3% |
| Argument Accuracy | 66.7% |
Run timestamp: 2026-04-19T15:37:22.121Z. Spec version: 2026-04-19.
Approach
Uses createBedrockFormShaper({ model: OPUS_MODEL_ID }) with the standard 25-command tool-use prompt. Opus is the frontier model, expected to deliver the highest quality at higher cost and latency. Each scripted intent is evaluated against expected Command[] output via the shaping-commands kind (deterministic: kind precision/recall + argument accuracy, no LLM judge).
Per-intent Results
| Intent | Recall | Precision | Arg Acc | Matched | Missing | Extra |
|---|---|---|---|---|---|---|
| swap-pages | 100% | 100% | 100% | swapPages | — | — |
| merge-employment | 100% | 50% | 100% | mergePages | — | renamePage |
| optional-middle-name | 100% | 100% | 100% | setRequired | — | — |
| move-military | 0% | 100% | 0% | — | moveGroup | — |
| rename-personal-info | 100% | 50% | 100% | renamePage | — | renameGroup |
| suggest-delivery-modes | 40% | 100% | 0% | setDeliveryMode, setDeliveryMode | setDeliveryMode, setDeliveryMode, setDeliveryMode | — |
Findings
- Best-in-suite on entity resolution, edge case on multi-step. Opus is the only model that correctly maps “personal info” to
renamePageand produces arenamePagecommand for the scripted rename intent — though with extra (renameGroupalso emitted, hence 50% precision on that case). The move-military intent still fails across all three models, suggesting the failure lives in the prompt’s handling of quoted group names rather than in any single model’s capability. - Delivery-mode recall regresses vs Sonnet/Haiku. On
suggest-delivery-modes, Opus only produces 2 of the 5 expectedsetDeliveryModecommands. The model appears to take “suggest delivery modes” more literally as “pick the ones that need suggestion” — an interpretation error, not a capability gap. Sonnet and Haiku both over-applysetDeliveryModeto all 5 pages and accidentally line up with the scripted ground truth. - Highest overall quality, at ~2x the latency. Opus run took 19.8s vs Sonnet 15.9s vs Haiku 10.8s. For interactive shaping UX, this is the Haiku-vs-Sonnet tradeoff from Assignment 10 all over again: the simplest intents are handled equivalently by the cheapest model; only the rare multi-step or entity-resolution cases need the frontier model, and even then the scorer rewards Sonnet for emitting more commands.
Cost
Bedrock on-demand pricing for Claude Opus 4.6 (model id us.anthropic.claude-opus-4-6-v1): ~$15.00 per 1M input tokens, ~$75.00 per 1M output tokens. Each scripted intent is a single short tool-calling turn (~2k input tokens, ~200 output tokens); total run cost is ~$0.20 for all six intents.
A digital services project by Flexion