U.S. flagA digital services project by Flexion

Form Shaping: Claude Opus 4.6

Selectable in Settings → Variants → Shaping.

Status: experimental

Summary

Metric Value
Command-Kind Recall 73.3%
Command-Kind Precision 83.3%
Argument Accuracy 66.7%

Run timestamp: 2026-04-19T15:37:22.121Z. Spec version: 2026-04-19.

Approach

Uses createBedrockFormShaper({ model: OPUS_MODEL_ID }) with the standard 25-command tool-use prompt. Opus is the frontier model, expected to deliver the highest quality at higher cost and latency. Each scripted intent is evaluated against expected Command[] output via the shaping-commands kind (deterministic: kind precision/recall + argument accuracy, no LLM judge).

Per-intent Results

Intent Recall Precision Arg Acc Matched Missing Extra
swap-pages 100% 100% 100% swapPages
merge-employment 100% 50% 100% mergePages renamePage
optional-middle-name 100% 100% 100% setRequired
move-military 0% 100% 0% moveGroup
rename-personal-info 100% 50% 100% renamePage renameGroup
suggest-delivery-modes 40% 100% 0% setDeliveryMode, setDeliveryMode setDeliveryMode, setDeliveryMode, setDeliveryMode

Findings

  • Best-in-suite on entity resolution, edge case on multi-step. Opus is the only model that correctly maps “personal info” to renamePage and produces a renamePage command for the scripted rename intent — though with extra (renameGroup also emitted, hence 50% precision on that case). The move-military intent still fails across all three models, suggesting the failure lives in the prompt’s handling of quoted group names rather than in any single model’s capability.
  • Delivery-mode recall regresses vs Sonnet/Haiku. On suggest-delivery-modes, Opus only produces 2 of the 5 expected setDeliveryMode commands. The model appears to take “suggest delivery modes” more literally as “pick the ones that need suggestion” — an interpretation error, not a capability gap. Sonnet and Haiku both over-apply setDeliveryMode to all 5 pages and accidentally line up with the scripted ground truth.
  • Highest overall quality, at ~2x the latency. Opus run took 19.8s vs Sonnet 15.9s vs Haiku 10.8s. For interactive shaping UX, this is the Haiku-vs-Sonnet tradeoff from Assignment 10 all over again: the simplest intents are handled equivalently by the cheapest model; only the rare multi-step or entity-resolution cases need the frontier model, and even then the scorer rewards Sonnet for emitting more commands.

Cost

Bedrock on-demand pricing for Claude Opus 4.6 (model id us.anthropic.claude-opus-4-6-v1): ~$15.00 per 1M input tokens, ~$75.00 per 1M output tokens. Each scripted intent is a single short tool-calling turn (~2k input tokens, ~200 output tokens); total run cost is ~$0.20 for all six intents.

← Back to shaping-model-comparison