U.S. flagA digital services project by Flexion

closed
llm-integration
GitHub #75

User Story

As Maya, in order to see quantitative benchmarks for the shaping model variants, I want a CLI runner that evaluates all three shaping models against the scripted intent suite.

Context

Story #59 shipped the shaping-commands evaluation kind and 6 scripted intents with expected outputs. The scoring logic is tested. What’s missing is a CLI command to actually call each model with the intents and record results.

Acceptance Criteria

  • CLI command: bun run cli evaluate shaping run <variant-id>
  • Runs all 6 scripted intents against the selected shaping variant
  • Writes results to catalog/experiments/shaping-model-comparison/{variant}.json
  • Generates markdown catalog page with metrics
  • All three variants evaluated: bedrock-haiku, bedrock-sonnet, bedrock-opus
  • Catalog pages updated with real metrics

Definition of Done

  • Tests pass
  • All three variants evaluated
  • Catalog pages populated