From PDF to Structured Data

The core LLM integration: given a government PDF form, extract a complete DataCollectionSpec — fields, types, grouping, conditions, and sensitivity classifications.

How It Works

Upload: Maya (form creator persona) uploads a PDF form
Extract: The system sends the PDF to Claude via Amazon Bedrock
Structure: Claude returns a structured DataCollectionSpec with fields, types, groups, conditions
Review: Maya reviews the extracted spec in the catalog browser

The Prompt

The extraction prompt asks Claude to identify:

Fields: Name, label, type (text, number, date, boolean, select), help text
Groups: Logical grouping of related fields (e.g., “Personal Information”)
Conditions: When a field should appear based on other field values
Sensitivity: PII classification (name, SSN, address, etc.)

Strategy Pattern

The extractor uses a strategy pattern (PdfExtractor interface) so implementations can be swapped for experimentation:

ApiPdfExtractor — Current implementation using Claude via Bedrock
Future: alternative models, different prompting strategies, chunking approaches

Test Form

The primary evaluation form is a 24-page DOJ Pardon Application — a complex, real-world government form with conditional logic, multiple sections, and various field types.

See: Experiments | Story #3