From PDF to Structured Data
The core LLM integration: given a government PDF form, extract a complete DataCollectionSpec — fields, types, grouping, conditions, and sensitivity classifications.
How It Works
- Upload: Maya (form creator persona) uploads a PDF form
- Extract: The system sends the PDF to Claude via Amazon Bedrock
- Structure: Claude returns a structured DataCollectionSpec with fields, types, groups, conditions
- Review: Maya reviews the extracted spec in the catalog browser
The Prompt
The extraction prompt asks Claude to identify:
- Fields: Name, label, type (text, number, date, boolean, select), help text
- Groups: Logical grouping of related fields (e.g., “Personal Information”)
- Conditions: When a field should appear based on other field values
- Sensitivity: PII classification (name, SSN, address, etc.)
Strategy Pattern
The extractor uses a strategy pattern (PdfExtractor interface) so implementations can be swapped for experimentation:
ApiPdfExtractor— Current implementation using Claude via Bedrock- Future: alternative models, different prompting strategies, chunking approaches
Test Form
The primary evaluation form is a 24-page DOJ Pardon Application — a complex, real-world government form with conditional logic, multiple sections, and various field types.
See: Experiments | Story #3
A digital services project by Flexion