Parsyn

Filename	Type	Status	Created	Actions
Loading...

Parsyn — Documentation

Extract structured data from any PDF using Claude AI. Supports multiple document types — each with its own schema, prompt, enrichment rules, validation rules, and output formats — all configurable without code changes.

Tabs Pipeline New Doc Type Enrichment Rules Validation Rules Keywords Output Formats Bulk Processing Walkthrough REST API Cost

1. Tabs Overview

Tab	Purpose
Documents	Upload PDFs (or ZIPs), select document type, view status
Extract	Run AI extraction, edit JSON results, reprocess (free), download CSV. Select document type or auto-classify. Load Previous reloads from DB without API cost.
Bulk	Upload multiple PDFs or ZIPs, extract all, download combined CSV. Supports ZIP archives (ignores hidden/dot files from Mac).
Rules	Configure keywords, thresholds, enrichment rules, validation rules, and prompts — per document type
Doc Types	Create, edit, clone, publish, and Generate from PDF (AI builds schema + rules from a sample)
Review	Compare extraction vs ground truth, save corrections for regression testing
Regression	Run extraction on all ground-truth documents and compare results automatically

2. Pipeline Flow

PDF Upload
  |
  v
1. Classification (Haiku ~$0.002) — auto-detects document type from all active types in DB
  Uses: type descriptions + classification hints keywords
  |
  v
2. Segmentation (free) — splits multi-account PDFs into separate segments
  |
  v
3. Extraction (Sonnet ~$0.04) — reads PDF natively, extracts structured JSON per schema
  |
  v
4. Enrichment (free) — 14 rule types for post-processing
  |
  v
5. Validation (free) — 17 rule types for quality checks
  |
  v
6. Recovery (Haiku ~$0.002, only if validation fails) — re-extracts failed fields
  |
  v
7. Output (free) — CSV via output templates + JSON

Typical cost: ~$0.04 per PDF. Steps 4-7 can be re-run free via the Reprocess button.

3. Creating a New Document Type

Two approaches — manual or AI-assisted:

Manual

Doc Types tab → + New Document Type
Set type_id, display_name, schema fields
Write extraction prompt
Add classification hints
Save → add enrichment/validation rules → Publish

AI-Generated (~$0.04)

Doc Types tab → Generate from PDF
Upload a sample PDF
Describe what you need + any gotchas
Claude generates schema, prompt, enrichment + validation rules
Review the JSON, edit if needed → Save

Clone: Copy an existing type (including all rules) as a starting point.
Schema field types: string, float, int, bool, date, array (with nested item_fields)

4. Enrichment Rules (14 types)

Run after extraction to clean, transform, and normalize data. All free — no API calls.

Rule Type	What It Does	Example Parameters
`field_transform`	Transform a field value in-place	strip_chars, trim, to_uppercase, regex_replace, pad_left, date_format
`value_conversion`	Convert numeric values	divide_by, multiply_by, round, negate, to_number
`row_filter`	Exclude/include array items	exclude_if_name_matches, include_only_if, exclude_if_value_null
`field_injection`	Set or copy field values	set_if_null, set_always, copy_if_null, calculate_if_null
`rename_value`	Normalize field values	replace_exact (mapping table), replace_contains
`calculation`	Cross-field calculations	sum_and_set, evaluate_expression ("qty * price")
`currency_detect`	Detect currency from symbols	source_field: scan for currency symbols, set GBP/USD/EUR
`date_normalize`	Parse dates from any format	15 format patterns: "13 Mar 2026", "03/13/26", "13.03.2026" etc.
`lookup_map`	Map values via lookup table	mappings: {"uk": "United Kingdom"}, case_insensitive, output_field
`array_sort`	Sort array items by field	sort_by: "amount", descending: true
`array_aggregate`	Sum/count/min/max/avg across array	source_path: "items[].amount", operation: "sum"
`string_concat`	Concatenate multiple fields	fields: ["first_name", "last_name"], separator: " "
`conditional_set`	Set field based on if/else chain	conditions: [{condition: {field, op, value}, set_value}], default
`array_deduplicate`	Remove duplicate array items	compare_fields: ["name", "amount"] (or all fields)

Target paths: vendor_name, items[].amount, bills[].charges[].total
Conditions: {"field": "measure", "op": "equals", "value": "MWh"} — operators: equals, not_equals, contains, gt, lt, gte, lte, is_null, is_not_null, matches_regex, in_set
Priority: lower = runs first. Use 10, 20, 30... for easy reordering.
keywords_ref: Reference a shared keyword set instead of inline keywords.

5. Validation Rules (17 types)

Check extracted data for errors and warnings. Run after enrichment.

Rule Type	What It Checks	Parameters
`required`	Field not null/empty	(none)
`format`	Regex pattern match	pattern: "^\d{4}-\d{2}-\d{2}$"
`range`	Numeric min/max bounds	min: 0, max: 100000
`length`	String length bounds	min: 1, max: 200
`cross_field`	Compare two fields	other_field, op: lt/gt/eq/ne/lte/gte
`calculation`	Verify computed relationship	expression: "qty * price", tolerance_pct, tolerance_min
`conditional`	Nested check when condition met	condition + check (inner rule)
`unique`	No duplicates in array field	(none)
`not_empty_array`	Array has min items	min_items: 1
`email_format`	Valid email address	(none)
`sum_check`	Array items sum to total	source_path: "items[].amount", tolerance_pct
`date_range`	Date within allowed range	min_date/max_date: "today", "-365d", "+30d", "2025-01-01"
`depends_on`	Required when another field set	trigger_field, trigger_op: "gt", trigger_value: 0
`one_of`	Value from allowed set	allowed: ["GBP", "USD", "EUR"], case_insensitive: true
`count_check`	Array item count range	min_items: 1, max_items: 100
`consistency`	Same value across array items	(none) — checks all resolved values match

Severity: error (blocks) or warning (informational).
Error codes: Custom identifiers for reports (e.g. MISSING_REQUIRED, SUM_MISMATCH).
= new rule types

6. Keywords & Shared Variables

Keyword sets are named lists stored in the database. Enrichment rules reference them via keywords_ref so you can change a list in one place without editing every rule.

Step 1: Create a keyword set (Rules tab or CLI):
  POST /api/cli/keyword/add
  { "document_type_id": "travel_expense",
    "category": "enrichment_keyword_set",
    "key": "exclude_descriptions",
    "value": "cancelled" }

Step 2: Reference it in an enrichment rule:
  { "rule_type": "row_filter",
    "parameters_json": {
      "action": "exclude_if_name_matches",
      "field": "destination",
      "keywords_ref": "exclude_descriptions"
    }}

Change the keyword list → all referencing rules pick up the change automatically.

7. Output Formats

Each document type can have multiple named output templates. The same extracted JSON can produce different CSV/JSON layouts.

Default CSV — auto-flattens JSON. Arrays become rows with header fields repeated.
Custom templates — define column mappings: Output Name = source_path
Array fields — use items[].amount to create one row per array item
Hardcoded values — add static columns (e.g. "Company" = "Optima Energy")
Concatenation — origin + ' to ' + destination

Manage via Manage Formats button in the Extract tab. Drag fields from the JSON to build your output visually.

8. Bulk Processing

The Bulk tab processes multiple PDFs in one go and produces a combined CSV output.

Accepts .pdf and .zip files (PDFs are extracted from zips automatically)
Hidden/dot files ignored (handles Mac __MACOSX and ._ files)
Upload progress bar with per-file extraction log
Combined CSV includes a _source_file column identifying which PDF each row came from
Download as combined CSV or individual JSON files

9. Walkthrough: Adding a New Document Type

Doc Types → Generate from PDF or + New — create the type with schema, prompt, hints
Rules → select your type → Enrichment Rules — add post-processing:
- Strip currency symbols (field_transform)
- Normalize dates (date_normalize)
- Calculate subtotals (array_aggregate)
- Map vendor codes (lookup_map)
Rules → Validation Rules — add quality checks:
- Required fields (required)
- Email format (email_format)
- Subtotal matches line items (sum_check)
- Currency in allowed set (one_of)
Extract tab — upload a PDF, select your type, run extraction
Edit JSON + Reprocess (free) — iterate on enrichment/validation rules without API cost
Manage Formats — create custom CSV layouts from the extracted JSON
Publish — version-stamp the type for production use

10. REST API Reference

All operations are available via REST API. Authenticate with X-API-Key header. Interactive docs at /docs (Swagger) and /redoc (ReDoc).

Authentication

# Login (returns session cookie)
curl -X POST http://localhost:8080/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"username": "admin", "password": "your-password"}' \
  -c cookies.txt

# Or use API key (recommended for automation)
curl http://localhost:8080/api/documents/ \
  -H "X-API-Key: ce_your_api_key_here"

Upload & Extract (Single PDF)

# 1. Upload a PDF
curl -X POST http://localhost:8080/api/documents/ \
  -H "X-API-Key: ce_xxx" \
  -F "[email protected]" \
  -F "document_type_id=utility_bill"

# Response:
{ "id": "abc123", "filename": "invoice.pdf", "status": "pending" }

# 2. Run extraction (~$0.04, 30-60 seconds)
curl -X POST http://localhost:8080/api/extractions/run/abc123 \
  -H "X-API-Key: ce_xxx"

# Response:
{
  "extraction_id": "ext_789",
  "success": true,
  "steps": [
    {"name": "classify", "success": true, "duration_ms": 1200, "cost": {"total_cost": 0.002}},
    {"name": "extract",  "success": true, "duration_ms": 28000, "cost": {"total_cost": 0.038}},
    {"name": "enrich",   "success": true, "duration_ms": 15},
    {"name": "validate", "success": true, "duration_ms": 3}
  ],
  "extraction_data": { "supplier_name": "SSE", "bills": [...] },
  "validation_result": { "errors": [], "warnings": [...] },
  "total_cost": { "total_cost": 0.042, "model": "claude-sonnet-4-20250514" },
  "total_duration_ms": 29500,
  "_confidence": { "overall": 0.87, "low_confidence_fields": [...] }
}

# 3. Reprocess (free — re-run enrichment + validation after editing JSON)
curl -X POST http://localhost:8080/api/extractions/tools/reprocess \
  -H "X-API-Key: ce_xxx" \
  -H "Content-Type: application/json" \
  -d '{"extraction_data": {"supplier_name": "SSE", ...}, "document_type_id": "utility_bill"}'

# Response: enriched data + validation + CSV rows (all free)

Direct File Extraction (No Pre-Upload)

# Option A: Multipart file upload (simplest)
curl -X POST http://localhost:8080/api/extractions/tools/extract-file \
  -H "X-API-Key: ce_xxx" \
  -F "[email protected]" \
  -F "document_type_id=utility_bill" \
  -F "store_result=true"

# Option B: Base64-encoded bytes (for programmatic use)
curl -X POST http://localhost:8080/api/extractions/tools/extract-bytes \
  -H "X-API-Key: ce_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_base64": "'$(base64 -w0 invoice.pdf)'",
    "document_type_id": "utility_bill",
    "store_result": false
  }'

# Both return the same response format as /run/{doc_id}
# Use store_result=false for stateless extraction (no DB records)
# Use store_result=true to save for later review/comparison

# Python example:
import base64, requests
with open("invoice.pdf", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()
resp = requests.post("http://localhost:8080/api/extractions/tools/extract-bytes",
    json={"pdf_base64": b64, "document_type_id": "utility_bill"},
    headers={"X-API-Key": "ce_xxx"})
data = resp.json()["extraction_data"]

Batch Processing (50% Cost Savings)

# 1. Submit batch job (up to 10,000 PDFs)
curl -X POST http://localhost:8080/api/batch/submit?document_type_id=utility_bill \
  -H "X-API-Key: ce_xxx" \
  -F "[email protected]" \
  -F "[email protected]" \
  -F "[email protected]"

# Response:
{
  "job_id": "job_abc123",
  "batch_api_id": "msgbatch_xxx",
  "total_files": 15,
  "status": "processing",
  "message": "Batch submitted with 15 files. Results may take up to 24 hours."
}

# 2. Check status (poll every few minutes)
curl http://localhost:8080/api/batch/status/job_abc123 \
  -H "X-API-Key: ce_xxx"

# Response (processing):
{
  "id": "job_abc123",
  "status": "processing",
  "api_status": "in_progress",
  "total_files": 15,
  "succeeded": 8,
  "failed": 0,
  "processing": 7
}

# Response (completed):
{ "id": "job_abc123", "status": "completed", "succeeded": 14, "failed": 1 }

# 3. Retrieve results
curl http://localhost:8080/api/batch/results/job_abc123 \
  -H "X-API-Key: ce_xxx"

# Response:
{
  "job_id": "job_abc123",
  "status": "completed",
  "results": [
    { "custom_id": "a1b2c3_0", "filename": "bill1.pdf", "success": true,
      "extraction_data": { "supplier_name": "SSE", "bills": [...] } },
    { "custom_id": "d4e5f6_1", "filename": "bill2.pdf", "success": true,
      "extraction_data": { ... } },
    { "custom_id": "g7h8i9_2", "filename": "corrupt.pdf", "success": false,
      "error": "JSON parse error" }
  ]
}

# 4. List all batch jobs
curl http://localhost:8080/api/batch/ -H "X-API-Key: ce_xxx"

Webhooks (Event Notifications)

# Create a webhook
curl -X POST http://localhost:8080/api/webhooks/ \
  -H "X-API-Key: ce_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Slack Notification",
    "url": "https://hooks.slack.com/services/xxx/yyy/zzz",
    "events": "extraction.complete,batch.complete"
  }'

# Events: extraction.complete, extraction.failed, batch.complete,
#          ground_truth.saved, regression.complete

# Webhook payload (sent as POST to your URL):
{
  "event": "extraction.complete",
  "timestamp": "2026-03-31T14:30:00",
  "data": {
    "document_id": "abc123",
    "filename": "invoice.pdf",
    "document_type": "utility_bill",
    "success": true,
    "cost": 0.042,
    "duration_ms": 29500
  }
}

# Test a webhook
curl -X POST http://localhost:8080/api/webhooks/webhook_id/test \
  -H "X-API-Key: ce_xxx"

# List / update / delete
curl http://localhost:8080/api/webhooks/ -H "X-API-Key: ce_xxx"
curl -X PUT  http://localhost:8080/api/webhooks/id -d '{"is_active": false}'
curl -X DELETE http://localhost:8080/api/webhooks/id

Folder Watcher (Auto-Extraction)

# Start watching a folder (PDFs dropped here are auto-extracted)
curl -X POST http://localhost:8080/api/watcher/start \
  -H "X-API-Key: ce_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "watch_dir": "/data/incoming-pdfs",
    "document_type_id": "utility_bill",
    "poll_interval": 10
  }'

# Files are moved to /data/incoming-pdfs/processed/ or /data/incoming-pdfs/failed/

# Check status
curl http://localhost:8080/api/watcher/status -H "X-API-Key: ce_xxx"

# Stop watcher
curl -X POST http://localhost:8080/api/watcher/stop -H "X-API-Key: ce_xxx"

Output Templates & Export

# Create an output template with transforms and aggregations
curl -X POST http://localhost:8080/api/output-templates/ \
  -H "X-API-Key: ce_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "document_type_id": "utility_bill",
    "name": "BDL Import Format",
    "format": "csv",
    "field_mappings": [
      {"output_name": "Supplier",    "source": "supplier_name", "transform": "upper"},
      {"output_name": "Bill Date",   "source": "bill_date", "transform": "date_format",
       "transform_args": {"format": "%d/%m/%Y"}},
      {"output_name": "Charge Name", "source": "bills[].charges[].name"},
      {"output_name": "Amount",      "source": "bills[].charges[].total",
       "transform": "round", "transform_args": {"decimals": 2}},
      {"output_name": "Calculated",  "source": "{quantity} * {rate}"}
    ],
    "aggregations": [
      {"column": "Amount", "function": "sum", "label": "Grand Total"},
      {"column": "Charge Name", "function": "count", "label": "Row Count"}
    ]
  }'

# Apply template to extraction data
curl -X POST http://localhost:8080/api/output-templates/apply/template_id \
  -H "X-API-Key: ce_xxx" \
  -H "Content-Type: application/json" \
  -d '{"extraction_data": {...}, "document_type_id": "utility_bill"}'

# Export as Excel (direct download)
curl -X POST http://localhost:8080/api/output-templates/template_id/export-xlsx \
  -H "X-API-Key: ce_xxx" \
  -H "Content-Type: application/json" \
  -d '{"extraction_data": {...}}' -o output.xlsx

# Export as XML
curl -X POST http://localhost:8080/api/output-templates/template_id/export-xml \
  -H "X-API-Key: ce_xxx" \
  -d '{"extraction_data": {...}}' -o output.xml

# Available transforms
curl http://localhost:8080/api/output-templates/available-transforms

# AI: Generate template from sample CSV
curl -X POST http://localhost:8080/api/output-templates/generate-from-sample \
  -H "X-API-Key: ce_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "document_type_id": "utility_bill",
    "sample_columns": ["Supplier", "Bill Date", "Amount", "VAT"],
    "available_fields": ["supplier_name", "bill_date", "bills[].charges[].total"]
  }'

Document Types & Schema

# List all document types
curl http://localhost:8080/api/document-types/ -H "X-API-Key: ce_xxx"

# Create a new type
curl -X POST http://localhost:8080/api/document-types/ \
  -H "X-API-Key: ce_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "type_id": "purchase_order",
    "display_name": "Purchase Order",
    "description": "Standard PO documents",
    "schema_fields": [
      {"name": "po_number", "type": "string", "required": true},
      {"name": "vendor", "type": "string"},
      {"name": "line_items", "type": "array", "items": [
        {"name": "description", "type": "string"},
        {"name": "quantity", "type": "number"},
        {"name": "unit_price", "type": "number"}
      ]}
    ],
    "extraction_prompt": "Extract all data from this purchase order..."
  }'

# AI: Generate type from a sample PDF
curl -X POST http://localhost:8080/api/document-types/generate-from-pdf \
  -H "X-API-Key: ce_xxx" \
  -F "[email protected]" \
  -F "description=Hotel invoice with room charges, taxes, and guest details" \
  -F "requirements=Extract guest name, dates, room charges, taxes, total"

# Clone a type
curl -X POST http://localhost:8080/api/document-types/utility_bill/clone \
  -d '{"new_type_id": "utility_bill_v2", "new_display_name": "Utility Bill v2"}'

# Export/Import config (for sharing between environments)
curl http://localhost:8080/api/document-types/utility_bill/export -o config.json
curl -X POST http://localhost:8080/api/document-types/import \
  -H "Content-Type: application/json" -d @config.json

Quality: Ground Truth & Regression

# Save corrected data as ground truth
curl -X POST http://localhost:8080/api/ground-truth/doc_abc123 \
  -H "X-API-Key: ce_xxx" \
  -H "Content-Type: application/json" \
  -d '{"corrected_data": {"supplier_name": "SSE Energy", ...}}'

# Response: auto-computed diff against latest extraction
{ "id": "gt_xyz", "correction_count": 3 }

# Run regression test (re-extracts all docs with ground truth)
curl -X POST http://localhost:8080/api/regression/run \
  -H "X-API-Key: ce_xxx" \
  -H "Content-Type: application/json" \
  -d '{"document_type_id": "utility_bill"}'

# Response:
{
  "total_documents": 12,
  "passed": 11,
  "failed": 1,
  "summary": { "pass_rate": 0.917, "avg_score": 0.95 },
  "results": [
    { "filename": "sse_bill.pdf", "score": 1.0, "status": "pass" },
    { "filename": "edf_bill.pdf", "score": 0.8, "status": "fail",
      "comparison": { "mismatch_count": 2, "mismatches": [...] } }
  ]
}

# AI: Suggest improvements based on corrections
curl -X POST http://localhost:8080/api/ground-truth/doc_abc123/suggest-improvements \
  -d '{"document_type_id": "utility_bill"}'

Integration Workflow Example

# Complete automation flow:
# 1. Set up webhook for notifications
curl -X POST /api/webhooks/ -d '{
  "name": "Process Complete", "url": "https://your-system.com/webhook",
  "events": "extraction.complete"
}'

# 2. Start folder watcher (or submit via API)
curl -X POST /api/watcher/start -d '{
  "watch_dir": "/incoming", "document_type_id": "utility_bill"
}'

# 3. Drop PDFs into /incoming/ → auto-extracted → webhook fires
# 4. Your system receives webhook, fetches results:
curl /api/documents/                     # list docs
curl /api/extractions/document/doc_id    # get extraction
curl -X POST /api/output-templates/apply/template_id \
  -d '{"extraction_data": ...}'          # format output
curl -X POST /api/output-templates/template_id/export-xlsx \
  -d '{"extraction_data": ...}' -o out.xlsx  # Excel export

11. Cost Optimization

Strategy	Saving	How
Reprocess	100% free	Edit JSON + Reprocess to test enrichment/validation changes
Load Previous	100% free	Reload last extraction from DB
Prompt Caching	~90% on prompt	Automatic — 5 min cache across sequential PDFs
Skip Classification	~$0.002/PDF	Select doc type manually or auto-set when type chosen
Batch API	50% discount	For regression testing — results in up to 24h
Generate from PDF	~$0.04 once	AI builds complete type definition from one sample

Parsyn

Upload PDF

Documents

Select Document

Pipeline Progress

Pipeline Steps

Validation

Extraction Data (editable — change values and click Reprocess)

CSV Preview

Extraction History

Extraction Comparison

Bulk Extraction

Batch Jobs

Progress

Results

Per-File Results

Folder Watcher

Webhooks

Add Webhook

Review Ground Truth

Actions

AI Improvement Suggestions

Ground Truth Summary

Regression Testing