Sign in to continue
| Filename | Type | Status | Created | Actions |
|---|---|---|---|---|
| Loading... | ||||
Select a document and document type, then run extraction.
Previous extractions for this document. Select two to compare.
Upload multiple PDFs, extract all, and download a single combined CSV.
Batch processing uses the Anthropic Batch API for 50% cost savings. Results may take up to 24 hours.
| File | Status | Rows | Cost | Duration |
|---|
Auto-extract PDFs dropped into a watched folder. Files move to processed/ or failed/ after extraction.
Receive HTTP POST notifications when extractions complete, batches finish, or ground truth is saved.
Load an extracted document, review and correct the data, then save as ground truth for regression testing.
Re-extract all documents that have ground truth and compare results. Identifies regressions when rules or prompts change.
| Document | Score | Status | Mismatches |
|---|
Extract structured data from any PDF using Claude AI. Supports multiple document types — each with its own schema, prompt, enrichment rules, validation rules, and output formats — all configurable without code changes.
| Tab | Purpose |
|---|---|
| Documents | Upload PDFs (or ZIPs), select document type, view status |
| Extract | Run AI extraction, edit JSON results, reprocess (free), download CSV. Select document type or auto-classify. Load Previous reloads from DB without API cost. |
| Bulk | Upload multiple PDFs or ZIPs, extract all, download combined CSV. Supports ZIP archives (ignores hidden/dot files from Mac). |
| Rules | Configure keywords, thresholds, enrichment rules, validation rules, and prompts — per document type |
| Doc Types | Create, edit, clone, publish, and Generate from PDF (AI builds schema + rules from a sample) |
| Review | Compare extraction vs ground truth, save corrections for regression testing |
| Regression | Run extraction on all ground-truth documents and compare results automatically |
PDF Upload
|
v
1. Classification (Haiku ~$0.002) — auto-detects document type from all active types in DB
Uses: type descriptions + classification hints keywords
|
v
2. Segmentation (free) — splits multi-account PDFs into separate segments
|
v
3. Extraction (Sonnet ~$0.04) — reads PDF natively, extracts structured JSON per schema
|
v
4. Enrichment (free) — 14 rule types for post-processing
|
v
5. Validation (free) — 17 rule types for quality checks
|
v
6. Recovery (Haiku ~$0.002, only if validation fails) — re-extracts failed fields
|
v
7. Output (free) — CSV via output templates + JSON
Typical cost: ~$0.04 per PDF. Steps 4-7 can be re-run free via the Reprocess button.
Two approaches — manual or AI-assisted:
Clone: Copy an existing type (including all rules) as a starting point.
Schema field types: string, float, int, bool, date, array (with nested item_fields)
Run after extraction to clean, transform, and normalize data. All free — no API calls.
| Rule Type | What It Does | Example Parameters |
|---|---|---|
field_transform | Transform a field value in-place | strip_chars, trim, to_uppercase, regex_replace, pad_left, date_format |
value_conversion | Convert numeric values | divide_by, multiply_by, round, negate, to_number |
row_filter | Exclude/include array items | exclude_if_name_matches, include_only_if, exclude_if_value_null |
field_injection | Set or copy field values | set_if_null, set_always, copy_if_null, calculate_if_null |
rename_value | Normalize field values | replace_exact (mapping table), replace_contains |
calculation | Cross-field calculations | sum_and_set, evaluate_expression ("qty * price") |
currency_detect | Detect currency from symbols | source_field: scan for currency symbols, set GBP/USD/EUR |
date_normalize | Parse dates from any format | 15 format patterns: "13 Mar 2026", "03/13/26", "13.03.2026" etc. |
lookup_map | Map values via lookup table | mappings: {"uk": "United Kingdom"}, case_insensitive, output_field |
array_sort | Sort array items by field | sort_by: "amount", descending: true |
array_aggregate | Sum/count/min/max/avg across array | source_path: "items[].amount", operation: "sum" |
string_concat | Concatenate multiple fields | fields: ["first_name", "last_name"], separator: " " |
conditional_set | Set field based on if/else chain | conditions: [{condition: {field, op, value}, set_value}], default |
array_deduplicate | Remove duplicate array items | compare_fields: ["name", "amount"] (or all fields) |
Target paths: vendor_name, items[].amount, bills[].charges[].total
Conditions: {"field": "measure", "op": "equals", "value": "MWh"} — operators: equals, not_equals, contains, gt, lt, gte, lte, is_null, is_not_null, matches_regex, in_set
Priority: lower = runs first. Use 10, 20, 30... for easy reordering.
keywords_ref: Reference a shared keyword set instead of inline keywords.
Check extracted data for errors and warnings. Run after enrichment.
| Rule Type | What It Checks | Parameters |
|---|---|---|
required | Field not null/empty | (none) |
format | Regex pattern match | pattern: "^\d{4}-\d{2}-\d{2}$" |
range | Numeric min/max bounds | min: 0, max: 100000 |
length | String length bounds | min: 1, max: 200 |
cross_field | Compare two fields | other_field, op: lt/gt/eq/ne/lte/gte |
calculation | Verify computed relationship | expression: "qty * price", tolerance_pct, tolerance_min |
conditional | Nested check when condition met | condition + check (inner rule) |
unique | No duplicates in array field | (none) |
not_empty_array | Array has min items | min_items: 1 |
email_format | Valid email address | (none) |
sum_check | Array items sum to total | source_path: "items[].amount", tolerance_pct |
date_range | Date within allowed range | min_date/max_date: "today", "-365d", "+30d", "2025-01-01" |
depends_on | Required when another field set | trigger_field, trigger_op: "gt", trigger_value: 0 |
one_of | Value from allowed set | allowed: ["GBP", "USD", "EUR"], case_insensitive: true |
count_check | Array item count range | min_items: 1, max_items: 100 |
consistency | Same value across array items | (none) — checks all resolved values match |
Severity: error (blocks) or warning (informational).
Error codes: Custom identifiers for reports (e.g. MISSING_REQUIRED, SUM_MISMATCH).
= new rule types
Keyword sets are named lists stored in the database. Enrichment rules reference them via keywords_ref
so you can change a list in one place without editing every rule.
Step 1: Create a keyword set (Rules tab or CLI):
POST /api/cli/keyword/add
{ "document_type_id": "travel_expense",
"category": "enrichment_keyword_set",
"key": "exclude_descriptions",
"value": "cancelled" }
Step 2: Reference it in an enrichment rule:
{ "rule_type": "row_filter",
"parameters_json": {
"action": "exclude_if_name_matches",
"field": "destination",
"keywords_ref": "exclude_descriptions"
}}
Change the keyword list → all referencing rules pick up the change automatically.
Each document type can have multiple named output templates. The same extracted JSON can produce different CSV/JSON layouts.
Output Name = source_pathitems[].amount to create one row per array itemorigin + ' to ' + destinationManage via Manage Formats button in the Extract tab. Drag fields from the JSON to build your output visually.
The Bulk tab processes multiple PDFs in one go and produces a combined CSV output.
__MACOSX and ._ files)_source_file column identifying which PDF each row came fromfield_transform)date_normalize)array_aggregate)lookup_map)required)email_format)sum_check)one_of)All operations are available via REST API. Authenticate with X-API-Key header. Interactive docs at /docs (Swagger) and /redoc (ReDoc).
# Login (returns session cookie) curl -X POST http://localhost:8080/api/auth/login \ -H "Content-Type: application/json" \ -d '{"username": "admin", "password": "your-password"}' \ -c cookies.txt # Or use API key (recommended for automation) curl http://localhost:8080/api/documents/ \ -H "X-API-Key: ce_your_api_key_here"
# 1. Upload a PDF curl -X POST http://localhost:8080/api/documents/ \ -H "X-API-Key: ce_xxx" \ -F "[email protected]" \ -F "document_type_id=utility_bill" # Response: { "id": "abc123", "filename": "invoice.pdf", "status": "pending" } # 2. Run extraction (~$0.04, 30-60 seconds) curl -X POST http://localhost:8080/api/extractions/run/abc123 \ -H "X-API-Key: ce_xxx" # Response: { "extraction_id": "ext_789", "success": true, "steps": [ {"name": "classify", "success": true, "duration_ms": 1200, "cost": {"total_cost": 0.002}}, {"name": "extract", "success": true, "duration_ms": 28000, "cost": {"total_cost": 0.038}}, {"name": "enrich", "success": true, "duration_ms": 15}, {"name": "validate", "success": true, "duration_ms": 3} ], "extraction_data": { "supplier_name": "SSE", "bills": [...] }, "validation_result": { "errors": [], "warnings": [...] }, "total_cost": { "total_cost": 0.042, "model": "claude-sonnet-4-20250514" }, "total_duration_ms": 29500, "_confidence": { "overall": 0.87, "low_confidence_fields": [...] } } # 3. Reprocess (free — re-run enrichment + validation after editing JSON) curl -X POST http://localhost:8080/api/extractions/tools/reprocess \ -H "X-API-Key: ce_xxx" \ -H "Content-Type: application/json" \ -d '{"extraction_data": {"supplier_name": "SSE", ...}, "document_type_id": "utility_bill"}' # Response: enriched data + validation + CSV rows (all free)
# Option A: Multipart file upload (simplest) curl -X POST http://localhost:8080/api/extractions/tools/extract-file \ -H "X-API-Key: ce_xxx" \ -F "[email protected]" \ -F "document_type_id=utility_bill" \ -F "store_result=true" # Option B: Base64-encoded bytes (for programmatic use) curl -X POST http://localhost:8080/api/extractions/tools/extract-bytes \ -H "X-API-Key: ce_xxx" \ -H "Content-Type: application/json" \ -d '{ "pdf_base64": "'$(base64 -w0 invoice.pdf)'", "document_type_id": "utility_bill", "store_result": false }' # Both return the same response format as /run/{doc_id} # Use store_result=false for stateless extraction (no DB records) # Use store_result=true to save for later review/comparison # Python example: import base64, requests with open("invoice.pdf", "rb") as f: b64 = base64.b64encode(f.read()).decode() resp = requests.post("http://localhost:8080/api/extractions/tools/extract-bytes", json={"pdf_base64": b64, "document_type_id": "utility_bill"}, headers={"X-API-Key": "ce_xxx"}) data = resp.json()["extraction_data"]
# 1. Submit batch job (up to 10,000 PDFs) curl -X POST http://localhost:8080/api/batch/submit?document_type_id=utility_bill \ -H "X-API-Key: ce_xxx" \ -F "[email protected]" \ -F "[email protected]" \ -F "[email protected]" # Response: { "job_id": "job_abc123", "batch_api_id": "msgbatch_xxx", "total_files": 15, "status": "processing", "message": "Batch submitted with 15 files. Results may take up to 24 hours." } # 2. Check status (poll every few minutes) curl http://localhost:8080/api/batch/status/job_abc123 \ -H "X-API-Key: ce_xxx" # Response (processing): { "id": "job_abc123", "status": "processing", "api_status": "in_progress", "total_files": 15, "succeeded": 8, "failed": 0, "processing": 7 } # Response (completed): { "id": "job_abc123", "status": "completed", "succeeded": 14, "failed": 1 } # 3. Retrieve results curl http://localhost:8080/api/batch/results/job_abc123 \ -H "X-API-Key: ce_xxx" # Response: { "job_id": "job_abc123", "status": "completed", "results": [ { "custom_id": "a1b2c3_0", "filename": "bill1.pdf", "success": true, "extraction_data": { "supplier_name": "SSE", "bills": [...] } }, { "custom_id": "d4e5f6_1", "filename": "bill2.pdf", "success": true, "extraction_data": { ... } }, { "custom_id": "g7h8i9_2", "filename": "corrupt.pdf", "success": false, "error": "JSON parse error" } ] } # 4. List all batch jobs curl http://localhost:8080/api/batch/ -H "X-API-Key: ce_xxx"
# Create a webhook curl -X POST http://localhost:8080/api/webhooks/ \ -H "X-API-Key: ce_xxx" \ -H "Content-Type: application/json" \ -d '{ "name": "Slack Notification", "url": "https://hooks.slack.com/services/xxx/yyy/zzz", "events": "extraction.complete,batch.complete" }' # Events: extraction.complete, extraction.failed, batch.complete, # ground_truth.saved, regression.complete # Webhook payload (sent as POST to your URL): { "event": "extraction.complete", "timestamp": "2026-03-31T14:30:00", "data": { "document_id": "abc123", "filename": "invoice.pdf", "document_type": "utility_bill", "success": true, "cost": 0.042, "duration_ms": 29500 } } # Test a webhook curl -X POST http://localhost:8080/api/webhooks/webhook_id/test \ -H "X-API-Key: ce_xxx" # List / update / delete curl http://localhost:8080/api/webhooks/ -H "X-API-Key: ce_xxx" curl -X PUT http://localhost:8080/api/webhooks/id -d '{"is_active": false}' curl -X DELETE http://localhost:8080/api/webhooks/id
# Start watching a folder (PDFs dropped here are auto-extracted) curl -X POST http://localhost:8080/api/watcher/start \ -H "X-API-Key: ce_xxx" \ -H "Content-Type: application/json" \ -d '{ "watch_dir": "/data/incoming-pdfs", "document_type_id": "utility_bill", "poll_interval": 10 }' # Files are moved to /data/incoming-pdfs/processed/ or /data/incoming-pdfs/failed/ # Check status curl http://localhost:8080/api/watcher/status -H "X-API-Key: ce_xxx" # Stop watcher curl -X POST http://localhost:8080/api/watcher/stop -H "X-API-Key: ce_xxx"
# Create an output template with transforms and aggregations curl -X POST http://localhost:8080/api/output-templates/ \ -H "X-API-Key: ce_xxx" \ -H "Content-Type: application/json" \ -d '{ "document_type_id": "utility_bill", "name": "BDL Import Format", "format": "csv", "field_mappings": [ {"output_name": "Supplier", "source": "supplier_name", "transform": "upper"}, {"output_name": "Bill Date", "source": "bill_date", "transform": "date_format", "transform_args": {"format": "%d/%m/%Y"}}, {"output_name": "Charge Name", "source": "bills[].charges[].name"}, {"output_name": "Amount", "source": "bills[].charges[].total", "transform": "round", "transform_args": {"decimals": 2}}, {"output_name": "Calculated", "source": "{quantity} * {rate}"} ], "aggregations": [ {"column": "Amount", "function": "sum", "label": "Grand Total"}, {"column": "Charge Name", "function": "count", "label": "Row Count"} ] }' # Apply template to extraction data curl -X POST http://localhost:8080/api/output-templates/apply/template_id \ -H "X-API-Key: ce_xxx" \ -H "Content-Type: application/json" \ -d '{"extraction_data": {...}, "document_type_id": "utility_bill"}' # Export as Excel (direct download) curl -X POST http://localhost:8080/api/output-templates/template_id/export-xlsx \ -H "X-API-Key: ce_xxx" \ -H "Content-Type: application/json" \ -d '{"extraction_data": {...}}' -o output.xlsx # Export as XML curl -X POST http://localhost:8080/api/output-templates/template_id/export-xml \ -H "X-API-Key: ce_xxx" \ -d '{"extraction_data": {...}}' -o output.xml # Available transforms curl http://localhost:8080/api/output-templates/available-transforms # AI: Generate template from sample CSV curl -X POST http://localhost:8080/api/output-templates/generate-from-sample \ -H "X-API-Key: ce_xxx" \ -H "Content-Type: application/json" \ -d '{ "document_type_id": "utility_bill", "sample_columns": ["Supplier", "Bill Date", "Amount", "VAT"], "available_fields": ["supplier_name", "bill_date", "bills[].charges[].total"] }'
# List all document types curl http://localhost:8080/api/document-types/ -H "X-API-Key: ce_xxx" # Create a new type curl -X POST http://localhost:8080/api/document-types/ \ -H "X-API-Key: ce_xxx" \ -H "Content-Type: application/json" \ -d '{ "type_id": "purchase_order", "display_name": "Purchase Order", "description": "Standard PO documents", "schema_fields": [ {"name": "po_number", "type": "string", "required": true}, {"name": "vendor", "type": "string"}, {"name": "line_items", "type": "array", "items": [ {"name": "description", "type": "string"}, {"name": "quantity", "type": "number"}, {"name": "unit_price", "type": "number"} ]} ], "extraction_prompt": "Extract all data from this purchase order..." }' # AI: Generate type from a sample PDF curl -X POST http://localhost:8080/api/document-types/generate-from-pdf \ -H "X-API-Key: ce_xxx" \ -F "[email protected]" \ -F "description=Hotel invoice with room charges, taxes, and guest details" \ -F "requirements=Extract guest name, dates, room charges, taxes, total" # Clone a type curl -X POST http://localhost:8080/api/document-types/utility_bill/clone \ -d '{"new_type_id": "utility_bill_v2", "new_display_name": "Utility Bill v2"}' # Export/Import config (for sharing between environments) curl http://localhost:8080/api/document-types/utility_bill/export -o config.json curl -X POST http://localhost:8080/api/document-types/import \ -H "Content-Type: application/json" -d @config.json
# Save corrected data as ground truth curl -X POST http://localhost:8080/api/ground-truth/doc_abc123 \ -H "X-API-Key: ce_xxx" \ -H "Content-Type: application/json" \ -d '{"corrected_data": {"supplier_name": "SSE Energy", ...}}' # Response: auto-computed diff against latest extraction { "id": "gt_xyz", "correction_count": 3 } # Run regression test (re-extracts all docs with ground truth) curl -X POST http://localhost:8080/api/regression/run \ -H "X-API-Key: ce_xxx" \ -H "Content-Type: application/json" \ -d '{"document_type_id": "utility_bill"}' # Response: { "total_documents": 12, "passed": 11, "failed": 1, "summary": { "pass_rate": 0.917, "avg_score": 0.95 }, "results": [ { "filename": "sse_bill.pdf", "score": 1.0, "status": "pass" }, { "filename": "edf_bill.pdf", "score": 0.8, "status": "fail", "comparison": { "mismatch_count": 2, "mismatches": [...] } } ] } # AI: Suggest improvements based on corrections curl -X POST http://localhost:8080/api/ground-truth/doc_abc123/suggest-improvements \ -d '{"document_type_id": "utility_bill"}'
# Complete automation flow: # 1. Set up webhook for notifications curl -X POST /api/webhooks/ -d '{ "name": "Process Complete", "url": "https://your-system.com/webhook", "events": "extraction.complete" }' # 2. Start folder watcher (or submit via API) curl -X POST /api/watcher/start -d '{ "watch_dir": "/incoming", "document_type_id": "utility_bill" }' # 3. Drop PDFs into /incoming/ → auto-extracted → webhook fires # 4. Your system receives webhook, fetches results: curl /api/documents/ # list docs curl /api/extractions/document/doc_id # get extraction curl -X POST /api/output-templates/apply/template_id \ -d '{"extraction_data": ...}' # format output curl -X POST /api/output-templates/template_id/export-xlsx \ -d '{"extraction_data": ...}' -o out.xlsx # Excel export
| Strategy | Saving | How |
|---|---|---|
| Reprocess | 100% free | Edit JSON + Reprocess to test enrichment/validation changes |
| Load Previous | 100% free | Reload last extraction from DB |
| Prompt Caching | ~90% on prompt | Automatic — 5 min cache across sequential PDFs |
| Skip Classification | ~$0.002/PDF | Select doc type manually or auto-set when type chosen |
| Batch API | 50% discount | For regression testing — results in up to 24h |
| Generate from PDF | ~$0.04 once | AI builds complete type definition from one sample |