Catch bad data
before your customers do.

Field-level evals for LLM-powered web extraction pipelines. Compare page-extracted outputs against trusted labels, catch schema-valid JSON that is still wrong, and see which fields broke before they ship.

Request beta access

Plugs into your existing stack.

experiments

Run field-level experiments on every pipeline you ship.

Experiments give each critical field its own grader. Choose from LLM graders, fuzzy string matching, numeric ranges, and more. Find regressions before they flow downstream.

POST/v1/evaluate

{

"product_name": "TrailRunner X200",

"sale_price": 160.00 129.99,inverted

"original_price": 160 160.00,wrong number format

"review_count": 2847,

"sku": "TRX-200-GRY-42",

"shipping_days": 2 nullincorrect

}

Use cases

truths

Build ground truth datasets with ease.

Create ground datasets from imported HTML or captured web snapshots. Add truths by pairing them with golden JSON, or label values directly on the page with the Visual Labeler.

$1,295,000

42 Shaw StreetToronto, ON M6K 2P3

3Beds

2.5Baths

1,850Sq Ft

Renovated Leslieville semi on a walkable stretch of Queen East.

{

"address": null,

"price": null,

"cover_img": null,

"bedrooms": null,

"bathrooms": null,

"square_feet": null

}

arena

Pick the best accuracy per dollar.

Run models, prompts, and pipelines against the same Truth Set so you can cut costs, improve accuracy, and ship with confidence.

Sonnet 4.8 (low)

Sonnet 4.8 (max)

Gemini 3.1 Pro

GPT-5.5

DeepSeek V4 Flash

Kimi K2.6

Haiku 4.5 (tuned)

Prompt	Acc.	$/1K	Acc/$
Baseline	88%	$1.95	45.1
+ Few-shot	93%	$2.10	44.3
+ Schema-anchored	97%	$1.95	49.7

Winner: Haiku 4.5 (tuned) — 97% accuracy at $1.95 / 1K, beating GPT-5.5 (96% @ $7.25) for ~4× less. More expensive ≠ more accurate.

auditor

Catch cost spikes and schema drift in production.

Just a few lines of code stream live telemetry — cost per page, schema adherence, and null rates — then promote the worst offenders into Truth Sets.

Production telemetry by URL pattern

URL pattern	Pages	Cost	Schema
/products/*	9,841	$0.021/page	99.1%
/search/*	5,212	$0.088/page	92.4%
/reviews/*	2,633	$0.047/page	96.8%
/category/*	3,094	$0.019/page	99.4%

shipping_days missing on 8.2% of search pages

Promote /search?q=running+shoe to 'Catalog' Truth Set?

api

Own your infra. Use it anywhere.

Drop Truths.dev into CI, batch jobs, or agent loops. Define what correct looks like once, then gate every deploy on field accuracy.

# score one output against ground truth
result = truths.evaluate(
    truth_set_id="ts_trailrunner_pages",
    output=extractor.run(url),
)

# gate the deploy on field accuracy
assert result.verdict != "failed"

# compare pipelines, monitor production
truths.arena.compare(["exp_prompt_v6", "exp_prompt_v7"])
truths.auditor.events.create(result.telemetry)

Why teams choose Truths.dev

LLM eval platforms are built for conversations. Scraping tools are built for access. Truths.dev is built for the messy middle: proving extracted web data is correct, cheap enough, and schema-valid in production.

	Generic LLM evals	Scraping monitors	In-house scripts	Truths.dev
Field-level JSON accuracy	Partial	—	Manual	✓
Missing vs incorrect failure types	—	—	Hard	✓
Visual DOM-to-JSON debugging	—	—	—	✓
Model / prompt / pipeline Arena	Partial	—	Manual	Arena
Token cost per page in production	Partial	—	Manual	✓
Production schema adherence	—	Partial	Manual	✓
Promote failures into Truths	—	—	—	✓

Common questions

Quick answers about Truth Sets, evals, Arena, Auditor, and how Truths fits your extraction stack.

What is Truths.dev, and how is it different from schema validation?

Truths.dev audits LLM-powered web extraction by scoring extracted JSON against labeled correct values. Schema validation only checks shape and types; Truths scores each field, handles exact numbers, fuzzy text, nulls, and custom tolerances, and separates missing and incorrect values.

What is a Truth Set, and what do I need to get started?

A Truth Set is your golden dataset for one extraction task: a schema, labeled examples, and per-field match rules. Import records from your existing pipeline or capture web snapshots, label a handful of pages, then reuse the set across Experiments, Arena, and production audits.

Does Truths.dev replace my scraper, browser agent, or backend?

No. Truths sits downstream of extraction. Keep your current stack for fetching and parsing, whether that is Playwright, Firecrawl, Browser Use, your own backend, or something else. Use the API to trigger eval runs, fetch field-level verdicts, and gate deploys on accuracy thresholds.

How do Experiments and Arena help me improve extraction quality?

Experiments run your extractor against a Truth Set to establish field-level accuracy. Arena compares models, prompts, and pipelines by field accuracy, failure type, cost, and latency, so you can improve the fields that matter instead of chasing aggregate pass/fail.

What does Auditor monitor, and what do you do with that data?

Auditor receives lightweight telemetry from live extraction runs: token usage, cost, model and prompt version, URL pattern, schema health, null rates, latency, retries, and extraction status. Your extraction data stays yours: we only store data for running evals, never sell it, never use it to train models, and support deletion requests. Read the privacy policy for more detail.

Catch bad data
before your customers do.

Plugs into your existing stack.

Run field-level experiments on every pipeline you ship.

Pricing intelligence

Lead generation

Agent automation

Data aggregation

Build ground truth datasets with ease.

Pick the best accuracy per dollar.

Catch cost spikes and schema drift in production.

Production telemetry by URL pattern

Own your infra. Use it anywhere.

Why teams choose Truths.dev

Common questions

Catch bad databefore your customers do.

Plugs into your existing stack.

Run field-level experiments on every pipeline you ship.

Pricing intelligence

Lead generation

Agent automation

Data aggregation

Build ground truth datasets with ease.

Pick the best accuracy per dollar.

Catch cost spikes and schema drift in production.

Production telemetry by URL pattern

Own your infra. Use it anywhere.

Why teams choose Truths.dev

Common questions

Catch bad data
before your customers do.