Skip to Content

rag-forge audit

Run evaluation on pipeline telemetry and generate an audit report.

Synopsis

rag-forge audit [--input <file>] [--golden-set <file>] [options]

Description

audit evaluates your RAG pipeline’s quality and produces a structured report. It requires at least one of --input (a JSONL file of captured telemetry) or --golden-set (a JSON file of ground-truth question/answer pairs). Both can be provided at the same time.

The command delegates to the Python rag_forge_evaluator.cli module. It scores each sample across a set of metrics (faithfulness, relevance, answer completeness, and others depending on the evaluator engine) and maps the overall score to an RMM level (0–5). The RMM — RAG Maturity Model — gives you a single headline grade that indicates where your pipeline sits on the quality spectrum.

Three evaluator engines are supported: llm-judge (default, uses a language model to score each metric), ragas (open-source evaluation framework, internally uses gpt-4o-mini regardless of --judge), and deepeval (another evaluation framework). The --judge flag selects the provider powering llm-judge; --judge-model lets you pin a specific model id for that provider.

On completion, audit saves an HTML report and a JSON report to the output directory. Pass --pdf to also render a PDF — Playwright availability is checked at audit start, not at the end, so missing dependencies fail fast before judge calls run.

The golden set file for a scaffolded project lives at eval/golden_set.json.

Options

FlagDefaultDescription
-i, --input <file>Path to telemetry JSONL file. Either --input or --golden-set is required.
-g, --golden-set <file>Path to golden set JSON file. Either --input or --golden-set is required.
-j, --judge <model>mockJudge provider for llm-judge engine: mock | claude | openai. Unknown aliases fail loudly with a ConfigurationError.
--judge-model <name>provider defaultSpecific judge model id (e.g. claude-opus-4-6, gpt-4-turbo). Falls back to RAG_FORGE_JUDGE_MODEL env var, then the provider default.
-o, --output <dir>./reportsOutput directory for reports
--evaluator <engine>llm-judgeEvaluator engine: llm-judge | ragas | deepeval. Note: --evaluator ragas requires --judge openai since RAGAS uses its own internal OpenAI judge regardless.
--pdfGenerate PDF report (requires Playwright). Checked at audit start.

Environment variables

VariableDefaultDescription
RAG_FORGE_JUDGE_MODELprovider defaultOverride the judge model id. Lower precedence than --judge-model.
RAG_FORGE_JUDGE_MAX_TOKENS4096Max output tokens per judge call. Bumped from 1024 in v0.1.1 — the old default truncated mid-array on long structured responses.
RAG_FORGE_JUDGE_MAX_RETRIES5Max retries on transient 429/5xx from the judge SDK, plus the attempt cap for the 529-specific retry wrapper.
RAG_FORGE_JUDGE_OVERLOAD_BUDGET_SECONDS300Wall-clock budget for 529 Overloaded retries (Anthropic-specific). The retry wrapper exits when either MAX_RETRIES attempts are exhausted or cumulative backoff would exceed this budget — whichever trips first. Raise to 600 for long audits during sustained capacity events. Added in v0.1.3.
RAG_FORGE_LOG_QUERIESunsetSet to 1 to show query previews in the per-sample progress stream. Default is to redact query content (PHI/PII protection).

Pre-run banner and confirmation

Starting in v0.1.1, audit prints a pre-run banner to stderr showing sample count, judge model, evaluator engine, total judge calls, estimated time, and estimated USD cost — and asks for confirmation before any judge calls are made. The npm CLI auto-confirms (the TypeScript wrapper passes --yes to the Python subprocess). Direct python -m rag_forge_evaluator.cli audit ... invocations require either an interactive terminal or an explicit --yes flag.

While the audit runs, a per-sample progress line streams to stderr with metric scores and elapsed time. On completion, a summary line shows total elapsed, scored count, skipped count, overall score, RMM level, and report path.

Resilience: partial reports and retry budget

Starting in v0.1.3, audit is resilient to mid-loop failures. If the run aborts — via Ctrl+C, an unhandled exception, or an exhausted 529 retry budget during a sustained Anthropic capacity event — the orchestrator flushes everything scored so far to audit-report.partial.json before propagating the error. The partial file preserves per-sample results so a long paid audit never vaporizes its own progress.

Partial reports use a dual-surface design: top-level metrics, rmm_level, and overall_score are unconditionally null so the standard report shape visibly lacks data. Subset aggregates live under partial_metrics.by_metric with an in-band caveat. RMM level is omitted entirely — threshold claims apply to the whole pipeline, not a subset.

When the Claude judge hits a 529 Overloaded error, the retry wrapper streams a one-line notice to stderr on each retry so a long audit never looks frozen:

[judge: 529 overload, retry 2, elapsed 6s / 300s budget]

Tune the retry window via RAG_FORGE_JUDGE_OVERLOAD_BUDGET_SECONDS (see environment variables above).

Exit codes

CodeMeaning
0Audit completed successfully; full report written.
1Hard failure — unrecoverable error, no report.
2Usage / config error (e.g. bad flag, missing file).
3Partial auditaudit-report.partial.json written with everything scored so far.

CI scripts can branch on exit code 3 to surface partial artifacts without treating them as hard failures:

rag-forge audit --input telemetry.jsonl --judge claude --yes code=$? if [ $code -eq 3 ]; then echo "Partial audit — inspect $OUTDIR/audit-report.partial.json" fi

Examples

Audit against a golden set with mock judge

rag-forge audit --golden-set eval/golden_set.json

Audit production telemetry with Claude as judge

rag-forge audit --input ./telemetry/pipeline.jsonl --judge claude --output ./reports/prod

Pin a specific Claude model

rag-forge audit --input ./telemetry/pipeline.jsonl --judge claude --judge-model claude-opus-4-6

Use RAGAS as the evaluator engine

# Note: ragas requires --judge openai (or omit --judge entirely) rag-forge audit --golden-set eval/golden_set.json --evaluator ragas --judge openai

Generate a PDF report

rag-forge audit --golden-set eval/golden_set.json --judge claude --pdf