rag-forge audit
Run evaluation on pipeline telemetry and generate an audit report.
Synopsis
rag-forge audit [--input <file>] [--golden-set <file>] [options]Description
audit evaluates your RAG pipeline’s quality and produces a structured report. It requires at least one of --input (a JSONL file of captured telemetry) or --golden-set (a JSON file of ground-truth question/answer pairs). Both can be provided at the same time.
The command delegates to the Python rag_forge_evaluator.cli module. It scores each sample across a set of metrics (faithfulness, relevance, answer completeness, and others depending on the evaluator engine) and maps the overall score to an RMM level (0–5). The RMM — RAG Maturity Model — gives you a single headline grade that indicates where your pipeline sits on the quality spectrum.
Three evaluator engines are supported: llm-judge (default, uses a language model to score each metric), ragas (open-source evaluation framework, internally uses gpt-4o-mini regardless of --judge), and deepeval (another evaluation framework). The --judge flag selects the provider powering llm-judge; --judge-model lets you pin a specific model id for that provider.
On completion, audit saves an HTML report and a JSON report to the output directory. Pass --pdf to also render a PDF — Playwright availability is checked at audit start, not at the end, so missing dependencies fail fast before judge calls run.
The golden set file for a scaffolded project lives at eval/golden_set.json.
Options
| Flag | Default | Description |
|---|---|---|
-i, --input <file> | — | Path to telemetry JSONL file. Either --input or --golden-set is required. |
-g, --golden-set <file> | — | Path to golden set JSON file. Either --input or --golden-set is required. |
-j, --judge <model> | mock | Judge provider for llm-judge engine: mock | claude | openai. Unknown aliases fail loudly with a ConfigurationError. |
--judge-model <name> | provider default | Specific judge model id (e.g. claude-opus-4-6, gpt-4-turbo). Falls back to RAG_FORGE_JUDGE_MODEL env var, then the provider default. |
-o, --output <dir> | ./reports | Output directory for reports |
--evaluator <engine> | llm-judge | Evaluator engine: llm-judge | ragas | deepeval. Note: --evaluator ragas requires --judge openai since RAGAS uses its own internal OpenAI judge regardless. |
--pdf | — | Generate PDF report (requires Playwright). Checked at audit start. |
Environment variables
| Variable | Default | Description |
|---|---|---|
RAG_FORGE_JUDGE_MODEL | provider default | Override the judge model id. Lower precedence than --judge-model. |
RAG_FORGE_JUDGE_MAX_TOKENS | 4096 | Max output tokens per judge call. Bumped from 1024 in v0.1.1 — the old default truncated mid-array on long structured responses. |
RAG_FORGE_JUDGE_MAX_RETRIES | 5 | Max retries on transient 429/5xx from the judge SDK, plus the attempt cap for the 529-specific retry wrapper. |
RAG_FORGE_JUDGE_OVERLOAD_BUDGET_SECONDS | 300 | Wall-clock budget for 529 Overloaded retries (Anthropic-specific). The retry wrapper exits when either MAX_RETRIES attempts are exhausted or cumulative backoff would exceed this budget — whichever trips first. Raise to 600 for long audits during sustained capacity events. Added in v0.1.3. |
RAG_FORGE_LOG_QUERIES | unset | Set to 1 to show query previews in the per-sample progress stream. Default is to redact query content (PHI/PII protection). |
Pre-run banner and confirmation
Starting in v0.1.1, audit prints a pre-run banner to stderr showing sample count, judge model, evaluator engine, total judge calls, estimated time, and estimated USD cost — and asks for confirmation before any judge calls are made. The npm CLI auto-confirms (the TypeScript wrapper passes --yes to the Python subprocess). Direct python -m rag_forge_evaluator.cli audit ... invocations require either an interactive terminal or an explicit --yes flag.
While the audit runs, a per-sample progress line streams to stderr with metric scores and elapsed time. On completion, a summary line shows total elapsed, scored count, skipped count, overall score, RMM level, and report path.
Resilience: partial reports and retry budget
Starting in v0.1.3, audit is resilient to mid-loop failures. If the run aborts — via Ctrl+C, an unhandled exception, or an exhausted 529 retry budget during a sustained Anthropic capacity event — the orchestrator flushes everything scored so far to audit-report.partial.json before propagating the error. The partial file preserves per-sample results so a long paid audit never vaporizes its own progress.
Partial reports use a dual-surface design: top-level metrics, rmm_level, and overall_score are unconditionally null so the standard report shape visibly lacks data. Subset aggregates live under partial_metrics.by_metric with an in-band caveat. RMM level is omitted entirely — threshold claims apply to the whole pipeline, not a subset.
When the Claude judge hits a 529 Overloaded error, the retry wrapper streams a one-line notice to stderr on each retry so a long audit never looks frozen:
[judge: 529 overload, retry 2, elapsed 6s / 300s budget]Tune the retry window via RAG_FORGE_JUDGE_OVERLOAD_BUDGET_SECONDS (see environment variables above).
Exit codes
| Code | Meaning |
|---|---|
0 | Audit completed successfully; full report written. |
1 | Hard failure — unrecoverable error, no report. |
2 | Usage / config error (e.g. bad flag, missing file). |
3 | Partial audit — audit-report.partial.json written with everything scored so far. |
CI scripts can branch on exit code 3 to surface partial artifacts without treating them as hard failures:
rag-forge audit --input telemetry.jsonl --judge claude --yes
code=$?
if [ $code -eq 3 ]; then
echo "Partial audit — inspect $OUTDIR/audit-report.partial.json"
fiExamples
Audit against a golden set with mock judge
rag-forge audit --golden-set eval/golden_set.jsonAudit production telemetry with Claude as judge
rag-forge audit --input ./telemetry/pipeline.jsonl --judge claude --output ./reports/prodPin a specific Claude model
rag-forge audit --input ./telemetry/pipeline.jsonl --judge claude --judge-model claude-opus-4-6Use RAGAS as the evaluator engine
# Note: ragas requires --judge openai (or omit --judge entirely)
rag-forge audit --golden-set eval/golden_set.json --evaluator ragas --judge openaiGenerate a PDF report
rag-forge audit --golden-set eval/golden_set.json --judge claude --pdfRelated commands
rag-forge golden— manage the golden question/answer setrag-forge assess— one-shot assessment without a full audit run