Skip to Content

rag-forge index

Index documents into the vector store.

Synopsis

rag-forge index --source <dir> [options]

Description

index is the core ingestion command. It reads source documents from a directory, parses and chunks them, generates embeddings, and writes the resulting chunks into the vector store under the specified collection name.

--source is required. Every other option has a default, so the minimal invocation is rag-forge index --source ./docs.

By default the command uses the mock embedding provider so you can test the pipeline without an API key. Switch to openai or local for production use. The chunking strategy and chunk size mirror the options available in chunk — use rag-forge chunk first to validate your settings before running a full index.

When --enrich is set, the command prepends a document-level summary to each chunk before embedding. This increases retrieval quality at the cost of additional LLM calls. --enrichment-generator selects which model performs the summarization. If --strategy llm-driven is set, --chunking-generator is required.

The optional --sparse-index-path flag persists a BM25 sparse index to disk alongside the dense vector store, enabling hybrid retrieval in subsequent query calls.

Options

FlagDefaultDescription
-s, --source <dir>requiredSource directory of documents to index
-c, --collection <name>rag-forgeCollection name in the vector store
-e, --embedding <provider>mockEmbedding provider: openai | local | mock
--strategy <name>recursiveChunking strategy: fixed | recursive | semantic | structural | llm-driven
--chunking-generator <provider>Generator for LLM-driven chunking: claude | openai | mock. Required when --strategy llm-driven
--enrichEnable contextual enrichment (document summary prepending)
--enrichment-generator <provider>Generator for enrichment summaries: claude | openai | mock. Requires --enrich
--sparse-index-path <path>Path to persist BM25 sparse index

Examples

Minimal indexing with mock embeddings

rag-forge index --source ./docs

Production indexing with OpenAI embeddings

rag-forge index --source ./docs --embedding openai --collection my-project

Hybrid retrieval setup (dense + sparse)

rag-forge index --source ./docs --embedding openai --sparse-index-path ./bm25.pkl

LLM-driven chunking with contextual enrichment

rag-forge index --source ./docs \ --strategy llm-driven \ --chunking-generator claude \ --enrich \ --enrichment-generator claude