Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) is an architecture pattern that addresses one of the fundamental limitations of large language models: they can only reason over information present in their context window. RAG solves this by adding a retrieval step before generation — when a user asks a question, the system first searches a knowledge base for relevant information, then feeds those retrieved passages into the LLM prompt alongside the question. The model generates its answer grounded in the retrieved evidence rather than relying solely on its training data.
The retrieval mechanism typically uses vector search. Documents are split into chunks (paragraphs, sections, or semantically coherent segments), each chunk is converted into a dense vector embedding using a model like OpenAI's text-embedding-3-small, and these vectors are stored in a database with vector search capability. At query time, the user's question is embedded with the same model, and the most similar document chunks are retrieved using cosine similarity. DocumentIQ uses pgvector on PostgreSQL for this, with configurable chunking strategies (semantic, structural, or fixed-token) that can be switched at runtime without code changes.
DocumentIQ's project chat assistant is a full RAG implementation. Each project's documents are automatically chunked and embedded after upload. When a user asks a question in the chat interface, the system retrieves the most relevant chunks from that project's documents, combines them with the project's extracted field values as structured context, and generates a streaming response with inline citations to specific source documents. Retrieval parameters — how many chunks to fetch, the token budget for context, and the minimum similarity threshold — are configurable per billing plan, giving higher-tier users richer and more comprehensive answers.