Chat With Your PDFs: How RAG-Powered Document Assistants Actually Work
You have 400 supplier contracts sitting in a shared drive. Your CFO walks in and asks, "Which of our agreements include a force majeure clause that covers cyber incidents?" You have two choices: spend the afternoon opening PDFs one by one, or type the question into an AI assistant that has already read every document and will answer in seconds -- with a link to the exact page in each contract where it found the answer.
The second option is what "chat with your PDFs" really means. And it is no longer a research demo. In 2026 it is a standard feature of serious document intelligence platforms. This guide explains how it works under the hood, where the engineering choices matter, and how to evaluate an AI document assistant that you can trust in production.
What "Chat With Your Documents" Actually Means
The term gets used loosely. There are three distinct implementations floating around, and they have very different trade-offs:
- Full-context stuffing -- dump the entire document (or several documents) into the model's context window and ask the question. Works for one short PDF. Breaks as soon as you have 50 long contracts because you blow past token limits, burn credits on irrelevant text, and the model loses focus in the noise.
- Keyword search + LLM summarization -- run a classic full-text search on the documents, pull the top hits, and ask the model to summarize. Misses anything that does not share vocabulary with the query. "Escalation" does not match "price adjustment." "Termination for convenience" does not match "either party may end this agreement."
- Retrieval-Augmented Generation (RAG) -- chunk documents, embed them into a vector space, and retrieve the semantically relevant passages for every question. This is the approach that actually scales and the one DocumentIQ's chat assistant uses.
The rest of this post is about option three.
How RAG Works: The Four-Stage Pipeline
A RAG-powered document assistant has four stages. Each one has a meaningful impact on answer quality.
Stage 1: Text Extraction
Before anything else, you need clean text from every document. For native PDFs this is fast -- libraries like PyMuPDF or pdfplumber pull embedded text in milliseconds. For scanned PDFs or image-only documents, OCR runs first to produce text. DOCX files get parsed directly.
This step sounds trivial. It is not. Bad text extraction at this stage corrupts everything downstream. Multi-column layouts read across columns instead of within them. Tables lose their row structure. Headers bleed into body text. A good document assistant invests in high-quality extraction, because no amount of clever retrieval fixes garbage input.
Stage 2: Chunking
You cannot embed a 40-page contract as a single vector. You need to split it into pieces that are small enough to be semantically focused but large enough to preserve context. This is called chunking, and it is the single most underestimated part of a RAG pipeline.
There are three common strategies:
- Fixed-token chunking -- split every N tokens with a bit of overlap. Simple, predictable, loses semantic boundaries. Works fine as a baseline.
- Structural chunking -- split on detected headings, numbered clauses (5.3), and blank lines. Good for well-structured documents like contracts and specifications.
- Semantic chunking -- embed every sentence, then scan consecutive sentences for cosine-similarity drops. A drop signals a topic shift and becomes a chunk boundary. The result: every chunk is semantically cohesive, not cut at an arbitrary word count.
DocumentIQ uses semantic chunking by default, with structural and fixed-token as swappable alternatives. The active strategy is a single configuration flag -- no code change required to switch.
Why this matters for chat: when a user asks "What is the indemnification cap?", the retriever should pull the chunk that contains the indemnification discussion -- whole and intact. If the chunk boundary falls in the middle of the clause, the retrieved context will be missing half the answer, and the model will either hallucinate the rest or tell the user it does not know.
Stage 3: Embedding and Vector Storage
Each chunk is passed through an embedding model (text-embedding-3-small or similar, 1536 dimensions). The resulting vector is stored alongside the chunk text in a vector database. DocumentIQ uses pgvector -- the PostgreSQL extension -- so the vectors live in the same database as the rest of the data. No separate vector service to run, monitor, or pay for.
The vector captures the meaning of the chunk. Two chunks about "price escalation" and "annual fee adjustment" will land close together in vector space, even though they share no keywords. That is what makes the search semantic.
Stage 4: Retrieval and Generation
When the user sends a question, the assistant does this:
- Embed the question using the same model that embedded the chunks.
- Run a cosine-similarity search across the vector store, filtered to the current project so cross-project data never leaks.
- Retrieve the top N chunks above a similarity threshold. In DocumentIQ, N and the threshold are configurable per billing plan -- higher tiers get more candidates and a lower similarity floor for broader coverage.
- Fit chunks into a token budget -- greedy fill from highest similarity down, stopping when the budget is hit. This prevents the model's context from being flooded.
- Assemble the prompt: system instructions + project's extracted data summary + the retrieved chunks labelled with their source document and page range + the user's question + prior turns of conversation.
- Stream the answer token-by-token over Server-Sent Events so the UI feels responsive on long answers.
- Resolve citations -- parse
[filename.pdf]references from the model's response and link them back to the source documents so the user can verify every claim.
That is the full pipeline. The engineering detail that matters in production is what happens at stages two, four, and seven.
Why "Chat With Your Documents" Demos Lie to You
Most demos look great because they use one short document and a question the model can answer from its training data anyway. The failure modes only show up at scale:
- Retrieval misses -- the relevant chunk does not score highly for the question vector. This is usually a chunking or embedding-model issue. Re-indexing with a better strategy fixes it.
- Context flood -- 50 marginally related chunks get retrieved and the model loses the thread. Strict similarity thresholds and token budgets keep this in check.
- Hallucinated citations -- the model invents a filename that never existed. Mitigated by post-processing that resolves citation strings against the actual project's document list and drops unresolved ones.
- Cross-project leakage -- a user in Project A gets an answer that quotes a document from Project B. Mitigated by hard-filtering every vector search to the current project's document IDs at query time, not after.
- Stale retrieval -- documents were updated but embeddings were never refreshed. Fixed by triggering a re-index whenever a document is re-uploaded, and by including a chunking-strategy version field so you can detect and purge stale chunks after a strategy change.
If a vendor cannot explain how they handle each of these, assume they do not handle them.
Citations: The Feature That Makes Chat Trustworthy
No enterprise will roll out a document chat assistant that cannot cite its sources. The business reason is obvious: if the assistant claims "the supplier agreement allows 30 days of notice for termination," the lawyer reading the answer needs to verify it on the actual page, not take the AI's word for it.
Every answer in DocumentIQ's chat includes inline citation chips that link to the source document and the relevant page range. Clicking a chip opens a side panel showing the exact extracted text used in the context. If the model tries to answer from training data instead of the provided documents, the system prompt explicitly tells it to respond "this information is not in the provided documents" -- and low-temperature settings keep that behaviour consistent.
Scope: Why Chat Should Be Project-Scoped
Cross-document search across an entire organization sounds powerful. In practice it is a security hazard and a relevance killer. Mixing 400 supplier contracts with 2,000 HR documents and 10,000 customer emails means every query gets distracted by irrelevant matches, and sensitive data from one team ends up in answers meant for another.
DocumentIQ scopes chat to a single project. Every query runs against only the documents in that project. Different teams, different projects, different chat spaces. Permissions follow project membership. You get the relevance benefits of a focused corpus and the security benefits of hard isolation.
Chat vs. Extraction: When to Use Which
Chat and structured extraction are complementary, not competing.
Use structured extraction when:
- You need the same fields across many documents in a repeatable, exportable format (all invoice totals, all contract end dates, all line items).
- Downstream systems need structured JSON or rows in a database.
- The output will feed a dashboard, ledger, or reporting tool.
Use chat when:
- You have ad-hoc, open-ended questions ("Which contracts mention data residency in the EU?").
- The question is exploratory and might spawn follow-up questions.
- Different users ask different things of the same document corpus and you cannot pre-define every field.
In DocumentIQ, both run on the same documents in the same project. The extraction engine populates structured tables; the chat assistant uses both those extracted values and the raw document chunks as context. Asking "What is the total value of all contracts renewed this quarter?" uses the extracted contract_value and renewal_date fields. Asking "Explain the indemnification structure in the Acme agreement" uses the raw document chunks. Most real questions use both.
A Two-Minute Tour of Building This Yourself in DocumentIQ
If you want to try this on your own documents:
- Create a project and upload your PDFs or DOCX files. Text extraction and semantic chunking run in the background.
- Wait for indexing to complete -- you can see progress per-document in the project view. For a typical contract (20 pages), indexing takes under 10 seconds.
- Open the Chat tab on the project. Click "New Chat."
- Pick a model in the chat settings. Different plans include different models; higher-quality models cost more credits per message but handle nuanced legal or technical language better.
- Ask a question. The answer streams back with inline citation chips. Click any chip to jump to the source excerpt.
You do not need to configure extraction fields first to use chat. Extraction and chat are independent features -- you can turn on chat and skip structured extraction entirely, or combine them.
Cost and Credits
Every chat message consumes credits from the organization's wallet. The cost depends on two things:
- The selected model. Frontier models (GPT-4o, Claude 3.5 Sonnet, Opus 4) cost more per message than smaller models (GPT-4o Mini).
- The plan's discount factor. Enterprise plans apply a flat discount to every credit transaction.
In practice, a typical question with 20 retrieved chunks and a 400-token answer runs between 0.2 and 1.5 credits depending on the model. That is about one-tenth of the cost of an extraction run over the same document set, which is why chat tends to be the first feature teams adopt after extraction.
What Good Looks Like: A Checklist
When you evaluate a document chat feature from any vendor, check for these:
- [x] Citations that link to page-level source excerpts -- not just "this document."
- [x] Configurable chunking strategy that can be swapped without a redeploy.
- [x] Project-scoped retrieval with hard filtering at query time.
- [x] Streaming responses so the UI does not freeze on long answers.
- [x] Model choice per session so users can pick speed vs. quality.
- [x] Graceful handling of "not in the documents" -- the assistant admits when it does not know.
- [x] Re-index on document update so answers reflect the current version of every file.
- [x] Pluggable embedding model so you can upgrade embedding quality later without re-architecting.
DocumentIQ ships all of these out of the box. If you are building your own RAG pipeline, treat this list as the minimum bar.
The Bottom Line
Chat with your PDFs is not a gimmick. It is the most natural interface for a specific, valuable task: answering ad-hoc questions across a large document corpus with verifiable citations. The engineering is non-trivial -- chunking, retrieval, prompt assembly, citation resolution, and scope control all matter -- but when the pieces are done well, the result feels like magic and works in production.
If you are still searching through contracts and reports by hand, you are paying a hidden tax every day. The good news: that tax is optional.
Create a free account and try the project chat on your own documents. No credit card, no sales call, no setup. Upload a PDF, ask a question, see the citations.
Related reading:
Solutions:
Blog:
- The Complete Guide to Intelligent Document Processing (2026)
- OCR vs LLM Document Extraction
- How to Extract Data from PDF Invoices Using AI
- The Hidden Cost of Missing Price Escalation Clauses
Tools: