StoAI
Blog/RAG Architecture

RAG Architecture: The Definitive Guide for SaaS Engineers

The complete guide to building production RAG systems. Covers ingestion pipelines, chunking strategies, embedding models, vector databases, retrieval patterns, re-ranking, and the evaluation framework that ensures quality.

·18 min read·Updated Mar 11, 2026

What RAG Actually Is (And Isn't)

Retrieval-Augmented Generation (RAG) is a pattern that gives LLMs access to your data at query time. Instead of relying solely on what the model learned during training, RAG retrieves relevant documents from your database and includes them as context for the LLM's response.

RAG is not fine-tuning. Fine-tuning changes the model's weights. RAG changes the model's context. This distinction matters because RAG is dramatically simpler to implement, easier to update, and doesn't require ML expertise.

For SaaS engineers, RAG is how you build AI features that know about your product's data — customer support bots that reference your docs, search features that understand your content, and assistants that answer questions about user-specific data.

The RAG Pipeline: 5 Stages

Every RAG system has five stages. Getting each one right compounds into quality. Getting any one wrong caps your system's potential.

Stage 1: Ingestion — Getting Data Into the System

Ingestion is the process of taking your source documents (PDFs, HTML, Markdown, database records) and preparing them for the RAG pipeline.

Document processing: Each format requires specific handling. PDFs need OCR for scanned content and layout analysis for tables. HTML needs boilerplate removal. Markdown is the easiest — already structured.

Metadata extraction: Extract and store metadata with each document: source URL, title, author, date, section hierarchy, document type. Metadata enables filtering at retrieval time ("only search the API docs" or "only results from the last 6 months").

Update strategy: Your data changes. Design ingestion for incremental updates, not full reindexing. Track document hashes and only re-process changed documents. A full reindex should be possible but shouldn't be necessary for daily operations.

Stage 2: Chunking — Splitting Documents Intelligently

Chunking is the most impactful decision in RAG pipeline design. The right chunking strategy can improve retrieval quality by 30-40%.

Fixed-size chunking (simplest): Split every N tokens with M token overlap. Start with 512 tokens and 50 token overlap. Simple, predictable, works reasonably well for homogeneous content.

Recursive chunking (recommended default): Split on natural boundaries — paragraphs first, then sentences, then characters. Respects document structure better than fixed-size. LangChain's RecursiveCharacterTextSplitter implements this.

Semantic chunking: Use embeddings to detect topic shifts and split at semantic boundaries. Better quality but more complex and slower. Worth it for diverse document types.

Parent-child chunking: Index small chunks for retrieval precision, but retrieve the parent (larger) chunk for LLM context. This gives you the best of both worlds — precise retrieval with sufficient context.

Our recommendation: Start with recursive chunking at 512 tokens, 50 token overlap. Move to parent-child chunking when you need better quality and have evaluation infrastructure to measure the improvement.

Stage 3: Embedding — Converting Text to Vectors

Embedding models convert text chunks into numerical vectors that capture semantic meaning. Similar texts produce similar vectors.

Model selection:

  • OpenAI text-embedding-3-small (1536 dimensions): Best cost/quality ratio. $0.02 per 1M tokens. Good enough for most SaaS use cases.
  • OpenAI text-embedding-3-large (3072 dimensions): Higher quality, 2x cost. Use when retrieval precision is critical.
  • Cohere embed-v3: Strong multilingual support. Good choice if your content spans multiple languages.
  • Open source (BGE, E5): Self-hosted, no API costs. Quality approaches commercial models for English content.

Batch processing: Embed documents in batches of 100-500. This maximizes throughput and minimizes API call overhead.

Consistency rule: Always use the same embedding model for indexing and querying. Vectors from different models are not compatible.

Stage 4: Retrieval — Finding Relevant Chunks

Retrieval is where most RAG systems fail — not because the relevant chunks don't exist, but because the retrieval strategy doesn't find them.

Vector search: Query is embedded using the same model, then find the N most similar chunks using cosine similarity. This is the basic approach and works well for semantic queries ("How do I configure authentication?").

Keyword search (BM25): Traditional keyword matching. Essential for exact matches — product names, error codes, IDs, technical terms. Vector search often misses these.

Hybrid search (recommended): Run both vector and keyword search in parallel. Combine results using Reciprocal Rank Fusion (RRF). This consistently outperforms either approach alone by 15-25% in our evaluations.

Implementation with pgvector:

sql
-- Vector search
SELECT id, content, 1 - (embedding <=> query_embedding) AS similarity
FROM documents
WHERE similarity > 0.7
ORDER BY similarity DESC
LIMIT 20;

-- Combine with BM25 using ts_rank
SELECT id, content, ts_rank(tsv, query) AS bm25_score
FROM documents
WHERE tsv @@ query
ORDER BY bm25_score DESC
LIMIT 20;

Merge results from both queries using RRF, then re-rank.

Stage 5: Generation — Producing the Answer

The generation stage takes retrieved chunks and produces a response using an LLM.

Context construction: Arrange retrieved chunks in relevance order. Include metadata (source, section title). Add a clear instruction: "Answer based only on the provided context. If the context doesn't contain the answer, say so."

Prompt template:

System: You are a helpful assistant for [Product]. Answer questions using only the provided context. If unsure, say "I don't have enough information to answer that."

Context:
[Retrieved Chunk 1 — Source: API Docs > Authentication]
[Retrieved Chunk 2 — Source: FAQ > Login Issues]
[Retrieved Chunk 3 — Source: Troubleshooting Guide]

User: {user_question}

Streaming: Always stream responses for real-time RAG features. Users expect immediate feedback. Time to first token should be under 1 second.

Citation: Include source references in the response. This builds trust and lets users verify the AI's claims. Format: "According to the API Documentation [1], authentication uses..."

Re-Ranking: Improving Retrieval Precision

Initial retrieval (vector + BM25) casts a wide net — 20-50 chunks. Re-ranking narrows this to the 3-5 most relevant chunks using a cross-encoder model.

Cross-encoders are more accurate than embedding similarity because they see both the query and the document together. The trade-off is speed — cross-encoders process each query-document pair individually.

Implementation: Retrieve 20-30 chunks with hybrid search. Re-rank using Cohere Rerank or a local cross-encoder model. Pass the top 5 chunks to the LLM.

Impact: In our evaluations, re-ranking improves answer quality by 15-25%. The cost is 100-200ms additional latency and a small API cost ($0.001-0.002 per query for Cohere Rerank).

Evaluation: Measuring RAG Quality

Without evaluation, you can't improve your RAG system. You need metrics at two levels.

Retrieval Metrics

  • Precision@K: Of the top K retrieved chunks, what fraction are relevant? Target: 70%+ at K=5.
  • Recall@K: Of all relevant chunks in your corpus, what fraction appear in the top K? Target: 85%+ at K=20.
  • MRR (Mean Reciprocal Rank): How high does the first relevant chunk appear? Target: 0.7+ (first relevant chunk in top 2 positions on average).

Generation Metrics

  • Faithfulness: Does the response only contain information from the retrieved context? (No hallucination.)
  • Relevance: Does the response actually answer the question?
  • Completeness: Does the response cover all relevant aspects from the retrieved context?

Use LLM-as-judge to evaluate these metrics automatically on a sample of queries.

The Evaluation Dataset

Build a dataset of 100+ query-answer pairs. Include:

  • Common questions (what users actually ask)
  • Edge cases (ambiguous queries, multi-topic queries)
  • Adversarial queries (questions that shouldn't be answerable from your data)

Run this evaluation weekly and track trends. Any sustained quality drop triggers investigation.

Production Concerns

Handling Updates

When source documents change, you need to re-embed affected chunks. Track document versions and chunk-to-document mappings. Implement incremental re-indexing that processes only changed documents.

Scaling

For most SaaS products (under 1M documents), pgvector with proper indexing handles retrieval in < 50ms. Beyond 1M documents, consider Pinecone or Weaviate for managed scaling.

Cost

RAG costs come from three sources: embedding (one-time per document), storage (vector database), and retrieval + generation (per query). At typical SaaS scale, embedding and storage are negligible. Generation cost dominates — optimize with caching and context window management.

Conclusion

RAG is the most practical way to build AI features that understand your data. The quality of your RAG system depends on getting all five stages right — ingestion, chunking, embedding, retrieval, and generation.

Start simple: recursive chunking, OpenAI embeddings, pgvector, hybrid search, and Claude Sonnet for generation. This stack handles 90% of SaaS RAG use cases. Add re-ranking and advanced chunking when your evaluation metrics show they're needed.

About the author

Written by Rafael Danieli, founder of StoAI. Systems engineer specializing in production AI for SaaS companies. Background in distributed systems, reliability engineering, and integration architecture.