StoAI
Blog/RAG Architecture

Building a Production RAG Pipeline: From Ingestion to Response

Step-by-step guide to building a production RAG pipeline. Covers document processing, chunking, embedding, indexing, retrieval, and response generation — with code examples and architecture diagrams.

·15 min read·Updated Mar 11, 2026

Architecture Overview: The Production RAG Pipeline

This guide walks through building a complete RAG pipeline that's production-ready — not a tutorial that works in a notebook but breaks under real traffic. We'll cover each component with the specific decisions that matter for SaaS applications.

The pipeline has four major components:

Ingestion Pipeline → Vector Store ← Retrieval API → Generation Layer → Response
(async, batch)       (pgvector)      (sync, real-time)  (Claude/GPT)    (streaming)

Ingestion runs asynchronously — documents are processed and indexed in the background. Retrieval and generation run synchronously in the request path, with strict latency budgets.

Document Processing: Handling Real-World Formats

Production documents aren't clean Markdown files. They're PDFs with headers and footers, HTML with navigation chrome, Word documents with formatting artifacts, and API responses with boilerplate.

PDF Processing

Use pdf-parse for text-based PDFs and Tesseract or Amazon Textract for scanned documents. Key decisions:

  • Header/footer removal: PDFs repeat headers and footers on every page. Detect and remove them by comparing content across page boundaries.
  • Table extraction: Standard text extraction linearizes tables into nonsense. Use dedicated table extraction (Textract, or Claude's vision capabilities for complex tables).
  • Layout analysis: Multi-column PDFs need column detection before text extraction. Process left-to-right, top-to-bottom within each column.

HTML Processing

Strip navigation, sidebars, footers, and scripts. Keep the main content area. Mozilla's Readability library is the best tool for this — it's what Firefox Reader Mode uses.

Preserve heading hierarchy (h1 → h6) as metadata. This helps chunking respect document structure.

Metadata Extraction

For every document, extract and store:

typescript
interface DocumentMetadata {
  source: string;        // URL or file path
  title: string;         // Document title
  section: string;       // Section/chapter name
  contentType: string;   // "api-docs", "faq", "tutorial", etc.
  lastModified: string;  // For freshness filtering
  hash: string;          // For change detection
}

This metadata enables filtered retrieval ("search only API docs") and freshness management ("prefer recent content").

Chunking: Recursive Strategy With Metadata Inheritance

After document processing, split content into chunks that are small enough for precise retrieval but large enough to contain meaningful context.

The Recursive Strategy

typescript
function chunkDocument(text: string, metadata: DocumentMetadata): Chunk[] {
  // Split on section boundaries first (## headings)
  // Then on paragraph boundaries (double newline)
  // Then on sentence boundaries (period + space)
  // Target: 400-600 tokens per chunk, 50 token overlap
}

Metadata Inheritance

Each chunk inherits metadata from its parent document plus chunk-specific metadata:

typescript
interface Chunk {
  id: string;
  content: string;
  embedding: number[];
  metadata: {
    ...DocumentMetadata;
    chunkIndex: number;
    headingHierarchy: string[];  // ["Authentication", "OAuth 2.0", "Token Refresh"]
    previousChunkId: string | null;
    nextChunkId: string | null;
  }
}

The heading hierarchy is critical — it provides context that the chunk text alone doesn't have. A chunk about "token refresh" makes much more sense when you know it's under "Authentication > OAuth 2.0."

Parent-Child Chunking

For complex documents, index small chunks (256 tokens) for precise retrieval but store a reference to the parent chunk (1024 tokens). When a small chunk is retrieved, return the parent chunk to the LLM for more context.

Document Section (2048 tokens)
├── Parent Chunk 1 (1024 tokens)
│   ├── Child Chunk 1a (256 tokens) ← indexed for retrieval
│   ├── Child Chunk 1b (256 tokens) ← indexed for retrieval
│   ├── Child Chunk 1c (256 tokens) ← indexed for retrieval
│   └── Child Chunk 1d (256 tokens) ← indexed for retrieval
└── Parent Chunk 2 (1024 tokens)
    ├── Child Chunk 2a (256 tokens)
    └── ...

When child chunk 1b is retrieved, the LLM receives parent chunk 1 (all 1024 tokens). This gives precise retrieval with rich context.

Embedding: Model Selection and Batching

Model Choice

For most SaaS RAG systems, OpenAI text-embedding-3-small offers the best cost/quality ratio. It's $0.02 per 1M tokens and produces 1536-dimensional vectors.

If retrieval quality is your top priority (e.g., medical or legal applications), use text-embedding-3-large (3072 dimensions, $0.13 per 1M tokens).

Batching Strategy

Embed chunks in batches of 100-500 for optimal throughput:

typescript
async function embedBatch(chunks: string[]): Promise<number[][]> {
  const BATCH_SIZE = 200;
  const results: number[][] = [];

  for (let i = 0; i < chunks.length; i += BATCH_SIZE) {
    const batch = chunks.slice(i, i + BATCH_SIZE);
    const response = await openai.embeddings.create({
      model: "text-embedding-3-small",
      input: batch,
    });
    results.push(...response.data.map(d => d.embedding));
  }

  return results;
}

Query Embedding

At query time, embed the user's question with the same model. This is a single API call with sub-100ms latency.

Vector Database: pgvector Setup and Optimization

For SaaS products already using PostgreSQL, pgvector is the pragmatic choice. No additional infrastructure to manage — it's a Postgres extension.

Schema

sql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE chunks (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  content TEXT NOT NULL,
  embedding vector(1536) NOT NULL,
  metadata JSONB NOT NULL,
  document_id UUID REFERENCES documents(id),
  created_at TIMESTAMPTZ DEFAULT now()
);

-- HNSW index for fast approximate nearest neighbor search
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- GIN index for metadata filtering
CREATE INDEX ON chunks USING gin (metadata);

-- Full-text search index for BM25
ALTER TABLE chunks ADD COLUMN tsv tsvector
  GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;
CREATE INDEX ON chunks USING gin (tsv);

Performance Tuning

  • HNSW parameters: m = 16 and ef_construction = 64 give good recall (95%+) with fast queries. Increase ef_construction to 128 for higher recall at the cost of slower index builds.
  • Query parameters: Set hnsw.ef_search = 40 for a good recall/speed trade-off at query time.
  • Memory: pgvector index needs to fit in RAM for best performance. 1M vectors × 1536 dimensions × 4 bytes = ~6GB. Ensure your Postgres instance has sufficient shared_buffers.

Retrieval: Hybrid Search Implementation

The Hybrid Pipeline

typescript
async function retrieve(query: string, filters?: Record<string, string>, topK = 5) {
  // 1. Embed the query
  const queryEmbedding = await embed(query);

  // 2. Vector search (top 20)
  const vectorResults = await db.query(`
    SELECT id, content, metadata,
           1 - (embedding <=> $1) AS score
    FROM chunks
    WHERE ($2::jsonb IS NULL OR metadata @> $2::jsonb)
    ORDER BY embedding <=> $1
    LIMIT 20
  `, [queryEmbedding, filters ? JSON.stringify(filters) : null]);

  // 3. BM25 search (top 20)
  const bm25Results = await db.query(`
    SELECT id, content, metadata,
           ts_rank(tsv, plainto_tsquery('english', $1)) AS score
    FROM chunks
    WHERE tsv @@ plainto_tsquery('english', $1)
      AND ($2::jsonb IS NULL OR metadata @> $2::jsonb)
    ORDER BY score DESC
    LIMIT 20
  `, [query, filters ? JSON.stringify(filters) : null]);

  // 4. Reciprocal Rank Fusion
  const fused = reciprocalRankFusion(vectorResults, bm25Results);

  // 5. Return top K
  return fused.slice(0, topK);
}

Reciprocal Rank Fusion

RRF combines rankings from multiple sources without needing to normalize scores:

typescript
function reciprocalRankFusion(
  ...resultSets: Array<Array<{ id: string; score: number }>>
): Array<{ id: string; score: number }> {
  const K = 60; // constant, standard value
  const scores = new Map<string, number>();

  for (const results of resultSets) {
    results.forEach((result, rank) => {
      const current = scores.get(result.id) || 0;
      scores.set(result.id, current + 1 / (K + rank + 1));
    });
  }

  return Array.from(scores.entries())
    .map(([id, score]) => ({ id, score }))
    .sort((a, b) => b.score - a.score);
}

Generation: Prompt Construction and Streaming

Context Assembly

typescript
function buildPrompt(query: string, chunks: Chunk[]): string {
  const context = chunks
    .map((chunk, i) =>
      `[Source ${i + 1}: ${chunk.metadata.title} > ${chunk.metadata.headingHierarchy.join(" > ")}]\n${chunk.content}`
    )
    .join("\n\n---\n\n");

  return `Answer the following question using only the provided context. Include source references [1], [2], etc. If the context doesn't contain enough information, say so.

Context:
${context}

Question: ${query}`;
}

Streaming Response

Always stream RAG responses. Users expect immediate feedback, and RAG responses often take 2-5 seconds for the full response:

typescript
async function* generateResponse(query: string, chunks: Chunk[]) {
  const prompt = buildPrompt(query, chunks);

  const stream = await anthropic.messages.stream({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    system: "You are a helpful assistant. Answer questions accurately based on the provided context.",
    messages: [{ role: "user", content: prompt }],
  });

  for await (const event of stream) {
    if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
      yield event.delta.text;
    }
  }
}

Production Hardening

Error Handling

  • Embedding API failure: Queue for retry, return "processing" status to user.
  • Vector search timeout: Fall back to BM25-only search.
  • LLM timeout (>5s): Return retrieved chunks with a "generating..." message, then update via WebSocket.
  • Empty retrieval: Return "I don't have information about that" instead of hallucinating.

Monitoring

Track these RAG-specific metrics:

  • Retrieval latency (p95 < 100ms)
  • Generation latency (p95 < 3s)
  • Empty retrieval rate (target < 5%)
  • User feedback on RAG responses
  • Cache hit rate for repeated queries

Scaling

For most SaaS products (< 500K chunks), a single Postgres instance with pgvector handles production traffic. Beyond that, consider:

  • Read replicas for search queries
  • Pinecone or Weaviate for managed scaling
  • Separate embedding computation from serving

Conclusion

A production RAG pipeline is five components that each need to work well: document processing, chunking, embedding, retrieval, and generation. Start with the simple version of each — recursive chunking, pgvector, hybrid search, Claude Sonnet — and optimize based on evaluation metrics. The infrastructure described here handles 90% of SaaS RAG use cases.

Sobre el autor

Escrito por Rafael Danieli, fundador de StoAI. Ingeniero de sistemas especializado en IA de producción para empresas SaaS. Background en sistemas distribuidos, ingeniería de confiabilidad y arquitectura de integración.