RAG Patterns

RAG (retrieval-augmented generation) gives your LLM access to private data without fine-tuning. You retrieve relevant context from a vector database and inject it into the prompt. Cloudflare gives you two paths: build it yourself with Vectorize, or use AI Search for a managed pipeline.

Prerequisites: Vectorize RAG quickstart

DIY (Vectorize)AI Search (Managed)
ChunkingYou control strategy and sizeAutomatic
Embedding modelYour choice (BGE, custom)Cloudflare’s default
RerankingImplement yourselfBuilt-in
Metadata filteringFull controlLimited
Data sourcesAnything you can ingestURLs, documents, APIs
Setup effortMedium - build the pipelineLow - point at data sources
CustomizationFull control over every stepMinimal
Best forCustom search, complex pipelinesQuick internal search, prototypes

Use AI Search when you want a search experience over documents without building infrastructure. Point it at URLs or documents, and it handles chunking, embedding, indexing, and retrieval automatically.

Use Vectorize when you need control over any part of the pipeline: custom chunking, specific embedding models, metadata filtering, or integration with other services.

The RAG Pipeline

flowchart TD
    subgraph ingest["Ingest Pipeline"]
        direction TB
        D["Documents"] --> CH["Chunk"]
        CH --> EM["Embed"]
        EM --> ST[("Vectorize")]
    end

    subgraph query["Query Pipeline"]
        direction TB
        Q["User Question"] --> QE["Embed Query"]
        QE --> VS["Vector Search"]
        VS --> RR["Rerank"]
        RR --> CTX["Build Context"]
        CTX --> LLM["Generate Answer"]
    end

    ST -.->|"Similarity search"| VS

The ingest pipeline runs once (or on updates). The query pipeline runs on every user request. Reranking is optional but significantly improves result quality when you have more than a handful of results.

Chunking Strategies

Chunking determines what the vector database can retrieve. Bad chunks mean bad retrieval, regardless of how good your embedding model is.

Fixed-Size Chunking

Split text every N characters or tokens. Simple and predictable.

function chunkFixedSize(text: string, size: number, overlap: number = 0): string[] {
  const chunks: string[] = [];
  let start = 0;

  while (start < text.length) {
    const chunk = text.slice(start, start + size);
    if (chunk.trim().length > 0) {
      chunks.push(chunk.trim());
    }
    start += size - overlap;
  }

  return chunks;
}

// 500 chars per chunk, 50 char overlap
const chunks = chunkFixedSize(document, 500, 50);

Pros: Easy to implement, predictable chunk sizes. Cons: Splits mid-sentence, mid-paragraph, even mid-word. Context bleeds across chunk boundaries.

Paragraph-Based Chunking

Split on natural text boundaries (double newlines), then group into chunks up to a size limit. This is what the Vectorize RAG quickstart uses.

function chunkByParagraph(text: string, maxLength: number): string[] {
  const paragraphs = text.split(/\n\n+/);
  const chunks: string[] = [];
  let current = "";

  for (const para of paragraphs) {
    if (current.length + para.length > maxLength && current.length > 0) {
      chunks.push(current.trim());
      current = "";
    }
    current += para + "\n\n";
  }

  if (current.trim().length > 0) {
    chunks.push(current.trim());
  }

  return chunks;
}

Pros: Respects natural text boundaries. Semantically coherent chunks. Cons: Variable chunk sizes. Very long paragraphs still need splitting.

Recursive Chunking

Split by the largest delimiter first (sections, then paragraphs, then sentences, then words). Each level only activates when the previous level produces chunks that are too large.

function chunkRecursive(
  text: string,
  maxLength: number,
  separators: string[] = ["\n## ", "\n\n", "\n", ". ", " "]
): string[] {
  if (text.length <= maxLength) return [text];

  const sep = separators[0];
  const remaining = separators.slice(1);
  const parts = text.split(sep);

  const chunks: string[] = [];
  let current = "";

  for (const part of parts) {
    const candidate = current.length > 0 ? current + sep + part : part;
    if (candidate.length > maxLength && current.length > 0) {
      chunks.push(current.trim());
      current = part;
    } else {
      current = candidate;
    }
  }

  if (current.trim().length > 0) {
    if (current.length > maxLength && remaining.length > 0) {
      chunks.push(...chunkRecursive(current, maxLength, remaining));
    } else {
      chunks.push(current.trim());
    }
  }

  return chunks;
}

// Split markdown: try headers first, then paragraphs, then sentences
const chunks = chunkRecursive(markdownDoc, 500);

Pros: Best semantic coherence. Adapts to document structure. Cons: More complex. Separator order matters for your content type.

Which to Use

ContentStrategyReasoning
Plain text, logsFixed-sizeNo structure to preserve
Articles, blog postsParagraph-basedNatural paragraph boundaries
Markdown docs, codeRecursivePreserves heading hierarchy
Structured data (JSON)Custom per schemaSplit by logical units (records, objects)

Gotcha: The embedding model has a token limit (512 tokens for BGE base). Chunks exceeding this limit get truncated silently, degrading embedding quality. Keep chunks well under the model’s limit.

Embedding Model Selection

The embedding model determines the quality of your vector representations. On Workers AI:

ModelDimensionsSpeedQualityUse Case
@cf/baai/bge-small-en-v1.5384FastGoodPrototyping, low-latency
@cf/baai/bge-base-en-v1.5768MediumBetterProduction default
@cf/baai/bge-large-en-v1.51024SlowBestWhen quality matters most

Start with BGE base. It’s the best balance of quality and speed. Only move to large if retrieval precision is clearly the bottleneck, and only move to small if latency is the constraint.

The Vectorize index dimension must match your embedding model. You set this at index creation time and cannot change it later:

# 768 dimensions for bge-base
npx wrangler vectorize create my-index --dimensions=768 --metric=cosine

# 384 dimensions for bge-small
npx wrangler vectorize create my-index --dimensions=384 --metric=cosine

Gotcha: Vectorize dimension limits are set at index creation. If you switch embedding models with different dimensions, you need a new index and must re-embed all your data.

Batch Embeddings

Embedding one text at a time is slow. Workers AI accepts arrays, so batch your requests:

// Slow: one at a time
for (const chunk of chunks) {
  const embedding = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
    text: [chunk.text],
  });
  // store...
}

// Fast: batch
const batchSize = 100;
for (let i = 0; i < chunks.length; i += batchSize) {
  const batch = chunks.slice(i, i + batchSize);
  const embeddings = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
    text: batch.map((c) => c.text),
  });

  const vectors = batch.map((chunk, j) => ({
    id: chunk.id,
    values: embeddings.data[j],
    metadata: { text: chunk.text, source: chunk.source },
  }));

  await env.VECTORIZE.upsert(vectors);
}

Batch embedding is especially important during initial ingest. For known content that doesn’t change, pre-compute embeddings and store them rather than generating on-the-fly.

Reranking

Vector search returns results by embedding similarity, which is a rough proxy for relevance. Reranking re-scores the results using a more expensive model that reads both the query and each result, producing a more accurate relevance score.

The pattern: retrieve more candidates than you need from the vector search, then rerank and take the top results.

app.post("/ask", async (c) => {
  const { question } = await c.req.json<{ question: string }>();

  // 1. Embed the question
  const queryEmbedding = await c.env.AI.run("@cf/baai/bge-base-en-v1.5", {
    text: [question],
  });

  // 2. Retrieve more candidates than needed
  const candidates = await c.env.VECTORIZE.query(queryEmbedding.data[0], {
    topK: 20,  // Retrieve 20 candidates
    returnMetadata: "all",
  });

  // 3. Rerank with a cross-encoder model
  const reranked = await rerank(c.env.AI, question, candidates.matches);

  // 4. Use top 5 reranked results for context
  const topResults = reranked.slice(0, 5);
  const context = topResults
    .map((r) => r.metadata?.text ?? "")
    .join("\n\n");

  // 5. Generate answer
  const response = await c.env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
    messages: [
      {
        role: "system",
        content: `Answer based on this context:\n\n${context}`,
      },
      { role: "user", content: question },
    ],
  });

  return c.json({ answer: response.response });
});

A simple reranking function using the LLM as a cross-encoder:

interface ScoredMatch {
  id: string;
  score: number;
  metadata?: Record<string, string>;
}

async function rerank(
  ai: Ai,
  query: string,
  matches: VectorizeMatch[]
): Promise<ScoredMatch[]> {
  // Score each candidate by asking the LLM to rate relevance
  const scored = await Promise.all(
    matches.map(async (match) => {
      const text = match.metadata?.text ?? "";
      const response = await ai.run("@cf/meta/llama-3.1-8b-instruct", {
        messages: [
          {
            role: "system",
            content:
              "Rate the relevance of the passage to the query on a scale of 0-10. Respond with just the number.",
          },
          {
            role: "user",
            content: `Query: ${query}\n\nPassage: ${text}`,
          },
        ],
        max_tokens: 5,
      });

      const score = parseFloat(response.response ?? "0");
      return {
        id: match.id,
        score: isNaN(score) ? 0 : score,
        metadata: match.metadata as Record<string, string>,
      };
    })
  );

  return scored.sort((a, b) => b.score - a.score);
}

Gotcha: LLM-based reranking is expensive - one LLM call per candidate. Keep your candidate set small (20-50) and use a fast model. For high-volume production, consider a dedicated cross-encoder model if available.

Pure vector search misses exact keyword matches. Pure keyword search misses semantic similarity. Hybrid search combines both: run a keyword search and a vector search in parallel, then merge the results.

app.post("/search", async (c) => {
  const { query } = await c.req.json<{ query: string }>();

  // 1. Vector search
  const queryEmbedding = await c.env.AI.run("@cf/baai/bge-base-en-v1.5", {
    text: [query],
  });
  const vectorResults = await c.env.VECTORIZE.query(queryEmbedding.data[0], {
    topK: 10,
    returnMetadata: "all",
  });

  // 2. Keyword search (using D1 full-text search)
  const keywordResults = await c.env.DB.prepare(
    `SELECT id, title, content, rank
     FROM documents_fts
     WHERE documents_fts MATCH ?
     ORDER BY rank
     LIMIT 10`
  )
    .bind(query)
    .all<{ id: string; title: string; content: string; rank: number }>();

  // 3. Merge with reciprocal rank fusion
  const merged = reciprocalRankFusion(
    vectorResults.matches.map((m) => m.id),
    keywordResults.results.map((r) => r.id),
    60 // RRF constant (higher = less aggressive reranking)
  );

  return c.json({ results: merged.slice(0, 10) });
});

function reciprocalRankFusion(
  vectorRanks: string[],
  keywordRanks: string[],
  k: number = 60
): string[] {
  const scores = new Map<string, number>();

  vectorRanks.forEach((id, i) => {
    scores.set(id, (scores.get(id) ?? 0) + 1 / (k + i + 1));
  });

  keywordRanks.forEach((id, i) => {
    scores.set(id, (scores.get(id) ?? 0) + 1 / (k + i + 1));
  });

  return [...scores.entries()]
    .sort((a, b) => b[1] - a[1])
    .map(([id]) => id);
}

Hybrid search works well when:

  • Users search with exact terms (product names, error codes)
  • Your content has important keywords that embedding models might dilute
  • You want the best of both worlds without choosing one approach

Performance Tips

  1. Batch embeddings during ingest. Don’t embed one document at a time. Workers AI accepts arrays - send 50-100 texts per call.

  2. Pre-compute for known content. If your documents don’t change often, embed them once and store the vectors. Don’t re-embed on every deployment.

  3. Use metadata filtering before vector search. Filtering by category, date range, or tenant ID before similarity ranking reduces the search space and improves both speed and relevance.

  4. Cache frequent queries. Pipe your LLM calls through AI Gateway with caching enabled. Identical questions return cached answers.

  5. Right-size your chunks. Smaller chunks (200-300 chars) give more precise retrieval but need more storage and more LLM context. Larger chunks (500-800 chars) give more context per result but may include irrelevant text. Test with your actual data.

  6. Monitor retrieval quality. Log the query, retrieved chunks, and user feedback. If users consistently don’t find what they need, the problem is usually chunking or embedding model choice, not the LLM.

What’s Next