RAG Patterns
RAG (retrieval-augmented generation) gives your LLM access to private data without fine-tuning. You retrieve relevant context from a vector database and inject it into the prompt. Cloudflare gives you two paths: build it yourself with Vectorize, or use AI Search for a managed pipeline.
Prerequisites: Vectorize RAG quickstart
DIY RAG vs AI Search
| DIY (Vectorize) | AI Search (Managed) | |
|---|---|---|
| Chunking | You control strategy and size | Automatic |
| Embedding model | Your choice (BGE, custom) | Cloudflare’s default |
| Reranking | Implement yourself | Built-in |
| Metadata filtering | Full control | Limited |
| Data sources | Anything you can ingest | URLs, documents, APIs |
| Setup effort | Medium - build the pipeline | Low - point at data sources |
| Customization | Full control over every step | Minimal |
| Best for | Custom search, complex pipelines | Quick internal search, prototypes |
Use AI Search when you want a search experience over documents without building infrastructure. Point it at URLs or documents, and it handles chunking, embedding, indexing, and retrieval automatically.
Use Vectorize when you need control over any part of the pipeline: custom chunking, specific embedding models, metadata filtering, or integration with other services.
The RAG Pipeline
flowchart TD
subgraph ingest["Ingest Pipeline"]
direction TB
D["Documents"] --> CH["Chunk"]
CH --> EM["Embed"]
EM --> ST[("Vectorize")]
end
subgraph query["Query Pipeline"]
direction TB
Q["User Question"] --> QE["Embed Query"]
QE --> VS["Vector Search"]
VS --> RR["Rerank"]
RR --> CTX["Build Context"]
CTX --> LLM["Generate Answer"]
end
ST -.->|"Similarity search"| VS
The ingest pipeline runs once (or on updates). The query pipeline runs on every user request. Reranking is optional but significantly improves result quality when you have more than a handful of results.
Chunking Strategies
Chunking determines what the vector database can retrieve. Bad chunks mean bad retrieval, regardless of how good your embedding model is.
Fixed-Size Chunking
Split text every N characters or tokens. Simple and predictable.
function chunkFixedSize(text: string, size: number, overlap: number = 0): string[] {
const chunks: string[] = [];
let start = 0;
while (start < text.length) {
const chunk = text.slice(start, start + size);
if (chunk.trim().length > 0) {
chunks.push(chunk.trim());
}
start += size - overlap;
}
return chunks;
}
// 500 chars per chunk, 50 char overlap
const chunks = chunkFixedSize(document, 500, 50);
Pros: Easy to implement, predictable chunk sizes. Cons: Splits mid-sentence, mid-paragraph, even mid-word. Context bleeds across chunk boundaries.
Paragraph-Based Chunking
Split on natural text boundaries (double newlines), then group into chunks up to a size limit. This is what the Vectorize RAG quickstart uses.
function chunkByParagraph(text: string, maxLength: number): string[] {
const paragraphs = text.split(/\n\n+/);
const chunks: string[] = [];
let current = "";
for (const para of paragraphs) {
if (current.length + para.length > maxLength && current.length > 0) {
chunks.push(current.trim());
current = "";
}
current += para + "\n\n";
}
if (current.trim().length > 0) {
chunks.push(current.trim());
}
return chunks;
}
Pros: Respects natural text boundaries. Semantically coherent chunks. Cons: Variable chunk sizes. Very long paragraphs still need splitting.
Recursive Chunking
Split by the largest delimiter first (sections, then paragraphs, then sentences, then words). Each level only activates when the previous level produces chunks that are too large.
function chunkRecursive(
text: string,
maxLength: number,
separators: string[] = ["\n## ", "\n\n", "\n", ". ", " "]
): string[] {
if (text.length <= maxLength) return [text];
const sep = separators[0];
const remaining = separators.slice(1);
const parts = text.split(sep);
const chunks: string[] = [];
let current = "";
for (const part of parts) {
const candidate = current.length > 0 ? current + sep + part : part;
if (candidate.length > maxLength && current.length > 0) {
chunks.push(current.trim());
current = part;
} else {
current = candidate;
}
}
if (current.trim().length > 0) {
if (current.length > maxLength && remaining.length > 0) {
chunks.push(...chunkRecursive(current, maxLength, remaining));
} else {
chunks.push(current.trim());
}
}
return chunks;
}
// Split markdown: try headers first, then paragraphs, then sentences
const chunks = chunkRecursive(markdownDoc, 500);
Pros: Best semantic coherence. Adapts to document structure. Cons: More complex. Separator order matters for your content type.
Which to Use
| Content | Strategy | Reasoning |
|---|---|---|
| Plain text, logs | Fixed-size | No structure to preserve |
| Articles, blog posts | Paragraph-based | Natural paragraph boundaries |
| Markdown docs, code | Recursive | Preserves heading hierarchy |
| Structured data (JSON) | Custom per schema | Split by logical units (records, objects) |
Gotcha: The embedding model has a token limit (512 tokens for BGE base). Chunks exceeding this limit get truncated silently, degrading embedding quality. Keep chunks well under the model’s limit.
Embedding Model Selection
The embedding model determines the quality of your vector representations. On Workers AI:
| Model | Dimensions | Speed | Quality | Use Case |
|---|---|---|---|---|
@cf/baai/bge-small-en-v1.5 | 384 | Fast | Good | Prototyping, low-latency |
@cf/baai/bge-base-en-v1.5 | 768 | Medium | Better | Production default |
@cf/baai/bge-large-en-v1.5 | 1024 | Slow | Best | When quality matters most |
Start with BGE base. It’s the best balance of quality and speed. Only move to large if retrieval precision is clearly the bottleneck, and only move to small if latency is the constraint.
The Vectorize index dimension must match your embedding model. You set this at index creation time and cannot change it later:
# 768 dimensions for bge-base
npx wrangler vectorize create my-index --dimensions=768 --metric=cosine
# 384 dimensions for bge-small
npx wrangler vectorize create my-index --dimensions=384 --metric=cosine
Gotcha: Vectorize dimension limits are set at index creation. If you switch embedding models with different dimensions, you need a new index and must re-embed all your data.
Batch Embeddings
Embedding one text at a time is slow. Workers AI accepts arrays, so batch your requests:
// Slow: one at a time
for (const chunk of chunks) {
const embedding = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: [chunk.text],
});
// store...
}
// Fast: batch
const batchSize = 100;
for (let i = 0; i < chunks.length; i += batchSize) {
const batch = chunks.slice(i, i + batchSize);
const embeddings = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: batch.map((c) => c.text),
});
const vectors = batch.map((chunk, j) => ({
id: chunk.id,
values: embeddings.data[j],
metadata: { text: chunk.text, source: chunk.source },
}));
await env.VECTORIZE.upsert(vectors);
}
Batch embedding is especially important during initial ingest. For known content that doesn’t change, pre-compute embeddings and store them rather than generating on-the-fly.
Reranking
Vector search returns results by embedding similarity, which is a rough proxy for relevance. Reranking re-scores the results using a more expensive model that reads both the query and each result, producing a more accurate relevance score.
The pattern: retrieve more candidates than you need from the vector search, then rerank and take the top results.
app.post("/ask", async (c) => {
const { question } = await c.req.json<{ question: string }>();
// 1. Embed the question
const queryEmbedding = await c.env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: [question],
});
// 2. Retrieve more candidates than needed
const candidates = await c.env.VECTORIZE.query(queryEmbedding.data[0], {
topK: 20, // Retrieve 20 candidates
returnMetadata: "all",
});
// 3. Rerank with a cross-encoder model
const reranked = await rerank(c.env.AI, question, candidates.matches);
// 4. Use top 5 reranked results for context
const topResults = reranked.slice(0, 5);
const context = topResults
.map((r) => r.metadata?.text ?? "")
.join("\n\n");
// 5. Generate answer
const response = await c.env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [
{
role: "system",
content: `Answer based on this context:\n\n${context}`,
},
{ role: "user", content: question },
],
});
return c.json({ answer: response.response });
});
A simple reranking function using the LLM as a cross-encoder:
interface ScoredMatch {
id: string;
score: number;
metadata?: Record<string, string>;
}
async function rerank(
ai: Ai,
query: string,
matches: VectorizeMatch[]
): Promise<ScoredMatch[]> {
// Score each candidate by asking the LLM to rate relevance
const scored = await Promise.all(
matches.map(async (match) => {
const text = match.metadata?.text ?? "";
const response = await ai.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [
{
role: "system",
content:
"Rate the relevance of the passage to the query on a scale of 0-10. Respond with just the number.",
},
{
role: "user",
content: `Query: ${query}\n\nPassage: ${text}`,
},
],
max_tokens: 5,
});
const score = parseFloat(response.response ?? "0");
return {
id: match.id,
score: isNaN(score) ? 0 : score,
metadata: match.metadata as Record<string, string>,
};
})
);
return scored.sort((a, b) => b.score - a.score);
}
Gotcha: LLM-based reranking is expensive - one LLM call per candidate. Keep your candidate set small (20-50) and use a fast model. For high-volume production, consider a dedicated cross-encoder model if available.
Hybrid Search
Pure vector search misses exact keyword matches. Pure keyword search misses semantic similarity. Hybrid search combines both: run a keyword search and a vector search in parallel, then merge the results.
app.post("/search", async (c) => {
const { query } = await c.req.json<{ query: string }>();
// 1. Vector search
const queryEmbedding = await c.env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: [query],
});
const vectorResults = await c.env.VECTORIZE.query(queryEmbedding.data[0], {
topK: 10,
returnMetadata: "all",
});
// 2. Keyword search (using D1 full-text search)
const keywordResults = await c.env.DB.prepare(
`SELECT id, title, content, rank
FROM documents_fts
WHERE documents_fts MATCH ?
ORDER BY rank
LIMIT 10`
)
.bind(query)
.all<{ id: string; title: string; content: string; rank: number }>();
// 3. Merge with reciprocal rank fusion
const merged = reciprocalRankFusion(
vectorResults.matches.map((m) => m.id),
keywordResults.results.map((r) => r.id),
60 // RRF constant (higher = less aggressive reranking)
);
return c.json({ results: merged.slice(0, 10) });
});
function reciprocalRankFusion(
vectorRanks: string[],
keywordRanks: string[],
k: number = 60
): string[] {
const scores = new Map<string, number>();
vectorRanks.forEach((id, i) => {
scores.set(id, (scores.get(id) ?? 0) + 1 / (k + i + 1));
});
keywordRanks.forEach((id, i) => {
scores.set(id, (scores.get(id) ?? 0) + 1 / (k + i + 1));
});
return [...scores.entries()]
.sort((a, b) => b[1] - a[1])
.map(([id]) => id);
}
Hybrid search works well when:
- Users search with exact terms (product names, error codes)
- Your content has important keywords that embedding models might dilute
- You want the best of both worlds without choosing one approach
Performance Tips
-
Batch embeddings during ingest. Don’t embed one document at a time. Workers AI accepts arrays - send 50-100 texts per call.
-
Pre-compute for known content. If your documents don’t change often, embed them once and store the vectors. Don’t re-embed on every deployment.
-
Use metadata filtering before vector search. Filtering by category, date range, or tenant ID before similarity ranking reduces the search space and improves both speed and relevance.
-
Cache frequent queries. Pipe your LLM calls through AI Gateway with caching enabled. Identical questions return cached answers.
-
Right-size your chunks. Smaller chunks (200-300 chars) give more precise retrieval but need more storage and more LLM context. Larger chunks (500-800 chars) give more context per result but may include irrelevant text. Test with your actual data.
-
Monitor retrieval quality. Log the query, retrieved chunks, and user feedback. If users consistently don’t find what they need, the problem is usually chunking or embedding model choice, not the LLM.
What’s Next
- Agent Patterns - Build agents that use RAG as a tool
- Model Catalog - Embedding model specs and pricing
- AI Landscape - How Vectorize fits in the stack