Vectorize RAG
Build a complete retrieval-augmented generation (RAG) pipeline in a single Worker. Chunk your text, generate embeddings with Workers AI, store them in Vectorize, query for relevant context, and generate an answer with an LLM. Everything runs on Cloudflare’s network.
Prerequisites: AI Landscape, Workers AI
How RAG Works
RAG augments an LLM’s knowledge with your own data. Instead of fine-tuning a model, you retrieve relevant documents at query time and inject them into the prompt.
The pipeline:
- Chunk: Split documents into smaller pieces
- Embed: Convert each chunk into a vector using an embedding model
- Store: Save vectors in Vectorize with metadata
- Query: Convert the user’s question into a vector, find similar chunks
- Generate: Send the question + retrieved chunks to an LLM
Setup
Create a Vectorize index and configure bindings:
# Create the index (768 dimensions for bge-base)
npx wrangler vectorize create knowledge-base --dimensions=768 --metric=cosine
wrangler.jsonc:
{
"name": "rag-demo",
"main": "src/index.ts",
"compatibility_date": "2025-01-01",
"compatibility_flags": ["nodejs_compat"],
"ai": {
"binding": "AI"
},
"vectorize": [
{
"binding": "VECTORIZE",
"index_name": "knowledge-base"
}
]
}
Run npx wrangler types to update the Env interface.
The Full Pipeline
Here is the complete Worker with ingest and query endpoints:
import { Hono } from "hono";
const app = new Hono<{ Bindings: Env }>();
interface EmbeddingResponse {
shape: number[];
data: number[][];
}
// --- Ingest: chunk, embed, store ---
app.post("/ingest", async (c) => {
const { documents } = await c.req.json<{
documents: { id: string; title: string; content: string }[];
}>();
// 1. Chunk the documents
const chunks: { id: string; text: string; metadata: Record<string, string> }[] = [];
for (const doc of documents) {
const docChunks = chunkText(doc.content, 500);
docChunks.forEach((text, i) => {
chunks.push({
id: `${doc.id}-${i}`,
text,
metadata: { title: doc.title, docId: doc.id },
});
});
}
// 2. Generate embeddings
const texts = chunks.map((chunk) => chunk.text);
const embeddings: EmbeddingResponse = await c.env.AI.run(
"@cf/baai/bge-base-en-v1.5",
{ text: texts }
);
// 3. Store in Vectorize
const vectors = chunks.map((chunk, i) => ({
id: chunk.id,
values: embeddings.data[i],
metadata: chunk.metadata,
}));
const result = await c.env.VECTORIZE.upsert(vectors);
return c.json({ indexed: vectors.length, result });
});
// --- Query: embed question, search, generate ---
app.post("/ask", async (c) => {
const { question } = await c.req.json<{ question: string }>();
// 1. Embed the question
const queryEmbedding: EmbeddingResponse = await c.env.AI.run(
"@cf/baai/bge-base-en-v1.5",
{ text: [question] }
);
// 2. Search Vectorize for similar chunks
const matches = await c.env.VECTORIZE.query(queryEmbedding.data[0], {
topK: 5,
returnMetadata: "all",
});
// 3. Build context from matched chunks
const context = matches.matches
.map((match) => `[${match.metadata?.title}]: ${match.metadata?.text ?? ""}`)
.join("\n\n");
// 4. Generate answer with context
const response = await c.env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [
{
role: "system",
content: `Answer the question based on the following context. If the context doesn't contain the answer, say so.\n\nContext:\n${context}`,
},
{ role: "user", content: question },
],
});
return c.json({
answer: response.response,
sources: matches.matches.map((m) => ({
id: m.id,
score: m.score,
title: m.metadata?.title,
})),
});
});
export default app;
Chunking
Chunking splits long documents into pieces small enough to embed meaningfully. A simple approach splits by paragraph with a character limit:
function chunkText(text: string, maxLength: number): string[] {
const paragraphs = text.split(/\n\n+/);
const chunks: string[] = [];
let current = "";
for (const para of paragraphs) {
if (current.length + para.length > maxLength && current.length > 0) {
chunks.push(current.trim());
current = "";
}
current += para + "\n\n";
}
if (current.trim().length > 0) {
chunks.push(current.trim());
}
return chunks;
}
Guidelines:
- ~500 characters per chunk is a reasonable starting point
- Keep chunks semantically coherent (don’t split mid-sentence)
- Include overlap between chunks if precision matters (e.g., 50 characters overlap)
- Store the chunk text in metadata so you can retrieve it with the vector
Gotcha: The embedding model has a token limit. BGE base handles up to 512 tokens per input. If your chunks exceed this, the embedding quality degrades. Keep chunks under the model’s limit.
Storing Metadata
Vectorize supports metadata on each vector. Use it to store the original text, source document ID, and any fields you want to filter on:
const vectors = chunks.map((chunk, i) => ({
id: chunk.id,
values: embeddings.data[i],
metadata: {
title: chunk.metadata.title,
text: chunk.text, // Store the chunk text for retrieval
docId: chunk.metadata.docId,
category: "documentation",
},
}));
When querying, you can filter by metadata:
const matches = await c.env.VECTORIZE.query(queryVector, {
topK: 5,
returnMetadata: "all",
filter: { category: "documentation" },
});
Vectorize Operations
| Operation | Method | Description |
|---|---|---|
| Insert/update | upsert(vectors) | Add or replace vectors by ID |
| Query | query(vector, options) | Find similar vectors |
| Get by ID | getByIds(ids) | Retrieve specific vectors |
| Delete | deleteByIds(ids) | Remove vectors |
| Describe | describe() | Index stats (count, dimensions) |
upsert is idempotent - calling it with the same ID replaces the vector. This makes re-indexing safe.
Similarity Metrics
Choose when creating the index:
| Metric | When to Use |
|---|---|
cosine | Text embeddings (most common, direction matters, not magnitude) |
euclidean | When absolute distance matters |
dot-product | When vectors are normalized |
For BGE embeddings, use cosine. The score ranges from 0 (unrelated) to 1 (identical).
Testing the Pipeline
# Ingest some documents
curl -X POST http://localhost:8787/ingest \
-H "Content-Type: application/json" \
-d '{
"documents": [
{
"id": "doc-1",
"title": "Workers AI",
"content": "Workers AI lets you run AI models on Cloudflare GPUs. You can generate text, create embeddings, and produce images. Models include Llama, Mistral, and BGE."
},
{
"id": "doc-2",
"title": "Vectorize",
"content": "Vectorize is a vector database for storing embeddings. It supports cosine similarity, metadata filtering, and integrates with Workers AI for a complete RAG pipeline."
}
]
}'
# Ask a question
curl -X POST http://localhost:8787/ask \
-H "Content-Type: application/json" \
-d '{"question": "How do I generate embeddings?"}'
Gotcha: Vectorize indexing is eventually consistent. After an
upsert, the vectors may not be immediately queryable. In production this is fast (seconds), but in tests you might need a small delay.
What’s Next
- AI Gateway - Add caching to the LLM call in the generate step
- Agents SDK - Wrap the RAG pipeline in a stateful conversational agent
- RAG Patterns - Advanced chunking, reranking, and hybrid search