Vectorize RAG

Build a complete retrieval-augmented generation (RAG) pipeline in a single Worker. Chunk your text, generate embeddings with Workers AI, store them in Vectorize, query for relevant context, and generate an answer with an LLM. Everything runs on Cloudflare’s network.

Prerequisites: AI Landscape, Workers AI

How RAG Works

RAG augments an LLM’s knowledge with your own data. Instead of fine-tuning a model, you retrieve relevant documents at query time and inject them into the prompt.

The pipeline:

Chunk: Split documents into smaller pieces
Embed: Convert each chunk into a vector using an embedding model
Store: Save vectors in Vectorize with metadata
Query: Convert the user’s question into a vector, find similar chunks
Generate: Send the question + retrieved chunks to an LLM

Setup

Create a Vectorize index and configure bindings:

# Create the index (768 dimensions for bge-base)
npx wrangler vectorize create knowledge-base --dimensions=768 --metric=cosine

wrangler.jsonc:

{
  "name": "rag-demo",
  "main": "src/index.ts",
  "compatibility_date": "2025-01-01",
  "compatibility_flags": ["nodejs_compat"],
  "ai": {
    "binding": "AI"
  },
  "vectorize": [
    {
      "binding": "VECTORIZE",
      "index_name": "knowledge-base"
    }
  ]
}

Run npx wrangler types to update the Env interface.

The Full Pipeline

Here is the complete Worker with ingest and query endpoints:

import { Hono } from "hono";

const app = new Hono<{ Bindings: Env }>();

interface EmbeddingResponse {
  shape: number[];
  data: number[][];
}

// --- Ingest: chunk, embed, store ---

app.post("/ingest", async (c) => {
  const { documents } = await c.req.json<{
    documents: { id: string; title: string; content: string }[];
  }>();

  // 1. Chunk the documents
  const chunks: { id: string; text: string; metadata: Record<string, string> }[] = [];
  for (const doc of documents) {
    const docChunks = chunkText(doc.content, 500);
    docChunks.forEach((text, i) => {
      chunks.push({
        id: `${doc.id}-${i}`,
        text,
        metadata: { title: doc.title, docId: doc.id },
      });
    });
  }

  // 2. Generate embeddings
  const texts = chunks.map((chunk) => chunk.text);
  const embeddings: EmbeddingResponse = await c.env.AI.run(
    "@cf/baai/bge-base-en-v1.5",
    { text: texts }
  );

  // 3. Store in Vectorize
  const vectors = chunks.map((chunk, i) => ({
    id: chunk.id,
    values: embeddings.data[i],
    metadata: chunk.metadata,
  }));

  const result = await c.env.VECTORIZE.upsert(vectors);
  return c.json({ indexed: vectors.length, result });
});

// --- Query: embed question, search, generate ---

app.post("/ask", async (c) => {
  const { question } = await c.req.json<{ question: string }>();

  // 1. Embed the question
  const queryEmbedding: EmbeddingResponse = await c.env.AI.run(
    "@cf/baai/bge-base-en-v1.5",
    { text: [question] }
  );

  // 2. Search Vectorize for similar chunks
  const matches = await c.env.VECTORIZE.query(queryEmbedding.data[0], {
    topK: 5,
    returnMetadata: "all",
  });

  // 3. Build context from matched chunks
  const context = matches.matches
    .map((match) => `[${match.metadata?.title}]: ${match.metadata?.text ?? ""}`)
    .join("\n\n");

  // 4. Generate answer with context
  const response = await c.env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
    messages: [
      {
        role: "system",
        content: `Answer the question based on the following context. If the context doesn't contain the answer, say so.\n\nContext:\n${context}`,
      },
      { role: "user", content: question },
    ],
  });

  return c.json({
    answer: response.response,
    sources: matches.matches.map((m) => ({
      id: m.id,
      score: m.score,
      title: m.metadata?.title,
    })),
  });
});

export default app;

Chunking

Chunking splits long documents into pieces small enough to embed meaningfully. A simple approach splits by paragraph with a character limit:

function chunkText(text: string, maxLength: number): string[] {
  const paragraphs = text.split(/\n\n+/);
  const chunks: string[] = [];
  let current = "";

  for (const para of paragraphs) {
    if (current.length + para.length > maxLength && current.length > 0) {
      chunks.push(current.trim());
      current = "";
    }
    current += para + "\n\n";
  }

  if (current.trim().length > 0) {
    chunks.push(current.trim());
  }

  return chunks;
}

Guidelines:

~500 characters per chunk is a reasonable starting point
Keep chunks semantically coherent (don’t split mid-sentence)
Include overlap between chunks if precision matters (e.g., 50 characters overlap)
Store the chunk text in metadata so you can retrieve it with the vector

Gotcha: The embedding model has a token limit. BGE base handles up to 512 tokens per input. If your chunks exceed this, the embedding quality degrades. Keep chunks under the model’s limit.

Storing Metadata

Vectorize supports metadata on each vector. Use it to store the original text, source document ID, and any fields you want to filter on:

const vectors = chunks.map((chunk, i) => ({
  id: chunk.id,
  values: embeddings.data[i],
  metadata: {
    title: chunk.metadata.title,
    text: chunk.text,        // Store the chunk text for retrieval
    docId: chunk.metadata.docId,
    category: "documentation",
  },
}));

When querying, you can filter by metadata:

const matches = await c.env.VECTORIZE.query(queryVector, {
  topK: 5,
  returnMetadata: "all",
  filter: { category: "documentation" },
});

Vectorize Operations

Operation	Method	Description
Insert/update	`upsert(vectors)`	Add or replace vectors by ID
Query	`query(vector, options)`	Find similar vectors
Get by ID	`getByIds(ids)`	Retrieve specific vectors
Delete	`deleteByIds(ids)`	Remove vectors
Describe	`describe()`	Index stats (count, dimensions)

upsert is idempotent - calling it with the same ID replaces the vector. This makes re-indexing safe.

Similarity Metrics

Choose when creating the index:

Metric	When to Use
`cosine`	Text embeddings (most common, direction matters, not magnitude)
`euclidean`	When absolute distance matters
`dot-product`	When vectors are normalized

For BGE embeddings, use cosine. The score ranges from 0 (unrelated) to 1 (identical).

Testing the Pipeline

# Ingest some documents
curl -X POST http://localhost:8787/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "documents": [
      {
        "id": "doc-1",
        "title": "Workers AI",
        "content": "Workers AI lets you run AI models on Cloudflare GPUs. You can generate text, create embeddings, and produce images. Models include Llama, Mistral, and BGE."
      },
      {
        "id": "doc-2",
        "title": "Vectorize",
        "content": "Vectorize is a vector database for storing embeddings. It supports cosine similarity, metadata filtering, and integrates with Workers AI for a complete RAG pipeline."
      }
    ]
  }'

# Ask a question
curl -X POST http://localhost:8787/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "How do I generate embeddings?"}'

Gotcha: Vectorize indexing is eventually consistent. After an upsert, the vectors may not be immediately queryable. In production this is fast (seconds), but in tests you might need a small delay.

What’s Next

AI Gateway - Add caching to the LLM call in the generate step
Agents SDK - Wrap the RAG pipeline in a stateful conversational agent
RAG Patterns - Advanced chunking, reranking, and hybrid search