Vectorize RAG

Build a complete retrieval-augmented generation (RAG) pipeline in a single Worker. Chunk your text, generate embeddings with Workers AI, store them in Vectorize, query for relevant context, and generate an answer with an LLM. Everything runs on Cloudflare’s network.

Prerequisites: AI Landscape, Workers AI

How RAG Works

RAG augments an LLM’s knowledge with your own data. Instead of fine-tuning a model, you retrieve relevant documents at query time and inject them into the prompt.

The pipeline:

  1. Chunk: Split documents into smaller pieces
  2. Embed: Convert each chunk into a vector using an embedding model
  3. Store: Save vectors in Vectorize with metadata
  4. Query: Convert the user’s question into a vector, find similar chunks
  5. Generate: Send the question + retrieved chunks to an LLM

Setup

Create a Vectorize index and configure bindings:

# Create the index (768 dimensions for bge-base)
npx wrangler vectorize create knowledge-base --dimensions=768 --metric=cosine

wrangler.jsonc:

{
  "name": "rag-demo",
  "main": "src/index.ts",
  "compatibility_date": "2025-01-01",
  "compatibility_flags": ["nodejs_compat"],
  "ai": {
    "binding": "AI"
  },
  "vectorize": [
    {
      "binding": "VECTORIZE",
      "index_name": "knowledge-base"
    }
  ]
}

Run npx wrangler types to update the Env interface.

The Full Pipeline

Here is the complete Worker with ingest and query endpoints:

import { Hono } from "hono";

const app = new Hono<{ Bindings: Env }>();

interface EmbeddingResponse {
  shape: number[];
  data: number[][];
}

// --- Ingest: chunk, embed, store ---

app.post("/ingest", async (c) => {
  const { documents } = await c.req.json<{
    documents: { id: string; title: string; content: string }[];
  }>();

  // 1. Chunk the documents
  const chunks: { id: string; text: string; metadata: Record<string, string> }[] = [];
  for (const doc of documents) {
    const docChunks = chunkText(doc.content, 500);
    docChunks.forEach((text, i) => {
      chunks.push({
        id: `${doc.id}-${i}`,
        text,
        metadata: { title: doc.title, docId: doc.id },
      });
    });
  }

  // 2. Generate embeddings
  const texts = chunks.map((chunk) => chunk.text);
  const embeddings: EmbeddingResponse = await c.env.AI.run(
    "@cf/baai/bge-base-en-v1.5",
    { text: texts }
  );

  // 3. Store in Vectorize
  const vectors = chunks.map((chunk, i) => ({
    id: chunk.id,
    values: embeddings.data[i],
    metadata: chunk.metadata,
  }));

  const result = await c.env.VECTORIZE.upsert(vectors);
  return c.json({ indexed: vectors.length, result });
});

// --- Query: embed question, search, generate ---

app.post("/ask", async (c) => {
  const { question } = await c.req.json<{ question: string }>();

  // 1. Embed the question
  const queryEmbedding: EmbeddingResponse = await c.env.AI.run(
    "@cf/baai/bge-base-en-v1.5",
    { text: [question] }
  );

  // 2. Search Vectorize for similar chunks
  const matches = await c.env.VECTORIZE.query(queryEmbedding.data[0], {
    topK: 5,
    returnMetadata: "all",
  });

  // 3. Build context from matched chunks
  const context = matches.matches
    .map((match) => `[${match.metadata?.title}]: ${match.metadata?.text ?? ""}`)
    .join("\n\n");

  // 4. Generate answer with context
  const response = await c.env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
    messages: [
      {
        role: "system",
        content: `Answer the question based on the following context. If the context doesn't contain the answer, say so.\n\nContext:\n${context}`,
      },
      { role: "user", content: question },
    ],
  });

  return c.json({
    answer: response.response,
    sources: matches.matches.map((m) => ({
      id: m.id,
      score: m.score,
      title: m.metadata?.title,
    })),
  });
});

export default app;

Chunking

Chunking splits long documents into pieces small enough to embed meaningfully. A simple approach splits by paragraph with a character limit:

function chunkText(text: string, maxLength: number): string[] {
  const paragraphs = text.split(/\n\n+/);
  const chunks: string[] = [];
  let current = "";

  for (const para of paragraphs) {
    if (current.length + para.length > maxLength && current.length > 0) {
      chunks.push(current.trim());
      current = "";
    }
    current += para + "\n\n";
  }

  if (current.trim().length > 0) {
    chunks.push(current.trim());
  }

  return chunks;
}

Guidelines:

  • ~500 characters per chunk is a reasonable starting point
  • Keep chunks semantically coherent (don’t split mid-sentence)
  • Include overlap between chunks if precision matters (e.g., 50 characters overlap)
  • Store the chunk text in metadata so you can retrieve it with the vector

Gotcha: The embedding model has a token limit. BGE base handles up to 512 tokens per input. If your chunks exceed this, the embedding quality degrades. Keep chunks under the model’s limit.

Storing Metadata

Vectorize supports metadata on each vector. Use it to store the original text, source document ID, and any fields you want to filter on:

const vectors = chunks.map((chunk, i) => ({
  id: chunk.id,
  values: embeddings.data[i],
  metadata: {
    title: chunk.metadata.title,
    text: chunk.text,        // Store the chunk text for retrieval
    docId: chunk.metadata.docId,
    category: "documentation",
  },
}));

When querying, you can filter by metadata:

const matches = await c.env.VECTORIZE.query(queryVector, {
  topK: 5,
  returnMetadata: "all",
  filter: { category: "documentation" },
});

Vectorize Operations

OperationMethodDescription
Insert/updateupsert(vectors)Add or replace vectors by ID
Queryquery(vector, options)Find similar vectors
Get by IDgetByIds(ids)Retrieve specific vectors
DeletedeleteByIds(ids)Remove vectors
Describedescribe()Index stats (count, dimensions)

upsert is idempotent - calling it with the same ID replaces the vector. This makes re-indexing safe.

Similarity Metrics

Choose when creating the index:

MetricWhen to Use
cosineText embeddings (most common, direction matters, not magnitude)
euclideanWhen absolute distance matters
dot-productWhen vectors are normalized

For BGE embeddings, use cosine. The score ranges from 0 (unrelated) to 1 (identical).

Testing the Pipeline

# Ingest some documents
curl -X POST http://localhost:8787/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "documents": [
      {
        "id": "doc-1",
        "title": "Workers AI",
        "content": "Workers AI lets you run AI models on Cloudflare GPUs. You can generate text, create embeddings, and produce images. Models include Llama, Mistral, and BGE."
      },
      {
        "id": "doc-2",
        "title": "Vectorize",
        "content": "Vectorize is a vector database for storing embeddings. It supports cosine similarity, metadata filtering, and integrates with Workers AI for a complete RAG pipeline."
      }
    ]
  }'

# Ask a question
curl -X POST http://localhost:8787/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "How do I generate embeddings?"}'

Gotcha: Vectorize indexing is eventually consistent. After an upsert, the vectors may not be immediately queryable. In production this is fast (seconds), but in tests you might need a small delay.

What’s Next

  • AI Gateway - Add caching to the LLM call in the generate step
  • Agents SDK - Wrap the RAG pipeline in a stateful conversational agent
  • RAG Patterns - Advanced chunking, reranking, and hybrid search