AI Landscape

Cloudflare’s AI stack has four main products. Each solves a different problem, and they compose together for end-to-end AI applications.

Prerequisites: Platform Model, Workers

The Stack at a Glance

ProductWhat It DoesBinding Type
Workers AIRun models on Cloudflare GPUs (LLMs, embeddings, image gen)Ai
AI GatewayProxy AI requests with caching, rate limiting, fallbackVia env.AI.gateway()
VectorizeVector database for similarity searchVectorize
AI SearchManaged RAG - index data, query with natural languageAPI-based

How They Connect

flowchart LR
    subgraph app["Your Worker"]
        W["Hono App"]
    end

    subgraph gateway["AI Gateway"]
        GW["Proxy + Cache"]
    end

    subgraph inference["Inference"]
        WAI["Workers AI"]
        OAI["OpenAI"]
        ANT["Anthropic"]
    end

    subgraph storage["Vector Storage"]
        VZ["Vectorize"]
    end

    W -->|"env.AI.gateway()"| GW
    GW -->|Primary| WAI
    GW -->|Fallback| OAI
    GW -->|Fallback| ANT
    W -->|"env.AI.run()"| WAI
    W -->|"env.VECTORIZE.query()"| VZ
    WAI -->|Embeddings| VZ

A typical flow: your Worker generates embeddings with Workers AI, stores them in Vectorize, and when a user asks a question, queries Vectorize for relevant context, then calls an LLM through AI Gateway (which caches repeated queries and falls back to another provider if the primary is down).

Workers AI

Serverless GPU inference. You call env.AI.run() with a model name and input, and get a response. No GPU provisioning, no model deployment, no infrastructure.

// Text generation
const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [{ role: "user", content: "Explain edge computing" }],
});

// Embeddings
const embeddings = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
  text: ["document to embed"],
});

// Image generation
const image = await env.AI.run("@cf/black-forest-labs/flux-1-schnell", {
  prompt: "a sunset over the ocean",
});

Key characteristics:

  • Model catalog: Llama, Mistral, Gemma (text), BGE (embeddings), FLUX/Stable Diffusion (images), Whisper (speech-to-text)
  • No cold starts: Models are pre-loaded on Cloudflare GPUs
  • Token limits vary by model: Check the model catalog for context windows and output limits
  • Streaming: Pass stream: true for Server-Sent Events response
  • Binding: Single ai binding in wrangler.jsonc, accessible as env.AI

See Workers AI quickstart for a full walkthrough.

AI Gateway

A proxy that sits between your app and any AI provider. You can use it with Workers AI, OpenAI, Anthropic, or any provider that speaks the OpenAI-compatible API format.

What it gives you:

  • Caching: Identical prompts return cached responses (saves cost and latency)
  • Rate limiting: Throttle requests per user, per IP, or globally
  • Fallback: If provider A fails, automatically try provider B
  • Logging: Every request logged with prompt, response, tokens, latency
  • Analytics: Dashboard showing usage, costs, and error rates
  • Retries: Automatic retry on transient failures

Two ways to use it:

  1. Workers AI binding: env.AI.gateway("my-gateway").run(...) - adds gateway features to Workers AI calls
  2. Universal endpoint: Proxy any provider through https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_id}/{provider}/...

See AI Gateway quickstart for setup and examples.

Vectorize

A vector database optimized for Workers. You store embedding vectors with metadata, then query for the most similar vectors. This is the foundation for RAG (retrieval-augmented generation).

// Store vectors
await env.VECTORIZE.upsert([
  { id: "doc-1", values: embeddingVector, metadata: { title: "My Doc" } },
]);

// Query for similar vectors
const matches = await env.VECTORIZE.query(queryVector, {
  topK: 5,
  filter: { category: "docs" },
});

Key characteristics:

  • Dimensions: Configurable per index (768 for BGE base, 384 for BGE small)
  • Metrics: Cosine similarity, Euclidean distance, dot product
  • Metadata filtering: Filter results by metadata fields before similarity ranking
  • Binding: vectorize binding in wrangler.jsonc, accessible as env.VECTORIZE

See Vectorize RAG quickstart for a complete pipeline.

Managed RAG as a service. You point it at your data sources (URLs, documents, APIs), and it handles chunking, embedding, indexing, and retrieval. When a user queries, it returns relevant context with source citations.

AI Search is the “I don’t want to build RAG infrastructure” option. If you need custom chunking strategies, embedding models, or retrieval logic, use Vectorize directly. If you want a turnkey search experience, AI Search handles the pipeline.

When to Use What

ScenarioProducts
Add AI chat to your appWorkers AI (or AI Gateway + OpenAI)
Reduce AI API costsAI Gateway (caching)
Use multiple AI providers with fallbackAI Gateway
Search your own documentsVectorize + Workers AI
Quick RAG without building infrastructureAI Search
Build a stateful AI assistantAgents SDK + Workers AI
Monitor AI usage and costsAI Gateway (analytics)

Composition Example

A typical AI-powered search feature uses three products together:

  1. Ingest: Worker receives documents, generates embeddings with Workers AI, stores in Vectorize
  2. Query: User asks a question, Worker embeds the query, searches Vectorize for relevant chunks
  3. Generate: Worker sends the user question + retrieved context to an LLM through AI Gateway
  4. Respond: Streaming response back to the user

Each step uses a binding: env.AI for inference, env.VECTORIZE for vector search, and env.AI.gateway() for the final LLM call with caching.