AI Landscape

Cloudflare’s AI stack has four main products. Each solves a different problem, and they compose together for end-to-end AI applications.

Prerequisites: Platform Model, Workers

The Stack at a Glance

Product	What It Does	Binding Type
Workers AI	Run models on Cloudflare GPUs (LLMs, embeddings, image gen)	`Ai`
AI Gateway	Proxy AI requests with caching, rate limiting, fallback	Via `env.AI.gateway()`
Vectorize	Vector database for similarity search	`Vectorize`
AI Search	Managed RAG - index data, query with natural language	API-based

How They Connect

flowchart LR
    subgraph app["Your Worker"]
        W["Hono App"]
    end

    subgraph gateway["AI Gateway"]
        GW["Proxy + Cache"]
    end

    subgraph inference["Inference"]
        WAI["Workers AI"]
        OAI["OpenAI"]
        ANT["Anthropic"]
    end

    subgraph storage["Vector Storage"]
        VZ["Vectorize"]
    end

    W -->|"env.AI.gateway()"| GW
    GW -->|Primary| WAI
    GW -->|Fallback| OAI
    GW -->|Fallback| ANT
    W -->|"env.AI.run()"| WAI
    W -->|"env.VECTORIZE.query()"| VZ
    WAI -->|Embeddings| VZ

A typical flow: your Worker generates embeddings with Workers AI, stores them in Vectorize, and when a user asks a question, queries Vectorize for relevant context, then calls an LLM through AI Gateway (which caches repeated queries and falls back to another provider if the primary is down).

Workers AI

Serverless GPU inference. You call env.AI.run() with a model name and input, and get a response. No GPU provisioning, no model deployment, no infrastructure.

// Text generation
const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [{ role: "user", content: "Explain edge computing" }],
});

// Embeddings
const embeddings = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
  text: ["document to embed"],
});

// Image generation
const image = await env.AI.run("@cf/black-forest-labs/flux-1-schnell", {
  prompt: "a sunset over the ocean",
});

Key characteristics:

Model catalog: Llama, Mistral, Gemma (text), BGE (embeddings), FLUX/Stable Diffusion (images), Whisper (speech-to-text)
No cold starts: Models are pre-loaded on Cloudflare GPUs
Token limits vary by model: Check the model catalog for context windows and output limits
Streaming: Pass stream: true for Server-Sent Events response
Binding: Single ai binding in wrangler.jsonc, accessible as env.AI

See Workers AI quickstart for a full walkthrough.

AI Gateway

A proxy that sits between your app and any AI provider. You can use it with Workers AI, OpenAI, Anthropic, or any provider that speaks the OpenAI-compatible API format.

What it gives you:

Caching: Identical prompts return cached responses (saves cost and latency)
Rate limiting: Throttle requests per user, per IP, or globally
Fallback: If provider A fails, automatically try provider B
Logging: Every request logged with prompt, response, tokens, latency
Analytics: Dashboard showing usage, costs, and error rates
Retries: Automatic retry on transient failures

Two ways to use it:

Workers AI binding: env.AI.gateway("my-gateway").run(...) - adds gateway features to Workers AI calls
Universal endpoint: Proxy any provider through https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_id}/{provider}/...

See AI Gateway quickstart for setup and examples.

Vectorize

A vector database optimized for Workers. You store embedding vectors with metadata, then query for the most similar vectors. This is the foundation for RAG (retrieval-augmented generation).

// Store vectors
await env.VECTORIZE.upsert([
  { id: "doc-1", values: embeddingVector, metadata: { title: "My Doc" } },
]);

// Query for similar vectors
const matches = await env.VECTORIZE.query(queryVector, {
  topK: 5,
  filter: { category: "docs" },
});

Key characteristics:

Dimensions: Configurable per index (768 for BGE base, 384 for BGE small)
Metrics: Cosine similarity, Euclidean distance, dot product
Metadata filtering: Filter results by metadata fields before similarity ranking
Binding: vectorize binding in wrangler.jsonc, accessible as env.VECTORIZE

See Vectorize RAG quickstart for a complete pipeline.

AI Search

Managed RAG as a service. You point it at your data sources (URLs, documents, APIs), and it handles chunking, embedding, indexing, and retrieval. When a user queries, it returns relevant context with source citations.

AI Search is the “I don’t want to build RAG infrastructure” option. If you need custom chunking strategies, embedding models, or retrieval logic, use Vectorize directly. If you want a turnkey search experience, AI Search handles the pipeline.

When to Use What

Scenario	Products
Add AI chat to your app	Workers AI (or AI Gateway + OpenAI)
Reduce AI API costs	AI Gateway (caching)
Use multiple AI providers with fallback	AI Gateway
Search your own documents	Vectorize + Workers AI
Quick RAG without building infrastructure	AI Search
Build a stateful AI assistant	Agents SDK + Workers AI
Monitor AI usage and costs	AI Gateway (analytics)

Composition Example

A typical AI-powered search feature uses three products together:

Ingest: Worker receives documents, generates embeddings with Workers AI, stores in Vectorize
Query: User asks a question, Worker embeds the query, searches Vectorize for relevant chunks
Generate: Worker sends the user question + retrieved context to an LLM through AI Gateway
Respond: Streaming response back to the user

Each step uses a binding: env.AI for inference, env.VECTORIZE for vector search, and env.AI.gateway() for the final LLM call with caching.