AI Landscape
Cloudflare’s AI stack has four main products. Each solves a different problem, and they compose together for end-to-end AI applications.
Prerequisites: Platform Model, Workers
The Stack at a Glance
| Product | What It Does | Binding Type |
|---|---|---|
| Workers AI | Run models on Cloudflare GPUs (LLMs, embeddings, image gen) | Ai |
| AI Gateway | Proxy AI requests with caching, rate limiting, fallback | Via env.AI.gateway() |
| Vectorize | Vector database for similarity search | Vectorize |
| AI Search | Managed RAG - index data, query with natural language | API-based |
How They Connect
flowchart LR
subgraph app["Your Worker"]
W["Hono App"]
end
subgraph gateway["AI Gateway"]
GW["Proxy + Cache"]
end
subgraph inference["Inference"]
WAI["Workers AI"]
OAI["OpenAI"]
ANT["Anthropic"]
end
subgraph storage["Vector Storage"]
VZ["Vectorize"]
end
W -->|"env.AI.gateway()"| GW
GW -->|Primary| WAI
GW -->|Fallback| OAI
GW -->|Fallback| ANT
W -->|"env.AI.run()"| WAI
W -->|"env.VECTORIZE.query()"| VZ
WAI -->|Embeddings| VZ
A typical flow: your Worker generates embeddings with Workers AI, stores them in Vectorize, and when a user asks a question, queries Vectorize for relevant context, then calls an LLM through AI Gateway (which caches repeated queries and falls back to another provider if the primary is down).
Workers AI
Serverless GPU inference. You call env.AI.run() with a model name and input, and get a response. No GPU provisioning, no model deployment, no infrastructure.
// Text generation
const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [{ role: "user", content: "Explain edge computing" }],
});
// Embeddings
const embeddings = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: ["document to embed"],
});
// Image generation
const image = await env.AI.run("@cf/black-forest-labs/flux-1-schnell", {
prompt: "a sunset over the ocean",
});
Key characteristics:
- Model catalog: Llama, Mistral, Gemma (text), BGE (embeddings), FLUX/Stable Diffusion (images), Whisper (speech-to-text)
- No cold starts: Models are pre-loaded on Cloudflare GPUs
- Token limits vary by model: Check the model catalog for context windows and output limits
- Streaming: Pass
stream: truefor Server-Sent Events response - Binding: Single
aibinding inwrangler.jsonc, accessible asenv.AI
See Workers AI quickstart for a full walkthrough.
AI Gateway
A proxy that sits between your app and any AI provider. You can use it with Workers AI, OpenAI, Anthropic, or any provider that speaks the OpenAI-compatible API format.
What it gives you:
- Caching: Identical prompts return cached responses (saves cost and latency)
- Rate limiting: Throttle requests per user, per IP, or globally
- Fallback: If provider A fails, automatically try provider B
- Logging: Every request logged with prompt, response, tokens, latency
- Analytics: Dashboard showing usage, costs, and error rates
- Retries: Automatic retry on transient failures
Two ways to use it:
- Workers AI binding:
env.AI.gateway("my-gateway").run(...)- adds gateway features to Workers AI calls - Universal endpoint: Proxy any provider through
https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_id}/{provider}/...
See AI Gateway quickstart for setup and examples.
Vectorize
A vector database optimized for Workers. You store embedding vectors with metadata, then query for the most similar vectors. This is the foundation for RAG (retrieval-augmented generation).
// Store vectors
await env.VECTORIZE.upsert([
{ id: "doc-1", values: embeddingVector, metadata: { title: "My Doc" } },
]);
// Query for similar vectors
const matches = await env.VECTORIZE.query(queryVector, {
topK: 5,
filter: { category: "docs" },
});
Key characteristics:
- Dimensions: Configurable per index (768 for BGE base, 384 for BGE small)
- Metrics: Cosine similarity, Euclidean distance, dot product
- Metadata filtering: Filter results by metadata fields before similarity ranking
- Binding:
vectorizebinding inwrangler.jsonc, accessible asenv.VECTORIZE
See Vectorize RAG quickstart for a complete pipeline.
AI Search
Managed RAG as a service. You point it at your data sources (URLs, documents, APIs), and it handles chunking, embedding, indexing, and retrieval. When a user queries, it returns relevant context with source citations.
AI Search is the “I don’t want to build RAG infrastructure” option. If you need custom chunking strategies, embedding models, or retrieval logic, use Vectorize directly. If you want a turnkey search experience, AI Search handles the pipeline.
When to Use What
| Scenario | Products |
|---|---|
| Add AI chat to your app | Workers AI (or AI Gateway + OpenAI) |
| Reduce AI API costs | AI Gateway (caching) |
| Use multiple AI providers with fallback | AI Gateway |
| Search your own documents | Vectorize + Workers AI |
| Quick RAG without building infrastructure | AI Search |
| Build a stateful AI assistant | Agents SDK + Workers AI |
| Monitor AI usage and costs | AI Gateway (analytics) |
Composition Example
A typical AI-powered search feature uses three products together:
- Ingest: Worker receives documents, generates embeddings with Workers AI, stores in Vectorize
- Query: User asks a question, Worker embeds the query, searches Vectorize for relevant chunks
- Generate: Worker sends the user question + retrieved context to an LLM through AI Gateway
- Respond: Streaming response back to the user
Each step uses a binding: env.AI for inference, env.VECTORIZE for vector search, and env.AI.gateway() for the final LLM call with caching.