AI Gateway
AI Gateway sits between your app and AI providers. It caches repeated prompts, rate limits requests, logs everything, and falls back to alternate providers when one goes down. Works with Workers AI, OpenAI, Anthropic, and any OpenAI-compatible API.
Prerequisites: AI Landscape, Workers AI
Create a Gateway
Create a gateway in the Cloudflare dashboard:
- Go to AI > AI Gateway in the dashboard
- Click Create Gateway
- Name it (e.g.,
production) and save
You get a gateway ID. You’ll reference this in your Worker code.
Use with Workers AI Binding
The simplest way. Add gateway options to your existing env.AI.run() calls:
import { Hono } from "hono";
const app = new Hono<{ Bindings: Env }>();
app.post("/chat", async (c) => {
const { prompt } = await c.req.json<{ prompt: string }>();
const response = await c.env.AI.run(
"@cf/meta/llama-3.1-8b-instruct",
{
messages: [{ role: "user", content: prompt }],
},
{
gateway: {
id: "production",
skipCache: false,
cacheTtl: 3600, // Cache for 1 hour
},
}
);
return c.json(response);
});
export default app;
Gateway options:
| Parameter | Type | Default | Description |
|---|---|---|---|
id | string | - | Gateway name (must exist in your account) |
skipCache | boolean | false | Bypass cache for this request |
cacheTtl | number | - | Cache TTL in seconds |
That’s it. The same env.AI binding, same model call, but now your requests flow through the gateway with caching, logging, and analytics.
Universal Endpoint
For non-Workers AI providers, use the Universal Endpoint. This is a URL that proxies requests to any AI provider:
https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_id}/{provider}
OpenAI Through the Gateway
Replace the OpenAI base URL with the gateway URL:
app.post("/chat/openai", async (c) => {
const { prompt } = await c.req.json<{ prompt: string }>();
const response = await fetch(
`https://gateway.ai.cloudflare.com/v1/${c.env.ACCOUNT_ID}/production/openai/chat/completions`,
{
method: "POST",
headers: {
Authorization: `Bearer ${c.env.OPENAI_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "gpt-4o-mini",
messages: [{ role: "user", content: prompt }],
}),
}
);
return new Response(response.body, {
headers: { "Content-Type": "application/json" },
});
});
Supported providers via Universal Endpoint: openai, anthropic, azure-openai, google-ai-studio, cohere, groq, huggingface, mistral, perplexity-ai, replicate.
Caching
AI Gateway caches responses by matching the full request (provider, model, messages/prompt). Identical requests return the cached response without hitting the provider.
When to use caching:
- FAQ bots: Same questions get asked repeatedly
- Autocomplete: Identical partial prompts
- Batch processing: Same transformation applied to many items
When to skip cache:
- Conversations with history: Each message adds new context
- Real-time data: Responses depend on current state
- Creative generation: You want variety
// Force fresh response
const response = await c.env.AI.run(
"@cf/meta/llama-3.1-8b-instruct",
{ messages: [{ role: "user", content: prompt }] },
{ gateway: { id: "production", skipCache: true } }
);
Rate Limiting
Configure rate limits in the dashboard under your gateway settings. You can limit by:
- Requests per minute/hour: Global or per-user
- Tokens per minute: Control cost
- Concurrent requests: Prevent overload
When a limit is hit, the gateway returns a 429 Too Many Requests response.
Provider Fallback
Use the env.AI.gateway() method with the Universal Endpoint format to define fallback chains. If the primary provider fails, the gateway tries the next one:
app.post("/chat/reliable", async (c) => {
const { prompt } = await c.req.json<{ prompt: string }>();
// Use the gateway's universal provider for fallback
const response = await c.env.AI.gateway("production").run({
provider: "compat",
endpoint: "chat/completions",
headers: {
authorization: `Bearer ${c.env.OPENAI_API_KEY}`,
},
query: {
model: "openai/gpt-4o-mini",
messages: [{ role: "user", content: prompt }],
},
});
return response;
});
The gateway handles retries and fallback automatically based on your dashboard configuration.
Logging and Analytics
Every request through the gateway is logged. The dashboard shows:
- Request log: Prompt, response, tokens, latency, cache hit/miss
- Analytics: Total requests, cost estimates, error rates, cache hit ratio
- Model usage: Breakdown by model and provider
This is valuable for:
- Cost tracking: See exactly how much you’re spending per model
- Debugging: Inspect individual prompts and responses
- Optimization: Identify which requests could benefit from caching
Full Example: Cached Chat with Fallback
import { Hono } from "hono";
const app = new Hono<{ Bindings: Env }>();
app.post("/chat", async (c) => {
const { prompt, skipCache } = await c.req.json<{
prompt: string;
skipCache?: boolean;
}>();
try {
// Primary: Workers AI through the gateway
const response = await c.env.AI.run(
"@cf/meta/llama-3.1-8b-instruct",
{
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: prompt },
],
},
{
gateway: {
id: "production",
skipCache: skipCache ?? false,
cacheTtl: 1800,
},
}
);
return c.json({ source: "workers-ai", ...response });
} catch (err) {
// Fallback: OpenAI directly (also through gateway for logging)
const fallback = await fetch(
`https://gateway.ai.cloudflare.com/v1/${c.env.ACCOUNT_ID}/production/openai/chat/completions`,
{
method: "POST",
headers: {
Authorization: `Bearer ${c.env.OPENAI_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "gpt-4o-mini",
messages: [{ role: "user", content: prompt }],
}),
}
);
const data = await fallback.json();
return c.json({ source: "openai-fallback", ...data });
}
});
export default app;
wrangler.jsonc for this example:
{
"name": "ai-gateway-demo",
"main": "src/index.ts",
"compatibility_date": "2025-01-01",
"compatibility_flags": ["nodejs_compat"],
"ai": {
"binding": "AI"
},
"vars": {
"ACCOUNT_ID": "your-account-id"
}
}
Store OPENAI_API_KEY as a secret:
npx wrangler secret put OPENAI_API_KEY
Gotcha: The gateway caches based on the exact request body. If you include timestamps or random IDs in your prompt, every request is a cache miss. Keep prompts deterministic for cache hits.
What’s Next
- Vectorize RAG - Build a search pipeline that uses the gateway for LLM calls
- Agents SDK - Stateful agents with AI Gateway for reliable inference