AI Gateway

AI Gateway sits between your app and AI providers. It caches repeated prompts, rate limits requests, logs everything, and falls back to alternate providers when one goes down. Works with Workers AI, OpenAI, Anthropic, and any OpenAI-compatible API.

Prerequisites: AI Landscape, Workers AI

Create a Gateway

Create a gateway in the Cloudflare dashboard:

  1. Go to AI > AI Gateway in the dashboard
  2. Click Create Gateway
  3. Name it (e.g., production) and save

You get a gateway ID. You’ll reference this in your Worker code.

Use with Workers AI Binding

The simplest way. Add gateway options to your existing env.AI.run() calls:

import { Hono } from "hono";

const app = new Hono<{ Bindings: Env }>();

app.post("/chat", async (c) => {
  const { prompt } = await c.req.json<{ prompt: string }>();

  const response = await c.env.AI.run(
    "@cf/meta/llama-3.1-8b-instruct",
    {
      messages: [{ role: "user", content: prompt }],
    },
    {
      gateway: {
        id: "production",
        skipCache: false,
        cacheTtl: 3600, // Cache for 1 hour
      },
    }
  );

  return c.json(response);
});

export default app;

Gateway options:

ParameterTypeDefaultDescription
idstring-Gateway name (must exist in your account)
skipCachebooleanfalseBypass cache for this request
cacheTtlnumber-Cache TTL in seconds

That’s it. The same env.AI binding, same model call, but now your requests flow through the gateway with caching, logging, and analytics.

Universal Endpoint

For non-Workers AI providers, use the Universal Endpoint. This is a URL that proxies requests to any AI provider:

https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_id}/{provider}

OpenAI Through the Gateway

Replace the OpenAI base URL with the gateway URL:

app.post("/chat/openai", async (c) => {
  const { prompt } = await c.req.json<{ prompt: string }>();

  const response = await fetch(
    `https://gateway.ai.cloudflare.com/v1/${c.env.ACCOUNT_ID}/production/openai/chat/completions`,
    {
      method: "POST",
      headers: {
        Authorization: `Bearer ${c.env.OPENAI_API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        model: "gpt-4o-mini",
        messages: [{ role: "user", content: prompt }],
      }),
    }
  );

  return new Response(response.body, {
    headers: { "Content-Type": "application/json" },
  });
});

Supported providers via Universal Endpoint: openai, anthropic, azure-openai, google-ai-studio, cohere, groq, huggingface, mistral, perplexity-ai, replicate.

Caching

AI Gateway caches responses by matching the full request (provider, model, messages/prompt). Identical requests return the cached response without hitting the provider.

When to use caching:

  • FAQ bots: Same questions get asked repeatedly
  • Autocomplete: Identical partial prompts
  • Batch processing: Same transformation applied to many items

When to skip cache:

  • Conversations with history: Each message adds new context
  • Real-time data: Responses depend on current state
  • Creative generation: You want variety
// Force fresh response
const response = await c.env.AI.run(
  "@cf/meta/llama-3.1-8b-instruct",
  { messages: [{ role: "user", content: prompt }] },
  { gateway: { id: "production", skipCache: true } }
);

Rate Limiting

Configure rate limits in the dashboard under your gateway settings. You can limit by:

  • Requests per minute/hour: Global or per-user
  • Tokens per minute: Control cost
  • Concurrent requests: Prevent overload

When a limit is hit, the gateway returns a 429 Too Many Requests response.

Provider Fallback

Use the env.AI.gateway() method with the Universal Endpoint format to define fallback chains. If the primary provider fails, the gateway tries the next one:

app.post("/chat/reliable", async (c) => {
  const { prompt } = await c.req.json<{ prompt: string }>();

  // Use the gateway's universal provider for fallback
  const response = await c.env.AI.gateway("production").run({
    provider: "compat",
    endpoint: "chat/completions",
    headers: {
      authorization: `Bearer ${c.env.OPENAI_API_KEY}`,
    },
    query: {
      model: "openai/gpt-4o-mini",
      messages: [{ role: "user", content: prompt }],
    },
  });

  return response;
});

The gateway handles retries and fallback automatically based on your dashboard configuration.

Logging and Analytics

Every request through the gateway is logged. The dashboard shows:

  • Request log: Prompt, response, tokens, latency, cache hit/miss
  • Analytics: Total requests, cost estimates, error rates, cache hit ratio
  • Model usage: Breakdown by model and provider

This is valuable for:

  • Cost tracking: See exactly how much you’re spending per model
  • Debugging: Inspect individual prompts and responses
  • Optimization: Identify which requests could benefit from caching

Full Example: Cached Chat with Fallback

import { Hono } from "hono";

const app = new Hono<{ Bindings: Env }>();

app.post("/chat", async (c) => {
  const { prompt, skipCache } = await c.req.json<{
    prompt: string;
    skipCache?: boolean;
  }>();

  try {
    // Primary: Workers AI through the gateway
    const response = await c.env.AI.run(
      "@cf/meta/llama-3.1-8b-instruct",
      {
        messages: [
          { role: "system", content: "You are a helpful assistant." },
          { role: "user", content: prompt },
        ],
      },
      {
        gateway: {
          id: "production",
          skipCache: skipCache ?? false,
          cacheTtl: 1800,
        },
      }
    );

    return c.json({ source: "workers-ai", ...response });
  } catch (err) {
    // Fallback: OpenAI directly (also through gateway for logging)
    const fallback = await fetch(
      `https://gateway.ai.cloudflare.com/v1/${c.env.ACCOUNT_ID}/production/openai/chat/completions`,
      {
        method: "POST",
        headers: {
          Authorization: `Bearer ${c.env.OPENAI_API_KEY}`,
          "Content-Type": "application/json",
        },
        body: JSON.stringify({
          model: "gpt-4o-mini",
          messages: [{ role: "user", content: prompt }],
        }),
      }
    );

    const data = await fallback.json();
    return c.json({ source: "openai-fallback", ...data });
  }
});

export default app;

wrangler.jsonc for this example:

{
  "name": "ai-gateway-demo",
  "main": "src/index.ts",
  "compatibility_date": "2025-01-01",
  "compatibility_flags": ["nodejs_compat"],
  "ai": {
    "binding": "AI"
  },
  "vars": {
    "ACCOUNT_ID": "your-account-id"
  }
}

Store OPENAI_API_KEY as a secret:

npx wrangler secret put OPENAI_API_KEY

Gotcha: The gateway caches based on the exact request body. If you include timestamps or random IDs in your prompt, every request is a cache miss. Keep prompts deterministic for cache hits.

What’s Next

  • Vectorize RAG - Build a search pipeline that uses the gateway for LLM calls
  • Agents SDK - Stateful agents with AI Gateway for reliable inference