Workers AI

Call AI models directly from a Worker. Workers AI gives you serverless GPU inference for text generation, embeddings, image generation, and more. One binding, one function call, no infrastructure.

Prerequisites: AI Landscape, First Worker

Setup

Add the AI binding to wrangler.jsonc:

{
  "name": "ai-demo",
  "main": "src/index.ts",
  "compatibility_date": "2025-01-01",
  "compatibility_flags": ["nodejs_compat"],
  "ai": {
    "binding": "AI"
  }
}

Run npx wrangler types to generate the Env interface with the AI: Ai binding.

Text Generation

The most common use case. Call an LLM with a prompt and get a response:

import { Hono } from "hono";

const app = new Hono<{ Bindings: Env }>();

app.post("/chat", async (c) => {
  const { prompt } = await c.req.json<{ prompt: string }>();

  const response = await c.env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: prompt },
    ],
  });

  return c.json(response);
});

export default app;

The response object:

{
  "response": "Edge computing runs code closer to users...",
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 87,
    "total_tokens": 111
  }
}

Two input formats:

messages: Array of { role, content } objects (chat-style, recommended)
prompt: Single string (simpler, but no system prompt or history)

Streaming Responses

Pass stream: true to get a Server-Sent Events stream. This lets you show the response as it generates, rather than waiting for the full completion:

app.post("/chat/stream", async (c) => {
  const { prompt } = await c.req.json<{ prompt: string }>();

  const stream = await c.env.AI.run(
    "@cf/meta/llama-3.1-8b-instruct",
    {
      messages: [{ role: "user", content: prompt }],
      stream: true,
    }
  );

  return new Response(stream, {
    headers: {
      "Content-Type": "text/event-stream",
      "Cache-Control": "no-cache",
    },
  });
});

The client receives SSE events:

data: {"response":"Edge "}
data: {"response":"computing "}
data: {"response":"runs "}
data: [DONE]

To consume from the browser:

const response = await fetch("/chat/stream", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ prompt: "Explain edge computing" }),
});

const reader = response.body!.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  const text = decoder.decode(value);
  // Parse SSE lines, append to UI
  console.log(text);
}

Embeddings

Generate vector embeddings for text. Embeddings convert text into numeric vectors that capture semantic meaning, so you can compare documents by similarity.

app.post("/embed", async (c) => {
  const { texts } = await c.req.json<{ texts: string[] }>();

  const result = await c.env.AI.run("@cf/baai/bge-base-en-v1.5", {
    text: texts,
  });

  return c.json({
    shape: result.shape,   // [3, 768] - 3 texts, 768 dimensions each
    vectors: result.data,  // Array of float arrays
  });
});

Common embedding models:

Model	Dimensions	Use Case
`@cf/baai/bge-base-en-v1.5`	768	General English text
`@cf/baai/bge-small-en-v1.5`	384	Faster, lower memory
`@cf/baai/bge-large-en-v1.5`	1024	Higher quality, slower

Embeddings pair with Vectorize for similarity search and RAG.

Image Generation

Generate images from text prompts:

app.post("/image", async (c) => {
  const { prompt } = await c.req.json<{ prompt: string }>();

  const result = await c.env.AI.run(
    "@cf/black-forest-labs/flux-1-schnell",
    { prompt }
  );

  // result.image is a base64-encoded PNG
  const imageBytes = Uint8Array.from(atob(result.image), (c) =>
    c.charCodeAt(0)
  );

  return new Response(imageBytes, {
    headers: { "Content-Type": "image/png" },
  });
});

Image models:

Model	Speed	Quality
`@cf/black-forest-labs/flux-1-schnell`	Fast	Good
`@cf/stabilityai/stable-diffusion-xl-base-1.0`	Medium	Good

Using the Vercel AI SDK

The workers-ai-provider package integrates Workers AI with the Vercel AI SDK, giving you streamText, generateText, and other high-level helpers:

import { createWorkersAI } from "workers-ai-provider";
import { streamText } from "ai";

app.post("/chat/ai-sdk", async (c) => {
  const workersai = createWorkersAI({ binding: c.env.AI });

  const result = streamText({
    model: workersai("@cf/meta/llama-3.1-8b-instruct"),
    prompt: "Explain edge computing in one paragraph.",
  });

  return result.toTextStreamResponse({
    headers: {
      "Content-Type": "text/x-unknown",
      "content-encoding": "identity",
      "transfer-encoding": "chunked",
    },
  });
});

Install with npm install workers-ai-provider ai.

Key Models

Category	Model	Notes
Text (large)	`@cf/meta/llama-3.1-8b-instruct`	Best general-purpose
Text (fast)	`@cf/mistral/mistral-7b-instruct-v0.1`	Lower latency
Embeddings	`@cf/baai/bge-base-en-v1.5`	768 dimensions
Image	`@cf/black-forest-labs/flux-1-schnell`	Fast generation
Speech-to-text	`@cf/openai/whisper`	Audio transcription
Translation	`@cf/meta/m2m100-1.2b`	100+ language pairs

The full catalog is at developers.cloudflare.com/workers-ai/models. Models are added regularly.

Gotcha: Token limits vary significantly by model. Llama 3.1 8B supports 128K context input but has lower output limits on Workers AI. Always check the model card in the catalog for exact limits.

Gotcha: Workers AI has a free tier of 10,000 neurons per day. A “neuron” is Cloudflare’s unit of AI compute; different models consume neurons at different rates. Monitor usage in the dashboard.

What’s Next

AI Gateway - Add caching and fallback to your AI calls
Vectorize RAG - Store embeddings and build a search pipeline
Agents SDK - Build a stateful agent with tools
Model Catalog - Full model specs, speed tiers, and pricing