Workers AI

Call AI models directly from a Worker. Workers AI gives you serverless GPU inference for text generation, embeddings, image generation, and more. One binding, one function call, no infrastructure.

Prerequisites: AI Landscape, First Worker

Setup

Add the AI binding to wrangler.jsonc:

{
  "name": "ai-demo",
  "main": "src/index.ts",
  "compatibility_date": "2025-01-01",
  "compatibility_flags": ["nodejs_compat"],
  "ai": {
    "binding": "AI"
  }
}

Run npx wrangler types to generate the Env interface with the AI: Ai binding.

Text Generation

The most common use case. Call an LLM with a prompt and get a response:

import { Hono } from "hono";

const app = new Hono<{ Bindings: Env }>();

app.post("/chat", async (c) => {
  const { prompt } = await c.req.json<{ prompt: string }>();

  const response = await c.env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: prompt },
    ],
  });

  return c.json(response);
});

export default app;

The response object:

{
  "response": "Edge computing runs code closer to users...",
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 87,
    "total_tokens": 111
  }
}

Two input formats:

  • messages: Array of { role, content } objects (chat-style, recommended)
  • prompt: Single string (simpler, but no system prompt or history)

Streaming Responses

Pass stream: true to get a Server-Sent Events stream. This lets you show the response as it generates, rather than waiting for the full completion:

app.post("/chat/stream", async (c) => {
  const { prompt } = await c.req.json<{ prompt: string }>();

  const stream = await c.env.AI.run(
    "@cf/meta/llama-3.1-8b-instruct",
    {
      messages: [{ role: "user", content: prompt }],
      stream: true,
    }
  );

  return new Response(stream, {
    headers: {
      "Content-Type": "text/event-stream",
      "Cache-Control": "no-cache",
    },
  });
});

The client receives SSE events:

data: {"response":"Edge "}
data: {"response":"computing "}
data: {"response":"runs "}
data: [DONE]

To consume from the browser:

const response = await fetch("/chat/stream", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ prompt: "Explain edge computing" }),
});

const reader = response.body!.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  const text = decoder.decode(value);
  // Parse SSE lines, append to UI
  console.log(text);
}

Embeddings

Generate vector embeddings for text. Embeddings convert text into numeric vectors that capture semantic meaning, so you can compare documents by similarity.

app.post("/embed", async (c) => {
  const { texts } = await c.req.json<{ texts: string[] }>();

  const result = await c.env.AI.run("@cf/baai/bge-base-en-v1.5", {
    text: texts,
  });

  return c.json({
    shape: result.shape,   // [3, 768] - 3 texts, 768 dimensions each
    vectors: result.data,  // Array of float arrays
  });
});

Common embedding models:

ModelDimensionsUse Case
@cf/baai/bge-base-en-v1.5768General English text
@cf/baai/bge-small-en-v1.5384Faster, lower memory
@cf/baai/bge-large-en-v1.51024Higher quality, slower

Embeddings pair with Vectorize for similarity search and RAG.

Image Generation

Generate images from text prompts:

app.post("/image", async (c) => {
  const { prompt } = await c.req.json<{ prompt: string }>();

  const result = await c.env.AI.run(
    "@cf/black-forest-labs/flux-1-schnell",
    { prompt }
  );

  // result.image is a base64-encoded PNG
  const imageBytes = Uint8Array.from(atob(result.image), (c) =>
    c.charCodeAt(0)
  );

  return new Response(imageBytes, {
    headers: { "Content-Type": "image/png" },
  });
});

Image models:

ModelSpeedQuality
@cf/black-forest-labs/flux-1-schnellFastGood
@cf/stabilityai/stable-diffusion-xl-base-1.0MediumGood

Using the Vercel AI SDK

The workers-ai-provider package integrates Workers AI with the Vercel AI SDK, giving you streamText, generateText, and other high-level helpers:

import { createWorkersAI } from "workers-ai-provider";
import { streamText } from "ai";

app.post("/chat/ai-sdk", async (c) => {
  const workersai = createWorkersAI({ binding: c.env.AI });

  const result = streamText({
    model: workersai("@cf/meta/llama-3.1-8b-instruct"),
    prompt: "Explain edge computing in one paragraph.",
  });

  return result.toTextStreamResponse({
    headers: {
      "Content-Type": "text/x-unknown",
      "content-encoding": "identity",
      "transfer-encoding": "chunked",
    },
  });
});

Install with npm install workers-ai-provider ai.

Key Models

CategoryModelNotes
Text (large)@cf/meta/llama-3.1-8b-instructBest general-purpose
Text (fast)@cf/mistral/mistral-7b-instruct-v0.1Lower latency
Embeddings@cf/baai/bge-base-en-v1.5768 dimensions
Image@cf/black-forest-labs/flux-1-schnellFast generation
Speech-to-text@cf/openai/whisperAudio transcription
Translation@cf/meta/m2m100-1.2b100+ language pairs

The full catalog is at developers.cloudflare.com/workers-ai/models. Models are added regularly.

Gotcha: Token limits vary significantly by model. Llama 3.1 8B supports 128K context input but has lower output limits on Workers AI. Always check the model card in the catalog for exact limits.

Gotcha: Workers AI has a free tier of 10,000 neurons per day. A “neuron” is Cloudflare’s unit of AI compute; different models consume neurons at different rates. Monitor usage in the dashboard.

What’s Next