Workers AI
Call AI models directly from a Worker. Workers AI gives you serverless GPU inference for text generation, embeddings, image generation, and more. One binding, one function call, no infrastructure.
Prerequisites: AI Landscape, First Worker
Setup
Add the AI binding to wrangler.jsonc:
{
"name": "ai-demo",
"main": "src/index.ts",
"compatibility_date": "2025-01-01",
"compatibility_flags": ["nodejs_compat"],
"ai": {
"binding": "AI"
}
}
Run npx wrangler types to generate the Env interface with the AI: Ai binding.
Text Generation
The most common use case. Call an LLM with a prompt and get a response:
import { Hono } from "hono";
const app = new Hono<{ Bindings: Env }>();
app.post("/chat", async (c) => {
const { prompt } = await c.req.json<{ prompt: string }>();
const response = await c.env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: prompt },
],
});
return c.json(response);
});
export default app;
The response object:
{
"response": "Edge computing runs code closer to users...",
"usage": {
"prompt_tokens": 24,
"completion_tokens": 87,
"total_tokens": 111
}
}
Two input formats:
messages: Array of{ role, content }objects (chat-style, recommended)prompt: Single string (simpler, but no system prompt or history)
Streaming Responses
Pass stream: true to get a Server-Sent Events stream. This lets you show the response as it generates, rather than waiting for the full completion:
app.post("/chat/stream", async (c) => {
const { prompt } = await c.req.json<{ prompt: string }>();
const stream = await c.env.AI.run(
"@cf/meta/llama-3.1-8b-instruct",
{
messages: [{ role: "user", content: prompt }],
stream: true,
}
);
return new Response(stream, {
headers: {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
},
});
});
The client receives SSE events:
data: {"response":"Edge "}
data: {"response":"computing "}
data: {"response":"runs "}
data: [DONE]
To consume from the browser:
const response = await fetch("/chat/stream", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ prompt: "Explain edge computing" }),
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value);
// Parse SSE lines, append to UI
console.log(text);
}
Embeddings
Generate vector embeddings for text. Embeddings convert text into numeric vectors that capture semantic meaning, so you can compare documents by similarity.
app.post("/embed", async (c) => {
const { texts } = await c.req.json<{ texts: string[] }>();
const result = await c.env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: texts,
});
return c.json({
shape: result.shape, // [3, 768] - 3 texts, 768 dimensions each
vectors: result.data, // Array of float arrays
});
});
Common embedding models:
| Model | Dimensions | Use Case |
|---|---|---|
@cf/baai/bge-base-en-v1.5 | 768 | General English text |
@cf/baai/bge-small-en-v1.5 | 384 | Faster, lower memory |
@cf/baai/bge-large-en-v1.5 | 1024 | Higher quality, slower |
Embeddings pair with Vectorize for similarity search and RAG.
Image Generation
Generate images from text prompts:
app.post("/image", async (c) => {
const { prompt } = await c.req.json<{ prompt: string }>();
const result = await c.env.AI.run(
"@cf/black-forest-labs/flux-1-schnell",
{ prompt }
);
// result.image is a base64-encoded PNG
const imageBytes = Uint8Array.from(atob(result.image), (c) =>
c.charCodeAt(0)
);
return new Response(imageBytes, {
headers: { "Content-Type": "image/png" },
});
});
Image models:
| Model | Speed | Quality |
|---|---|---|
@cf/black-forest-labs/flux-1-schnell | Fast | Good |
@cf/stabilityai/stable-diffusion-xl-base-1.0 | Medium | Good |
Using the Vercel AI SDK
The workers-ai-provider package integrates Workers AI with the Vercel AI SDK, giving you streamText, generateText, and other high-level helpers:
import { createWorkersAI } from "workers-ai-provider";
import { streamText } from "ai";
app.post("/chat/ai-sdk", async (c) => {
const workersai = createWorkersAI({ binding: c.env.AI });
const result = streamText({
model: workersai("@cf/meta/llama-3.1-8b-instruct"),
prompt: "Explain edge computing in one paragraph.",
});
return result.toTextStreamResponse({
headers: {
"Content-Type": "text/x-unknown",
"content-encoding": "identity",
"transfer-encoding": "chunked",
},
});
});
Install with npm install workers-ai-provider ai.
Key Models
| Category | Model | Notes |
|---|---|---|
| Text (large) | @cf/meta/llama-3.1-8b-instruct | Best general-purpose |
| Text (fast) | @cf/mistral/mistral-7b-instruct-v0.1 | Lower latency |
| Embeddings | @cf/baai/bge-base-en-v1.5 | 768 dimensions |
| Image | @cf/black-forest-labs/flux-1-schnell | Fast generation |
| Speech-to-text | @cf/openai/whisper | Audio transcription |
| Translation | @cf/meta/m2m100-1.2b | 100+ language pairs |
The full catalog is at developers.cloudflare.com/workers-ai/models. Models are added regularly.
Gotcha: Token limits vary significantly by model. Llama 3.1 8B supports 128K context input but has lower output limits on Workers AI. Always check the model card in the catalog for exact limits.
Gotcha: Workers AI has a free tier of 10,000 neurons per day. A “neuron” is Cloudflare’s unit of AI compute; different models consume neurons at different rates. Monitor usage in the dashboard.
What’s Next
- AI Gateway - Add caching and fallback to your AI calls
- Vectorize RAG - Store embeddings and build a search pipeline
- Agents SDK - Build a stateful agent with tools
- Model Catalog - Full model specs, speed tiers, and pricing