Model Catalog

Workers AI runs models on Cloudflare’s GPU infrastructure. You call env.AI.run() with a model name and get results. No provisioning, no deployment, no GPUs to manage. This page covers the key models, their specs, and pricing.

Text Generation Models

Model	Context Window	Speed	Best For
`@cf/meta/llama-3.1-8b-instruct`	128K tokens	Medium	General-purpose chat, tool use, reasoning
`@cf/meta/llama-3.1-70b-instruct`	128K tokens	Slow	Complex reasoning, higher quality
`@cf/mistral/mistral-7b-instruct-v0.1`	~3K tokens	Fast	Low-latency, simpler tasks
`@hf/google/gemma-7b-it`	8K tokens	Fast	Lightweight chat, short responses
`@cf/meta/llama-3-8b-instruct`	8K tokens	Medium	Previous gen, still capable

Note: Mistral v0.1 has a ~3K token context window (2,824 tokens per Cloudflare docs). The 32K context window applies to Mistral v0.2, not v0.1.

Recommendation: Start with Llama 3.1 8B Instruct. It has the best balance of quality, speed, and context window. Move to 70B only if quality is clearly insufficient for your use case.

// Default choice for most use cases
const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [{ role: "user", content: prompt }],
});

// Lower latency, simpler tasks
const fast = await env.AI.run("@cf/mistral/mistral-7b-instruct-v0.1", {
  messages: [{ role: "user", content: prompt }],
});

Gotcha: Token limits on Workers AI may differ from the model’s native limits. Llama 3.1 supports 128K context natively, but Workers AI may impose lower output token limits. Check the model card in the catalog for exact limits.

Embedding Models

Model	Dimensions	Max Tokens	Speed	Quality
`@cf/baai/bge-small-en-v1.5`	384	512	Fast	Good
`@cf/baai/bge-base-en-v1.5`	768	512	Medium	Better
`@cf/baai/bge-large-en-v1.5`	1024	512	Slow	Best

Embedding models convert text into numeric vectors. Use them with Vectorize for similarity search and RAG.

Recommendation: BGE base is the production default. Use BGE small for prototyping or when latency matters more than precision. BGE large is for maximum retrieval accuracy.

// Embed a batch of texts
const result = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
  text: ["first document", "second document", "third document"],
});
// result.data = [[0.012, -0.034, ...], [...], [...]]
// result.shape = [3, 768]

Gotcha: All BGE models have a 512-token input limit. Text beyond this gets truncated silently. If your chunks regularly exceed 512 tokens, embedding quality degrades without any error.

Image Generation Models

Model	Speed	Output	Notes
`@cf/black-forest-labs/flux-1-schnell`	Fast	PNG (base64)	Good quality, fast generation
`@cf/stabilityai/stable-diffusion-xl-base-1.0` (Beta)	Medium	PNG (base64)	SDXL, 1024x1024 default

const image = await env.AI.run("@cf/black-forest-labs/flux-1-schnell", {
  prompt: "a mountain landscape at sunset, photorealistic",
});
// image.image is a base64-encoded PNG string

Image generation consumes significantly more neurons than text generation. Use it sparingly in high-traffic applications.

Speech-to-Text

Model	Languages	Speed	Notes
`@cf/openai/whisper`	100+	Medium	Automatic language detection

app.post("/transcribe", async (c) => {
  const audioData = await c.req.arrayBuffer();

  const result = await c.env.AI.run("@cf/openai/whisper", {
    audio: [...new Uint8Array(audioData)],
  });

  return c.json({
    language: result.detected_language,
    word_count: result.word_count,
  });
});

Whisper accepts raw audio bytes. Supported formats include WAV, MP3, FLAC, and OGG.

Image Classification

Model	Speed	Notes
`@cf/microsoft/resnet-50`	Fast	ImageNet categories

const result = await env.AI.run("@cf/microsoft/resnet-50", {
  image: [...new Uint8Array(imageData)],
});
// result = [{ label: "golden retriever", score: 0.95 }, ...]

Translation

Model	Language Pairs	Speed
`@cf/meta/m2m100-1.2b`	100+ pairs	Medium

const result = await env.AI.run("@cf/meta/m2m100-1.2b", {
  text: "Hello, how are you?",
  source_lang: "english",
  target_lang: "french",
});
// result.translated_text = "Bonjour, comment allez-vous?"

Pricing: Neurons

Workers AI uses “neurons” as its billing unit. A neuron is an abstract unit of AI compute - different models consume different numbers of neurons per request.

Free Tier

10,000 neurons per day (resets at midnight UTC)
Enough for light prototyping and testing
No credit card required

Paid Tier

$0.011 per 1,000 neurons (beyond the free tier)
Usage tracked in the Cloudflare dashboard under Workers AI

Neurons per Request

The neuron cost depends on the model and the input/output size:

Task	Typical Neurons	Example
Text generation (short)	50-200	Quick Q&A with Llama 8B
Text generation (long)	200-1000+	Multi-paragraph response
Embedding (single text)	5-10	One document chunk
Embedding (batch of 50)	50-100	Batch ingest
Image generation	1000-5000	One FLUX image
Speech-to-text	100-500	30-second audio clip

These are estimates. Actual neuron consumption depends on input length, output length, and model choice. Monitor usage in the dashboard.

Cost Estimation

For a RAG application handling 1,000 queries per day:

Per query:
  - Embed question:     ~10 neurons
  - Generate answer:    ~150 neurons
  Total:               ~160 neurons per query

Daily:  1,000 * 160 = 160,000 neurons
Free:   -10,000
Billed: 150,000 neurons
Cost:   150 * $0.011 = $1.65/day = ~$50/month

For initial ingest of 10,000 document chunks:

  - Embed 10,000 chunks (batched): ~2,000 neurons
  Cost: 2 * $0.011 = $0.022 (one-time)

Gotcha: Image generation is the most neuron-intensive task. A single FLUX image can cost as much as 50 text generation calls. Budget accordingly if you’re generating images at scale.

Model Availability

The model catalog changes over time:

New models are added regularly
Some models are in beta and may be removed
Beta models may have different rate limits
Cloudflare occasionally deprecates older model versions

Always check the official model catalog for the current list. Don’t hard-code model names in configuration - keep them in environment variables or config files so you can swap models without redeploying.

What’s Next

Workers AI quickstart - Get started with inference
RAG Patterns - Build search with embeddings
Gotchas - Common pitfalls with Workers AI models