Model Catalog

Workers AI runs models on Cloudflare’s GPU infrastructure. You call env.AI.run() with a model name and get results. No provisioning, no deployment, no GPUs to manage. This page covers the key models, their specs, and pricing.

Text Generation Models

ModelContext WindowSpeedBest For
@cf/meta/llama-3.1-8b-instruct128K tokensMediumGeneral-purpose chat, tool use, reasoning
@cf/meta/llama-3.1-70b-instruct128K tokensSlowComplex reasoning, higher quality
@cf/mistral/mistral-7b-instruct-v0.1~3K tokensFastLow-latency, simpler tasks
@hf/google/gemma-7b-it8K tokensFastLightweight chat, short responses
@cf/meta/llama-3-8b-instruct8K tokensMediumPrevious gen, still capable

Note: Mistral v0.1 has a ~3K token context window (2,824 tokens per Cloudflare docs). The 32K context window applies to Mistral v0.2, not v0.1.

Recommendation: Start with Llama 3.1 8B Instruct. It has the best balance of quality, speed, and context window. Move to 70B only if quality is clearly insufficient for your use case.

// Default choice for most use cases
const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [{ role: "user", content: prompt }],
});

// Lower latency, simpler tasks
const fast = await env.AI.run("@cf/mistral/mistral-7b-instruct-v0.1", {
  messages: [{ role: "user", content: prompt }],
});

Gotcha: Token limits on Workers AI may differ from the model’s native limits. Llama 3.1 supports 128K context natively, but Workers AI may impose lower output token limits. Check the model card in the catalog for exact limits.

Embedding Models

ModelDimensionsMax TokensSpeedQuality
@cf/baai/bge-small-en-v1.5384512FastGood
@cf/baai/bge-base-en-v1.5768512MediumBetter
@cf/baai/bge-large-en-v1.51024512SlowBest

Embedding models convert text into numeric vectors. Use them with Vectorize for similarity search and RAG.

Recommendation: BGE base is the production default. Use BGE small for prototyping or when latency matters more than precision. BGE large is for maximum retrieval accuracy.

// Embed a batch of texts
const result = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
  text: ["first document", "second document", "third document"],
});
// result.data = [[0.012, -0.034, ...], [...], [...]]
// result.shape = [3, 768]

Gotcha: All BGE models have a 512-token input limit. Text beyond this gets truncated silently. If your chunks regularly exceed 512 tokens, embedding quality degrades without any error.

Image Generation Models

ModelSpeedOutputNotes
@cf/black-forest-labs/flux-1-schnellFastPNG (base64)Good quality, fast generation
@cf/stabilityai/stable-diffusion-xl-base-1.0 (Beta)MediumPNG (base64)SDXL, 1024x1024 default
const image = await env.AI.run("@cf/black-forest-labs/flux-1-schnell", {
  prompt: "a mountain landscape at sunset, photorealistic",
});
// image.image is a base64-encoded PNG string

Image generation consumes significantly more neurons than text generation. Use it sparingly in high-traffic applications.

Speech-to-Text

ModelLanguagesSpeedNotes
@cf/openai/whisper100+MediumAutomatic language detection
app.post("/transcribe", async (c) => {
  const audioData = await c.req.arrayBuffer();

  const result = await c.env.AI.run("@cf/openai/whisper", {
    audio: [...new Uint8Array(audioData)],
  });

  return c.json({
    language: result.detected_language,
    word_count: result.word_count,
  });
});

Whisper accepts raw audio bytes. Supported formats include WAV, MP3, FLAC, and OGG.

Image Classification

ModelSpeedNotes
@cf/microsoft/resnet-50FastImageNet categories
const result = await env.AI.run("@cf/microsoft/resnet-50", {
  image: [...new Uint8Array(imageData)],
});
// result = [{ label: "golden retriever", score: 0.95 }, ...]

Translation

ModelLanguage PairsSpeed
@cf/meta/m2m100-1.2b100+ pairsMedium
const result = await env.AI.run("@cf/meta/m2m100-1.2b", {
  text: "Hello, how are you?",
  source_lang: "english",
  target_lang: "french",
});
// result.translated_text = "Bonjour, comment allez-vous?"

Pricing: Neurons

Workers AI uses “neurons” as its billing unit. A neuron is an abstract unit of AI compute - different models consume different numbers of neurons per request.

Free Tier

  • 10,000 neurons per day (resets at midnight UTC)
  • Enough for light prototyping and testing
  • No credit card required
  • $0.011 per 1,000 neurons (beyond the free tier)
  • Usage tracked in the Cloudflare dashboard under Workers AI

Neurons per Request

The neuron cost depends on the model and the input/output size:

TaskTypical NeuronsExample
Text generation (short)50-200Quick Q&A with Llama 8B
Text generation (long)200-1000+Multi-paragraph response
Embedding (single text)5-10One document chunk
Embedding (batch of 50)50-100Batch ingest
Image generation1000-5000One FLUX image
Speech-to-text100-50030-second audio clip

These are estimates. Actual neuron consumption depends on input length, output length, and model choice. Monitor usage in the dashboard.

Cost Estimation

For a RAG application handling 1,000 queries per day:

Per query:
  - Embed question:     ~10 neurons
  - Generate answer:    ~150 neurons
  Total:               ~160 neurons per query

Daily:  1,000 * 160 = 160,000 neurons
Free:   -10,000
Billed: 150,000 neurons
Cost:   150 * $0.011 = $1.65/day = ~$50/month

For initial ingest of 10,000 document chunks:

  - Embed 10,000 chunks (batched): ~2,000 neurons
  Cost: 2 * $0.011 = $0.022 (one-time)

Gotcha: Image generation is the most neuron-intensive task. A single FLUX image can cost as much as 50 text generation calls. Budget accordingly if you’re generating images at scale.

Model Availability

The model catalog changes over time:

  • New models are added regularly
  • Some models are in beta and may be removed
  • Beta models may have different rate limits
  • Cloudflare occasionally deprecates older model versions

Always check the official model catalog for the current list. Don’t hard-code model names in configuration - keep them in environment variables or config files so you can swap models without redeploying.

What’s Next