Model Catalog
Workers AI runs models on Cloudflare’s GPU infrastructure. You call env.AI.run() with a model name and get results. No provisioning, no deployment, no GPUs to manage. This page covers the key models, their specs, and pricing.
Text Generation Models
| Model | Context Window | Speed | Best For |
|---|---|---|---|
@cf/meta/llama-3.1-8b-instruct | 128K tokens | Medium | General-purpose chat, tool use, reasoning |
@cf/meta/llama-3.1-70b-instruct | 128K tokens | Slow | Complex reasoning, higher quality |
@cf/mistral/mistral-7b-instruct-v0.1 | ~3K tokens | Fast | Low-latency, simpler tasks |
@hf/google/gemma-7b-it | 8K tokens | Fast | Lightweight chat, short responses |
@cf/meta/llama-3-8b-instruct | 8K tokens | Medium | Previous gen, still capable |
Note: Mistral v0.1 has a ~3K token context window (2,824 tokens per Cloudflare docs). The 32K context window applies to Mistral v0.2, not v0.1.
Recommendation: Start with Llama 3.1 8B Instruct. It has the best balance of quality, speed, and context window. Move to 70B only if quality is clearly insufficient for your use case.
// Default choice for most use cases
const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [{ role: "user", content: prompt }],
});
// Lower latency, simpler tasks
const fast = await env.AI.run("@cf/mistral/mistral-7b-instruct-v0.1", {
messages: [{ role: "user", content: prompt }],
});
Gotcha: Token limits on Workers AI may differ from the model’s native limits. Llama 3.1 supports 128K context natively, but Workers AI may impose lower output token limits. Check the model card in the catalog for exact limits.
Embedding Models
| Model | Dimensions | Max Tokens | Speed | Quality |
|---|---|---|---|---|
@cf/baai/bge-small-en-v1.5 | 384 | 512 | Fast | Good |
@cf/baai/bge-base-en-v1.5 | 768 | 512 | Medium | Better |
@cf/baai/bge-large-en-v1.5 | 1024 | 512 | Slow | Best |
Embedding models convert text into numeric vectors. Use them with Vectorize for similarity search and RAG.
Recommendation: BGE base is the production default. Use BGE small for prototyping or when latency matters more than precision. BGE large is for maximum retrieval accuracy.
// Embed a batch of texts
const result = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: ["first document", "second document", "third document"],
});
// result.data = [[0.012, -0.034, ...], [...], [...]]
// result.shape = [3, 768]
Gotcha: All BGE models have a 512-token input limit. Text beyond this gets truncated silently. If your chunks regularly exceed 512 tokens, embedding quality degrades without any error.
Image Generation Models
| Model | Speed | Output | Notes |
|---|---|---|---|
@cf/black-forest-labs/flux-1-schnell | Fast | PNG (base64) | Good quality, fast generation |
@cf/stabilityai/stable-diffusion-xl-base-1.0 (Beta) | Medium | PNG (base64) | SDXL, 1024x1024 default |
const image = await env.AI.run("@cf/black-forest-labs/flux-1-schnell", {
prompt: "a mountain landscape at sunset, photorealistic",
});
// image.image is a base64-encoded PNG string
Image generation consumes significantly more neurons than text generation. Use it sparingly in high-traffic applications.
Speech-to-Text
| Model | Languages | Speed | Notes |
|---|---|---|---|
@cf/openai/whisper | 100+ | Medium | Automatic language detection |
app.post("/transcribe", async (c) => {
const audioData = await c.req.arrayBuffer();
const result = await c.env.AI.run("@cf/openai/whisper", {
audio: [...new Uint8Array(audioData)],
});
return c.json({
language: result.detected_language,
word_count: result.word_count,
});
});
Whisper accepts raw audio bytes. Supported formats include WAV, MP3, FLAC, and OGG.
Image Classification
| Model | Speed | Notes |
|---|---|---|
@cf/microsoft/resnet-50 | Fast | ImageNet categories |
const result = await env.AI.run("@cf/microsoft/resnet-50", {
image: [...new Uint8Array(imageData)],
});
// result = [{ label: "golden retriever", score: 0.95 }, ...]
Translation
| Model | Language Pairs | Speed |
|---|---|---|
@cf/meta/m2m100-1.2b | 100+ pairs | Medium |
const result = await env.AI.run("@cf/meta/m2m100-1.2b", {
text: "Hello, how are you?",
source_lang: "english",
target_lang: "french",
});
// result.translated_text = "Bonjour, comment allez-vous?"
Pricing: Neurons
Workers AI uses “neurons” as its billing unit. A neuron is an abstract unit of AI compute - different models consume different numbers of neurons per request.
Free Tier
- 10,000 neurons per day (resets at midnight UTC)
- Enough for light prototyping and testing
- No credit card required
Paid Tier
- $0.011 per 1,000 neurons (beyond the free tier)
- Usage tracked in the Cloudflare dashboard under Workers AI
Neurons per Request
The neuron cost depends on the model and the input/output size:
| Task | Typical Neurons | Example |
|---|---|---|
| Text generation (short) | 50-200 | Quick Q&A with Llama 8B |
| Text generation (long) | 200-1000+ | Multi-paragraph response |
| Embedding (single text) | 5-10 | One document chunk |
| Embedding (batch of 50) | 50-100 | Batch ingest |
| Image generation | 1000-5000 | One FLUX image |
| Speech-to-text | 100-500 | 30-second audio clip |
These are estimates. Actual neuron consumption depends on input length, output length, and model choice. Monitor usage in the dashboard.
Cost Estimation
For a RAG application handling 1,000 queries per day:
Per query:
- Embed question: ~10 neurons
- Generate answer: ~150 neurons
Total: ~160 neurons per query
Daily: 1,000 * 160 = 160,000 neurons
Free: -10,000
Billed: 150,000 neurons
Cost: 150 * $0.011 = $1.65/day = ~$50/month
For initial ingest of 10,000 document chunks:
- Embed 10,000 chunks (batched): ~2,000 neurons
Cost: 2 * $0.011 = $0.022 (one-time)
Gotcha: Image generation is the most neuron-intensive task. A single FLUX image can cost as much as 50 text generation calls. Budget accordingly if you’re generating images at scale.
Model Availability
The model catalog changes over time:
- New models are added regularly
- Some models are in beta and may be removed
- Beta models may have different rate limits
- Cloudflare occasionally deprecates older model versions
Always check the official model catalog for the current list. Don’t hard-code model names in configuration - keep them in environment variables or config files so you can swap models without redeploying.
What’s Next
- Workers AI quickstart - Get started with inference
- RAG Patterns - Build search with embeddings
- Gotchas - Common pitfalls with Workers AI models