AI Gotchas

A collection of common pitfalls, sharp edges, and non-obvious behavior across Cloudflare’s AI stack. Read this before debugging for hours.

Workers AI

Model availability is not guaranteed

Some models in the catalog are in beta. Beta models can be removed, have their API changed, or have different rate limits without notice. Don’t build production features on beta models without a fallback.

// Defensive: wrap model calls with fallback
async function generateText(env: Env, messages: AiTextGenerationInput["messages"]) {
  try {
    return await env.AI.run("@cf/meta/llama-3.1-8b-instruct", { messages });
  } catch (err) {
    // Fallback to a different model if primary is unavailable
    return await env.AI.run("@cf/mistral/mistral-7b-instruct-v0.1", { messages });
  }
}

Token limits vary by model and differ from native

Workers AI may impose different token limits than the model’s native capabilities. Llama 3.1 natively supports 128K context, but the Workers AI deployment may have lower input or output limits. Always check the model card for the Workers AI-specific limits.

Cold start latency for large models

Smaller models (7-8B parameters) are typically pre-loaded and respond quickly. Larger models (70B) may have noticeable cold start latency on the first request. Subsequent requests to the same model are fast.

If latency matters, prefer smaller models:

Model SizeTypical First-Request Latency
7-8BLow (pre-loaded)
13BModerate
70BHigher (may need loading)

Streaming responses need correct headers

When using stream: true, the response is a ReadableStream of Server-Sent Events. You must set the right headers or the client may buffer the entire response:

const stream = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [{ role: "user", content: prompt }],
  stream: true,
});

// These headers are required for streaming to work in browsers
return new Response(stream, {
  headers: {
    "Content-Type": "text/event-stream",
    "Cache-Control": "no-cache",
    "Connection": "keep-alive",
  },
});

Embedding truncation is silent

BGE embedding models have a 512-token input limit. If your text exceeds this, the model truncates it and embeds only the first 512 tokens. There is no error or warning. The resulting embedding represents only the beginning of your text.

Fix: Chunk your text to stay under the token limit before embedding. A safe heuristic is ~400 words or ~1500 characters per chunk.

Vectorize

Dimension mismatch is permanent

The dimension count is set at index creation and cannot be changed. If you switch embedding models (e.g., from BGE base at 768 to BGE small at 384), you need to:

  1. Create a new index with the correct dimensions
  2. Re-embed all your data
  3. Switch your code to use the new index
  4. Delete the old index
# This is permanent - choose carefully
npx wrangler vectorize create my-index --dimensions=768 --metric=cosine

Indexing is eventually consistent

After upsert(), vectors are not immediately queryable. In production, the delay is typically seconds. In tests, you may need to add a small delay:

await env.VECTORIZE.upsert(vectors);
// In tests, vectors may not be queryable immediately
await new Promise((resolve) => setTimeout(resolve, 1000));
const results = await env.VECTORIZE.query(queryVector, { topK: 5 });

Metadata size limits

Each vector can have metadata, but there are size limits. Avoid storing large text blobs in metadata. Instead, store a reference (document ID) and look up the full text from D1 or R2 when needed.

// Risky: storing full chunk text in metadata
{ id: "chunk-1", values: [...], metadata: { text: veryLongString } }

// Safer: store a reference, look up text when needed
{ id: "chunk-1", values: [...], metadata: { docId: "doc-1", chunkIndex: 0 } }

Metadata filter types are limited

Vectorize metadata filtering supports equality checks and basic comparisons, but not full-text search or complex queries. If you need advanced filtering, pre-filter candidates using D1/KV before the vector search.

AI Gateway

Caching requires identical requests

AI Gateway caches by exact request match. If any part of the request differs - different timestamp, different request ID, different ordering of messages - it’s a cache miss.

Common cache-busters to avoid:

// Cache-busting: timestamp in the prompt
messages: [{ role: "system", content: `Current time: ${Date.now()}. Answer...` }]

// Cache-busting: random instruction variation
messages: [{ role: "system", content: `Session ${crypto.randomUUID()}. Answer...` }]

// Cache-friendly: deterministic prompts
messages: [{ role: "system", content: "You are a helpful assistant." }]

Gotcha: This means caching does not help with conversational AI where each message adds new context. It’s most effective for FAQ-style queries, batch processing, and autocomplete.

AI Gateway is not semantic caching

The cache matches on exact request body, not on the meaning of the request. “What is the weather?” and “Tell me today’s weather” are different cache keys, even though they mean the same thing. If you need semantic caching, implement it yourself with embeddings and Vectorize.

Rate limit responses need handling

When AI Gateway rate-limits a request, it returns 429 Too Many Requests. Your code should handle this gracefully:

const response = await env.AI.run(
  "@cf/meta/llama-3.1-8b-instruct",
  { messages },
  { gateway: { id: "production" } }
);
// If rate-limited, the run() call may throw

Agents SDK

The package is pre-1.0

The agents npm package is pre-1.0. The API may change between minor versions. Always pin your version:

{
  "dependencies": {
    "agents": "0.0.67"
  }
}

Test after every upgrade. Breaking changes are possible in 0.x releases.

Agent names are case-sensitive

"MyAgent" and "myagent" refer to different Durable Object instances with completely separate state. Be consistent with naming:

// These are DIFFERENT agent instances
const agent1 = new AgentClient({ agent: "ChatAgent", name: "Room-1" });
const agent2 = new AgentClient({ agent: "ChatAgent", name: "room-1" });

setState replaces the entire state

this.setState() replaces the state object, not merges it. Always spread the existing state:

// Wrong: loses everything except `counter`
this.setState({ counter: this.state.counter + 1 });

// Right: preserves all existing state
this.setState({ ...this.state, counter: this.state.counter + 1 });

Migration tags are required for new DO classes

When adding a new Agent class, you need both a binding and a migration entry in wrangler.jsonc. Without the migration, deployment fails:

{
  "durable_objects": {
    "bindings": [
      { "name": "AGENT", "class_name": "MyAgent" }
    ]
  },
  "migrations": [
    { "tag": "v1", "new_classes": ["MyAgent"] }
  ]
}

hono-agents middleware must be registered first

The agentsMiddleware() must be registered before other routes in Hono. It intercepts WebSocket upgrade requests for agents. If a catch-all route runs first, agent connections fail silently:

const app = new Hono();

// Correct order: middleware first
app.use("*", agentsMiddleware());
app.get("/api/health", (c) => c.json({ ok: true }));

// Wrong: catch-all before middleware
// app.get("*", (c) => c.text("not found"));
// app.use("*", agentsMiddleware()); // Too late, never reached for agent routes

WebSocket hibernation uses class methods

When using the Agents SDK (which builds on DO WebSocket hibernation), message handling uses class methods like onMessage, onClose, and onError. Do not use ws.addEventListener("message", ...) - the hibernation API requires the class method pattern.

See Durable Objects deep-dive for more on the hibernation API.

What’s Next