Cosmos DB Patterns
The main Cosmos DB design mistakes happen before the first production incident: the wrong partition key, documents that blur several access patterns together, and a change feed plan that assumes downstream processing will stay simple forever. This page covers those design pressures for operational workloads, not Cosmos DB as a general schema design topic.
Start From Access Patterns, Not Theoretical Distribution
The most common design trap is choosing a partition key based on how evenly it distributes data rather than how the system actually reads and writes.
The useful question is: which unit of work reads and writes this document repeatedly?
flowchart TD
A[Choose partition key] --> B{What is the<br/>primary access pattern?}
B -->|Read/write scoped<br/>to one tenant| C[tenantId]
B -->|Read/write scoped<br/>to one workflow| D[workflowId]
B -->|Read/write scoped<br/>to one device or entity| E[entityId]
C --> F{Will one tenant<br/>dominate writes?}
F -->|No, traffic is<br/>roughly balanced| G[tenantId is fine]
F -->|Yes, one tenant<br/>is very hot| H{Can you scope<br/>narrower?}
H -->|Yes| I[tenantId + suffix<br/>or workflowId]
H -->|No| J[Investigate<br/>hierarchical keys]
D --> K{Multiple independent<br/>workflows per tenant?}
K -->|Yes| L[workflowId works]
K -->|No, one workflow<br/>per tenant| G
style G fill:#4a9,stroke:#333
style L fill:#4a9,stroke:#333
style I fill:#49a,stroke:#333
style J fill:#a94,stroke:#333
Good partition keys usually do one of two things:
- keep the records for one operational owner together, or
- keep the records for one active workflow together.
Weak keys fail in two ways:
- Hot partitions because a single key receives disproportionate traffic.
- Cross-partition query pressure because the data you need together is scattered.
Multi-Tenant Design
For multi-tenant systems, tenantId is often the most natural partition key because operators, reporting, and failure ownership are usually tenant-scoped.
But it is not automatically correct.
Use tenantId when:
- most reads and writes stay within one tenant,
- operators investigate failures by tenant,
- per-tenant data volume is unlikely to create a persistent hot partition.
Use something narrower when:
- one tenant can produce very large bursts (e.g., an enterprise customer driving 80% of writes),
- several independent workflows exist inside the same tenant,
- write amplification within a tenant is the dominant scaling problem.
Examples by domain:
| Domain | Natural partition key | When to go narrower |
|---|---|---|
| E-commerce orders | customerId | High-volume customers; use orderId |
| Multi-tenant SaaS | tenantId | Enterprise tenants with heavy write loads |
| IoT device state | deviceId | Already narrow; rarely needs further splitting |
| Identity provisioning | tenantId | Large org with many concurrent workflows |
The important part is deciding based on who owns the data operationally and where write pressure actually lands.
RU Cost Pressure
Cosmos DB becomes expensive when the system quietly turns every small step into a document rewrite or a fan-out query.
Common RU pressure points:
- Workers repeatedly upserting large state documents when only one field changed.
- Dashboards querying across partitions for recent failures.
- Checkpoint records updated too frequently.
- Change feed handlers writing several secondary records for each input change.
- Wide indexing on properties that are rarely queried.
Practical ways to reduce pressure:
- Keep documents small and purpose-built. A 50-field document rewritten on every status change costs more than a focused status record.
- Store large artifacts (reports, export files, images) in Azure Storage, not embedded in documents.
- Index for the queries operators and workers actually run. Remove default indexing on properties nobody filters on.
- Separate high-churn records from slower-moving summary records when one document shape is doing too many jobs.
The cheapest design is usually not the most normalized one. It is the one that keeps common reads and writes targeted.
Document Shape Should Match Operational Questions
Systems typically need to answer a narrow set of questions quickly:
- What state is this order/workflow/device in?
- Which component owns this failure?
- What was the last successful checkpoint?
- Has this downstream action already been applied?
If those questions require reconstructing several unrelated documents or running broad queries on every request, the design is drifting away from the operational model.
That does not mean everything belongs in one document. It means the dominant read paths should be cheap and obvious.
Consistency as a Contract Decision
Consistency choice should follow the failure cost of stale reads.
Session consistency is the practical default for most workloads because:
- a writer can observe its own updates,
- the workflow maintains reasonable freshness,
- the system avoids the cost and latency of strong global constraints.
Stronger consistency (bounded staleness or strong) is worth considering when a coordination rule genuinely depends on it. Two workers making contradictory decisions because one read stale state is a real failure. But treat stronger consistency as a deliberate cost trade-off, not a safe default.
Weaker consistency (eventual or consistent prefix) works for analytics, reconciliation views, or eventually-corrected summaries. Much less acceptable when correctness depends on freshness.
The useful question: what breaks if this reader is briefly behind?
Change Feed Patterns
The change feed lets downstream processors react to item changes without polling the container. It works well for certain patterns and poorly for others.
When change feed works well:
- State transitions are the natural event. An order moving from “pending” to “shipped” is a meaningful signal.
- Downstream consumers want to build projections, analytics views, or secondary indexes.
- The consumer can tolerate the shape and frequency of state changes as-is.
When you need a real broker instead:
- Downstream systems want retries, dead-lettering, or operator inspection of failed work.
- Several unrelated processors depend on the same document mutation pattern.
- The real need is explicit workflow messaging, not state change notification.
- Operators need broker-like remediation but the system only has state changes to work with.
Change feed consumers also inherit design responsibilities:
- They need their own checkpointing and recovery model.
- Duplicate processing must be tolerated.
- Order is meaningful within the feed’s partitioned model, not as one global sequence.
Anti-Patterns
One giant document per tenant. A single document rewritten by every step becomes a write contention point and costs more RUs on every update. Split by operational concern.
Embedding blobs in state documents. Export payloads, audit files, or images stored as document properties inflate RU cost on every read and write. Use Azure Storage for artifacts.
Using change feed as a hidden broker. The design goes wrong when application state updates exist mainly to trigger consumers. If several unrelated processors depend on the same document mutation, the system needs explicit messaging, not implicit state change coupling.
Wide cross-partition queries for routine views. If the operator dashboard runs a cross-partition query on every page load, the partition key does not match the operational model.
Practical Recommendation
Design Cosmos DB around the operational owner of the state first, then validate the RU and downstream consequences of that choice. Prefer session consistency as the default, keep high-churn state compact, and use change feed only when downstream consumers genuinely want state transitions rather than brokered commands.
If the design starts to look like a queue, a file store, and an analytics sink all inside one container, stop and split the responsibilities before cost and recovery pressure force the issue.
In Entra-adjacent systems, these same patterns apply with identity-specific domain objects: tenant provisioning state, connector checkpoints, and reconciliation snapshots. The partition key and document shape decisions are identical; only the entities change.