Messaging Selection and Failure Modes

Choosing between Event Hubs and Service Bus is one of the first architecture decisions that changes how an Entra-adjacent system behaves under pressure. The question is not which service is more powerful in the abstract. The question is what kind of message the system is carrying, what kind of failure the team expects, and what recovery model operators can live with.

Read this page after the baseline service framing in Event Hubs for Identity Events and Service Bus for Workflows. For a concrete workflow pattern, see Reliable Worker with Service Bus. For a concrete streaming pattern, see Event Stream to Data Explorer.

Start With The Message Contract

The cleanest selection rule is to decide what the message means.

  • Choose Service Bus when the message means “someone must complete this workflow step.”
  • Choose Event Hubs when the message means “here is another event in the stream.”

That difference sounds small, but it changes almost everything that follows.

Service Bus assumes the system cares about ownership, settlement, retries, and inspection of failed work. Event Hubs assumes the system cares about ingesting a high-volume flow, letting several consumers read it independently, and replaying retained history when needed.

Selection Criteria That Actually Matter

Use the pressure points below rather than product branding.

Decision pressureEvent HubsService Bus
Primary fitLarge event streamsDurable workflow coordination
Ordering modelOrdered only within a partitionOrdered per queue or subscription, with stronger entity-level ordering via sessions
ReplayBuilt in through retention and offsetsNot a replay system; recovery is usually retry, re-enqueue, or reissue
Throughput profileDesigned for very high ingest and parallel consumersLower throughput but richer broker behavior
Delivery semanticsAt-least-once style stream consumption with client-managed progressBroker-managed delivery with locks, settlement, retries, and dead-lettering
Multiple consumersNative via consumer groupsFan-out via topics and subscriptions, each with its own delivery lifecycle
SessionsNot the modelFirst-class feature for correlated workflow ordering
Dead-letteringNo built-in dead-letter queue equivalentFirst-class dead-letter support
TransactionsLimited fit for workflow-style transactional coordinationBetter fit when the workflow needs broker-aware coordination
Idempotency expectationConsumers must assume duplicates and replaysConsumers still need idempotency because retries and redelivery happen

The important detail is that neither service removes the need for idempotent consumers. They fail differently, but both can deliver the same logical work more than once.

How The Choice Changes With The Workload

Graph-driven onboarding workflow

An onboarding or remediation flow usually starts from a control-plane event or a polling result, then runs downstream steps with consequences. That work often needs retries, poison-message handling, and ordering for a tenant, user, or request.

Choose Service Bus.

Why:

  • the unit of work has an owner,
  • repeated failure should surface in a dead-letter queue,
  • related steps may need session ordering,
  • operators often need a clear remediation path.

High-volume audit or telemetry feed

A system emitting worker telemetry, reconciliation results, or audit-adjacent records usually cares more about throughput, retention, and several downstream readers than about per-message coordination.

Choose Event Hubs.

Why:

  • the event stream may feed analytics, alerting, and enrichment at the same time,
  • replay matters for incident investigation or new consumers,
  • no single consumer should own the event forever.

Mixed workload: command plus evidence stream

Some systems have both. A reliable command flow tells a worker to act, and a separate event stream records what happened for diagnostics or analytics.

Use both, but keep the contracts separate:

  • Service Bus carries the command or workflow step.
  • Event Hubs carries the resulting telemetry or event evidence.

The common mistake is trying to force one service to do both jobs and then rebuilding the missing behavior in code.

Ordering Is Narrower Than People Expect

Ordering guarantees are often overstated during design reviews.

With Event Hubs, ordering exists only within a partition. If two related events land in different partitions, the stream does not promise cross-partition order. That is acceptable for many analytics and event-processing systems, but it can break workflows that assume a single global sequence.

With Service Bus, ordering is still not magic. A queue gives a cleaner brokered work stream, and sessions can keep related messages together, but ordering only helps if the session key actually matches the workflow boundary you care about.

If the architecture requirement is really “all updates for tenant X must stay in order,” the design work is not done until the partition key or session key reflects tenant X consistently.

Replay, Recovery, And Historical Reprocessing

Event Hubs is the clear choice when replay is a real operational tool. A new consumer can start from an earlier offset, an investigation can reread retained data, and a bug fix can reprocess historical events if the retention window still covers them.

Service Bus is not built around that recovery model. Recovery usually looks like one of these instead:

  • retrying the locked message,
  • moving the message to the dead-letter queue,
  • fixing the payload or dependency and resubmitting work,
  • recreating a command from a state store or source system.

That is not a weakness. It is a different contract. Workflow systems usually need controlled re-execution, not stream replay.

Throughput And Cost Pressure

Event Hubs and Service Bus can both become expensive if the wrong shape of workload lands on them.

Event Hubs pressure usually comes from:

  • underestimating partition needs,
  • treating a single hot partition as “good enough,”
  • retaining large streams longer than the investigation model really needs,
  • pushing workflow-like commands through a service optimized for ingest.

Service Bus pressure usually comes from:

  • using broker features for traffic that is really just telemetry,
  • serializing too much work through a narrow session key,
  • letting retries and dead-letter volume grow without fixing root causes,
  • using topics and subscriptions where a simple queue or stream would do.

The practical rule is simple: high event volume pushes toward Event Hubs; expensive failure handling and workflow ownership push toward Service Bus.

Typical Failure Modes

Event Hubs failure patterns

The builder needs to be ready for these:

  • Checkpoint mistakes cause duplicate processing or gaps in what the application thinks it has consumed.
  • Hot partitions create lag because one key receives most of the traffic.
  • Slow consumers fall behind the stream and may lose the chance to replay once retention expires.
  • Schema drift in events breaks downstream parsing across multiple consumers at once.
  • Assuming workflow semantics leads teams to bolt on custom retry, poison-event handling, and ownership logic outside the platform.

Mitigation patterns:

  • checkpoint deliberately and test restart behavior,
  • choose partition keys that reflect scale and ordering needs together,
  • keep consumers idempotent,
  • version event payloads carefully,
  • move delivery-sensitive workflow steps to Service Bus instead of simulating a broker in application code.

Service Bus failure patterns

The builder needs to be ready for these:

  • Poison messages repeatedly fail and block useful work until they are dead-lettered or isolated.
  • Non-idempotent handlers create duplicate side effects during retry or redelivery.
  • Bad session-key choice serializes unrelated work or fails to preserve the entity ordering you expected.
  • Long-running handlers lose locks or create noisy retry behavior.
  • Subscription sprawl makes topic-based fan-out harder to reason about operationally.

Mitigation patterns:

  • design handlers to be idempotent against repeated delivery,
  • persist workflow progress outside the message body when the step has side effects,
  • choose session keys that match the entity or workflow boundary,
  • keep handlers short and push longer stateful coordination into external state stores like Cosmos DB for Identity State,
  • use dead-letter inspection as an operator tool, not as a permanent overflow bin.

Transactions And Coordination Boundaries

If the system truly needs broker-aware coordination across message operations, Service Bus is the closer fit. If the system needs to atomically update workflow state and then react to that state change downstream, the cleaner answer is often to combine Service Bus with a durable store, or to use a state-driven pattern with Cosmos DB and change feed when the trigger is a state transition rather than a workflow command.

What usually fails is trying to make Event Hubs behave like a transactional coordinator. It can move a stream efficiently, but it does not want to be the place where workflow invariants live.

Idempotency Is Mandatory In Both Models

This is the part teams skip until production says otherwise.

With Event Hubs, duplicates happen because consumers restart, replay, or recover from checkpoint ambiguity. With Service Bus, duplicates happen because messages are retried, redelivered after lock loss, or resubmitted after remediation. In both cases, the consumer should be able to detect whether the side effect already happened.

That usually means storing enough workflow or processing state outside the transport to answer questions like:

  • did tenant onboarding case X already reach step Y,
  • did we already provision entitlement Z,
  • did we already emit the downstream command for this checkpoint.

If the answer lives only in memory or only in the transport, recovery will be fragile.

Practical Recommendation

Start with Service Bus for delivery-sensitive identity workflows and with Event Hubs for high-volume identity event streams. Only mix them when the workload clearly contains both a command path and an analytics or telemetry path.

Do not optimize for future optionality by hiding the difference behind a generic messaging abstraction. The services are opinionated because the workloads are different. Let the contract stay visible in the design.