Hybrid Worker on VM

What you will build

A hybrid architecture where cloud services handle messaging, state, and coordination, while a VM-hosted worker reaches systems that are only accessible from a private or on-premises network. The VM does the last-mile work; everything else stays in managed cloud services.

Scenario

You need to run a worker that must reach a target system on a private network. The target might be an on-premises database, a legacy application behind a firewall, a private API that is not exposed to the internet, or a system that requires host-installed software to communicate with.

The worker needs to live inside the network boundary that can reach the target, but the workflow coordination, message delivery, and state persistence should stay in cloud services where they are easier to operate.

Examples:

Syncing data from a cloud system to an on-premises SQL Server or Active Directory
Calling a legacy SOAP API that is only accessible from a corporate VLAN
Running a vendor connector that requires locally installed drivers or agents
Pushing configuration updates to systems behind a firewall

Architecture

flowchart LR
    subgraph cloud["Azure Cloud Services"]
        bus[Service Bus]
        state[Cosmos DB / Storage]
        api[Cloud APIs: Graph, etc.]
    end

    subgraph boundary["Private / On-Premises Network"]
        vm[VM Worker]
        target1[On-prem database]
        target2[Legacy application]
        target3[Private API]
    end

    api -->|"work items"| bus
    bus -->|"receive under lock"| vm
    vm -->|"reach targets"| target1
    vm -->|"reach targets"| target2
    vm -->|"reach targets"| target3
    vm -->|"persist state"| state
    vm -->|"report back"| bus

The critical boundary: the VM lives where the targets are reachable. Cloud services live where they are easy to manage. The connection between them is intentional and narrow.

Read this pattern together with:

Why a VM

You choose a VM over managed compute (Functions, Container Apps) when the constraints are about the host or the network, not just the application code:

Private network placement. The VM sits in a VNet that is peered with the on-premises network or connected via VPN/ExpressRoute. Managed compute cannot reach those targets without significant networking setup, and sometimes not at all.
OS-level dependencies. The target system requires installed drivers, agents, or runtime libraries that are not available in managed compute environments.
Machine control. You need to configure the OS, manage certificates, or run background services alongside the worker.
Legacy software. The connector or client library only runs on a specific OS version or architecture.

This is a boundary-driven decision, not a rejection of managed compute. If the constraint goes away (the target gets a public API, the library gets containerized), revisit the decision.

Keep workflow logic in cloud services

The most common mistake with hybrid workers is putting too much logic on the VM. The VM should do the minimum necessary: receive a work item, call the private target, report the result. Everything else belongs in cloud services.

Use Service Bus for work delivery. The VM pulls messages from a Service Bus queue. This gives you message ownership, retries, and dead-lettering without building that into the VM application. See Reliable Worker with Service Bus for the full pattern.

Use Cosmos DB or Storage for state. Workflow progress, checkpoints, and artifacts should live in cloud-side stores (Cosmos DB, Azure Storage), not on the VM’s local disk. If the VM dies, the state survives. See State and Artifacts for the storage split.

Keep scheduling in cloud services. If the work is timer-driven, use a cloud-side timer (Azure Functions, Logic Apps) that enqueues work items to Service Bus. The VM consumes from the queue. Do not run cron jobs on the VM for workflow orchestration.

Security considerations

A VM in a private network with access to sensitive targets needs careful security:

Managed identity. Use Azure managed identity for the VM to authenticate to cloud services (Service Bus, Cosmos DB, Storage, Key Vault). No credentials stored on disk or in environment variables. The VM’s identity gets only the RBAC roles it needs.

Private endpoints. Connect the VM to Azure services (Service Bus, Cosmos DB, Storage) over private endpoints so traffic stays on the Azure backbone and never traverses the public internet.

NSG rules. Network Security Groups should restrict the VM’s inbound and outbound traffic to exactly what is needed. Block everything else. The VM should accept connections from the on-prem targets it serves and from Azure management services, nothing more.

Minimal surface. Do not install anything on the VM beyond what the worker needs. No development tools, no admin portals, no SSH from the internet. Treat it as an appliance.

Trade-offs

This pattern solves real reachability and host-control problems, but it adds operational cost:

You own the machine lifecycle. Patching, monitoring, disk management, and OS updates are your responsibility. Managed compute handles this for you.
Scaling is manual. Adding capacity means provisioning more VMs, not adjusting a slider. Plan for peak load or accept that scaling takes time.
Failure is louder. A crashed VM stops processing entirely until it is restarted. Managed compute auto-recovers. Build monitoring and auto-restart into your operational model.
Cost is constant. The VM runs (and costs money) whether it is processing work or idle. Right-size it for the actual workload.

Choose this pattern because the boundary requires it, not because it feels familiar.

When not to use this pattern

This pattern is wrong when:

The target is publicly reachable. If the downstream system exposes a public API (or can be reached through an API gateway), use managed compute. A VM adds complexity for no benefit.
Managed compute can reach the target. Azure Functions with VNet integration, Container Apps with private networking, or Azure App Service with hybrid connections can reach many “private” targets without a dedicated VM.
There is no machine-specific dependency. If the worker is pure application code with no OS-level requirements, containerize it and run it on managed compute.
You are avoiding learning managed services. Adding VMs because they are familiar, when managed compute would work, creates unnecessary operational burden.

If the network and host constraints are absent, use the simpler patterns: Graph-Driven Automation for Functions-based work, or Reliable Worker for Service Bus-driven processing on managed compute.