UTM HA Cluster: What You’re Building and Why It Matters

You’ve built the Simple cluster and seen what happens when a single etcd node dies or the only master goes down. Now it’s time to eliminate every one of those single points of failure. This post is the conceptual overview of the UTM HA cluster — what the 11 VMs do, what production problems they solve, and why UTM remains the right choice when maximum realism matters.

When you’re ready for the full technical walkthrough, the UTM HA deep dive covers every one of the 17 deployment steps. The same HA architecture is also available with Vagrant for declarative infrastructure-as-code and OrbStack for the lowest resource footprint. For a side-by-side look at what each HA tool offers at the overview level, see the Vagrant HA overview and the OrbStack HA overview.

From 6 VMs to 11 — What Changes and Why

The Simple cluster runs 6 VMs: vault, jump, one etcd node, one master, and two workers. It works, but it has three deliberate weak points — single etcd, single master, and no load balancer. The HA cluster adds 5 VMs to eliminate all of them.

HAProxy (new). A dedicated load balancer that sits in front of the two master nodes. Workers and kubectl commands connect to HAProxy on port 6443, and HAProxy routes traffic to whichever master is healthy. If master-1 goes down, traffic flows to master-2 automatically. In the Simple cluster, workers pointed directly at a single master — if it died, the control plane was gone. HAProxy is what makes the control plane truly fault-tolerant.

etcd-1, etcd-2, etcd-3 (was 1 node). The single etcd node becomes a 3-node cluster with Raft consensus. Any one etcd node can fail and the cluster keeps running because 2 out of 3 still form a majority (quorum). This is the single most important HA upgrade — etcd holds every piece of cluster state, and losing it means losing everything. Three nodes is the minimum for meaningful fault tolerance: you can survive one failure while maintaining consensus.

master-1, master-2 (was 1 node). Two control plane nodes, each running kube-apiserver, kube-controller-manager, and kube-scheduler. Both API servers connect to the same etcd cluster. The controller-manager and scheduler use leader election — only one instance is active at a time, with the other on standby. If the active master fails, the standby takes over within seconds.

worker-1, worker-2, worker-3 (was 2 nodes). A third worker adds capacity and lets you experiment with pod distribution, node drains, and scheduling constraints across more nodes.

Vault and Jump remain the same roles as the Simple cluster — Vault manages the 3-tier PKI CA hierarchy, and jump is the bastion/Ansible controller.

What You’ll Actually Learn

The HA cluster teaches concepts that a Simple cluster physically cannot demonstrate. These are the lessons that separate homelab experience from production readiness.

How etcd quorum works in practice. With 3 etcd nodes, you can kill one and watch the cluster continue operating. Kill two and it stops — that’s the quorum boundary. You’ll understand why production clusters run 3 or 5 etcd nodes (never 2 or 4) and why an even number of nodes doesn’t improve fault tolerance.

How control plane leader election works. With two masters, you can check which one is the active controller-manager and scheduler using the leader election endpoints. Kill the active master and watch the standby take over. This is the mechanism that keeps reconciliation loops running during maintenance windows and node failures.

How load balancers front the API server. HAProxy is the simplest possible L4 (TCP) load balancer — it doesn’t terminate TLS or inspect traffic, just forwards TCP connections to healthy backends. Understanding this pattern is directly transferable to cloud load balancers (AWS NLB, GCP TCP LB) that serve the same function in managed Kubernetes deployments.

How certificate SANs expand for HA. The Simple cluster’s API server cert has SANs for one master IP. The HA version adds SANs for both master IPs, the HAProxy IP, and the service cluster IP. You’ll see firsthand why certificate planning matters before deployment — adding a SAN after the fact means reissuing and redeploying certificates across multiple nodes.

How Ansible orchestrates multi-node deployments. The HA cluster uses the same Ansible roles as the Simple cluster, but with parallel execution across more nodes. etcd and HAProxy deploy simultaneously since they’re independent. Watching forks=12 distribute certificates to 11 nodes in parallel shows what configuration management looks like at a meaningful scale.

Why UTM for HA

The same reasons that make UTM the right choice for Simple apply even more strongly at HA scale.

Single NIC per VM. Each of the 11 VMs has one network interface with one IP on UTM’s 192.168.64.0/24 shared network. No dual-NIC confusion, no wrong-interface binding. When you’re debugging etcd peer communication or HAProxy health checks across 11 nodes, clean networking eliminates an entire class of problems. Compare this to Vagrant’s dual-NIC setup (covered in the Vagrant HA overview) where every bind address must explicitly choose the right interface.

Fastest HA deployment. The full 11-VM HA cluster deploys in about 6 minutes 13 seconds — faster than both Vagrant (8m 10s) and OrbStack (7m 26s). UTM’s cloud-init ISO provisioning and single-NIC networking contribute to the speed advantage.

Full kernel isolation. Each VM boots its own Ubuntu 24.04 kernel. etcd’s Raft consensus, kubelet’s node registration, and Calico’s pod networking all behave identically to how they’d work on bare metal or cloud instances. OrbStack’s shared-kernel approach (covered in the OrbStack HA overview) is lighter but requires workarounds like failSwapOn: false and conntrack.maxPerCore: 0 that UTM doesn’t need.

Cloud-init ISO provisioning. The same NoCloud datasource mechanism used by AWS, GCP, and Azure. Directly transferable knowledge for cloud infrastructure work.

The tradeoff is resources. 11 UTM VMs need about 22 vCPUs, 38 GB RAM, and 300 GB disk. You’ll want a Mac with 48 GB+ total memory. If that’s more than your machine can handle, the OrbStack HA cluster delivers the same architecture with a fraction of the footprint.

Failure Experiments to Try

The whole point of HA is fault tolerance, and the best way to understand it is to break things deliberately.

Kill one etcd node. SSH to an etcd node and stop the service. Run etcdctl endpoint health from jump — two nodes report healthy, one is unreachable. Deploy a new pod. It works. The cluster barely noticed.

Kill two etcd nodes. Stop a second etcd node. Now try deploying a pod — it hangs. The API server can’t write to etcd because quorum is lost (1 out of 3 is not a majority). This is the boundary of your fault tolerance.

Kill the active master. Find which master is the leader for controller-manager, then stop that master’s services. Within seconds, the standby master’s controller-manager wins the leader election. HAProxy stops routing to the dead master. The cluster continues operating.

Stop HAProxy. Now workers can’t reach the API server at all — even though both masters are running. This shows that the load balancer is itself a single point of failure in this design. Production environments solve this with keepalived or cloud-native LB services. That’s a topic for a future project.

Who Should Build This

The UTM HA cluster is the right choice if you’ve completed the UTM Simple cluster and want to understand what production-grade looks like, want the cleanest networking and fastest deployment time for HA, have a Mac with 48 GB+ RAM and disk to spare, and plan to work with cloud infrastructure where full VM behavior matters.

If infrastructure-as-code and Vagrantfile-driven lifecycle management matter more, see the Vagrant HA overview. If you need HA on a 16 GB Mac, see the OrbStack HA overview.

What’s Next

Ready for the technical details? The UTM HA deep dive walks through all 17 deployment steps — from cloud image download to Calico CNI. The Vault PKI deep dive covers the certificate authority hierarchy that secures the entire cluster. And the Learning Path maps the full progression across all three tools.

Big tech, small lab. One reel at a time.

Questions, corrections, or want to share how you’re using these repos?

labitlearnit@gmail.com