Vagrant HA Cluster: What You’re Building and Why It Matters

You’ve built the Simple cluster and experienced every single point of failure firsthand. Now it’s time to scale from 6 VMs to 11 and eliminate all of them. This post is the conceptual overview of the Vagrant HA cluster — what the 11 VMs do, what production problems they solve, and why Vagrant is the right choice when infrastructure-as-code matters most.

When you’re ready for the full technical walkthrough, the Vagrant HA deep dive covers the Vagrantfile, socket_vmnet networking, and the complete deployment flow. The same HA architecture is also available with UTM for the fastest deployment and cleanest networking, and OrbStack for the lowest resource footprint. For a side-by-side look at what each HA tool offers at the overview level, see the UTM HA overview and the OrbStack HA overview.

From 6 VMs to 11 — What Changes and Why

The Simple cluster runs 6 VMs defined in a single Vagrantfile: vault, jump, one etcd node, one master, and two workers. It works, but it has three deliberate weak points. The HA cluster adds 5 VMs to eliminate all of them.

HAProxy (new). A dedicated TCP load balancer that sits in front of the two master nodes on port 6443. Workers and kubectl commands connect to HAProxy, which routes traffic to whichever master is healthy. If master-1 goes down, traffic flows to master-2 automatically. This is the component that makes the control plane fault-tolerant.

etcd-1, etcd-2, etcd-3 (was 1 node). A 3-node etcd cluster with Raft consensus replaces the single node. Any one node can fail and the cluster keeps running — 2 out of 3 still form quorum. This is the most critical HA upgrade because etcd holds every piece of cluster state. Losing etcd means losing everything.

master-1, master-2 (was 1 node). Two control plane nodes, each running kube-apiserver, kube-controller-manager, and kube-scheduler. The controller-manager and scheduler use leader election — only one instance is active, with the other on standby for automatic failover.

worker-1, worker-2, worker-3 (was 2 nodes). A third worker adds capacity for pod distribution experiments, node drains, and scheduling constraints.

Vault and Jump remain unchanged — Vault manages the 3-tier PKI hierarchy, and jump is the bastion/Ansible controller.

What You’ll Actually Learn

The HA cluster teaches concepts that a Simple cluster cannot demonstrate. These are the lessons that bridge the gap between homelab and production.

How etcd quorum works in practice. Kill one etcd node and the cluster continues. Kill two and it stops — quorum lost. You’ll understand why production clusters use 3 or 5 etcd nodes (never 2 or 4) and why even numbers don’t improve fault tolerance.

How control plane leader election works. Check which master is the active controller-manager, kill it, and watch the standby take over within seconds. This is how Kubernetes maintains reconciliation loops during failures and maintenance.

How load balancers front the API server. HAProxy does L4 TCP forwarding with health checks — no TLS termination, no traffic inspection. Understanding this pattern transfers directly to cloud load balancers (AWS NLB, GCP TCP LB) in managed Kubernetes.

How dual-NIC networking scales to 11 nodes. This is Vagrant-specific and critically important. Every one of the 11 VMs has two network interfaces — eth0 (NAT for internet) and eth1 (vmnet for cluster communication). Every etcd peer URL, every API server bind address, every kubelet registration must use the vmnet IP on eth1. Getting this wrong at HA scale means nodes that appear healthy but can’t actually communicate. The UTM HA cluster avoids this entirely with single-NIC VMs; the OrbStack HA cluster has its own dual-IP variant on a single interface.

How certificate SANs expand for HA. The Simple cluster’s API server cert has SANs for one master IP. The HA version adds SANs for both master IPs, the HAProxy IP, and the service cluster IP. Certificate planning before deployment is essential — adding SANs afterward means reissuing and redeploying across multiple nodes.

Why Vagrant for HA

Vagrant’s core strength — declarative infrastructure-as-code — becomes even more valuable at HA scale.

One Vagrantfile defines all 11 VMs. Every VM’s name, IP suffix, vCPU count, and RAM allocation lives in a single file. Change a worker’s memory from 6 GB to 8 GB and the diff is a one-line change in Git. Add a fourth worker and it’s a few lines added to the VM definitions array. This is infrastructure-as-code at its most practical — the entire cluster topology is version-controlled, diffable, and shareable.

Clean lifecycle management. vagrant up --provider=qemu creates the full cluster. vagrant destroy -f tears it down completely. vagrant status shows which VMs are running. No orphaned disk images, no stale UTM plists, no forgotten OrbStack machines. The lifecycle is explicit and repeatable.

Full kernel isolation. Like UTM, Vagrant uses QEMU to create real VMs with their own kernels. etcd Raft consensus, kubelet registration, and Calico networking all behave identically to bare metal. The OrbStack HA cluster trades this for lower resource consumption by sharing the host kernel.

Auto-detected networking. The Vagrantfile and deploy script auto-detect the socket_vmnet subnet prefix by reading the LaunchDaemon plist. No hardcoded IPs — the project adapts to whatever subnet socket_vmnet is configured with.

The tradeoffs: dual-NIC networking adds configuration complexity, and deployment takes about 8 minutes 10 seconds (versus UTM’s 6m 13s and OrbStack’s 7m 26s). You’ll need about 42 GB of free RAM. If clean single-NIC networking and fastest deployment matter more, the UTM HA cluster is the better fit.

Failure Experiments to Try

HA exists to survive failures. The best way to learn it is to cause them.

Kill one etcd node. Stop the etcd service on one node. Check cluster health from jump — two nodes healthy, one unreachable. Deploy a pod. It succeeds. The cluster barely noticed.

Kill two etcd nodes. Stop a second. Now pod deployments hang — quorum is lost. This is the fault tolerance boundary for a 3-node cluster.

Kill the active master. Find the controller-manager leader, stop that master. The standby wins the leader election within seconds. HAProxy routes all traffic to the surviving master.

Destroy and recreate a VM. vagrant destroy worker-3 -f followed by vagrant up worker-3 --provider=qemu. Then re-run the worker playbook from jump to re-provision it. This tests both the Vagrant lifecycle and the Ansible idempotency — the playbook should bring the new worker to the same state as the others.

Who Should Build This

The Vagrant HA cluster is the right choice if you’ve completed the Vagrant Simple cluster and want to see infrastructure-as-code at real scale, want everything in a version-controlled Vagrantfile you can share with teammates, want full VM isolation with declarative lifecycle management, and have a Mac with 48 GB+ RAM.

If single-NIC networking and the fastest deployment time matter more, see the UTM HA overview. If you need HA on a resource-constrained Mac, see the OrbStack HA overview.

What’s Next

Ready for the technical details? The Vagrant HA deep dive covers the Vagrantfile internals, the socket_vmnet bridge, dual-NIC configuration, and deployment timing for every phase. The Vault PKI deep dive covers the certificate authority hierarchy shared across all three projects. And the Learning Path maps the full progression across all three tools.

Big tech, small lab. One reel at a time.

Questions, corrections, or want to share how you’re using these repos?

labitlearnit@gmail.com

Enjoyed this post?

Want homelab configs to your email?

Leave a Reply

Discover more from Lab it, learn it

Subscribe now to keep reading and get access to the full archive.

Continue reading