Vagrant + QEMU + Ansible: One Command to a Production-Grade Cluster

The comparison post established that Vagrant is the right choice when declarative, version-controlled infrastructure matters most. This post goes deeper — walking through every layer of the k8s-vagrant-ha-homelab project, from the Vagrantfile that defines 11 VMs to the Ansible roles that deploy a full HA Kubernetes cluster with Vault PKI on Apple Silicon.

The goal: run ./k8s-vagrant-ha-homelab.sh and walk away. Eight minutes later, you have a production-grade cluster with a 3-node etcd cluster, 2 control plane masters behind HAProxy, 3 workers, HashiCorp Vault with a 3-tier PKI CA hierarchy, and a bastion jump server — all running on your MacBook.

The Architecture

The cluster runs 11 VMs across 6 roles, all created by Vagrant using the QEMU provider with socket_vmnet for host-routable networking on the 192.168.105.0/24 subnet (auto-detected from the socket_vmnet config).

VM	Role	IP Suffix	vCPU	RAM
haproxy	API Server Load Balancer	.10	2	2 GB
vault	PKI & Secrets (Vault 1.15.4)	.11	2	4 GB
jump	Bastion / Ansible Controller / kubectl	.12	2	4 GB
etcd-1/2/3	etcd cluster (3 nodes)	.21 / .22 / .23	2 each	2 GB each
master-1/2	K8s control plane	.31 / .32	2 each	4 GB each
worker-1/2/3	K8s worker nodes	.41 / .42 / .43	2 each	6 GB each

Total footprint: 22 vCPUs, 42 GB RAM. You’ll need a Mac with 48 GB+ total system memory to run the full HA setup comfortably.

Why Vagrant + QEMU on Apple Silicon?

VirtualBox 7.1+ added ARM64 host support, but it’s limited to ARM64 guests (no x86 emulation) and Vagrant’s VirtualBox provider on Apple Silicon is still less mature than the QEMU path for multi-VM setups. Parallels and VMware Fusion cost money. Docker Desktop’s Linux VMs share a kernel and don’t give you real VM isolation. That leaves QEMU — the same hypervisor UTM wraps in a GUI — but accessed through Vagrant’s declarative abstraction layer.

The vagrant-qemu plugin provides a QEMU-based provider that runs natively on ARM64. But QEMU on macOS has a networking gap: by default, it can only give VMs user-mode (SLIRP) networking, which means no routable IPs from the host. That’s where socket_vmnet enters the picture.

The socket_vmnet Bridge

socket_vmnet is a lightweight daemon from the Lima project that exposes macOS’s vmnet.framework to QEMU through a Unix socket. It runs as root (via a LaunchDaemon installed by Homebrew), but the QEMU process itself stays unprivileged. Each VM gets a real IP on the vmnet subnet — accessible from the Mac host and from other VMs.

The deploy script auto-detects the subnet prefix by reading the socket_vmnet LaunchDaemon plist at /Library/LaunchDaemons/homebrew.mxcl.socket_vmnet.plist, extracting the gateway IP, and deriving the network prefix. This means the project adapts automatically to whatever subnet socket_vmnet was configured with — no hardcoded IPs.

The tradeoff: every Vagrant VM ends up with two network interfaces. The first is a NAT interface (required by the QEMU provider for internet access), and the second is the vmnet interface providing the routable IP. This dual-NIC design means the Vagrantfile and Ansible inventory must be explicit about which interface to use for cluster communication. Getting this wrong is the #1 source of mysterious failures — nodes appear to join the cluster but can’t actually talk to each other.

Inside the Vagrantfile

The Vagrantfile defines all 11 VMs declaratively. Each VM is specified with its role, IP suffix, vCPU count, and RAM allocation. The network prefix is auto-detected from the socket_vmnet config — the same detection logic used by the deploy script — so the Vagrantfile and the shell script always agree on which subnet to use.

Key design decisions in the Vagrantfile:

SSH port forwarding: Each VM gets a unique forwarded SSH port (51010 for haproxy, 51011 for vault, etc.) so Vagrant can reach each VM individually through the NAT interface. But for actual cluster work, all SSH goes through the jump server over the vmnet interface.

Shell provisioner over cloud-init: Unlike the UTM version (which generates cloud-init ISO images), Vagrant uses its built-in shell provisioner to configure each VM after boot. The provisioner sets the hostname, injects SSH keys, writes /etc/hosts with all 11 VM entries, configures the static IP on the vmnet interface using Netplan, and installs role-specific packages. Workers get socat and conntrack (required by kubelet), the jump server gets Ansible and the Vault CLI, and so on.

The QEMU provider block: Each VM’s QEMU configuration specifies the architecture (aarch64), the Ubuntu 24.04 ARM64 cloud image as the base box, the socket_vmnet path for bridged networking, and machine-specific resource allocations. The extra_qemu_args array handles the vmnet socket path and any additional QEMU flags needed for Apple Silicon compatibility.

The Bastion Architecture

A deliberate design choice: the Mac host connects only to the jump server. Every other VM is accessed through jump. This mirrors real-world bastion patterns and means the Mac’s SSH config has exactly one entry. The jump server’s SSH config has entries for all 10 other VMs.

Ansible playbooks run on the jump server, not from the Mac. The deploy script SSHes into jump and executes each playbook from there, using the jump-side inventory that references VMs by their vmnet IPs. This keeps the Mac clean — no Ansible inventory pointing to ephemeral VM IPs, no SSH config pollution with 11 different hosts.

kubectl also runs from jump. After deployment, you ssh jump and immediately have cluster access. The admin kubeconfig is deployed to jump during the control plane setup step.

The Deploy Script — What Happens Step by Step

The k8s-vagrant-ha-homelab.sh script orchestrates the entire deployment. It supports three modes: full deploy (VM creation + all playbooks), --ansible-only (skip vagrant up, run playbooks only), and --from-step N (resume from a specific step). Every step is timed individually, and a summary table is printed at the end.

Here’s the high-level flow:

Phase 1 — VM Creation (runs on Mac): The script runs vagrant up --provider=qemu, which creates all 11 VMs through the QEMU provider. This takes about 1m 42s. Each VM boots Ubuntu 24.04, applies the shell provisioner for hostname/networking/packages, and becomes SSH-accessible. The script then configures the Mac’s /etc/hosts (adding entries for jump and vault) and sets up the SSH config for the jump server.

Phase 2 — Jump Server Setup (runs on Mac → jump): The SSH private key is copied to jump. The project’s entire ansible/ directory is synced to jump via SCP. An SSH config is written on jump with entries for all 10 other VMs. A connectivity test verifies that jump can reach every node. Kubernetes binaries (pre-downloaded on the Mac) are cached on jump so Ansible roles don’t need to download them inside each VM.

Phase 3 — Ansible Deployment (runs on jump): The script SSHes into jump and executes each Ansible playbook in sequence. The deployment steps mirror the UTM version’s 17-step flow, because both projects share the same Ansible roles. The key steps are: Vault bootstrap and PKI setup, certificate issuance for all components, etcd cluster deployment, HAProxy configuration, control plane deployment, worker node deployment, and Calico CNI installation.

Shared Ansible Roles — No Code Duplication

One of the most important design decisions: the Vagrant project includes the exact same Ansible roles as the UTM project. The roles are not forked or modified — they’re the same code. What changes between UTM and Vagrant is only the VM creation layer and the inventory files.

The project structure makes this explicit:

UTM Project	Vagrant Project	Purpose
scripts/k8s-utm-ha-homelab.sh	k8s-vagrant-ha-homelab.sh	Shell script — full deploy with timing
ansible/playbooks/k8s-utm-ha-homelab.yml	ansible/playbooks/k8s-vagrant-ha-homelab.yml	Ansible playbook — full deploy
ansible/roles/	ansible/roles/	All roles (identical)

This means improvements to any role — say, a better etcd health check or a new certificate rotation task — benefit both projects immediately. The separation between “how VMs are created” and “what runs inside them” is clean and intentional.

The Vault PKI Pipeline

The Vault setup is identical to the UTM version (covered in depth in the UTM deep-dive post), but here’s the summary for context. HashiCorp Vault is installed on a dedicated VM and initialized with Shamir’s Secret Sharing. A 3-tier CA hierarchy is configured: Root CA → Intermediate CA → three leaf CAs (Kubernetes, etcd, Front Proxy). Each CA has PKI roles that control what certificates can be issued — allowed domains, key usage, IP SANs, and TTLs.

The k8s-certs.yml playbook then issues certificates for every component: etcd server/peer/client certs, API server certs with all the required SANs, controller-manager and scheduler certs, per-node kubelet certs, kube-proxy certs, and the front-proxy client cert. Certificates are deployed to standardized paths on each node with correct ownership and permissions.

The critical difference from most homelab setups: a compromised etcd certificate cannot authenticate to the Kubernetes API, and vice versa. The CA separation enforces this boundary.

Networking: The Dual-NIC Reality

This is the part that trips people up. Every Vagrant VM in this setup has two network interfaces:

eth0 (NAT): Managed by the QEMU provider. Provides internet access through the Mac. Vagrant uses this interface for SSH and provisioning. IP is typically in the 10.0.2.x range and is not routable from the host or other VMs.

eth1 (vmnet): Provided by socket_vmnet. Gets a static IP on the 192.168.105.x subnet (or whatever subnet socket_vmnet is configured with). This is the interface used for all cluster communication — etcd peers, API server traffic, kubelet registration, pod networking.

Every Ansible playbook, every systemd unit file, every kubeconfig must explicitly reference the vmnet IP. Auto-detection of the “primary” IP will grab eth0’s NAT address, which is wrong. The Ansible inventory and group variables handle this by always specifying the exact bind address derived from the auto-detected network prefix plus each node’s suffix.

Compare this to the UTM version, where each VM has a single interface with one IP — the cleanest setup. Or OrbStack, where there’s one interface but two IPs stacked on it. Vagrant’s dual-NIC approach is the most explicit but also the most configuration-heavy.

Deployment Timing

From a cold start (no pre-existing VMs) to a fully working HA Kubernetes cluster:

Phase	Duration
vagrant up (11 VMs)	~1m 42s
Vault setup (bootstrap + PKI)	~42s
Certificate issuance	~36s
etcd cluster + HAProxy	~36s
Control plane (2 masters)	~2m 6s
Workers (3 nodes) + Calico	~2m 6s
Total	~8m 10s

For comparison, the UTM version completes in 6m 13s and OrbStack in 7m 26s. Vagrant’s overhead comes from the vagrant up phase (QEMU provider is slower to boot VMs than UTM’s direct approach) and the dual-NIC configuration overhead during provisioning.

Two Ways to Deploy

The project offers two fully automated deployment paths:

Option A — Shell Script (k8s-vagrant-ha-homelab.sh): Runs vagrant up, configures the jump server, then SSHes into jump to execute each playbook with per-step timing. Supports --ansible-only to skip VM creation and --from-step N to resume mid-deployment. This is the approach that mirrors the UTM project’s deploy script.

Option B — Ansible Playbook (k8s-vagrant-ha-homelab.yml): A single Ansible playbook that orchestrates the entire deployment. Phase 1-2 runs on the Mac (localhost), Phase 3 runs on the jump server. Run vagrant up first, then ansible-playbook -i inventory/localhost.yml playbooks/k8s-vagrant-ha-homelab.yml --ask-become-pass. This is the pure infrastructure-as-code approach.

Both paths produce the same result. The shell script is better for debugging (you can resume from any step), while the Ansible playbook is better for CI/CD integration.

Gotchas and Troubleshooting

Host key verification after VM recreation: When VMs are destroyed and recreated, they generate new SSH host keys. Stale keys cached on the Mac will cause SSH and Ansible to reject connections. Before re-running the deploy script, clean up stale SSH state by removing cached host keys for all VM IPs and deleting Ansible’s SSH control sockets at ~/.ansible/cp/. The deploy script handles this automatically, but manual Ansible runs may need cleanup.

Duplicate SSH config entries: Both the HA and simple Vagrant projects add a Host jump block to ~/.ssh/config. Running both creates duplicates. SSH uses the first match, so it still works, but check with grep -c "Host jump" ~/.ssh/config and clean up if needed.

VM running but unreachable: If vagrant status shows a VM as running but SSH fails with “No route to host”, the socket_vmnet bridge may need a restart. Check that the bridge100 interface exists on the Mac with ifconfig bridge100, then restart socket_vmnet with sudo brew services restart socket_vmnet.

Wrong interface binding: If Kubernetes components bind to eth0 (NAT) instead of eth1 (vmnet), nodes will appear to join the cluster but can’t communicate. Always verify bind addresses in systemd unit files and kubeconfig files point to the 192.168.105.x addresses.

Calico pods in Init state: After deployment completes, Calico pods often show as ContainerCreating or Init:2/3. This is normal — Calico needs a minute or two to fully initialize. Check again with kubectl get pods -A and everything should be Running.

Vagrant vs UTM — When to Choose Which

Both use QEMU under the hood. Both produce full VMs with their own kernels and genuine isolation. The difference is the management layer:

Choose Vagrant when you want everything in a single, version-controlled Vagrantfile. vagrant up creates the cluster, vagrant destroy -f tears it down. The Vagrantfile is committable to Git. Teammates can clone the repo and run the same command to get an identical environment. The lifecycle management is declarative and repeatable.

Choose UTM when you want the fastest deployment (6m 13s vs 8m 10s), the cleanest networking (single NIC per VM), and direct control over VM configuration. UTM with utmctl gives you raw access to QEMU’s capabilities without an abstraction layer. The tradeoff is that VM definitions live in shell scripts and cloud-init ISOs rather than a declarative file.

For a deeper comparison across UTM, Vagrant, and OrbStack, see the main comparison post.

Running It Yourself

Prerequisites: macOS on Apple Silicon, Vagrant with the QEMU provider (vagrant plugin install vagrant-qemu), socket_vmnet (brew install socket_vmnet), Ansible (brew install ansible), and the Python hvac library (pip3 install hvac). You’ll need about 42 GB of free RAM.

git clone https://github.com/labitlearnit/k8s-vagrant-ha-homelab.git
cd k8s-vagrant-ha-homelab
./k8s-vagrant-ha-homelab.sh

Or the Ansible-native approach:

vagrant up
cd ansible
ansible-playbook -i inventory/localhost.yml playbooks/k8s-vagrant-ha-homelab.yml --ask-become-pass

After deployment, ssh jump from your Mac, then kubectl get nodes to see your cluster.

Wrapping Up

Vagrant adds an abstraction layer over QEMU that trades some speed and networking simplicity for declarative infrastructure management. The Vagrantfile becomes the single source of truth for your cluster’s topology. Combined with shared Ansible roles that work identically across UTM, Vagrant, and OrbStack, you get the flexibility to switch virtualization layers without rewriting your automation.

The full source code is at github.com/labitlearnit/k8s-vagrant-ha-homelab. Star the repo if you find it useful, and feel free to open issues or PRs.

Big tech, small lab. One reel at a time.