Vagrant Gotchas — What Breaks with QEMU and socket_vmnet

Vagrant + QEMU + socket_vmnet is the declarative infrastructure-as-code path for Kubernetes on Apple Silicon. It’s powerful, it’s repeatable, and it has more moving parts than the other two options. This post covers what breaks, what confuses, and what silently fails when building K8s clusters with Vagrant on an M-series Mac, drawn from real deployment experience with both the Simple cluster and the full 11-VM HA deployment.

Each gotcha follows the same format: what happens, why it happens, and how to fix it. For gotchas that apply to all three tools (UTM, Vagrant, OrbStack), see the HA-Specific Gotchas post. For UTM and OrbStack-specific issues, see the UTM Gotchas and OrbStack Gotchas posts.

Gotcha #1: The Dual-NIC Problem — Binding to the Wrong Interface

What happens: Kubernetes components (etcd, kube-apiserver, kubelet) start successfully, nodes appear to join the cluster, but they can’t actually communicate. kubectl get nodes shows nodes as Ready, but pods can’t be scheduled or networking fails. Alternatively, etcd peers can’t form a cluster despite all being “running.”

Why it happens: Every Vagrant VM in this setup has two network interfaces. eth0 is the NAT interface managed by the QEMU provider — it provides internet access but uses a non-routable 10.0.2.x address. eth1 is the socket_vmnet interface with the static 192.168.105.x address used for all cluster communication. Each Kubernetes component in this project — etcd, kube-apiserver, kube-controller-manager, kube-scheduler, kubelet, kube-proxy — is deployed as its own systemd service with its own unit file, rendered from a Jinja2 template by an Ansible role. If the bind address in that template is wrong or missing, the component listens on eth0’s NAT address instead of eth1’s vmnet address. Other nodes can’t reach it because 10.0.2.x is only routable from within that specific VM.

How to fix it: Every systemd unit file must explicitly bind to the vmnet IP. In this project, the Ansible roles derive the correct IP from inventory host variables and pass it into the Jinja2 templates — so the fix lives in your group_vars or host_vars, not in the unit file itself. For example, if a node’s vmnet IP isn’t being picked up correctly, check the inventory variables first:

# In group_vars or host_vars, the node IP must resolve to the eth1 (vmnet) address
# WRONG — points to eth0 NAT address
node_ip: 10.0.2.15

# RIGHT — points to the socket_vmnet address
node_ip: 192.168.105.21

Each Ansible role then uses this variable in its unit file template. For example, the etcd role renders the peer and client URLs from the host’s vmnet IP, so the resulting systemd unit explicitly binds to the right address — never relying on hostname resolution or interface auto-detection. If you’re debugging connectivity issues, the first thing to check is which IP each component is actually binding to:

# Check what IP etcd is listening on
ss -tlnp | grep 2379

# Check what IP the API server is advertising
systemctl cat kube-apiserver | grep advertise

# Check what IP kubelet registered with
kubectl get nodes -o wide  # The INTERNAL-IP column tells you

Compare this to UTM, where each VM has a single interface with one IP — no dual-NIC confusion. Or OrbStack, where there’s one interface but two IPs stacked on it (a different flavor of the same problem).

Gotcha #2: socket_vmnet Bridge Disappears

What happens: vagrant status shows VMs as running, but SSH to the vmnet IP fails with “No route to host” or “Connection timed out.” The VMs are alive (you can reach them on the NAT interface via Vagrant’s forwarded ports) but the vmnet network is dead.

Why it happens: socket_vmnet runs as a LaunchDaemon and manages the bridge100 interface on macOS. If macOS puts the daemon to sleep, if a system update restarts networking, or if the daemon crashes, the bridge interface disappears. QEMU VMs keep running but lose their vmnet connectivity.

How to fix it: Check if the bridge exists and restart socket_vmnet if needed:

# Check if bridge100 exists
ifconfig bridge100
# If "interface does not exist" — socket_vmnet needs a restart

# Restart socket_vmnet
sudo brew services restart socket_vmnet

# Verify the bridge is back
ifconfig bridge100
# Should show an inet address in the 192.168.105.x range

# After bridge recovery, VMs should be reachable again
ping -c 3 192.168.105.12  # Test jump server

If the bridge comes back but VMs are still unreachable, the VMs may need their vmnet interfaces re-initialized. The quickest fix is vagrant halt followed by vagrant up — this restarts the QEMU processes and re-attaches them to the socket_vmnet bridge.

Gotcha #3: Stale SSH Host Keys After VM Recreation

What happens: After destroying and recreating VMs (vagrant destroy -f followed by vagrant up), SSH connections and Ansible playbooks fail with “WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!” or “Host key verification failed.”

Why it happens: When VMs are recreated, they generate new SSH host keys. The old keys are cached in ~/.ssh/known_hosts on the Mac and on the jump server. SSH sees the key mismatch and refuses to connect as a man-in-the-middle protection.

How to fix it: Clean up stale SSH state before re-running the deploy:

# Remove stale host keys for all VM IPs on the Mac
for suffix in 10 11 12 21 22 23 31 32 41 42 43; do
    ssh-keygen -R "192.168.105.${suffix}" 2>/dev/null
done

# Remove Ansible SSH control sockets
rm -rf ~/.ansible/cp/*

# Also clean up on the jump server if it's accessible
ssh jump 'rm -f ~/.ssh/known_hosts'

The deploy script handles this automatically. For manual Ansible runs after VM recreation, you’ll need to clean up yourself. Alternatively, the SSH configs generated by the deploy script include StrictHostKeyChecking no and UserKnownHostsFile /dev/null to bypass this entirely — acceptable for ephemeral lab VMs, not for production.

Gotcha #4: Duplicate SSH Config Entries

What happens: SSH connections go to the wrong VM, or you’re prompted for a password when key-based auth should work. Running ssh jump connects somewhere unexpected.

Why it happens: Both the HA and Simple Vagrant projects add a Host jump block to ~/.ssh/config. If you’ve deployed both projects (or redeployed the same project multiple times without cleaning up), you end up with duplicate entries. SSH uses the first match, so if the Simple cluster’s jump entry (pointing to 192.168.105.12) comes before the HA cluster’s entry (also pointing to 192.168.105.12 but with different settings), it may use outdated config.

How to fix it: Check for and clean up duplicates:

# Count how many jump entries exist
grep -c "Host jump" ~/.ssh/config
# Should be exactly 1

# If more than 1, edit and remove duplicates
vi ~/.ssh/config
# Keep only the most recent entry

The same issue can occur with UTM and OrbStack deploy scripts since they all add jump entries. If you’re switching between tools, either clean up ~/.ssh/config manually or use tool-specific hostnames (e.g., jump-vagrant, jump-utm).

Gotcha #5: vagrant-qemu Plugin SSH Port Conflicts

What happens: vagrant up fails during the SSH phase for some VMs with “The forwarded port to [port] is already in use on the host machine.” Or, some VMs appear to start but Vagrant can’t SSH into them for provisioning.

Why it happens: Each VM gets a forwarded SSH port on the Mac (51010, 51011, 51012, etc.) so Vagrant can reach them through the NAT interface. If a previous vagrant up left orphaned QEMU processes, or if another Vagrant project uses overlapping port numbers, the ports are already bound.

How to fix it: Find and kill orphaned QEMU processes, then retry:

# Find orphaned QEMU processes
ps aux | grep qemu-system-aarch64

# Kill orphaned QEMU processes (be careful not to kill VMs you want running)
kill $(ps aux | grep '[q]emu-system-aarch64' | awk '{print $2}')

# Check which ports are in use
lsof -i :51010-51043

# Then retry
vagrant up --provider=qemu

If port conflicts persist, you can change the forwarded port assignments in the Vagrantfile. The port numbers are arbitrary — they just need to be unique and not conflict with other services.

Gotcha #6: socket_vmnet Subnet Detection Fails

What happens: The deploy script or Vagrantfile reports that it can’t detect the network prefix, or uses a wrong subnet. VMs boot but have incorrect IPs.

Why it happens: The subnet is auto-detected by reading the socket_vmnet LaunchDaemon plist at /Library/LaunchDaemons/homebrew.mxcl.socket_vmnet.plist. If socket_vmnet was installed differently (not via Homebrew), if the plist was customized, or if the plist format changed in a newer version, the parsing can fail or return an unexpected value.

How to fix it: Verify the plist exists and check what subnet it’s configured with:

# Check the socket_vmnet plist
cat /Library/LaunchDaemons/homebrew.mxcl.socket_vmnet.plist

# Look for the gateway IP (e.g., 192.168.105.1)
# The network prefix is derived from this (192.168.105)

# Verify the bridge interface matches
ifconfig bridge100 | grep inet
# Should show an IP in the same subnet

If auto-detection fails, you can override the network prefix by setting it directly in the Vagrantfile or deploy script. Look for the subnet detection function and replace it with a hardcoded value for your environment.

Gotcha #7: Vagrant Box Download Hangs or Fails

What happens: The first vagrant up stalls while downloading the Ubuntu ARM64 box, or fails with a network error. Subsequent runs continue to try downloading.

Why it happens: Vagrant downloads the base box from Vagrant Cloud on first use. The ARM64 Ubuntu box is ~600 MB. Slow connections, corporate proxies, or Vagrant Cloud CDN issues can cause the download to hang or fail. Unlike the main deploy script which validates download size, Vagrant’s box download has less robust error handling.

How to fix it: Download the box separately and add it manually:

# Download the box file directly
curl -LO https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-arm64.img

# Add it as a Vagrant box
vagrant box add --name ubuntu/noble64 noble-server-cloudimg-arm64.img

# Verify the box is available
vagrant box list

Once the box is cached locally, all future vagrant up runs use the local copy. If you need to update the box later, use vagrant box update.

Gotcha #8: vagrant destroy Doesn’t Clean Up Everything

What happens: After vagrant destroy -f, you re-run vagrant up but hit SSH issues, port conflicts, or stale state. The new deployment behaves as if remnants of the old cluster are still present.

Why it happens: vagrant destroy removes the VMs and their disks, but it doesn’t clean up: SSH known_hosts entries on the Mac, the jump entry in ~/.ssh/config, Ansible SSH control sockets, Mac /etc/hosts entries, or QEMU processes that didn’t terminate cleanly.

How to fix it: After vagrant destroy -f, also clean up host-side state:

# Clean up stale SSH host keys
for suffix in 10 11 12 21 22 23 31 32 41 42 43; do
    ssh-keygen -R "192.168.105.${suffix}" 2>/dev/null
done

# Clean up Ansible control sockets
rm -rf ~/.ansible/cp/*

# Kill any orphaned QEMU processes
pkill -f qemu-system-aarch64 2>/dev/null

# Optionally remove the jump entry from ~/.ssh/config
# and the entries from /etc/hosts

The deploy script handles most of this automatically on the next run, but for the cleanest experience, do a full cleanup before redeploying.

Gotcha #9: QEMU Provider Is Slower Than Native UTM

What happens: vagrant up for 11 VMs takes ~1m 42s, while UTM boots the same VMs faster. Vagrant’s provisioning phase also adds overhead compared to UTM’s cloud-init-only approach.

Why it happens: Vagrant adds an abstraction layer over QEMU. Each VM boot goes through Vagrant’s lifecycle: create the QEMU process, wait for SSH on the NAT interface, run the shell provisioner, then hand off to the deploy script. UTM boots VMs directly with QEMU and uses cloud-init (which runs inside the VM in parallel with the boot) for provisioning. The total deployment (8m 10s for Vagrant vs 6m 13s for UTM) reflects this overhead.

How to fix it: This isn’t a bug — it’s the cost of Vagrant’s declarative abstraction. If speed is your top priority, use UTM instead. If you’re re-running just the Ansible phase (VMs already exist), use --ansible-only to skip vagrant up entirely:

# Skip VM creation, run only the Ansible deployment
./k8s-vagrant-ha-homelab.sh --ansible-only

# Or resume from a specific step
./k8s-vagrant-ha-homelab.sh --from-step 5

Gotcha #10: Netplan Static IP Not Applied on eth1

What happens: The VM boots and eth0 (NAT) has an IP, but eth1 (vmnet) shows no IP address or has the wrong one. ip addr show eth1 shows the interface is UP but has no inet address.

Why it happens: The Netplan configuration written by the shell provisioner needs to be applied explicitly. If the provisioner runs before the vmnet interface is fully initialized by QEMU, or if there’s a timing issue between interface creation and Netplan apply, the static IP configuration is silently skipped.

How to fix it: The provisioner applies Netplan after writing the config, but if the IP is missing, apply it manually:

# Check current interface state
ip addr show eth1

# Check the Netplan config
cat /etc/netplan/01-vmnet.yaml

# Apply the config
sudo netplan apply

# Verify
ip addr show eth1
# Should now show the 192.168.105.x address

If the interface name isn’t eth1 (some QEMU versions use enp0s3 / enp0s4 naming), check ip link show to find the correct name and update the Netplan config accordingly.

Gotcha #11: socket_vmnet Looks Healthy but vagrant up Still Fails with “Connection refused”

What happens: vagrant up fails with Failed to connect to "/opt/homebrew/var/run/socket_vmnet": Connection refused. But when you investigate, the service looks fine — sudo launchctl list | grep socket_vmnet shows it loaded, the socket file exists at /opt/homebrew/var/run/socket_vmnet, and pgrep -lf socket_vmnet returns a live PID.

Why it happens: The error message points at the socket, but the real culprit can be further up the stack. Two things commonly trigger it even when socket_vmnet itself is healthy. First, after macOS updates, Homebrew upgrades, or reboots, the socket file can end up with stale ownership. Running brew services list | grep socket_vmnet without sudo returns nothing — because socket_vmnet is a root-owned service, and only sudo brew services list shows its real state. Second, previous failed vagrant up attempts leave stale directories under ~/.vagrant.d/tmp/vagrant-qemu/ (one per attempted VM, like vq_N0Ean20imRs). These hold stale PID files and socket references that confuse the next run.

How to fix it: Walk through the state in order — service, socket, then Vagrant’s own stale state:

# 1. Confirm the service is actually running (use sudo!)
sudo brew services list | grep socket_vmnet
# Should show: started  root  /Library/LaunchDaemons/homebrew.mxcl.socket_vmnet.plist

# 2. Test the socket directly as your user
nc -U /opt/homebrew/var/run/socket_vmnet < /dev/null; echo "exit: $?"
# exit: 0 means the socket accepts connections from your user

# 3. Restart socket_vmnet to refresh ownership and the socket file
sudo brew services restart socket_vmnet

# 4. Verify a fresh PID and new socket timestamp
pgrep -lf socket_vmnet
ls -l /opt/homebrew/var/run/socket_vmnet

# 5. Clean up stale Vagrant QEMU state from prior failed runs
pgrep -lf qemu  # Confirm no QEMU is running first
rm -rf ~/.vagrant.d/tmp/vagrant-qemu/vq_*

# 6. Retry
vagrant up

The socket permissions on /opt/homebrew/var/run/socket_vmnet should be srwxrwx--- owned by root:staff. Since your user is in the staff group on macOS by default, this works out of the box — but if you ever see the socket owned by a different group, that alone can cause “Connection refused” even when everything else looks healthy.

Quick Reference: Vagrant Diagnostics

When something goes wrong with the Vagrant deployment, these commands help narrow down the issue:

# Check VM status
vagrant status

# Check socket_vmnet bridge
ifconfig bridge100

# SSH via Vagrant (NAT path) vs direct (vmnet path)
vagrant ssh jump         # Uses NAT forwarded port
ssh jump                 # Uses vmnet IP via ~/.ssh/config

# Check which IPs each interface has inside a VM
vagrant ssh jump -c 'ip addr show'

# Check QEMU processes
ps aux | grep qemu-system-aarch64

# Check forwarded port assignments
vagrant port jump

# Debug provisioning issues
vagrant provision jump --debug

# Nuclear option: full destroy and cleanup
vagrant destroy -f
rm -rf .vagrant/
pkill -f qemu-system-aarch64

Where to Go Next

These gotchas cover the Vagrant-specific issues. For problems that hit during the Ansible deployment phase — Vault seal/unseal, certificate SANs, etcd quorum, Calico initialization — see the HA-Specific Gotchas post, which covers cross-tool issues that apply regardless of whether you’re running UTM, Vagrant, or OrbStack.

For the full deployment walkthrough, see the Vagrant HA deep dive. For the full roadmap from simple to HA, see From Simple to HA: A Learning Path for Kubernetes on Apple Silicon.

Big tech, small lab. One reel at a time.

Questions, corrections, or want to share how you’re using these repos?

labitlearnit@gmail.com

Enjoyed this post?

Want homelab configs to your email?

Leave a Reply

Discover more from Lab it, learn it

Subscribe now to keep reading and get access to the full archive.

Continue reading