UTM Gotchas — What Breaks When Building K8s Clusters on Apple Silicon

UTM is the cleanest virtualization option for Kubernetes on Apple Silicon — single NIC per VM, real QEMU-backed isolation, cloud-init ISO provisioning. But “cleanest” doesn’t mean “no gotchas.” This post covers everything that breaks, confuses, or silently fails when building K8s clusters with UTM on an M-series Mac, drawn from real deployment experience with both the Simple cluster and the full 11-VM HA deployment.

Each gotcha follows the same format: what happens, why it happens, and how to fix it. For gotchas that apply to all three tools (UTM, Vagrant, OrbStack), see the HA-Specific Gotchas post. For Vagrant and OrbStack-specific issues, see the Vagrant Gotchas and OrbStack Gotchas posts.

Gotcha #1: Cloud-Init ISO Must Be qcow2, Not Raw

What happens: A freshly created VM boots but cloud-init doesn’t run. The hostname stays ubuntu, the k8s user doesn’t exist, and the static IP isn’t configured. The VM is essentially unconfigured.

Why it happens: UTM expects all disk images in qcow2 format. The cloud-init ISO is generated as a raw image using mkisofs (or hdiutil makehybrid on macOS), but UTM’s QEMU backend won’t read raw disk images attached as VirtIO drives. The ISO silently fails to mount inside the VM, so cloud-init finds no datasource and skips all configuration.

How to fix it: Convert the ISO to qcow2 before attaching it to the VM:

qemu-img convert -f raw -O qcow2 cloud-init.iso cloud-init.qcow2

The deploy script handles this automatically, but if you’re building VMs manually or debugging the provisioning pipeline, this is the first thing to check. Verify cloud-init ran by SSHing into the VM and checking cloud-init status — it should say “done” with no errors.

Gotcha #2: UTM Doesn’t Auto-Detect New VM Directories

What happens: You create VM directories under ~/Library/Containers/com.utmapp.UTM/Data/Documents/ with valid config.plist files and disk images, but the VMs don’t appear in UTM’s sidebar.

Why it happens: UTM scans its documents directory on launch. If UTM is already running when you create new VM directories, it doesn’t detect them until it’s restarted. There’s no filesystem watcher or refresh button.

How to fix it: Restart UTM after creating VM directories:

# Kill UTM gracefully
killall UTM 2>/dev/null
sleep 2

# Re-open UTM
open -a UTM
sleep 5  # Give it time to scan and display VMs

The deploy script does this in Step 4. If you’re scripting VM creation outside the provided script, always include the kill-wait-reopen cycle. The 5-second sleep after reopening is important — UTM needs time to parse all the plists before utmctl commands will work.

Gotcha #3: utmctl Start Fails Silently on Already-Running VMs

What happens: Running utmctl start vm-name on a VM that’s already running produces an error message, but if you’re scripting the start sequence in a loop, the error can be swallowed and you might think VMs failed to start.

Why it happens: utmctl returns a non-zero exit code when asked to start an already-running VM. In a bash script without proper error handling, this can cascade — especially with set -e enabled.

How to fix it: Catch the error gracefully in your start loop:

for vm in haproxy vault jump etcd-1 etcd-2 etcd-3 master-1 master-2 worker-1 worker-2 worker-3; do
    utmctl start "$vm" 2>/dev/null || true
    sleep 2
done

The || true ensures the loop continues even if a VM is already running. The deploy script handles this, but if you’re writing your own automation around utmctl, this pattern is essential.

Gotcha #4: MAC Address Collisions Cause Network Failures

What happens: Two or more VMs get the same IP address on the 192.168.64.0/24 subnet, or one VM can’t reach the network at all. SSH connections intermittently succeed and fail between runs.

Why it happens: UTM’s shared networking uses macOS’s vmnet framework, which assigns IPs via DHCP based on MAC addresses. If two VMs have the same MAC address (which can happen if you copy a VM directory without regenerating the plist), they’ll fight over the same DHCP lease. Even with static IP configuration via cloud-init, the vmnet DHCP layer can get confused.

How to fix it: Every VM must have a unique MAC address in its config.plist. The deploy script generates random MACs for each VM:

# Generate a unique MAC address with the QEMU OUI prefix
MAC=$(printf '52:54:00:%02X:%02X:%02X' $((RANDOM%256)) $((RANDOM%256)) $((RANDOM%256)))

If you suspect a MAC collision, check each VM’s plist for the MACAddress field and ensure they’re all different. You can also check the Mac’s ARP table with arp -a | grep 192.168.64 to see which MACs are associated with which IPs.

Gotcha #5: Disk Resize Doesn’t Expand the Filesystem

What happens: You resize the qcow2 disk image with qemu-img resize to give a VM more space, but inside the VM, df -h still shows the original size. The extra space is invisible.

Why it happens: qemu-img resize expands the virtual disk device, but the partition table and filesystem inside the disk don’t know about the new space. The partition needs to be grown first, then the filesystem needs to be resized to fill it.

How to fix it: The deploy script handles this in cloud-init’s runcmd section:

runcmd:
  - growpart /dev/vda 1
  - resize2fs /dev/vda1

growpart expands partition 1 to fill all available space on the disk device. resize2fs then resizes the ext4 filesystem to match the new partition size. Both commands are idempotent — if the partition already fills the disk, they do nothing.

If you need to resize after the VM is already running, SSH in and run these commands manually. The VM doesn’t need a reboot — the filesystem expands live.

Gotcha #6: UEFI Boot and the 10-Minute Timeout

What happens: The deploy script’s VM-readiness check times out at 10 minutes (600 seconds), but most VMs boot in under 2 minutes. Occasionally, a VM gets stuck and the timeout fires, leaving you with a partially deployed cluster.

Why it happens: UTM VMs use UEFI boot, which adds overhead compared to BIOS. If the cloud-init ISO has issues (see Gotcha #1), if the Mac is under heavy memory pressure (see Gotcha #7), or if the VM’s disk image is corrupted from a partial download, the VM can hang during boot without producing useful error output.

How to fix it: Open UTM’s GUI and check the VM’s console output. Look for:

# Check if the VM is actually running
utmctl list

# If a VM shows as "stopped" unexpectedly, check UTM GUI console
# Common causes:
# 1. cloud-init ISO not qcow2 (see Gotcha #1)
# 2. Corrupt disk image — delete and re-download the base cloud image
# 3. Insufficient Mac RAM — check Activity Monitor for memory pressure

The deploy script has a smart early-exit: if jump is ready and 8 out of 11 VMs are up, it proceeds rather than waiting for stragglers. This handles the case where one or two VMs are slow without blocking the entire deployment.

Gotcha #7: Mac Memory Pressure Kills VMs

What happens: After starting all 11 VMs, some become unreachable or extremely slow. macOS starts swapping heavily, and Activity Monitor shows memory pressure in the red zone. Some VMs may crash entirely.

Why it happens: The HA cluster allocates 38 GB of RAM across 11 VMs. UTM (via QEMU) pre-allocates memory for each VM — unlike OrbStack, which shares the host kernel and dynamically allocates memory. On a 64 GB Mac, 38 GB for VMs plus macOS overhead plus browser tabs and other applications can easily push into swap territory. macOS’s memory compressor can delay the pain, but eventually the system degrades.

How to fix it: Close unnecessary applications before deploying. For the full HA cluster, a Mac with at least 48 GB RAM is recommended, and 64 GB is comfortable. If you’re on a 32 GB Mac, the Simple cluster (6 VMs, ~26 GB RAM) is a better fit — or consider OrbStack, which runs the same 11-VM HA cluster with dramatically lower memory consumption.

You can check per-VM memory allocation by inspecting the config.plist files or running:

# Check total memory allocated to running VMs
utmctl list  # Shows running VMs
# Then check each VM's RAM in UTM GUI > VM Settings > System

Gotcha #8: NoCloud Datasource Must Be Set Explicitly

What happens: Cloud-init runs but takes 30+ extra seconds during boot. You see log entries about probing for AWS, GCP, or Azure metadata endpoints before cloud-init finally falls back to the NoCloud datasource.

Why it happens: Ubuntu’s default cloud-init configuration probes multiple datasources in order: AWS IMDS, GCP, Azure, then NoCloud, then None. Each probe has a timeout. On a local VM with no cloud metadata service, all the cloud probes must time out before NoCloud is tried.

How to fix it: Set the datasource list explicitly in the cloud-init user-data to skip cloud probing entirely:

#cloud-config
datasource_list: [NoCloud, None]
# ... rest of your cloud-init config

This tells cloud-init to only look for NoCloud (the ISO we attached) and None (the fallback). The deploy script includes this in every VM’s user-data, saving 30+ seconds per VM boot.

Gotcha #9: config.plist UUID and Identifier Must Be Unique

What happens: You create multiple VMs by copying an existing VM directory, but UTM only shows one of them, or multiple VMs show the same name, or utmctl commands target the wrong VM.

Why it happens: Each VM’s config.plist contains UUID and identifier fields that UTM uses to distinguish VMs. Copying a directory without regenerating these values creates duplicates that UTM can’t differentiate.

How to fix it: Always generate fresh UUIDs when creating new VMs. The deploy script uses a UUID generation function for each VM:

# Generate a unique UUID for each VM
generate_uuid() {
    python3 -c "import uuid; print(str(uuid.uuid4()).upper())"
}

If you’re manually creating VMs, never copy a VM directory without editing the plist to change the UUID, identifier, and display name. A mismatch here can cause UTM to behave unpredictably.

Gotcha #10: The Shared Networking Subnet Is Configurable — But Must Match Your cloud-init

What happens: You configure static IPs in your cloud-init network-config using a subnet (e.g., 10.0.0.0/24), but VMs can’t reach the Mac host or each other, even though they boot successfully.

Why it happens: UTM’s shared networking subnet is set by the VlanGuestAddress key in each VM’s config.plist — it is not hardcoded by macOS. The gateway is always the .1 address of whatever subnet is configured there. If your cloud-init network-config assigns a static IP in a different subnet than VlanGuestAddress, the VM’s traffic won’t route through the vmnet gateway and connectivity will fail.

How to fix it: Make sure the subnet in your cloud-init network-config matches the VlanGuestAddress in config.plist. The deploy script sets both consistently to 192.168.64.0/24:

# In config.plist:
<key>VlanGuestAddress</key>
<string>192.168.64.0/24</string>

# In cloud-init network-config:
version: 2
ethernets:
  enp0s1:
    dhcp4: false
    addresses:
      - 192.168.64.10/24
    routes:
      - to: default
        via: 192.168.64.1

If you want to use a different subnet, change VlanGuestAddress in every VM’s plist and update the gateway and static IP assignments in cloud-init to match. The deploy script uses .10 through .43 within 192.168.64.0/24, leaving plenty of room. If another vmnet-based tool is already using that range, you can pick a different subnet — just keep plist and cloud-init in sync.

Quick Reference: UTM Diagnostics

When something goes wrong with UTM VMs, these commands help narrow down the issue:

# List all VMs and their status
utmctl list

# Check if the Mac can reach a VM
ping -c 3 192.168.64.10

# Check SSH connectivity
ssh -i ~/.ssh/k8slab.key -o ConnectTimeout=5 k8s@192.168.64.10 "hostname"

# Check cloud-init status inside a VM
ssh jump 'cloud-init status'

# View cloud-init logs for errors
ssh jump 'cat /var/log/cloud-init-output.log | tail -50'

# Check Mac ARP table for IP/MAC mappings
arp -a | grep 192.168.64

# Check Mac memory pressure
vm_stat | head -5

Where to Go Next

These gotchas cover the UTM-specific issues. For problems that hit during the Ansible deployment phase — Vault seal/unseal, certificate SANs, etcd quorum, Calico initialization — see the HA-Specific Gotchas post, which covers cross-tool issues that apply regardless of whether you’re running UTM, Vagrant, or OrbStack.

For the full deployment walkthrough, see the UTM HA deep dive. For the full roadmap from simple to HA, see From Simple to HA: A Learning Path for Kubernetes on Apple Silicon.

Big tech, small lab. One reel at a time.

Questions, corrections, or want to share how you’re using these repos?

labitlearnit@gmail.com