HA-Specific Gotchas — Problems That Hit Every Tool

Some problems don’t care which virtualization tool you’re running. Whether you deploy the 11-VM HA cluster on UTM, Vagrant, or OrbStack, the Ansible roles are identical — and so are the gotchas they produce. This post covers HA-specific issues that hit at the Kubernetes, etcd, Vault, and Ansible layers regardless of your VM backend.

For tool-specific gotchas, see the UTM Gotchas, Vagrant Gotchas, and OrbStack Gotchas posts.

Gotcha #1: Vault Is Sealed After Every Restart

What happens: You restart the Vault VM (or reboot the Mac, which restarts all VMs), then try to run Ansible playbooks that issue certificates. They fail with “Vault is sealed” or “connection refused” errors. The Vault UI shows a seal status screen instead of the login page.

Why it happens: HashiCorp Vault uses Shamir’s Secret Sharing for its seal mechanism. On startup, Vault is in a sealed state — it can’t read its storage backend until enough unseal keys are provided. The initial vault-full-setup.yml playbook initializes Vault with 5 key shares and a threshold of 3, meaning 3 of the 5 keys must be provided after every Vault restart to unseal it. This is a security feature, not a bug — it ensures that a stolen Vault server can’t be read without the unseal keys.

How to fix it: The deploy script creates a vault-unseal helper function in the jump server’s .bashrc that reads the unseal keys from the credentials file and applies them automatically:

			
# SSH to the jump server
ssh jump
# Run the unseal helper
vault-unseal
# Verify Vault is unsealed
vault status
# Should show: Sealed = false

		

If the vault-unseal function isn’t available (e.g., you’re on a fresh SSH session before .bashrc was configured), unseal manually:

			
# Read the unseal keys from the credentials file
cat ~/.vault-credentials/vault-init.json | python3 -c "
import json, sys
data = json.load(sys.stdin)
for key in data['unseal_keys_b64'][:3]:
    print(key)
"
# Apply each key
export VAULT_ADDR=http://vault:8200
vault operator unseal 
vault operator unseal 
vault operator unseal 

		

For a deeper understanding of the Vault PKI setup and how the unseal mechanism integrates with the 3-tier CA hierarchy, see the Vault PKI deep dive.

Gotcha #2: Certificate SAN Mismatches Break the API Server

What happens: The kube-apiserver fails to start, or kubelet nodes can’t connect to it. Error logs show “x509: certificate is valid for [list of IPs], not [the IP being used].” Or etcd clients get TLS handshake errors when connecting to etcd servers.

Why it happens: Every TLS certificate issued by the Vault PKI has Subject Alternative Names (SANs) that define which hostnames and IPs the certificate is valid for. The API server certificate must include SANs for: all master node IPs, the HAProxy IP, the Kubernetes service IP (first IP in the service CIDR, typically 10.96.0.1), kubernetes, kubernetes.default, kubernetes.default.svc, and localhost. If any component connects to the API server using an IP or hostname that’s not in the SANs, TLS verification fails.

How to fix it: Check which SANs are on the certificate versus which address is being used:

			
# Inspect the API server certificate SANs
openssl x509 -in /etc/kubernetes/pki/kube-apiserver.crt -noout -text | grep -A1 "Subject Alternative Name"
# The output should include all expected IPs:
# IP:192.168.x.31, IP:192.168.x.32, IP:192.168.x.10,
# IP:10.96.0.1, IP:127.0.0.1,
# DNS:kubernetes, DNS:kubernetes.default, etc.
# Check what address kubelet is using to reach the API server
cat /etc/kubernetes/kubelet-kubeconfig.yaml | grep server
# This must match one of the SANs

		

The most common SAN mismatch happens when you change the network prefix (e.g., switching between tools) but don’t reissue certificates. Certificates are bound to specific IPs. If the IPs change, the certs must be reissued:

			
# Re-issue all certificates from the jump server
ssh jump
cd ~/k8s-*-ha-homelab/ansible
ansible-playbook -i inventory/homelab.yml playbooks/k8s-certs.yml

Gotcha #3: etcd Cluster Won’t Form Quorum

What happens: The etcd-cluster.yml playbook completes, but etcdctl endpoint health shows one or more nodes as unhealthy. The Kubernetes API server can’t connect to etcd, and the entire cluster is stuck.

Why it happens: etcd requires all members specified in the initial-cluster flag to be reachable during initial bootstrap. If one etcd node is slow to start, if its peer certificate has wrong SANs, or if the peer URLs use an unreachable IP (the dual-NIC problem on Vagrant, the dual-IP problem on OrbStack), the cluster can’t form quorum. A 3-node etcd cluster needs at least 2 nodes to form a majority.

How to fix it: Diagnose which node is failing and why:

			
# Check etcd cluster health from the jump server
ssh jump
ETCDCTL_API=3 etcdctl \
  --endpoints=https://etcd-1:2379,https://etcd-2:2379,https://etcd-3:2379 \
  --cacert=/etc/etcd/pki/ca.crt \
  --cert=/etc/etcd/pki/etcd-healthcheck-client.crt \
  --key=/etc/etcd/pki/etcd-healthcheck-client.key \
  endpoint health
# Check etcd logs on a failing node
ssh etcd-1
sudo journalctl -u etcd -n 50 --no-pager
# Common error patterns:
# "rejected connection" — peer certificate issue
# "dial tcp: connect: connection refused" — wrong IP or node not started
# "member not found" — initial-cluster config mismatch

		

If etcd is in a broken state after a failed bootstrap, the cleanest fix is to wipe the data directory on all nodes and re-run the playbook:

			
# On each etcd node, stop etcd and wipe data
for node in etcd-1 etcd-2 etcd-3; do
    ssh $node 'sudo systemctl stop etcd && sudo rm -rf /var/lib/etcd/*'
done
# Re-run the etcd playbook from jump
cd ~/k8s-*-ha-homelab/ansible
ansible-playbook -i inventory/homelab.yml playbooks/etcd-cluster.yml

		

Gotcha #4: HAProxy Health Checks Show Backend Down

What happens: HAProxy is running on port 6443, but kubectl commands from the jump server return “connection refused” or timeout. Checking HAProxy stats (if enabled) shows both master backends as DOWN.

Why it happens: HAProxy performs TCP health checks against master-1:6443 and master-2:6443. If the kube-apiserver hasn’t started yet (because it’s waiting for etcd, or because certificates haven’t been deployed), HAProxy marks the backends as down. This is working as designed — HAProxy correctly reports that the API servers aren’t responding.

How to fix it: The issue is almost always downstream of HAProxy. Check the API server on each master:

			
# Check if kube-apiserver is running on each master
ssh master-1 'sudo systemctl status kube-apiserver'
ssh master-2 'sudo systemctl status kube-apiserver'
# Check API server logs for errors
ssh master-1 'sudo journalctl -u kube-apiserver -n 30 --no-pager'
# Common causes:
# "etcd cluster is unavailable" — fix etcd first (see Gotcha #3)
# "certificate" errors — reissue certs (see Gotcha #2)
# "bind: address already in use" — port 6443 is taken by something else
# Test direct API server connectivity (bypassing HAProxy)
curl -k https://192.168.x.31:6443/healthz
curl -k https://192.168.x.32:6443/healthz
# Should return "ok" if the API server is healthy
# Check HAProxy config
ssh haproxy 'cat /etc/haproxy/haproxy.cfg'
# Verify backend IPs match the actual master IPs

		

The deploy script runs etcd and HAProxy setup in parallel (Step 15), then deploys the control plane (Step 16). HAProxy backends are expected to be DOWN until Step 16 completes. If backends are still DOWN after the full deployment, the problem is in the control plane setup.

Gotcha #5: Calico Pods Stuck in Init or ContainerCreating

What happens: After deployment completes, kubectl get pods -A shows Calico pods in Init:2/3, ContainerCreating, or CrashLoopBackOff. Worker nodes show as NotReady in kubectl get nodes.

Why it happens: Calico needs time to initialize after first deployment. The calico-node DaemonSet runs an init container that installs the CNI plugin binaries and configuration, then the main container establishes BGP peering with other nodes. This process typically takes 1–2 minutes after the manifest is applied. During this window, pods show as initializing.

How to fix it: Wait 2 minutes and check again:

			
# Check Calico pod status
kubectl get pods -n kube-system -l k8s-app=calico-node
# If still stuck after 2 minutes, check the logs
kubectl logs -n kube-system -l k8s-app=calico-node -c install-cni
kubectl logs -n kube-system -l k8s-app=calico-node -c calico-node
# Common issues:
# "Unable to connect to BIRDv4 socket" — BIRD daemon starting up, wait longer
# "felix not ready" — Felix is initializing, normal during startup
# "error getting ClusterInformation" — API server connectivity issue
# Verify nodes transition to Ready
kubectl get nodes -w  # Watch mode — nodes flip to Ready as Calico initializes

		

If Calico pods remain in CrashLoopBackOff for more than 5 minutes, the issue is usually: the API server isn’t reachable from workers (check kubeconfig and network), or the pod CIDR (10.244.0.0/16) conflicts with an existing network range. This behavior is consistent across all three tools.

Gotcha #6: Ansible SSH Control Sockets Go Stale

What happens: Ansible playbooks that previously worked start failing with SSH connection errors. The error messages mention “Control socket connect” or “mux_client” failures. Running individual SSH commands works fine, but Ansible can’t connect.

Why it happens: Ansible uses SSH connection multiplexing (ControlMaster) to share SSH connections across tasks. The control sockets are stored in ~/.ansible/cp/. If a VM is restarted, destroyed and recreated, or if the network changes, these sockets become stale. Ansible tries to reuse the dead socket instead of creating a new connection, and the task fails.

How to fix it: Delete the stale control sockets:

			
# On the Mac (if running Ansible from Mac)
rm -rf ~/.ansible/cp/*
# On the jump server (if running Ansible from jump)
ssh jump 'rm -rf ~/.ansible/cp/*'
# Then retry the playbook
ansible-playbook -i inventory/homelab.yml playbooks/ping.yml

		

The deploy scripts clean up control sockets automatically before running. This gotcha hits when you’re running playbooks manually or iterating on configuration changes between runs.

Gotcha #7: Binary Pre-Cache Markers Cause Skipped Downloads

What happens: Ansible roles report that binaries are already cached and skip the distribution step, but the binaries are actually missing or corrupt on the target nodes. Kubernetes components fail to start because the binaries aren’t in /usr/local/bin/.

Why it happens: The deploy script creates .pre-cached marker files in /tmp/k8s-binaries/, /tmp/etcd-cache/, and /tmp/containerd-cache/ on the jump server after copying binaries from the Mac. Ansible roles check for these markers — if they exist, the download step is skipped. But if the jump server was rebooted (which clears /tmp/) and the markers were recreated without the actual binary files, or if SCP partially failed, the markers exist but the binaries don’t.

How to fix it: Verify that the actual binaries exist alongside the markers, and remove markers if binaries are missing:

			
# Check on the jump server
ssh jump
# Verify K8s binaries
ls -la /tmp/k8s-binaries/
# Should contain: kube-apiserver, kubelet, kubectl, etc.
# Verify etcd cache
ls -la /tmp/etcd-cache/
# Should contain: etcd-v3.5.12-linux-arm64.tar.gz
# Verify containerd cache
ls -la /tmp/containerd-cache/
# Should contain: containerd-1.7.24-linux-arm64.tar.gz, runc.arm64
# If binaries are missing but markers exist, remove markers
rm -f /tmp/k8s-binaries/.pre-cached
rm -f /tmp/etcd-cache/.pre-cached
rm -f /tmp/containerd-cache/.pre-cached
# Then re-run the deploy script (it will re-copy binaries)
# Or re-run just the binary caching step

		

Gotcha #8: kubeconfig Points to Wrong API Server Address

What happens: kubectl commands from the jump server return “Unable to connect to the server” or “dial tcp: connection refused.” The cluster is running but kubectl can’t reach it.

Why it happens: The admin kubeconfig deployed to the jump server contains the API server address. In the HA setup, this should point to the HAProxy load balancer IP (not a specific master). If the kubeconfig was generated with the wrong address (e.g., pointing to master-1 directly instead of haproxy), kubectl will fail when that specific master is down. Or if the network prefix changed between runs, the IP in the kubeconfig doesn’t match the current environment.

How to fix it: Check and fix the kubeconfig:

			
# Check what address kubectl is using
ssh jump
kubectl config view | grep server
# Should show: https://192.168.x.10:6443 (haproxy IP)
# NOT: https://192.168.x.31:6443 (direct master IP)
# Test connectivity to that address
curl -k https://192.168.x.10:6443/healthz
# If the address is wrong, re-run the control plane playbook
# which regenerates and deploys the kubeconfig
cd ~/k8s-*-ha-homelab/ansible
ansible-playbook -i inventory/homelab.yml playbooks/control-plane.yml

		

Gotcha #9: Service Account Signing Key Mismatch Between Masters

What happens: Service account tokens work when requests hit master-1 but fail with “unauthorized” when HAProxy routes to master-2 (or vice versa). Pods that use service accounts intermittently get authentication errors.

Why it happens: Both kube-apiserver instances and the kube-controller-manager must use the same service account signing key pair. The controller-manager signs tokens with the private key, and each API server verifies them with the public key. If the key pair was generated separately on each master (rather than generated once and distributed), each master has a different key, and tokens signed by one can’t be verified by the other.

How to fix it: The Ansible roles handle this correctly by issuing the service account key pair once from the Vault PKI and distributing the same key to both masters. If you suspect a key mismatch:

			
# Compare the service account public key on both masters
ssh master-1 'md5sum /etc/kubernetes/pki/sa.pub'
ssh master-2 'md5sum /etc/kubernetes/pki/sa.pub'
# These MUST match
ssh master-1 'md5sum /etc/kubernetes/pki/sa.key'
ssh master-2 'md5sum /etc/kubernetes/pki/sa.key'
# These MUST also match
# If they don't match, re-run the certificate playbook
cd ~/k8s-*-ha-homelab/ansible
ansible-playbook -i inventory/homelab.yml playbooks/k8s-certs.yml
# Then restart the control plane
ansible-playbook -i inventory/homelab.yml playbooks/control-plane.yml

		

Gotcha #10: Ansible Galaxy Collection Not Installed on Jump

What happens: The Vault setup playbook fails with “ERROR! couldn’t resolve module/action ‘community.hashi_vault.vault_read’” or similar collection-not-found errors.

Why it happens: The Vault Ansible roles use the community.hashi_vault collection for interacting with Vault’s API. This collection isn’t included with Ansible by default — it must be installed separately with ansible-galaxy. If the jump server’s cloud-init finished installing Ansible but the deploy script ran the Vault playbook before the Galaxy collection was installed, the modules aren’t available.

How to fix it: Install the collection on the jump server:

			
# SSH to jump and install the collection
ssh jump
ansible-galaxy collection install community.hashi_vault
# Verify it's installed
ansible-galaxy collection list | grep hashi_vault
# Then retry the Vault playbook
cd ~/k8s-*-ha-homelab/ansible
ansible-playbook -i inventory/homelab.yml playbooks/vault-full-setup.yml

		

The deploy script installs this collection in Step 12 after confirming Ansible is available. If you’re running playbooks manually, make sure the collection is installed first.

Gotcha #11: Worker Node kubelet Fails to Register

What happens: Worker nodes don’t appear in kubectl get nodes after the worker playbook completes. The kubelet service is running but the node never registers with the API server.

Why it happens: kubelet registration requires: the kubelet client certificate to be valid and trusted by the API server, the kubeconfig to point to the correct API server address (HAProxy IP), and the API server to be reachable from the worker. Any failure in this chain causes silent registration failure — kubelet keeps trying but doesn’t log obvious errors unless you look carefully.

How to fix it: Check the kubelet logs and verify the connection chain:

			
# Check kubelet status and logs on a worker
ssh worker-1
sudo systemctl status kubelet
sudo journalctl -u kubelet -n 50 --no-pager
# Look for these patterns:
# "Unable to register node" — certificate or API server issue
# "dial tcp: connection refused" — wrong API server address
# "x509: certificate" — TLS trust chain broken
# Verify the kubelet kubeconfig points to HAProxy
cat /etc/kubernetes/kubelet-kubeconfig.yaml | grep server
# Must be: https://192.168.x.10:6443
# Verify the kubelet client certificate is valid
openssl x509 -in /etc/kubernetes/pki/kubelet-worker-1.crt -noout -dates
openssl x509 -in /etc/kubernetes/pki/kubelet-worker-1.crt -noout -issuer
# Test API server connectivity from the worker
curl -k https://192.168.x.10:6443/healthz
# Should return "ok"

		

Quick Reference: HA Cluster Diagnostics

When the HA cluster is misbehaving, work through the stack from bottom to top:

			
# 1. Is Vault unsealed?
vault status
# If sealed: vault-unseal
# 2. Is etcd healthy?
ETCDCTL_API=3 etcdctl --endpoints=https://etcd-1:2379,https://etcd-2:2379,https://etcd-3:2379 \
  --cacert=/etc/etcd/pki/ca.crt \
  --cert=/etc/etcd/pki/etcd-healthcheck-client.crt \
  --key=/etc/etcd/pki/etcd-healthcheck-client.key \
  endpoint health
# 3. Are API servers running?
for m in master-1 master-2; do
    echo "--- $m ---"
    ssh $m 'sudo systemctl is-active kube-apiserver'
done
# 4. Is HAProxy routing correctly?
curl -k https://haproxy:6443/healthz
# 5. Are all nodes registered?
kubectl get nodes -o wide
# 6. Are all pods running?
kubectl get pods -A
# 7. Can pods communicate across nodes?
kubectl run test-1 --image=busybox --command -- sleep 3600
kubectl run test-2 --image=busybox --command -- sleep 3600
# Wait for both to be running, then:
kubectl exec test-1 -- ping -c 3 $(kubectl get pod test-2 -o jsonpath='{.status.podIP}')

		

Where to Go Next

These cross-tool gotchas cover the HA-specific issues. For tool-specific problems (networking quirks, VM lifecycle issues, provider-specific failures), see the dedicated posts: UTM Gotchas, Vagrant Gotchas, and OrbStack Gotchas.

For the full deployment walkthroughs, see the UTM HA deep dive, Vagrant HA deep dive, and OrbStack HA deep dive. For the full roadmap from simple to HA, see From Simple to HA: A Learning Path for Kubernetes on Apple Silicon.

Big tech, small lab. One reel at a time.

Questions, corrections, or want to share how you’re using these repos?

labitlearnit@gmail.com