Why Your Homelab K8s Cluster Isn’t Production-Ready (And How to Fix It)

You built a Kubernetes cluster. It works. Pods deploy, services route, kubectl responds. And yet, if you held it up against even a basic production checklist, it would fail on almost every line item.

That’s fine — it’s a homelab. But the gap between “works” and “production-ready” is exactly where the most valuable learning happens. This post walks through the most common shortcuts in homelab K8s clusters, explains why each one matters, and shows how to fix them using the same Ansible automation and Vault PKI infrastructure covered across this series.

If you’re new to the lab, start with the UTM vs Vagrant vs OrbStack comparison to pick your virtualization tool. Everything below applies regardless of which tool you choose.

Side-by-side comparison: a typical homelab Kubernetes cluster (single master with embedded etcd, single self-signed CA, direct SSH from Mac to every node) versus the HA project architecture (jump bastion, Vault 3-tier PKI hierarchy, HAProxy load balancer, two masters, 3-node etcd with mTLS, workers connecting via the load balancer) — The 10-item gap, visualized. Same workloads, same kubectl — completely different blast radius.

The Checklist

Here’s the production-readiness audit. Most homelab clusters fail on six or more of these. Each section explains the problem, the risk, and the fix.

1. Single Control Plane Node

The shortcut: One master node running kube-apiserver, controller-manager, and scheduler. The cluster works perfectly — until you reboot the master or it runs out of memory. When the single master goes down, kubectl stops responding, no new pods get scheduled, and existing pods on workers keep running but can’t be managed. The cluster isn’t “down” in the traditional sense — workloads survive — but you’ve lost all control plane operations.

Why it matters: In production, the control plane is the brain. Losing it means no scaling, no deployments, no rolling updates, no self-healing. A node failure during a deployment leaves the rollout in a half-finished state with no way to complete or roll back. For a homelab, a single-master failure is a learning inconvenience. For production, it’s an incident.

The fix: Run at least two control plane nodes behind a load balancer. The HA setup across all three projects in this series uses two masters behind HAProxy. HAProxy listens on port 6443 and round-robins API server requests to both masters with health checks — if one master goes down, all traffic routes to the surviving one within seconds. The kubeconfig on every node and on the jump server points to the HAProxy address, not to any individual master. This means a master failure is invisible to clients.

The UTM 17-step deployment post covers the HAProxy configuration in detail (Step 15). The Ansible haproxy.yml playbook handles the entire setup — install, configure, and verify in under 45 seconds.

2. Self-Signed Certificates (or Worse, No TLS)

The shortcut: Running openssl req -x509 once, generating a single self-signed CA, and using it for everything. Or letting kubeadm handle it (which creates self-signed certs you never inspect). Or the worst case: disabling TLS verification with --insecure-skip-tls-verify and calling it done.

Why it matters: A flat certificate structure — one CA signing everything — means there’s no blast radius containment. If that single CA’s private key leaks, an attacker can forge a certificate for any identity in the cluster: impersonate the API server, create fake kubelet identities, or connect directly to etcd and read every secret stored in the cluster. With a single CA, there’s no cryptographic boundary between the Kubernetes control plane, the etcd data store, and the API aggregation layer.

The fix: Separate CAs with a proper hierarchy. The project uses a 3-tier PKI with HashiCorp Vault: a root CA (trust anchor, signs nothing directly), an intermediate CA (signs the leaf CAs), and three leaf CAs — one for Kubernetes components, one for etcd, and one for the front proxy. A compromised etcd certificate can’t be used to authenticate to the Kubernetes API. A stolen kubelet cert can’t access the etcd data store. Each trust domain is cryptographically isolated.

The Vault PKI deep dive explains the entire hierarchy — why three CAs, how the TTLs are designed (365-day root, 180-day intermediate, 90-day leaf), and how the three Ansible roles (vault-bootstrap, vault-pki-setup, k8s-certs) automate it end to end.

3. No Bastion / Jump Server

The shortcut: Every node in the cluster is directly accessible from your Mac via SSH. All IPs are reachable, all ports are open, and your SSH config has an entry for every node. It’s convenient — ssh master-1 from anywhere.

Why it matters: In production, cluster nodes sit in private subnets with no direct internet or host access. The attack surface of exposing SSH on every node is significant: each open SSH port is a potential entry point, and if one node is compromised, lateral movement to every other node is trivial since they’re all on the same flat network with the same SSH key. A bastion host creates a chokepoint — all access flows through a single, hardened entry point that can be monitored, rate-limited, and audited.

The fix: A dedicated jump server that serves as both the SSH bastion and the Ansible controller. In this project, the Mac only has SSH config for jump. To reach any other node, you go through jump: ssh jump, then ssh master-1. The jump server’s ~/.ssh/config has entries for all 10 other VMs. All Ansible playbooks run from jump, not from the Mac. This mirrors the production pattern where an operations host (or CI/CD runner) sits inside the private network and manages the cluster.

The bastion architecture is consistent across all three virtualization tools. The OrbStack post and the Vagrant post both detail how the jump server is provisioned and configured for each platform.

4. Single etcd Node

The shortcut: One etcd instance, usually running on the same machine as the API server. It works — etcd is just a key-value store, right?

Why it matters: etcd is the source of truth for the entire cluster. Every resource — pods, services, secrets, configmaps, RBAC policies — lives in etcd. A single etcd node means a single disk failure or process crash loses the cluster state. Even with backups, restoring etcd from a snapshot means losing any changes made between the last backup and the failure. More importantly, a single etcd node doesn’t teach you how etcd’s Raft consensus protocol actually works. You never experience leader election, quorum requirements, or the behavior of a cluster when a member goes down and comes back.

The fix: A 3-node etcd cluster with mutual TLS. Three nodes give you fault tolerance for one node failure (Raft requires a majority — 2 out of 3 — for quorum). The HA setup deploys three dedicated etcd VMs, each with its own server certificate (with node-specific SANs), peer certificate (for inter-node Raft replication), and client certificate. The API server connects to all three etcd endpoints and etcd handles the leader routing internally.

Each etcd node runs with --client-cert-auth=true and --peer-client-cert-auth=true, meaning both client and peer connections require valid certificates signed by the etcd CA. No certificate, no access — even if you can reach the etcd port.

5. kubeadm Hiding the Complexity

The shortcut: kubeadm init and kubeadm join. Two commands and the cluster exists. Certificates are auto-generated, etcd is embedded, the control plane runs as static pods managed by kubelet. It’s fast, reliable, and officially supported.

Why it matters: kubeadm is excellent for getting a cluster running quickly. But it deliberately hides the details that matter most for understanding Kubernetes internals. You don’t learn which certificates the API server needs, how etcd bootstraps a cluster, what flags the controller-manager requires, or how kubelet registers with the API server. When something breaks in a kubeadm cluster, the debugging is harder because you didn’t build the layer that failed.

The fix: Install Kubernetes the hard way — from raw binaries. The project downloads kube-apiserver, kube-controller-manager, kube-scheduler, kubelet, kube-proxy, and kubectl as individual ARM64 binaries. Each component gets its own systemd unit file with explicit configuration flags. There’s no abstraction layer. When the API server won’t start, you read its systemd journal and see the exact flag that’s wrong. When kubelet can’t register, you check its kubeconfig and certificate paths.

The hard-way approach also means control plane components run as systemd services on the master nodes, not as pods managed by kubelet. This is why masters don’t appear in kubectl get nodes output — only workers run kubelet and register with the API server. It’s a deliberate architectural choice that teaches you the difference between the control plane and the data plane.

6. No Automation (Manual Setup)

The shortcut: Following a tutorial step by step, copying and pasting commands into each node one at a time. The cluster works at the end, but reproducing it means following the same tutorial again. And if you need to change one thing — a different IP range, an extra worker node — you’re back to manual edits across multiple nodes.

Why it matters: In production, infrastructure that can’t be recreated from code is a liability. If the cluster dies and the only knowledge of how to rebuild it lives in someone’s browser history, that’s a bus-factor-one situation. Automation isn’t just about speed — it’s about reproducibility, version control, and the ability to iterate without fear of breaking things permanently.

The fix: Ansible for everything. The project uses Ansible roles for every phase of the deployment: Vault bootstrap, PKI configuration, certificate issuance, etcd clustering, HAProxy setup, control plane deployment, and worker node configuration. The playbooks are idempotent — run them twice and the second run changes nothing. The entire cluster can be destroyed and recreated from scratch in under 8 minutes (UTM HA: 6m 13s, OrbStack HA: 7m 26s, Vagrant HA: 8m 10s).

The Ansible roles are identical across UTM, Vagrant, and OrbStack. The only thing that changes is the inventory file — different IPs for different virtualization layers. This separation was a deliberate design decision covered in the comparison post.

7. No Certificate Rotation Strategy

The shortcut: Certificates are generated once during setup and never touched again. Maybe they have a 10-year TTL. Maybe they have a 1-year TTL and you’ll deal with it when they expire (you won’t).

Why it matters: Certificates expire. When they do, components stop trusting each other and the cluster breaks — often with cryptic TLS handshake errors that don’t obviously point to expired certs. In production, certificate rotation is a regular operational task. Short-lived certificates (90 days or less) are a security best practice because they limit the window of exposure if a certificate is compromised.

The fix: Vault’s PKI engine makes rotation straightforward. Because certificates are issued via API calls (not manual openssl commands), re-running the k8s-certs Ansible role issues fresh certificates from Vault and deploys them to every node. The leaf CAs have 90-day TTLs, which means rotating certificates is as simple as re-running a playbook. Vault also supports CRL (Certificate Revocation List) publishing, so compromised certificates can be revoked immediately without waiting for expiry.

Compare this to the openssl approach: rotating a certificate means regenerating it manually, copying it to every node that uses it, and restarting every affected service. With Vault + Ansible, it’s one command.

8. etcd Without TLS (or Without Mutual TLS)

The shortcut: etcd running with --client-cert-auth=false and no peer TLS. Any process that can reach port 2379 can read and write the entire cluster state — including all Kubernetes secrets (which are base64-encoded, not encrypted, in etcd by default).

Why it matters: etcd stores everything. If an attacker gains network access to the etcd port without TLS authentication, they can dump every secret, modify RBAC policies to grant themselves cluster-admin, or delete critical resources. Even in a homelab, understanding why etcd access must be restricted is essential for building the right security instincts.

The fix: Full mutual TLS on all etcd connections. The project configures etcd with separate server, peer, and client certificates — all signed by the dedicated etcd CA. The API server authenticates to etcd using an etcd client certificate. Peer communication between etcd nodes uses peer certificates. Health checks use a dedicated healthcheck client certificate. No anonymous access is possible.

9. No Network Policy / Flat Pod Network

The shortcut: Installing a CNI plugin (Flannel, Calico, Cilium) and leaving the default configuration. Every pod can talk to every other pod across all namespaces. Every pod can reach the API server, the metadata endpoint, and external services.

Why it matters: A flat network is Kubernetes’s default, and it’s deliberately permissive. But in production, network policies are the firewall between namespaces and workloads. Without them, a compromised pod in the frontend namespace can reach the database pods in the backend namespace, exfiltrate data to external endpoints, or pivot to the API server.

The fix: The project deploys Calico 3.28.0 as the CNI, which supports Kubernetes NetworkPolicy resources. Calico can enforce ingress and egress rules at the pod level. The next step (as an exercise for readers) is writing NetworkPolicy manifests that restrict cross-namespace traffic and limit egress to known endpoints. The cluster is already running a CNI that supports it — the policies just need to be applied.

10. No Monitoring or Observability

The shortcut: No Prometheus, no Grafana, no alerting. The only monitoring is running kubectl get pods and eyeballing the output.

Why it matters: You can’t fix what you can’t see. In production, observability is how you detect problems before they become outages. Prometheus metrics expose control plane health (API server latency, etcd leader changes, scheduler queue depth), node health (CPU, memory, disk pressure), and workload health (pod restarts, OOMKills, failed probes). Without monitoring, you’re flying blind.

The fix: This is an area the project doesn’t cover yet — and that’s intentional. The cluster is designed to be a clean foundation. Adding Prometheus + Grafana on top is a natural next step. The hard-way cluster exposes all the standard Kubernetes metrics endpoints, and the etcd cluster exposes its own metrics on port 2381. The infrastructure is ready for monitoring — it just needs the monitoring stack deployed.

The Full Scorecard

Here’s where a typical homelab cluster stands against this checklist, compared to the HA setup in this project:

Item	Typical Homelab	This Project (HA)
HA control plane	Single master	2 masters + HAProxy
TLS certificates	Self-signed / kubeadm	Vault 3-tier CA
Bastion / jump server	Direct access	Dedicated jump host
etcd clustering	Single node	3-node with mTLS
Installation method	kubeadm	Hard way (binaries)
Automation	Manual / scripts	Full Ansible
Cert rotation	None	Vault API + Ansible
etcd TLS	None or one-way	Full mTLS
Network policies	CNI only, no policies	Calico deployed, policies TBD
Monitoring	None	Not yet (infra ready)

Green doesn’t mean “perfect.” It means the production pattern is implemented. There’s always more to harden — RBAC fine-tuning, pod security standards, secrets encryption at rest, audit logging, etcd backups. But fixing the items in this list gets a homelab cluster from “demo” to “would survive a basic production review.”

Where to Start

If your current homelab cluster hits most of the red items above, don’t try to fix everything at once. Here’s a reasonable progression:

Start with automation. Get your current setup into Ansible or a similar tool. Even if the architecture is single-master with self-signed certs, having it automated means you can iterate quickly. Tear down, change one thing, rebuild.

Add the bastion. This is the cheapest architectural improvement — one extra VM that becomes the single entry point. It changes your SSH workflow slightly but adds a real security boundary.

Fix the certificates. Replace self-signed certs with a Vault-managed PKI hierarchy. This is the biggest learning curve but also the most valuable skill for production work.

Scale to HA. Add the second master, the HAProxy load balancer, and the 3-node etcd cluster. This is where the architecture starts to feel real.

For a structured walkthrough of this entire progression — from a 6-VM simple cluster to the full 11-VM HA setup — see From Simple to HA: A Learning Path for Kubernetes on Apple Silicon.

Or skip the progression entirely and clone one of the HA repos — all of this is already built and automated:

# Pick your tool and go
git clone https://github.com/labitlearnit/k8s-utm-ha-homelab.git       # Full VMs, fastest
git clone https://github.com/labitlearnit/k8s-vagrant-ha-homelab.git   # Declarative, Vagrantfile
git clone https://github.com/labitlearnit/k8s-orbstack-ha-homelab.git  # Lightweight, lowest RAM

The Deeper Dives

This post is the “what” and “why.” The rest of the series covers the “how”:

From Simple to HA: A Learning Path for Kubernetes on Apple Silicon — the structured progression from 6-VM simple to 11-VM HA, with a step-by-step learning order and working code at every level.

UTM vs Vagrant vs OrbStack — the same cluster built six different ways, with deployment times, resource usage, and networking compared side by side.

Building an 11-VM HA Cluster on UTM — the full 17-step deployment flow, from cloud image download to working cluster.

Vagrant + QEMU + Ansible — how the Vagrantfile, QEMU provider, and socket_vmnet work together on Apple Silicon.

OrbStack for Kubernetes — 11 VMs on minimal RAM, the lightweight path.

Vault PKI: 3-Tier CA the Right Way — the deep dive into certificate hierarchy, Vault operations, and the three Ansible roles.

All source code is on GitHub. Star the repos if you find them useful.

Big tech, small lab. One reel at a time.

Questions, corrections, or want to share how you’re using these repos?

labitlearnit@gmail.com