Understanding etcd Quorum — Why 3 Nodes, Never 2 or 4

Every Kubernetes cluster stores its entire state — deployments, services, secrets, config maps, node registrations — in etcd. If etcd stops accepting writes, the cluster is operationally dead. Not crashed — pods keep running on workers — but dead in the sense that nothing can change. No new deployments, no scaling, no healing.

The HA clusters in this project (UTM, Vagrant, OrbStack) all run 3-node etcd clusters. The Simple clusters (UTM, Vagrant, OrbStack) deliberately run a single etcd node so you can experience exactly what breaks when there’s no quorum. This post explains why the number 3 matters, how the Raft consensus protocol makes it work, and what you can do on your homelab clusters to see it firsthand.

What Quorum Actually Means

Quorum is the minimum number of nodes that must agree before a write is committed. In etcd, the formula is simple:

quorum = ⌊n/2⌋ + 1

For a 3-node cluster, that’s 2. For a 5-node cluster, that’s 3. The cluster can tolerate n – quorum failures — one failure with 3 nodes, two failures with 5 nodes.

Here’s the math laid out:

Cluster Size	Quorum	Failures Tolerated
1	1	0
2	2	0
3	2	1
4	3	1
5	3	2
6	4	2
7	4	3

Notice the pattern: 2 nodes tolerates zero failures — same as 1 node. Adding a second etcd node buys you nothing except added complexity. You need both nodes to agree, and if either dies, quorum is lost. This is why 2 is worse than useless — it creates the illusion of redundancy.

Similarly, 4 nodes tolerates one failure — same as 3 nodes. You’ve added an extra node (more RAM, more disk, more network traffic for peer replication) without improving fault tolerance. The jump from 3 to 5 is what gets you the next level — two tolerated failures.

This is why production etcd clusters always run odd numbers: 3, 5, or 7. Each odd increment improves fault tolerance. Even numbers just add cost.

Four etcd clusters of 2, 3, 4, and 5 nodes each shown at their maximum survivable failure state, illustrating why 4 nodes tolerates no more failures than 3 and why 2 nodes tolerates none — Each cluster shown at its maximum survivable state. The “tolerates N” line reveals why 4 ≡ 3 and 2 ≡ 1.

How Raft Consensus Works

etcd implements the Raft consensus algorithm. Raft is designed to be understandable — it was created specifically because Paxos (the previous gold standard) was notoriously hard to reason about. Raft breaks consensus into three subproblems: leader election, log replication, and safety.

Leader election. At any point, one etcd node is the leader and the others are followers. The leader handles all client writes. If the leader stops sending heartbeats (because it crashed, got network-partitioned, or was stopped for maintenance), followers notice the silence. After a randomized election timeout (150-300ms typically), a follower promotes itself to candidate and requests votes from the other nodes. If it gets a majority, it becomes the new leader. The randomized timeout prevents split votes — if two candidates start elections simultaneously, one almost always starts slightly earlier and wins.

Terms. Raft divides time into terms — monotonically increasing integers. Each term begins with an election. If a candidate wins, it serves as leader for the rest of that term. If no one wins (split vote), the term ends with no leader and a new term starts. Terms act as a logical clock — any node that sees a message with a higher term number knows its own information is stale.

Log replication. When a client sends a write to the leader (e.g., “create deployment nginx”), the leader appends the entry to its log and sends it to all followers. Once a majority of nodes have written the entry to their logs and acknowledged it, the leader considers the entry committed and applies it to its state machine. The leader then tells the followers to apply it too. This is the two-phase pattern: propose → commit.

In a 3-node cluster, the leader needs one follower’s acknowledgment (leader + 1 follower = 2 = quorum). In a 5-node cluster, it needs two. The write isn’t visible to clients until committed — this guarantees that any committed write survives as long as a majority of nodes are alive.

Safety. Raft guarantees that once an entry is committed, it will be present in the logs of all future leaders. A candidate can’t win an election unless its log is at least as up-to-date as a majority of nodes. This prevents a stale node from becoming leader and overwriting recent commits.

What Happens When Quorum Is Lost

When a 3-node etcd cluster loses 2 nodes, the surviving node can’t form a quorum by itself. Here’s what happens to your Kubernetes cluster:

Writes stop completely. The API server can’t create, update, or delete any resource. kubectl apply hangs. Deployments can’t scale. Pods can’t be scheduled. The API server’s connection to etcd times out and starts returning errors.

Reads may continue briefly. etcd supports linearizable reads (the default) which require quorum confirmation, and serializable reads which can be served by any node. With linearizable reads and no quorum, even reads fail. With serializable reads, the surviving node can still serve stale data — but it won’t reflect any writes that were in-flight when quorum was lost.

Existing workloads keep running. Pods already scheduled on workers continue operating. kubelet on each worker maintains its containers regardless of control plane state. But nothing can change — no new pods, no restarts triggered by readiness probe failures, no rolling updates. The cluster is frozen in its last known state.

Recovery is possible. Bring the failed nodes back and they’ll rejoin the cluster, sync their logs from the surviving node, and quorum is restored. If the nodes can’t be recovered, etcd provides disaster recovery procedures — but that’s a different scenario from normal operations.

Visualizing the Voting Scenarios

It helps to think through the scenarios concretely for a 3-node cluster (etcd-1, etcd-2, etcd-3):

Four voting scenarios in a 3-node etcd cluster: all 3 healthy, 1 node down, 2 nodes down, and leader dies with a new leader elected — The four voting scenarios walked through below, at a glance. Red arrows = write replication; green dashed = ACK or vote; orange dashed = RequestVote during election.

All 3 healthy. etcd-1 is leader. A write comes in. etcd-1 sends it to etcd-2 and etcd-3. Both acknowledge. The entry is committed with 3/3 agreement. Maximum throughput, minimum latency.

1 node down (e.g., etcd-3 stops). etcd-1 is still leader. A write comes in. etcd-1 sends it to etcd-2 and etcd-3. etcd-2 acknowledges. etcd-3 doesn’t respond. That’s fine — 2/3 is quorum. The entry is committed. When etcd-3 comes back, it catches up by replaying the leader’s log.

2 nodes down (e.g., etcd-2 and etcd-3 stop). etcd-1 is alone. A write comes in. etcd-1 can’t get any acknowledgments. 1/3 is not a majority. The write hangs indefinitely. If etcd-1 was a follower when the leader died, it starts an election but can’t get votes — it needs at least one other node to vote for it. The cluster is stuck.

Leader dies (etcd-1 stops, etcd-2 and etcd-3 alive). etcd-2 and etcd-3 notice heartbeats stopped. One of them (say etcd-2, whose election timeout fires first) becomes candidate and requests votes. etcd-3 votes yes. 2/3 is a majority — etcd-2 becomes the new leader. The cluster continues operating with 2 nodes. Total downtime: typically under a second.

Why Your Homelab Uses 3 Nodes

Three is the sweet spot for homelabs. It’s the minimum cluster size that provides meaningful fault tolerance (survive 1 failure), and it runs comfortably on a single Mac. The HA clusters in this project allocate 2 GB RAM per etcd node — 6 GB total for the etcd tier.

Five nodes would tolerate 2 failures, but on a homelab Mac you’re unlikely to have independent failure domains anyway — if the Mac crashes, all 5 nodes go down together. The extra resilience of 5 nodes matters in production where nodes are on different racks, different power circuits, or different availability zones. On a single laptop, 3 nodes teaches you all the quorum mechanics without burning extra resources.

The UTM HA deep dive covers the etcd deployment in Step 15 — the systemd unit with all the TLS flags, the initial-cluster flag for bootstrapping, and the health check that confirms quorum. The Vagrant and OrbStack deep dives deploy the identical etcd configuration using the same Ansible roles.

Questions, corrections, or want to share how you’re using these repos?

labitlearnit@gmail.com