Heartbeats in Distributed Systems You Should Know in Golang

Mar 20 2026 golang 7 minutes read (About 1043 words)

When a Server Dies in Silence — How Heartbeat Mechanisms Keep Distributed Systems Alive

heartbeat-distributed-systems-golang

1. Introduction

In a distributed system, a server can stop responding while its TCP connection remains open. The load balancer sees it as alive. Requests continue to flow in. Users see timeouts, retries fail, and the failure cascades through the cluster. By the time anyone notices, the damage is done.

Heartbeat mechanisms solve this by making liveness explicit and measurable. Rather than waiting for a request to fail, nodes continuously signal their health, and monitors detect silence before it becomes disaster.

Go’s concurrency primitives — goroutines and channels — make implementing heartbeats elegant and efficient. In this article, we’ll build both sides of a heartbeat system from scratch, then look at the advanced algorithms that production systems like Cassandra and etcd rely on.

2. The Two Models: Push vs. Pull

There are two fundamental approaches to heartbeating:

Push model: Each node actively broadcasts “I am alive” at a fixed interval. The monitor listens and tracks when it last heard from each node.

Pull model: The monitor actively queries each node’s health endpoint. Kubernetes Liveness probes use this model.

For most distributed systems, the push model is preferred: it scales better (monitors don’t need to know all node addresses upfront) and detects failures faster (silence is immediate signal).

We’ll focus on the push model.

3. Implementing the Heartbeat Sender

The sender runs a background goroutine that ticks at a regular interval and broadcasts its presence:

package heartbeat

import (
    "context"
    "time"
)

type Heartbeat struct {
    NodeID    string
    Timestamp time.Time
    Sequence  uint64
}

type Sender struct {
    nodeID   string
    interval time.Duration
    send     func(Heartbeat)
}

func NewSender(nodeID string, interval time.Duration, send func(Heartbeat)) *Sender {
    return &Sender{nodeID: nodeID, interval: interval, send: send}
}

func (s *Sender) Run(ctx context.Context) {
    ticker := time.NewTicker(s.interval)
    defer ticker.Stop()

    var seq uint64
    for {
        select {
        case <-ticker.C:
            seq++
            s.send(Heartbeat{
                NodeID:    s.nodeID,
                Timestamp: time.Now(),
                Sequence:  seq,
            })
        case <-ctx.Done():
            return
        }
    }
}

Run it: Better Go Playground

From this code, we can observe:

time.NewTicker produces ticks at the exact interval — no drift accumulation.
ctx.Done() provides clean shutdown — the goroutine exits when the context is cancelled.
The send function is injected, keeping the sender decoupled from the transport (could be gRPC, UDP, a channel, etc.).

4. Implementing the Heartbeat Monitor

The monitor tracks the last time it heard from each node and fires an alert when a node goes silent:

package heartbeat

import (
    "sync"
    "time"
)

type Monitor struct {
    mu             sync.RWMutex
    lastHeartbeats map[string]time.Time
    timeout        time.Duration
    onFailure      func(nodeID string)
}

func NewMonitor(timeout time.Duration, onFailure func(string)) *Monitor {
    return &Monitor{
        lastHeartbeats: make(map[string]time.Time),
        timeout:        timeout,
        onFailure:      onFailure,
    }
}

func (m *Monitor) Receive(hb Heartbeat) {
    m.mu.Lock()
    m.lastHeartbeats[hb.NodeID] = hb.Timestamp
    m.mu.Unlock()
}

func (m *Monitor) CheckAll() {
    m.mu.RLock()
    defer m.mu.RUnlock()

    now := time.Now()
    for nodeID, last := range m.lastHeartbeats {
        if now.Sub(last) > m.timeout {
            m.onFailure(nodeID)
        }
    }
}

Note that sync.RWMutex is the right choice here: heartbeats arrive frequently (Receive is called on every beat), but checks happen less often (CheckAll runs on a slower ticker). Multiple goroutines can call Receive concurrently — RWMutex allows concurrent reads but serializes writes.

Practical guideline: set timeout to 3–10× the heartbeat interval to tolerate transient network delays. A 1-second heartbeat interval with a 5-second timeout is a common configuration.

5. Wiring Them Together

func main() {
    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()

    monitor := NewMonitor(5*time.Second, func(nodeID string) {
        fmt.Printf("⚠️  Node %s appears to be down!\n", nodeID)
    })

    // Simulate two nodes sending heartbeats
    for _, id := range []string{"node-1", "node-2"} {
        sender := NewSender(id, 1*time.Second, func(hb Heartbeat) {
            monitor.Receive(hb)
        })
        go sender.Run(ctx)
    }

    // Monitor checks every second
    checkTicker := time.NewTicker(1 * time.Second)
    defer checkTicker.Stop()
    for range checkTicker.C {
        monitor.CheckAll()
    }
}

6. Advanced: The Phi (φ) Accrual Failure Detector

Hard timeouts are fragile. Under network congestion, a healthy node might miss one or two heartbeat windows and get falsely marked as dead. Cassandra solves this with the Phi Accrual Failure Detector.

Instead of a binary “alive / dead” decision, phi gives a continuous suspicion level based on statistical analysis of heartbeat arrival times. The phi value grows as silence continues, and the caller decides what phi threshold constitutes “failure”:

1	φ = -log₁₀(1 - CDF(t_now - t_last))

Where CDF is fit to the historical distribution of inter-heartbeat intervals. A φ of 1 means 10% probability of failure; φ of 3 means 99.9%.

Why this matters: a node under GC pressure might pause for 300ms. With hard timeouts, that’s a false positive. With phi, it’s just a slight increase in suspicion — recoverable when the next heartbeat arrives.

7. Beyond Point-to-Point: Gossip Protocol

When you have hundreds of nodes, having each node send heartbeats to a central monitor creates a bottleneck. The gossip protocol distributes this:

Each node periodically picks a random peer and shares its view of which nodes are alive.
That peer merges the information and gossips it to another random peer.
Failure information spreads exponentially — O(log N) rounds to reach all nodes.

This is how Consul, etcd, and Serf implement cluster membership. The protocol is self-healing: if a node is falsely marked dead, it can correct the record by gossiping its own liveness.

8. Conclusion

Heartbeat mechanisms transform failure detection from reactive to proactive. Go’s goroutines and time.Ticker make the implementation natural — a background goroutine per node, a mutex-protected map in the monitor, and a clean shutdown via context cancellation.

For production systems, the hard timeout approach is a solid starting point. As your cluster grows and network conditions become less predictable, the Phi Accrual Failure Detector or a gossip-based approach like Serf gives you the resilience to tolerate transient failures without false positives.

Have you implemented heartbeat detection in a Go service? What timeout strategy worked best for your environment? Share your experience below!

More in the “You Should Know In Golang” series:
https://wesley-wei.medium.com/list/you-should-know-in-golang-e9491363cd9a