PostgreSQL HA: A Battle-Tested Playbook

This title was summarized by AI from the post below.

PostgreSQL replication is easy, until it silently isn’t. It was a Tuesday afternoon, peak traffic, when the latency alerts started screaming. Our synchronous replica was lagging—badly—and a minor network hiccup almost took down the entire production database. Here’s the battle-tested playbook we now use for rock-solid PostgreSQL HA. 1. We stopped defaulting to async. Speed is tempting, but for our core services, even a few seconds of data loss on failover is unthinkable. We now use synchronous replication for the auth and payment tables. The ~5ms latency hit was a small, and honestly necessary, price to pay for a zero RPO. 2. Automate failover like your job depends on it. Because it does. Manual promotion is a recipe for disaster under pressure. We run Patroni with etcd to manage the cluster state and automate leader election. It handles failover in under 30 seconds, which saved us during a partial AWS AZ outage last quarter. 3. PgBouncer is non-negotiable. Your application shouldn't know or care which DB instance is the primary. A central connection pooler like PgBouncer sits in front of the cluster, maintaining persistent connections and redirecting traffic seamlessly after a failover. This completely eliminated the cascade of app-level connection errors we used to see. 4. Monitor replication lag like a hawk. An out-of-sync replica isn't high availability; it's a high-stakes liability. We have a non-negotiable Datadog alert that queries pg_stat_replication and pages the on-call if replay_lag exceeds 500ms for more than two minutes. It’s our best early warning for network saturation. 5. Brutally test your recovery. Your HA setup is just a theory until you pull the plug. We use Terraform to spin up a staging clone of our production DB and run weekly chaos tests—terminating the primary instance, severing network connections via security group rules. It’s the only way to build real confidence. True high availability isn't about uptime; it's about predictable, tested recovery. What's the most counter-intuitive failure you've seen in a PostgreSQL HA setup? Save this as a sanity check for your own cluster. #DevOps #PostgreSQL #HighAvailability #SiteReliability

  • No alternative text description for this image

That's a solid playbook; the reliance on Patroni with etcd is a pragmatic choice for automated leader election and state management. The point about brutally testing recovery via infrastructure-as-code chaos is where most teams stumble, mistaking configuration for proven resilience.

To view or add a comment, sign in

Explore content categories