Engineering · 2026-09-30

Postgres as a queue, and when not to

A defence of using Postgres as the job-queue substrate for application-bound workloads, the four properties that make the choice load-bearing rather than corner-cutting, and the conditions under which we'd switch.

Several of our products run their background work through a Postgres-backed job queue (pg-boss or a close analogue), on the same Postgres instance that holds the rest of the application's data. There is a strain of opinion in performance-oriented circles that this is wrong: that queues belong in a system designed for them (Redis Streams, Kafka, RabbitMQ, NATS, SQS), and that putting them in Postgres is at best a transitional choice. We have heard the argument enough times to be sure of our answer, which is: it is the right choice for a wide class of application-bound workloads, and the scale at which it stops being right is well outside most applications' real trajectory.

The argument for keeping the queue in Postgres has four parts.

One operational thing to back up, monitor, and patch. A separate queue substrate is a separate stateful service, with a separate HA story, a separate backup schedule, a separate version-upgrade cadence, a separate set of metrics, a separate failure mode the on-call must learn. Each of those is a small cost; together, they are not small. For application teams whose product footprint is "Postgres plus the application," the operational simplicity of keeping the queue in the same database is meaningful. It is the difference between a stack that fits in one engineer's head and one that requires an SRE rotation.

Transactional consistency between application data and job enqueues. When the application accepts a request that should produce a side effect later, it inserts the application row, any related records, and the queue job in a single transaction. The job exists if and only if the application data exists. This eliminates a class of bug that lives in every system with a separate queue: the application commits to its database, then fails to enqueue, and the data sits without ever being processed. The standard mitigation is the transactional outbox pattern, which works but adds infrastructure that a single-database deployment doesn't need.

Visibility from the same tools the application already uses. When something goes wrong with the queue, the on-call engineer's familiar psql (or ORM, or admin UI) is sufficient. Job state, retry count, last error, payload — all reachable with the tool the engineer was already using to look at the application data. Specialised queue substrates have their own admin surfaces, which are usually adequate but which add a tab and a mental model. They also make it harder to write the cross-cut query — show me every job related to this entity in the last hour, joined against the entity's other state — that an incident often actually requires.

Backup and restore are a single operation. A Postgres logical backup contains the queue. A point-in-time restore restores the queue. There is no possibility of restoring application state from one backup and queue state from another and discovering they are inconsistent. For products whose customers ask "what is your recovery story" with some seriousness, the simplicity of the answer matters.

The argument against — and the conditions under which we would switch — is mostly about throughput, latency, and contention.

Polling latency. Most Postgres-backed queues are polling-based. The default polling interval is a few seconds. For interactive jobs, the floor this puts on enqueue-to-execute latency is observable to users. The mitigation is to use Postgres's LISTEN/NOTIFY channel for high-priority jobs — the worker wakes immediately on a notification and processes the job before the next scheduled poll — which is a few hundred lines of code, not a substrate change. For most application-bound workloads (evidence ingestion, notifications, scheduled reports, fan-out tasks) the polling floor is not in the critical path.

Index churn under heavy update. Postgres-backed queues mark jobs done by updating their state. Under sustained heavy throughput, the indexes on the job tables accumulate dead tuples faster than autovacuum reclaims them, and the planner starts choosing worse plans over time. Monitoring index bloat per queue table, scheduling periodic VACUUMs outside business hours, and tuning autovacuum thresholds for the queue tables specifically are the standard mitigations. At application scale, this is a periodic operational concern, not a pressing one.

Schema migrations on the same database as the queue. Long migrations on the application schema can interact with queue-table contention in awkward ways — a migration that takes an ACCESS EXCLUSIVE lock on a hot table will block queue activity for the duration. The remedy is discipline (schema migrations off-peak, online-migration patterns for hot tables) rather than substrate change, but it is real, and a separate queue would not have the coupling.

The cliff. At sustained throughput in the order of tens of thousands of jobs per second per tenant, or storage volumes in the hundreds of millions of completed jobs without aggressive archival, a Postgres-backed queue starts to be a poor fit. Most application workloads are an order of magnitude below the lower of those numbers; some are two. If a deployment shape pushed past those thresholds, the natural next step is a substrate that matches the operational shape one already has — for systems that already run a Redis instance, Redis Streams is a small step; for systems that don't, the cost of adding Redis to the operational footprint should be weighed seriously against the alternative of partitioning the workload to keep Postgres comfortable.

What we don't do, deliberately:

Run the queue in a separate Postgres instance from the application. The transactional-consistency argument is the whole point; running them apart loses it. If the operational footprint already has two instances, the queue belongs with the application's data, not on its own.
Push queue throughput by adding workers indefinitely. Worker count is bounded by the database's connection budget, and a runaway worker pool will degrade the application's interactive performance before it improves throughput. Sizing workers per workload, against a budget that leaves headroom, is the load-bearing operational discipline.
Expose the queue to in-cluster services other than the application. The temptation, on a well-run Postgres, is to let a sibling service enqueue work into the same instance. Resist this. The queue is a private implementation detail of the application that owns it. Shared queue infrastructure is the start of a cross-service coupling that is hard to undo.

The pragmatic position, after running this in production across several products and several years, is that Postgres-as-a-queue is a load-bearing simplification, not a corner-cut. The systems that genuinely need a dedicated queue substrate know they need one; their throughput, fan-out, or durability requirements force the choice. Most applications, including most compliance-flavoured applications, do not.

← All engineering posts