Thread

log_postmaster_stats

Jakub Wartak <jakub.wartak@enterprisedb.com> — 2026-05-27T11:39:15Z
Hi -hackers,

We seem to have certain observability about postmaster
(pg_stat_database.{sessions,parallel_workers_launched}), but we do not have
pre-exisiting way to asses how much postmaster was really busy back in the
past. Even checkpointer (log_checkpoints) or startup recovery code is reporting
better what they were doing. One can say we have log_connections, yet bigger
shops cannot afford to log_connections all the time to count what happened
some time ago (and that can cumbersome anyway).

The attached patch introduces log_postmaster_stats in the same way we do have
log_startup_progress_interval, e.g. when set to 10 (seconds), it will show this
during artificial connection storm (log produced every 10s):

LOG:  postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00
parallel workers started/sec; CPU: user: 0.00 s, system: 0.00 s,
elapsed: 10.00 s
LOG:  postmaster stats: avg 1834.30 conns/sec; 1833.60 disconns/sec;
0.00 parallel workers started/sec; CPU: user: 0.12 s, system: 4.75 s,
elapsed: 9.96 s
LOG:  postmaster stats: avg 1055.75 conns/sec; 1056.25 disconns/sec;
0.00 parallel workers started/sec; CPU: user: 0.12 s, system: 4.27 s,
elapsed: 16.25 s
LOG:  postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00
parallel workers started/sec; CPU: user: 0.00 s, system: 0.00 s,
elapsed: 13.82 s
LOG:  postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00
parallel workers started/sec; CPU: user: 0.00 s, system: 0.00 s,
elapsed: 10.00 s

The interesting thing above is that the elapsed time is 6s (with the
setting at 10s), then one
can already tell there was a probem.

Known issues include connection storms, spotting low postmaster/fork()
efficency,
PQ workers causing startvation for new connections and so on. It is somehow
complementary to having those pg_stat_database counters mentioned at the
beggining. It is also complementary to the more recent log_connections with
=setup_durations, which logs timings, but not direct rate of forks()/second.

Another interesting thing above is that there can be discrepeancy
between user+system=~5s
against elapsed wall clock time=~10s above (it does not add up) and that's even
getrusage(RUSAGE_SELF and not RUSAGE_CHILDREN), but this comes apparently from
CPU scheduling at those kind of fork() rates. I was thinking about adding some
message like every now and then:
    "WARNING: postmaster potentially overloaded, stats not gathered in time"
however lot of folks don't like those self diagnosis messages, so that's not in
v1 patch today.

I've thought it would be good idea to actually to enable it by default (@60s?),
but right now it is off to be aligned with others.

Any hints/reviews are welcome.

-J.