Thread
-
log_postmaster_stats
Jakub Wartak <jakub.wartak@enterprisedb.com> — 2026-05-27T11:39:15Z
Hi -hackers, We seem to have certain observability about postmaster (pg_stat_database.{sessions,parallel_workers_launched}), but we do not have pre-exisiting way to asses how much postmaster was really busy back in the past. Even checkpointer (log_checkpoints) or startup recovery code is reporting better what they were doing. One can say we have log_connections, yet bigger shops cannot afford to log_connections all the time to count what happened some time ago (and that can cumbersome anyway). The attached patch introduces log_postmaster_stats in the same way we do have log_startup_progress_interval, e.g. when set to 10 (seconds), it will show this during artificial connection storm (log produced every 10s): LOG: postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00 parallel workers started/sec; CPU: user: 0.00 s, system: 0.00 s, elapsed: 10.00 s LOG: postmaster stats: avg 1834.30 conns/sec; 1833.60 disconns/sec; 0.00 parallel workers started/sec; CPU: user: 0.12 s, system: 4.75 s, elapsed: 9.96 s LOG: postmaster stats: avg 1055.75 conns/sec; 1056.25 disconns/sec; 0.00 parallel workers started/sec; CPU: user: 0.12 s, system: 4.27 s, elapsed: 16.25 s LOG: postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00 parallel workers started/sec; CPU: user: 0.00 s, system: 0.00 s, elapsed: 13.82 s LOG: postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00 parallel workers started/sec; CPU: user: 0.00 s, system: 0.00 s, elapsed: 10.00 s The interesting thing above is that the elapsed time is 6s (with the setting at 10s), then one can already tell there was a probem. Known issues include connection storms, spotting low postmaster/fork() efficency, PQ workers causing startvation for new connections and so on. It is somehow complementary to having those pg_stat_database counters mentioned at the beggining. It is also complementary to the more recent log_connections with =setup_durations, which logs timings, but not direct rate of forks()/second. Another interesting thing above is that there can be discrepeancy between user+system=~5s against elapsed wall clock time=~10s above (it does not add up) and that's even getrusage(RUSAGE_SELF and not RUSAGE_CHILDREN), but this comes apparently from CPU scheduling at those kind of fork() rates. I was thinking about adding some message like every now and then: "WARNING: postmaster potentially overloaded, stats not gathered in time" however lot of folks don't like those self diagnosis messages, so that's not in v1 patch today. I've thought it would be good idea to actually to enable it by default (@60s?), but right now it is off to be aligned with others. Any hints/reviews are welcome. -J.