Thread

  1. log_postmaster_stats

    Jakub Wartak <jakub.wartak@enterprisedb.com> — 2026-05-27T11:39:15Z

    Hi -hackers,
    
    We seem to have certain observability about postmaster
    (pg_stat_database.{sessions,parallel_workers_launched}), but we do not have
    pre-exisiting way to asses how much postmaster was really busy back in the
    past. Even checkpointer (log_checkpoints) or startup recovery code is reporting
    better what they were doing. One can say we have log_connections, yet bigger
    shops cannot afford to log_connections all the time to count what happened
    some time ago (and that can cumbersome anyway).
    
    The attached patch introduces log_postmaster_stats in the same way we do have
    log_startup_progress_interval, e.g. when set to 10 (seconds), it will show this
    during artificial connection storm (log produced every 10s):
    
    LOG:  postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00
    parallel workers started/sec; CPU: user: 0.00 s, system: 0.00 s,
    elapsed: 10.00 s
    LOG:  postmaster stats: avg 1834.30 conns/sec; 1833.60 disconns/sec;
    0.00 parallel workers started/sec; CPU: user: 0.12 s, system: 4.75 s,
    elapsed: 9.96 s
    LOG:  postmaster stats: avg 1055.75 conns/sec; 1056.25 disconns/sec;
    0.00 parallel workers started/sec; CPU: user: 0.12 s, system: 4.27 s,
    elapsed: 16.25 s
    LOG:  postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00
    parallel workers started/sec; CPU: user: 0.00 s, system: 0.00 s,
    elapsed: 13.82 s
    LOG:  postmaster stats: avg 0.00 conns/sec; 0.00 disconns/sec; 0.00
    parallel workers started/sec; CPU: user: 0.00 s, system: 0.00 s,
    elapsed: 10.00 s
    
    The interesting thing above is that the elapsed time is 6s (with the
    setting at 10s), then one
    can already tell there was a probem.
    
    Known issues include connection storms, spotting low postmaster/fork()
    efficency,
    PQ workers causing startvation for new connections and so on. It is somehow
    complementary to having those pg_stat_database counters mentioned at the
    beggining. It is also complementary to the more recent log_connections with
    =setup_durations, which logs timings, but not direct rate of forks()/second.
    
    Another interesting thing above is that there can be discrepeancy
    between user+system=~5s
    against elapsed wall clock time=~10s above (it does not add up) and that's even
    getrusage(RUSAGE_SELF and not RUSAGE_CHILDREN), but this comes apparently from
    CPU scheduling at those kind of fork() rates. I was thinking about adding some
    message like every now and then:
        "WARNING: postmaster potentially overloaded, stats not gathered in time"
    however lot of folks don't like those self diagnosis messages, so that's not in
    v1 patch today.
    
    I've thought it would be good idea to actually to enable it by default (@60s?),
    but right now it is off to be aligned with others.
    
    Any hints/reviews are welcome.
    
    -J.