Thread

  1. Re: [PATCH] Expose checkpoint timestamp and duration in pg_stat_checkpointer

    Soumya S Murali <soumyamurali.work@gmail.com> — 2025-11-26T10:15:06Z

    On Mon, Nov 24, 2025 at 3:37 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
    >
    > On 2025-Nov-24, Michael Banck wrote:
    >
    > > In general I doubt how much those gauges (as oppposed to counters) only
    > > pertaining to the last checkpoint are useful in pg_stat_checkpointer.
    > > What would be the use case for those two values?
    >
    > I think it's useful to know how long checkpoint has to work.  It's a bit
    > lame to have only one duration (the last one), but at least with this
    > arrangement you can have external monitoring software connect to the
    > server, extract that value and save it somewhere else.  Monitoring
    > systems do this all the time, and we've been waiting for a better
    > implementation to store monitoring data inside Postgres for years.  I
    > think we shouldn't block this proposal just because of this issue,
    > because it can clearly be useful.
    >
    > However, I'm not sure I'm very interested in knowing only the duration
    > of the checkpoint.  I mean, much of the time the duration is going to be
    > whatever fraction of the checkpoint timeout you have as
    > checkpoint_completion_target, right?  Which includes sleeps.  So I think
    > you really want two durations: one is the duration itself, and the other
    > is what fraction of that did the checkpointer sleep in order to achieve
    > that duration.  So you know how much time checkpointer spent trying to
    > get the operating system do stuff rather than just sit there waiting.
    > We already have that data, kinda, in write_time and sync_time, but those
    > are cumulative rather than just for the last one.  (I guess you can have
    > the monitoring system compute the deltas as it finds each new
    > checkpoint.)  I'm not sure how good this system is.
    
    Thank you for the detailed thoughts. I agree that having only the last
    checkpoint’s duration is limited, but it still gives monitoring tools
    a concrete value they can sample and store over time, which is better
    than relying only on counters and logs. I will try whether separating
    total duration and actual active write/sync time (vs. sleep time) can
    be exposed in a more clearer way, as that seems useful for deeper
    diagnosis.
    
    > In the past, I looked at a couple of monitoring dashboards offered by
    > cloud vendors, searching for anything valuable in terms of checkpoints.
    > What I saw was very disappointing -- mostly just "how many checkpoints
    > per minute", which is mostly flat zero with periodic spikes.  Totally
    > useless.  Does anybody know if some vendor has good charts for this?
    > Also, if we were to add this new proposed duration, how could these
    > charts improve?
    
    I will look into this in more depth. Will let you know if I find
    something concrete.
    
    Regards
    Soumya