Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Use correct datatype for PID

  2. Improve comments in online checksums code

  3. Fix checksum state transition during promotion

  4. Fix regex searching for page verification failures in tests

  5. Apply data-checksum worker throttling parameters

  6. Skip WAL for unlogged main fork during online checksum enable

  7. Revert "Get rid of WALBufMappingLock"

  8. Get rid of WALBufMappingLock

  9. Improve grammar of options for command arrays in TAP tests

  1. Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2024-07-03T06:41:01Z

    After some off-list discussion about the desirability of this feature, where
    several hackers opined that it's something that we should have, I've decided to
    rebase this patch and submit it one more time.  There are several (long)
    threads covering the history of this patch [0][1], related work stemming from
    this [2] as well as earlier attempts and discussions [3][4].  Below I try to
    respond to a summary of points raised in those threads.
    
    The mechanics of the patch hasn't changed since the last posted version, it has
    mainly been polished slightly.  A high-level overview of the processing is:
    It's using a launcher/worker model where the launcher will spawn a worker per
    database which will traverse all pages and dirty them in order to calculate and
    set the checksum on them.  During this inprogress state all backends calculated
    and write checksums but don't verify them on read.  Once all pages have been
    checksummed the state of the cluster will switch over to "on" synchronized
    across all backends with a procsignalbarrier.  At this point checksums are
    verified and processing is equal to checksums having been enabled initdb.  When
    a user disables checksums the cluster enters a state where all backends still
    write checksums until all backends have acknowledged that they have stopped
    verifying checksums (again using a procsignalbarrier).  At this point the
    cluster switches to "off" and checksums are neither written nor verified.  In
    case the cluster is restarted, voluntarily or via a crash, processing will have
    to be restarted (more on that further down).
    
    The user facing controls for this are two SQL level functions, for enabling and
    disabling.  The existing data_checksums GUC remains but is expanded with more
    possible states (with on/off retained).
    
    
    Complaints against earlier versions
    ===================================
    Seasoned hackers might remember that this patch has been on -hackers before.
    There has been a lot of review, and AFAICT all specific comments have been
    addressed.  There are however a few larger more generic complaints:
    
    * Restartability - the initial version of the patch did not support stateful
    restarts, a shutdown performed (or crash) before checksums were enabled would
    result in a need to start over from the beginning.  This was deemed the safe
    orchestration method.  The lack of this feature was seen as serious drawback,
    so it was added.  Subsequent review instead found the patch to be too
    complicated with a too large featureset.  I thihk there is merit to both of
    these arguments: being able to restart is a great feature; and being able to
    reason about the correctness of a smaller patch is also great.  As of this
    submission I have removed the ability to restart to keep the scope of the patch
    small (which is where the previous version was, which received no review after
    the removal).  The way I prefer to frame this is to first add scaffolding and
    infrastructure (this patch) and leave refinements and add-on features
    (restartability, but also others like parallel workers, optimizing rare cases,
    etc) for follow-up patches.
    
    * Complexity - it was brought up that this is a very complex patch for a niche
    feature, and there is a lot of truth to that.  It is inherently complex to
    change a pg_control level state of a running cluster.  There might be ways to
    make the current patch less complex, while not sacrificing stability, and if so
    that would be great.  A lot of of the complexity came from being able to
    restart processing, and that's not removed for this version, but it's clearly
    not close to a one-line-diff even without it.
    
    Other complaints were addressed, in part by the invention of procsignalbarriers
    which makes this synchronization possible.  In re-reading the threads I might
    have missed something which is still left open, and if so I do apologize for
    that.
    
    
    Open TODO items:
    ================
    * Immediate checkpoints - the code is currently using CHECKPOINT_IMMEDIATE in
    order to be able to run the tests in a timely manner on it.  This is overly
    aggressive and dialling it back while still being able to run fast tests is a
    TODO.  Not sure what the best option is there.
    
    * Monitoring - an insightful off-list reviewer asked how the current progress
    of the operation is monitored.  So far I've been using pg_stat_activity but I
    don't disagree that it's not a very sharp tool for this.  Maybe we need a
    specific function or view or something?  There clearly needs to be a way for a
    user to query state and progress of a transition.
    
    * Throttling - right now the patch uses the vacuum access strategy, with the
    same cost options as vacuum, in order to implement throttling.  This is in part
    due to the patch starting out modelled around autovacuum as a worker, but it
    may not be the right match for throttling checksums.
    
    * Naming - the in-between states when data checksums are enabled or disabled
    are called inprogress-on and inprogress-off.  The reason for this is simply
    that early on there were only three states: inprogress, on and off, and the
    process of disabling wasn't labeled with a state.  When this transition state
    was added it seemed like a good idea to tack the end-goal onto the transition.
    These state names make the code easily greppable but might not be the most
    obvious choices for anything user facing.  Is "Enabling" and "Disabling" better
    terms to use (across the board or just user facing) or should we stick to the
    current?
    
    There are ways in which this processing can be optimized to achieve better
    performance, but in order to keep goalposts in sight and patchsize down they
    are left as future work.
    
    --
    Daniel Gustafsson
    
    [0] https://www.postgresql.org/message-id/flat/CABUevExz9hUUOLnJVr2kpw9Cx%3Do4MCr1SVKwbupzuxP7ckNutA%40mail.gmail.com
    [1] https://www.postgresql.org/message-id/flat/CABUevEwE3urLtwxxqdgd5O2oQz9J717ZzMbh%2BziCSa5YLLU_BA%40mail.gmail.com
    [2] https://www.postgresql.org/message-id/flat/20181030051643.elbxjww5jjgnjaxg%40alap3.anarazel.de
    [3] https://www.postgresql.org/message-id/flat/FF393672-5608-46D6-9224-6620EC532693%40endpoint.com
    [4] https://www.postgresql.org/message-id/flat/CABUevEx8KWhZE_XkZQpzEkZypZmBp3GbM9W90JLp%3D-7OJWBbcg%40mail.gmail.com
    
    
  2. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas.vondra@enterprisedb.com> — 2024-07-03T11:20:10Z

    Hi Daniel,
    
    Thanks for rebasing the patch and submitting it again!
    
    On 7/3/24 08:41, Daniel Gustafsson wrote:
    > After some off-list discussion about the desirability of this feature, where
    > several hackers opined that it's something that we should have, I've decided to
    > rebase this patch and submit it one more time.  There are several (long)
    > threads covering the history of this patch [0][1], related work stemming from
    > this [2] as well as earlier attempts and discussions [3][4].  Below I try to
    > respond to a summary of points raised in those threads.
    > 
    > The mechanics of the patch hasn't changed since the last posted version, it has
    > mainly been polished slightly.  A high-level overview of the processing is:
    > It's using a launcher/worker model where the launcher will spawn a worker per
    > database which will traverse all pages and dirty them in order to calculate and
    > set the checksum on them.  During this inprogress state all backends calculated
    > and write checksums but don't verify them on read.  Once all pages have been
    > checksummed the state of the cluster will switch over to "on" synchronized
    > across all backends with a procsignalbarrier.  At this point checksums are
    > verified and processing is equal to checksums having been enabled initdb.  When
    > a user disables checksums the cluster enters a state where all backends still
    > write checksums until all backends have acknowledged that they have stopped
    > verifying checksums (again using a procsignalbarrier).  At this point the
    > cluster switches to "off" and checksums are neither written nor verified.  In
    > case the cluster is restarted, voluntarily or via a crash, processing will have
    > to be restarted (more on that further down).
    > 
    > The user facing controls for this are two SQL level functions, for enabling and
    > disabling.  The existing data_checksums GUC remains but is expanded with more
    > possible states (with on/off retained).
    > 
    > 
    > Complaints against earlier versions
    > ===================================
    > Seasoned hackers might remember that this patch has been on -hackers before.
    > There has been a lot of review, and AFAICT all specific comments have been
    > addressed.  There are however a few larger more generic complaints:
    > 
    > * Restartability - the initial version of the patch did not support stateful
    > restarts, a shutdown performed (or crash) before checksums were enabled would
    > result in a need to start over from the beginning.  This was deemed the safe
    > orchestration method.  The lack of this feature was seen as serious drawback,
    > so it was added.  Subsequent review instead found the patch to be too
    > complicated with a too large featureset.  I thihk there is merit to both of
    > these arguments: being able to restart is a great feature; and being able to
    > reason about the correctness of a smaller patch is also great.  As of this
    > submission I have removed the ability to restart to keep the scope of the patch
    > small (which is where the previous version was, which received no review after
    > the removal).  The way I prefer to frame this is to first add scaffolding and
    > infrastructure (this patch) and leave refinements and add-on features
    > (restartability, but also others like parallel workers, optimizing rare cases,
    > etc) for follow-up patches.
    > 
    
    I 100% support this approach.
    
    Sure, I'd like to have a restartable tool, but clearly that didn't go
    particularly well, and we still have nothing to enable checksums online.
    That doesn't seem to benefit anyone - to me it seems reasonable to get
    the non-restartable tool in, and then maybe later someone can improve
    this to make it restartable. Thanks to the earlier work we know it's
    doable, even if it was too complex.
    
    This way it's at least possible to enable checksums online with some
    additional care (e.g. to make sure no one restarts the cluster etc.).
    I'd bet for vast majority of systems this will work just fine. Huge
    systems with some occasional / forced restarts may not be able to make
    this work - but then again, that's no worse than now.
    
    > * Complexity - it was brought up that this is a very complex patch for a niche
    > feature, and there is a lot of truth to that.  It is inherently complex to
    > change a pg_control level state of a running cluster.  There might be ways to
    > make the current patch less complex, while not sacrificing stability, and if so
    > that would be great.  A lot of of the complexity came from being able to
    > restart processing, and that's not removed for this version, but it's clearly
    > not close to a one-line-diff even without it.
    > 
    
    I'd push back on this a little bit - the patch looks like this:
    
     50 files changed, 2691 insertions(+), 48 deletions(-)
    
    and if we ignore the docs / perl tests, then the two parts that stand
    out are
    
     src/backend/access/transam/xlog.c             |  455 +++++-
     src/backend/postmaster/datachecksumsworker.c  | 1353 +++++++++++++++++
    
    I don't think the worker code is exceptionally complex. Yes, it's not
    trivial, but a lot of the 1353 inserts is comments (which is good) or
    generic infrastructure to start the worker etc.
    
    > Other complaints were addressed, in part by the invention of procsignalbarriers
    > which makes this synchronization possible.  In re-reading the threads I might
    > have missed something which is still left open, and if so I do apologize for
    > that.
    > 
    > 
    > Open TODO items:
    > ================
    > * Immediate checkpoints - the code is currently using CHECKPOINT_IMMEDIATE in
    > order to be able to run the tests in a timely manner on it.  This is overly
    > aggressive and dialling it back while still being able to run fast tests is a
    > TODO.  Not sure what the best option is there.
    > 
    
    Why not to add a parameter to pg_enable_data_checksums(), specifying
    whether to do immediate checkpoint or wait for the next one? AFAIK
    that's what we do in pg_backup_start, for example.
    
    > * Monitoring - an insightful off-list reviewer asked how the current progress
    > of the operation is monitored.  So far I've been using pg_stat_activity but I
    > don't disagree that it's not a very sharp tool for this.  Maybe we need a
    > specific function or view or something?  There clearly needs to be a way for a
    > user to query state and progress of a transition.
    > 
    
    Yeah, I think a view like pg_stat_progress_checksums would work.
    
    > * Throttling - right now the patch uses the vacuum access strategy, with the
    > same cost options as vacuum, in order to implement throttling.  This is in part
    > due to the patch starting out modelled around autovacuum as a worker, but it
    > may not be the right match for throttling checksums.
    > 
    
    IMHO it's reasonable to reuse the vacuum throttling. Even if it's not
    perfect, it does not seem great to invent something new and end up with
    two different ways to throttle stuff.
    
    > * Naming - the in-between states when data checksums are enabled or disabled
    > are called inprogress-on and inprogress-off.  The reason for this is simply
    > that early on there were only three states: inprogress, on and off, and the
    > process of disabling wasn't labeled with a state.  When this transition state
    > was added it seemed like a good idea to tack the end-goal onto the transition.
    > These state names make the code easily greppable but might not be the most
    > obvious choices for anything user facing.  Is "Enabling" and "Disabling" better
    > terms to use (across the board or just user facing) or should we stick to the
    > current?
    > 
    
    I think the naming is fine. In the worst case we can rename that later,
    seems more like a detail.
    
    > There are ways in which this processing can be optimized to achieve better
    > performance, but in order to keep goalposts in sight and patchsize down they
    > are left as future work.
    > 
    
    +1
    
    
    regards
    
    -- 
    Tomas Vondra
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  3. Re: Changing the state of data checksums in a running cluster

    Bruce Momjian <bruce@momjian.us> — 2024-07-06T02:23:35Z

    On Wed, Jul  3, 2024 at 01:20:10PM +0200, Tomas Vondra wrote:
    > > * Restartability - the initial version of the patch did not support stateful
    > > restarts, a shutdown performed (or crash) before checksums were enabled would
    > > result in a need to start over from the beginning.  This was deemed the safe
    > > orchestration method.  The lack of this feature was seen as serious drawback,
    > > so it was added.  Subsequent review instead found the patch to be too
    > > complicated with a too large featureset.  I thihk there is merit to both of
    > > these arguments: being able to restart is a great feature; and being able to
    > > reason about the correctness of a smaller patch is also great.  As of this
    > > submission I have removed the ability to restart to keep the scope of the patch
    > > small (which is where the previous version was, which received no review after
    > > the removal).  The way I prefer to frame this is to first add scaffolding and
    > > infrastructure (this patch) and leave refinements and add-on features
    > > (restartability, but also others like parallel workers, optimizing rare cases,
    > > etc) for follow-up patches.
    > > 
    > 
    > I 100% support this approach.
    
    Yes, I was very disappointed when restartability sunk the patch, and I
    saw this as another case where saying "yes" to every feature improvement
    can lead to failure.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      Only you can decide what is important to you.
    
    
    
    
  4. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2024-09-30T21:21:30Z

    > On 3 Jul 2024, at 13:20, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
    
    > Thanks for rebasing the patch and submitting it again!
    
    Thanks for review, sorry for being so slow to pick this up again.
    
    The attached version is a rebase with some level of cleanup and polish all
    around, and most importantly it adresses the two points raised below.
    
    >> * Immediate checkpoints - the code is currently using CHECKPOINT_IMMEDIATE in
    >> order to be able to run the tests in a timely manner on it.  This is overly
    >> aggressive and dialling it back while still being able to run fast tests is a
    >> TODO.  Not sure what the best option is there.
    > 
    > Why not to add a parameter to pg_enable_data_checksums(), specifying
    > whether to do immediate checkpoint or wait for the next one? AFAIK
    > that's what we do in pg_backup_start, for example.
    
    That's a good idea, pg_enable_data_checksums now accepts a third parameter
    "fast" (defaults to false) which will enable immediate checkpoints when true.
    
    >> * Monitoring - an insightful off-list reviewer asked how the current progress
    >> of the operation is monitored.  So far I've been using pg_stat_activity but I
    >> don't disagree that it's not a very sharp tool for this.  Maybe we need a
    >> specific function or view or something?  There clearly needs to be a way for a
    >> user to query state and progress of a transition.
    > 
    > Yeah, I think a view like pg_stat_progress_checksums would work.
    
    Added in the attached version.  It probably needs some polish (the docs for
    sure do) but it's at least a start.
    
    --
    Daniel Gustafsson
    
    
  5. Re: Changing the state of data checksums in a running cluster

    Michael Banck <mbanck@gmx.net> — 2024-09-30T22:43:49Z

    Hi,
    
    On Mon, Sep 30, 2024 at 11:21:30PM +0200, Daniel Gustafsson wrote:
    > > Yeah, I think a view like pg_stat_progress_checksums would work.
    > 
    > Added in the attached version.  It probably needs some polish (the docs for
    > sure do) but it's at least a start.
    
    Just a nitpick, but we call it data_checksums about everywhere, but the
    new view is called pg_stat_progress_datachecksums - I think
    pg_stat_progress_data_checksums would look better even if it gets quite
    long.
    
    
    Michael
    
    
    
    
  6. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2024-10-01T18:55:34Z

    > On 1 Oct 2024, at 00:43, Michael Banck <mbanck@gmx.net> wrote:
    > 
    > Hi,
    > 
    > On Mon, Sep 30, 2024 at 11:21:30PM +0200, Daniel Gustafsson wrote:
    >>> Yeah, I think a view like pg_stat_progress_checksums would work.
    >> 
    >> Added in the attached version.  It probably needs some polish (the docs for
    >> sure do) but it's at least a start.
    > 
    > Just a nitpick, but we call it data_checksums about everywhere, but the
    > new view is called pg_stat_progress_datachecksums - I think
    > pg_stat_progress_data_checksums would look better even if it gets quite
    > long.
    
    That's a fair point, I'll make sure to switch for the next version of the
    patch.
    
    --
    Daniel Gustafsson
    
    
    
    
    
  7. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2024-10-07T09:42:41Z

    > On 1 Oct 2024, at 20:55, Daniel Gustafsson <daniel@yesql.se> wrote:
    > 
    >> On 1 Oct 2024, at 00:43, Michael Banck <mbanck@gmx.net> wrote:
    >> 
    >> Hi,
    >> 
    >> On Mon, Sep 30, 2024 at 11:21:30PM +0200, Daniel Gustafsson wrote:
    >>>> Yeah, I think a view like pg_stat_progress_checksums would work.
    >>> 
    >>> Added in the attached version.  It probably needs some polish (the docs for
    >>> sure do) but it's at least a start.
    >> 
    >> Just a nitpick, but we call it data_checksums about everywhere, but the
    >> new view is called pg_stat_progress_datachecksums - I think
    >> pg_stat_progress_data_checksums would look better even if it gets quite
    >> long.
    > 
    > That's a fair point, I'll make sure to switch for the next version of the
    > patch.
    
    A rebased v3 attached with that change.
    
    --
    Daniel Gustafsson
    
    
  8. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2024-10-07T14:46:25Z

    Hi,
    
    I did a quick review today. First a couple minor comments:
    
    1) monitoring.sgml
    
    typos: number of database -> number of databases
           calcuated -> calculated
    
    2) unnecessary newline in heapam.c (no other changes)
    
    3) unnecessary ListCell in DataChecksumsWorkerMain() on line 1345,
    shadowing earlier variable
    
    4) unnecessary comment change in bufmgr.c (no other changes)
    
    5) unnecessary include and newline in bulk_write.c (no other changes)
    
    6) unnecessary newline in pgstat.c (no other changes)
    
    7) controldata.c - maybe this
    
       if (oldctrl->data_checksum_version == 2)
    
    should use PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION instead of the magic
    constant? For "off" we use "0" which seems somewhat acceptable, but for
    other values it's less obvious what the meaning is.
    
    8) xlog_internal.h - xl_checksum_state should be added to typedefs
    
    9) system_functions.sql - Isn't it weird that this only creates the new
    pg_enable_data_checksums function, but not pg_disable_data_checksums? It
    also means it doesn't revoke EXECUTE from public on it, which I guess it
    probably should? Or why should this be different for the two functions?
    
    Also the error message seems to differ:
    
      test=> select pg_enable_data_checksums();
      ERROR:  permission denied for function pg_enable_data_checksums
      test=> select pg_disable_data_checksums();
      ERROR:  must be superuser
    
    Probably harmless, but seems a bit strange.
    
    
    But there also seems to be a more serious problem with recovery. I did a
    simple script that does a loop of
    
    * start a cluster
    * initialize a small pgbench database (scale 1 - 100)
    * run "long" pgbench
    * call pg_enable_data_checksums(), wait for it to complete
    * stop the cluster with "-m immediate"
    * start the cluster
    
    And this unfortunately hits this assert:
    
      bool
      AbsorbChecksumsOnBarrier(void)
      {
        Assert(LocalDataChecksumVersion ==
    PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
        LocalDataChecksumVersion = PG_DATA_CHECKSUM_VERSION;
        return true;
      }
    
    Based on our short discussion about this, the controlfile gets updated
    right after pg_enable_data_checksums() completes. The immediate stop
    however forces a recovery since the last checkpoint, which means we see
    the XLOG_CHECKSUMS WAL message again, and set the barrier. And then we
    exit recovery, try to start checkpointer and it trips over this, because
    the control file already has the "on" value :-(
    
    I'm not sure what's the best way to fix this. Would it be possible to
    remember we saw the XLOG_CHECKSUMS during recovery, and make the assert
    noop in that case? Or not set the barrier when exiting recovery. I'm not
    sure the relaxed assert would remain meaningful, though. What would it
    check on standbys, for example?
    
    Maybe a better way would be to wait for a checkpoint before updating the
    controlfile, similar to what we do at the beginning? Possibly even with
    the same "fast=true/false" logic. That would prevent us from seeing the
    XLOG_CHECKSUMS wal record with the updated flag. It would extend the
    "window" where a crash would mean we have to redo the checksums, but I
    don't think that matters much. For small databases who cares, and for
    large databases it should not be a meaningful difference (setting the
    checksums already ran over multiple checkpoints, so one checkpoint is
    not a big difference).
    
    
    regards
    
    -- 
    Tomas Vondra
    
    
    
    
    
  9. Re: Changing the state of data checksums in a running cluster

    Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> — 2024-10-07T18:42:31Z

    Tomas Vondra <tomas@vondra.me> writes:
    
    > 3) unnecessary ListCell in DataChecksumsWorkerMain() on line 1345,
    >    shadowing earlier variable
    
    All the ListCell variables can be eliminated by using the foreach_ptr
    and foreach_oid macros instead of plain foreach.
    
    - ilmari
    
    
    
    
  10. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2024-10-08T20:38:36Z

    > On 7 Oct 2024, at 16:46, Tomas Vondra <tomas@vondra.me> wrote:
    
    > I did a quick review today. First a couple minor comments:
    
    Thanks for looking! 1-6 are all fixed.
    
    > 7) controldata.c - maybe this
    > 
    >   if (oldctrl->data_checksum_version == 2)
    > 
    > should use PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION instead of the magic
    > constant? For "off" we use "0" which seems somewhat acceptable, but for
    > other values it's less obvious what the meaning is.
    
    It doesn't seem clean to include storage/bufpage.h in pg_upgrade, I wonder if
    we should move (or mirror) the checksum versions to storage/checksum_impl.h to
    make them available to frontend and backend tools?
    
    > 8) xlog_internal.h - xl_checksum_state should be added to typedefs
    
    Fixed.
    
    > 9) system_functions.sql - Isn't it weird that this only creates the new
    > pg_enable_data_checksums function, but not pg_disable_data_checksums?
    
    We don't need any DEFAULT values for pg_disable_data_checksums so it doesn't
    need to be created there.
    
    > It
    > also means it doesn't revoke EXECUTE from public on it, which I guess it
    > probably should? Or why should this be different for the two functions?
    
    That should however be done, so fixed.
    
    > But there also seems to be a more serious problem with recovery. I did a
    > simple script that does a loop of
    > 
    > * start a cluster
    > * initialize a small pgbench database (scale 1 - 100)
    > * run "long" pgbench
    > * call pg_enable_data_checksums(), wait for it to complete
    > * stop the cluster with "-m immediate"
    > * start the cluster
    > 
    > And this unfortunately hits this assert:
    > 
    >  bool
    >  AbsorbChecksumsOnBarrier(void)
    >  {
    >    Assert(LocalDataChecksumVersion ==
    > PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
    >    LocalDataChecksumVersion = PG_DATA_CHECKSUM_VERSION;
    >    return true;
    >  }
    > 
    > Based on our short discussion about this, the controlfile gets updated
    > right after pg_enable_data_checksums() completes. The immediate stop
    > however forces a recovery since the last checkpoint, which means we see
    > the XLOG_CHECKSUMS WAL message again, and set the barrier. And then we
    > exit recovery, try to start checkpointer and it trips over this, because
    > the control file already has the "on" value :-(
    > 
    > I'm not sure what's the best way to fix this. Would it be possible to
    > remember we saw the XLOG_CHECKSUMS during recovery, and make the assert
    > noop in that case? Or not set the barrier when exiting recovery. I'm not
    > sure the relaxed assert would remain meaningful, though. What would it
    > check on standbys, for example?
    > 
    > Maybe a better way would be to wait for a checkpoint before updating the
    > controlfile, similar to what we do at the beginning? Possibly even with
    > the same "fast=true/false" logic. That would prevent us from seeing the
    > XLOG_CHECKSUMS wal record with the updated flag. It would extend the
    > "window" where a crash would mean we have to redo the checksums, but I
    > don't think that matters much. For small databases who cares, and for
    > large databases it should not be a meaningful difference (setting the
    > checksums already ran over multiple checkpoints, so one checkpoint is
    > not a big difference).
    
    The more I think about it the more I think that updating the control file is
    the wrong thing to do for this patch, it should only change the state in memory
    and let the checkpoints update the controlfile.  The attached fixes that and I
    can no longer reproduce the assertion failure you hit.
    
    The attached version also contains updates to the documentation, the aux proc
    counter and other smaller bits of polish.
    
    I did remove parts of the progress reporting for now since it can't be used
    from the dynamic backgroundworker it seems.  I need to regroup and figure out a
    better way there, but I wanted to address your above find sooner rather than
    wait for that.
    
    --
    Daniel Gustafsson
    
    
  11. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2024-10-08T20:39:57Z

    > On 7 Oct 2024, at 20:42, Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> wrote:
    > 
    > Tomas Vondra <tomas@vondra.me> writes:
    > 
    >> 3) unnecessary ListCell in DataChecksumsWorkerMain() on line 1345,
    >>   shadowing earlier variable
    > 
    > All the ListCell variables can be eliminated by using the foreach_ptr
    > and foreach_oid macros instead of plain foreach.
    
    Fair point, done in the v4 attached upthread.
    
    --
    Daniel Gustafsson
    
    
    
    
    
  12. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2024-10-09T10:41:02Z

    
    On 10/8/24 22:38, Daniel Gustafsson wrote:
    > 
    >> 7) controldata.c - maybe this
    >>
    >>   if (oldctrl->data_checksum_version == 2)
    >>
    >> should use PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION instead of the magic
    >> constant? For "off" we use "0" which seems somewhat acceptable, but for
    >> other values it's less obvious what the meaning is.
    > 
    > It doesn't seem clean to include storage/bufpage.h in pg_upgrade, I wonder if
    > we should move (or mirror) the checksum versions to storage/checksum_impl.h to
    > make them available to frontend and backend tools?
    > 
    
    +1 to have checksum_impl.h
    
    > 
    >> But there also seems to be a more serious problem with recovery. I did a
    >> simple script that does a loop of
    >>
    >> * start a cluster
    >> * initialize a small pgbench database (scale 1 - 100)
    >> * run "long" pgbench
    >> * call pg_enable_data_checksums(), wait for it to complete
    >> * stop the cluster with "-m immediate"
    >> * start the cluster
    >>
    >> And this unfortunately hits this assert:
    >>
    >>  bool
    >>  AbsorbChecksumsOnBarrier(void)
    >>  {
    >>    Assert(LocalDataChecksumVersion ==
    >> PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
    >>    LocalDataChecksumVersion = PG_DATA_CHECKSUM_VERSION;
    >>    return true;
    >>  }
    >>
    >> Based on our short discussion about this, the controlfile gets updated
    >> right after pg_enable_data_checksums() completes. The immediate stop
    >> however forces a recovery since the last checkpoint, which means we see
    >> the XLOG_CHECKSUMS WAL message again, and set the barrier. And then we
    >> exit recovery, try to start checkpointer and it trips over this, because
    >> the control file already has the "on" value :-(
    >>
    >> I'm not sure what's the best way to fix this. Would it be possible to
    >> remember we saw the XLOG_CHECKSUMS during recovery, and make the assert
    >> noop in that case? Or not set the barrier when exiting recovery. I'm not
    >> sure the relaxed assert would remain meaningful, though. What would it
    >> check on standbys, for example?
    >>
    >> Maybe a better way would be to wait for a checkpoint before updating the
    >> controlfile, similar to what we do at the beginning? Possibly even with
    >> the same "fast=true/false" logic. That would prevent us from seeing the
    >> XLOG_CHECKSUMS wal record with the updated flag. It would extend the
    >> "window" where a crash would mean we have to redo the checksums, but I
    >> don't think that matters much. For small databases who cares, and for
    >> large databases it should not be a meaningful difference (setting the
    >> checksums already ran over multiple checkpoints, so one checkpoint is
    >> not a big difference).
    > 
    > The more I think about it the more I think that updating the control file is
    > the wrong thing to do for this patch, it should only change the state in memory
    > and let the checkpoints update the controlfile.  The attached fixes that and I
    > can no longer reproduce the assertion failure you hit.
    > 
    
    I think leaving the update of controlfile to checkpointer is correct,
    and probably the only way to make this correct (without race
    conditions). We need to do that automatically with the checkpoint (which
    updates the redo LSN, guaranteeing we won't see the XLOG_CHECKSUMS
    record again).
    
    I ran the tests with this new patch, and I haven't reproduced the
    crashes. I'll let it run a bit longer, and improve it to test some more
    stuff, but it looks good.
    
    regards
    
    -- 
    Tomas Vondra
    
    
    
    
    
  13. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2024-10-11T07:57:13Z

    > On 9 Oct 2024, at 12:41, Tomas Vondra <tomas@vondra.me> wrote:
    
    >>> should use PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION instead of the magic
    >>> constant? For "off" we use "0" which seems somewhat acceptable, but for
    >>> other values it's less obvious what the meaning is.
    >> 
    >> It doesn't seem clean to include storage/bufpage.h in pg_upgrade, I wonder if
    >> we should move (or mirror) the checksum versions to storage/checksum_impl.h to
    >> make them available to frontend and backend tools?
    > 
    > +1 to have checksum_impl.h
    
    I tried various different ways of breaking out the checksum version into
    another header file but all of them ended up messier than the current state due
    to how various tools include the checksum code.  In the end I opted for doing
    the bufpage include to keep it simple.  This patch is big enough as it is
    without additional refactoring of checksum (header) code, that can be done
    separately from this.
    
    > I ran the tests with this new patch, and I haven't reproduced the
    > crashes. I'll let it run a bit longer, and improve it to test some more
    > stuff, but it looks good.
    
    Thanks for testing, I am too unable to reproduce that error.
    
    The attached v5 has the above include fix as well as pgindent and pgperltidy
    runs and some tweaking to the commit message to make it concise. It's also
    rebased to handle a recent conflict in the makefiles.
    
    --
    Daniel Gustafsson
    
    
  14. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2024-11-04T11:27:51Z

    Attached is a rebased v6 fixing the tests to handle that checksums are now on
    by default, no other changes are made as no outstanding review comments or
    identified bugs exist.
    
    Does anyone object to going ahead with this?
    
    --
    Daniel Gustafsson
    
    
  15. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2024-11-05T12:51:05Z

    > On 4 Nov 2024, at 12:27, Daniel Gustafsson <daniel@yesql.se> wrote:
    > 
    > Attached is a rebased v6 fixing the tests to handle that checksums are now on
    > by default, no other changes are made as no outstanding review comments or
    > identified bugs exist.
    > 
    > Does anyone object to going ahead with this?
    
    And a new rebase to cope with recent changes,
    
    --
    Daniel Gustafsson
    
    
  16. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2024-11-06T09:27:08Z

    > On 5 Nov 2024, at 13:51, Daniel Gustafsson <daniel@yesql.se> wrote:
    > 
    >> On 4 Nov 2024, at 12:27, Daniel Gustafsson <daniel@yesql.se> wrote:
    >> 
    >> Attached is a rebased v6 fixing the tests to handle that checksums are now on
    >> by default, no other changes are made as no outstanding review comments or
    >> identified bugs exist.
    >> 
    >> Does anyone object to going ahead with this?
    > 
    > And a new rebase to cope with recent changes,
    
    ..and one more since I forgot to git add the new expected output testfile.
    
    --
    Daniel Gustafsson
    
    
  17. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2024-11-08T00:41:10Z

    Hi,
    
    Unfortunately it seems we're not out of the woods yet :-(
    
    I started doing some more testing on the v8 patch. My plan was to do
    some stress testing with physical replication, random restarts and stuff
    like that. But I ran into issues before that.
    
    Attached is a reproducer script, that does this:
    
    1) initializes an instance with a small (scale 10) pgbench database
    
    2) runs a pgbench in the background, and flips checksums
    
    3) restarts the database with fast or immediate mode
    
    4) watches for checksums state until it reaches expected value
    
    5) restarts the instance
    
    Of course, the restart interrupts the checksum enable, with this message
    in the log:
    
    WARNING:  data checksums are being enabled, but no worker is running
    1731024482.102 2024-11-08 01:08:02.102 CET [267066] [startup:]
    [672d5660.4133a:7] [2024-11-08 01:08:00 CET] [/0] HINT:  If checksums
    were being enabled during shutdown then processing must be manually
    restarted.
    
    That's expected, of course. So I did
    
      SELECT pg_enable_data_checksums()
    
    and "datachecksumsworker launcher" appeared in pg_stat_activity, but
    nothing else was happening. It also says:
    
      Waiting for worker in database template0 (pid 258442)
    
    But there are no workers with that PID. Not in the OS, not in the view,
    not in the server log. Seems a bit weird. Maybe it already completed,
    but then why is there a launcher waiting for it?
    
    Ultimately I tried running CHECKPOINT, And that apparently did the
    trick, and the instance restarted. But then on start it hits an assert that:
    
      (LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
    
    But this only happens in the final stop is -m immediate. If I change it
    to "-m fast" it works.
    
    I haven't looked into the details, but I guess it's related to the issue
    with controlfile update we dealt with about a month ago.
    
    Attached is the test.sh file (make sure to tweak the paths), and an
    example of the backtraces. I've seen various processes hitting that.
    
    
    Two more comments:
    
    * It's a bit surprising that pg_disable_data_checksums() flips the state
    right away, while pg_enable_data_checksums() waits for a checkpoint. I
    guess it's correct, but maybe the docs should mention this difference?
    
    * The docs currently say:
    
    <para>
     If the cluster is stopped while in <literal>inprogress-on</literal> mode,
     for any reason, then this process must be restarted manually. To do this,
     re-execute the function <function>pg_enable_data_checksums()</function>
     once the cluster has been restarted. The background worker will attempt
     to resume the work from where it was interrupted.
    </para>
    
    I believe that's incorrect/misleading. There's no attempt to resume work
    from where it was interrupted.
    
    
    regards
    
    -- 
    Tomas Vondra
    
  18. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2024-11-26T22:07:12Z

    Hi,
    
    I spent a bit more time doing some testing on the last version of the
    patch from [1]. And I ran into this assert in PostmasterStateMachine()
    when stopping the cluster:
    
      /* All types should be included in targetMask or remainMask */
      Assert((remainMask.mask | targetMask.mask) == BTYPE_MASK_ALL.mask);
    
    At first I was puzzled as this happens on every shutdown, but that's
    because these checks were introduced by a78af0427015 a week ago. So it's
    more a matter of rebasing.
    
    However, I also noticed the progress monitoring does not really work. I
    get stuff like this:
    
        + psql -x test -c 'select * from pg_stat_progress_data_checksums'
        -[ RECORD 1 ]---------------------+---------
        pid                               | 56811
        datid                             | 0
        datname                           |
        phase                             | enabling
        databases_total                   | 4
        relations_total                   |
        databases_processed               | 0
        relations_processed               | 0
        databases_current                 | 16384
        relation_current                  | 0
        relation_current_blocks           | 0
        relation_current_blocks_processed | 0
    
    But I've never seen any of the "processed" fields to be non-zero (and
    relations is even NULL), and the same thing applies to relation_. Also
    what is the datid/datname about? It's empty, not mentioned in sgml docs,
    and we already have databases_current ...
    
    The message [2] from 10/08 says:
    
    > I did remove parts of the progress reporting for now since it can't be
    > used from the dynamic backgroundworker it seems.  I need to regroup
    > and figure out a better way there, but I wanted to address your above
    > find sooner rather than wait for that.
    
    And I guess that would explain why some of the fields are not updated.
    But then the later patch versions seem to imply there are no outstanding
    issues / missing stuff.
    
    
    regards
    
    
    [1]
    https://www.postgresql.org/message-id/CA226DE1-DC9A-4675-A83C-32270C473F0B%40yesql.se
    
    [2]
    https://www.postgresql.org/message-id/DD25705F-E75F-4DCA-B49A-5578F4F55D94%40yesql.se
    
    -- 
    Tomas Vondra
    
    
    
    
    
  19. Re: Changing the state of data checksums in a running cluster

    Michael Paquier <michael@paquier.xyz> — 2024-12-11T04:47:08Z

    On Tue, Nov 26, 2024 at 11:07:12PM +0100, Tomas Vondra wrote:
    > I spent a bit more time doing some testing on the last version of the
    > patch from [1]. And I ran into this assert in PostmasterStateMachine()
    > when stopping the cluster:
    > 
    >   /* All types should be included in targetMask or remainMask */
    >   Assert((remainMask.mask | targetMask.mask) == BTYPE_MASK_ALL.mask);
    > 
    > At first I was puzzled as this happens on every shutdown, but that's
    > because these checks were introduced by a78af0427015 a week ago. So it's
    > more a matter of rebasing.
    
    Looks like the CI is not really happy about this point..  (Please make
    sure to refresh the patch status after a review.)
    --
    Michael
    
  20. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-03-09T17:57:52Z

    Here's a rebased version of the patch series, addressing the issues I've
    pointed out in the last round of reviews. I've kept the changes in
    separate patches for clarity, but it should be squashed into a single
    patch in the end.
    
    
    1) v20250309-0001-Online-enabling-and-disabling-of-data-chec.patch
    ------------------------------------------------------------------
    
    Original patch, rebased, resolving merge conflicts.
    
    
    2) v20250309-0002-simple-post-rebase-fixes.patch
    ------------------------------------------------
    
    A couple minor fixes, addressing test failures due to stuff committed
    since the previous patch version. Mostly mechanical, the main change is
    I don't think the pgstat_bestart() call is necessary. Or is it?
    
    
    3) v20250309-0003-sync-the-data_checksums-GUC-with-the-local.patch
    ------------------------------------------------------------------
    
    This is the main change, fixing failures in 002_actions.pl - the short
    version is that test does "-C data_checksums", but AFAICS that does not
    quite work because it does not call show_data_checksums() that early,
    and instead just grabs the variable backing the GUC. Which may be out of
    sync, so this patch fixes that by updating them both.
    
    That fixes the issue, but it's it a bit strange we now have three places
    tracking the state of data checksums? We have data_checksum_version in
    the control file, and then data_checksums and LocalDataChecksumVersion
    in the backends.
    
    Would it be possible to "unify" the latter two? That would also mean we
    don't have the duplicate constants like PG_DATA_CHECKSUM_VERSION and
    DATA_CHECKSUM_VERSION. Or why do we need that?
    
    
    4) v20250309-0004-make-progress-reporting-work.patch
    ----------------------------------------------------
    
    The progress reporting got "mostly disabled" in an earlier version, due
    to issues with the bgworkers. AFAICS the issue is the "status" row can
    be updated only by a single process, which does not quite work with the
    launcher + per-db workers architecture.
    
    I've considered a couple different approaches:
    
    
    a) updating the status only from the launcher
    
    This is mostly what CREATE INDEX does with parallel builds, and there
    it's mostly sufficient. But for checksums it'd mean we only have the
    number of databases to process/done, and that seems unsatisfactory,
    considering large clusters often have only a single large database. So
    not good enough, IMHO.
    
    
    b) allowing workers to update the status row, created by the launcher
    
    I guess this would be better, we'd know the relations/blocks counts. And
    I haven't tried coding this, but there would need to be some locking so
    that the workers don't overwrite increments from other workers, etc.
    
    But I don't think it'd work nicely with parallel per-db workers (which
    we don't do now, but we might).
    
    
    c) having one status entry per worker
    
    I ended up doing this, it didn't require any changes to the progress
    infrastructure, and it will work naturally even with multiple workers.
    There will always be one row for launcher status (which only has the
    number of databases total/done), and then one row per worker, with
    database-level info (datid, datname, #relations, #blocks).
    
    I removed the "DONE" phase, because that's right before the launcher
    exists, and I don't think we have that for similar cases. And I added
    "waiting on checkpoint" state, because that's often a long wait when the
    launcher seems to do nothing, so it seems useful to communicate the
    reason for that wait.
    
    
    5) v20250309-0005-update-docs.patch
    -----------------------------------
    
    Minor tweaks to docs, to reflect the changes to the progress reporting
    changes, and also some corrections (no resume after restart, ...).
    
    
    So far this passed all my tests - both chekc-world and stress testing
    (no failures / assert crashes, ...). One thing that puzzles me is I
    wasn't able to reproduce the failures reported in [1] - not even with
    just the rebase + minimal fixes (0001 + 0002).  My best theory is this
    is somehow machine-specific, and my laptop is too slow or something.
    I'll try with the machine I used before once it gets available.
    
    
    regards
    
    
    [1]
    https://www.postgresql.org/message-id/e4dbcb2c-e04a-4ba2-bff0-8d979f55960e%40vondra.me
    
    
    -- 
    Tomas Vondra
    
  21. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-03-09T23:35:29Z

    Seems cfbot was unhappy with the patches, so here's an improved version,
    fixing some minor issues in expected output and a compiler warning.
    
    There however seems to be some issue with 003_standby_restarts, which
    causes failures on freebsd and macos. I don't know what that is about,
    but the test runs much longer than on debian.
    
    
    regards
    
    -- 
    Tomas Vondra
    
  22. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-03-10T00:18:23Z

    On 3/10/25 00:35, Tomas Vondra wrote:
    > Seems cfbot was unhappy with the patches, so here's an improved version,
    > fixing some minor issues in expected output and a compiler warning.
    > 
    > There however seems to be some issue with 003_standby_restarts, which
    > causes failures on freebsd and macos. I don't know what that is about,
    > but the test runs much longer than on debian.
    > 
    
    OK, turns out the failures were caused by the test creating a standby
    from a backup, without a slot, so sometimes the primary removed the
    necessary WAL. Fixed in the attached version.
    
    There's still a failure on windows, though. I'd bet that's due to the
    data_checksum/LocalDatachecksumVersion sync not working correctly on
    builds with EXEC_BACKEND, or something like that, but it's too late so
    I'll take a closer look tomorrow.
    
    
    regards
    
    -- 
    Tomas Vondra
    
  23. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-03-10T09:46:32Z

    On 3/10/25 01:18, Tomas Vondra wrote:
    >
    > ...
    >
    > There's still a failure on windows, though. I'd bet that's due to the
    > data_checksum/LocalDatachecksumVersion sync not working correctly on
    > builds with EXEC_BACKEND, or something like that, but it's too late so
    > I'll take a closer look tomorrow.
    > 
    
    Just like I suspected, there was a bug in EXEC_BACKEND, although a bit
    different from what I guessed - the worker state in shmem was zeroed
    every time, not just once. And a second issue was child_process_kinds
    got out of sync with BackendType (mea culpa).
    
    For me, this passes all CI tests, hopefully cfbot will be happy too.
    
    regards
    
    -- 
    Tomas Vondra
    
  24. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-03-10T11:17:58Z

    On 3/10/25 10:46, Tomas Vondra wrote:
    > On 3/10/25 01:18, Tomas Vondra wrote:
    >>
    >> ...
    >>
    >> There's still a failure on windows, though. I'd bet that's due to the
    >> data_checksum/LocalDatachecksumVersion sync not working correctly on
    >> builds with EXEC_BACKEND, or something like that, but it's too late so
    >> I'll take a closer look tomorrow.
    >>
    > 
    > Just like I suspected, there was a bug in EXEC_BACKEND, although a bit
    > different from what I guessed - the worker state in shmem was zeroed
    > every time, not just once. And a second issue was child_process_kinds
    > got out of sync with BackendType (mea culpa).
    > 
    > For me, this passes all CI tests, hopefully cfbot will be happy too.
    > 
    
    A bit embarrassing, I did not notice updating child_process_kinds breaks
    the stats regression test, so here's a version fixing that.
    
    
    regards
    
    -- 
    Tomas Vondra
    
  25. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2025-03-10T13:27:06Z

    > On 10 Mar 2025, at 12:17, Tomas Vondra <tomas@vondra.me> wrote:
    > 
    > On 3/10/25 10:46, Tomas Vondra wrote:
    >> On 3/10/25 01:18, Tomas Vondra wrote:
    
    Thank you so much for picking up and fixing the blockers, it's highly appreciated!
    
    >> For me, this passes all CI tests, hopefully cfbot will be happy too.
    
    Confirmed, it compiles clean, builds docs and passes all tests for me as well.
    
    A few comments from reading over your changes:
    
    +   launcher worker has this value set, the other worker processes
    +   have this <literal>NULL</literal>.
    There seems to be a word or two missing (same in a few places), should this be
    "have this set to NULL"?
    
    
    +   The command is currently waiting for a checkpoint to update the checksum
    +   state at the end.
    s/at the end/before finishing/?
    
    
    + * XXX aren't PG_DATA_ and DATA_ constants the same? why do we need both?
    They aren't mapping 1:1 as PG_DATA_ has the version numbers, and if checksums
    aren't enabled there is no version and thus there is no PG_DATA_CHECKSUMS_OFF.
    This could of course be remedied.  IIRC one reason for adding the enum was to
    get compiler warnings on missing cases when switch()ing over the value, but I
    don't think the current code has any switch.
    
    
    +   /* XXX isn't it weird there's no wait between the phase updates? */
    It is, I think we should skip PROGRESS_DATACHECKSUMS_PHASE_WAITING_BACKENDS in
    favor of PROGRESS_DATACHECKSUMS_PHASE_ENABLING.
    
    
    +   * When enabling checksums, we have to wait for a checkpoint for the
    +   * checksums to e.
    Seems to be missing the punchline, "for the checksum state to be moved from
    in-progress to on" perhaps?
    
    
    It also needs a pgindent and pgperltidy but there were only small trivial
    changes there.
    
    Thanks again for updating the patch!
    
    --
    Daniel Gustafsson
    
    
    
    
    
  26. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-03-10T15:01:35Z

    On 3/10/25 14:27, Daniel Gustafsson wrote:
    >> On 10 Mar 2025, at 12:17, Tomas Vondra <tomas@vondra.me> wrote:
    >>
    >> On 3/10/25 10:46, Tomas Vondra wrote:
    >>> On 3/10/25 01:18, Tomas Vondra wrote:
    > 
    > Thank you so much for picking up and fixing the blockers, it's highly appreciated!
    > 
    >>> For me, this passes all CI tests, hopefully cfbot will be happy too.
    > 
    > Confirmed, it compiles clean, builds docs and passes all tests for me as well.
    > 
    > A few comments from reading over your changes:
    > 
    > +   launcher worker has this value set, the other worker processes
    > +   have this <literal>NULL</literal>.
    > There seems to be a word or two missing (same in a few places), should this be
    > "have this set to NULL"?
    > 
    
    done
    
    > 
    > +   The command is currently waiting for a checkpoint to update the checksum
    > +   state at the end.
    > s/at the end/before finishing/?
    > 
    
    done
    
    > 
    > + * XXX aren't PG_DATA_ and DATA_ constants the same? why do we need both?
    > They aren't mapping 1:1 as PG_DATA_ has the version numbers, and if checksums
    > aren't enabled there is no version and thus there is no PG_DATA_CHECKSUMS_OFF.
    > This could of course be remedied.  IIRC one reason for adding the enum was to
    > get compiler warnings on missing cases when switch()ing over the value, but I
    > don't think the current code has any switch.
    > 
    
    I haven't done anything about this. I'm not convinced it's an issue we
    need to fix, and I haven't tried how much work would it be.
    
    > 
    > +   /* XXX isn't it weird there's no wait between the phase updates? */
    > It is, I think we should skip PROGRESS_DATACHECKSUMS_PHASE_WAITING_BACKENDS in
    > favor of PROGRESS_DATACHECKSUMS_PHASE_ENABLING.
    > 
    
    Removed the WAITING_BACKENDS phase.
    
    > 
    > +   * When enabling checksums, we have to wait for a checkpoint for the
    > +   * checksums to e.
    > Seems to be missing the punchline, "for the checksum state to be moved from
    > in-progress to on" perhaps?
    > 
    
    done
    
    > 
    > It also needs a pgindent and pgperltidy but there were only small trivial
    > changes there.
    > 
    
    done
    
    Attached is an updated version.
    
    
    -- 
    Tomas Vondra
    
  27. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-03-10T15:16:02Z

    One thing I forgot to mention is the progress reporting only updates
    blocks for the FORK_MAIN. It wouldn't be difficult to report blocks for
    each fork, but it'd be confusing - the relation counters would remain
    the same, but the block counters would change for each fork.
    
    I guess we could report the current_relation/fork, but it seems like an
    overkill. The main fork is by far the largest one, so this seems OK.
    
    
    regards
    
    -- 
    Tomas Vondra
    
    
    
    
    
  28. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-03-10T17:35:56Z

    Hi,
    
    I continued stress testing this, as I was rather unsure why the assert
    failures reported in [1] disappeared. And I managed to reproduce that
    again, and I think I actually understand why it happens.
    
    I modified the test script (attached) to setup replication, not just a
    single instance. And then it does a bit of work, flips the checksums,
    restarts the instances (randomly, fast/immediate), verifies the checkums
    and so on. And I can hit this assert in AbsorbChecksumsOnBarrier()
    pretty easily:
    
      Assert(LocalDataChecksumVersion ==
             PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
    
    The reason is pretty simple - this happens on the standby:
    
    1) standby receives XLOG_CHECKSUMS and applies it from 2 to 1 (i.e. it
    sets ControlFile->data_checksum_version from "inprogress-on" to "on"),
    and signals all other processes to refresh LocalDataChecksumVersion
    
    2) the control file gets written to disk for whatever reason (redo does
    this in a number of places)
    
    3) standby gets restarted with "immediate" mode (I'm not sure if this
    can happen with "fast" mode, I only recall seeing "immediate")
    
    4) the standby receives the XLOG_CHECKSUMS record *again*, updates the
    ControlFile->data_checksum_version (to the same value, no noop), and
    then signals the other processes again
    
    5) the other processes already have LocalDataChecksumVersion=1 (on), but
    the assert says it should be 2 (inprogress-on) => kaboom
    
    I believe this can happen for changes in either direction, although the
    window while disabling checksums is more narrow.
    
    
    I'm not sure what to do about this. Maybe we could relax the assert in
    some way? But that seems a bit ... possibly risky. It's not necessarily
    true we'll see the immediately preceding checksum state, we might see a
    couple updates back (if the control file was not updated in between).
    
    Could this affect checksum verification during recovery? Imagine we get
    to the "on" state, the controlfile gets flushed, and then the standby
    restarts and starts receiving older records again. The control file says
    we should be verifying checksums, but couldn't some of the writes have
    been lost (and so the pages may not have a valid checksum)?
    
    The one idea I have is to create an "immediate" restartpoint in
    xlog_redo() right after XLOG_CHECKSUMS updates the control file. AFAICS
    a "spread" restartpoint would not be enough, because then we could get
    into the same situation with a control file of sync (ahead of WAL) after
    a restart. It'd not be cheap, but it should be a rare operation ...
    
    I was wondering if the primary has the same issue, but AFAICS it does
    not. It flushes the control file in only a couple places, I couldn't
    think of a way to get it out of sync.
    
    
    regards
    
    [1]
    https://www.postgresql.org/message-id/e4dbcb2c-e04a-4ba2-bff0-8d979f55960e%40vondra.me
    
    -- 
    Tomas Vondra
    
  29. Re: Changing the state of data checksums in a running cluster

    Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> — 2025-03-11T13:07:35Z

    As the resident perl style pedant, I'd just like to complain about the
    below:
    
    Tomas Vondra <tomas@vondra.me> writes:
    
    > diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
    > index 666bd2a2d4c..1c66360c16c 100644
    > --- a/src/test/perl/PostgreSQL/Test/Cluster.pm
    > +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
    > @@ -3761,7 +3761,8 @@ sub checksum_enable_offline
    >  	my ($self) = @_;
    >  
    >  	print "### Enabling checksums in \"$self->data_dir\"\n";
    > -	PostgreSQL::Test::Utils::system_or_bail('pg_checksums', '-D', $self->data_dir, '-e');
    > +	PostgreSQL::Test::Utils::system_or_bail('pg_checksums', '-D',
    > +		$self->data_dir, '-e');
    >  	return;
    >  }
    
    
    This breaking between the command line options and its arguments is why
    we're switching to using fat commas. We're also using long options for
    improved self-documentation, so this should be written as:
    
    	PostgreSQL::Test::Utils::system_or_bail('pg_checksums',
    		'--pgdata' => $self->data_dir,
    		'--enable');
    
    And likewise below in the disable method.
    
    - ilmari
    
    
    
    
  30. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-03-12T12:28:50Z

    On 3/11/25 14:07, Dagfinn Ilmari Mannsåker wrote:
    > As the resident perl style pedant, I'd just like to complain about the
    > below:
    > 
    > Tomas Vondra <tomas@vondra.me> writes:
    > 
    >> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
    >> index 666bd2a2d4c..1c66360c16c 100644
    >> --- a/src/test/perl/PostgreSQL/Test/Cluster.pm
    >> +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
    >> @@ -3761,7 +3761,8 @@ sub checksum_enable_offline
    >>  	my ($self) = @_;
    >>  
    >>  	print "### Enabling checksums in \"$self->data_dir\"\n";
    >> -	PostgreSQL::Test::Utils::system_or_bail('pg_checksums', '-D', $self->data_dir, '-e');
    >> +	PostgreSQL::Test::Utils::system_or_bail('pg_checksums', '-D',
    >> +		$self->data_dir, '-e');
    >>  	return;
    >>  }
    > 
    > 
    > This breaking between the command line options and its arguments is why
    > we're switching to using fat commas. We're also using long options for
    > improved self-documentation, so this should be written as:
    > 
    > 	PostgreSQL::Test::Utils::system_or_bail('pg_checksums',
    > 		'--pgdata' => $self->data_dir,
    > 		'--enable');
    > 
    > And likewise below in the disable method.
    > 
    
    I don't know what fat comma is, but that's simply what perltidy did. I
    don't mind formatting it differently, if there's a better way.
    
    
    thanks
    
    -- 
    Tomas Vondra
    
    
    
    
    
  31. Re: Changing the state of data checksums in a running cluster

    Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> — 2025-03-12T13:08:50Z

    Tomas Vondra <tomas@vondra.me> writes:
    
    > On 3/11/25 14:07, Dagfinn Ilmari Mannsåker wrote:
    >> As the resident perl style pedant, I'd just like to complain about the
    >> below:
    >> 
    >> Tomas Vondra <tomas@vondra.me> writes:
    >> 
    >>> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
    >>> index 666bd2a2d4c..1c66360c16c 100644
    >>> --- a/src/test/perl/PostgreSQL/Test/Cluster.pm
    >>> +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
    >>> @@ -3761,7 +3761,8 @@ sub checksum_enable_offline
    >>>  	my ($self) = @_;
    >>>  
    >>>  	print "### Enabling checksums in \"$self->data_dir\"\n";
    >>> -	PostgreSQL::Test::Utils::system_or_bail('pg_checksums', '-D', $self->data_dir, '-e');
    >>> +	PostgreSQL::Test::Utils::system_or_bail('pg_checksums', '-D',
    >>> +		$self->data_dir, '-e');
    >>>  	return;
    >>>  }
    >> 
    >> 
    >> This breaking between the command line options and its arguments is why
    >> we're switching to using fat commas. We're also using long options for
    >> improved self-documentation, so this should be written as:
    >> 
    >> 	PostgreSQL::Test::Utils::system_or_bail('pg_checksums',
    >> 		'--pgdata' => $self->data_dir,
    >> 		'--enable');
    >> 
    >> And likewise below in the disable method.
    >> 
    >
    > I don't know what fat comma is, but that's simply what perltidy did. I
    > don't mind formatting it differently, if there's a better way.
    
    Fat comma is the perlish name for the => arrow, which is semantically
    equivalent to a comma (except it auto-quotes any immediately preceding
    bareword), but looks fatter.  Perltidy knows to not wrap lines around
    them, keeping the key and value (or option and argument in this case)
    together.  See commit ce1b0f9da03 for a large (but not complete, I have
    more patches pending) conversion to this new style.
    
    - ilmari
     
    
    
    
    
  32. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-03-12T13:16:41Z

    On 3/10/25 18:35, Tomas Vondra wrote:
    > Hi,
    > 
    > I continued stress testing this, as I was rather unsure why the assert
    > failures reported in [1] disappeared. And I managed to reproduce that
    > again, and I think I actually understand why it happens.
    > 
    > I modified the test script (attached) to setup replication, not just a
    > single instance. And then it does a bit of work, flips the checksums,
    > restarts the instances (randomly, fast/immediate), verifies the checkums
    > and so on. And I can hit this assert in AbsorbChecksumsOnBarrier()
    > pretty easily:
    > 
    >   Assert(LocalDataChecksumVersion ==
    >          PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
    > 
    > The reason is pretty simple - this happens on the standby:
    > 
    > 1) standby receives XLOG_CHECKSUMS and applies it from 2 to 1 (i.e. it
    > sets ControlFile->data_checksum_version from "inprogress-on" to "on"),
    > and signals all other processes to refresh LocalDataChecksumVersion
    > 
    > 2) the control file gets written to disk for whatever reason (redo does
    > this in a number of places)
    > 
    > 3) standby gets restarted with "immediate" mode (I'm not sure if this
    > can happen with "fast" mode, I only recall seeing "immediate")
    > 
    > 4) the standby receives the XLOG_CHECKSUMS record *again*, updates the
    > ControlFile->data_checksum_version (to the same value, no noop), and
    > then signals the other processes again
    > 
    > 5) the other processes already have LocalDataChecksumVersion=1 (on), but
    > the assert says it should be 2 (inprogress-on) => kaboom
    > 
    > I believe this can happen for changes in either direction, although the
    > window while disabling checksums is more narrow.
    > 
    > 
    > I'm not sure what to do about this. Maybe we could relax the assert in
    > some way? But that seems a bit ... possibly risky. It's not necessarily
    > true we'll see the immediately preceding checksum state, we might see a
    > couple updates back (if the control file was not updated in between).
    > 
    > Could this affect checksum verification during recovery? Imagine we get
    > to the "on" state, the controlfile gets flushed, and then the standby
    > restarts and starts receiving older records again. The control file says
    > we should be verifying checksums, but couldn't some of the writes have
    > been lost (and so the pages may not have a valid checksum)?
    > 
    > The one idea I have is to create an "immediate" restartpoint in
    > xlog_redo() right after XLOG_CHECKSUMS updates the control file. AFAICS
    > a "spread" restartpoint would not be enough, because then we could get
    > into the same situation with a control file of sync (ahead of WAL) after
    > a restart. It'd not be cheap, but it should be a rare operation ...
    > 
    > I was wondering if the primary has the same issue, but AFAICS it does
    > not. It flushes the control file in only a couple places, I couldn't
    > think of a way to get it out of sync.
    > 
    
    I continued investigating this and experimenting with alternative
    approaches, and I think the way the patch relies on ControlFile is not
    quite right. That is, it always sets data_checksum_version to the last
    ("current") value, but that's not what ControlFile is for ...
    
    The ControlFile is meant to be a safe/consistent state, e.g. for crash
    recovery. By setting data_checksum_version to the "last" value we've
    seen, that's broken - if the control file gets persisted (haven't seen
    this on primary, but pretty common on replica, per the report), the
    recovery will start with a "future" data_checksum_version value. Which
    is wrong - we'll read the XLOG_CHECKUMS record, triggering the assert. I
    suspect it might also lead to confusion whether checksums should be
    verified or not.
    
    In my earlier message I suggested maybe this could be solved by forcing
    a checkpoint every time we see the XLOG_CHECKUMS record (or rather a
    restart point, as it'd be on the replica). Sure, that would have some
    undesirable consequences (forcing an immediate checkpoint is not cheap,
    and the redo would need to wait for that). But the assumption was it'd
    be very rare (how often you enable checksums?), so this cost might be
    acceptable.
    
    But when I started experimenting with this, I realized it has a couple
    other issues:
    
    1) We can't do the checkpoint/restartpoint when handling XLOG_CHECKUMS,
    because that'd mean we see this XLOG record again, which we don't want.
    So the checkpoint would need to happen the *next* time we update the
    control file.
    
    2) But we can't trigger a checkpoint from UpdateControlFile, because of
    locking (because CreateCheckPoint also calls UpdateControlFile). So this
    would require much more invasive changes to all places updating the
    control file.
    
    3) It does not resolve the mismatch with using ControlFile to store
    "current" data_checksums_version value.
    
    4) ... probably more minor issues that I already forgot about.
    
    In the end, I decided to try to rework this by storing the current value
    elsewhere, and only updating the "persistent" value in the control file
    when necessary.
    
    XLogCtl seemed like a good place, so I used that - after all, it's a
    value from XLOG. Maybe there's a better place? I'm open to suggestions,
    but it does not really affect the overall approach.
    
    So all the places now update XLogCtl->data_checksums_version instead of
    the ControlFile, and also query this flag for *current* value.
    
    The value is copied from XLogCtl to ControlFile when creating checkpoint
    (or restartpoint), and the control file is persisted. This means (a) the
    current value can't get written to the control file prematurely, and (b)
    the value is consistent with the checkpoint (i.e. with the LSN where we
    start crash recovery, if needed).
    
    The attached 0005 patch implements this. It's a bit WIP and I'm sure it
    can be improved, but I'm yet to see a single crash/failure with it. With
    the original patch I've seen crashes after 5-10 loops (i.e. a couple
    minutes), I'm now at loop 1000 and it's still OK.
    
    I believe the approach is correct, but the number of possible states
    (e.g. after a crash/restart) seems a bit complex. I wonder if there's a
    better way to handle this, but I can't think of any. Ideas?
    
    One issue I ran into is the postmaster does not seem to be processing
    the barriers, and thus not getting info about the data_checksum_version
    changes. That's fine until it needs to launch a child process (e.g. a
    walreceiver), which will then see the LocalDataChecksumVersion as of the
    start of the instance, not the "current" one. I fixed this by explicitly
    refreshing the value in postmaster_child_launch(), but maybe I'm missing
    something. (Also, EXEC_BACKEND may need to handle this too.)
    
    
    regards
    
    -- 
    Tomas Vondra
    
  33. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2025-03-13T09:54:28Z

    > On 12 Mar 2025, at 14:16, Tomas Vondra <tomas@vondra.me> wrote:
    
    > I continued investigating this and experimenting with alternative
    > approaches, and I think the way the patch relies on ControlFile is not
    > quite right. That is, it always sets data_checksum_version to the last
    > ("current") value, but that's not what ControlFile is for ...
    
    Agreed, that's a thinko on my part.  Reading it makes it clear, but I had
    failed to see that when hacking =/
    
    > XLogCtl seemed like a good place, so I used that - after all, it's a
    > value from XLOG. Maybe there's a better place? I'm open to suggestions,
    > but it does not really affect the overall approach.
    
    Seems like a good place for it.
    
    > So all the places now update XLogCtl->data_checksums_version instead of
    > the ControlFile, and also query this flag for *current* value.
    > 
    > The value is copied from XLogCtl to ControlFile when creating checkpoint
    > (or restartpoint), and the control file is persisted. This means (a) the
    > current value can't get written to the control file prematurely, and (b)
    > the value is consistent with the checkpoint (i.e. with the LSN where we
    > start crash recovery, if needed).
    
    +1
    
    > The attached 0005 patch implements this. It's a bit WIP and I'm sure it
    > can be improved, but I'm yet to see a single crash/failure with it. With
    > the original patch I've seen crashes after 5-10 loops (i.e. a couple
    > minutes), I'm now at loop 1000 and it's still OK.
    
    Given how successful this test has been at stressing out errors that is indeed
    comforting to hear.
    
    > I believe the approach is correct, but the number of possible states
    > (e.g. after a crash/restart) seems a bit complex. I wonder if there's a
    > better way to handle this, but I can't think of any. Ideas?
    
    Not sure if this moves the needle too far in terms of complexity wrt to the
    previous version of the patch, there were already multiple copies.
    
    > One issue I ran into is the postmaster does not seem to be processing
    > the barriers, and thus not getting info about the data_checksum_version
    > changes.
    
    Makes sense, that seems like a pretty reasonable constraint for the barrier.
    
    > That's fine until it needs to launch a child process (e.g. a
    > walreceiver), which will then see the LocalDataChecksumVersion as of the
    > start of the instance, not the "current" one. I fixed this by explicitly
    > refreshing the value in postmaster_child_launch(), but maybe I'm missing
    > something. (Also, EXEC_BACKEND may need to handle this too.)
    
    The pg_checksums test is failing for me on this version due to the GUC not
    being initialized, don't we need something like the below as well?  (With a
    comment explaining why ReadControlFile wasn't enough.)
    
    @@ -5319,6 +5319,7 @@ LocalProcessControlFile(bool reset)
            Assert(reset || ControlFile == NULL);
            ControlFile = palloc(sizeof(ControlFileData));
            ReadControlFile();
    +       SetLocalDataChecksumVersion(ControlFile->data_checksum_version);
    
    A few comments on the patchset:
    
    + * Local state fror Controlfile data_checksum_version. After initialization
    s/fror/for/.  Also, this is no longer true as it's a local copy of the XlogCtl
    value and not the Controlfile value (which may or may not be equal).
    
    
    - if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
    + if (XLogCtl->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
     	ereport(WARNING,
     		(errmsg("data checksums are being enabled, but no worker is running"),
     		 errhint("If checksums were being enabled during shutdown then processing must be manually restarted.")));
    Reading this made me realize what a terrible error message I had placed there,
    the hint is good but the message says checksums are being enabled but they're
    not being enabled.  Maybe "data checksums are marked as being in-progress, but
    no worker is running"
    
    
    +uint32
    +GetLocalDataChecksumVersion(void)
    +{
    + return LocalDataChecksumVersion;
    +}
    +
    +/*
    + * Get the *current* data_checksum_version (might not be written to control
    + * file yet).
    + */
    +uint32
    +GetCurrentDataChecksumVersion(void)
    +{
    + return XLogCtl->data_checksum_version;
    +}
    I wonder if CachedDataChecksumVersion would be more appropriate to distinguish
    it from the Current value, and also to make appear less like actual copies of
    controlfile values like LocalMinRecoveryPoint.  Another thought is if we should
    have the GetLocalDataChecksumVersion() API?  GetCurrentDataChecksumVersion()
    should be a better API no?
    
    --
    Daniel Gustafsson
    
    
    
    
    
  34. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-03-13T11:03:29Z

    On 3/13/25 10:54, Daniel Gustafsson wrote:
    >> On 12 Mar 2025, at 14:16, Tomas Vondra <tomas@vondra.me> wrote:
    > 
    >> I continued investigating this and experimenting with alternative
    >> approaches, and I think the way the patch relies on ControlFile is not
    >> quite right. That is, it always sets data_checksum_version to the last
    >> ("current") value, but that's not what ControlFile is for ...
    > 
    > Agreed, that's a thinko on my part.  Reading it makes it clear, but I had
    > failed to see that when hacking =/
    > 
    
    It wasn't obvious to me either, until I managed to trigger the failure
    and investigated the root cause.
    
    >> XLogCtl seemed like a good place, so I used that - after all, it's a
    >> value from XLOG. Maybe there's a better place? I'm open to suggestions,
    >> but it does not really affect the overall approach.
    > 
    > Seems like a good place for it.
    > 
    
    OK
    
    >> So all the places now update XLogCtl->data_checksums_version instead of
    >> the ControlFile, and also query this flag for *current* value.
    >>
    >> The value is copied from XLogCtl to ControlFile when creating checkpoint
    >> (or restartpoint), and the control file is persisted. This means (a) the
    >> current value can't get written to the control file prematurely, and (b)
    >> the value is consistent with the checkpoint (i.e. with the LSN where we
    >> start crash recovery, if needed).
    > 
    > +1
    > 
    
    OK. I still want to go over the places once more and double check it
    sets the ControlFile value to the right data_checksum_version.
    
    >> The attached 0005 patch implements this. It's a bit WIP and I'm sure it
    >> can be improved, but I'm yet to see a single crash/failure with it. With
    >> the original patch I've seen crashes after 5-10 loops (i.e. a couple
    >> minutes), I'm now at loop 1000 and it's still OK.
    > 
    > Given how successful this test has been at stressing out errors that is indeed
    > comforting to hear.
    > 
    
    It is. I plan to vary the stress test a bit more, and also run it on
    another machine (rpi5, to get some non-x86 testing).
    
    >> I believe the approach is correct, but the number of possible states
    >> (e.g. after a crash/restart) seems a bit complex. I wonder if there's a
    >> better way to handle this, but I can't think of any. Ideas?
    > 
    > Not sure if this moves the needle too far in terms of complexity wrt to the
    > previous version of the patch, there were already multiple copies.
    > 
    
    It does add one more place (XLogCtl->data_checksum_version) to store the
    current state, so it's not that much more complex, ofc. But I was not
    really comparing this to the previous patch version, I meant the state
    space in general - all possible combinations of all the flags (control
    file, local + xlogct).
    
    I wonder if it might be possible to have a more thorough validation of
    the transitions. We already have that for the LocalDataChecksumVersion,
    thanks to the asserts - and it was damn useful, otherwise we would not
    have noticed this issue for a long time, I think.
    
    I wonder if we can have similar checks for the other flags. I'm pretty
    sure we can have the same checks for XLogCtl, right? I'm not quite sure
    about ControlFile - can't that "skip" some of the changes, e.g. if we do
    (enable->disable->enable) within a single checkpoint? Need to check.
    
    This also reminds me I had a question about the barrier - can't it
    happen a process gets to process multiple barriers at the same time? I
    mean, let's say it gets stuck for a while, and the cluster happens to go
    through disable+enable. Won't it then see both barriers? That'd be a
    problem, because the core processes the barriers in the order determined
    by the enum value, not in the order the barriers happened. Which means
    it might break the expected state transitions again (and end with the
    wrong local value). I haven't tried, though.
    
    >> One issue I ran into is the postmaster does not seem to be processing
    >> the barriers, and thus not getting info about the data_checksum_version
    >> changes.
    > 
    > Makes sense, that seems like a pretty reasonable constraint for the barrier.
    > 
    
    Not sure I follow. What's a reasonable constraint?
    
    >> That's fine until it needs to launch a child process (e.g. a
    >> walreceiver), which will then see the LocalDataChecksumVersion as of the
    >> start of the instance, not the "current" one. I fixed this by explicitly
    >> refreshing the value in postmaster_child_launch(), but maybe I'm missing
    >> something. (Also, EXEC_BACKEND may need to handle this too.)
    > 
    > The pg_checksums test is failing for me on this version due to the GUC not
    > being initialized, don't we need something like the below as well?  (With a
    > comment explaining why ReadControlFile wasn't enough.)
    > 
    > @@ -5319,6 +5319,7 @@ LocalProcessControlFile(bool reset)
    >         Assert(reset || ControlFile == NULL);
    >         ControlFile = palloc(sizeof(ControlFileData));
    >         ReadControlFile();
    > +       SetLocalDataChecksumVersion(ControlFile->data_checksum_version);
    > 
    
    Yeah, I think this (or something like it) is missing.
    
    > A few comments on the patchset:
    > 
    > + * Local state fror Controlfile data_checksum_version. After initialization
    > s/fror/for/.  Also, this is no longer true as it's a local copy of the XlogCtl
    > value and not the Controlfile value (which may or may not be equal).
    > 
    > 
    > - if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
    > + if (XLogCtl->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
    >  	ereport(WARNING,
    >  		(errmsg("data checksums are being enabled, but no worker is running"),
    >  		 errhint("If checksums were being enabled during shutdown then processing must be manually restarted.")));
    > Reading this made me realize what a terrible error message I had placed there,
    > the hint is good but the message says checksums are being enabled but they're
    > not being enabled.  Maybe "data checksums are marked as being in-progress, but
    > no worker is running"
    > 
    
    Makes sense, will reword.
    
    > 
    > +uint32
    > +GetLocalDataChecksumVersion(void)
    > +{
    > + return LocalDataChecksumVersion;
    > +}
    > +
    > +/*
    > + * Get the *current* data_checksum_version (might not be written to control
    > + * file yet).
    > + */
    > +uint32
    > +GetCurrentDataChecksumVersion(void)
    > +{
    > + return XLogCtl->data_checksum_version;
    > +}
    > I wonder if CachedDataChecksumVersion would be more appropriate to distinguish
    > it from the Current value, and also to make appear less like actual copies of
    > controlfile values like LocalMinRecoveryPoint.  Another thought is if we should
    > have the GetLocalDataChecksumVersion() API?  GetCurrentDataChecksumVersion()
    > should be a better API no?
    > 
    
    FWIW those functions are for debug logging only, I needed to print the
    values in a couple places outside xlog.c. I don't intend to make that
    part of the patch.
    
    
    regards
    
    -- 
    Tomas Vondra
    
    
    
    
    
  35. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2025-03-13T12:32:20Z

    > On 13 Mar 2025, at 12:03, Tomas Vondra <tomas@vondra.me> wrote:
    > On 3/13/25 10:54, Daniel Gustafsson wrote:
    >>> On 12 Mar 2025, at 14:16, Tomas Vondra <tomas@vondra.me> wrote:
    
    >>> I believe the approach is correct, but the number of possible states
    >>> (e.g. after a crash/restart) seems a bit complex. I wonder if there's a
    >>> better way to handle this, but I can't think of any. Ideas?
    >> 
    >> Not sure if this moves the needle too far in terms of complexity wrt to the
    >> previous version of the patch, there were already multiple copies.
    > 
    > It does add one more place (XLogCtl->data_checksum_version) to store the
    > current state, so it's not that much more complex, ofc. But I was not
    > really comparing this to the previous patch version, I meant the state
    > space in general - all possible combinations of all the flags (control
    > file, local + xlogct).
    
    Fair point.
    
    > I wonder if it might be possible to have a more thorough validation of
    > the transitions. We already have that for the LocalDataChecksumVersion,
    > thanks to the asserts - and it was damn useful, otherwise we would not
    > have noticed this issue for a long time, I think.
    > 
    > I wonder if we can have similar checks for the other flags. I'm pretty
    > sure we can have the same checks for XLogCtl, right?
    
    I don't see why not, they should abide by the same rules.
    
    > I'm not quite sure
    > about ControlFile - can't that "skip" some of the changes, e.g. if we do
    > (enable->disable->enable) within a single checkpoint? Need to check.
    
    For enable->disable->enable within a single checkpoint then ControlFile should
    never see the disable state.
    
    > This also reminds me I had a question about the barrier - can't it
    > happen a process gets to process multiple barriers at the same time? I
    > mean, let's say it gets stuck for a while, and the cluster happens to go
    > through disable+enable. Won't it then see both barriers? That'd be a
    > problem, because the core processes the barriers in the order determined
    > by the enum value, not in the order the barriers happened. Which means
    > it might break the expected state transitions again (and end with the
    > wrong local value). I haven't tried, though.
    
    Interesting, that seems like a general deficiency in the barriers, surely
    processing them in-order would be more intuitive?  That would probably require
    some form of Lamport clock though.
    
    >>> One issue I ran into is the postmaster does not seem to be processing
    >>> the barriers, and thus not getting info about the data_checksum_version
    >>> changes.
    >> 
    >> Makes sense, that seems like a pretty reasonable constraint for the barrier.
    > 
    > Not sure I follow. What's a reasonable constraint?
    
    That the postmaster deosn't process them.
    
    >>> That's fine until it needs to launch a child process (e.g. a
    >>> walreceiver), which will then see the LocalDataChecksumVersion as of the
    >>> start of the instance, not the "current" one. I fixed this by explicitly
    >>> refreshing the value in postmaster_child_launch(), but maybe I'm missing
    >>> something. (Also, EXEC_BACKEND may need to handle this too.)
    >> 
    >> The pg_checksums test is failing for me on this version due to the GUC not
    >> being initialized, don't we need something like the below as well?  (With a
    >> comment explaining why ReadControlFile wasn't enough.)
    >> 
    >> @@ -5319,6 +5319,7 @@ LocalProcessControlFile(bool reset)
    >>        Assert(reset || ControlFile == NULL);
    >>        ControlFile = palloc(sizeof(ControlFileData));
    >>        ReadControlFile();
    >> +       SetLocalDataChecksumVersion(ControlFile->data_checksum_version);
    >> 
    > 
    > Yeah, I think this (or something like it) is missing.
    
    Thanks for confirming.
    
    >> +uint32
    >> +GetLocalDataChecksumVersion(void)
    >> +{
    >> + return LocalDataChecksumVersion;
    >> +}
    >> +
    >> +/*
    >> + * Get the *current* data_checksum_version (might not be written to control
    >> + * file yet).
    >> + */
    >> +uint32
    >> +GetCurrentDataChecksumVersion(void)
    >> +{
    >> + return XLogCtl->data_checksum_version;
    >> +}
    >> I wonder if CachedDataChecksumVersion would be more appropriate to distinguish
    >> it from the Current value, and also to make appear less like actual copies of
    >> controlfile values like LocalMinRecoveryPoint.  Another thought is if we should
    >> have the GetLocalDataChecksumVersion() API?  GetCurrentDataChecksumVersion()
    >> should be a better API no?
    >> 
    > 
    > FWIW those functions are for debug logging only, I needed to print the
    > values in a couple places outside xlog.c. I don't intend to make that
    > part of the patch.
    
    Ah, gotcha, I never applied the debug patch from the patchset so I figured this
    was a planned API.  The main question still stands though, if LocalDataCheckXX
    can be confusing and CachedDataCheckXX would be better in order to
    distinguish it from actual controlfile copies?
    
    --
    Daniel Gustafsson
    
    
    
    
    
  36. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-03-13T16:26:47Z

    On 3/13/25 13:32, Daniel Gustafsson wrote:
    >> On 13 Mar 2025, at 12:03, Tomas Vondra <tomas@vondra.me> wrote:
    >> On 3/13/25 10:54, Daniel Gustafsson wrote:
    >>>> On 12 Mar 2025, at 14:16, Tomas Vondra <tomas@vondra.me> wrote:
    > 
    >>>> I believe the approach is correct, but the number of possible states
    >>>> (e.g. after a crash/restart) seems a bit complex. I wonder if there's a
    >>>> better way to handle this, but I can't think of any. Ideas?
    >>>
    >>> Not sure if this moves the needle too far in terms of complexity wrt to the
    >>> previous version of the patch, there were already multiple copies.
    >>
    >> It does add one more place (XLogCtl->data_checksum_version) to store the
    >> current state, so it's not that much more complex, ofc. But I was not
    >> really comparing this to the previous patch version, I meant the state
    >> space in general - all possible combinations of all the flags (control
    >> file, local + xlogct).
    > 
    > Fair point.
    > 
    >> I wonder if it might be possible to have a more thorough validation of
    >> the transitions. We already have that for the LocalDataChecksumVersion,
    >> thanks to the asserts - and it was damn useful, otherwise we would not
    >> have noticed this issue for a long time, I think.
    >>
    >> I wonder if we can have similar checks for the other flags. I'm pretty
    >> sure we can have the same checks for XLogCtl, right?
    > 
    > I don't see why not, they should abide by the same rules.
    > 
    
    OK, I'll add these asserts.
    
    >> I'm not quite sure
    >> about ControlFile - can't that "skip" some of the changes, e.g. if we do
    >> (enable->disable->enable) within a single checkpoint? Need to check.
    > 
    > For enable->disable->enable within a single checkpoint then ControlFile should
    > never see the disable state.
    > 
    
    Hmm, that means we can't have the same checks for the ControlFile
    fields, but I don't think that's a problem. We've verified the "path" to
    that (on the XLogCtl field), so that seems fine.
    
    >> This also reminds me I had a question about the barrier - can't it
    >> happen a process gets to process multiple barriers at the same time? I
    >> mean, let's say it gets stuck for a while, and the cluster happens to go
    >> through disable+enable. Won't it then see both barriers? That'd be a
    >> problem, because the core processes the barriers in the order determined
    >> by the enum value, not in the order the barriers happened. Which means
    >> it might break the expected state transitions again (and end with the
    >> wrong local value). I haven't tried, though.
    > 
    > Interesting, that seems like a general deficiency in the barriers, surely
    > processing them in-order would be more intuitive?  That would probably require
    > some form of Lamport clock though.
    > 
    
    Yeah, that seems non-trivial. What if we instead ensured there can't be
    two barriers set at the same time? Say, if we (somehow) ensured all
    processes saw the previous barrier before allowing a new one, we would
    not have this issue, right?
    
    But I don't know what would be a good way to ensure this. Is there a way
    to check if all processes saw the barrier? Any ideas?
    
    >>>> One issue I ran into is the postmaster does not seem to be processing
    >>>> the barriers, and thus not getting info about the data_checksum_version
    >>>> changes.
    >>>
    >>> Makes sense, that seems like a pretty reasonable constraint for the barrier.
    >>
    >> Not sure I follow. What's a reasonable constraint?
    > 
    > That the postmaster deosn't process them.
    > 
    
    OK, that means we need a way to "refresh" the value for new child
    processses, similar to what my patch does. But I suspect there might be
    a race condition - if the child process starts while processing the
    XLOG_CHECKUMS record, it might happen to get the new value and then also
    the barrier (if it does the "refresh" in between the XLogCtl update and
    the barrier). Doesn't this need some sort of interlock, preventing this?
    
    The child startup would need to do this:
    
    1) acquire lock
    2) reset barriers
    3) refresh the LocalDataChecksumValue (from XLogCtl)
    4) release lock
    
    while the walreceiver would do this
    
    1) acquire lock
    2) update XLogCtl value
    3) emit barrier
    4) release lock
    
    Or is there a reason why this would be unnecessary?
    
    >>>> That's fine until it needs to launch a child process (e.g. a
    >>>> walreceiver), which will then see the LocalDataChecksumVersion as of the
    >>>> start of the instance, not the "current" one. I fixed this by explicitly
    >>>> refreshing the value in postmaster_child_launch(), but maybe I'm missing
    >>>> something. (Also, EXEC_BACKEND may need to handle this too.)
    >>>
    >>> The pg_checksums test is failing for me on this version due to the GUC not
    >>> being initialized, don't we need something like the below as well?  (With a
    >>> comment explaining why ReadControlFile wasn't enough.)
    >>>
    >>> @@ -5319,6 +5319,7 @@ LocalProcessControlFile(bool reset)
    >>>        Assert(reset || ControlFile == NULL);
    >>>        ControlFile = palloc(sizeof(ControlFileData));
    >>>        ReadControlFile();
    >>> +       SetLocalDataChecksumVersion(ControlFile->data_checksum_version);
    >>>
    >>
    >> Yeah, I think this (or something like it) is missing.
    > 
    > Thanks for confirming.
    > 
    >>> +uint32
    >>> +GetLocalDataChecksumVersion(void)
    >>> +{
    >>> + return LocalDataChecksumVersion;
    >>> +}
    >>> +
    >>> +/*
    >>> + * Get the *current* data_checksum_version (might not be written to control
    >>> + * file yet).
    >>> + */
    >>> +uint32
    >>> +GetCurrentDataChecksumVersion(void)
    >>> +{
    >>> + return XLogCtl->data_checksum_version;
    >>> +}
    >>> I wonder if CachedDataChecksumVersion would be more appropriate to distinguish
    >>> it from the Current value, and also to make appear less like actual copies of
    >>> controlfile values like LocalMinRecoveryPoint.  Another thought is if we should
    >>> have the GetLocalDataChecksumVersion() API?  GetCurrentDataChecksumVersion()
    >>> should be a better API no?
    >>>
    >>
    >> FWIW those functions are for debug logging only, I needed to print the
    >> values in a couple places outside xlog.c. I don't intend to make that
    >> part of the patch.
    > 
    > Ah, gotcha, I never applied the debug patch from the patchset so I figured this
    > was a planned API.  The main question still stands though, if LocalDataCheckXX
    > can be confusing and CachedDataCheckXX would be better in order to
    > distinguish it from actual controlfile copies?
    > 
    
    Yeah, I'll think about the naming.
    
    
    regards
    
    -- 
    Tomas Vondra
    
    
    
    
    
  37. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-03-13T23:11:39Z

    On 3/13/25 17:26, Tomas Vondra wrote:
    > On 3/13/25 13:32, Daniel Gustafsson wrote:
    >>> On 13 Mar 2025, at 12:03, Tomas Vondra <tomas@vondra.me> wrote:
    >>>
    >>> ...
    >>>
    >>> This also reminds me I had a question about the barrier - can't it
    >>> happen a process gets to process multiple barriers at the same time? I
    >>> mean, let's say it gets stuck for a while, and the cluster happens to go
    >>> through disable+enable. Won't it then see both barriers? That'd be a
    >>> problem, because the core processes the barriers in the order determined
    >>> by the enum value, not in the order the barriers happened. Which means
    >>> it might break the expected state transitions again (and end with the
    >>> wrong local value). I haven't tried, though.
    >>
    >> Interesting, that seems like a general deficiency in the barriers, surely
    >> processing them in-order would be more intuitive?  That would probably require
    >> some form of Lamport clock though.
    >>
    > 
    > Yeah, that seems non-trivial. What if we instead ensured there can't be
    > two barriers set at the same time? Say, if we (somehow) ensured all
    > processes saw the previous barrier before allowing a new one, we would
    > not have this issue, right?
    > 
    > But I don't know what would be a good way to ensure this. Is there a way
    > to check if all processes saw the barrier? Any ideas?
    > 
    
    Actually, scratch this. There already is a way to do this, by using
    WaitForProcSignalBarrier. And the XLOG_CHECKSUMS processing already
    calls this. So we should not see two barriers at the same time ...
    
    >>>>> One issue I ran into is the postmaster does not seem to be processing
    >>>>> the barriers, and thus not getting info about the data_checksum_version
    >>>>> changes.
    >>>>
    >>>> Makes sense, that seems like a pretty reasonable constraint for the barrier.
    >>>
    >>> Not sure I follow. What's a reasonable constraint?
    >>
    >> That the postmaster deosn't process them.
    >>
    > 
    > OK, that means we need a way to "refresh" the value for new child
    > processses, similar to what my patch does. But I suspect there might be
    > a race condition - if the child process starts while processing the
    > XLOG_CHECKUMS record, it might happen to get the new value and then also
    > the barrier (if it does the "refresh" in between the XLogCtl update and
    > the barrier). Doesn't this need some sort of interlock, preventing this?
    > 
    > The child startup would need to do this:
    > 
    > 1) acquire lock
    > 2) reset barriers
    > 3) refresh the LocalDataChecksumValue (from XLogCtl)
    > 4) release lock
    > 
    > while the walreceiver would do this
    > 
    > 1) acquire lock
    > 2) update XLogCtl value
    > 3) emit barrier
    > 4) release lock
    > 
    > Or is there a reason why this would be unnecessary?
    > 
    
    I still think this might be a problem. I wonder if we could maybe
    leverage the barrier generation, to detect that we don't need to process
    this barrier, because we already got the value directly ...
    
    FWIW we'd have this problem even if postmaster was processing barriers,
    because there'd always be a "gap" between the fork and ProcSignalInit()
    registering the new process into the procsignal array.
    
    
    regards
    
    -- 
    Tomas Vondra
    
    
    
    
    
  38. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-03-14T12:20:37Z

    On 3/14/25 00:11, Tomas Vondra wrote:
    > ...
    >>>>>> One issue I ran into is the postmaster does not seem to be processing
    >>>>>> the barriers, and thus not getting info about the data_checksum_version
    >>>>>> changes.
    >>>>>
    >>>>> Makes sense, that seems like a pretty reasonable constraint for the barrier.
    >>>>
    >>>> Not sure I follow. What's a reasonable constraint?
    >>>
    >>> That the postmaster deosn't process them.
    >>>
    >>
    >> OK, that means we need a way to "refresh" the value for new child
    >> processses, similar to what my patch does. But I suspect there might be
    >> a race condition - if the child process starts while processing the
    >> XLOG_CHECKUMS record, it might happen to get the new value and then also
    >> the barrier (if it does the "refresh" in between the XLogCtl update and
    >> the barrier). Doesn't this need some sort of interlock, preventing this?
    >>
    >> The child startup would need to do this:
    >>
    >> 1) acquire lock
    >> 2) reset barriers
    >> 3) refresh the LocalDataChecksumValue (from XLogCtl)
    >> 4) release lock
    >>
    >> while the walreceiver would do this
    >>
    >> 1) acquire lock
    >> 2) update XLogCtl value
    >> 3) emit barrier
    >> 4) release lock
    >>
    >> Or is there a reason why this would be unnecessary?
    >>
    > 
    > I still think this might be a problem. I wonder if we could maybe
    > leverage the barrier generation, to detect that we don't need to process
    > this barrier, because we already got the value directly ...
    > 
    > FWIW we'd have this problem even if postmaster was processing barriers,
    > because there'd always be a "gap" between the fork and ProcSignalInit()
    > registering the new process into the procsignal array.
    > 
    
    I experimented with this a little bit, and unfortunately I ran into not
    one, but two race conditions in this :-( I don't have reproducers, all
    of this was done by manually adding sleep() calls / gdb breakpoints to
    pause the processes for a while, but I'll try to explain what/why ...
    
    
    1) race #1: SetDataChecksumsOn
    
    The function (and all the other "SetDataChecksums" funcs) does this
    
        SpinLockAcquire(&XLogCtl->info_lck);
        XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
        SpinLockRelease(&XLogCtl->info_lck);
    
        barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);
    
    Now, imagine there's a sleep() before the EmitProcSignalBarrier. A new
    process may start during that, and it'll read the current checksum value
    from XLogCtl. And then the SetDataChecksumsOn() wakes up, and emits the
    barrier. So far so good.
    
    But the new backend is already registered in ProcSignal, so it'll get
    the barrier too, and will try to set the local version to "on" again.
    And kaboom - that hits the assert in AbsorbChecksumsOnBarrier():
    
        Assert(LocalDataChecksumVersion ==
               PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
    
    The other "SetDataChecksums" have the same issue, except that in those
    cases there are no asserts to trip. Only AbsorbChecksumsOnBarrier() has
    such assert to check the state transition.
    
    This is "ephemeral" in the sense that setting the value to "on" again
    would be harmless, and indeed a non-assert build will run just fine.
    
    
    2) race #2: InitPostgres
    
    The InitPostgres does this:
    
        InitLocalControldata();
    
        ProcSignalInit(MyCancelKeyValid, MyCancelKey);
    
    where InitLocalControldata gets the current checksum value from XLogCtl,
    and ProcSignalInit registers the backend into the procsignal (which is
    what barriers are based on).
    
    Imagine there's a sleep() between these two calls, and the cluster does
    not have checksums enabled. A backend will start, will read "off" from
    XLogCtl, and then gets stuck on the sleep before it gets added to the
    procsignal/barrier array.
    
    Now, we enable checksums, and the instance goes through 'inprogress-on'
    and 'on' states. This completes, and the backend wakes up and registers
    itself into procsignal - but it won't get any barriers, of course.
    
    So we end up with an instance with data_checksums="on", but this one
    backend still believes data_checksums="on". This can cause a lot of
    trouble, because it won't write blocks with checksums. I.e. this is
    persistent data corruption.
    
    
    I have been thinking about how to fix this. One way would be to
    introduce some sort of locking, so that the two steps (update of the
    XLogCtl version + barrier emit) and (local flag init + procsignal init)
    would always happen atomically. So, something like this:
    
        SpinLockAcquire(&XLogCtl->info_lck);
        XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
        barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);
        SpinLockRelease(&XLogCtl->info_lck);
    
    and
    
        SpinLockAcquire(&XLogCtl->info_lck);
        InitLocalControldata();
        ProcSignalInit(MyCancelKeyValid, MyCancelKey);
        SpinLockRelease(&XLogCtl->info_lck);
    
    But that seems pretty heavy-handed, it's definitely much more work while
    holding a spinlock than I'm comfortable with, and I wouldn't be
    surprised if there were deadlock cases etc. (FWIW I believe it needs to
    use XLogCtl->info_lck, to make the value consistent with checkpoints.)
    
    
    Anyway, I think a much simpler solution would be to reorder InitPostgres
    like this:
    
        ProcSignalInit(MyCancelKeyValid, MyCancelKey);
    
        InitLocalControldata();
    
    i.e. to first register into procsignal, and then read the new value.
    AFAICS this guarantees we won't lose any checksum version updates. It
    does mean we still can get a barrier for a value we've already seen, but
    I think we should simply ignore this for the very first update.
    
    Opinions? Other ideas how to fix this?
    
    
    regards
    
    -- 
    Tomas Vondra
    
    
    
    
    
  39. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2025-03-14T13:38:06Z

    > On 14 Mar 2025, at 13:20, Tomas Vondra <tomas@vondra.me> wrote:
    
    > I experimented with this a little bit, and unfortunately I ran into not
    > one, but two race conditions in this :-( I don't have reproducers, all
    > of this was done by manually adding sleep() calls / gdb breakpoints to
    > pause the processes for a while, but I'll try to explain what/why ...
    
    Ugh. Thanks for this!
    
    > 1) race #1: SetDataChecksumsOn
    > 
    > The function (and all the other "SetDataChecksums" funcs) does this
    > 
    >    SpinLockAcquire(&XLogCtl->info_lck);
    >    XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
    >    SpinLockRelease(&XLogCtl->info_lck);
    > 
    >    barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);
    > 
    > Now, imagine there's a sleep() before the EmitProcSignalBarrier. A new
    > process may start during that, and it'll read the current checksum value
    > from XLogCtl. And then the SetDataChecksumsOn() wakes up, and emits the
    > barrier. So far so good.
    > 
    > But the new backend is already registered in ProcSignal, so it'll get
    > the barrier too, and will try to set the local version to "on" again.
    > And kaboom - that hits the assert in AbsorbChecksumsOnBarrier():
    > 
    >    Assert(LocalDataChecksumVersion ==
    >           PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
    > 
    > The other "SetDataChecksums" have the same issue, except that in those
    > cases there are no asserts to trip. Only AbsorbChecksumsOnBarrier() has
    > such assert to check the state transition.
    > 
    > This is "ephemeral" in the sense that setting the value to "on" again
    > would be harmless, and indeed a non-assert build will run just fine.
    
    As mentioned off-list, being able to loosen the restriction for the first
    barrier seen seem like a good way to keep this assertion.  Removing it is of
    course the alternative solution, as it's not causing any issues, but given how
    handy it's been to find actual issues it would be good to be able to keep it.
    
    > 2) race #2: InitPostgres
    > 
    > The InitPostgres does this:
    > 
    >    InitLocalControldata();
    > 
    >    ProcSignalInit(MyCancelKeyValid, MyCancelKey);
    > 
    > where InitLocalControldata gets the current checksum value from XLogCtl,
    > and ProcSignalInit registers the backend into the procsignal (which is
    > what barriers are based on).
    > 
    > Imagine there's a sleep() between these two calls, and the cluster does
    > not have checksums enabled. A backend will start, will read "off" from
    > XLogCtl, and then gets stuck on the sleep before it gets added to the
    > procsignal/barrier array.
    > 
    > Now, we enable checksums, and the instance goes through 'inprogress-on'
    > and 'on' states. This completes, and the backend wakes up and registers
    > itself into procsignal - but it won't get any barriers, of course.
    > 
    > So we end up with an instance with data_checksums="on", but this one
    > backend still believes data_checksums="on". This can cause a lot of
    > trouble, because it won't write blocks with checksums. I.e. this is
    > persistent data corruption.
    > 
    > I have been thinking about how to fix this. One way would be to
    > introduce some sort of locking, so that the two steps (update of the
    > XLogCtl version + barrier emit) and (local flag init + procsignal init)
    > would always happen atomically. So, something like this:
    > 
    >    SpinLockAcquire(&XLogCtl->info_lck);
    >    XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
    >    barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);
    >    SpinLockRelease(&XLogCtl->info_lck);
    > 
    > and
    > 
    >    SpinLockAcquire(&XLogCtl->info_lck);
    >    InitLocalControldata();
    >    ProcSignalInit(MyCancelKeyValid, MyCancelKey);
    >    SpinLockRelease(&XLogCtl->info_lck);
    > 
    > But that seems pretty heavy-handed, it's definitely much more work while
    > holding a spinlock than I'm comfortable with, and I wouldn't be
    > surprised if there were deadlock cases etc. (FWIW I believe it needs to
    > use XLogCtl->info_lck, to make the value consistent with checkpoints.)
    
    Yeah, that seems quite likely to introduce a new set if issues.
    
    > Anyway, I think a much simpler solution would be to reorder InitPostgres
    > like this:
    > 
    >    ProcSignalInit(MyCancelKeyValid, MyCancelKey);
    > 
    >    InitLocalControldata();
    
    Agreed.
    
    > i.e. to first register into procsignal, and then read the new value.
    > AFAICS this guarantees we won't lose any checksum version updates. It
    > does mean we still can get a barrier for a value we've already seen, but
    > I think we should simply ignore this for the very first update.
    
    Calling functions with sideeffects in setting state seems like a bad idea
    before ProcSignalInit has run, that's thinko on my part in this patch.  Your
    solution of reordering seems like the right way to handle this.
    
    --
    Daniel Gustafsson
    
    
    
    
    
  40. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2025-03-14T14:06:52Z

    > On 14 Mar 2025, at 14:38, Daniel Gustafsson <daniel@yesql.se> wrote:
    >> On 14 Mar 2025, at 13:20, Tomas Vondra <tomas@vondra.me> wrote:
    
    >> This is "ephemeral" in the sense that setting the value to "on" again
    >> would be harmless, and indeed a non-assert build will run just fine.
    > 
    > As mentioned off-list, being able to loosen the restriction for the first
    > barrier seen seem like a good way to keep this assertion.  Removing it is of
    > course the alternative solution, as it's not causing any issues, but given how
    > handy it's been to find actual issues it would be good to be able to keep it.
    > 
    >> i.e. to first register into procsignal, and then read the new value.
    >> AFAICS this guarantees we won't lose any checksum version updates. It
    >> does mean we still can get a barrier for a value we've already seen, but
    >> I think we should simply ignore this for the very first update.
    > 
    > Calling functions with sideeffects in setting state seems like a bad idea
    > before ProcSignalInit has run, that's thinko on my part in this patch.  Your
    > solution of reordering seems like the right way to handle this.
    
    0006 in the attached version is what I have used when testing the above, along
    with an update to the copyright year which I had missed doing earlier.  It also
    contains the fix in LocalProcessControlFile which I had in my local tree, I
    think we need something like that at least.
    
    --
    Daniel Gustafsson
    
    
  41. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-03-15T15:50:02Z

    
    On 3/14/25 15:06, Daniel Gustafsson wrote:
    >> On 14 Mar 2025, at 14:38, Daniel Gustafsson <daniel@yesql.se> wrote:
    >>> On 14 Mar 2025, at 13:20, Tomas Vondra <tomas@vondra.me> wrote:
    > 
    >>> This is "ephemeral" in the sense that setting the value to "on" again
    >>> would be harmless, and indeed a non-assert build will run just fine.
    >>
    >> As mentioned off-list, being able to loosen the restriction for the first
    >> barrier seen seem like a good way to keep this assertion.  Removing it is of
    >> course the alternative solution, as it's not causing any issues, but given how
    >> handy it's been to find actual issues it would be good to be able to keep it.
    >>
    >>> i.e. to first register into procsignal, and then read the new value.
    >>> AFAICS this guarantees we won't lose any checksum version updates. It
    >>> does mean we still can get a barrier for a value we've already seen, but
    >>> I think we should simply ignore this for the very first update.
    >>
    >> Calling functions with sideeffects in setting state seems like a bad idea
    >> before ProcSignalInit has run, that's thinko on my part in this patch.  Your
    >> solution of reordering seems like the right way to handle this.
    > 
    > 0006 in the attached version is what I have used when testing the above, along
    > with an update to the copyright year which I had missed doing earlier.  It also
    > contains the fix in LocalProcessControlFile which I had in my local tree, I
    > think we need something like that at least.
    >
    
    Thanks, here's an updated patch version - I squashed all the earlier
    parts, but I kept your changes and my adjustments separate, for clarity.
    A couple comments:
    
    1) I don't think the comment before InitialDataChecksumTransition was
    entirely accurate, because it said we can see the duplicate state only
    for "on" state. But AFAICS we can see duplicate values for any states,
    except that we only have an assert for the "on" so we don't notice the
    other cases. I wonder if we could strengthen this a bit, by adding some
    asserts for the other states too.
    
    2) I admit it's rather subjective, but I didn't like how you did the
    assert in AbsorbChecksumsOnBarrier. But looking at it now in the diff,
    maybe it was more readable ...
    
    3) I renamed InitLocalControldata() to InitLocalDataChecksumVersion().
    The name was entirely misleading, because it now initializes the flag in
    XLogCtl, it has nothing to do with control file.
    
    4) I realized AuxiliaryProcessMainCommon() may be a better place to
    initialize the checksum flag for non-backend processes. In fact, doing
    it in postmaster_child_launch() had the same race condition because it
    happened before ProcSignalInit().
    
    I'm sure there's cleanup possible in a bunch of places, but the really
    bad thing is I realized the handling on a standby is not quite correct.
    I don't know what exactly is happening, there's too many moving bits,
    but here's what I see ...
    
    Every now and then, after restarting the standby, it logs a bunch of
    page verification failures. Stuff like this:
    
      WARNING:  page verification failed, calculated checksum 9856 but
                expected 0
      CONTEXT:  WAL redo at 0/3447BA8 for Heap2/VISIBLE:
                snapshotConflictHorizon: 0, flags: 0x03; blkref #0: rel
                1663/16384/16401, fork 2, blk 0 FPW; blkref #1: rel
                1663/16384/16401, blk 0
      WARNING:  page verification failed, LSN 0/CF54C10
    
    This is after an immediate shutdown, but I've seen similar failures for
    fast shutdowns too (the root causes may be different / may need a
    different fix, not sure).
    
    The instance restarts, and the "startup" process starts recovery
    
      LOG:  redo starts at 0/2000028
    
    This matches LSN from the very first start of the standby - there were
    no restart points since then, apparently. And since then the primary did
    this with the checksums (per pg_waldump):
    
      lsn: 0/0ECCFC48, prev 0/0ECCFBA0, desc: CHECKSUMS inprogress-off
      lsn: 0/0ECD0168, prev 0/0ECD0128, desc: CHECKSUMS off
    
    The instance already saw both records before the immediate shutdown (per
    the logging in patch 0004), but after the restart the instance goes back
    to having checksums enabled again
    
      data_checksum_version = 1
    
    Which is correct, because it starts at 0/2000028, which is before either
    of the XLOG_CHECKSUMS records. But then at 0/3447BA8 (which is *before*
    either of the checksum changes) it tries to read a page from disk, and
    hits a checksum error. That page is from the future (per the page LSN
    logged by patch 0004), but it's still before both XLOG_CHECKSUMS
    messages. So how come the page has pd_checksum 0?
    
    I'd have understood if the page came "broken" from the primary, but I've
    not seen a single page verification failure on that side (and it's
    subject to the same fast/immediate restarts, etc).
    
    
    I wonder if this might be related to how we enforce checkpoints only
    when setting the checksums to "on" on the primary. Maybe that's safe on
    primary but not on a standby?
    
    FWIW I've seen similar issues for "fast" shutdowns too - at least the
    symptoms are similar, but the mechanism might be a bit different. In
    particular, I suspect there's some sort of thinko in updating the
    data_checksum_version in the control file, but I can't put my finger on
    it yet.
    
    Another thing I noticed is this comment in CreateRestartPoint(), before
    one of the early exits:
    
    /*
     * If the last checkpoint record we've replayed is already our last
     * restartpoint, we can't perform a new restart point. We still update
     * minRecoveryPoint in that case, so that if this is a shutdown restart
     * point, we won't start up earlier than before. That's not strictly
     * necessary, but when hot standby is enabled, it would be rather weird
     * if the database opened up for read-only connections at a
     * point-in-time before the last shutdown. Such time travel is still
     * possible in case of immediate shutdown, though.
     * ...
    
    I wonder if this "time travel backwards" might be an issue for this too,
    because it might mean we end up picking the wrong data_checksum_version
    from the control file. In any case, if this happens, we don't get to the
    ControlFile->data_checksum_version update a bit further down. And
    there's another condition that can skip that.
    
    
    I'll continue investigating this next week, but at this point I'm quite
    confused and would be grateful for any insights ...
    
    regards
    
    -- 
    Tomas Vondra
    
  42. Re: Changing the state of data checksums in a running cluster

    Andres Freund <andres@anarazel.de> — 2025-03-15T16:26:12Z

    Jo.
    
    On 2025-03-15 16:50:02 +0100, Tomas Vondra wrote:
    > Thanks, here's an updated patch version
    
    FWIW, this fails in CI;
    
    https://cirrus-ci.com/build/4678473324691456
    On all OSs:
    [16:08:36.331] #   Failed test 'options --locale-provider=icu --locale=und --lc-*=C: no stderr'
    [16:08:36.331] #   at /tmp/cirrus-ci-build/src/bin/initdb/t/001_initdb.pl line 132.
    [16:08:36.331] #          got: '2025-03-15 16:08:26.216 UTC [63153] LOG:  XLogCtl->data_checksum_version 0 ControlFile->data_checksum_version 0
    [16:08:36.331] # 2025-03-15 16:08:26.216 UTC [63153] LOG:  XLogCtl->data_checksum_version 0 ControlFile->data_checksum_version 0 (UPDATED)
    
    Windows & Compiler warnings:
    [16:05:08.723] ../src/backend/storage/page/bufpage.c(25): fatal error C1083: Cannot open include file: 'execinfo.h': No such file or directory
    
    [16:18:52.385] bufpage.c:25:10: fatal error: execinfo.h: No such file or directory
    [16:18:52.385]    25 | #include <execinfo.h>
    [16:18:52.385]       |          ^~~~~~~~~~~~
    
    Greetings,
    
    Andres Freund
    
    
    
    
  43. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-03-15T17:32:57Z

    On 3/15/25 17:26, Andres Freund wrote:
    > Jo.
    > 
    > On 2025-03-15 16:50:02 +0100, Tomas Vondra wrote:
    >> Thanks, here's an updated patch version
    > 
    > FWIW, this fails in CI;
    > 
    > https://cirrus-ci.com/build/4678473324691456
    > On all OSs:
    > [16:08:36.331] #   Failed test 'options --locale-provider=icu --locale=und --lc-*=C: no stderr'
    > [16:08:36.331] #   at /tmp/cirrus-ci-build/src/bin/initdb/t/001_initdb.pl line 132.
    > [16:08:36.331] #          got: '2025-03-15 16:08:26.216 UTC [63153] LOG:  XLogCtl->data_checksum_version 0 ControlFile->data_checksum_version 0
    > [16:08:36.331] # 2025-03-15 16:08:26.216 UTC [63153] LOG:  XLogCtl->data_checksum_version 0 ControlFile->data_checksum_version 0 (UPDATED)
    > 
    > Windows & Compiler warnings:
    > [16:05:08.723] ../src/backend/storage/page/bufpage.c(25): fatal error C1083: Cannot open include file: 'execinfo.h': No such file or directory
    > 
    > [16:18:52.385] bufpage.c:25:10: fatal error: execinfo.h: No such file or directory
    > [16:18:52.385]    25 | #include <execinfo.h>
    > [16:18:52.385]       |          ^~~~~~~~~~~~
    > 
    > Greetings,
    
    Yeah, that's just the "debug stuff" - I don't expect any of that to be
    included in the commit, I only posted it for convenience. It adds a lot
    of debug logging, which I hope might help others to understand what the
    problem with checksums on standby is.
    
    regards
    
    -- 
    Tomas Vondra
    
    
    
    
    
  44. Re: Changing the state of data checksums in a running cluster

    Alexander Korotkov <aekorotkov@gmail.com> — 2025-04-04T07:55:32Z

    Hi!
    
    On Sat, Mar 15, 2025 at 7:33 PM Tomas Vondra <tomas@vondra.me> wrote:
    
    > On 3/15/25 17:26, Andres Freund wrote:
    > > Jo.
    > >
    > > On 2025-03-15 16:50:02 +0100, Tomas Vondra wrote:
    > >> Thanks, here's an updated patch version
    > >
    > > FWIW, this fails in CI;
    > >
    > > https://cirrus-ci.com/build/4678473324691456
    > > On all OSs:
    > > [16:08:36.331] #   Failed test 'options --locale-provider=icu
    > --locale=und --lc-*=C: no stderr'
    > > [16:08:36.331] #   at /tmp/cirrus-ci-build/src/bin/initdb/t/
    > 001_initdb.pl line 132.
    > > [16:08:36.331] #          got: '2025-03-15 16:08:26.216 UTC [63153]
    > LOG:  XLogCtl->data_checksum_version 0 ControlFile->data_checksum_version 0
    > > [16:08:36.331] # 2025-03-15 16:08:26.216 UTC [63153] LOG:
    > XLogCtl->data_checksum_version 0 ControlFile->data_checksum_version 0
    > (UPDATED)
    > >
    > > Windows & Compiler warnings:
    > > [16:05:08.723] ../src/backend/storage/page/bufpage.c(25): fatal error
    > C1083: Cannot open include file: 'execinfo.h': No such file or directory
    > >
    > > [16:18:52.385] bufpage.c:25:10: fatal error: execinfo.h: No such file or
    > directory
    > > [16:18:52.385]    25 | #include <execinfo.h>
    > > [16:18:52.385]       |          ^~~~~~~~~~~~
    > >
    > > Greetings,
    >
    > Yeah, that's just the "debug stuff" - I don't expect any of that to be
    > included in the commit, I only posted it for convenience. It adds a lot
    > of debug logging, which I hope might help others to understand what the
    > problem with checksums on standby is.
    >
    
    I took a look at this patch.  I have following notes.
    1) I think reporting of these errors could be better, more detailed.
    Especially the second one could be similar to some of other errors on
    checksums processing.
                ereport(ERROR,
                        (errmsg("failed to start background worker to process
    data checksums")));
                ereport(ERROR,
                        (errmsg("unable to enable data checksums in cluster")));
    
    2) ProcessAllDatabases() contains loop, which repeats scanning the new
    databases for checkums.  It continues while there are new database on each
    iteration.  Could we just limit the number of iterations to 2?  Given at
    each step we're calling WaitForAllTransactionsToFinish(), everything that
    gets created after first WaitForAllTransactionsToFinish() call should have
    checksums enabled in the beginning.
    
    ------
    Regards,
    Alexander Korotkov
    Supabase
    
  45. Re: Changing the state of data checksums in a running cluster

    Bernd Helmle <mailings@oopsware.de> — 2025-07-11T15:53:58Z

    Am Samstag, dem 15.03.2025 um 16:50 +0100 schrieb Tomas Vondra:
    > 
    
    > I wonder if this "time travel backwards" might be an issue for this
    > too,
    > because it might mean we end up picking the wrong
    > data_checksum_version
    > from the control file. In any case, if this happens, we don't get to
    > the
    > ControlFile->data_checksum_version update a bit further down. And
    > there's another condition that can skip that.
    > 
    > 
    > I'll continue investigating this next week, but at this point I'm
    > quite
    > confused and would be grateful for any insights ...
    > 
    
    Hi,
    
    Since i wanted to dig a little deeper in this patch i took the
    opportunity and rebased it to current master, hopefully not having
    broken something seriously.
    
    Thanks,
    	Bernd
    
  46. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2025-07-12T19:39:30Z

    > On 11 Jul 2025, at 17:53, Bernd Helmle <mailings@oopsware.de> wrote:
    
    > Since i wanted to dig a little deeper in this patch i took the
    > opportunity and rebased it to current master, hopefully not having
    > broken something seriously.
    
    Thanks, much appreciated!
    
    --
    Daniel Gustafsson
    
    
    
    
    
  47. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2025-08-16T19:34:03Z

    Attached is a rebase on top of the func.sgml changes which caused this to no
    longer apply.
    
    This version is also substantially updated with a new injection point based
    test suite, fixed a few bugs (found by said test suite), added checkpoint to
    disabling checksums, code cleanup, more granular wait events, comment rewrites
    and additions and more smaller cleanups.
    
    --
    Daniel Gustafsson
    
    
  48. Re: Changing the state of data checksums in a running cluster

    Bruce Momjian <bruce@momjian.us> — 2025-08-19T16:21:41Z

    On Sat, Aug 16, 2025 at 09:34:03PM +0200, Daniel Gustafsson wrote:
    > Attached is a rebase on top of the func.sgml changes which caused this to no
    > longer apply.
    > 
    > This version is also substantially updated with a new injection point based
    > test suite, fixed a few bugs (found by said test suite), added checkpoint to
    > disabling checksums, code cleanup, more granular wait events, comment rewrites
    > and additions and more smaller cleanups.
    
    I am very glad you went simple and didn't attempt restarting this
    process from the place it stopped:
    
        If the cluster is stopped while in <literal>inprogress-on</literal>
        mode, for any reason, then this process must be
        restarted manually. To do this, re-execute the function
        <function>pg_enable_data_checksums()</function> once the cluster has
        been restarted. The process will start over, there is no support for
        resuming work from where it was interrupted.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      Do not let urgent matters crowd out time for investment in the future.
    
    
    
    
  49. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-08-20T14:37:33Z

    On 8/16/25 21:34, Daniel Gustafsson wrote:
    > Attached is a rebase on top of the func.sgml changes which caused this to no
    > longer apply.
    > 
    > This version is also substantially updated with a new injection point based
    > test suite, fixed a few bugs (found by said test suite), added checkpoint to
    > disabling checksums, code cleanup, more granular wait events, comment rewrites
    > and additions and more smaller cleanups.
    > 
    
    Thanks for the updated patch.
    
    The injection points seem like a huge improvement, allowing testing of
    different code paths in a more deterministic way.
    
    I started running the stress test, using pretty much exactly the version
    posted in March [1]. And so far I noticed only one issue, when the
    standby reports mismatched checksums on a fsm:
    
    LOG:  page verification failed, calculated checksum 24786 but expected 24760
    CONTEXT:  WAL redo at 0/0344A290 for Heap2/MULTI_INSERT+INIT: ntuples:
    185, flags: 0x28; blkref #0: rel 1663/16384/16403, blk 0
    LOG:  invalid page in block 2 of relation base/16384/16403_fsm; zeroing
    out page
    CONTEXT:  WAL redo at 0/0344A290 for Heap2/MULTI_INSERT+INIT: ntuples:
    185, flags: 0x28; blkref #0: rel 1663/16384/16403, blk 0
    WARNING:  invalid page in block 2 of relation base/16384/16403_fsm;
    zeroing out page
    CONTEXT:  WAL redo at 0/0344A290 for Heap2/MULTI_INSERT+INIT: ntuples:
    185, flags: 0x28; blkref #0: rel 1663/16384/16403, blk 0
    LOG:  page verification failed, calculated checksum 37048 but expected 0
    CONTEXT:  WAL redo at 0/0344D7E0 for Heap2/MULTI_INSERT+INIT: ntuples:
    61, flags: 0x28; blkref #0: rel 1663/16384/16400, blk 0
    LOG:  invalid page in block 2 of relation base/16384/16400_fsm; zeroing
    out page
    
    This happens quite regularly, it's not hard to hit. But I've only seen
    it to happen on a FSM, and only right after immediate shutdown. I don't
    think that's quite expected.
    
    I believe the built-in TAP tests (with injection points) can't catch
    this, because there's no concurrent activity while flipping checksums
    on/off. It'd be good to do something like that, by running pgbench in
    the background, or something like that.
    
    I also don't see any restarts of the primary/standby. That might be good
    to do too.
    
    I plan to randomize the stress test a bit more, once this FSM issue gets
    fixed. Maybe that'll find some additional issues.
    
    
    [1]
    https://www.postgresql.org/message-id/f528413c-477a-4ec3-a0df-e22a80ffbe41@vondra.me
    
    -- 
    Tomas Vondra
    
    
    
    
    
  50. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-08-20T17:02:57Z

    Hi,
    
    I think there's a minor issue in how pg_checksums validates state before
    checking the data.
    
    The current patch simply does:
    
      if (ControlFile->data_checksum_version == 0 &&
          mode == PG_MODE_CHECK)
          pg_fatal("data checksums are not enabled in cluster");
    
    and that worked when the version was either 0 or 1. But now it can be
    also 2 or 3, for inprogress-on / inprogress-off, and if the cluster gets
    shut down at the right moment, that can end in the control file.
    
    It doesn't make sense to verify checksums in such cluster, pg_checksums
    should handle that as "off", i.e. error out.
    
    
    regards
    
    -- 
    Tomas Vondra
    
    
    
    
    
  51. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2025-08-25T18:32:51Z

    > On 20 Aug 2025, at 16:37, Tomas Vondra <tomas@vondra.me> wrote:
    
    > This happens quite regularly, it's not hard to hit. But I've only seen
    > it to happen on a FSM, and only right after immediate shutdown. I don't
    > think that's quite expected.
    > 
    > I believe the built-in TAP tests (with injection points) can't catch
    > this, because there's no concurrent activity while flipping checksums
    > on/off. It'd be good to do something like that, by running pgbench in
    > the background, or something like that.
    
    In searching for this bug I opted for implementing a version of the stress
    tests as a TAP test, see 006_concurrent_pgbench.pl in the attached patch
    version.  It's gated behind PG_TEST_EXTRA since it's clearly not something
    which can be enabled by default (if this goes in this need to be re-done to
    provide two levels IMO, but during testing this is more convenient).  I'm
    curious to see which improvements you can think to make it stress the code to
    the breaking point.
    
    > I think there's a minor issue in how pg_checksums validates state before
    > checking the data.
    > 
    > The current patch simply does:
    > 
    >  if (ControlFile->data_checksum_version == 0 &&
    >      mode == PG_MODE_CHECK)
    >      pg_fatal("data checksums are not enabled in cluster");
    > 
    > and that worked when the version was either 0 or 1. But now it can be
    > also 2 or 3, for inprogress-on / inprogress-off, and if the cluster gets
    > shut down at the right moment, that can end in the control file.
    
    Good point, I've changed the test to check for checksums being enabled rather
    than checking if they are disabled.
    
    --
    Daniel Gustafsson
    
    
  52. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-08-25T23:06:24Z

    On 8/25/25 20:32, Daniel Gustafsson wrote:
    >> On 20 Aug 2025, at 16:37, Tomas Vondra <tomas@vondra.me> wrote:
    > 
    >> This happens quite regularly, it's not hard to hit. But I've only seen
    >> it to happen on a FSM, and only right after immediate shutdown. I don't
    >> think that's quite expected.
    >>
    >> I believe the built-in TAP tests (with injection points) can't catch
    >> this, because there's no concurrent activity while flipping checksums
    >> on/off. It'd be good to do something like that, by running pgbench in
    >> the background, or something like that.
    > 
    > In searching for this bug I opted for implementing a version of the stress
    > tests as a TAP test, see 006_concurrent_pgbench.pl in the attached patch
    > version.  It's gated behind PG_TEST_EXTRA since it's clearly not something
    > which can be enabled by default (if this goes in this need to be re-done to
    > provide two levels IMO, but during testing this is more convenient).  I'm
    > curious to see which improvements you can think to make it stress the code to
    > the breaking point.
    > 
    
    I think this TAP looks very nice, but there's a couple issues with it.
    See the attached patch fixing those.
    
    1) I think test_checksums should be in src/test/modules/Makefile?
    
    2) The test_checksums/Makefile didn't seem to work for me, I was getting
    
    Makefile:23: *** recipe commences before first target.  Stop.
    
    Because there was a missing "\" so I had to fix that. And then it was
    complaining about Makefile.global or something, so I fixed that by
    cargo-culting what other Makefiles in test modules do. Now it seems to
    work for me. I guess you're on meson?
    
    3) I'm no perl expert, but AFAICS the test wasn't really running the
    pgbench, for a couple of reasons. It was passing "-q" to pgbench, but
    that's only for initialization. The clusters had max_connections=10, but
    the pgbench was using "-c 10", so I was getting "too many connections".
    It was not setting "$pgbench_running = 1" so the other loops were
    getting "too many connections" too. Another thing is I'm not sure it's
    OK to pass '' to IPC::Run::start, I think it'll take it as an argument,
    confusing pgbench.
    
    With these changes it runs for me, and I even saw some
    
       LOG: page verification failed
    
    in tmp_check/log/006_concurrent_pgbench_standby_1.log. But it takes a
    while - a couple minutes, maybe? I think I saw it at
    
        t/006_concurrent_pgbench.pl .. 427/?
    
    or something like that. I think the bash version did a couple things
    differently, which might make the failures more frequent (but it's just
    a wild guess).
    
    In particular, I think the script restarts the two nodes independently,
    while the TAP always stops both primary and standby, in this order. I
    think it'd be useful to restart one or both.
    
    The other thing is the bash script added some random delays/sleep, which
    increases the test duration, but it also means generating somewhat
    random amounts of data, etc. It also randomized some other stuff (scale,
    client count, ...). But that can wait.
    
    
    regards
    
    -- 
    Tomas Vondra
    
  53. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2025-08-27T08:30:18Z

    > On 26 Aug 2025, at 01:06, Tomas Vondra <tomas@vondra.me> wrote:
    
    > I think this TAP looks very nice, but there's a couple issues with it.
    > See the attached patch fixing those.
    
    Thanks, I have incorporated (most of) your patch in the attached.  I did keep
    the PG_TEST_EXTRA check for injection points though which I assume were removed
    out of mistake.
    
    > With these changes it runs for me, and I even saw some
    > 
    >   LOG: page verification failed
    > 
    > in tmp_check/log/006_concurrent_pgbench_standby_1.log. But it takes a
    > while - a couple minutes, maybe? I think I saw it at
    > 
    >    t/006_concurrent_pgbench.pl .. 427/?
    
    That's very interesting, I have been running it to timeout several times in a
    row without hitting any verification failures.  Will keep running.
    
    > or something like that. I think the bash version did a couple things
    > differently, which might make the failures more frequent (but it's just
    > a wild guess).
    > 
    > In particular, I think the script restarts the two nodes independently,
    > while the TAP always stops both primary and standby, in this order. I
    > think it'd be useful to restart one or both.
    
    Done in the attached, it will now randomly stop one or both or none.  If the
    node is stopped I've added an offline pg_checksum step to validate the
    datafiles as a why-not test.
    
    > The other thing is the bash script added some random delays/sleep, which
    > increases the test duration, but it also means generating somewhat
    > random amounts of data, etc. It also randomized some other stuff (scale,
    > client count, ...). But that can wait.
    
    Added as well in a few places, maybe more can be sprinkled in.
    
    --
    Daniel Gustafsson
    
    
  54. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-08-27T09:39:35Z

    
    On 8/27/25 10:30, Daniel Gustafsson wrote:
    >> On 26 Aug 2025, at 01:06, Tomas Vondra <tomas@vondra.me> wrote:
    > 
    >> I think this TAP looks very nice, but there's a couple issues with it.
    >> See the attached patch fixing those.
    > 
    > Thanks, I have incorporated (most of) your patch in the attached.  I did keep
    > the PG_TEST_EXTRA check for injection points though which I assume were removed
    > out of mistake.
    > 
    
    Yes, that was a mistake.
    
    >> With these changes it runs for me, and I even saw some
    >>
    >>   LOG: page verification failed
    >>
    >> in tmp_check/log/006_concurrent_pgbench_standby_1.log. But it takes a
    >> while - a couple minutes, maybe? I think I saw it at
    >>
    >>    t/006_concurrent_pgbench.pl .. 427/?
    > 
    > That's very interesting, I have been running it to timeout several times in a
    > row without hitting any verification failures.  Will keep running.
    > 
    
    Just to be clear - I don't see any pg_checksums failures either. I only
    see failures in the standby log, and I don't think the script checks
    that (it probably should).
    
    >> or something like that. I think the bash version did a couple things
    >> differently, which might make the failures more frequent (but it's just
    >> a wild guess).
    >>
    >> In particular, I think the script restarts the two nodes independently,
    >> while the TAP always stops both primary and standby, in this order. I
    >> think it'd be useful to restart one or both.
    > 
    > Done in the attached, it will now randomly stop one or both or none.  If the
    > node is stopped I've added an offline pg_checksum step to validate the
    > datafiles as a why-not test.
    > 
    >> The other thing is the bash script added some random delays/sleep, which
    >> increases the test duration, but it also means generating somewhat
    >> random amounts of data, etc. It also randomized some other stuff (scale,
    >> client count, ...). But that can wait.
    > 
    > Added as well in a few places, maybe more can be sprinkled in.
    > 
    
    Thanks. I'll take a look.
    
    
    regards
    
    -- 
    Tomas Vondra
    
    
    
    
    
  55. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2025-08-27T11:00:04Z

    > On 27 Aug 2025, at 11:39, Tomas Vondra <tomas@vondra.me> wrote:
    
    > Just to be clear - I don't see any pg_checksums failures either. I only
    > see failures in the standby log, and I don't think the script checks
    > that (it probably should).
    
    Right, that's what I'm been checking too.  I have been considering adding
    another background process for monitoring all the log entries but I just
    thought of a much simpler solution.  When the clusters are turned off we can
    take the opportunity to slurp the log written since last restart and inspect
    it. The attached adds this.
    
    It would probably be good to at some point clean this up a little by placing
    all of variables for a single node in an associative hash which can be passed
    around, and place repeated code in subroutines etc..
    
    --
    Daniel Gustafsson
    
    
  56. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-08-27T12:39:38Z

    On 8/27/25 13:00, Daniel Gustafsson wrote:
    >> On 27 Aug 2025, at 11:39, Tomas Vondra <tomas@vondra.me> wrote:
    > 
    >> Just to be clear - I don't see any pg_checksums failures either. I only
    >> see failures in the standby log, and I don't think the script checks
    >> that (it probably should).
    > 
    > Right, that's what I'm been checking too.  I have been considering adding
    > another background process for monitoring all the log entries but I just
    > thought of a much simpler solution.  When the clusters are turned off we can
    > take the opportunity to slurp the log written since last restart and inspect
    > it. The attached adds this.
    > 
    
    There's still a couple issues, unfortunately. First, this may not do
    what you intended:
    
      sub cointoss
      {
        return int(rand(2) == 1);
      }
    
    The rand() call returns values in [0, 2.0), it's very unlikely to
    produce 1, and so there are no restarts. I fixed this to
    
      sub cointoss
      {
        return int(rand() < 0.5);
      }
    
    and then it starts working, restarting with ~50% probability.
    
    Then it hits another problem when calling pg_checksums. That only works
    after a "clean" shutdown, not after immediate one.This can't just check
    is_alive, it needs to remember how exactly was the server stopped and
    only verify checksums for 'fast' mode. I never saw any failures in
    pg_checksums, so I just commented that out.
    
    Then it starts working as intended, I think. I only did a couple runs so
    far, but I haven't found any checksum failures ... Then I realized I did
    the earlier runs with slightly stale master HEAD (38c5fbd97ee6a to be
    precise), while this time I did git pull.
    
    And this happened on Friday:
    
    commit c13070a27b63d9ce4850d88a63bf889a6fde26f0
    Author: Alexander Korotkov <akorotkov@postgresql.org>
    Date:   Fri Aug 22 18:44:39 2025 +0300
    
        Revert "Get rid of WALBufMappingLock"
    
        This reverts commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683.
        It appears that conditional variables are not suitable for use
        inside critical sections.  If WaitLatch()/WaitEventSetWaitBlock()
        face postmaster death, they exit, releasing all locks instead of
        PANIC.  In certain situations, this leads to data corruption.
    
        ...
    
    I think it's very likely the checksums were broken by this. After all,
    that linked thread has subject "VM corruption on standby" and I've only
    ever seen checksum failures on standby on the _vm fork.
    
    There's also an issue when calling teardown_node at the end. That won't
    work if the instance got stopped in the last round, and the test will
    fail like this:
    
    t/006_concurrent_pgbench.pl .. 379/? # Tests were run but no plan was
    declared and done_testing() was not seen.
    # Looks like your test exited with 29 just after 379.
    t/006_concurrent_pgbench.pl .. Dubious, test returned 29 (wstat 7424,
    0x1d00)
    All 379 subtests passed
    
    
    > It would probably be good to at some point clean this up a little by placing
    > all of variables for a single node in an associative hash which can be passed
    > around, and place repeated code in subroutines etc..
    > 
    
    Yeah. For me Perl is hard to read in any case, but if we can cleanup
    this a little bit before additional cases, that'd be nice.
    
    
    regards
    
    -- 
    Tomas Vondra
    
    
    
    
    
  57. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-08-27T12:42:05Z

    On 8/27/25 14:39, Tomas Vondra wrote:
    > ...
    >
    > And this happened on Friday:
    > 
    > commit c13070a27b63d9ce4850d88a63bf889a6fde26f0
    > Author: Alexander Korotkov <akorotkov@postgresql.org>
    > Date:   Fri Aug 22 18:44:39 2025 +0300
    > 
    >     Revert "Get rid of WALBufMappingLock"
    > 
    >     This reverts commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683.
    >     It appears that conditional variables are not suitable for use
    >     inside critical sections.  If WaitLatch()/WaitEventSetWaitBlock()
    >     face postmaster death, they exit, releasing all locks instead of
    >     PANIC.  In certain situations, this leads to data corruption.
    > 
    >     ...
    > 
    > I think it's very likely the checksums were broken by this. After all,
    > that linked thread has subject "VM corruption on standby" and I've only
    > ever seen checksum failures on standby on the _vm fork.
    > 
    
    Forgot to mention - I did try with c13070a27b reverted, and with that I
    can reproduce the checksum failures again (using the fixed TAP test).
    
    It's not a definitive proof, but it's a hint c13070a27b63 was causing
    the checksum failures.
    
    
    regards
    
    -- 
    Tomas Vondra
    
    
    
    
    
  58. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-08-28T16:11:18Z

    Hi,
    
    I spent a bit more time fixing the TAP test. The attached patch makes it
    "work" for me (or I think it should, in principle). I'm not saying it's
    the best way to do stuff.
    
    With the patch applied, I tried running it, and I got a failure when
    running pg_checksums. There's a log snippet describing the issue, but
    AFAICS it's happening like this:
    
    1) checksums are disabled
    2) flip_data_checksums gets called
    3) both clusters go through 'inprogress-on' and 'on' states
    4) primary gets shutdown in 'immediate' mode
    5) standby gets shutdown in 'fast' mode
    6) we try to validate checksums on the standby, but control file still
    says checksums=inprogress-on
    
    This seems like a bug to me - AFAICS the expectation is that after fast
    shutdown, we don't forget the checksum state. Or is that expected? In
    that case the TAP test probably needs to check the control file, instead
    of relying on the perl variable $data_checksum_state. Or maybe it should
    check that the control file has the correct / expected state?
    
    FWIW I don't think the primary shutdown matters. I've seen multiple of
    these failures, and it happens even without primary shutdown. But the
    standby "fast" shutdown is always there.
    
    But this also shows a limitation of the TAP test - it never triggers the
    shutdowns while flipping the checksums (in flip_data_checksums). I think
    that's something worth testing.
    
    regards
    
    -- 
    Tomas Vondra
    
  59. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-08-29T14:26:41Z

    On 8/27/25 14:42, Tomas Vondra wrote:
    > On 8/27/25 14:39, Tomas Vondra wrote:
    >> ...
    >>
    >> And this happened on Friday:
    >>
    >> commit c13070a27b63d9ce4850d88a63bf889a6fde26f0
    >> Author: Alexander Korotkov <akorotkov@postgresql.org>
    >> Date:   Fri Aug 22 18:44:39 2025 +0300
    >>
    >>     Revert "Get rid of WALBufMappingLock"
    >>
    >>     This reverts commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683.
    >>     It appears that conditional variables are not suitable for use
    >>     inside critical sections.  If WaitLatch()/WaitEventSetWaitBlock()
    >>     face postmaster death, they exit, releasing all locks instead of
    >>     PANIC.  In certain situations, this leads to data corruption.
    >>
    >>     ...
    >>
    >> I think it's very likely the checksums were broken by this. After all,
    >> that linked thread has subject "VM corruption on standby" and I've only
    >> ever seen checksum failures on standby on the _vm fork.
    >>
    > 
    > Forgot to mention - I did try with c13070a27b reverted, and with that I
    > can reproduce the checksum failures again (using the fixed TAP test).
    > 
    > It's not a definitive proof, but it's a hint c13070a27b63 was causing
    > the checksum failures.
    > 
    
    Unfortunately, it seems I spoke too soon :-( I decided to test this on
    multiple machines overnight, and it still fails on the slower ones.
    
    Attached is a patch addressing a couple more issues, to makes the TAP
    test work well enough. (Attached as .txt, to not confuse cfbot).
    
    - The pgbench started by IPC::Run::start() needs to be finished, to
    release resources. Otherwise it leaks file descriptors (and there's a
    bunch of "defunct" pgbench processes), which may be a problem with
    increased number of iterations.
    
    - AFAICS the pgbench can't use stdin/stdout/stderr, otherwise the pipes
    get broken when the command fails (after restart). I just used /dev/null
    instead, the stdout/stderr was not used anyway (or was it?).
    
    - commented out the pg_checksums call, because of the issues mentioned
    earlier (I was trying to make it work by remembering the state, but it
    seems to not make it into control file before shutdown occasionally)
    
    I increased the number of iterations to 1000+ and ran it on three machines:
    
    - ryzen (new machine from ~2024)
    - xeon (old slow machine from ~2016)
    - rpi5 (very slow machine)
    
    I haven't seen a single failure on ryzen, after ~3000 iterations. But
    both xeon and rpi5 show a number of failures. Xeon has about 35 reports
    of 'Failed test', rpi5 and about 10.
    
    My guess is it's something about timing. It works on the "fast" ryzen,
    but fails on xeon which is ~3-4x slower. And rpi5, which is even slower.
    
    The other reason why it seems unrelated to the reverted commit is that
    it's not just about visibility maps (which was got corrupted). I see
    checksum failures on VM and FSM. I think I forgot about the FSM cases,
    and by the fact that I saw no failures on the ryzen post revert. But
    clearly, other machines still have issues.
    
    Another interesting fact is that the checksum failures happen both on
    the primary and the standby, it's not just a standby issue. But again,
    this sees to be machine-dependent. On the rpi5 I've only seen standby
    issues. The xeon sees failures both on primary/standby (roughly 1:1).
    
    There are more weird things. If I grep for page failures, I see this (a
    more detailed log attached):
    
    -----------
    # 2025-08-28 22:33:28.195 CEST startup[177466] LOG:  page verification
    failed, calculated checksum 25350 but expected 44559
    # 2025-08-28 22:33:28.197 CEST startup[177466] LOG:  page verification
    failed, calculated checksum 25350 but expected 44559
    # 2025-08-28 22:33:28.199 CEST startup[177466] LOG:  page verification
    failed, calculated checksum 59909 but expected 53920
    # 2025-08-28 22:33:28.201 CEST startup[177466] LOG:  page verification
    failed, calculated checksum 59909 but expected 53920
    # 2025-08-28 22:33:28.206 CEST startup[177466] LOG:  page verification
    failed, calculated checksum 59909 but expected 53920
    # 2025-08-28 22:33:28.207 CEST startup[177466] LOG:  page verification
    failed, calculated checksum 25350 but expected 44559
    -----------
    
    This is right after a single restart, while doing the recovery. The
    weird thing is, this is all for just two FSM pages!
    
    -----------
    LOG:  invalid page in block 2 of relation "base/5/16410_fsm"; zeroing
    out page
    LOG:  invalid page in block 2 of relation "base/5/16408_fsm"; zeroing
    out page
    -----------
    
    And the calculated/expected checksums repeat! It's just different WAL
    records hitting the same page, and complaining about the same issue,
    after claiming the page was zeroed out. Isn't that weird? How come the
    page doesn't "get" the correct checksum after the first redo?
    
    I've seen these failures after changing checksums in both directions,
    both after enabling and disabling. But I've only ever saw this after
    immediate shutdown, never after fast shutdown. (It's interesting the
    pg_checksums failed only after fast shutdowns ...).
    
    Could it be that the redo happens to start from an older position, but
    using the new checksum version?
    
    
    regards
    
    -- 
    Tomas Vondra
    
  60. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-08-29T14:38:22Z

    On 8/29/25 16:26, Tomas Vondra wrote:
    > ...
    > 
    > I've seen these failures after changing checksums in both directions,
    > both after enabling and disabling. But I've only ever saw this after
    > immediate shutdown, never after fast shutdown. (It's interesting the
    > pg_checksums failed only after fast shutdowns ...).
    > 
    
    Of course, right after I send a message, it fails after a fast shutdown,
    contradicting my observation ...
    
    > Could it be that the redo happens to start from an older position, but
    > using the new checksum version?
    > 
    
    ... but it also provided more data supporting this hypothesis. I added
    logging of pg_current_wal_lsn() before / after changing checksums on the
    primary, and I see this:
    
    1) LSN before: 14/2B0F26A8
    2) enable checksums
    3) LSN after: 14/EE335D60
    4) standby waits for 14/F4E786E8 (higher, likely thanks to pgbench)
    5) standby restarts with -m fast
    6) redo starts at 14/230043B0, which is *before* enabling checksums
    
    I guess this is the root cause. A bit more detailed log attached.
    
    
    regards
    
    -- 
    Tomas Vondra
    
  61. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-09-01T12:11:06Z

    On 8/29/25 16:38, Tomas Vondra wrote:
    > On 8/29/25 16:26, Tomas Vondra wrote:
    >> ...
    >>
    >> I've seen these failures after changing checksums in both directions,
    >> both after enabling and disabling. But I've only ever saw this after
    >> immediate shutdown, never after fast shutdown. (It's interesting the
    >> pg_checksums failed only after fast shutdowns ...).
    >>
    > 
    > Of course, right after I send a message, it fails after a fast shutdown,
    > contradicting my observation ...
    > 
    >> Could it be that the redo happens to start from an older position, but
    >> using the new checksum version?
    >>
    > 
    > ... but it also provided more data supporting this hypothesis. I added
    > logging of pg_current_wal_lsn() before / after changing checksums on the
    > primary, and I see this:
    > 
    > 1) LSN before: 14/2B0F26A8
    > 2) enable checksums
    > 3) LSN after: 14/EE335D60
    > 4) standby waits for 14/F4E786E8 (higher, likely thanks to pgbench)
    > 5) standby restarts with -m fast
    > 6) redo starts at 14/230043B0, which is *before* enabling checksums
    > 
    > I guess this is the root cause. A bit more detailed log attached.
    > 
    
    I kept stress testing this over the weekend, and I think I found two
    issues causing the checksum failures, both for a single node and on a
    standby:
    
    1) no checkpoint in the "disable path"
    
    In the "enable" path, a checkpoint it enforced before flipping the state
    from "inprogress-on" to "on". It's hidden in the ProcessAllDatabases,
    but it's there. But the "off" path does not do that, probably on the
    assumption that we'll always see the writes in the WAL order, so that
    we'll see the XLOG_CHECKSUMS setting checksums=off before seeing any
    writes without checksums.
    
    And in the happy path this works fine - the standby is happy, etc. But
    what about after a crash / immediate shutdown? Consider a sequence like
    this:
    
    a) we have checksums=on
    b) write to page P, updating the checksum
    c) start disabling checksums
    d) progress to inprogress-off
    e) progress to off
    f) write to page P, without checksum update
    g) the write to P gets evicted (small shared buffers, ...)
    h) crash / immediate shutdown
    
    Recovery starts from a LSN before (a), so we believe checksums=on. We
    try to redo the write to P, which starts by reading the page from disk,
    to check the page LSN. We still think checksums=on, and to read the LSN
    we need to verify the checksum. But the page was modified without the
    checksum, and evicted. Kabooom!
    
    This is not that hard to trigger by hand. Add a long at the end of
    SetDataChecksumsOff, start a pgbench on a scale larger than shared
    buffers and call pg_disable_data_checksums(). Once it gets stuck on the
    sleep, give it more time to dirty and evict some pages, then kill -9. On
    recovery you should get the same checksum failures.
    
    FWIW I've only ever seen failures on fsm/vm forks, which matches what I
    see in the TAP tests. But isn't it a bit strange?
    
    
    I think the "disable" path needs a checkpoint between inprogress-off and
    off states, same as the "enable" path.
    
    
    2) no restart point on the standby
    
    The standby has a similar issue, I think. Even if the primary creates
    all the necessary checkpoints, the standby may not need to create the
    restart point for them. If you look into xlog_redo, it only "remembers"
    the checkpoint position, it does not trigger a restart point. Than only
    happens in XLogPageRead, based on distance from the previous one.
    
    So a very similar failure to the primary is possible, even with the
    extra checkpoint fixing (1). The primary flips checksums in either
    direction, generating checkpoints, but the standby does not create the
    restart points. But it applies WAL, and some of the pages without
    checksums get evicted.
    
    And then the standby fails, and goes to some redo position far back, and
    runs into the same checksum failure when trying to check page LSN.
    
    I think the standby needs some logic to force restart point creation
    when the checksum flag changed.
    
    
    I have an experimental WIP branch at:
    
    https://github.com/tvondra/postgres/tree/online-checksums-tap-tweaks
    
    It fixes the TAP issues reported earlier (and a couple more), and it
    does a bunch of additional tweaks:
    
    a) A lot of debug messages that helped me to figure this out. This is
    probably way too much, especially the controlfile updates can be very
    noisy on a standby.
    
    b) Adds a simpler TAP, testing just a single node (should be easier to
    understand than with failures on standby).
    
    c) Adds an explicit checkpoints, to fix (1). It probably adds too many
    checkpoints, though? AFAICS a checkpoint after the "inprogress" phase
    should be enough, a checkpoint after the "on/off" can go away.
    
    d) Forces creating a restart point on the first checkpoint after
    XLOG_CHECKSUMS record. It's done in a bit silly way, using a static
    flag. Maybe there's a more elegant approach, say by comparing the
    checksum value in ControlFile to the received checkpoint?
    
    e) Randomizes a couple more GUC values. This needs more thought, it was
    done blindly before better understanding how the failures happen (it
    requires buffers evicted, not hitting max_wal_size, ...). There are more
    params worth randomizing (e.g. the "fast" flag).
    
    Anyway, with (c) and (d) applied, the checksum failures go away. It may
    not be 100% right (e.g. we could do away with fewer checkpoints), but it
    seems to be the right direction.
    
    I don't have time to cleanup the branch more, I've already spent too
    much time looking at LSNs advancing in weird ways :-( Hopefully it's
    good enough to show what needs to be fixed, etc. If there's a new
    version, I'm happy to rerun the tests on my machines, ofc.
    
    
    However, there still are more bugs. Attached is a log from a crash after
    hitting the assert into AbsorbChecksumsOffBarrier:
    
     Assert((LocalDataChecksumVersion != PG_DATA_CHECKSUM_VERSION) &&
     (LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION ||
      LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION));
    
    This happened while flipping checksums to 'off, but the backend already
    thinks checksum are 'off':
    
        LocalDataChecksumVersion==0
    
    I think this implies some bug in setting up LocalDataChecksumVersion
    after connection, because this is for a query checking the checksum
    state, executed by the TAP test (in a new connection, right?).
    
    I haven't looked into this more, but how come the "off" direction does
    not need to check InitialDataChecksumTransition?
    
    
    I think the TAP test turned out to be very useful, so far. While
    investigating on this, I thought about a couple more tweaks to make it
    detect additional issues (on top of the randomization).
    
    - Right now the shutdowns/restarts happen only in very limited places.
    The checksum flips from on/off or off/on, and then a restart happens.
    AFAICS it never happens in the "inprogress" phases, right?
    
    - The pgbench clients connect once, so there are almost no new
    connections while flipping checksums. Maybe some of the pgbenches should
    run with "-C", to open new connections. It was pretty lucky the TAP
    query hit the assert, this would make it more likely.
    
    
    regards
    
    -- 
    Tomas Vondra
    
  62. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2025-10-06T12:53:01Z

    > On 1 Sep 2025, at 14:11, Tomas Vondra <tomas@vondra.me> wrote:
    
    > I kept stress testing this over the weekend, and I think I found two
    > issues causing the checksum failures, both for a single node and on a
    > standby:
    
    Thanks a lot for all your testing. I was able to reproduce ..
    
    > I have an experimental WIP branch at:
    > 
    > https://github.com/tvondra/postgres/tree/online-checksums-tap-tweaks
    
    .. and with your changes I also see the reproducer go away.  I concur that
    you likely found the issue and the right fix for it.  I have absorbed your
    patches into my branch, the debug logging is left as 0002 but the other ones
    are incorporated into 0001.
    
    > It fixes the TAP issues reported earlier (and a couple more), and it
    > does a bunch of additional tweaks:
    > 
    > a) A lot of debug messages that helped me to figure this out. This is
    > probably way too much, especially the controlfile updates can be very
    > noisy on a standby.
    
    I've toned down the logging a bit, but kept most of it in 0002.
    
    > b) Adds a simpler TAP, testing just a single node (should be easier to
    > understand than with failures on standby).
    
    This turned out to be more useful than I initially thought, so I've kept this
    in the attached version.  There could be value in separating the single and
    dual node tests into different PG_TEST_EXTRA values given how intensive the
    latter is.
    
    > Anyway, with (c) and (d) applied, the checksum failures go away. It may
    > not be 100% right (e.g. we could do away with fewer checkpoints), but it
    > seems to be the right direction.
    
    I think so too, and while I have removed one of them due to being issued just
    before (or after) another checkpoint I do believe this is the right fix for the
    issue.  There might well be more issues, but I wanted to get a new version out
    on the thread to get more visibility on the the new tests.
    
    > I haven't looked into this more, but how come the "off" direction does
    > not need to check InitialDataChecksumTransition?
    
    This boiled down into the barrier absorbing functions evolving out of sync with
    one another over multiple versions of the patch.  To address this absorbing the
    barrier has been converted into a single function which is driven by an array
    of ChecksumBarrierCondition structs, one for each target state.  This struct
    defines what the current state of the cluster must be for the barrier to be
    successfully absorbed.  This removed a lot of duplicate code and also unifies
    the previously quite varied levels of assertions at the barrier.
    
    > I think the TAP test turned out to be very useful, so far. While
    > investigating on this, I thought about a couple more tweaks to make it
    > detect additional issues (on top of the randomization).
    > 
    > - Right now the shutdowns/restarts happen only in very limited places.
    > The checksum flips from on/off or off/on, and then a restart happens.
    > AFAICS it never happens in the "inprogress" phases, right?
    
    That would be a good idea, as well as use a crashing injection test on top
    controlled shutdowns.
    
    > - The pgbench clients connect once, so there are almost no new
    > connections while flipping checksums. Maybe some of the pgbenches should
    > run with "-C", to open new connections. It was pretty lucky the TAP
    > query hit the assert, this would make it more likely.
    
    I've added this to both tests using pgbench, randomized with a cointoss call.
    
    The attached has the above as well as a few other changes:
    
    * A new injection test which calls abort() right before checkpointing has been
    added in 005_injection.
    
    * Most ereport calls have gotten proper errcodes in order to aid analysis,
    particurly fleet-wide analysis should this be deployed in a larger setting.
    
    * More code-level documentation of test code and several tweaked (and added)
    code comments to aid readability.
    
    --
    Daniel Gustafsson
    
    
  63. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2025-11-05T08:56:13Z

    Rebase due to recent conflicts with only a trivial whitespace fix, otherwise
    the same as the previous version.
    
    --
    Daniel Gustafsson
    
    
  64. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-11-05T14:05:48Z

    Hi Daniel,
    
    I took a look at the patch again, focusing mostly on the docs and
    comments. I think the code is OK, I haven't noticed anything serious.
    
    
    testing
    -------
    
    I'm running the TAP tests - so far it looks fine, I've done 2000
    iterations of the "single" test, now running ~2000 iterations of the
    "standby" test. No issues/failures so far.
    
    The question is whether we can/should make it even more "random", by
    doing restarts in more places etc. I might give it a try, if I happen to
    have some free time. But no promises, I'm not sure how feasible it is.
    Making it "truly random" means it's hard to deduce what should be the
    actual state of checksums, etc.
    
    
    review
    ------
    
    Attached is a patch adding FIXME/XXX comments to a bunch of places,
    which I think makes it clearer which place I'm talking about. I'll
    briefly go over the items, and maybe explain them a bit more.
    
    1) func-admin.sgml
    
    - This is missing documentation for the "fast" parameters for both
    functions (enable/disable).
    
    - The paragraph stars with "Initiates data checksums for ..." but that
    doesn't sound right to me. I think you can initiate enabling/disabling,
    not "data checksums".
    
    - I think the docs should at least briefly describe the impact on the
    cluster, and also on a standby, due to having to write everything into
    WAL, waiting for checkpoints etc. And maybe discuss how to mitigate that
    in some way. More about the standby stuff later.
    
    
    2) glossary.sgml
    
    - This describes "checksum worker" as process that enables or disables
    checksums in a specific database, but we don't need any per-database
    processes when disabling, no?
    
    
    3) monitoring.sgml
    
    - I'm not sure what "regardles" implies here. Does that mean we simply
    don't hide/reset the counters?
    
    - I added a brief explanation about using the "launcher" row for overall
    progress, and per-database workers for "current progress".
    
    - Do we want to refer to "datachecksumsworker"? Isn't it a bit too
    low-level detail?
    
    - The table of phases is missing "waiting" and "done" phases. IMHO if
    the progress view can return it, it should be in the docs.
    
    
    4) wal.sgml
    
    - I added a sentence explaining that both enabling and disabling happens
    in phases, with checkpoints in between. I think that's helpful for users
    and DBAs.
    
    - The section only described "enabling checksums", but it probably
    should explain the process for disabling too. Added a para.
    
    - Again, I think we should explain the checkpoints, restartpoints,
    impact on standby ... somewhere. Not sure if this is the right place.
    
    
    5) xlog.c
    
    - Some minor corrections (typos, ...).
    
    - Isn't the claim that PG_DATA_CHECKSUM_ON_VERSION is the only place
    where we check InitialDataChecksumTransition stale? AFAIK we now check
    this in AbsorbDataChecksumsBarrier for all transitions, no?
    
    
    6) datachecksumsworker.c
    
    - I understand what the comments at the beginning of the file say, but I
    suspect it's partially due to already knowing the code. There's a couple
    places that might be more explicit, I think. For example:
    
    - One of the items in the synchronization/correctness section states
    that "Backends SHALL NOT violate local data_checksums state" but what
    does "violating local data_checksums state" mean? What even is "local
    state in this context"? Should we explain/define that, or would that be
    unnecessarily detailed?
    
    - The section only talks about "enabling checksums" but also discusses
    all four possible states. Maybe it should talk about disabling too, as
    that requires the same synchronization/correctness.
    
    - Maybe it'd be good make it more explicit at which point the process
    waits on a barrier, which backend initiates that (and which backends are
    required to absorb the signal). It kinda is there, but only indirectly.
    
    - Another idea I had is that maybe it'd help to have some visualization
    of the process (with data_checksum states, barriers, ...) e.g. in the
    form of an ASCII image.
    
    
    open questions
    --------------
    
    For me the main remaining question is impact people should expect in
    production systems, and maybe ways to mitigate that.
    
    In single-node systems this is entirely fine, I think. There will be
    checkpoints, but those will be "spread" and it's just the checksum
    worker waiting for that to complete.
    
    It'll write everything into WAL, but that's fairly well understood /
    should be expected. We should probably mention that in the sgml docs, so
    that people are not surprised their WAL archive gets huge.
    
    I'm much more concerned about streaming replicas, because there it
    forces a restart point - and it *blocks redo* until it completes. Which
    means there'll be replication lag, and for synchronous standbys this
    would even block progress on the primary.
    
    We should very clearly document this. But I'm also wondering if we might
    mitigate this in some way / reduce the impact.
    
    I see some similarities to shutdown checkpoints, which can take a lot of
    time if there happen to be a lot of dirty data, increasing disruption
    during planned restarts (when no one can connect). A common mitigation
    is to run CHECKPOINT shortly before the restart, to flush most of the
    dirty data while still allowing new connections.
    
    Maybe we could do something like that for checksum changes? I don't know
    how exactly we could do that, but let's say we can predict when roughly
    to expect the next state change. And we'd ensure the standby starts
    flushing stuff before that, so that creating the restartpoint is cheap.
    Or maybe we'd (gradually?) lower max_wal_size on the standby, to reduce
    the amount of WAL as we're getting closer to the end?
    
    
    regards
    
    -- 
    Tomas Vondra
    
  65. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-11-10T01:26:07Z

    Hi,
    
    We had some off-list discussions about this patch (me, Daniel and
    Andres), and Andres mentioned he suspects the patch may be breaking PITR
    in some way. We didn't have any example of that, but PITR seems like a
    pretty fundamental feature, so I took it seriously and decided to do
    some stress testing. And yeah, there are issues :-(
    
    I did a similar stress testing in the past, which eventually evolved
    into the two TAP tests in the current patch. Those TAP tests run either
    a single node or primary-standby cluster, and flip checksums on/off
    while restarting the instance(s). And verify the cluster behaves OK,
    with no checksum failures, unexpected states, etc.
    
    I chose to do something similar to test PITR - run a single node with
    pgbench (or some other write activity), flip checksums on/off in a loop,
    while taking basebackups. And then validate the basebackups are valid,
    and can be used to start a new instance. I planned to extend this to
    more elaborate tests with proper PITR using a WAL archive, etc. I didn't
    get that far - this simple test already hit a couple issues.
    
    I'm attaching sets of test scripts, and the scripts used to validate
    backups. The scripts are fairly simple, but need some changes to run -
    adjusting some paths, etc.
    
    1) basebackup-short.tgz - basic basebackup test
    
    2) basebackup-long.tgz - long-runnning basebackup test (throttled)
    
    3) validate.tgz - validate backups from (1) and (2)
    
    The test scripts expect "scale" parameter for pgbench. I used 500, but
    that's mostly arbitrary - maybe a different value would hit the issues
    more often. Not sure.
    
    While testing I ran into two issues while validating the backups (which
    is essentially about performing the usual basebackup redo).
    
    
    1) assert in a checkpointer
    
    TRAP: failed Assert("TransactionIdIsValid(initial)"), File:
    "procarray.c", Line: 1707, PID: 2649754
    postgres: checkpointer (ExceptionalCondition+0x56) [0x55d64ec38f96]
    postgres: checkpointer (+0x5341d8) [0x55d64eac41d8]
    postgres: checkpointer (GetOldestTransactionIdConsideredRunning+0xc)
    [0x55d64eac598c]
    postgres: checkpointer (CreateRestartPoint+0x725) [0x55d64e7dd2f5]
    postgres: checkpointer (CheckpointerMain+0x3ec) [0x55d64ea38a8c]
    postgres: checkpointer (postmaster_child_launch+0x102) [0x55d64ea3b592]
    postgres: checkpointer (+0x4ad74a) [0x55d64ea3d74a]
    postgres: checkpointer (PostmasterMain+0xce7) [0x55d64ea40b97]
    postgres: checkpointer (main+0x1ca) [0x55d64e713b1a]
    /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7f55f1a33ca8]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f55f1a33d65]
    postgres: checkpointer (_start+0x21) [0x55d64e713e81]
    
    I was really confused by this initially, because why would this break
    tracking of running transactions in a snapshot, etc.? But that's really
    a red herring - the real issue is calling CreateRestartPoint().
    
    This happens because we need to ensure that a standby does not get
    confused about checksums state, so redo of XLOG_CHECKSUMS forces
    creating a restart point whenever the checksum state changes. Which has
    some negative consequences, as discussed in my previous message.
    
    But AFAICS in this case the root cause is calling this during a regular
    redo, not just on a standby. Presumably there's a way to distinguish
    these two cases, and skip the restart point creation on a simple redo.
    Alternatively, maybe the standby should do something different instead
    of creating a restart point. (More about that later.)
    
    
    2) unexpected state during redo of XLOG_CHECKSUMS record
    
    An example of the failure:
    
        LOG:  redo starts at 36/5B028668
        FATAL:  incorrect data checksum state 3 for target state 1
        CONTEXT:  WAL redo at 37/E136E408 for XLOG/CHECKSUMS: on
        ERROR:  incorrect data checksum state 3 for target state 1
        ERROR:  incorrect data checksum state 3 for target state 1
        ERROR:  incorrect data checksum state 3 for target state 1
        ERROR:  incorrect data checksum state 3 for target state 1
        ERROR:  incorrect data checksum state 3 for target state 1
        LOG:  startup process (PID 2649028) exited with exit code 1
    
    I was really confused about this at first, because I haven't seen this
    during earlier testing (which did redo after a crash), so why should it
    happen here?
    
    But the reason is very simple - the basebackup can run for a while,
    spanning multiple checkpoints, and also multiple changes of checksum
    state (enable/disable). Furthermore, basebackup copies the pg_control
    file last, and that's where we get the checksum state from.
    
    For the earlier crash/restart test that's not the case, because each
    checksum change creates a checkpoint, and the redo starts from that
    point. There can't be multiple XLOG_CHECKSUMS records to apply.
    
    So after a crash we start from the last checkpoint (i.e. from the
    initial checksum state), and then there can be only a single
    XLOG_CHECKSUMS record to reapply.
    
    But with a basebackup we get the checksum state from the pg_control
    copied at the end (so likely the final state, not the initial one), and
    there may be multiple XLOG_CHECKSUMS in between (and multiple
    checkpoints). So what happens is that we start from the final state, and
    then try to apply the earlier XLOG_CHECKSUMS states. Hence the failure.
    
    This seems like a pretty fundamental issue - we can't guarantee a backup
    won't run too long, or anything like that. A full backup may easily run
    for multiple hours or even days.
    
    I suspect it should be possible to hit an issue similar to those we
    observed on the standby in earlier testing (which is why we ended up
    creating startpoints on a standby) with "future" blocks.
    
    Imagine a basebackup starts, we disable checksums, and one of the blocks
    gets written to the disk without a checksum before the basebackup copies
    that file. Then during redo of the backup, we start with checksums=on
    (assume we don't have the issue with incorrect state). If we attempt to
    read the page before the XLOG_CHECKSUMS, that'll fail, because the
    on-disk page does not have a valid checksum, and we need that to read
    the page LSN.
    
    I only speculate this can happen, I haven't actually seen / reproduced
    it. But maybe it's fixed thanks to FPI, or something like that? It
    didn't help on the standby, though. And in that case allowing a random
    mix of pages with/without checksums in a basebackup seems problematic.
    
    What could we do about the root cause? We discussed this with Daniel and
    we've been stuck for quite a while. But then it occurred to us maybe we
    can simply "pause" the checksum state change while there's backup in
    progress. We already enable/disable FPW based on this, so why couldn't
    we check XLogCtl->Insert.runningBackups, and only advance to the next
    checksum state if (runningBackups==0)?
    
    That would mean a single backup does not need to worry about seeing a
    mix of blocks written with different checksum states, and it also means
    the final pg_control file has the correct checksum state, because it is
    not allowed to change during the basebackup.
    
    Of course, this would mean checksum changes may take longer. A corner
    case is that database with a basebackup running 100% of the time won't
    be able to change checksums on-line. But to me that seems acceptable, if
    communicated / documented clearly.
    
    It also occurred to me something like this might help with the standby
    case too. On the standby, the problem happens when it skips checkpoints
    when creating a restart point, in which case redo may go too far back,
    with an incompatible checksum state. Maybe if a standby reported LSN of
    the last restartpoint, the primary could use that to decide whether it's
    safe to update the checksum state. Of course, this is tricky, because
    standbys may be disconnected, there's cascading replications, etc.
    
    
    regards
    
    -- 
    Tomas Vondra
  66. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-11-18T19:06:40Z

    On 11/10/25 02:26, Tomas Vondra wrote:
    > What could we do about the root cause? We discussed this with Daniel and
    > we've been stuck for quite a while. But then it occurred to us maybe we
    > can simply "pause" the checksum state change while there's backup in
    > progress. We already enable/disable FPW based on this, so why couldn't
    > we check XLogCtl->Insert.runningBackups, and only advance to the next
    > checksum state if (runningBackups==0)?
    > 
    > That would mean a single backup does not need to worry about seeing a
    > mix of blocks written with different checksum states, and it also means
    > the final pg_control file has the correct checksum state, because it is
    > not allowed to change during the basebackup.
    > 
    > Of course, this would mean checksum changes may take longer. A corner
    > case is that database with a basebackup running 100% of the time won't
    > be able to change checksums on-line. But to me that seems acceptable, if
    > communicated / documented clearly.
    
    After thinking about this approach a bit, I realized the basebackup may
    also run on the standby. Which means the checksum process won't see it
    by checking XLogCtl->Insert.runningBackups. It will merrily proceed,
    breaking the standby backup just as described earlier ...
    
    Not sure what would be a good fix. One option is to "pause" the redo,
    which is what the patch already does (by forcing an immediate checkpoint
    whenever checksum state changes). We could pause redo until the backup
    completes. But of course, that'd be terrible - especially for syncrep. I
    hoped we'd find a better approach, and pausing redo for longer goes in
    the opposite direction.
    
    On the other hand, we already have similar issue with full_page_writes.
    The backup on standby is not allowed to start if fpw=off, and if the
    setting changes while the backup is running, the backup fails:
    
      pg_basebackup: error: backup failed: ERROR:  WAL generated with
    "full_page_writes=off" was replayed during online backup
    HINT:  This means that the backup being taken on the standby is corrupt
    and should not be used. Enable "full_page_writes" and run CHECKPOINT on
    the primary, and then try an online backup again.
    
    Maybe this would be acceptable for checksums too ...
    
    It's not exactly the same, of course. We don't really expect people to
    change fpw in a running cluster.
    
    
    regards
    
    -- 
    Tomas Vondra
    
    
    
    
    
  67. Re: Changing the state of data checksums in a running cluster

    Andreas Karlsson <andreas@proxel.se> — 2025-11-19T21:03:37Z

    Hi,
    
    I have been following these discussions but not read the patch in detail.
    
    This patch makes me worried especially with the new issues recently 
    uncovered. This was already a quite big patch and to fix these issues it 
    will likely have to become even bigger and given how this would become a 
    very rarely stressed code paths I wonder if we can actually ever become 
    confident that the patch works in all edge cases.
    
    Something like this need to be easy to understand for us to have any 
    hope at all to be comfortable in the correctness. Can we actually do that?
    
    Andreas
    
    
    
    
  68. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-11-20T10:34:14Z

    On 11/19/25 22:03, Andreas Karlsson wrote:
    > Hi,
    > 
    > I have been following these discussions but not read the patch in detail.
    > 
    > This patch makes me worried especially with the new issues recently
    > uncovered. This was already a quite big patch and to fix these issues it
    > will likely have to become even bigger and given how this would become a
    > very rarely stressed code paths I wonder if we can actually ever become
    > confident that the patch works in all edge cases.
    > 
    > Something like this need to be easy to understand for us to have any
    > hope at all to be comfortable in the correctness. Can we actually do that?
    > 
    
    How's this different from any other complex patch? We get more familiar
    with the problem during review, identify issues, improve the patch to
    address them. And then again and again.
    
    Of course, it'd be great to have a perfect understanding of the problem
    from the very beginning, but that's not always possible. And I can't
    guarantee we'll find/fix all issues.
    
    
    regards
    
    -- 
    Tomas Vondra
    
    
    
    
    
  69. Re: Changing the state of data checksums in a running cluster

    Andreas Karlsson <andreas@proxel.se> — 2025-11-21T00:44:31Z

    On 11/20/25 11:34 AM, Tomas Vondra wrote:
    > On 11/19/25 22:03, Andreas Karlsson wrote:
    >> Hi,
    >>
    >> I have been following these discussions but not read the patch in detail.
    >>
    >> This patch makes me worried especially with the new issues recently
    >> uncovered. This was already a quite big patch and to fix these issues it
    >> will likely have to become even bigger and given how this would become a
    >> very rarely stressed code paths I wonder if we can actually ever become
    >> confident that the patch works in all edge cases.
    >>
    >> Something like this need to be easy to understand for us to have any
    >> hope at all to be comfortable in the correctness. Can we actually do that?
    >>
    > 
    > How's this different from any other complex patch? We get more familiar
    > with the problem during review, identify issues, improve the patch to
    > address them. And then again and again.
    
    The difference I see is in how rarely anyone actually switches checksum 
    state in a production database, especially now that we enabled them by 
    default. A complex and rarely stressed code path is a minefield.
    
    Andreas
    
    
    
    
    
  70. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2025-11-21T12:17:09Z

    
    On 11/21/25 01:44, Andreas Karlsson wrote:
    > On 11/20/25 11:34 AM, Tomas Vondra wrote:
    >> On 11/19/25 22:03, Andreas Karlsson wrote:
    >>> Hi,
    >>>
    >>> I have been following these discussions but not read the patch in
    >>> detail.
    >>>
    >>> This patch makes me worried especially with the new issues recently
    >>> uncovered. This was already a quite big patch and to fix these issues it
    >>> will likely have to become even bigger and given how this would become a
    >>> very rarely stressed code paths I wonder if we can actually ever become
    >>> confident that the patch works in all edge cases.
    >>>
    >>> Something like this need to be easy to understand for us to have any
    >>> hope at all to be comfortable in the correctness. Can we actually do
    >>> that?
    >>>
    >>
    >> How's this different from any other complex patch? We get more familiar
    >> with the problem during review, identify issues, improve the patch to
    >> address them. And then again and again.
    > 
    > The difference I see is in how rarely anyone actually switches checksum
    > state in a production database, especially now that we enabled them by
    > default. A complex and rarely stressed code path is a minefield.
    > 
    
    True. Hence the stress testing I've been doing - and indeed, that made
    us discover the various issues reported in this thread.
    
    Still, isn't that similar to error paths in various other patches? Those
    also tend to be rarely exercised in practice. I think the right way to
    address that is more testing. Of course, there's a difference between
    "regular bugs" and "design problems". Some of the issues are more about
    the design/architecture not considering something important.
    
    I don't know if / when this will be ready for commit. Maybe never, who
    knows. I prefer going step by step. We know about a couple issues, we
    need to figure out what to do about those. Then we can reconsider.
    
    FWIW I'm not sure the number of people currently enabling checksums on
    production databases is a good metric of how important the patch is.
    Maybe more people would like to do that, but can't accept the downtime.
    
    regards
    
    -- 
    Tomas Vondra
    
    
    
    
    
  71. Re: Changing the state of data checksums in a running cluster

    Bruce Momjian <bruce@momjian.us> — 2025-11-21T19:57:35Z

    On Fri, Nov 21, 2025 at 01:17:09PM +0100, Tomas Vondra wrote:
    > True. Hence the stress testing I've been doing - and indeed, that made
    > us discover the various issues reported in this thread.
    > 
    > Still, isn't that similar to error paths in various other patches? Those
    > also tend to be rarely exercised in practice. I think the right way to
    > address that is more testing. Of course, there's a difference between
    > "regular bugs" and "design problems". Some of the issues are more about
    > the design/architecture not considering something important.
    > 
    > I don't know if / when this will be ready for commit. Maybe never, who
    > knows. I prefer going step by step. We know about a couple issues, we
    > need to figure out what to do about those. Then we can reconsider.
    > 
    > FWIW I'm not sure the number of people currently enabling checksums on
    > production databases is a good metric of how important the patch is.
    > Maybe more people would like to do that, but can't accept the downtime.
    
    I think it is a worth-while feature.  We would have had it years ago
    except that people asked for re-start-ability after a crash, and since
    we don't have restart logic at the relation level, the patch got too
    complex and was abandoned.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      Do not let urgent matters crowd out time for investment in the future.
    
    
    
    
  72. Re: Changing the state of data checksums in a running cluster

    Andres Freund <andres@anarazel.de> — 2025-11-21T20:28:40Z

    Hi,
    
    On 2025-11-21 01:44:31 +0100, Andreas Karlsson wrote:
    > On 11/20/25 11:34 AM, Tomas Vondra wrote:
    > > On 11/19/25 22:03, Andreas Karlsson wrote:
    > > > I have been following these discussions but not read the patch in detail.
    > > > 
    > > > This patch makes me worried especially with the new issues recently
    > > > uncovered. This was already a quite big patch and to fix these issues it
    > > > will likely have to become even bigger and given how this would become a
    > > > very rarely stressed code paths I wonder if we can actually ever become
    > > > confident that the patch works in all edge cases.
    > > > 
    > > > Something like this need to be easy to understand for us to have any
    > > > hope at all to be comfortable in the correctness. Can we actually do that?
    > > > 
    > > 
    > > How's this different from any other complex patch? We get more familiar
    > > with the problem during review, identify issues, improve the patch to
    > > address them. And then again and again.
    > 
    > The difference I see is in how rarely anyone actually switches checksum
    > state in a production database, especially now that we enabled them by
    > default. A complex and rarely stressed code path is a minefield.
    
    FWIW, I think this is actually a good feature build the infrastructure for
    features (i.e. dynamically reconfiguring the cluster while running) like this,
    precisely because it isn't *constantly* used.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  73. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2025-12-01T21:51:04Z

    > On 5 Nov 2025, at 15:05, Tomas Vondra <tomas@vondra.me> wrote:
    
    > I took a look at the patch again, focusing mostly on the docs and
    > comments.
    
    While I am still working on a few ideas on how to handle the PITR issue, I
    didn't want to leave this review hanging longer (and it would be nice to get it
    out of the way and not mix it with other issues).
    
    > 1) func-admin.sgml
    > 
    > - This is missing documentation for the "fast" parameters for both
    > functions (enable/disable).
    
    I had originally omitted them intentionally since they were intended for
    testing, but I agree that they should be documented as they might be useful
    outside of testng as well.
    
    > - The paragraph stars with "Initiates data checksums for ..." but that
    > doesn't sound right to me. I think you can initiate enabling/disabling,
    > not "data checksums".
    
    Fair point.
    
    > - I think the docs should at least briefly describe the impact on the
    > cluster, and also on a standby, due to having to write everything into
    > WAL, waiting for checkpoints etc. And maybe discuss how to mitigate that
    > in some way. More about the standby stuff later.
    
    Agreed, but I think it's better done in a one central place like the data
    checksums section in wal.sgml (which your XXX also mention).  In the preamble
    to the function table I've added a mention of the system performance impact
    with a link to the aforementioned section.
    
    > 2) glossary.sgml
    > 
    > - This describes "checksum worker" as process that enables or disables
    > checksums in a specific database, but we don't need any per-database
    > processes when disabling, no?
    
    That's very true, the Worker Launcher will receive the operation and if it is
    to disable it will proceed without launching any workers since it's cluster
    wide.  Thinking about it, maybe DataChecksumsWorkerLauncher isn't a very good
    name, DataChecksumsController or DataChecksumsCoordinator might be a better
    choice?
    
    Naming seems to be hard, who knew..
    
    > 3) monitoring.sgml
    > 
    > - I'm not sure what "regardles" implies here. Does that mean we simply
    > don't hide/reset the counters?
    
    Correct, I've expanded this para to mention this.
    
    > - I added a brief explanation about using the "launcher" row for overall
    > progress, and per-database workers for "current progress".
    
    +1
    
    > - Do we want to refer to "datachecksumsworker"? Isn't it a bit too
    > low-level detail?
    
    I think so, we should stick to "launcher process" and "worker process" and be
    conistent about it.
    
    > - The table of phases is missing "waiting" and "done" phases. IMHO if
    > the progress view can return it, it should be in the docs.
    
    Nice catch.  The code can't actually return "waiting" since it was broken up
    into three different wait phases, but one of them wasn't documented or added to
    the view properly.  Fixed.
    
    > 4) wal.sgml
    > 
    > - I added a sentence explaining that both enabling and disabling happens
    > in phases, with checkpoints in between. I think that's helpful for users
    > and DBAs.
    
    +1
    
    > - The section only described "enabling checksums", but it probably
    > should explain the process for disabling too. Added a para.
    
    +1
    
    > - Again, I think we should explain the checkpoints, restartpoints,
    > impact on standby ... somewhere. Not sure if this is the right place.
    
    I've added a subsection in the main Data Checksums section for this.
    
    > 5) xlog.c
    
    > - Some minor corrections (typos, ...).
    
    Thanks, I did some additional minor wordsmithing around these.
    
    > - Isn't the claim that PG_DATA_CHECKSUM_ON_VERSION is the only place
    > where we check InitialDataChecksumTransition stale? AFAIK we now check
    > this in AbsorbDataChecksumsBarrier for all transitions, no?
    
    Correct, fixed.
    
    > 6) datachecksumsworker.c
    
    > - One of the items in the synchronization/correctness section states
    > that "Backends SHALL NOT violate local data_checksums state" but what
    > does "violating local data_checksums state" mean? What even is "local
    > state in this context"? Should we explain/define that, or would that be
    > unnecessarily detailed?
    
    By "local state" I was referring to the data checksum state that a backend
    knows about.  I've reworded this to hopefully be a little clearer.
    
    > - The section only talks about "enabling checksums" but also discusses
    > all four possible states. Maybe it should talk about disabling too, as
    > that requires the same synchronization/correctness.
    > 
    > - Maybe it'd be good make it more explicit at which point the process
    > waits on a barrier, which backend initiates that (and which backends are
    > required to absorb the signal). It kinda is there, but only indirectly.
    
    I've tried to address these, but they might still be off since I am very
    Stockholm syndromed into this.
    
    > - Another idea I had is that maybe it'd help to have some visualization
    > of the process (with data_checksum states, barriers, ...) e.g. in the
    > form of an ASCII image.
    
    I instead opted for a SVG image in the docs which illustrates the states and
    the transitions.  My Graphviz skills aren't that great so it doesn't look all
    that pretty yet, but it's something to iterate on at least.
    
    > I'm much more concerned about streaming replicas, because there it
    > forces a restart point - and it *blocks redo* until it completes. Which
    > means there'll be replication lag, and for synchronous standbys this
    > would even block progress on the primary.
    > 
    > We should very clearly document this.
    
    Indeed. I've started on such a section.
    
    > I see some similarities to shutdown checkpoints, which can take a lot of
    > time if there happen to be a lot of dirty data, increasing disruption
    > during planned restarts (when no one can connect). A common mitigation
    > is to run CHECKPOINT shortly before the restart, to flush most of the
    > dirty data while still allowing new connections.
    > 
    > Maybe we could do something like that for checksum changes? I don't know
    > how exactly we could do that, but let's say we can predict when roughly
    > to expect the next state change.
    
    Each worker knows how far it has come within its database, but is unaware about
    other databases; the launcher knows how far it has come across all databases,
    but is unaware about the relative size of each database.  Maybe there still is
    a heuristic that can be teased out of imperfect knowledge.
    
    > And we'd ensure the standby starts
    > flushing stuff before that, so that creating the restartpoint is cheap.
    > Or maybe we'd (gradually?) lower max_wal_size on the standby, to reduce
    > the amount of WAL as we're getting closer to the end?
    
    That's an interesting idea, do you know if we have processes taking a similar
    approach today?
    
    The attached is a rebase with the above fixes along with a few more smaller
    fixups and cleanups noticed along the way, nothing which change any
    functionality though.
    
    --
    Daniel Gustafsson
    
    
  74. Re: Changing the state of data checksums in a running cluster

    Andres Freund <andres@anarazel.de> — 2025-12-15T22:21:46Z

    Hi,
    
    
    One high level issue first: I don't think the way this uses checkpoints and
    restartpoints is likely to work out.
    
    Synchronously having to wait for restartpoints during recovery seems generally
    operationally a huge issue, but actually could also easily lead to undetected
    deadlocks, particularly with syncrep.
    
    Using the checksum state from the control file seems very fraught,
    particularly with PITR, as the control file can be "from the future". Which
    can be a problem e.g. if checksums were disabled, but we start recovery with a
    control file with checksums enabled.
    
    Forcing synchronous checkpoints in a bunch of places also will make this a
    good bit slower than necessary, particularly for testing.
    
    
    My suggestion for how to do this instead is to put the checksum state into the
    XLOG_CHECKPOINT_* records.  When starting recovery from an online checkpoint,
    I think we should use the ChecksumType from the XLOG_CHECKPOINT_REDO record,
    that way the standby/recovery environment's assumption about whether checksums
    were enabled is the same as it was at the time the WAL was generated. For
    shutdown checkpoints, we could either start to emit a XLOG_CHECKPOINT_REDO, or
    we can use the information from the checkpoint record itself.
    
    I think if we only ever use the checksum state from the point where we start
    recovery, we might not need to force *any* checkpoints.
    
    
    Daniel and I chatted about that proposal, and couldn't immediately come up
    with scenarios where that would be wrong.  For a while I thought there would
    be problems when doing PITR from a base backup that had checksums initially
    enabled, but where checksums were disabled before the base backup was
    completed. My worry was that a later (once checksums were disabled) hint bit
    write (which would not necessarily be WAL logged) would corrupt the checksum,
    but I don't think that's a problem, because the startup process will only read
    data pages in the process of processing WAL records, and if there's a WAL
    record, there would also have to be an FPW, which would "cure" the
    unchecksummed page.
    
    
    
    More comments below, inline - I wrote these first, so it's possible that I
    missed updating some of them in light of what I now wrote above.
    
    > +/*
    > + * This must match largest number of sets in barrier_eq and barrier_ne in the
    > + * below checksum_barriers definition.
    > + */
    > +#define MAX_BARRIER_CONDITIONS 2
    > +
    > +/*
    > + * Configuration of conditions which must match when absorbing a procsignal
    > + * barrier during data checksum enable/disable operations.  A single function
    > + * is used for absorbing all barriers, and the set of conditions to use is
    > + * looked up in the checksum_barriers struct.  The struct member for the target
    > + * state defines which state the backend must currently be in, and which it
    > + * must not be in.
    > + */
    > +typedef struct ChecksumBarrierCondition
    > +{
    > +	/* The target state of the barrier */
    > +	int			target;
    > +	/* A set of states in which at least one MUST match the current state */
    > +	int			barrier_eq[MAX_BARRIER_CONDITIONS];
    > +	/* The number of elements in the barrier_eq set */
    > +	int			barrier_eq_sz;
    > +	/* A set of states which all MUST NOT match the current state */
    > +	int			barrier_ne[MAX_BARRIER_CONDITIONS];
    > +	/* The number of elements in the barrier_ne set */
    > +	int			barrier_ne_sz;
    > +} ChecksumBarrierCondition;
    > +
    > +static const ChecksumBarrierCondition checksum_barriers[] =
    > +{
    > +	{PG_DATA_CHECKSUM_OFF, {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION, PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION}, 2, {PG_DATA_CHECKSUM_VERSION}, 1},
    > +	{PG_DATA_CHECKSUM_VERSION, {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION}, 1, {0}, 0},
    > +	{PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION, {PG_DATA_CHECKSUM_ANY_VERSION}, 1, {PG_DATA_CHECKSUM_VERSION}, 1},
    > +	{PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION, {PG_DATA_CHECKSUM_VERSION}, 1, {0}, 0},
    > +	{-1}
    > +};
    
    The explanation for this doesn't really explain what the purpose of this thing
    is...  Perhaps worth referencing datachecksumsworker.c or such?
    
    For a local, constantly sized, array, you shouldn't need a -1 terminator, as
    you can instead use lengthof() or such to detect invalid accesses.
    
    
    > +/*
    > + * Flag to remember if the procsignalbarrier being absorbed for checksums is
    > + * the first one.  The first procsignalbarrier can in rare cases be for the
    > + * state we've initialized, i.e. a duplicate.  This may happen for any
    > + * data_checksum_version value when the process is spawned between the update
    > + * of XLogCtl->data_checksum_version and the barrier being emitted.  This can
    > + * only happen on the very first barrier so mark that with this flag.
    > + */
    > +static bool InitialDataChecksumTransition = true;
    
    This is pretty hard to understand right now, at the very least it needs an
    updated comment. But perhaps we can just get rid of this and accept barriers
    that are redundant.
    
    
    
    > @@ -830,9 +905,10 @@ XLogInsertRecord(XLogRecData *rdata,
    >  		 * only happen just after a checkpoint, so it's better to be slow in
    >  		 * this case and fast otherwise.
    >  		 *
    > -		 * Also check to see if fullPageWrites was just turned on or there's a
    > -		 * running backup (which forces full-page writes); if we weren't
    > -		 * already doing full-page writes then go back and recompute.
    > +		 * Also check to see if fullPageWrites was just turned on, there's a
    > +		 * running backup or if checksums are enabled (all of which forces
    > +		 * full-page writes); if we weren't already doing full-page writes
    > +		 * then go back and recompute.
    >  		 *
    >  		 * If we aren't doing full-page writes then RedoRecPtr doesn't
    >  		 * actually affect the contents of the XLOG record, so we'll update
    > @@ -845,7 +921,9 @@ XLogInsertRecord(XLogRecData *rdata,
    >  			Assert(RedoRecPtr < Insert->RedoRecPtr);
    >  			RedoRecPtr = Insert->RedoRecPtr;
    >  		}
    > -		doPageWrites = (Insert->fullPageWrites || Insert->runningBackups > 0);
    > +		doPageWrites = (Insert->fullPageWrites ||
    > +						Insert->runningBackups > 0 ||
    > +						DataChecksumsNeedWrite());
    >
    >  		if (doPageWrites &&
    >  			(!prevDoPageWrites ||
    
    Why do we need to separately check for DataChecksumsNeedWrite() if turning on
    checksums also forces Insert->fullPageWrites to on?
    
    
    
    > +/*
    > + * SetDataChecksumsOnInProgress
    > + *		Sets the data checksum state to "inprogress-on" to enable checksums
    > + *
    > + * To start the process of enabling data checksums in a running cluster the
    > + * data_checksum_version state must be changed to "inprogress-on". See
    > + * SetDataChecksumsOn below for a description on how this state change works.
    > + * This function blocks until all backends in the cluster have acknowledged the
    > + * state transition.
    > + */
    > +void
    > +SetDataChecksumsOnInProgress(bool immediate_checkpoint)
    > +{
    > +	uint64		barrier;
    > +	int			flags;
    > +
    > +	Assert(ControlFile != NULL);
    > +
    > +	/*
    > +	 * The state transition is performed in a critical section with
    > +	 * checkpoints held off to provide crash safety.
    > +	 */
    > +	MyProc->delayChkptFlags |= DELAY_CHKPT_START;
    > +	START_CRIT_SECTION();
    
    ISTM that delayChkptFlags ought to only be set once within the critical
    section. Obviously we can't fail inbetween as the code stands, but that's not
    guaranteed to stay this way.
    
    
    > +	XLogChecksums(PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
    > +
    > +	SpinLockAcquire(&XLogCtl->info_lck);
    > +	XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION;
    > +	SpinLockRelease(&XLogCtl->info_lck);
    
    Maybe worth adding an assertion checking that we are currently in an expected
    state (off or inprogress, I think?).
    
    
    > +	barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_ON);
    > +
    > +	END_CRIT_SECTION();
    > +	MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
    
    Swap as well.
    
    Think it might be worth mentioning that we rely on the memory ordering implied
    by XLogChecksums() and WaitForProcSignalBarrier() for the changes to
    delayChkptFlags. Unless we have a different logic around that?
    
    
    > +	/*
    > +	 * Await state change in all backends to ensure that all backends are in
    > +	 * "inprogress-on". Once done we know that all backends are writing data
    > +	 * checksums.
    > +	 */
    > +	WaitForProcSignalBarrier(barrier);
    > +
    > +	/*
    > +	 * force checkpoint to persist the current checksum state in control file
    > +	 * etc.
    > +	 *
    > +	 * XXX is this needed? There's already a checkpoint at the end of
    > +	 * ProcessAllDatabases, maybe this is redundant?
    > +	 */
    > +	flags = CHECKPOINT_FORCE | CHECKPOINT_WAIT;
    > +	if (immediate_checkpoint)
    > +		flags = flags | CHECKPOINT_FAST;
    > +	RequestCheckpoint(flags);
    
    Why do we need a checkpoint at all?
    
    
    > +}
    > +
    > +/*
    > + * SetDataChecksumsOn
    > + *		Enables data checksums cluster-wide
    > + *
    > + * Enabling data checksums is performed using two barriers, the first one to
    > + * set the state to "inprogress-on" (done by SetDataChecksumsOnInProgress())
    > + * and the second one to set the state to "on" (done here). Below is a short
    > + * description of the processing, a more detailed write-up can be found in
    > + * datachecksumsworker.c.
    > + *
    > + * To start the process of enabling data checksums in a running cluster the
    > + * data_checksum_version state must be changed to "inprogress-on".  This state
    > + * requires data checksums to be written but not verified. This ensures that
    > + * all data pages can be checksummed without the risk of false negatives in
    > + * validation during the process.  When all existing pages are guaranteed to
    > + * have checksums, and all new pages will be initiated with checksums, the
    > + * state can be changed to "on". Once the state is "on" checksums will be both
    > + * written and verified. See datachecksumsworker.c for a longer discussion on
    > + * how data checksums can be enabled in a running cluster.
    > + *
    > + * This function blocks until all backends in the cluster have acknowledged the
    > + * state transition.
    > + */
    > +void
    > +SetDataChecksumsOn(bool immediate_checkpoint)
    >  {
    > +	uint64		barrier;
    > +	int			flags;
    > +
    >  	Assert(ControlFile != NULL);
    > -	return (ControlFile->data_checksum_version > 0);
    > +
    > +	SpinLockAcquire(&XLogCtl->info_lck);
    > +
    > +	/*
    > +	 * The only allowed state transition to "on" is from "inprogress-on" since
    > +	 * that state ensures that all pages will have data checksums written.
    > +	 */
    > +	if (XLogCtl->data_checksum_version != PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
    > +	{
    > +		SpinLockRelease(&XLogCtl->info_lck);
    > +		elog(ERROR, "checksums not in \"inprogress-on\" mode");
    
    Seems like a PANIC condition to me...
    
    > +	}
    > +
    > +	SpinLockRelease(&XLogCtl->info_lck);
    > +
    > +	MyProc->delayChkptFlags |= DELAY_CHKPT_START;
    > +	INJECTION_POINT("datachecksums-enable-checksums-delay", NULL);
    > +	START_CRIT_SECTION();
    
    I think it's a really really bad idea to do something fallible, like
    INJECTION_POINT, after setting delayChkptFlags, but before entering the crit
    section. Any error in the injection point will lead to a corrupted
    delayChkptFlags, no?
    
    
    > +	XLogChecksums(PG_DATA_CHECKSUM_VERSION);
    > +
    > +	SpinLockAcquire(&XLogCtl->info_lck);
    > +	XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
    > +	SpinLockRelease(&XLogCtl->info_lck);
    > +
    > +	barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);
    > +
    > +	END_CRIT_SECTION();
    > +	MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
    > +
    > +	/*
    > +	 * Await state transition of "on" in all backends. When done we know that
    > +	 * data checksums are enabled in all backends and data checksums are both
    > +	 * written and verified.
    > +	 */
    > +	WaitForProcSignalBarrier(barrier);
    > +
    > +	INJECTION_POINT("datachecksums-enable-checksums-pre-checkpoint", NULL);
    > +
    > +	/* XXX is this needed? */
    > +	flags = CHECKPOINT_FORCE | CHECKPOINT_WAIT;
    > +	if (immediate_checkpoint)
    > +		flags = flags | CHECKPOINT_FAST;
    > +	RequestCheckpoint(flags);
    > +}
    > +
    > +/*
    > + * SetDataChecksumsOff
    > + *		Disables data checksums cluster-wide
    > + *
    > + * Disabling data checksums must be performed with two sets of barriers, each
    > + * carrying a different state. The state is first set to "inprogress-off"
    > + * during which checksums are still written but not verified. This ensures that
    > + * backends which have yet to observe the state change from "on" won't get
    > + * validation errors on concurrently modified pages. Once all backends have
    > + * changed to "inprogress-off", the barrier for moving to "off" can be emitted.
    > + * This function blocks until all backends in the cluster have acknowledged the
    > + * state transition.
    > + */
    > +void
    > +SetDataChecksumsOff(bool immediate_checkpoint)
    > +{
    > +	[...]
    > +	/*
    > +	 * Ensure that we don't incur a checkpoint during disabling checksums.
    > +	 */
    > +	MyProc->delayChkptFlags |= DELAY_CHKPT_START;
    > +	START_CRIT_SECTION();
    > +
    > +	XLogChecksums(0);
    
    Why no symbolic name here?
    
    
    > +	SpinLockAcquire(&XLogCtl->info_lck);
    > +	XLogCtl->data_checksum_version = 0;
    > +	SpinLockRelease(&XLogCtl->info_lck);
    > +
    > +	barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_OFF);
    > +
    > +	END_CRIT_SECTION();
    > +	MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
    > +
    > +	WaitForProcSignalBarrier(barrier);
    > +
    > +	flags = CHECKPOINT_FORCE | CHECKPOINT_WAIT;
    > +	if (immediate_checkpoint)
    > +		flags = flags | CHECKPOINT_FAST;
    > +	RequestCheckpoint(flags);
    > +}
    > +
    > +/*
    > + * AbsorbDataChecksumsBarrier
    > + *		Generic function for absorbing data checksum state changes
    > + *
    > + * All procsignalbarriers regarding data checksum state changes are absorbed
    > + * with this function.  The set of conditions required for the state change to
    > + * be accepted are listed in the checksum_barriers struct, target_state is
    > + * used to look up the relevant entry.
    > + */
    > +bool
    > +AbsorbDataChecksumsBarrier(int target_state)
    > +{
    > +	const ChecksumBarrierCondition *condition = checksum_barriers;
    > +	int			current = LocalDataChecksumVersion;
    > +	bool		found = false;
    > +
    > +	/*
    > +	 * Find the barrier condition definition for the target state. Not finding
    > +	 * a condition would be a grave programmer error as the states are a
    > +	 * discrete set.
    > +	 */
    > +	while (condition->target != target_state && condition->target != -1)
    > +		condition++;
    > +	if (unlikely(condition->target == -1))
    > +		ereport(ERROR,
    > +				errcode(ERRCODE_INVALID_PARAMETER_VALUE),
    > +				errmsg("invalid target state %i for data checksum barrier",
    > +					   target_state));
    
    FWIW, you don't need unlikely() when the branch does an ereport(ERROR), as
    ereports >=ERROR are marked "cold" automatically.
    
    
    > +	/*
    > +	 * The current state MUST be equal to one of the EQ states defined in this
    > +	 * barrier condition, or equal to the target_state if - and only if -
    > +	 * InitialDataChecksumTransition is true.
    > +	 */
    > +	for (int i = 0; i < condition->barrier_eq_sz; i++)
    > +	{
    > +		if (current == condition->barrier_eq[i] ||
    > +			condition->barrier_eq[i] == PG_DATA_CHECKSUM_ANY_VERSION)
    > +			found = true;
    > +	}
    > +	if (InitialDataChecksumTransition && current == target_state)
    > +		found = true;
    > +
    > +	/*
    > +	 * The current state MUST NOT be equal to any of the NE states defined in
    > +	 * this barrier condition.
    > +	 */
    > +	for (int i = 0; i < condition->barrier_ne_sz; i++)
    > +	{
    > +		if (current == condition->barrier_ne[i])
    > +			found = false;
    > +	}
    > +
    > +	/*
    > +	 * If the relevent state criteria aren't satisfied, throw an error which
    > +	 * will be caught by the procsignal machinery for a later retry.
    > +	 */
    > +	if (!found)
    > +		ereport(ERROR,
    > +				errcode(ERRCODE_INVALID_PARAMETER_VALUE),
    > +				errmsg("incorrect data checksum state %i for target state %i",
    > +					   current, target_state));
    > +
    > +	SetLocalDataChecksumVersion(target_state);
    > +	InitialDataChecksumTransition = false;
    > +	return true;
    > +}
    
    
    > +/*
    > + * Log the new state of checksums
    > + */
    > +static void
    > +XLogChecksums(uint32 new_type)
    > +{
    > +	xl_checksum_state xlrec;
    > +	XLogRecPtr	recptr;
    > +
    > +	xlrec.new_checksumtype = new_type;
    > +
    > +	XLogBeginInsert();
    > +	XLogRegisterData((char *) &xlrec, sizeof(xl_checksum_state));
    > +
    > +	INJECTION_POINT("datachecksums-xlogchecksums-pre-xloginsert", &new_type);
    > +
    > +	recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKSUMS);
    > +	XLogFlush(recptr);
    > +}
    
    Why an injection point between XLogBeginInsert() and XLogInsert(), rather than
    have the injection point before the XLogBeginInsert()?
    
    
    > +	else if (info == XLOG_CHECKSUMS)
    > +	{
    > +		xl_checksum_state state;
    > +		uint64		barrier;
    > +
    > +		memcpy(&state, XLogRecGetData(record), sizeof(xl_checksum_state));
    > +
    > +		/*
    > +		 * XXX Could this end up written to the control file prematurely? IIRC
    > +		 * that happens during checkpoint, so what if that gets triggered e.g.
    > +		 * because someone runs CHECKPOINT? If we then crash (or something
    > +		 * like that), could that confuse the instance?
    > +		 */
    > +		SpinLockAcquire(&XLogCtl->info_lck);
    > +		XLogCtl->data_checksum_version = state.new_checksumtype;
    > +		SpinLockRelease(&XLogCtl->info_lck);
    > +
    > +		/*
    > +		 * Block on a procsignalbarrier to await all processes having seen the
    > +		 * change to checksum status. Once the barrier has been passed we can
    > +		 * initiate the corresponding processing.
    > +		 */
    > +		switch (state.new_checksumtype)
    > +		{
    > +			case PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION:
    > +				barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_ON);
    > +				WaitForProcSignalBarrier(barrier);
    > +				break;
    > +
    > +			case PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION:
    > +				barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_OFF);
    > +				WaitForProcSignalBarrier(barrier);
    > +				break;
    > +
    > +			case PG_DATA_CHECKSUM_VERSION:
    > +				barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);
    > +				WaitForProcSignalBarrier(barrier);
    > +				break;
    > +
    > +			default:
    > +				Assert(state.new_checksumtype == 0);
    > +				barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_OFF);
    > +				WaitForProcSignalBarrier(barrier);
    > +				break;
    
    I'd not add a default clause, but add a case each value of the enum. That way
    we'll get warnings if the set of states changes.
    
    
    These WaitForProcSignalBarrier() are one of the scariest bits of the
    patchset. If the startup process were to hold any lock that backends need, and
    the backends waited for that lock without processing interrupts, we'd have an
    undetected deadlock.  This is much more likely to be a problem for the startup
    process, as it does the work of many backends on the primary.
    
    We do process interrupts while waiting for heavyweight locks, so that at least
    is not an issue. Seems worth to call out rather explicitly.
    
    
    
    > +	if (checksumRestartPoint &&
    > +		(info == XLOG_CHECKPOINT_ONLINE ||
    > +		 info == XLOG_CHECKPOINT_REDO ||
    > +		 info == XLOG_CHECKPOINT_SHUTDOWN))
    > +	{
    > +		int			flags;
    > +
    > +		elog(LOG, "forcing creation of a restartpoint after XLOG_CHECKSUMS");
    > +
    > +		/* We explicitly want an immediate checkpoint here */
    > +		flags = CHECKPOINT_FORCE | CHECKPOINT_WAIT | CHECKPOINT_FAST;
    > +		RequestCheckpoint(flags);
    > +
    > +		checksumRestartPoint = false;
    > +	}
    
    As noted above, I don't think we should rely on starting restartpoints.
    
    
    
    > diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
    > index 2be4e069816..baf6c8cc2cc 100644
    > --- a/src/backend/backup/basebackup.c
    > +++ b/src/backend/backup/basebackup.c
    > @@ -1613,7 +1613,8 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
    >  	 * enabled for this cluster, and if this is a relation file, then verify
    >  	 * the checksum.
    >  	 */
    > -	if (!noverify_checksums && DataChecksumsEnabled() &&
    > +	if (!noverify_checksums &&
    > +		DataChecksumsNeedWrite() &&
    >  		RelFileNumberIsValid(relfilenumber))
    >  		verify_checksum = true;
    >
    
    Why is DataChecksumsNeedWrite() being tested here?
    
    
    >  --
    >  -- We also set up some things as accessible to standard roles.
    >  --
    > diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
    > index 086c4c8fb6f..6d452b10bce 100644
    > --- a/src/backend/catalog/system_views.sql
    > +++ b/src/backend/catalog/system_views.sql
    > @@ -1374,6 +1374,26 @@ CREATE VIEW pg_stat_progress_copy AS
    >      FROM pg_stat_get_progress_info('COPY') AS S
    >          LEFT JOIN pg_database D ON S.datid = D.oid;
    >
    > +CREATE VIEW pg_stat_progress_data_checksums AS
    > +    SELECT
    > +        S.pid AS pid, S.datid, D.datname AS datname,
    > +        CASE S.param1 WHEN 0 THEN 'enabling'
    > +                      WHEN 1 THEN 'disabling'
    > +                      WHEN 2 THEN 'waiting on temporary tables'
    > +                      WHEN 3 THEN 'waiting on checkpoint'
    > +					  WHEN 4 THEN 'waiting on barrier'
    > +                      WHEN 5 THEN 'done'
    > +                      END AS phase,
    > +        CASE S.param2 WHEN -1 THEN NULL ELSE S.param2 END AS databases_total,
    > +        S.param3 AS databases_done,
    > +        CASE S.param4 WHEN -1 THEN NULL ELSE S.param4 END AS relations_total,
    > +        CASE S.param5 WHEN -1 THEN NULL ELSE S.param5 END AS relations_done,
    > +        CASE S.param6 WHEN -1 THEN NULL ELSE S.param6 END AS blocks_total,
    > +        CASE S.param7 WHEN -1 THEN NULL ELSE S.param7 END AS blocks_done
    > +    FROM pg_stat_get_progress_info('DATACHECKSUMS') AS S
    > +        LEFT JOIN pg_database D ON S.datid = D.oid
    > +    ORDER BY S.datid; -- return the launcher process first
    > +
    
    Not this patch's fault, but I strongly dislike that we do this in SQL. Every
    postgres database in the world has ~110kB of pg_stat_progress view definitions
    in it.  We should just do this in a C function.
    
    
    > diff --git a/src/backend/postmaster/datachecksumsworker.c b/src/backend/postmaster/datachecksumsworker.c
    > new file mode 100644
    > index 00000000000..57311760b2b
    > --- /dev/null
    > +++ b/src/backend/postmaster/datachecksumsworker.c
    > @@ -0,0 +1,1491 @@
    > +/*-------------------------------------------------------------------------
    > + *
    > + * datachecksumsworker.c
    > + *	  Background worker for enabling or disabling data checksums online
    > + *
    > + * When enabling data checksums on a database at initdb time or when shut down
    > + * with pg_checksums, no extra process is required as each page is checksummed,
    > + * and verified, when accessed.  When enabling checksums on an already running
    > + * cluster, this worker will ensure that all pages are checksummed before
    > + * verification of the checksums is turned on. In the case of disabling
    > + * checksums, the state transition is performed only in the control file, no
    > + * changes are performed on the data pages.
    > + *
    > + * Checksums can be either enabled or disabled cluster-wide, with on/off being
    > + * the end state for data_checksums.
    > + *
    > + * Enabling checksums
    > + * ------------------
    > + * When enabling checksums in an online cluster, data_checksums will be set to
    > + * "inprogress-on" which signals that write operations MUST compute and write
    > + * the checksum on the data page, but during reading the checksum SHALL NOT be
    > + * verified. This ensures that all objects created during checksumming will
    > + * have checksums set, but no reads will fail due to incorrect checksum.
    
    Maybe "... due to not yet set checksums."?  Incorrect checksums sounds like
    it's about checksums that are actively wrong, rather than just not set. Except
    for the corner case of a torn page in the process of an hint bit write, after
    having disabled checksums, there shouldn't be incorrect ones, right?
    
    
    > The
    > + * DataChecksumsWorker will compile a list of databases which exist at the
    > + * start of checksumming, and all of these which haven't been dropped during
    > + * the processing MUST have been processed successfully in order for checksums
    > + * to be enabled. Any new relation created during processing will see the
    > + * in-progress state and will automatically be checksummed.
    
    What about new databases created while checksums are being enabled? They could
    be copied before the worker has processed them. At least for the file_copy
    strategy, the copy will be verbatim and thus will not necessarily have
    checksums set.
    
    
    > + * Synchronization and Correctness
    > + * -------------------------------
    > + * The processes involved in enabling, or disabling, data checksums in an
    > + * online cluster must be properly synchronized with the normal backends
    > + * serving concurrent queries to ensure correctness. Correctness is defined
    > + * as the following:
    > + *
    > + *    - Backends SHALL NOT violate the data_checksums state they have agreed to
    > + *      by acknowledging the procsignalbarrier:  This means that all backends
    > + *      MUST calculate and write data checksums during all states except off;
    > + *      MUST validate checksums only in the 'on' state.
    > + *    - Data checksums SHALL NOT be considered enabled cluster-wide until all
    > + *      currently connected backends have state "on": This means that all
    > + *      backends must wait on the procsignalbarrier to be acknowledged by all
    > + *      before proceeding to validate data checksums.
    > + *
    > + * There are two levels of synchronization required for changing data_checksums
    
    Maybe s/levels/steps/?
    
    > + * in an online cluster: (i) changing state in the active backends ("on",
    > + * "off", "inprogress-on" and "inprogress-off"), and (ii) ensuring no
    > + * incompatible objects and processes are left in a database when workers end.
    > + * The former deals with cluster-wide agreement on data checksum state and the
    > + * latter with ensuring that any concurrent activity cannot break the data
    > + * checksum contract during processing.
    > + *
    > + * Synchronizing the state change is done with procsignal barriers, where the
    > + * WAL logging backend updating the global state in the controlfile will wait
    
    It's not entirely obvious what "the WAL logging backend" means.
    
    
    > + * for all other backends to absorb the barrier. Barrier absorption will happen
    > + * during interrupt processing, which means that connected backends will change
    > + * state at different times. To prevent data checksum state changes when
    > + * writing and verifying checksums, interrupts shall be held off before
    > + * interrogating state and resumed when the IO operation has been performed.
    > + *
    > + *   When Enabling Data Checksums
    > + *   ----------------------------
    
    Odd change in indentation here.
    
    
    
    > + *   A process which fails to observe data checksums being enabled can induce
    > + *   two types of errors: failing to write the checksum when modifying the page
    > + *   and failing to validate the data checksum on the page when reading it.
    > + *
    > + *   When processing starts all backends belong to one of the below sets, with
    > + *   one set being empty:
    > + *
    > + *   Bd: Backends in "off" state
    > + *   Bi: Backends in "inprogress-on" state
    > + *
    > + *   If processing is started in an online cluster then all backends are in Bd.
    > + *   If processing was halted by the cluster shutting down, the controlfile
    > + *   state "inprogress-on" will be observed on system startup and all backends
    > + *   will be placed in Bd.
    
    Why not in Bi? Just for simplicities sake? ISTM we already need to be sure
    that new backends start in Bi, as they might never observe the barrier...
    
    
    > Backends transition Bd -> Bi via a procsignalbarrier
    > + *   which is emitted by the DataChecksumsLauncher.  When all backends have
    > + *   acknowledged the barrier then Bd will be empty and the next phase can
    > + *   begin: calculating and writing data checksums with DataChecksumsWorkers.
    > + *   When the DataChecksumsWorker processes have finished writing checksums on
    > + *   all pages and enables data checksums cluster-wide via another
    
    s/enables/enabled/?
    
    > + *   procsignalbarrier, there are four sets of backends where Bd shall be an
    > + *   empty set:
    > + *
    > + *   Bg: Backend updating the global state and emitting the procsignalbarrier
    > + *   Bd: Backends in "off" state
    > + *   Be: Backends in "on" state
    > + *   Bi: Backends in "inprogress-on" state
    > + *
    > + *   Backends in Bi and Be will write checksums when modifying a page, but only
    > + *   backends in Be will verify the checksum during reading. The Bg backend is
    > + *   blocked waiting for all backends in Bi to process interrupts and move to
    > + *   Be. Any backend starting while Bg is waiting on the procsignalbarrier will
    > + *   observe the global state being "on" and will thus automatically belong to
    > + *   Be.  Checksums are enabled cluster-wide when Bi is an empty set. Bi and Be
    > + *   are compatible sets while still operating based on their local state as
    > + *   both write data checksums.
    > + *
    > + *   When Disabling Data Checksums
    > + *   -----------------------------
    > + *   A process which fails to observe that data checksums have been disabled
    > + *   can induce two types of errors: writing the checksum when modifying the
    > + *   page and validating a data checksum which is no longer correct due to
    > + *   modifications to the page.
    
    Hm. I wonder if we ought to zero out old checksums when loading a page into
    s_b with checksums disabled... But that's really independent of this patchset.
    
    
    > + * Potential optimizations
    > + * -----------------------
    > + * Below are some potential optimizations and improvements which were brought
    > + * up during reviews of this feature, but which weren't implemented in the
    > + * initial version. These are ideas listed without any validation on their
    > + * feasibility or potential payoff. More discussion on these can be found on
    > + * the -hackers threads linked to in the commit message of this feature.
    > + *
    > + *   * Launching datachecksumsworker for resuming operation from the startup
    > + *     process: Currently users have to restart processing manually after a
    > + *     restart since dynamic background worker cannot be started from the
    > + *     postmaster. Changing the startup process could make restarting the
    > + *     processing automatic on cluster restart.
    > + *   * Avoid dirtying the page when checksums already match: Iff the checksum
    > + *     on the page happens to already match we still dirty the page. It should
    > + *     be enough to only do the log_newpage_buffer() call in that case.
    > + *   * Invent a lightweight WAL record that doesn't contain the full-page
    > + *     image but just the block number: On replay, the redo routine would read
    > + *     the page from disk.
    
    The last sentence might be truncated?
    
    
    > + *   * Teach pg_checksums to avoid checksummed pages when pg_checksums is used
    > + *     to enable checksums on a cluster which is in inprogress-on state and
    > + *     may have checksummed pages (make pg_checksums be able to resume an
    > + *     online operation).
    > + *   * Restartability (not necessarily with page granularity).
    > + *
    > + *
    > + * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
    > + * Portions Copyright (c) 1994, Regents of the University of California
    > + *
    > + *
    > + * IDENTIFICATION
    > + *	  src/backend/postmaster/datachecksumsworker.c
    
    Hm. The whole set of interactions with checkpoints/restartpoints aren't explored
    here?
    
    
    > +/*
    > + * ProcessSingleRelationFork
    > + *		Enable data checksums in a single relation/fork.
    > + *
    > + * Returns true if successful, and false if *aborted*. On error, an actual
    > + * error is raised in the lower levels.
    > + */
    > +static bool
    > +ProcessSingleRelationFork(Relation reln, ForkNumber forkNum, BufferAccessStrategy strategy)
    > +{
    > +	BlockNumber numblocks = RelationGetNumberOfBlocksInFork(reln, forkNum);
    > +	char		activity[NAMEDATALEN * 2 + 128];
    > +	char	   *relns;
    > +
    > +	relns = get_namespace_name(RelationGetNamespace(reln));
    > +
    > +	if (!relns)
    > +		return false;
    > +
    > +	/* Report the current relation to pgstat_activity */
    > +	snprintf(activity, sizeof(activity) - 1, "processing: %s.%s (%s, %dblocks)",
    > +			 relns, RelationGetRelationName(reln), forkNames[forkNum], numblocks);
    > +	pgstat_report_activity(STATE_RUNNING, activity);
    > +
    > +	/*
    > +	 * As of now we only update the block counter for main forks in order to
    > +	 * not cause too frequent calls. TODO: investigate whether we should do it
    > +	 * more frequent?
    > +	 */
    > +	if (forkNum == MAIN_FORKNUM)
    > +		pgstat_progress_update_param(PROGRESS_DATACHECKSUMS_BLOCKS_TOTAL,
    > +									 numblocks);
    
    That doesn't make much sense to me. Presumably the reason to skip it for the
    other forks is that they're small-ish. But if so, there's no point in skipping
    the reporting either, as presumably there wouldn't be a lot of reporting?
    
    
    > +	/*
    > +	 * We are looping over the blocks which existed at the time of process
    > +	 * start, which is safe since new blocks are created with checksums set
    > +	 * already due to the state being "inprogress-on".
    > +	 */
    > +	for (BlockNumber blknum = 0; blknum < numblocks; blknum++)
    > +	{
    > +		Buffer		buf = ReadBufferExtended(reln, forkNum, blknum, RBM_NORMAL, strategy);
    > +
    > +		/* Need to get an exclusive lock before we can flag as dirty */
    > +		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
    > +
    > +		/*
    > +		 * Mark the buffer as dirty and force a full page write.  We have to
    > +		 * re-write the page to WAL even if the checksum hasn't changed,
    > +		 * because if there is a replica it might have a slightly different
    > +		 * version of the page with an invalid checksum, caused by unlogged
    > +		 * changes (e.g. hintbits) on the master happening while checksums
    > +		 * were off. This can happen if there was a valid checksum on the page
    > +		 * at one point in the past, so only when checksums are first on, then
    > +		 * off, and then turned on again.  TODO: investigate if this could be
    > +		 * avoided if the checksum is calculated to be correct and wal_level
    > +		 * is set to "minimal",
    > +		 */
    > +		START_CRIT_SECTION();
    > +		MarkBufferDirty(buf);
    > +		log_newpage_buffer(buf, false);
    > +		END_CRIT_SECTION();
    
    Hm. It's pretty annoying to have to pass page_std = false here, that could
    increase the write volume noticeably. But there's not a great way to know
    what the right value would be :(
    
    
    
    > +/*
    > + * launcher_exit
    > + *
    > + * Internal routine for cleaning up state when the launcher process exits. We
    > + * need to clean up the abort flag to ensure that processing can be restarted
    > + * again after it was previously aborted.
    > + */
    > +static void
    > +launcher_exit(int code, Datum arg)
    > +{
    > +	if (launcher_running)
    > +	{
    > +		LWLockAcquire(DataChecksumsWorkerLock, LW_EXCLUSIVE);
    > +		launcher_running = false;
    > +		DataChecksumsWorkerShmem->launcher_running = false;
    > +		LWLockRelease(DataChecksumsWorkerLock);
    > +	}
    > +}
    
    Could we end up exiting this with the worker still running?
    
    
    > +/*
    > + * WaitForAllTransactionsToFinish
    > + *		Blocks awaiting all current transactions to finish
    > + *
    > + * Returns when all transactions which are active at the call of the function
    > + * have ended, or if the postmaster dies while waiting. If the postmaster dies
    > + * the abort flag will be set to indicate that the caller of this shouldn't
    > + * proceed.
    > + *
    > + * NB: this will return early, if aborted by SIGINT or if the target state
    > + * is changed while we're running.
    
    I think either here, or at its callsites, the patch needs to explain *why* we
    are waiting for all transactions to finish.  Presumably this is to ensure that
    other sessions haven't created relations that we can't see yet?
    
    It actually doesn't seem to wait for all transactions, ust for ones with an
    xid?
    
    > + */
    > +static void
    > +WaitForAllTransactionsToFinish(void)
    > +{
    > +	TransactionId waitforxid;
    > +
    > +	LWLockAcquire(XidGenLock, LW_SHARED);
    > +	waitforxid = XidFromFullTransactionId(TransamVariables->nextXid);
    > +	LWLockRelease(XidGenLock);
    > +
    > +	while (TransactionIdPrecedes(GetOldestActiveTransactionId(false, true), waitforxid))
    > +	{
    > +		char		activity[64];
    > +		int			rc;
    > +
    > +		/* Oldest running xid is older than us, so wait */
    > +		snprintf(activity,
    > +				 sizeof(activity),
    > +				 "Waiting for current transactions to finish (waiting for %u)",
    > +				 waitforxid);
    > +		pgstat_report_activity(STATE_RUNNING, activity);
    > +
    > +		/* Retry every 3 seconds */
    > +		ResetLatch(MyLatch);
    > +		rc = WaitLatch(MyLatch,
    > +					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
    > +					   3000,
    > +					   WAIT_EVENT_CHECKSUM_ENABLE_STARTCONDITION);
    > +
    > +		/*
    > +		 * If the postmaster died we won't be able to enable checksums
    > +		 * cluster-wide so abort and hope to continue when restarted.
    > +		 */
    > +		if (rc & WL_POSTMASTER_DEATH)
    > +			ereport(FATAL,
    > +					errcode(ERRCODE_ADMIN_SHUTDOWN),
    > +					errmsg("postmaster exited during data checksum processing"),
    > +					errhint("Restart the database and restart data checksum processing by calling pg_enable_data_checksums()."));
    > +
    > +		LWLockAcquire(DataChecksumsWorkerLock, LW_SHARED);
    > +		if (DataChecksumsWorkerShmem->launch_operation != operation)
    > +			abort_requested = true;
    > +		LWLockRelease(DataChecksumsWorkerLock);
    > +		if (abort_requested)
    > +			break;
    
    I don't like this much - loops with a timeout are generally a really bad idea
    and we shouldn't add more instances. Presumably this also makes the tests
    slower...
    
    How about collecting the to-be-waited-for virtualxids and then wait for those?
    
    
    > +	if (operation == ENABLE_DATACHECKSUMS)
    > +	{
    > +		/*
    > +		 * If we are asked to enable checksums in a cluster which already has
    > +		 * checksums enabled, exit immediately as there is nothing more to do.
    > +		 * Hold interrupts to make sure state doesn't change during checking.
    > +		 */
    > +		HOLD_INTERRUPTS();
    > +		if (DataChecksumsNeedVerify())
    > +		{
    > +			RESUME_INTERRUPTS();
    > +			goto done;
    > +		}
    > +		RESUME_INTERRUPTS();
    
    I don't understand what this interrupt stuff achieves here?
    
    
    
    > diff --git a/src/include/storage/checksum.h b/src/include/storage/checksum.h
    > index 25d13a798d1..0faaac14b1b 100644
    > --- a/src/include/storage/checksum.h
    > +++ b/src/include/storage/checksum.h
    > @@ -15,6 +15,21 @@
    >
    >  #include "storage/block.h"
    >
    > +/*
    > + * Checksum version 0 is used for when data checksums are disabled (OFF).
    > + * PG_DATA_CHECKSUM_VERSION defines that data checksums are enabled in the
    > + * cluster and PG_DATA_CHECKSUM_INPROGRESS_{ON|OFF}_VERSION defines that data
    > + * checksums are either currently being enabled or disabled.
    > + */
    > +typedef enum ChecksumType
    > +{
    > +	PG_DATA_CHECKSUM_OFF = 0,
    > +	PG_DATA_CHECKSUM_VERSION,
    > +	PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION,
    > +	PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION,
    > +	PG_DATA_CHECKSUM_ANY_VERSION
    > +} ChecksumType;
    
    Why is there "VERSION" in the name of these?  Feels like that's basically just
    vestigial at this point.
    
    
    >  /*
    >   * There also exist several built-in LWLock tranches.  As with the predefined
    > diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
    > index c6f5ebceefd..d90d35b1d6f 100644
    > --- a/src/include/storage/proc.h
    > +++ b/src/include/storage/proc.h
    > @@ -463,11 +463,11 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
    >   * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
    >   * run during normal operation.  Startup process and WAL receiver also consume
    >   * 2 slots, but WAL writer is launched only after startup has exited, so we
    > - * only need 6 slots.
    > + * only need 6 slots to cover these. The DataChecksums worker and launcher
    > + * can consume 2 slots when data checksums are enabled or disabled.
    >   */
    >  #define MAX_IO_WORKERS          32
    > -#define NUM_AUXILIARY_PROCS		(6 + MAX_IO_WORKERS)
    > -
    > +#define NUM_AUXILIARY_PROCS		(8 + MAX_IO_WORKERS)
    
    Aren't they bgworkers now?
    
    
    > diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
    > index afeeb1ca019..c54c61e2cd8 100644
    > --- a/src/include/storage/procsignal.h
    > +++ b/src/include/storage/procsignal.h
    > @@ -54,6 +54,11 @@ typedef enum
    >  typedef enum
    >  {
    >  	PROCSIGNAL_BARRIER_SMGRRELEASE, /* ask smgr to close files */
    > +
    > +	PROCSIGNAL_BARRIER_CHECKSUM_OFF,
    > +	PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_ON,
    > +	PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_OFF,
    > +	PROCSIGNAL_BARRIER_CHECKSUM_ON,
    >  } ProcSignalBarrierType;
    
    I wonder if these really should be different barriers. What if we just made it
    one, and instead drove the transition on the current shmem content?
    
    
    
    Other stuff:
    - what protects against multiple backends enabling checksums at the same time?
    
      Afaict there isn't anything, and we just ignore the second request. Which
      seems ok-ish if it's the same request as before, but not great if it's a
      different one.
    
      Should also have tests for that.
    
    - I think this adds a bit too much of the logic to xlog.c, already an unwieldy
      file. A fair bit of all of this doesn't seem like it needs to be in there.
    
    - the code seems somewhat split brained about bgworkers and auxprocesses
    
    
    Greetings,
    
    Andres Freund
    
    
    
    
  75. Re: Changing the state of data checksums in a running cluster

    SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> — 2026-05-01T16:57:20Z

    Hi Daniel,
    
    On Thu, Apr 30, 2026 at 8:20 AM Daniel Gustafsson <daniel@yesql.se> wrote:
    
    > > On 29 Apr 2026, at 15:42, Ayush Tiwari <ayushtiwari.slg01@gmail.com>
    > wrote:
    >
    > > The patchset looks good.
    >
    > Thanks for review, I've applied the patchset after some editorializing.
    >
    
    While further testing this feature, I realized that
    ProcessSingleRelationFork()
    unconditionally called log_newpage_buffer() for every page of every
    relation
    during pg_enable_data_checksums(). This included unlogged relations,
    which by definition never generate WAL for data changes and are reset to
    their
    init fork on any recovery.
    
    Guard the log_newpage_buffer() call with RelationNeedsWAL() so that
    unlogged relations still get their pages dirtied (ensuring the checksum
    is flushed to disk at the next checkpoint) but do not emit WAL.
    
    Attached a patch to address this and added a test for the same. My current
    test checks if standby has main fork, I could just checked WAL to verify
    this
    using pg_waldump. Any other test ideas are welcome.
    
    Thanks,
    Satya
    
  76. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2026-05-01T21:13:16Z

    > On 1 May 2026, at 18:57, SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:
    
    > While further testing this feature, I realized that ProcessSingleRelationFork() 
    > unconditionally called log_newpage_buffer() for every page of every relation 
    > during pg_enable_data_checksums(). This included unlogged relations, 
    > which by definition never generate WAL for data changes and are reset to their
    > init fork on any recovery. 
    > 
    > Guard the log_newpage_buffer() call with RelationNeedsWAL() so that
    > unlogged relations still get their pages dirtied (ensuring the checksum
    > is flushed to disk at the next checkpoint) but do not emit WAL.
    > 
    > Attached a patch to address this and added a test for the same. My current
    > test checks if standby has main fork, I could just checked WAL to verify this
    > using pg_waldump. Any other test ideas are welcome.
    
    Thanks for the report, I agree that this is an oversight that should be fixed.
    Your patch looks good on first glance, I am travelling till Sunday evening so
    will take another look when back in the office and will apply it then. Thanks!
    
    --
    Daniel Gustafsson
    
    
    
    
    
  77. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2026-05-04T13:16:34Z

    Hi,
    
    Thanks for getting this feature pushed, and for resolving the failures
    reported since the feature freeze. I consider this to be an important
    improvement, not just for the feature itself, but also because of all
    the useful infrastructure it added.
    
    Attached is a refined version of the TAP tests already posted by Daniel
    some time ago [1]. Unfortunately, that .txt did not apply cleanly for
    some reason, so here's a better version.
    
    I found these tests quite useful when reasoning about how the patch
    behaves in concurrent environment (e.g. with multiple sessions
    triggering checksum enable/disable, or with a checkpoint, crashes, etc).
    
    At this point all the tests pass, but there are a couple cases with
    correct but slightly surprising behavior, worth discussing. Which is
    what this e-mail is going to be about.
    
    I'll explain what the TAP tests aim to do first, and then discuss the
    slightly surprising behavior.
    
    It's not meant for inclusion into PG19, at least not in this shape - I
    wrote those TAP tests while investigating some of the earlier failures
    and/or when wondering about behavior in various situations (sequence of
    concurrent steps, race conditions, ...). So it's more of an exhaustive,
    and the tests are somewhat redundant (N+1 is often just (N + some small
    tweak)).
    
    I can imagine distilling it into a tiny subset, and adding that. But
    that's up to discussion. But that's for later.
    
    
    Let me briefly explain what the various TAP tests aim to do. From the
    very beginning, my main concern regarding this patch was race conditions
    when updating the shared state about effective data_checksum_version.
    Because the state is effectively split into about three or four places:
    
    * LocalDataChecksumVersion (local cache)
    * XLogCtl->data_checksum_version (XLogCtl->info_lck)
    * ControlFile->data_checksum_version (ControlFileLock)
    * state in control file on disk
    
    These pieces are protected by different locks, the protocol for updating
    and/or reading the various flags is not trivial (and some of the fixed
    issues were due to ControlFile->data_checksum_version being updated from
    a place that shouldn't have).
    
    So the primary goal of the TAP tests was to check for race conditions by
    leveraging injection points to step through concurrent processes in a
    deterministic way. The first couple patches (0001-0004) add debug
    logging and injection points into a lot of places. And by "a lot" I mean
    ~80 new injection points, which is about the number of injection points
    we have in master now. Anyway, this allows stepping through concurrent
    checksum changes, and also checksum change vs. checkpointer.
    
    Then come the actual TAP tests:
    
    1) 0005-TAP-10-concurrent-checksum-changes.patch
    
    Two concurrent checksum changes. The first one gets paused at an
    injection point, then the second one gets initiated.
    
    2) 0006-TAP-11-concurrency-with-checkpoints.patch
    
    A checksum change + checkpoint. The change gets paused at an injection
    point, a synchronous checkpoint is performed.
    
    3) 0007-TAP-12-crashes-at-injection-points.patch
    
    Similar to 0006, but with a crash + recovery. A checksum change gets
    paused at an injection point, a synchronous checkpoint is performed. The
    changes gets wpken up and either completes, or pauses on a different
    injection point. A restart/crash happens.
    
    4) 0008-TAP-13-concurrency-with-checkpoint-REDO.patch
    
    Similar to 0007, but the checkpoint is not synchronous - happens in the
    background, so that the TAP can step through both sides and interleave
    them in an arbitrary way. This matters because the checksum change
    updates the different state pieces (XlogCtl/ControlFile), while the
    checkpointer reads them to record initial state for REDO etc.
    
    5) 0009-TAP-14-checkpoints-with-crashes.patch
    
    Similar to 0008, except that the steps are more fine grained, and
    focused on two particular cases with surprisingly different final state.
    
    
    AFAIK everything works as expected, except for two cases in the "TAP
    012" test. One for the "enabling" direction, one for the "disabling"
    direction. I'm going to discuss the "enabling" direction, I believe the
    other case is just a mirror with the same root cause.
    
    The TAP 012 tests checksum change with a concurrent checkpoint, followed
    by a crash, and tests the final state. It pauses the change at an
    injection point, does a checkpoint, proceeds to the next injection
    point, crashes and does recovery. The expectation is that the final
    state "flips" at some injection point, once it gets further enough, and
    stays there. But what actually happens is this:
    
    a) test_checksum_transition(
        'disabled', 'enable', undef,
        'datachecksums-enable-inprogress-checksums-end',
        'datachecksums-enable-checksums-start',
        'off');
    
    b) test_checksum_transition(
        'disabled', 'enable', undef,
        'datachecksums-enable-checksums-start',
        'datachecksums-enable-checksums-after-xlog',
        'on');
    
    c) test_checksum_transition(
        'disabled', 'enable', 'datachecksums-enable-checksums-start',
        'datachecksums-enable-checksums-after-xlogs',
        'datachecksums-enable-checksums-after-xlogctl',
        'off');
    
    This says that if the checkpoint happens after
    'datachecksums-enable-inprogress-checksums-end' or after
    'datachecksums-enable-checksums-after-xlog', we end up with 'off' (i.e.
    enabling checksums fails).
    
    But if the checkpoint happens after
    'datachecksums-enable-checksums-start', we end up with "on" (after
    recovery).
    
    This is a bit surprising, because that injection point is before
    'datachecksums-enable-checksums-after-xlog'. So the enabling process
    gets further and further, but the final state flips off -> on -> off,
    contradicting the expectation that it changes once.
    
    I haven't quite wrapped my head around it yet, but my understanding is
    this is due to a race condition between the checksum launcher (writing
    XLOG2_CHECKSUMS and updating the shmem state), and the checkpointer
    (reading the shmem state and generating REDO).
    
    The launcher does this sequence of steps:
    
    1) write XLOG2_CHECKSUMS with new state
    2) update XLogCtl->data_checksum_version
    3) update ControlFile->data_checksum_version
    4) UpdateControlFile()
    5) emits barrier
    
    while the checkpointer (CreateCheckPoint) does this:
    
    A) read XLogCtl->data_checksum_version (while holding insert locks)
    B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
    C) UpdateControlFile()
    
    The outcome depends on how exactly these two sequences interleave. For
    example, this can happen:
    
    1) write XLOG2_CHECKSUMS with new state
      A) read XLogCtl->data_checksum_version (while holding insert locks)
      B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
      C) UpdateControlFile()
    2) update XLogCtl->data_checksum_version
    3) update ControlFile->data_checksum_version
    4) UpdateControlFile()
    5) emits barrier
    
    Which means the XLOG_CHECKPOINT_REDO will be after XLOG2_CHECKSUMS (and
    so redo won't see it), but the checkpoint will still get the old
    checksum state from XLogCtl. And so the outcome is "off", per case (c).
    
    But it can also happen what case (b) does:
    
      A) read XLogCtl->data_checksum_version (while holding insert locks)
      B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
      C) UpdateControlFile()
    1) write XLOG2_CHECKSUMS with new state
    2) update XLogCtl->data_checksum_version
    3) update ControlFile->data_checksum_version
    4) UpdateControlFile()
    5) emits barrier
    
    In which case the REDO will have the old state, but the recovery will
    read the XLOG2_CHECKSUMS, and so end up with "on".
    
    This is the root cause of the surprising behavior in TAP 012, I think.
    
    I attempted to trigger these race conditions in TAP 013, but without
    much success. In the end I realized it probably needs more control,
    waiting for the other process to hit the next injection point before
    unpausing the current one. TAP 014 does that, and it shows that with the
    right interleaving of steps the (c) case can end up with both "on" and
    "off" final state.
    
    As I said, I don't claim I fully understand this yet. But I wouldn't
    call this "bug" - AFAICS it won't produce an incorrect final state (I
    haven't seen any such cases).
    
    Still, I wonder if there's a potential issue I failed to notice.
    
    
    The other question I had when looking at this (concurrency with
    checkpoints) is what we get by doing
    
         MyProc->delayChkptFlags |= DELAY_CHKPT_START;
    
    whenever updating the state in SetDataChecksums... functions. Because
    the only thing that guarantees is the updates happen on one side of the
    checkpoint record. What does that give us, actually?
    
    It does not seem to prevent this surprising behavior, and it does not
    say the XLOG2_CHECKSUMS happens before/after the XLOG_CHECKPOINT_REDO.
    
    
    regards
    
    [1]
    https://www.postgresql.org/message-id/9197F930-DDEB-4CAC-82A2-16FEC715CCE8%40yesql.se
    
    -- 
    Tomas Vondra
    
  78. Re: Changing the state of data checksums in a running cluster

    SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> — 2026-05-05T07:43:03Z

    Hi Hackers, Daniel,
    
    Further testing this feature, I noticed that the cost_delay and cost_limit
    arguments
     to pg_enable_data_checksums() in practice have no effect.
    
    It appears we have two independent issues in DataChecksumsWorkerMain():
    
    (1) The worker writes the user-supplied values to VacuumCostDelay and
    VacuumCostLimit (the GUC-bound globals in globals.c). However,
    vacuum_delay_point() reads vacuum_cost_delay / vacuum_cost_limit
    declared in vacuum.c. The two pairs are kept in sync only by
    VacuumUpdateCosts(), which the worker never calls. Therefore, the napping
    formula always sees the defaults (vacuum_cost_delay = 0) and never
    sleeps.
    
    (2) The worker also resets VacuumCostPageHit/Miss/Dirty to 0 at startup.
    With all per-page weights at zero, VacuumCostBalance never reaches
    vacuum_cost_limit, which would defeat the throttling on its own even
    if (1) were fixed.
    
    Repro:
    
    Create a database and load data (say 3 GB)
    
    SELECT pg_disable_data_checksums();
    SELECT pg_enable_data_checksums(100, 1);   -- 100 ms/page, balance limit 1
    
    Without the fix, this completes in ~10 seconds and pg_stat_activity
    never shows wait_event = VacuumDelay for the worker. With even moderate
    parameters (e.g. (50, 200)) the worker is continuously in VacuumDelay
    after the patch, and total runtime stretches as one would expect.
    Also manually tested with cost_delay 0 and higher cost limits.
    
    Attached a patch to fix this.
    
    Thanks,
    Satya
    
  79. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2026-05-05T13:46:03Z

    > While further testing this feature, I realized that ProcessSingleRelationFork() 
    > unconditionally called log_newpage_buffer() for every page of every relation 
    > during pg_enable_data_checksums(). This included unlogged relations, 
    > which by definition never generate WAL for data changes and are reset to their
    > init fork on any recovery. 
    
    
    I've tested your patch, and also expanded the test you wrote a little with a
    persistence change.
    
    > Further testing this feature, I noticed that the cost_delay and cost_limit arguments
    >  to pg_enable_data_checksums() in practice have no effect.
    
    Ugh, the API for updating the costs changed between when this code was written
    and tested, and when it was committed (and since I was the one committing the
    new API I really should've caught that). Thanks for the report and fix!
    
    While hacking on your patches I realized that the regexes for finding page
    verification failures in the logs were anchoring at the wrong point, the
    attached 0003 fixes that.
    
    Attached are editorialized versions of the patches, as well as my testfix, that
    I'm planning to go ahead with.
    
    --
    Daniel Gustafsson
    
    
  80. Re: Changing the state of data checksums in a running cluster

    SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> — 2026-05-05T14:24:07Z

    Hi,
    
    On Tue, May 5, 2026 at 6:46 AM Daniel Gustafsson <daniel@yesql.se> wrote:
    
    > > While further testing this feature, I realized that
    > ProcessSingleRelationFork()
    > > unconditionally called log_newpage_buffer() for every page of every
    > relation
    > > during pg_enable_data_checksums(). This included unlogged relations,
    > > which by definition never generate WAL for data changes and are reset to
    > their
    > > init fork on any recovery.
    >
    >
    > I've tested your patch, and also expanded the test you wrote a little with
    > a
    > persistence change.
    >
    > > Further testing this feature, I noticed that the cost_delay and
    > cost_limit arguments
    > >  to pg_enable_data_checksums() in practice have no effect.
    >
    > Ugh, the API for updating the costs changed between when this code was
    > written
    > and tested, and when it was committed (and since I was the one committing
    > the
    > new API I really should've caught that). Thanks for the report and fix!
    >
    
    Any thoughts on adding an injection point test to verify the values are
    configured correctly?
    This can be a different patch, perhaps included in Tomas' tests?
    
    
    >
    > While hacking on your patches I realized that the regexes for finding page
    > verification failures in the logs were anchoring at the wrong point, the
    > attached 0003 fixes that.
    >
    > Attached are editorialized versions of the patches, as well as my testfix,
    > that
    > I'm planning to go ahead with.
    >
    
    LGTM. Thanks!
    
    Thanks,
    Satya
    
  81. Re: Changing the state of data checksums in a running cluster

    Ayush Tiwari <ayushtiwari.slg01@gmail.com> — 2026-05-05T15:21:22Z

    Hi,
    
    On Tue, 5 May 2026 at 19:16, Daniel Gustafsson <daniel@yesql.se> wrote:
    
    > > While further testing this feature, I realized that
    > ProcessSingleRelationFork()
    > > unconditionally called log_newpage_buffer() for every page of every
    > relation
    > > during pg_enable_data_checksums(). This included unlogged relations,
    > > which by definition never generate WAL for data changes and are reset to
    > their
    > > init fork on any recovery.
    >
    >
    > I've tested your patch, and also expanded the test you wrote a little with
    > a
    > persistence change.
    >
    >
    I tested the patches, it mostly looks good.
    
    I've a small concern in 0001.  The new guard uses only
    RelationNeedsWAL(reln),
    but ProcessSingleRelationByOid() iterates all forks.  For unlogged
    relations,
    the init fork is special, there are several existing call sites that
    preserve
    WAL for INIT_FORKNUM, for example using
    
      RelationNeedsWAL(rel) || forknum == INIT_FORKNUM
    
    and catalog/storage.c notes that unlogged init forks need WAL and sync.
    
    So I think the condition in ProcessSingleRelationFork() should preserve the
    init-fork case, e.g.
    
      if (RelationNeedsWAL(reln) || forkNum == INIT_FORKNUM)
          log_newpage_buffer(buf, false);
    
    It would also be good to extend the new test so it exercises a non-empty
    init
    fork, perhaps with an unlogged table that has an index, and then verifies
    the
    standby/recovery behavior.
    
    0002 and 0003 look good to me.
    
    Regards,
    Ayush
    
  82. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2026-05-05T19:08:30Z

    > On 5 May 2026, at 17:21, Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote:
    
    > I've a small concern in 0001.  The new guard uses only RelationNeedsWAL(reln),
    > but ProcessSingleRelationByOid() iterates all forks.  For unlogged relations,
    > the init fork is special, there are several existing call sites that preserve
    > WAL for INIT_FORKNUM, for example using
    > 
    >   RelationNeedsWAL(rel) || forknum == INIT_FORKNUM
    > 
    > and catalog/storage.c notes that unlogged init forks need WAL and sync.
    > 
    > So I think the condition in ProcessSingleRelationFork() should preserve the
    > init-fork case, e.g.
    > 
    >   if (RelationNeedsWAL(reln) || forkNum == INIT_FORKNUM)
    >       log_newpage_buffer(buf, false);
    
    Which failure scenario are you thinking about here?  When dealing with the
    catalog relation I can see the need but here we are reading, and writing, data
    pages.  In which case would we need to issue an FPI for an unlogged relation
    init fork? I might be missing something obvious here.
    
    > 0002 and 0003 look good to me.
    
    Thanks for looking!
    
    --
    Daniel Gustafsson
    
    
    
    
    
  83. Re: Changing the state of data checksums in a running cluster

    Ayush Tiwari <ayushtiwari.slg01@gmail.com> — 2026-05-05T19:18:53Z

    Hi,
    
    On Wed, 6 May 2026 at 00:38, Daniel Gustafsson <daniel@yesql.se> wrote:
    
    > > On 5 May 2026, at 17:21, Ayush Tiwari <ayushtiwari.slg01@gmail.com>
    > wrote:
    >
    > > I've a small concern in 0001.  The new guard uses only
    > RelationNeedsWAL(reln),
    > > but ProcessSingleRelationByOid() iterates all forks.  For unlogged
    > relations,
    > > the init fork is special, there are several existing call sites that
    > preserve
    > > WAL for INIT_FORKNUM, for example using
    > >
    > >   RelationNeedsWAL(rel) || forknum == INIT_FORKNUM
    > >
    > > and catalog/storage.c notes that unlogged init forks need WAL and sync.
    > >
    > > So I think the condition in ProcessSingleRelationFork() should preserve
    > the
    > > init-fork case, e.g.
    > >
    > >   if (RelationNeedsWAL(reln) || forkNum == INIT_FORKNUM)
    > >       log_newpage_buffer(buf, false);
    >
    > Which failure scenario are you thinking about here?  When dealing with the
    > catalog relation I can see the need but here we are reading, and writing,
    > data
    > pages.  In which case would we need to issue an FPI for an unlogged
    > relation
    > init fork? I might be missing something obvious here.
    >
    
    The case I was thinking about is not the unlogged relation contents
    themselves, but the init fork used as the reset template.  Some unlogged
    indexes can have initialized pages in the init fork, and recovery later
    copies
    that fork to the main fork when resetting unlogged relations.
    
    So my concern was that, during online checksum enable, we might update the
    checksum state of an init-fork page on the primary but not WAL-log an FPI
    for
    it because RelationNeedsWAL(reln) is false.  Then a standby, or recovery
    after
    a crash, could still have the old version of that init fork.  If that fork
    is
    later copied to the main fork after checksums are enabled, it might lead to
    checksum verification failures?
    
    Maybe there is another guarantee that makes this impossible, but I did not
    see
    it from the patch/test.  That is why I wondered whether the condition should
    preserve the existing special treatment for INIT_FORKNUM.
    
    Regards,
    Ayush
    
  84. Re: Changing the state of data checksums in a running cluster

    SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> — 2026-05-05T19:19:00Z

    Hi
    
    On Tue, May 5, 2026 at 12:08 PM Daniel Gustafsson <daniel@yesql.se> wrote:
    
    > > On 5 May 2026, at 17:21, Ayush Tiwari <ayushtiwari.slg01@gmail.com>
    > wrote:
    >
    > > I've a small concern in 0001.  The new guard uses only
    > RelationNeedsWAL(reln),
    > > but ProcessSingleRelationByOid() iterates all forks.  For unlogged
    > relations,
    > > the init fork is special, there are several existing call sites that
    > preserve
    > > WAL for INIT_FORKNUM, for example using
    > >
    > >   RelationNeedsWAL(rel) || forknum == INIT_FORKNUM
    > >
    > > and catalog/storage.c notes that unlogged init forks need WAL and sync.
    > >
    > > So I think the condition in ProcessSingleRelationFork() should preserve
    > the
    > > init-fork case, e.g.
    > >
    > >   if (RelationNeedsWAL(reln) || forkNum == INIT_FORKNUM)
    > >       log_newpage_buffer(buf, false);
    >
    > Which failure scenario are you thinking about here?  When dealing with the
    > catalog relation I can see the need but here we are reading, and writing,
    > data
    > pages.  In which case would we need to issue an FPI for an unlogged
    > relation
    > init fork? I might be missing something obvious here.
    
    
    Good catch,IIUC init page has valid checksum and when we set checksum on
    primary I think we need to pass it to standby. Otherwise, after failover we
    may see invalid checksum for unlogged relations. I haven’t validated it,
    will test it and update the patch.
    
  85. Re: Changing the state of data checksums in a running cluster

    SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> — 2026-05-05T19:45:09Z

    Hi,
    
    On Tue, May 5, 2026 at 12:19 PM Ayush Tiwari <ayushtiwari.slg01@gmail.com>
    wrote:
    
    > Hi,
    >
    > On Wed, 6 May 2026 at 00:38, Daniel Gustafsson <daniel@yesql.se> wrote:
    >
    >> > On 5 May 2026, at 17:21, Ayush Tiwari <ayushtiwari.slg01@gmail.com>
    >> wrote:
    >>
    >> > I've a small concern in 0001.  The new guard uses only
    >> RelationNeedsWAL(reln),
    >> > but ProcessSingleRelationByOid() iterates all forks.  For unlogged
    >> relations,
    >> > the init fork is special, there are several existing call sites that
    >> preserve
    >> > WAL for INIT_FORKNUM, for example using
    >> >
    >> >   RelationNeedsWAL(rel) || forknum == INIT_FORKNUM
    >> >
    >> > and catalog/storage.c notes that unlogged init forks need WAL and sync.
    >> >
    >> > So I think the condition in ProcessSingleRelationFork() should preserve
    >> the
    >> > init-fork case, e.g.
    >> >
    >> >   if (RelationNeedsWAL(reln) || forkNum == INIT_FORKNUM)
    >> >       log_newpage_buffer(buf, false);
    >>
    >> Which failure scenario are you thinking about here?  When dealing with the
    >> catalog relation I can see the need but here we are reading, and writing,
    >> data
    >> pages.  In which case would we need to issue an FPI for an unlogged
    >> relation
    >> init fork? I might be missing something obvious here.
    >>
    >
    > The case I was thinking about is not the unlogged relation contents
    > themselves, but the init fork used as the reset template.  Some unlogged
    > indexes can have initialized pages in the init fork, and recovery later
    > copies
    > that fork to the main fork when resetting unlogged relations.
    >
    > So my concern was that, during online checksum enable, we might update the
    > checksum state of an init-fork page on the primary but not WAL-log an FPI
    > for
    > it because RelationNeedsWAL(reln) is false.  Then a standby, or recovery
    > after
    > a crash, could still have the old version of that init fork.  If that fork
    > is
    > later copied to the main fork after checksums are enabled, it might lead to
    > checksum verification failures?
    >
    > Maybe there is another guarantee that makes this impossible, but I did not
    > see
    > it from the patch/test.  That is why I wondered whether the condition
    > should
    > preserve the existing special treatment for INIT_FORKNUM.
    >
    
    It is a bug in the code, I added a test in the v2 patch to test this
    scenario and the test
    failed earlier.
    
    Thanks,
    Satya
    
    >
    
  86. Re: Changing the state of data checksums in a running cluster

    SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> — 2026-05-05T20:05:51Z

    Hi,
    
    On Tue, May 5, 2026 at 12:45 PM SATYANARAYANA NARLAPURAM <
    satyanarlapuram@gmail.com> wrote:
    
    > Hi,
    >
    > On Tue, May 5, 2026 at 12:19 PM Ayush Tiwari <ayushtiwari.slg01@gmail.com>
    > wrote:
    >
    >> Hi,
    >>
    >> On Wed, 6 May 2026 at 00:38, Daniel Gustafsson <daniel@yesql.se> wrote:
    >>
    >>> > On 5 May 2026, at 17:21, Ayush Tiwari <ayushtiwari.slg01@gmail.com>
    >>> wrote:
    >>>
    >>> > I've a small concern in 0001.  The new guard uses only
    >>> RelationNeedsWAL(reln),
    >>> > but ProcessSingleRelationByOid() iterates all forks.  For unlogged
    >>> relations,
    >>> > the init fork is special, there are several existing call sites that
    >>> preserve
    >>> > WAL for INIT_FORKNUM, for example using
    >>> >
    >>> >   RelationNeedsWAL(rel) || forknum == INIT_FORKNUM
    >>> >
    >>> > and catalog/storage.c notes that unlogged init forks need WAL and sync.
    >>> >
    >>> > So I think the condition in ProcessSingleRelationFork() should
    >>> preserve the
    >>> > init-fork case, e.g.
    >>> >
    >>> >   if (RelationNeedsWAL(reln) || forkNum == INIT_FORKNUM)
    >>> >       log_newpage_buffer(buf, false);
    >>>
    >>> Which failure scenario are you thinking about here?  When dealing with
    >>> the
    >>> catalog relation I can see the need but here we are reading, and
    >>> writing, data
    >>> pages.  In which case would we need to issue an FPI for an unlogged
    >>> relation
    >>> init fork? I might be missing something obvious here.
    >>>
    >>
    >> The case I was thinking about is not the unlogged relation contents
    >> themselves, but the init fork used as the reset template.  Some unlogged
    >> indexes can have initialized pages in the init fork, and recovery later
    >> copies
    >> that fork to the main fork when resetting unlogged relations.
    >>
    >> So my concern was that, during online checksum enable, we might update the
    >> checksum state of an init-fork page on the primary but not WAL-log an FPI
    >> for
    >> it because RelationNeedsWAL(reln) is false.  Then a standby, or recovery
    >> after
    >> a crash, could still have the old version of that init fork.  If that
    >> fork is
    >> later copied to the main fork after checksums are enabled, it might lead
    >> to
    >> checksum verification failures?
    >>
    >> Maybe there is another guarantee that makes this impossible, but I did
    >> not see
    >> it from the patch/test.  That is why I wondered whether the condition
    >> should
    >> preserve the existing special treatment for INIT_FORKNUM.
    >>
    >
    > It is a bug in the code, I added a test in the v2 patch to test this
    > scenario and the test
    > failed earlier.
    >
    
    Please find the latest v3. Earlier I took dependency on amcheck and removed
    it in v3
    and instead added queries that forces necessary checks.
    
     Thanks,
    Satya
    
  87. Re: Changing the state of data checksums in a running cluster

    Ayush Tiwari <ayushtiwari.slg01@gmail.com> — 2026-05-05T21:01:59Z

    Hi,
    
    
    On Wed, 6 May 2026 at 01:36, SATYANARAYANA NARLAPURAM <
    satyanarlapuram@gmail.com> wrote:
    
    > Hi,
    >
    > On Tue, May 5, 2026 at 12:45 PM SATYANARAYANA NARLAPURAM <
    > satyanarlapuram@gmail.com> wrote:
    >
    >> Hi,
    >>
    >> On Tue, May 5, 2026 at 12:19 PM Ayush Tiwari <ayushtiwari.slg01@gmail.com>
    >> wrote:
    >>
    >>> Hi,
    >>>
    >>> On Wed, 6 May 2026 at 00:38, Daniel Gustafsson <daniel@yesql.se> wrote:
    >>>
    >>>> > On 5 May 2026, at 17:21, Ayush Tiwari <ayushtiwari.slg01@gmail.com>
    >>>> wrote:
    >>>>
    >>>> > I've a small concern in 0001.  The new guard uses only
    >>>> RelationNeedsWAL(reln),
    >>>> > but ProcessSingleRelationByOid() iterates all forks.  For unlogged
    >>>> relations,
    >>>> > the init fork is special, there are several existing call sites that
    >>>> preserve
    >>>> > WAL for INIT_FORKNUM, for example using
    >>>> >
    >>>> >   RelationNeedsWAL(rel) || forknum == INIT_FORKNUM
    >>>> >
    >>>> > and catalog/storage.c notes that unlogged init forks need WAL and
    >>>> sync.
    >>>> >
    >>>> > So I think the condition in ProcessSingleRelationFork() should
    >>>> preserve the
    >>>> > init-fork case, e.g.
    >>>> >
    >>>> >   if (RelationNeedsWAL(reln) || forkNum == INIT_FORKNUM)
    >>>> >       log_newpage_buffer(buf, false);
    >>>>
    >>>> Which failure scenario are you thinking about here?  When dealing with
    >>>> the
    >>>> catalog relation I can see the need but here we are reading, and
    >>>> writing, data
    >>>> pages.  In which case would we need to issue an FPI for an unlogged
    >>>> relation
    >>>> init fork? I might be missing something obvious here.
    >>>>
    >>>
    >>> The case I was thinking about is not the unlogged relation contents
    >>> themselves, but the init fork used as the reset template.  Some unlogged
    >>> indexes can have initialized pages in the init fork, and recovery later
    >>> copies
    >>> that fork to the main fork when resetting unlogged relations.
    >>>
    >>> So my concern was that, during online checksum enable, we might update
    >>> the
    >>> checksum state of an init-fork page on the primary but not WAL-log an
    >>> FPI for
    >>> it because RelationNeedsWAL(reln) is false.  Then a standby, or recovery
    >>> after
    >>> a crash, could still have the old version of that init fork.  If that
    >>> fork is
    >>> later copied to the main fork after checksums are enabled, it might lead
    >>> to
    >>> checksum verification failures?
    >>>
    >>> Maybe there is another guarantee that makes this impossible, but I did
    >>> not see
    >>> it from the patch/test.  That is why I wondered whether the condition
    >>> should
    >>> preserve the existing special treatment for INIT_FORKNUM.
    >>>
    >>
    >> It is a bug in the code, I added a test in the v2 patch to test this
    >> scenario and the test
    >> failed earlier.
    >>
    >
    > Please find the latest v3. Earlier I took dependency on amcheck and
    > removed it in v3
    > and instead added queries that forces necessary checks.
    >
    
    Thanks, the v3 patch addresses my concern.
    
    I also checked the new test by
    temporarily removing the INIT_FORKNUM exception, and it failed on the
    promoted
    standby while reading the unlogged relation, so the test does cover the init
    fork failure scenario.
    
    The code change looks right to me now.
    
    Regards,
    Ayush
    
  88. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2026-05-06T15:22:47Z

    > On 5 May 2026, at 23:01, Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote:
    
    > Thanks, the v3 patch addresses my concern.
    
    Thanks both of you for the report, patch and testing.  Now I see what you meant
    yesterday and I agree with the conclusion.  I've pushed 0001-0003 today with minor
    tweaks.
    
    --
    Daniel Gustafsson
    
    
    
    
    
  89. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2026-05-26T18:12:35Z

    Hi,
    
    On 5/4/26 15:16, Tomas Vondra wrote:
    > Hi,
    > 
    > ... snip ...
    >
    > Let me briefly explain what the various TAP tests aim to do. From the
    > very beginning, my main concern regarding this patch was race conditions
    > when updating the shared state about effective data_checksum_version.
    > Because the state is effectively split into about three or four places:
    > 
    > * LocalDataChecksumVersion (local cache)
    > * XLogCtl->data_checksum_version (XLogCtl->info_lck)
    > * ControlFile->data_checksum_version (ControlFileLock)
    > * state in control file on disk
    > 
    > These pieces are protected by different locks, the protocol for updating
    > and/or reading the various flags is not trivial (and some of the fixed
    > issues were due to ControlFile->data_checksum_version being updated from
    > a place that shouldn't have).
    > 
    
    I still have a funny feeling about the "global checksum version" stored
    in multiple places with separate locks, and the protocol when updating
    them (e.g. which process does that, when, etc.). I haven't found any
    fundamental correctness issue, though.
    
    I kept looking for issues, and I realized StartupXLOG may not do the
    right thing on the standby after promotion.
    
      if (XLogCtl->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_ON)
      {
          XLogChecksums(PG_DATA_CHECKSUM_OFF);
    
          SpinLockAcquire(&XLogCtl->info_lck);
          XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_OFF;
          SetLocalDataChecksumState(XLogCtl->data_checksum_version);
          SpinLockRelease(&XLogCtl->info_lck);
    
          ereport(...);
      }
    
    Consider this sequence of steps:
    
    1) primary/standby cluster has checksums OFF
    2) primary starts enabling checksums
    3) primary moves to inprogress-on
    4) standby receives that and moves to inprogress-on too
    5) primary crashes / shuts down / ...
    6) standby gets promoted, and does the StartupXLOG thing
    7) standby moves from inprogress-on back to off
    
    However, as there's no EmitProcSignalBarrier(), existing backends on the
    hot standby won't be notified about the "off" change. And so there will
    be a somewhat inconsistent checksum state, with new backends knowing
    it's "off", and old backends thinking it's still inprogress-on.
    
    It's not difficult to reproduce this. I guess the existing TAP tests may
    not catch this because they always open a new connection when checking
    the state, and so see the new (and correct) version.
    
    I believe a similar issue happens for the inprogress-off case a couple
    lines later, but I haven't tried reproducing that one.
    
    Ultimately, neither of these cases should cause checksum failures,
    because the "correct" new state is "off", and the old backends are not
    verifying checksums either.
    
    I suppose this means we should not be updating the checksum state
    without emitting the barrier? I think all other places do that.
    
    >
    > ... snip ...
    > 
    > The TAP 012 tests checksum change with a concurrent checkpoint, followed
    > by a crash, and tests the final state. It pauses the change at an
    > injection point, does a checkpoint, proceeds to the next injection
    > point, crashes and does recovery. The expectation is that the final
    > state "flips" at some injection point, once it gets further enough, and
    > stays there. But what actually happens is this:
    > 
    > a) test_checksum_transition(
    >     'disabled', 'enable', undef,
    >     'datachecksums-enable-inprogress-checksums-end',
    >     'datachecksums-enable-checksums-start',
    >     'off');
    > 
    > b) test_checksum_transition(
    >     'disabled', 'enable', undef,
    >     'datachecksums-enable-checksums-start',
    >     'datachecksums-enable-checksums-after-xlog',
    >     'on');
    > 
    > c) test_checksum_transition(
    >     'disabled', 'enable', 'datachecksums-enable-checksums-start',
    >     'datachecksums-enable-checksums-after-xlogs',
    >     'datachecksums-enable-checksums-after-xlogctl',
    >     'off');
    > 
    > This says that if the checkpoint happens after
    > 'datachecksums-enable-inprogress-checksums-end' or after
    > 'datachecksums-enable-checksums-after-xlog', we end up with 'off' (i.e.
    > enabling checksums fails).
    > 
    > But if the checkpoint happens after
    > 'datachecksums-enable-checksums-start', we end up with "on" (after
    > recovery).
    > 
    > This is a bit surprising, because that injection point is before
    > 'datachecksums-enable-checksums-after-xlog'. So the enabling process
    > gets further and further, but the final state flips off -> on -> off,
    > contradicting the expectation that it changes once.
    > 
    > I haven't quite wrapped my head around it yet, but my understanding is
    > this is due to a race condition between the checksum launcher (writing
    > XLOG2_CHECKSUMS and updating the shmem state), and the checkpointer
    > (reading the shmem state and generating REDO).
    > 
    > The launcher does this sequence of steps:
    > 
    > 1) write XLOG2_CHECKSUMS with new state
    > 2) update XLogCtl->data_checksum_version
    > 3) update ControlFile->data_checksum_version
    > 4) UpdateControlFile()
    > 5) emits barrier
    > 
    > while the checkpointer (CreateCheckPoint) does this:
    > 
    > A) read XLogCtl->data_checksum_version (while holding insert locks)
    > B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
    > C) UpdateControlFile()
    > 
    > The outcome depends on how exactly these two sequences interleave. For
    > example, this can happen:
    > 
    > 1) write XLOG2_CHECKSUMS with new state
    >   A) read XLogCtl->data_checksum_version (while holding insert locks)
    >   B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
    >   C) UpdateControlFile()
    > 2) update XLogCtl->data_checksum_version
    > 3) update ControlFile->data_checksum_version
    > 4) UpdateControlFile()
    > 5) emits barrier
    > 
    > Which means the XLOG_CHECKPOINT_REDO will be after XLOG2_CHECKSUMS (and
    > so redo won't see it), but the checkpoint will still get the old
    > checksum state from XLogCtl. And so the outcome is "off", per case (c).
    > 
    > But it can also happen what case (b) does:
    > 
    >   A) read XLogCtl->data_checksum_version (while holding insert locks)
    >   B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
    >   C) UpdateControlFile()
    > 1) write XLOG2_CHECKSUMS with new state
    > 2) update XLogCtl->data_checksum_version
    > 3) update ControlFile->data_checksum_version
    > 4) UpdateControlFile()
    > 5) emits barrier
    > 
    > In which case the REDO will have the old state, but the recovery will
    > read the XLOG2_CHECKSUMS, and so end up with "on".
    > 
    > This is the root cause of the surprising behavior in TAP 012, I think.
    > 
    > I attempted to trigger these race conditions in TAP 013, but without
    > much success. In the end I realized it probably needs more control,
    > waiting for the other process to hit the next injection point before
    > unpausing the current one. TAP 014 does that, and it shows that with the
    > right interleaving of steps the (c) case can end up with both "on" and
    > "off" final state.
    > 
    > As I said, I don't claim I fully understand this yet. But I wouldn't
    > call this "bug" - AFAICS it won't produce an incorrect final state (I
    > haven't seen any such cases).
    > 
    > Still, I wonder if there's a potential issue I failed to notice.
    > 
    
    I kept thinking about this non-determinism, and I still think I'm right
    about the root cause. The checkpointer and datachecksum worker may
    interleave in different ways, affecting the final checksum state. The
    existing interlock (or lack of of it) is not sufficient.
    
    To verify this hypothesis, I've done a simple thing - I've introduced a
    new LWLock, protecting the "critical part" so that the checkpoint_redo
    can't happen in between XLogChecksums() and XLogCtl update. Patch
    attached, but I don't propose this for commit, I'm not sure if it's
    right (or safe), it's merely a PoC to confirm the issue. But it resolves
    the weird cases, simply by prohibiting some of the step sequences (so
    some of the tests needed to be commented out, as they'd block).
    
    I'm still not sure if it really is an issue or just an annoyance,
    because I've not been able to find a case where it'd lead to checksum
    failures (or obviously incorrect final state after recovery).
    
    > 
    > The other question I had when looking at this (concurrency with
    > checkpoints) is what we get by doing
    > 
    >      MyProc->delayChkptFlags |= DELAY_CHKPT_START;
    > 
    > whenever updating the state in SetDataChecksums... functions. Because
    > the only thing that guarantees is the updates happen on one side of the
    > checkpoint record. What does that give us, actually?
    > 
    > It does not seem to prevent this surprising behavior, and it does not
    > say the XLOG2_CHECKSUMS happens before/after the XLOG_CHECKPOINT_REDO.
    > 
    
    I still don't understand why this needs DELAY_CHKPT_START ...
    
    
    I also noticed a couple minor comment issues, per attached patch (this
    may need pgindent).
    
    
    regards
    
    -- 
    Tomas Vondra
    
  90. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2026-05-28T11:28:49Z

    > On 26 May 2026, at 20:12, Tomas Vondra <tomas@vondra.me> wrote:
    
    > I suppose this means we should not be updating the checksum state
    > without emitting the barrier? I think all other places do that.
    
    Good catch, it's indeed a bug, any state change must emit a procsignalbarrier
    to maintain cluster consistency.  I ended up writing a test for this very case
    as well.
    
    > I'm still not sure if it really is an issue or just an annoyance,
    > because I've not been able to find a case where it'd lead to checksum
    > failures (or obviously incorrect final state after recovery).
    
    I've tried to get it to reach an incorrect end state but failed, but I do agree
    that maybe we need an improved locking protocol around state updates.  Need to
    spend some more time thinking about this.
    
    > I still don't understand why this needs DELAY_CHKPT_START ...
    
    Having stared at this for some time, and going over old threads, I think this
    is a mistake.  AFAICT though it cannot cause any error, so I'd lean towards
    erring on the safe side by leaving as is and looking at removing in 20.  What
    do you think?
    
    > I also noticed a couple minor comment issues, per attached patch (this
    > may need pgindent).
    
    I ended up splitting this into two, one for the comment fixes and one for the
    data type change.
    
    I propose applying the three patches below to v19 to fix the promotion issue
    before we wrap beta1.
    
    --
    Daniel Gustafsson
    
    
  91. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2026-05-28T11:51:14Z

    On 5/28/26 13:28, Daniel Gustafsson wrote:
    >> On 26 May 2026, at 20:12, Tomas Vondra <tomas@vondra.me> wrote:
    > 
    >> I suppose this means we should not be updating the checksum state
    >> without emitting the barrier? I think all other places do that.
    > 
    > Good catch, it's indeed a bug, any state change must emit a procsignalbarrier
    > to maintain cluster consistency.  I ended up writing a test for this very case
    > as well.
    > 
    
    Good.
    
    >> I'm still not sure if it really is an issue or just an annoyance,
    >> because I've not been able to find a case where it'd lead to checksum
    >> failures (or obviously incorrect final state after recovery).
    > 
    > I've tried to get it to reach an incorrect end state but failed, but I do agree
    > that maybe we need an improved locking protocol around state updates.  Need to
    > spend some more time thinking about this.
    > 
    
    OK
    
    >> I still don't understand why this needs DELAY_CHKPT_START ...
    > 
    > Having stared at this for some time, and going over old threads, I think this
    > is a mistake.  AFAICT though it cannot cause any error, so I'd lean towards
    > erring on the safe side by leaving as is and looking at removing in 20.  What
    > do you think?
    > 
    
    I'd probably try to fix this for 19, otherwise it may be confusing
    people looking at the code in the future. We're still months from 19
    getting released. Ofc, maybe I'm underestimating the risk.
    
    >> I also noticed a couple minor comment issues, per attached patch (this
    >> may need pgindent).
    > 
    > I ended up splitting this into two, one for the comment fixes and one for the
    > data type change.
    > 
    > I propose applying the three patches below to v19 to fix the promotion issue
    > before we wrap beta1.
    > 
    
    WFM
    
    > --
    > Daniel Gustafsson
    > 
    
    -- 
    Tomas Vondra
    
    
    
    
    
  92. Re: Changing the state of data checksums in a running cluster

    Daniel Gustafsson <daniel@yesql.se> — 2026-05-29T20:08:11Z

    > On 28 May 2026, at 13:51, Tomas Vondra <tomas@vondra.me> wrote:
    > 
    > On 5/28/26 13:28, Daniel Gustafsson wrote:
    >>> On 26 May 2026, at 20:12, Tomas Vondra <tomas@vondra.me> wrote:
    >> 
    >>> I suppose this means we should not be updating the checksum state
    >>> without emitting the barrier? I think all other places do that.
    >> 
    >> Good catch, it's indeed a bug, any state change must emit a procsignalbarrier
    >> to maintain cluster consistency.  I ended up writing a test for this very case
    >> as well.
    > 
    > Good.
    
    I've pushed this now, along with your other findings, ahead of the beta1
    deadline, buildfarm seems happy so far.
    
    >>> I still don't understand why this needs DELAY_CHKPT_START ...
    >> 
    >> Having stared at this for some time, and going over old threads, I think this
    >> is a mistake.  AFAICT though it cannot cause any error, so I'd lean towards
    >> erring on the safe side by leaving as is and looking at removing in 20.  What
    >> do you think?
    >> 
    > 
    > I'd probably try to fix this for 19, otherwise it may be confusing
    > people looking at the code in the future. We're still months from 19
    > getting released. Ofc, maybe I'm underestimating the risk.
    
    You're probably right.  Once beta1 is out I'll work on getting this fixed.
    
    --
    Daniel Gustafsson
    
    
    
    
    
  93. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2026-05-29T20:27:20Z

    
    On 5/29/26 22:08, Daniel Gustafsson wrote:
    >> On 28 May 2026, at 13:51, Tomas Vondra <tomas@vondra.me> wrote:
    >>
    >> On 5/28/26 13:28, Daniel Gustafsson wrote:
    >>>> On 26 May 2026, at 20:12, Tomas Vondra <tomas@vondra.me> wrote:
    >>>
    >>>> I suppose this means we should not be updating the checksum state
    >>>> without emitting the barrier? I think all other places do that.
    >>>
    >>> Good catch, it's indeed a bug, any state change must emit a procsignalbarrier
    >>> to maintain cluster consistency.  I ended up writing a test for this very case
    >>> as well.
    >>
    >> Good.
    > 
    > I've pushed this now, along with your other findings, ahead of the beta1
    > deadline, buildfarm seems happy so far.
    > 
    
    Thanks!
    
    >>>> I still don't understand why this needs DELAY_CHKPT_START ...
    >>>
    >>> Having stared at this for some time, and going over old threads, I think this
    >>> is a mistake.  AFAICT though it cannot cause any error, so I'd lean towards
    >>> erring on the safe side by leaving as is and looking at removing in 20.  What
    >>> do you think?
    >>>
    >>
    >> I'd probably try to fix this for 19, otherwise it may be confusing
    >> people looking at the code in the future. We're still months from 19
    >> getting released. Ofc, maybe I'm underestimating the risk.
    > 
    > You're probably right.  Once beta1 is out I'll work on getting this fixed.
    > 
    
    +1
    
    
    regards
    
    -- 
    Tomas Vondra