Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Use correct datatype for PID
- 0ca1b3010597 19 (unreleased) landed
Improve comments in online checksums code
- cd857dec0e0a 19 (unreleased) landed
Fix checksum state transition during promotion
- 5fee7cab1b87 19 (unreleased) landed
Fix regex searching for page verification failures in tests
- 486b9a9b9eb4 19 (unreleased) landed
Apply data-checksum worker throttling parameters
- 9a39056c418c 19 (unreleased) landed
Skip WAL for unlogged main fork during online checksum enable
- 2018bd616790 19 (unreleased) landed
Revert "Get rid of WALBufMappingLock"
- c13070a27b63 19 (unreleased) cited
Get rid of WALBufMappingLock
- bc22dc0e0ddc 18.0 cited
Improve grammar of options for command arrays in TAP tests
- ce1b0f9da03e 18.0 cited

Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2024-07-03T06:41:01Z

After some off-list discussion about the desirability of this feature, where
several hackers opined that it's something that we should have, I've decided to
rebase this patch and submit it one more time. There are several (long)
threads covering the history of this patch [0][1], related work stemming from
this [2] as well as earlier attempts and discussions [3][4]. Below I try to
respond to a summary of points raised in those threads.

The mechanics of the patch hasn't changed since the last posted version, it has
mainly been polished slightly. A high-level overview of the processing is:
It's using a launcher/worker model where the launcher will spawn a worker per
database which will traverse all pages and dirty them in order to calculate and
set the checksum on them. During this inprogress state all backends calculated
and write checksums but don't verify them on read. Once all pages have been
checksummed the state of the cluster will switch over to "on" synchronized
across all backends with a procsignalbarrier. At this point checksums are
verified and processing is equal to checksums having been enabled initdb. When
a user disables checksums the cluster enters a state where all backends still
write checksums until all backends have acknowledged that they have stopped
verifying checksums (again using a procsignalbarrier). At this point the
cluster switches to "off" and checksums are neither written nor verified. In
case the cluster is restarted, voluntarily or via a crash, processing will have
to be restarted (more on that further down).

The user facing controls for this are two SQL level functions, for enabling and
disabling. The existing data_checksums GUC remains but is expanded with more
possible states (with on/off retained).

Complaints against earlier versions
===================================
Seasoned hackers might remember that this patch has been on -hackers before.
There has been a lot of review, and AFAICT all specific comments have been
addressed. There are however a few larger more generic complaints:

* Restartability - the initial version of the patch did not support stateful
restarts, a shutdown performed (or crash) before checksums were enabled would
result in a need to start over from the beginning. This was deemed the safe
orchestration method. The lack of this feature was seen as serious drawback,
so it was added. Subsequent review instead found the patch to be too
complicated with a too large featureset. I thihk there is merit to both of
these arguments: being able to restart is a great feature; and being able to
reason about the correctness of a smaller patch is also great. As of this
submission I have removed the ability to restart to keep the scope of the patch
small (which is where the previous version was, which received no review after
the removal). The way I prefer to frame this is to first add scaffolding and
infrastructure (this patch) and leave refinements and add-on features
(restartability, but also others like parallel workers, optimizing rare cases,
etc) for follow-up patches.

* Complexity - it was brought up that this is a very complex patch for a niche
feature, and there is a lot of truth to that. It is inherently complex to
change a pg_control level state of a running cluster. There might be ways to
make the current patch less complex, while not sacrificing stability, and if so
that would be great. A lot of of the complexity came from being able to
restart processing, and that's not removed for this version, but it's clearly
not close to a one-line-diff even without it.

Other complaints were addressed, in part by the invention of procsignalbarriers
which makes this synchronization possible. In re-reading the threads I might
have missed something which is still left open, and if so I do apologize for
that.

Open TODO items:
================
* Immediate checkpoints - the code is currently using CHECKPOINT_IMMEDIATE in
order to be able to run the tests in a timely manner on it. This is overly
aggressive and dialling it back while still being able to run fast tests is a
TODO. Not sure what the best option is there.

* Monitoring - an insightful off-list reviewer asked how the current progress
of the operation is monitored. So far I've been using pg_stat_activity but I
don't disagree that it's not a very sharp tool for this. Maybe we need a
specific function or view or something? There clearly needs to be a way for a
user to query state and progress of a transition.

* Throttling - right now the patch uses the vacuum access strategy, with the
same cost options as vacuum, in order to implement throttling. This is in part
due to the patch starting out modelled around autovacuum as a worker, but it
may not be the right match for throttling checksums.

* Naming - the in-between states when data checksums are enabled or disabled
are called inprogress-on and inprogress-off. The reason for this is simply
that early on there were only three states: inprogress, on and off, and the
process of disabling wasn't labeled with a state. When this transition state
was added it seemed like a good idea to tack the end-goal onto the transition.
These state names make the code easily greppable but might not be the most
obvious choices for anything user facing. Is "Enabling" and "Disabling" better
terms to use (across the board or just user facing) or should we stick to the
current?

There are ways in which this processing can be optimized to achieve better
performance, but in order to keep goalposts in sight and patchsize down they
are left as future work.

--
Daniel Gustafsson

[0] https://www.postgresql.org/message-id/flat/CABUevExz9hUUOLnJVr2kpw9Cx%3Do4MCr1SVKwbupzuxP7ckNutA%40mail.gmail.com
[1] https://www.postgresql.org/message-id/flat/CABUevEwE3urLtwxxqdgd5O2oQz9J717ZzMbh%2BziCSa5YLLU_BA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/20181030051643.elbxjww5jjgnjaxg%40alap3.anarazel.de
[3] https://www.postgresql.org/message-id/flat/FF393672-5608-46D6-9224-6620EC532693%40endpoint.com
[4] https://www.postgresql.org/message-id/flat/CABUevEx8KWhZE_XkZQpzEkZypZmBp3GbM9W90JLp%3D-7OJWBbcg%40mail.gmail.com

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas.vondra@enterprisedb.com> — 2024-07-03T11:20:10Z

Hi Daniel,

Thanks for rebasing the patch and submitting it again!

On 7/3/24 08:41, Daniel Gustafsson wrote:
> After some off-list discussion about the desirability of this feature, where
> several hackers opined that it's something that we should have, I've decided to
> rebase this patch and submit it one more time.  There are several (long)
> threads covering the history of this patch [0][1], related work stemming from
> this [2] as well as earlier attempts and discussions [3][4].  Below I try to
> respond to a summary of points raised in those threads.
> 
> The mechanics of the patch hasn't changed since the last posted version, it has
> mainly been polished slightly.  A high-level overview of the processing is:
> It's using a launcher/worker model where the launcher will spawn a worker per
> database which will traverse all pages and dirty them in order to calculate and
> set the checksum on them.  During this inprogress state all backends calculated
> and write checksums but don't verify them on read.  Once all pages have been
> checksummed the state of the cluster will switch over to "on" synchronized
> across all backends with a procsignalbarrier.  At this point checksums are
> verified and processing is equal to checksums having been enabled initdb.  When
> a user disables checksums the cluster enters a state where all backends still
> write checksums until all backends have acknowledged that they have stopped
> verifying checksums (again using a procsignalbarrier).  At this point the
> cluster switches to "off" and checksums are neither written nor verified.  In
> case the cluster is restarted, voluntarily or via a crash, processing will have
> to be restarted (more on that further down).
> 
> The user facing controls for this are two SQL level functions, for enabling and
> disabling.  The existing data_checksums GUC remains but is expanded with more
> possible states (with on/off retained).
> 
> 
> Complaints against earlier versions
> ===================================
> Seasoned hackers might remember that this patch has been on -hackers before.
> There has been a lot of review, and AFAICT all specific comments have been
> addressed.  There are however a few larger more generic complaints:
> 
> * Restartability - the initial version of the patch did not support stateful
> restarts, a shutdown performed (or crash) before checksums were enabled would
> result in a need to start over from the beginning.  This was deemed the safe
> orchestration method.  The lack of this feature was seen as serious drawback,
> so it was added.  Subsequent review instead found the patch to be too
> complicated with a too large featureset.  I thihk there is merit to both of
> these arguments: being able to restart is a great feature; and being able to
> reason about the correctness of a smaller patch is also great.  As of this
> submission I have removed the ability to restart to keep the scope of the patch
> small (which is where the previous version was, which received no review after
> the removal).  The way I prefer to frame this is to first add scaffolding and
> infrastructure (this patch) and leave refinements and add-on features
> (restartability, but also others like parallel workers, optimizing rare cases,
> etc) for follow-up patches.
> 

I 100% support this approach.

Sure, I'd like to have a restartable tool, but clearly that didn't go
particularly well, and we still have nothing to enable checksums online.
That doesn't seem to benefit anyone - to me it seems reasonable to get
the non-restartable tool in, and then maybe later someone can improve
this to make it restartable. Thanks to the earlier work we know it's
doable, even if it was too complex.

This way it's at least possible to enable checksums online with some
additional care (e.g. to make sure no one restarts the cluster etc.).
I'd bet for vast majority of systems this will work just fine. Huge
systems with some occasional / forced restarts may not be able to make
this work - but then again, that's no worse than now.

> * Complexity - it was brought up that this is a very complex patch for a niche
> feature, and there is a lot of truth to that.  It is inherently complex to
> change a pg_control level state of a running cluster.  There might be ways to
> make the current patch less complex, while not sacrificing stability, and if so
> that would be great.  A lot of of the complexity came from being able to
> restart processing, and that's not removed for this version, but it's clearly
> not close to a one-line-diff even without it.
> 

I'd push back on this a little bit - the patch looks like this:

 50 files changed, 2691 insertions(+), 48 deletions(-)

and if we ignore the docs / perl tests, then the two parts that stand
out are

 src/backend/access/transam/xlog.c             |  455 +++++-
 src/backend/postmaster/datachecksumsworker.c  | 1353 +++++++++++++++++

I don't think the worker code is exceptionally complex. Yes, it's not
trivial, but a lot of the 1353 inserts is comments (which is good) or
generic infrastructure to start the worker etc.

> Other complaints were addressed, in part by the invention of procsignalbarriers
> which makes this synchronization possible.  In re-reading the threads I might
> have missed something which is still left open, and if so I do apologize for
> that.
> 
> 
> Open TODO items:
> ================
> * Immediate checkpoints - the code is currently using CHECKPOINT_IMMEDIATE in
> order to be able to run the tests in a timely manner on it.  This is overly
> aggressive and dialling it back while still being able to run fast tests is a
> TODO.  Not sure what the best option is there.
> 

Why not to add a parameter to pg_enable_data_checksums(), specifying
whether to do immediate checkpoint or wait for the next one? AFAIK
that's what we do in pg_backup_start, for example.

> * Monitoring - an insightful off-list reviewer asked how the current progress
> of the operation is monitored.  So far I've been using pg_stat_activity but I
> don't disagree that it's not a very sharp tool for this.  Maybe we need a
> specific function or view or something?  There clearly needs to be a way for a
> user to query state and progress of a transition.
> 

Yeah, I think a view like pg_stat_progress_checksums would work.

> * Throttling - right now the patch uses the vacuum access strategy, with the
> same cost options as vacuum, in order to implement throttling.  This is in part
> due to the patch starting out modelled around autovacuum as a worker, but it
> may not be the right match for throttling checksums.
> 

IMHO it's reasonable to reuse the vacuum throttling. Even if it's not
perfect, it does not seem great to invent something new and end up with
two different ways to throttle stuff.

> * Naming - the in-between states when data checksums are enabled or disabled
> are called inprogress-on and inprogress-off.  The reason for this is simply
> that early on there were only three states: inprogress, on and off, and the
> process of disabling wasn't labeled with a state.  When this transition state
> was added it seemed like a good idea to tack the end-goal onto the transition.
> These state names make the code easily greppable but might not be the most
> obvious choices for anything user facing.  Is "Enabling" and "Disabling" better
> terms to use (across the board or just user facing) or should we stick to the
> current?
> 

I think the naming is fine. In the worst case we can rename that later,
seems more like a detail.

> There are ways in which this processing can be optimized to achieve better
> performance, but in order to keep goalposts in sight and patchsize down they
> are left as future work.
> 

+1


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Changing the state of data checksums in a running cluster

Bruce Momjian <bruce@momjian.us> — 2024-07-06T02:23:35Z

On Wed, Jul  3, 2024 at 01:20:10PM +0200, Tomas Vondra wrote:
> > * Restartability - the initial version of the patch did not support stateful
> > restarts, a shutdown performed (or crash) before checksums were enabled would
> > result in a need to start over from the beginning.  This was deemed the safe
> > orchestration method.  The lack of this feature was seen as serious drawback,
> > so it was added.  Subsequent review instead found the patch to be too
> > complicated with a too large featureset.  I thihk there is merit to both of
> > these arguments: being able to restart is a great feature; and being able to
> > reason about the correctness of a smaller patch is also great.  As of this
> > submission I have removed the ability to restart to keep the scope of the patch
> > small (which is where the previous version was, which received no review after
> > the removal).  The way I prefer to frame this is to first add scaffolding and
> > infrastructure (this patch) and leave refinements and add-on features
> > (restartability, but also others like parallel workers, optimizing rare cases,
> > etc) for follow-up patches.
> > 
> 
> I 100% support this approach.

Yes, I was very disappointed when restartability sunk the patch, and I
saw this as another case where saying "yes" to every feature improvement
can lead to failure.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2024-09-30T21:21:30Z

> On 3 Jul 2024, at 13:20, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

> Thanks for rebasing the patch and submitting it again!

Thanks for review, sorry for being so slow to pick this up again.

The attached version is a rebase with some level of cleanup and polish all
around, and most importantly it adresses the two points raised below.

>> * Immediate checkpoints - the code is currently using CHECKPOINT_IMMEDIATE in
>> order to be able to run the tests in a timely manner on it.  This is overly
>> aggressive and dialling it back while still being able to run fast tests is a
>> TODO.  Not sure what the best option is there.
> 
> Why not to add a parameter to pg_enable_data_checksums(), specifying
> whether to do immediate checkpoint or wait for the next one? AFAIK
> that's what we do in pg_backup_start, for example.

That's a good idea, pg_enable_data_checksums now accepts a third parameter
"fast" (defaults to false) which will enable immediate checkpoints when true.

>> * Monitoring - an insightful off-list reviewer asked how the current progress
>> of the operation is monitored.  So far I've been using pg_stat_activity but I
>> don't disagree that it's not a very sharp tool for this.  Maybe we need a
>> specific function or view or something?  There clearly needs to be a way for a
>> user to query state and progress of a transition.
> 
> Yeah, I think a view like pg_stat_progress_checksums would work.

Added in the attached version.  It probably needs some polish (the docs for
sure do) but it's at least a start.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Michael Banck <mbanck@gmx.net> — 2024-09-30T22:43:49Z

Hi,

On Mon, Sep 30, 2024 at 11:21:30PM +0200, Daniel Gustafsson wrote:
> > Yeah, I think a view like pg_stat_progress_checksums would work.
> 
> Added in the attached version.  It probably needs some polish (the docs for
> sure do) but it's at least a start.

Just a nitpick, but we call it data_checksums about everywhere, but the
new view is called pg_stat_progress_datachecksums - I think
pg_stat_progress_data_checksums would look better even if it gets quite
long.

Michael

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2024-10-01T18:55:34Z

> On 1 Oct 2024, at 00:43, Michael Banck <mbanck@gmx.net> wrote:
> 
> Hi,
> 
> On Mon, Sep 30, 2024 at 11:21:30PM +0200, Daniel Gustafsson wrote:
>>> Yeah, I think a view like pg_stat_progress_checksums would work.
>> 
>> Added in the attached version.  It probably needs some polish (the docs for
>> sure do) but it's at least a start.
> 
> Just a nitpick, but we call it data_checksums about everywhere, but the
> new view is called pg_stat_progress_datachecksums - I think
> pg_stat_progress_data_checksums would look better even if it gets quite
> long.

That's a fair point, I'll make sure to switch for the next version of the
patch.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2024-10-07T09:42:41Z

> On 1 Oct 2024, at 20:55, Daniel Gustafsson <daniel@yesql.se> wrote:
> 
>> On 1 Oct 2024, at 00:43, Michael Banck <mbanck@gmx.net> wrote:
>> 
>> Hi,
>> 
>> On Mon, Sep 30, 2024 at 11:21:30PM +0200, Daniel Gustafsson wrote:
>>>> Yeah, I think a view like pg_stat_progress_checksums would work.
>>> 
>>> Added in the attached version.  It probably needs some polish (the docs for
>>> sure do) but it's at least a start.
>> 
>> Just a nitpick, but we call it data_checksums about everywhere, but the
>> new view is called pg_stat_progress_datachecksums - I think
>> pg_stat_progress_data_checksums would look better even if it gets quite
>> long.
> 
> That's a fair point, I'll make sure to switch for the next version of the
> patch.

A rebased v3 attached with that change.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2024-10-07T14:46:25Z

Hi,

I did a quick review today. First a couple minor comments:

1) monitoring.sgml

typos: number of database -> number of databases
       calcuated -> calculated

2) unnecessary newline in heapam.c (no other changes)

3) unnecessary ListCell in DataChecksumsWorkerMain() on line 1345,
shadowing earlier variable

4) unnecessary comment change in bufmgr.c (no other changes)

5) unnecessary include and newline in bulk_write.c (no other changes)

6) unnecessary newline in pgstat.c (no other changes)

7) controldata.c - maybe this

   if (oldctrl->data_checksum_version == 2)

should use PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION instead of the magic
constant? For "off" we use "0" which seems somewhat acceptable, but for
other values it's less obvious what the meaning is.

8) xlog_internal.h - xl_checksum_state should be added to typedefs

9) system_functions.sql - Isn't it weird that this only creates the new
pg_enable_data_checksums function, but not pg_disable_data_checksums? It
also means it doesn't revoke EXECUTE from public on it, which I guess it
probably should? Or why should this be different for the two functions?

Also the error message seems to differ:

  test=> select pg_enable_data_checksums();
  ERROR:  permission denied for function pg_enable_data_checksums
  test=> select pg_disable_data_checksums();
  ERROR:  must be superuser

Probably harmless, but seems a bit strange.


But there also seems to be a more serious problem with recovery. I did a
simple script that does a loop of

* start a cluster
* initialize a small pgbench database (scale 1 - 100)
* run "long" pgbench
* call pg_enable_data_checksums(), wait for it to complete
* stop the cluster with "-m immediate"
* start the cluster

And this unfortunately hits this assert:

  bool
  AbsorbChecksumsOnBarrier(void)
  {
    Assert(LocalDataChecksumVersion ==
PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
    LocalDataChecksumVersion = PG_DATA_CHECKSUM_VERSION;
    return true;
  }

Based on our short discussion about this, the controlfile gets updated
right after pg_enable_data_checksums() completes. The immediate stop
however forces a recovery since the last checkpoint, which means we see
the XLOG_CHECKSUMS WAL message again, and set the barrier. And then we
exit recovery, try to start checkpointer and it trips over this, because
the control file already has the "on" value :-(

I'm not sure what's the best way to fix this. Would it be possible to
remember we saw the XLOG_CHECKSUMS during recovery, and make the assert
noop in that case? Or not set the barrier when exiting recovery. I'm not
sure the relaxed assert would remain meaningful, though. What would it
check on standbys, for example?

Maybe a better way would be to wait for a checkpoint before updating the
controlfile, similar to what we do at the beginning? Possibly even with
the same "fast=true/false" logic. That would prevent us from seeing the
XLOG_CHECKSUMS wal record with the updated flag. It would extend the
"window" where a crash would mean we have to redo the checksums, but I
don't think that matters much. For small databases who cares, and for
large databases it should not be a meaningful difference (setting the
checksums already ran over multiple checkpoints, so one checkpoint is
not a big difference).


regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> — 2024-10-07T18:42:31Z

Tomas Vondra <tomas@vondra.me> writes:

> 3) unnecessary ListCell in DataChecksumsWorkerMain() on line 1345,
>    shadowing earlier variable

All the ListCell variables can be eliminated by using the foreach_ptr
and foreach_oid macros instead of plain foreach.

- ilmari

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2024-10-08T20:38:36Z

> On 7 Oct 2024, at 16:46, Tomas Vondra <tomas@vondra.me> wrote:

> I did a quick review today. First a couple minor comments:

Thanks for looking! 1-6 are all fixed.

> 7) controldata.c - maybe this
> 
>   if (oldctrl->data_checksum_version == 2)
> 
> should use PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION instead of the magic
> constant? For "off" we use "0" which seems somewhat acceptable, but for
> other values it's less obvious what the meaning is.

It doesn't seem clean to include storage/bufpage.h in pg_upgrade, I wonder if
we should move (or mirror) the checksum versions to storage/checksum_impl.h to
make them available to frontend and backend tools?

> 8) xlog_internal.h - xl_checksum_state should be added to typedefs

Fixed.

> 9) system_functions.sql - Isn't it weird that this only creates the new
> pg_enable_data_checksums function, but not pg_disable_data_checksums?

We don't need any DEFAULT values for pg_disable_data_checksums so it doesn't
need to be created there.

> It
> also means it doesn't revoke EXECUTE from public on it, which I guess it
> probably should? Or why should this be different for the two functions?

That should however be done, so fixed.

> But there also seems to be a more serious problem with recovery. I did a
> simple script that does a loop of
> 
> * start a cluster
> * initialize a small pgbench database (scale 1 - 100)
> * run "long" pgbench
> * call pg_enable_data_checksums(), wait for it to complete
> * stop the cluster with "-m immediate"
> * start the cluster
> 
> And this unfortunately hits this assert:
> 
>  bool
>  AbsorbChecksumsOnBarrier(void)
>  {
>    Assert(LocalDataChecksumVersion ==
> PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
>    LocalDataChecksumVersion = PG_DATA_CHECKSUM_VERSION;
>    return true;
>  }
> 
> Based on our short discussion about this, the controlfile gets updated
> right after pg_enable_data_checksums() completes. The immediate stop
> however forces a recovery since the last checkpoint, which means we see
> the XLOG_CHECKSUMS WAL message again, and set the barrier. And then we
> exit recovery, try to start checkpointer and it trips over this, because
> the control file already has the "on" value :-(
> 
> I'm not sure what's the best way to fix this. Would it be possible to
> remember we saw the XLOG_CHECKSUMS during recovery, and make the assert
> noop in that case? Or not set the barrier when exiting recovery. I'm not
> sure the relaxed assert would remain meaningful, though. What would it
> check on standbys, for example?
> 
> Maybe a better way would be to wait for a checkpoint before updating the
> controlfile, similar to what we do at the beginning? Possibly even with
> the same "fast=true/false" logic. That would prevent us from seeing the
> XLOG_CHECKSUMS wal record with the updated flag. It would extend the
> "window" where a crash would mean we have to redo the checksums, but I
> don't think that matters much. For small databases who cares, and for
> large databases it should not be a meaningful difference (setting the
> checksums already ran over multiple checkpoints, so one checkpoint is
> not a big difference).

The more I think about it the more I think that updating the control file is
the wrong thing to do for this patch, it should only change the state in memory
and let the checkpoints update the controlfile.  The attached fixes that and I
can no longer reproduce the assertion failure you hit.

The attached version also contains updates to the documentation, the aux proc
counter and other smaller bits of polish.

I did remove parts of the progress reporting for now since it can't be used
from the dynamic backgroundworker it seems.  I need to regroup and figure out a
better way there, but I wanted to address your above find sooner rather than
wait for that.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2024-10-08T20:39:57Z

> On 7 Oct 2024, at 20:42, Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> wrote:
> 
> Tomas Vondra <tomas@vondra.me> writes:
> 
>> 3) unnecessary ListCell in DataChecksumsWorkerMain() on line 1345,
>>   shadowing earlier variable
> 
> All the ListCell variables can be eliminated by using the foreach_ptr
> and foreach_oid macros instead of plain foreach.

Fair point, done in the v4 attached upthread.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2024-10-09T10:41:02Z


On 10/8/24 22:38, Daniel Gustafsson wrote:
> 
>> 7) controldata.c - maybe this
>>
>>   if (oldctrl->data_checksum_version == 2)
>>
>> should use PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION instead of the magic
>> constant? For "off" we use "0" which seems somewhat acceptable, but for
>> other values it's less obvious what the meaning is.
> 
> It doesn't seem clean to include storage/bufpage.h in pg_upgrade, I wonder if
> we should move (or mirror) the checksum versions to storage/checksum_impl.h to
> make them available to frontend and backend tools?
> 

+1 to have checksum_impl.h

> 
>> But there also seems to be a more serious problem with recovery. I did a
>> simple script that does a loop of
>>
>> * start a cluster
>> * initialize a small pgbench database (scale 1 - 100)
>> * run "long" pgbench
>> * call pg_enable_data_checksums(), wait for it to complete
>> * stop the cluster with "-m immediate"
>> * start the cluster
>>
>> And this unfortunately hits this assert:
>>
>>  bool
>>  AbsorbChecksumsOnBarrier(void)
>>  {
>>    Assert(LocalDataChecksumVersion ==
>> PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
>>    LocalDataChecksumVersion = PG_DATA_CHECKSUM_VERSION;
>>    return true;
>>  }
>>
>> Based on our short discussion about this, the controlfile gets updated
>> right after pg_enable_data_checksums() completes. The immediate stop
>> however forces a recovery since the last checkpoint, which means we see
>> the XLOG_CHECKSUMS WAL message again, and set the barrier. And then we
>> exit recovery, try to start checkpointer and it trips over this, because
>> the control file already has the "on" value :-(
>>
>> I'm not sure what's the best way to fix this. Would it be possible to
>> remember we saw the XLOG_CHECKSUMS during recovery, and make the assert
>> noop in that case? Or not set the barrier when exiting recovery. I'm not
>> sure the relaxed assert would remain meaningful, though. What would it
>> check on standbys, for example?
>>
>> Maybe a better way would be to wait for a checkpoint before updating the
>> controlfile, similar to what we do at the beginning? Possibly even with
>> the same "fast=true/false" logic. That would prevent us from seeing the
>> XLOG_CHECKSUMS wal record with the updated flag. It would extend the
>> "window" where a crash would mean we have to redo the checksums, but I
>> don't think that matters much. For small databases who cares, and for
>> large databases it should not be a meaningful difference (setting the
>> checksums already ran over multiple checkpoints, so one checkpoint is
>> not a big difference).
> 
> The more I think about it the more I think that updating the control file is
> the wrong thing to do for this patch, it should only change the state in memory
> and let the checkpoints update the controlfile.  The attached fixes that and I
> can no longer reproduce the assertion failure you hit.
> 

I think leaving the update of controlfile to checkpointer is correct,
and probably the only way to make this correct (without race
conditions). We need to do that automatically with the checkpoint (which
updates the redo LSN, guaranteeing we won't see the XLOG_CHECKSUMS
record again).

I ran the tests with this new patch, and I haven't reproduced the
crashes. I'll let it run a bit longer, and improve it to test some more
stuff, but it looks good.

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2024-10-11T07:57:13Z

> On 9 Oct 2024, at 12:41, Tomas Vondra <tomas@vondra.me> wrote:

>>> should use PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION instead of the magic
>>> constant? For "off" we use "0" which seems somewhat acceptable, but for
>>> other values it's less obvious what the meaning is.
>> 
>> It doesn't seem clean to include storage/bufpage.h in pg_upgrade, I wonder if
>> we should move (or mirror) the checksum versions to storage/checksum_impl.h to
>> make them available to frontend and backend tools?
> 
> +1 to have checksum_impl.h

I tried various different ways of breaking out the checksum version into
another header file but all of them ended up messier than the current state due
to how various tools include the checksum code.  In the end I opted for doing
the bufpage include to keep it simple.  This patch is big enough as it is
without additional refactoring of checksum (header) code, that can be done
separately from this.

> I ran the tests with this new patch, and I haven't reproduced the
> crashes. I'll let it run a bit longer, and improve it to test some more
> stuff, but it looks good.

Thanks for testing, I am too unable to reproduce that error.

The attached v5 has the above include fix as well as pgindent and pgperltidy
runs and some tweaking to the commit message to make it concise. It's also
rebased to handle a recent conflict in the makefiles.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2024-11-04T11:27:51Z

Attached is a rebased v6 fixing the tests to handle that checksums are now on
by default, no other changes are made as no outstanding review comments or
identified bugs exist.

Does anyone object to going ahead with this?

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2024-11-05T12:51:05Z

> On 4 Nov 2024, at 12:27, Daniel Gustafsson <daniel@yesql.se> wrote:
> 
> Attached is a rebased v6 fixing the tests to handle that checksums are now on
> by default, no other changes are made as no outstanding review comments or
> identified bugs exist.
> 
> Does anyone object to going ahead with this?

And a new rebase to cope with recent changes,

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2024-11-06T09:27:08Z

> On 5 Nov 2024, at 13:51, Daniel Gustafsson <daniel@yesql.se> wrote:
> 
>> On 4 Nov 2024, at 12:27, Daniel Gustafsson <daniel@yesql.se> wrote:
>> 
>> Attached is a rebased v6 fixing the tests to handle that checksums are now on
>> by default, no other changes are made as no outstanding review comments or
>> identified bugs exist.
>> 
>> Does anyone object to going ahead with this?
> 
> And a new rebase to cope with recent changes,

..and one more since I forgot to git add the new expected output testfile.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2024-11-08T00:41:10Z

Hi,

Unfortunately it seems we're not out of the woods yet :-(

I started doing some more testing on the v8 patch. My plan was to do
some stress testing with physical replication, random restarts and stuff
like that. But I ran into issues before that.

Attached is a reproducer script, that does this:

1) initializes an instance with a small (scale 10) pgbench database

2) runs a pgbench in the background, and flips checksums

3) restarts the database with fast or immediate mode

4) watches for checksums state until it reaches expected value

5) restarts the instance

Of course, the restart interrupts the checksum enable, with this message
in the log:

WARNING:  data checksums are being enabled, but no worker is running
1731024482.102 2024-11-08 01:08:02.102 CET [267066] [startup:]
[672d5660.4133a:7] [2024-11-08 01:08:00 CET] [/0] HINT:  If checksums
were being enabled during shutdown then processing must be manually
restarted.

That's expected, of course. So I did

  SELECT pg_enable_data_checksums()

and "datachecksumsworker launcher" appeared in pg_stat_activity, but
nothing else was happening. It also says:

  Waiting for worker in database template0 (pid 258442)

But there are no workers with that PID. Not in the OS, not in the view,
not in the server log. Seems a bit weird. Maybe it already completed,
but then why is there a launcher waiting for it?

Ultimately I tried running CHECKPOINT, And that apparently did the
trick, and the instance restarted. But then on start it hits an assert that:

  (LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)

But this only happens in the final stop is -m immediate. If I change it
to "-m fast" it works.

I haven't looked into the details, but I guess it's related to the issue
with controlfile update we dealt with about a month ago.

Attached is the test.sh file (make sure to tweak the paths), and an
example of the backtraces. I've seen various processes hitting that.


Two more comments:

* It's a bit surprising that pg_disable_data_checksums() flips the state
right away, while pg_enable_data_checksums() waits for a checkpoint. I
guess it's correct, but maybe the docs should mention this difference?

* The docs currently say:

<para>
 If the cluster is stopped while in <literal>inprogress-on</literal> mode,
 for any reason, then this process must be restarted manually. To do this,
 re-execute the function <function>pg_enable_data_checksums()</function>
 once the cluster has been restarted. The background worker will attempt
 to resume the work from where it was interrupted.
</para>

I believe that's incorrect/misleading. There's no attempt to resume work
from where it was interrupted.


regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2024-11-26T22:07:12Z

Hi,

I spent a bit more time doing some testing on the last version of the
patch from [1]. And I ran into this assert in PostmasterStateMachine()
when stopping the cluster:

  /* All types should be included in targetMask or remainMask */
  Assert((remainMask.mask | targetMask.mask) == BTYPE_MASK_ALL.mask);

At first I was puzzled as this happens on every shutdown, but that's
because these checks were introduced by a78af0427015 a week ago. So it's
more a matter of rebasing.

However, I also noticed the progress monitoring does not really work. I
get stuff like this:

    + psql -x test -c 'select * from pg_stat_progress_data_checksums'
    -[ RECORD 1 ]---------------------+---------
    pid                               | 56811
    datid                             | 0
    datname                           |
    phase                             | enabling
    databases_total                   | 4
    relations_total                   |
    databases_processed               | 0
    relations_processed               | 0
    databases_current                 | 16384
    relation_current                  | 0
    relation_current_blocks           | 0
    relation_current_blocks_processed | 0

But I've never seen any of the "processed" fields to be non-zero (and
relations is even NULL), and the same thing applies to relation_. Also
what is the datid/datname about? It's empty, not mentioned in sgml docs,
and we already have databases_current ...

The message [2] from 10/08 says:

> I did remove parts of the progress reporting for now since it can't be
> used from the dynamic backgroundworker it seems.  I need to regroup
> and figure out a better way there, but I wanted to address your above
> find sooner rather than wait for that.

And I guess that would explain why some of the fields are not updated.
But then the later patch versions seem to imply there are no outstanding
issues / missing stuff.


regards


[1]
https://www.postgresql.org/message-id/CA226DE1-DC9A-4675-A83C-32270C473F0B%40yesql.se

[2]
https://www.postgresql.org/message-id/DD25705F-E75F-4DCA-B49A-5578F4F55D94%40yesql.se

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Michael Paquier <michael@paquier.xyz> — 2024-12-11T04:47:08Z

On Tue, Nov 26, 2024 at 11:07:12PM +0100, Tomas Vondra wrote:
> I spent a bit more time doing some testing on the last version of the
> patch from [1]. And I ran into this assert in PostmasterStateMachine()
> when stopping the cluster:
> 
>   /* All types should be included in targetMask or remainMask */
>   Assert((remainMask.mask | targetMask.mask) == BTYPE_MASK_ALL.mask);
> 
> At first I was puzzled as this happens on every shutdown, but that's
> because these checks were introduced by a78af0427015 a week ago. So it's
> more a matter of rebasing.

Looks like the CI is not really happy about this point..  (Please make
sure to refresh the patch status after a review.)
--
Michael

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-03-09T17:57:52Z

Here's a rebased version of the patch series, addressing the issues I've
pointed out in the last round of reviews. I've kept the changes in
separate patches for clarity, but it should be squashed into a single
patch in the end.


1) v20250309-0001-Online-enabling-and-disabling-of-data-chec.patch
------------------------------------------------------------------

Original patch, rebased, resolving merge conflicts.


2) v20250309-0002-simple-post-rebase-fixes.patch
------------------------------------------------

A couple minor fixes, addressing test failures due to stuff committed
since the previous patch version. Mostly mechanical, the main change is
I don't think the pgstat_bestart() call is necessary. Or is it?


3) v20250309-0003-sync-the-data_checksums-GUC-with-the-local.patch
------------------------------------------------------------------

This is the main change, fixing failures in 002_actions.pl - the short
version is that test does "-C data_checksums", but AFAICS that does not
quite work because it does not call show_data_checksums() that early,
and instead just grabs the variable backing the GUC. Which may be out of
sync, so this patch fixes that by updating them both.

That fixes the issue, but it's it a bit strange we now have three places
tracking the state of data checksums? We have data_checksum_version in
the control file, and then data_checksums and LocalDataChecksumVersion
in the backends.

Would it be possible to "unify" the latter two? That would also mean we
don't have the duplicate constants like PG_DATA_CHECKSUM_VERSION and
DATA_CHECKSUM_VERSION. Or why do we need that?


4) v20250309-0004-make-progress-reporting-work.patch
----------------------------------------------------

The progress reporting got "mostly disabled" in an earlier version, due
to issues with the bgworkers. AFAICS the issue is the "status" row can
be updated only by a single process, which does not quite work with the
launcher + per-db workers architecture.

I've considered a couple different approaches:


a) updating the status only from the launcher

This is mostly what CREATE INDEX does with parallel builds, and there
it's mostly sufficient. But for checksums it'd mean we only have the
number of databases to process/done, and that seems unsatisfactory,
considering large clusters often have only a single large database. So
not good enough, IMHO.


b) allowing workers to update the status row, created by the launcher

I guess this would be better, we'd know the relations/blocks counts. And
I haven't tried coding this, but there would need to be some locking so
that the workers don't overwrite increments from other workers, etc.

But I don't think it'd work nicely with parallel per-db workers (which
we don't do now, but we might).


c) having one status entry per worker

I ended up doing this, it didn't require any changes to the progress
infrastructure, and it will work naturally even with multiple workers.
There will always be one row for launcher status (which only has the
number of databases total/done), and then one row per worker, with
database-level info (datid, datname, #relations, #blocks).

I removed the "DONE" phase, because that's right before the launcher
exists, and I don't think we have that for similar cases. And I added
"waiting on checkpoint" state, because that's often a long wait when the
launcher seems to do nothing, so it seems useful to communicate the
reason for that wait.


5) v20250309-0005-update-docs.patch
-----------------------------------

Minor tweaks to docs, to reflect the changes to the progress reporting
changes, and also some corrections (no resume after restart, ...).


So far this passed all my tests - both chekc-world and stress testing
(no failures / assert crashes, ...). One thing that puzzles me is I
wasn't able to reproduce the failures reported in [1] - not even with
just the rebase + minimal fixes (0001 + 0002).  My best theory is this
is somehow machine-specific, and my laptop is too slow or something.
I'll try with the machine I used before once it gets available.


regards


[1]
https://www.postgresql.org/message-id/e4dbcb2c-e04a-4ba2-bff0-8d979f55960e%40vondra.me


-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-03-09T23:35:29Z

Seems cfbot was unhappy with the patches, so here's an improved version,
fixing some minor issues in expected output and a compiler warning.

There however seems to be some issue with 003_standby_restarts, which
causes failures on freebsd and macos. I don't know what that is about,
but the test runs much longer than on debian.


regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-03-10T00:18:23Z

On 3/10/25 00:35, Tomas Vondra wrote:
> Seems cfbot was unhappy with the patches, so here's an improved version,
> fixing some minor issues in expected output and a compiler warning.
> 
> There however seems to be some issue with 003_standby_restarts, which
> causes failures on freebsd and macos. I don't know what that is about,
> but the test runs much longer than on debian.
> 

OK, turns out the failures were caused by the test creating a standby
from a backup, without a slot, so sometimes the primary removed the
necessary WAL. Fixed in the attached version.

There's still a failure on windows, though. I'd bet that's due to the
data_checksum/LocalDatachecksumVersion sync not working correctly on
builds with EXEC_BACKEND, or something like that, but it's too late so
I'll take a closer look tomorrow.

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-03-10T09:46:32Z

On 3/10/25 01:18, Tomas Vondra wrote:
>
> ...
>
> There's still a failure on windows, though. I'd bet that's due to the
> data_checksum/LocalDatachecksumVersion sync not working correctly on
> builds with EXEC_BACKEND, or something like that, but it's too late so
> I'll take a closer look tomorrow.
> 

Just like I suspected, there was a bug in EXEC_BACKEND, although a bit
different from what I guessed - the worker state in shmem was zeroed
every time, not just once. And a second issue was child_process_kinds
got out of sync with BackendType (mea culpa).

For me, this passes all CI tests, hopefully cfbot will be happy too.

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-03-10T11:17:58Z

On 3/10/25 10:46, Tomas Vondra wrote:
> On 3/10/25 01:18, Tomas Vondra wrote:
>>
>> ...
>>
>> There's still a failure on windows, though. I'd bet that's due to the
>> data_checksum/LocalDatachecksumVersion sync not working correctly on
>> builds with EXEC_BACKEND, or something like that, but it's too late so
>> I'll take a closer look tomorrow.
>>
> 
> Just like I suspected, there was a bug in EXEC_BACKEND, although a bit
> different from what I guessed - the worker state in shmem was zeroed
> every time, not just once. And a second issue was child_process_kinds
> got out of sync with BackendType (mea culpa).
> 
> For me, this passes all CI tests, hopefully cfbot will be happy too.
> 

A bit embarrassing, I did not notice updating child_process_kinds breaks
the stats regression test, so here's a version fixing that.


regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2025-03-10T13:27:06Z

> On 10 Mar 2025, at 12:17, Tomas Vondra <tomas@vondra.me> wrote:
> 
> On 3/10/25 10:46, Tomas Vondra wrote:
>> On 3/10/25 01:18, Tomas Vondra wrote:

Thank you so much for picking up and fixing the blockers, it's highly appreciated!

>> For me, this passes all CI tests, hopefully cfbot will be happy too.

Confirmed, it compiles clean, builds docs and passes all tests for me as well.

A few comments from reading over your changes:

+   launcher worker has this value set, the other worker processes
+   have this <literal>NULL</literal>.
There seems to be a word or two missing (same in a few places), should this be
"have this set to NULL"?


+   The command is currently waiting for a checkpoint to update the checksum
+   state at the end.
s/at the end/before finishing/?


+ * XXX aren't PG_DATA_ and DATA_ constants the same? why do we need both?
They aren't mapping 1:1 as PG_DATA_ has the version numbers, and if checksums
aren't enabled there is no version and thus there is no PG_DATA_CHECKSUMS_OFF.
This could of course be remedied.  IIRC one reason for adding the enum was to
get compiler warnings on missing cases when switch()ing over the value, but I
don't think the current code has any switch.


+   /* XXX isn't it weird there's no wait between the phase updates? */
It is, I think we should skip PROGRESS_DATACHECKSUMS_PHASE_WAITING_BACKENDS in
favor of PROGRESS_DATACHECKSUMS_PHASE_ENABLING.


+   * When enabling checksums, we have to wait for a checkpoint for the
+   * checksums to e.
Seems to be missing the punchline, "for the checksum state to be moved from
in-progress to on" perhaps?


It also needs a pgindent and pgperltidy but there were only small trivial
changes there.

Thanks again for updating the patch!

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-03-10T15:01:35Z

On 3/10/25 14:27, Daniel Gustafsson wrote:
>> On 10 Mar 2025, at 12:17, Tomas Vondra <tomas@vondra.me> wrote:
>>
>> On 3/10/25 10:46, Tomas Vondra wrote:
>>> On 3/10/25 01:18, Tomas Vondra wrote:
> 
> Thank you so much for picking up and fixing the blockers, it's highly appreciated!
> 
>>> For me, this passes all CI tests, hopefully cfbot will be happy too.
> 
> Confirmed, it compiles clean, builds docs and passes all tests for me as well.
> 
> A few comments from reading over your changes:
> 
> +   launcher worker has this value set, the other worker processes
> +   have this <literal>NULL</literal>.
> There seems to be a word or two missing (same in a few places), should this be
> "have this set to NULL"?
> 

done

> 
> +   The command is currently waiting for a checkpoint to update the checksum
> +   state at the end.
> s/at the end/before finishing/?
> 

done

> 
> + * XXX aren't PG_DATA_ and DATA_ constants the same? why do we need both?
> They aren't mapping 1:1 as PG_DATA_ has the version numbers, and if checksums
> aren't enabled there is no version and thus there is no PG_DATA_CHECKSUMS_OFF.
> This could of course be remedied.  IIRC one reason for adding the enum was to
> get compiler warnings on missing cases when switch()ing over the value, but I
> don't think the current code has any switch.
> 

I haven't done anything about this. I'm not convinced it's an issue we
need to fix, and I haven't tried how much work would it be.

> 
> +   /* XXX isn't it weird there's no wait between the phase updates? */
> It is, I think we should skip PROGRESS_DATACHECKSUMS_PHASE_WAITING_BACKENDS in
> favor of PROGRESS_DATACHECKSUMS_PHASE_ENABLING.
> 

Removed the WAITING_BACKENDS phase.

> 
> +   * When enabling checksums, we have to wait for a checkpoint for the
> +   * checksums to e.
> Seems to be missing the punchline, "for the checksum state to be moved from
> in-progress to on" perhaps?
> 

done

> 
> It also needs a pgindent and pgperltidy but there were only small trivial
> changes there.
> 

done

Attached is an updated version.


-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-03-10T15:16:02Z

One thing I forgot to mention is the progress reporting only updates
blocks for the FORK_MAIN. It wouldn't be difficult to report blocks for
each fork, but it'd be confusing - the relation counters would remain
the same, but the block counters would change for each fork.

I guess we could report the current_relation/fork, but it seems like an
overkill. The main fork is by far the largest one, so this seems OK.


regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-03-10T17:35:56Z

Hi,

I continued stress testing this, as I was rather unsure why the assert
failures reported in [1] disappeared. And I managed to reproduce that
again, and I think I actually understand why it happens.

I modified the test script (attached) to setup replication, not just a
single instance. And then it does a bit of work, flips the checksums,
restarts the instances (randomly, fast/immediate), verifies the checkums
and so on. And I can hit this assert in AbsorbChecksumsOnBarrier()
pretty easily:

  Assert(LocalDataChecksumVersion ==
         PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);

The reason is pretty simple - this happens on the standby:

1) standby receives XLOG_CHECKSUMS and applies it from 2 to 1 (i.e. it
sets ControlFile->data_checksum_version from "inprogress-on" to "on"),
and signals all other processes to refresh LocalDataChecksumVersion

2) the control file gets written to disk for whatever reason (redo does
this in a number of places)

3) standby gets restarted with "immediate" mode (I'm not sure if this
can happen with "fast" mode, I only recall seeing "immediate")

4) the standby receives the XLOG_CHECKSUMS record *again*, updates the
ControlFile->data_checksum_version (to the same value, no noop), and
then signals the other processes again

5) the other processes already have LocalDataChecksumVersion=1 (on), but
the assert says it should be 2 (inprogress-on) => kaboom

I believe this can happen for changes in either direction, although the
window while disabling checksums is more narrow.


I'm not sure what to do about this. Maybe we could relax the assert in
some way? But that seems a bit ... possibly risky. It's not necessarily
true we'll see the immediately preceding checksum state, we might see a
couple updates back (if the control file was not updated in between).

Could this affect checksum verification during recovery? Imagine we get
to the "on" state, the controlfile gets flushed, and then the standby
restarts and starts receiving older records again. The control file says
we should be verifying checksums, but couldn't some of the writes have
been lost (and so the pages may not have a valid checksum)?

The one idea I have is to create an "immediate" restartpoint in
xlog_redo() right after XLOG_CHECKSUMS updates the control file. AFAICS
a "spread" restartpoint would not be enough, because then we could get
into the same situation with a control file of sync (ahead of WAL) after
a restart. It'd not be cheap, but it should be a rare operation ...

I was wondering if the primary has the same issue, but AFAICS it does
not. It flushes the control file in only a couple places, I couldn't
think of a way to get it out of sync.


regards

[1]
https://www.postgresql.org/message-id/e4dbcb2c-e04a-4ba2-bff0-8d979f55960e%40vondra.me

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> — 2025-03-11T13:07:35Z

As the resident perl style pedant, I'd just like to complain about the
below:

Tomas Vondra <tomas@vondra.me> writes:

> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
> index 666bd2a2d4c..1c66360c16c 100644
> --- a/src/test/perl/PostgreSQL/Test/Cluster.pm
> +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
> @@ -3761,7 +3761,8 @@ sub checksum_enable_offline
>  	my ($self) = @_;
>  
>  	print "### Enabling checksums in \"$self->data_dir\"\n";
> -	PostgreSQL::Test::Utils::system_or_bail('pg_checksums', '-D', $self->data_dir, '-e');
> +	PostgreSQL::Test::Utils::system_or_bail('pg_checksums', '-D',
> +		$self->data_dir, '-e');
>  	return;
>  }

This breaking between the command line options and its arguments is why
we're switching to using fat commas. We're also using long options for
improved self-documentation, so this should be written as:

	PostgreSQL::Test::Utils::system_or_bail('pg_checksums',
		'--pgdata' => $self->data_dir,
		'--enable');

And likewise below in the disable method.

- ilmari

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-03-12T12:28:50Z

On 3/11/25 14:07, Dagfinn Ilmari Mannsåker wrote:
> As the resident perl style pedant, I'd just like to complain about the
> below:
> 
> Tomas Vondra <tomas@vondra.me> writes:
> 
>> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
>> index 666bd2a2d4c..1c66360c16c 100644
>> --- a/src/test/perl/PostgreSQL/Test/Cluster.pm
>> +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
>> @@ -3761,7 +3761,8 @@ sub checksum_enable_offline
>>  	my ($self) = @_;
>>  
>>  	print "### Enabling checksums in \"$self->data_dir\"\n";
>> -	PostgreSQL::Test::Utils::system_or_bail('pg_checksums', '-D', $self->data_dir, '-e');
>> +	PostgreSQL::Test::Utils::system_or_bail('pg_checksums', '-D',
>> +		$self->data_dir, '-e');
>>  	return;
>>  }
> 
> 
> This breaking between the command line options and its arguments is why
> we're switching to using fat commas. We're also using long options for
> improved self-documentation, so this should be written as:
> 
> 	PostgreSQL::Test::Utils::system_or_bail('pg_checksums',
> 		'--pgdata' => $self->data_dir,
> 		'--enable');
> 
> And likewise below in the disable method.
> 

I don't know what fat comma is, but that's simply what perltidy did. I
don't mind formatting it differently, if there's a better way.


thanks

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> — 2025-03-12T13:08:50Z

Tomas Vondra <tomas@vondra.me> writes:

> On 3/11/25 14:07, Dagfinn Ilmari Mannsåker wrote:
>> As the resident perl style pedant, I'd just like to complain about the
>> below:
>> 
>> Tomas Vondra <tomas@vondra.me> writes:
>> 
>>> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
>>> index 666bd2a2d4c..1c66360c16c 100644
>>> --- a/src/test/perl/PostgreSQL/Test/Cluster.pm
>>> +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
>>> @@ -3761,7 +3761,8 @@ sub checksum_enable_offline
>>>  	my ($self) = @_;
>>>  
>>>  	print "### Enabling checksums in \"$self->data_dir\"\n";
>>> -	PostgreSQL::Test::Utils::system_or_bail('pg_checksums', '-D', $self->data_dir, '-e');
>>> +	PostgreSQL::Test::Utils::system_or_bail('pg_checksums', '-D',
>>> +		$self->data_dir, '-e');
>>>  	return;
>>>  }
>> 
>> 
>> This breaking between the command line options and its arguments is why
>> we're switching to using fat commas. We're also using long options for
>> improved self-documentation, so this should be written as:
>> 
>> 	PostgreSQL::Test::Utils::system_or_bail('pg_checksums',
>> 		'--pgdata' => $self->data_dir,
>> 		'--enable');
>> 
>> And likewise below in the disable method.
>> 
>
> I don't know what fat comma is, but that's simply what perltidy did. I
> don't mind formatting it differently, if there's a better way.

Fat comma is the perlish name for the => arrow, which is semantically
equivalent to a comma (except it auto-quotes any immediately preceding
bareword), but looks fatter.  Perltidy knows to not wrap lines around
them, keeping the key and value (or option and argument in this case)
together.  See commit ce1b0f9da03 for a large (but not complete, I have
more patches pending) conversion to this new style.

- ilmari

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-03-12T13:16:41Z

On 3/10/25 18:35, Tomas Vondra wrote:
> Hi,
> 
> I continued stress testing this, as I was rather unsure why the assert
> failures reported in [1] disappeared. And I managed to reproduce that
> again, and I think I actually understand why it happens.
> 
> I modified the test script (attached) to setup replication, not just a
> single instance. And then it does a bit of work, flips the checksums,
> restarts the instances (randomly, fast/immediate), verifies the checkums
> and so on. And I can hit this assert in AbsorbChecksumsOnBarrier()
> pretty easily:
> 
>   Assert(LocalDataChecksumVersion ==
>          PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
> 
> The reason is pretty simple - this happens on the standby:
> 
> 1) standby receives XLOG_CHECKSUMS and applies it from 2 to 1 (i.e. it
> sets ControlFile->data_checksum_version from "inprogress-on" to "on"),
> and signals all other processes to refresh LocalDataChecksumVersion
> 
> 2) the control file gets written to disk for whatever reason (redo does
> this in a number of places)
> 
> 3) standby gets restarted with "immediate" mode (I'm not sure if this
> can happen with "fast" mode, I only recall seeing "immediate")
> 
> 4) the standby receives the XLOG_CHECKSUMS record *again*, updates the
> ControlFile->data_checksum_version (to the same value, no noop), and
> then signals the other processes again
> 
> 5) the other processes already have LocalDataChecksumVersion=1 (on), but
> the assert says it should be 2 (inprogress-on) => kaboom
> 
> I believe this can happen for changes in either direction, although the
> window while disabling checksums is more narrow.
> 
> 
> I'm not sure what to do about this. Maybe we could relax the assert in
> some way? But that seems a bit ... possibly risky. It's not necessarily
> true we'll see the immediately preceding checksum state, we might see a
> couple updates back (if the control file was not updated in between).
> 
> Could this affect checksum verification during recovery? Imagine we get
> to the "on" state, the controlfile gets flushed, and then the standby
> restarts and starts receiving older records again. The control file says
> we should be verifying checksums, but couldn't some of the writes have
> been lost (and so the pages may not have a valid checksum)?
> 
> The one idea I have is to create an "immediate" restartpoint in
> xlog_redo() right after XLOG_CHECKSUMS updates the control file. AFAICS
> a "spread" restartpoint would not be enough, because then we could get
> into the same situation with a control file of sync (ahead of WAL) after
> a restart. It'd not be cheap, but it should be a rare operation ...
> 
> I was wondering if the primary has the same issue, but AFAICS it does
> not. It flushes the control file in only a couple places, I couldn't
> think of a way to get it out of sync.
> 

I continued investigating this and experimenting with alternative
approaches, and I think the way the patch relies on ControlFile is not
quite right. That is, it always sets data_checksum_version to the last
("current") value, but that's not what ControlFile is for ...

The ControlFile is meant to be a safe/consistent state, e.g. for crash
recovery. By setting data_checksum_version to the "last" value we've
seen, that's broken - if the control file gets persisted (haven't seen
this on primary, but pretty common on replica, per the report), the
recovery will start with a "future" data_checksum_version value. Which
is wrong - we'll read the XLOG_CHECKUMS record, triggering the assert. I
suspect it might also lead to confusion whether checksums should be
verified or not.

In my earlier message I suggested maybe this could be solved by forcing
a checkpoint every time we see the XLOG_CHECKUMS record (or rather a
restart point, as it'd be on the replica). Sure, that would have some
undesirable consequences (forcing an immediate checkpoint is not cheap,
and the redo would need to wait for that). But the assumption was it'd
be very rare (how often you enable checksums?), so this cost might be
acceptable.

But when I started experimenting with this, I realized it has a couple
other issues:

1) We can't do the checkpoint/restartpoint when handling XLOG_CHECKUMS,
because that'd mean we see this XLOG record again, which we don't want.
So the checkpoint would need to happen the *next* time we update the
control file.

2) But we can't trigger a checkpoint from UpdateControlFile, because of
locking (because CreateCheckPoint also calls UpdateControlFile). So this
would require much more invasive changes to all places updating the
control file.

3) It does not resolve the mismatch with using ControlFile to store
"current" data_checksums_version value.

4) ... probably more minor issues that I already forgot about.

In the end, I decided to try to rework this by storing the current value
elsewhere, and only updating the "persistent" value in the control file
when necessary.

XLogCtl seemed like a good place, so I used that - after all, it's a
value from XLOG. Maybe there's a better place? I'm open to suggestions,
but it does not really affect the overall approach.

So all the places now update XLogCtl->data_checksums_version instead of
the ControlFile, and also query this flag for *current* value.

The value is copied from XLogCtl to ControlFile when creating checkpoint
(or restartpoint), and the control file is persisted. This means (a) the
current value can't get written to the control file prematurely, and (b)
the value is consistent with the checkpoint (i.e. with the LSN where we
start crash recovery, if needed).

The attached 0005 patch implements this. It's a bit WIP and I'm sure it
can be improved, but I'm yet to see a single crash/failure with it. With
the original patch I've seen crashes after 5-10 loops (i.e. a couple
minutes), I'm now at loop 1000 and it's still OK.

I believe the approach is correct, but the number of possible states
(e.g. after a crash/restart) seems a bit complex. I wonder if there's a
better way to handle this, but I can't think of any. Ideas?

One issue I ran into is the postmaster does not seem to be processing
the barriers, and thus not getting info about the data_checksum_version
changes. That's fine until it needs to launch a child process (e.g. a
walreceiver), which will then see the LocalDataChecksumVersion as of the
start of the instance, not the "current" one. I fixed this by explicitly
refreshing the value in postmaster_child_launch(), but maybe I'm missing
something. (Also, EXEC_BACKEND may need to handle this too.)

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2025-03-13T09:54:28Z

> On 12 Mar 2025, at 14:16, Tomas Vondra <tomas@vondra.me> wrote:

> I continued investigating this and experimenting with alternative
> approaches, and I think the way the patch relies on ControlFile is not
> quite right. That is, it always sets data_checksum_version to the last
> ("current") value, but that's not what ControlFile is for ...

Agreed, that's a thinko on my part.  Reading it makes it clear, but I had
failed to see that when hacking =/

> XLogCtl seemed like a good place, so I used that - after all, it's a
> value from XLOG. Maybe there's a better place? I'm open to suggestions,
> but it does not really affect the overall approach.

Seems like a good place for it.

> So all the places now update XLogCtl->data_checksums_version instead of
> the ControlFile, and also query this flag for *current* value.
> 
> The value is copied from XLogCtl to ControlFile when creating checkpoint
> (or restartpoint), and the control file is persisted. This means (a) the
> current value can't get written to the control file prematurely, and (b)
> the value is consistent with the checkpoint (i.e. with the LSN where we
> start crash recovery, if needed).

+1

> The attached 0005 patch implements this. It's a bit WIP and I'm sure it
> can be improved, but I'm yet to see a single crash/failure with it. With
> the original patch I've seen crashes after 5-10 loops (i.e. a couple
> minutes), I'm now at loop 1000 and it's still OK.

Given how successful this test has been at stressing out errors that is indeed
comforting to hear.

> I believe the approach is correct, but the number of possible states
> (e.g. after a crash/restart) seems a bit complex. I wonder if there's a
> better way to handle this, but I can't think of any. Ideas?

Not sure if this moves the needle too far in terms of complexity wrt to the
previous version of the patch, there were already multiple copies.

> One issue I ran into is the postmaster does not seem to be processing
> the barriers, and thus not getting info about the data_checksum_version
> changes.

Makes sense, that seems like a pretty reasonable constraint for the barrier.

> That's fine until it needs to launch a child process (e.g. a
> walreceiver), which will then see the LocalDataChecksumVersion as of the
> start of the instance, not the "current" one. I fixed this by explicitly
> refreshing the value in postmaster_child_launch(), but maybe I'm missing
> something. (Also, EXEC_BACKEND may need to handle this too.)

The pg_checksums test is failing for me on this version due to the GUC not
being initialized, don't we need something like the below as well?  (With a
comment explaining why ReadControlFile wasn't enough.)

@@ -5319,6 +5319,7 @@ LocalProcessControlFile(bool reset)
        Assert(reset || ControlFile == NULL);
        ControlFile = palloc(sizeof(ControlFileData));
        ReadControlFile();
+       SetLocalDataChecksumVersion(ControlFile->data_checksum_version);

A few comments on the patchset:

+ * Local state fror Controlfile data_checksum_version. After initialization
s/fror/for/.  Also, this is no longer true as it's a local copy of the XlogCtl
value and not the Controlfile value (which may or may not be equal).


- if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
+ if (XLogCtl->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
 	ereport(WARNING,
 		(errmsg("data checksums are being enabled, but no worker is running"),
 		 errhint("If checksums were being enabled during shutdown then processing must be manually restarted.")));
Reading this made me realize what a terrible error message I had placed there,
the hint is good but the message says checksums are being enabled but they're
not being enabled.  Maybe "data checksums are marked as being in-progress, but
no worker is running"


+uint32
+GetLocalDataChecksumVersion(void)
+{
+ return LocalDataChecksumVersion;
+}
+
+/*
+ * Get the *current* data_checksum_version (might not be written to control
+ * file yet).
+ */
+uint32
+GetCurrentDataChecksumVersion(void)
+{
+ return XLogCtl->data_checksum_version;
+}
I wonder if CachedDataChecksumVersion would be more appropriate to distinguish
it from the Current value, and also to make appear less like actual copies of
controlfile values like LocalMinRecoveryPoint.  Another thought is if we should
have the GetLocalDataChecksumVersion() API?  GetCurrentDataChecksumVersion()
should be a better API no?

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-03-13T11:03:29Z

On 3/13/25 10:54, Daniel Gustafsson wrote:
>> On 12 Mar 2025, at 14:16, Tomas Vondra <tomas@vondra.me> wrote:
> 
>> I continued investigating this and experimenting with alternative
>> approaches, and I think the way the patch relies on ControlFile is not
>> quite right. That is, it always sets data_checksum_version to the last
>> ("current") value, but that's not what ControlFile is for ...
> 
> Agreed, that's a thinko on my part.  Reading it makes it clear, but I had
> failed to see that when hacking =/
> 

It wasn't obvious to me either, until I managed to trigger the failure
and investigated the root cause.

>> XLogCtl seemed like a good place, so I used that - after all, it's a
>> value from XLOG. Maybe there's a better place? I'm open to suggestions,
>> but it does not really affect the overall approach.
> 
> Seems like a good place for it.
> 

OK

>> So all the places now update XLogCtl->data_checksums_version instead of
>> the ControlFile, and also query this flag for *current* value.
>>
>> The value is copied from XLogCtl to ControlFile when creating checkpoint
>> (or restartpoint), and the control file is persisted. This means (a) the
>> current value can't get written to the control file prematurely, and (b)
>> the value is consistent with the checkpoint (i.e. with the LSN where we
>> start crash recovery, if needed).
> 
> +1
> 

OK. I still want to go over the places once more and double check it
sets the ControlFile value to the right data_checksum_version.

>> The attached 0005 patch implements this. It's a bit WIP and I'm sure it
>> can be improved, but I'm yet to see a single crash/failure with it. With
>> the original patch I've seen crashes after 5-10 loops (i.e. a couple
>> minutes), I'm now at loop 1000 and it's still OK.
> 
> Given how successful this test has been at stressing out errors that is indeed
> comforting to hear.
> 

It is. I plan to vary the stress test a bit more, and also run it on
another machine (rpi5, to get some non-x86 testing).

>> I believe the approach is correct, but the number of possible states
>> (e.g. after a crash/restart) seems a bit complex. I wonder if there's a
>> better way to handle this, but I can't think of any. Ideas?
> 
> Not sure if this moves the needle too far in terms of complexity wrt to the
> previous version of the patch, there were already multiple copies.
> 

It does add one more place (XLogCtl->data_checksum_version) to store the
current state, so it's not that much more complex, ofc. But I was not
really comparing this to the previous patch version, I meant the state
space in general - all possible combinations of all the flags (control
file, local + xlogct).

I wonder if it might be possible to have a more thorough validation of
the transitions. We already have that for the LocalDataChecksumVersion,
thanks to the asserts - and it was damn useful, otherwise we would not
have noticed this issue for a long time, I think.

I wonder if we can have similar checks for the other flags. I'm pretty
sure we can have the same checks for XLogCtl, right? I'm not quite sure
about ControlFile - can't that "skip" some of the changes, e.g. if we do
(enable->disable->enable) within a single checkpoint? Need to check.

This also reminds me I had a question about the barrier - can't it
happen a process gets to process multiple barriers at the same time? I
mean, let's say it gets stuck for a while, and the cluster happens to go
through disable+enable. Won't it then see both barriers? That'd be a
problem, because the core processes the barriers in the order determined
by the enum value, not in the order the barriers happened. Which means
it might break the expected state transitions again (and end with the
wrong local value). I haven't tried, though.

>> One issue I ran into is the postmaster does not seem to be processing
>> the barriers, and thus not getting info about the data_checksum_version
>> changes.
> 
> Makes sense, that seems like a pretty reasonable constraint for the barrier.
> 

Not sure I follow. What's a reasonable constraint?

>> That's fine until it needs to launch a child process (e.g. a
>> walreceiver), which will then see the LocalDataChecksumVersion as of the
>> start of the instance, not the "current" one. I fixed this by explicitly
>> refreshing the value in postmaster_child_launch(), but maybe I'm missing
>> something. (Also, EXEC_BACKEND may need to handle this too.)
> 
> The pg_checksums test is failing for me on this version due to the GUC not
> being initialized, don't we need something like the below as well?  (With a
> comment explaining why ReadControlFile wasn't enough.)
> 
> @@ -5319,6 +5319,7 @@ LocalProcessControlFile(bool reset)
>         Assert(reset || ControlFile == NULL);
>         ControlFile = palloc(sizeof(ControlFileData));
>         ReadControlFile();
> +       SetLocalDataChecksumVersion(ControlFile->data_checksum_version);
> 

Yeah, I think this (or something like it) is missing.

> A few comments on the patchset:
> 
> + * Local state fror Controlfile data_checksum_version. After initialization
> s/fror/for/.  Also, this is no longer true as it's a local copy of the XlogCtl
> value and not the Controlfile value (which may or may not be equal).
> 
> 
> - if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
> + if (XLogCtl->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
>  	ereport(WARNING,
>  		(errmsg("data checksums are being enabled, but no worker is running"),
>  		 errhint("If checksums were being enabled during shutdown then processing must be manually restarted.")));
> Reading this made me realize what a terrible error message I had placed there,
> the hint is good but the message says checksums are being enabled but they're
> not being enabled.  Maybe "data checksums are marked as being in-progress, but
> no worker is running"
> 

Makes sense, will reword.

> 
> +uint32
> +GetLocalDataChecksumVersion(void)
> +{
> + return LocalDataChecksumVersion;
> +}
> +
> +/*
> + * Get the *current* data_checksum_version (might not be written to control
> + * file yet).
> + */
> +uint32
> +GetCurrentDataChecksumVersion(void)
> +{
> + return XLogCtl->data_checksum_version;
> +}
> I wonder if CachedDataChecksumVersion would be more appropriate to distinguish
> it from the Current value, and also to make appear less like actual copies of
> controlfile values like LocalMinRecoveryPoint.  Another thought is if we should
> have the GetLocalDataChecksumVersion() API?  GetCurrentDataChecksumVersion()
> should be a better API no?
> 

FWIW those functions are for debug logging only, I needed to print the
values in a couple places outside xlog.c. I don't intend to make that
part of the patch.

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2025-03-13T12:32:20Z

> On 13 Mar 2025, at 12:03, Tomas Vondra <tomas@vondra.me> wrote:
> On 3/13/25 10:54, Daniel Gustafsson wrote:
>>> On 12 Mar 2025, at 14:16, Tomas Vondra <tomas@vondra.me> wrote:

>>> I believe the approach is correct, but the number of possible states
>>> (e.g. after a crash/restart) seems a bit complex. I wonder if there's a
>>> better way to handle this, but I can't think of any. Ideas?
>> 
>> Not sure if this moves the needle too far in terms of complexity wrt to the
>> previous version of the patch, there were already multiple copies.
> 
> It does add one more place (XLogCtl->data_checksum_version) to store the
> current state, so it's not that much more complex, ofc. But I was not
> really comparing this to the previous patch version, I meant the state
> space in general - all possible combinations of all the flags (control
> file, local + xlogct).

Fair point.

> I wonder if it might be possible to have a more thorough validation of
> the transitions. We already have that for the LocalDataChecksumVersion,
> thanks to the asserts - and it was damn useful, otherwise we would not
> have noticed this issue for a long time, I think.
> 
> I wonder if we can have similar checks for the other flags. I'm pretty
> sure we can have the same checks for XLogCtl, right?

I don't see why not, they should abide by the same rules.

> I'm not quite sure
> about ControlFile - can't that "skip" some of the changes, e.g. if we do
> (enable->disable->enable) within a single checkpoint? Need to check.

For enable->disable->enable within a single checkpoint then ControlFile should
never see the disable state.

> This also reminds me I had a question about the barrier - can't it
> happen a process gets to process multiple barriers at the same time? I
> mean, let's say it gets stuck for a while, and the cluster happens to go
> through disable+enable. Won't it then see both barriers? That'd be a
> problem, because the core processes the barriers in the order determined
> by the enum value, not in the order the barriers happened. Which means
> it might break the expected state transitions again (and end with the
> wrong local value). I haven't tried, though.

Interesting, that seems like a general deficiency in the barriers, surely
processing them in-order would be more intuitive?  That would probably require
some form of Lamport clock though.

>>> One issue I ran into is the postmaster does not seem to be processing
>>> the barriers, and thus not getting info about the data_checksum_version
>>> changes.
>> 
>> Makes sense, that seems like a pretty reasonable constraint for the barrier.
> 
> Not sure I follow. What's a reasonable constraint?

That the postmaster deosn't process them.

>>> That's fine until it needs to launch a child process (e.g. a
>>> walreceiver), which will then see the LocalDataChecksumVersion as of the
>>> start of the instance, not the "current" one. I fixed this by explicitly
>>> refreshing the value in postmaster_child_launch(), but maybe I'm missing
>>> something. (Also, EXEC_BACKEND may need to handle this too.)
>> 
>> The pg_checksums test is failing for me on this version due to the GUC not
>> being initialized, don't we need something like the below as well?  (With a
>> comment explaining why ReadControlFile wasn't enough.)
>> 
>> @@ -5319,6 +5319,7 @@ LocalProcessControlFile(bool reset)
>>        Assert(reset || ControlFile == NULL);
>>        ControlFile = palloc(sizeof(ControlFileData));
>>        ReadControlFile();
>> +       SetLocalDataChecksumVersion(ControlFile->data_checksum_version);
>> 
> 
> Yeah, I think this (or something like it) is missing.

Thanks for confirming.

>> +uint32
>> +GetLocalDataChecksumVersion(void)
>> +{
>> + return LocalDataChecksumVersion;
>> +}
>> +
>> +/*
>> + * Get the *current* data_checksum_version (might not be written to control
>> + * file yet).
>> + */
>> +uint32
>> +GetCurrentDataChecksumVersion(void)
>> +{
>> + return XLogCtl->data_checksum_version;
>> +}
>> I wonder if CachedDataChecksumVersion would be more appropriate to distinguish
>> it from the Current value, and also to make appear less like actual copies of
>> controlfile values like LocalMinRecoveryPoint.  Another thought is if we should
>> have the GetLocalDataChecksumVersion() API?  GetCurrentDataChecksumVersion()
>> should be a better API no?
>> 
> 
> FWIW those functions are for debug logging only, I needed to print the
> values in a couple places outside xlog.c. I don't intend to make that
> part of the patch.

Ah, gotcha, I never applied the debug patch from the patchset so I figured this
was a planned API.  The main question still stands though, if LocalDataCheckXX
can be confusing and CachedDataCheckXX would be better in order to
distinguish it from actual controlfile copies?

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-03-13T16:26:47Z

On 3/13/25 13:32, Daniel Gustafsson wrote:
>> On 13 Mar 2025, at 12:03, Tomas Vondra <tomas@vondra.me> wrote:
>> On 3/13/25 10:54, Daniel Gustafsson wrote:
>>>> On 12 Mar 2025, at 14:16, Tomas Vondra <tomas@vondra.me> wrote:
> 
>>>> I believe the approach is correct, but the number of possible states
>>>> (e.g. after a crash/restart) seems a bit complex. I wonder if there's a
>>>> better way to handle this, but I can't think of any. Ideas?
>>>
>>> Not sure if this moves the needle too far in terms of complexity wrt to the
>>> previous version of the patch, there were already multiple copies.
>>
>> It does add one more place (XLogCtl->data_checksum_version) to store the
>> current state, so it's not that much more complex, ofc. But I was not
>> really comparing this to the previous patch version, I meant the state
>> space in general - all possible combinations of all the flags (control
>> file, local + xlogct).
> 
> Fair point.
> 
>> I wonder if it might be possible to have a more thorough validation of
>> the transitions. We already have that for the LocalDataChecksumVersion,
>> thanks to the asserts - and it was damn useful, otherwise we would not
>> have noticed this issue for a long time, I think.
>>
>> I wonder if we can have similar checks for the other flags. I'm pretty
>> sure we can have the same checks for XLogCtl, right?
> 
> I don't see why not, they should abide by the same rules.
> 

OK, I'll add these asserts.

>> I'm not quite sure
>> about ControlFile - can't that "skip" some of the changes, e.g. if we do
>> (enable->disable->enable) within a single checkpoint? Need to check.
> 
> For enable->disable->enable within a single checkpoint then ControlFile should
> never see the disable state.
> 

Hmm, that means we can't have the same checks for the ControlFile
fields, but I don't think that's a problem. We've verified the "path" to
that (on the XLogCtl field), so that seems fine.

>> This also reminds me I had a question about the barrier - can't it
>> happen a process gets to process multiple barriers at the same time? I
>> mean, let's say it gets stuck for a while, and the cluster happens to go
>> through disable+enable. Won't it then see both barriers? That'd be a
>> problem, because the core processes the barriers in the order determined
>> by the enum value, not in the order the barriers happened. Which means
>> it might break the expected state transitions again (and end with the
>> wrong local value). I haven't tried, though.
> 
> Interesting, that seems like a general deficiency in the barriers, surely
> processing them in-order would be more intuitive?  That would probably require
> some form of Lamport clock though.
> 

Yeah, that seems non-trivial. What if we instead ensured there can't be
two barriers set at the same time? Say, if we (somehow) ensured all
processes saw the previous barrier before allowing a new one, we would
not have this issue, right?

But I don't know what would be a good way to ensure this. Is there a way
to check if all processes saw the barrier? Any ideas?

>>>> One issue I ran into is the postmaster does not seem to be processing
>>>> the barriers, and thus not getting info about the data_checksum_version
>>>> changes.
>>>
>>> Makes sense, that seems like a pretty reasonable constraint for the barrier.
>>
>> Not sure I follow. What's a reasonable constraint?
> 
> That the postmaster deosn't process them.
> 

OK, that means we need a way to "refresh" the value for new child
processses, similar to what my patch does. But I suspect there might be
a race condition - if the child process starts while processing the
XLOG_CHECKUMS record, it might happen to get the new value and then also
the barrier (if it does the "refresh" in between the XLogCtl update and
the barrier). Doesn't this need some sort of interlock, preventing this?

The child startup would need to do this:

1) acquire lock
2) reset barriers
3) refresh the LocalDataChecksumValue (from XLogCtl)
4) release lock

while the walreceiver would do this

1) acquire lock
2) update XLogCtl value
3) emit barrier
4) release lock

Or is there a reason why this would be unnecessary?

>>>> That's fine until it needs to launch a child process (e.g. a
>>>> walreceiver), which will then see the LocalDataChecksumVersion as of the
>>>> start of the instance, not the "current" one. I fixed this by explicitly
>>>> refreshing the value in postmaster_child_launch(), but maybe I'm missing
>>>> something. (Also, EXEC_BACKEND may need to handle this too.)
>>>
>>> The pg_checksums test is failing for me on this version due to the GUC not
>>> being initialized, don't we need something like the below as well?  (With a
>>> comment explaining why ReadControlFile wasn't enough.)
>>>
>>> @@ -5319,6 +5319,7 @@ LocalProcessControlFile(bool reset)
>>>        Assert(reset || ControlFile == NULL);
>>>        ControlFile = palloc(sizeof(ControlFileData));
>>>        ReadControlFile();
>>> +       SetLocalDataChecksumVersion(ControlFile->data_checksum_version);
>>>
>>
>> Yeah, I think this (or something like it) is missing.
> 
> Thanks for confirming.
> 
>>> +uint32
>>> +GetLocalDataChecksumVersion(void)
>>> +{
>>> + return LocalDataChecksumVersion;
>>> +}
>>> +
>>> +/*
>>> + * Get the *current* data_checksum_version (might not be written to control
>>> + * file yet).
>>> + */
>>> +uint32
>>> +GetCurrentDataChecksumVersion(void)
>>> +{
>>> + return XLogCtl->data_checksum_version;
>>> +}
>>> I wonder if CachedDataChecksumVersion would be more appropriate to distinguish
>>> it from the Current value, and also to make appear less like actual copies of
>>> controlfile values like LocalMinRecoveryPoint.  Another thought is if we should
>>> have the GetLocalDataChecksumVersion() API?  GetCurrentDataChecksumVersion()
>>> should be a better API no?
>>>
>>
>> FWIW those functions are for debug logging only, I needed to print the
>> values in a couple places outside xlog.c. I don't intend to make that
>> part of the patch.
> 
> Ah, gotcha, I never applied the debug patch from the patchset so I figured this
> was a planned API.  The main question still stands though, if LocalDataCheckXX
> can be confusing and CachedDataCheckXX would be better in order to
> distinguish it from actual controlfile copies?
> 

Yeah, I'll think about the naming.


regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-03-13T23:11:39Z

On 3/13/25 17:26, Tomas Vondra wrote:
> On 3/13/25 13:32, Daniel Gustafsson wrote:
>>> On 13 Mar 2025, at 12:03, Tomas Vondra <tomas@vondra.me> wrote:
>>>
>>> ...
>>>
>>> This also reminds me I had a question about the barrier - can't it
>>> happen a process gets to process multiple barriers at the same time? I
>>> mean, let's say it gets stuck for a while, and the cluster happens to go
>>> through disable+enable. Won't it then see both barriers? That'd be a
>>> problem, because the core processes the barriers in the order determined
>>> by the enum value, not in the order the barriers happened. Which means
>>> it might break the expected state transitions again (and end with the
>>> wrong local value). I haven't tried, though.
>>
>> Interesting, that seems like a general deficiency in the barriers, surely
>> processing them in-order would be more intuitive?  That would probably require
>> some form of Lamport clock though.
>>
> 
> Yeah, that seems non-trivial. What if we instead ensured there can't be
> two barriers set at the same time? Say, if we (somehow) ensured all
> processes saw the previous barrier before allowing a new one, we would
> not have this issue, right?
> 
> But I don't know what would be a good way to ensure this. Is there a way
> to check if all processes saw the barrier? Any ideas?
> 

Actually, scratch this. There already is a way to do this, by using
WaitForProcSignalBarrier. And the XLOG_CHECKSUMS processing already
calls this. So we should not see two barriers at the same time ...

>>>>> One issue I ran into is the postmaster does not seem to be processing
>>>>> the barriers, and thus not getting info about the data_checksum_version
>>>>> changes.
>>>>
>>>> Makes sense, that seems like a pretty reasonable constraint for the barrier.
>>>
>>> Not sure I follow. What's a reasonable constraint?
>>
>> That the postmaster deosn't process them.
>>
> 
> OK, that means we need a way to "refresh" the value for new child
> processses, similar to what my patch does. But I suspect there might be
> a race condition - if the child process starts while processing the
> XLOG_CHECKUMS record, it might happen to get the new value and then also
> the barrier (if it does the "refresh" in between the XLogCtl update and
> the barrier). Doesn't this need some sort of interlock, preventing this?
> 
> The child startup would need to do this:
> 
> 1) acquire lock
> 2) reset barriers
> 3) refresh the LocalDataChecksumValue (from XLogCtl)
> 4) release lock
> 
> while the walreceiver would do this
> 
> 1) acquire lock
> 2) update XLogCtl value
> 3) emit barrier
> 4) release lock
> 
> Or is there a reason why this would be unnecessary?
> 

I still think this might be a problem. I wonder if we could maybe
leverage the barrier generation, to detect that we don't need to process
this barrier, because we already got the value directly ...

FWIW we'd have this problem even if postmaster was processing barriers,
because there'd always be a "gap" between the fork and ProcSignalInit()
registering the new process into the procsignal array.


regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-03-14T12:20:37Z

On 3/14/25 00:11, Tomas Vondra wrote:
> ...
>>>>>> One issue I ran into is the postmaster does not seem to be processing
>>>>>> the barriers, and thus not getting info about the data_checksum_version
>>>>>> changes.
>>>>>
>>>>> Makes sense, that seems like a pretty reasonable constraint for the barrier.
>>>>
>>>> Not sure I follow. What's a reasonable constraint?
>>>
>>> That the postmaster deosn't process them.
>>>
>>
>> OK, that means we need a way to "refresh" the value for new child
>> processses, similar to what my patch does. But I suspect there might be
>> a race condition - if the child process starts while processing the
>> XLOG_CHECKUMS record, it might happen to get the new value and then also
>> the barrier (if it does the "refresh" in between the XLogCtl update and
>> the barrier). Doesn't this need some sort of interlock, preventing this?
>>
>> The child startup would need to do this:
>>
>> 1) acquire lock
>> 2) reset barriers
>> 3) refresh the LocalDataChecksumValue (from XLogCtl)
>> 4) release lock
>>
>> while the walreceiver would do this
>>
>> 1) acquire lock
>> 2) update XLogCtl value
>> 3) emit barrier
>> 4) release lock
>>
>> Or is there a reason why this would be unnecessary?
>>
> 
> I still think this might be a problem. I wonder if we could maybe
> leverage the barrier generation, to detect that we don't need to process
> this barrier, because we already got the value directly ...
> 
> FWIW we'd have this problem even if postmaster was processing barriers,
> because there'd always be a "gap" between the fork and ProcSignalInit()
> registering the new process into the procsignal array.
> 

I experimented with this a little bit, and unfortunately I ran into not
one, but two race conditions in this :-( I don't have reproducers, all
of this was done by manually adding sleep() calls / gdb breakpoints to
pause the processes for a while, but I'll try to explain what/why ...

1) race #1: SetDataChecksumsOn

The function (and all the other "SetDataChecksums" funcs) does this

    SpinLockAcquire(&XLogCtl->info_lck);
    XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
    SpinLockRelease(&XLogCtl->info_lck);

    barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);

Now, imagine there's a sleep() before the EmitProcSignalBarrier. A new
process may start during that, and it'll read the current checksum value
from XLogCtl. And then the SetDataChecksumsOn() wakes up, and emits the
barrier. So far so good.

But the new backend is already registered in ProcSignal, so it'll get
the barrier too, and will try to set the local version to "on" again.
And kaboom - that hits the assert in AbsorbChecksumsOnBarrier():

    Assert(LocalDataChecksumVersion ==
           PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);

The other "SetDataChecksums" have the same issue, except that in those
cases there are no asserts to trip. Only AbsorbChecksumsOnBarrier() has
such assert to check the state transition.

This is "ephemeral" in the sense that setting the value to "on" again
would be harmless, and indeed a non-assert build will run just fine.

2) race #2: InitPostgres

The InitPostgres does this:

    InitLocalControldata();

    ProcSignalInit(MyCancelKeyValid, MyCancelKey);

where InitLocalControldata gets the current checksum value from XLogCtl,
and ProcSignalInit registers the backend into the procsignal (which is
what barriers are based on).

Imagine there's a sleep() between these two calls, and the cluster does
not have checksums enabled. A backend will start, will read "off" from
XLogCtl, and then gets stuck on the sleep before it gets added to the
procsignal/barrier array.

Now, we enable checksums, and the instance goes through 'inprogress-on'
and 'on' states. This completes, and the backend wakes up and registers
itself into procsignal - but it won't get any barriers, of course.

So we end up with an instance with data_checksums="on", but this one
backend still believes data_checksums="on". This can cause a lot of
trouble, because it won't write blocks with checksums. I.e. this is
persistent data corruption.

I have been thinking about how to fix this. One way would be to
introduce some sort of locking, so that the two steps (update of the
XLogCtl version + barrier emit) and (local flag init + procsignal init)
would always happen atomically. So, something like this:

    SpinLockAcquire(&XLogCtl->info_lck);
    XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
    barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);
    SpinLockRelease(&XLogCtl->info_lck);

and

    SpinLockAcquire(&XLogCtl->info_lck);
    InitLocalControldata();
    ProcSignalInit(MyCancelKeyValid, MyCancelKey);
    SpinLockRelease(&XLogCtl->info_lck);

But that seems pretty heavy-handed, it's definitely much more work while
holding a spinlock than I'm comfortable with, and I wouldn't be
surprised if there were deadlock cases etc. (FWIW I believe it needs to
use XLogCtl->info_lck, to make the value consistent with checkpoints.)

Anyway, I think a much simpler solution would be to reorder InitPostgres
like this:

    ProcSignalInit(MyCancelKeyValid, MyCancelKey);

    InitLocalControldata();

i.e. to first register into procsignal, and then read the new value.
AFAICS this guarantees we won't lose any checksum version updates. It
does mean we still can get a barrier for a value we've already seen, but
I think we should simply ignore this for the very first update.

Opinions? Other ideas how to fix this?

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2025-03-14T13:38:06Z

> On 14 Mar 2025, at 13:20, Tomas Vondra <tomas@vondra.me> wrote:

> I experimented with this a little bit, and unfortunately I ran into not
> one, but two race conditions in this :-( I don't have reproducers, all
> of this was done by manually adding sleep() calls / gdb breakpoints to
> pause the processes for a while, but I'll try to explain what/why ...

Ugh. Thanks for this!

> 1) race #1: SetDataChecksumsOn
> 
> The function (and all the other "SetDataChecksums" funcs) does this
> 
>    SpinLockAcquire(&XLogCtl->info_lck);
>    XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
>    SpinLockRelease(&XLogCtl->info_lck);
> 
>    barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);
> 
> Now, imagine there's a sleep() before the EmitProcSignalBarrier. A new
> process may start during that, and it'll read the current checksum value
> from XLogCtl. And then the SetDataChecksumsOn() wakes up, and emits the
> barrier. So far so good.
> 
> But the new backend is already registered in ProcSignal, so it'll get
> the barrier too, and will try to set the local version to "on" again.
> And kaboom - that hits the assert in AbsorbChecksumsOnBarrier():
> 
>    Assert(LocalDataChecksumVersion ==
>           PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
> 
> The other "SetDataChecksums" have the same issue, except that in those
> cases there are no asserts to trip. Only AbsorbChecksumsOnBarrier() has
> such assert to check the state transition.
> 
> This is "ephemeral" in the sense that setting the value to "on" again
> would be harmless, and indeed a non-assert build will run just fine.

As mentioned off-list, being able to loosen the restriction for the first
barrier seen seem like a good way to keep this assertion.  Removing it is of
course the alternative solution, as it's not causing any issues, but given how
handy it's been to find actual issues it would be good to be able to keep it.

> 2) race #2: InitPostgres
> 
> The InitPostgres does this:
> 
>    InitLocalControldata();
> 
>    ProcSignalInit(MyCancelKeyValid, MyCancelKey);
> 
> where InitLocalControldata gets the current checksum value from XLogCtl,
> and ProcSignalInit registers the backend into the procsignal (which is
> what barriers are based on).
> 
> Imagine there's a sleep() between these two calls, and the cluster does
> not have checksums enabled. A backend will start, will read "off" from
> XLogCtl, and then gets stuck on the sleep before it gets added to the
> procsignal/barrier array.
> 
> Now, we enable checksums, and the instance goes through 'inprogress-on'
> and 'on' states. This completes, and the backend wakes up and registers
> itself into procsignal - but it won't get any barriers, of course.
> 
> So we end up with an instance with data_checksums="on", but this one
> backend still believes data_checksums="on". This can cause a lot of
> trouble, because it won't write blocks with checksums. I.e. this is
> persistent data corruption.
> 
> I have been thinking about how to fix this. One way would be to
> introduce some sort of locking, so that the two steps (update of the
> XLogCtl version + barrier emit) and (local flag init + procsignal init)
> would always happen atomically. So, something like this:
> 
>    SpinLockAcquire(&XLogCtl->info_lck);
>    XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
>    barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);
>    SpinLockRelease(&XLogCtl->info_lck);
> 
> and
> 
>    SpinLockAcquire(&XLogCtl->info_lck);
>    InitLocalControldata();
>    ProcSignalInit(MyCancelKeyValid, MyCancelKey);
>    SpinLockRelease(&XLogCtl->info_lck);
> 
> But that seems pretty heavy-handed, it's definitely much more work while
> holding a spinlock than I'm comfortable with, and I wouldn't be
> surprised if there were deadlock cases etc. (FWIW I believe it needs to
> use XLogCtl->info_lck, to make the value consistent with checkpoints.)

Yeah, that seems quite likely to introduce a new set if issues.

> Anyway, I think a much simpler solution would be to reorder InitPostgres
> like this:
> 
>    ProcSignalInit(MyCancelKeyValid, MyCancelKey);
> 
>    InitLocalControldata();

Agreed.

> i.e. to first register into procsignal, and then read the new value.
> AFAICS this guarantees we won't lose any checksum version updates. It
> does mean we still can get a barrier for a value we've already seen, but
> I think we should simply ignore this for the very first update.

Calling functions with sideeffects in setting state seems like a bad idea
before ProcSignalInit has run, that's thinko on my part in this patch.  Your
solution of reordering seems like the right way to handle this.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2025-03-14T14:06:52Z

> On 14 Mar 2025, at 14:38, Daniel Gustafsson <daniel@yesql.se> wrote:
>> On 14 Mar 2025, at 13:20, Tomas Vondra <tomas@vondra.me> wrote:

>> This is "ephemeral" in the sense that setting the value to "on" again
>> would be harmless, and indeed a non-assert build will run just fine.
> 
> As mentioned off-list, being able to loosen the restriction for the first
> barrier seen seem like a good way to keep this assertion.  Removing it is of
> course the alternative solution, as it's not causing any issues, but given how
> handy it's been to find actual issues it would be good to be able to keep it.
> 
>> i.e. to first register into procsignal, and then read the new value.
>> AFAICS this guarantees we won't lose any checksum version updates. It
>> does mean we still can get a barrier for a value we've already seen, but
>> I think we should simply ignore this for the very first update.
> 
> Calling functions with sideeffects in setting state seems like a bad idea
> before ProcSignalInit has run, that's thinko on my part in this patch.  Your
> solution of reordering seems like the right way to handle this.

0006 in the attached version is what I have used when testing the above, along
with an update to the copyright year which I had missed doing earlier.  It also
contains the fix in LocalProcessControlFile which I had in my local tree, I
think we need something like that at least.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-03-15T15:50:02Z

On 3/14/25 15:06, Daniel Gustafsson wrote:
>> On 14 Mar 2025, at 14:38, Daniel Gustafsson <daniel@yesql.se> wrote:
>>> On 14 Mar 2025, at 13:20, Tomas Vondra <tomas@vondra.me> wrote:
> 
>>> This is "ephemeral" in the sense that setting the value to "on" again
>>> would be harmless, and indeed a non-assert build will run just fine.
>>
>> As mentioned off-list, being able to loosen the restriction for the first
>> barrier seen seem like a good way to keep this assertion.  Removing it is of
>> course the alternative solution, as it's not causing any issues, but given how
>> handy it's been to find actual issues it would be good to be able to keep it.
>>
>>> i.e. to first register into procsignal, and then read the new value.
>>> AFAICS this guarantees we won't lose any checksum version updates. It
>>> does mean we still can get a barrier for a value we've already seen, but
>>> I think we should simply ignore this for the very first update.
>>
>> Calling functions with sideeffects in setting state seems like a bad idea
>> before ProcSignalInit has run, that's thinko on my part in this patch.  Your
>> solution of reordering seems like the right way to handle this.
> 
> 0006 in the attached version is what I have used when testing the above, along
> with an update to the copyright year which I had missed doing earlier.  It also
> contains the fix in LocalProcessControlFile which I had in my local tree, I
> think we need something like that at least.
>

Thanks, here's an updated patch version - I squashed all the earlier
parts, but I kept your changes and my adjustments separate, for clarity.
A couple comments:

1) I don't think the comment before InitialDataChecksumTransition was
entirely accurate, because it said we can see the duplicate state only
for "on" state. But AFAICS we can see duplicate values for any states,
except that we only have an assert for the "on" so we don't notice the
other cases. I wonder if we could strengthen this a bit, by adding some
asserts for the other states too.

2) I admit it's rather subjective, but I didn't like how you did the
assert in AbsorbChecksumsOnBarrier. But looking at it now in the diff,
maybe it was more readable ...

3) I renamed InitLocalControldata() to InitLocalDataChecksumVersion().
The name was entirely misleading, because it now initializes the flag in
XLogCtl, it has nothing to do with control file.

4) I realized AuxiliaryProcessMainCommon() may be a better place to
initialize the checksum flag for non-backend processes. In fact, doing
it in postmaster_child_launch() had the same race condition because it
happened before ProcSignalInit().

I'm sure there's cleanup possible in a bunch of places, but the really
bad thing is I realized the handling on a standby is not quite correct.
I don't know what exactly is happening, there's too many moving bits,
but here's what I see ...

Every now and then, after restarting the standby, it logs a bunch of
page verification failures. Stuff like this:

  WARNING:  page verification failed, calculated checksum 9856 but
            expected 0
  CONTEXT:  WAL redo at 0/3447BA8 for Heap2/VISIBLE:
            snapshotConflictHorizon: 0, flags: 0x03; blkref #0: rel
            1663/16384/16401, fork 2, blk 0 FPW; blkref #1: rel
            1663/16384/16401, blk 0
  WARNING:  page verification failed, LSN 0/CF54C10

This is after an immediate shutdown, but I've seen similar failures for
fast shutdowns too (the root causes may be different / may need a
different fix, not sure).

The instance restarts, and the "startup" process starts recovery

  LOG:  redo starts at 0/2000028

This matches LSN from the very first start of the standby - there were
no restart points since then, apparently. And since then the primary did
this with the checksums (per pg_waldump):

  lsn: 0/0ECCFC48, prev 0/0ECCFBA0, desc: CHECKSUMS inprogress-off
  lsn: 0/0ECD0168, prev 0/0ECD0128, desc: CHECKSUMS off

The instance already saw both records before the immediate shutdown (per
the logging in patch 0004), but after the restart the instance goes back
to having checksums enabled again

  data_checksum_version = 1

Which is correct, because it starts at 0/2000028, which is before either
of the XLOG_CHECKSUMS records. But then at 0/3447BA8 (which is *before*
either of the checksum changes) it tries to read a page from disk, and
hits a checksum error. That page is from the future (per the page LSN
logged by patch 0004), but it's still before both XLOG_CHECKSUMS
messages. So how come the page has pd_checksum 0?

I'd have understood if the page came "broken" from the primary, but I've
not seen a single page verification failure on that side (and it's
subject to the same fast/immediate restarts, etc).

I wonder if this might be related to how we enforce checkpoints only
when setting the checksums to "on" on the primary. Maybe that's safe on
primary but not on a standby?

FWIW I've seen similar issues for "fast" shutdowns too - at least the
symptoms are similar, but the mechanism might be a bit different. In
particular, I suspect there's some sort of thinko in updating the
data_checksum_version in the control file, but I can't put my finger on
it yet.

Another thing I noticed is this comment in CreateRestartPoint(), before
one of the early exits:

/*
 * If the last checkpoint record we've replayed is already our last
 * restartpoint, we can't perform a new restart point. We still update
 * minRecoveryPoint in that case, so that if this is a shutdown restart
 * point, we won't start up earlier than before. That's not strictly
 * necessary, but when hot standby is enabled, it would be rather weird
 * if the database opened up for read-only connections at a
 * point-in-time before the last shutdown. Such time travel is still
 * possible in case of immediate shutdown, though.
 * ...

I wonder if this "time travel backwards" might be an issue for this too,
because it might mean we end up picking the wrong data_checksum_version
from the control file. In any case, if this happens, we don't get to the
ControlFile->data_checksum_version update a bit further down. And
there's another condition that can skip that.

I'll continue investigating this next week, but at this point I'm quite
confused and would be grateful for any insights ...

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Andres Freund <andres@anarazel.de> — 2025-03-15T16:26:12Z

Jo.

On 2025-03-15 16:50:02 +0100, Tomas Vondra wrote:
> Thanks, here's an updated patch version

FWIW, this fails in CI;

https://cirrus-ci.com/build/4678473324691456
On all OSs:
[16:08:36.331] #   Failed test 'options --locale-provider=icu --locale=und --lc-*=C: no stderr'
[16:08:36.331] #   at /tmp/cirrus-ci-build/src/bin/initdb/t/001_initdb.pl line 132.
[16:08:36.331] #          got: '2025-03-15 16:08:26.216 UTC [63153] LOG:  XLogCtl->data_checksum_version 0 ControlFile->data_checksum_version 0
[16:08:36.331] # 2025-03-15 16:08:26.216 UTC [63153] LOG:  XLogCtl->data_checksum_version 0 ControlFile->data_checksum_version 0 (UPDATED)

Windows & Compiler warnings:
[16:05:08.723] ../src/backend/storage/page/bufpage.c(25): fatal error C1083: Cannot open include file: 'execinfo.h': No such file or directory

[16:18:52.385] bufpage.c:25:10: fatal error: execinfo.h: No such file or directory
[16:18:52.385]    25 | #include <execinfo.h>
[16:18:52.385]       |          ^~~~~~~~~~~~

Greetings,

Andres Freund

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-03-15T17:32:57Z

On 3/15/25 17:26, Andres Freund wrote:
> Jo.
> 
> On 2025-03-15 16:50:02 +0100, Tomas Vondra wrote:
>> Thanks, here's an updated patch version
> 
> FWIW, this fails in CI;
> 
> https://cirrus-ci.com/build/4678473324691456
> On all OSs:
> [16:08:36.331] #   Failed test 'options --locale-provider=icu --locale=und --lc-*=C: no stderr'
> [16:08:36.331] #   at /tmp/cirrus-ci-build/src/bin/initdb/t/001_initdb.pl line 132.
> [16:08:36.331] #          got: '2025-03-15 16:08:26.216 UTC [63153] LOG:  XLogCtl->data_checksum_version 0 ControlFile->data_checksum_version 0
> [16:08:36.331] # 2025-03-15 16:08:26.216 UTC [63153] LOG:  XLogCtl->data_checksum_version 0 ControlFile->data_checksum_version 0 (UPDATED)
> 
> Windows & Compiler warnings:
> [16:05:08.723] ../src/backend/storage/page/bufpage.c(25): fatal error C1083: Cannot open include file: 'execinfo.h': No such file or directory
> 
> [16:18:52.385] bufpage.c:25:10: fatal error: execinfo.h: No such file or directory
> [16:18:52.385]    25 | #include <execinfo.h>
> [16:18:52.385]       |          ^~~~~~~~~~~~
> 
> Greetings,

Yeah, that's just the "debug stuff" - I don't expect any of that to be
included in the commit, I only posted it for convenience. It adds a lot
of debug logging, which I hope might help others to understand what the
problem with checksums on standby is.

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Alexander Korotkov <aekorotkov@gmail.com> — 2025-04-04T07:55:32Z

Hi!

On Sat, Mar 15, 2025 at 7:33 PM Tomas Vondra <tomas@vondra.me> wrote:

> On 3/15/25 17:26, Andres Freund wrote:
> > Jo.
> >
> > On 2025-03-15 16:50:02 +0100, Tomas Vondra wrote:
> >> Thanks, here's an updated patch version
> >
> > FWIW, this fails in CI;
> >
> > https://cirrus-ci.com/build/4678473324691456
> > On all OSs:
> > [16:08:36.331] #   Failed test 'options --locale-provider=icu
> --locale=und --lc-*=C: no stderr'
> > [16:08:36.331] #   at /tmp/cirrus-ci-build/src/bin/initdb/t/
> 001_initdb.pl line 132.
> > [16:08:36.331] #          got: '2025-03-15 16:08:26.216 UTC [63153]
> LOG:  XLogCtl->data_checksum_version 0 ControlFile->data_checksum_version 0
> > [16:08:36.331] # 2025-03-15 16:08:26.216 UTC [63153] LOG:
> XLogCtl->data_checksum_version 0 ControlFile->data_checksum_version 0
> (UPDATED)
> >
> > Windows & Compiler warnings:
> > [16:05:08.723] ../src/backend/storage/page/bufpage.c(25): fatal error
> C1083: Cannot open include file: 'execinfo.h': No such file or directory
> >
> > [16:18:52.385] bufpage.c:25:10: fatal error: execinfo.h: No such file or
> directory
> > [16:18:52.385]    25 | #include <execinfo.h>
> > [16:18:52.385]       |          ^~~~~~~~~~~~
> >
> > Greetings,
>
> Yeah, that's just the "debug stuff" - I don't expect any of that to be
> included in the commit, I only posted it for convenience. It adds a lot
> of debug logging, which I hope might help others to understand what the
> problem with checksums on standby is.
>

I took a look at this patch.  I have following notes.
1) I think reporting of these errors could be better, more detailed.
Especially the second one could be similar to some of other errors on
checksums processing.
            ereport(ERROR,
                    (errmsg("failed to start background worker to process
data checksums")));
            ereport(ERROR,
                    (errmsg("unable to enable data checksums in cluster")));

2) ProcessAllDatabases() contains loop, which repeats scanning the new
databases for checkums.  It continues while there are new database on each
iteration.  Could we just limit the number of iterations to 2?  Given at
each step we're calling WaitForAllTransactionsToFinish(), everything that
gets created after first WaitForAllTransactionsToFinish() call should have
checksums enabled in the beginning.

------
Regards,
Alexander Korotkov
Supabase

Re: Changing the state of data checksums in a running cluster

Bernd Helmle <mailings@oopsware.de> — 2025-07-11T15:53:58Z

Am Samstag, dem 15.03.2025 um 16:50 +0100 schrieb Tomas Vondra:
> 

> I wonder if this "time travel backwards" might be an issue for this
> too,
> because it might mean we end up picking the wrong
> data_checksum_version
> from the control file. In any case, if this happens, we don't get to
> the
> ControlFile->data_checksum_version update a bit further down. And
> there's another condition that can skip that.
> 
> 
> I'll continue investigating this next week, but at this point I'm
> quite
> confused and would be grateful for any insights ...
> 

Hi,

Since i wanted to dig a little deeper in this patch i took the
opportunity and rebased it to current master, hopefully not having
broken something seriously.

Thanks,
	Bernd

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2025-07-12T19:39:30Z

> On 11 Jul 2025, at 17:53, Bernd Helmle <mailings@oopsware.de> wrote:

> Since i wanted to dig a little deeper in this patch i took the
> opportunity and rebased it to current master, hopefully not having
> broken something seriously.

Thanks, much appreciated!

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2025-08-16T19:34:03Z

Attached is a rebase on top of the func.sgml changes which caused this to no
longer apply.

This version is also substantially updated with a new injection point based
test suite, fixed a few bugs (found by said test suite), added checkpoint to
disabling checksums, code cleanup, more granular wait events, comment rewrites
and additions and more smaller cleanups.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Bruce Momjian <bruce@momjian.us> — 2025-08-19T16:21:41Z

On Sat, Aug 16, 2025 at 09:34:03PM +0200, Daniel Gustafsson wrote:
> Attached is a rebase on top of the func.sgml changes which caused this to no
> longer apply.
> 
> This version is also substantially updated with a new injection point based
> test suite, fixed a few bugs (found by said test suite), added checkpoint to
> disabling checksums, code cleanup, more granular wait events, comment rewrites
> and additions and more smaller cleanups.

I am very glad you went simple and didn't attempt restarting this
process from the place it stopped:

    If the cluster is stopped while in <literal>inprogress-on</literal>
    mode, for any reason, then this process must be
    restarted manually. To do this, re-execute the function
    <function>pg_enable_data_checksums()</function> once the cluster has
    been restarted. The process will start over, there is no support for
    resuming work from where it was interrupted.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-08-20T14:37:33Z

On 8/16/25 21:34, Daniel Gustafsson wrote:
> Attached is a rebase on top of the func.sgml changes which caused this to no
> longer apply.
> 
> This version is also substantially updated with a new injection point based
> test suite, fixed a few bugs (found by said test suite), added checkpoint to
> disabling checksums, code cleanup, more granular wait events, comment rewrites
> and additions and more smaller cleanups.
> 

Thanks for the updated patch.

The injection points seem like a huge improvement, allowing testing of
different code paths in a more deterministic way.

I started running the stress test, using pretty much exactly the version
posted in March [1]. And so far I noticed only one issue, when the
standby reports mismatched checksums on a fsm:

LOG:  page verification failed, calculated checksum 24786 but expected 24760
CONTEXT:  WAL redo at 0/0344A290 for Heap2/MULTI_INSERT+INIT: ntuples:
185, flags: 0x28; blkref #0: rel 1663/16384/16403, blk 0
LOG:  invalid page in block 2 of relation base/16384/16403_fsm; zeroing
out page
CONTEXT:  WAL redo at 0/0344A290 for Heap2/MULTI_INSERT+INIT: ntuples:
185, flags: 0x28; blkref #0: rel 1663/16384/16403, blk 0
WARNING:  invalid page in block 2 of relation base/16384/16403_fsm;
zeroing out page
CONTEXT:  WAL redo at 0/0344A290 for Heap2/MULTI_INSERT+INIT: ntuples:
185, flags: 0x28; blkref #0: rel 1663/16384/16403, blk 0
LOG:  page verification failed, calculated checksum 37048 but expected 0
CONTEXT:  WAL redo at 0/0344D7E0 for Heap2/MULTI_INSERT+INIT: ntuples:
61, flags: 0x28; blkref #0: rel 1663/16384/16400, blk 0
LOG:  invalid page in block 2 of relation base/16384/16400_fsm; zeroing
out page

This happens quite regularly, it's not hard to hit. But I've only seen
it to happen on a FSM, and only right after immediate shutdown. I don't
think that's quite expected.

I believe the built-in TAP tests (with injection points) can't catch
this, because there's no concurrent activity while flipping checksums
on/off. It'd be good to do something like that, by running pgbench in
the background, or something like that.

I also don't see any restarts of the primary/standby. That might be good
to do too.

I plan to randomize the stress test a bit more, once this FSM issue gets
fixed. Maybe that'll find some additional issues.

[1]
https://www.postgresql.org/message-id/f528413c-477a-4ec3-a0df-e22a80ffbe41@vondra.me

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-08-20T17:02:57Z

Hi,

I think there's a minor issue in how pg_checksums validates state before
checking the data.

The current patch simply does:

  if (ControlFile->data_checksum_version == 0 &&
      mode == PG_MODE_CHECK)
      pg_fatal("data checksums are not enabled in cluster");

and that worked when the version was either 0 or 1. But now it can be
also 2 or 3, for inprogress-on / inprogress-off, and if the cluster gets
shut down at the right moment, that can end in the control file.

It doesn't make sense to verify checksums in such cluster, pg_checksums
should handle that as "off", i.e. error out.


regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2025-08-25T18:32:51Z

> On 20 Aug 2025, at 16:37, Tomas Vondra <tomas@vondra.me> wrote:

> This happens quite regularly, it's not hard to hit. But I've only seen
> it to happen on a FSM, and only right after immediate shutdown. I don't
> think that's quite expected.
> 
> I believe the built-in TAP tests (with injection points) can't catch
> this, because there's no concurrent activity while flipping checksums
> on/off. It'd be good to do something like that, by running pgbench in
> the background, or something like that.

In searching for this bug I opted for implementing a version of the stress
tests as a TAP test, see 006_concurrent_pgbench.pl in the attached patch
version.  It's gated behind PG_TEST_EXTRA since it's clearly not something
which can be enabled by default (if this goes in this need to be re-done to
provide two levels IMO, but during testing this is more convenient).  I'm
curious to see which improvements you can think to make it stress the code to
the breaking point.

> I think there's a minor issue in how pg_checksums validates state before
> checking the data.
> 
> The current patch simply does:
> 
>  if (ControlFile->data_checksum_version == 0 &&
>      mode == PG_MODE_CHECK)
>      pg_fatal("data checksums are not enabled in cluster");
> 
> and that worked when the version was either 0 or 1. But now it can be
> also 2 or 3, for inprogress-on / inprogress-off, and if the cluster gets
> shut down at the right moment, that can end in the control file.

Good point, I've changed the test to check for checksums being enabled rather
than checking if they are disabled.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-08-25T23:06:24Z

On 8/25/25 20:32, Daniel Gustafsson wrote:
>> On 20 Aug 2025, at 16:37, Tomas Vondra <tomas@vondra.me> wrote:
> 
>> This happens quite regularly, it's not hard to hit. But I've only seen
>> it to happen on a FSM, and only right after immediate shutdown. I don't
>> think that's quite expected.
>>
>> I believe the built-in TAP tests (with injection points) can't catch
>> this, because there's no concurrent activity while flipping checksums
>> on/off. It'd be good to do something like that, by running pgbench in
>> the background, or something like that.
> 
> In searching for this bug I opted for implementing a version of the stress
> tests as a TAP test, see 006_concurrent_pgbench.pl in the attached patch
> version.  It's gated behind PG_TEST_EXTRA since it's clearly not something
> which can be enabled by default (if this goes in this need to be re-done to
> provide two levels IMO, but during testing this is more convenient).  I'm
> curious to see which improvements you can think to make it stress the code to
> the breaking point.
> 

I think this TAP looks very nice, but there's a couple issues with it.
See the attached patch fixing those.

1) I think test_checksums should be in src/test/modules/Makefile?

2) The test_checksums/Makefile didn't seem to work for me, I was getting

Makefile:23: *** recipe commences before first target.  Stop.

Because there was a missing "\" so I had to fix that. And then it was
complaining about Makefile.global or something, so I fixed that by
cargo-culting what other Makefiles in test modules do. Now it seems to
work for me. I guess you're on meson?

3) I'm no perl expert, but AFAICS the test wasn't really running the
pgbench, for a couple of reasons. It was passing "-q" to pgbench, but
that's only for initialization. The clusters had max_connections=10, but
the pgbench was using "-c 10", so I was getting "too many connections".
It was not setting "$pgbench_running = 1" so the other loops were
getting "too many connections" too. Another thing is I'm not sure it's
OK to pass '' to IPC::Run::start, I think it'll take it as an argument,
confusing pgbench.

With these changes it runs for me, and I even saw some

   LOG: page verification failed

in tmp_check/log/006_concurrent_pgbench_standby_1.log. But it takes a
while - a couple minutes, maybe? I think I saw it at

    t/006_concurrent_pgbench.pl .. 427/?

or something like that. I think the bash version did a couple things
differently, which might make the failures more frequent (but it's just
a wild guess).

In particular, I think the script restarts the two nodes independently,
while the TAP always stops both primary and standby, in this order. I
think it'd be useful to restart one or both.

The other thing is the bash script added some random delays/sleep, which
increases the test duration, but it also means generating somewhat
random amounts of data, etc. It also randomized some other stuff (scale,
client count, ...). But that can wait.

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2025-08-27T08:30:18Z

> On 26 Aug 2025, at 01:06, Tomas Vondra <tomas@vondra.me> wrote:

> I think this TAP looks very nice, but there's a couple issues with it.
> See the attached patch fixing those.

Thanks, I have incorporated (most of) your patch in the attached.  I did keep
the PG_TEST_EXTRA check for injection points though which I assume were removed
out of mistake.

> With these changes it runs for me, and I even saw some
> 
>   LOG: page verification failed
> 
> in tmp_check/log/006_concurrent_pgbench_standby_1.log. But it takes a
> while - a couple minutes, maybe? I think I saw it at
> 
>    t/006_concurrent_pgbench.pl .. 427/?

That's very interesting, I have been running it to timeout several times in a
row without hitting any verification failures.  Will keep running.

> or something like that. I think the bash version did a couple things
> differently, which might make the failures more frequent (but it's just
> a wild guess).
> 
> In particular, I think the script restarts the two nodes independently,
> while the TAP always stops both primary and standby, in this order. I
> think it'd be useful to restart one or both.

Done in the attached, it will now randomly stop one or both or none.  If the
node is stopped I've added an offline pg_checksum step to validate the
datafiles as a why-not test.

> The other thing is the bash script added some random delays/sleep, which
> increases the test duration, but it also means generating somewhat
> random amounts of data, etc. It also randomized some other stuff (scale,
> client count, ...). But that can wait.

Added as well in a few places, maybe more can be sprinkled in.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-08-27T09:39:35Z


On 8/27/25 10:30, Daniel Gustafsson wrote:
>> On 26 Aug 2025, at 01:06, Tomas Vondra <tomas@vondra.me> wrote:
> 
>> I think this TAP looks very nice, but there's a couple issues with it.
>> See the attached patch fixing those.
> 
> Thanks, I have incorporated (most of) your patch in the attached.  I did keep
> the PG_TEST_EXTRA check for injection points though which I assume were removed
> out of mistake.
> 

Yes, that was a mistake.

>> With these changes it runs for me, and I even saw some
>>
>>   LOG: page verification failed
>>
>> in tmp_check/log/006_concurrent_pgbench_standby_1.log. But it takes a
>> while - a couple minutes, maybe? I think I saw it at
>>
>>    t/006_concurrent_pgbench.pl .. 427/?
> 
> That's very interesting, I have been running it to timeout several times in a
> row without hitting any verification failures.  Will keep running.
> 

Just to be clear - I don't see any pg_checksums failures either. I only
see failures in the standby log, and I don't think the script checks
that (it probably should).

>> or something like that. I think the bash version did a couple things
>> differently, which might make the failures more frequent (but it's just
>> a wild guess).
>>
>> In particular, I think the script restarts the two nodes independently,
>> while the TAP always stops both primary and standby, in this order. I
>> think it'd be useful to restart one or both.
> 
> Done in the attached, it will now randomly stop one or both or none.  If the
> node is stopped I've added an offline pg_checksum step to validate the
> datafiles as a why-not test.
> 
>> The other thing is the bash script added some random delays/sleep, which
>> increases the test duration, but it also means generating somewhat
>> random amounts of data, etc. It also randomized some other stuff (scale,
>> client count, ...). But that can wait.
> 
> Added as well in a few places, maybe more can be sprinkled in.
> 

Thanks. I'll take a look.


regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2025-08-27T11:00:04Z

> On 27 Aug 2025, at 11:39, Tomas Vondra <tomas@vondra.me> wrote:

> Just to be clear - I don't see any pg_checksums failures either. I only
> see failures in the standby log, and I don't think the script checks
> that (it probably should).

Right, that's what I'm been checking too.  I have been considering adding
another background process for monitoring all the log entries but I just
thought of a much simpler solution.  When the clusters are turned off we can
take the opportunity to slurp the log written since last restart and inspect
it. The attached adds this.

It would probably be good to at some point clean this up a little by placing
all of variables for a single node in an associative hash which can be passed
around, and place repeated code in subroutines etc..

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-08-27T12:39:38Z

On 8/27/25 13:00, Daniel Gustafsson wrote:
>> On 27 Aug 2025, at 11:39, Tomas Vondra <tomas@vondra.me> wrote:
> 
>> Just to be clear - I don't see any pg_checksums failures either. I only
>> see failures in the standby log, and I don't think the script checks
>> that (it probably should).
> 
> Right, that's what I'm been checking too.  I have been considering adding
> another background process for monitoring all the log entries but I just
> thought of a much simpler solution.  When the clusters are turned off we can
> take the opportunity to slurp the log written since last restart and inspect
> it. The attached adds this.
> 

There's still a couple issues, unfortunately. First, this may not do
what you intended:

  sub cointoss
  {
    return int(rand(2) == 1);
  }

The rand() call returns values in [0, 2.0), it's very unlikely to
produce 1, and so there are no restarts. I fixed this to

  sub cointoss
  {
    return int(rand() < 0.5);
  }

and then it starts working, restarting with ~50% probability.

Then it hits another problem when calling pg_checksums. That only works
after a "clean" shutdown, not after immediate one.This can't just check
is_alive, it needs to remember how exactly was the server stopped and
only verify checksums for 'fast' mode. I never saw any failures in
pg_checksums, so I just commented that out.

Then it starts working as intended, I think. I only did a couple runs so
far, but I haven't found any checksum failures ... Then I realized I did
the earlier runs with slightly stale master HEAD (38c5fbd97ee6a to be
precise), while this time I did git pull.

And this happened on Friday:

commit c13070a27b63d9ce4850d88a63bf889a6fde26f0
Author: Alexander Korotkov <akorotkov@postgresql.org>
Date:   Fri Aug 22 18:44:39 2025 +0300

    Revert "Get rid of WALBufMappingLock"

    This reverts commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683.
    It appears that conditional variables are not suitable for use
    inside critical sections.  If WaitLatch()/WaitEventSetWaitBlock()
    face postmaster death, they exit, releasing all locks instead of
    PANIC.  In certain situations, this leads to data corruption.

    ...

I think it's very likely the checksums were broken by this. After all,
that linked thread has subject "VM corruption on standby" and I've only
ever seen checksum failures on standby on the _vm fork.

There's also an issue when calling teardown_node at the end. That won't
work if the instance got stopped in the last round, and the test will
fail like this:

t/006_concurrent_pgbench.pl .. 379/? # Tests were run but no plan was
declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 379.
t/006_concurrent_pgbench.pl .. Dubious, test returned 29 (wstat 7424,
0x1d00)
All 379 subtests passed

> It would probably be good to at some point clean this up a little by placing
> all of variables for a single node in an associative hash which can be passed
> around, and place repeated code in subroutines etc..
> 

Yeah. For me Perl is hard to read in any case, but if we can cleanup
this a little bit before additional cases, that'd be nice.

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-08-27T12:42:05Z

On 8/27/25 14:39, Tomas Vondra wrote:
> ...
>
> And this happened on Friday:
> 
> commit c13070a27b63d9ce4850d88a63bf889a6fde26f0
> Author: Alexander Korotkov <akorotkov@postgresql.org>
> Date:   Fri Aug 22 18:44:39 2025 +0300
> 
>     Revert "Get rid of WALBufMappingLock"
> 
>     This reverts commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683.
>     It appears that conditional variables are not suitable for use
>     inside critical sections.  If WaitLatch()/WaitEventSetWaitBlock()
>     face postmaster death, they exit, releasing all locks instead of
>     PANIC.  In certain situations, this leads to data corruption.
> 
>     ...
> 
> I think it's very likely the checksums were broken by this. After all,
> that linked thread has subject "VM corruption on standby" and I've only
> ever seen checksum failures on standby on the _vm fork.
> 

Forgot to mention - I did try with c13070a27b reverted, and with that I
can reproduce the checksum failures again (using the fixed TAP test).

It's not a definitive proof, but it's a hint c13070a27b63 was causing
the checksum failures.


regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-08-28T16:11:18Z

Hi,

I spent a bit more time fixing the TAP test. The attached patch makes it
"work" for me (or I think it should, in principle). I'm not saying it's
the best way to do stuff.

With the patch applied, I tried running it, and I got a failure when
running pg_checksums. There's a log snippet describing the issue, but
AFAICS it's happening like this:

1) checksums are disabled
2) flip_data_checksums gets called
3) both clusters go through 'inprogress-on' and 'on' states
4) primary gets shutdown in 'immediate' mode
5) standby gets shutdown in 'fast' mode
6) we try to validate checksums on the standby, but control file still
says checksums=inprogress-on

This seems like a bug to me - AFAICS the expectation is that after fast
shutdown, we don't forget the checksum state. Or is that expected? In
that case the TAP test probably needs to check the control file, instead
of relying on the perl variable $data_checksum_state. Or maybe it should
check that the control file has the correct / expected state?

FWIW I don't think the primary shutdown matters. I've seen multiple of
these failures, and it happens even without primary shutdown. But the
standby "fast" shutdown is always there.

But this also shows a limitation of the TAP test - it never triggers the
shutdowns while flipping the checksums (in flip_data_checksums). I think
that's something worth testing.

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-08-29T14:26:41Z

On 8/27/25 14:42, Tomas Vondra wrote:
> On 8/27/25 14:39, Tomas Vondra wrote:
>> ...
>>
>> And this happened on Friday:
>>
>> commit c13070a27b63d9ce4850d88a63bf889a6fde26f0
>> Author: Alexander Korotkov <akorotkov@postgresql.org>
>> Date:   Fri Aug 22 18:44:39 2025 +0300
>>
>>     Revert "Get rid of WALBufMappingLock"
>>
>>     This reverts commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683.
>>     It appears that conditional variables are not suitable for use
>>     inside critical sections.  If WaitLatch()/WaitEventSetWaitBlock()
>>     face postmaster death, they exit, releasing all locks instead of
>>     PANIC.  In certain situations, this leads to data corruption.
>>
>>     ...
>>
>> I think it's very likely the checksums were broken by this. After all,
>> that linked thread has subject "VM corruption on standby" and I've only
>> ever seen checksum failures on standby on the _vm fork.
>>
> 
> Forgot to mention - I did try with c13070a27b reverted, and with that I
> can reproduce the checksum failures again (using the fixed TAP test).
> 
> It's not a definitive proof, but it's a hint c13070a27b63 was causing
> the checksum failures.
> 

Unfortunately, it seems I spoke too soon :-( I decided to test this on
multiple machines overnight, and it still fails on the slower ones.

Attached is a patch addressing a couple more issues, to makes the TAP
test work well enough. (Attached as .txt, to not confuse cfbot).

- The pgbench started by IPC::Run::start() needs to be finished, to
release resources. Otherwise it leaks file descriptors (and there's a
bunch of "defunct" pgbench processes), which may be a problem with
increased number of iterations.

- AFAICS the pgbench can't use stdin/stdout/stderr, otherwise the pipes
get broken when the command fails (after restart). I just used /dev/null
instead, the stdout/stderr was not used anyway (or was it?).

- commented out the pg_checksums call, because of the issues mentioned
earlier (I was trying to make it work by remembering the state, but it
seems to not make it into control file before shutdown occasionally)

I increased the number of iterations to 1000+ and ran it on three machines:

- ryzen (new machine from ~2024)
- xeon (old slow machine from ~2016)
- rpi5 (very slow machine)

I haven't seen a single failure on ryzen, after ~3000 iterations. But
both xeon and rpi5 show a number of failures. Xeon has about 35 reports
of 'Failed test', rpi5 and about 10.

My guess is it's something about timing. It works on the "fast" ryzen,
but fails on xeon which is ~3-4x slower. And rpi5, which is even slower.

The other reason why it seems unrelated to the reverted commit is that
it's not just about visibility maps (which was got corrupted). I see
checksum failures on VM and FSM. I think I forgot about the FSM cases,
and by the fact that I saw no failures on the ryzen post revert. But
clearly, other machines still have issues.

Another interesting fact is that the checksum failures happen both on
the primary and the standby, it's not just a standby issue. But again,
this sees to be machine-dependent. On the rpi5 I've only seen standby
issues. The xeon sees failures both on primary/standby (roughly 1:1).

There are more weird things. If I grep for page failures, I see this (a
more detailed log attached):

-----------
# 2025-08-28 22:33:28.195 CEST startup[177466] LOG:  page verification
failed, calculated checksum 25350 but expected 44559
# 2025-08-28 22:33:28.197 CEST startup[177466] LOG:  page verification
failed, calculated checksum 25350 but expected 44559
# 2025-08-28 22:33:28.199 CEST startup[177466] LOG:  page verification
failed, calculated checksum 59909 but expected 53920
# 2025-08-28 22:33:28.201 CEST startup[177466] LOG:  page verification
failed, calculated checksum 59909 but expected 53920
# 2025-08-28 22:33:28.206 CEST startup[177466] LOG:  page verification
failed, calculated checksum 59909 but expected 53920
# 2025-08-28 22:33:28.207 CEST startup[177466] LOG:  page verification
failed, calculated checksum 25350 but expected 44559
-----------

This is right after a single restart, while doing the recovery. The
weird thing is, this is all for just two FSM pages!

-----------
LOG:  invalid page in block 2 of relation "base/5/16410_fsm"; zeroing
out page
LOG:  invalid page in block 2 of relation "base/5/16408_fsm"; zeroing
out page
-----------

And the calculated/expected checksums repeat! It's just different WAL
records hitting the same page, and complaining about the same issue,
after claiming the page was zeroed out. Isn't that weird? How come the
page doesn't "get" the correct checksum after the first redo?

I've seen these failures after changing checksums in both directions,
both after enabling and disabling. But I've only ever saw this after
immediate shutdown, never after fast shutdown. (It's interesting the
pg_checksums failed only after fast shutdowns ...).

Could it be that the redo happens to start from an older position, but
using the new checksum version?

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-08-29T14:38:22Z

On 8/29/25 16:26, Tomas Vondra wrote:
> ...
> 
> I've seen these failures after changing checksums in both directions,
> both after enabling and disabling. But I've only ever saw this after
> immediate shutdown, never after fast shutdown. (It's interesting the
> pg_checksums failed only after fast shutdowns ...).
> 

Of course, right after I send a message, it fails after a fast shutdown,
contradicting my observation ...

> Could it be that the redo happens to start from an older position, but
> using the new checksum version?
> 

... but it also provided more data supporting this hypothesis. I added
logging of pg_current_wal_lsn() before / after changing checksums on the
primary, and I see this:

1) LSN before: 14/2B0F26A8
2) enable checksums
3) LSN after: 14/EE335D60
4) standby waits for 14/F4E786E8 (higher, likely thanks to pgbench)
5) standby restarts with -m fast
6) redo starts at 14/230043B0, which is *before* enabling checksums

I guess this is the root cause. A bit more detailed log attached.

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-09-01T12:11:06Z

On 8/29/25 16:38, Tomas Vondra wrote:
> On 8/29/25 16:26, Tomas Vondra wrote:
>> ...
>>
>> I've seen these failures after changing checksums in both directions,
>> both after enabling and disabling. But I've only ever saw this after
>> immediate shutdown, never after fast shutdown. (It's interesting the
>> pg_checksums failed only after fast shutdowns ...).
>>
> 
> Of course, right after I send a message, it fails after a fast shutdown,
> contradicting my observation ...
> 
>> Could it be that the redo happens to start from an older position, but
>> using the new checksum version?
>>
> 
> ... but it also provided more data supporting this hypothesis. I added
> logging of pg_current_wal_lsn() before / after changing checksums on the
> primary, and I see this:
> 
> 1) LSN before: 14/2B0F26A8
> 2) enable checksums
> 3) LSN after: 14/EE335D60
> 4) standby waits for 14/F4E786E8 (higher, likely thanks to pgbench)
> 5) standby restarts with -m fast
> 6) redo starts at 14/230043B0, which is *before* enabling checksums
> 
> I guess this is the root cause. A bit more detailed log attached.
> 

I kept stress testing this over the weekend, and I think I found two
issues causing the checksum failures, both for a single node and on a
standby:

1) no checkpoint in the "disable path"

In the "enable" path, a checkpoint it enforced before flipping the state
from "inprogress-on" to "on". It's hidden in the ProcessAllDatabases,
but it's there. But the "off" path does not do that, probably on the
assumption that we'll always see the writes in the WAL order, so that
we'll see the XLOG_CHECKSUMS setting checksums=off before seeing any
writes without checksums.

And in the happy path this works fine - the standby is happy, etc. But
what about after a crash / immediate shutdown? Consider a sequence like
this:

a) we have checksums=on
b) write to page P, updating the checksum
c) start disabling checksums
d) progress to inprogress-off
e) progress to off
f) write to page P, without checksum update
g) the write to P gets evicted (small shared buffers, ...)
h) crash / immediate shutdown

Recovery starts from a LSN before (a), so we believe checksums=on. We
try to redo the write to P, which starts by reading the page from disk,
to check the page LSN. We still think checksums=on, and to read the LSN
we need to verify the checksum. But the page was modified without the
checksum, and evicted. Kabooom!

This is not that hard to trigger by hand. Add a long at the end of
SetDataChecksumsOff, start a pgbench on a scale larger than shared
buffers and call pg_disable_data_checksums(). Once it gets stuck on the
sleep, give it more time to dirty and evict some pages, then kill -9. On
recovery you should get the same checksum failures.

FWIW I've only ever seen failures on fsm/vm forks, which matches what I
see in the TAP tests. But isn't it a bit strange?

I think the "disable" path needs a checkpoint between inprogress-off and
off states, same as the "enable" path.

2) no restart point on the standby

The standby has a similar issue, I think. Even if the primary creates
all the necessary checkpoints, the standby may not need to create the
restart point for them. If you look into xlog_redo, it only "remembers"
the checkpoint position, it does not trigger a restart point. Than only
happens in XLogPageRead, based on distance from the previous one.

So a very similar failure to the primary is possible, even with the
extra checkpoint fixing (1). The primary flips checksums in either
direction, generating checkpoints, but the standby does not create the
restart points. But it applies WAL, and some of the pages without
checksums get evicted.

And then the standby fails, and goes to some redo position far back, and
runs into the same checksum failure when trying to check page LSN.

I think the standby needs some logic to force restart point creation
when the checksum flag changed.

I have an experimental WIP branch at:

https://github.com/tvondra/postgres/tree/online-checksums-tap-tweaks

It fixes the TAP issues reported earlier (and a couple more), and it
does a bunch of additional tweaks:

a) A lot of debug messages that helped me to figure this out. This is
probably way too much, especially the controlfile updates can be very
noisy on a standby.

b) Adds a simpler TAP, testing just a single node (should be easier to
understand than with failures on standby).

c) Adds an explicit checkpoints, to fix (1). It probably adds too many
checkpoints, though? AFAICS a checkpoint after the "inprogress" phase
should be enough, a checkpoint after the "on/off" can go away.

d) Forces creating a restart point on the first checkpoint after
XLOG_CHECKSUMS record. It's done in a bit silly way, using a static
flag. Maybe there's a more elegant approach, say by comparing the
checksum value in ControlFile to the received checkpoint?

e) Randomizes a couple more GUC values. This needs more thought, it was
done blindly before better understanding how the failures happen (it
requires buffers evicted, not hitting max_wal_size, ...). There are more
params worth randomizing (e.g. the "fast" flag).

Anyway, with (c) and (d) applied, the checksum failures go away. It may
not be 100% right (e.g. we could do away with fewer checkpoints), but it
seems to be the right direction.

I don't have time to cleanup the branch more, I've already spent too
much time looking at LSNs advancing in weird ways :-( Hopefully it's
good enough to show what needs to be fixed, etc. If there's a new
version, I'm happy to rerun the tests on my machines, ofc.

However, there still are more bugs. Attached is a log from a crash after
hitting the assert into AbsorbChecksumsOffBarrier:

 Assert((LocalDataChecksumVersion != PG_DATA_CHECKSUM_VERSION) &&
 (LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION ||
  LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION));

This happened while flipping checksums to 'off, but the backend already
thinks checksum are 'off':

    LocalDataChecksumVersion==0

I think this implies some bug in setting up LocalDataChecksumVersion
after connection, because this is for a query checking the checksum
state, executed by the TAP test (in a new connection, right?).

I haven't looked into this more, but how come the "off" direction does
not need to check InitialDataChecksumTransition?

I think the TAP test turned out to be very useful, so far. While
investigating on this, I thought about a couple more tweaks to make it
detect additional issues (on top of the randomization).

- Right now the shutdowns/restarts happen only in very limited places.
The checksum flips from on/off or off/on, and then a restart happens.
AFAICS it never happens in the "inprogress" phases, right?

- The pgbench clients connect once, so there are almost no new
connections while flipping checksums. Maybe some of the pgbenches should
run with "-C", to open new connections. It was pretty lucky the TAP
query hit the assert, this would make it more likely.

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2025-10-06T12:53:01Z

> On 1 Sep 2025, at 14:11, Tomas Vondra <tomas@vondra.me> wrote:

> I kept stress testing this over the weekend, and I think I found two
> issues causing the checksum failures, both for a single node and on a
> standby:

Thanks a lot for all your testing. I was able to reproduce ..

> I have an experimental WIP branch at:
> 
> https://github.com/tvondra/postgres/tree/online-checksums-tap-tweaks

.. and with your changes I also see the reproducer go away.  I concur that
you likely found the issue and the right fix for it.  I have absorbed your
patches into my branch, the debug logging is left as 0002 but the other ones
are incorporated into 0001.

> It fixes the TAP issues reported earlier (and a couple more), and it
> does a bunch of additional tweaks:
> 
> a) A lot of debug messages that helped me to figure this out. This is
> probably way too much, especially the controlfile updates can be very
> noisy on a standby.

I've toned down the logging a bit, but kept most of it in 0002.

> b) Adds a simpler TAP, testing just a single node (should be easier to
> understand than with failures on standby).

This turned out to be more useful than I initially thought, so I've kept this
in the attached version.  There could be value in separating the single and
dual node tests into different PG_TEST_EXTRA values given how intensive the
latter is.

> Anyway, with (c) and (d) applied, the checksum failures go away. It may
> not be 100% right (e.g. we could do away with fewer checkpoints), but it
> seems to be the right direction.

I think so too, and while I have removed one of them due to being issued just
before (or after) another checkpoint I do believe this is the right fix for the
issue.  There might well be more issues, but I wanted to get a new version out
on the thread to get more visibility on the the new tests.

> I haven't looked into this more, but how come the "off" direction does
> not need to check InitialDataChecksumTransition?

This boiled down into the barrier absorbing functions evolving out of sync with
one another over multiple versions of the patch.  To address this absorbing the
barrier has been converted into a single function which is driven by an array
of ChecksumBarrierCondition structs, one for each target state.  This struct
defines what the current state of the cluster must be for the barrier to be
successfully absorbed.  This removed a lot of duplicate code and also unifies
the previously quite varied levels of assertions at the barrier.

> I think the TAP test turned out to be very useful, so far. While
> investigating on this, I thought about a couple more tweaks to make it
> detect additional issues (on top of the randomization).
> 
> - Right now the shutdowns/restarts happen only in very limited places.
> The checksum flips from on/off or off/on, and then a restart happens.
> AFAICS it never happens in the "inprogress" phases, right?

That would be a good idea, as well as use a crashing injection test on top
controlled shutdowns.

> - The pgbench clients connect once, so there are almost no new
> connections while flipping checksums. Maybe some of the pgbenches should
> run with "-C", to open new connections. It was pretty lucky the TAP
> query hit the assert, this would make it more likely.

I've added this to both tests using pgbench, randomized with a cointoss call.

The attached has the above as well as a few other changes:

* A new injection test which calls abort() right before checkpointing has been
added in 005_injection.

* Most ereport calls have gotten proper errcodes in order to aid analysis,
particurly fleet-wide analysis should this be deployed in a larger setting.

* More code-level documentation of test code and several tweaked (and added)
code comments to aid readability.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2025-11-05T08:56:13Z

Rebase due to recent conflicts with only a trivial whitespace fix, otherwise
the same as the previous version.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-11-05T14:05:48Z

Hi Daniel,

I took a look at the patch again, focusing mostly on the docs and
comments. I think the code is OK, I haven't noticed anything serious.


testing
-------

I'm running the TAP tests - so far it looks fine, I've done 2000
iterations of the "single" test, now running ~2000 iterations of the
"standby" test. No issues/failures so far.

The question is whether we can/should make it even more "random", by
doing restarts in more places etc. I might give it a try, if I happen to
have some free time. But no promises, I'm not sure how feasible it is.
Making it "truly random" means it's hard to deduce what should be the
actual state of checksums, etc.


review
------

Attached is a patch adding FIXME/XXX comments to a bunch of places,
which I think makes it clearer which place I'm talking about. I'll
briefly go over the items, and maybe explain them a bit more.

1) func-admin.sgml

- This is missing documentation for the "fast" parameters for both
functions (enable/disable).

- The paragraph stars with "Initiates data checksums for ..." but that
doesn't sound right to me. I think you can initiate enabling/disabling,
not "data checksums".

- I think the docs should at least briefly describe the impact on the
cluster, and also on a standby, due to having to write everything into
WAL, waiting for checkpoints etc. And maybe discuss how to mitigate that
in some way. More about the standby stuff later.


2) glossary.sgml

- This describes "checksum worker" as process that enables or disables
checksums in a specific database, but we don't need any per-database
processes when disabling, no?


3) monitoring.sgml

- I'm not sure what "regardles" implies here. Does that mean we simply
don't hide/reset the counters?

- I added a brief explanation about using the "launcher" row for overall
progress, and per-database workers for "current progress".

- Do we want to refer to "datachecksumsworker"? Isn't it a bit too
low-level detail?

- The table of phases is missing "waiting" and "done" phases. IMHO if
the progress view can return it, it should be in the docs.


4) wal.sgml

- I added a sentence explaining that both enabling and disabling happens
in phases, with checkpoints in between. I think that's helpful for users
and DBAs.

- The section only described "enabling checksums", but it probably
should explain the process for disabling too. Added a para.

- Again, I think we should explain the checkpoints, restartpoints,
impact on standby ... somewhere. Not sure if this is the right place.


5) xlog.c

- Some minor corrections (typos, ...).

- Isn't the claim that PG_DATA_CHECKSUM_ON_VERSION is the only place
where we check InitialDataChecksumTransition stale? AFAIK we now check
this in AbsorbDataChecksumsBarrier for all transitions, no?


6) datachecksumsworker.c

- I understand what the comments at the beginning of the file say, but I
suspect it's partially due to already knowing the code. There's a couple
places that might be more explicit, I think. For example:

- One of the items in the synchronization/correctness section states
that "Backends SHALL NOT violate local data_checksums state" but what
does "violating local data_checksums state" mean? What even is "local
state in this context"? Should we explain/define that, or would that be
unnecessarily detailed?

- The section only talks about "enabling checksums" but also discusses
all four possible states. Maybe it should talk about disabling too, as
that requires the same synchronization/correctness.

- Maybe it'd be good make it more explicit at which point the process
waits on a barrier, which backend initiates that (and which backends are
required to absorb the signal). It kinda is there, but only indirectly.

- Another idea I had is that maybe it'd help to have some visualization
of the process (with data_checksum states, barriers, ...) e.g. in the
form of an ASCII image.


open questions
--------------

For me the main remaining question is impact people should expect in
production systems, and maybe ways to mitigate that.

In single-node systems this is entirely fine, I think. There will be
checkpoints, but those will be "spread" and it's just the checksum
worker waiting for that to complete.

It'll write everything into WAL, but that's fairly well understood /
should be expected. We should probably mention that in the sgml docs, so
that people are not surprised their WAL archive gets huge.

I'm much more concerned about streaming replicas, because there it
forces a restart point - and it *blocks redo* until it completes. Which
means there'll be replication lag, and for synchronous standbys this
would even block progress on the primary.

We should very clearly document this. But I'm also wondering if we might
mitigate this in some way / reduce the impact.

I see some similarities to shutdown checkpoints, which can take a lot of
time if there happen to be a lot of dirty data, increasing disruption
during planned restarts (when no one can connect). A common mitigation
is to run CHECKPOINT shortly before the restart, to flush most of the
dirty data while still allowing new connections.

Maybe we could do something like that for checksum changes? I don't know
how exactly we could do that, but let's say we can predict when roughly
to expect the next state change. And we'd ensure the standby starts
flushing stuff before that, so that creating the restartpoint is cheap.
Or maybe we'd (gradually?) lower max_wal_size on the standby, to reduce
the amount of WAL as we're getting closer to the end?


regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-11-10T01:26:07Z

Hi,

We had some off-list discussions about this patch (me, Daniel and
Andres), and Andres mentioned he suspects the patch may be breaking PITR
in some way. We didn't have any example of that, but PITR seems like a
pretty fundamental feature, so I took it seriously and decided to do
some stress testing. And yeah, there are issues :-(

I did a similar stress testing in the past, which eventually evolved
into the two TAP tests in the current patch. Those TAP tests run either
a single node or primary-standby cluster, and flip checksums on/off
while restarting the instance(s). And verify the cluster behaves OK,
with no checksum failures, unexpected states, etc.

I chose to do something similar to test PITR - run a single node with
pgbench (or some other write activity), flip checksums on/off in a loop,
while taking basebackups. And then validate the basebackups are valid,
and can be used to start a new instance. I planned to extend this to
more elaborate tests with proper PITR using a WAL archive, etc. I didn't
get that far - this simple test already hit a couple issues.

I'm attaching sets of test scripts, and the scripts used to validate
backups. The scripts are fairly simple, but need some changes to run -
adjusting some paths, etc.

1) basebackup-short.tgz - basic basebackup test

2) basebackup-long.tgz - long-runnning basebackup test (throttled)

3) validate.tgz - validate backups from (1) and (2)

The test scripts expect "scale" parameter for pgbench. I used 500, but
that's mostly arbitrary - maybe a different value would hit the issues
more often. Not sure.

While testing I ran into two issues while validating the backups (which
is essentially about performing the usual basebackup redo).


1) assert in a checkpointer

TRAP: failed Assert("TransactionIdIsValid(initial)"), File:
"procarray.c", Line: 1707, PID: 2649754
postgres: checkpointer (ExceptionalCondition+0x56) [0x55d64ec38f96]
postgres: checkpointer (+0x5341d8) [0x55d64eac41d8]
postgres: checkpointer (GetOldestTransactionIdConsideredRunning+0xc)
[0x55d64eac598c]
postgres: checkpointer (CreateRestartPoint+0x725) [0x55d64e7dd2f5]
postgres: checkpointer (CheckpointerMain+0x3ec) [0x55d64ea38a8c]
postgres: checkpointer (postmaster_child_launch+0x102) [0x55d64ea3b592]
postgres: checkpointer (+0x4ad74a) [0x55d64ea3d74a]
postgres: checkpointer (PostmasterMain+0xce7) [0x55d64ea40b97]
postgres: checkpointer (main+0x1ca) [0x55d64e713b1a]
/lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7f55f1a33ca8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f55f1a33d65]
postgres: checkpointer (_start+0x21) [0x55d64e713e81]

I was really confused by this initially, because why would this break
tracking of running transactions in a snapshot, etc.? But that's really
a red herring - the real issue is calling CreateRestartPoint().

This happens because we need to ensure that a standby does not get
confused about checksums state, so redo of XLOG_CHECKSUMS forces
creating a restart point whenever the checksum state changes. Which has
some negative consequences, as discussed in my previous message.

But AFAICS in this case the root cause is calling this during a regular
redo, not just on a standby. Presumably there's a way to distinguish
these two cases, and skip the restart point creation on a simple redo.
Alternatively, maybe the standby should do something different instead
of creating a restart point. (More about that later.)


2) unexpected state during redo of XLOG_CHECKSUMS record

An example of the failure:

    LOG:  redo starts at 36/5B028668
    FATAL:  incorrect data checksum state 3 for target state 1
    CONTEXT:  WAL redo at 37/E136E408 for XLOG/CHECKSUMS: on
    ERROR:  incorrect data checksum state 3 for target state 1
    ERROR:  incorrect data checksum state 3 for target state 1
    ERROR:  incorrect data checksum state 3 for target state 1
    ERROR:  incorrect data checksum state 3 for target state 1
    ERROR:  incorrect data checksum state 3 for target state 1
    LOG:  startup process (PID 2649028) exited with exit code 1

I was really confused about this at first, because I haven't seen this
during earlier testing (which did redo after a crash), so why should it
happen here?

But the reason is very simple - the basebackup can run for a while,
spanning multiple checkpoints, and also multiple changes of checksum
state (enable/disable). Furthermore, basebackup copies the pg_control
file last, and that's where we get the checksum state from.

For the earlier crash/restart test that's not the case, because each
checksum change creates a checkpoint, and the redo starts from that
point. There can't be multiple XLOG_CHECKSUMS records to apply.

So after a crash we start from the last checkpoint (i.e. from the
initial checksum state), and then there can be only a single
XLOG_CHECKSUMS record to reapply.

But with a basebackup we get the checksum state from the pg_control
copied at the end (so likely the final state, not the initial one), and
there may be multiple XLOG_CHECKSUMS in between (and multiple
checkpoints). So what happens is that we start from the final state, and
then try to apply the earlier XLOG_CHECKSUMS states. Hence the failure.

This seems like a pretty fundamental issue - we can't guarantee a backup
won't run too long, or anything like that. A full backup may easily run
for multiple hours or even days.

I suspect it should be possible to hit an issue similar to those we
observed on the standby in earlier testing (which is why we ended up
creating startpoints on a standby) with "future" blocks.

Imagine a basebackup starts, we disable checksums, and one of the blocks
gets written to the disk without a checksum before the basebackup copies
that file. Then during redo of the backup, we start with checksums=on
(assume we don't have the issue with incorrect state). If we attempt to
read the page before the XLOG_CHECKSUMS, that'll fail, because the
on-disk page does not have a valid checksum, and we need that to read
the page LSN.

I only speculate this can happen, I haven't actually seen / reproduced
it. But maybe it's fixed thanks to FPI, or something like that? It
didn't help on the standby, though. And in that case allowing a random
mix of pages with/without checksums in a basebackup seems problematic.

What could we do about the root cause? We discussed this with Daniel and
we've been stuck for quite a while. But then it occurred to us maybe we
can simply "pause" the checksum state change while there's backup in
progress. We already enable/disable FPW based on this, so why couldn't
we check XLogCtl->Insert.runningBackups, and only advance to the next
checksum state if (runningBackups==0)?

That would mean a single backup does not need to worry about seeing a
mix of blocks written with different checksum states, and it also means
the final pg_control file has the correct checksum state, because it is
not allowed to change during the basebackup.

Of course, this would mean checksum changes may take longer. A corner
case is that database with a basebackup running 100% of the time won't
be able to change checksums on-line. But to me that seems acceptable, if
communicated / documented clearly.

It also occurred to me something like this might help with the standby
case too. On the standby, the problem happens when it skips checkpoints
when creating a restart point, in which case redo may go too far back,
with an incompatible checksum state. Maybe if a standby reported LSN of
the last restartpoint, the primary could use that to decide whether it's
safe to update the checksum state. Of course, this is tricky, because
standbys may be disconnected, there's cascading replications, etc.


regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-11-18T19:06:40Z

On 11/10/25 02:26, Tomas Vondra wrote:
> What could we do about the root cause? We discussed this with Daniel and
> we've been stuck for quite a while. But then it occurred to us maybe we
> can simply "pause" the checksum state change while there's backup in
> progress. We already enable/disable FPW based on this, so why couldn't
> we check XLogCtl->Insert.runningBackups, and only advance to the next
> checksum state if (runningBackups==0)?
> 
> That would mean a single backup does not need to worry about seeing a
> mix of blocks written with different checksum states, and it also means
> the final pg_control file has the correct checksum state, because it is
> not allowed to change during the basebackup.
> 
> Of course, this would mean checksum changes may take longer. A corner
> case is that database with a basebackup running 100% of the time won't
> be able to change checksums on-line. But to me that seems acceptable, if
> communicated / documented clearly.

After thinking about this approach a bit, I realized the basebackup may
also run on the standby. Which means the checksum process won't see it
by checking XLogCtl->Insert.runningBackups. It will merrily proceed,
breaking the standby backup just as described earlier ...

Not sure what would be a good fix. One option is to "pause" the redo,
which is what the patch already does (by forcing an immediate checkpoint
whenever checksum state changes). We could pause redo until the backup
completes. But of course, that'd be terrible - especially for syncrep. I
hoped we'd find a better approach, and pausing redo for longer goes in
the opposite direction.

On the other hand, we already have similar issue with full_page_writes.
The backup on standby is not allowed to start if fpw=off, and if the
setting changes while the backup is running, the backup fails:

  pg_basebackup: error: backup failed: ERROR:  WAL generated with
"full_page_writes=off" was replayed during online backup
HINT:  This means that the backup being taken on the standby is corrupt
and should not be used. Enable "full_page_writes" and run CHECKPOINT on
the primary, and then try an online backup again.

Maybe this would be acceptable for checksums too ...

It's not exactly the same, of course. We don't really expect people to
change fpw in a running cluster.

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Andreas Karlsson <andreas@proxel.se> — 2025-11-19T21:03:37Z

Hi,

I have been following these discussions but not read the patch in detail.

This patch makes me worried especially with the new issues recently 
uncovered. This was already a quite big patch and to fix these issues it 
will likely have to become even bigger and given how this would become a 
very rarely stressed code paths I wonder if we can actually ever become 
confident that the patch works in all edge cases.

Something like this need to be easy to understand for us to have any 
hope at all to be comfortable in the correctness. Can we actually do that?

Andreas

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-11-20T10:34:14Z

On 11/19/25 22:03, Andreas Karlsson wrote:
> Hi,
> 
> I have been following these discussions but not read the patch in detail.
> 
> This patch makes me worried especially with the new issues recently
> uncovered. This was already a quite big patch and to fix these issues it
> will likely have to become even bigger and given how this would become a
> very rarely stressed code paths I wonder if we can actually ever become
> confident that the patch works in all edge cases.
> 
> Something like this need to be easy to understand for us to have any
> hope at all to be comfortable in the correctness. Can we actually do that?
> 

How's this different from any other complex patch? We get more familiar
with the problem during review, identify issues, improve the patch to
address them. And then again and again.

Of course, it'd be great to have a perfect understanding of the problem
from the very beginning, but that's not always possible. And I can't
guarantee we'll find/fix all issues.

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Andreas Karlsson <andreas@proxel.se> — 2025-11-21T00:44:31Z

On 11/20/25 11:34 AM, Tomas Vondra wrote:
> On 11/19/25 22:03, Andreas Karlsson wrote:
>> Hi,
>>
>> I have been following these discussions but not read the patch in detail.
>>
>> This patch makes me worried especially with the new issues recently
>> uncovered. This was already a quite big patch and to fix these issues it
>> will likely have to become even bigger and given how this would become a
>> very rarely stressed code paths I wonder if we can actually ever become
>> confident that the patch works in all edge cases.
>>
>> Something like this need to be easy to understand for us to have any
>> hope at all to be comfortable in the correctness. Can we actually do that?
>>
> 
> How's this different from any other complex patch? We get more familiar
> with the problem during review, identify issues, improve the patch to
> address them. And then again and again.

The difference I see is in how rarely anyone actually switches checksum 
state in a production database, especially now that we enabled them by 
default. A complex and rarely stressed code path is a minefield.

Andreas

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2025-11-21T12:17:09Z

On 11/21/25 01:44, Andreas Karlsson wrote:
> On 11/20/25 11:34 AM, Tomas Vondra wrote:
>> On 11/19/25 22:03, Andreas Karlsson wrote:
>>> Hi,
>>>
>>> I have been following these discussions but not read the patch in
>>> detail.
>>>
>>> This patch makes me worried especially with the new issues recently
>>> uncovered. This was already a quite big patch and to fix these issues it
>>> will likely have to become even bigger and given how this would become a
>>> very rarely stressed code paths I wonder if we can actually ever become
>>> confident that the patch works in all edge cases.
>>>
>>> Something like this need to be easy to understand for us to have any
>>> hope at all to be comfortable in the correctness. Can we actually do
>>> that?
>>>
>>
>> How's this different from any other complex patch? We get more familiar
>> with the problem during review, identify issues, improve the patch to
>> address them. And then again and again.
> 
> The difference I see is in how rarely anyone actually switches checksum
> state in a production database, especially now that we enabled them by
> default. A complex and rarely stressed code path is a minefield.
> 

True. Hence the stress testing I've been doing - and indeed, that made
us discover the various issues reported in this thread.

Still, isn't that similar to error paths in various other patches? Those
also tend to be rarely exercised in practice. I think the right way to
address that is more testing. Of course, there's a difference between
"regular bugs" and "design problems". Some of the issues are more about
the design/architecture not considering something important.

I don't know if / when this will be ready for commit. Maybe never, who
knows. I prefer going step by step. We know about a couple issues, we
need to figure out what to do about those. Then we can reconsider.

FWIW I'm not sure the number of people currently enabling checksums on
production databases is a good metric of how important the patch is.
Maybe more people would like to do that, but can't accept the downtime.

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Bruce Momjian <bruce@momjian.us> — 2025-11-21T19:57:35Z

On Fri, Nov 21, 2025 at 01:17:09PM +0100, Tomas Vondra wrote:
> True. Hence the stress testing I've been doing - and indeed, that made
> us discover the various issues reported in this thread.
> 
> Still, isn't that similar to error paths in various other patches? Those
> also tend to be rarely exercised in practice. I think the right way to
> address that is more testing. Of course, there's a difference between
> "regular bugs" and "design problems". Some of the issues are more about
> the design/architecture not considering something important.
> 
> I don't know if / when this will be ready for commit. Maybe never, who
> knows. I prefer going step by step. We know about a couple issues, we
> need to figure out what to do about those. Then we can reconsider.
> 
> FWIW I'm not sure the number of people currently enabling checksums on
> production databases is a good metric of how important the patch is.
> Maybe more people would like to do that, but can't accept the downtime.

I think it is a worth-while feature.  We would have had it years ago
except that people asked for re-start-ability after a crash, and since
we don't have restart logic at the relation level, the patch got too
complex and was abandoned.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.

Re: Changing the state of data checksums in a running cluster

Andres Freund <andres@anarazel.de> — 2025-11-21T20:28:40Z

Hi,

On 2025-11-21 01:44:31 +0100, Andreas Karlsson wrote:
> On 11/20/25 11:34 AM, Tomas Vondra wrote:
> > On 11/19/25 22:03, Andreas Karlsson wrote:
> > > I have been following these discussions but not read the patch in detail.
> > > 
> > > This patch makes me worried especially with the new issues recently
> > > uncovered. This was already a quite big patch and to fix these issues it
> > > will likely have to become even bigger and given how this would become a
> > > very rarely stressed code paths I wonder if we can actually ever become
> > > confident that the patch works in all edge cases.
> > > 
> > > Something like this need to be easy to understand for us to have any
> > > hope at all to be comfortable in the correctness. Can we actually do that?
> > > 
> > 
> > How's this different from any other complex patch? We get more familiar
> > with the problem during review, identify issues, improve the patch to
> > address them. And then again and again.
> 
> The difference I see is in how rarely anyone actually switches checksum
> state in a production database, especially now that we enabled them by
> default. A complex and rarely stressed code path is a minefield.

FWIW, I think this is actually a good feature build the infrastructure for
features (i.e. dynamically reconfiguring the cluster while running) like this,
precisely because it isn't *constantly* used.

Greetings,

Andres Freund

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2025-12-01T21:51:04Z

> On 5 Nov 2025, at 15:05, Tomas Vondra <tomas@vondra.me> wrote:

> I took a look at the patch again, focusing mostly on the docs and
> comments.

While I am still working on a few ideas on how to handle the PITR issue, I
didn't want to leave this review hanging longer (and it would be nice to get it
out of the way and not mix it with other issues).

> 1) func-admin.sgml
> 
> - This is missing documentation for the "fast" parameters for both
> functions (enable/disable).

I had originally omitted them intentionally since they were intended for
testing, but I agree that they should be documented as they might be useful
outside of testng as well.

> - The paragraph stars with "Initiates data checksums for ..." but that
> doesn't sound right to me. I think you can initiate enabling/disabling,
> not "data checksums".

Fair point.

> - I think the docs should at least briefly describe the impact on the
> cluster, and also on a standby, due to having to write everything into
> WAL, waiting for checkpoints etc. And maybe discuss how to mitigate that
> in some way. More about the standby stuff later.

Agreed, but I think it's better done in a one central place like the data
checksums section in wal.sgml (which your XXX also mention).  In the preamble
to the function table I've added a mention of the system performance impact
with a link to the aforementioned section.

> 2) glossary.sgml
> 
> - This describes "checksum worker" as process that enables or disables
> checksums in a specific database, but we don't need any per-database
> processes when disabling, no?

That's very true, the Worker Launcher will receive the operation and if it is
to disable it will proceed without launching any workers since it's cluster
wide.  Thinking about it, maybe DataChecksumsWorkerLauncher isn't a very good
name, DataChecksumsController or DataChecksumsCoordinator might be a better
choice?

Naming seems to be hard, who knew..

> 3) monitoring.sgml
> 
> - I'm not sure what "regardles" implies here. Does that mean we simply
> don't hide/reset the counters?

Correct, I've expanded this para to mention this.

> - I added a brief explanation about using the "launcher" row for overall
> progress, and per-database workers for "current progress".

+1

> - Do we want to refer to "datachecksumsworker"? Isn't it a bit too
> low-level detail?

I think so, we should stick to "launcher process" and "worker process" and be
conistent about it.

> - The table of phases is missing "waiting" and "done" phases. IMHO if
> the progress view can return it, it should be in the docs.

Nice catch.  The code can't actually return "waiting" since it was broken up
into three different wait phases, but one of them wasn't documented or added to
the view properly.  Fixed.

> 4) wal.sgml
> 
> - I added a sentence explaining that both enabling and disabling happens
> in phases, with checkpoints in between. I think that's helpful for users
> and DBAs.

+1

> - The section only described "enabling checksums", but it probably
> should explain the process for disabling too. Added a para.

+1

> - Again, I think we should explain the checkpoints, restartpoints,
> impact on standby ... somewhere. Not sure if this is the right place.

I've added a subsection in the main Data Checksums section for this.

> 5) xlog.c

> - Some minor corrections (typos, ...).

Thanks, I did some additional minor wordsmithing around these.

> - Isn't the claim that PG_DATA_CHECKSUM_ON_VERSION is the only place
> where we check InitialDataChecksumTransition stale? AFAIK we now check
> this in AbsorbDataChecksumsBarrier for all transitions, no?

Correct, fixed.

> 6) datachecksumsworker.c

> - One of the items in the synchronization/correctness section states
> that "Backends SHALL NOT violate local data_checksums state" but what
> does "violating local data_checksums state" mean? What even is "local
> state in this context"? Should we explain/define that, or would that be
> unnecessarily detailed?

By "local state" I was referring to the data checksum state that a backend
knows about.  I've reworded this to hopefully be a little clearer.

> - The section only talks about "enabling checksums" but also discusses
> all four possible states. Maybe it should talk about disabling too, as
> that requires the same synchronization/correctness.
> 
> - Maybe it'd be good make it more explicit at which point the process
> waits on a barrier, which backend initiates that (and which backends are
> required to absorb the signal). It kinda is there, but only indirectly.

I've tried to address these, but they might still be off since I am very
Stockholm syndromed into this.

> - Another idea I had is that maybe it'd help to have some visualization
> of the process (with data_checksum states, barriers, ...) e.g. in the
> form of an ASCII image.

I instead opted for a SVG image in the docs which illustrates the states and
the transitions.  My Graphviz skills aren't that great so it doesn't look all
that pretty yet, but it's something to iterate on at least.

> I'm much more concerned about streaming replicas, because there it
> forces a restart point - and it *blocks redo* until it completes. Which
> means there'll be replication lag, and for synchronous standbys this
> would even block progress on the primary.
> 
> We should very clearly document this.

Indeed. I've started on such a section.

> I see some similarities to shutdown checkpoints, which can take a lot of
> time if there happen to be a lot of dirty data, increasing disruption
> during planned restarts (when no one can connect). A common mitigation
> is to run CHECKPOINT shortly before the restart, to flush most of the
> dirty data while still allowing new connections.
> 
> Maybe we could do something like that for checksum changes? I don't know
> how exactly we could do that, but let's say we can predict when roughly
> to expect the next state change.

Each worker knows how far it has come within its database, but is unaware about
other databases; the launcher knows how far it has come across all databases,
but is unaware about the relative size of each database.  Maybe there still is
a heuristic that can be teased out of imperfect knowledge.

> And we'd ensure the standby starts
> flushing stuff before that, so that creating the restartpoint is cheap.
> Or maybe we'd (gradually?) lower max_wal_size on the standby, to reduce
> the amount of WAL as we're getting closer to the end?

That's an interesting idea, do you know if we have processes taking a similar
approach today?

The attached is a rebase with the above fixes along with a few more smaller
fixups and cleanups noticed along the way, nothing which change any
functionality though.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Andres Freund <andres@anarazel.de> — 2025-12-15T22:21:46Z

Hi,


One high level issue first: I don't think the way this uses checkpoints and
restartpoints is likely to work out.

Synchronously having to wait for restartpoints during recovery seems generally
operationally a huge issue, but actually could also easily lead to undetected
deadlocks, particularly with syncrep.

Using the checksum state from the control file seems very fraught,
particularly with PITR, as the control file can be "from the future". Which
can be a problem e.g. if checksums were disabled, but we start recovery with a
control file with checksums enabled.

Forcing synchronous checkpoints in a bunch of places also will make this a
good bit slower than necessary, particularly for testing.


My suggestion for how to do this instead is to put the checksum state into the
XLOG_CHECKPOINT_* records.  When starting recovery from an online checkpoint,
I think we should use the ChecksumType from the XLOG_CHECKPOINT_REDO record,
that way the standby/recovery environment's assumption about whether checksums
were enabled is the same as it was at the time the WAL was generated. For
shutdown checkpoints, we could either start to emit a XLOG_CHECKPOINT_REDO, or
we can use the information from the checkpoint record itself.

I think if we only ever use the checksum state from the point where we start
recovery, we might not need to force *any* checkpoints.


Daniel and I chatted about that proposal, and couldn't immediately come up
with scenarios where that would be wrong.  For a while I thought there would
be problems when doing PITR from a base backup that had checksums initially
enabled, but where checksums were disabled before the base backup was
completed. My worry was that a later (once checksums were disabled) hint bit
write (which would not necessarily be WAL logged) would corrupt the checksum,
but I don't think that's a problem, because the startup process will only read
data pages in the process of processing WAL records, and if there's a WAL
record, there would also have to be an FPW, which would "cure" the
unchecksummed page.



More comments below, inline - I wrote these first, so it's possible that I
missed updating some of them in light of what I now wrote above.

> +/*
> + * This must match largest number of sets in barrier_eq and barrier_ne in the
> + * below checksum_barriers definition.
> + */
> +#define MAX_BARRIER_CONDITIONS 2
> +
> +/*
> + * Configuration of conditions which must match when absorbing a procsignal
> + * barrier during data checksum enable/disable operations.  A single function
> + * is used for absorbing all barriers, and the set of conditions to use is
> + * looked up in the checksum_barriers struct.  The struct member for the target
> + * state defines which state the backend must currently be in, and which it
> + * must not be in.
> + */
> +typedef struct ChecksumBarrierCondition
> +{
> +	/* The target state of the barrier */
> +	int			target;
> +	/* A set of states in which at least one MUST match the current state */
> +	int			barrier_eq[MAX_BARRIER_CONDITIONS];
> +	/* The number of elements in the barrier_eq set */
> +	int			barrier_eq_sz;
> +	/* A set of states which all MUST NOT match the current state */
> +	int			barrier_ne[MAX_BARRIER_CONDITIONS];
> +	/* The number of elements in the barrier_ne set */
> +	int			barrier_ne_sz;
> +} ChecksumBarrierCondition;
> +
> +static const ChecksumBarrierCondition checksum_barriers[] =
> +{
> +	{PG_DATA_CHECKSUM_OFF, {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION, PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION}, 2, {PG_DATA_CHECKSUM_VERSION}, 1},
> +	{PG_DATA_CHECKSUM_VERSION, {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION}, 1, {0}, 0},
> +	{PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION, {PG_DATA_CHECKSUM_ANY_VERSION}, 1, {PG_DATA_CHECKSUM_VERSION}, 1},
> +	{PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION, {PG_DATA_CHECKSUM_VERSION}, 1, {0}, 0},
> +	{-1}
> +};

The explanation for this doesn't really explain what the purpose of this thing
is...  Perhaps worth referencing datachecksumsworker.c or such?

For a local, constantly sized, array, you shouldn't need a -1 terminator, as
you can instead use lengthof() or such to detect invalid accesses.


> +/*
> + * Flag to remember if the procsignalbarrier being absorbed for checksums is
> + * the first one.  The first procsignalbarrier can in rare cases be for the
> + * state we've initialized, i.e. a duplicate.  This may happen for any
> + * data_checksum_version value when the process is spawned between the update
> + * of XLogCtl->data_checksum_version and the barrier being emitted.  This can
> + * only happen on the very first barrier so mark that with this flag.
> + */
> +static bool InitialDataChecksumTransition = true;

This is pretty hard to understand right now, at the very least it needs an
updated comment. But perhaps we can just get rid of this and accept barriers
that are redundant.



> @@ -830,9 +905,10 @@ XLogInsertRecord(XLogRecData *rdata,
>  		 * only happen just after a checkpoint, so it's better to be slow in
>  		 * this case and fast otherwise.
>  		 *
> -		 * Also check to see if fullPageWrites was just turned on or there's a
> -		 * running backup (which forces full-page writes); if we weren't
> -		 * already doing full-page writes then go back and recompute.
> +		 * Also check to see if fullPageWrites was just turned on, there's a
> +		 * running backup or if checksums are enabled (all of which forces
> +		 * full-page writes); if we weren't already doing full-page writes
> +		 * then go back and recompute.
>  		 *
>  		 * If we aren't doing full-page writes then RedoRecPtr doesn't
>  		 * actually affect the contents of the XLOG record, so we'll update
> @@ -845,7 +921,9 @@ XLogInsertRecord(XLogRecData *rdata,
>  			Assert(RedoRecPtr < Insert->RedoRecPtr);
>  			RedoRecPtr = Insert->RedoRecPtr;
>  		}
> -		doPageWrites = (Insert->fullPageWrites || Insert->runningBackups > 0);
> +		doPageWrites = (Insert->fullPageWrites ||
> +						Insert->runningBackups > 0 ||
> +						DataChecksumsNeedWrite());
>
>  		if (doPageWrites &&
>  			(!prevDoPageWrites ||

Why do we need to separately check for DataChecksumsNeedWrite() if turning on
checksums also forces Insert->fullPageWrites to on?



> +/*
> + * SetDataChecksumsOnInProgress
> + *		Sets the data checksum state to "inprogress-on" to enable checksums
> + *
> + * To start the process of enabling data checksums in a running cluster the
> + * data_checksum_version state must be changed to "inprogress-on". See
> + * SetDataChecksumsOn below for a description on how this state change works.
> + * This function blocks until all backends in the cluster have acknowledged the
> + * state transition.
> + */
> +void
> +SetDataChecksumsOnInProgress(bool immediate_checkpoint)
> +{
> +	uint64		barrier;
> +	int			flags;
> +
> +	Assert(ControlFile != NULL);
> +
> +	/*
> +	 * The state transition is performed in a critical section with
> +	 * checkpoints held off to provide crash safety.
> +	 */
> +	MyProc->delayChkptFlags |= DELAY_CHKPT_START;
> +	START_CRIT_SECTION();

ISTM that delayChkptFlags ought to only be set once within the critical
section. Obviously we can't fail inbetween as the code stands, but that's not
guaranteed to stay this way.


> +	XLogChecksums(PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
> +
> +	SpinLockAcquire(&XLogCtl->info_lck);
> +	XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION;
> +	SpinLockRelease(&XLogCtl->info_lck);

Maybe worth adding an assertion checking that we are currently in an expected
state (off or inprogress, I think?).


> +	barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_ON);
> +
> +	END_CRIT_SECTION();
> +	MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;

Swap as well.

Think it might be worth mentioning that we rely on the memory ordering implied
by XLogChecksums() and WaitForProcSignalBarrier() for the changes to
delayChkptFlags. Unless we have a different logic around that?


> +	/*
> +	 * Await state change in all backends to ensure that all backends are in
> +	 * "inprogress-on". Once done we know that all backends are writing data
> +	 * checksums.
> +	 */
> +	WaitForProcSignalBarrier(barrier);
> +
> +	/*
> +	 * force checkpoint to persist the current checksum state in control file
> +	 * etc.
> +	 *
> +	 * XXX is this needed? There's already a checkpoint at the end of
> +	 * ProcessAllDatabases, maybe this is redundant?
> +	 */
> +	flags = CHECKPOINT_FORCE | CHECKPOINT_WAIT;
> +	if (immediate_checkpoint)
> +		flags = flags | CHECKPOINT_FAST;
> +	RequestCheckpoint(flags);

Why do we need a checkpoint at all?


> +}
> +
> +/*
> + * SetDataChecksumsOn
> + *		Enables data checksums cluster-wide
> + *
> + * Enabling data checksums is performed using two barriers, the first one to
> + * set the state to "inprogress-on" (done by SetDataChecksumsOnInProgress())
> + * and the second one to set the state to "on" (done here). Below is a short
> + * description of the processing, a more detailed write-up can be found in
> + * datachecksumsworker.c.
> + *
> + * To start the process of enabling data checksums in a running cluster the
> + * data_checksum_version state must be changed to "inprogress-on".  This state
> + * requires data checksums to be written but not verified. This ensures that
> + * all data pages can be checksummed without the risk of false negatives in
> + * validation during the process.  When all existing pages are guaranteed to
> + * have checksums, and all new pages will be initiated with checksums, the
> + * state can be changed to "on". Once the state is "on" checksums will be both
> + * written and verified. See datachecksumsworker.c for a longer discussion on
> + * how data checksums can be enabled in a running cluster.
> + *
> + * This function blocks until all backends in the cluster have acknowledged the
> + * state transition.
> + */
> +void
> +SetDataChecksumsOn(bool immediate_checkpoint)
>  {
> +	uint64		barrier;
> +	int			flags;
> +
>  	Assert(ControlFile != NULL);
> -	return (ControlFile->data_checksum_version > 0);
> +
> +	SpinLockAcquire(&XLogCtl->info_lck);
> +
> +	/*
> +	 * The only allowed state transition to "on" is from "inprogress-on" since
> +	 * that state ensures that all pages will have data checksums written.
> +	 */
> +	if (XLogCtl->data_checksum_version != PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
> +	{
> +		SpinLockRelease(&XLogCtl->info_lck);
> +		elog(ERROR, "checksums not in \"inprogress-on\" mode");

Seems like a PANIC condition to me...

> +	}
> +
> +	SpinLockRelease(&XLogCtl->info_lck);
> +
> +	MyProc->delayChkptFlags |= DELAY_CHKPT_START;
> +	INJECTION_POINT("datachecksums-enable-checksums-delay", NULL);
> +	START_CRIT_SECTION();

I think it's a really really bad idea to do something fallible, like
INJECTION_POINT, after setting delayChkptFlags, but before entering the crit
section. Any error in the injection point will lead to a corrupted
delayChkptFlags, no?


> +	XLogChecksums(PG_DATA_CHECKSUM_VERSION);
> +
> +	SpinLockAcquire(&XLogCtl->info_lck);
> +	XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
> +	SpinLockRelease(&XLogCtl->info_lck);
> +
> +	barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);
> +
> +	END_CRIT_SECTION();
> +	MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> +
> +	/*
> +	 * Await state transition of "on" in all backends. When done we know that
> +	 * data checksums are enabled in all backends and data checksums are both
> +	 * written and verified.
> +	 */
> +	WaitForProcSignalBarrier(barrier);
> +
> +	INJECTION_POINT("datachecksums-enable-checksums-pre-checkpoint", NULL);
> +
> +	/* XXX is this needed? */
> +	flags = CHECKPOINT_FORCE | CHECKPOINT_WAIT;
> +	if (immediate_checkpoint)
> +		flags = flags | CHECKPOINT_FAST;
> +	RequestCheckpoint(flags);
> +}
> +
> +/*
> + * SetDataChecksumsOff
> + *		Disables data checksums cluster-wide
> + *
> + * Disabling data checksums must be performed with two sets of barriers, each
> + * carrying a different state. The state is first set to "inprogress-off"
> + * during which checksums are still written but not verified. This ensures that
> + * backends which have yet to observe the state change from "on" won't get
> + * validation errors on concurrently modified pages. Once all backends have
> + * changed to "inprogress-off", the barrier for moving to "off" can be emitted.
> + * This function blocks until all backends in the cluster have acknowledged the
> + * state transition.
> + */
> +void
> +SetDataChecksumsOff(bool immediate_checkpoint)
> +{
> +	[...]
> +	/*
> +	 * Ensure that we don't incur a checkpoint during disabling checksums.
> +	 */
> +	MyProc->delayChkptFlags |= DELAY_CHKPT_START;
> +	START_CRIT_SECTION();
> +
> +	XLogChecksums(0);

Why no symbolic name here?


> +	SpinLockAcquire(&XLogCtl->info_lck);
> +	XLogCtl->data_checksum_version = 0;
> +	SpinLockRelease(&XLogCtl->info_lck);
> +
> +	barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_OFF);
> +
> +	END_CRIT_SECTION();
> +	MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> +
> +	WaitForProcSignalBarrier(barrier);
> +
> +	flags = CHECKPOINT_FORCE | CHECKPOINT_WAIT;
> +	if (immediate_checkpoint)
> +		flags = flags | CHECKPOINT_FAST;
> +	RequestCheckpoint(flags);
> +}
> +
> +/*
> + * AbsorbDataChecksumsBarrier
> + *		Generic function for absorbing data checksum state changes
> + *
> + * All procsignalbarriers regarding data checksum state changes are absorbed
> + * with this function.  The set of conditions required for the state change to
> + * be accepted are listed in the checksum_barriers struct, target_state is
> + * used to look up the relevant entry.
> + */
> +bool
> +AbsorbDataChecksumsBarrier(int target_state)
> +{
> +	const ChecksumBarrierCondition *condition = checksum_barriers;
> +	int			current = LocalDataChecksumVersion;
> +	bool		found = false;
> +
> +	/*
> +	 * Find the barrier condition definition for the target state. Not finding
> +	 * a condition would be a grave programmer error as the states are a
> +	 * discrete set.
> +	 */
> +	while (condition->target != target_state && condition->target != -1)
> +		condition++;
> +	if (unlikely(condition->target == -1))
> +		ereport(ERROR,
> +				errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +				errmsg("invalid target state %i for data checksum barrier",
> +					   target_state));

FWIW, you don't need unlikely() when the branch does an ereport(ERROR), as
ereports >=ERROR are marked "cold" automatically.


> +	/*
> +	 * The current state MUST be equal to one of the EQ states defined in this
> +	 * barrier condition, or equal to the target_state if - and only if -
> +	 * InitialDataChecksumTransition is true.
> +	 */
> +	for (int i = 0; i < condition->barrier_eq_sz; i++)
> +	{
> +		if (current == condition->barrier_eq[i] ||
> +			condition->barrier_eq[i] == PG_DATA_CHECKSUM_ANY_VERSION)
> +			found = true;
> +	}
> +	if (InitialDataChecksumTransition && current == target_state)
> +		found = true;
> +
> +	/*
> +	 * The current state MUST NOT be equal to any of the NE states defined in
> +	 * this barrier condition.
> +	 */
> +	for (int i = 0; i < condition->barrier_ne_sz; i++)
> +	{
> +		if (current == condition->barrier_ne[i])
> +			found = false;
> +	}
> +
> +	/*
> +	 * If the relevent state criteria aren't satisfied, throw an error which
> +	 * will be caught by the procsignal machinery for a later retry.
> +	 */
> +	if (!found)
> +		ereport(ERROR,
> +				errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +				errmsg("incorrect data checksum state %i for target state %i",
> +					   current, target_state));
> +
> +	SetLocalDataChecksumVersion(target_state);
> +	InitialDataChecksumTransition = false;
> +	return true;
> +}


> +/*
> + * Log the new state of checksums
> + */
> +static void
> +XLogChecksums(uint32 new_type)
> +{
> +	xl_checksum_state xlrec;
> +	XLogRecPtr	recptr;
> +
> +	xlrec.new_checksumtype = new_type;
> +
> +	XLogBeginInsert();
> +	XLogRegisterData((char *) &xlrec, sizeof(xl_checksum_state));
> +
> +	INJECTION_POINT("datachecksums-xlogchecksums-pre-xloginsert", &new_type);
> +
> +	recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKSUMS);
> +	XLogFlush(recptr);
> +}

Why an injection point between XLogBeginInsert() and XLogInsert(), rather than
have the injection point before the XLogBeginInsert()?


> +	else if (info == XLOG_CHECKSUMS)
> +	{
> +		xl_checksum_state state;
> +		uint64		barrier;
> +
> +		memcpy(&state, XLogRecGetData(record), sizeof(xl_checksum_state));
> +
> +		/*
> +		 * XXX Could this end up written to the control file prematurely? IIRC
> +		 * that happens during checkpoint, so what if that gets triggered e.g.
> +		 * because someone runs CHECKPOINT? If we then crash (or something
> +		 * like that), could that confuse the instance?
> +		 */
> +		SpinLockAcquire(&XLogCtl->info_lck);
> +		XLogCtl->data_checksum_version = state.new_checksumtype;
> +		SpinLockRelease(&XLogCtl->info_lck);
> +
> +		/*
> +		 * Block on a procsignalbarrier to await all processes having seen the
> +		 * change to checksum status. Once the barrier has been passed we can
> +		 * initiate the corresponding processing.
> +		 */
> +		switch (state.new_checksumtype)
> +		{
> +			case PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION:
> +				barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_ON);
> +				WaitForProcSignalBarrier(barrier);
> +				break;
> +
> +			case PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION:
> +				barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_OFF);
> +				WaitForProcSignalBarrier(barrier);
> +				break;
> +
> +			case PG_DATA_CHECKSUM_VERSION:
> +				barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);
> +				WaitForProcSignalBarrier(barrier);
> +				break;
> +
> +			default:
> +				Assert(state.new_checksumtype == 0);
> +				barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_OFF);
> +				WaitForProcSignalBarrier(barrier);
> +				break;

I'd not add a default clause, but add a case each value of the enum. That way
we'll get warnings if the set of states changes.


These WaitForProcSignalBarrier() are one of the scariest bits of the
patchset. If the startup process were to hold any lock that backends need, and
the backends waited for that lock without processing interrupts, we'd have an
undetected deadlock.  This is much more likely to be a problem for the startup
process, as it does the work of many backends on the primary.

We do process interrupts while waiting for heavyweight locks, so that at least
is not an issue. Seems worth to call out rather explicitly.



> +	if (checksumRestartPoint &&
> +		(info == XLOG_CHECKPOINT_ONLINE ||
> +		 info == XLOG_CHECKPOINT_REDO ||
> +		 info == XLOG_CHECKPOINT_SHUTDOWN))
> +	{
> +		int			flags;
> +
> +		elog(LOG, "forcing creation of a restartpoint after XLOG_CHECKSUMS");
> +
> +		/* We explicitly want an immediate checkpoint here */
> +		flags = CHECKPOINT_FORCE | CHECKPOINT_WAIT | CHECKPOINT_FAST;
> +		RequestCheckpoint(flags);
> +
> +		checksumRestartPoint = false;
> +	}

As noted above, I don't think we should rely on starting restartpoints.



> diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
> index 2be4e069816..baf6c8cc2cc 100644
> --- a/src/backend/backup/basebackup.c
> +++ b/src/backend/backup/basebackup.c
> @@ -1613,7 +1613,8 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
>  	 * enabled for this cluster, and if this is a relation file, then verify
>  	 * the checksum.
>  	 */
> -	if (!noverify_checksums && DataChecksumsEnabled() &&
> +	if (!noverify_checksums &&
> +		DataChecksumsNeedWrite() &&
>  		RelFileNumberIsValid(relfilenumber))
>  		verify_checksum = true;
>

Why is DataChecksumsNeedWrite() being tested here?


>  --
>  -- We also set up some things as accessible to standard roles.
>  --
> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
> index 086c4c8fb6f..6d452b10bce 100644
> --- a/src/backend/catalog/system_views.sql
> +++ b/src/backend/catalog/system_views.sql
> @@ -1374,6 +1374,26 @@ CREATE VIEW pg_stat_progress_copy AS
>      FROM pg_stat_get_progress_info('COPY') AS S
>          LEFT JOIN pg_database D ON S.datid = D.oid;
>
> +CREATE VIEW pg_stat_progress_data_checksums AS
> +    SELECT
> +        S.pid AS pid, S.datid, D.datname AS datname,
> +        CASE S.param1 WHEN 0 THEN 'enabling'
> +                      WHEN 1 THEN 'disabling'
> +                      WHEN 2 THEN 'waiting on temporary tables'
> +                      WHEN 3 THEN 'waiting on checkpoint'
> +					  WHEN 4 THEN 'waiting on barrier'
> +                      WHEN 5 THEN 'done'
> +                      END AS phase,
> +        CASE S.param2 WHEN -1 THEN NULL ELSE S.param2 END AS databases_total,
> +        S.param3 AS databases_done,
> +        CASE S.param4 WHEN -1 THEN NULL ELSE S.param4 END AS relations_total,
> +        CASE S.param5 WHEN -1 THEN NULL ELSE S.param5 END AS relations_done,
> +        CASE S.param6 WHEN -1 THEN NULL ELSE S.param6 END AS blocks_total,
> +        CASE S.param7 WHEN -1 THEN NULL ELSE S.param7 END AS blocks_done
> +    FROM pg_stat_get_progress_info('DATACHECKSUMS') AS S
> +        LEFT JOIN pg_database D ON S.datid = D.oid
> +    ORDER BY S.datid; -- return the launcher process first
> +

Not this patch's fault, but I strongly dislike that we do this in SQL. Every
postgres database in the world has ~110kB of pg_stat_progress view definitions
in it.  We should just do this in a C function.


> diff --git a/src/backend/postmaster/datachecksumsworker.c b/src/backend/postmaster/datachecksumsworker.c
> new file mode 100644
> index 00000000000..57311760b2b
> --- /dev/null
> +++ b/src/backend/postmaster/datachecksumsworker.c
> @@ -0,0 +1,1491 @@
> +/*-------------------------------------------------------------------------
> + *
> + * datachecksumsworker.c
> + *	  Background worker for enabling or disabling data checksums online
> + *
> + * When enabling data checksums on a database at initdb time or when shut down
> + * with pg_checksums, no extra process is required as each page is checksummed,
> + * and verified, when accessed.  When enabling checksums on an already running
> + * cluster, this worker will ensure that all pages are checksummed before
> + * verification of the checksums is turned on. In the case of disabling
> + * checksums, the state transition is performed only in the control file, no
> + * changes are performed on the data pages.
> + *
> + * Checksums can be either enabled or disabled cluster-wide, with on/off being
> + * the end state for data_checksums.
> + *
> + * Enabling checksums
> + * ------------------
> + * When enabling checksums in an online cluster, data_checksums will be set to
> + * "inprogress-on" which signals that write operations MUST compute and write
> + * the checksum on the data page, but during reading the checksum SHALL NOT be
> + * verified. This ensures that all objects created during checksumming will
> + * have checksums set, but no reads will fail due to incorrect checksum.

Maybe "... due to not yet set checksums."?  Incorrect checksums sounds like
it's about checksums that are actively wrong, rather than just not set. Except
for the corner case of a torn page in the process of an hint bit write, after
having disabled checksums, there shouldn't be incorrect ones, right?


> The
> + * DataChecksumsWorker will compile a list of databases which exist at the
> + * start of checksumming, and all of these which haven't been dropped during
> + * the processing MUST have been processed successfully in order for checksums
> + * to be enabled. Any new relation created during processing will see the
> + * in-progress state and will automatically be checksummed.

What about new databases created while checksums are being enabled? They could
be copied before the worker has processed them. At least for the file_copy
strategy, the copy will be verbatim and thus will not necessarily have
checksums set.


> + * Synchronization and Correctness
> + * -------------------------------
> + * The processes involved in enabling, or disabling, data checksums in an
> + * online cluster must be properly synchronized with the normal backends
> + * serving concurrent queries to ensure correctness. Correctness is defined
> + * as the following:
> + *
> + *    - Backends SHALL NOT violate the data_checksums state they have agreed to
> + *      by acknowledging the procsignalbarrier:  This means that all backends
> + *      MUST calculate and write data checksums during all states except off;
> + *      MUST validate checksums only in the 'on' state.
> + *    - Data checksums SHALL NOT be considered enabled cluster-wide until all
> + *      currently connected backends have state "on": This means that all
> + *      backends must wait on the procsignalbarrier to be acknowledged by all
> + *      before proceeding to validate data checksums.
> + *
> + * There are two levels of synchronization required for changing data_checksums

Maybe s/levels/steps/?

> + * in an online cluster: (i) changing state in the active backends ("on",
> + * "off", "inprogress-on" and "inprogress-off"), and (ii) ensuring no
> + * incompatible objects and processes are left in a database when workers end.
> + * The former deals with cluster-wide agreement on data checksum state and the
> + * latter with ensuring that any concurrent activity cannot break the data
> + * checksum contract during processing.
> + *
> + * Synchronizing the state change is done with procsignal barriers, where the
> + * WAL logging backend updating the global state in the controlfile will wait

It's not entirely obvious what "the WAL logging backend" means.


> + * for all other backends to absorb the barrier. Barrier absorption will happen
> + * during interrupt processing, which means that connected backends will change
> + * state at different times. To prevent data checksum state changes when
> + * writing and verifying checksums, interrupts shall be held off before
> + * interrogating state and resumed when the IO operation has been performed.
> + *
> + *   When Enabling Data Checksums
> + *   ----------------------------

Odd change in indentation here.



> + *   A process which fails to observe data checksums being enabled can induce
> + *   two types of errors: failing to write the checksum when modifying the page
> + *   and failing to validate the data checksum on the page when reading it.
> + *
> + *   When processing starts all backends belong to one of the below sets, with
> + *   one set being empty:
> + *
> + *   Bd: Backends in "off" state
> + *   Bi: Backends in "inprogress-on" state
> + *
> + *   If processing is started in an online cluster then all backends are in Bd.
> + *   If processing was halted by the cluster shutting down, the controlfile
> + *   state "inprogress-on" will be observed on system startup and all backends
> + *   will be placed in Bd.

Why not in Bi? Just for simplicities sake? ISTM we already need to be sure
that new backends start in Bi, as they might never observe the barrier...


> Backends transition Bd -> Bi via a procsignalbarrier
> + *   which is emitted by the DataChecksumsLauncher.  When all backends have
> + *   acknowledged the barrier then Bd will be empty and the next phase can
> + *   begin: calculating and writing data checksums with DataChecksumsWorkers.
> + *   When the DataChecksumsWorker processes have finished writing checksums on
> + *   all pages and enables data checksums cluster-wide via another

s/enables/enabled/?

> + *   procsignalbarrier, there are four sets of backends where Bd shall be an
> + *   empty set:
> + *
> + *   Bg: Backend updating the global state and emitting the procsignalbarrier
> + *   Bd: Backends in "off" state
> + *   Be: Backends in "on" state
> + *   Bi: Backends in "inprogress-on" state
> + *
> + *   Backends in Bi and Be will write checksums when modifying a page, but only
> + *   backends in Be will verify the checksum during reading. The Bg backend is
> + *   blocked waiting for all backends in Bi to process interrupts and move to
> + *   Be. Any backend starting while Bg is waiting on the procsignalbarrier will
> + *   observe the global state being "on" and will thus automatically belong to
> + *   Be.  Checksums are enabled cluster-wide when Bi is an empty set. Bi and Be
> + *   are compatible sets while still operating based on their local state as
> + *   both write data checksums.
> + *
> + *   When Disabling Data Checksums
> + *   -----------------------------
> + *   A process which fails to observe that data checksums have been disabled
> + *   can induce two types of errors: writing the checksum when modifying the
> + *   page and validating a data checksum which is no longer correct due to
> + *   modifications to the page.

Hm. I wonder if we ought to zero out old checksums when loading a page into
s_b with checksums disabled... But that's really independent of this patchset.


> + * Potential optimizations
> + * -----------------------
> + * Below are some potential optimizations and improvements which were brought
> + * up during reviews of this feature, but which weren't implemented in the
> + * initial version. These are ideas listed without any validation on their
> + * feasibility or potential payoff. More discussion on these can be found on
> + * the -hackers threads linked to in the commit message of this feature.
> + *
> + *   * Launching datachecksumsworker for resuming operation from the startup
> + *     process: Currently users have to restart processing manually after a
> + *     restart since dynamic background worker cannot be started from the
> + *     postmaster. Changing the startup process could make restarting the
> + *     processing automatic on cluster restart.
> + *   * Avoid dirtying the page when checksums already match: Iff the checksum
> + *     on the page happens to already match we still dirty the page. It should
> + *     be enough to only do the log_newpage_buffer() call in that case.
> + *   * Invent a lightweight WAL record that doesn't contain the full-page
> + *     image but just the block number: On replay, the redo routine would read
> + *     the page from disk.

The last sentence might be truncated?


> + *   * Teach pg_checksums to avoid checksummed pages when pg_checksums is used
> + *     to enable checksums on a cluster which is in inprogress-on state and
> + *     may have checksummed pages (make pg_checksums be able to resume an
> + *     online operation).
> + *   * Restartability (not necessarily with page granularity).
> + *
> + *
> + * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
> + * Portions Copyright (c) 1994, Regents of the University of California
> + *
> + *
> + * IDENTIFICATION
> + *	  src/backend/postmaster/datachecksumsworker.c

Hm. The whole set of interactions with checkpoints/restartpoints aren't explored
here?


> +/*
> + * ProcessSingleRelationFork
> + *		Enable data checksums in a single relation/fork.
> + *
> + * Returns true if successful, and false if *aborted*. On error, an actual
> + * error is raised in the lower levels.
> + */
> +static bool
> +ProcessSingleRelationFork(Relation reln, ForkNumber forkNum, BufferAccessStrategy strategy)
> +{
> +	BlockNumber numblocks = RelationGetNumberOfBlocksInFork(reln, forkNum);
> +	char		activity[NAMEDATALEN * 2 + 128];
> +	char	   *relns;
> +
> +	relns = get_namespace_name(RelationGetNamespace(reln));
> +
> +	if (!relns)
> +		return false;
> +
> +	/* Report the current relation to pgstat_activity */
> +	snprintf(activity, sizeof(activity) - 1, "processing: %s.%s (%s, %dblocks)",
> +			 relns, RelationGetRelationName(reln), forkNames[forkNum], numblocks);
> +	pgstat_report_activity(STATE_RUNNING, activity);
> +
> +	/*
> +	 * As of now we only update the block counter for main forks in order to
> +	 * not cause too frequent calls. TODO: investigate whether we should do it
> +	 * more frequent?
> +	 */
> +	if (forkNum == MAIN_FORKNUM)
> +		pgstat_progress_update_param(PROGRESS_DATACHECKSUMS_BLOCKS_TOTAL,
> +									 numblocks);

That doesn't make much sense to me. Presumably the reason to skip it for the
other forks is that they're small-ish. But if so, there's no point in skipping
the reporting either, as presumably there wouldn't be a lot of reporting?


> +	/*
> +	 * We are looping over the blocks which existed at the time of process
> +	 * start, which is safe since new blocks are created with checksums set
> +	 * already due to the state being "inprogress-on".
> +	 */
> +	for (BlockNumber blknum = 0; blknum < numblocks; blknum++)
> +	{
> +		Buffer		buf = ReadBufferExtended(reln, forkNum, blknum, RBM_NORMAL, strategy);
> +
> +		/* Need to get an exclusive lock before we can flag as dirty */
> +		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
> +
> +		/*
> +		 * Mark the buffer as dirty and force a full page write.  We have to
> +		 * re-write the page to WAL even if the checksum hasn't changed,
> +		 * because if there is a replica it might have a slightly different
> +		 * version of the page with an invalid checksum, caused by unlogged
> +		 * changes (e.g. hintbits) on the master happening while checksums
> +		 * were off. This can happen if there was a valid checksum on the page
> +		 * at one point in the past, so only when checksums are first on, then
> +		 * off, and then turned on again.  TODO: investigate if this could be
> +		 * avoided if the checksum is calculated to be correct and wal_level
> +		 * is set to "minimal",
> +		 */
> +		START_CRIT_SECTION();
> +		MarkBufferDirty(buf);
> +		log_newpage_buffer(buf, false);
> +		END_CRIT_SECTION();

Hm. It's pretty annoying to have to pass page_std = false here, that could
increase the write volume noticeably. But there's not a great way to know
what the right value would be :(



> +/*
> + * launcher_exit
> + *
> + * Internal routine for cleaning up state when the launcher process exits. We
> + * need to clean up the abort flag to ensure that processing can be restarted
> + * again after it was previously aborted.
> + */
> +static void
> +launcher_exit(int code, Datum arg)
> +{
> +	if (launcher_running)
> +	{
> +		LWLockAcquire(DataChecksumsWorkerLock, LW_EXCLUSIVE);
> +		launcher_running = false;
> +		DataChecksumsWorkerShmem->launcher_running = false;
> +		LWLockRelease(DataChecksumsWorkerLock);
> +	}
> +}

Could we end up exiting this with the worker still running?


> +/*
> + * WaitForAllTransactionsToFinish
> + *		Blocks awaiting all current transactions to finish
> + *
> + * Returns when all transactions which are active at the call of the function
> + * have ended, or if the postmaster dies while waiting. If the postmaster dies
> + * the abort flag will be set to indicate that the caller of this shouldn't
> + * proceed.
> + *
> + * NB: this will return early, if aborted by SIGINT or if the target state
> + * is changed while we're running.

I think either here, or at its callsites, the patch needs to explain *why* we
are waiting for all transactions to finish.  Presumably this is to ensure that
other sessions haven't created relations that we can't see yet?

It actually doesn't seem to wait for all transactions, ust for ones with an
xid?

> + */
> +static void
> +WaitForAllTransactionsToFinish(void)
> +{
> +	TransactionId waitforxid;
> +
> +	LWLockAcquire(XidGenLock, LW_SHARED);
> +	waitforxid = XidFromFullTransactionId(TransamVariables->nextXid);
> +	LWLockRelease(XidGenLock);
> +
> +	while (TransactionIdPrecedes(GetOldestActiveTransactionId(false, true), waitforxid))
> +	{
> +		char		activity[64];
> +		int			rc;
> +
> +		/* Oldest running xid is older than us, so wait */
> +		snprintf(activity,
> +				 sizeof(activity),
> +				 "Waiting for current transactions to finish (waiting for %u)",
> +				 waitforxid);
> +		pgstat_report_activity(STATE_RUNNING, activity);
> +
> +		/* Retry every 3 seconds */
> +		ResetLatch(MyLatch);
> +		rc = WaitLatch(MyLatch,
> +					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
> +					   3000,
> +					   WAIT_EVENT_CHECKSUM_ENABLE_STARTCONDITION);
> +
> +		/*
> +		 * If the postmaster died we won't be able to enable checksums
> +		 * cluster-wide so abort and hope to continue when restarted.
> +		 */
> +		if (rc & WL_POSTMASTER_DEATH)
> +			ereport(FATAL,
> +					errcode(ERRCODE_ADMIN_SHUTDOWN),
> +					errmsg("postmaster exited during data checksum processing"),
> +					errhint("Restart the database and restart data checksum processing by calling pg_enable_data_checksums()."));
> +
> +		LWLockAcquire(DataChecksumsWorkerLock, LW_SHARED);
> +		if (DataChecksumsWorkerShmem->launch_operation != operation)
> +			abort_requested = true;
> +		LWLockRelease(DataChecksumsWorkerLock);
> +		if (abort_requested)
> +			break;

I don't like this much - loops with a timeout are generally a really bad idea
and we shouldn't add more instances. Presumably this also makes the tests
slower...

How about collecting the to-be-waited-for virtualxids and then wait for those?


> +	if (operation == ENABLE_DATACHECKSUMS)
> +	{
> +		/*
> +		 * If we are asked to enable checksums in a cluster which already has
> +		 * checksums enabled, exit immediately as there is nothing more to do.
> +		 * Hold interrupts to make sure state doesn't change during checking.
> +		 */
> +		HOLD_INTERRUPTS();
> +		if (DataChecksumsNeedVerify())
> +		{
> +			RESUME_INTERRUPTS();
> +			goto done;
> +		}
> +		RESUME_INTERRUPTS();

I don't understand what this interrupt stuff achieves here?



> diff --git a/src/include/storage/checksum.h b/src/include/storage/checksum.h
> index 25d13a798d1..0faaac14b1b 100644
> --- a/src/include/storage/checksum.h
> +++ b/src/include/storage/checksum.h
> @@ -15,6 +15,21 @@
>
>  #include "storage/block.h"
>
> +/*
> + * Checksum version 0 is used for when data checksums are disabled (OFF).
> + * PG_DATA_CHECKSUM_VERSION defines that data checksums are enabled in the
> + * cluster and PG_DATA_CHECKSUM_INPROGRESS_{ON|OFF}_VERSION defines that data
> + * checksums are either currently being enabled or disabled.
> + */
> +typedef enum ChecksumType
> +{
> +	PG_DATA_CHECKSUM_OFF = 0,
> +	PG_DATA_CHECKSUM_VERSION,
> +	PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION,
> +	PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION,
> +	PG_DATA_CHECKSUM_ANY_VERSION
> +} ChecksumType;

Why is there "VERSION" in the name of these?  Feels like that's basically just
vestigial at this point.


>  /*
>   * There also exist several built-in LWLock tranches.  As with the predefined
> diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
> index c6f5ebceefd..d90d35b1d6f 100644
> --- a/src/include/storage/proc.h
> +++ b/src/include/storage/proc.h
> @@ -463,11 +463,11 @@ extern PGDLLIMPORT PGPROC *PreparedXactProcs;
>   * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
>   * run during normal operation.  Startup process and WAL receiver also consume
>   * 2 slots, but WAL writer is launched only after startup has exited, so we
> - * only need 6 slots.
> + * only need 6 slots to cover these. The DataChecksums worker and launcher
> + * can consume 2 slots when data checksums are enabled or disabled.
>   */
>  #define MAX_IO_WORKERS          32
> -#define NUM_AUXILIARY_PROCS		(6 + MAX_IO_WORKERS)
> -
> +#define NUM_AUXILIARY_PROCS		(8 + MAX_IO_WORKERS)

Aren't they bgworkers now?


> diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
> index afeeb1ca019..c54c61e2cd8 100644
> --- a/src/include/storage/procsignal.h
> +++ b/src/include/storage/procsignal.h
> @@ -54,6 +54,11 @@ typedef enum
>  typedef enum
>  {
>  	PROCSIGNAL_BARRIER_SMGRRELEASE, /* ask smgr to close files */
> +
> +	PROCSIGNAL_BARRIER_CHECKSUM_OFF,
> +	PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_ON,
> +	PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_OFF,
> +	PROCSIGNAL_BARRIER_CHECKSUM_ON,
>  } ProcSignalBarrierType;

I wonder if these really should be different barriers. What if we just made it
one, and instead drove the transition on the current shmem content?



Other stuff:
- what protects against multiple backends enabling checksums at the same time?

  Afaict there isn't anything, and we just ignore the second request. Which
  seems ok-ish if it's the same request as before, but not great if it's a
  different one.

  Should also have tests for that.

- I think this adds a bit too much of the logic to xlog.c, already an unwieldy
  file. A fair bit of all of this doesn't seem like it needs to be in there.

- the code seems somewhat split brained about bgworkers and auxprocesses


Greetings,

Andres Freund

Re: Changing the state of data checksums in a running cluster

SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> — 2026-05-01T16:57:20Z

Hi Daniel,

On Thu, Apr 30, 2026 at 8:20 AM Daniel Gustafsson <daniel@yesql.se> wrote:

> > On 29 Apr 2026, at 15:42, Ayush Tiwari <ayushtiwari.slg01@gmail.com>
> wrote:
>
> > The patchset looks good.
>
> Thanks for review, I've applied the patchset after some editorializing.
>

While further testing this feature, I realized that
ProcessSingleRelationFork()
unconditionally called log_newpage_buffer() for every page of every
relation
during pg_enable_data_checksums(). This included unlogged relations,
which by definition never generate WAL for data changes and are reset to
their
init fork on any recovery.

Guard the log_newpage_buffer() call with RelationNeedsWAL() so that
unlogged relations still get their pages dirtied (ensuring the checksum
is flushed to disk at the next checkpoint) but do not emit WAL.

Attached a patch to address this and added a test for the same. My current
test checks if standby has main fork, I could just checked WAL to verify
this
using pg_waldump. Any other test ideas are welcome.

Thanks,
Satya

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2026-05-01T21:13:16Z

> On 1 May 2026, at 18:57, SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote:

> While further testing this feature, I realized that ProcessSingleRelationFork() 
> unconditionally called log_newpage_buffer() for every page of every relation 
> during pg_enable_data_checksums(). This included unlogged relations, 
> which by definition never generate WAL for data changes and are reset to their
> init fork on any recovery. 
> 
> Guard the log_newpage_buffer() call with RelationNeedsWAL() so that
> unlogged relations still get their pages dirtied (ensuring the checksum
> is flushed to disk at the next checkpoint) but do not emit WAL.
> 
> Attached a patch to address this and added a test for the same. My current
> test checks if standby has main fork, I could just checked WAL to verify this
> using pg_waldump. Any other test ideas are welcome.

Thanks for the report, I agree that this is an oversight that should be fixed.
Your patch looks good on first glance, I am travelling till Sunday evening so
will take another look when back in the office and will apply it then. Thanks!

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2026-05-04T13:16:34Z

Hi,

Thanks for getting this feature pushed, and for resolving the failures
reported since the feature freeze. I consider this to be an important
improvement, not just for the feature itself, but also because of all
the useful infrastructure it added.

Attached is a refined version of the TAP tests already posted by Daniel
some time ago [1]. Unfortunately, that .txt did not apply cleanly for
some reason, so here's a better version.

I found these tests quite useful when reasoning about how the patch
behaves in concurrent environment (e.g. with multiple sessions
triggering checksum enable/disable, or with a checkpoint, crashes, etc).

At this point all the tests pass, but there are a couple cases with
correct but slightly surprising behavior, worth discussing. Which is
what this e-mail is going to be about.

I'll explain what the TAP tests aim to do first, and then discuss the
slightly surprising behavior.

It's not meant for inclusion into PG19, at least not in this shape - I
wrote those TAP tests while investigating some of the earlier failures
and/or when wondering about behavior in various situations (sequence of
concurrent steps, race conditions, ...). So it's more of an exhaustive,
and the tests are somewhat redundant (N+1 is often just (N + some small
tweak)).

I can imagine distilling it into a tiny subset, and adding that. But
that's up to discussion. But that's for later.


Let me briefly explain what the various TAP tests aim to do. From the
very beginning, my main concern regarding this patch was race conditions
when updating the shared state about effective data_checksum_version.
Because the state is effectively split into about three or four places:

* LocalDataChecksumVersion (local cache)
* XLogCtl->data_checksum_version (XLogCtl->info_lck)
* ControlFile->data_checksum_version (ControlFileLock)
* state in control file on disk

These pieces are protected by different locks, the protocol for updating
and/or reading the various flags is not trivial (and some of the fixed
issues were due to ControlFile->data_checksum_version being updated from
a place that shouldn't have).

So the primary goal of the TAP tests was to check for race conditions by
leveraging injection points to step through concurrent processes in a
deterministic way. The first couple patches (0001-0004) add debug
logging and injection points into a lot of places. And by "a lot" I mean
~80 new injection points, which is about the number of injection points
we have in master now. Anyway, this allows stepping through concurrent
checksum changes, and also checksum change vs. checkpointer.

Then come the actual TAP tests:

1) 0005-TAP-10-concurrent-checksum-changes.patch

Two concurrent checksum changes. The first one gets paused at an
injection point, then the second one gets initiated.

2) 0006-TAP-11-concurrency-with-checkpoints.patch

A checksum change + checkpoint. The change gets paused at an injection
point, a synchronous checkpoint is performed.

3) 0007-TAP-12-crashes-at-injection-points.patch

Similar to 0006, but with a crash + recovery. A checksum change gets
paused at an injection point, a synchronous checkpoint is performed. The
changes gets wpken up and either completes, or pauses on a different
injection point. A restart/crash happens.

4) 0008-TAP-13-concurrency-with-checkpoint-REDO.patch

Similar to 0007, but the checkpoint is not synchronous - happens in the
background, so that the TAP can step through both sides and interleave
them in an arbitrary way. This matters because the checksum change
updates the different state pieces (XlogCtl/ControlFile), while the
checkpointer reads them to record initial state for REDO etc.

5) 0009-TAP-14-checkpoints-with-crashes.patch

Similar to 0008, except that the steps are more fine grained, and
focused on two particular cases with surprisingly different final state.


AFAIK everything works as expected, except for two cases in the "TAP
012" test. One for the "enabling" direction, one for the "disabling"
direction. I'm going to discuss the "enabling" direction, I believe the
other case is just a mirror with the same root cause.

The TAP 012 tests checksum change with a concurrent checkpoint, followed
by a crash, and tests the final state. It pauses the change at an
injection point, does a checkpoint, proceeds to the next injection
point, crashes and does recovery. The expectation is that the final
state "flips" at some injection point, once it gets further enough, and
stays there. But what actually happens is this:

a) test_checksum_transition(
    'disabled', 'enable', undef,
    'datachecksums-enable-inprogress-checksums-end',
    'datachecksums-enable-checksums-start',
    'off');

b) test_checksum_transition(
    'disabled', 'enable', undef,
    'datachecksums-enable-checksums-start',
    'datachecksums-enable-checksums-after-xlog',
    'on');

c) test_checksum_transition(
    'disabled', 'enable', 'datachecksums-enable-checksums-start',
    'datachecksums-enable-checksums-after-xlogs',
    'datachecksums-enable-checksums-after-xlogctl',
    'off');

This says that if the checkpoint happens after
'datachecksums-enable-inprogress-checksums-end' or after
'datachecksums-enable-checksums-after-xlog', we end up with 'off' (i.e.
enabling checksums fails).

But if the checkpoint happens after
'datachecksums-enable-checksums-start', we end up with "on" (after
recovery).

This is a bit surprising, because that injection point is before
'datachecksums-enable-checksums-after-xlog'. So the enabling process
gets further and further, but the final state flips off -> on -> off,
contradicting the expectation that it changes once.

I haven't quite wrapped my head around it yet, but my understanding is
this is due to a race condition between the checksum launcher (writing
XLOG2_CHECKSUMS and updating the shmem state), and the checkpointer
(reading the shmem state and generating REDO).

The launcher does this sequence of steps:

1) write XLOG2_CHECKSUMS with new state
2) update XLogCtl->data_checksum_version
3) update ControlFile->data_checksum_version
4) UpdateControlFile()
5) emits barrier

while the checkpointer (CreateCheckPoint) does this:

A) read XLogCtl->data_checksum_version (while holding insert locks)
B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
C) UpdateControlFile()

The outcome depends on how exactly these two sequences interleave. For
example, this can happen:

1) write XLOG2_CHECKSUMS with new state
  A) read XLogCtl->data_checksum_version (while holding insert locks)
  B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
  C) UpdateControlFile()
2) update XLogCtl->data_checksum_version
3) update ControlFile->data_checksum_version
4) UpdateControlFile()
5) emits barrier

Which means the XLOG_CHECKPOINT_REDO will be after XLOG2_CHECKSUMS (and
so redo won't see it), but the checkpoint will still get the old
checksum state from XLogCtl. And so the outcome is "off", per case (c).

But it can also happen what case (b) does:

  A) read XLogCtl->data_checksum_version (while holding insert locks)
  B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
  C) UpdateControlFile()
1) write XLOG2_CHECKSUMS with new state
2) update XLogCtl->data_checksum_version
3) update ControlFile->data_checksum_version
4) UpdateControlFile()
5) emits barrier

In which case the REDO will have the old state, but the recovery will
read the XLOG2_CHECKSUMS, and so end up with "on".

This is the root cause of the surprising behavior in TAP 012, I think.

I attempted to trigger these race conditions in TAP 013, but without
much success. In the end I realized it probably needs more control,
waiting for the other process to hit the next injection point before
unpausing the current one. TAP 014 does that, and it shows that with the
right interleaving of steps the (c) case can end up with both "on" and
"off" final state.

As I said, I don't claim I fully understand this yet. But I wouldn't
call this "bug" - AFAICS it won't produce an incorrect final state (I
haven't seen any such cases).

Still, I wonder if there's a potential issue I failed to notice.


The other question I had when looking at this (concurrency with
checkpoints) is what we get by doing

     MyProc->delayChkptFlags |= DELAY_CHKPT_START;

whenever updating the state in SetDataChecksums... functions. Because
the only thing that guarantees is the updates happen on one side of the
checkpoint record. What does that give us, actually?

It does not seem to prevent this surprising behavior, and it does not
say the XLOG2_CHECKSUMS happens before/after the XLOG_CHECKPOINT_REDO.


regards

[1]
https://www.postgresql.org/message-id/9197F930-DDEB-4CAC-82A2-16FEC715CCE8%40yesql.se

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> — 2026-05-05T07:43:03Z

Hi Hackers, Daniel,

Further testing this feature, I noticed that the cost_delay and cost_limit
arguments
 to pg_enable_data_checksums() in practice have no effect.

It appears we have two independent issues in DataChecksumsWorkerMain():

(1) The worker writes the user-supplied values to VacuumCostDelay and
VacuumCostLimit (the GUC-bound globals in globals.c). However,
vacuum_delay_point() reads vacuum_cost_delay / vacuum_cost_limit
declared in vacuum.c. The two pairs are kept in sync only by
VacuumUpdateCosts(), which the worker never calls. Therefore, the napping
formula always sees the defaults (vacuum_cost_delay = 0) and never
sleeps.

(2) The worker also resets VacuumCostPageHit/Miss/Dirty to 0 at startup.
With all per-page weights at zero, VacuumCostBalance never reaches
vacuum_cost_limit, which would defeat the throttling on its own even
if (1) were fixed.

Repro:

Create a database and load data (say 3 GB)

SELECT pg_disable_data_checksums();
SELECT pg_enable_data_checksums(100, 1);   -- 100 ms/page, balance limit 1

Without the fix, this completes in ~10 seconds and pg_stat_activity
never shows wait_event = VacuumDelay for the worker. With even moderate
parameters (e.g. (50, 200)) the worker is continuously in VacuumDelay
after the patch, and total runtime stretches as one would expect.
Also manually tested with cost_delay 0 and higher cost limits.

Attached a patch to fix this.

Thanks,
Satya

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2026-05-05T13:46:03Z

> While further testing this feature, I realized that ProcessSingleRelationFork() 
> unconditionally called log_newpage_buffer() for every page of every relation 
> during pg_enable_data_checksums(). This included unlogged relations, 
> which by definition never generate WAL for data changes and are reset to their
> init fork on any recovery. 


I've tested your patch, and also expanded the test you wrote a little with a
persistence change.

> Further testing this feature, I noticed that the cost_delay and cost_limit arguments
>  to pg_enable_data_checksums() in practice have no effect.

Ugh, the API for updating the costs changed between when this code was written
and tested, and when it was committed (and since I was the one committing the
new API I really should've caught that). Thanks for the report and fix!

While hacking on your patches I realized that the regexes for finding page
verification failures in the logs were anchoring at the wrong point, the
attached 0003 fixes that.

Attached are editorialized versions of the patches, as well as my testfix, that
I'm planning to go ahead with.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> — 2026-05-05T14:24:07Z

Hi,

On Tue, May 5, 2026 at 6:46 AM Daniel Gustafsson <daniel@yesql.se> wrote:

> > While further testing this feature, I realized that
> ProcessSingleRelationFork()
> > unconditionally called log_newpage_buffer() for every page of every
> relation
> > during pg_enable_data_checksums(). This included unlogged relations,
> > which by definition never generate WAL for data changes and are reset to
> their
> > init fork on any recovery.
>
>
> I've tested your patch, and also expanded the test you wrote a little with
> a
> persistence change.
>
> > Further testing this feature, I noticed that the cost_delay and
> cost_limit arguments
> >  to pg_enable_data_checksums() in practice have no effect.
>
> Ugh, the API for updating the costs changed between when this code was
> written
> and tested, and when it was committed (and since I was the one committing
> the
> new API I really should've caught that). Thanks for the report and fix!
>

Any thoughts on adding an injection point test to verify the values are
configured correctly?
This can be a different patch, perhaps included in Tomas' tests?


>
> While hacking on your patches I realized that the regexes for finding page
> verification failures in the logs were anchoring at the wrong point, the
> attached 0003 fixes that.
>
> Attached are editorialized versions of the patches, as well as my testfix,
> that
> I'm planning to go ahead with.
>

LGTM. Thanks!

Thanks,
Satya

Re: Changing the state of data checksums in a running cluster

Ayush Tiwari <ayushtiwari.slg01@gmail.com> — 2026-05-05T15:21:22Z

Hi,

On Tue, 5 May 2026 at 19:16, Daniel Gustafsson <daniel@yesql.se> wrote:

> > While further testing this feature, I realized that
> ProcessSingleRelationFork()
> > unconditionally called log_newpage_buffer() for every page of every
> relation
> > during pg_enable_data_checksums(). This included unlogged relations,
> > which by definition never generate WAL for data changes and are reset to
> their
> > init fork on any recovery.
>
>
> I've tested your patch, and also expanded the test you wrote a little with
> a
> persistence change.
>
>
I tested the patches, it mostly looks good.

I've a small concern in 0001.  The new guard uses only
RelationNeedsWAL(reln),
but ProcessSingleRelationByOid() iterates all forks.  For unlogged
relations,
the init fork is special, there are several existing call sites that
preserve
WAL for INIT_FORKNUM, for example using

  RelationNeedsWAL(rel) || forknum == INIT_FORKNUM

and catalog/storage.c notes that unlogged init forks need WAL and sync.

So I think the condition in ProcessSingleRelationFork() should preserve the
init-fork case, e.g.

  if (RelationNeedsWAL(reln) || forkNum == INIT_FORKNUM)
      log_newpage_buffer(buf, false);

It would also be good to extend the new test so it exercises a non-empty
init
fork, perhaps with an unlogged table that has an index, and then verifies
the
standby/recovery behavior.

0002 and 0003 look good to me.

Regards,
Ayush

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2026-05-05T19:08:30Z

> On 5 May 2026, at 17:21, Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote:

> I've a small concern in 0001.  The new guard uses only RelationNeedsWAL(reln),
> but ProcessSingleRelationByOid() iterates all forks.  For unlogged relations,
> the init fork is special, there are several existing call sites that preserve
> WAL for INIT_FORKNUM, for example using
> 
>   RelationNeedsWAL(rel) || forknum == INIT_FORKNUM
> 
> and catalog/storage.c notes that unlogged init forks need WAL and sync.
> 
> So I think the condition in ProcessSingleRelationFork() should preserve the
> init-fork case, e.g.
> 
>   if (RelationNeedsWAL(reln) || forkNum == INIT_FORKNUM)
>       log_newpage_buffer(buf, false);

Which failure scenario are you thinking about here?  When dealing with the
catalog relation I can see the need but here we are reading, and writing, data
pages.  In which case would we need to issue an FPI for an unlogged relation
init fork? I might be missing something obvious here.

> 0002 and 0003 look good to me.

Thanks for looking!

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Ayush Tiwari <ayushtiwari.slg01@gmail.com> — 2026-05-05T19:18:53Z

Hi,

On Wed, 6 May 2026 at 00:38, Daniel Gustafsson <daniel@yesql.se> wrote:

> > On 5 May 2026, at 17:21, Ayush Tiwari <ayushtiwari.slg01@gmail.com>
> wrote:
>
> > I've a small concern in 0001.  The new guard uses only
> RelationNeedsWAL(reln),
> > but ProcessSingleRelationByOid() iterates all forks.  For unlogged
> relations,
> > the init fork is special, there are several existing call sites that
> preserve
> > WAL for INIT_FORKNUM, for example using
> >
> >   RelationNeedsWAL(rel) || forknum == INIT_FORKNUM
> >
> > and catalog/storage.c notes that unlogged init forks need WAL and sync.
> >
> > So I think the condition in ProcessSingleRelationFork() should preserve
> the
> > init-fork case, e.g.
> >
> >   if (RelationNeedsWAL(reln) || forkNum == INIT_FORKNUM)
> >       log_newpage_buffer(buf, false);
>
> Which failure scenario are you thinking about here?  When dealing with the
> catalog relation I can see the need but here we are reading, and writing,
> data
> pages.  In which case would we need to issue an FPI for an unlogged
> relation
> init fork? I might be missing something obvious here.
>

The case I was thinking about is not the unlogged relation contents
themselves, but the init fork used as the reset template.  Some unlogged
indexes can have initialized pages in the init fork, and recovery later
copies
that fork to the main fork when resetting unlogged relations.

So my concern was that, during online checksum enable, we might update the
checksum state of an init-fork page on the primary but not WAL-log an FPI
for
it because RelationNeedsWAL(reln) is false.  Then a standby, or recovery
after
a crash, could still have the old version of that init fork.  If that fork
is
later copied to the main fork after checksums are enabled, it might lead to
checksum verification failures?

Maybe there is another guarantee that makes this impossible, but I did not
see
it from the patch/test.  That is why I wondered whether the condition should
preserve the existing special treatment for INIT_FORKNUM.

Regards,
Ayush

Re: Changing the state of data checksums in a running cluster

SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> — 2026-05-05T19:19:00Z

Hi

On Tue, May 5, 2026 at 12:08 PM Daniel Gustafsson <daniel@yesql.se> wrote:

> > On 5 May 2026, at 17:21, Ayush Tiwari <ayushtiwari.slg01@gmail.com>
> wrote:
>
> > I've a small concern in 0001.  The new guard uses only
> RelationNeedsWAL(reln),
> > but ProcessSingleRelationByOid() iterates all forks.  For unlogged
> relations,
> > the init fork is special, there are several existing call sites that
> preserve
> > WAL for INIT_FORKNUM, for example using
> >
> >   RelationNeedsWAL(rel) || forknum == INIT_FORKNUM
> >
> > and catalog/storage.c notes that unlogged init forks need WAL and sync.
> >
> > So I think the condition in ProcessSingleRelationFork() should preserve
> the
> > init-fork case, e.g.
> >
> >   if (RelationNeedsWAL(reln) || forkNum == INIT_FORKNUM)
> >       log_newpage_buffer(buf, false);
>
> Which failure scenario are you thinking about here?  When dealing with the
> catalog relation I can see the need but here we are reading, and writing,
> data
> pages.  In which case would we need to issue an FPI for an unlogged
> relation
> init fork? I might be missing something obvious here.


Good catch,IIUC init page has valid checksum and when we set checksum on
primary I think we need to pass it to standby. Otherwise, after failover we
may see invalid checksum for unlogged relations. I haven’t validated it,
will test it and update the patch.

Re: Changing the state of data checksums in a running cluster

SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> — 2026-05-05T19:45:09Z

Hi,

On Tue, May 5, 2026 at 12:19 PM Ayush Tiwari <ayushtiwari.slg01@gmail.com>
wrote:

> Hi,
>
> On Wed, 6 May 2026 at 00:38, Daniel Gustafsson <daniel@yesql.se> wrote:
>
>> > On 5 May 2026, at 17:21, Ayush Tiwari <ayushtiwari.slg01@gmail.com>
>> wrote:
>>
>> > I've a small concern in 0001.  The new guard uses only
>> RelationNeedsWAL(reln),
>> > but ProcessSingleRelationByOid() iterates all forks.  For unlogged
>> relations,
>> > the init fork is special, there are several existing call sites that
>> preserve
>> > WAL for INIT_FORKNUM, for example using
>> >
>> >   RelationNeedsWAL(rel) || forknum == INIT_FORKNUM
>> >
>> > and catalog/storage.c notes that unlogged init forks need WAL and sync.
>> >
>> > So I think the condition in ProcessSingleRelationFork() should preserve
>> the
>> > init-fork case, e.g.
>> >
>> >   if (RelationNeedsWAL(reln) || forkNum == INIT_FORKNUM)
>> >       log_newpage_buffer(buf, false);
>>
>> Which failure scenario are you thinking about here?  When dealing with the
>> catalog relation I can see the need but here we are reading, and writing,
>> data
>> pages.  In which case would we need to issue an FPI for an unlogged
>> relation
>> init fork? I might be missing something obvious here.
>>
>
> The case I was thinking about is not the unlogged relation contents
> themselves, but the init fork used as the reset template.  Some unlogged
> indexes can have initialized pages in the init fork, and recovery later
> copies
> that fork to the main fork when resetting unlogged relations.
>
> So my concern was that, during online checksum enable, we might update the
> checksum state of an init-fork page on the primary but not WAL-log an FPI
> for
> it because RelationNeedsWAL(reln) is false.  Then a standby, or recovery
> after
> a crash, could still have the old version of that init fork.  If that fork
> is
> later copied to the main fork after checksums are enabled, it might lead to
> checksum verification failures?
>
> Maybe there is another guarantee that makes this impossible, but I did not
> see
> it from the patch/test.  That is why I wondered whether the condition
> should
> preserve the existing special treatment for INIT_FORKNUM.
>

It is a bug in the code, I added a test in the v2 patch to test this
scenario and the test
failed earlier.

Thanks,
Satya

>

Re: Changing the state of data checksums in a running cluster

SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> — 2026-05-05T20:05:51Z

Hi,

On Tue, May 5, 2026 at 12:45 PM SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:

> Hi,
>
> On Tue, May 5, 2026 at 12:19 PM Ayush Tiwari <ayushtiwari.slg01@gmail.com>
> wrote:
>
>> Hi,
>>
>> On Wed, 6 May 2026 at 00:38, Daniel Gustafsson <daniel@yesql.se> wrote:
>>
>>> > On 5 May 2026, at 17:21, Ayush Tiwari <ayushtiwari.slg01@gmail.com>
>>> wrote:
>>>
>>> > I've a small concern in 0001.  The new guard uses only
>>> RelationNeedsWAL(reln),
>>> > but ProcessSingleRelationByOid() iterates all forks.  For unlogged
>>> relations,
>>> > the init fork is special, there are several existing call sites that
>>> preserve
>>> > WAL for INIT_FORKNUM, for example using
>>> >
>>> >   RelationNeedsWAL(rel) || forknum == INIT_FORKNUM
>>> >
>>> > and catalog/storage.c notes that unlogged init forks need WAL and sync.
>>> >
>>> > So I think the condition in ProcessSingleRelationFork() should
>>> preserve the
>>> > init-fork case, e.g.
>>> >
>>> >   if (RelationNeedsWAL(reln) || forkNum == INIT_FORKNUM)
>>> >       log_newpage_buffer(buf, false);
>>>
>>> Which failure scenario are you thinking about here?  When dealing with
>>> the
>>> catalog relation I can see the need but here we are reading, and
>>> writing, data
>>> pages.  In which case would we need to issue an FPI for an unlogged
>>> relation
>>> init fork? I might be missing something obvious here.
>>>
>>
>> The case I was thinking about is not the unlogged relation contents
>> themselves, but the init fork used as the reset template.  Some unlogged
>> indexes can have initialized pages in the init fork, and recovery later
>> copies
>> that fork to the main fork when resetting unlogged relations.
>>
>> So my concern was that, during online checksum enable, we might update the
>> checksum state of an init-fork page on the primary but not WAL-log an FPI
>> for
>> it because RelationNeedsWAL(reln) is false.  Then a standby, or recovery
>> after
>> a crash, could still have the old version of that init fork.  If that
>> fork is
>> later copied to the main fork after checksums are enabled, it might lead
>> to
>> checksum verification failures?
>>
>> Maybe there is another guarantee that makes this impossible, but I did
>> not see
>> it from the patch/test.  That is why I wondered whether the condition
>> should
>> preserve the existing special treatment for INIT_FORKNUM.
>>
>
> It is a bug in the code, I added a test in the v2 patch to test this
> scenario and the test
> failed earlier.
>

Please find the latest v3. Earlier I took dependency on amcheck and removed
it in v3
and instead added queries that forces necessary checks.

 Thanks,
Satya

Re: Changing the state of data checksums in a running cluster

Ayush Tiwari <ayushtiwari.slg01@gmail.com> — 2026-05-05T21:01:59Z

Hi,


On Wed, 6 May 2026 at 01:36, SATYANARAYANA NARLAPURAM <
satyanarlapuram@gmail.com> wrote:

> Hi,
>
> On Tue, May 5, 2026 at 12:45 PM SATYANARAYANA NARLAPURAM <
> satyanarlapuram@gmail.com> wrote:
>
>> Hi,
>>
>> On Tue, May 5, 2026 at 12:19 PM Ayush Tiwari <ayushtiwari.slg01@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> On Wed, 6 May 2026 at 00:38, Daniel Gustafsson <daniel@yesql.se> wrote:
>>>
>>>> > On 5 May 2026, at 17:21, Ayush Tiwari <ayushtiwari.slg01@gmail.com>
>>>> wrote:
>>>>
>>>> > I've a small concern in 0001.  The new guard uses only
>>>> RelationNeedsWAL(reln),
>>>> > but ProcessSingleRelationByOid() iterates all forks.  For unlogged
>>>> relations,
>>>> > the init fork is special, there are several existing call sites that
>>>> preserve
>>>> > WAL for INIT_FORKNUM, for example using
>>>> >
>>>> >   RelationNeedsWAL(rel) || forknum == INIT_FORKNUM
>>>> >
>>>> > and catalog/storage.c notes that unlogged init forks need WAL and
>>>> sync.
>>>> >
>>>> > So I think the condition in ProcessSingleRelationFork() should
>>>> preserve the
>>>> > init-fork case, e.g.
>>>> >
>>>> >   if (RelationNeedsWAL(reln) || forkNum == INIT_FORKNUM)
>>>> >       log_newpage_buffer(buf, false);
>>>>
>>>> Which failure scenario are you thinking about here?  When dealing with
>>>> the
>>>> catalog relation I can see the need but here we are reading, and
>>>> writing, data
>>>> pages.  In which case would we need to issue an FPI for an unlogged
>>>> relation
>>>> init fork? I might be missing something obvious here.
>>>>
>>>
>>> The case I was thinking about is not the unlogged relation contents
>>> themselves, but the init fork used as the reset template.  Some unlogged
>>> indexes can have initialized pages in the init fork, and recovery later
>>> copies
>>> that fork to the main fork when resetting unlogged relations.
>>>
>>> So my concern was that, during online checksum enable, we might update
>>> the
>>> checksum state of an init-fork page on the primary but not WAL-log an
>>> FPI for
>>> it because RelationNeedsWAL(reln) is false.  Then a standby, or recovery
>>> after
>>> a crash, could still have the old version of that init fork.  If that
>>> fork is
>>> later copied to the main fork after checksums are enabled, it might lead
>>> to
>>> checksum verification failures?
>>>
>>> Maybe there is another guarantee that makes this impossible, but I did
>>> not see
>>> it from the patch/test.  That is why I wondered whether the condition
>>> should
>>> preserve the existing special treatment for INIT_FORKNUM.
>>>
>>
>> It is a bug in the code, I added a test in the v2 patch to test this
>> scenario and the test
>> failed earlier.
>>
>
> Please find the latest v3. Earlier I took dependency on amcheck and
> removed it in v3
> and instead added queries that forces necessary checks.
>

Thanks, the v3 patch addresses my concern.

I also checked the new test by
temporarily removing the INIT_FORKNUM exception, and it failed on the
promoted
standby while reading the unlogged relation, so the test does cover the init
fork failure scenario.

The code change looks right to me now.

Regards,
Ayush

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2026-05-06T15:22:47Z

> On 5 May 2026, at 23:01, Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote:

> Thanks, the v3 patch addresses my concern.

Thanks both of you for the report, patch and testing.  Now I see what you meant
yesterday and I agree with the conclusion.  I've pushed 0001-0003 today with minor
tweaks.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2026-05-26T18:12:35Z

Hi,

On 5/4/26 15:16, Tomas Vondra wrote:
> Hi,
> 
> ... snip ...
>
> Let me briefly explain what the various TAP tests aim to do. From the
> very beginning, my main concern regarding this patch was race conditions
> when updating the shared state about effective data_checksum_version.
> Because the state is effectively split into about three or four places:
> 
> * LocalDataChecksumVersion (local cache)
> * XLogCtl->data_checksum_version (XLogCtl->info_lck)
> * ControlFile->data_checksum_version (ControlFileLock)
> * state in control file on disk
> 
> These pieces are protected by different locks, the protocol for updating
> and/or reading the various flags is not trivial (and some of the fixed
> issues were due to ControlFile->data_checksum_version being updated from
> a place that shouldn't have).
> 

I still have a funny feeling about the "global checksum version" stored
in multiple places with separate locks, and the protocol when updating
them (e.g. which process does that, when, etc.). I haven't found any
fundamental correctness issue, though.

I kept looking for issues, and I realized StartupXLOG may not do the
right thing on the standby after promotion.

  if (XLogCtl->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_ON)
  {
      XLogChecksums(PG_DATA_CHECKSUM_OFF);

      SpinLockAcquire(&XLogCtl->info_lck);
      XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_OFF;
      SetLocalDataChecksumState(XLogCtl->data_checksum_version);
      SpinLockRelease(&XLogCtl->info_lck);

      ereport(...);
  }

Consider this sequence of steps:

1) primary/standby cluster has checksums OFF
2) primary starts enabling checksums
3) primary moves to inprogress-on
4) standby receives that and moves to inprogress-on too
5) primary crashes / shuts down / ...
6) standby gets promoted, and does the StartupXLOG thing
7) standby moves from inprogress-on back to off

However, as there's no EmitProcSignalBarrier(), existing backends on the
hot standby won't be notified about the "off" change. And so there will
be a somewhat inconsistent checksum state, with new backends knowing
it's "off", and old backends thinking it's still inprogress-on.

It's not difficult to reproduce this. I guess the existing TAP tests may
not catch this because they always open a new connection when checking
the state, and so see the new (and correct) version.

I believe a similar issue happens for the inprogress-off case a couple
lines later, but I haven't tried reproducing that one.

Ultimately, neither of these cases should cause checksum failures,
because the "correct" new state is "off", and the old backends are not
verifying checksums either.

I suppose this means we should not be updating the checksum state
without emitting the barrier? I think all other places do that.

>
> ... snip ...
> 
> The TAP 012 tests checksum change with a concurrent checkpoint, followed
> by a crash, and tests the final state. It pauses the change at an
> injection point, does a checkpoint, proceeds to the next injection
> point, crashes and does recovery. The expectation is that the final
> state "flips" at some injection point, once it gets further enough, and
> stays there. But what actually happens is this:
> 
> a) test_checksum_transition(
>     'disabled', 'enable', undef,
>     'datachecksums-enable-inprogress-checksums-end',
>     'datachecksums-enable-checksums-start',
>     'off');
> 
> b) test_checksum_transition(
>     'disabled', 'enable', undef,
>     'datachecksums-enable-checksums-start',
>     'datachecksums-enable-checksums-after-xlog',
>     'on');
> 
> c) test_checksum_transition(
>     'disabled', 'enable', 'datachecksums-enable-checksums-start',
>     'datachecksums-enable-checksums-after-xlogs',
>     'datachecksums-enable-checksums-after-xlogctl',
>     'off');
> 
> This says that if the checkpoint happens after
> 'datachecksums-enable-inprogress-checksums-end' or after
> 'datachecksums-enable-checksums-after-xlog', we end up with 'off' (i.e.
> enabling checksums fails).
> 
> But if the checkpoint happens after
> 'datachecksums-enable-checksums-start', we end up with "on" (after
> recovery).
> 
> This is a bit surprising, because that injection point is before
> 'datachecksums-enable-checksums-after-xlog'. So the enabling process
> gets further and further, but the final state flips off -> on -> off,
> contradicting the expectation that it changes once.
> 
> I haven't quite wrapped my head around it yet, but my understanding is
> this is due to a race condition between the checksum launcher (writing
> XLOG2_CHECKSUMS and updating the shmem state), and the checkpointer
> (reading the shmem state and generating REDO).
> 
> The launcher does this sequence of steps:
> 
> 1) write XLOG2_CHECKSUMS with new state
> 2) update XLogCtl->data_checksum_version
> 3) update ControlFile->data_checksum_version
> 4) UpdateControlFile()
> 5) emits barrier
> 
> while the checkpointer (CreateCheckPoint) does this:
> 
> A) read XLogCtl->data_checksum_version (while holding insert locks)
> B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
> C) UpdateControlFile()
> 
> The outcome depends on how exactly these two sequences interleave. For
> example, this can happen:
> 
> 1) write XLOG2_CHECKSUMS with new state
>   A) read XLogCtl->data_checksum_version (while holding insert locks)
>   B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
>   C) UpdateControlFile()
> 2) update XLogCtl->data_checksum_version
> 3) update ControlFile->data_checksum_version
> 4) UpdateControlFile()
> 5) emits barrier
> 
> Which means the XLOG_CHECKPOINT_REDO will be after XLOG2_CHECKSUMS (and
> so redo won't see it), but the checkpoint will still get the old
> checksum state from XLogCtl. And so the outcome is "off", per case (c).
> 
> But it can also happen what case (b) does:
> 
>   A) read XLogCtl->data_checksum_version (while holding insert locks)
>   B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
>   C) UpdateControlFile()
> 1) write XLOG2_CHECKSUMS with new state
> 2) update XLogCtl->data_checksum_version
> 3) update ControlFile->data_checksum_version
> 4) UpdateControlFile()
> 5) emits barrier
> 
> In which case the REDO will have the old state, but the recovery will
> read the XLOG2_CHECKSUMS, and so end up with "on".
> 
> This is the root cause of the surprising behavior in TAP 012, I think.
> 
> I attempted to trigger these race conditions in TAP 013, but without
> much success. In the end I realized it probably needs more control,
> waiting for the other process to hit the next injection point before
> unpausing the current one. TAP 014 does that, and it shows that with the
> right interleaving of steps the (c) case can end up with both "on" and
> "off" final state.
> 
> As I said, I don't claim I fully understand this yet. But I wouldn't
> call this "bug" - AFAICS it won't produce an incorrect final state (I
> haven't seen any such cases).
> 
> Still, I wonder if there's a potential issue I failed to notice.
> 

I kept thinking about this non-determinism, and I still think I'm right
about the root cause. The checkpointer and datachecksum worker may
interleave in different ways, affecting the final checksum state. The
existing interlock (or lack of of it) is not sufficient.

To verify this hypothesis, I've done a simple thing - I've introduced a
new LWLock, protecting the "critical part" so that the checkpoint_redo
can't happen in between XLogChecksums() and XLogCtl update. Patch
attached, but I don't propose this for commit, I'm not sure if it's
right (or safe), it's merely a PoC to confirm the issue. But it resolves
the weird cases, simply by prohibiting some of the step sequences (so
some of the tests needed to be commented out, as they'd block).

I'm still not sure if it really is an issue or just an annoyance,
because I've not been able to find a case where it'd lead to checksum
failures (or obviously incorrect final state after recovery).

> 
> The other question I had when looking at this (concurrency with
> checkpoints) is what we get by doing
> 
>      MyProc->delayChkptFlags |= DELAY_CHKPT_START;
> 
> whenever updating the state in SetDataChecksums... functions. Because
> the only thing that guarantees is the updates happen on one side of the
> checkpoint record. What does that give us, actually?
> 
> It does not seem to prevent this surprising behavior, and it does not
> say the XLOG2_CHECKSUMS happens before/after the XLOG_CHECKPOINT_REDO.
> 

I still don't understand why this needs DELAY_CHKPT_START ...

I also noticed a couple minor comment issues, per attached patch (this
may need pgindent).

regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2026-05-28T11:28:49Z

> On 26 May 2026, at 20:12, Tomas Vondra <tomas@vondra.me> wrote:

> I suppose this means we should not be updating the checksum state
> without emitting the barrier? I think all other places do that.

Good catch, it's indeed a bug, any state change must emit a procsignalbarrier
to maintain cluster consistency.  I ended up writing a test for this very case
as well.

> I'm still not sure if it really is an issue or just an annoyance,
> because I've not been able to find a case where it'd lead to checksum
> failures (or obviously incorrect final state after recovery).

I've tried to get it to reach an incorrect end state but failed, but I do agree
that maybe we need an improved locking protocol around state updates.  Need to
spend some more time thinking about this.

> I still don't understand why this needs DELAY_CHKPT_START ...

Having stared at this for some time, and going over old threads, I think this
is a mistake.  AFAICT though it cannot cause any error, so I'd lean towards
erring on the safe side by leaving as is and looking at removing in 20.  What
do you think?

> I also noticed a couple minor comment issues, per attached patch (this
> may need pgindent).

I ended up splitting this into two, one for the comment fixes and one for the
data type change.

I propose applying the three patches below to v19 to fix the promotion issue
before we wrap beta1.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2026-05-28T11:51:14Z

On 5/28/26 13:28, Daniel Gustafsson wrote:
>> On 26 May 2026, at 20:12, Tomas Vondra <tomas@vondra.me> wrote:
> 
>> I suppose this means we should not be updating the checksum state
>> without emitting the barrier? I think all other places do that.
> 
> Good catch, it's indeed a bug, any state change must emit a procsignalbarrier
> to maintain cluster consistency.  I ended up writing a test for this very case
> as well.
> 

Good.

>> I'm still not sure if it really is an issue or just an annoyance,
>> because I've not been able to find a case where it'd lead to checksum
>> failures (or obviously incorrect final state after recovery).
> 
> I've tried to get it to reach an incorrect end state but failed, but I do agree
> that maybe we need an improved locking protocol around state updates.  Need to
> spend some more time thinking about this.
> 

OK

>> I still don't understand why this needs DELAY_CHKPT_START ...
> 
> Having stared at this for some time, and going over old threads, I think this
> is a mistake.  AFAICT though it cannot cause any error, so I'd lean towards
> erring on the safe side by leaving as is and looking at removing in 20.  What
> do you think?
> 

I'd probably try to fix this for 19, otherwise it may be confusing
people looking at the code in the future. We're still months from 19
getting released. Ofc, maybe I'm underestimating the risk.

>> I also noticed a couple minor comment issues, per attached patch (this
>> may need pgindent).
> 
> I ended up splitting this into two, one for the comment fixes and one for the
> data type change.
> 
> I propose applying the three patches below to v19 to fix the promotion issue
> before we wrap beta1.
> 

WFM

> --
> Daniel Gustafsson
> 

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Daniel Gustafsson <daniel@yesql.se> — 2026-05-29T20:08:11Z

> On 28 May 2026, at 13:51, Tomas Vondra <tomas@vondra.me> wrote:
> 
> On 5/28/26 13:28, Daniel Gustafsson wrote:
>>> On 26 May 2026, at 20:12, Tomas Vondra <tomas@vondra.me> wrote:
>> 
>>> I suppose this means we should not be updating the checksum state
>>> without emitting the barrier? I think all other places do that.
>> 
>> Good catch, it's indeed a bug, any state change must emit a procsignalbarrier
>> to maintain cluster consistency.  I ended up writing a test for this very case
>> as well.
> 
> Good.

I've pushed this now, along with your other findings, ahead of the beta1
deadline, buildfarm seems happy so far.

>>> I still don't understand why this needs DELAY_CHKPT_START ...
>> 
>> Having stared at this for some time, and going over old threads, I think this
>> is a mistake.  AFAICT though it cannot cause any error, so I'd lean towards
>> erring on the safe side by leaving as is and looking at removing in 20.  What
>> do you think?
>> 
> 
> I'd probably try to fix this for 19, otherwise it may be confusing
> people looking at the code in the future. We're still months from 19
> getting released. Ofc, maybe I'm underestimating the risk.

You're probably right.  Once beta1 is out I'll work on getting this fixed.

--
Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2026-05-29T20:27:20Z


On 5/29/26 22:08, Daniel Gustafsson wrote:
>> On 28 May 2026, at 13:51, Tomas Vondra <tomas@vondra.me> wrote:
>>
>> On 5/28/26 13:28, Daniel Gustafsson wrote:
>>>> On 26 May 2026, at 20:12, Tomas Vondra <tomas@vondra.me> wrote:
>>>
>>>> I suppose this means we should not be updating the checksum state
>>>> without emitting the barrier? I think all other places do that.
>>>
>>> Good catch, it's indeed a bug, any state change must emit a procsignalbarrier
>>> to maintain cluster consistency.  I ended up writing a test for this very case
>>> as well.
>>
>> Good.
> 
> I've pushed this now, along with your other findings, ahead of the beta1
> deadline, buildfarm seems happy so far.
> 

Thanks!

>>>> I still don't understand why this needs DELAY_CHKPT_START ...
>>>
>>> Having stared at this for some time, and going over old threads, I think this
>>> is a mistake.  AFAICT though it cannot cause any error, so I'd lean towards
>>> erring on the safe side by leaving as is and looking at removing in 20.  What
>>> do you think?
>>>
>>
>> I'd probably try to fix this for 19, otherwise it may be confusing
>> people looking at the code in the future. We're still months from 19
>> getting released. Ofc, maybe I'm underestimating the risk.
> 
> You're probably right.  Once beta1 is out I'll work on getting this fixed.
> 

+1


regards

-- 
Tomas Vondra