Re: Changing the state of data checksums in a running cluster
Daniel Gustafsson <daniel@yesql.se>
From: Daniel Gustafsson <daniel@yesql.se>
To: Tomas Vondra <tomas@vondra.me>
Cc: Bernd Helmle <mailings@oopsware.de>,
Michael Paquier <michael@paquier.xyz>,
Michael Banck <mbanck@gmx.net>,
PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Date: 2025-08-25T18:32:51Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Use correct datatype for PID
- 0ca1b3010597 19 (unreleased) landed
-
Improve comments in online checksums code
- cd857dec0e0a 19 (unreleased) landed
-
Fix checksum state transition during promotion
- 5fee7cab1b87 19 (unreleased) landed
-
Fix regex searching for page verification failures in tests
- 486b9a9b9eb4 19 (unreleased) landed
-
Apply data-checksum worker throttling parameters
- 9a39056c418c 19 (unreleased) landed
-
Skip WAL for unlogged main fork during online checksum enable
- 2018bd616790 19 (unreleased) landed
-
Revert "Get rid of WALBufMappingLock"
- c13070a27b63 19 (unreleased) cited
-
Get rid of WALBufMappingLock
- bc22dc0e0ddc 18.0 cited
-
Improve grammar of options for command arrays in TAP tests
- ce1b0f9da03e 18.0 cited
Attachments
- v20250825-0001-Online-enabling-and-disabling-of-data-chec.patch (application/octet-stream) patch v20250825-0001
> On 20 Aug 2025, at 16:37, Tomas Vondra <tomas@vondra.me> wrote:
> This happens quite regularly, it's not hard to hit. But I've only seen
> it to happen on a FSM, and only right after immediate shutdown. I don't
> think that's quite expected.
>
> I believe the built-in TAP tests (with injection points) can't catch
> this, because there's no concurrent activity while flipping checksums
> on/off. It'd be good to do something like that, by running pgbench in
> the background, or something like that.
In searching for this bug I opted for implementing a version of the stress
tests as a TAP test, see 006_concurrent_pgbench.pl in the attached patch
version. It's gated behind PG_TEST_EXTRA since it's clearly not something
which can be enabled by default (if this goes in this need to be re-done to
provide two levels IMO, but during testing this is more convenient). I'm
curious to see which improvements you can think to make it stress the code to
the breaking point.
> I think there's a minor issue in how pg_checksums validates state before
> checking the data.
>
> The current patch simply does:
>
> if (ControlFile->data_checksum_version == 0 &&
> mode == PG_MODE_CHECK)
> pg_fatal("data checksums are not enabled in cluster");
>
> and that worked when the version was either 0 or 1. But now it can be
> also 2 or 3, for inprogress-on / inprogress-off, and if the cluster gets
> shut down at the right moment, that can end in the control file.
Good point, I've changed the test to check for checksums being enabled rather
than checking if they are disabled.
--
Daniel Gustafsson