Re: Changing the state of data checksums in a running cluster
Daniel Gustafsson <daniel@yesql.se>
From: Daniel Gustafsson <daniel@yesql.se>
To: Tomas Vondra <tomas@vondra.me>
Cc: Bernd Helmle <mailings@oopsware.de>,
Michael Paquier <michael@paquier.xyz>,
Michael Banck <mbanck@gmx.net>,
PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Date: 2025-08-27T08:30:18Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Use correct datatype for PID
- 0ca1b3010597 19 (unreleased) landed
-
Improve comments in online checksums code
- cd857dec0e0a 19 (unreleased) landed
-
Fix checksum state transition during promotion
- 5fee7cab1b87 19 (unreleased) landed
-
Fix regex searching for page verification failures in tests
- 486b9a9b9eb4 19 (unreleased) landed
-
Apply data-checksum worker throttling parameters
- 9a39056c418c 19 (unreleased) landed
-
Skip WAL for unlogged main fork during online checksum enable
- 2018bd616790 19 (unreleased) landed
-
Revert "Get rid of WALBufMappingLock"
- c13070a27b63 19 (unreleased) cited
-
Get rid of WALBufMappingLock
- bc22dc0e0ddc 18.0 cited
-
Improve grammar of options for command arrays in TAP tests
- ce1b0f9da03e 18.0 cited
Attachments
- v20250827-0001-Online-enabling-and-disabling-of-data-chec.patch (application/octet-stream) patch v20250827-0001
> On 26 Aug 2025, at 01:06, Tomas Vondra <tomas@vondra.me> wrote: > I think this TAP looks very nice, but there's a couple issues with it. > See the attached patch fixing those. Thanks, I have incorporated (most of) your patch in the attached. I did keep the PG_TEST_EXTRA check for injection points though which I assume were removed out of mistake. > With these changes it runs for me, and I even saw some > > LOG: page verification failed > > in tmp_check/log/006_concurrent_pgbench_standby_1.log. But it takes a > while - a couple minutes, maybe? I think I saw it at > > t/006_concurrent_pgbench.pl .. 427/? That's very interesting, I have been running it to timeout several times in a row without hitting any verification failures. Will keep running. > or something like that. I think the bash version did a couple things > differently, which might make the failures more frequent (but it's just > a wild guess). > > In particular, I think the script restarts the two nodes independently, > while the TAP always stops both primary and standby, in this order. I > think it'd be useful to restart one or both. Done in the attached, it will now randomly stop one or both or none. If the node is stopped I've added an offline pg_checksum step to validate the datafiles as a why-not test. > The other thing is the bash script added some random delays/sleep, which > increases the test duration, but it also means generating somewhat > random amounts of data, etc. It also randomized some other stuff (scale, > client count, ...). But that can wait. Added as well in a few places, maybe more can be sprinkled in. -- Daniel Gustafsson